LETTER
Communicated by Misha Tsodyks
Mean-Driven and Fluctuation-Driven Persistent Activity in Recurrent Networks Alfonso Renart∗
[email protected] Departamento de F´ısca Te´orica, Universidad Aut´onoma de Madrid, Cantoblanco 28049, Madrid, Spain, and Volen Center for Complex Systems, Brandeis University, Waltham, MA 02254, U.S.A.
Rub´en Moreno-Bote
[email protected] Department de F´ısca Te´orica, Universidad Aut´onoma de Madrid, Cantoblanco 28049, Madrid, Spain
Xiao-Jing Wang
[email protected] Volen Center for Complex Systems, Brandeis University, Waltham, MA 02254, U.S.A.
N´estor Parga
[email protected] Departamento de F´ısca Te´orica, Universidad Aut´onoma de Madrid, Cantoblanco 28049, Madrid, Spain
Spike trains from cortical neurons show a high degree of irregularity, with coefficients of variation (CV) of their interspike interval (ISI) distribution close to or higher than one. It has been suggested that this irregularity might be a reflection of a particular dynamical state of the local cortical circuit in which excitation and inhibition balance each other. In this “balanced” state, the mean current to the neurons is below threshold, and firing is driven by current fluctuations, resulting in irregular Poisson-like spike trains. Recent data show that the degree of irregularity in neuronal spike trains recorded during the delay period of working memory experiments is the same for both low-activity states of a few Hz and for elevated, persistent activity states of a few tens of Hz. Since the difference between these persistent activity states cannot be due to external factors coming from sensory inputs, this suggests that the underlying
∗ Current address: Center for Molecular and Behavioral Neuroscience, Rutgers University, Newark, NJ 07102 USA.
Neural Computation 19, 1–46 (2007)
C 2006 Massachusetts Institute of Technology
2
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
network dynamics might support coexisting balanced states at different firing rates. We use mean field techniques to study the possible existence of multiple balanced steady states in recurrent networks of current-based leaky integrate-and-fire (LIF) neurons. To assess the degree of balance of a steady state, we extend existing mean-field theories so that not only the firing rate, but also the coefficient of variation of the interspike interval distribution of the neurons, are determined self-consistently. Depending on the connectivity parameters of the network, we find bistable solutions of different types. If the local recurrent connectivity is mainly excitatory, the two stable steady states differ mainly in the mean current to the neurons. In this case, the mean drive in the elevated persistent activity state is suprathreshold and typically characterized by low spiking irregularity. If the local recurrent excitatory and inhibitory drives are both large and nearly balanced, or even dominated by inhibition, two stable states coexist, both with subthreshold current drive. In this case, the spiking variability in both the resting state and the mnemonic persistent state is large, but the balance condition implies parameter fine-tuning. Since the degree of required fine-tuning increases with network size and, on the other hand, the size of the fluctuations in the afferent current to the cells increases for small networks, overall we find that fluctuation-driven persistent activity in the very simplified type of models we analyze is not a robust phenomenon. Possible implications of considering more realistic models are discussed.
1 Introduction The spike trains of cortical neurons recorded in vivo are irregular and consistent, to a first approximation, to a Poisson process, possessing a roughly exponential interspike interval (ISI) distribution (except at very short intervals) and a coefficient variation (CV) of the ISI close to one (Softky & Koch, 1993). The possible implications of this fact on the basic principles of cortical organization have been the motivation for a large number of studies during the past 10 years (Softky & Koch, 1993; Shadlen & Newsome, 1994, 1998; Tsodyks & Sejnowski, 1995; van Vreeswijk & Sompolinsky, 1996, 1998; Zador & Stevens, 1998; Harsch & Robinson, 2000). An important idea that was analyzed by some of these studies was that a way out of the apparent inconsistency between the cortical neuron working as an integrator over the timescale of a relatively long time constant of the order of 10 to 20 ms of a very large number of inputs, and its irregular spiking, was to have similar amounts of excitatory and inhibitory drive. In this way, the mean drive to the cell was subthreshold, and spikes were the result of fluctuations, which occur irregularly, thus leading to a high CV (Gerstein & Mandelbrot, 1964). Although the implications of this result were first studied in a feedforward architecture (Shadlen & Newsome, 1994), it was soon discovered that a state
Bistability in Balanced Recurrent Networks
3
in which excitation and inhibition balance each other, resulting in irregular spiking, was a robust dynamical attractor in recurrent networks (Tsodyks & Sejnowski, 1995; van Vreeswijk & Sompolinsky, 1996, 1998); that is, under very general conditions, a recurrent network settles down into a state of this sort. Although the original studies characterizing quantitatively the degree of spiking irregularity in the cortex were done using data from sensory cortices, it has since been shown that neurons in higher-order associative areas like the prefrontal cortex (PFC) also spike irregularly (Shinomoto, Sakai, & Funahashi, 1999; Compte et al., 2003) (see Figure 1). This is interesting because it is well known that cells in the PFC (Fuster & Alexander, 1971; Funahashi, Bruce, & Goldman-Rakic, 1989; Miller, Erickson, & Desimone, 1996; Romo, Brody, Hern´andez, & Lemus, 1999), as well as those in other associative cortices like the inferotemporal (Miyashita & Chang, 1988) or posterior parietal cortex (Gnadt & Andersen, 1988; Chafee & GoldmanRakic, 1998), show activity patterns that are selective to stimuli no longer present to the animal and are thus being held in working memory. The activity of these neurons seems to be able to switch, on presentation of an appropriate brief, transient input, from a basal spontaneous activity level to a higher activity state. When the dimensionality of the stimulus to be remembered is low (e.g., the position of an LED on a computer screen or the frequency of a vibrotactile simulus), the mnemonic activity during the delay period when the stimulus is absent seems to be graded (Funahashi et al., 1989; Romo et al., 1999), whereas when the dimensionality of the stimulus is high (e.g., a complex image), the single neurons seem to choose from a small number of discrete activity states (Miyashita & Chang, 1988; Miller et al., 1996). This last coding scheme is referred to as object working memory. Since there is no explicit sensory input present during the delay period in a working memory task, the neuronal activity must be a result of the dynamics of the relevant neural circuit. There is a long tradition of modeling studies that have described delay-period activity as a reflection of dynamical attractors in multistable (usually bistable) networks presumed to represent the local cortical environment of the neurons recorded in the neurophysiological experiments (Hopfield, 1982; Amit & Tsodyks, 1991; Ben-Yishai, Lev Bar-Or, & Sompolinsky, 1995; Amit & Brunel, 1997b; Brunel, 2000a; Compte, Brunel, Goldman-Rakic, & Wang, 2000; Brunel & Wang, 2001; Hansel & Mato, 2001; Cai, Tao, Shelly, & McLaughlin, 2004). Originally inspired by models and techniques from the statistical mechanics of disordered systems, network models of persistent activity have progressively become more faithful to the biological circuits that they seek to describe. The landmark study (Amit & Brunel, 1997b) provided an extended meanfield description of the activity of a recurrent network of spiking current-based leaky integrate-and-fire neurons (LIF). One of its main achievements was to use the theory of diffusion processes to provide an intuitive, compact
4
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga FIXATION 3
NONPREFERRED
# of cells
p =8.5e −15
2
CV 1 0
PREFERRED
F
20
20
20
10
10
10
0
PR NP
0
1
2
0
0
2
1
2
0
0
CV
CV
1
2
CV
30
30
30
20
20
20
10
10
10
p =1.6e−23
# of cells
1.5
CV2
1 0.5 0
F
PR NP
0
0
1
CV
2
2
0
0
1
CV
2
2
0
0
1
CV
2
2
Figure 1: CV of the ISI of neurons in monkey prefrontal cortex during a spatial working memory task. The monkey made saccades to remembered locations on a computer screen after a delay period of a few seconds. On each trial, a dot of light (cue stimulus) was briefly shown in one of eight to-be-remembered locations, equidistant from the fixation point but at different angles. After the delay period, starting with the disappearance of the cue stimulus and terminating with the disappearance of the fixation point, the monkey made a saccade to the remembered location. Top and bottom rows correspond, respectively, to the CV and CV2 (CV calculated using only consecutive ISIs to try to compensate from possible slow nonstationarities in the neurons instantaneous frequency) computed from spike trains of prefrontal cortical neurons recorded from monkeys performing an oculomotor spatial working memory task. Results shown correspond to analysis of the activity during the delay period of the task. The spike trains are irregular (CV ∼ 1), and to a similar extent, both when the data correspond to trials in which the preferred (PR; middle column) positional cue for the cell was held in working memory (higher firing rate during the delay period) and when it corresponds to stimuli with the nonpreferred (NP; right column) positional cue for the particular neuron (lower firing rate during the delay period). See Compte et al. (2003) for details. Adapted with permission from Compte et al. (2003).
description of the spontaneous, low-rate, basal activity state of cortical cells in terms of self-consistency equations that included information about both the mean and the fluctuations of the afferent current to the cell. The theory proposed was both simple and accurate, and matched well the properties of simulated LIF networks. The spontaneous activity state in Amit and Brunel (1997b) is effectively the balanced state described above, in which the recurrent connectivity is
Bistability in Balanced Recurrent Networks
5
dominated by inhibition and firing is due to the occurrence of positive fluctuations in the drive to the neurons. However, in Amit and Brunel (1997b), this same model was used to describe the coexistence of the spontaneous activity state with a persistent activity state with a physiologically plausible firing rate that would correspond to the spiking observed during the delay period in object working memory tasks, such as seen in, for example, Miyashita and Chang (1988). Although the model, with its large number of subsequent improvements, has been successful in providing a fairly accurate description of simulated spiking networks, no effort has yet been made to study systematically the relationship between multistability and the irregularity of the spike trains, especially in the elevated activity state. As we will show below, the qualitative organization of the connectivity in the recurrent network not only determines the existence of a fluctuation-driven balanced spontaneous activity state in the network, but also the existence of bistability in the network, and whether the elevated activity states are fluctuation driven. In order to perform a systematic analysis of the types of persistent activity that can be obtained in a network of current-based LIF neurons, two steps are important. First, we believe that the scaling of the synaptic connections with the number of afferent synapses per neuron should be made explicit. This approach was taken in the studies of the balanced state (Tsodyks & Sejnowski, 1995; van Vreeswijk & Sompolinsky, 1996), but is not present in the Amit and Brunel (1997b) framework. As we shall see, when the scaling is made explicit and the network is studied in the limit of a large number of connections per cell, the difference between the behavior of alternative circuit organizations (or architectures) becomes qualitative. Second, it would be desirable to be able to check for the spike train irregularity within the theory. In Amit and Brunel (1997b), spiking was assumed to be Poisson and, hence, to have a CV equal to 1. Poisson spike trains are completely characterized by a single number, the instantaneous firing probability, so there is nothing more to say about the spike train once its firing rate has been given. A general self-consistent description of the higher-order moments of spiking in a recurrent network of LIF neurons is extremely difficult, as the calculation of the moments of the ISI distribution becomes prohibitively complicated when the input current to a particular cell contains temporal correlations (although see Moreno-Bote & Parga, 2006). However, based on our study of the input-output properties of the LIF neuron under the influence of correlated inputs (Moreno, de la Rocha, Renart, & Parga, 2002), we have constructed a self-consistent description for the first two moments of the current to the neurons in the network, which relaxes the Poisson assumption and which we expect to be valid if the temporal correlations in the spike trains in the network are sufficiently short. Some of the results presented here have already been published in abstract form.
6
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
2 Methods We consider a network of current-based leaky integrate-and-fire neurons. The voltage difference across each neuron’s membrane evolves in time according to the following equation, d V(t) V(t) + I (t), =− dt τm with voltages being measured relative to the leak potential of the neuron. When the depolarization reaches a threshold voltage that we set at Vth = 20 mV, a spike is emited, and the voltage is clamped at a reset potential Vr = 10 mV during a refractory period τref = 2 ms, after which the voltage continues to integrate the input current. The membrane time constant is τm = 10 ms. When the neuron is inserted in a network, I (t) represents the total synaptic current, which is assumed to be a linear sum of the contributions from each individual presynaptic cell. We consider the simplest description of the synaptic interaction between the pre- and postsynaptic neurons, according to which each presynaptic action potential provokes an instantaneous “kick” in the depolarization of the postsynaptic cell. The network is composed of NE excitatory and NI inhibitory cells randomly connected so that each cell receives C E excitatory and C I inhibitory contacts, each with an efficacy (“kick” size) J E j and J Ik , respectively ( j = 1, . . . , C E ; k = 1, . . . , C I ). The total afferent current into the cell can be represented as I (t) =
CE j=1
J E j s j (t) −
CI
J Ik sk (t),
k=1
where s j(k) (t) represents the spike train from the jth excitatory (kth inhibitory) neuron. Since according to this description, the effect of a presynaptic spike on the voltage of the postsynaptic neuron is instantaneous, s(t) is a collection of Dirac delta functions, that is, s(t) ≡ j δ(t − t j ), where t j are the spike times. 2.1 Mean-Field Description. Spike trains in the model are stochastic, with an instantaneous firing rate (i.e., a probability of measuring a spike in (t, t + dt) per unit time) denoted by ν(t) = ν. The secondorder statistics of the process is characterized by its connected two-point correlation function C(t, t ), giving the joint probability density (above chance) that two spikes happen at (t, t + dt) and at (t , t + dt), that is, C(t, t ) ≡ s(t)s(t ) − s(t)s(t ). Stochastic spiking in network models is usually assumed to follow Poisson statistics, which is both a fairly good approximation to what is commonly observed experimentally (see, e.g.,
Bistability in Balanced Recurrent Networks
7
Softky & Koch, 1993; Compte et al., 2003) and convenient technically since Poisson spike trains lack any temporal correlations. For Poisson spike trains, C(t, t ) = νδ(t − t ), where ν is the instantaneous firing probability. We have previously analyzed the effect of temporal correlations in the afferents to a LIF neuron on its firing rate (Moreno et al., 2002). Temporal correlations measured in vivo are often well fitted by an exponential (Bair, Zohary, & Newsome, 2001). We considered exponential correlations of the form
|t −t|
e− τ c C(t, t ) = ν δ(t − t) + (F∞ − 1) 2τc
,
(2.1)
where F∞ is the Fano factor of the spike train for infinitely long time windows. The Fano factor in a window of length T is defined as the ratio between the variance and the mean of the spike count on the window. It is illustrative to calculate it for our process, FT ≡
2 σ N(T)
,
N(T)
where N(T) is the (stochastic) spike count in a window of length T, N(T) =
T
dt s(t), 0
so that N(T) = νT, and the spike count variance is given by 2 σ N(T) ≡
0
T
0
T
T dt dt C(t, t ) = νT + ν(F∞ − 1) T − τc 1 − e− τc .
When the time window is long compared to the correlation time constant, 2 that is, T τc , then σ N(T) ∝ F∞ νT; hence, our use of the factor F∞ in the definition of the correlation function. An interesting point to note is that for time windows that are long compared to the correlation time constant, the variance of the spike count is linear in time, which is a signature of independence across time, that is, independent variances add up (for the 2 Poisson process, (σ N(T) )Poisson = νT, so that (FT )Poisson = 1). If the characteristic time of the postsynaptic cell integrating this stochastic current (its membrane time constant) is very long compared with τc , we expect that the main effect of the deviation from Poisson of the input spike trains will be on the amplitude of the current variance, with the parameter τc playing only 2 a marginal role, as it does not appear in σ N(T) when T τc . As we show
8
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
below, a rigorous analysis of the effect of correlations on the mean firing rate of a LIF neuron confirms this intuitive picture. The postsynaptic cell receives many inputs. Recall that the total current is given (we consider for simplicity for this discussion that the cell receives C inputs from a single, large, homogeneous population composed of N neurons) by I (t) = J Cj s j (t). Thus, the mean and correlation function of the total afferent current to a given cell are I (t) = C J ν C I (t, t ) = I (t )I (t) − I (t)I (t ) =J2 si (t)s j (t ) − si (t)s j (t ) ij
= C J C(t, t ) + C(C − 1)J 2 Ccc (t, t ), 2
where C(t, t ) is the (auto)correlation function in equation 2.1 and Ccc (t, t ) is the cross-correlogram between any two given cells of the pool of presynaptic inputs (which we have assumed to be the same for all pairs). We restrict our analysis to very sparse random networks—networks with C N—so that the fraction of synaptic inputs shared by any two given neurons can be assumed to be negligible. In this case, the cross-correlation between the spike trains of the two cells Ccc (t, t) will be zero. This approximation simplifies the analysis of the network behavior significantly and allows for a self-consistent solution for the network’s steady states. Thus, the temporal structure of the total current to the cell is described by
α − |t−t | C I (t, t ) = σ02 δ(t − t ) + e τc 2τc
(2.2)
with σ02 = C J 2 ν
and α = F∞ − 1.
We have previously calculated the output firing rate of an LIF neuron subject to an exponentially correlated input (Moreno et al., 2002). The calculation is done using the diffusion approximation (Ricciardi, 1977) in which the discontinuous voltage trajectories are approximated by those obtained from an equivalent diffusion process. The diffusion approximation is expected to give accurate results when the overall rate of the input process is high, with the amplitude of each individual event being very small (Ricciardi, 1977). For small but finite τc , the analytic calculation of the firing rate of the cell can be done only when the deviation of the input current from a white noise
Bistability in Balanced Recurrent Networks
9
√ process is small, that is, it has to be done assuming that k ≡ τc /τm 1 and that α 1. More specifically, we found that if k = 0, then the firing rate can be calculated for arbitrary values of α, but if k is small but finite, then an expression can be found for the case when both k and α are small (see Moreno et al., 2002, for details). If k = 0, then the result one obtains is that the firing rate of the neuron is given by the same expression that one finds for the case of a white noise input, but with an effective variance that takes into account the change in amplitude of the fluctuations due to the non-Poisson nature of the inputs. The effective variance is equal to 2 σeff = σ02 (1 + α) = C J 2 ν F∞ ,
which is exactly the slope of the linear increase with the size of the time window T of the variance in the spike count NI (T) of the total input current. This result can be understood in terms of the Fano factor calculation outlined above. Assuming k = 0 is equivalent to assuming an infinitely long time window for the calculation of the Fano factor, and in those conditions we also saw that the only effect of the temporal correlations is to renormalize the variance of the spike count with respect to the poisson case. In order to set up a self-consistent scenario, we have to close the loop, by calculating a measure of the variability of the postsynaptic cell and relating it to the same property in the spike trains of its inputs. To do this, we note that if the spike trains in the model can be described as renewal processes, these processes have a property that relates their spike count variability and their ISI variability, F∞ = CV2 , if a point process is renewal (Cox, 1962). Renewal point processes are characterized by having independent ISIs, which are not necessarily exponentially distributed. Since we are assuming that the temporal correlations in the spike trains are short anyway, and the firing rates of the cells in the persistent activity states that we are interested in are not very high, then we expect the renewal assumption to be appropriate. The final step is to make sure that the result for the firing rate (the inverse of the mean ISI) in terms of the effective variance also holds for higher moments of the postsynaptic ISI, not only for the first, and this is indeed the case (Renart, 2000); that is, the CV of the ISI when k = 0 is given by the same expression as when the input is a white noise process, but with a renormalized variance equal to 2 σeff . Thus, under the assumptions described above, there is a way of computing the output rate and CV of an LIF neuron solely in terms of the rate and CV of its presynaptic inputs. In the steady states, both input and output
10
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
firing rate and CV will be the same, and this provides a couple of equations that determine these quantities self-consistently. In the reminder of the letter, we thus use the common expressions for the mean and CV of the first passage time of the Ornstein-Uhlenbeck (OU) process, ν
−1
√
= τref + τm π
CV2 = 2πν 2
Vth −µV √ σV 2 Vres −µV √ σV 2
Vth −µV √ σV 2
2
d x ex [1 + erf(x)]
Vres −µV √ σV 2
d x ex
2
x
−∞
2
dy e y [1 + erf(y)]2 ,
(2.3)
(2.4)
where µV and σV2 are the mean and variance of the depolarization of the postsynaptic cell (in the absence of threshold; Ricciardi, 1977). In a stationary situation, they are related to the mean µ and variance σ 2 of the afferent current to the cell by µV = τm µ;
σV2 =
1 τm σ 2 . 2
Following the arguments above, the mean and (effective) variance of the current to the cells are given by µ=CJ ν 2 = C J 2 νCV2 σ 2 ≡ σeff
for the mean and variance of the white noise input current. Finally, it is easy to show that if the presynaptic afferents to the cell come from a set of different statistically homogeneous subpopulations, the previous expressions generalize readily to µi =
Ci j J i j ν j
j
σi2 ≡ σi2eff =
Ci j J i2j ν j CV2j
(2.5) (2.6)
j
as long as the timescales of the correlations in the spike trains of the neurons in the different subpopulations are all of the same order. Inhibitory subpopulations are characterized by negative connection strengths. 2.2 Dynamics. A detailed characterization of the dynamics of the activity of the network is beyond the scope of this work. Since our main interest is the steady states of the network, we use a simple, effective dynamical
Bistability in Balanced Recurrent Networks
11
scheme that is consistent with the self-consistent equations that determine the steady states. In particular, we use the subthreshold dynamics of the first two moments of the depolarization in terms of the first two moments of the current (Ricciardi, 1977; Gillespie, 1992): dµV µV + µ; =− dt τm
dσV2 σ2 = − V + σ 2. dt τm /2
(2.7)
In using these equations, our assumption is that the activity of the population follows instantaneously the distribution of the depolarization. Thus, at every point in time, we use expressions 2.5 and 2.6 for µ and σ appearing in the right-hand side of equations 2.7, which depend on the rate ν(µV , σV ) and CV(µV , σV ) as given in equations 2.3 and 2.4. The only dynamical variables are therefore µV and σV2 (Amit & Brunel, 1997b; Mascaro & Amit, 1999). 2.3 Numerical Analysis of the Analytic Results. The phase plane analysis of the reduced network was done using both custom-made C++ code and the program XPPaut. The integrals in equations 2.3 and 2.4 were calculated analytically for very large and very small values of the limits of integration (using asymptotic expressions for the error function; Abramowitz & Stegun, 1970) and numerically for values of the integration limits of order one. The corresponding C++ code was incorporated into XPPaut through the use of dynamically linked libraries for phase plane analysis. Some of the cusp diagrams were calculated without the use of XPPaut by the direct approach of looking for values of the parameters at which the number of fixed points changed abruptly.
2.4 Numerical Simulations. We simulated an identical network to the one used in the mean-field description (see the captions of Figures 12 and 13 for parameters). In the simulation, on every time step dt = 50 µs, it is checked which neurons in the network receive any spikes. The membrane potential of cells that do not receive spikes is integrated analytically. The membrane potential of cells that receive spikes is integrated analytically within that dt, taking into account the synaptic postsynaptic potentials (PSPs) but assuming that there is no threshold. Only at the end of the time step is it checked whether the membrane potential is above threshold. If this is the case, the neuron is said to have produced a spike. This procedure effectively introduces a (very short but nonzero) synaptic integration time constant. Emitted spikes are fed back into the network using a system of queues to account for the synaptic delays (Mattia & Del Giudice, 2000).
12
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
3 Results 3.1 Network Architecture and Scaling Considerations. We first study the issue of how the architecture of the network determines the qualitative nature of the steady-state solutions for the firing rate of the cells in the network. In particular, we are interested in understanding under which conditions there are multiple steady-state solutions (bistability or, in general, multistability) in networks in which cells will fire with a high degree of irregularity. We consider a network for object working memory composed of a number of nonoverlapping subpopulations, or columns, defined by selectivity with respect to a given external stimulus. Each subpopulation contains both excitatory and inhibitory cells. The synaptic properties of neurons within the same column are assumed to be statistically identical. Thus, the column is, in our network, the minimal signaling unit at the average level, that is, all the neurons within a column behave identically on average. As will be shown below, the type of bistability realizable for the column depends critically on its size (more specifically, on the number of connections that a given cell receives from within its own column). Different architectures will thus be considered in which the total number of afferent connections per cell C is constant (and large) but the number of columns in the network varies, effectively varying the number of afferent connections from a given column to a cell. A multicolumnar architecture of this sort is inspired in the anatomical organization of the PFC, in which columnarly organized putative excitatory cells and interneurons show similar response profiles during working memory tasks (Rao, Williams, & Goldman-Rakic, 1999). As noted in section 1, many of the properties of the network can be inferred from a scaling analysis. In the limit in which the connectivity is very sparse, so that correlations between the spike trains of different cells can be neglected (see section 2), the relevant parameter is the number of afferent connections per neuron C. We will investigate the behavior of the network in the limit C → ∞ (the “extensive” limit) since, in this case, the different scenarios become qualitatively different. Of course, physical neurons receive a finite number of connections, but the rationale is that the physical solution can be considered a small perturbation to the solution found in the C = ∞ case, which is much easier to characterize. One should keep in mind that even if C becomes very large, we still need to impose the sparseness condition for our analysis to be valid, which implies that it should always hold that N C. When considering current-based scenarios in the extensive limit, one is forced to normalize the connection strengths (the size of the unitary PSPs, which we denote generally by J ) by (some power of) the number of connections per cell C, in order to keep the total afferent current within the (presynaptic) dynamic range of the neuron (whose order of magnitude is given by the distance between reset and threshold). As we show below,
Bistability in Balanced Recurrent Networks
13
different scaling schemes of J with C lead to different relative magnitudes of the mean and fluctuations of the afferent current into the cells in the extensive limit, and this in turn determines the type of steady-state solutions for the network. We thus proceed to analyze the expressions for the mean and variance of the afferent current (see equations 2.5 and 2.6) under different scaling assumptions. We consider multicolumnar networks in which the C afferent connections to a given cell come from Nc different “columns” (each contributing Cc connections, so that C = Nc Cc ). Each column is composed of an excitatory and an inhibitory subpopulation. The multicolumnar structure of the network is specified by the following scaling relationships, Nc ≡ nc C α 1−α Cc ≡ n−1 , c C
with 0 ≤ α ≤ 1 and nc order one, that is, independent of C. The case α = 0 corresponds to a finite number nc of columns, each contributing a number of connections of order C. The case α = 1 corresponds to an extensive number of columns, each contributing a number of connections of order one—that is, a fixed number as the total number of connections C grows. Although connection strengths between the different subpopulations can all be different, we assume that they can be classified into two types according to their scaling with C: those between cells within the same column, of strength J w , and those between cells belonging to different columns, of strength J b (the scaling is assumed to be the same for excitatory and for inhibitory connections). We define J w ≡ jw C −αw J b ≡ jb C −αb where αw , αb > 0 and the j’s are all order one. In these conditions, the afferent current to the excitatory or inhibitory cells (it does not matter which, for this analysis) from their own subpopulation is characterized by µin = Cc [J Ew ν Ein − J Iw ν Iin ] 1−α−αw = C 1−α−αw [ j Ew ν Ein − j Iw ν Iin ]n−1 f µin c ≡C
σin2 = Cc [J E2w ν Ein CV2Ein + J I2w ν Iin CV2Iin ] 1−α−2αw f σin , = C 1−α−2αw [ j E2 w ν Ein CV2Ein + j I2w ν Iin CV2Iin ]n−1 c ≡C
where the f ’s are linear combinations of rates and CVs weighted by factors
14
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
of order one. We proceed by assuming that all other columns are in the same state ν E,I out , CV E,I out , so that the current to the cells in the column under focus from the rest of the network is characterized by µout = (Nc − 1)Cc J Eb ν Eout − J Ib ν Iout −α −α j Eb ν Eout − j Ib ν Iout ≡ C 1−αb 1 − n−1 f µout = C 1−αb 1 − n−1 c C c C 2 σout = (Nc − 1)Cc J E2b ν Eout CV2Eout + J I2b ν Iout CV2Iout 2 −α j Eb ν Eout CV2Eout + j I2b ν Iout CV2Iout = C 1−2αb 1 − n−1 c C −α ≡ C 1−2αb 1 − n−1 f σout . c C −α become equal to one if α > 0 In the extensive limit, the terms 1 − n−1 c C and are of order one if α = 0, in which case they can be included as an extra multiplicative factor in the f terms. We thus omit them from now on. In addition to their recurrent inputs, cells receive a similar number of external excitatory inputs as well, but since we are interested in the generation of irregularity by the network, we will assume this external drive to be deterministic, that is, characterized by
µext = C J ext νext = C 1−αext jext νext ≡ C 1−αext f ext , with J ext = jext C −αext and αext > 0. The scaling with C of the different components of the total afferent current is thus given by µin = C 1−α−αw f µin µout = C 1−αb f µout
σin2 = C 1−α−2αw f σin 2 σout = C 1−2αb f σout
µext = C 1−αext f ext . If α, αb , αw are such that the variances vanish as C → ∞, the corresponding networks will consist of regularly spiking neurons. Since we are interested in irregular spiking, we therefore look for solutions in which 2 σin2 , σout or both remain order one in the C → ∞ limit. There are several ways to achieve this. 3.1.1 Scenario 1: Homogeneous Balanced Network. This case is associated with the choice α = 0. In this case, the size of the columns is of the same order as the size of the whole network (i.e., the number of columns, nc , is order one), in which case the in and out quantities become equivalent. √ A finite variance is achieved by setting αw = αb = 1/2, that is, J ∝ 1/ C. This scenario is equivalent to the network studied originally in Tsodyks and Sejnowski (1995) and van Vreeswijk and Sompolinsky (1996, 1998). In such
Bistability in Balanced Recurrent Networks
15
a network, the mean input from the recurrent network grows as the square root of the number of inputs, µin + µout =
√ C( f µin + f µout ).
This quantity can be positive or negative depending on the excitationinhibition balance in the network. The overall mean input into the neurons is obtained by adding the external input: µ = µin + µout + µext . In order not to saturate the dynamic range of the cell in the extensive limit, the overall mean current into the neurons should remain of order one as C → ∞. Hence, it is needed that µ=
√ 1 C[ f µin + f µout + f ext C 2 −αext ] ∼ O(1),
(3.1)
√ which is possible only if the term in square brackets vanishes as 1/ C. If the synapses from the external inputs vanish like 1/C (αext = 1), then the external input has a negligible contribution to the overall mean input to the cells. In this case, since both f µin and f µout are linear combinations of the firing rates of the neurons inside and outside the column under focus, equation 3.1 effectively becomes, in the extensive limit, a set of linear homogeneous equations for the activity of the different columns (note that although we have, for brevity, written only one, there are four equations like equation 3.1, for the excitatory and inhibitory subpopulations inside and outside the column under focus). Thus, unless the matrix of coefficients of the firing rates in f µin and f µout for the excitatory and inhibitory subpopulations is not full rank, the only solution of equation 3.1 in the extensive limit is given by a silent, zero rate, network (van Vreeswijk & Sompolinsky, 1998). On the other hand, if αext = 1/2, the linear system defined by equations 3.1 in the extensive limit is not homogeneous anymore. Hence, in the general case (except for degeneracies), if αext = 1/2, there is a single self-consistent solution for the firing rates in the network, in which the activity in each subpopulation is proportional to the external drive νext (van Vreeswijk & Sompolinsky, 1998). This highlights the importance of a powerful external excitatory drive. When αext = 1/2, the external drive by itself would drive the cells to saturation if the recurrent connections were inactivated. In the presence of the recurrent connections, the activities in the excitatory and inhibitory subpopulations adjust themselves to compensate this massive external drive. The firing rates in the self-consistent solution correspond to the only way in which this compensation can occur for all the subpopulations at the same time. It follows that inasmuch as the different inputs to the neuron combine linearly, unless the connectivity matrix is degenerate, which requires some kind of fine-tuning mechanism, bistability in a large, homogeneous balanced network is not a robust property.
16
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
3.1.2 Scenario 2: Homogeneous Multicolumnar Balanced Network. Since the linearity condition that results in a unique solution follows from the mean current to the cells from within the column growing with C, we impose that µin ∼ O(1), that is, 1 − α − αw = 0. If this is the case, the variance coming from within the column goes as C α−1 . We consider first the case α < 1. In these conditions, the variance coming from the activity inside the column vanishes for large C. To keep the total variance finite, we set αb = 1/2. If we also choose α = 1/2, then αw = αb = 1/2, so the network is homogeneous in that all the connection strengths scale similarly with C regardless of whether they connect cells belonging to the same or different columns. Since α = 1/2, there are many columns in the network, and the number of connections coming from√inside a particular column is a very small fraction (which decreases like 1/ C) of the total number of afferent connections to the cell. The fact that αb = 1/2 implies that √ the mean current coming from outside the column will still grow like C. Thus, in order for the cells not to saturate, the excitation and the inhibition from outside the column have to balance precisely; the steady state of the rest of the network becomes, again, a unique, linear function of the external input to the cells (where again we choose αext = 1/2 to avoid the quiescent state). However, now the mean current coming from inside the column is independent of C, so the steady-state activity inside the column is not determined by a set of linear equations. Instead, it should be determined self-consistently using the nonlinear transfer function in equation 2.3, which, in principle, permits bistability. This scenario is, in fact, equivalent to the one studied in Brunel (2000b), where a systematic analysis of the conditions in which bistability in a network like this can exist has been performed. (Although no explicit scaling of the synaptic connection strengths with C was assumed in Brunel, 2000b, the essential fact that the total variance to the cells in the subpopulation that supports bistability is constant is considered in that article.) As will be shown in detail below, the fact that the potential multiple steadystate solutions in this scenario differ only in the mean current to the cells, not in their variance (which is fixed by the balance condition on the rates outside the column), leads necessarily to a lower (in general, significantly lower) CV in the activity in the cells in√the elevated persistent activity state. Therefore, in a network with J ∝ 1/ C scaling, bistability is possible in √ small subsets of neurons comprising a fraction ∝ 1/ C of the total number of connections per cell, but the elevated persistent activity states are characterized by a change in the mean drive to the cells at constant variance, and, as we show below, this leads to a significant decrease in the spiking irregularity in the elevated persistent activity states. 3.1.3 Scenario 3: Heterogeneous Multicolumnar Network. In order for the CV in the elevated persistent activity state to remain close to one, the variance of the afferent current to the cells inside the column should depend
Bistability in Balanced Recurrent Networks
17
on their own activity. Thus, in addition to the condition 1 − α − αw = 0 necessary for bistability, we have to impose that σin2 ∝ C α−1 be independent of C, that is, α = 1, which implies αw = 0. In these conditions, the extensive number of connections per cell come from an extensive number of columns, with the number of connections from each column remaining a finite number. The αw = 0 condition reflects the fact that since cells receive only a finite number of intracolumnar connections, the strength of these connections does not need to scale in any specific way with C. As for the activity outside the√network, one could now, in principle, choose either J b ∝ 1/C or J b ∝ 1/ C (corresponding to αb = 1, 1/2, respectively), since there is already a finite amount of variance coming from within the column. In the first case, the rest of the network contributes only a noiseless deterministic current whose exact amount has to be determined selfconsistently, and in the second it contributes both to the total mean and variance of the afferent current to the neurons in the √ column. In this last case, as in the previous two scenarios, the J b ∝ 1/ C scaling results in the need for balance between the total excitation and inhibition outside the network, which (again, if αext = 1/2) leads to a unique solution for the activity of the rest of the population linear in the external drive to these neurons. In this scenario, the network is heterogeneous since the strength of the connections from neurons within the same column is larger than those from neurons in other columns. Since the rate and CV of the cells inside the column have to be determined self-consistently in this case, we proceed to do a systematic quantitative analysis of this scenario in the next section. From the scaling considerations described in this section, it is already clear, though, that a potential bistable solution with high CV is possible only in a small network.
3.2 Types of Bistable Solutions in a Reduced Network. In this section, we consider the network described in scenario 3 in the previous section, with the choice αb = 1/2, and analyze the types of steady-state solutions for the activity in a particular column of finite size. The rest of the network is in a balanced state, and its activity is completely decoupled from the activity of the column, which is too small in size to make a difference in the overall input to the rest of the cells. For our present purposes, all that matters about the afferents outside the column (from both the rest of the network and the external ones) is that they provide a finite net input to the cells in the column. We denote the mean and variance of that fixed external ext 2 current by µext E,I /τm and 2(σ E,I ) /τm , where the factors with the membrane ext 2 time constant τm have been included so that µext E,I and (σ E,I ) represent the contribution to the mean and variance of the postsynaptic depolarization (in the absence of threshold) in the steady states arising from outside the column.
18
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
The net input to the excitatory and inhibitory populations in the column under focus is thus characterized by µ E = c EE jEE ν E − c EI jEI ν I + µ I = c IE jIE ν E − c II jII ν I +
µext E τm
µext I τm
2 σ E2 = c EE jEE ν E CV2E + c EI jEI2 ν I CV2I +
σ I2 = c IE jIE2 ν E CV2E + c II jII2 ν I CV2I +
(σ Eext )2 τm /2
(σ Iext )2 , τm /2
where the number of connection and connection strength parameters c and j are all order one. We proceed by simplifying this scheme further in order to reduce the dimensionality of the system from four to two, which will allow a systematic exploration of the effect of all the parameters on the type of steady-state solutions of the network. In particular, we make the inputs to the excitatory and inhibitory populations identical, c IE = c EE ≡ c E µext E
= µext I
c II = c EI ≡ c I
≡µ
ext
(σ Eext )2
=
jIE = jEE ≡ j E
(σ Eext )2
≡ (σ
jII = jEI ≡ j I
) ,
ext 2
so that the whole column becomes statistically identical: ν E = ν I ≡ ν and CV E = CV I ≡ CV. For simplicity, we also assume that the number of excitatory and inhibitory inputs is the same: c E = c I = c. Thus, we are left with a system with four parameters, c µ = c( j E − j I );
cσ =
c( j E2 + j I2 );
µext ;
σ ext ,
(3.2)
all with units of mV, and two dynamical variables (from equation 2.7) µV dµV =− + µ; dt τm
σ2 dσV2 = − V + σ 2, dt τm /2
where µ = cµ ν +
µext ; τm
σ 2 = c σ2 νCV2 +
(σ ext )2 τm /2
(3.3)
Bistability in Balanced Recurrent Networks
19
100
1
Firing Rate x CV2 (Hz)
1.5
CV
Firing Rate (Hz)
150 150
50 0.5 0 20
σ (mV)
0 0
10 0 0
20 µ (mV)
10
20 10 20 µ (mV)
30
10 30 0
100
50
0 20
10 σ (mV)
σ (mV)
0 0
10
20 µ (mV)
30
Figure 2: Mean firing rate ν (left), CV (middle), and product νCV2 (right) of the LIF neuron as a function of the mean and standard deviation of the depolarization. Parameters: Vth = 20 mV, Vres = 10 mV, τm = 10 ms, and τref = 2 ms.
and −1
ν(µV , σV )
√
= τref + τm π
CV2 (µV , σV ) = 2πν 2
Vth −µV √ σV 2 Vres −µV √ σV 2
Vth −µV √ σV 2 Vres −µV √ σV 2
d x ex
2
2
d x ex [1 + erf(x)]
x
−∞
2
dy e y [1 + erf(y)]2
(3.4)
(3.5)
The parameters c µ and c σ2 measure the strength of the feedback that the activity in the column produces on the mean and variance of the current to the cells. c µ can be less than equal to, or greater than zero. A value larger (less) than zero implies that the activity in the column has a net excitatory (inhibitory) effect on the neurons. In general, we assume the positive parameter c σ2 to be independent of c µ (implying the recurrent feedback on the mean and on the variance can be manipulated independently). Note, however, that since j I /j E > 0, c σ2 cannot be arbitrarily small, that is, c σ2 > c µ2 /c. Equations 3.4 and 3.5 are plotted as a function of µV and σV in Figure 2. 3.2.1 Mean- and Fluctuation-Driven Bistability in the Reduced System. nullclines for the two equations 3.3 are given by ν(µV , σV ) =
µV − µext ; τm c µ
ν(µV , σV )CV2 (µV , σV ) =
The
σV2 − (σ ext )2 . (τm /2)c σ2
The values of (µV , σV ) that satisfy these equations are shown in Figure 3 for several values of the parameters c µ , c σ . The nullclines for the mean (see Figure 3, left) are the projection on the (µV , σV ) plane of the intersection of the surface in Figure 2, left, with a plane parallel to the σV axes, shifted by µext and tilted (i.e., with slope) at a rate 1/(τm c µ ). Since the mean firing
20
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga Nullclines for the Standard Deviation 10
Nullclines for the mean
4
Cµ=0 C >0
3.5
σ
8 σ (mV)
σ (mV)
3 2.5 2
Cµ < 0
6
4 C >0 µ
1.5 1 10
2
µext=15 mV
15
20 µ (mV)
σ
C =0
=2 mV
σ
ext
25
30
10
20
µ (mV)
30
40
Figure 3: Nullclines of the equation for the mean µV (left) and standard deviation σV (right). The nullcline for the mean depends on only c µ , and the one for the standard deviation depends on only c σ .
rate as a function of the average depolarization changes curvature (it has an inflection point near threshold, where firing changes from being driven by fluctuations to being driven by the mean), the nullcline for the mean has a “knee” when the net feedback c µ is large enough and excitatory. Similarly, the nullclines for the standard deviation of the depolarization (see Figure 3, right) are the projection on the (µV , σV ) plane of the intersection of the surface in Figure 2, right, with a parabolic surface parallel to the µV axes, shifted by (σ ext )2 and with a curvature 2/τm c σ2 . Again, this curve can display a “knee” for high enough values of the net strength of the feedback onto the variance c σ2 . The fixed points of the system are given by the points of intersection of the two nullclines. We now show, through two particular examples, the main result of this letter: depending on the degree of balance between excitation and inhibition, two types of bistability can exist: mean driven and fluctuation driven. Mean-Driven Bistability. Figure 4 shows a particular example of the type of bistability obtained for low-moderate values of c σ and moderate-high values of c µ . Figure 4a shows the time evolution of the firing rate (bottom) and CV (top) in the network when the external drive to the cells is transiently elevated. In response to the transient input, the network switches between a low-rate, high-CV basal state, into an elevated activity state. For this particular type of bistability, the CV in this state is low. The nullclines for this example are shown in Figure 4b. The σV nullcline is essentially parallel to the µV axis, and it intercepts the µV nullcline (which has a pronounced “knee”) at three points: one below (stable), one around (unstable), and one above
Bistability in Balanced Recurrent Networks
a
21
CV
1.5 1 0.5 0 Firing Rate (Hz)
80 60 40 20 0
b
σ (mV)
0.75
0
1 2 Time (s)
Nullcline for µ Nullcline for σ
3
Rate = 35.5 Hz CV = 0.24
0.7 0.65 Rate = 1.42 Hz CV = 0.94
0.6 18
19
20 µ (mV)
21
Figure 4: Example of mean-driven bistability. (a) CV (top) and firing rate (bottom) in the network as a function of time. Between t = 0 s and t = 0.5 s (dashed lines), the mean of the external drive to the neurons was elevated from 18 mV to 19 mV, causing the network to switch to its elevated activity fixed point. In this fixed point, the CV is low. (b) Nullclines for this example. The two stable fixed points differ primarily in the mean current that the cells are receiving, with an essentially constant variance. Hence, the CV in the elevated persistent-activity fixed point is low. Parameters: µext = 18 mV, σ ext = 0.65 mV, c µ = 7.2 mV, c σ = 1 mV. Dotted line: neuronal threshold.
(stable) threshold. The stable fixed point below threshold corresponds to the state of the system before the external input is transiently elevated in Figure 4a. It is typically characterized by a low rate and a relatively high CV, as subthreshold spiking is fluctuation driven and thus irregular. However, since the CV drops fast for µV > Vth at the values of σV at which the µV
22
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
nullcline bends (see Figure 2, middle), the CV in the suprathreshold fixed point is typically low (but see section 4). Fluctuations in the current to the cells play little role in determining the spike times of the cells in this elevated persistent activity state. Qualitatively, this is the type of excitation-driven bistability that has been thoroughly analyzed over the past few years . It is expected to be present in networks in which small populations of excitatory cells can be bistable in the presence of global inhibition by virtue of selective synaptic potentiation. It is also expected to be present in larger populations if the fluctuations arising from the recurrent connections are weak compared to those coming from external afferents, for instance, due to synaptic filtering (Wang, 1999). Fluctuation-Driven Bistability. If the connectivity is such that the mean drive to the neurons is only weakly dependent on the activity, that is, c µ is small, but at the same time the activity has a strong effect on the variance, that is, c σ is large, the system can also have two stable fixed points, as shown in the example in Figure 5 (same format as in the previous figure). In this situation, however, the two fixed points are subthreshold, and they differ primarily in the variance of the current to the cells. Hence, spiking in both fixed points is fluctuation driven, and the CV is high in both of them; in particular, it is slightly higher in the elevated activity fixed point (see Figure 5a). This type of bistability can be realized only if there is a precise balance between the net excitatory and inhibitory drive to the cells. Since c σ must be large in order for the σV nullcline to display a “knee,” both the net excitation and inhibition should be large, and in these conditions, a small c µ can be achieved only if the balance between the two is precise. This suggests that this regime will be sensitive to changes in the parameters determining the connectivity; that is, it will require fine-tuning, a conclusion that is supported by the analysis below. Mean- and fluctuation-driven bistability are not discrete phenomena. Depending on the values of the parameters, the elevated activity fixed point can rotate in the (µV , σV ) plane, spanning intermediate values from those shown in the examples in Figures 4 and 5. We thus now proceed to a systematic characterization of all possible behaviors of the reduced system as a function of its four parameters. 3.2.2 Effect of the External Current. Since c σ2 > 0, the σV nullcline always bends upward (see Figure 3), that is, the values of σV in the nullcline are always larger than σ ext . Assuming for simplicity that c σ can be arbitrarily low, this implies that no bistability can exist unless the external variance is low enough. In particular, for every value of the external mean µext , ext there is a critical value σc1 defined as the value of σV at which the first two ext derivatives of the µV nullcline vanish (see Figure 6, middle). For σ ext > σc1 , the two nullclines can cross only once, and therefore no bistability of any
Bistability in Balanced Recurrent Networks
a
23
CV
1.5 1 0.5 0 Firing Rate (Hz)
80 60 40 20 0
0
b
1 2 Time (s)
3
15 σ (mV)
Rate = 59.3 Hz CV = 1.17
10 Rate = 2.4 Hz CV = 1.02
5
Nullcline for µ Nullcline for σ
5
15 µ (mV)
25
Figure 5: Example of fluctuation-driven bistability. (a) CV (top) and firing rate (bottom) in the network as a function of time. Between t = 0 s and t = 0.5 s (dashed lines), the standard deviation of the external drive to the neurons was elevated from 5 mV to 7 mV, causing the network to switch to its elevated activity fixed point. In this fixed point, the CV is slightly higher than in the basal state. (b) Nullclines for this example. The two stable fixed points differ primarily in the variance of the current that the cells are receiving, with little change in the mean. Hence, the CV in the elevated persistent-activity fixed point is slightly higher than in the low-activity fixed point; that is, the CV increases with the rate. Parameters: µext = 5 mV, σ ext = 5 mV, c µ = 5 mV, c σ = 20.2 mV. Dotted line: Neuronal threshold.
kind is possible in the reduced system (see Figure 6, left). For values of σ ext only slightly lower than this critical value, the jump between the low- and the high-activity stable fixed points in the (µV , σV ) plane is approximately horizontal, so the type of bistability obtained is mean driven. For lower
24
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga ext
ext
> σc1
10 8 σ (mV)
σ (mV)
8 6 4 2 0
σext=σext c2
ext
σ =σc1 µ nullcline σ nullcline
12 10 σ (mV)
ext
σ
10
6 4
10
20 µ (mV)
30
40
2 0
8 6 4
10
20 µ (mV)
30
40
2 0
10
20 µ (mV)
30
40
Figure 6: The external variance determines the existence and types of bistability ext (left), no bistability is possible. σ ext = possible in the network. For σ ext > σc1 ext ext marks the onset of bistability (middle). At σ ext = σc2 bistability becomes σc1 possible in a perfectly balanced network (a network with c µ = 0) (right).
values of the external variance, a point is eventually reached at which bistability becomes possible in a perfectly balanced network. Again, for ext each µext , one can define a second critical value σc2 in the following way: ext ext σc2 is the value of σ at which the point where the derivative at the inflection point of the σV nullcline is infinite occurs at a value of µV equal ext to µext (see Figure 6, right). For values of σ ext < σc2 , bistability is possible in networks in which the net recurrent feedback is inhibitory. Since both critical values of σ ext are functions of the external mean, they define curves in the (µext , σ ext ) plane. These curves are plotted in Figure 7. Both are decreasing functions of µext and meet at threshold, implying that bistability in the reduced network is possible only for subthreshold mean external inputs (see section 4). 3.2.3 Phase Diagrams of the Reduced Network. For each point in the (µext , σ ext ) plane, the external current is completely characterized, and the only two parameters left to be specified are c µ , c σ . In particular, in the regions where bistability is possible, it will exist for only appropriate values of c µ and c σ . The two insets in Figure 7 show phase diagrams in the (c µ , c σ ) plane showing the regions of bistability in two representative points: one in ext ext which σc2 < σ ext < σc1 , in which bistability is possible only in excitationext dominated networks (top-right inset), and one in which σ ext < σc2 , in which bistability is possible in both excitation- and in inhibition-dominated networks (bottom-left inset). In this latter case, the region enclosed by the curve in which bistability can exist stretches to the left, including the region with c µ 0. We have further characterized the nature of the fixed-point solutions in these two cases by plotting the rate and CV on each point in the (c µ , c σ ) plane on which bistability is possible, as well as the ratio between the rate and CV in the high- versus low-activity states. Instead of showing this in the
Bistability in Balanced Recurrent Networks
25
9 ext
σc1
8
No Bistability
7
10
ext
σ
5
5 0
4 3 2
15
20
c (mV)
25
µ
20 cσ (mV)
σ
ext
(mV)
6
c (mV)
σc2
10
1
0
0
40
c (mV)
80
µ
0 0
5 ext
µ
10 (mV)
15
20
Figure 7: Phase diagram with the effect of the mean and variance of the external current on the existence and types of bistability in the network. The two insets represent the regions of bistability in the (c µ , c σ ) plane at the corresponding points in the (µext , σ ext ) plane. Fluctuation-driven bistability is possible only ext (µext ). Top-right and bottom-left near and below the lower critical line σ ext = σc2 ext ext insets correspond to µ = 10 mV; σ = 4 mV and µext = 10 mV; σ ext = 3 mV, respectively.
(c µ , c σ ) plane, we have inverted Equations 3.2 to show (assuming a constant c = 100) the results as a function of the unitary EPSP j E and of the ratio of the unitary inhibitory to excitatory PSPs j I /j E , which measures the degree of balance in the network, that is, j I /j E = 1 implies a perfect balance:
jE =
cµ +
2cc σ2 − c µ2 2c
;
j I /j E =
2cc σ2 − c µ2 − c µ 2cc σ2 − c µ2 + c µ
.
(3.6)
In Figure 8 we show the results for the case where the external variance ext , so that bistability is possible only if the net recurrent is higher than σc2 connectivity is excitatory. Overall, the shape of the bistable region in this space is a diagonal to the right. This means that closer to the balanced region, the net excitatory drive (proportional to j E ) has to be higher in order for bistable solutions to exist. The low-activity fixed point (left column) is subthreshold, and thus spiking is fluctuation driven, characterized by a high CV. In this case, the high-activity fixed point is suprathreshold, so the CV in this fixed point is in general small (see the bottom right). Of course,
26
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga Rate
down
(Hz)
Rate /Rate
Hz 30
0.8
up
down
0.8 25
25
0.6
0.6
20
I E
0.4
15
0.2
10
j /j
j /j
I E
20
15
0.4
10 0.2 5
5
0.2
0.4 0.6 j (mV)
0.8
0.2
E
0.4 0.6 j (mV)
0.8
E
CV
CV /CV
down
up
0.95
0.8
down
0.8
0.8 0.7
0.9
0.4
I E
0.6 j /j
jI/jE
0.6
0.6 0.5
0.4
0.4 C
σ
0.3
0.2
0.2 0.85 0.2
0.4 0.6 jE (mV)
0.8
0.2 16
0.2
0.1
C 18 µ
0.4 0.6 jE (mV)
0.8
Figure 8: Bistability in an excitation-dominated network in the ( j E , j I /j E ) plane. (Top and bottom left) Firing rate and CV in the low-activity fixed point. (Top and bottom right) Ratio of the firing rate and CV between the high- and low-activity fixed points. (Inset) same as bottom-right panel in the (c µ , c σ ) plane. Parameters: c = 100, µext = 10 mV, and σ ext = 4 mV.
very close to the cusp, at the onset of bistability, the CV (and rate) in both fixed points is similar. In Figure 9 the same information is shown for the case where the external ext variance is lower than σc2 so that bistability is possible when the recurrent connectivity is dominated by excitation or inhibition. In this case, the region where the CV in the high- and low-activity fixed points is similar is larger, corresponding to situations in which j I /j E ∼ 1, that is, excitation and inhibition in the network are roughly balanced. Only in this region is the < 100 Hz. In this regime, firing rate in the high-activity state not too high, ∼ when excitation dominates, the rate in the high-activity state becomes very high. Note also that the transition between the relatively low-rate, high-CV regime and the very high-rate, low-CV regime at j I /j E ∼ 0.9 is relatively abrupt. Finally, we can use the relations 3.6 to study quantitatively the effect of the number of connections c on the regions of bistability, something we
Bistability in Balanced Recurrent Networks
27
Hz
Ratedown (Hz)
Rateup/Ratedown 600
3
1
2.5
1.5
0.4
1
0.2 0
0.5
1 jE (mV)
I E
500
0.8
400
0.6
300
0.4
200 100
0.2
0.5
0
0
1.5
0.5
1 j (mV)
1.5
E
CVdown
j /j
j /j
0.6
I E
2
CVup/CVdown
1
1
1
0.8
0.8
0.6
0.99
jI/jE
j /j
I E
0.8
1
1
C
σ
−10
0.8
C 10 µ
0.6
0.6
0.4
0.4
0.2
0.2
0.4 0.2 0
0
0.5
1 j (mV) E
1.5
0.98
0
0.5
1 jE (mV)
1.5
Figure 9: Mean and fluctuation-driven bistability in the ( j E , j I /j E ) plane. Panels as in Figure 8. (Inset) Portion of the bistability region with fluctuation-driven fixed points in the (c µ , c σ ) plane. When the network is approximately balanced, that is, j I /j E ∼ 1, the CV in the high-activity state is high. Parameters: c = 100, µext = 10 mV, and σ ext = 3 mV.
did in section 3.1 at a qualitative level based on scaling considerations. Figure 10 shows the effect of increasing the number of afferent connections per cell on the shape of the region of bistability for σ ext = 4 mV (left) and for σ ext = 3 mV (right). The results are clearer when shown in the plane ( j E c, j I /j E ), where j E c represents the net mean excitatory drive to the neurons. The range of values of the net excitatory drive in which bistability is allowed in the excitation-dominated regime, where j I /j E < 1, does not depend very strongly on c. However, for both σ ext = 4 mV and σ ext = 3 mV, when inhibition and excitation become more balanced, a higher net excitatory drive is needed. In particular, when σ ext = 3, the bistable region always includes the balanced network, j I /j E = 1, but the range of values of j I /j E ∼ 1 where bistability is possible (being in this case fluctuation driven) considerably shrinks. Thus, as noted in section 3.1, bistability in a large, balanced network requires a precise balance of excitation and inhibition.
28
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga ext
σ
ext
σ
=4 mV
=3 mV
1
I E
0.5
j /j
j /j
I E
1
0.5
c=100 c=1000 c=10000 0 0
200
400 jE c (mV)
600
c=100 c=1000 c=10000 800
0 0
500
1000 1500 j c (mV)
2000
E
Figure 10: Effect of number of connections c on the phase diagram in the ( j E c, j I /j E ) plane for the case where µext = 10 mV and σ ext = 4 mV (left) and σ ext = 3 mV (right). Note that in the right panel, j I /j E = 1 is always in the region of bistability, but the range of values of j I /j E ∼ 1 in the bistable region decreases significantly with c.
The precise balance between excitation and inhibition required to obtain solutions with fluctuation-driven bistability is also evident when one analyzes the effect of changing the net input to the excitatory and inhibitory subpopulations within a column. In this case, the excitatory and inhibitory mean firing rate and CV become different. We have chosen to study the effect of the different ratios of excitation and inhibition to the excitatory and inhibitory populations. In particular, defining γE ≡
jEI jEE
and
γI ≡
jII , jIE
we have considered the effect of having γ E = γ I while still considering that the excitatory connections to the excitatory and inhibitory populations are equal, that is, jEE = jIE ≡ j E . To proceed with the analysis, we started by specifying a point in the parameter space of the symmetric network in which excitation and inhibition were identical by choosing a value for (µext , σ ext , c µ , c σ ). Then, fixing c = 200, we used the relationships 3.6 to solve for j E and γ and, defining γ E ≡ γ , we found which values of γ I resulted in bistable solutions. Correspondingly, when γ I = γ E , the two subpopulations within the column become identical again. We performed this analysis for two initial sets of parameters of the symmetric network: one corresponding to mean-driven and the other to fluctuation-driven bi-stability. The results of this analysis are shown in Figure 11. The type of bistability does not change qualitatively depending on the value of γ I /γ E in the mean-driven case (left column). For the right
Bistability in Balanced Recurrent Networks
29
Mean-driven 300
Low activity High activity
80
Firing Rate (Hz)
Firing Rate (Hz)
100
Fluctuation-driven
60 40
Bistable Region
20
200
0
0 1
1.5 γI/γE
2
1
1.05
1
0.8
0.8
Bistable Region
0.6
CV
CV
1.025 γ /γ I E
1
0.4
0.2
0.2 1
1.5 γI/γE
2
Bistable Region
0.6
0.4
0
Bistable Region
100
0
1
1.025 γI/γE
1.05
Figure 11: Effect of different levels of input to the excitatory and inhibitory subpopulations. The ratio between the inhibitory and excitatory connection strengths γ was allowed to be different for each subpopulation. Left and right columns correspond to mean- and fluctuation-driven bistability in the corresponding situation for the symmetric network. The network is bistable for values of γ I /γ E within the dashed lines. Parameters on the left column: µext = 18 mV, σ ext = 0.65 mV, c µ = 7.2 mV, c σ = 1 mV. Parameters on the right column: µext = 10 mV, σ ext = 3 mV, c µ = 0.5 mV, c σ = 19 mV.
column, however, the original fluctuation-driven regime is quickly abolished as γ I /γ E increases, leading to very high activity and low CV in the high-activity fixed point. Note that the size of the bistable region is also much smaller in this case. 3.3 Numerical Simulations. We conducted numerical simulations of our network to investigate whether the two types of bistable states that the mean-field analysis predicts, the mean-driven and the fluctuation-driven regimes, can be realized. In addition to the approximations that we are forced to make in order to be able to construct the mean-field theory itself, the more qualitative and robust result that fluctuation-driven bistable points require large fluctuations in relatively small networks with relatively large
30
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
synapses leads to the question of whether potential fixed points in this very noisy network will indeed be stable. We found that under certain conditions, both types of bistable points can be realized in numerical simulations. In Figure 12 we show an example of a network supporting bistability on the mean-driven regime. In this network, the recurrent connections are dominated by excitation, and the mean of the external drive leaves the neurons very close to threshold, with the fluctuations of this external drive being small. As expected, the irregularity in the elevated-activity fixed point is low, with a CV ∼ 0.2. The mean µV in this fixed point is above threshold. The mean-field theory predicts for the same network a rate of 46.7 Hz and a CV of 0.21. An example of another network showing bistability, this time in the fluctuation-driven regime, is shown in Figure 13. In this network, the recurrent connections are dominated by inhibition, and the external drive leaves the membrane potential relatively far from threshold on average but has large fluctuations. Taking into account only consecutive ISIs, the temporal irregularity is still large: CV2 ∼ 0.8. The spike trains in the elevated activity state are quite irregular, partly, but not only, due to large, temporal fluctuations in the instantaneous network activity. The mean-field prediction for these parameters gives a rate of 91.5 Hz and a CV of 1.6. In order for the elevated activity states to be stable, in both the meandriven and, especially, the fluctuation-driven regimes, we needed to use a wide distribution of synaptic delays in the recurrent connections: between 1 and 10 ms for Figure 12 and 1 and 50 ms for Figure 13. Narrow distributions of delays lead to oscillations, which destabilize the stationary activity states. The emergence and properties of these oscillations in a network similar to the one we study here have been described in Brunel (2000a). Although such long synaptic delays are not expected to be found in connections between neighboring local cortical neurons, our network is extremely simple and lacks many elements of biological realism that would work in the same direction as the wide distributions of delays. Among these are slow and saturating synaptic interactions (NMDA-mediated excitation; (Wang, 1999) and heterogeneity in cellular and synaptic properties. The large and slow temporal fluctuations in the instantaneous rate in Figure 13 are due to the large fluctuations in the nearly balanced external and recurrent drive to the cells and the wide distribution of synaptic delays. These fluctuations lead to high trial-to-trial variability in the activity of the network, as shown in Figure 14. In this figure, we show nine trials with identical parameters as in Figure 13, and only different seeds for the random number generator. On each panel, the mean instantaneous activity across all nine trials (the thick line) is shown along with the activity in the individual trial. Sometimes the large fluctuations lead to the activity returning to the basal spontaneous state. Other times they provoke longlasting periods of elevated firing (above average). Nevertheless, on a large fraction of the trials, a memory of the stimulus persists for several seconds.
Bistability in Balanced Recurrent Networks
31
Rate (Hz)
150
100
50
0 0
100
200
300
400 500 Time (ms)
600
=0.2
0
50 Rate (Hz)
100 0
0.25 CV
0.5
700
800
=0.21 2
0
0.25 CV2
0.5
Figure 12: Numerical simulations of a bistable network in the mean-driven regime. The rate of the external afferents was raised between 200 and 300 ms (vertical bars). (Top) Raster display of the activity of 200 neurons in the network. (Middle) Instantaneous network activity (temporal bin of 10 ms). The dashed line represents the average network activity during the delay period, 53.4 Hz. (Lower panels) Distribution across cells of the rate (left), CV (middle), and CV2 (right) during the delay period. The fact that the CV and CV2 are very similar reflects the stationarity of the instantaneous activity. Single-cell parameters as in the caption to Figure 2. The network consists of two populations of excitatory and inhibitory cells (1000 neurons each) connected at random with 0.1 probability. Delays are uniformly distributed between 1 and 10 ms. External spikes are all excitatory, with PSP size 0.09 mV. The external rate is 19.25 KHz. This leads to µext = 17.325 mV and σ ext = 0.883 mV. Recurrent EPSPs and IPSPs are 0.138 mV and −0.05 mV, respectively, leading to c µ = 8.8 mV and c σ = 1.468 mV.
32
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
Rate (Hz)
150
100
50
0 0
100
200
300
400 500 Time (ms)
600
=1.27
0
200 Rate (Hz)
400 0
2 CV
700
800
=0.81
4
0
1 CV2
2
Figure 13: Numerical simulations of a bistable network in the fluctuationdriven regime. Panels as in Figure 12. Parameters as in Figure 12 except distribution of delays, which is uniform between 1 and 50 ms. External spikes are excitatory, with PSP size 1.85 mV and rate 0.78 KHz and inhibitory, with PSP size −1.85 mV and rate 0.5 KHz. This leads to µext = 5.18 mV and σ ext = 4.68 mV. Recurrent EPSPs and IPSPs are 1.85 mV and −1.98 mV, respectively, leading to c µ = −13 mV and c σ = 27.1 mV.
Firing Rate (Hz)
Bistability in Balanced Recurrent Networks
33
100 50
Firing Rate (Hz)
0 100 50
Firing Rate (Hz)
0 100 50 0 0
1 Time (s)
20
1 Time (s)
20
1 Time (s)
2
Figure 14: Trial-to-trial variability in the fluctuation-driven regime. Each panel is a different repetition of the same trial, in a network identical to the one described in Figure 13. The thick line represents the average across all nine trials, and the thin line is the instantaneous network activity in the given trial. Vertical bars mark the time during which the rate of the external inputs is elevated.
In the mean-driven regime, the trial-to-trial variability is very low (not shown). We conclude that despite quantitative differences in the rate and CV between the mean-field theory and the simulations, it is possible, albeit difficult, to find both mean-driven and fluctuation-driven bistability in small networks of LIF neurons. 4 Discussion In this letter, we have aimed at an understanding of the different ways in which a simple network of current-based LIF neurons can be organized in order to support bistability, the coexistence of two steady-state solutions for the activity of the network that can be selected by transient external stimulation. We have shown that in addition to the well-known case in which strong excitatory feedback can lead to bistability, bistability can also be obtained when the recurrent connectivity is nearly balanced, or even when its net effect is inhibitory, provided that an increase in the activity in
34
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
the network provides a large enough increase in the size of the fluctuations of the current afferent to the cells. When bistability is obtained in this fashion, the CV in both steady states is close to one, as found experimentally (see Figure 1; Compte et al., 2003). We have done a systematic analysis at the mean-field level (and a partial one through numerical simulations) of a reduced network where the activity in the excitatory and the inhibitory subpopulations was equal by construction (implying balance at the level of the output activity) and studied which types of bistable solutions are obtained depending on the level of balance in the currents (the parameter c µ ), that is, balance at the level of the inputs. This simple model allows for a complete understanding of the role of the different model parameters. The first phenomenon, which we have termed mean-driven bistability, can essentially be traced back to the shape of the curve relating the mean firing rate of the cell to the average current it is receiving (at a constant noise level; Brunel, 2000b), that is, the f − I curve. In order for bistability to exist, this curve should be a sigmoid, for which it is enough that the neurons possess a threshold and a refractory period. If, in addition, the low-activity fixed point is to have a nonzero activity (consistent with the fact that cells in the cortex fire spontaneously), then the neuron should display nonzero activity for subthreshold mean currents. This can be achieved if the current is noisy, where the noise is due to the spiking irregularity of the inputs to the cell. When this type of bistability is considered in a network of LIF neurons, the mean current to the cells in the high-activity fixed point is above threshold. Under general assumptions, this leads invariably to fairly regular spiking in this high-activity fixed point. Of course, tuning the parameters of the current in such a way that the mean current in the high-activity fixed point is only very marginally suprathreshold will result in only a small decrease of the CV with respect to the low-activity fixed point (e.g., Figure 2 in Brunel & Wang, 2001). On the other hand, in this scenario, it is relatively easy (it does not take much tuning) for the firing rate in the elevated persistent activity state not to be very much higher than that in the low-activity state, for example, below 100 Hz (see Figure 8). When the recurrent connectivity is balanced, bistable solutions can exist in which both fixed points are subthreshold, so that spiking in both fixed points is fluctuation driven and thus fairly irregular. This can be the case if the fluctuations in the depolarization due to current from outside the column are low enough (see Figure 6). However, in order for these solutions to exist, first, the overall inputs to the excitatory and the inhibitory subpopulations should be close enough (ensuring balance at the level of the firing activity in the network); second, both of these inputs, the one to the excitatory and the one to the inhibitory subpopulation, should themselves also be balanced (be composed of similar amounts of excitation and inhibition); and third, both the net excitatory drive and the inhibitory drive to the cells should be large. This third condition, if the first two are satisfied, results in a high, effective fluctuation feedback gain: an increase in the activity of
Bistability in Balanced Recurrent Networks
35
the cells results in a large increase in the size of the fluctuations in the afferent current to the neurons (a large value of the parameter c σ ). However, it also implies that the excitation-inhibition balance condition will be quite stringent; it will require tuning, especially when the network is large. In addition, if this balance is slightly broken, since both excitation and inhibition are large, the corresponding firing rate in the elevated persistent activity state becomes very large, for example, significantly higher than 100 Hz (see Figure 9). In fact, based just on scaling considerations (see section 3.1), one can conclude that this type of bistability can be present only (unless one allows for perfect tuning) in small networks. If the network is large, the excitation-inhibition balance condition has, in general, a single (albeit very robust) solution (van Vreeswijk & Sompolinsky, 1996, 1998). It is intriguing, however, that several lines of evidence in fact suggest a fairly precise balance of local excitation and inhibition in cortical circuits, at both the output level (Rao et al., 1999) and the input level (Anderson, Carandini, & Ferster, 2000; Shu, Hasenstaub, & McCormick, 2003; Marino et al., 2005).
4.1 Limitations of the Present Approach. Most of the results we have presented are based on an exhaustive analysis of the stationary states of a mean-field description of a simple network of LIF neurons. Several limitations of our approach should be noted. First, in order to be able to go beyond the Poisson assumption, we have had to make a number of approximations (discussed in section 2) that are expected to be valid only on limited regions of the large parameter space. Second, we have focused only on the stationary fixed points of the system, neglecting an examination of any oscillatory solutions. Oscillations in networks of LIF neurons in the high-noise regime have been extensively studied by Brunel and collaborators (see, e.g., Brunel & Hakim, 1999; Brunel, 2000a; Brunel & Wang, 2003). Third, in order to be able to provide an analytical description, we have considered a very simplified network lacking many aspects of biological realism known to affect network dynamics, most important, a more realistic description of synaptic dynamics (Wang, 1999; Brunel & Wang, 2001). Finally, the use of a mean-field description based on the diffusion approximation to study small networks with big synaptic PSPs might lead to problems, since the diffusion approximation assumes these PSPs are (infinitely) small. Large PSPs might lead to fluctuations that are too strong, which would destabilize the analytically predicted fixed points. In order to check that the main qualitative conclusion of this study was not an artifact due to the mean-field approach, we simulated the network of LIF neurons used in the mean-field description, adding synaptic delays. Provided the distribution of delays was wide, we observed both types of bistable solutions. However, as expected, the fluctuation-driven persistent activity states show large, temporal fluctuations that sometimes are enough to destabilize them.
36
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
The evidence we have provided is suggestive, but given the limitations listed above, it is not conclusive. Addressing the limitations of this work will involve using recently developed analytical techniques (Moreno-Bote & Parga, 2006), along with a systematic exploration of the behavior of more realistic recurrent networks through numerical simulations. 4.2 Self-Consistent Second-Order Statistics. The extended mean-field theory we have used builds on the landmark study by Amit and Brunel (1997b), which provided a general-purpose theory for the study of the different types of steady states in recurrent networks of spiking neurons in the presence of noise, while leaving room for different degrees of biophysical realism. Our contribution has been to try to go beyond the Poisson assumption in order to allow a self-consistent solution to the second-order statistics of the spike trains in the network. If the spike trains are assumed to be Poisson, there is only one parameter to be determined self-consistently: the firing rate. Under the approximations made in this letter, the statistics are characterized by two parameters, the firing rate and the CV, which provides information about the degree of irregularity of the spiking activity. In order to go beyond the Poisson assumption, we have assumed the spike trains in the network can be described as renewal processes with a very short correlation time. In these conditions, for time windows large compared to this correlation time, the Fano factor of the process is constant, but instead of being one, as for a Poisson process, it is equal to CV2 . This motivates our strategy of neglecting the temporal aspect of the deviation from Poisson, which is extremely complicated to deal with analytically, and keep only its effect on the amplitude of the correlations. We have done this by using the expressions for the rate and CV of the first passage time of the OU process with a renormalized variance that takes into account the CV of the inputs. If the time constant of the autocorrelation of the process is exactly zero, this approximation becomes exact (Moreno et al., 2002), so we have assumed it will still be qualitatively valid if the correlation time constant is small. In this way, we have been able to solve for the CV of the neurons in the steady states self-consistently. It has to be stressed that the fact that the individual inputs to a neuron are considered independent does not imply that the overall input process, made of the sum of each individual component, is Poisson. Informally, in order for the superposition to converge to a (homogeneous) Poisson process of rate λ, two conditions have to be met: given any set S on the time axis (say, any time interval), calling Ni1 the probability of observing one spike in S from process i, and Ni>2 the probability of observing two or more spikes in S from process i, then the superposition of the i = 1, . . . , N processes will converge to a Poisson process if lim N→∞ iN Ni1 = λS (with max{Ni1 } = 0 as N >2 N → ∞) and if lim N→∞ i Ni = 0 (see, e.g., Daley & Vere-Jones, 1988). The autocorrelated renewal processes that we consider in this letter do
Bistability in Balanced Recurrent Networks
37
not meet the second condition, which can also be seen in the fact that the superposition process has an autocorrelation given by equation 2.2, not by a Dirac delta function, as would be the case for a Poisson process. Despite this, it might be the case that if instead of any set S, one considers only a given time window T, both conditions could approximately be met in T, and we could say that the superposition process is locally Poisson in T. Whether this locally Poisson train will have the same effect on the postsynaptic cell as a real Poisson train of the same rate depends on a number of factors and has been studied in detail in Moreno et al. (2002) for the case of exponential autocorrelations. Other types of autocorrelation structures, for instance, regular spike trains, could lead to different results. This is an open problem. 4.3 Current-Based versus Conductance-Based Descriptions. We have analyzed a network of current-based LIF neurons. The motivations for this choice are that current-based LIF neurons are simpler to analyze, especially in the presence of noise, than conductance-based LIF neurons and also that there were a number of unresolved issues raised in the current-based framework that we have made an attempt to clarify. In particular, we were interested in understanding whether the framework of Amit and Brunel (1997b) could be used to produce bistable solutions in balanced networks like those studied in Tsodyks and Sejnowski (1995) and van Vreeswijk and Sompolinsky (1996, 1998) outside the context of multistability in recurrent networks. An important issue has been the relationship between different scaling relationships between the connection strengths and the number of afferents and the possible types of bistability attainable in large networks, when the number of afferents per cell tends to infinity. √ This analysis shows that large, homogeneous networks using the J ∼ 1/ C scaling needed to retain a significant amount of fluctuation at large C do not support bistability in a robust manner, a result already implicitly present in van Vreeswijk and Sompolinsky (1996, 1998). Reasonably robust bistability in homogeneous balanced networks requires that they are small. Does one expect these conclusions to hold qualitatively if one considers the more realistic case of conductance-based synaptic inputs to the cells? The answer to this question is uncertain. In particular, scaling relationships between J and C, absolutely unavoidable in current-based scenarios to keep a finite input to the cells in the extensive C limit, are not necessary when synaptic inputs are assumed to induce a transient change in conductance. In the presence of conductances, the steady-state voltage is automatically independent of C for large C, regardless of the value of the unitary synaptic conductances. In fact, assuming a simple model for a cell having only leak, excitatory, and inhibitory synaptic conductances, the steady-state voltage in the absence of threshold is given by Vss =
gL gE gI VL + VE + VI , gTot gTot gTot
38
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
where VL , VE , VI are the reversal potentials of the respective currents; g L , g E , g I are the total leak, excitatory, and inhibitory conductance in the cell and gTot = g L + g E + g I . (This expression ignores the effect of temporal fluctuations in the conductances, but it is a good approximation, since the mean conductance, being by definition positive, is expected to be much larger than its fluctuations.) The steady-state voltage is just the average of the different reversal potentials, each weighted by the relative amount of conductance that the respective current is carrying. Of course, each of the total synaptic conductances is proportional to the number of inputs, but since the steady-state voltage is just a weighted sum, it does not explode even if C tends to infinity. It might seem that in fact, the infinite-C limit leads to an ill-defined model, as the membrane time constant vanishes in this case as 1/C. If Cm is the membrane capacitance, then τm =
Cm ∼ 1/C, gTot
assuming that g E,I ∼ C. We believe, however, that this is an artifact due to an incorrect way of defining the model in the large C limit. It implicitly assumes that the number of synaptic inputs grows at a constant membrane area, thus increasing indefinitely the local channel density. A more appropriate way of taking the large C limit is to fix the relative densities of the different channels per unit area and then assume the area becomes large. In this case, both Cm and the total leak conductance of the cell (proportional to the number of leak channels) will grow with the total area. This way of taking the limit respects the well-known decrease in membrane time constant as the synaptic input to the cell grows, but retaining a well-defined, nonzero membrane time constant in the extensive C limit (in this case, the range of values that τm can take is determined by the local differences in channel density, which is independent of the total channel number). A crucial difference with the current-based cell is the behavior of the variance of the depolarization in the large C limit. A simple estimate of this variance can be obtained by ignoring threshold and considering only the low-pass filtering effect of the membrane (with time constant τm ) on a gaussian noise current of variance σ I2 and time constant τs . It is straightforward to calculate the variance of the depolarization in these conditions, resulting in σV2 =
σ I2 2 gTot
τs τs + τm
.
Bistability in Balanced Recurrent Networks
39
If the inputs are independent, both the variance of the current and the total conductance of the cell are proportional to C, which implies that σV2 ∼ 1/C for large C. Therefore, the statistics of the depolarization in conductance-based and current-based neurons show a very different dependence with the number of inputs to the cell. In particular, it is unclear whether the main organizational principle behind the balanced state in the current-based framework, √ that is, the J ∼ 1 C scaling that is needed to retain a finite variance in the C → ∞ limit and that leads to the set of linear equations that specify the single solution for the activity in the balanced network, is relevant in a conductance-based framework. A rigorous study of this problem is beyond the scope of this work, but is one of the outstanding challenges for understanding the basic principles of cortical organization.
4.4 Correlations and Synaptic Time Constants. Our mean-field description assumes that the network is in the extreme sparse limit, N C, in which the fraction of common input shared by two neurons is negligible, leading to vanishing correlations between the afferent current to different cells in the large C, large N limit. This is a crucial assumption, since it causes the variance of the depolarization in the network to be the sum of the variances of the individual spike trains, that is, proportional to C. If the correlation coefficient is finite as C → ∞, the variance is proportional to C 2 (see, e.g., Moreno et al., 2002). In a current-based network, J ∼ 1/C scaling would lead to a nonvanishing variance in the large C limit without a stringent balance condition, and in a conductance-based network, it would lead to a C-independent variance for large C. This suggests that correlations between the cells in the recurrent network should have a large effect on both their input-output properties (Zohary, Shadlen, & Newsome, 1994; Salinas & Sejnowski, 2000; Moreno et al., 2002) and the network dynamics. The issue is, however, not straightforward, as simulations of irregular spiking networks with realistic connectivity parameters, which do show weak but significant cross-correlations between neurons (Amit & Brunel, 1997a), seem to be well described by the mean-field theory in which correlations are neglected (Amit & Brunel, 1997a; Brunel & Wang, 2001). Noise correlations measured experimentally are small but significant, with normalized correlation coefficients on the range of a few percent to a few tenths for a review (see, e.g., Salinas & Sejnowski, 2001). It would thus be desirable to be able to extend the current mean-field theory to incorporate the effect of cross-correlations and to understand under which conditions their effect is important. The first steps in this direction have already been taken (Moreno et al., 2002; Moreno-Bote & Parga, 2006). The arguments of the previous section suggest that a correct treatment of correlations might be especially important in large networks of conductance-based neurons.
40
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
An issue of similar importance for the understanding of the high irregularity of cortical spike trains is the relationship between the time constant (or time constants) of the inputs to the neuron and its (effective) membrane time constant. Indeed, the need for a balance between excitation and inhibition in order to have a high-output spiking variability when receiving many irregular inputs exists only if the membrane time constant is relatively long—in particular, long enough that if the afferents are all excitatory, the input fluctuations are averaged out. If the membrane time constant is very short, large, short-lived fluctuations are needed to make the postsynaptic cell fire, and these occur only irregularly, even if all the afferents are excitatory (see, e.g., Figure 1 in Shadlen & Newsome, 1994). These considerations seem relevant since cortical cells receive large numbers of inputs that have spontaneous activity, thus putting the cell into a high-conductance state (see, e.g., Destexhe, Rudolph, & Pare, 2003) in which its effective membrane time constant is short—on the order of only a few miliseconds (Bernander, Douglas, Martin, & Koch, 1991; Softky, 1994). It has also been recognized that when the synaptic time constant is large compared to the membrane time constant, spiking in the subthreshold regime becomes very irregular, and in particular, the distribution of firing rates becomes bimodal. Qualitatively, in these conditions, the depolarization follows the current instead of integrating it. Relative to the timescale of the membrane, fluctuations are long-lived, and this separates two different timescales for spiking (which result in bimodality of the firing-rate distribution) depending on whether the size of a fluctuation is such that the total current is subthreshold (i.e., no spiking leading to a large peak of the firing rate histogram at zero) or suprathreshold (leading to a nonzero peak in the firing rate distribution) (Moreno-Bote & Parga, 2005). In these conditions, neurons seem “bursty,” and the CV of the ISI is high. Interestingly, recent evidence confirms this bimodality of the firing-rate distribution in spiking activity recorded in vivo in the visual cortex (Carandini, 2004). Increases in the synaptic-to-membrane time constant ratio leading to more irregular spiking can be due to a number of factors: a very short membrane time constant if the neuron is a high-conductance state, relatively long excitatory synaptic drive if there is a substantial NMDA component in the excitatory EPSPs, or even long-lived dendrosomatic current sources, for instance, due to the existence of “calcium spikes” generated in the dendrites. There is evidence that irregular current applied to the dendrites of pyramidal cells results in higher CVs than the same current applied to the soma (Larkum, Senn, & Luscher, 2004). 4.5 Parameter Fine-Tuning. In order for both stable firing rate states of the networks we have studied to display significant spiking irregularity, the afferent current to the cells in both states needs to be subthreshold. We have shown that this requires a significant amount of parameter fine-tuning, especially when the number of connections per neuron is large. Parameter
Bistability in Balanced Recurrent Networks
41
fine-tuning is a problem, since biological networks are heterogeneous and cellular and synaptic properties change in time. Regarding this issue, though, some considerations are in order. First, the model we have considered is extremely simple, especially at the singleneuron level. We have already pointed out possible consequences of considering properties such as longer synaptic time constants or some degree of correlations between the spiking activity of different neurons. Another biophysical property that we expect to have a large impact is short-term synaptic plasticity. In the presence of depressing synapses, the postsynaptic current is no longer linear in the presynaptic firing rate, thus acting as an activity-dependent gain control mechanism (Tsodyks & Markram, 1997; Abbott, Varela, Sen, & Nelson, 1997). It remains to be explored to what extent balanced bistability in networks of neurons exhibiting these properties becomes a more robust phenomenon. Second, synaptic weights (as well as intrinsic properties; Desai, Rutherford, & Turrigiano, 1999) can adapt in an activity-dependent manner to keep the overall activity in a recurrent network within an appropriate operational range (Turrigiano, Leslie, Desai, Rutherford, & Nelson, 1998). Delicate computational tasks, which seem to require finetuning, can be rendered robust though the use of these types of activitydependent homeostatic rules (Renart, Song, & Wang, 2003). It will be interesting to study whether homeostatic plasticity (Turrigiano, 1999) can be used to relax some of the fine-tuning constraints described in this letter. 4.6 Multicolumnar Networks and Hierarchical Organization. The fact that bistability is not a robust property of large, homogeneous balanced networks suggests that the functional units of working memory could correspond to small subpopulations (Rao et al., 1999). In addition, we have shown that bistability in a small, reduced network is possible only for subthreshold external inputs (see section 3.2.2). At the same time, it is known that a nonzero activity balanced state requires a very large (suprathreshold) excitatory drive (see section 3.1 and van Vreeswijk & Sompolinsky, 1998). This seems to point to a hierarchical organization: large networks receive massive excitation from long-distance projections, and this external excitation sets up a balanced state in the network. Globally, the activity in the large, balanced network follows the external input linearly. This large, balanced network then provides an already balanced (subthreshold) input to smaller subcomponents, which, in these conditions (in particular, if the variance of this subthreshold input is small enough; see figure 7), can display more complex nonlinear behavior such as bistability. From the point of view of the smaller subnetworks, the balanced subthreshold input can be considered external, since the size of this network is too small to make a difference in the global activity of the larger network (despite being recurrently connected, the activities in the large and small networks effectively
42
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
decouple). In the cortex, the larger balanced network could correspond to a whole cortical column. Indeed, long-range projections between columns are mostly excitatory (see, e.g., Douglas & Martin, 2004). Within a column, the smaller networks that interact through both excitation and inhibition could anatomically correspond to microcolumns (Rao et al., 1999) or, more generally, define functional assemblies (Hebb, 1949). 5 Summary General principles of cortical organization (large numbers of active synaptic inputs per neuron) and function (irregular spiking statistics) put strong constraints on working memory models of spiking neurons. We have provided evidence that a network of current-based LIF neurons can exhibit bistability with the high persistent activity driven by either the mean or the fluctuations in the input to the cells. The fluctuation-driven bistability regime requires a strict excitation-inhibition balance that needs parameter tuning. It remains a challenge in future research to analyze systematically what the conditions are under which nonlinear phenomena such as bistability can exist robustly in large networks of more biophysically plausible conductance-based and correlated spiking neurons. It is also conceivable that additional biological mechanisms, such as homeostatic regulation, are important for solving the fine-tuning problem and ensuring a desired excitation-inhibition balance in cortical circuits. Progress in this direction will provide insight into the microcircuit mechanisms of working memory, such as found in the prefrontal cortex. Acknowledgments We are indebted to Jaime de la Rocha for providing the code for the numerical simulations and to Albert Compte for providing the data for Figure 1. A.R. thanks N. Brunel for pointing out previous related work, and A. Amarasingham for discussions on point processes. Support was provided by the National Institute of Mental Health (MH62349, DA016455), the A. P. Sloan Foundation and the Swartz Foundation, and the Spanish Ministery of Education and Science (BFM 2003-06242). References Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–224. Abramowitz, M., & Stegun, I. A. (1970). Tables of mathematical functions. New York: Dover. Amit, D. J., & Brunel, N. (1997a). Dynamics of a recurrent network of spiking neurons before and following learning. Network, 8, 373–404.
Bistability in Balanced Recurrent Networks
43
Amit, D. J., & Brunel, N. (1997b). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cerebral Cortex, 7, 237–252. Amit, D. J., & Tsodyks, M. V. (1991). Quantitative study of attractor neural network retrieving at low spike rates II: Low-rate retrieval in symmetric networks. Network, 2, 275–294. Anderson, J. S., Carandini, M., & Ferster, D. (2000). Orientation tuning of input conductance, excitation, and inhibition in cat primary visual cortex. J. Neurophysiol., 84, 909–926. Bair, W., Zohary, E., & Newsome, W. T. (2001). Correlated firing in macaque visual area MT: Time scales and relationship to behavior. J. Neurosci, 21, 1676– 1697. Ben-Yishai, R., Lev Bar-Or, R., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Natl. Acad. Sci. USA, 92, 3844–3848. Bernander, O., Douglas, R. J., Martin, K. A., & Koch, C. (1991). Synaptic background activity influences spatiotemporal integration in single pyramidal cells. Proc. Natl. Acad. Sci. USA, 88, 11569–11573. Brunel, N. (2000a). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8, 183–208. Brunel, N. (2000b). Persistent activity and the single cell f-I curve in a cortical network model. Network, 11, 261–280. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrate-andfire neurons with low firing rates. Neural Computation, 11, 1621–1671. Brunel, N., & Wang, X.-J. (2001). Effects of neuromodulation in a cortical network model of object working memory dominated by recurrent inhibition. J. Comput. Neurosci., 11, 63–85. Brunel, N., & Wang, X.-J. (2003). What determines the frequency of fast network oscillations with irregular neural discharges? I. Synaptic dynamics and excitationinhibition balance. J. Neurophysiol., 90, 415–430. Cai, D., Tao, L., Shelley, M., & McLaughlin, D. W. (2004). An effective kinetic representation of fluctuation-driven networks with application to simple and complex cells in visual cortex. Proc. Natl. Acad. Sci., 101, 7757–7762. Carandini, M. (2004). Amplification of trial-to-trial response variability by neurons in visual cortex. PLOS Biol., 2, E264. Chafee, M. V., & Goldman-Rakic, P. S. (1998). Matching patterns of activity in primate prefrontal area 8a and parietal area 7ip neurons during a spatial working memory task. J. Neurophysiol., 79, 2919–2940. Compte, A., Brunel, N., Goldman-Rakic, P. S., & Wang, X.-J. (2000). Synaptic mechanisms and network dynamics underlying spatial working memory in a cortical network model. Cerebral Cortex, 10, 910–923. Compte, A., Constantinidis, C., Tegn´er, J., Raghavachari, S., Chafee, M., GoldmanRakic, P. S., & Wang, X.-J. (2003). Temporally irregular mnemonic persistent activity in prefrontal neurons of monkeys during a delayed response task. J. Neurophysiol., 90, 3441–3454. Cox, D. R. (1962). Renewal theory. New York: Wiley. Daley, D. J., & Vere-Jones, D. (1988). An introduction to the theory of point processes. New York: Springer.
44
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
Desai, N. S., Rutherford, L. C., & Turrigiano, G. G. (1999). Plasticity in the intrinsic excitability of cortical pyramidal neurons. Nat. Neurosci., 2, 515–520. Destexhe, A., Rudolph, M., & Pare, D. (2003). The high-conductance state of neocortical neurons in vivo. Nat. Rev. Neurosci., 4, 739–751. Douglas, R. J., & Martin, K. A. (2004). Neuronal circuits of the neocortex. Ann. Rev. Neurosci., 27, 419–451. Funahashi, S., Bruce, C. J., & Goldman-Rakic, P. S. (1989). Mnemonic coding of visual space in the monkey’s dorsolateral prefrontal cortex. J. Neurophysiol., 61, 331– 349. Fuster, J. M., & Alexander, G. (1971). Neuron activity related to short-term memory. Science, 173, 652–654. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophys. J., 4, 41–68. Gillespie, D. T. (1992). Markov processes: An introduction for physical scientists. Orlando, FL: Academic Press. Gnadt, J. W., & Andersen, R. A. (1988). Memory related motor planning activity in posterior parietal cortex of macaque. Exp. Brain Res., 70, 216–220. Hansel, D., & Mato, G. (2001). Existence and stability of persistent states in large neuronal networks. Phys. Rev. Lett., 86, 4175–4178. Harsch, A., & Robinson, H. P. (2000). Postsynaptic variability of firing in rat cortical neurons: The roles of input synchronization and synaptic NMDA receptor conductance. J. Neurosci., 20, 6181–6192. Hebb, D. O. (1949). Organization of behavior. New York: Wiley. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA, 79, 2554–2558. Larkum, M. E., Senn, W., & Luscher, H. M. (2004). Top-down dendritic input increases the gain of layer 5 pyramidal neurons. Cereb. Cortex, 14, 1059–1070. Marino, J., Schummers, J., Lyon, D. C., Schwabe, L., Beck, O., Wiesing, P., & Obermayer, K. (2005). Invariant computations in local cortical networks with balanced excitation and inhibition. Nat. Neurosci., 8, 194–201. Mascaro, M., & Amit, D. J. (1999). Effective neural response function for collective population states. Network, 10, 351–373. Mattia, M., & Del Giudice, P. (2000). Efficient event-driven simulation of large networks of spiking neurons and dynamical synapses. Neural Computation, 12, 2305– 2329. Miller, E. K., Erickson, C. A., & Desimone, R. (1996). Neural mechanisms of visual working memory in prefrontal cortex of the macaque. J. Neurosci., 16, 5154– 5167. Miyashita, Y., & Chang, H. S. (1988). Neuronal correlate of pictorial short-term memory in the primate temporal cortex. Nature, 331, 68–70. Moreno, R., de la Rocha, J., Renart, A., & Parga, N. (2002). Response of spiking neurons to correlated inputs. Phys. Rev. Lett., 89, 288101. Moreno-Bote, R., & Parga, N. (2005). Membrane potential and response properties of populations of cortical neurons in the high conductance state. Phys. Rev. Lett., 94, 088103. Moreno-Bote, R., & Parga, N. (2006). Auto- and cross-correlograms for the spike response of lif neurons with slow synapses. Phys. Rev. Lett., 96, 028101.
Bistability in Balanced Recurrent Networks
45
Rao, S. G., Williams, G. V., & Goldman-Rakic, P. S. (1999). Isodirectional tuning of adjacent interneurons and pyramidal cells during working memory: Evidence for microcolumnar organization in PFC. J. Neurophysiol., 81, 1903–1916. Renart, A. (2000). Multi-modular memory systems. Unpublished doctoral dissertation, ´ Universidad Autonoma de Madrid. Renart, A., Song, P., & Wang, X.-J. (2003). Robust spatial working memory through homeostatic synaptic scaling in heterogeneous cortical networks. Neuron, 38, 473– 485. Ricciardi, L. M. (1977). Diffusion processes and related topics on biology. Berlin: SpringerVerlag. Romo, R., Brody, C. D., Hern´andez, A., & Lemus, L. (1999). Neuronal correlates of parametric working memory in the prefrontal cortex. Nature, 399, 470–474. Salinas, E., & Sejnowski, T. J. (2000). Impact of correlated synaptic input on output firing rate and variability in simple neuronal models. Journal of Neuroscience, 20, 6193–6209. Salinas, E., & Sejnowski, T. J. (2001). Correlated neuronal activity and the flow of neural information. Nat. Rev. Neurosci., 2, 539–550. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Current Opinion in Neurobiol., 4, 569–579. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18, 3870–3896. Shinomoto, S., Sakai, Y., & Funahashi, S. (1999). The Ornstein-Uhlenbeck process does not reproduce spiking statistics of neurons in prefrontal cortex. Neural Comput., 11, 935–951. Shu, Y., Hasenstaub, A., & McCormick, D. A. (2003). Turning on and off recurrent balanced cortical activity. Nature, 432, 288–293. Softky, W. R. (1994). Sub-millisecond coincidence detection in active dendritic trees. Neuroscience, 58, 13–41. Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13, 334–350. Tsodyks, M. V., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci. USA, 94, 719–723. Tsodyks, M. V., & Sejnowski, T. (1995). Rapid state switching in balanced cortical network models. Network, 6, 111–124. Turrigiano, G. G. (1999). Homeostatic plasticity in neuronal networks: The more things change, the more they stay the same. Trends in Neurosci., 22, 221–227. Turrigiano, G. G., Leslie, K. R., Desai, N. S., Rutherford, L. C., & Nelson, S. B. (1998). Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature, 391, 892–896. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Computation, 10, 1321–1371. Wang, X.-J. (1999). Synaptic basis of cortical persistent activity: The importance of NMDA receptors to working memory. J. Neurosci., 19, 9587–9603.
46
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
Zador, A. M., & Stevens, C. F. (1998). Input synchrony and the irregular firing of cortical neurons. Nat. Neurosci., 1, 210–217. Zohary, E., Shadlen, M. N., & Newsome, W. T. (1994). Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 370, 140–143.
Received August 22, 2005; accepted May 17, 2006.
LETTER
Communicated by Alain Destexhe
Exact Subthreshold Integration with Continuous Spike Times in Discrete-Time Neural Network Simulations Abigail Morrison [email protected] Computational Neurophysics, Institute of Biology III, and Bernstein Center for Computational Neuroscience, Albert-Ludwigs-University, 79104 Freiburg, Germany
Sirko Straube [email protected] Computational Neurophysics, Institute of Biology III, Albert-Ludwigs-University, 79104 Freiburg, Germany
Hans Ekkehard Plesser [email protected] Department of Mathematical Sciences and Technology, Norwegian University of Life ˚ Norway Sciences, N-1432 As,
Markus Diesmann [email protected] Computational Neurophysics, Institute of Biology III, and Bernstein Center for Computational Neuroscience, Albert-Ludwigs-University, 79104 Freiburg, Germany
Very large networks of spiking neurons can be simulated efficiently in parallel under the constraint that spike times are bound to an equidistant time grid. Within this scheme, the subthreshold dynamics of a wide class of integrate-and-fire-type neuron models can be integrated exactly from one grid point to the next. However, the loss in accuracy caused by restricting spike times to the grid can have undesirable consequences, which has led to interest in interpolating spike times between the grid points to retrieve an adequate representation of network dynamics. We demonstrate that the exact integration scheme can be combined naturally with off-grid spike events found by interpolation. We show that by exploiting the existence of a minimal synaptic propagation delay, the need for a central event queue is removed, so that the precision of event-driven simulation on the level of single neurons is combined with the efficiency of time-driven global scheduling. Further, for neuron models with linear subthreshold dynamics, even local event queuing can be avoided, resulting in much greater efficiency on the single-neuron level. These ideas are exemplified by two implementations of a widely used neuron model. We present a measure for the efficiency of network simulations in terms Neural Computation 19, 47–79 (2007)
C 2006 Massachusetts Institute of Technology
48
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
of their integration error and show that for a wide range of input spike rates, the novel techniques we present are both more accurate and faster than standard techniques. 1 Introduction A major problem in the simulation of cortical neural networks has been their high connectivity. With each neuron receiving input from the order of 104 other neurons, simulations are demanding in terms of memory as well as simulation time requirements. Techniques to study such highly connected systems of 105 and more neurons, using distributed computing, are now available (see, e.g., Hammarlund & Ekeberg, 1998; Harris et al., 2003; Morrison, Mehring, Geisel, Aertsen, & Diesmann, 2005). The question remains as to whether a time-driven or event-driven simulation algorithm (Fujimoto, 2000; Zeigler, Praehofer, & Kim, 2000; Sloot, Kaandorp, Hoekstra, & Overeinder, 1999; Ferscha, 1996) should be used. At first glance, the choice of event-driven algorithms seems natural, because a neuron can be described as emitting and absorbing point events (spikes) in continuous time. In fact, for neuron models with linear subthreshold dynamics and postsynaptic potentials without rise time, highly efficient algorithms exist (see, e.g., Mattia & Del Giudice, 2000). These exploit the fact that threshold crossings can occur only at the impact times of excitatory events. If more general types of neuron models are considered, the global algorithmic framework becomes much more complicated. For example, each neuron may be required to “look ahead” to determine when it will fire in the absence of new events. The global algorithm then either updates the neuron with the shortest latency or delivers the event with the most imminent arrival time (whichever is shorter) and revises the latency calculations for the neurons receiving the event. (See Marian, Reilly, & Mackey, 2002; Makino, 2003; Rochel & Martinez, 2003; Lytton & Hines, 2005; and Brette, 2006, for refined versions of such an algorithm.) This decision process clearly comes at a cost and becomes unwieldy for networks of high connectivity: if each neuron is receiving input spikes at a conservative average rate of 1 Hz from each of 104 synapses, it needs to process a spike every 0.1 ms, and this limits the characteristic integration step size. Therefore, time-driven algorithms have been found useful for the simulation of large, highly connected networks. Here, each neuron is updated on an equidistant time grid, and the emission and absorption times of spikes are restricted to the grid (see section 3). The temporal spacing of the grid is called computation time step h. Consider a network of 105 neurons as described above. At a computation time step of 0.1 ms, a time-driven algorithm carries out 109 neuron updates per second, the same number as required for the eventdriven algorithm. In this situation, the time-driven scheme is necessarily faster than the event-driven scheme because the costs of the actual updates are the same and there is no overhead caused by the scheduling of events.
Continuous Spike Times in Exact Discrete-Time Simulations
49
However, in this letter, we criticize this view and argue that in order to arrive at a relevant measure of efficiency, simulation time should be analyzed as a function of the integration error rather than the update interval. Whether a time-driven or event-driven scheme yields a better perfomance from this perspective depends on the required accuracy of the simulation and the network spike rate, and is not immediately apparent from considerations of complexity. In the time-driven framework, Rotter and Diesmann (1999) showed that for a wide class of neuron models, the linearity of the subthreshold dynamics can be exploited to integrate the neuron state exactly from one grid point to the next by performing a single matrix vector multiplication. Here, the computation time step simultaneously determines the accuracy with which incoming spikes influence the subthreshold dynamics and the timescale at which threshold crossings can be detected. However, Hansel, Mato, Meunier, and Neltner (1998) showed that forcing spikes onto the grid can significantly distort the synchronization dynamics of certain networks. Reducing the computation step ameliorates the problem only slowly, as the integration error declines linearly with h (see section 8.4.1). The problem was solved (Hansel et al., 1998; Shelley & Tao, 2001) by interpolating the membrane potential between grid points to give a better approximation of the time of threshold crossing and evaluating the effect of incoming spikes on the neuronal state in continuous time. In this work, we demonstrate that the techniques developed for the exact integration of the subthreshold dynamics (Rotter & Diesmann, 1999) and for the interpolation of spike times (Hansel et al., 1998; Shelley & Tao, 2001) can be successfully combined. By requiring that the minimal synaptic propagation delay be at least as large as the computation time step, all events can be queued at their target neurons rather than relying on a central event queue to maintain causality in the network. This reduces the complexity of the global scheduling algorithm—that is, deciding which neuron should be updated, and how far—to the simple time-driven case, whereby each neuron in turn is advanced in time by a fixed amount. Therefore, the global overhead costs are no more than in a traditional discrete-time simulation, and yet on the level of the individual neuron, spikes can be processed and emitted in continuous time with the accuracy of an event-driven algorithm. This approach represents a hybridization of traditional time-driven and event-driven algorithms: the scheme is time driven on the global level to advance the system time but event driven on the level of the individual neurons. The exact integration method is predicated on the linearity of the subthreshold dynamics. We show that this property can be further exploited, as the order of incoming events is not relevant for calculating the neuron state. This completely removes the need for storing and sorting individual events, and therefore also for dynamic data structures, while maintaining the high precision of the event-driven approach.
50
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
We illustrate these ideas by comparing three implementations of the same widely used integrate-and-fire neuron model. As in previous work, the scaling behavior of the integration error is considered a function of the computational resolution. However, in contrast to these works, we also analyze the run time and memory consumption of a large neuronal network model as a function of integration error, thus defining a measure of efficiency that can be applied to any competing model implementations. This analysis reveals that the novel scheme of embedding continuous-time implementations in a discrete-time framework can in many cases result in simulations that are both more accurate and faster than a given purely discrete-time simulation. This is possible because the new scheme achieves the same accuracy at larger computation time steps. Depending on the rate of events to be processed by the neuron, the gain in simulation speed due to an increased step size can more than compensate for the increased complexity of processing continuous-time input events. The scheme presented is well suited for distributed computing. In section 2, we describe the neuron model used as an example in the remainder of the article, and then we review the techniques for integrating the dynamics of a neural network in discrete time steps in section 3. Subsequently, in section 4, we present two implementations solving the singleneuron dynamics between grid points but handling the incoming and emitted spikes in continuous time. The performance of these implementations with respect to integration error, run time, and memory requirements is analyzed in section 5. We show that the choice of which implementation should be used for a given problem depends on a trade-off between these factors. The concepts of time-driven and event-driven simulation of large neural networks are discussed in section 6 in the light of our findings. The numerical techniques underlying the reported results are given in the appendix. The conceptual and algorithmic work described here is a module in our long-term collaborative project to provide the technology for neural systems simulations (Diesmann & Gewaltig, 2002). Preliminary results have been presented in abstract form (Morrison, Hake, Straube, Plesser, & Diesmann, 2005). 2 Example Neuron Model Although the methods in this letter can be applied to any neuron model reducible to a system of linear differential equations, for clarity, we compare various implementations of one particular physical model: a currentbased integrate-and-fire neuron with postsynaptic currents represented as α-functions. The dynamics of the membrane potential V is: V˙ = −
V 1 + I, τm C
Continuous Spike Times in Exact Discrete-Time Simulations
51
where τm is the membrane time constant, C is the capacitance of the membrane, and I is the input current to the neuron. The current arises as a superposition of the synaptic currents and any external current. The time course of the synaptic current ι due to one incoming spike is ι(t) = ˆι
e −t/τα te , τα
where ˆι is the peak value of the current and τα is the rise time. When the membrane potential reaches a given threshold value , the membrane potential is clamped to zero for an absolute refractory period τr . The values for these parameters used in this article are τm , 10 ms; C, 250 pF; , 20 mV; τr , 2 ms; ˆι, 103.4 pA; and τα , 0.1 ms. 3 Exact Integration of Subthreshold Dynamics in a Discrete Time Simulation The dynamics of the neuron model described in section 2 is linear and can therefore be reformulated to give a particularly efficient implementation for a discrete-time simulation (Rotter & Diesmann, 1999). We refer to this traditional, discrete-time approach as the grid-constrained implementation. Making the substitutions y1 =
d 1 ι+ ι dt τα
y2 = ι y3 = V, where yi is the ith component of the state vector y, we arrive at the following system of linear equations:
− τ1α
0
0
1 C
1 y˙ = Ay = 1 − τα
0
0 y, 1 − τm
ˆι τeα
y(0) = 0 , 0
where y(0) is the initial condition for a postsynaptic potential originating at time t = 0. The exact solution of this system is given by y(t) = P(t)y(0), where P(t) = e At denotes the matrix exponential of At, which is an exact mathematical expression (see, e.g., Golub & van Loan, 1996). For a fixed time step h, the state of the system can be propagated from one grid position to the next by yt+h = P(h)yt . This is an efficient method because P(h) is constant and has to be calculated only once at the beginning of a simulation. Moreover, P(h) can be obtained in closed form (Diesmann, Gewaltig, Rotter,
52
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
& Aertsen, 2001), for example, using symbolic algebra software such as Maple (Heck, 2003) or Mathematica (Wolfram, 2003), and can therefore be evaluated by simple expressions in the implementation language (see also section A.3). The complete update for the neuron state may be written as yt+h = P(h)yt + xt+h ,
(3.1)
assuming incoming spikes are constrained to the grid, as the linearity of the system permits the initial conditions for all spikes arriving at a given grid point to be lumped together into one term, xt+h
e τα
xt+h = 0 ˆιk . 0
(3.2)
k∈St+h
Here St+h is the set of indices k ∈ 1, . . . , K of synapses that deliver a spike to the neuron at time t + h, and ˆιk represents the “weight” of synapse k. Note that the ˆιk may be arbitrarily signed and may also vary over the course of the simulation. The new neuron state yt+h is the exact solution to the subthreshold neural dynamics at time t + h, including all events that arrive at time t + h. This assumes that the neuron does not itself produce a spike in the interval (t, t + h]. 3 If the membrane potential yt+h exceeds the threshold value , the neuron communicates a spike event to the network with a time stamp of t + h; the membrane potential is subsequently clamped to 0 in the interval [t + h, t + h + τr ] (see Figure 1A). The earliest grid point at which a neuron could produce its next spike is therefore t + 2h + τr . Note that for a gridconstrained implementation, τr must be an integer multiple of h, because the membrane potential is evaluated only at grid points, and we define it to be nonzero. 3.1 Computational Requirements. In order to preserve causality, it is necessary that there is a minimal synaptic delay of h. Otherwise, if a neuron spiked at time t + h and its synapses had a propagation delay of 0, then this event would seem to arrive at some of its targets at t + h and at some of them at t + 2h, depending on the order in which the neurons are updated. In practice, simulations are generally performed with synaptic delays that are greater than the time step h, and so some technique must be used to store events that have already been produced by a neuron but are not due to arrive at their targets for several time steps. In a grid-constrained simulation, only delays that are an integer multiple of h can be considered because incoming spikes can be handled only at grid points. Consequently, pending events can be stored in a data structure analogous to a looped tape device (see Morrison, Mehring, et al., 2005). If a neuron emits a spike at time t that has a
Continuous Spike Times in Exact Discrete-Time Simulations
53
AV Θ
0
B
t
t+h
t+2h
time
t+h
t+2h
time
V
Θ
0 t
t δ
= t Θ− t
Θ
τr
Figure 1: Spike generation and refractory periods for a grid-constrained (A) and a continuous-time implementation (B). The spike threshold is indicated by the dashed horizontal line. The solid black curve shows the membrane potential time course for a neuron subject to a suprathreshold constant input current leading to a threshold crossing in the interval (t, t + h]. The gray vertical lines indicate the discrete time grid with spacing h. The refractory period τr in this example is set to its minimal value h. Filled circles denote observable values of the membrane potential; unfilled circles denote supporting points that are not observable. (A) In the grid-constrained implementation, the spike is emitted at t + h. During the refractory period τr , the membrane potential is clamped to zero. (B) In the continuous-time implementation, the threshold crossing is found by interpolation (here linear: black dashed line) at time t . The spike is emitted with the time stamp t + h and with an offset with respect to t of δ = t − t. The neuron is refractory from t until t + τr , during which period the membrane potential is clamped to zero. At grid point t + 2h, a finite membrane potential can be observed.
54
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
delay of d, the simulation algorithm waits until all neurons have completed their updates for the integration step (t − h, t] and then delivers the event to its target(s). The event is placed in the tape device of the target neuron d/ h segments on from the current reading position. It will then become visible to the target neuron at the grid point t + d, when the neuron is performing the update for the integration step (t + d − h, t + d]. Recalling that the initial conditions for all events arriving at a given time may be lumped together (see equation 3.2) and that two of the three components of the initial conditions vector are zero, the segments of the looped taped device can be very simple. Each segment contains just one value, which is incremented by the weight of every spike event delivered there. In other words, when the neuron performs the integration step (t, t + h], the segment visible to the reading head contains the first component of xt+h up to the scaling factor of τeα . In terms of memory, the looped tape device needs as many segments as computation steps are required to cover the maximum synaptic delay, plus an additional segment to represent the events arriving at t + h. Therefore, performing a given simulation with a smaller time step will require more memory. The model described in section 2 has only a single synaptic dynamics and so requires only one tape device; models using multiple types of synaptic dynamics can be implemented in this framework by providing them with the corresponding number of tape devices. 4 Continuous-Time Implementation of Exact Integration 4.1 Canonical Implementation. The most obvious way to reconcile exact integration and precise spike timing within a discrete-time simulation is to store the precise times of incoming events. In order to represent this information, an offset must be assigned to each spike event in addition to the time stamp. This offset is measured from the beginning of the interval in which the spike was produced: a spike generated at time t + δ receives a time stamp of t + h and an offset of δ (see Figure 1B). Given a sorted list of event offsets {δ1 , δ2 , · · · , δn } with δi ≤ h, which become visible to a neuron in the step (t, t + h], exact integration of the subthreshold dynamics can be performed from the beginning of the time step to the beginning of the list: yt+δ1 = P(δ1 )yt + xδ1 ; then along the list: yt+δ2 = P(δ2 − δ1 )yt+δ1 + xδ2 .. . yt+δn = P(δn − δn−1 )yt+δn−1 + xδn ;
Continuous Spike Times in Exact Discrete-Time Simulations
55
and finally from the end of the list to the end of the time step:
yt+h = P(h − δn )yt+δn .
The final term yt+h is the exact solution for the neuron dynamics at time t + h. This sequence is illustrated in Figure 2B. This is assuming that the neuron does not produce a spike or emerge from its refractory period during this interval. These special cases are described in more detail below. 4.1.1 Spike Generation. In the grid-constrained implementation, the neuron state is inspected at the end of each time step to see if it meets its spiking criteria. In the case of the neuron model described in section 3, the criterion is y3 ≥ , where is the threshold. In this implementation, the neuron state can be inspected after every step of the process described in sec3 3 tion 4.1. If yt+δ < and yt+δ ≥ , then the membrane potential of the i i+1 neuron reached threshold between t + δi and t + δi+1 . As the dynamics of this model is not invertible, the time t of this threshold passing can be determined only by interpolating the membrane potential in the interval (t + δi , t + δi+1 ]. For this article, linear, quadratic, and cubic interpolation schemes were investigated. After the threshold crossing, the neuron is refractory for the duration of its refractory period τr . The membrane potential y3 is set to zero and need not be calculated during this period, although the other components of the neuron state continue to be updated as in section 4.1. At the end of the time interval, an event is dispatched with a discrete-time stamp of t + h and an offset of δ = t − t (see Figure 1B). 4.1.2 Emerging from the Refractory Period. The neuron emerges from its refractory period in the time step defined by t < t + τr ≤ t + h (see Figure 1B). In contrast to the grid-constrained implementation, τr does not have to be an integer multiple of h. For a continuous time τr , a grid position t can always be found such that t + τr comes within (t, t + h]. However, the implementation is simpler when τr is a nonzero integer multiple of h. To calculate the neuron state at time t + h exactly, the interval is divided into two subintervals: (t, t + τr ] and (t + τr , t + h]. In the first period, the neuron is still refractory, so when performing the exact integration along the incoming events as in section 4.1, y3 is not calculated and remains at zero. At the end of this period, the neuron state yt +τr is the exact solution for the dynamics at the end of the refractory period. In the second period, the neuron is no longer refractory, and so the exact integration can be performed as usual (including the calculation of y3 ). The neuron state yt+h is therefore an exact solution to the neuron dynamics at time t + h.
56
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
A
B
C
t
t+h
t+2h
Figure 2: Handling of incoming spikes in the grid-constrained (A), the canonical (B), and the prescient (C) implementations. In each panel, the solid curve represents the excursion of the membrane potential in response to two incoming spikes (gray vertical lines). Filled circles denote observable values of the membrane potential; unfilled circles denote supporting points that are not observable. The gray horizontal arrows beneath each panel indicate the propagation steps performed during the time step (t, t + h]. Dashed arrows in A indicate that in the grid-constrained implementation, input spikes are effectively shifted to the next point on the grid. The observable membrane potentials following spike impact are identical and exact in B and C but differ from those in A.
Continuous Spike Times in Exact Discrete-Time Simulations
57
4.1.3 Computational Requirements. As in the grid-constrained implementation, a minimum spike propagation delay of h is necessary in order to ensure that all events due at the neuron between t and t + h have arrived by the time the neuron performs its update for that interval. The simple looped event buffer described in section 3.1 must be extended to store the weights and offsets for incoming events separately. As the number of events arriving in an interval cannot be known ahead of time, this structure must be capable of dynamic memory allocation, which reduces its cache effectiveness. However, information about the minimum propagation delay in the network can be utilized to streamline the data structure so that its size does not depend on h, which compensates to some extent for the use of dynamic memory. Finally, as the events may arrive in any order, the buffer must also be capable of sorting the events with respect to increasing offset, which for a general-purpose sorting algorithm has complexity O(n log n). An alternative representation of the timing data is a priority queue; in practice, this was not quite as efficient as the looped tape device. In contrast to the grid-constrained implementation, delays can now be represented in continuous time. A spike arrival time t + δ + d, where t + δ is a continuous spike generation time and d is a continuous time delay, can always be decomposed into a discrete grid point t + d and a continuous offset δ . However, for notational and implementational convenience (see section 6), we assume d to be a nonzero integer multiple of h. 4.2 Prescient Implementation. In the implementation described in section 4.1, the neuron state at the end of a time step is calculated by integrating along a sorted list of events. However, as the subthreshold dynamics is linear, it is not dependent on the order of events. In this section, an implementation is presented that exploits this fact to reduce the computational complexity and dynamic memory requirements of the canonical implementation. 4.2.1 Receiving an Event. Consider a spike event generated during the step (t − h, t] with offset δ and transmitted with delay d. This spike will be processed during the update step (t + d − h, t + d], its effect being observable for the first time at t + d. Since the correct spike arrival time is t + d − h + δ, when the algorithm delivers the spike to the neuron we evolve the effect of the spike input from the arrival time to the end of the interval at t + d using y˜ t+d = P(h − δ)x. Therefore, instead of storing the entire event as for the canonical implementation, the three components of its effect on the neuron state can be stored in an event buffer at the position d instead. Due to the linearity of the system,
58
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
these components can be summed for all events due to arrive at the neuron in a given time step, regardless of the order in which the algorithm delivers them to the neuron or the order in which they are to become visible to the neuron. As we exploit the fact that the effect of an event on the neuron can be calculated before the event becomes visible to the neuron, we call this the prescient implementation. 4.2.2 Calculating the Subthreshold Dynamics. At the beginning of each time interval (t, t + h], the total effect of all events arriving within that step on the three components of the neuron state at time t + h is already stored in the event buffer. Calculating the neuron state at the end of the time step is therefore simple: yt+h = P(h)yt + y˜ t+h . The new neuron state yt+h is the exact solution to the neuron dynamics at time t + h, including all events that arrived within the step (t, t + h]. This is depicted in Figure 2C. As with the canonical implementation, there are two special cases that need to be treated with more care: a time step in which a spike is generated and one in which the neuron emerges from its refractory period. 4.2.3 Spike Generation. The process that generates a spike is very similar to that for the canonical implementation described in section 4.1.1. In this case, as the timing of the incoming events is no longer known, the neuron state can be inspected only at the end of the time step rather than at each incoming event, and so the length of the interpolation interval is h rather than the interspike interval of incoming events. 4.2.4 Emerging from the Refractory Period. As for the canonical implementation, the time step in which the neuron emerges from its refractory period is divided into two subintervals: (t, t + τr ] and (t + τr , t + h]. Setting tem = t + τr − t, the neuron state at the end of the refractory period can be calculated as follows: yt+tem = P(tem )yt 3 yt+t ← 0. em
Having emerged from the refractory period, the membrane potential is no longer clamped to zero and can develop normally during the remainder of the time step (see Figure 1B): yt+h = P(h − tem )yt+tem + y˜ t+h .
Continuous Spike Times in Exact Discrete-Time Simulations
59
However, this overestimates the effect of the events arriving in the time interval (t, t + h] on the membrane potential. The summation of the components of these events was predicated on the assumption of linear dynamics, but as the membrane potential is clamped to zero until t + tem , this assumption does not hold. Any events arriving at the neuron before its emergence from its refractory period should have no effect on its membrane potential before this point, yet adding the full value of the third component of y˜ t+h assumes that they do. As a corrective measure, the effect of the new events on the membrane potential can be considered to be linear within the small interval (t, t + h], and the membrane potential can be adjusted accordingly: 3 3 3 yt+h ← yt+h − γ y˜ t+h ,
with γ = tem / h. 4.2.5 Computational Requirements. As in the grid-constrained and canonical implementations, a minimum spike propagation delay of h is required to preserve causality. The looped-tape device described in section 3.1 needs to be able to store the three components of the neuron state rather than just the weight of the incoming events. Alternatively, three event buffers can be used, capable of storing one component each. Unlike the buffer devices for the canonical implementation, they need to store only one value per time step rather than one for each incoming spike, so there is no time overhead for sorting the values. Moreover, they do not require dynamic memory allocation and so are more cache effective. 5 Performance In order to compare error scaling for the different implementations and interpolation orders, a simple single-neuron simulation was chosen. As the system is deterministic and nonchaotic, reducing the computation time step h causes the simulation results to converge to the exact solution, so error measures can be well defined. To investigate the costs incurred by simulating at finer resolutions or using computationally more expensive off-grid neuron implementations, a network simulation was chosen. This is fairer than a single-neuron simulation, as the run-time penalties of applications requiring more memory will come into play only if the application is large enough not to fit easily into the processor’s cache memory. Furthermore, it is only when performing network simulations that the bite of long simulation times per neuron is really felt. 5.1 Single-Neuron Simulations. Each experiment consisted of 40 trials of 500 ms each, during which a neuron of the type described in section 2 was stimulated with a constant excitatory current of 412 pA and unique
60
A. Morrison, S. Straube, H. Plesser, and M. Diesmann 0
Error [mV]
A 10
B
−5
−5
10
10
−10
−10
10
10
−15
10
0
10
−15
−4
10
−2
10
Time step h [ms]
0
10
10
−4
10
−2
10
0
10
Time step h [ms]
Figure 3: Scaling of error in membrane potential as a function of the computational resolution in double logarithmic representation. (A) Canonical implementation. (B) Prescient implementation. No interpolation, circles; linear interpolation, plus signs; quadratic interpolation, diamonds; cubic interpolation, multiplication signs. In both cases, the triangles show the behavior of the gridconstrained implementation, and the gray lines indicate the slopes expected for scaling of orders first to fourth with an arbitrary intercept of the vertical axis.
realizations of an excitatory Poissonian spike train of 1.3 × 104 Hz and an inhibitory Poissonian spike train of 3 × 103 Hz. The spike times of the Poissonian input trains were represented in continuous time. Parameters are as in section 2, but the peak value of the current resulting from an inhibitory spike was a factor of 6.25 greater than that of an excitatory spike to ensure a balance between excitation and inhibition. The output firing rate was 12.7 Hz. The experiment was repeated for each implementation with each interpolation order over a wide range of computational resolutions. As the membrane potential and spike times cannot be calculated analytically for this protocol, the canonical implementation with cubic interpolation at the finest resolution (2−13 ms ≈ 0.12 µs) was defined to be the reference simulation for each realization of the input spike train. As a measure of the error in calculating the membrane potential, the deviation of the actual membrane potential from the reference membrane potential was sampled every millisecond for all the trials. In Figure 3, the median of these deviations is plotted as a function of the computational resolution in double logarithmic representation. In both the canonical implementation (see Figure 3A) and the prescient implementation (see Figure 3B), the same scaling behavior can be seen: for an interpolation order of n, the error in membrane potential scales with order n + 1 (see section A.4). The error has a lower bound at 10−14 , which can be seen for very fine resolutions using cubic interpolation. This represents the greatest
Continuous Spike Times in Exact Discrete-Time Simulations 0
Error [ms]
A 10
B
−5
0
10
−5
10
10
−10
−10
10
10
−15
10
61
−15
−4
10
−2
10
Time step h [ms]
0
10
10
−4
10
−2
10
0
10
Time step h [ms]
Figure 4: Scaling of error in spike times as a function of the computational resolution. (A) Canonical implementation. (B) Prescient implementation. Symbols and lines as in Figure 3.
numerical precision possible for this physical quantity using the standard representation of floating-point numbers (see section A.1). Interestingly, the error for the canonical implementation also saturates at coarse resolutions. This is because interpolation is performed between incoming events rather than across the whole time step, as in the case of the prescient implementation. Consequently, the effective computational resolution cannot be any coarser than the average interspike interval of the incoming spike train (in this case, 1/16 ms), and this determines the maximum error of the canonical implementation. Note that the error of the grid-constrained implementation scales in the same way as that of the canonical and prescient implementations with no interpolation. However, due to the fact that incoming spikes are forced to the grid, the absolute error is greater for this implementation. The accuracy of a simulation is not determined by the membrane potential alone; the precision of spike times is of at least as much relevance. The median of the differences between the actual and the reference spike times is shown in Figure 4. As with the error in membrane potential, using an interpolation order of n results in a scaling of order n + 1, and the error has a lower bound that is exhibited at very fine resolution when using cubic interpolation. Furthermore, a similar upper bound on the error is observed for the canonical implementation at coarse resolutions. However, in this case, the grid-constrained implementation exhibits not only the same scaling, but a similar absolute error to the continuous-time implementations with no interpolation. Recalling that all the simulations receive identical continuous-time Poisson spike trains, the only difference remaining between the grid-constrained implementation and the continuous-time implementations without interpolation is that the former ignores the offset of the incoming spikes and treats them as if they were produced on the
62
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
grid, whereas the latter process the incoming spikes precisely. This reveals that handling incoming spikes precisely confers no real advantage if outgoing spikes are generated without interpolation and thereby forced to the grid. One might therefore conclude that the precise handling of incoming spikes is unnecessary and that the single-neuron integration error could be significantly improved just by performing an appropriate interpolation to generate spikes, while treating incoming spikes as if they were produced on the grid. In fact, this is not the case. If incoming spikes are treated as if on the grid, the error in the membrane potential decreases only linearly with h, thus limiting the accuracy of higher-order methods in determining threshold crossings. This is corroborated by simulations (data not shown). Substantial improvement in the accuracy of single-neuron simulations requires both techniques: precise handling of incoming spikes and interpolation of outgoing spikes. 5.2 Network Simulations. In order to determine the efficiency of the various implementations, a balanced recurrent network was adapted from Brunel (2000). The network contained 10,240 excitatory and 2560 inhibitory neurons and had a connection probability of 0.1, resulting in a total of 15.6 × 106 synapses. The inhibitory synapses were a factor of 6.25 stronger than the excitatory synapses, and each neuron received a constant excitatory current of 412 pA as its sole external input. Membrane potentials were initialized to values chosen from a uniform random distribution over [−/2, 0.99]. In this configuration, the network fires with approximately 12.7 Hz in the asynchronous irregular regime, which recreates the input statistics used in the single-neuron simulations. The synaptic delay was 1 ms, and the network was simulated for 1 biological second. The simulation time and memory requirements for the network simulation described above are shown in Figure 5. For the simulation time (see Figure 5A), it can be seen that at coarse resolutions, the grid-constrained implementation is significantly faster than the prescient implementation, which in turn is faster than the canonical implementation. This is due to the fact that the cost of processing spikes is essentially independent of the computational resolution and manifests as an implementation-dependent constant contribution to the simulation time, which is particularly dominant at coarse resolutions. The difference in speed between the canonical and prescient implementations results from the use of dynamic as opposed to static data structures, and to a lesser extent from the cost of sorting incoming spikes in the canonical implementation. As the computation time step decreases, the simulation times converge, because the cost of updating the neuron dynamics in the absence of events, which is the same for all implementations, is inversely proportional to the resolution and so manifests as a scaling with exponent −1 at small computation time steps (see Figure 5A). It is clear that in general, at the same computation time step, the continuous-time implementations must be slower than the grid-constrained
Continuous Spike Times in Exact Discrete-Time Simulations
B
3
10
Memory [GB]
Simulation time [s]
A
2
10
63
6 4
2
1
1
10
−4
10
−2
10
Time step h [ms]
0
10
−4
10
−2
10
0
10
Time step h [ms]
Figure 5: Simulation time (A) and memory requirements (B) for a network simulation as functions of computational resolution in double logarithmic representation. Triangles, grid-constrained neuron; plus signs, canonical implementation with cubic interpolation; circles, prescient implementation with cubic interpolation. Other interpolation orders for the canonical and the prescient implementations result in practically identical curves and are therefore not shown. For details of the simulation, see the text.
implementation, as the former perform a propagation for every incoming spike and the latter does not. The increased costs concomitant with higher interpolation orders proved to be negligible in a network simulation. An increase in memory requirements can be observed for all implementations (see Figure 5B) as the resolutions become finer. Although all implementations require much the same amount of memory at coarser resolutions, for finer resolutions, the canonical implementation requires the least memory, followed by the grid-constrained implementation, and the prescient implementation requires the most. It is clear that for a wide range of resolutions, the memory required by the rest of the network, specifically the synapses, dominates the total memory requirements (Morrison, Mehring, et al., 2005). As the resolution becomes finer, the memory required for the input buffers for the neurons plays a greater role. The spike buffer for the canonical implementation is independent of the resolution (see section 4.1.3), and so it might seem that it should not require more memory at finer resolution. However, all implementations tested here also have a buffer for a piece-wise constant input current. In addition to this, the grid-constrained implementation has one buffer for the weights of incoming spikes, and the prescient implementation has three buffers—one for each component of the state vector. All of these buffers require memory in inverse proportion to the resolution, thus explaining the ordering of the curves. Generally a smaller application is more cache effective than a larger one, and this may explain why the canonical implementation exhibits slightly lower
64
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
simulation times than the other implementations at very fine resolutions (see Figure 5A). 5.3 Conjunction of Integration Error and Run-Time Costs. The considerations in section 5.2 of how the simulation time and memory requirements increase with finer resolutions are of limited practical relevance to a scientist with a particular problem to investigate. More interesting in this case is how much precision bang you get for your simulation time buck. Unlike the single neuron, however, the network described above is a chaotic system (Brunel, 2000). Any deviation at all between simulations will lead, in a short time, to totally different results on the microscopic level, such as the evoked spike patterns. Such a deviation can even be caused by differences in the tiny round-off errors that occur if floating-point numbers are summed in a different order. Because of this, these simulations do not converge on the microscopic level as the single-neuron simulations do, and for that reason the so-called accuracy of a simulation cannot be taken at face value. We therefore relate the cost of network simulations to the accuracy of single-neuron simulations with comparable input statistics. In Figure 6, the simulation time and memory requirements data from Figure 5 are combined with the accuracy of the corresponding single-neuron simulations shown in Figure 4, thus eliminating the computational resolution as a parameter. Figure 6A shows the simulation time as a function of spike time error for the three implementations. This graph can be read in two directions: horizontally and vertically. By reading the graph horizontally, we can determine which implementation will give the best accuracy for a given affordable simulation time. Reading the graph vertically allows us to determine which implementation will result in the shortest simulation time for a given acceptable accuracy. Concentrating on the latter interpretation, it can be seen from the intersection of the lines corresponding to the prescientand grid-based implementations (vertical dashed line in Figure 5A) that if an error greater than 2.3 × 10−2 ms is acceptable, the grid-constrained implementation is faster. For better accuracy than this, the prescient implementation is more effective. If an appropriate time step is chosen, the prescient implementation can simulate more accurately and more quickly than a given grid-constrained simulation in this regime. Only for very high accuracy can a lower simulation time be achieved using the canonical implementation. Similarly, Figure 6B, which shows the memory requirements as a function of spike-time error in double logarithmic representation, can be read in both directions. This shows qualitatively the same relationship as in Figure 6A, but the point at which one would switch from the prescient implementation to the canonical implementation in order to conserve memory occurs for larger errors. The flatness of the curves for the continuous-time implementations shows that it is possible to increase the accuracy of a simulation considerably without having to worry about the memory requirements.
Continuous Spike Times in Exact Discrete-Time Simulations
B
3
10
2
10
−12
10
−4
10
10
−2
10
2
prescient faster
Input rate [kHz]
10
100
−8
10
−4
10
0
10
Error [ms]
D 120
grid−constrained faster
10
−12
0
10
Simulation time [s]
−1
Eqv. Error [ms]
−8
10
Error [ms]
C
4
1
1
10
6
Memory [GB]
Simulation time [s]
A
65
100 80 60 40 20 0
0
20
40
60
80
Input rate [kHz]
Figure 6: Analysis of simulation time and memory requirements for a network simulation as functions of spike-time error for the single-neuron simulation and input spike rate. (A) Simulation time as a function of spike-time error in double logarithmic representation. Triangles, grid-constrained neuron; plus signs, canonical implementation with cubic interpolation; circles, prescient implementation with cubic interpolation. Data combined from Figure 4 and Figure 5. For errors smaller than 2.3 × 10−2 ms (vertical dashed line), a continuous-time implementation with an appropriately chosen computation step size gives better performance. (B) Memory consumption as a function of spike-time error in double logarithmic representation. Symbols as in A. (C) Error in spike time for which the prescient and grid-constrained implementations require the same simulation time (for different appropriate choices of h) as a function of input spike rate. The unfilled circle indicates the equivalence error for a network rate of 12.7 Hz (input rate 16 kHz), that is, the intersection marked by the vertical dashed line in A. The gray line is a linear fit to the data (slope −0.95). (D) Simulation time as a function of input spike rate for h = 0.125 ms, symbols as in A. The gray lines are linear fits to the data (slopes 1.3, 0.8 and 0.3 s per kHz from top to bottom).
66
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
In general, the point at which the prescient implementation at an appropriate computational resolution can produce a faster and more accurate simulation than a grid-constrained simulation will depend on the rate of events a neuron has to handle. To investigate this relationship, the network described in section 5.2 was simulated with different input currents and inhibitory synapse strengths to generate a range of different firing rates in the asynchronous irregular regime. The single-neuron integration error at which the prescient- and the grid-constrained implementations require the same simulation time is shown in Figure 6C as a function of the input spike rate. This equivalence error depends linearly on the input spike rate, demonstrating that in the parameter space of useful accuracies and realistic input rates, there is a wide regime where the prescient is faster than the grid-constrained implementation. Underlying the benign nature of the comparative effectiveness analyzed in Figure 6C is the dependence of simulation time on the rate of events. For all implementations, the simulation time increases practically linearly with the input spike rate (see Figure 6D), albeit with different slopes. 5.4 Artificial Synchrony. The previous section has shown that in many situations, continuous-time implementations achieve a desired single-neuron integration error more effectively than the grid-constrained implementation. However, continuous-time implementations have an advantage compared to the grid-constrained implementation beyond the single-neuron integration error. In a network simulation carried out with the grid-constrained implementation, the spikes of all neurons are aligned to the temporal grid defined by the computation time step. This causes artificial synchronization between neurons that may distort measures of synchronization and correlation on the network level. To demonstrate this effect, Hansel et al. (1998) investigated a small network of N integrateand-fire neurons with excitatory all-to-all coupling. Here, we extend their analysis to the three implementations under study and provide a comparison of single-neuron and network-level integration error. In contrast to their study, our network is constructed from the model neuron introduced in section 2. The time constant of the synaptic current τα is adjusted to the rise time of the synaptic current in the original model, which was described by a β-function. The synaptic delay d and the absolute refractory period τr are set to the maximum computation time step h investigated in this section. Note that this choice of d and τr means that these parameters can be represented at all computational resolutions, thus ensuring that all simulations using the grid-constrained implementation are solving the same dynamical system. Figure 7A illustrates the synchrony in the network as a function of synaptic strength. Following Hansel et al. (1998), synchrony is defined as the variance of the population-averaged membrane potential normalized by the population-averaged variance of the membrane
Continuous Spike Times in Exact Discrete-Time Simulations
A
67
1
Synchrony
0.8 0.6 0.4 0.2 0
0
0.2
0.4
0.6
0.8
1
Synaptic strength
Synch. Error [%]
2
C 10
Synch. Error [%]
2
B 10
0
10
−2
10
−4
10
0
10
−2
10
−4
−4
10
−2
10
Time step h [ms]
0
10
10
−12
10
−8
10
−4
10
0
10
Error [ms]
Figure 7: Synchronization error in a network simulation. (A) Network synchrony (see equation 3.3) as a function of synaptic strength in a fully connected network of N = 128 neurons (τα = (3/2) ln 3 ms, synaptic delay d = τr = 0.25 ms) with excitatory coupling (cf. Hansel et al., 1998). Neurons are driven by a suprathreshold DC I0 = 575 pA, no further external input. The initial T )], where i ∈ 1, . . . , N is membrane potential is Vi (0) = τCm I0 [1 − exp(−γ i−1 N τm the neuron index, T the period in the absence of coupling, and γ = 0.5 controls the initial coherence. The simulation time is 10 s, and V is recorded in intervals of 1 ms between 5 s and 10 s. Synaptic strength is expressed as the amplitude of a postsynaptic current relative to the rheobase current I∞ = (C/τm )θ = 500 pA and multiplied by the number of neurons N. Other parameters as in section 2. Canonical implementation as reference (h = 2−10 ms, gray curve) and prescient implementation (h = 2−2 ms, circles), both with cubic interpolation; grid-constrained implementation (h = 2−5 ms, triangles). (B) Synchronization error as a function of the computation time step in double logarithmic representation: grid-constrained implementation, triangles; prescient implementation, circles; canonical implementation, plus signs. (C) Synchronization error as a function of the single-neuron integration error for the grid-constrained and the prescient implementation, same representation as in B. The gray lines are linear fits to the data.
68
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
potential time course, S=
2
2
2
2 Vi (t) t − Vi (t) t i , Vi (t) i − Vi (t) i t t
i
(5.1)
where · i indicates averaging over the N neurons and · t indicates averaging over time. This is a measure of coherence in the limit of large N with S = 1 for full synchronization and S = 0 for the asynchronous state. The grid-constrained implementation exhibits a considerable error in synchrony, which vanishes in approaching the asynchronous regime. The prescient implementation accurately preserves network synchrony even with a significantly larger computation time step. The error in synchrony is quantified in Figure 7B as the root mean square of the relative deviation of S with respect to the reference solution, estimated over the range of synaptic strength investigated. Note that this includes the asynchronous regime where errors are small in general. The prescient implementation is easily an order of magnitude more accurate than the gridconstrained implementation and is itself outperformed by the canonical implementation. In addition, for the continuous-time implementations, the error in synchrony drops more rapidly with decreasing computational time step h than for the grid-constrained implementation. However, at the same h, different integration methods exhibit a different integration error for the single-neuron dynamics (see Figure 4). Therefore, to accentuate network effects, continuous-time and grid-constrained implementations should be compared at the same single-neuron integration error. To this end, we proceed as follows: the network spike rate is approximately 80 Hz, corresponding to an input spike rate of some 10 kHz. In Figure 7C, the error in network synchrony at a given computational time step h is plotted as a function of the spike time error of a single neuron driven with an input spike rate of approximately 10 kHz, simulated at the same h. For single-neuron errors of 10−2 ms and above, the grid-constrained implementation results in considerable errors in network synchrony. Spike timing errors of 10−2 ms and below are required for the grid-constrained and the prescient implementations to achieve a synchronization error in the 1% range or better. Interestingly, the grid-constrained implementation exhibits a larger synchronization error than the prescient implementation for identical single-neuron integration errors. 6 Discussion We have shown that exact integration techniques are compatible with continuous-time handling of spike events within a discrete-time simulation. This combination of techniques achieves arbitrarily high accuracy (up to machine precision) without incurring any extra management costs in the global algorithm, such as a central event queue or looking ahead to see
Continuous Spike Times in Exact Discrete-Time Simulations
69
which neuron will fire next. This is particularly important for the study of large networks with frequent events, as the cost of managing events can become prohibitive (Mattia & Del Giudice, 2000; Reutimann, Giugliano, & Fusi, 2003). We introduced a canonical implementation that illustrates the principles of combining these techniques and a prescient implementation that further exploits the linearity of the subthreshold dynamics. The latter implementation simplifies the neuron update algorithm and requires only static data structures and no queuing, leading to a better time and accuracy performance than the canonical implementation. We compared interpolating polynomials of orders 1 to 3 and discovered that the increased numerical complexity of the higher-order interpolations was not reflected in the run time, which is dominated by other factors. Furthermore, it was shown that the highest-order interpolation performed stably. This suggests that the highest-order interpolation should be used, as the greater accuracy is obtained at negligible cost. We have investigated the nature of the trade-off between accuracy and simulation time/memory and demonstrated that for a large range of input spike rates, it is possible to find a combination of continuous-time implementation and computation time step that fulfills a given maximum error requirement both more accurately and faster than a grid-constrained simulation. This measure of efficiency is based on truly large-scale networks (12,800 neurons, 15.6 million synapses). The techniques described here have several possible extensions. First, the canonical implementation places no constraints on the neuron model used beyond the physically plausible requirement that the membrane potential is thrice continuously differentiable. It may therefore be used to implement essentially any kind of neuronal dynamics, including neurons with conductance-based synapses. The prescient implementation further requires that the neuron’s dynamics is linear; it may thus be used for a wide range of model neurons with current-based synapses. The neuron model we implemented does not have invertible dynamics, and so the determination of its spike time necessarily involves approximation. For some neuron models, it is possible to determine the precise spike time without recourse to approximation, such as the Lapicque model (Tuckwell, 1988) and its descendants (Mirollo & Strogatz, 1990). Such models can, of course, also be implemented in this framework, but they would have only the same precision as a classical event-based simulation if they were canonically implemented (obviously without interpolation); a prescient implementation would be able to represent the subthreshold dynamics exactly on the grid but would entail the use of approximative methods to determine the spiketimes. Although we investigated polynomial interpolation, other methods of spike time determination such as Newton’s method can be implemented with no change to the conceptual framework. Second, most constraints imposed in terms of the computational time step h may be relaxed. As indicated in section 4.1.3, the restriction of
70
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
delays to nonzero integer multiples of h can be relaxed to any floatingpoint number ≥ h. When a neuron spikes, the offset of the spike could then be combined on the fly with the delay to create an integer component and a revised offset, thus allowing the spike to be delivered correctly by the global discrete-time algorithm, and processed correctly by the continuoustime neuron model. This relaxation would come at the memory cost of having to store delays as floating-point numbers rather than integers and the computational cost of having to perform the on-the-fly delay decomposition. Furthermore, h is currently a parameter of the entire network, so it is the same for all neurons in a given simulation. Given that the minimum propagation delay already defines natural synchronization points to preserve causality in the simulation, it would be possible to allow h to be chosen individually for each neuron in the network, or even to use variable time steps, while still maintaining consistency with the global discrete-time algorithm. Finally, the techniques are compatible with the distributed computing techniques described in Morrison, Mehring, et al. (2005), requiring only that the spike-time offsets are communicated in addition to the indices of the spiking neurons. This increases the bulk but not the frequency of communication, as it is still sufficient to communicate in intervals determined by the minimum propagation delay. A similar minimum delay principle is used by Lytton and Hines (2005), again suggesting a convergence of time-driven and event-driven approaches. When investigating a particular system, it is worthwhile considering what accuracy is necessary. For the networks described in section 5.2, it would be pointless to simulate with a very small time step, as they are chaotic. As long as the relevant macroscopic measures are preserved, any time step is as good or as bad as any other. However, a good rule of thumb is that it should be possible to discriminate spike times an order of magnitude more accurately than the characteristic timescales of the macroscopic phenomena to be observed, such as the temporal structure of cross-correlations. Note that even if the characteristics of the system to be investigated suggest that a grid-constrained implementation is optimal, the availability of equivalent continuous-time implementations is still advantageous: should the suspicion arise that an observed phenomenon is an artifact of the grid constraints, they can be used to test this without altering any other part of the simulation. For much the same reason, it is very useful to be able to modify the time step without having to adjust the rest of the simulation. The network studied in section 5.4 illustrates two more important points. First, it demonstrates that exceedingly small single-neuron integration errors may be required to accurately capture network synchronization. Second, it is clear from Figure 7C that continuous-time implementations are better at rendering macroscopic measures such as synchrony correctly: in conditions where the grid-constrained and prescient implementations
Continuous Spike Times in Exact Discrete-Time Simulations
71
achieve the same single-neuron spike-timing error, the prescient implementation yields a significantly smaller error in network synchrony. Accuracy cannot be improved on indefinitely: Figures 3 and 4 show that the errors in both membrane potential and spike timing saturate at about 10−14 mV and ms, respectively, for a time step of around 10−3 ms. The saturation accuracy is close to the maximal precision of the standard floating-point representation of the computer, and so 10−3 ms represents a lower bound on the range of useful h. An upper bound is determined by the physical properties of the system. First, h may not be larger than the minimum propagation delay in the network or the refractory period of the neurons. Second, using a large h increases the danger that a spike is missed. This can occur if the true trajectory of the membrane potential passes through the threshold within a step but is subthreshold again by the end of the step. This is less of an issue for the canonical implementation, as the check for a suprathreshold membrane potential is not just performed at the end of every step but also at the arrival of every incoming event (see section 4.1.1). There is a common perception that event-driven algorithms are exact and time-driven algorithms are approximate. However, both parts of this perception are generally false. With respect to the first part, event-driven algorithms are not by the nature of the algorithm more exact than timedriven algorithms. It depends on the dynamics of the neuron model whether an event-driven algorithm can find an exact solution, just as it does for timedriven algorithms. For a restricted class of models, the spike times can be calculated exactly through inversion of the dynamics. For other models, approximate methods to determine the spike times need to be employed. With respect to the second part, time-driven algorithms are not necessarily approximate. A discrete-time algorithm does not imply that spike times have to be constrained onto the grid, as shown by Hansel et al. (1998) and Shelley and Tao (2001). Moreover, the subthreshold dynamics for a large class of neuron models can be integrated exactly (Rotter & Diesmann, 1999). Here we combine these insights to show that the degree of approximation in a simulation is not determined by whether an event-driven or a time-driven algorithm is used but by the dynamics of the neuron model. A further question is whether the terms time-driven and event-driven should even be used in this mutually exclusive way. In our algorithm, neuron implementations treating incoming and outgoing spikes in continuous time are seamlessly integrated into a global discrete-time algorithm. Should this therefore be considered a time-driven or an event-driven algorithm? We believe that this combination of techniques represents a hybrid algorithm that is globally time driven but locally event driven. Similarly, when designing a distributed simulation algorithm (Morrison, Mehring, et al., 2005), it was shown that a time-driven neuron updating algorithm can be successfully combined with event-driven synapse updating, again suggesting that no dogmatic distinction between the two approaches need
72
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
be made. However, although we were able to demonstrate the potential advantages of a hybrid algorithm, these findings do not in principle rule out the existence of pure event-driven or time-driven algorithms with identical universality and better performance for a given set of parameters than the schemes presented here. In closing, we express our hope that this work can help defuse the at times heated debate between advocates of event-driven and time-driven algorithms for the simulation of neuronal networks. Appendix: Numerical Techniques In this appendix, we present the numerical techniques employed to achieve the reported accuracy. A.1 Accuracy of Floating-Point Representation of Physical Quantities. The double representation of floating-point numbers (Press, Teukolsky, Vetterling, & Flannery, 1992) used by standard computer hardware limits the accuracy with which physical quantities can be stored. The machine precision ε is the smallest number for which the double representation of 1 + ε is different from the representation of 1. Consequently, the absolute error σx of a quantity x is limited by the magnitude of x, σx ≈ 2log2 x · ε. In double representation, we have ε = 2−52 ≈ 2.22 · 10−16 . Membrane potential values y are on the order of 20 mV; the lower limit of the integration error in the membrane potential is therefore on the order of 5 · 10−15 mV. According to the rules of error propagation, the error in the time of threshold crossing σ depends on the error in membrane potential as σ = |1/ y˙ |. Typical values of the derivative | y˙ | of the membrane potential are on the order of 1 mV per ms (single-neuron simulations, data not shown), from which we obtain 5 · 10−15 ms as a lower bound for the error in spike timing. Therefore, the observed integration errors at which the simulations saturate are close to the limits imposed by the double representation for both physical quantities. A.2 Representation of Spike Times. We have seen in section A.1 that the absolute error σx depends on the magnitude of the quantity x. As a consequence, the error of spike times recorded in double representation increases with simulation time. An additional error is introduced if the computation time step h cannot be represented as a double (e.g., 0.1 ms).
Continuous Spike Times in Exact Discrete-Time Simulations
73
Therefore, we record spike times as a pair of two values {t + h, δ}. The first one is an integral number in units of h represented as a long int specifying the computation step in which the spike was emitted. The second one is the offset of the spike time in units of ms represented as a double. If h is a power of 2 in units of ms, both values can be represented as doubles without loss of accuracy. A.3 Evaluating the Update Equation. The implementation of the update equation 3.1 in the target programming language requires attention to numerical detail if utmost precision is desired. We were able to reduce membrane potential errors in a nonspiking simulation from some 10−12 mV to about 10−15 mV by careful rearrangement of terms. Although details may depend significantly on processor architecture and compiler optimization strategies, we will briefly recount the implementation we found optimal. The matrix-vector multiplication in equation 3.1 describes updates of the form y ← (1 − e − τ )a + e − τ y, h
h
where a and y are of order unity, while h τ so that e − τ ≈ 1. For a time step h −12 of h = 2 ms and a time constant of τ = 10 ms, one has γ = 1 − e − τ ≈ 10−5 . The quantity γ can be computed accurately for small values of h/τ using the function expm1(x) provided by current numeric libraries (C standard library; see also Galassi et al., 2001). Using double resolution, γ will have some 15 significant digits, spanning roughly from 10−5 to 10−20 , and all h of these digits are nontrivial. The exponential e − τ may be computed to 15 significant digits using exp(x); the first five of these will be trivial nines, though, leaving just 10 nontrivial digits, which furthermore span down to only 10−15 . We thus rewrite the equation above entirely in terms of γ , h
y ← γ a + (1 − γ )y, and finally as y ← γ (a − y) + y. In our experience, this final form yields the most accurate results, as the full precision of γ is retained as long as possible. Note that computing the 1 − γ term above discards the five least significant digits of γ . When several terms need to be added, they should be organized according to their expected magnitude starting with the smallest components. A.4 Polynomial Interpolation of the Membrane Potential. In order to approximate the time of threshold crossing, the membrane potential yt3
74
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
known at grid points t with spacing h can be interpolated with polynomials of different order. For the purpose of this section, we drop the index specifying the membrane potential as the third component of the state vector. Without loss of generality, we assume that the threshold crossing occurs at time δ ∗ in the interval (0, h]. The corresponding values of the membrane potential are denoted y0 < and yh ≥ , respectively. The threshold crossings are found using the explicit formulas for the roots of polynomials of order n = 1, 2, and 3 (Weisstein, 1999). In order to constrain the polynomials, we exploit the fact that the derivative of the membrane potential can be easily obtained from the state vector at both sides of the interval. For the grid-constrained (n = 0) simulation and linear (n = 1) interpolation, we demonstrate why the error in spike timing decreases with h n+1 . A.4.1 Grid-Constrained Simulation. In the variables defined above, the approximate time of threshold crossing δ equals the computation time step h; the spike is reported to occur at the right border of the interval (0, h]. Assuming the membrane potential y(t) to be exact, the error in membrane potential with respect to the value at the exact point of threshold crossing is = yh − y(δ ∗ ). Let us require that y(t) is sufficiently often differentiable and that the derivatives assume finite values. We can then express the membrane potential as a Taylor expansion originating at the left border of the interval y(t) = y0 + y˙0 t + O(t 2 ). Considering terms up to first order, we obtain = {y0 + y˙0 h} − {y0 + y˙0 δ ∗ } = y˙0 (h − δ ∗ ). Hence, reaches its maximum amplitude at δ ∗ = 0, and we can write || ≤ | y˙0 |h. The error in spike timing is σ = h − δ∗ = and |σ | ≤ h.
1 y˙0
Continuous Spike Times in Exact Discrete-Time Simulations
75
A.4.2 Linear Interpolation. A polynomial of order 1 is naturally constrained by the values of the membrane potential (y0 and yh ) at both ends of the interval. With yt = a t + b, the set of equations specifying the coefficients of the polynomial (a and b) is −1 a 0 1 y0 = b h 1 yh −1 −1 y0 h −h = yh 1 0 (yh − y0 )h −1 . = y0 Thus, in normalized form, we need to solve 0=
yh − y0 δ + (y0 − ) h
(A.1)
for δ to find the approximate time of threshold crossing. At the exact point of threshold crossing δ ∗ , the error in membrane potential is =
yh − y0 ∗ δ + y0 − y(δ ∗ ). h
(A.2)
Let us require that y(t) is sufficiently often differentiable and that the derivatives assume finite values. We can then express the membrane potential as a Taylor expansion originating at the left border of the interval y(t) = y0 + y˙0 t + 12 y¨ 0 t 2 + O(t 3 ). Considering terms up to second order, we obtain 1 1 2 = y˙0 δ ∗ + y¨ 0 hδ ∗ + y0 − y0 + y˙0 δ ∗ + y¨ 0 δ ∗ 2 2 1 2 = y¨ 0 (δ ∗ h − δ ∗ ). 2 The time of threshold crossing is bounded by the interval (0, h]. Hence, reaches its maximum amplitude at δ ∗ = 12 h, and we can write || ≤
1 | y¨ 0 | h 2 . 8
76
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
Noting that y(δ ∗ ) = in equation A.2, we have =
yh − y0 ∗ δ + (y0 − ), h
(A.3)
and subtracting equation 4.1 from 4.3, we obtain =
yh − y0 ∗ (δ − δ). h
Thus, the error in spike timing is σ = δ∗ − δ =
h . yh − y0
With the help of the expansion h 1 y¨ 0 = − h, yh − y0 y˙0 2 y˙0 2 we arrive at |σ | ≤
1 y¨ 0 2 h . 8 y˙0
A.4.3 Quadratic Interpolation. Using a polynomial of order 2, we can add an additional constraint to the interpolating function. We decide for the derivative of the membrane potential at the left border of the interval y˙ 0 . With yt = a t 2 + bt + c, we have −1 0 0 1 y0 a b = h 2 h 1 yh c y˙ 0 0 1 0 −2 −2 −1 y0 −h h h 0 1 yh = 0 y˙ 0 1 0 0 (yh − y0 )h −2 − y˙ 0 h −1 . y˙ 0 = y0 Thus, in normalized form, we need to solve 0 = δ2 +
(y0 − )h 2 y˙ 0 h 2 δ+ . yh − y0 − y˙ 0 h yh − y0 − y˙ 0 h
Continuous Spike Times in Exact Discrete-Time Simulations
77
The solution can be obtained by the quadratic formula. The GSL (Galassi et al., 2001) implements an appropriate solver. Generally there are two real solutions: the desired one inside the interval (0, h] and one outside. A.4.4 Cubic Interpolation. A polynomial of order 3 enables us to constrain the interpolation further by the derivative of the membrane potential at the right border of the interval y˙ h . With yt = a t 3 + bt 2 + ct + d, we have −1 0 0 0 1 a y0 3 2 b h h h 1 yh = c 0 0 1 0 y˙ 0 d y˙ h 3h 2 2h 1 0 y0 h −2 2h −2h −3 h −2 −3h −2 3h −2 −2h −1 −h −1 yh = 0 0 1 0 y˙ 0 y˙ h 1 0 0 0 2(y0 − yh )h −3 + ( y˙ 0 + y˙ h )h −2 3(yh − y0 )h −2 − (2 y˙ 0 + y˙ h )h −1 . = y˙ 0 y0 Thus, in normalized form, we need to solve 0 = δ3 + +
3(yh − y0 )h − (2 y˙ 0 + y˙ h )h 2 2 δ 2(y0 − yh ) + ( y˙ 0 + y˙ h )h
y˙ 0 h 3 (y0 − )h 3 δ+ . 2(y0 − yh ) + ( y˙ 0 + y˙ h )h 2(y0 − yh ) + ( y˙ 0 + y˙ h )h
The solution can be found by the cubic formula. There is at least one real solution in the interval (0, h]. It is convenient to chose a substitution that avoids the intermediate occurrence of complex quantities (e.g., Weisstein, 1999). The GSL (Galassi et al., 2001) implements an appropriate solver. If the interval contains more than one solution, the time of threshold crossing is defined by the left-most root. Acknowledgments The new address of A. M. and M. D. is Computational Neuroscience Group, RIKEN Brain Science Institute, Wako, Japan. The new address of S.S. is Human-Neurobiologie, University of Bremen, Germany. We acknowledge Johan Hake for the interesting discussions that started the project.
78
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
As always, Stefan Rotter and the members of the NEST collaboration (in particular Marc-Oliver Gewaltig) were very helpful. We thank Jochen Eppler for assisting with the C++ coding. We acknowledge the two anonymous referees whose stimulating questions helped us to improve the letter. This work was partially funded by DAAD 313-PPP-N4-lk, DIP F1.2, BMBF Grant 01GQ0420 to the Bernstein Center for Computational Neuroscience Freiburg, and EU Grant 15879 (FACETS). As of the date this letter is published, the NEST initiative (www.nest-initiative.org) makes available the different implementations of the neuron model used as an example in this article, in source code form under its public license. All simulations were carried out using the parallel computing facilities of the Norwegian Uni˚ versity of Life Sciences at As. References Brette, R. (2006). Exact simulation of integrate-and-fire models with synaptic conductances. Neural Comput., 18, 2004–2027. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8(3), 183–208. Diesmann, M., & Gewaltig, M.-O. (2002). NEST: An environment for neural systems simulations. In T. Plesser & V. Macho (Eds.), Forschung und wisschenschaftliches ¨ Rechnen, Beitr¨age zum Heinz-Billing-Preis 2001 (pp. 43–70). Gottingen: Gesellschaft ¨ wissenschaftliche Datenverarbeitung. fur Diesmann, M., Gewaltig, M.-O., Rotter, S., & Aertsen, A. (2001). State space analysis of synchronous spiking in cortical neural networks. Neurocomputing, 38–40, 565– 571. Ferscha, A. (1996). Parallel and distributed simulation of discrete event systems. In A. Y. Zomaya (Ed.), Parallel and distributed computing handbook (pp. 1003–1041). New York: McGraw-Hill. Fujimoto, R. M. (2000). Parallel and distributed simulation systems. New York: Wiley. Galassi, M., Davies, J., Theiler, J., Gough, B., Jungman, G., Booth, M., & Rossi, F. (2001). Gnu scientific library: Reference manual. Bristol: Network Theory Limited. Golub, G. H., & van Loan, C. F. (1996). Matrix computations (3rd ed.). Baltimore, MD: Johns Hopkins University Press. Hammarlund, P., & Ekeberg, O. (1998). Large neural network simulations on multiple hardware platforms. J. Comput. Neurosci., 5(4), 443–459. Hansel, D., Mato, G., Meunier, C., & Neltner, L. (1998). On numerical simulations of integrate-and-fire neural networks. Neural Comput., 10(2), 467–483. Harris, J., Baurick, J., Frye, J., King, J., Ballew, M., Goodman, P., & Drewes, R. (2003). A novel parallel hardware and software solution for a large-scale biologically realistic cortical simulation (Tech. Rep.). Las Vegas: University of Nevada. Heck, A. (2003). Introduction to Maple (3rd ed.). Berlin: Springer-Verlag. Lytton, W. W., & Hines, M. L. (2005). Independent variable time-step integration of individual neurons for network simulations. Neural Comput., 17, 903–921. Makino, T. (2003). A discrete-event neural network simulator for general neuron models. Neural Comput. and Applic., 11, 210–223.
Continuous Spike Times in Exact Discrete-Time Simulations
79
Marian, I., Reilly, R. G., & Mackey, D. (2002). Efficient event-driven simulation of spiking neural networks. In Proceedings of the 3. WSES International Conference on Neural Networks and Applications. Interlaken, Switzerland. Mattia, M., & Del Giudice, P. (2000). Efficient event-driven simulation of large networks of spiking neurons and dynamical synapses. Neural Comput., 12(10), 2305– 2329. Mirollo, R. E., & Strogatz, S. H. (1990). Synchronization of pulse-coupled biological oscillators. SIAM J. Appl. Math., 50(6), 1645–1662. Morrison, A., Hake, J., Straube, S., Plesser, H. E., & Diesmann, M. (2005). Precise spike timing with exact subthreshold integration in discrete time network simu¨ lations. Proceedings of the 30th Gottingen Neurobiology Conference. Neuroforum, 1(Suppl.), 205B. Morrison, A., Mehring, C., Geisel, T., Aertsen, A., & Diesmann, M. (2005). Advancing the boundaries of high connectivity network simulation with distributed computing. Neural Comput., 17(8), 1776–1801. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C (2nd ed.). Cambridge: Cambridge University Press. Reutimann, J., Giugliano, M., & Fusi, S. (2003). Event-driven simulation of spiking neurons with stochastic dynamics. Neural Comput., 15, 811–830. Rochel, O., & Martinez, D. (2003). An event-driven framework for the simulation of networks of spiking neurons. In ESANN’2003 Proceedings—European Symposium on Artifical Neural Networks (pp. 295–300). Bruges, Belgium: d-side Publications. Rotter, S., & Diesmann, M. (1999). Exact digital simulation of time-invariant linear systems with applications to neuronal modeling. Biol. Cybern., 81(5/6), 381–402. Shelley, M. J., & Tao, L. (2001). Efficient and accurate time-stepping schemes for integrate-and-fire neuronal networks. J. Comput. Neurosci., 11(2), 111–119. Sloot, A., Kaandorp, J. A., Hoekstra, G., & Overeinder, B. J. (1999). Distributed simulation with cellular automata: Architecture and applications. In J. Pavelka, G. Tel, & M. Bartosek (Eds.), SOFSEM’99, LNCS (pp. 203–248). Berlin: SpringerVerlag. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. Weisstein, E. W. (1999). CRC concise encyclopedia of mathematics. Boca Raton, FL: CRC Press. Wolfram, S. (2003). The mathematica book (5th ed.). Champaign, IL: Wolfram Media Incorporated. Zeigler, B. P., Praehofer, H., & Kim, T. G. (2000). Theory of modeling and simulation: Integrating discrete event and continuous complex dynamic systems (2nd ed.). Amsterdam: Academic Press.
Received August 12, 2005; accepted May 25, 2006.
LETTER
Communicated by Daniel Amit
The Road to Chaos by Time-Asymmetric Hebbian Learning in Recurrent Neural Networks Colin Molter [email protected] Laboratory for Dynamics of Emergent Intelligence, RIKEN Brain Science Institute, Wako, Saitama, 351-0198, Japan, and Laboratory of Artificial Intelligence, IRIDIA, Universit´e Libre de Bruxelles, 1050 Brussels, Belgium
Utku Salihoglu [email protected] Laboratory of Artificial Intelligence, IRIDIA, Universit´e Libre de Bruxelles, 1050 Brussels, Belgium
Hugues Bersini [email protected] Laboratory of Artificial Intelligence, IRIDIA, Universit´e Libre de Bruxelles, 1050 Brussels, Belgium
This letter aims at studying the impact of iterative Hebbian learning algorithms on the recurrent neural network’s underlying dynamics. First, an iterative supervised learning algorithm is discussed. An essential improvement of this algorithm consists of indexing the attractor information items by means of external stimuli rather than by using only initial conditions, as Hopfield originally proposed. Modifying the stimuli mainly results in a change of the entire internal dynamics, leading to an enlargement of the set of attractors and potential memory bags. The impact of the learning on the network’s dynamics is the following: the more information to be stored as limit cycle attractors of the neural network, the more chaos prevails as the background dynamical regime of the network. In fact, the background chaos spreads widely and adopts a very unstructured shape similar to white noise. Next, we introduce a new form of supervised learning that is more plausible from a biological point of view: the network has to learn to react to an external stimulus by cycling through a sequence that is no longer specified a priori. Based on its spontaneous dynamics, the network decides “on its own” the dynamical patterns to be associated with the stimuli. Compared with classical supervised learning, huge enhancements in storing capacity and computational cost have been observed. Moreover, this new form of supervised learning, by being more “respectful” of the network intrinsic dynamics, maintains much more structure in Neural Computation 19, 80–110 (2007)
C 2006 Massachusetts Institute of Technology
Road to Chaos in Recurrent Neural Networks
81
the obtained chaos. It is still possible to observe the traces of the learned attractors in the chaotic regime. This complex but still very informative regime is referred to as “frustrated chaos.” 1 Introduction Synaptic plasticity is now widely accepted as a basic mechanism underlying learning and memory. There is experimental evidence that neuronal activity can affect synaptic strength through both long-term potentiation and longterm depression (Bliss & Lomo, 1973). Inspired by or forecasting this biological fact, a large number of learning “rules,” specifying how activity and training experience change synaptic efficacies, have been proposed (Hebb, 1949; Sejnowski, 1977). Such learning rules have been essential for the construction of most models of associative memory (among others, Amari, 1977; Hopfield, 1982; Amari & Maginu, 1988; Amit, 1995; Brunel, Carusi, & Fusi, 1997; Fusi, 2002; Amit & Mongillo, 2003). In such models, the neural network maps the structure of information contained in the external or internal environment into embedded attractors. Since Amari, Grossberg, and Hopfield precursor works (Amari, 1972; Hopfield, 1982; Grossberg, 1992), the privileged regime to code information has been fixed-point attractors. Many theoretical and experimental works have shown and discussed the limited storing capacity of these attractor network (Amit, Gutfreund, & Sompolinsky, 1987; Gardner, 1987; Gardner & Derrida, 1989; Amit & Fusi, 1994; and Domany, van Hemmen, & Schulten, 1995, for a review). However, many neurophysiological reports (Nicolis & Tsuda, 1985; Skarda & Freeman, 1987; Babloyantz & Loureno, 1994; Rodriguez et al., 1999; and Kenet, Bibitchkov, Tsodyks, Grinvald, & Arieli, 2003) tend to indicate that brain dynamics is much more complex than fixed points and is more faithfully characterized by cyclic and weak chaotic regimes. In line with these results, in this article, we propose to map stimuli to spatiotemporal limit cycle attractors of the network’s dynamics. A learned stimulus is no longer expected to stabilize the network into a steady state (which could in some cases correspond to a minimum of a Lyapunov function). Instead, the stimulus is expected to drive the network into a specific spatiotemporal cyclic trajectory. This cyclic trajectory is still considered an attractor since content addressability is expected: before presentation of the stimulus, the network could follow another trajectory, and the stimulus could be corrupted with noise. By relying on spatiotemporal cyclic attractors, the famous theoretical results on the limited capacity of Hopfield network no longer apply. In fact, the extension of encoding attractors to cycles potentially boosts this storing capacity. Suppose a network is composed of two neurons that can have only two values: −1 and +1. Without paying attention to noise and generalization, only four fixed-point attractors can be exploited, whereas by adding cycles, this number increases. For instance,
82
C. Molter, U. Salihoglu, and H. Bersini
cycles of length two are: (+1, +1)(+1, −1) (+1, +1)(−1, +1) (+1, +1)(−1, −1) (+1, −1)(−1, +1) (+1, −1)(−1, −1) (−1, +1)(−1, −1). In a given network, with a fixed topology and parameterization, the number of cyclic attractors is obviously inferior to the number of fixed points (a cycle iterates through a succession of unstable equilibrium). An indexing of the memorized attractors restricted to the initial conditions, as classically done in Hopfield networks, would not allow a full exploitation of all of these potential cyclic attractors. This is the reason that in the experimental framework presented here, the indexing is done instead by means of an added external input layer that continuously feeds the network with different external stimuli. Each external stimulus modifies the parameterization of the network—thus, the possible dynamical regimes and the set of potential attractors. The goal of this letter is not to calculate the “maximum storage capacity” of these “cyclic attractors networks,”1 either theoretically2 or experimentally. Rather, we intend to discuss the potential dynamical regimes (oscillations and chaos) that allow this new form of information storage. The experimental results we present show how the theoretical limitation of fixed-point attractor networks can easily be overcome by adding the input layer and exploitating the cyclic attractors. By studying small fully connected networks, we have shown previously (Molter & Bersini, 2003) how a synaptic matrix randomly generated allows the exploitation of a huge number of cyclic attractors for storing information. In this letter, according to our previous results (Molter, Salihoglu, & Bersini, 2005a, 2005b), a time-asymmetric Hebbian rule is proposed to encode the information. This rule is related to experimental observations showing an asymmetrical time window of synaptic plasticity in pyramidal cells during tetanic stimulus conditions (Levy & Steward, 1983; Bi & Poo, 1999). Information stored into the network consists of a set of pair data. Each datum is composed of an external stimulus and a series of patterns through which the network is expected to iterate when the stimulus feeds the network (this series corresponds to the limit cycle attractor). Traditionally, the information to be stored is either installed by means of a supervised mechanism (e.g., Hopfield, 1982) or discovered on the spot by an unsupervised version, revealing some statistical regularities in the data presented to the 1 To paraphrase Gardner’s article “Maximum Storage Capacity in Neural Networks” (1987). 2 The fact that we are not working at the thermodynamic limit, as in population models, would render such analysis very difficult.
Road to Chaos in Recurrent Neural Networks
83
net (e.g. Amit & Brunel, 1994). In this letter, two different forms of learning are studied and compared. In the first, the information to be learned (the external stimulus and the limit cycle attractor) is prespecified and installed as a result of a classical supervised algorithm. However, such supervised learning has always raised serious problems at a biological level, due to its top-down nature, and a cognitive level. Who would take responsibility to look inside the brain of the learner, that is, to decide the information to associate with the external stimulus and exert supervision during the learning task? To answer the last question, in the second form of learning proposed here, the semantics of the attractors to be associated with the feeding stimulus is left unprescribed: the network is taught to associate external stimuli with original attractors, not specified a priori. This perspective remains in line with the very old philosophical conviction of constructivism, which has been modernized in neural network terms by several authors (among others, Varela, Thompson, & Rosch, 1991; Erdi, 1996; Tsuda, 2001). One operational form has achieved great popularity as a neural net implementation of statistical clustering algorithms (Kohonen, 1982; Grossberg, 1992). To differentiate between the two learning procedures, the first one is called out-supervised, since the information is fully specified from outside. In contrast, the second one is called in-supervised, since the learning maps stimuli to cyclic attractors “derived” from the ones spontaneously proposed by the network.3 Quite naturally, we show that this in-supervised learning leads to increased storing capacity. The aim of this letter is not only to show how stimuli could be mapped to limit cycle attractors of the network’s dynamics. It also focuses on the network’s background dynamics: the dynamics observed when unlearned stimuli feed the network. More precisely, in line with theoretical investigations of the onset and the nature of chaotic dynamics in deterministic dynamical systems (Eckmann & Ruelle, 1985; Sompolinsky, Crisanti, & Sommers, 1988; van Vreeswijk & Sompolinsky, 1996; Hansel & Sompolinsky, 1996), this letter computes and analyzes the background presence of chaotic dynamics. The presence of chaos in recurrent neural networks (RNNs) and the benefit gained by its presence is still an open question. However, since the seminal paper by Skarda and Freeman (1987) dedicated to chaos in the rabbit brain, many authors have shared the idea that chaos is the ideal regime to store and efficiently retrieve information in neural networks (Freeman, 2002; Guillot & Dauce, 2002; Pasemann, 2002; Kaneko & Tsuda, 2003). Chaos, although very simply produced, 3 This algorithm is still supervised in the sense that the mapped cyclic attractors are only “derived” from the spontaneous network dynamics. External supervision is still partly needed, which raises question about biological plausibility. However, even if the proposed in-supervised algorithm is improved or if it ends as part of a more biologically plausible learning procedure, we think that the dynamical results obtained here will remain qualitatively valid.
84
C. Molter, U. Salihoglu, and H. Bersini
inherently possesses an infinite number of cyclic regimes that can be exploited for coding information. Moreover, it randomly wanders around these unstable regimes in a spontaneous way, thus rapidly proposing alternative responses to external stimuli and able to switch easily from one of these potential attractors to another in response to any coming stimulus. This article maintains this line of thinking by forcing the coding of information in robust cyclic attractors and experimentally showing that the more information is to be stored, the more chaos appears as a regime in the back, erratically itinerating, that is, moving from place to place, among brief appearances of these attractors. Chaos appears to be the consequence of the learning, not the cause. However, it appears as a helpful consequence that widens the net’s encoding capacity and diminishes the existence of spurious attractors. By comparing the nature of the chaotic dynamics obtained from the two learning procedures, we show that more structure in the background chaotic regimes is obtained when the network “chooses” by itself the limit cycle attractors to encode the stimuli (the in-supervised learning). The nature of these chaotic regimes can be related to the classic intermittent type of chaos (Pomeau & Manneville, 1980) and to its extension to biological networks, originally called frustrated chaos (Bersini, 1998), in reminiscence of the frustration phenomenon occurring in both recurrent neural nets and spin glasses. This chaos is related to chaotic itinerancy (Kaneko, 1992; Tsuda, 1992), which has been suggested to be of biological significance (Tsuda, 2001). The plan of the letter is as follows. Section 2 describes the model as well as the learning task. Section 3 describes the out-supervised learning procedure, where stimuli are encoded in predefined limit cycle attractors. Section 4 describes the in-supervised learning procedure, where stimuli are encoded in the limit cycle attractors derived from those spontaneously proposed by the network. Section 5 computes and compares networks’ encoding capacity, as well as the content addressability of encoded information. Section 6 compares and discusses the proportion and the nature of the observed chaotic dynamics when the network is presented with unlearned stimuli. 2 Model and Learning Task Descriptions This section describes the model used in our simulations as well as the learning task. 2.1 The Model. The network is fully connected. Each neuron’s activation is a function of other neurons’ impact and an external stimulus. The neurons’ activation f is continuous and updated synchronously by discrete time step. The mathematical description of such a network is a classic
Road to Chaos in Recurrent Neural Networks
85
one. The activation value of a neuron xi at a discrete time step n + 1 is xi (n + 1) = f (g neti (n)) neti (n) =
N
wij x j (n) +
j=1
M
wis ιs ,
(2.1)
s=1
where N is the number of neurons, M is the number of units composing the stimulus, g is the slope parameter, wij is the weight between the neurons j and i, wis is the weight between the external stimulus’ unit s and the neuron i, and ιs is the unit s of the external stimulus. The saturating activation function f is taken continuous (here tanh) to ease the study of the networks’ dynamical properties. The network’s size, N, and the stimulus’s size, M, have been set to 25 in this article for legibility. Of course, this size has an impact on both the encoding capacity and the background dynamics. This impact will not be discussed. Another impact not discussed here is the value of the slope parameter g, set to 3 in the following.4 The main purpose of this article is to analyze how the learning of stimuli in spatiotemporal attractors affects the network’s background dynamics. When storing information in fixed-point attractors, the temporal update rule can indifferently be asynchronous or synchronous. This is no longer the case when storing information in cycle attractors, for which the updating must necessarily be synchronous: it is a global activity due to one pattern that generates the next one. To compare the network’s continuous internal states with bit patterns, a filter layer quantizing the internal states, based on the sign function, is added (Omlin, 2001). It defines the output vector o:
o i = −1 ⇐⇒ xi < 0 oi = 1
⇐⇒ xi ≥ 0,
(2.2)
where xi is the internal state of the neuron i and o i is its associated output (i.e., its visible value). This filter layer enables it to perform symbolic investigations on the dynamical attractors. Figure 1 represents a period 2 sequence unfolding in a network of four neurons. The persistent external stimulus feeding the network appears Figure 1A. Given that the internal state of neurons is continuous, the internal states (see Figure 1B) are filtered (see Figure 1C) to enable the comparison with the stored data. 4
The impact of this parameter has been discussed in Dauce, Quoy, Cessac, Doyon, & Samuelides (1998). They have demonstrated how the slope parameter can be used as a route to chaos.
86
C. Molter, U. Salihoglu, and H. Bersini
Figure 1: (A) A fully recurrent neural network (N = 4) fed by a persistent external stimulus. (B). Three shots of the network’s states. Each represents the internal state of the network at a successive time step. (C) After filtering, a cycle of period 2 can be seen.
2.2 The Learning Task. Two different forms of supervised learning are proposed in this article. However, both consist in storing a set of q external stimuli in spatiotemporal cycles of the network’s internal dynamics. The data set is written as D = D1 , . . . , Dq ,
(2.3)
where each datum Dµ is defined by a pair composed of a pattern χ µ corresponding to the external stimulus feeding the network and a sequence of patterns ς µ,i , i = 1, . . . , lµ to store in a dynamical attractor: Dµ = χ µ , (ς µ,1 , . . . , ς µ,lµ )
µ = 1, . . . , q ,
(2.4)
where lµ is the period of the sequence µ and may vary from one datum to another. Each pattern µ is defined by assigning digital values to all neurons: µ
χ µ = {χi , i = 1, . . . , M} µ,k
ς µ,k = {ςi
, i = 1, . . . , N}
with with
µ
χi ∈ {−1, 1} µ,k
ςi
∈ {−1, 1}.
(2.5)
Road to Chaos in Recurrent Neural Networks
87
3 The Out-Supervised Learning Algorithm This first learning task is straightforward and consists of storing a welldefined data set. It means that each datum stored in the network is fully specified a priori: each external stimulus must be associated with a prespecified limit cycle attractor of the network’s dynamics. By suppressing the external stimulus and defining all the sequences’ periods lµ to 1, this task is reduced to the classical learning task originally proposed by Hopfield: storing pattern in fixed-point attractors of the underlying RNN’s dynamics. The learning task described above turns out to generalize the one proposed by Hopfield. For ease of reading, when patterns are stored in fixed-point attractors, they are noted by ξ µ . 3.1 Introduction: Hopfield’s Autoassociative Model. In the basic Hopfield model (Hopfield, 1982), all connections need to be symmetric, no autoconnection can exist, and the update rule must be synchronous. Hopfield has proven that these constraints are sufficient to define a Lyapunov function H for the system:5 1 wij xi x j . 2 N
H=−
N
(3.1)
i=1 j=1
Each state variation produced by the system’s equation entails a nonpositive variation of H: H ≤ 0. The existence of such a decreasing function ensures convergence to fixed-point attractors. Each local minimum of the Lyapunov function represents one fixed point of the dynamics. These local minima can be used to store patterns. This kind of network is akin to a content-addressable memory since any stored item will be retrieved when the network dynamics is initiated with a vector of activation values sufficiently overlapping the stored pattern.6 In such a case, the network dynamics is initiated in the desired item’s basin of attraction, spontaneously driving the network dynamics to converge to this specific item. The set of patterns can be stored in the network by using the following Hebbian learning rule, which obviously respects the constraints of the Hopfield model (symmetric connections and no autoconnection): 1 µ µ ξi ξ j N p
wij =
wii = 0.
(3.2)
µ=1
5 A Lyapunov function defines a lower-bounded function whose derivative is decreasing in time. 6 This remains valid only when learning a few uncorrelated patterns. In other cases, the network converges to any fixed-point attractor.
88
C. Molter, U. Salihoglu, and H. Bersini
However, this kind of rule leads to drastic storage limitations. An in-depth analysis of the Hopfield model’s storing capacity has been done by Amit et al. (1987) by relying on a mean-field approach and on replica methods originally developed for spin-glass models. Their theoretical results show that these types of networks, when coupled with this learning rule, are unlikely to store more than 0.14 N uncorrelated random patterns. 3.2 Iterative Version of the Hebbian Learning Rule 3.2.1 Learning Fixed Points. A better way of storing patterns is given by an iterative version of the Hebbian rule (Gardner, 1987; for a detailed description of this algorithm, see van Hemmen & Kuhn (1995), and Forrest & Wallace, 1995). The principle of this algorithm is as follows: at each learning iteration, the stability of every nominal pattern ξ µ is tested. Whenever one pattern has not yet reached stability, the responsible neuron i sees its connectivity reinforced by adding a Hebbian term to all the synaptic connections impinging on it, µ
µ
wij → wij + εs ξi ξ j ,
(3.3)
where εs defines the learning rate. All patterns to be learned are repeatedly tested for stability, and once all are stable, the learning is complete. This learning algorithm is incremental since the learning of new information can be done by preserving all information that has already been learned. It has been proved (Gardner, 1987) that by using this procedure, the capacity can be increased up to 2N uncorrelated random patterns. In our model, stored cycles are indexed by the use of external stimuli. These external stimuli are responsible for a modification of the underlying network’s internal dynamics and, consequently, increasing the number of potential attractors, as well as the size of their basins of attraction. The connection weights between the external stimuli and the neurons are learned by adopting the same approach as given in equation 3.3. When one pattern is not yet stable, the responsible neuron i sees its connectivity reinforced by adding a Hebbian term to all of the synaptic connections impinging on it (see equation 3.3), including connections coming from the external stimulus: wik → wik + εb χkµ ξiµ ,
(3.4)
where εb defines the learning rate applied on the external stimulus’s connections and which may differ from εs . In order to not only store the patterns but also ensure sufficient content addressability, we must try to “excavate” the basins of attraction. Two approaches are commonly proposed in the literature. The first aims at getting
Road to Chaos in Recurrent Neural Networks
89
the alignment of the spin of the neuron (+1 or −1) together with its local field to be not just positive (the requirement to ensure stability) but greater than a given minimum bound. The second approach attempts explicitly to enlarge the domains of attraction around each nominal pattern. To do so, the network is trained to associate noisy versions of each nominal pattern with the desired pattern, following a given number of iterations expected to be sufficient for convergence. This second approach is the one adopted in this article. Two noise parameters are introduced to tune noise during the learning phase: the noise imposed on the internal states lns and the noise imposed on the external stimulus lnb .7 3.2.2 Learning Cycles. The learning rule defined in equation 3.3 naturally leads to asymmetrical weights’ values. It is no longer possible to define a Lyapunov function for this system, the main consequence being to include cycles in the set of “memory bags.” As for fixed points, the network can be trained to converge to such limit cycles attractors by modifying equations 3.3 and 3.4. This time, the weights wij and wis are modified according to the expected value of neuron i at time t + 1 and the expected value of neuron j and of the external stimulus at time t: µ,ν+1
ςj
µ,ν+1
χi ,
wij → wij + εs ςi
wis → wis + εb ςi
µ,ν µ
(3.5)
where εs and εb , respectively, define the learning rate and the stimulus learning rate. This time-asymmetric Hebbian rule can be related to the asymmetric time window of synaptic plasticity observed in pyramidal cells during tetanic stimulus conditions (Levy & Steward, 1983; Bi & Poo, 1999). 3.2.3 Adaptation of the Algorithm to Continuous Activation Functions. When working with continuous state neurons, we are no longer working with cyclic attractors but with limit cycle attractors. As a consequence, the µ
µ 7 A noisy pattern ξ lns is obtained from a pattern to learn ξ by choosing a set of lns items, randomly chosen among all the initial pattern’s items, and by switching their sign. Thus, d H (lns ) defines the Hamming distance between the two patterns:
d H (lns ) =
N i=1
di = 0 di
where
di = 1
µ
µ
if
ξi ξi,lns = 1
(items equals)
if
µ µ ξi ξi,lns
(items differents)
= −1
In this article, the Hamming distance is normalized to range in [0, 100].
.
90
C. Molter, U. Salihoglu, and H. Bersini
algorithm needs to be adapted in order to prevent the learned data from vanishing after a few iterations. One step iteration does not guarantee longterm stability of the internal states since observations are performed using a filter layer. The adaptation consists of waiting a certain number of cycles before testing the correctness of the obtained attractor. The halting test for discrete neurons is given by the following equation, ∀µ, ν if (x(0) = ς µ,ν ) → x(1) = ς µ,ν+1
⇒ stop;
(3.6)
for continuous neurons, it becomes ∀µ, ν if (x(0) = ς µ,ν ) → (o(1) = o(lµ + 1) = . . . = o(T ∗ lµ + 1) = ς µ,ν+1 )
⇒ stop,
(3.7)
where T is a further parameter of our algorithm (set to 10 in all the experiments) and o is the output filtered vector defined in equation 1.2. 4 In-Supervised Learning algorithm As shown in section 5, the encoding capacities of networks learning in the out-supervised way described above are fairly good. However, these results are disappointing compared to the potential capacity observed in random networks (Molter & Bersini, 2003). Moreover, section 6 shows how learning too many cycle attractors in an out-supervised way leads to the kind of blackout catastrophe similar to the ones observed in fixed-point attractor networks (Amit, 1989). Here, the network’s background regime becomes fully chaotic and similar to white noise. Learning prespecified data appears to be too constraining for the network. This section introduces an in-supervised learning algorithm, more plausible from a biological point of view: the network has to learn to react to an external stimulus by cycling through a sequence that is not specified a priori but is obtained following an internal mechanism. In other words, the information is generated through the learning procedure that assigns a meaning to each external stimulus. There is an important tradition of less supervised learning in neural nets since the seminal work of Kohonen and Grossberg. This tradition enters in resonance with writings in cognitive psychology and constructivist philosophy (among others, Piaget, 1963; Varela et al., 1991; Erdi, 1996; and Tsuda, 2001). The algorithm presented now can be seen as a dynamical extension in the spirit of this preliminary work where the coding scheme relies on cycles instead of single neurons. 4.1 Description of the Learning Task. The main characteristic of this new algorithm lies in the nature of the learned information: only the external stimuli are known before learning. The limit cycle attractor associated
Road to Chaos in Recurrent Neural Networks
91
with an external stimulus is identified through the learning procedure: the procedure enforces a mapping between each stimulus of the data set and a limit cycle attractor of the network’s inner dynamic, whatever it is. Hence, the aim of the learning procedure is twofold: first, it proposes a dynamical way to code the information (i.e., to associate a meaning with the external stimuli), and then it learns it (through a classical supervised procedure). Before mapping, the data set is defined by Dbm (bm standing for “before mapping”): 1 q , . . . , Dbm Dbm = Dbm
µ
Dbm = χ µ
µ = 1, . . . , q .
(4.1)
After mapping, the data set becomes 1 q , . . . , Dam Dam = Dam
µ Dam = χ µ , (ς µ,1 , . . . , ς µ,lµ )
µ = 1, . . . , q ,
(4.2)
where lµ is the period of the learned cycle. 4.2 Description of the Algorithm. Inputs of this algorithm are a data set Dbm to learn (see equation 4.1), and a range [mincs , maxcs ] that defines the bounds of the accepted periods of the limit cycle attractors coding the information. This algorithm can be broken down in three phases that are constantly iterated until convergence: 1. Remapping stimuli into spatiotemporal cyclic attractors. During this phase, the network is presented with an external stimulus that drives it into a temporal attractor outputµ (which can be chaotic). Since the idea is to constrain the network as little as possible, a meaning is assigned to the stimulus by associating it with a close cyclic version of the attractor outputµ , called cycleµ , an original8 attractor respecting the periodic bounds [mincs , maxcs ]. This step is iterated for all the stimuli of the data set; 2. Learning the information. Once a new attractor cycleµ has been proposed for each stimulus, it is tentatively learned by means of a supervised procedure. However, to avoid constraining the network too much, only a limited number of iterations are performed, even if no convergence has been reached. 3. End test. if all stimuli are successfully associated with different cyclic attractors, the in-supervised learning stops, otherwise the whole process is repeated.
8 Original means that each pattern composing the limit cycle attractor must be different from all other patterns of all cycleµ .
92
C. Molter, U. Salihoglu, and H. Bersini
Table 1: Pseudocode of the In-Supervised algorithm. A/ re-mapping stimuli to spatiotemporal cyclic attractors µ
1. ∀ data Dbm to learn, µ = 1, . . . , q a. Stimulation of the network i. The stimulus is initialized with χ µ . ii. The states are initialized with ς µ,1 which are obtained from the previous iteration (or random at first). iii.To skip the transient, the network is simulated some steps. iv. The states ς µ,i crossed by the network’s dynamics are stored in outputµ . b. Proposal of an attractor code i. If outputµ is not a cycle of period lesser or equal to maxcs ⇒ compression process (see Table 2): outputµ → cycleµ . ii. If outputµ is not a cycle of period greater or equal to mincs ⇒ extension process (see Table 2): outputµ → cycleµ . iii.If a pattern contained in cycleµ is too correlated with any other patterns, this pattern is slightly modified to make it “original”. µ
2. The data set Dtemp is created where Dtemp = (χ µ , cycleµ ) B/ learning the information 3. Using an iterative supervised learning algorithm, the data set Dtemp is tentatively learned a limited number of time steps. C/ end test µ
4. If ∀ data Dbm , the network iterates through valid limit cycle attractors ⇒ finished else goto 1
The pseudocode of this in-supervised learning algorithm is described in Table 1. The algorithm presented here learns to map external stimuli into the network’s cyclic attractors in a very unconstrained way. Provided these attractors are derived from the ones spontaneously proposed by the network, a form of supervision is still needed (how to create an original cycle and how to know if the proposed cycle is original). However, recent neurophysiological observations have shown that in some cases, synaptic plasticity is effectively guided by a supervised procedure (Gutfreund, Zheng, & Knudsen, 2002; Franosch, Lingenheil, & van Hemmen, 2005). The supervised procedure proposed here to create and test original cycles (see Table 2) has no real biological grounding. This part of the algorithm might be improved in the future to reinforce its biological likelihood. Nevertheless, we think that whatever supervised procedure is chosen, the part of the study concerned with the capacity of the network and the background chaos remains valid.
Road to Chaos in Recurrent Neural Networks
93
Table 2: Routines Used When Creating an Attractor Code. Compression process Since outputµ = ς µ,1 , . . . , ς µ,maxcs , . . . is a cycle of period greater than maxcs (it could even be chaotic): ⇒ cycleµ is generated by truncating outputµ : cycleµ = ς µ,1 , . . . , ς µ, ps (the compression period ps ∈ [mincs , maxcs ] is another parameter). Extension process Since outputµ = ς µ,1 , . . . , ς µ,q with q < mincs :
⇒ cycleµ is generated by duplicating outputµ : cycleµ = ς µ,1 , . . . , ς µ,q , ς µ,1 , . . . such that size(cycleµ ) = pe , where pe ∈ [mincs , maxcs ] is the extension period, another parameter, randomly given or fixed;
5 Performance Regarding the Encoding Capacity Encoding capacities of networks having learned by using the outsupervised algorithm and networks having learned by using the insupervised algorithm are compared in Figure 2. Here, to ease the comparison, we have enforced the in-supervised algorithm to learn mappings to cycles of fixed period. However, because this algorithm enables mapping stimuli to limit cycle attractors of variable period (between mincs and maxcs ), better performance could be achieved. The number of iterations required to learn the specified data set is plotted as a function of the size of this data set.9 We have seen in section 3.2 that it is possible to enforce greater content addressability by training the network to associate noisy versions of each nominal pattern with the desired limit cycle attractors. Here, the two parameters enforcing the content addressability (lns and lnb ) have been set to 0 in order to display the maximum encoding capacities. As expected, the in-supervised learning procedure, by letting the network decide how to map the stimuli, outperforms its out-supervised counterpart where all the mappings are specified before the learning.10 Robustness to noise is compared in Figure 3. Content addressability has been enforced as much as possible by means of the learning-with-noise procedure. By computing the normalized Hamming distance between the obtained attractor and the initially learned attractor, robustness to noise 9
Each iteration represents 100 weight modifications defined in equation 3.6. The mapping stimuli to fixed-point attractors are not compared, since to obtain relevant results, we have to enforce content addressability by learning with noise. See also section 6.1.1. 10
50
C. Molter, U. Salihoglu, and H. Bersini
20
30
40
5
10
10
10
3
3
5
0
Number of iterations
94
0
10
20
30
40
50
60
70
Number of stimuli mapped to cycles 3, 5 and 10
Figure 2: Encoding capacities’ comparison between out-supervised learning (filled circles) and in-supervised learning (squares). The number of iterations required to learn the specified data set is plotted according to data set size. Results obtained from three different types of data sets are represented: data sets composed of stimuli associated with cycles of period 3, 5, and 10. Each value has been obtained from statistics of 100 different data sets.
indicates how well the dynamics is able to retrieve the original association when both the external stimulus and the network’s internal state are perturbed by noise. The two plots in Figure 3 show the normalized Hamming distance between the stored sequence and the recovered sequence according to the noise injected in the external stimulus and the initial states. This noise is quantified by computing the initial overlaps m0b and m0s .11 The noise injected in the network’s internal state is measured by computing the smallest overlap between the internal state and every pattern composing the limit cycle attractor associated with the external stimulus. In-supervised learning, compared to out-supervised learning, considerably improves robustness. For instance, when learning 6 period–4 cycles, in the in-supervised case (see Figure 3B), stored sequences are very robust to noise, while in the out-supervised case (see Figure 3A), content addressability is no longer observed. By adding a tiny amount of noise, the correlation between the stored sequence and the recovered one goes to zero (the normalized Hamming distance is equal to 50). These figures also show that in the in-supervised case, the external stimulus plays a stronger role in indexing the stored data.
11
The overlap mµ between two patterns is given by mµ =
N 1 µ,idcyc µ,idcyc ςi ςi,noisy . N i=1
Two patterns that match perfectly have an overlap equal to 1; it equals −1 in the opposite case. The overlap m and the Hamming distance dh are related by dh = N m+1 2 .
Road to Chaos in Recurrent Neural Networks
95
Figure 3: Content addressability obtained for the learned networks. The normalized Hamming distance between the expected sequence and the obtained sequence is plotted according to the noise injected in both the initial states (m0s ) and the external stimulus (m0b ). Results for the out-supervised Hebbian algorithm (A) are compared with the ones obtained for the in-supervised Hebbian algorithm (B).
One can say that the in-supervised learning mechanism implicitly supplies the network with an important robustness to noise. This could be explained by the following considerations. First, the coding attractors are derived from the ones spontaneously proposed by the network. Second, they need to have large and stable basins of attraction in order to resist the process of trial, error, and adaptation that characterizes this iterative remapping procedure. 6 Dynamical Analysis Tests performed here aim at analyzing the so-called background or spontaneous dynamics obtained when the network is presented with external stimuli other than the learned ones. In the first two sections, analyses are performed to quantify the presence and proportion of chaotic dynamics. Section 6.1 uses two measures: the network’s mean Lyapunov exponent and the probability of having chaotic dynamics. Section 6.2 uses symbolic analyses12 to quantify the proportion of the different types of symbolic attractors found. It shows how chaotic dynamics help to prevent the proliferation of spurious data. Qualitative analyses of the nature of the chaotic dynamics obtained are performed in the last section (section 6.3) by means of classical tools such as return maps,
12 Symbolic analyses are performed by analyzing the output of the filter layer instead of directly analyzing the network’s internal state.
96
C. Molter, U. Salihoglu, and H. Bersini
power spectra, and Lyapunov spectra. Furthermore an innovative measure is developed to assess the presence of frustrated chaotic dynamics. 6.1 Quantitative Dynamical Analysis: The Background Regime. Quantitative analyses are performed here using two kinds of measures: the mean Lyapunov exponent and the probability of having chaotic dynamics. Both measures come from statistics on 100 learned networks. For each learned network, dynamics obtained from randomly chosen external stimuli and initial states have been tested (1000 different configurations). These tests aim at analyzing the so-called background or spontaneous dynamics obtained by stimulating the network with external stimuli and initial states different from the learned ones. The computation of the first Lyapunov exponent is done empirically by computing the evolution of a tiny perturbation (renormalized at each time step) performed on an attractor state (for more details, see, Wolf, Swift, Swinney, & Vastano, 1984; Albers, Sprott, & Dechert, 1998). This exponent indicates how fast the system’s history is lost. While stable dynamics have negative Lyapunov exponents, Lyapunov exponents bigger than 0 are the signature of chaotic dynamics, and when the biggest Lyapunov exponent is very high, the system’s dynamics may be seen as equivalent to a turbulent state. Here, to distinguish chaotic dynamics from quasi-periodic regimes, dynamics is said to be chaotic if the Lyapunov exponent is greater than a given value slightly above zero (in practice, if the Lyapunov exponent is greater than 0.01). Obtained results are made more meaningful by comparing global dynamics of learned networks (indexed with L ) with global dynamics of networks obtained randomly (without learning). However, to constrain random networks to behave as a kind of “surrogate network” (indexed with S ), they must have the same mean µ and the same standard deviation σ for their weight distributions (neuron-neuron and stimulus-neuron): µ wijL = µ wijS µ wbiL = µ wbiS µ wiiL = µ wiiS σ wijL = σ wijS σ wbiL = σ wbiS σ wiiL = σ wiiS .
(6.1)
These surrogate random networks enable the measurement of the weight distribution’s impact. The random distribution has been chosen gaussian. 6.1.1 Network Stabilization Through Hebbian Learning of Static Patterns. Figure 4A shows the mean Lyapunov exponents and probabilities of having chaos for networks with learned data encoded in fixed-point attractors only. To avoid weight distribution to converge to the identity matrix (if ∀i, j : wii > 0 and wij ≈ 0, all patterns are learned but without robustness), noise has been added during the training period. We can observe that:
Road to Chaos in Recurrent Neural Networks
97
Figure 4: Mean Lyapunov exponents and probability of chaos in RNNs, in function of the learning sets’ size. Out-supervised learned networks (filled circles) are compared with in-supervised learned networks (squares). In both cases, comparisons are performed with their surrogate random networks (in gray). Since these results come from statistics, mean and standard deviation are plotted. (A) Network stabilization can be observed after Hebbian learning of static patterns. (B) Chaotic dynamics appear after mapping stimuli to period 3 cycles.
r
r
r r
The encoding capacity (with content addressability) of in-supervised learned networks is greater than the maximum encoding capacities (without content addressability) obtained from theoretical results (Gardner, 1987). The explanation lies in the presence of the input layer, which modifies the network’s internal dynamics and enables other attractors to appear. Mean Lyapunov exponents of learned networks are always negative. When the learning task are made more complex, networks are more and more constrained. The mean Lyapunov exponent increases but still remains below a negative upper bound. It is nearly impossible to find chaotic dynamics for spontaneous regimes in learned networks, even after intensive learning of static patterns. Surrogate random networks show very different results, clearly indicating how the learned weight distribution is anything but random.
98
C. Molter, U. Salihoglu, and H. Bersini
r
The same trends are observed when relying on both the outsupervised algorithm and the in-supervised algorithm.
The iterative Hebbian learning algorithm described here is likely to keep all connections approximately symmetric, preserving the stability of learned networks, while it is no longer the case for random networks. 6.1.2 Hebbian Learning of Cycles: A New Road to Chaos. If learning data in fixed-point attractors stabilize the network, learning sequences in limit cycle attractors lead to diametrically opposed results. Figure 4B compares the chaotic dynamics’ presence in 25-neuron networks learned with different data set of period 3 cycles. We can observe that chaotic dynamics is equal to one, with high mean Lyapunov exponents. This kind of dynamics is ergodic (Fusi, 2002). This looks very similar to the “blackout catastrophe” observed in fixed-point attractors networks (Amit, 1989). If the result is the same—the network becomes unable to retrieve any of the memorized patterns—this blackout arises with a progressive increase of the chaotic dynamics:
r
r
r
Networks learned through the out-supervised algorithm are becoming more and more chaotic while the learning task is intensified. At the end, the probability of falling into a chaotic dynamics is equal to one, with high mean Lyapunov exponents. This kind of dynamics is ergodic and turns out to be reminiscent of the “blackout catastrophe” observed in fixed-point attractor networks (Amit, 1989; Fusi, 2002). If the result is the same—the network becomes unable to retrieve any of the memorized patterns—here this blackout arises through a progressive increase of the chaotic dynamics. The same trend shows up in the in-supervised case; however, even after intensive periods of learning, the networks do not become fully chaotic. The dynamics never turns into an ergodic one: chaotic attractors and limit-cycle attractors coexist. Compared to surrogate random networks, in both cases learning contributes to structure the dynamics (at least, learned networks have to remember the learned data!). However, when the learning task is intensified, the differences between the random and the learned networks tend to vanish in the out-supervised case.
The absence of full chaos in in-supervised learned networks is explained by the fact that this algorithm is based on a process of trial, error, and adaptations, which provides robustness and prevents full chaos. By increasing the data set to learn, learning takes more and more time, but at the same time, the number of remappings increases and forces large basins of stable dynamics. The network is more and more constrained, and complex but not fully chaotic.
Road to Chaos in Recurrent Neural Networks
99
Figure 5: Road to chaos observed when learning four cycles of period 7 in a recurrent network of 25 neurons through an iterative supervised Hebbian algorithm (the number of learning iterations is indicated on the x-axis). The network is initialized randomly, and following a given transient, the average state of the network is plotted on the y-axis. The mean Lyapunov exponent demonstrates the growing presence of chaotic dynamics.
Obtaining chaos by learning an increasing number of cycles based on a time-asymmetric Hebbian mechanism can be seen as a new road to chaos. This road to chaos is illustrated in Figure 5. In contrast with the classical roads shaped by the gradual variation of a control parameter, this new road relies on an external mechanism simultaneously modifying a set of parameters to fulfill an encoding task. When learning cycles, the network is prevented from stabilizing in fixed-point attractors. The more cycles there are to learn, the more the network is externally constrained and the more the regime turns out to be spontaneously chaotic. 6.2 Quantitative Dynamical Analysis: Symbolic Analyses. After learning, when the network is presented with a learned stimulus while its initial state is correctly chosen, the expected spatiotemporal attractor is observed. Section 5 showed that noise tolerance is expected: the external stimulus or the network’s internal state can be slightly modified without affecting the result. However, in some cases, the network fails to “understand” the stimulus, and an unexpected symbolic attractor appears in the output. The question is to know whether this attractor should be considered as spurious data. The intuitive idea is that if the attractor is chaotic or if its period is different from the learned data, it is easy to recognize it at a glance, and thus to discard it. This becomes more difficult if the observed attractor’s period is the same as the ones of the attractors learned. In this case, it is in fact impossible to know whether this information is relevant without comparing it with all the learned data. As a consequence, we will define an attractor—having the same period as the learned data but still different from all of them—as spurious data.
100
C. Molter, U. Salihoglu, and H. Bersini
Figure 6: Proportion of the different symbolic attractors obtained during the spontaneous activity of artificial neural networks learned using, respectively, the out-supervised algorithm (A) and the in-supervised algorithm (B). Two types of mappings are analyzed: stimuli mapped to fixed-point attractors and stimuli mapped to spatiotemporal attractors of period 4.
As a result, two classification schemes are used to differentiate the symbolic attractors obtained. The first scheme is based on periods. This criterion enables distinguishing among chaotic attractors, periodic attractors whose periods differ from the learned ones (named out-of-range attractors), and periodic attractors having the same period as the learned ones. The aim of the second classification scheme is to differentiate these attractors, based on the normalized Hamming distance between them and the closest learned data (i.e., the attractors at a distance less than 10%, less than 20%, and so on). Figure 6 shows the proportion of the different types of attractors found as the size of the learned data set is increased. Results obtained with the in-supervised and the out-supervised algorithm while mapping stimuli in spatiotemporal attractors of various periods are compared. For each data set size, statistics are obtained from 100 different learned networks, and each time, 1000 symbolic attractors obtained from random stimuli and internal states have been classified. When stimuli are mapped into fixed-point attractors, the proportion of chaotic attractors and of out-of-range attractors falls rapidly to zero. In contrast, the number of spurious data increases drastically. In fact both
Road to Chaos in Recurrent Neural Networks
101
Figure 7: Proportions of the different types of attractors observed in insupervised learned networks while the noise injected in a previously learned stimulus is varied. Networks’ initial states were randomly initiated. (A) Thirty stimuli are mapped to fixed-point attractors. (B) Thirty stimuli are mapped to period 4 cycles. One hundred learned networks, with each time 1000 configurations have been tested.
learning procedures tend to stabilize the network by enforcing symmetric weights and positive auto-connections.13 When stimuli are mapped to cyclic attractors, both learning procedures lead to an increasing number of chaotic attractors. The more you learn, the more the spontaneous regime of the net tends to be chaotic. Still, we have to differentiate the two learning procedures. Out-supervised learning, due to its very constraining nature, drives the network into a fully chaotic state that prevents the network from learning more than 13 period 4 cycles. The less constraining and more natural in-supervised learning task leads to different behavior. This time, the network does not become fully chaotic, and the storing capacity is enhanced by a factor as large as 4. Unfortunately, the number of spurious data is also increasing, and noticeable proportions of spurious data are visible when the network is fed with random stimuli. The aim of Figure 7 is to analyze the proportion of the different types of attractors observed when the external stimulus is progressively shifted from a learned stimulus to a random one. Again, the network’s initial state is set completely random. Two types of in-supervised mappings are compared: stimuli mapped to fixed-point attractors and stimuli mapped to attractors of period 4. When unnoised learned stimuli are presented to the network, stunning results appear. For fixed-point learned networks, in more than 80% of the observations, we are facing spurious data. Indeed, the distance between the observed attractors and the expected one is larger than 10%, and they have the same period (period 1). By contrast, for period 4 learned networks, the 13 If no noise is injected during the learning procedure, the network converges to the identity matrix.
102
C. Molter, U. Salihoglu, and H. Bersini
probability of recovering the perfect cycle increases to 56%. Moreover, 62% of the obtained cycles are at a distance less than 10% of the expected ones, and the amount of chaotic and out-of-range attractors is, respectively, equal to 23% and 8%. Space left for spurious data becomes less than 5%. Thus, if we imagine a procedure where the network’s states are slightly modified in case a chaotic trajectory is encountered, until a cyclic attractor is found, we are nearly certain of obtaining the correct mapping. Because of the pervading number of spurious data in fixed-point attractors’ learned networks, it becomes difficult to imagine these networks as working memories. By contrast, the presence of chaotic attractors help to prevent the proliferation of spurious data in cyclic attractors’ learned networks, while good storage capacity is possible by relying on in-supervised mappings. 6.3 Qualitative Dynamical Analysis: Nature of Chaos. The preceding section has given quantitative analyses of the presence of chaotic dynamics in learned networks. This section aims at introducing a more qualitative description of the different types of chaotic regimes encountered in these networks. Numerous techniques exist to characterize chaotic dynamics (Eckmann & Ruelle, 1985). In the first part of this section, three well-known tools are used: chaotic dynamics are characterized by means of return maps, power spectra, and analysis of their Lyapunov spectrum. The computation of the Lyapunov spectrum is performed through Gram-Schmidt reorthogonalization of the evolved system’s Jacobian matrix (which is estimated at each time step from the system’s equations). This method is detailed in Wolf et al. (1984). In the second part of this section, to have a better understanding of the notion of frustrated chaos, a new type of measure is proposed, indicating how chaotic dynamics is built on nearby limit cycle attractors. 6.3.1 Preliminary Analyses. From return maps and power spectra analysis, three types of chaotic regimes have been identified in these networks. Figure 8 shows the power spectra and the return maps of these three generic types. They are labeled here white noise, deep chaos, and informative chaos. For each of them, a return map has been plotted for one particular neuron and the network’s mean signal. We observe that:
r
White noise. The power spectrum of this type of chaos (see Figure 8A) shows a total lack of structure; all the frequencies are represented nearly equally. The associated return map is completely filled, with a bigger density of points at the edges, indicating the presence of saturation. No useful information can be obtained from such chaos.
Road to Chaos in Recurrent Neural Networks
103
Figure 8: Return maps of the network’s mean signal (upper figures), return maps of a particular neuron (center figures), and power spectra of the network’s mean signal (lower figures) for chaotic regimes encountered in random (A) and learned (B, C) networks. The Lyapunov exponent of the corresponding dynamics is given.
r
r
Deep chaos. The power spectrum of this type of chaos shows more structure, but is still very similar to white noise (see Figure 8B). However, the return map is very similar to the previous one and does not seem to provide any useful information. Informative chaos. The power spectrum looks very informative (see Figure 8C). Different peaks show up, indicating the presence of nearby limit cycles. The associated return map shows more structure, but still with the presence of saturation.
The most relevant result is the possibility of predicting the type of chaos preferentially encountered by knowing the learning procedure used, as well as the size of the data set to be learned. Chaotic dynamics encountered in surrogate random networks are nearly always similar to white noise. This explains the large mean Lyapunov exponent obtained for these networks.
104
C. Molter, U. Salihoglu, and H. Bersini
Figure 9: The 10 first Lyapunov exponents. (A) After out-supervised learning of 10 period 4 cycles. (B) After in-supervised learning of 20 period 4 cycles. Each time, 10 different learning sets have been tested. For each obtained network, 100 chaotic dynamics have been studied.
When learning by means of the out-supervised procedure, depending on the data set’s size, different types of chaotic regimes appear. In a small data set size, informative chaos and deep chaos coexist. By increasing the data set’s size, the more we want to learn, the more chaotic regimes go from an informative chaos to an almost white noise one. Learning too much information in an out-supervised way leads to networks’ showing very uninformative deep chaos similar to white noise. In other words, having too many competing limit cycle attractors forces the network’s dynamics to become almost random, losing its structure and behaving as white noise. All information about hypothetical nearby limit cycles attractors is lost. No more content addressability can be obtained. When learning by means of the in-supervised procedure, chaotic dynamics are nearly always of the third type: very informative chaos shows up. By not predefining the internal representations of the network, this type of learning preserves more structure in the chaotic dynamics. The dynamical structure of the informative chaos (Figure 8C) reveals the existence of nearby competing attractors. It shows the phenomenon of frustration obtained when the network hesitates between two (or more) nearby cycles, passing from one to another. To provide a better understanding of this type of chaos, Figure 9 compares Lyapunov spectra obtained from deep chaos in out-supervised learned networks and from informative chaos in in-supervised learned networks. From this figure, deep chaos can ¨ be related to “hyperchaos” (Rossler, 1983). In hyperchaos, the presence of more than one positive Lyapunov exponent is expected, and these exponents are expected to be high. By contrast, Lyapunov spectra obtained in informative chaos after in-supervised learning are characteristic of chaotic itinerancies (Kaneko & Tsuda, 2003). In this type of chaos, the dynamics is attracted to learned memories, which is indicated by negative Lyapunov exponents, while at the same time it escapes from them, which is indicated
Road to Chaos in Recurrent Neural Networks
105
Figure 10: Probability of the presence of the nearby limit cycle attractors in chaotic dynamics (x-axis). By slowly shifting the external stimulus from a stimulus previously learned (region a) to another stimulus learned (region b), the network’s dynamics goes from the limit cycle attractor associated with the former stimulus to the limit cycle attractor associated with the latter stimulus. The probabilities of the presence of cycle a and cycle b are plotted (black). The Lyapunov exponent of the obtained dynamics is also plotted (gray). (A) After out-supervised learning, hyperchaos is observed in between the learned attractors. (B) The same conditions lead to frustrated chaos after insupervised learning.
by the presence of at least one positive Lyapunov exponent. This positive Lyapunov exponent must be slightly positive in order not to completely erase the system’s history and thus keep traces of the learned memories. This regime of frustration is increased by some modes of neutral stability indicated by the presence of many exponents whose values are close to zero. 6.3.2 The Frustrated Chaos. The main difference between the frustrated chaos (Bersini & Sener, 2002) and similar forms of chaos welknown in the literature, like intermittency chaos (Pomeau & Manneville, 1980) and chaotic itinerancy (Ikeda, Matsumoto, & Otsuka, 1989; Kaneko, 1992; Tsuda, 1992), lies in the way to obtain it and in the possible characterization of the dynamics in terms of the encoded attractors. If all of those very structured chaotic regimes are characterized by strong cyclic components among which the dynamics randomly itinerates, it is in the transparency and the exploitation of those cycles that lies the key difference. In the frustrated chaos, those cycles are the basic stages of the road to chaos. It is by forcing those cycles in the network (by tuning the connection parameters) that the chaos finally shows up. One way of forcing these cycles is by the time-asymmetric Hebbian learning adopted in this letter. Frustrated chaos is a dynamical regime that appears in a network when the global structure is such that local connectivity patterns responsible for stable and meaningful oscillatory behaviors are intertwined, leading to mutually competing attractors and unpredictable itinerancy among brief appearance of these attractors.
106
C. Molter, U. Salihoglu, and H. Bersini
To have a better understanding of this type of chaos, Figure 10 compares a hyperchaos and the frustrated one through the probability of presence of the nearby limit cycle attractors in chaotic dynamics. In the figure on the left, the network has learned to associate 2 data with limit cycle attractors of period 10 by using the out-supervised Hebbian algorithm. This algorithm barely constrains the network, and as a consequence, chaotic dynamics appear very uninformative: by shifting the external stimulus from one attractor to another one in between, strong chaos shows up (indicated by the Lyapunov exponent), where any information concerning these two limit cycle attractors is lost. In contrast, when mapping four stimuli to period 4 cycles, using the in-supervised algorithm (Figure 10, right), when driving the dynamics by shifting the external stimulus from one attractor to another one, the chaos encountered on the road appears much more structured: small Lyapunov exponents and strong presence of the nearby limit cycles are easy to observe, shifting progressively from one attractor to the other one. 7 Conclusion This letter studies the possibility of encoding information in spatiotemporal cyclic attractors of a network’s internal state. Two versions of a timeasymmetric Hebbian learning algorithm are proposed. First is a classical out-supervised version, where the cyclic attractors are predefined and need to be replicated by the internal dynamics of the network. Second is an insupervised version where the cyclic attractors to be associated with the external stimuli are left unprescribed and derived from the ones spontaneously proposed by the network. First, experimental results show that the encoding performances obtained in the in-supervised case are greatly enhanced compared to the ones obtained in the out-supervised case. This is intuitively understandable since the in-supervised leaning is much less constraining for the network. Second, experimental results aim at analyzing the background dynamical regime of the network: the dynamics observed when unlearned stimuli are presented to the network. It is empirically shown that the more information the network has to store in its attractors, the more its spontaneous dynamical regime tends to be chaotic. Chaos is in fact the biggest pool of potential cyclic attractors. Asymmetric Hebbian learning can be seen as an alternative road to chaos. Adopting the out-supervised learning and increasing the amount of information to store, the background chaos spreads widely and adopts a very unstructured shape similar to white noise. The reason lies in the constraints imposed by this learning process, which becomes harder and harder to satisfy by the network. In contrast, in-supervised learning, by being more “respectful” of the network intrinsic dynamics, maintains much more structure in the obtained chaos. It is still possible to observe in the chaotic regime the traces of the learned attractors. Following
Road to Chaos in Recurrent Neural Networks
107
our previous considerations, we call this complex but still very structured and informative regime frustrated chaos. This behavior can be related to experimental findings where ongoing cortical activity has been shown to encompass a set of dynamically switching cortical states that corresponds to stimulus-evoked activity (Kenet et al., 2003). Symbolic investigations have been performed on the spatiotemporal attractors obtained when the network is in a random state and presented with random or noisy stimuli. Different types of attractors are observed in the output. Spurious data have been defined as attractors having the same period as the learned data but still different from all of them. This follows the intuitive idea that the other kinds of attractors (like chaotic attractors) are easily recognizable at a glance and unlikely to be confused with learned data. In the case of spurious data, it is impossible to know if the observed attractors bear useful information without comparing it with all the learned data Networks where the information is coded in fixed-point attractors are easily corrupted with spurious data, which makes their exploitation very delicate. By contrast, when the information is stored in cyclic attractors using in-supervised learning, chaotic attractors appear as the background regime of the net. They are likely to play a beneficial role by preventing the proliferation of spurious data and helping the recovery of the correct mapping: if a previously learned stimulus is presented to the network, either the network will find the correct mapping, or it will iterate through a chaotic trajectory. While not as easy to interpret and “engineerize” as its classical supervised counterpart, in-supervised learning makes more sense when adopting both a biological and a cognitive perspective. Biological systems propose their own way to treat external impact by slightly perturbing their inner working. Here the information is still meaningful to the extent that the external impact is associated in an unequivocal way with an attractor. As the famous Colombian neurophysiologist Rodolfo Llinas (2001) uses to say: “A person’s waking life is a dream modulated by the senses”.
References Albers, D., Sprott, J., & Dechert, W. (1998). Routes to chaos in neural networks with random weights. International Journal of Bifurcation and Chaos, 8, 1463–1478. Amari, S. (1972). Learning pattern sequences by self-organizing nets of threshold elements. IEEE Transactions on Computers, 21, 1197–1206. Amari, S. (1977). Neural theory of association and concept-formation. Biological Cybernetics, 26, 175–185. Amari, S., & Maginu, K. (1988). Statistical neurodynamics of associative memory. Neural Networks, 1, 63–73. Amit, D. (1989). Modeling brain function. Cambridge: Cambridge University Press.
108
C. Molter, U. Salihoglu, and H. Bersini
Amit, D. (1995). The Hebbian paradigm reintegrated: Local reverberations as internal representations. Behavioral Brain Science, 18, 617–657. Amit, D., & Brunel, N. (1994). Learning internal representations in an attractor neural network with analogue neurons. Network: Computation in Neural Systems, 6, 359– 388. Amit, D., & Fusi, S. (1994). Learning in neural networks with material synapses. Neural Computation, 6, 957–982. Amit, D., Gutfreund, G., & Sompolinsky, H. (1987). Statistical mechanics of neural networks near saturation. Ann. Phys., 173, 30–67. Amit, D., & Mongillo, G. (2003). Spike-driven synaptic dynamics generating working memory states. Neural Computation, 15, 565–596. Babloyantz, A., & Loureno (1994). Computation with chaos: A paradigm for cortical activity. Proceedings of National Academy of Sciences, 91, 9027–9031. Bersini, H. (1998). The frustrated and compositional nature of chaos in small Hopfield networks. Neural Networks, 11, 1017–1025. Bersini, H., & Sener, P. (2002). The connections between the frustrated chaos and the intermittency chaos in small Hopfield networks. Neural Netwoks, 15, 1197–1204. Bi, G., & Poo, M. (1999). Distributed synaptic modification in neural networks induced by patterned stimulation. Nature, 401, 792–796. Bliss, T., & Lomo, T. (1973). Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path. Journal of Physiology, 232, 331–356. Brunel, N., Carusi, F., & Fusi, S. (1997). Slow stochastic Hebbian learning of classes of stimuli in a recurrent neural network. Network: Computation in Neural Systems, 9, 123–152. Dauce, E., Quoy, M., Cessac, B., Doyon, B., & Samuelides, M. (1998). Self-organization and dynamics reduction in recurrent networks: Stimulus presentation and learning. Neural Networks, 11, 521–533. Domany, E., van Hemmen, J., & Schulten, K. (Eds.). (1995). Models of neural networks (2nd ed.). Berlin: Springer. Eckmann, J., & Ruelle, D. (1985). Ergodic theory of chaos and strange attractors. Reviews of Modern Physics, 57(3), 617–656. Erdi, P. (1996). The brain as a hermeneutic device. Biosystems, 38, 179–189. Forrest, B., & Wallace, D. (1995). Models of neural netWorks (2nd ed.). Berlin: Springer. Franosch, J.-M., Lingenheil, M., & van Hemmen, J. (2005). How a frog can learn what is where in the dark. Physical Review Letters, 95, 1–4. Freeman, W. (2002). Biocomputing. Norwell, MA: Kluwer. Fusi, S. (2002). Hebbian spike-driven synaptic plasticity for learning patterns of mean firing rates. Biological Cybernetics, 87, 459–470. Gardner, E. (1987). Maximum storage capacity in neural networks. Europhysics Letters, 4, 481–485. Gardner, E., & Derrida, B. (1989). Three unfinished works on the optimal storage capacity of networks. J. Physics A: Math. Gen., 22, 1983–1994. Grossberg, S. (1992). Neural networks and natural intelligence. Cambridge, MA: MIT Press . Guillot, A., & Dauce, E. (Eds.). (2002). Approche dynamique de la cognition artificielle. Paris: Herms Science.
Road to Chaos in Recurrent Neural Networks
109
Gutfreund, Y., Zheng, W., & Knudsen, E. I. (2002). Gated visual input to the central auditory system. Science, 297, 1556–1559. Hansel, D., & Sompolinsky, H. (1996). Chaos and synchrony in a model of a hypercolumn in visual cortex. J. Computation Neuroscience, 3, 7–34. Hebb, D. (1949). The organization of behavior. New York: Wiley-Interscience. Hopfield, J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences USA, 79, 2554–2558. Ikeda, K., Matsumoto, K., & Otsuka, K. (1989). Maxwell-Bloch turbulence. Progress of Theoretical Physics, 99(Suppl.), 295–324. Kaneko, K. (1992). Pattern dynamics in spatiotemporal chaos. Physica D, 34, 1– 41. Kaneko, K., & Tsuda, I. (2003). Chaotic itinerancy. Chaos: Focus Issue on Chaotic Itinerancy, 13(3), 926–936. Kenet, T., Bibitchkov, D., Tsodyks, M., Grinvald, A., & Arieli, A. (2003). Spontaneously emerging cortical representations of visual attributes. Nature, 425, 954– 956. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69. Levy, W., & Steward, O. (1983). Temporal contiguity requirements for long term associative potentiation/depression in the hippocampus. Neuroscience, 8, 791– 797. Llineas, R. (2001). I of the vortex: From neurons to self. Cambridge, MA: MIT Press. Molter, C., & Bersini, H. (2003). How chaos in small Hopfield networks makes sense of the world. In Proceedings of the International Joint Conference on Neural Networks conference. Piscataway, NJ: IEEE Press. Molter, C., Salihoglu, U., & Bersini, H. (2005a). Introduction of an Hebbian unsupervised learning algorithm to boost the encoding capacity of Hopfield networks. In Proceedings of the International Joint Conference on Neural Networks—IJCNN Conference, Montreal. Los Alamitos, CA: IEEE Computer Society Press. Molter, C., Salihoglu, U., & Bersini, H. (2005b). Learning cycles brings chaos in continuous Hopfield networks. In Proceedings of the International Joint Conference on Neural Networks—IJCNN Conference. Los Alamitos, CA: IEEE Computer Society Press. Nicolis, J., & Tsuda, I. (1985). Chaotic dynamics of information processing: The “magic number seven plusminus two” revisited. Bulletin of Mathematical Biology, 47, 343–65. Omlin, C. (2001). Understanding and explaining DRN behavior. In K. Kremer (Ed.), A field guide to dynamical recurrent networks. Piscataway, NJ: IEEE Press. Pasemann, F. (2002). Complex dynamics and the structure of small neural networks. Network: Computation in Neural Systems, 13(2), 195–216. Piaget, J. (1963). The psychology of intelligence. New York: Routledge. Pomeau, Y., & Manneville, P. (1980). Intermittent transitions to turbulence in dissipative dynamical systems. Comm. Math. Phys., 74, 189–197. Rodriguez, E., George, N., Lachaux, J., Renault, B., Martinerie, J., Reunault, B., & Varela, F. (1999). Perception’s shadow: Long-distance synchronization of human brain activity. Nature, 397, 430–433.
110
C. Molter, U. Salihoglu, and H. Bersini
¨ ¨ Naturforschung, 38a, 788– Rossler, O. E. (1983). The chaotic hierarchy. Zeitschrift fur 801. Sejnowski, T. (1977). Storing covariance with nonlinearly interacting neurons. J. Math. Biol., 4, 303–321. Skarda, C., & Freeman, W. (1987). How brains make chaos in order to make sense of the world. Behavioral and Brain Sciences, 10, 161–195. Sompolinsky, H., Crisanti, A., & Sommers, H. (1988). Chaos in random neural networks. Physical Review Letters, 61, 258–262. Tsuda, I. (1992). Dynamic link of memory-chaotic memory map in nonequilibrium neural networks. Neural Networks, 5, 313–326. Tsuda, I. (2001). Toward an interpretation of dynamic neural activity in terms of chaotic dynamical systems. Behavioral and Brain Sciences, 24, 793–847. van Hemmen, J., & Kuhn, R. (1995). Models of neural networks (2nd ed.). Berlin: Springer. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274(5293), 1724–1726. Varela, F., Thompson, E., & Rosch, E. (1991). The embodied mind: Cognitive science and human experience. Cambridge, MA: MIT Press. Wolf, A., Swift, J., Swinney, H., & Vastano, J. (1984). Determining Lyapunov exponents from a time series. Physica, D16, 285–317.
Received April 6, 2005; accepted May 30, 2006.
LETTER
Communicated by Herbert Jaeger
Analysis and Design of Echo State Networks Mustafa C. Ozturk [email protected]
Dongming Xu [email protected]
Jos´e C. Pr´ıncipe [email protected] Computational NeuroEngineering Laboratory, Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611, U.S.A.
The design of echo state network (ESN) parameters relies on the selection of the maximum eigenvalue of the linearized system around zero (spectral radius). However, this procedure does not quantify in a systematic manner the performance of the ESN in terms of approximation error. This article presents a functional space approximation framework to better understand the operation of ESNs and proposes an informationtheoretic metric, the average entropy of echo states, to assess the richness of the ESN dynamics. Furthermore, it provides an interpretation of the ESN dynamics rooted in system theory as families of coupled linearized systems whose poles move according to the input signal dynamics. With this interpretation, a design methodology for functional approximation is put forward where ESNs are designed with uniform pole distributions covering the frequency spectrum to abide by the richness metric, irrespective of the spectral radius. A single bias parameter at the ESN input, adapted with the modeling error, configures the ESN spectral radius to the input-output joint space. Function approximation examples compare the proposed design methodology versus the conventional design. 1 Introduction Dynamic computational models require the ability to store and access the time history of their inputs and outputs. The most common dynamic neural architecture is the time-delay neural network (TDNN) that couples delay lines with a nonlinear static architecture where all the parameters (weights) are adapted with the backpropagation algorithm. The conventional delay line utilizes ideal delay operators, but delay lines with local first-order recursive filters have been proposed by Werbos (1992) and extensively studied in the gamma model (de Vries, 1991; Principe, de Vries, & de Oliviera, 1993). Chains of first-order integrators are interesting because they effectively decrease the number of delays necessary to create time embeddings Neural Computation 19, 111–138 (2007)
C 2006 Massachusetts Institute of Technology
112
M. Ozturk, D. Xu, and J. Pr´ıncipe
(Principe, 2001). Recurrent neural networks (RNNs) implement a different type of embedding that is largely unexplored. RNNs are perhaps the most biologically plausible of the artificial neural network (ANN) models (Anderson, Silverstein, Ritz, & Jones, 1977; Hopfield, 1984; Elman, 1990), but are not well understood theoretically (Siegelmann & Sontag, 1991; Siegelmann, 1993; Kremer, 1995). One of the main practical problems with RNNs is the difficulty to adapt the system weights. Various algorithms, such as backpropagation through time (Werbos, 1990) and real-time recurrent learning (Williams & Zipser, 1989), have been proposed to train RNNs; however, these algorithms suffer from computational complexity, resulting in slow training, complex performance surfaces, the possibility of instability, and the decay of gradients through the topology and time (Haykin, 1998). The problem of decaying gradients has been addressed with special processing elements (PEs) (Hochreiter & Schmidhuber, 1997). Alternative second-order training methods based on extended Kalman filtering (Singhal & Wu, 1989; Puskorius & Feldkamp, 1994; Feldkamp, Prokhorov, Eagen, & Yuan, 1998) and the multistreaming training approach (Feldkamp et al., 1998) provide more reliable performance and have enabled practical applications in identification and control of dynamical systems (Kechriotis, Zervas, & Monolakos, 1994; Puskorius & Feldkamp, 1994; Delgado, Kambhampati, & Warwick, 1995). Recently, two new recurrent network topologies have been proposed: the echo state network (ESN) by Jaeger (2001, 2002a; Jaeger & Hass, 2004) and the liquid state machine (LSM) by Maass (Maass, Natschl¨ager, & Markram, 2002). ESNs possess a highly interconnected and recurrent topology of nonlinear PEs that constitutes a “reservoir of rich dynamics” (Jaeger, 2001) and contain information about the history of input and output patterns. The outputs of these internal PEs (echo states) are fed to a memoryless but adaptive readout network (generally linear) that produces the network output. The interesting property of ESN is that only the memoryless readout is trained, whereas the recurrent topology has fixed connection weights. This reduces the complexity of RNN training to simple linear regression while preserving a recurrent topology, but obviously places important constraints in the overall architecture that have not yet been fully studied. Similar ideas have been explored independently by Maass and formalized in the LSM architecture. LSMs, although formulated quite generally, are mostly implemented as neural microcircuits of spiking neurons (Maass et al., 2002), whereas ESNs are dynamical ANN models. Both attempt to model biological information processing using similar principles. We focus on the ESN formulation in this letter. The echo state condition is defined in terms of the spectral radius (the largest among the absolute values of the eigenvalues of a matrix, denoted by · ) of the reservoir’s weight matrix (W < 1). This condition states that the dynamics of the ESN is uniquely controlled by the input, and the effect of the initial states vanishes. The current design of ESN parameters
Analysis and Design of Echo State Networks
113
relies on the selection of spectral radius. However, there are many possible weight matrices with the same spectral radius, and unfortunately they do not all perform at the same level of mean square error (MSE) for functional approximation. A similar problem exists in the design of the LSM. LSMs have been shown to possess universal approximation given the separation property (SP) for the liquid (reservoir in ESNs) and the approximation property (AP) for the readout (Maass et al., 2002). SP is quantified by a kernel-quality measure proposed in Maass, Legenstein, and Bertschinger (2005) that is based on the rank of a matrix formed by the system states corresponding to different input signals. The kernel quality is a measure for the complexity and diversity of nonlinear operations carried out by the liquid on its input stream in order to boost the classification power of a subsequent linear decision hyperplane (Maass et al., 2005). A variation of SP has been proposed in Bertschinger and Natschl¨ager (2004), and it has been argued that complex calculations can be best carried out by networks on the boundary between ordered and chaotic dynamics. In this letter, we are interested in studying the ESN for functional approximation (filters that map input functions u(·) of time on output functions y(·) of time). We see two major shortcomings with the current ESN approach that uses echo state condition as a design principle. First, the impact of fixed reservoir parameters for function approximation means that the information about the desired response is conveyed only to the output projection. This is not optimal, and strategies to select different reservoirs for different applications have not been devised. Second, imposing a constraint only on the spectral radius is a weak condition to properly set the parameters of the reservoir, as experiments show (different randomizations with the same spectral radius perform differently for the same problem; see Figure 2). This letter aims to address these two problems by proposing a framework, a metric, and a design principle for ESNs. The framework is a signal processing interpretation of basis and projections in functional spaces to describe and understand the ESN architecture. According to this interpretation, the ESN states implement a set of basis functionals (representation space) constructed dynamically by the input, while the readout simply projects the desired response onto this representation space. The metric to describe the richness of the ESN dynamics is an information-theoretic quantity, the average state entropy (ASE). Entropy measures the amount of information contained in a given random variable (Shannon, 1948). Here, the random variable is the instantaneous echo state from which the entropy for the overall state (vector) is estimated. The probability density function (pdf) in a differential geometric framework should be thought of as a volume form; that is, in our case, the pdf of the state vector describes the metric of the state space manifold (Amari, 1990). Moreover, Cox (1946) established information as a coordinate free metric in the state manifold. Therefore, entropy becomes a global descriptor of information that quantifies the volume of the manifold defined by the random variable. Due to the
114
M. Ozturk, D. Xu, and J. Pr´ıncipe
time dependency of the states, the state entropy averaged over time (ASE) is an appropriate estimate of the volume of the state manifold. The design principle specifies that one should consider independently the correlation among the basis and the spectral radius. In the absence of any information about the desired response, the ESN states should be designed with the highest ASE, independent of the spectral radius. We interpret the ESN dynamics as a combination of time-varying linear systems obtained from the linearization of the ESN nonlinear PE in a small, local neighborhood of the current state. The design principle means that the poles of the linearized ESN reservoir should have uniform pole distributions to generate echo states with the most diverse pole locations (which correspond to the uniformity of time constants). Effectively, this will create the least correlated bases for a given spectral radius, which corresponds to the largest volume spanned by the basis set. When the designer has no other information about the desired response to set the basis, this principle distributes the system’s degrees of freedom uniformly in space. It approximates for ESNs the well-known property of orthogonal basis. The unresolved issue that ASE does not quantify is how to set the spectral radius, which depends again on the desired mapping. The concept of memory depth as explained in Principe et al. (1993) and Jaeger (2002a) is helpful in understanding the issues associated with the spectral radius. The correlation time of the desired response (as estimated by the first zero of the autocorrelation function) gives an indication of the type of spectral radius required (long correlation time requires high spectral radius). Alternatively, a simple adaptive bias is added at the ESN input to control the spectral radius integrating the information from the input-output joint space in the ESN bases. For sigmoidal PEs, the bias adjusts the operating points of the reservoir PEs, which has the net effect of adjusting the volume of the state manifold as required to approximate the desired response with a small error. This letter shows that ESNs designed with this strategy obtain systematically better results in a set of experiments when compared with the conventional ESN design.
2 Analysis of Echo State Networks 2.1 Echo States as Bases and Projections. Let us consider the architecture and recursive update equation of a typical ESN more closely. Consider the recurrent discrete-time neural network given in Figure 1 with M input units, N internal PEs, and L output units. The value of the input unit at time n is u(n) = [u1 (n), u2 (n), . . . , u M (n)]T , of internal units are x(n) = [x1 (n), x2 (n), . . . , xN (n)]T , and of output units are y(n) = [y1 (n), y2 (n), . . . , yL (n)]T . The connection weights are given in an N × M weight matrix Win = (wiinj ) for connections between the input and the internal PEs, in an N × N matrix W = (wi j ) for connections between the internal PEs, in an L × N matrix Wout = (wiout j ) for connections from PEs to the
Analysis and Design of Echo State Networks
Dynamical Reservoir
Input Layer Win
115
Read-out Wout
W
x(n)
u(n) .
+ .
y(n)
Wback Figure 1: An echo state network (ESN). ESN is composed of two parts: a fixedweight (W < 1) recurrent network and a linear readout. The recurrent network is a reservoir of highly interconnected dynamical components, states of which are called echo states. The memoryless linear readout is trained to produce the output.
output units, and in an N × L matrix Wba ck = (wibaj ck ) for the connections that project back from the output to the internal PEs (Jaeger, 2001). The activation of the internal PEs (echo state) is updated according to x(n + 1) = f(Win u(n + 1) + Wx(n) + Wba ck y(n)),
(2.1)
where f = ( f 1 , f 2 , . . . , f N ) are the internal PEs’ activation functions. Here, all x −x f i ’s are hyperbolic tangent functions ( ee x −e ). The output from the readout +e −x network is computed according to y(n + 1) = fout (Wout x(n + 1)),
(2.2)
where fout = ( f 1out , f 2out , . . . , f Lout ) are the output unit’s nonlinear functions (Jaeger, 2001, 2002a). Generally, the readout is linear so fout is identity. ESNs resemble the RNN architecture proposed in Puskorius and Feldkamp (1996) and also used by Sanchez (2004) in brain-machine
116
M. Ozturk, D. Xu, and J. Pr´ıncipe
interfaces. The critical difference is the dimensionality of the hidden recurrent PE layer and the adaptation of the recurrent weights. We submit that the ideas of approximation theory in functional spaces (bases and projections), so useful in adaptive signal processing (Principe, 2001), should be utilized to understand the ESN architecture. Let h(u(t)) be a real-valued function of a real-valued vector u(t) = [u1 (t), u2 (t), . . . , u M (t)]T . In functional approximation, the goal is to estimate the behavior of h(u(t)) as a combination of simpler functions ϕi (t), called the basis functionals, ˆ such that its approximant, h(u(t)), is given by ˆ h(u(t)) =
N
a i ϕi (t).
i=1
Here, a i ’s are the projections of h(u(t)) onto each basis function. One of the central questions in practical functional approximation is how to choose the set of bases to approximate a given desired signal. In signal processing, the choice normally goes for a complete set of orthogonal basis, independent of the input. When the basis set is complete and can be made as large as required, fixed bases work wonders (e.g., Fourier decompositions). In neural computing, the basic idea is to derive the set of bases from the input signal through a multilayered architecture. For instance, consider a single hidden layer TDNN with N PEs and a linear output. The hiddenlayer PE outputs can be considered a set of nonorthogonal basis functionals dependent on the input, ϕi (u(t)) = g
b i j u j (t) .
j
b i j ’s are the input layer weights, and g is the PE nonlinearity. The approximation produced by the TDNN is then ˆ h(u(t)) =
N
a i ϕi (u(t)),
(2.3)
i=1
where a i ’s are the weights of the output layer. Notice that the b i j ’s adapt the bases and the a i ’s adapt the projection in the projection space. Here the goal is to restrict the number of bases (number of hidden layer PEs) because their number is coupled with the number of parameters to adapt, which has an impact on generalization and training set size, for example. Usually,
Analysis and Design of Echo State Networks
117
since all of the parameters of the network are adapted, the best basis in the joint (input and desired signals) space as well as the best projection can be achieved and represents the optimal solution. The output of the TDNN is a linear combination of its internal representations, but to achieve a basis set (even if nonorthogonal), linear independence among the ϕi (u(t))’s must be enforced. Ito, Shah and Pon, and others have shown that this is indeed the case (Ito, 1996; Shah & Poon, 1999), but a thorough discussion is outside the scope of this article. The ESN (and the RNN) architecture can also be studied in this framework. The states of equation 2.1 correspond to the basis set, which are recursively computed from the input, output, and previous states through Win , W, and Wba ck . Notice, however, that none of these weight matrices is adapted, that is, the functional bases in the ESN are uniquely defined by the input and the initial selection of weights. In a sense, ESNs are trading the adaptive connections in the RNN hidden layer by a brute force approach of creating fixed diversified dynamics in the hidden layer. For an ESN with a linear readout network, the output equation (y(n + 1) = Wout x(n + 1)) has the same form of equation 2.3, where the ϕi ’s and a i ’s are replaced by the echo states and the readout weights, respectively. The readout weights are adapted in the training data, which means that the ESN is able to find the optimal projection in the projection space, just like the RNN or the TDNN. A similar perspective of basis and projections for information processing in biological networks has been proposed by Pouget and Sejnowski (1997). They explored the possibility that the response of neurons in parietal cortex serves as basis functions for the transformations from the sensory input to the motor responses. They proposed that “the role of spatial representations is to code the sensory inputs and posture signals in a format that simplifies subsequent computation, particularly in the generation of motor commands”. The central issue in ESN design is exactly the nonadaptive nature of the basis set. Parameter sets in the reservoir that provide linearly independent states and possess a given spectral radius may define drastically different projection spaces because the correlation among the bases is not constrained. A simple experiment was designed to demonstrate that the selection of the ESN parameters by constraining the spectral radius is not the most suitable for function approximation. Consider a 100-unit ESN where the input signal is sin(2πn/10π). Mimicking Jaeger (2001), the goal is to let the ESN generate the seventh power of the input signal. Different realizations of a randomly connected 100-unit ESN were constructed where the entries of W are set to 0.4, −0.4, and 0 with probabilities of 0.025, 0.025, and 0.95, respectively. This corresponds to a spectral radius of 0.88. Input weights are set to +1 or, −1 with equal probabilities, and Wba ck is set to zero. Input is applied for 300 time steps, and the echo states are calculated using equation 2.1. The next step is to train the linear readout. One method
118
M. Ozturk, D. Xu, and J. Pr´ıncipe
MSE for different realizations
104
106
108
109
0
10
20 30 Different realizations
40
50
Figure 2: Performances of ESNs for different realizations of W with the same weight distribution. The weight values are set to 0.4, −0.4, and 0 with probabilities of 0.025, 0.025, and 0.95. All realizations have the same spectral radius of 0.88. In the 50 realizations, MSEs vary from 5.9 × 10−9 to 8.9 × 10−5 . Results show that for each set of random weights that provide the same spectral radius, the correlation or degree of redundancy among the bases will change, and different performances are encountered in practice.
to determine the optimal output weight matrix, Wout , in the mean square error (MSE) sense (where MSE is defined by O = 12 (d − y)T (d − y)) is to use the Wiener solution given by Haykin (2001): W
out
T −1
= E[xx ]
E[xd] ∼ =
1 x(n)x(n)T N n
−1
1 x(n)d(n) . N n
(2.4)
Here, E[.] denotes the expected value operator, and d denotes the desired signal. Figure 2 depicts the MSE values for 50 different realizations of the ESNs. As observed, even though each ESN has the same sparseness and spectral radius, the MSE values obtained vary greatly among different realizations. The minimum MSE value obtained among the 50 realizations is 5.9x10−9 , whereas the maximum MSE is 8.9x10−5 . This experiment
Analysis and Design of Echo State Networks
119
demonstrates that a design strategy that is based solely on the spectral radius is not sufficient to specify the system architecture for function approximation. This shows that for each set of random weights that provide the same spectral radius, the correlation or degree of redundancy among the bases will change, and different performances are encountered in practice. 2.2 ESN Dynamics as a Combination of Linear Systems. It is well known that the dynamics of a nonlinear system can be approximated by that of a linear system in a small neighborhood of an equilibrium point (Kuznetsov, Kuznetsov, & Marsden, 1998). Here, we perform the analysis with hyperbolic tangent nonlinearities and approximate the ESN dynamics by the dynamics of the linearized system in the neighborhood of the current system state. Hence, when the system operating point varies over time, the linear system approximating the ESN dynamics changes. We are particularly interested in the movement of the poles of the linearized ESN. Consider the update equation for the ESN without output feedback given by x(n + 1) = f(Win u(n + 1) + Wx(n)). Linearizing the system around the current state x(n), one obtains the Jacobian matrix, J(n + 1), defined by
f˙ (net1 (n))w11 ˙ f (net2 (n))w21 J(n + 1) = ··· f˙ (netN (n))w N1 =
f˙ (net1 (n)) 0
f˙ (net1 (n))w12 · · · f˙ (net1 (n))w1N f˙ (net2 (n))w22 · · · f˙ (net2 (n))w2N ··· ··· ··· f˙ (netN (n))w N2 · · · f˙ (netN (n))w NN ···
0
f˙ (net2 (n)) · · ·
0
0
···
···
0
0
···
···
· · · f˙ (netN (n))
· W = F(n) · W.
(2.5)
Here, neti (n) is the ith entry of the vector (Win u(n + 1) + Wx(n)), and wi j denotes the (i, j)th entry of W. The poles of the linearized system at time n + 1 are given by the eigenvalues of the Jacobian matrix J(n + 1).1 As the amplitude of each PE changes, the local slope changes, and so the poles of
1 The
A)−1 B
transfer function of a linear system x(n + 1) = Ax(n) + Bu(n) is Ad joint(zI−A) det(zI−A) B.
X(z) U(z)
= (zI −
= The poles of the transfer function can be obtained by solving det(zI − A) = 0. The solution corresponds to the eigenvalues of A.
120
M. Ozturk, D. Xu, and J. Pr´ıncipe
the linearized system are time varying, although the parameters of ESN are fixed. In order to visualize the movement of the poles, consider an ESN with 100 states. The entries of the internal weight matrix are chosen to be 0, 0.4 and −0.4 with probabilities 0.9, 0.05, and 0.05. W is scaled such that a spectral radius of 0.95 is obtained. Input weights are set to +1 or −1 with equal probabilities. A sinusoidal signal with a period of 100 is fed to the system, and the echo states are computed according to equation 2.1. Then the Jacobian matrix and the eigenvalues are calculated using equation 2.5. Figure 3 shows the pole tracks of the linearized ESN for different input values. A single ESN with fixed parameters implements a combination of many linear systems with varying pole locations, hence many different time constants that modulate the richness of the reservoir of dynamics as a function of input amplitude. Higher-amplitude portions of the signal tend to saturate the nonlinear function and cause the poles to shrink toward the origin of the z-plane (decreases the spectral radius), which results in a system with a large stability margin. When the input is close to zero, the poles of the linearized ESN are close to the maximal spectral radius chosen, decreasing the stability margin. When compared to their linear counterpart, an ESN with the same number of states results in a detailed coverage of the z-plane dynamics, which illustrates the power of nonlinear systems. Similar results can be obtained using signals of different shapes at the ESN input. A key corollary of the above analysis is that the spectral radius of an ESN can be adjusted using a constant bias signal at the ESN input without changing the recurrent connection matrix, W. The application of a nonzero constant bias will move the operating point to regions of the sigmoid function closer to saturation and always decrease the spectral radius due to the shape of the nonlinearity.2 The relevance of bias in terms of overall system performance has also been discussed in Jaeger (2002b) and Bertschinger and Natschl¨ager (2004), but here we approach it from a system theory perspective and explain its effect on reservoir dynamics. 3 Average State Entropy as a Measure of the Richness of ESN Reservoir Previous research was aware of the influence of diversity of the recurrent layer outputs on the overall performance of ESNs and LSMs. Several metrics to quantify the diversity have been proposed (Jaeger, 2001; Maass, et al., 2 Assume W has nondegenerate eigenvalues and corresponding linearly independent eigenvectors. Then consider the eigendecomposition of W, where W = PDP−1 , P is the eigenvector matrix and D is the diagonal matrix of eigenvalues (Dii ) of W. Since F(n) and D are diagonal, J(n + 1) = F(n)W = F(n)(PDP−1 ) = P(F(n)D)P−1 is the eigendecomposition of J(n + 1). Here, each entry of F(n)D, f (net(n))Dii , is an eigenvalue of J. Therefore, | f (net(n))Dii | ≤ |Dii | since f (neti ) ≤ f (0).
Analysis and Design of Echo State Networks
Imaginary
(E)
1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1
Imaginary
C E
20
1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1
-0.5
1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1
-0.5
1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1
-0.5
(B)
D
40
60
Time
80
100
0
0.5
1
0
0.5
1
0
0.5
1
Real
(D)
Imaginary
Imaginary
(C)
1 0.8 0.6 0.4 0.2 0 B -0.2 -0.4 -0.6 -0.8 -1 0
-0.5
0
Real
0.5
1
(F)
Imaginary
Amplitude
(A)
121
-0.5
0
Real
0.5
1
Real
Real
Figure 3: The pole tracks of the linearized ESN with 100 PEs when the input goes through a cycle. An ESN with fixed parameters implements a combination of linear systems with varying pole locations. (A) One cycle of sinusoidal signal with a period of 100. (B–E) The positions of poles of the linearized systems when the input values are at B, C, D, and E in Figure 5A. (F) The cumulative pole locations show the movement of the poles as the input changes. Due to the varying pole locations, different time constants modulate the richness of the reservoir of dynamics as a function of input amplitude. Higher-amplitude signals tend to saturate the nonlinear function and cause the poles to shrink toward the origin of the z-plane (decreases the spectral radius), which results in a system with a large stability margin. When the input is close to zero, the poles of the linearized ESN are close to the maximal spectral radius chosen, decreasing the stability margin. An ESN with more states results in a detailed coverage of the z-plane dynamics, which illustrates the power of nonlinear systems, when compared to their linear counterpart.
122
M. Ozturk, D. Xu, and J. Pr´ıncipe
2005). Here, our approach of bases and projections leads to a new metric. We propose the instantaneous state entropy to quantify the distribution of instantaneous amplitudes across the ESN states. Entropy of the instantaneous ESN states is appropriate to quantify performance in function approximation because the ESN output is a mere weighted combination of the instantaneous value of the ESN states. If the echo state’s instantaneous amplitudes are concentrated on only a few values across the ESN state dynamic range, the ability to approximate an arbitrary desired response by weighting the states is limited (and wasteful due to redundancy between the different states), and performance will suffer. On the other hand, if the ESN states provide a diversity of instantaneous amplitudes, it is much easier to achieve the desired mapping. Hence, the instantaneous entropy of the states appears as a good measure to quantify the richness of dynamics with instantaneous mappers. Due to the time structure of signals, the average state entropy (ASE), defined as the state entropy averaged over time, will be the parameter used to quantify the diversity in the dynamical reservoir of the ESN. Moreover, entropy has been proposed as an appropriate measure of the volume of the signal manifold (Cox, 1946; Amari, 1990). Here, ASE measures the volume of the echo state manifold spanned by trajectories. Renyi’s quadratic entropy is employed here because it is a global measure of information. In addition, an efficient nonparametric estimator of Renyi’s entropy, which avoids explicit pdf estimation, has been developed (Principe, Xu, & Fisher, 2000). Renyi’s entropy with parameter γ for a random variable X with a pdf f X (x) is given by Renyi (1970): Hγ (X) =
1 γ −1 log E[ f X (X)]. 1−γ
Renyi’s quadratic entropy is obtained for γ = 2 (for γ → 1, Shannon’s entropy is obtained). Given N samples {x1 , x2 , . . . , xN } drawn from the unknown pdf to be estimated, Parzen windowing approximates the underlying pdf by 1 K σ (x − xi ), N N
f X (x) =
i=1
where K σ is the kernel function with the kernel size σ . Then the Renyi’s quadratic entropy can be estimated by (Principe et al., 2000) 1 K σ (x j − xi ) . H2 (X) = −log 2 N j i
(3.1)
Analysis and Design of Echo State Networks
123
The instantaneous state entropy is estimated using equation 3.1 where the samples are the entries of the state vector x(n) = [x1 (n), x2 (n), . . . , xN (n)]T of an ESN with N internal PEs. Results will be shown with a gaussian kernel with kernel size chosen to be 0.3 of the standard deviation of the entries of the state vector. We will show that ASE is a more sensitive parameter to quantify the approximation properties of ESNs by experimentally demonstrating that ESNs with different spectral radius and even with the same spectral radius display different ASEs. Let us consider the same 100-unit ESN that we used in the previous section built with three different spectral radii 0.2, 0.5, 0.8 with an input signal of sin(2πn/20). Figure 4A depicts the echo states over 200 time ticks. The instantaneous state entropy is also calculated at each time step using equation 3.1 and plotted in Figure 4B. First, note that the instantaneous state entropy changes over time with the distribution of the echo states as we would expect, since state entropy is dependent on the input signal that also changes in this case. Second, as the spectral radius increases in the simulation, the diversity in the echo states increases. For the spectral radius of 0.2, echo state’s instantaneous amplitudes are concentrated on only a few values, which is wasteful due to redundancy between different states. In practice, to quantify the overall representation ability over time, we will use ASE, which takes values −0.735, −0.007, and 0.335 for the spectral radii of 0.2, 0.5, and 0.8, respectively. Moreover, even for the same spectral radius, several ASEs are possible. Figure 4C shows ASEs from 50 different realizations of ESNs with the same spectral radius of 0.5, which means that ASE is a finer descriptor of the dynamics of the reservoir. Although we have presented an experiment with sinusoidal signal, similar results are obtained for other inputs as long as the input dynamic range is properly selected. Maximizing ASE means that the diversity of the states over time is the largest and should provide a basis set that is as uncorrelated as possible. This condition is unfortunately not a guarantee that the ESN so designed will perform the best, because the basis set in ESNs is created independent of the desired response and the application may require a small spectral radius. However, we maintain that when the desired response is not accessible for the design of the ESN bases or when the same reservoir is to be used for a number of problems, the default strategy should be to maximize the ASE of the state vector. The following section addresses the design of ESNs with high ASE values and a simple mechanism to adjust the reservoir dynamics without changing the recurrent connection weights. 4 Designing Echo State Networks 4.1 Design of the Echo State Recurrent Connections. According to the interpretation of ESNs as coupled linear systems, the design of the internal
124
M. Ozturk, D. Xu, and J. Pr´ıncipe
connection matrix, W, will be based on the distribution of the poles of the linearized system around zero state. Our proposal is to design the ESN such that the linearized system has uniform pole distribution inside the unit circle of the z-plane. With this design scenario, the system dynamics will include uniform coverage of time constants arising from the uniform distribution of the poles, which also decorrelates as much as possible the basis functionals. This principle was chosen by analogy to the identification of linear systems using Kautz filters (Kautz, 1954), which shows that the best approximation of a given transfer function by a linear system with finite order is achieved when poles are placed in the neighborhood of the spectral resonances. When no information is available about the desired response, we should uniformly spread the poles to anticipate good approximation to arbitrary mappings. We again use a maximum entropy principle to distribute the poles inside the unit circle uniformly. The constraints of a circle as boundary conditions for discrete linear systems and complex conjugate locations are easy to include for the pole distribution (Thogula, 2003). The poles are first initialized at random locations; the quadratic Renyi’s entropy is calculated by equation 3.1, and poles are moved such that the entropy of the new distribution is increased over iterations (Erdogmus & Principe, 2002). This method is efficient to find uniform coverage of the unit circle with an arbitrary number of poles. The system with the uniform pole locations can be interpreted using linear system theory. The poles that are close to the unit circle correspond to many sharp bandpass filters specializing in different frequency regions, whereas the inner poles realize filters of larger frequency support. Moreover, different orientations (angles) of the poles create filters of different center frequencies. Now the problem is to construct an internal weight matrix from the pole locations (eigenvalues of W). In principle, we would like to create a sparse
Figure 4: Examples of echo states and instantaneous state entropy. (A) Outputs of echo states (100 PEs) produced by ESNs with spectral radius of 0.2, 0.5, and 0.8, from top to bottom, respectively. The diversity of echo states increases when the spectral radius increases. Within the dynamic range of the echo states, systems with smaller spectral radius can generate only uneven representations, while for W = 0.8, outputs of echo states almost uniformly distribute within their dynamic range. (B) Instantaneous state entropy is calculated using equation 3.1. Information contained in the echo states is changing over time according to the input amplitude. Therefore, the richness of representation is controlled by the input amplitude. Moreover, the value of ASE increases with spectral radius. (C) ASEs from 50 different realizations of ESNs with the same spectral radius of 0.5. The plot shows that ASE is a finer descriptor of the dynamics of the reservoir than the spectral radius.
Analysis and Design of Echo State Networks
125
(A) Echo States
1 0 -1
0 1
20 40 60 80 100 120 140 160 180 200
0 -1
0 1
20 40 60 80 100 120 140 160 180 200
0 -1
0
20 40 60 80 100 120 140 160 180 200 Time (B) State Entropy Spectral Radius = 0.2 Spectral Radius = 0.5 Spectral Radius = 0.8
1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2.5
0
50
100 Time
150
200
(C) Different ASEs for the same spectral radius 0.3
ASE
0.25 0.2 0.15 0.1 0.05 0
10
20
30 Trials
40
50
126
M. Ozturk, D. Xu, and J. Pr´ıncipe
matrix, so we started with the sparsest matrix (with an inverse), which is the direct canonical structure given by (Kailath, 1980)
−a 1 −a 2 · · · −a N−1 −a N
1 0 0 1 W= ··· ··· 0 0
···
0
···
0
···
···
···
1
0 0 . ··· 0
(4.1)
The characteristic polynomial of W is l(s) = det(sI − W) = s N + a 1 s N−1 + a 2 s N−2 + a N = (s − p1 )(s − p2 ) · · · (s − p N ),
(4.2)
where pi ’s are the eigenvalues and a i ’s are the coefficients of the characteristic polynomial of W. Here, we know the pole locations of the linear system obtained from the linearization of the ESN, so using equation 4.2, we can obtain the characteristic polynomial and construct W matrix in the canonical form using equation 4.1. We will call the ESN constructed based on the uniform pole principle ASE-ESN. All other possible solutions with the same eigenvalues can be obtained by Q−1 WQ, where Q is any nonsingular matrix. To corroborate our hypothesis, we would like to show that the linearized ESN designed with the recurrent weight matrix having the eigenvalues uniformly distributed inside the unit circle creates higher ASE values for a given spectral radius compared to other ESNs with random internal connection weight matrices. We will consider an ESN with 30 states and use our procedure to create the W matrix for ASE-ESN for different spectral radii between [0.1, 0.95]. Similarly, we constructed ESNs with sparse random W matrices with different sparseness constraints. This corresponds to a weight distribution having the values 0, c and −c with probabilities p1 , (1 − p1 )/2, and (1 − p1 )/2, where p1 defines the sparseness of W and c is a constant that takes a specific value depending on the spectral radius. We also created W matrices with values uniformly distributed between −1 and 1 (U-ESN) and scaled to obtain a given spectral radius (Jaeger & Hass, 2004). Then, for different Win matrices, we run the ASE-ESNs with the sinusoidal input given in section 3 and calculate ASE. Figure 5 compares the ASE values averaged over 1000 realizations. As observed from the figure, the ASE-ESN with uniform pole distribution generates higher ASE on average for all spectral radii compared to ESNs with sparse and uniform random connections. This approach is indeed conceptually similar to Jeffreys’ maximum entropy prior (Jeffreys, 1946): it will provide a consistently good response for the largest class of problems. Concentrating the poles of the linearized
Analysis and Design of Echo State Networks
127
1 ASEESN UESN sparseness=0.2 sparseness=0.1 sparseness=0.07
0.8
ASE
0.6 0.4 0.2 0 -0.2 -0.4
0
0.2
0.4
0.6
0.8
1
Spectral radius Figure 5: Comparison of ASE values obtained for ASE-ESN having W with uniform eigenvalue distribution, ESNs with random W matrix, and U-ESN with uniformly distributed weights between −1 and 1. Randomly generated weights have sparseness of 0.07, 0.1, and 0.2. ASE values are calculated for the networks with spectral radius from 0.1 to 0.95. The ASE-ESN with uniform pole distribution generates a higher ASE on average for all spectral radii compared to ESNs with random connections.
system in certain regions of the space provides good performance only if the desired response has energy in this part of the space, as is well known from the theory of Kautz filters (Kautz, 1954). 4.2 Design of the Adaptive Bias. In conventional ESNs, only the output weights are trained, optimizing the projections of the desired response onto the basis functions (echo states). Since the dynamical reservoir is fixed, the basis functions are only input dependent. However, since function approximation is a problem in the joint space of the input and desired signals, a penalty in performance will be incurred. From the linearization analysis that shows the crucial importance of the operating point of the PE nonlinearity in defining the echo state dynamics, we propose to use a single external adaptive bias to adjust the effective spectral radius of an ESN. Notice that according to linearization analysis, bias can reduce only spectral radius. The information for adaptation of bias is the MSE in training, which modulates the spectral radius of the system with the information derived from the approximation error. With this simple mechanism, some information from the input-output joint space is incorporated in the definition of the projection space of the ESN. The beauty of this method is that the spectral
128
M. Ozturk, D. Xu, and J. Pr´ıncipe
radius can be adjusted by a single parameter that is external to the system without changing reservoir weights. The training of bias can be easily accomplished. Indeed, since the parameter space is only one-dimensional, a simple line search method can be efficiently employed to optimize the bias. Among different line search algorithms, we will use a search that uses Fibonacci numbers in the selection of points to be evaluated (Wilde, 1964). The Fibonacci search method minimizes the maximum number of evaluations needed to reduce the interval of uncertainty to within the prescribed length. In our problem, a bias value is picked according to Fibonacci search. For each value of bias, training data are applied to the ESN, and the echo states are calculated. Then the corresponding optimal output weights and the objective function (MSE) are evaluated to pick the next bias value. Alternatively, gradient-based methods can be utilized to optimize the bias, due to simplicity and low computational cost. System update equation with an external bias signal, b, is given by x(n + 1) = f(Win u(n + 1) + Win b + Wx(n)). The update equation for b is given by ∂x(n + 1) ∂ O(n + 1) = −e · Wout × ∂b ∂b ∂x(n) in ˙ ) · W × = −e · Wout × f(net + W . n+1 ∂b
(4.3) (4.4)
Here, O is the MSE defined previously. This algorithm may suffer from similar problems observed in gradient-based methods in recurrent networks training. However, we observed that the performance surface is rather simple. Moreover, since the search parameter is one-dimensional, the gradient vector can assume only one of the two directions. Hence, imprecision in the gradient estimation should affect the speed of convergence but normally not change the correct gradient direction. 5 Experiments This section presents a variety of experiments in order to test the validity of the ESN design scheme proposed in the previous section. 5.1 Short-Term Memory Capacity. This experiment compares the shortterm memory (STM) capacity of ESNs with the same spectral radius using the framework presented in Jaeger (2002a). Consider an ESN with a single input signal, u(n), optimally trained with the desired signal u(n − k), for a given delay k. Denoting the optimal output signal yk (n), the k-delay
Analysis and Design of Echo State Networks
129
STM capacity of a network, MCk , is defined as a squared correlation coefficient between u(n − k) and yk (n) (Jaeger, 2002a). The STM capacity, MC, of the network is defined as ∞ k=1 MC k . STM capacity measures how accurately the delayed versions of the input signal are recovered with optimally trained output units. Jaeger (2002a) has shown that the memory capacity for recalling an independent and identically distributed (i.i.d.) input by an N unit RNN with linear output units is bounded by N. We use ESNs with 20 PEs and a single input unit. ESNs are driven by an i.i.d. random input signal, u(n), that is uniformly distributed over [−0.5, 0.5]. The goal is to train the ESN to generate the delayed versions of the input, u(n − 1), . . . , u(n − 40). We used four different ESNs: R-ESN, U-ESN, ASE-ESN, and BASE-ESN. R-ESN is a randomly connected ESN used in Jaeger (2002a) where the entries of W matrix are set to 0, 0.47, −0.47 with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a sparse connectivity of 20% and a spectral radius of 0.9. The entries of W of U-ESN are uniformly distributed over [−1, 1] and scaled to obtain the spectral radius of 0.9. ASE-ESN also has a spectral radius of 0.9 and is designed with uniform poles. BASE-ESN has the same recurrent weight matrix as ASE-ESN and an adaptive bias at its input. In each ESN, the input weights are set to 0.1 or −0.1 with equal probability, and direct connections from the input to the output are allowed, whereas Wba ck is set to 0 (Jaeger, 2002a). The echo states are calculated using equation 2.1 for 200 samples of the input signal, and the first 100 samples corresponding to initial transient are eliminated. Then the output weight matrix is calculated using equation 2.4. For the BASE-ESN, the bias is trained for each task. All networks are run with a test input signal, and the corresponding output and MCk are calculated. Figure 6 shows the k-delay STM capacity (averaged over 100 trials) of each ESN for delays 1, . . . , 40 for the test signal. The STM capacities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are 13.09, 13.55, 16.70, and 16.90, respectively. First, ESNs with uniform pole distribution (ASEESN and BASE-ESN) have MCs that are much longer than the randomly generated ESN given in Jaeger (2002a) in spite of all having the same spectral radius. In fact, the STM capacity of ASE-ESN is close to the theoretical maximum value of N = 20. A closer look at the figure shows that R-ESN performs slightly better than ASE-ESN for delays less than 9. In fact, for small k, large ASE degrades the performance because the tasks do not need long memory depth. However, the drawback of high ASE for small k is recovered in BASE-ESN, which reduces the ASE to the appropriate level required for the task. Overall, the addition of the bias to the ASE-ESN increases the STM capacity from 16.70 to 16.90. On the other hand, U-ESN has slightly better STM compared to R-ESN with only three different weight values, although it has more distinct weight values compared to R-ESN. It is also significant to note that the MC will be very poor for an ESN with smaller spectral radius even with an adaptive bias, since the problem requires large ASE and bias can only reduce ASE. This experiment demonstrates the
130
M. Ozturk, D. Xu, and J. Pr´ıncipe
1
RESN UESN ASEESN BASEESN
Memory Capacity
0.8 0.6 0.4 0.2 0 0
10
20 Delay
30
40
Figure 6: The k-delay STM capacity of each ESN for delays 1, . . . , 40 computed using the test signal. The results are averaged over 100 different realizations of each ESN type with the specifications given in the text for different W and Win matrices. The STM capacities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are 13.09, 13.55, 16.70, and 16.90, respectively.
suitability of maximizing ASE in tasks that require a substantial memory length. 5.2 Binary Parity Check. The effect of the adaptive bias was marginal in the previous experiment since the nature of the problem required large ASE values. However, there are tasks in which the optimal solutions require smaller ASE values and smaller spectral radius. Those are the tasks where the adaptive bias becomes a crucial design parameter in our design methodology. Consider an ESN with 100 internal units and a single input unit. ESN is driven by a binary input signal, u(n), that assumes the values 0 or 1. The goal is to train an ESN to generate the m-bit parity corresponding to last m bits received, where m is 3, . . . , 8. Similar to the previous experiments, we used the R-ESN, ASE-ESN, and BASE-ESN topologies. R-ESN is a randomly connected ESN where the entries of W matrix are set to 0, 0.06, −0.06 with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a sparse connectivity of 20% and a spectral radius of 0.3. ASE-ESN and BASE-ESN are designed with a spectral radius of 0.9. The input weights are set to 1 or -1 with equal probability, and direct connections from the input to the output are allowed whereas Wba ck is set to 0. The echo states are calculated using equation 2.1 for 1000 samples of the input signal, and the first 100 samples corresponding to the initial transient are eliminated. Then the output weight
Analysis and Design of Echo State Networks
131
350
Wrong Decisions
300 250 200 150 100 ASEESN RESN BASEESN
50 0 3
4
5
6
7
8
m Figure 7: The number of wrong decisions made by each ESN for m = 3, . . . , 8 in the binary parity check problem. The results are averaged over 100 different realizations of R-ESN, ASE-ESN, and BASE-ESN for different W and Win matrices with the specifications given in the text. The total numbers of wrong decisions for m = 3, . . . , 8 of R-ESN, ASE-ESN, and BASE-ESN are 722, 862, and 699.
matrix is calculated using equation 2.4. For ESN with adaptive bias, the bias is trained for each task. The binary decision is made by a threshold detector that compares the output of the ESN to 0.5. Figure 7 shows the number of wrong decisions (averaged over 100 different realizations) made by each ESN for m = 3, . . . , 8. The total numbers of wrong decisions for m = 3, . . . , 8 of R-ESN, ASEESN, and BASE-ESN are 722, 862, and 699, respectively. ASE-ESN performs poorly since the nature of the problem requires a short time constant for fast response, but ASE-ESN has a large spectral radius. For 5-bit parity, the R-ESN has no wrong decisions, whereas ASE-ESN has 53 wrong decisions. BASE-ESN performs a lot better than ASE-ESN and slightly better than the R-ESN since the adaptive bias reduces the spectral radius effectively. Note that for m = 7 and 8, the ASE-ESN performs similar to the R-ESN, since the task requires access to longer input history, which compromises the need for fast response. Indeed, the bias in the BASE-ESN takes effect when there are errors (m > 4) and when the task benefits from smaller spectral radius. The optimal bias values are approximately 3.2, 2.8, 2.6, and 2.7 for m = 3, 4, 5, and 6, respectively. For m = 7 or 8, there is a wide range of bias values that result in similar MSE values (between 0 and 3). In
132
M. Ozturk, D. Xu, and J. Pr´ıncipe
summary, this experiment clearly demonstrates the power of the bias signal to configure the ESN reservoir according to the mapping task. 5.3 System Identification. This section presents a function approximation task where the aim is to identify a nonlinear dynamical system. The unknown system is defined by the difference equation y(n + 1) = 0.3y(n) + 0.6y(n − 1) + f (u(n)), where f (u) = 0.6 sin(πu) + 0.3 sin(3πu) + 0.1 sin(5πu). The input to the system is chosen to be sin(2πn/25). We used three different ESNs—R-ESN, ASE-ESN, and BASE-ESN—with 30 internal units and a single input unit. The W matrix of each ESN is scaled such that it has a spectral radius of 0.95. R-ESN is a randomly connected ESN where the entries of W matrix are set to 0, 0.35, −0.35 with probabilities 0.8, 0.1, 0.1, respectively. In each ESN, the input weights are set to 1 or −1 with equal probability, and direct connections from the input to the output are allowed, whereas Wba ck is set to 0. The optimal output weights are calculated using equation 2.4. The MSE values (averaged over 100 realizations) for RESN and ASE-ESN are 1.23x10−5 and 1.83x10−6 , respectively. The addition of the adaptive bias to the ASE-ESN reduces the MSE value from 1.83x10−6 to 3.27x10−9 . 6 Discussion The great appeal of echo state networks (ESNs) and liquid state machine (LSM) is their ability to construct arbitrary mappings of signals with rich and time-varying temporal structures without requiring adaptation of the free parameters of the recurrent layer. The echo state condition allows the recurrent connections to be fixed with training limited to the linear output layer. However, the literature did not elucidate on how to properly choose the recurrent parameters for system identification applications. Here, we provide an alternate framework that interprets the echo states as a set of functional bases formed by fixed nonlinear combinations of the input. The linear readout at the output stage simply computes the projection of the desired output space onto this representation space. We further introduce an information-theoretic criterion, ASE, to better understand and evaluate the capability of a given ESN to construct such a representation layer. The average entropy of the distribution of the echo states quantifies the volume spanned by the bases. As such, this volume should be the largest to achieve the smallest correlation among the bases and be able to cope with
Analysis and Design of Echo State Networks
133
arbitrary mappings. However, not all function approximation problems require the same memory depth, which is coupled to the spectral radius. The effective spectral radius of an ESN can be optimized for the given problem with the help of an external bias signal that is adapted using the joint inputoutput space information. The interesting property of this method when applied to ESN built from sigmoidal nonlinearities is that it allows the fine tuning of the system dynamics for a given problem with a single external adaptive bias input and without changing internal system parameters. In our opinion, the combination of the largest possible ASE and the adaptation of the spectral radius by the bias produces the most parsimonious pole location of the linearized ESN when no knowledge about the mapping is available to optimally locate the bass functionals. Moreover, the bias can be easily trained with either a line search method or a gradient-based method since it is one-dimensional. We have illustrated experimentally that the design of the ESN using the maximization of ASE with the adaptation of the spectral radius by the bias has provided consistently better performance across tasks that require different memory depths. This means that these two parameters’ design methodology is preferred to the spectral radius criterion proposed by Jaeger, and it is still easily incorporated in the ESN design. Experiments demonstrate that the ASE for ESN with uniform linearized poles is maximized when the spectral radius of the recurrent weight matrix approaches one (instability). It is interesting to relate this observation with the computational properties found in dynamical systems “at the edge of chaos” (Packard, 1988; Langton, 1990; Mitchell, Hraber, & Crutchfield, 1993; Bertschinger & Natschl¨ager, 2004). Langton stated that when cellular automata rules are evolved to perform a complex computation, evolution will tend to select rules with “critical” parameter values, which correlate with a phase transition between ordered and chaotic regimes. Recently, similar conclusions were suggested for LSMs (Bertschinger & Natschl¨ager, 2004). Langton’s interpretation of edge of chaos was questioned by Mitchell et al. (1993). Here, we provide a system-theoretic view and explain the computational behavior with the diversity of dynamics achieved with linearizations that have poles close to the unit circle. According to our results, the spectral radius of the optimal ESN in function approximation is problem dependent, and in general it is impossible to forecast the computational performance as the system approaches instability (the spectral radius of the recurrent weight matrix approaches one). However, allowing the system to modulate the spectral radius by either the output or internal biasing may allow a system close to instability to solve various problems requiring different spectral radii. Our emphasis here is mostly on ESNs without output feedback connections. However, the proposed design methodology can also be applied to ESNs with output feedback. Both feedforward and feedback connections contribute to specify the bases to create the projection space. At the same
134
M. Ozturk, D. Xu, and J. Pr´ıncipe
time, there are applications where the output feedback contributes to the system dynamics in a different fashion. For example, it has been shown that a fixed weight (fully trained) RNN with output feedback can implement a family of functions (meta-learners) (Prokhorov, Feldkamp, & Tyukin, 1992). In meta-learning, the role of output feedback in the network is to bias the system to different regions of dynamics, providing multiple input-output mappings required (Santiago & Lendaris, 2004). However, results could not be replicated with ESNs (Prokhorov, 2005). We believe that more work has to be done on output feedback in the context of ESNs but also suspect that the echo state condition may be a restriction on the system dynamics for this type of problem. There are many interesting issues to be researched in this exciting new area. Besides an evaluation tool, ASE may also be utilized to train the ESN’s representation layer in an unsupervised fashion. In fact, we can easily adapt with the SIG (stochastic information gradient) described in Erdogmus, Hild, and Principe (2003): extra weights linking the outputs of recurrent states to maximize output entropy. Output entropy maximization is a well-known metric to create independent components (Bell & Sejnowski, 1995), and here it means that the echo states will become as independent as possible. This would circumvent the linearization of the dynamical system to set the recurrent weights and would fine-tune continuously in an unsupervised manner the parameters of the ESN among different inputs. However, it goes against the idea of a fixed ESN reservoir. The reservoir of recurrent PEs can be thought of as a new form of a timeto-space mapping. Unlike the delay line that forms an embedding (Takens, 1981), this mapping may have the advantage of filtering noise and produce representations with better SNRs to the peaks of the input, which is very appealing for signal processing and seems to be used in biology. However, further theoretical work is necessary in order to understand the embedding capabilities of ESNs. One of the disadvantages of the ESN correlated basis is in the design of the readout. Gradient-based algorithms will be very slow to converge (due to the large eigenvalue spread of modes), and even if recursive methods are used, their stability may be compromised by the condition number of the matrix. However, our recent results incorporating an L 1 norm penalty in the LMS (Rao et al., 2005) show great promise of solving this problem. Finally we would like to briefly comment on the implications of these models to neurobiology and computational neuroscience. The work by Pouget and Sejnowski (1997) has shown that the available physiological data are consistent with the hypothesis that the response of a single neuron in the parietal cortex serves as a basis function generated by the sensory input in a nonlinear fashion. In other words, the neurons transform the sensory input into a format (representation space) such that the subsequent computation is simplified. Then, whenever a motor command (output of the biological system) needs to be generated, this simple computation to
Analysis and Design of Echo State Networks
135
read out the neuronal activity is done. There is an intriguing similarity between the interpretation of the neuronal activity by Pouget and Sejnowski and our interpretation of echo states in ESN. We believe that similar ideas can be applied to improve the design of microcircuit implementations of LSMs. First, the framework of functional space interpretation (bases and projections) is also applicable to microcircuits. Second, the ASE measure may be directly utilized for LSM states because the states are normally lowpass-filtered before the readout. However, the control of ASE by changing the liquid dynamics is unclear. Perhaps global control of thresholds or bias current will be able to accomplish bias control as in ESN with sigmoid PEs.
Acknowledgments This work was partially supported by NSF ECS-0422718, NSF CNS-0540304, and ONR N00014-1-1-0405.
References Amari, S.-I. (1990). Differential-geometrical methods in statistics. New York: Springer. Anderson, J., Silverstein, J., Ritz, S., & Jones, R. (1977). Distinctive features, categorical perception, and probability learning: Some applications of a neural model. Psychological Review, 84, 413–451. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129– 1159. Bertschinger, N., & Natschl¨ager, T. (2004). Real-time computation at the edge of chaos in recurrent neural networks. Neural Computation, 16(7), 1413–1436. Cox, R. T. (1946). Probability, frequency, and reasonable expectation. American Journal of Physics, 14(1), 1–13. de Vries, B. (1991). Temporal processing with neural networks—the development of the gamma model. Unpublished doctoral dissertation, University of Florida. Delgado, A., Kambhampati, C., & Warwick, K. (1995). Dynamic recurrent neural network for system identification and control. IEEE Proceedings of Control Theory and Applications, 142(4), 307–314. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179–211. Erdogmus, D., Hild, K. E., & Principe, J. (2003). Online entropy manipulation: Stochastic information gradient. Signal Processing Letters, 10(8), 242–245. Erdogmus, D., & Principe, J. (2002). Generalized information potential criterion for adaptive system training. IEEE Transactions on Neural Networks, 13(5), 1035–1044. Feldkamp, L. A., Prokhorov, D. V., Eagen, C., & Yuan, F. (1998). Enhanced multistream Kalman filter training for recurrent networks. In J. Suykens, & J. Vandewalle (Eds.), Nonlinear modeling: Advanced black-box techniques (pp. 29–53). Dordrecht, Netherlands: Kluwer.
136
M. Ozturk, D. Xu, and J. Pr´ıncipe
Haykin, S. (1998). Neural networks: A comprehensive foundation (2nd ed.). Upper Saddle River, NJ. Prentice Hall. Haykin, S. (2001). Adaptive filter theory (4th ed.). Upper Saddle River, NJ: Prentice Hall. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. Hopfield, J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences, 81, 3088–3092. Ito, Y. (1996). Nonlinearity creates linear independence. Advances in Computer Mathematics, 5(1), 189–203. Jaeger, H. (2001). The echo state approach to analyzing and training recurrent neural networks (Tech. Rep. No. 148). Bremen: German National Research Center for Information Technology. Jaeger, H. (2002a). Short term memory in echo state networks (Tech. Rep. No. 152). Bremen: German National Research Center for Information Technology. Jaeger, H. (2002b). Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the “echo state network” approach (Tech. Rep. No. 159). Bremen: German National Research Center for Information Technology. Jaeger, H., & Hass, H. (2004). Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667), 78–80. Jeffreys, H. (1946). An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London, A 196, 453–461. Kailath, T. (1980). Linear systems. Upper Saddle River, NJ: Prentice Hall. Kautz, W. (1954). Transient synthesis in time domain. IRE Transactions on Circuit Theory, 1(3), 29–39. Kechriotis, G., Zervas, E., & Manolakos, E. S. (1994). Using recurrent neural networks for adaptive communication channel equalization. IEEE Transactions on Neural Networks, 5(2), 267–278. Kremer, S. C. (1995). On the computational power of Elman-style recurrent networks. IEEE Transactions on Neural Networks, 6(5), 1000–1004. Kuznetsov, Y., Kuznetsov, L., & Marsden, J. (1998). Elements of applied bifurcation theory (2nd ed.). New York: Springer-Verlag. Langton, C. G. (1990). Computation at the edge of chaos. Physica D, 42, 12–37. Maass, W., Legenstein, R. A., & Bertschinger, N. (2005). Methods for estimating the computational power and generalization capability of neural microcircuits. In L. K. Saul, Y. Weiss, L. Bottou (Eds.), Advances in neural information processing systems, no. 17 (pp. 865–872). Cambridge, MA: MIT Press. Maass, W., Natschl¨ager, T., & Markram, H. (2002). Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14(11), 2531–2560. Mitchell, M., Hraber, P., & Crutchfield, J. (1993). Revisiting the edge of chaos: Evolving cellular automata to perform computations. Complex Systems, 7, 89– 130. Packard, N. (1988). Adaptation towards the edge of chaos. In J. A. S. Kelso, A. J. Mandell, & M. F. Shlesinger (Eds.), Dynamic patterns in complex systems (pp. 293– 301). Singapore: World Scientific.
Analysis and Design of Echo State Networks
137
Pouget, A., & Sejnowski, T. J. (1997). Spatial transformations in the parietal cortex using basis functions. Journal of Cognitive Neuroscience, 9(2), 222–237. Principe, J. (2001). Dynamic neural networks and optimal signal processing. In Y. Hu & J. Hwang (Eds.), Neural networks for signal processing (Vol. 6-1, pp. 6– 28). Boca Raton, FL: CRC Press. Principe, J. C., de Vries, B., & de Oliviera, P. G. (1993). The gamma filter—a new class of adaptive IIR filters with restricted feedback. IEEE Transactions on Signal Processing, 41(2), 649–656. Principe, J., Xu, D., & Fisher, J. (2000). Information theoretic learning. In S. Haykin (Ed.), Unsupervised adaptive filtering (pp. 265–319). Hoboken, NJ: Wiley. Prokhorov, D. (2005). Echo state networks: Appeal and challenges. In Proc. of International Joint Conference on Neural Networks (pp. 1463–1466). Montreal, Canada. Prokhorov, D., Feldkamp, L., & Tyukin, I. (1992). Adaptive behavior with fixed weights in recurrent neural networks: An overview. In Proc. of International Joint Conference on Neural Networks (pp. 2018–2022). Honolulu, Hawaii. Puskorius, G. V., & Feldkamp, L. A. (1994). Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent networks. IEEE Transactions on Neural Networks, 5(2), 279–297. Puskorius, G. V., & Feldkamp, L. A. (1996). Dynamic neural network methods applied to on-vehicle idle speed control. Proceedings of IEEE, 84(10), 1407–1420. Rao, Y., Kim, S., Sanchez, J., Erdogmus, D., Principe, J. C., Carmena, J., Lebedev, M., & Nicolelis, M. (2005). Learning mappings in brain machine interfaces with echo state networks. In 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing. Philadelphia. Renyi, A. (1970). Probability theory. New York: Elsevier. Sanchez, J. C. (2004). From cortical neural spike trains to behavior: Modeling and analysis. Unpublished doctoral dissertation, University of Florida. Santiago, R. A., & Lendaris, G. G. (2004). Context discerning multifunction networks: Reformulating fixed weight neural networks. In Proc. of International Joint Conference on Neural Networks (pp. 189–194). Budapest, Hungary. Shah, J. V., & Poon, C.-S. (1999). Linear independence of internal representations in multilayer perceptrons. IEEE Transactions on Neural Networks, 10(1), 10–18. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 623–656. Siegelmann, H. T. (1993). Foundations of recurrent neural networks. Unpublished doctoral dissertation, Rutgers University. Siegelmann, H. T., & Sontag, E. (1991). Turing computability with neural nets. Applied Mathematics Letters, 4(6), 77–80. Singhal, S., & Wu, L. (1989). Training multilayer perceptrons with the extended Kalman algorithm. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 1 (pp. 133–140). San Mateo, CA: Morgan Kaufmann. Takens, F. (1981). Detecting strange attractors in turbulence. In D. A. Rand & L.-S. Young (Eds.), Dynamical systems and turbulence (pp. 366–381). Berlin: Springer. Thogula, R. (2003). Information theoretic self-organization of multiple agents. Unpublished master’s thesis, University of Florida. Werbos, P. (1990). Backpropagation through time: What it does and how to do it. Proceedings of IEEE, 78(10), 1550–1560.
138
M. Ozturk, D. Xu, and J. Pr´ıncipe
Werbos, P. (1992). Neurocontrol and supervised learning: An overview and evaluation. In D. White & D. Sofge (Eds.), Handbook of intelligent control (pp. 65–89). New York: Van Nostrand Reinhold. Wilde, D. J. (1964). Optimum seeking methods. Upper Saddle River, NJ: Prentice Hall. Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1, 270–280.
Received December 28, 2004; accepted June 1, 2006.
LETTER
Communicated by Ralph M. Siegel
Invariant Global Motion Recognition in the Dorsal Visual System: A Unifying Theory Edmund T. Rolls [email protected]
Simon M. Stringer [email protected] Oxford University, Centre for Computational Neuroscience, Department of Experimental Psychology, Oxford OX1 3UD, England
The motion of an object (such as a wheel rotating) is seen as consistent independent of its position and size on the retina. Neurons in higher cortical visual areas respond to these global motion stimuli invariantly, but neurons in early cortical areas with small receptive fields cannot represent this motion, not only because of the aperture problem but also because they do not have invariant representations. In a unifying hypothesis with the design of the ventral cortical visual system, we propose that the dorsal visual system uses a hierarchical feedforward network architecture (V1, V2, MT, MSTd, parietal cortex) with training of the connections with a short-term memory trace associative synaptic modification rule to capture what is invariant at each stage. Simulations show that the proposal is computationally feasible, in that invariant representations of the motion flow fields produced by objects self-organize in the later layers of the architecture. The model produces invariant representations of the motion flow fields produced by global in-plane motion of an object, in-plane rotational motion, looming versus receding of the object, and object-based rotation about a principal axis. Thus, the dorsal and ventral visual systems may share some similar computational principles. 1 Introduction A key issue in understanding the cortical mechanisms that underlie motion perception is how we perceive the motion of objects such as a rotating wheel invariantly with respect to position on the retina, and size. For example, we perceive the wheel shown in Figures 1 and 4a rotating clockwise independent of its position on the retina. This occurs even though the local motion for the wheels in the different positions may be opposite (as indicated in the dashed box in Figure 1). How could this invariance of the visual motion perception of objects arise in the visual system? Invariant motion representations are known to be developed in the cortical dorsal visual system. Motion-sensitive neurons in V1 have small receptive fields Neural Computation 19, 139–169 (2007)
C 2006 Massachusetts Institute of Technology
140
E. Rolls and S. Stringer
Figure 1: A wheel rotating clockwise at different locations on the retina. How can a network learn to represent the clockwise rotation independent of the location of the moving object? The dashed box shows that local motion cues available at the beginning of the visual system are ambiguous about the direction of rotation when the stimulus is seen in different locations. One rotating wheel is presented at any one time, but the need is to develop a representation of the fact that in the case shown, the rotating flow field is always clockwise, independent of the location of the flow field and even though the local motion cues may be ambiguous, as shown in the dashed box.
(in the range 1–2 degrees at the fovea), and therefore cannot detect global motion, and this is part of the aperture problem (Wurtz & Kandel, 2000). Neurons in MT, which receives inputs from V1 and V2, have larger receptive fields (e.g., 5 degrees at the fovea) and are able to respond to planar global motion, such as a field of small dots in which the majority (in practice, as little as 55%) move in one direction, or to the overall direction of a
Invariant Global Motion Recognition
141
moving plaid, the orthogonal grating components of which have motion at 45 degrees to the overall motion (Movshon, Adelson, Gizzi, & Newsome, 1985; Newsome, Britten, & Movshon, 1989). Further on in the dorsal visual system, some neurons in macaque visual area MST (but not MT) respond to rotating flow fields or looming with considerable translation invariance (Graziano, Andersen, & Snowden, 1994; Geesaman & Andersen, 1996). It is known that single neurons in the ventral visual system have translation, size, and even view-invariant representations of stationary objects (Rolls & Deco, 2002; Desimone, 1991; Tanaka, 1996; Logothetis & Sheinberg, 1996; Rolls, 1992, 2000, 2006). A theory that can account for this uses a feature hierarchy network (Fukushima, 1980; Rolls, 1992; Wallis & Rolls, 1997; Riesenhuber & Poggio, 1999) combined with an associative Hebb-like learning rule (in which the synaptic weights increase in proportion to the pre-and postsynaptic firing rates) with a short-term memory of, for example, 1 sec, to enable different instances of the stimulus to be associated together as the visual objects transform continuously from second to second in the world ¨ ak, 1991; Rolls, 1992; Wallis & Rolls, 1997; Bartlett & Sejnowski, 1998; (Foldi´ Rolls & Milward, 2000; Stringer & Rolls, 2000, 2002; Rolls & Deco, 2002). In a unifying hypothesis, we propose here that the analysis of invariant motion in the dorsal visual system uses a similar architecture and learning rule, but in contrast utilizes as its inputs neurons that respond to local motion of the type found in the primary visual cortex, V1 (Wurtz & Kandel, 2000; Duffy, 2004; Bair & Movshon, 2004). A feature of the theory is that motion in the visual field is computed only once in V1 (by processes that take into account luminance changes across short times) and that the representations of motion that develop in the dorsal visual system require no further computation of time-delay-related firing to compute motion. The theory is of interest, for it proposes that some aspects of the computations in parts of the cerebral cortex that appear to be involved in different types of visual function, the dorsal and ventral visual systems, may in fact be performed by some similar organizational and computational principles. 2 The Theory and Its Implementation in a Model 2.1 The Theory. We propose that the general architecture of the dorsal visual system areas we consider is a feedforward feature hierarchy network, the inputs to which are local motion-sensitive neurons of V1 with receptive fields of approximately 1 degree in diameter (see Figure 2). There is convergence from stage to stage, so that a neuron at any one stage need receive only a limited number of inputs from the preceding stage, yet by the end of the network, an effectively global computation that can take into account information derived from different parts of the retina can have been performed. Within each cortical layer of the architecture (or layer of the network), local lateral inhibition implemented by inhibitory feedback neurons implements competition between the neurons, in such a way that fast-firing neurons
142
E. Rolls and S. Stringer
Layer 4
Layer 3
Layer 2
Layer 1 Figure 2: Stylized image of hierarchical organization in the dorsal as well as ventral visual system. The architecture is captured by the VisNet model, in which convergence through the network is designed to provide fourth-layer neurons with information from across the entire input retina.
inhibit other neurons in the vicinity, so that the overall activity within an area is kept within bounds. The competition may be nonlinear, due in part to the threshold nonlinearity of neurons, and this competition, helped by the diluted connectivity (i.e., the fact that only a low proportion of the neurons are connected), enables some neurons to respond to particular combinations of the inputs being received from the preceding area (Rolls & Deco, 2002; Deco & Rolls, 2005). These aspects of the architecture potentially enable single neurons at higher stages of the network to respond to combinations of the local motion inputs from V1 to the first layer of the network. These combinations, helped by the increasingly larger receptive fields, could include global motion to partly randomly moving dots (and to plaids) over
Invariant Global Motion Recognition
143
areas as large as 5 degrees in MT (Wurtz & Kandel, 2000; Duffy & Wurtz, 1996). In the architecture shown in Figure 2, layer 1 might correspond to MT; layer 2 to MST, which has receptive fields of 15–65 degrees in diameter; and layers 3 and 4 to areas in the parietal cortex such as 7a and to areas in the cortex in the superior temporal sulcus, which receives from parietal visual areas where view-invariant object-based motion is represented (Hasselmo, Rolls, Baylis, & Nalwa, 1989; Sakata, Shibutani, Ito, & Tsurugai, 1986). The synaptic plasticity between the layers of neurons has a Hebbian associative component in order to enable the system to build reliable representations in which the same neurons are activated by particular stimuli on different occasions in what is effectively a hierarchical multilayer competitive network (Rolls, 1992; Wallis & Rolls, 1997; Rolls & Deco, 2002). Such processes might enable neurons in layer 2 of Figure 4a to respond to, for example, a wheel rotating clockwise in one position on the retina (e.g., neuron A in layer 2). A key issue not addressed by the architecture described so far is how rotation (e.g., of a small wheel rotating clockwise) in one part of the retina activates the same neurons at the end of the network as when it is presented on a different part of the retina (see Figure 1). We propose that an associative synaptic learning rule with a short-term memory trace of neuronal activity is used between the layers to solve this problem. The idea is that if at a high level of the architecture (labeled layer 2/3 in Figure 4a) a wheel rotating clockwise is activating a neuron in one position on the retina, then the activated neurons remain active in a short delay period (of, e.g., 1 s) while the object moves to another location on the retina (e.g., the right position in Figure 4a). Then, with the postsynaptic neurons still active from the motion at the left position, the newly active synapses onto the layer 2/3 neuron (C) show associative modification, resulting in neuron C learning in an unsupervised way to respond to the wheel rotating clockwise in either the left or the right position on the retina. The idea is, just as for the ventral visual system (Rolls & Deco, 2002), that whatever the convergence allows to be learned at each stage of the hierarchy will be learned by this invariance algorithm, resulting in neurons higher in the hierarchy having higher- and higher-level invariance properties, including view-invariant object-based motion. More formally, the rule we propose is that identical to the one ¨ ak, 1991; Rolls, 1992; Wallis & proposed for the ventral visual system (Foldi´ Rolls, 1997; Rolls & Deco, 2002) as follows: w j = α yτ −1 x τj ,
(2.1)
where the trace yτ is updated according to yτ = (1 − η)yτ + ηyτ −1 ,
(2.2)
144
E. Rolls and S. Stringer
and we have the following definitions: x j : jth input to the neuron yτ : trace value of the output of the neuron at time step τ w j : synaptic weight between jth input and the neuron y : output from the neuron α : learning rate; annealed between unity and zero η : trace value; the optimal value varies with presentation sequence length The parameter η may be set anywhere in the interval [0, 1], and for the simulations described here, η was set to 0.8, which works well with nine transforms for each object in the stimulus set (Wallis & Rolls, 1997). (A discussion of the good performance of this rule, and its relation to other versions of trace learning rules, including the point that the trace can be implemented in the presynaptic firing, is provided by Rolls & Milward, 2000, and Rolls & Stringer, 2001. We note that in the version of the rule used here (equation 2.1), the trace is calculated from the postsynaptic firing in the preceding time step (yτ −1 ) but not the current time step, but that analogous performance is obtained if the firing in the current time step is also included (Rolls & Milward, 2000; Rolls & Stringer, 2001).) The temporal trace in the brain could be implemented by a number of processes, as simple as continuing firing of neurons for several hundred ms after a stimulus has disappeared or moved (as shown to be present for at least inferior temporal neurons in masking experiments—Rolls & Tovee, 1994; Rolls, Tovee, Purcell, Stewart, & Azzopardi, 1994), or by the long time constant of NMDA receptors and the resulting entry of calcium to neurons. An important idea here is that the temporal properties of the biologically implemented learning mechanism are such that it is well suited to detecting the relevant continuities in the world of real motion of objects. The system uses the underlying continuity in the world to help itself learn the invariances of, for example, the motions that are typical of objects. 2.2 The Network Architecture. The model we used for the simulations was VisNet, which was developed as a model of hierarchical processing in the ventral visual system that uses a trace learning to develop invariant representations of stationary objects (Wallis & Rolls, 1997; Rolls & Milward, 2000; Rolls & Deco, 2002). The simulations performed here utilized the latest version of the VisNet model (VisNet2), with the same model parameters as used by Rolls and Milward (2000) for their investigations of the formation of invariant representations in the ventral visual system. These parameters were kept identical for all the simulations described here. The difference is that instead of using simple cell-like inputs to the model that respond to stationary-oriented bars and edges (with four spatial frequencies and four orientations), in the modeling described here we used motion-related
Invariant Global Motion Recognition
145
Table 1: VisNet Dimensions.
Dimensions
Number of Connections
Radius
100 100 100 201 -
12 9 6 6 -
32 × 32 32 × 32 32 × 32 32 × 32 128 × 128 × 8
Layer 4 Layer 3 Layer 2 Layer 1 Input layer
Table 2: Lateral Inhibition Parameters. Layer Radius, σ Contrast, δ
1 1.38 1.5
2 2.7 1.5
3 4.0 1.6
4 6.0 1.4
inputs that capture some of the relevant properties of neurons present in V1 as part of the primate magnocellular (M) system (Wurtz & Kandel, 2000; Duffy, 2004; Rolls & Deco, 2002). VisNet is a four-layer feedforward network with unsupervised competitive learning at each layer. For each layer, the forward connections to individual cells are derived from a topologically corresponding region of the preceding layer, with connection probabilities based on a gaussian distribution (see Figure 2). These distributions are defined by a radius that will contain approximately 67% of the connections from the preceding layer. Typical values are given in Table 1. Within each layer there is competition between neurons, which is graded rather than winner-take-all, and is implemented in two stages. First, to implement lateral inhibition, the firing rates of the neurons (calculated as the dot product of the vector of presynaptic firing rates and the synaptic weight vector on a neuron, followed by a linear activation function to produce a firing rate) within a layer are convolved with a spatial filter, I , where δ controls the contrast and σ controls the width, and a and b index the distance away from the center of the filter:
Ia ,b
−δe − a 2σ+b2 2 = 1 − a =0 Ia ,b
if a = 0 or
b = 0,
if a = 0 and b = 0.
b=0
(2.3)
Typical lateral inhibition parameters are given in Table 2. Next, contrast enhancement is applied by means of a sigmoid function y = f sigmoid (r ) =
1 1+
e −2β(r −α)
,
(2.4)
146
E. Rolls and S. Stringer
Table 3: Sigmoid Parameters. Layer Percentile Slope β
1 99.2 190
2 98 40
3 88 75
4 91 26
where r is the firing rate after lateral inhibition, y is the firing rate after contrast enhancement, and α and β are the sigmoid threshold and slope, respectively. The parameters α and β are constant within each layer, although α is adjusted to control the sparseness of the firing rates. For example, to set the sparseness to, say, 5%, the threshold is set to the value of the 95th percentile point of the firing rates r within the layer. Typical parameters for the sigmoid function are shown in Table 3. ¨ ak, 1991; Rolls, 1992; Wallis & Rolls, 1997; The trace learning rule (Foldi´ Rolls & Milward, 2000; Rolls & Stringer, 2001; Rolls & Deco, 2002) is that shown in equation 2.1 and encourages neurons to develop invariant responses to input patterns that tend to occur close together in time, because these are likely to be from the same moving object. 2.3 The Motion Inputs to the Network. The images presented to the network represent local motion signals with small receptive fields. These local visual motion (or local optic flow) input signals are similar to those of neurons in V1 in that they have small receptive fields and cannot detect global motion because of the aperture problem (Wurtz & Kandel, 2000). At each pixel coordinate in the 128 × 128 image, a direction of local motion/optic flow is defined. The global optic flow patterns used in the different experiments occupied part of this 128 × 128 image, as described for each experiment below. At each coordinate, there are eight cells, where the optimal response is defined by flows 45 degrees apart. That is, the cells are tuned to local optic flow directions of 0, 45, 90, . . ., 315 degrees. The firing rate of each cell is set equal to a gaussian function of the difference between the cell’s preferred direction and the actual direction of local optic flow. The standard deviation of this gaussian was 20 degrees. The number of inputs from the arrays of motion sensitive cells to each cell in the first layer of the network is 201, selected probabilistically as a gaussian function of distance as described above and in more detail elsewhere (Rolls & Milward, 2000). The local motion signals are given to the network, and not computed in the simulations, because the aim of the simulations is to test the theory that (given that local motion inputs that are known to be present in early cortical processing; Wurtz & Kandel, 2000) the trace learning mechanism described can in a hierarchical network account for a range of the types of global motion neuron that are found in the dorsal stream visual cortical areas.
Invariant Global Motion Recognition
147
2.4 Training and Test Procedure. To train the network, each stimulus is presented to VisNet in a randomized sequence of locations or orientations with respect to VisNet’s input retina. The different locations were spaced 32 pixels apart on the 128 × 128 retina. At each stimulus presentation, the activation of individual neurons is calculated, then the neuronal firing rates are calculated, and then the synaptic weights are updated. Each time a stimulus has been presented in all the training locations or orientations, a new stimulus is chosen at random and the process repeated. The presentation of all the stimuli through all locations or orientations constitutes one epoch of training. In this manner, the network is trained one layer at a time starting with layer 1 and finishing with layer 4. In the investigations described here, the numbers of training epochs for layers 1 to 4 were 50, 100, 100, and 75, respectively, as these have been shown in previous work to provide good performance (Wallis & Rolls, 1997; Rolls & Milward, 2000). The learning rates α in equation 2.1 for layers 1 to 4 were 0.09, 0.067, 0.05, and 0.04. Two measures of performance were used to assess the ability of the output layer of the network to develop neurons that are able to respond with view invariance to individual stimuli or objects (see Rolls & Milward, 2000). A single cell information measure was applied to individual cells in layer 4 and measures how much information is available from the response of a single cell about which stimulus was shown independent of view. The measure was the stimulus-specific information or surprise, I (s, R), which is the amount of information the set of responses, R, has about a specific stimulus, s. (The mutual information between the whole set of stimuli S and of responses R is the average across stimuli of this stimulus-specific information.) (Note that r is an individual response from the set of responses R.) I (s, R) =
r ∈R
P(r |s) log2
P(r |s) P(r )
(2.5)
The calculation procedure was identical to that described by Rolls, Treves, Tovee, and Panzeri (1997) with the following exceptions. First, no correction was made for the limited number of trials because, in VisNet, each measurement of a response is exact, with no variation due to sampling on different trials. Second, the binning procedure was to use equispaced rather than equipopulated bins. This small modification was useful because the data provided by VisNet can produce perfectly discriminating responses with little trial-to-trial variability. Because the cells in VisNet can have bimodally distributed responses, equipopulated bins could fail to separate the two modes perfectly. (This is because one of the equipopulated bins might contain responses from both of the modes.) The number of bins used was equal to or less than the number of trials per stimulus, that is, for VisNet the number of positions on the retina (Rolls et al., 1997). Because
148
E. Rolls and S. Stringer
VisNet operates as a form of competitive net to perform categorization of the inputs received, good performance of a neuron will be characterized by large responses to one or a few stimuli regardless of their position on the retina (or other transform) and small responses to the other stimuli. We are thus interested in the maximum amount of information that a neuron provides about any of the stimuli rather than the average amount of information it conveys about the whole set S of stimuli (known as the mutual information). Thus, for each cell, the performance measure was the maximum amount of information a cell conveyed about any one stimulus (with a check, in practice always satisfied, that the cell had a large response to that stimulus, as a large response is what a correctly operating competitive net should produce to an identified category). In many of the graphs in this article, the amount of information each of the 50 most informative cells had about any stimulus is shown. A multiple cell information measure, the average amount of information that is obtained about which stimulus was shown from a single presentation of a stimulus from the responses of all the cells, enabled measurement of whether across a population of cells, information about every object in the set was provided. Procedures for calculating the multiple cell information measure are given by Rolls, Treves, and Tovee (1997) and Rolls and Milward (2000). The multiple cell information measure is the mutual information I (S, R), that is, the average amount of information that is obtained from a single presentation of a stimulus about the set of stimuli S from the responses of all the cells. For multiple cell analysis, the set of responses, R, consists of response vectors comprising the responses from each cell. Ideally, we would like to calculate I (S, R) =
P(s)I (s, R).
(2.6)
s∈S
However, the information cannot be measured directly from the probability table P(r, s) embodying the relationship between a stimulus s and the response rate vector r provided by the firing of the set of neurons to a presentation of that stimulus. (Note that “stimulus” refers to an individual object that can occur with different transforms, e.g., translation or size; see Wallis & Rolls, 1997.) This is because the dimensionality of the response vectors is too large to be adequately sampled by trials. Therefore, a decoding procedure is used, in which the stimulus s that gave rise to the particular firing-rate response vector on each trial is estimated. This involves, for example, maximum likelihood estimation or dot product decoding. For example, given a response vector r to a single presentation of a stimulus, its similarity to the average response vector of each neuron to each stimulus is used to estimate using a dot product comparison which stimulus was shown. The probabilities of it being each of the stimuli can be estimated in
Invariant Global Motion Recognition
149
this way. Details are provided by Rolls et al. (1997). A probability table is then constructed of the real stimuli s and the decoded stimuli s . From this probability table, the mutual information is calculated as I (S, S ) =
P(s, s ) log2
s,s
P(s, s ) . P(s)P(s )
(2.7)
The multiple cell information was calculated using the five cells for each stimulus with high information values for that stimulus. Thus, in this letter, 10 cells were used in the multiple cell information analysis. 3 Simulation Results We now describe simulations with the neural network model described in section 2 that enabled us to test this theory. 3.1 Experiment 1: Global Planar Motion. Motion-sensitive neurons in V1 have small receptive fields (in the range 1–2 deg at the fovea) and therefore cannot detect global motion, and this is part of the aperture problem (Wurtz & Kandel, 2000). As described in section 1, neurons in MT have larger receptive fields and are able to respond to planar global motion (Movshon et al., 1985; Newsome et al., 1989). Here we show that the hierarchical feature network we propose can solve this global planar motion problem and, moreover, that the performance is improved by using a trace rather than a purely associative synaptic modification rule. Invariance is addressed in later simulations. The network was trained on two 100 × 100 stimuli representing noisy left and right global planar motion (see Figure 3a). During the training, cells developed that responded to either left or right global motion but not to both (see Figure 3), with 1 bit of information representing perfect discrimination of left from right. The untrained network with initial random synaptic weights tested as a control showed much poorer performance, as shown in Figure 3. It might be expected that some global planar motion sensitivity would be developed by a purely Hebbian learning rule, and indeed this has been demonstrated (under somewhat different training conditions) by Sereno (1989) and Sereno and Sereno (1991). This occurs because on any single trial with one average direction of global motion, neurons at intermediate layers will tend to receive on average inputs that reflect the current average global planar motion and will thus learn to respond optimally to the current inputs that represent that motion direction. We showed that the trace learning rule used here performed better than a Hebb rule (which produced only neurons with 0.0 bits given that the motion stimulus patches presented in our simulations were in nonoverlapping locations, as
150
E. Rolls and S. Stringer
a Global planar motion left
Global planar motion right
Stimulus 1
Stimulus 2
b
c VisNet: 2s 9l: Single cell analysis
VisNet: 2s 9l: Multiple cell analysis
3
3 trace random
2 1.5 1 0.5
2 1.5 1 0.5
0
0
-0.5
-0.5 5
10 15 20 25 30 35 40 45 50 Cell Rank
trace random
2.5 Information (bits)
Information (bits)
2.5
1
2
3
4
5
6
7
8
9
10
Number of Cells
Figure 3: Experiment 1. (a) The two motion stimuli used in experiment 1 were noisy global planar motion left (left) and noisy global planar motion right (right), present throughout the 128 × 128 retina. Each arrow in this and subsequent figures represents the local direction of optic flow. The size of the optic flow pattern was 100 × 100 pixels, not the 2 × 4 shown in the diagram. The noise was introduced into each image stimulus by inverting the direction of optic flow at a random set of 45% of the image nodes. This meant that it would not be possible to determine the directional bias of the flow field by examining the optic flow over local regions of the retina. Instead, the overall directional bias could be determined only by analyzing the whole image. (b) When trained with the trace rule, equation 2.1, some single cells in layer 4 conveyed 1 bit of information about whether the global motion was left or right, and this is perfect performance. (The single cell information is shown for the 50 most selective cells.) (c) The multiple cell information measures, used to show that different neurons are tuned to different stimuli (see section 2.4), indicate that over a set of neurons, information about the whole stimulus set was present. (The information values for one cell are the average of 10 cells selected from the 50 most selective cells, and hence the value is not exactly 1 bit.)
Invariant Global Motion Recognition
151
illustrated in Figure 1). A further reason for the better performance of the trace rule is that on successive trials, the average global motion identifiable by a single intermediate-layer neuron from the probabilistic inputs will be a better estimate (a temporal average) of the true global motion, and this will be utilized in the learning. These results show that the network architecture is able to develop global motion representations of the noisy local motion patterns. Indeed, it is emphasized that neurons in the input to VisNet had only local but not global motion information, as shown by the fact that the average amount of information the 50 most selective input cells had about the global motion was 0.0 bits. 3.2 Experiment 2: Rotating Wheel. Neurons in MST, but not MT, are responsive to rotation with considerable translation invariance (Graziano et al., 1994). The aim of this simulation was to determine whether layer 4 cells in our network develop position-invariant representations of wheels rotating clockwise (as shown in Figure 4a) versus anticlockwise. The stimuli consist only of optic flow fields around the rim of a geometric circle with radius 16 unless otherwise stated. The local motion inputs from the wheel in the two positions shown are ambiguous where the wheels are close to each other in Figure 4a. The network was expected to solve the problem as illustrated in Figure 4a. The results in Figures 4b to 4d show perfect performance on position invariance when trained with the trace rule but not when untrained. The perfect performance is shown by the neurons that responded to, for example, clockwise but not anticlockwise rotation, and did this for each of the nine training positions. Figure 4e shows perfect size invariance for some layer 4 cells when the network was trained with the trace rule with three different radii of the wheels: 10, 16, and 22. These results show that the network architecture is able to develop location- and size-invariant representations of the global, rotating wheel, motion patterns even though the neurons in the input layer receive information from only a small local region of the retina. We note that the position-invariant global motion results shown in Figure 4 were not due to chance mappings of the two stimuli through the network and were a result of the training, in that the position-invariant information about whether the global motion was clockwise or anticlockwise was 0.0 bits for both the single and the multiple cell information in the untrained (“random”) network. Corresponding differences between the trained and the untrained networks were found in all the other experiments described in this article. 3.3 Experiment 3: Looming. Neurons in macaque dorsal stream visual area MSTd respond to looming stimuli with considerable translation
152
E. Rolls and S. Stringer
invariance (Graziano et al., 1994; Geesaman & Andersen, 1996). We tested whether the network could learn to respond to small patches of looming versus contracting motion typically generated by objects as they are seen successively on different locations on the retina. The network was trained on two circular flow patterns representing looming toward and looming away, as shown in Figure 5a. The stimuli are circular optic flow fields, with the direction of flow either away from (left) or toward (right) the center of the circle and with radius 16 unless otherwise stated. The results shown in Figures 5b to 5d show perfect performance on position invariance when trained with the trace rule but not when untrained. The perfect performance is shown by the neurons that responded to, for example, looming toward but not movement away, and did this for each of the nine training positions. Simulations were run for various optic flow field diameters to test the robustness of the results, and in all cases tested (which included radii of
Figure 4: Experiment 2. (a) Two rotating wheels at different locations rotating in opposite directions. The local flow field is ambiguous. Clockwise or counterclockwise rotation can be diagnosed only by a global flow computation, and it is shown how the network is expected to solve the problem to produce positioninvariant global-motion-sensitive neurons. One rotating wheel is presented at any one time, but the need is to develop a representation of the fact that in the case shown, the rotating flow field is always clockwise, independent of the location of the flow field. (b) Single cell information measures showing that some layer 4 neurons have perfect performance of 1 bit (clockwise versus anticlockwise) after training with the trace rule, but not with random initial synaptic weights in the untrained control condition. (c) The multiple cell information measures show that small groups of neurons have perfect performance. (d) Position invariance illustrated for a single cell from layer 4, which responded only to the clockwise rotation, and for every one of the nine positions. (e) Size invariance illustrated for a single cell from layer 4, which after training with three different radii of rotating wheel, responded only to anticlockwise rotation, independent of the size of the rotating wheels. (For the position-invariant simulations, the wheel rims overlapped, but are shown slightly separated in Figure 1 for clarity.) The training grid spacing was 32 pixels, and the radii of the wheels were 16 pixels. This ensured the rims of the wheels in adjacent training grid locations overlapped. One wheel was shown on any one trial. On successive trials, the wheel rotating clockwise was shown in each of the nine locations, allowing the trace learning rule to build location-invariant representations of the wheel rotating in one direction. In the next set of training trials, the wheel was shown rotating in the opposite direction in each of the nine locations. For the size-invariant simulations, the network was trained and tested with the set of clockwise versus anticlockwise rotating wheels presented in three different sizes.
Invariant Global Motion Recognition
153
C
a
Layer 3 = MSTd or higher Rotational motion with invariance Larger receptive field size
A
Layer 2 = MT Rotational motion Large receptive field size
B
Layer 1 = MT Global planar motion Intermediate receptive field size
Input layer = V1,V2 Local motion
b
c
VisNet: 2s 9l: Single cell analysis
VisNet: 2s 9l: Multiple cell analysis
3
3 trace random
trace random
2.5 Information (bits)
Information (bits)
2.5 2 1.5 1 0.5 0
2 1.5 1 0.5 0
-0.5
-0.5 5
10 15 20 25 30 35 40 45 50 Cell Rank
1
d
2
4 5 6 7 Number of Cells
8
9
10
e Visnet: Cell (24,13) Layer 4
Visnet: Cell (7,11) Layer 4
1
1 ’clock’ ’anticlock’
0.8 Firing Rate
0.8 Firing Rate
3
0.6 0.4
0.6 0.4
0.2
0.2
0
0 0
1
2
3 4 5 6 Location Index
7
8
’clock’ ’anticlock’
10
12
14
16 18 Radius
20
22
154
E. Rolls and S. Stringer
10 and 20 as well as intermediate values), cells developed a transform (location) invariant representation in the output layer. These results show that the network architecture is able to develop invariant representations of the global looming motion patterns, even though the neurons in the input layer receive information from only a small local region of the retina. 3.4 Experiment 4: Rotating Cylinder. Some neurons in the macaque cortex in the anterior part of the superior temporal sulcus (which receives inputs from both the dorsal and ventral visual streams; Ungerleider & Mishkin, 1982; Seltzer & Pandya, 1978; Rolls & Deco, 2002) respond to a head when it is rotating clockwise about its own axis but not counterclockwise, regardless of whether it is upright or inverted (Hasselmo et al., 1989). The result of the inversion experiment shows that these neurons are not just responding to global flow across the visual field, but are taking into account information about the shape and features of the object. Some neurons in the parietal cortex may also respond to motion of an object about one of its axes in an object-based way (Sakata et al., 1986). In experiment 4, we tested whether the network could self-organize to form neurons that represent global motion in an object-based coordinate frame. The network was trained on two stimuli, with four transforms of each. Figure 6a shows stimulus 1, which is a cylinder with shading at the top rotating clockwise about its own (top-defined) axis. Stimulus 1 is shown in its upright and inverted transforms. Stimulus 2 is the same cylinder with shading at the top, but rotating anticlockwise about its own vertical axis. The stimuli were presented in a single location, but to solve the problem,
Figure 5: Experiment 3. (a) The two motion stimuli were flow fields looming toward (left) and looming away (right). The stimuli are circular optic flow fields, with the direction of flow either away from (left) or toward (right) the center of the circle. Local motion cells near, for example, the intersection of the two stimuli cannot distinguish between the two global motion patterns. Locationinvariant representations (for nine different locations) of stimuli looming toward or moving away from the observer were learned, as shown by the single cell information measures (b), and multiple cell information measures (c) (using the same conventions as in Figure 3) were formed if the network was trained with the trace rule but not if it was untrained. (d) Position invariance illustrated for a single cell from layer 4, which responded only to moving away, and for every one of the nine positions. (The network was trained and tested with the stimuli presented in a 3 × 3 grid of nine retinal locations, as in experiment 1. The training grid spacing was 32 pixels, and the radii of the circular looming stimuli were 16 pixels. This ensured that the edges of the looming stimuli in adjacent training grid locations overlapped, as shown in the dashed box of Figure 5a.)
Invariant Global Motion Recognition
155
a Looming towards
Moving away
Stimulus 1
Stimulus 2
b
c
VisNet: 2s 9l: Single cell analysis
VisNet: 2s 9l: Multiple cell analysis
3
3 trace random Information (bits)
2
1 0.5
2 1.5 1 0.5
0
0
-0.5
-0.5 5
trace random
2.5
1.5
10 15 20 25 30 35 40 45 50 Cell Rank
1
2
3
d Visnet: Cell (9,7) Layer 4 1 0.8 Firing Rate
Information (bits)
2.5
’towards’ ’away’
0.6 0.4 0.2 0 0
1
2
3 4 5 6 Location Index
7
8
4 5 6 7 Number of Cells
8
9
10
156
E. Rolls and S. Stringer
the network must form some neurons that respond to the clockwise rotation of the shaded cylinder independent of the four transforms of each, which were upright (0 degrees), 90 degrees, inverted (180 degrees) and 270 degrees. Other neurons should self-organize to respond to view invariant counterclockwise rotation. For this experiment, additional information about surface luminance must be fed into the first layer of the network in order for the network to be able to distinguish between the clockwise and anticlockwise rotating cylinders. Additional retinal inputs to the first layer of the network came from a 128 × 128 array of luminance-sensitive cells. The cells within the luminance array are maximally activated for the shaded region of the cylinder image. Elsewhere the luminance inputs are zero. The number of inputs from the array of luminance sensitive cells to each cell in the first layer of the network was 50. The results shown in Figures 6b to 6c show perfect performance for many single cells, and across multiple cells, in representing the direction of rotation of the shaded cylinder about its own axis regardless of which of the four transforms was shown, when trained with the trace rule but not when untrained. Simulations were run for various sizes of the cylinders, including height = 40 and diameter = 20. For all simulations, cells developed a transform (e.g., upright, inverted) invariant representation in the output layer. That is, some cells responded to one of the stimuli in all of its four transformations (i.e., orientations) but not to the other stimulus. These results show that the network architecture is able to develop objectcentered view-invariant representations of the global motion patterns representing the two rotating cylinders, even though the neurons in the input layer receive information from only a small, local region of the retina.
Figure 6: Experiment 4. (a) Stimulus 1, which is a cylinder with shading at the top rotating clockwise about its own (top-defined) axis. Stimulus 1 is shown in its upright and inverted transforms. Stimulus 2 is the same cylinder with shading at the top, but rotating anticlockwise about its own axis. Invariant representations were formed, with some cells coding for the object rotating clockwise about its own axis and other cells coding for the object rotating anticlockwise, invariantly with respect to whether which of the four transforms (0 degrees = upright, 90 degrees, 180 degrees = inverted, and 270 degrees) was viewed, as shown by the single cell information measures (b) and multiple cell information measures (c) (using the same conventions as in Figure 3). Because only eight images in one location form the training set, some single cells by chance with the random untrained connectivity had some information about which stimulus was shown, but cells performed the correct mapping only if the network was trained with the trace rule.
Invariant Global Motion Recognition
157
Upright transform
a
Inverted transform
Stimulus 1: Cylinder rotating clockwise when viewed from shaded end.
Stimulus 2: Cylinder rotating anticlockwise when viewed from shaded end.
c
b VisNet: 2s 4t: Single cell analysis
VisNet: 2s 4t: Multiple cell analysis
3
3 trace random
2 1.5 1 0.5
trace random
2.5 Information (bits)
Information (bits)
2.5
2 1.5 1 0.5
0
0 5
10 15 20 25 30 35 40 45 50 Cell Rank
1
2
3
4 5 6 7 Number of Cells
8
9
10
158
E. Rolls and S. Stringer
3.5 Experiment 5: Optic Flow Analysis of Real Images: Translation Invariance. In experiments 5 and 6, we extend this research by testing the operation of the model when the optic flow inputs to the network are extracted by a motion analysis algorithm operating on the successive images generated by moving objects. The optic flow fields generated by a moving object were calculated as described next and were used to set the firing of the motion-selective cells, the properties of which are described in section 2.3. These optic flow algorithms use an image gradient-based method, which exploits the relationship between the spatial and temporal gradients of intensity, to compute the local optic flow throughout the image. The image flow constraint equation Ix U + I y V + It = 0 is approximated at each pixel location by algebraic finite difference approximations in space and time (Horn & Schunk, 1981). Systems of these finite difference equations are then solved for the local image velocity (U, V) within each 4 × 4 pixel block within the image. The images of the rotating objects were generated using OpenGL. In experiment 5, we investigated the learning of translation-invariant representations of the optic flow vector fields generated by clockwise versus anticlockwise rotation of the tetrahedron stimulus illustrated in Figure 7a. The network was trained with the two optic flow patterns generated in nine different locations, as in experiments 2 and 3. The flow fields used to train the network were generated by the object rotating through one degree of angle. The single cell information measures (see Figure 7b) and multiple cell information measures (see Figure 7c) (using the same conventions as in Figure 3) show that the maximal information, one bit, was reached by single cells and with the multiple cell information measure. The dashed line shows the control condition of a network with random untrained connectivity. This experiment shows that the model can operate well and learn translation-invariant representations with motion flow fields actually extracted from the successive images produced by a rotating object. 3.6 Experiment 6: Optic Flow Analysis of Real Images: Rotation Invariance. In experiment 6 we investigated the learning of rotationinvariant representations of the optic flow vector fields generated by clockwise versus anticlockwise rotation of the spoked wheel stimulus illustrated in Figure 8a. (The algorithm for generating the optic flow field is described in section 3.5.) The radius of the spoked wheel was 50 pixels on the 128 × 128 background. The rotation was in-plane, and the optic flow fields used as an input to the network were extracted from the changing images, each separated by one degree of the object as it rotated through 360 degrees. The single cell information measures (see Figure 8b) and multiple cell information measures (see Figure 8c) (using the same conventions as in Figure 3) show that the maximal information, one bit, was almost reached by single cells and by the multiple cell information measure. The dashed
Invariant Global Motion Recognition
159
a Optic flow fields produced by a tetrahedron rotating clockwise or anticlockwise
b
c VisNet: 2s 9l: Single cell analysis
VisNet: 2s 9l: Multiple cell analysis
3
3 trace random
2 1.5 1 0.5
2 1.5 1 0.5
0
0
-0.5
-0.5 5
10 15 20 25 30 35 40 45 50 Cell Rank
trace random
2.5 Information (bits)
Information (bits)
2.5
1
2
3
4 5 6 7 8 Number of Cells
9
10
Figure 7: Experiment 5. Translation-invariant representations of the optic flow vector fields generated by clockwise versus anticlockwise rotation of the tetrahedron stimulus illustrated. The optic flow field used as an input to the network was extracted from the changing images of the object as it rotated. The single cell information measures (b) and multiple cell information measures (c) (using the same conventions as in Figure 3) show that the maximal information, 1 bit, was reached by both single cells and in the multiple cell information measure. The dashed line shows the control condition of a network with random untrained connectivity.
line shows the control condition of a network with random untrained connectivity. This experiment shows that the model can operate well and learn rotation-invariant representations with motion flow fields actually extracted from a very large number of the successive images produced by a rotating object. Because of the large number of closely spaced training images used in this simulation, it is likely that the crucial type of learning was continuous transformation learning (Stringer, Perry, Rolls, & Proske, 2006). Consistent with this, the learning rate was set to the lower value of 7.2 × 10−5 for all layers for experiment 6 (cf. Stringer et al., 2006).
160
E. Rolls and S. Stringer
a Optic flow fields produced by a spoked wheel rotating clockwise or anticlockwise
c
b VisNet: 2s: Single cell analysis
VisNet: 2s: Multiple cell analysis
3
3 trace random
2 1.5 1 0.5
2 1.5 1 0.5
0
0
-0.5
-0.5 5
10 15 20 25 30 35 40 45 50 Cell Rank
trace random
2.5 Information (bits)
Information (bits)
2.5
1
2
3
4 5 6 7 8 Number of Cells
9
10
Figure 8: Experiment 6. In-plane rotation-invariant representations of the optic flow vector fields generated by a spoked wheel rotating clockwise or anticlockwise. The optic flow field used as an input to the network was extracted from the changing images of the object as it rotated through 360 degrees, each separated by 1 degree. The single cell information measures (b) and multiple cell information measures (c) (using the same conventions as in Figure 3) show that the maximal information, 1 bit, was reached by both single cells and in the multiple cell information measure. The dashed line shows the control condition of a network with random untrained connectivity.
3.7 Experiment 7: Generalization to Untrained Images. To investigate whether the representations of object-based motion such as circular rotation learned with the approach introduced in this article would generalize usefully to the flow fields generated by other objects moving in the same way, we trained the network on the optic flow vector fields generated by clockwise versus anticlockwise rotation of the spoked wheel stimulus illustrated in Figure 8. The training images rotated through 90 degrees in 1 degree steps. We then tested generalization to the new, untrained image shown in Figure 9a. The single and multiple cell information plots in Figure 9b show that information was available about the direction of
Invariant Global Motion Recognition
161
a Generalisation: Training with rotating spoked wheel, followed by testing with a regular grid rotating clockwise or anticlockwise .
VisNet: 2s: Single cell analysis
b
VisNet: 2s: Multiple cell analysis
3
3 trace random
2 1.5 1 0.5
trace random
2.5
Information (bits)
Information (bits)
2.5
0
2 1.5 1 0.5 0
-0.5
-0.5 5
10 15 20 25 30 35 40 45 50 Cell Rank
1
2
3
4
5 6 7 Cell Rank
8
9
10
Responses of a typical cell to the spoked wheel and grid, after training with the spoked wheel alone. Clockwise rotation Visnet: Cell (17,17) Layer 4
Visnet: Cell (17,17) Layer 4
1
1
0.8
0.8 Firing Rate
Firing Rate
c Spoked wheel
Anticlockwise rotation
0.6 0.4 0.2
0.6 0.4 0.2
0
0 0
10 20 30 40 50 60 70 80 90 Orientation (deg)
0
Visnet: Cell (17,17) Layer 4
Visnet: Cell (17,17) Layer 4
1
1
0.8
0.8 Firing Rate
Grid
Firing Rate
d
0.6 0.4 0.2
10 20 30 40 50 60 70 80 90 Orientation (deg)
0.6 0.4 0.2
0
0 0
10 20 30 40 50 60 70 80 90 Orientation (deg)
0
10 20 30 40 50 60 70 80 90 Orientation (deg)
Figure 9: Experiment 7. Generalization to untrained images. The network was trained on the optic flow vector fields generated by the spoked wheel stimulus illustrated in Figure 8 rotating clockwise or anticlockwise. (a) Generalization to the new untrained image shown at the top right of the Figure was then tested. (b) The single and multiple cell information plots show that information was available about the direction of rotation (clockwise versus anticlockwise) of the untrained test images. (c) The firing rate of a fourth layer cell to the clockwise and anticlockwise rotations of the trained image illustrated in Figure 8. (d) The firing rate of the same fourth layer cell to the clockwise and anticlockwise rotations of the untrained image illustrated in Figure 9a.
162
E. Rolls and S. Stringer
rotation (clockwise versus anticlockwise) of the untrained test images. Although the information was not as high as 1 bit, which would have indicated perfect generalization, individual cells did generalize usefully to the new images, as shown in Figures 9c and 9d. For example, Figure 9c shows the firing rate of a fourth layer cell to the clockwise and anticlockwise rotations of the trained image illustrated in Figure 8. Figure 9d shows the firing rate of the same fourth-layer cell to the clockwise and anticlockwise rotations of the untrained image illustrated in Figure 9a. The neuron responded correctly to almost all the anticlockwise rotation shifts, and correctly to many of the clockwise rotation shifts, though some noise was evident in the responses of the neuron to the untrained images. Overall, the results demonstrate useful generalization after training with one object to testing with an untrained, different, object on the ability to represent rotation.
4 Discussion We have presented a hierarchical feature analysis theory of the operation of parts of the dorsal visual system, which provides a computational account for how transform-invariant representations of the flow fields generated by moving objects could be formed in the cerebral cortex. The theory uses a modified Hebb rule with a short-term temporal trace of preceding activity to enable whatever is invariant at any stage of the dorsal motion system across short time intervals to be associated together. The theory can account for many types of invariance and has been tested by simulation for position and size invariance. The simulations show that the network can develop global planar representations from noisy local motion inputs (experiment 1), invariant representations of rotating optic flow fields (experiment 2), invariant representations of looming optic flow fields (experiment 3), and invariant representations of asymmetrical objects rotating about one of their axes (experiment 4). These are fundamental problems in motion analysis, and they have all been studied neurophysiologically, including local versus planar motion (Movshon et al., 1985; Newsome et al., 1989); position-invariant representation of rotating flow fields and looming (Lagae, Maes, Raiguel, Xiao, & Orban, 1994); and object-based rotation (Hasselmo et al., 1989; Sakata et al., 1986). The model thus shows principles by which the different types of motion-related invariant neuronal responses in the dorsal cortical visual system could be produced. The theory is unifying in the sense that the same theory, but with different inputs, can account for invariant representations of objects in the ventral visual system (Rolls, 1992; Wallis & Rolls, 1997; Elliffe, Rolls, & Stringer, 2002; Rolls & Deco, 2002). It is a strength of the unifying concept introduced in this article that the same hierarchical network that can perform computations of the type important in the ventral visual system can also perform computations of a type important in the dorsal visual system.
Invariant Global Motion Recognition
163
Our simulations support the hypothesis that the different response properties of MT and MST neurons from V1 neurons are determined in part by the sizes of their receptive fields, with a larger receptive field needed to analyze some global motion patterns. Similar conclusions were drawn from simulation experiments performed by Sereno (1989) and Sereno and Sereno (1991). This type of self-organization can occur with a Hebbian associative learning rule operating on the feedforward connections to a competitive network. However, experiment 1 showed that even for the computation of planar global motion in intermediate layers such as MT, a trace-based associative learning rule is better than a purely associative Hebbian rule with noisy (probabilistic) local motion inputs, because the trace rule allows temporal averaging to contribute to the learning. In experiments 2 and 3, the trace rule is crucial to the success of the learning, in that the stimuli when presented in different training locations did not overlap, so that the only process by which the different transforms can be linked is by the temporal trace learning rule implemented in the model (Rolls & Milward, 2000; Rolls & Stringer, 2001). (We note that in a new development, it has been shown that if different transforms of the training stimuli do overlap continuously in space, then this overlap can provide a useful learning principle for invariant representations to be formed and requires only associative synaptic modification; Stringer et al., 2006. It would be of interest to extend this concept, which has been applied to the ventral visual system, to the dorsal visual system.) One type of perceptual analysis that can be understood with the theory and simulations described here is how neurons can self-organize to respond to the motion inputs produced by small objects when they are seen on different parts of the retina. This is achieved by using memorytrace-based synaptic modification in the type of architecture illustrated in Figure 4a. The crucial stage for this learning is the top layer in Figure 4a labeled Layer 2/3. The forward connections to the neurons in this layer can form the required representation if they use a trace or similar learning rule, and the object motion occurs with some temporospatial continuity. (Temporospatial continuity has been shown to be important in human face invariance learning [Wallis & Bulthoff, 2001], and spatial continuity over continuous transforms may be a useful learning principle [Stringer et al., 2006].) This aspect of the architecture is what is formally similar to the architecture of the ventral visual system, which can learn invariant representations of stationary objects. The only difference required of the networks is that the ventral visual stream network should receive inputs from neurons that respond to stationary features such as lines or edges and that the dorsal visual stream network should receive inputs from neurons that respond to local motion cues. It is this concept that allows us to propose that there is a unifying hypothesis that applies to some of the computations performed by both the ventral and the dorsal visual streams.
164
E. Rolls and S. Stringer
The way in which position-invariant representations in the model develop is illustrated in Figure 4a, where, in the top layer labeled layer 3, individual neurons receive information from different parts of layer 2, where different neurons can represent the same object motion but in different parts of visual space. In the model, layer 2 can thus be thought of as corresponding to some neurons in area MT, in which direction selectivity for elementary optic flow components such as rotation, deformation, and expansion and contraction is not position invariant (Lagae et al., 1994). Layer 3 in the model can in the same way be thought of as corresponding to area MST, in which direction selectivity for elementary optic flow components such as rotation, deformation, and expansion and contraction is position invariant for 40% of neurons (Lagae et al., 1994). A further correspondence between the model and the brain is that neurons that respond to global planar motion are found in the brain in area MT (Movshon et al., 1985; Newsome et al., 1989) and in the model in layer 1, whereas neurons in V1 and V2 do not respond to global motion (Movshon et al., 1985; Newsome et al., 1989), and correspond in the model to the input layer of Figure 4a. Another type of perceptual analysis that can be understood with the theory and simulations described here is the object-based view-independent representation of objects, exemplified by the ability to see that an “ended” object is rotating clockwise about one of its axes. It was shown in experiment 4 that these representations can be formed by combining information from both the dorsal visual stream (about global motion) and the ventral visual stream (about object shape and/or luminance features). For these representations to be learned, a trace associative or similar learning rule must be used while the object transforms from one view to another (e.g., from upright to inverted). A hierarchical network with the general architecture shown in Figure 2 with separate analyses of form and motion that are combined at a final stage (as in experiment 4) is also useful for biological motion, such as representing a person walking (Giese & Poggio, 2003). However, the network described by Giese and Poggio is not very biologically plausible, in that it performs MAX functions to help with the computational issue of transform invariance and does not self-organize on the basis of the inputs so that it must be largely hand-wired. The issue here is that Giese and Poggio suppose that a MAX function is performed to select the maximally active afferent to a neuron, but there is no account of how afferents of just one type (e.g., a bar with a particular orientation and contrast) are being received by a given neuron. Not only is no principle suggested by which this could be achieved, but also no learning algorithm is given to achieve this. We suggest therefore that it would be of interest to investigate whether the more biologically plausible self-organizing type of network described in this article can learn on the basis of the inputs being received to respond to biological motion. To do this, some form of sequence sensitivity would be useful.
Invariant Global Motion Recognition
165
The theory described here is appropriate for the global motion analysis required to analyze the flow fields of objects as they translate, rotate, expand (loom), or contract, as shown in experiments 1 to 3. The theory thus provides a model of some of the computations that appear to occur along the pathway V1–V2–MT–MST, as neurons of these types are generated along this pathway (see section 1). The theory described here can also account for global motion in an object-based coordinate frame as shown in experiment 4. Neurons with these properties have been found in the cortex in the anterior part of the macaque superior temporal sulcus, in which neurons respond to a head when it is rotating clockwise about its own axis but not counterclockwise, regardless of whether it is upright or inverted (Hasselmo et al., 1989). The result of the inversion experiment shows that these neurons are not just responding to global flow across the visual field, but are taking into account information about the shape and features of the object. Area STPa (the cortex in the anterior part of the macaque superior temporal sulcus) contains neurons that respond to a rotating sphere (Anderson & Siegel, 2005), and as shown in experiment 4, the present theory could account for such neurons. Whether the present model could account for the structure from motion also observed for these neurons is not yet known. The theory could also account for neurons in area 7a of the parietal cortex that may also respond to motion of an object about one of its axes in an object-based way (Sakata et al., 1986). Neurons have also been found in the primary motor cortex (M1) that respond similarly to neurons in area 7a when a monkey is solving a visually presented maze (Crowe, Chafee, Averbeck, & Georgopoulos, 2004), but their visual properties are not sufficiently understood to know whether the present model might apply. Area LIP contains neurons that perform processing related to saccadic eye movements to visual targets (Andersen, 1997), and the present theory may not apply to this type of processing. The model of processing utilized here in a series of hierarchically organized competitive networks with convergence at each stage (as illustrated in Figure 2) is intended to capture some of the main anatomical and physiological characteristics of the ventral visual stream of visual cortical areas, and is intended to provide a model for how processing in these areas could operate, as described in detail elsewhere (Rolls & Deco, 2002; Rolls & Treves, 1998). To enable learning along this pathway to result by self-organization in the correct representations being formed, associative learning using a short-term memory trace has been proposed (Rolls, 1992; Wallis & Rolls, 1997; Rolls & Milward, 2000; Rolls & Stringer, 2001; Rolls & Deco, 2002). Another approach used in continuous transformation learning utilizes associative learning without a temporal trace and relies on close exemplars of stimuli being provided during the training (Stringer et al., 2006). What we propose here is that similar connectivity and learning processes in the series of cortical pathways in the dorsal visual stream that includes V1–V2– MT–MST and onward connections to the cortex in the superior temporal
166
E. Rolls and S. Stringer
sulcus and area 7a could account for the invariant representations of the flow fields produced by moving objects. In relation to the number of stimuli that could be learned by the system, we note that the network simulated is relatively small and was designed to illustrate the new computational hypotheses introduced here rather than to analyze the capacity of such feature hierarchical systems. We note in particular that the network simulated has 1024 neurons in each layer and 100 inputs to each neuron in layers 2 to 4. In contrast, it has been estimated that perhaps half of the macaque brain is involved in visual processing, and typically each neuron has on the order of 104 inputs. It will be of interest using much larger simulations in the future to address capacity issues of this class of network. However, we note that because the network can generalize to rotational flow fields generated by untrained stimuli, as shown in experiment 7, separate representations for the flow fields generated by every object may not be required, and this helps to reduce the number of separate representations that the network may be required to learn. In contrast to some other theories, the theory developed here utilizes a single unified approach to self-organization in the dorsal and ventral visual systems. Predictions of the theory described here include the following. First, use of a trace rule in the dorsal as well as ventral visual system is predicted. (Thus, differences in, for example, the time constants of NMDA receptors, or persistent poststimulus firing, either of which could implement a temporal trace, would not be expected.) Second, a feature hierarchy is a useful way for understanding details of the operation of the ventral visual system, but can now be used as a clarifying concept for how the details of representations in the dorsal visual system may be built. Third, the theory predicts that neurons specialized for motion detection by using differences in the arrival times of sensory inputs from different retinal locations need occur at only one stage of the system (e.g., in V1) and need not occur elsewhere in the dorsal visual system. These are labeled as local motion neurons in Figure 4a. Acknowledgments This research was supported by the Wellcome Trust and by the Medical Research Council. References Andersen, R. A. (1997). Multimodal integration for the representation of space in the posterior parietal cortex. Philosophical Transactions of the Royal Society of London B, 352, 1421–1428. Anderson, K. C., & Siegel, R. M. (2005). Three-dimensional structure-from-motion selectivity in the anterior superior temporal polysensory area, STPa, of the behaving monkey. Cerebral Cortex, 15, 1299–1307.
Invariant Global Motion Recognition
167
Bair, W., & Movshon, J. A. (2004). Adaptive temporal integration of motion in direction-selective neurons in macaque visual cortex. Journal of Neuroscience, 24, 7305–7323. Bartlett, M. S., & Sejnowski, T. J. (1998). Learning viewpoint-invariant face representations from visual experience in an attractor network. Network: Computation in Neural Systems, 9, 399–417. Crowe, D. A., Chafee, M. V., Averbeck, B. B., & Georgopoulos, A. P. (2004). Participation of primary motor cortical neurons in a distributed network during maze solution: Representation of spatial parameters and time-course comparison with parietal area 7a. Experimental Brain Research, 158, 28–34. Deco, G., & Rolls, E. T. (2005). Neurodynamics of biased competition and cooperation for attention: A model with spiking neurons. Journal of Neurophysiology, 94, 295– 313. Desimone, R. (1991). Face-selective cells in the temporal cortex of monkeys. Journal of Cognitive Neuroscience, 3, 1–8. Duffy, C. J. (2004). The cortical analysis of optic flow. In L. M. Chalupa & J. S. Werner (Eds.), The visual neurosciences (Vol. 2, pp. 1260–1283). Cambridge, MA: MIT Press. Duffy, C. J., & Wurtz, R. H. (1996). Optic flow, posture, and the dorsal visual pathway. In T. Ono, B. L. McNaughton, S. Molotchnikoff, E. T. Rolls, & H. Nishijo (Eds.), Perception, memory and emotion: frontiers in neuroscience (pp. 63–77). Cambridge: Cambridge University Press. Elliffe, M. C. M., Rolls, E. T., & Stringer, S. M. (2002). Invariant recognition of feature combinations in the visual system. Biological Cybernetics, 86, 59–71. ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural ComFoldi´ putation, 3, 194–200. Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36, 193–202. Geesaman, B. J., & Andersen, R. A. (1996). The analysis of complex motion patterns by form/cue invariant MSTd neurons. Journal of Neuroscience, 16, 4716–4732. Giese, M. A., & Poggio, T. (2003). Neural mechanisms for the recognition of biological movements. Nature Reviews Neuroscience, 4, 179–192. Graziano, M. S. A., Andersen, R. A., & Snowden, R. J. (1994). Tuning of MST neurons to spiral motions, Journal of Neuroscience, 14, 54–67. Hasselmo, M. E., Rolls, E. T., Baylis, G. C., & Nalwa, V. (1989). Object-centered encoding by face-selective neurons in the cortex in the superior temporal sulcus of the monkey. Experimental Brain Research, 75, 417–429. Horn, B. K. P., & Schunk, B. G. (1981). Determining optic flow. Artificial Intelligence, 17, 185–203. Lagae, L., Maes, H., Raiguel, S., Xiao, D.-K., & Orban, G. A. (1994). Responses of macaque STS neurons to optic flow components: A comparison of areas MT and MST. Journal of Neurophysiology, 71, 1597–1626. Logothetis, N. K., & Sheinberg, D. L. (1996). Visual object recognition. Annual Review of Neuroscience, 19, 577–621. Movshon, J. A., Adelson, E. H., Gizzi, M. S., & Newsome, W. T. (1985). The analysis of moving visual patterns. In C. Chagas, R. Gattass, & C. Gross (Eds.), Pattern recognition mechanisms (pp. 117–151). New York: Springer-Verlag.
168
E. Rolls and S. Stringer
Newsome, W. T., Britten, K. H., & Movshon, J. A. (1989). Neuronal correlates of a perceptual decision. Nature, 341, 52–54. Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2, 1019–1025. Rolls, E. T. (1992). Neurophysiological mechanisms underlying face processing within and beyond the temporal cortical visual areas. Philosophical Transactions of the Royal Society, 335, 11–21. Rolls, E. T. (2000). Functions of the primate temporal lobe cortical visual areas in invariant visual object and face recognition. Neuron, 27, 205–218. Rolls, E. T. (2006). The representation of information about faces in the temporal and frontal lobes of primates including humans. Neuropsychologia, PMID = 16797609. Rolls, E. T., & Deco, G. (2002). Computational neuroscience of vision. New York: Oxford University Press. Rolls, E. T., & Milward, T. (2000). A model of invariant object recognition in the visual system: Learning rules, activation functions, lateral inhibition, and informationbased performance measures. Neural Computation, 12, 2547–2572. Rolls, E. T., & Stringer, S. M. (2001). Invariant object recognition in the visual system with error correction and temporal difference learning. Network: Computation in Neural Systems, 12, 111–129. Rolls, E. T., & Tovee, M. J. (1994). Processing speed in the cerebral cortex and the neurophysiology of visual masking. Proceedings of the Royal Society, B, 257, 9–15. Rolls, E. T., Tovee, M. J., Purcell, D. G., Stewart, A. L., & Azzopardi, P. (1994). The responses of neurons in the temporal cortex of primates, and face identification and detection. Experimental Brain Research, 101, 474–484. Rolls, E. T., & Treves, A. (1998). Neural networks and brain function. New York: Oxford University Press. Rolls, E. T., Treves, A., & Tovee, M. J. (1997). The representational capacity of the distributed encoding of information provided by populations of neurons in the primate temporal visual cortex. Experimental Brain Research, 114, 149–162. Rolls, E. T., Treves, A., Tovee, M., & Panzeri, S. (1997). Information in the neuronal representation of individual stimuli in the primate temporal visual cortex. Journal of Computational Neuroscience, 4, 309–333. Sakata, H., Shibutani, H., Ito, Y., & Tsurugai, K. (1986). Parietal cortical neurons responding to rotary movement of visual space stimulus in space. Experimental Brain Research, 61, 658–663. Seltzer, B., & Pandya, D. N. (1978). Afferent cortical connections and architectonics of the superior temporal sulcus and surrounding cortex in the rhesus monkey. Brain Research, 149, 1–24. Sereno, M. I. (1989). Learning the solution to the aperture problem for pattern motion with a Hebb rule. In D. Touretzky (Ed.), Advances in neural information processing systems, 1 (pp. 468–476). San Mateo, CA: Morgan Kaufmann. Sereno, M. I., & Sereno, M. E. (1991). Learning to see rotation and dilation with a Hebb rule. In D. Touretzky & R. Lippmann (Eds.), Advances in neural information processing systems 3 (pp. 320–326). San Mateo, CA: Morgan Kaufmann. Stringer, S. M., Perry, G., Rolls, E. T., & Proske, J. H. (2006). Learning invariant object recognition in the visual system with continuous transformations. Biological Cybernetics, 94, 128–142.
Invariant Global Motion Recognition
169
Stringer, S. M., & Rolls, E. T. (2000). Position invariant recognition in the visual system with cluttered environments. Neural Networks, 13, 305–315. Stringer, S. M., & Rolls, E. T. (2002). Invariant object recognition in the visual system with novel views of 3D objects. Neural Computation, 14, 2585–2596. Tanaka, K. (1996). Inferotemporal cortex and object vision. Annual Review of Neuroscience, 19, 109–139. Ungerleider, L. G., & Mishkin, M. (1982). Two cortical visual systems. In D. J. Ingle, M. A. Goodale, & R. J. W. Mansfield (Eds.), Analysis of visual behavior (pp. 549–586). Cambridge, MA: MIT Press. Wallis, G., & Bulthoff, H. H. (2001). Effects of temporal assocation on recognition memory. Proceedings of the National Academy of Sciences, 98, 4800–4804. Wallis, G., & Rolls, E. T. (1997). Invariant face and object recognition in the visual system. Progress in Neurobiology, 51, 167–194. Wurtz, R. H., & Kandel, E. R. (2000). Perception of motion depth and form. In E. R. Kandel, J. H. Schwartz, & T. M. Jessell (Eds.), Principles of neural science, 4th ed. (pp. 548–571). New York: McGraw-Hill.
Received January 11, 2006; accepted May 15, 2006.
LETTER
Communicated by Mike Arnold
Recurrent Cerebellar Loops Simplify Adaptive Control of Redundant and Nonlinear Motor Systems John Porrill [email protected]
Paul Dean [email protected] Centre for Signal Processing in Neuroimaging and Systems Neuroscience, Department of Psychology, University of Sheffield, Sheffield S10 2TP, U.K.
We have described elsewhere an adaptive filter model of cerebellar learning in which the cerebellar microcircuit acts to decorrelate motor commands from their sensory consequences (Dean, Porrill, & Stone, 2002). Learning stability required the cerebellar microcircuit to be embedded in a recurrent loop, and this has been shown to lead to a simple and modular adaptive control architecture when applied to the linearized 3D vestibular ocular reflex (Porrill, Dean, & Stone, 2004). Here we investigate the properties of recurrent loop connectivity in the case of redundant and nonlinear motor systems and illustrate them using the example of kinematic control of a simulated two-joint robot arm. We demonstrate that (1) the learning rule does not require unavailable motor error signals or complex neural reference structures to estimate such signals (i.e., it solves the motor error problem) and (2) control of redundant systems is not subject to the nonconvexity problem in which incorrect average motor commands are learned for end-effector positions that can be accessed in more than one arm configuration. These properties suggest a central functional role for the closed cerebellar loops, which have been shown to be ubiquitous in motor systems (e.g., Kelly & Strick, 2003). 1 Introduction The grace and economy of animal movement suggest that neural control methods are likely to be of interest to robotics. The neural structure particularly associated with coordinated movement is the cerebellum, whose function seems to be the fine-tuning of motor skills by elaborating incomplete or approximate commands issued by higher levels of the motor system (Brindley, 1964). But although the microcircuitry of cerebellar cortex has inspired models for over 30 years (Albus, 1971; Eccles, Ito, & Szent´agothai, 1967; Marr, 1969), the results appear to have produced relatively meager benefits for robotics. As Marr himself commented, “In my own case, the cerebellar study . . . disappointed me, because even if the theory was Neural Computation 19, 170–193 (2007)
C 2006 Massachusetts Institute of Technology
Recurrent Cerebellar Loops Simplify Adaptive Control
171
correct, it did not enlighten one about the motor system—it did not, for example, tell one how to go about programming a mechanical arm” (Marr, 1982, p. 15). Part of the problem is that the control function of any individual region of the cerebellum depends not only on its internal microcircuitry but also on the way it is connected with other parts of the motor system (Lisberger, 1998), and in many cases the details of this connectivity are still not well understood. However, recent anatomical investigations have suggested that there may be a common feature of cerebellar connectivity: a recurrent architecture. It appears as if individual regions of the cerebellum (1) receive a copy of the low-level motor commands that constitute system output and (2) apply their corrections to the high-level command. These “multiple closed-loop circuits represent a fundamental architectural feature of cerebro-cerebellar interactions,” and it is a challenge to future studies to “determine the computations that are supported by this architecture” (Kelly & Strick, 2003). Determining these computations might also throw light on how biological control methods using the cerebellum might be of use to robotics. Our initial attempts to address this issue considered adaptation of the angular vestibular-ocular reflex (aVOR), a classic preparation for studying basic cerebellar function (Boyden, Katoh, & Raymond, 2004; Carpenter, 1988; Ito, 1970). In this reflex, a head-rotation signal from vestibular sensors is used to counter-rotate the eye to maintain stability of the retinal image. Visual processing delays of ∼100 ms mean that movement of the retinal image (a retinal slip) is used as an error signal for calibration of the aVOR rather than for online control, and this calibration requires the floccular region of the cerebellum. In our simulations of the aVOR, the characteristics of the oculomotor plant (eye muscles plus orbital tissue) were altered, and the flocculus (modeled as an adaptive filter; see section 2) was required to learn the appropriate plant compensation. Stable and robust learning was achieved by connecting the filter so that it decorrelated the retinal slip signal from a copy of the motor command sent to the eye muscles (Dean, Porrill, & Stone, 2002; Porrill, Dean, & Stone, 2004) and sent its output to join the vestibular input. This arrangement constitutes an example of the recurrent architecture described above, and ensuring the stability of the decorrelation control algorithm is therefore a candidate for its computational role. Here we extend these findings to derive some theoretical properties of the recurrent cerebellar architecture, which indicate that it may play a central role in simplifying the adaptive control of nonlinear and redundant biological motor systems. These properties will be illustrated by comparison of forward and recurrent architectures in simulated calibration of the inverse kinematic control of a two-degree-of-freedom (2dof) robot arm. Part of this work has been reported in abstract form (Porrill & Dean, 2004).
172
J. Porrill and P. Dean
Figure 1: (a) Schematic representation of the cerebellar microcircuit. MF inputs uk are analyzed in the granule cell layer (only one MF input is shown; in reality, GCs receive direct input from multiple MFs and indirect input from many more via recurrent connections, not shown here, to PFs), and the GC output signals p j are distributed along the PFs. The ith PC makes contact with many PFs that drive its simple spike output vi . The cell also has a CF input e i , which is assumed in Marr-Albus models to act as a teaching signal for the synaptic weights wij . (b) Interpretation of the cerebellar microcircuit as an adaptive filter. Granule cells processing is modeled as a filter pi = G i (u1 , u2 , . . .), and each PC outputs a weighted sum of its PF inputs. (Figure modified from Dean, Porrill, & Stone, 2004.)
2 The Adaptive Filter Model We will use what is perhaps the simplest implementation of the Marr-Albus architecture: the adaptive filter (Fujita, 1982). The microcircuit based around a Purkinje cell (PC) and its computational interpretation as an adaptive filter model are shown schematically in Figure 1.
Recurrent Cerebellar Loops Simplify Adaptive Control
173
The mossy fibers (MFs) carry input signals (u1 , u2 , . . .) that are processed in the granule cell layer to produce the signals ( p1 , p2 , . . .) carried on the parallel fibers (PFs). This process is interpreted as a massive expansionrecoding of the mossy fiber inputs of the form p j = G j (u1 , u2 , . . .).
(2.1)
In the nondynamic problems to be considered here, the G j are functions of the current values of (u1 , u2 , . . .). Dynamic problems can be tackled by allowing the p j to encode aspects of the past history of the ui ; in that case, the G j are required to be more general causal functionals such as tapped delay lines. We are interested in the output of a subset of Purkinje cells that take their inputs from common parallel fibers. The output vi of the ith such PC is modeled as a weighted linear combination of its PF inputs, vi =
wij p j ,
(2.2)
where the coefficient wij is the synaptic weight of the jth PF on the ith PC. The combined transformation from the vector of inputs u = (u1 , u2 , . . .) to the vector of outputs v = (v1 , v2 , . . .) can then be written conveniently as a vector equation, v = C(u) =
wij Gij (u),
(2.3)
where Gij (u) = (0, . . . , G j (u), . . . , 0) is the vector with the jth parallel fiber signal as its ith entry and zeros elsewhere. In Marr-Albus models, the climbing fiber input e i to the Purkinje cell is assumed to act as a teaching signal. The qualitative properties of long-term depression (LTD) and long-term potentiation (LTP) at PF/PC synapses are consistent with the anti-Hebbian heterosynaptic covariance learning rule (Sejnowski, 1977), δwij = −β e i p j ,
(2.4)
which is identical in form to the least mean square (LMS) learning rule of adaptive control theory (Widrow & Stearns, 1985). Note that in all the above formulas, neural signals are assumed to be coded as firing rates relative to a tonic firing rate, and hence can take both positive and negative values. 3 Example Problem: Learning Inverse Kinematics We will show that recurrent loops can simplify the adaptive control of nonlinear and redundant motor systems. We derive the results for a problem
174
J. Porrill and P. Dean
Figure 2: (a) Geometry of the planar, two-degree-of-freedom robot arm. Motor commands (m1 , m2 ) specify joint angles as shown. The position (x1 , x2 ) of the end effector is specified in Cartesian coordinates in arbitrary units. (b) Example plant compensation problem. The arm controller is initially calibrated for the arm lengths l1 = 1.1, l2 = 1.8 of the gray arm. This arm is shown reaching accurately to a point on the gray polar grid covering the work space. When used with the black arm (lengths l1 = 1, l2 = 2), this approximate controller reaches to points on the distorted (black) grid. The task of the cerebellum is to compensate for this miscalibration; reaching errors (δx1 , δx2 ) are provided in Cartesian coordinates.
presenting both of these difficulties: learning robot inverse kinematics. This problem is of general interest, since it is an equivalent to the generic problem of learning right inverse mappings. The theoretical results obtained in the general case will be illustrated throughout by application to the inverse kinematic problem for the 2dof robot arm shown in Figure 2, where the geometry is particularly easy to intuit. Although this is a very simple system, it exhibits strong nonlinearity and a discrete redundancy. We begin by recalling some terminology. The forward kinematics or plant model of a robot arm is the mapping x = P(m) from motor commands m = (m1 , . . . , m M ) to the end-effector positions x = (x1 , . . . , xS ). An inverse kinematics is a mapping m = Q(x) that calculates the motor commands corresponding to a given end-effector position. Since this implies that x = P(Q(x)), an inverse kinematics is a rightinverse P−1 for P. For redundant systems, a given position can be reached using more than one choice of motor command; in this case, we will use the notation Q = P−1 to denote a particular choice of inverse kinematics. For the 2dof arm, there are only two choices of motor command for a given end-effector position. This is an example of a discrete redundancy. More complex systems can have continuous redundancies in which there is a
Recurrent Cerebellar Loops Simplify Adaptive Control
175
Figure 3: Forward architecture. The desired position input xd produces a motor command m = B(xd ) + C(xd ), which is the sum of contributions from a fixed element B and an adaptive cerebellar element C. This input to the plant P produces the output position x. Training the weights of C requires proximal or motor error δm rather than distal or sensory error δx = x − xd . Hence, in this forward architecture, motor error must be estimated by backpropagation via a reference matrix R ≈ ∂P−1 /∂x. This requires detailed prior knowledge of the motor plant.
continuum of motor commands available for each end-effector position. These are sometimes called redundant degrees of freedom. The cerebellum has been described as a “repair shop,” compensating for miscalibration (due to damage, fatigue, or development, for example) of the motor plant (Robinson, 1974). It is this adaptive plant compensation problem that we will model here; that is, we assume that an approximate inverse kinematics controller B ≈ P−1 is available to the system and that the function of the cerebellum is to supplement this controller to produce more accurate movements. Learning is supervised by making the reaching error, δx = x − xd = P(B(xd )) − xd ,
(3.1)
available to the learning system, where xd is the target (desired) position. We call this quantity sensory error since it can be measured by an appropriate sensor and to distinguish it from motor error, which will be defined later. In the 2dof robot arm example shown in Figure 2b, the approximate controller B is the inverse kinematics for a robot with arm lengths that are ±10% in error. Thus, when required to reach to positions on the polar grid shown in Figure 2, the arm actually moves to positions on the overlaid distorted grid. 4 The Forward Learning Architecture To highlight the properties of recurrent connectivity, we begin by considering the problems encountered in implementing a learning rule for the alternative forward connectivity shown schematically in Figure 3. The motor command to the plant is produced by an open-loop filter that is the sum
176
J. Porrill and P. Dean
B + C of the fixed element B and the adaptive cerebellar component C; this combination will be an inverse kinematics B + C = P−1 if C takes the value C∗ =
wij∗ G ij = P−1 − B.
(4.1)
We assume that the granule cell basis functions Gij satisfy the matching condition, that is, that synaptic weights wij∗ can be found such that the above equation holds for the range of P−1 and B under consideration. To obtain a learning rule similar to the covariance rule (see equation 2.3), we introduce the concept of motor error (Gomi & Kawato, 1992). Motor error δm is the error in motor command responsible for the sensory error δx. Minimizing expected square motor error, Em =
1 2 1 t δm = δm δm , 2 2
(4.2)
(where a superscript t denotes the matrix transpose), leads to a simple learning rule because motor error is linearly related to synaptic weight error, δm = C(xd ) − C∗ (xd ) =
(wij − wij∗ )Gij (xd ).
(4.3)
Using this expression, the gradient of expected square motor error is ∂E ∂δm = δmt = δmt Gij (xd ) = δmi p j , ∂wij ∂wij
(4.4)
giving the gradient descent learning rule, δwij = −β δmi p j ,
(4.5)
where β is a small, positive constant. If we label Purkinje cells such that the ith PC output contributes to the ith component of motor error, then comparison with the covariance learning rule (see equation 2.4) shows that the teaching signal e i provided on the climbing fiber input to the ith Purkinje cell must be the ith component of motor error, e i = δmi .
(4.6)
This apparently simple prescription is complicated by the fact that motor error is not in itself an observable quantity. It is a derived quantity given by the equation δm = P−1 (x) − P−1 (xd ).
(4.7)
Recurrent Cerebellar Loops Simplify Adaptive Control
177
This leads to an obvious circularity in that the rule for learning inverse kinematics requires prior knowledge of that same inverse kinematics. This circularity can be circumvented to some extent by supposing that all errors are small so that δm ≈
∂P−1 δx, ∂x
(4.8)
and then replacing the unknown Jacobian ∂P−1 /∂x in this error backpropagation rule by a fixed approximation, R≈
∂P−1 . ∂x
(4.9)
If R were exact, then, if J is the forward Jacobian J = ∂P/∂m, the product JR would be the identity matrix. To ensure stable learning, the approximate R must estimate motor error correctly up to a strict positive realness (SPR) condition, which in this static case requires that the symmetric part of the matrix JR be positive definite. The hypothetical neural structures required to implement this transformation R and recover motor error from observable sensory error, δm ≈ Rδx
or
δmi ≈
Rik δxk ,
(4.10)
have been called reference structures (Gomi & Kawato, 1992), so we will call R the reference matrix. 5 The Motor Error Problem We refer to the requirement that the climbing fibers carry the unobservable motor error signal rather than the observed sensory error signal as the motor error problem. Although the forward architecture has been applied successfully to a number of real and simulated control tasks (notably in the form of the feedback error learning architecture; Gomi & Kawato, 1992, 1993), we will argue here that for generic biological control systems, the complexity of the neural reference structures it requires makes forward architecture implausible. It is clear that the complexity of the reference structure is multiplicative in the dimension of the control and sensor space. For a task in which M muscles control the output of N sensors, there are MN entries in the reference matrix R. For example, in our 2dof robot arm problem, four real numbers must be hard-wired to guarantee learning. For more realistic motor tasks in biological systems (such as reaching while preserving balance), values of 100 or more for MN would not be unreasonable.
178
J. Porrill and P. Dean
Figure 4: (a) The dots show the arrangement of RBF centers, and the circle shows the receptive field radius for the forward architecture. This configuration leads to an accurate fit if exact motor error is provided to the learning algorithm (not shown). (b) A snapshot of performance during learning that illustrates the need for multiple reference structures in nonlinear problems. The reference structure R is chosen as the exact Jacobian at the grid point marked with a small circle. Although the learned (black) grid overlays the exact (gray) grid more accurately in the neighborhood of O (compare with Figure 3 bottom), performance has clearly deteriorated over the remainder of the work space. Learning rate beta = 0.0005. The effect of reducing the learning rate is to delay but not abolish the divergence. (c) An illustration of the redundancy or nonconvexity problem. The arm controller is set up to consistently choose the configurations shown by the solid and dotted arms when reaching into the top or bottom half of the work space. However, when reaching to points in the shaded horizontal sector, the arm retains the configuration used for the previous target. Hence, arm configuration is chosen effectively randomly in this sector, and the system fails to learn. Learning rate is as above. (Exact motor errors were used in this part of the simulation).
In fact, this analysis understates the problem since biological motor systems are often nonlinear, and hence the reference structures are valid only locally. This behavior will be illustrated for the 2dof robot arm calibration problem described above (see Figure 2). Details of the radial basis function (RBF) implementation are given in the appendix. Figure 4a shows a snapshot of performance during learning. In this example, the reference matrix R was chosen to be the exact inverse Jacobian at the point O. Clearly this choice satisfies the SPR condition in a neighborhood of O, and hence in this region where R provides good estimates of motor error, reaching accuracy initially improves. However, outside this region, the sign of a component of motor error is wrongly estimated, and errors in this component diverge catastrophically. This instability will eventually spread to the point O itself because of overlap between adjacent RBFs. To ensure stable learning in this example would require at least three reference structures valid on three different sectors of the work space.
Recurrent Cerebellar Loops Simplify Adaptive Control
179
This requires 3 × 4 = 12 prespecified parameters (not including the extra parameters needed to specify the region of validity of each reference structure). For a general inverse kinematics problem, we must specify MNK parameters, where K is the number of regions required to guarantee that positive definiteness of JR in each region. Finally, we note that in the dynamic case, learning must be dynamically stable. For example, in the linear case, the required reference structure is a matrix of transfer functions R(iω), and an SPR condition must be satisfied by the matrix transfer function J(iω)R(iω) at each frequency, further increasing the complexity of the motor error problem. 6 The Redundancy Problem Most artificial and biological motor systems are redundant, that is, different motor commands can produce the same output. Such redundancy leads to a classic supervised learning problem called the nonconvexity problem: if the training data for the learning system associate multiple motor commands with a given position, and if this set of motor commands is nonconvex, then the system will learn an inappropriate weighted average motor command at that position. The forward architecture shown in Figure 3 is subject to the redundancy problem whenever the controller B is allowed to produce different output commands for the same input. This type of behavior is common in motor tasks; for example, a robot arm configuration might be determined by a combination of convenience and movement history. This type of behavior is illustrated for the 2dof arm in Figure 4b. In this experiment, one arm configuration is used in the top half of the work space and the opposite configuration in the bottom half. However, when reaching into the central sector, the controller reuses the configuration from the previous position (this kind of behavior is common to avoid work space obstacles), and hence the configuration chosen in the central sector is essentially random. While learning succeeds in the top and bottom sectors of the work space, the failure to learn in the central sector is evident from the distorted (black) grid of learned positions. This convexity problem can be avoided by providing auxiliary variables ξ to the learning component such that the combination (x, ξ ) unambiguously specifies the motor state of the system (although identifying such variables in practice can be nontrivial). For example, the discrete redundancy found in the 2dof arm requires a discrete variable ξ = ±1 to identify the particular arm configuration to be used. This solution is not particularly satisfactory since it breaks modularity, forcing a controller whose task is simply to reach to a given position to take responsibility for choosing the required arm configuration. More interesting from our point of view, this solution also increases the complexity of the reference structure, since the number K of reference
180
J. Porrill and P. Dean
Figure 5: Recurrent architecture. Here the motor command generated by the fixed element B is the input to the adaptive element C. The output of C is then used as a correction to the input to B. This loop implements Brindley’s (1964) influential suggestion that the cerebellum elaborates commands coming from higher-level structures in the context of information about the current state of the organism. In this recurrent architecture, the sensor error δx becomes effectively proximal to C and, as demonstrated in the text, can be used directly as a teaching signal.
matrices required must be further increased to reflect the dependence on the extra parameters ξ . For example, to learn correctly in the 2dof arm example in Figure 4b, the two different arm configurations clearly require different reference matrices; hence, to allow both configurations over the whole work space increases the number of hard-wired parameters by a factor of 2 to 2 × 12 = 24. Even in this simple example, the complexity of the reference structure required to support learning is beginning to approach that of the structure to be learned. Note that the situation is not helped by adding a conventional error feedback loop to the adaptive forward architecture (as in the feedback error learning model). This loop also requires a motor error signal, and since error is available only in sensory coordinates, different reference structures are required for different arm configurations. 7 The Recurrent Architecture As we noted in section 1, the forward architecture just described ignores a major feature, the recurrent connectivity, of the circuitry in which the cerebellar microcircuit is embedded. An alternative recurrent architecture reflecting this connectivity is shown schematically in Figure 5. Although the analysis of recurrent networks and their learning rules can be very complex (e.g., Pearlmutter, 1995) this architecture is an important exception; in particular, we will show that no backpropagation step is required in the learning rule. The analysis proceeds in two stages. First, a plausible cerebellar learning rule is derived by the familiar method of gradient descent. This derivation does not provide a rigorous proof of convergence because it requires a small-weight-error approximation. Second, a Lyapunov function
Recurrent Cerebellar Loops Simplify Adaptive Control
181
for the learning rule is derived and used to demonstrate convergence of the learning rule without the need for the small-weight-error approximation. We are able to simplify the treatment because we deal only with kinematics. Hence, we can idealize the recurrent loop as an algebraic loop in which the output m of the closed loop shown in Figure 5 satisfies the implicit equation, m = B(xd +C(m)).
(7.1)
Clearly control is possible only if the fixed element B has some left inverse B−1 . Applying this inverse to equation 7.1, we find that the desired position input xd is related to the actual motor command m by the equation xd = B−1 (m) − C(m).
(7.2)
Again, we assume the matching condition so that weights exist for which C takes the exact value C∗ . For this choice of C, the desired position will equal the actual output position, that is, x = P(m) = B−1 (m) − C∗ (m),
(7.3)
from which we derive the following expression for the desired cerebellar filter: C∗ = B−1 − P.
(7.4)
By subtracting equation 7.3 from 7.2, we find that δx = x − xd = C∗ (m) − ∗ ∗ C(m), and substituting C = wij Gij , C = wij Gij gives the following simple relationship between sensory error and synaptic weight error: δx = P(m) − xd = C(m) − C∗ (m) =
(wij − wij∗ )Gij (m).
(7.5)
(This is analogous to equation 4.3 relating motor error and synaptic weight error for the forward architecture.) Although this equation is at first sight linear in the weights wij , this appearance is misleading since the argument m also depends implicitly on the wij . However, the appearance of linearity is close enough to the truth to allow us to derive a simple learning rule. If weight errors are small, the second term in the derivative, ∂δx ∗ ∂Gkl (m) = Gij (m) + (wkl − wkl ) , ∂wij ∂wij
(7.6)
182
J. Porrill and P. Dean
can be neglected to give the approximation ∂δx ≈ Gij (m). ∂wij
(7.7)
Using this result, we can derive an approximate gradient-descent learning rule by minimizing expected square sensory error (rather than motor error, as in the previous section). Defining Es =
1 2 1 t δx = δx δx , 2 2
(7.8)
its gradient is ∂δx ∂ Es ≈ δxt Gij (m) = δxi p j , = δxt ∂wij ∂wij
(7.9)
leading to the approximate gradient-descent learning rule δwij = −β δxi p j .
(7.10)
In this learning rule, no Jacobian appears, and hence no reference structures embodying prior knowledge of plant parameters are required. Comparison with the covariance learning rule (see equation 2.4) shows that the teaching signal required on the climbing fibers in recurrent architecture is now the sensory error: e i = δxi .
(7.11)
Although this local learning rule has been derived as an approximate gradient-descent rule, its properties are more easily determined by a Lyapunov analysis; this analysis is simplified if we work with the continuous update form of the learning rule: w˙ ij = −β δxi p j .
(7.12)
As a Lyapunov function, we use the sum square synaptic weight error, V=
1 (wij − wij∗ )2 , 2
(7.13)
which has time derivative V˙ =
w˙ ij (wij − wij∗ ) = −β
i
δxi
j
(wij − wij∗ ) p j .
(7.14)
Recurrent Cerebellar Loops Simplify Adaptive Control
183
Substituting into this expression the expression for sensory error above, we find that V˙ = −βδx2 .
(7.15)
Since its derivative is nonpositive, V is a Lyapunov function, that is, a positive function that decreases over time as learning proceeds. It is unnecessary to appeal to the Lyapunov theorems to determine the behavior of this system sufficiently for practical purposes. The equation above shows that over a fixed period of time, the sum square synaptic weight error decreases by an amount proportional to the mean square sensory error; hence, it is clear that the system can make RMS sensory errors above a certain magnitude only for a limited time, since V would otherwise become negative, which is impossible. 8 The Motor Error Problem Is Solved It is clear that this architecture solves the motor error problem in that there is no longer a need for unavailable motor error signals on the climbing fibers or for complex reference structures to estimate them. Figure 6a shows the performance of recurrent architecture for the 2dof arm problem described in section 3 (see the appendix for implementation details). It can be seen that the arm now recalibrates successfully over the whole work space. Figure 6b shows the decrease in position error during training; this decrease is stochastic in nature, as would be expected from the approximate stochastic gradient-descent rule. In contrast, Figure 6c shows the monotonic decrease of sum square weight error V during training predicted by the Lyapunov analysis, with the greatest decreases taking place where large errors are made. 9 The Nonconvexity Problem Is Solved In the recurrent architecture, the nonconvexity problem is easily solved because it does not arise. The adaptive element C takes motor commands as input, and its task is to learn to associate them with a corrective output. Since the motor command completely determines the current configuration, there is no ambiguity to be resolved. Although we illustrate this property below for the discrete redundancy of the 2dof arm, the reasoning above clearly applies to both discrete and continuous redundancies. This property confers remarkable modularity on recurrent architecture. It means that the connectivity of a cerebellar controller is determined solely by the task and is independent of low-level details such as the particular algorithm chosen for resolving joint angle redundancy or how, for example, reciprocal innervation allocates the tension in antagonistic muscle pairs.
184
J. Porrill and P. Dean
Figure 6: (a) This panel illustrates successful recalibration by the recurrent architecture. After training, the learned (black) grid overlays the exact (gray) grid over the whole work space (compare with the initial performance in Figure 2). Learning rate β = 0.05. (b) The grid shows the region of motor space corresponding to the robot work space. The dots and the circle indicate RBF centers and the receptive field radius. (c) The two graphs illustrate the stochastic decrease in squared position error (top), same units of length as Figure 2, and the associated monotonic decrease in sum square synaptic weight error (bottom) as predicted by theory. The behavior at the positions marked by arrows illustrates the fact that faster decrease in sum square weight error is associated with larger position error, as predicted by the Lyapunov equation, 7.13.
This property is illustrated in Figure 7a, where the recurrent architecture is applied to the redundant reaching problem described in section 6. It can be seen that learning is now satisfactory over the whole work space. The only modification to the net for the task of Figure 7 is the need for RBF centers covering points in motor space associated with the alternative arm configuration. This grid of RBF centers is shown in Figure 7b. The situation would be only slightly different for a continuous redundancy; in this case, new RBF centers would be needed to cover all points in motor command space accessible by the redundant degrees of freedom. 10 Discussion As argued in section 1, one reason that cerebellar-inspired models have been of modest use to robotics is that cerebellar connectivity is often poorly understood (Lisberger, 1998). The general idea that identical cerebellar microcircuits can be wired up to perform a wide range of motor and
Recurrent Cerebellar Loops Simplify Adaptive Control
185
Figure 7: (a) This panel shows that recurrent architecture solves the redundant reaching problem example described in section 6 for which forward architecture fails (compare with Figure 4). The learned (black) grid overlays the exact (gray) grid accurately over the whole work space, including the horizontal 60 degree sector in which both arm configurations are used. Learning rate β = 0.05. (b) This panel shows the separate grids in motor space associated with the two arm configurations. The grid of dots and the circle indicate the RBF centers and receptive field radius. The dark gray regions highlight motor commands used to generate the arm configurations used consistently in the top and bottom sectors of the work space, and the light gray regions highlight motor commands that generate the two alternative configurations used in the central sector of the work space (the unshaded regions are not used). Since the arm configurations that are ambiguous in task space are represented in separate regions of motor space, they are learned independently in the recurrent architecture.
cognitive functions (Ito, 1984, 1997) is well appreciated, but identifying specific instances has proved difficult. Here we have explored the computational capacities of the cerebellar microcircuit embedded in a recurrent architecture for adaptive feedforward control of nonlinear redundant systems as exemplified by a simulated 2dof robot arm. We have shown that the architecture solves the distal error problem, copes naturally with the redundancy/convexity problem, and gives enhanced modularity. We now compare it with alternative architectures from both computational and biological perspectives 10.1 Computational Control. The distal error problem arises whenever output errors are used to train internal parameters (Jordan, 1996; Jordan & Wolpert, 2000). It is a fundamental obstacle in neural learning systems, and the consequent lack of biological plausibility of learning rules in neural net supervised learning algorithms has become a clich´e. There have been two main previous approaches to solving the distal error problem: (1) continue
186
J. Porrill and P. Dean
to use standard architectures and hypothesize the existence of structures implementing the required error backpropagation schemes, or (2) look for those special architectures in which output errors are themselves sufficient for training. The forward learning architecture (see Figure 3) appears poorly suited for solving the distal error problem. Considerable ingenuity has been expended on the feedback error learning scheme developed by Kawato and coworkers (Gomi & Kawato, 1992, 1993) (for a recent rigorous treatment, see Nakanishi & Schaal, 2004) in order to rescue this architecture, but even so, substantial difficulties remain. In feedback error learning, the adaptive component is embedded in a feedback controller so that the estimated motor error δmest is used as both a training signal and a feedback error term. As we have noted, feedback error learning imposes SPR conditions on the accuracy of the motor error estimate and hence requires complex reference structures for generic redundant and nonlinear systems. There have been theoretical attempts to remove the SPR condition. For example, Miyamura and Kimura (2002) avoid the necessity for SPR at the cost of requiring large gains in the conventional error feedback loop; this is unacceptable in autonomous and biological systems since one of the primary reasons for using adaptive control is to avoid the destabilizing effect of large feedback gains given inevitable feedback delays. Despite the difficulties we have noted here, feedback error learning has been usefully applied to online learning by autonomous robots in numerous contexts (e.g., Dean, Mayhew, Thacker, & Langdon, 1991; Mayhew, Zheng, & Cornell, 1992). It is clear that feedback error learning remains a useful approach for problems in which simplifying features of the motor plant mean that the reference structures are easily estimated. We have presented a general architecture for tracking control in which output error can be used directly for training. Other architectures of this type in the literature have been designed for specific control problems. For example, the adaptive scheme for control of robot manipulators proposed by Slotine and Li (1989) relies on special features of the problem of controlling joint angles using joint torques. We note also the adaptive schemes for particular single-input–single-output nonlinear systems considered by Patino and Liu (2000) and Nakanishi and Schaal (2004). Although none of these architectures tackles the generic problems of nonlinearity and redundancy considered here, it is interesting to note that they also use recurrent architectures in an essential way, supporting the idea that recurrent connectivity may play a fundamental role in simplifying biological motor control.
10.2 Biological Plausibility. From a biological perspective, the recurrent architecture appears more plausible in the context of plant compensation than the forward architecture, with its requirements for feedback error learning, for three reasons.
Recurrent Cerebellar Loops Simplify Adaptive Control
187
First, as pointed out in section 1, anatomical evidence indicates that the recurrent architecture is a feature of many cerebellar microzones. In addition, where it is available, electrophysiological evidence has specifically identified efferent-copy information as part of the mossy fiber input to particular regions of the cerebellum. Important examples are (1) the primate flocculus and ventral paraflocculus, responsible for adaptation of the vestibulo-ocular reflex (VOR), where extensive recordings have shown that about three-quarters of their mossy fiber inputs carry eye-movementrelated signals (Miles, Fuller, Braitman, & Dow, 1980); (2) the oculomotor vermis, responsible for saccadic adaptation, where about 25% of mossy fibers show short-latency burst firing in association with saccades that closely resemble the activity of excitatory burst neurons in the paramedian pontine reticular formation (Ohtsuka & Noda, 1992); and (3) regions of cerebellar cortex associated with control of limb movement by the red nucleus receive an efferent copy of rubrospinal output to cerebellum, related to limb position and velocity (Keifer & Houk, 1994). Thus, the defining feature of the recurrent architecture used here appears to be present for all the cerebellar microzones that have been adequately investigated. Second, the recurrent architecture allows use of sensory error signals, that is, the sensory consequences of inaccurate motor commands, which are physically available signals. Previous inability to use such distal error signals has been a fundamental obstacle in neural learning systems, so that architectures such as the one described here, in which distal error can be used directly as a teaching signal, have fundamental importance as basic components of biological learning systems. Hence, if the recurrent architecture were used biologically, we would expect cerebellar climbing fibers to carry sensory information. In contrast, a central consequence of feedback error learning is the identification of climbing fiber signals with motor error. In the words of Gomi and Kawato (1992), “Our view that the climbing fibers carry control error information, the difference between the instructions and the motor act, is common to most cerebellar motor-learning models; however ours is unique in that this error information is represented in motor-command coordinates.” This view runs into both theoretical and empirical problems. From a theoretical point of view, the use of the motor error signal requires not only new and as yet unidentified neural reference structures to recover motor error from observable sensory errors, but also new and as yet unidentified mechanisms to calibrate these structures. Again in the words of Gomi and Kawato (1992), “The most interesting and challenging theoretical problem [raised by FEL] is setting an appropriate inverse reference model in the feedback controller at the spinal and brainstem levels.” From the point of view of experimental evidence, it appears that many climbing fibers are in fact strongly activated by sensory inputs, such as touch, pain, muscle sense, or, in the case of the VOR, retinal slip (e.g., Apps & Garwicz, 2005; De Zeeuw et al., 1998; Simpson, Wylie, & De Zeeuw, 1996).
188
J. Porrill and P. Dean
In certain cases where nonsensory signals have been identified in climbingfiber discharge (e.g., Andersson, Garwicz, & Hesslow, 1988; Gibson, Horn, & Pong, 2002), their effect seems to be to emphasize the unpredicted sensory consequences of a movement by gating the expected consequences. Such gated signals are still physically available sensory signals. In other instances, for example, retinal slip in the horizontal VOR, it appears that the effect of nonsensory modulation is to produce a two-valued slip signal that conveys information about the direction of image movement but not its speed (Highstein, Porrill, & Dean, 2005; Simpson, Belton, Suh, & Winkelman, 2002). One line of evidence often used to support feedback error learning concerns ocular following, where externally imposed sliplike movements of the retinal image drive compensatory eye movements. It has been shown that the associated complex spike discharge in the flocculus can be predicted from either slip or eye movement signals (Kobayashi et al., 1998). However, given that a later article states that “only the velocity of the retinal error (retinal slip) was statistically sufficient” (Yamamoto, Kobayashi, Takemura, Kawano, & Kawato, 2002, p. 1558) to reproduce the observed variability in climbing fiber discharge, it is not clear that this evidence decisively supports the existence of a motor error signal, even in these special and ambiguous circumstances. Although we have not considered dynamics here, a further advantage of the recurrent architecture is its capacity to generate long-time constant signals when plant compensation requires integrator-like processes (Dean et al., 2002; Porrill et al., 2004). This desirable feature of the recurrent architecture was pointed out for eye-position control by Galiana and Outerbridge (1984) and has been incorporated in other models of gaze holding (e.g., Glasauer, 2003). It is unclear how forward architectures could be used to achieve the observed performance of the neural integrator (Robinson, 1974). 10.3 Further Developments. Our previous work on plant compensation in the VOR (Dean et al., 2002; Porrill et al., 2004) established the capabilities of recurrent architecture for dynamic control of linear systems. Here we have extended these results to the kinematic control of redundant and nonlinear systems. It is clearly a priority to extend these results to dynamic nonlinear control problems. It is also important to implement the decorrelation-control scheme in a robot, and this work is currently in progress. We have emphasized the limitations of the feedback-error-learning architecture, but its combination of feedback and feedforward controllers does offer considerable advantages for robust online control. It appears that a possibly similar strategy is used biologically to control gaze stability, combining the optokinetic (feedback) and vestibulo-ocular (feedforward) reflexes (Carpenter, 1988). As we will show elsewhere, the recurrent architecture can also be embedded stably and naturally in a conventional feedback loop.
Recurrent Cerebellar Loops Simplify Adaptive Control
189
Finally, although the recurrent architecture does not need forward connections for plant compensation (and so they are not shown in Figure 5), such connections are also ubiquitous for cerebellar microzones. We conjecture that once plant compensation has been achieved, a microzone could in principle use these inputs for a wide range of purposes, including sensory calibration and predictive control. Appendix: Technical Details Explicit details of the forward and recurrent algorithms for the robotic application are supplied below. This is a vanilla RBF implementation, since our intention is to concentrate on the nature of the error signal and the learning rule. Both accuracy and learning speed could be greatly improved by optimizing the choice of centers and transforming to an optimal basis of receptive fields. The forward kinematics for the 2dof robot arm with arm lengths (l1 , l2 ) is given by x1 = P1 (m1 , m2 ) = l1 cos m1 + l2 cos(m1 + m2 ) x2 = P2 (m1 , m2 ) = l1 sin m1 + l2 sin(m1 + m2 ).
(A.1)
The brainstem component of the controller is defined as the exact inverse kinematics for a robot with slightly different arm lengths (l1 , l2 ), −1
m1 = B1 (x1 , x2 ) = tan
x2 x1
−1
− tan
l2 sin ξ θ l1 + l2 sin ξ θ
m2 = B2 (x1 , x2 ) = ξ θ,
(A.2)
where θ = cos−1
x12 + x12 − l12 − l22 2l1 l2
,
(A.3)
and the choice of ξ = ±1 determines the arm configuration. In the forward architecture, the parallel fiber signals are given by gaussian RBFs p j = G j (x1 , x2 ) = e − 2σ 2 ((x1 −c1 j ) 1
2
+(x2 −c 2 j )2 )
,
(A.4)
with the centers (c 1 j , c 2 j ) chosen on a rectangular grid covering the work 1 space and with σ equal to 2 /2 times the maximum grid spacing (see Figure 4). The forward architecture implies the following expression for
190
J. Porrill and P. Dean
the motor commands: m1 = B1 (x1 , x2 ) + m2 = B2 (x1 , x2 ) +
w1 j p j w2 j p j .
(A.5)
The unknown weights wij were initially set to 0. The two components of sensory error are given by the formula δx1 = P1 (m1 , m2 ) − x1 δx2 = P2 (m1 , m2 ) − x2 ,
(A.6)
from which the two components of motor error are estimated using a 2 × 2 reference matrix R δmest R11 R12 δx1 1 = . (A.7) δmest R21 R22 δx2 2 RBF weights are updated using the learning rule wijnew = wijold − βδmiest p j .
(A.8)
In the alternative recurrent architecture, the PF signals are given by RBFs, p j = G j (m1 , m2 ) = e − 2σ 2 ((m1 −c1 j ) 1
2
+(m2 −c 2 j )2 )
,
(A.9)
with centers chosen on a rectangular grid in motor space. The grid was chosen to cover the image of the work space in motor space (see Figure 6b). The recurrent architecture of Figure 5 implies the following equation for motor commands
w1 j p j , x2 + w2 j p j m1 = B1 x1 +
(A.10) w1 j p j , x2 + w2 j p j . m2 = B2 x2 + This is an implicit equation for motor error since the p j depend on the mi via equation A.6. Its solution was obtained at each trial by iterating mn+1 = B(x + C(mn )) to convergence (a relative accuracy of 10−4 required at most 10 iterations in the simulations reported here). This off-line iteration is necessary to allow the arm to make discontinuous movements between randomly chosen gridpoints. If waypoints xn are closely sampled from a continuous curve, the more natural alternative online procedure mn+1 = B(xn + C(mn )) can be used.
Recurrent Cerebellar Loops Simplify Adaptive Control
191
Again the unknown weights wij were initially set to 0. Sensory error obtained from equation A.3 above was used directly in the learning rule wijnew = wijold − βδxi p j .
(A.11)
Values for the optimal weight values wij∗ are required to calculate the sum square weight errors plotted in Figure 6. These were obtained by direct batch minimization of sum square reaching error calculated over a subsampled grid of points in the work space. We have primarily investigated convergence in the circumstances in which it is believed to operate biologically, that is, in repair shop mode with changes in plant characteristics of 10% to 15%. However, simulations have indicated that it also converges from initial weights corresponding to grossly degraded performance, although we have not investigated this systematically. Acknowledgments This work was supported by EPSRC grant GR/T10602/01 under their Novel Computation Initiative. References Albus, J. S. (1971). A theory of cerebellar function. Mathematical Biosciences, 10, 25–61. Andersson, G., Garwicz, M., & Hesslow, G. (1988). Evidence for a GABA-mediated cerebellar inhibition of the inferior olive in the cat. Experimental Brain Research, 72, 450–456. Apps, R., & Garwicz, M. (2005). Anatomical and physiological foundations of cerebellar information processing. Nature Reviews Neuroscience, 6(4), 297–311. Boyden, E. S., Katoh, A., & Raymond, J. L. (2004). Cerebellum-dependent learning: The role of multiple plasticity mechanisms. Annual Review of Neuroscience, 27, 581–609. Brindley, G. S. (1964). The use made by the cerebellum of the information that it receives from sense organs. International Brain Research Organization Bulletin, 3, 80. Carpenter, R. H. S. (1988). Movements of the eyes (2nd ed.). London: Pion. De Zeeuw, C. I., Simpson, J. I., Hoogenraad, C. C., Galjart, N., Koekkoek, S. K. E., & Ruigrok, T. J. H. (1998). Microcircuitry and function of the inferior olive. Trends in Neurosciences, 21(9), 391–400. Dean, P., Mayhew, J. E. W., Thacker, N., & Langdon, P. M. (1991). Saccade control in a simulated robot camera-head system: Neural net architectures for efficient learning of inverse kinematics. Biological Cybernetics, 66, 27–36. Dean, P., Porrill, J., & Stone, J. V. (2002). Decorrelation control by the cerebellum achieves oculomotor plant compensation in simulated vestibulo-ocular reflex. Proceedings of the Royal Society of London, Series B, 269(1503), 1895–1904.
192
J. Porrill and P. Dean
Dean, P., Porrill, J., & Stone, J. V. (2004). Visual awareness and the cerebellum: Possible role of decorrelation control. Progress in Brain Research, 144, 61–75. Eccles, J. C., Ito, M., & Szent´agothai, J. (1967). The cerebellum as a neuronal machine. Berlin: Springer-Verlag. Fujita, M. (1982). Adaptive filter model of the cerebellum. Biological Cybernetics, 45, 195–206. Galiana, H. L., & Outerbridge, J. S. (1984). A bilateral model for central neural pathways in vestibuloocular reflex. Journal of Neurophysiology, 51(2), 210–241. Gibson, A. R., Horn, K. M., & Pong, M. (2002). Inhibitory control of olivary discharge. Annals of the New York Academy of Sciences, 978, 219–231. Glasauer, S. (2003). Cerebellar contribution to saccades and gaze holding—a modeling approach. Annals of the New York Academy of Sciences, 1004, 206–219. Gomi, H., & Kawato, M. (1992). Adaptive feedback control models of the vestibulocerebellum and spinocerebellum. Biological Cybernetics, 68(2), 105–114. Gomi, H., & Kawato, M. (1993). Neural network control for a closed-loop system using feedback-error-learning. Neural Networks, 6, 933–946. Highstein, S. M., Porrill, J., & Dean, P. (2005). Report on a workshop concerning the cerebellum and motor learning. Held in St Louis October 2004. Cerebellum, 4, 1–11. Ito, M. (1970). Neurophysiological aspects of the cerebellar motor control system. International Journal of Neurology (Montevideo), 7, 162–176. Ito, M. (1984). The cerebellum and neural control. New York: Raven Press. Ito, M. (1997). Cerebellar microcomplexes. International Review of Neurobiology, 41, 475–487. Jordan, M. I. (1996). Computational aspects of motor control and motor learning. In H. Heuer & S. Keele (Eds.), Handbook of perception and action, Vol. 2: Motor skills (pp. 71–120). London: Academic Press. Jordan, M. I., & Wolpert, D. M. (2000). Computational motor control. In M. S. Gazzaniga (Ed.), The new cognitive neurosciences (2nd ed., pp. 601–618). Cambridge MA: MIT Press. Keifer, J., & Houk, J. C. (1994). Motor function of the cerebellorubrospinal system. Physiological Reviews, 74(3), 509–542. Kelly, R. M., & Strick, P. L. (2003). Cerebellar loops with motor cortex and prefrontal cortex of a nonhuman primate. Journal of Neuroscience, 23(23), 8432–8444. Kobayashi, Y., Kawano, K., Takemura, A., Inoue, Y., Kitama, T., Gomi, H., & Kawato, M. (1998). Temporal firing patterns of Purkinje cells in the cerebellar ventral paraflocculus during ocular following responses in monkeys II. Complex spikes. Journal of Neurophysiology, 80(2), 832–848. Lisberger, S. G. (1998). Cerebellar LTD: A molecular mechanism of behavioral learning? Cell, 92(6), 701–704. Marr, D. (1969). A theory of cerebellar cortex. Journal of Physiology, 202, 437–470. Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information. San Francisco: Freeman. Mayhew, J. E. W., Zheng, Y., & Cornell, S. (1992). The adaptive control of a fourdegrees-of-freedom stereo camera head. Philosophical Transactions of the Royal Society of London, Series B, 337, 315–326.
Recurrent Cerebellar Loops Simplify Adaptive Control
193
Miles, F. A., Fuller, J. H., Braitman, D. J., & Dow, B. M. (1980). Long-term adaptive changes in primate vestibuloocular reflex. III. Electrophysiological observations in flocculus of normal monkeys. Journal of Neurophysiology, 43, 1437–1476. Miyamura, A., & Kimura, H. (2002). Stability of feedback error learning scheme. Systems and Control Letters, 45, 303–316. Nakanishi, J., & Schaal, S. (2004). Feedback error learning and nonlinear adaptive control. Neural Networks, 17(10), 1453–1465. Ohtsuka, K., & Noda, H. (1992). Burst discharges of mossy fibers in the oculomotor vermis of macaque monkeys during saccadic eye movements. Neuroscience Research, 15(1–2), 102–114. Patino, H. D., & Liu, D. (2000). Neural network–based model reference adaptive control system. IEEE Transactions on Systems, Man and Cybernetics—Part B: Cybernetics, 30, 198–203. Pearlmutter, B. A. (1995). Gradient calculations for dynamic recurrent neural networks. IEEE Transactions on Neural Networks, 6(5), 1212–1228. Porrill, J., & Dean, P. (2004). Recurrent cerebellar loops simplify adaptive control of redundant and nonlinear motor systems. In 2004 Abstract Viewer/Itinerary Planner (pp. Prog. No. 989.984). Washington, DC: Society for Neuroscience. Porrill, J., Dean, P., & Stone, J. V. (2004). Recurrent cerebellar architecture solves the motor error problem. Proceedings of the Royal Society of London, Series B, 271, 789–796. Robinson, D. A. (1974). The effect of cerebellectomy on the cat’s vestibulo-ocular integrator. Brain Research, 71, 195–207. Sejnowski, T. J. (1977). Storing covariance with nonlinearly interacting neurons. Journal of Mathematical Biology, 4, 303–321. Simpson, J. I., Belton, T., Suh, M., & Winkelman, B. (2002). Complex spike activity in the flocculus signals more than the eye can see. Annals of the New York Academy of Sciences, 978, 232–236. Simpson, J. I., Wylie, D. R., & De Zeeuw, C. I. (1996). On climbing fiber signals and their consequence(s). Behavioral and Brain Sciences, 19(3), 384–398. Slotine, J. J. E., & Li, W. P. (1989). Composite adaptive-control of robot manipulators. Automatica, 25(4), 509–519. Widrow, B., & Stearns, S. D. (1985). Adaptive signal processing. Upper Saddle River, NJ: Prentice Hall. Yamamoto, K., Kobayashi, Y., Takemura, A., Kawano, K., & Kawato, M. (2002). Computational studies on acquisition and adaptation of ocular following responses based on cerebellar synaptic plasticity. Journal of Neurophysiology, 87(3), 1554–1571.
Received August 8, 2005; accepted May 30, 2006.
LETTER
Communicated by Stephen Jos´e Hanson
Free-Lunch Learning: Modeling Spontaneous Recovery of Memory J. V. Stone [email protected] Psychology Department, Sheffield University, Sheffield S10 2TP, England
P. E. Jupp [email protected] School of Mathematics and Statistics, St. Andrews University, St. Andrews KY16 9SS, Scotland
After a language has been learned and then forgotten, relearning some words appears to facilitate spontaneous recovery of other words. More generally, relearning partially forgotten associations induces recovery of other associations in humans, an effect we call free-lunch learning (FLL). Using neural network models, we prove that FLL is a necessary consequence of storing associations as distributed representations. Specifically, we prove that (1) FLL becomes increasingly likely as the number of synapses (connection weights) increases, suggesting that FLL contributes to memory in neurophysiological systems, and (2) the magnitude of FLL is greatest if inactive synapses are removed, suggesting a computational role for synaptic pruning in physiological systems. We also demonstrate that FLL is different from generalization effects conventionally associated with neural network models. As FLL is a generic property of distributed representations, it may constitute an important factor in human memory. 1 Introduction A popular aphorism states that “there’s no such thing as a free lunch.” However, in the context of learning theory, we propose that there is. In previous work, free-lunch learning (FLL) has been demonstrated using a task in which participants learned the positions of letters on a nonstandard computer keyboard (Stone, Hunkin, & Hornby, 2001). After a period of forgetting, participants relearned a proportion of these letter positions. Crucially, it was found that this relearning induced recovery of the nonrelearned letter positions. Preliminary results suggest that FLL also occurs using face stimuli. If the brain stores information as distributed representations, then each neuron contributes to the storage of many associations, so that relearning Neural Computation 19, 194–217 (2007)
C 2006 Massachusetts Institute of Technology
Free-Lunch Learning
195
Figure 1: Free-lunch learning protocol. Two subsets of associations A1 and A2 are learned. After partial forgetting (see text), performance error E pre on subset A1 is measured. Subset A2 is then relearned to preforgetting levels of performance, and performance error E post on subset A1 is remeasured. If E post < E pre then FLL has occurred, and the amount of FLL is δ = E pre − E post .
some old and partially forgotten associations affects the integrity of other old associations. Using neural network models, we show that relearning some associations does not disrupt other stored associations but actually restores them. In essence, recovery occurs in neural network models because each association is distributed among all connection weights (synapses) between units (model neurons). After partial forgetting, relearning some of the associations forces all of the weights closer to preforgetting values, resulting in improved performance even on nonrelearned associations. 1.1 The Geometry of Free-Lunch Learning. The protocol used to examine FLL here is as follows (see Figure 1). First, learn a set of n1 + n2 associations A = A1 ∪ A2 consisting of two intermixed subsets A1 and A2 of n1 and n2 associations, respectively. After all learned associations A have been partially forgotten, measure performance on subset A1 . Finally, relearn only subset A2 , and then remeasure performance on subset A1 . FLL occurs if relearning subset A2 improves performance on A1 . Unless stated otherwise, we assume that for a network with n connection weights, n ≥ n1 + n2 . For the present, we assume that the network has one output unit and two input units, which implies n = 2 connection weights and that A1 and A2 each consist of n1 = n2 = 1 association, as in Figure 2. Input units are
196
a
J. Stone and P. Jupp
b
Figure 2: Geometry of free-lunch learning. (a) A network with two input units and one output unit, with connection weights wa and wb , defines a weight vector w = (wa , wb ). The network learns two associations A1 and A2 , where (for example) A1 is the mapping from input vector x1 = (x11 , x12 ) to desired output value d1 ; learning consists of adjusting w until the network output y1 = w · x1 equals d1 . (b) Each association A1 and A2 defines a constraint line L 1 and L 2 , respectively. The intersection of L 1 and L 2 defines a point w0 that satisfies both constraints, so that zero error on A1 and A2 is obtained if w = w0 . After partial forgetting, w is a randomly chosen point w1 on the circle C with radius r , and performance error E pre on A1 is the squared distance p 2 . After relearning A2 , the weight vector w2 is in L 2 , and performance error E post on A1 is q 2 . FLL occurs if δ = E pre − E post > 0, or equivalently if Q = p 2 − q 2 > 0. Relearning A2 has one of three possible effects, depending on the position of w1 on C: (1) if w1 is under the larger (dashed) arc C F L L as shown here, then p 2 > q 2 (δ > 0) and therefore FLL is observed; (2) if w1 is under the smaller (dotted) arc, then p 2 < q 2 (δ < 0), and therefore negative FLL is observed; and (3) if w1 is at the critical point wcrit , then p 2 = q 2 (δ = 0). Given that w1 is a randomly chosen point on C and that the length of C F L L is SF L L , the probability of FLL is P(δ > 0) = SF L L /πr (i.e., the proportion of C F L L under the upper semicircle of C).
connected to the output unit via weights wa and wb , which define a weight vector w = (wa , wb ). Associations A1 and A2 consist of different mappings from the input vectors x1 = (x11 , x12 ) and x2 = (x21 , x22 ) to desired output values d1 and d2 , respectively. If a network is presented with input vectors x1 and x2 , then its output values are y1 = w · x1 = wa x11 + wb x12 and y2 = w · x2 = wa x21 + wb x22 , respectively. Network performance error for k = 2 k associations is defined as E(w, A) = i=1 (di − yi )2 .
Free-Lunch Learning
197
The weight vector w defines a point in the (wa , wb )-plane. For an input vector x1 , there are many different combinations of weight values wa and wb that give the desired output d1 . These combinations lie on a straight line L 1 , because the network output is a linear weighted sum of input values. A corresponding constraint line L 2 exists for A2 . The intersection of L 1 and L 2 therefore defines the only point w0 that satisfies both constraints, so that zero error on A1 and A2 is obtained if and only if w = w0 . Without loss of generality, we define the origin w0 to be the intersection of L 1 and L 2 . We now consider the geometric effect of partial forgetting of both associations, followed by relearning A2 . This geometric account applies to a network with two weights (see Figure 2) and depends on the following observation: if the length of the input vector x1 = 1, then the performance error E(w, A1 ) = (d1 − y1 )2 of a network with weight vector w when tested on association A1 is equal to the squared distance between w and the constraint line L 1 (see appendix C). For example, if w is in L 1 , then E(w, A1 ) = 0, but as the distance between w and L 1 increases, so E(w, A1 ) must increase. For the purposes of this geometric account, we assume that x1 = x2 = 1. Partial forgetting is induced by adding isotropic noise v to the weight vector w = w0 . This effectively moves w to a randomly chosen point w1 = w0 + v on the circle C of radius r = v, where r represents the amount of forgetting. For a network with w = w1 , learning A2 moves w to the nearest point w2 on L 2 (see appendix B), so that w2 is the orthogonal projection of w1 on L 2 . Before relearning A2 , the performance error E pre on A1 is the squared distance p 2 between w1 and its orthogonal projection on L 1 (see appendix C). After relearning A2 , the performance error E post is the squared distance q 2 between w2 and its orthogonal projection on L 1 . The amount of FLL is δ = E pre − E post and, for a network with two weights, is equal to Q = p 2 − q 2 . The probability P(δ > 0) of FLL given L 1 and L 2 is equal to the proportion of points on C for which δ > 0 (or, equivalently, for which Q > 0). For example, averaging over all subsets A1 and A2 , there is the probability P(δ > 0) = 0.68 that relearning A2 induces FLL of A1 (see Figure 5), a probability that increases with the number of weights (see theorem 3). If we drop the assumption that a network has only two input units, then we can consider subsets A1 and A2 with n1 > 1 and n2 > 1 associations. If the number of connection weights n ≥ max(n1 , n2 ), then A1 and A2 define an (n − n1 )-dimensional subspace L 1 and an (n − n2 )-dimensional subspace L 2 , respectively. The intersection of L 1 and L 2 corresponds to weight vectors that generate zero error on A = A1 ∪ A2 . Finally, we can drop the assumption that a network has only one output unit, because the connections to each output unit can be considered as a distinct network, in which case our results can be applied to the network associated with each output unit.
198
J. Stone and P. Jupp
2 Methods Given a network with n input units and one output unit, the set A of associations consisted of k input vectors (x1 , . . . , xk ) and k corresponding desired scalar output values (d1 , . . . , dk ). Each input vector comprises n elements x = (x1 , . . . , xn ). The values of xi and di were chosen from a gaussian distribution with unit variance (i.e., σx2 = σd2 = 1). A network’s output yi is a weighted sum of input values yi = w · xi = kj=1 w j xi j , where xi j is the jth value of the ith input vector xi , and each weight wi is one input-output connection. Given that the network error for a given set of k associations is E(w, A) = k k 2 i=1 (di − yi ) , the derivative ∇ E (w) = 2 i=1 (di − yi )xi of E with respect to w yields the delta learning rule wnew = wold − η∇ E (wold ) , where η is the learning rate, which is adjusted according to the number of weights. A learning trial consists of presenting the k input vectors to the network and then updating the weights using the delta rule. Learning was stopped when ∇ E (w) < k0.001, where ∇ E (w) is the magnitude of the gradient. Initial learning of the k = n associations in A = A1 ∪ A2 was performed by solving a set of n simultaneous equations using a standard method, after which perfect performance on all n associations was obtained. Partial forgetting was induced by adding an isotropic noise vector v with r = v = 1. Relearning the n2 = n/2 associations in A2 was implemented with k = n2 using the delta rule. 3 Results Our four main theorems are summarized here, and proofs are provided in the appendixes. These theorems apply to a network with n weights that learns n1 + n2 associations A = A1 ∪ A2 and, after partial forgetting, relearns the n2 associations in A2 . Theorem 1.
The probability P(δ > 0) of FLL is greater than 0.5.
Theorem 2.
The expected amount of FLL per association in A1 is
E[δ/n1 ] =
n2 E[x2 ]E[v2 ]. n2
(3.1)
For given values of E[x2 ] and E v2 , the value of n2 , which maximizes E[δ/n1 ] (subject to n1 + n2 ≤ n), is n2 = n − n1 . If each input vector x = (x1 , . . . , xn ) is chosen from an isotropic (e.g., isotropic gaussian) distribution and the variance of xi is σx2 , then E x2 = nσx2 . If σx2 is the same for all n, then the state of a neuron (with a typical
Free-Lunch Learning
199
sigmoidal transfer function) would be in a constantly saturated state as the number of synapses increases. One way to prevent this saturation is to assume that the efficacy of synapses on a given neuron decreases as the number of synapses increases. If forgetting is caused primarily by learning spurious inputs, then the delta learning rule used here implies that the “amount of forgetting” v is approximately independent of n. We therefore assume that v and σx2 are constant, and for convenience, we set v = 1 and σx2 = 1. Substituting these values into equation 3.1 yields E[δ/n1 ] =
n2 . n
(3.2)
Using these assumptions, simulations of networks with n = 2 and n = 100 weights agree with equation 3.2, as shown in Figure 3. The role of pruning can be demonstrated as follows. Consider a network with 100 input units and one output unit with n = 100 weights. If n2 = 90 associations are relearned out of an original set of n1 + n2 = 100 associations, then E[δ/n1 ] = n2 /n = 0.90. However, if n = 1000, then E[δ/n1 ] = 0.09. In general, as the number n − (n1 + n2 ) of unpruned redundant weights increases, so E[δ/n1 ] decreases. Therefore, E[δ/n1 ] is maximized if n1 + n2 = n. If n1 + n2 < n, then the expected amount of FLL is not maximal and can therefore be increased by pruning redundant weights until n = n1 + n2 (see Figure 4). Note that for a particular network, performance error E post on A1 after learning A2 can be zero. For example, if w = w∗ in Figure 2, then p = q = 0, which implies that δ/n1 = E post = q 2 = 0. Theorem 3.
The probability P(δ > 0) of FLL of A1 satisfies
P(δ > 0) > 1 −
a 0 (n, n1 , n2 ) + a 1 (n, n2 ) var (x2 )/E[x2 ]2 , n1 n2 (n + 2)2
(3.3)
where a 0 (n, n1 , n2 ) = 2 n1 (n + 2)(n − n2 ) + n(n − n2 ) + n(n + 2)(n − 1) a 1 (n, n2 ) = n2 (2n + n2 + 6).
(3.4) (3.5)
Theorem 3 implies that if the numbers (n1 and n2 ) of associations in A1 and A2 are fixed nonzero proportions of the number n of connection weights 2 (n1 /n and n2 /n, respectively) and var x2 /nE x2 → 0 as n → ∞, then P(δ > 0) → 1 as n → ∞; and the probability that each of the n1 associations in A1 exhibits FLL is P(δ/n1 > 0) = P(δ > 0) because δ > 0 iff δ/n1 > 0. For example, if we assume that each input vector x = (x1 , . . . , xn ) is chosen from an isotropic (e.g., isotropic gaussian) distribution and the
200
J. Stone and P. Jupp
2 variance of xi is σx2 , then var x2 /E x2 = 2/n. This ensures that 2 2 2 var x /nE x → 0 as n → ∞, and therefore that P(δ > 0) → 1 as n → ∞. Using this assumption, an approximation of the right-hand side of equation 3.3 yields P(δ > 0) > 1 −
2(1 + α1 − α1 α2 ) 2(2 + α2 + 6/n) − , nα1 α2 α1 α2 (n + 2)2
(3.6)
where α1 = n1 /n and α2 = n2 /n. In this form, it is easy to see that P(δ > 0) → 1 as n → ∞. We briefly consider the case n1 ≥ n and n2 ≥ n, so that each of L 1 and L 2 is a single point. If the distance D between these points is much less than v, then simple geometry shows that performance error E pre on A1 is large and that relearning A2 reduces this error for any v (i.e., with probability 1) with E post ∝ D2 , even in the absence of initial learning of A1 and A2 (see equation A.18 in appendix A). A similar conclusion is implicit in Atkins and Murre (1998). Theorem 4. If, instead of relearning A2 , the network learns a new subset A3 (drawn from the same distribution as A2 ), then the expected amount of FLL is less than the expected amount of FLL after relearning subset A2 . Learning A3 is analogous to the control condition used with human participants (Stone et al., 2001), and the finding that the amount of recovery after learning A3 is less than the amount of recovery after relearning A2 is predicted by theorem 4.
Figure 3: Distribution of free-lunch learning. (a) Histogram of amount of FLL δ/n1 per association, based on 1000 runs, for a network with n = 2 weights (see section 2). After learning two association subsets (η = 0.1), A1 and A2 , containing n1 = 1 and n2 = 1 associations (respectively), the network has a weight vector w0 . Forgetting is then induced by adding a noise vector v with v2 = 1 to w0 . One association A2 is then relearned, and the change in performance on A1 is measured as δ/n1 (see Figure 2). Negative values indicate that performance on A1 decreases after relearning A2 . (b) Histogram of amount of FLL δ/n1 per association for a network with n = 100 weights and η = 0.005, with A1 and A2 each consisting of n1 = n2 = 50 associations, using the same protocol as in (a). In both (a) and (b), the mean value of δ/n1 is about 0.5, as predicted by equation 3.2. As the number of associations learned increases, the amount of FLL becomes more tightly clustered around δ/n1 = 0.5, as demonstrated in these two histograms, and the probability of FLL increases (also see Figure 5).
Free-Lunch Learning
201
202
J. Stone and P. Jupp
Figure 4: Effect of pruning on free-lunch learning. Graph of the expected amount of FLL per association E[δ/n1 ] as a function of the total number n1 + n2 of learned associations in A = A1 ∪ A2 , as given in equation 3.2. In this example, the number of connection weights is fixed at n = 100, and the number of associations in A = A1 ∪ A2 increases from n1 + n2 = 2 to n1 + n2 = 100. The number n2 of relearned associations in A2 is a constant proportion (0.5) of the associations in A. If n1 + n2 ≤ n, then the network contains n − (n1 + n2 ) unpruned redundant connections. Thus, pruning effectively increases as n1 + n2 increases because, as the number n1 + n2 of associations grows, so the number of unpruned redundant connections decreases. The expected amount of FLL per association E[δ/n1 ] increases as the amount of pruning increases.
4 Discussion Theorems 1 to 4 provide the first proof that relearning induces nontransient recovery, where postrecovery error is potentially zero. This contrasts with the usually small and transient recovery that occurs during the initial phase of relearning forgotten associations (Hinton & Plaut, 1987; Atkins & Murre, 1998), and during learning of new associations (Harvey & Stone, 1996). In particular, theorem 2 is predictive inasmuch as it suggests that the amount of FLL in humans should be (1) proportional to the amount of forgetting of A = A1 ∪ A2 and (2) proportional to the proportion n2 /(n1 + n2 ) of associations relearned after partial forgetting of A. We have assumed that the number n1 + n2 of associations A = A1 ∪ A2 encoded by a given neuron is not greater than the number n of input connections (synapses) to that neuron. Given that each neuron typically has
Free-Lunch Learning
203
Figure 5: Probability of free-lunch learning. The probability P(δ > 0) of FLL of associations A1 as a function of the total number n1 + n2 of learned associations A = A1 ∪ A2 for networks with n = n1 + n2 weights. Each of the two subsets of associations A1 and A2 consists of n1 = n2 = n/2 associations. After learning and then partially forgetting A, performance on A1 was measured. P(δ > 0) is the probability that performance on subset A1 is better after subset A2 has been relearned than it is before A2 has been relearned. Solid line: Empirical estimate of P(δ > 0). Each data point is based on 10,000 runs, where each run uses input vectors chosen from an isotropic gaussian distribution (see section 2). Dashed line: Theoretical lower bound on the probability of FLL, as given by theorems 1 and 3, assuming that input vectors are chosen from an isotropic (e.g., isotropic gausssian) distribution.
many thousands of synapses (e.g., cerebellar Purkinje cells), it seems likely that this assumption is valid. However, the total amount of FLL is maximal if n1 = n2 = n/2, so that the full potential of FLL can be realized only if n1 + n2 = n. This optimum number of synapses can be achieved if inactive (i.e., redundant) synapses are pruned. Pruning may therefore contribute to FLL in physiological systems (Purves & Lichtman, 1980; Goldin, Segal, & Avignone, 2001). We have also assumed that a delta rule is used to learn associations between inputs and desired outputs. This general type of supervised learning is thought to be implemented by the cerebellum and basal ganglia (Doya, 1999). Models of the cerebellum (Dean, Porrill, & Stone, 2002) use a delta rule to implement learning. Similarly, models of the basal ganglia (Nakahara, Itoh, Kawagoe, Takikawa, & Hikosaka, 2004) use a temporally discounted
204
J. Stone and P. Jupp
form of delta rule, the temporal difference rule. This temporal difference rule has also been used to model learning in humans (Seymour et al., 2004), and (under mild conditions) is equivalent to the standard delta rule (Sutton, 1988). Indeed, from a purely computational perspective, it is difficult to conceive how these forms of associative learning could be implemented without some form of delta rule. Our analysis is based on the assumption that the network model is linear. Of course, many nonlinear networks can be approximated by linear networks, but it is possible that the results derived here have limited applicability to certain classes of nonlinear networks. Relation to Task Generalization. It is only natural to ask how FLL relates to tasks that a human might learn. One obvious but vital condition for FLL is that different associations must be encoded by a common set of neuronal connections. Aside from this condition, it might be thought that relearning A2 improves performance on A1 because A1 and A2 are somehow related (as in Hanson & Negishi, 2002; Dienes, Altmann, & Gao, 1999), so that learning A2 generalizes to A1 . This form of task generalization can occur if A1 and A2 are related as follows. If the input-output pairs in A1 and A2 are sampled from a sufficiently smooth function f and n1 n and n2 n, then A1 and A2 are statistically related, and therefore the weights induced by learning A1 are similar to those induced by learning A2 . Consequently, the resultant network input-output functions g1 and g2 (respectively) both approximate the function f (i.e., g1 ≈ g2 ≈ f ). In this case, learning A2 yields good performance on A1 . In the context of FLL, if A1 ∪ A2 is learned, forgotten, and then A2 is relearned, performance on A1 will also improve. However, the reason for this improvement is obvious and trivial: it is simply that A1 and A2 are statistically related and large enough (i.e., with n1 n and n2 n) to induce similar network functions. In contrast, the effect described in this letter does not depend on statistical similarity between A1 and A2 . Crucial assumptions are that n1 + n2 ≤ n, n1 < n, and n2 < n, so that learning the n2 associations in A2 in a network with n weights is underconstrained. This implies that the network function induced by learning A1 has no particular relation to the network function induced by learning A2 , even if A1 and A2 are sampled from the same function f (provided A1 and A2 are disjoint sets). For example, if A1 and A2 each consists of one association sampled from a linear function f (i.e., a line), then learning A2 in a linear network (as in Figure 2a) induces a linear network function g1 (i.e., a line) that intersects with f but is otherwise unconstrained. Thus, learning A2 does not necessarily yield good performance on A1 . The FLL effect reported here depends on relearning after forgetting. To cite an extreme example, if unicycling and learning French were encoded by a common set of neurons, then, after forgetting both, relearning unicycling could improve your French (although the mechanism involved here is unrelated to that described in Harvey & Stone, 1996). Thus, FLL contrasts
Free-Lunch Learning
205
with the task generalization outlined above, where it is obvious that both A1 and A2 induce similar network functions. Motivated by the demonstration that recovery occurs in humans (Stone et al., 2001; Coltheart & Byng, 1989; Weekes & Coltheart, 1996) (but not in all studies—Atkins, 2001), we have proven that FLL occurs in network models. The analysis presented here suggests that FLL is a necessary and generic consequence of storing information in distributed systems rather than a side effect peculiar to a particular class of artificial neural nets. Moreover, the generic nature of FLL suggests that it is largely independent of the type (i.e., artificial or physiological) of network used to learn associations. FLL appears to be a fundamental property of distributed representations. Given the reliance of neuronal systems on distributed representations, FLL may be a ubiquitous feature of learning and memory. It is likely that any organism that did not take advantage of such a fundamental and ubiquitous effect would be at a severe selective disadvantage. Appendix A: Analysis of Free-Lunch Learning We proceed by deriving expressions for E pre , E post , and δ = E pre − E post . We prove that if n1 + n2 ≤ n, then the expected value of δ is positive. We then prove that if n1 + n2 ≤ n, the probability P(δ > 0) of FLL is greater than 0.5, that its lower bound increases with n (if n1 /n and n2 /n are fixed), and that this bound approaches unity as n increases. A.1 Definition of Performance Error. For an artificial neural network (ANN) with weight vector w, we define the performance error for input vectors x1 , . . . , xc and desired outputs d1 , . . . , dc to be E(x1 , . . . , xc ; w, d1 , . . . , dc ) =
c
(w · xi − di )2 .
(A.1)
i=1
By putting X = (x1 , . . . , xc )T , d = (d1 , . . . , dc )T and E(X; w, d) = E(x1 , . . . , xc ; w, d1 , . . . , dc ), we can write equation A.1 succinctly as E(X; w, d) = Xw − d2 .
(A.2)
Given a c × n matrix X and a c-dimensional vector d, let L X,d be the affine subspace, L X,d = w : XT Xw = XT d ,
206
J. Stone and P. Jupp
of Rn . Since i. rk XT X ≤ rk (X), ii. XT Xa = 0 ⇒ aT XT Xa = 0 ⇒ Xa = 0, it follows that rk XT X = rk (X) (where rk denotes the rank of a matrix), and so L X,d is nonempty.
(A.3)
If X and d are consistent (i.e., there is a w such that Xw = d), then L X,d = {w : Xw = d}. A.2 Comparison of Performance Errors. Given weight vectors w1 and w2 , a matrix X of input vectors, and a vector d of desired outputs, define δ(w1 , w2 ; X, d) = E pre − E post , ˜ be any element of where E pre = E(X; w1 , d) and E post = E(X; w2 , d). Let w L X,d . Then δ(w1 , w2 ; X, d) = Xw1 − d2 − Xw2 − d2 = Xw1 2 − Xw2 2 − 2 (w1 − w2 )T XT d ˜ = (w1 − w2 )T XT X (w1 + w2 ) − 2 (w1 − w2 )T XT Xw ˜ . = (w1 − w2 )T XT X (w1 + w2 − 2w)
(A.4)
Suppose given ni × n matrices Xi and ni -dimensional vectors di (for i = 1, 2). Put L i = L Xi ,di
for i = 1, 2.
If Xi has rank ni , then Xi = Ti Zi for unique ni × ni and ni × n matrices Ti and Zi with Ti upper triangular and Zi ZiT = Ini . Note that the matrix ZiT Zi represents the operator that projects onto the image of XiT Xi , and so ZiT Zi XiT Xi = XiT Xi .
(A.5)
Free-Lunch Learning
207
Let w0 be an element of L X,d , where
X1 d1 X= d= , X2 d2 that is,
X1T X1 + X2T X2 w0 = X1T d1 + X2T d2 .
(A.6)
(By equation A.3, such a w0 always exists.) Given v in Rn , put w1 = w0 + v. Let w02 and w2 be the orthogonal projections of w0 and w1 , respectively, onto L 2 . Then X2T X2 w02 = X2T d2
w2 = w02 + In −
Z2T Z2
(A.7) (w1 − w02 ) .
Manipulation gives w1 − w2 = Z2T Z2 (v + w0 − w02 ) ,
(A.8)
and so w1 + w2 − 2w0 = 2In − Z2T Z2 v − Z2T Z2 (w0 − w02 ) .
(A.9)
˜ be any element of L X1 ,d1 . Then equations A.4, A.6, A.7 to A.9, and A.5 Let w yield δ(w1 , w2 ; X1 , d1 ) ˜ = (w1 − w2 )T X1T X1 (w1 + w2 − 2w) = (w1 − w2 )T X1T X1 (w1 + w2 ) − 2 (w1 − w2 )T X1T d1 = (w1 − w2 )T X1T X1 (w1 + w2 − 2w0 ) − 2 (w1 − w2 )T X2T X2 (w0 − w02 ) = (v + w0 − w02 )T Z2T Z2 X1T X1 (w1 + w2 − 2w0 ) − 2 (v + w0 − w02 )T Z2T Z2 X2T X2 (w0 − w02 ) = (v + w0 − w02 )T Z2T Z2 X1T X1 2In − Z2T Z2 v − (v + w0 − w02 )T Z2T Z2 X1T X1 Z2T Z2 (w0 − w02 ) − 2 (v + w0 − w02 )T Z2T Z2 X2T X2 (w0 − w02 )
208
J. Stone and P. Jupp
= vT Z2T Z2 X1T X1 2In − Z2T Z2 v − 2 (w0 − w02 )T Z2T Z2 X1T X1 In − Z2T Z2 − X2T X2 v − (w0 − w02 )T Z2T Z2 2X2T X2 + X1T X1 Z2T Z2 (w0 − w02 ) = vT Z2T Z2 X1T X1 2In − Z2T Z2 v − 2 (w0 − w02 )T Z2T Z2 X1T X1 In − Z2T Z2 − X2T X2 v −(w0 − w02 )T (2X2T X2 + Z2T Z2 X1T X1 Z2T Z2 ) (w0 − w02 ) .
(A.10)
A.3 Moments of Isotropic Distributions. In order to obtain results on the distribution of performance error, it is useful to have some moments of isotropic distributions. Let u be uniformly distributed on Sn−1 , and let A and B be n × n matrices. The formulas for the second and fourth moments of u given in equations 9.6.1 and 9.6.2 of Mardia and Jupp (2000), together with some algebraic manipulation, yield tr (A) E uT Au = n
(A.11)
T
tr (AB) + tr AB + tr (A) tr (B) E uT AuuT Bu = n(n + 2) 2 ntr A + ntr AAT − 2tr (A)2 T . var u Au = n2 (n + 2)
(A.12) (A.13)
Now let x be isotropically distributed on Rn , that is, Ux has the same distribution as x for all orthogonal n × n matrices U. Then writing x = xu with u = 1 and using equations A.11 to A.13 gives E x2 tr (A) E xT Ax = n 4 E x tr (AB) + tr ABT + tr (A) tr (B) T T E x Axx Bx = n(n + 2)
var xT Ax =
(A.14)
E x4 ntr A2 + ntr AAT − 2tr (A)2 n2 (n + 2)
+
var x2 tr (A)2 . n2
(A.15)
Free-Lunch Learning
209
A.4 Distribution of Performance Error. Now suppose that X1 , d1 , X2 , d2 , and v are random and satisfy X1 and v are independent, the distribution of X1 is isotropic,
(A.16)
v has an isotropic distribution, where conditions A.16 mean that UX1 V has the same distribution as X1 for all orthogonal n1 × n1 matrices U and all orthogonal n × n matrices V. Then equation A.10 yields E [δ(w1 , w2 ; X1 , d1 ) |X1 , X2 ] E v2 T = tr X1 X1 Z2T Z2 n − (w0 − w02 )T 2X2T X2 + Z2T Z2 X1T X1 Z2T Z2 (w0 − w02 ) .
(A.17)
Taking expectations over X1 and X2 in equation A.17 gives the following general result on FLL: E[δ(w1 , w2 ; X1 , d1 )] > 0 iff n2 E (w0 − w02 )T 2X2T X2 + Z2T Z2 X1T X1 Z2T Z2 (w0 − w02 ) E[v2 ] > . n1 n2 (A.18) The intuitive interpretation of this result is that if E v2 is large enough, then there is FLL, whereas if P (w0 = w02 ) > 0 then “negative FLL” can occur. In particular, if n1 + n2 ≤ n and P (v = 0) > 0, then there is FLL. A.5 The Case n1 + n2 ≤ n. In this section we assume that X1 , d1 , X2 and d2 are random and that (X1 , d1 ), (X2 , d2 ) and v are independent,
(A.19)
the distribution of v is isotropic.
(A.20)
We suppose also that n1 + n2 ≤ n, and that the distributions of X1 , d1 , X2 , and d2 are continuous. Then, with probability 1, X1 w0 = d1 and X2 w0 = d2 , so that w02 = w0 and equation A.10 reduces to δ(w1 , w2 ; X1 , d1 ) = vT Z2T Z2 X1T X1 2In − Z2T Z2 v.
(A.21)
210
J. Stone and P. Jupp
A.5.1 FLL Is More Probable Than Not. Let w∗1 be the reflection of w1 in L 2 , that is, w∗1 = w2 − (w1 − w2 ) . Consideration of the parallelogram with vertices at w0 , w1 , w∗1 , and w1 + w∗1 − w0 gives 2 X1 (w1 − w0 ) 2 + X1 (w∗1 − w0 ) 2 = X1 [w1 − w0 ] + w∗1 − w0 2 + X1 [w1 − w0 ] − w∗1 − w0 2 = 4 X1 (w2 − w0 ) 2 + X1 (w1 − w2 ) 2 , so that (since d1 = X1 w0 ) δ(w1 , w2 ; X1 , d1 ) + δ(w∗1 , w2 ; X1 , d1 ) = X1 (w1 − w0 ) 2 + X1 (w∗1 − w0 ) 2 − 2X1 (w2 − w0 ) 2 = 2X1 (w1 − w2 ) 2 ≥ 0. Thus if δ(w1 , w2 ; X1 , d1 ) < 0, then δ(w∗1 , w2 ; X1 , d1 ) > 0. If v is distributed isotropically, then w∗1 − w0 is distributed isotropically, so that δ(w∗1 , w2 ; X1 , d1 ) has the same distribution (conditionally on X1 , d1 and X2 ) as δ(w1 , w2 ; X1 , d1 ), and so P(δ(w1 , w2 ; X1 , d1 ) < 0|X1 , d1 , X2 ) ≤ P(δ(w∗1 , w2 ; X1 , d1 ) > 0|X1 , d1 , X2 ) = P(δ(w1 , w2 ; X1 , d1 ) > 0|X1 , d1 , X2 ). (A.22) Further, if v ∈ L 2 \ L 1 , then w2 = w1 = w∗1 , so that δ(w1 , w2 ; X1 , d1 ) = δ(w∗1 , w2 ; X1 , d1 ) > 0. By continuity of δ, there is a neighborhood of v on which δ(w1 , w2 ; X1 , d1 ) > 0 and δ(w∗1 , w2 ; X1 , d1 ) > 0. Thus, if L 2 \ L 1 = ∅, then equation A.22 can be refined to P(δ(w1 , w2 ; X1 , d1 ) < 0|X1 , d1 , X2 ) < P(δ(w∗1 , w2 ; X1 , d1 ) < 0|X1 , d1 , X2 ). (A.23) Since P(L 2 ⊂ L 1 ) = 0 and P(δ(w1 , w2 ; X1 , d1 ) < 0|X1 , d1 , X2 ) is a continuous function of X1 , d1 and X2 , it follows from equation A.23 that P(δ(w1 , w2 ; X1 , d1 ) < 0) < P(δ(w1 , w2 ; X1 , d1 ) > 0), which implies the following result.
Free-Lunch Learning
211
Theorem 1 P(δ(w1 , w2 ; X 1 , d1 ) > 0) > 0.5. This implies that the median of δ(w 1 , w2 ; X 1 , d1 ) is positive. A.5.2 A Lower Bound for P(δ > 0). Our proof depends on Chebyshev’s inequality, which states that for any positive value of t, P(|δ − E[δ]| ≥ t) ≤
var(δ) , t2
where var(δ) denotes the variance of δ. If we set t = E[δ], then (since, by equation A.28, E[δ] > 0) P (δ ≤ 0) ≤
var (δ) E [δ]2
.
(A.24)
This provides a lower bound for the probability of FLL. We prove that this bound approaches unity as n approaches infinity. Now we assume (in addition to conditions A.19 and A.20) that the distributions of X1 and X2 are isotropic.
(A.25)
It follows from equations A.21, A.14, and A.15 that n1 In 2In − Z2T Z2 v E [δ(w1 , w2 ; X1 , d1 ) |Z2 , v ] = vT Z2T Z2 E x2 n n1 2 T T (A.26) = E x v Z2 Z2 v, n where x is the first column of X1T , and var (δ(w1 , w2 ; X1 , d1 ) |Z2 , v )
E x4 (n − 2)Z2 v4 + nZ2 v2 2In − Z2T Z2 v2 = n1 n2 (n + 2) var x2 Z2 v4 + . n2
(A.27)
Since v has an isotropic distribution, equations A.26, A.11, and A.13 imply that E [δ(w1 , w2 ; X1 , d1 ) |Z2 , v ] =
n1 n2 2 E x v2 . n2
(A.28)
212
J. Stone and P. Jupp
Given that there are n1 associations in the subset A1 that is not relearned, equation A.28 implies the following theorem about the expected amount of recovery per association in A1 . Theorem 2 E
n2 δ(w 1 , w 2 ; X 1 , d1 ) 2 2 Z2 , v = n2 E[x ]v . n1
(A.29)
Equations A.26 and A.13 also imply that var (E [δ(w1 , w2 ; X1 , d1 ) |Z2 , v ] |Z2 , v ) n 2 v4 2nn2 − 2n22 1 2 E x = n n2 (n + 2) 2 2n21 n2 (n − n2 )E x2 v4 , = n4 (n + 2)
(A.30)
and it follows from equations A.27 and A.12 that E [var (δ(w1 , w2 ; X1 , d1 ) |Z2 , v ) |Z2 , v ]
n1 v3 E x4 (n − 2)n2 (n2 + 2) + nn2 (2n − n2 + 2) = n(n + 2) n2 (n + 2) var x2 n2 (n2 + 2) + n2 =
n1 n2 v4 4 E x 2(n2 + 2n − n2 − 2) + var x2 (n + 2)(n2 + 2) . 3 2 n (n + 2) (A.31)
Then equations A.30 and A.31 give var (δ(w1 , w2 ; X1 , d1 ) |Z2 , v ) 2 2n21 n2 (n − n2 )E x2 v4 = n4 (n + 2) +
n1 n2 v4 4 E x 2(n2 + 2n − n2 − 2) + var x2 (n + 2)(n2 + 2) n3 (n + 2)2
Free-Lunch Learning
213
n1 n2 v4 {2[n1 (n + 2)(n − n2 ) + n(n − n2 ) + n(n + 2)(n − 1)]E[x2 ]2 n4 (n + 2)2
=
+ n2 (2n + n2 + 6)var(x2 )}, and so var (δ(w1 , w2 ; X1 , d1 ) |Z2 , v ) 2
E [δ(w1 , w2 ; X1 , d1 ) |Z2 , v ]
=
a 0 (n, n1 , n2 ) + a 1 (n, n2 )γ (n) , n1 n2 (n + 2)2
where a 0 (n, n1 , n2 ) = 2{n1 (n + 2)(n − n2 ) + n(n − n2 ) + n(n + 2)(n − 1)} a 1 (n, n2 ) = n2 (2n + n2 + 6) var x2 γ (n) = 2 . E x2 Chebyshev’s inequality implies the following theorem. Theorem 3 P (δ(w1 , w 2 ; X 1 , d1 ) ≤ 0 |Z2 , v ) ≤
a 0 (n, n1 , n2 ) + a 1 (n, n1 , n2 )γ (n) . n1 n2 (n + 2)2
Since the right-hand side does not depend on Z2 or v, this gives the following result. If γ (n)/n → 0 and n1 /n, n2 /n are bounded away from zero as n → ∞, then P (δ(w1 , w2 ; X1 , d1 ) > 0) → 1,
n → ∞.
If x ∼ N 0, σx2 In ,
Example.
then E x2 = nσx2 ,
var x2 = 2nσx4 ,
γ (n) =
2 , n
and so P(δ(w1 , w2 ; X1 , d1 ) > 0) → 1,
n → ∞,
provided that n1 /n and n2 /n are bounded away from zero.
214
J. Stone and P. Jupp
A.5.3 Learning A3 Instead of A2 . Now suppose that relearning of A2 is replaced by learning another subset A3 of n2 associations. Let the matrix X3 and vector d3 be such that the subspace L 3 corresponding to A3 has the form L 3 = L X3 ,d3 . Let w3 and w13 denote the orthogonal projections of w1 onto L 3 and L 1 ∩ L 3 , respectively. Then (A.32) w3 = w13 + In − Z3T Z3 (w1 − w13 ) , and so w1 = w3 + Z3T Z3 (w1 − w13 ) .
(A.33)
˜ = w13 , and equations A.33 and A.32, we have From equation A.4 with w δ(w1 , w3 ; X1 , d1 ) = (w1 − w3 )T X1T X1 (w1 + w3 − 2w13 ) = (w1 − w13 )T Z3T Z3 X1T X1 (w1 + w3 − 2w13 ) ˜ , = (v − ω) ˜ T Z3T Z3 X1T X1 2In − Z3T Z3 (v − ω)
(A.34)
where ω˜ = w13 − w0 . Since X1 w0 = X1 w13 , equation A.34 can be expanded as δ(w1 , w3 ; X1 , d1 )
= vT Z3T Z3 X1T X1 2In − Z3T Z3 v − vT Z3T Z3 X1T X1 2In − Z3T Z3 ω˜ − ω˜ T Z3T Z3 X1T X1 2In − Z3T Z3 v ˜ − ω˜ T Z3T Z3 X1T X1 Z3T Z3 ω,
and so E [δ(w1 , w3 ; X1 , d1 )|X1 , d1 , X2 , d2 , X3 , d3 ] E v2 T = tr Z3 Z3 X1T X1 2In − Z3T Z3 − ω˜ T Z3T Z3 X1T X1 Z3T Z3 ω˜ n E v2 T ˜ 2. = tr X1 X1 Z3T Z3 − X1 Z3T Z3 ω n Now assume that (X1 , d1 ), (X2 , d2 ), (X3 , d3 ) and v are independent, the distributions of X1 , X2 , X3 and v are isotropic.
Free-Lunch Learning
Since
215
E v2 T E v2 T E tr X1 X1 Z2T Z2 = E tr X1 X1 Z3T Z3 n n = E [δ(w1 , w2 ; X1 , d1 )] ,
we have the following theorem. Theorem 4 E[δ(w1 , w 3 ; X 1 , d1 )] ≤ E [δ(w 1 , w2 ; X 1 , d1 )] . Appendix B: Behavior of the Gradient Algorithm If E is regarded as a function of w, then differentiation of equation A.2 shows that the gradient of E at w is ∇ E (w) = 2XT (Xw − d) . Then for any algorithm that takes an initial w(0) to w(1) , w(2) , . . . using steps w(t+1) − w(t) in the direction of ∇ E (w(t) ) , w(t) − w(0) is in the image of XT X, and so is orthogonal to L X,d . It follows that if Xw(t) − d2 → minw Xw − d2 as t → ∞, then w(t) converges to the orthogonal projection of w(0) onto L X,d . Appendix C: The Geometry of Performance Error When n1 = 1 Given associations A1 and A2 , we prove that if n1 = 1 and input vectors have unit length (so that x1 = 1), then the difference δ in performance errors on association A1 of w1 (i.e., after partial forgetting) and w2 (i.e., after relearning A2 ) is equal to the difference Q = p 2 − q 2 . This proof supports the geometric account given in the article and in Figure 2 and does not (in general) apply if n1 > 1. We begin by proving that (if n1 = 1 and x1 = 1) the performance error of an association A1 for an arbitrary weight vector w1 is equal to the squared distance p 2 between w1 and its orthogonal projection w1 onto the affine subspace L 1 corresponding to A1 . If n1 = 1, then L 1 has the form L 1 = {w : w · x1 = d1 } for some x1 and d1 . Given an arbitrary weight vector w1 , we define the performance error on association A1 as equivalent to E(w1 , A1 ) = (w1 · x1 − d1 )2 .
(C.1)
216
J. Stone and P. Jupp
The orthogonal projection w1 of w1 onto L 1 is w1 = w1 +
d1 − w1 · x1 x1 , x1 2
(C.2)
so that d1 = w1 · x1 .
(C.3)
Substituting equation C.3 into C.1 and using C.2 yields E(w1 , A1 ) = w1 − w1 2 x1 2 = p 2 x1 2 .
(C.4)
Now suppose that x1 = 1. Then E(w1 , A1 ) = p 2 , that is, the performance error is equal to the squared distance between the weight vectors w1 and w1 . The same line of reasoning can be applied to prove that E(w2 , A1 ) = q 2 . Thus, the difference δ in performance error on A1 for weight vectors w1 and w2 is δ = E(w1 , A1 ) − E(w2 , A1 ) = p2 − q 2 = Q. Acknowledgments Thanks to S. Isard for substantial help with the analysis presented here; to R. Lister, S. Eglen, P. Parpia, A. Farthing, P. Warren, K. Gurney, N. Hunkin, and two anonymous referees for comments; and J. Porrill for useful discussions. References Atkins, P. (2001). What happens when we relearn part of what we previously knew? Predictions and constraints for models of long-term memory. Psychological Research, 65(3), 202–215.
Free-Lunch Learning
217
Atkins, P., & Murre, J., (1998). Recovery of unrehearsed items in connectionist models. Connection Science, 10(2), 99–119. Coltheart, M., & Byng, S. (1989). A treatment for surface dyslexia. In X. Seron (Ed.), Cognitive approaches in neuropsychological rehabilitation. Mahwah, NJ: Erlbaum. Dean, P., Porrill, J., & Stone, J. V. (2002). Decorrelation control by the cerebellum achieves oculomotor plant compensation in simulated vestibulo-ocular reflex. Proceedings Royal Society (B), 269(1503), 1895–1904. Dienes, Z., Altmann, G., & Gao, S.-J. (1999). Mapping across domains without feedback: A neural network model of transfer of implicit knowledge. Cognitive Science, 23, 53–82. Doya, K. (1999). What are the computations of the cerebellum, the basal ganglia and the cerebral cortex? Neural Networks, 12(7–8), 961–974. Goldin, M., Segal, M., & Avignone, E. (2001). Functional plasticity triggers formation and pruning of dendritic spines in cultured hippocampal networks. J. Neuroscience, 21(1), 186–193. Hanson, S. J., & Negishi, M. (2002). On the emergence of rules in neural networks. Neural Computation, 14, 2245–2268. Harvey, I., & Stone, J.V. (1996). Unicycling helps your French: Spontaneous recovery of associations by learning unrelated tasks. Neural Computation, 8, 697–704. Hinton, G., & Plaut, D. (1987). Using fast weights to deblur old memories. In Proceedings Ninth Annual Conference of the Cognitive Science Society, Seattle WA, 177–186. Mardia, K. V., & Jupp, P. E. (2000). Directional statistics. New York: Wiley. Nakahara, H., Itoh, H., Kawagoe, R., Takikawa, Y., & Hikosaka, O. (2004). Dopamine neurons can represent context-dependent prediction error. Neuron, 41(2), 269–280. Purves D., & Lichtman, J. (1980). Elimination of synapses in the developing nervous system. Science, 210, 153–157. Seymour, B., O’Doherty, J. P., Dayan, P., Koltzenburg, M., Jones, A. K., Dolan, R. J., Friston, K. J., & Frackowiak, R. (2004). Temporal difference models describe higher order learning in humans. Nature, 429, 664–667. Stone, J. V., Hunkin, N. M., & Hornby, A. (2001). Predicting spontaneous recovery of memory. Nature, 414, 167–168. Sutton, R. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44. Weekes, B., & Coltheart, M. (1996). Surface dyslexia and surface dysgraphia: Treatment studies and their theoretical implications. Cognitive Neuropsychology, 13, 277–315.
Received August 1, 2005; accepted May 19, 2006.
LETTER
Communicated by Mark Girolami
Linear Multilayer ICA Generating Hierarchical Edge Detectors Yoshitatsu Matsuda [email protected]
Kazunori Yamaguchi [email protected] Kazunori Yamaguchi Laboratory, Department of General Systems Studies, Graduate School of Arts and Sciences, University of Tokyo, Tokyo, Japan 153-8902
In this letter, a new ICA algorithm, linear multilayer ICA (LMICA), is proposed. There are two phases in each layer of LMICA. One is the mapping phase, where a two-dimensional mapping is formed by moving more highly correlated (nonindependent) signals closer with the stochastic multidimensional scaling network. Another is the local-ICA phase, where each neighbor (namely, highly correlated) pair of signals in the mapping is separated by MaxKurt algorithm. Because in LMICA only a small number of highly correlated pairs have to be separated, it can extract edge detectors efficiently from natural scenes. We conducted numerical experiments and verified that LMICA generates hierarchical edge detectors from large-size natural scenes. 1 Introduction Independent component analysis (ICA) is a recently developed method in the fields of signal processing and artificial neural networks, and it has been shown to be quite useful for the blind separation problem (Jutten & H´erault, 1991; Comon, 1994; Bell & Sejnowski, 1995; Cardoso & Laheld, 1996). The linear ICA is formalized as follows. Let s and A be N-dimensional source signals and an N × N mixing matrix. Then the observed signals x are defined as x = As.
(1.1)
The problem to solve is to find out A (or the inverse, W) when only the observed (mixed) signals are given. In other words, ICA blindly extracts the source signals from M samples of the observed signals as follows: Sˆ = WX,
(1.2)
where X is an N × M matrix of the observed signals and Sˆ is the estimate of the source signals. Although this is a typical ill-conditioned problem, ICA Neural Computation 19, 218–230 (2007)
C 2006 Massachusetts Institute of Technology
Linear Multilayer ICA Generating Hierarchical Edge Detectors
219
can solve it if the source signals are generated according to independent and nongaussian probability distributions. Concretely speaking, ICA algorithms find out W by minimizing a criterion (called the contrast function) ˆ that is defined by higher-order statistics (e.g., kurtosis) of components of S, and many methods have been proposed for the minimization, for example, fast ICA (Hyv¨arinen, 1999) and the relative gradient algorithm (Cardoso & Laheld, 1996). Because all of these previous algorithms try to estimate all the N2 components of W rigorously, their time complexity is O(N2 ), which is intractable for large N. For actual data such as natural scenes, W is given not randomly but according to some underlying structure in the data. LMICA (linear multilayer ICA), which we proposed previously (Matsuda & Yamaguchi, 2003, 2004, 2005b), utilizes such structure for estimating W. By gradually improving an estimate of W, it could finally find out a fairly good estimate of W quite efficiently. This letter extends this work with improvement of the algorithm, using the two-dimensional stochastic multidimensional scaling (MDS) network, and exhaustive numerical experiments (generating hierarchical edge detectors). This letter is organized as follows. In section 2, the algorithm is described. In section 3, numerical experiments verify that LMICA can extract edge detectors efficiently from natural scenes, and it can generate hierarchical edge detectors from large-size scenes. This letter is concluded in section 4. 2 Algorithm 2.1 Basic Idea. It is assumed that X is whitened initially. LMICA tries to extract the independent components approximately by repetition of the following two phases: the mapping phase, which brings more highly correlated signals nearer, and the local-ICA phase, where each neighbor pair of signals in the mapping is separated by the MaxKurt algorithm (Cardoso, 1999). Intuitively, the mapping phase finds only a few pairs that are more “important” for estimating W, and the local-ICA phase optimizes only the “important” pairs. The mechanism of LMICA is illustrated in Figure 1. Note that this illustration shows the ideal case where N signals are separated in only O (log N) layers. Although this does not hold for an arbitrary W, it will be shown in section 3 that natural scenes can be separated quite effectively by this method with a two-dimensional mapping. 2.2 Mapping Phase. In the mapping phase, current signals X are arranged 2 2in a two-dimensional array so that pairs of signals taking higher k xik x jk are placed nearer. In order to efficiently calculate such an array, the stochastic MDS network in Matsuda and Yamaguchi (2005a) is used. Its main procedure is the repetition of the calculation of the center of gravity and the slide of each signal toward the center in proportion to its value.
220
Y. Matsuda and K. Yamaguchi
1
2
3
4
5
6
7
8 local-ICA
1-2
1-2
3-4
3-4
5-6
5-6
7-8
7-8 mapping
1-2
3-4
1-2
3-4
5-6
7-8
5-6
7-8 local-ICA
1-4
1-4
1-4
1-4
5-8
5-8
5-8
5-8
1-4
5-8
1-4
5-8
1-4
5-8
1-4
5-8
mapping
local-ICA 1-8
1-8
1-8
1-8
1-8
1-8
1-8
1-8
Figure 1: LMICA (the ideal case). Each number from 1 to 8 means a source signal. In the first local-ICA phase, each neighbor pair of the completely mixed signals (denoted 1-8) is partially separated into 1-4 and 5-8. Next, the mapping phase rearranges the partially separated signals so that more highly correlated signals are nearer. In consequence, the four 1-4 signals (similarly, 5-8 ones) are brought nearer. Then the local-ICA phase partially separates the pairs of neighbor signals into 1-2, 3-4, 5-6, and 7-8. By repetition of these two phases, LMICA can extract all the sources quite efficiently.
The original stochastic MDS network is given as follows (see Matsuda & Yamaguchi, 2005a, for details): 1. Place given signals Z = (zik ) on a two-dimensional array randomly, where each signal i (the ith row of Z) corresponds to a component of the array through a one-to-one correspondence. The x- and ycoordinates of a signal i on the array are denoted as mi1 and mi2 . Note that mi1 and mi2 take discrete values. 2. Pick up a column of Z randomly, whose index is denoted as p. The column can be regarded as a randomly selected sample p of signals. 3. Calculate the center of gravity for the sample p by g
ml (Z, p) = where l = 1 or 2.
i zi p mil , i zi p
(2.1)
Linear Multilayer ICA Generating Hierarchical Edge Detectors STEP 1 5
STEP 2
2 23 17 7
5
STEP 3
2 23 17
7
2 23 17
STEP 4 7
5
2 23 17 12
8 20 15 10 21
8 20 10 12 21
8 20 10
10 15 6 11 3
10 15 6 11
3
8 20 10 12 21 10 15 6 11 3
10 7
25 4 24
25 4 24
9
25 4
25 4 24 1
1
9
13 14 16 22 19
7 unit to be updated new coordinates of 7 offset vector of 7 15 destination unit
2 21
5
221
1
13 14 16 22 19
24 1
9
6 11 3 9
13 14 16 22 19
13 14 16 22 19
unit to be shifted
new offset vector of 7
start and end points medium points 10 12 medium units
Figure 2: Discretized update on the stochastic MDS network (excerpted from Matsuda & Yamaguchi, 2005a, pp. 289). It illustrates each step of the discretized update rule. Each small square represents a signal on the discrete two-dimensional array (denoted a unit in this figure). In the first step, the destination is calculated by adding the current offset vector to the new coordinates. Second, the signals on the route are detected. Third, the signal to be updated is moved by exchanging the signals on the route one by one. Finally, the fraction under the discretization is preserved as the offset vector.
4. Calculate the new coordinate mil for each signal i by g mil = mil − λzi p mil − ml ,
(2.2)
where λ is the step size. 5. Update the coordinates of each signal on the array approximately according to equation 2.2 under the constraints that every coordinate mil has to be on the discrete two-dimensional array. Such discretized updates are conducted by giving an offset vector (which preserves the rounded off fraction under the discretization) to each signal and exchanging the signals one by one on the array. The mechanism is roughly illustrated in Figure 2. The details are omitted in this letter. 6. Terminate the process if a termination condition is satisfied. Otherwise, return to step 2. It can be shown that this process approximately minimizes an error function, C=
(zik z jk ) (mil − m jl )2 , i, j
k
l
(2.3)
222
Y. Matsuda and K. Yamaguchi
under the conditions that the location of each signal (mil , mi2 ) is bound to a unit in the discrete two-dimensional array through a one-to-one correspondence. Because the minimization of C makes more highly correlated pairs of signals (taking larger k (zik z jk )) be closer (smaller l (mil − m jl )2 ), it is shown that this process generates a topographic mapping where highly correlated signals are placed near each other. Some modifications are needed in order to make the original stochastic MDS network suitable for LMICA:
r r
r
zi p is given as xi2p − 1 where p is selected randomly from 1 to M at each update. Because the convergence of the original network is too slow and tends to drop into a local minimum if the signals Z take continuous values, the following elaboration is utilized. The signals are classified into two groups for each i: the positive group σ + consisting of signals satisfying zi p > 0 and the negative one σ − of zi p < 0. Then equations 2.1 and 2.2 are calculated for each group. Note that each signal of negative group is moved away from the center of gravity. Numerical experiments (omitted in this article) showed that this improvement gave more accurate results more efficiently than the original network. Two learning stages are often used in order to make local neighbor relations more accurate if there are many signals. The first is the usual stage for the whole array. The second is the locally tuning stage where the array is divided into some small areas and the algorithm is applied to each area.
The total procedure of the mapping phase for given X, W, A, and a given two-dimensional array is described as follows: Mapping Phase 1. Allocate each signal i to a component of the two-dimensional array by a randomly selected one-to-one correspondence. 2. Repeat the following steps over T times with a decreasing step size λ: a. Select p randomly from {1, . . . , M}, and let zi p = xi2p − 1 for each i. b. Calculate the two centers of gravity: g+ g− i∈σ + zi p mil i∈σ − zi p mil ml (Z, p) = and ml (Z, p) = . z + i p i∈σ i∈σ − zi p (2.4) c. Update eachmil by g+ mil − λzi p mil − ml zi p > 0 mil := (2.5) g− mil − λzi p mil − ml zi p < 0 under the constraints that every mil is on the discrete array.
Linear Multilayer ICA Generating Hierarchical Edge Detectors
223
3. Divide the array into small areas, and apply the above process to each area if there are many signals. 4. Rearrange the rows of X and W and the columns of A according to the generated array. 2.3 Local-ICA Phase. In the local-ICA phase, the following contrast function φ(X) (the minus sum of kurtoses) is used (which is the same one in MaxKurt algorithm; Cardoso, 1999): 4 i,k xik φ (X) = − −3 . (2.6) M φ(X) is minimized by “rotating” nearest-neighbor pairs of signals on the array. For each nearest-neighbor pair (i, j), a rotation matrix R(θ ) is given as cos θ sin θ R(θ ) = . (2.7) − sin θ cos θ Then the optimal angle θˆ is given as 4 4 θˆ = argminθ − xik , + x jk
(2.8)
k where xik = cos θ · xik + sin θ · x jk and x jk = − sin θ · xik + cos θ · x jk . After some tedious transformation of this equation (see Cardoso, 1999), it is shown that θˆ is determined analytically by the following equations:
sin 4θˆ =
αi j αi2j
+
βi2j
and cos 4θˆ =
where αi j =
3 x jk − xik x 3jk and βi j = xik
k
βi j αi2j
+ βi2j
k
,
(2.9)
4 2 2 xik + x 4jk − 6xik x jk
4
.
(2.10)
Now, the procedure of the local-ICA phase for given X, W, A, and an array is described as follows: Local-ICA Phase 1. For every nearest-neighbor pair of signals on the array, (i, j), a. Find out the optimal angle θˆ by equation 2.9. b. Rotate the corresponding parts of X, W, and A.
224
Y. Matsuda and K. Yamaguchi
2.4 Complete Algorithm. The complete algorithm of LMICA for any given observed signals X and array fixing the shape of the mapping is given by repeating the mapping phase and the local-ICA phase alternately: Linear Multilayer ICA Algorithm 1. Initial settings: Let X be a whitened observed signal and W and A be the N × N identity matrix. 2. Repetition: Do the following two phases alternately over L times. a. Do the mapping phase in section 2.2 b. Do the local-ICA phase in section 2.3. 2.5 Some Remarks 2.5.1 Relation to MaxKurt Algorithm. Equation 2.9 is the same the as MaxKurt algorithm (Cardoso, 1999). The crucial difference between LMICA and MaxKurt is that LMICA optimizes just the nearest-neighbor pairs instead of all the N(N−1) ones in MaxKurt. In LMICA, the pairs with higher 22 2 costs (higher k xik x jk ) are brought nearer in the mapping phase. Approximations of independent components can be extracted effectively by optimizing just the neighbor pairs. 2.5.2 Prewhitening. Although LMICA is applicable to any prewhitened signals, the selection of the whitening method is actually crucial for its performance. It is shown in section 3.1 that ZCA (zero-phase component analysis) is more suitable than principal component analysis (PCA) if natural scenes are given as the observed signals. The ZCA filter is given as X := − 2 X, 1
(2.11)
where is the covariance matrix of X. It has been known that the ZCA filter whitens the given signals with preserving the spatial relationship if natural scenes are given as X (Li & Atick, 1994; Bell & Sejnowski, 1997). 3 Results It is well known that various local edge detectors can be extracted from natural scenes by the standard ICA algorithm (Bell & Sejnowski, 1997; van Hateren & van der Schaaf, 1998). Here, LMICA was applied to the same problem. 3.1 Small Natural Scenes. Thirty thousand samples of natural scenes of 12 × 12 pixels were given as the observed signals X. That is, N and M were 144 and 30,000. Original natural scenes were downloaded from http://www.cis.hut.fi/projects/ica/data/images/. X is then whitened by
Linear Multilayer ICA Generating Hierarchical Edge Detectors
225
Decrease of Contrast Function LMICA (a) original topography (b) MaxKurt (c) LMICA for PCA (d)
minus kurtosis
-4 -5 -6 -7 -8 0
3000
6000 9000 12000 times of optimizations for pairs
15000
Figure 3: Decreasing curves of the contrast function φ along the times of optimizations for pairs of signals. (a) The standard LMICA. (b) LMICA using the identity mapping in the mapping phase, which preserves the original topography for the ZCA filter. (c) MaxKurt. (d) LMICA for the PCA-whitened signals.
ZCA. In the mapping phase, a 12 × 12 array was used, where the learning 100 length T was 100,000 with the step size λ = 100+t (t is the current time, which is the number of updates to the point). The calculation time for 400 layers of LMICA with Intel 2.8 GHz CPU was about 40 minutes, about 60% of which was for the mapping phase. It shows that the mapping phase is so efficient that its computational costs are not a bottleneck in LMICA. Figure 3 shows the decreasing curves of the contrast function φ along the times of optimizations (rotations) for pairs of signals, where the values of φ are averaged over 10 trials for independently sampled Xs. For comparison, the following four experiments were done: (1) LMICA; (2) LMICA using not the optimized rearrangement but the simple identity mapping in the mapping phase, which preserves the original 12 × 12 topography of scenes for the ZCA filter; (3) MaxKurt, where all the pairs are optimized one by one in random order; and (4) LMICA for the PCA-whitened observed signals; Note that one layer of LMICA without the mapping phase and one iteration of MaxKurt are equivalent to 12 × 11 × 2 = 264 and 144×143 = 10,296 times 2 of pair optimizations, respectively. In Figure 3, the standard LMICA gives the best solution everywhere except only a few first optimizations and the late ones. Though LMICA using the original topography is the best within about the first 1000 optimizations, it rapidly converged to local minima. On the other hand, MaxKurt and LMICA for PCA became superior to the others only after more than 10,000 optimizations.
226
Y. Matsuda and K. Yamaguchi
(a) LMICA.
(b) original topography.
(c) MaxKurt.
(d) fast ICA with g (u) = tanh (u).
Figure 4: Edge detectors from natural scenes of 12 × 12 pixels after 5280 optimizations. Each shows 144 edge detectors of 12 × 12 pixels from A. (a) LMICA. (b) LMICA using the original topography (namely, the identity mapping). (c) MaxKurt. (d) Fast ICA (g(u) = tanh(u)).
Figures 4a to 4c show the edge detectors after 5280 optimizations (equivalent to the twentieth layer of LMICA) by LMICA, LMICA using the original topography, and MaxKurt, respectively. For comparison, Figure 4d is the result by fast ICA with the widely used nonlinear function g(u) = tanh(u). Figures 4a and 4d show that LMICA could quite rapidly generate edge detectors similar to those in fast ICA. It is especially surprising that the number of optimizations in LMICA (5280) is about half of the degrees of freedom of the mixing matrix A (10,296). It suggests that LMICA gives an
Linear Multilayer ICA Generating Hierarchical Edge Detectors
(a) at the 0th layer (ZCA).
(b) at the 10th layer.
(c) at the 50th layer.
(d) at the 300th layer.
227
Figure 5: Representative edge detectors from large natural scenes at each layer. They show 20 representative edge detectors of A from scenes of 64 × 64 pixels at each layer.
effective model for the ICA processing of natural scenes. There are no clear edges in Figures 4b and 4c. Each detector in these figures is extremely localized and has no orientation preference (or one that is too weak). It shows that the mapping phase of LMICA plays a crucial role in the rapid formation of edge detectors. 3.2 Large Natural Scenes. Here, 100,000 samples of natural scenes of 64 × 64 pixels were given as X. Fast ICA, MaxKurt, and other well-known ICA algorithms are not applicable to such a large-scale problem because they require huge computations. For example, fast ICA based on kurtosis spent about 45 minutes on processing the small images (30,000 samples of 12 × 12 pixels). Under the rough assumption that the calculation time is proportional to the number of samples and the parameters to be estimated, it requires about 2000 hours to process the large images. This estimation is rather optimistic because it assumes that the number of updates for
Y. Matsuda and K. Yamaguchi 4000
4000
3000
3000
frequency
frequency
228
2000 1000
2000 1000
0
0 4 6 8 10 12 14 length of edges (a) at the 0th layer.
2
4 6 8 10 12 14 length of edges (b) at the 10th layer.
4000
4000
3000
3000
frequency
frequency
0
2000 1000
0
2
0
2
2000 1000
0
0 0
2
4 6 8 10 12 14 length of edges (c) at the 50th layer.
4 6 8 10 12 14 length of edges (d) at the 300th layer.
Figure 6: Histograms of edge lengths at each layer of LMICA for large, natural scenes. The lengths were calculated as the full width at half maximum (FWHM) of gaussian approximations of Hilbert transformation of W by a method similar to that of van Hateren and van der Schaaf (1998).
convergence is constant regardless of the number of parameters. In the mapping phase, 64 × 64 array was used for T = 1,500,000; then it was divided into 16 arrays of 16 × 16 components, and each array was optimized for T = 500,000. LMICA was carried out in L = 300 layers, and it consumed about 170 hours with Intel 2.8 GHz CPU. Figure 5 shows some representative edge detectors at the 0th (ZCA filter), 10th, 50th, and 300th layers. The histograms of the lengths of edges at each layer are shown in Figure 6. There are pixel detectors of only length zero at the 0th layer (ZCA) in Figures 5a and 6a. Then many short edge detectors were generated after just 10 layers (see Figures 5b and 6b). At the 50th layer, the lengths of edges were longer (see Figures 5c and 6c). At the final 300th layer (see Figures 5d and 6d), there are some long edges. In addition, it is interesting that some “compound” detectors were observed where multiple edges seem to be included in a single detector.
Linear Multilayer ICA Generating Hierarchical Edge Detectors
229
4 Conclusion In this letter, we proposed linear multilayer ICA (LMICA). We carried out some numerical experiments on natural scenes, which verified that LMICA can extract edge detectors quite efficiently. We also showed that LMICA can generate hierarchical edge detectors from large-size natural scenes, where short edges exist in lower layers and longer edges in higher ones. Although some multilayer models have been employed in ICA (e.g., Hyv¨arinen & Hoyer, 2000, and Hoyer & Hyv¨arinen, 2002), the purpose of and rationale for them are quite different from those of LMICA. Their multilayer networks have been proposed for constructing more powerful models of ICA, which include nonlinear connections and allow some dependencies between sources. On the other hand, LMICA is constructed only for the efficient calculation of the usual linear ICA. Nevertheless, it seems interesting that the structures of their multilayer models (locally connected multilayer networks) are quite similar to that of LMICA. It may be promising to extend LMICA with some nonlinearity. We are planning to apply LMICA to some applications in image processing, such as image compression and digital watermarking. We are also planning to utilize LMICA for other large-scale problems such as text mining. In addition, we are trying to explore a faster method for the mapping phase. Some batch-learning techniques may be promising. We are now paying attention to the fact that it has been known that edge detectors are not formed at the maximum of kurtoses. Some different nonlinearity is needed (e.g., tanh). So the choice of nonlinearity in the contrast function is quite important and sensitive in the usual ICA models, such as the InfoMax model in Bell and Sejnowski (1997). On the other hand, LMICA can generate hierarchical edge detectors by gradually increasing simple kurtoses. This suggests that LMICA may be able to extract edge detectors by using more general contrast functions, but further experiments will be needed to test this hypothesis. Finally, LMICA with kurtoses is expected to be as sensitive to outliers as other cumulant-based ICA algorithms are. Some different contrast functions would be needed for noisy signals. In addition, LMICA is available only if the number of sources is the same as that of observed signals, because it utilizes the ZCA filter. In order to apply LMICA to undercomplete cases, a new whitening method would have to be exploited, which can decrease the number of signals while preserving the topography of images. It is more difficult for LMICA to deal with overcomplete cases, because it is based on the simplest contrast function without any generative models of sources. Nevertheless, it seems interesting that LMICA could generate many edge detectors with different lengths at every layer. These detectors appear to be overcomplete bases, though without any theoretical foundations so far. Further analysis of these bases may be promising.
230
Y. Matsuda and K. Yamaguchi
References Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge filters. Vision Research, 37(23), 3327–3338. Cardoso, J.-F. (1999). High-order contrasts for independent component analysis. Neural Computation, 11(1), 157–192. Cardoso, J.-F., & Laheld, B. (1996). Equivariant adaptive source separation. IEEE Transactions on Signal Processing, 44(12), 3017–3030. Comon, P. (1994). Independent component analysis—a new concept? Signal Processing, 36, 287–314. Hoyer, P. O., & Hyv¨arinen, A. (2002). A multi-layer sparse coding network learns contour coding from natural images. Vision Research, 42(12), 1593–1605. Hyv¨arinen, A. (1999). Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3), 626–634. Hyv¨arinen, A., & Hoyer, P. O. (2000). Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces. Neural Computation, 12(7), 1705–1720. Jutten, C., & H´erault, J. (1991). Blind separation of sources (part I): An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24(1), 1–10. Li, Z., & Atick, J. J. (1994). Towards a theory of the striate cortex. Neural Computation, 6, 127–146. Matsuda, Y., & Yamaguchi, K. (2003). Linear multilayer ICA algorithm integrating small local modules. In Proceedings of ICA2003 (pp. 403–408). Nara, Japan. Matsuda, Y., & Yamaguchi, K. (2004). Linear multilayer independent component analysis using stochastic gradient algorithm. In C. G. Puntomet & A. Pneto (Eds.), Independent component analysis and blind source separation—ICA2004 (pp. 303–310). Berlin: Springer-Verlag. Matsuda, Y., & Yamaguchi, K. (2005a). An efficient MDS-based topographic mapping algorithm. Neurocomputing, 64, 285–299. Matsuda, Y., & Yamaguchi, K. (2005b). Linear multilayer independent component analysis for large natural scenes. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 897–904). Cambridge, MA: MIT Press. van Hateren, J. H., & van der Schaaf, A. (1998). Independent component filters of natural images compared with simple cells in primary visual cortex. Proceedings of the Royal Society of London: B, 265, 359–366.
Received December 16, 2005; accepted April 5, 2006.
LETTER
Communicated by Andries P. Engelbrecht
Functional Network Topology Learning and Sensitivity Analysis Based on ANOVA Decomposition Enrique Castillo [email protected] Department of Applied Mathematics and Computational Sciences, University of Cantabria and University of Castilla–La Mancha, Spain
˜ Noelia S´anchez-Marono [email protected]
Amparo Alonso-Betanzos [email protected] ˜ Spain Computer Science Department, University of A Coruna,
Carmen Castillo [email protected] Department of Civil Engineering, University of Castilla–La Mancha, Spain
A new methodology for learning the topology of a functional network from data, based on the ANOVA decomposition technique, is presented. The method determines sensitivity (importance) indices that allow a decision to be made as to which set of interactions among variables is relevant and which is irrelevant to the problem under study. This immediately suggests the network topology to be used in a given problem. Moreover, local sensitivities to small changes in the data can be easily calculated. In this way, the dual optimization problem gives the local sensitivities. The methods are illustrated by their application to artificial and real examples.
1 Introduction Functional networks (FN) have been proposed by E. Castillo (1998; Castillo, Cobo, Guti´errez, & Pruneda, 1998) and have been shown to be successful in solving many physical or engineering problems (see Castillo & Guti´errez, 1998; Castillo, Cobo, Guti´errez, & Pruneda, 2000). FN are a generalization of neural networks (NN) that combine knowledge about the structure of the problem and data (Castillo et al., 1998). There are important differences between FN and NN. Standard NN usually have a rigid topology (only the number of layers and the number of neurons can be chosen) and fixed neural functions, so that only the weights are learned. In FN, the neural Neural Computation 19, 231–257 (2007)
C 2006 Massachusetts Institute of Technology
232
˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,
topology is initially free, and the neural functions can be selected from a wide range of families, so that both the topology and the parameters of the neural functions are learned. In addition, FN incorporate knowledge on the network topology or neural functions, or both. This knowledge can come from two main sources: (1) the knowledge the user has about the problem being solved, which can be written in terms of certain properties or characteristics of the model, or (2) the available data, which can be scrutinized to obtain the structure. The authors cited have described how the first type of knowledge can be used to derive the network topology. This is done mainly with the help of functional equations, which both suggest an initial topology and allow this initial topology to be simplified, leading to an equivalent simpler one. Thus, one of the main tools for building the network topology from this type of knowledge is functional equations. Such equations have been extensively described in the literature (see, e.g., Acz´el, 1966; Castillo & Ru´ız-Cobo, 1992; Castillo, Cobo, Guti´errez, & Pruneda, 1999; Castillo, Iglesias, & Ru´ız-Cobo, 2004). In this way and although FN can also be used as black boxes, the network is a white (rather than a black) box, where the structure has a knowledge-based foundation. However, no efficient and clean method for deriving the network topology from data has yet been devised. This is one of the aims of this article. Sensitivity analysis is another area of great interest (Saltelli, Chan, & Scott, 2000; Castillo, Guti´errez, & Hadi, 1997; Chatterjee & Hadi, 1988; Hadi & Nyquist, 2002); the concern is not only with learning a model but with the sensitivity of the model parameters to data. Sensitivity can be understood as local or global. Local sensitivity aims at discovering how the model changes as small modifications are made to the data and at determining which data values have the greatest influence on the model when changed in small increments. Global sensitivity aims at discovering how sensitive the model is to the inputs and whether some inputs can be removed without a significant loss in output quality. Different studies have been carried out regarding model approximation (Sacks, Mitchell, & Wynn, 1989; Koehler & Owen, 1996; Currin, Mitchell, Morris, & Ylvisaker, 1991). Some of them use regression strategies, similar to, but different from the one described in Jiang and Owen (2001). In this letter, we present a methodology based on ANOVA decomposition that permits the topology of a functional network to be learned from data. The ANOVA decomposition, also permits global sensitivity indices to be obtained for each variable and for each interaction between variables. The topology of a functional network is derived from this information, and lowand high-order interactions among variables are easily determined using ANOVA. Finally, a local sensitivity analysis can be carried out too. In this way, the final model can be fully defined. The letter is structured as follows. Section 2 gives a quick introduction to functional networks and describes the ANOVA decomposition, including global sensitivity indices. Section 3 describes the proposed method for
Functional Network Topology Learning and Sensitivity Analysis
233
learning the network topology and the local and global sensitivity indices from data. Section 4 presents two illustrative examples, and section 5 contains the conclusions. 2 Background Knowledge 2.1 A Brief Introduction to Functional Networks. Functional networks (FN) are a generalization of neural networks that brings together domain knowledge, to determine the structure of the problem, and data, to estimate the unknown functional neurons (Castillo et al., 1998). In FN, there are two types of learning to deal with this domain and data knowledge:
r r
Structural learning, which includes the initial topology of the network and its posterior simplification using functional equations, leading to a simpler equivalent structure. Parametric learning, concerned with the estimation of the neuron functions. This can be done by considering linear combinations of given functional families and estimating the associated parameters from the available data. Note that this type of learning generalizes the idea of estimating the weights of the connections in a neural network.
In FN, not only arbitrary neural functions are allowed, but they are initially assumed to be multiargument and vector-valued functions; that is, they depend on several arguments and are multivariate. This fact is shown in Figure 1a. In this figure, we can also see some relevant differences between neural and functional networks. Note that the FN has no weights, and the parameters to be learned are incorporated into the neural functions f i ; i = 1, 2, 3. These neural functions are unknown functions from a given family (e.g., the polynomial function) to be estimated during the learning process. For example, the neural functions f i could be approximated by f i (xi , x j ) =
mi ki =0
a iki xiki +
mj
k
a ik j x j j ,
k j =0
and the parameters to be learned will be the coefficients a iki and a ik j . As each function f i is learned, a different function is obtained for each neuron. It can be argued that functional networks require domain knowledge for deriving the functional equations and make assumptions about the form the unknown functions should take. However, as neural networks, functional networks can also be used as a black box. 2.2 The ANOVA Decomposition. The analysis of variance (ANOVA) was developed by Fisher (1918) and later further developed by other authors (Efron & Stein, 1981; Sobol, 1969; Hoeffding, 1948).
˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,
234
x1 x2 x3
w51 f (Σw5i xi)
w52 w62 w63
f ( Σw6i xi)
w73 x4
f ( Σw7i xi)
x5 w85 x6 w86 x7
x8
f ( Σw8i xi)
w87
w74 (a)
x1
x5
f 1( x1,x2)
x2
f 2( x2,x3)
x9
f 4( x5,x6)
x3 x6
f 3( x3,x4) x4 (b)
Figure 1: (a) A neural network. (b) A functional network.
According to Sobol (2001), any square integrable function f (x1 , . . . , xn ) defined on the unit hypercube [0, 1]n can be written as:
y = f (x1 , . . . , xn ) = f 0 +
n i 1 =1
+
n−1 n−2
n
f i1 (xi1 ) +
n n−1
f i1 i2 (xi1 , xi2 )
i 1 =1 i 2 =i 1 +1
f i1 ,i2 ,i3 (xi1 , xi2 , xi3 ) + . . .
i 1 =1 i 2 =i 1 +1 i 3 =i 2 +1
+
2 1 i 1 =1 i 2 =2
...
n
f i1 i2 ...in (xi1 , xi2 , . . . , xin ),
(2.1)
i n =n
where the term f 0 is a constant and corresponds to the function with no arguments.
Functional Network Topology Learning and Sensitivity Analysis
235
The decomposition, equation 2.1, is called ANOVA iff
1 0
f i1 i2 ...ik (xi1 , xi2 , . . . , xik ) d xir ≡ 0; ∀r = 1, 2, . . . , k; ∀i 1 , i 2 , . . . , i k ; ∀k, (2.2)
where k is an index to point to any of the elements in the set x1 , x2 , . . . , xn . Hence, the functions corresponding to the different summands are orthogonal, that is,
1 0
1
1
...
0
0
f i1 i2 ...ik (xi1 , xi2 , . . . , xik ) f j1 j2 ... j (x j1 , x j2 , . . . , x j ) dx = 0
∀(i 1 i 2 , . . . , i k ) = ( j1 j2 , . . . , j ), where x = (x1 , x2 , . . . , xn ). It is important to notice that the conditions in equation 2.2 are sufficient for the different component functions to be orthogonal. In addition, it can be shown that they are unique, in the sense of L 2 equivalent classes. Note that since the above decomposition includes terms with all possible kinds of interactions among variables x1 , x2 , . . . , xn , it allows these interactions to be determined. The main advantage of this decomposition is that there are explicit formulas for obtaining the different summands or components of f (x1 , . . . , xn ). Sobol (2001) provides the following expressions:
1
f0 =
0
0
0
...
0
1
f (x1 , . . . , xn ) d x1 d x2 , . . . , d xn ,
(2.3)
0
1
...
0 1
f i j (xi , x j ) =
1
0
1
f i (xi ) =
1
f (x1 , . . . , xn )
0 1
... 0
n
d xk − f 0 ,
(2.4)
k=1;k=i 1
f (x1 , . . . , xn )
n k=1;k=i, j
d xk − f 0 −
f k (xk ) (2.5)
k=i, j
and so on. In other words, the f (x1 , . . . , xn ) function can always be written as the sum of 2n orthogonal summands (see equation 2.1). If f (x1 , . . . , xn ) is square integrable, then all f i1 i2 ,...,ik (xi1 , xi2 , . . . , xik ) for all combinations i 1 i 2 , . . . , i k and k = 1, 2, . . . , n are also square integrable.
˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,
236
Squaring f (x1 , . . . , xn ) and integrating over (0, 1)n gives
1 0
1
...
0
0
+
n
1
f 2 (x1 , . . . , xn )d x1 d x2 . . . d xn = f 02
f i21 (xi1 ) +
i 1 =1
+
n n−1
f i21 i2 (xi1 , xi2 ) + · · ·
i 1 =1 i 2 =i 1 +1
n−1 n−2
n
f i21 ,i2 ,i3 (xi1 , xi2 , xi3 )
i 1 =1 i 2 =i 1 +1 i 3 =i 2 +1
+
2 1
...
n
f i21 i2 ...in (xi1 , xi2 , . . . , xin )
(2.6)
i n =n
i 1 =1 i 2 =2
and calling D=
1
0
Di1 i2 ...ik =
1
0
1
0 1
1
...
0 1
...
0
0
f 2 (x1 , . . . , xn )d x1 d x2 . . . d xn − f 02 ,
(2.7)
f i21 i2 ,...,ik (xi1 , xi2 , . . . , xik )dx,
(2.8)
the result is D=
n
Di1 i2 ,...,ik .
(2.9)
k=1 i 1 i 2 ,...,i k
The constant D is called the variance, because if (x1 , x2 , . . . , xn ) is a uniform random variable in the unit hypercube, then D is the variance of f (x1 , x2 , . . . , xn ). Thus, the following set of global sensitivity indices, adding up to one, can be defined: Si1 i2 ...ik =
Di1 i2 ,...,ik D
⇔
Si1 i2 ,...,ik = 1.
(2.10)
i 1 i 2 ,...,i k
The practical importance of the ANOVA decomposition arises from the following facts: 1. Every square integrable function can be decomposed as the sum of orthogonal functions, including all interaction levels. 2. This decomposition is unique.
Functional Network Topology Learning and Sensitivity Analysis
237
3. There are explicit formulas for determining this decomposition in terms of f (x1 , . . . , xn ). 4. The variance of the initial function can be obtained by summing up the variances of the components, and this permits global sensitivity indices that sum to one to be assigned to the different functional components. 3 Learning Algorithm In this section, we present a method (denominated the AFN, i.e., ANOVA decomposition and functional networks) for learning the functional components of any general function f (x1 , . . . , xn ) from data and for calculating local and global sensitivity indices. Consider a data set {(x1m , x2m , . . . , xnm ; ym )|k = 1, 2, . . . , M}—a sample of size M of n input variables (X1 , X2 , . . . , Xn )—and one output variable Y. The algorithm works as follows. 3.1 Step 1: Select a Set of Approximating Orthonormal Functions. Each functional component f i1 i2 ,...,ik (xi1 , xi2 , . . . , xik ) is approximated as ki∗ i
1 2 ,...,i k
f i1 i2 ,...,ik (xi1 , xi2 , . . . , xik ) ≈
c i∗1 i2 ,...,ik ; j h i∗1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik ),
j=1
(3.1) where c i∗1 i2 ,...,ik ; j are real constants, and {h i∗1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik )| j = 1, 2, . . . , ki∗1 i2 ,...,ik }
(3.2)
is a set of basis functions (e.g., polynomial, sinusoids, exponential, splines, wavelets), which must be orthonormalized. One possibility consists of using one of the families of univariate orthogonal functions, for example, Legendre polynomials, form tensor products with them, and select a subset of them. Although these polynomials are defined with respect to a uniform weighting of [−1, 1], they can be easily mapped to the interval [0, 1]. Hermite and Chebychev polynomials, Fourier functions, and Haar wavelets also provide univariate basis functions. Another alternative consists of using a family of functions and the Gram-Schmidt technique, which is implemented in some numerical libraries, to orthonormalize these basis functions.
˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,
238
For the sake of completeness, we give a third alternative. Since they must satisfy the ANOVA constraints, equation 2.2, we must have 0
1
ki∗ i
1 2 ,...,i k
c i∗1 i2 ,...,ik ; j h i∗1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik )d xir ≡ 0; (3.3)
j=1
∀r = 1, 2, . . . , k;
∀i 1 , i 2 , . . . , i k ;
∀k ∈ 1, 2, . . . , n.
Note that in spite of the fact that these conditions represent an uncountably infinite number of constraints, only a finite number of them are independent. Thus, only some subset of these constants {c i∗1 i2 ,...,ik ; j | j = 1, 2, . . . , ki∗1 i2 ,...,ik }, remain free. After renaming the free constants {c i1 i2 ,...,ik ; j | j = 1, 2, . . . , ki1 i2 ,...,ik } and orthonormalizing the basis functions, equation 3.1 becomes ki1 i2 ...ik
f i1 i2 ,...,ik (xi1 , xi2 , . . . , xik ) ≈
c i1 i2 ...ik ; j pi1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik ).
(3.4)
j=1
Note that the initial set of basis functions, equation 3.2, changes to { pi1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik )| j = 1, 2, . . . , ki1 i2 ,...,ik }.
(3.5)
If one uses the tensor product technique, the multivariate normalized basis functions pi1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik ) can be selected from the set (tensor products): k
j=
pi j ,r j (xi j )|1 ≤ r j ≤ ki1
.
(3.6)
Note that the size of this set grows exponentially with k and that we can be interested in selecting a small subfamily. For example, if we select as univariate basis functions polynomials of degree r , for the m-dimensional basis functions, the tensor product technique leads to polynomials of degree r × m, which is too high. Thus, we can limit the degree of the corresponding m-multivariate basis to contain only polynomials of degree dm or less. This is what we have done with the examples presented in this letter. Example 1 (Univariate Functions): nomial functions,
Consider the basis of univariate poly-
{h ∗1;1 (x1 ), h ∗1;2 (x1 ), h ∗1;3 (x1 ), h ∗1;4 (x1 )} = {1, x1 , x12 , x13 }.
Functional Network Topology Learning and Sensitivity Analysis
239
After imposing the constraints, equations 2.3 to 2.5, we get the reduced basis {h 1;1 (x1 ), h 1;2 (x1 ), h 1;3 (x1 )} = {2x1 − 1, 3x12 − 1, 4x13 − 1}, which after normalization leads to √ √ { p1;1 (x1 ), p1;2 (x1 ), p1;3 (x1 )} = { 3(2x1 − 1), 5(6x12 − 6x1 + 1), √ 7(20x13 − 30x12 + 12x1 − 1)}. Example 2 (Multivariate Functions): If the function is multivariate, the normalized basis can be obtained as the tensor product of univariate basis functions. For example, consider the basis of 16 bivariate functions, {1, x2 , x22 , x23 , x1 , x1 x2 , x1 x22 , x1 x23 , x12 , x12 x2 , x12 x22 , x12 x23 , x13 , x13 x2 , x13 x22 , x13 x23 }, which, after imposing the constraints, equations 2.3 to 2.5, and normalization leads to the normalized basis: √
3 − 1 + 2 x1 (−1 + 2 x2 ), 15 (−1 + 2 x1 ) (1 − 6 x2 + 6 x22 ), √ 21 − 1 + 2 x1 − 1 + 12 x2 − 30x22 + 20 x23 , √ 15 1 − 6 x1 + 6 x12 − 1 + 2 x2 , 5 1 − 6 x1 + 6 x12 1 − 6 x2 + 6 x22 , √ 35 1 − 6 x1 + 6 x12 − 1 + 12 x2 − 30x22 + 20 x23 , √ 21 − 1 + 12 x1 − 30x12 + 20 x13 − 1 + 2 x2 √ 35 − 1 + 12 x1 − 30x12 + 20 x13 1 − 6 x2 + 6 x22 , 7 − 1 + 12 x1 − 30x12 + 20 x13 − 1 + 12 x2 − 30x22 + 20 x23 . This family of 9 functions is the family 3.6; that is, it comes from the tensor products of the normalized functions in example 1. Note that these bases can be obtained independently of the data set, which means that they are valid for learning any data set. So once calculated, they can be stored to be used when needed. In addition, since multivariate functions have associated tensor product bases of univariate functions, one needs to calculate and store only the basis associated with univariate functions.
240
˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,
3.2 Step 2: Learn the Coefficients by Least Squares and Obtain Local Sensitivity Indices. According to our approximation, we have ki 1 n
ym = f (x1m , . . . , xnm ) = f 0 +
c i1 ; j pi1 ; j (xi1 m )
i 1 =1 j=1
+
ki 1 i 2 n n−1
c i1 ; j pi1 i2 ; j (xi1 m , xi2 m )
i 1 =1 i 2 =i 1 +1 j=1
+
(3.7)
ki 1 i 2 i 3 n
n−1 n−2
c i1 i2 i3 ; j pi1 ,i2 ,i3 ; j (xi1 m , xi2 m , xi3 m ) + · · ·
i 1 =1 i 2 =i 1 +1 i 3 =i 2 +1 j=1
+
1 2
...
n ki 1 i 2 ...i n i n =n
i 1 =1 i 2 =2
c i1 i2 ...in ; j pi1 i2 ...in (xi1 m , xi2 m , . . . , xin m ).
j=1
The error for the mth data value is m (y, x, c) = ym − f 0 −
ki 1 n
c i1 ; j pi1 ; j (xi1 m )
i 1 =1 j=1
−
ki 1 i 2 n n−1
c i1 ; j pi1 i2 ; j (xi1 m , xi2 m )
i 1 =1 i 2 =i 1 +1 j=1
−
n−1 n−2
ki 1 i 2 i 3 n
c i1 i2 i3 ; j pi1 ,i2 ,i3 ; j (xi1 m , xi2 m , xi3 m ) − · · ·
i 1 =1 i 2 =i 1 +1 i 3 =i 2 +1 j=1
−
2 1
...
i 1 =1 i 2 =2
n ki 1 i 2 ...i n i n =n
c i1 i2 ,...,in ; j pi1 i2 ,...,in (xi1 m , xi2 m , . . . , xin m ), (3.8)
j=1
where y, x are the data vectors and c is the vector including all the unknown coefficients. Hence, to estimate the constants c, that is, all c i1 i2 ,...,ik ; j ; ∀k, j, the following minimization problem has to be solved: Minimize Q = f 0 ,c
M
m2 (y, x, c).
(3.9)
m=1
The minimization problem, equation 3.9, leads to a linear system of equations with a unique solution. However, due to its tensor character, its organization in a standard form as Ac = b requires renumbering of
Functional Network Topology Learning and Sensitivity Analysis
241
the unknowns c, which is not a trivial task. Alternatively, one can use standard optimization packages, such as GAMS (Brooke, Kendrik, Meeraus, & Raman, 1998), to solve equation 3.9. This is the option used in this letter. However, to obtain the local sensitivities of Q to small changes in the data, the following modified problem, which is equivalent to equation 3.9, can be solved instead: Minimize Q= f 0 ,c; y ,x
M
m2 (y , x , c),
(3.10)
m=1
subject to: y = y : λ
(3.11)
x = x : δ,
(3.12)
where λ and δ are the vectors of dual variables, which give the local sensitivities of Q with respect to the data values y and x, respectively. This is so because the dual variables associated with any primal constrained optimization problem are known to be the partial derivatives of the objective function optimal values with respect to changes in the right-hand-side parameters of the corresponding equality constraints. Since in this artificial optimization problem, equations 3.10 to 3.12, the right-hand sides of the equality constraints 3.11 and 3.12 are the data x and y, respectively, the partial derivatives of Q∗ (optimal value of Q), with respect to the data, that is, the sensitivities sought after, are the values of the corresponding dual variables. 3.3 Step 3: Obtain the Global Sensitivity Indices. Since the resulting basis functions have already been orthonormalized, the global sensitivity indices (importance factors) are the sums of the squares of the coefficients, that is, ki1 i2 ...ik
Si1 i2 ...ik =
c i21 i2 ,...,ik ; j ; ∀(i 1 i 2 , . . . , i k ).
(3.13)
j=1
3.4 Step 4: Return Solution. The algorithm returns the following information:
r
r
An estimation of the coefficients f 0 and {c i1 i2 ,...,ik ; j ; ∀(i 1 i 2 , . . . , i k ; j)}, which, considering the basis functions pi1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik ) and equation 3.7, allow obtaining an approximation for the solution of the problem being studied Local and global sensitivity indices.
242
˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,
For simplicity, only one output variable was considered; however, an extension to the case of several outputs can be immediately obtained by solving the corresponding minimization problems. Thus, the algorithm above allows us to obtain an approximation for a given function f in terms of its functional components. Moreover, global sensitivity indices are derived for each functional component; therefore, the topology of a functional network can be considerably simplified, taking into account only the components with higher sensitivities. Suppose that, ˆf (x1 , . . . , xn ) is the f approximation after removing unimportant interactions, that is, when only the subset of important interactions is included in the final model, then the mean squared error (assuming a uniform distribution in the unit cube) is MSE = 0
1
0
1
...
1
2 fˆ(x1 , . . . , xn ) − f (x1 , . . . , xn ) dx.
0
Sometimes it is more convenient to obtain the normalized mean squared error (NMSE), which is adimensional and defined as NMSE =
MSE . Var[ f (x1 , . . . , xn )]
4 Application Examples In this section two application examples are described to illustrate the methodology proposed above. 4.1 Learning a Three-Input Nonlinear Function. The aim of this example is to demonstrate that the global sensitivity indices recovered by the proposed algorithm from data are exactly the same as those that can be obtained directly from the original function. Also, we will show that the resulting global sensitivity indices are almost not affected by noise. For illustration, we selected the following function: y = f (x1 , x2 , x3 ) = 1 + x1 + x12 + x1 x2 + 2x1 x2 x3 .
(4.1)
First, we calculate the exact ANOVA components using equations 2.3 to 2.5, leading to: 7 4 1 f 0 = ; f 1 (x1 ) = − + 2 x1 + x12 ; f 2 (x2 ) = − + x2 ; 3 3 2 x3 1 1 f 3 (x3 ) = − + ; f 12 (x1 , x2 ) = − x1 − x2 + 2 x1 x2 ; 4 2 2
Functional Network Topology Learning and Sensitivity Analysis
243
Table 1: Sensibility Results with Different Levels of Noise. σ 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20
S1
S2
S3
S12
S13
S23
S123
0.8361 0.8359 0.8356 0.8360 0.8336 0.8335 0.8312 0.8305 0.8321 0.8268 0.8267
0.0922 0.0922 0.0921 0.0917 0.0924 0.0918 0.0931 0.0940 0.0901 0.0925 0.0930
0.0231 0.0230 0.0233 0.0231 0.0237 0.0238 0.0242 0.0241 0.0237 0.0244 0.0238
0.0307 0.0308 0.0309 0.0304 0.0317 0.0311 0.0314 0.0305 0.0319 0.0327 0.0331
0.0077 0.0077 0.0076 0.0081 0.0081 0.0086 0.0088 0.0089 0.0095 0.0110 0.0103
0.0077 0.0078 0.0080 0.0080 0.0077 0.0084 0.0081 0.0091 0.0096 0.0099 0.0096
0.0026 0.0026 0.0026 0.0026 0.0028 0.0028 0.0032 0.0029 0.0031 0.0028 0.0035
Note: Mean values of 100 replications.
1 1 x1 x3 x2 x3 − − + x1 x3 ; f 23 (x2 , x3 ) = − − + x2 x3 ; 4 2 2 4 2 2 x1 x2 1 x3 f 123 (x1 , x2 , x3 ) = − + + − x1 x2 + − x1 x3 − x2 x3 + 2 x1 x2 x3 , 4 2 2 2 (4.2) f 13 (x1 , x3 ) =
whose sum gives exactly the function in equation 4.1. Next, the exact variance D = 0.9037 of function f and the sensitivity indices for the functional components were calculated using equations 2.7 to 2.10. These are shown in the first row of Table 1, corresponding to σ = 0. Subsequently, we show that we can learn the function f , its ANOVA decomposition, and the global indices from data. To this end, a sample of size m = 100 was generated for each input variable {xik |i = 1, 2, 3; k = 1, 2, . . . , m} from a uniform U(0, 1) random variable, and {yk |k = 1, 2, . . . , m} was calculated using equation 4.1. Next, the AFN method was applied in order to obtain an approximation of the function f in equation 4.1. The algorithm starts by considering a set of basis functions for univariate functions. As the function to be estimated was known (because it is an illustrative example), a suitable set of basis functions was formed by third-degree polynomials. These basis functions were orthonormalized as described in example 1. The multivariate functions, with two and three arguments, were approximated with the tensor product functions obtained from the corresponding third-degree polynomials univariate functions (see example 2). However, only polynomials of third degree or lower were used in the approximation. After solving the minimization problem, the exact function
244
˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,
was recovered by applying the algorithm f (x1 , . . . , xn ) = 1 + x1 + x12 + x1 x2 + 2x1 x2 x3 . Next, the global sensitivity indices were calculated from the coefficients using equation 3.13 in step 3 of the algorithm. These coefficients were the exact ones (i.e., with no error), as can be confirmed in Table 1 in the row labeled “σ = 0.00.” Finally, to test the sensitivity of the results to noise, the learning process was repeated for different noise levels, adding to y in equation 4.1 a normal N(0, σ 2 ) noise with σ = 0.00, 0.02, . . . , 0.20. The results obtained are shown in Table 1. It can be observed that the sensitivity indices are barely affected by the noise level. Once the global sensitivity indices have been calculated, one can decide which interactions must be included in the model. This immediately defines the network topology. Since 98.18% of the variance in the illustrative example can be obtained by the approximation f (x1 , . . . , xn ) ≈ f 0 + f 1 (x1 ) + f 2 (x2 ) + f 3 (x3 ) + f 12 (x1 , x2 ),
(4.3)
one can decide to remove all other interactions. The final topology of the network obtained using our method is shown in Figure 2b, whereas Figure 2a shows the network topology used when the methodology proposed was not applied. Comparing both network topologies, it can be observed that the application of the algorithm achieved considerable simplification while practically maintaining the quality of the outputs. To illustrate the performance of the proposed method, Table 2 shows the means and standard deviations (in parentheses) of MSE (mean squared error) and NMSE (normalized mean squared error) for the training (80% of the sample) and testing (20% of the sample) samples, obtained with 100 replications. Since the values are small, they reveal good performance. In addition, Table 2 includes the errors obtained with increasing noise values (σ ). 4.2 Case Example: A Vertical Breakwater Problem. Breakwaters are constructed to provide sheltered bays for ships and protect harbor facilities. Moreover, in ports open to rough seas, they play a key role in port operations. Since sea waves are enormously powerful, it is not an easy matter to construct structures to mitigate sea power. When designing a breakwater (see Figure 3), one looks for the optimal cross-section that minimizes construction and maintenance costs during the breakwater’s useful life and also satisfies reliability constraints guaranteeing that the work is reasonably safe for each failure mode. Optimization of this engineering
Functional Network Topology Learning and Sensitivity Analysis
X1
245
f1 f12 f2 y
X2
f123
+
f13 f23 X3
f3 (a)
X1
f1 f12
X2
f2
X3
f3
y +
(b) Figure 2: Network topologies corresponding to the three-input function in the example. (a) Using the available data directly, without applying the proposed methodology. (b) Using the knowledge obtained by the proposed methodology to eliminate some interaction terms.
design is extremely important because of the corresponding reduction in the associated cost. Analysis of the failure probabilities of the breakwater requires calculating failure probabilities and the annual failure rate for each failure mode. This implies determining the pressure produced on the breakwater crownwall by the sea waves. This problem involves nine input variables (listed in Table 3) and four output variables: p1 , p2 , p3 , and pu (the first three pi are, respectively, water pressure at the mean, freeboard and base levels of the concrete block, and pu is the maximum subpressure value).
˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,
246
Table 2: Simulation Results for a Sample Size n = 100 with Different Levels of Noise. σ 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20
MSE tr
NMSE tr
MSE test
NMSE test
0.00000 (0.00000) 0.00035 (0.00006) 0.00140 (0.00023) 0.00312 (0.00052) 0.00570 (0.00095) 0.00891 (0.00162) 0.01257 (0.00248) 0.01681 (0.00300) 0.02203 (0.00389) 0.02789 (0.00489) 0.03490 (0.00623)
0.00000 (0.00000) 0.00040 (0.00009) 0.00159 (0.00039) 0.00352 (0.00082) 0.00643 (0.00153) 0.00997 (0.00230) 0.01398 (0.00363) 0.01853 (0.00409) 0.02433 (0.00554) 0.03052 (0.00695) 0.03753 (0.00873)
0.00000 (0.00000) 0.00054 (0.00018) 0.00228 (0.00090) 0.00494 (0.00149) 0.00944 (0.00326) 0.01371 (0.00465) 0.02062 (0.00761) 0.02661 (0.00973) 0.03563 (0.01453) 0.04186 (0.01439) 0.05572 (0.02170)
0.00000 (0.00000) 0.00061 (0.00022) 0.00258 (0.00111) 0.00556 (0.00183) 0.01067 (0.00420) 0.01546 (0.00589) 0.02304 (0.00975) 0.02926 (0.01096) 0.03956 (0.01772) 0.04571 (0.01657) 0.06034 (0.02652)
Notes: Mean values of 100 replications. Standard deviations appear in parentheses.
In the case of the vertical breakwater, approximating formulas for calculating p1 , p2 , p3 , and pu were given by Goda (1972) using his own theoretical and laboratory studies. These were later extended by other authors (Takahashi, 1996), obtaining: p1 = 0.5(1 + cos θ )(α1 + α4 cos2 θ )w0 HD
(4.4)
p2 = α2 p1
(4.5)
p3 = α3 p1
(4.6)
pu = 0.5(1 + cos θ )α1 α2 w0 HD ,
(4.7)
where the nondimensional coefficients are given by
(4π h/L) α1 = 0.6 + 0.5 sinh(4π h/L)
2 (4.8)
Functional Network Topology Learning and Sensitivity Analysis
247
p1 p2 hc h’ d h pu p3 Bm Figure 3: Typical cross section of a vertical breakwater.
h 1 α2 = 1 − 1− h cosh(2π h/L) hc α3 = 1 − min 1, 0.75(1 + cos θ )HD α4 = max(α5 , α I )
(4.9) (4.10) (4.11)
α5 = min((1 − d/ h)(HD /d) /3, 2d/HD ) 2
αI = αI 0αI 1 HD /d if HD ≤ 2d αI 0 = 2 otherwise cos δ2 / cosh δ1 αI 1 = 1/(cosh δ1 (cosh δ2 )) 20δ11 if δ11 ≤ 0 δ1 = 15δ11 otherwise
(4.12) (4.13) (4.14)
if δ2 ≤ 0 otherwise
(4.15)
(4.16)
δ11 = 0.93(Bm /L − 0.12) + 0.36[(h − d)/ h − 0.6] 4.9δ22 if δ22 ≤ 0 δ2 = otherwise 3δ22
(4.17)
δ22 = −0.36(Bm /L − 0.12) + 0.93[(h − d)/ h − 0.6].
(4.19)
(4.18)
˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,
248
Table 3: Input Variables for the Vertical Breakwater Problem. θ w0 HD h L d h hc Bm
incidence angle of waves water unit weight design wave height water depth wave length submerged depth of the vertical breakwater submerged depth of the vertical breakwater, including width of the armor layer free depth of the concrete block on the lee side berm sea side length
As can be observed, these formulas are very complicated and have continuity problems in the first derivatives because they use different expressions in different intervals that do not have the desired regularity properties at the change points. As a consequence, some optimization program solvers, such as GAMS, fail to obtain the optimal solution in some cases. Therefore, more complex solvers, which are more costly in terms of computational power and time, are needed. To solve this problem, the idea is to generate a random set of input variables and then calculate the corresponding outputs using the above formulas, then use these data to train a network that may solve the problem satisfactorily. This application is very important from a practical point of view because it means that an approximation can be obtained for the formulas in equations 4.4 to 4.19 without continuity or regularity problems. Therefore, the optimization problem can be stated, and the optimal solution can be obtained without any special computational requirement. Almost all the variables involved in the breakwater problem can be represented as powers of fundamental magnitudes such as length and mass. Therefore, the breakwater problem can be simplified by the application of dimensional analysis, specifically, its main theorem: the -theorem (Bridgman, 1922; Buckingham, 1914, 1915). This theorem states that any physical relation in terms of n variables can be rewritten in a new relation involving r fewer variables. When this theorem is applied, the input and output variables are transformed into dimensionless monomials. Different transformations are possible, but using engineering knowledge on breakwaters allows determining, without knowledge of formulas 4.4 to 4.19, that the dimensionless output variables
p1 p2 p3 pu , , , w0 HD w0 HD w0 HD w0 HD
(4.20)
Functional Network Topology Learning and Sensitivity Analysis
249
Table 4: Monomials Required for Estimating the Variables p1 , p2 , p3 , and pu . Output/Input p1 w0 HD p2 w0 HD p3 w0 HD pu w0 HD
h L
h h
√ √
√
√ √
√
θ
d h
HD d
BM L
√
√
√
√
√
√
√
√
√
√
√
√
hc HD
√
√
can be written in terms of the set of dimensionless input variables θ,
h d HD h h c Bm . , , , , , L h d h HD L
(4.21)
Not all the dimensionless monomials in equation 4.21 are needed for estimating the dimensionless outputs (see equations 4.4 to 4.19). Table 4 summarizes the required ones. In order to choose the input data points to learn these functions, several different criteria can be used; knowledge about their ranges and statistical distribution is required. (For some interesting work on this and related problems, readers are referred to Sacks et al., 1989; Koehler & Owen, 1996; and Currin et al., 1991.) Knowledge about ranges is easy to obtain from experienced engineers. However, given the lack of knowledge about the statistical distribution of the input data values in existing breakwaters, a set S of 2000 input data (1600 for training and 400 for testing) was generated randomly, assuming that each dimensionless variable in equation 4.21 is U(0, 1), that is, normalized in the interval (0, 1) in order to apply the methodology proposed. This is a reasonable assumption covering the actual ranges of the variables involved. Note that the ranges correspond to nondimensional variables. To show the performance of the methodology proposed (AFN), all of the dimensionless inputs will be considered for estimation of the desired dimensionless outputs. It will be demonstrated that only the relevant input interactions appear in the learned models. In this way, we illustrate how one can use data from a given problem to obtain appropriate knowledge that can assist in deriving a topology for functional networks. To our knowledge, there is no other efficient method for deriving functional network topologies from data. Twenty replications with different starting values for the parameters were performed for each case studied. This example is focused in the estimation of pu /w0 HD , but p1 /w0 HD , p2 /w0 HD , and p3 /w0 HD were also learned in a similar way, and the most significative results are presented. From this point, we will refer to the dimensionless outputs as p1 , p2 , p3 , and pu .
250
˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,
Table 5: Mean, Standard Deviation (STD), and Minimum and Maximum Values for the Normalized Mean Square Error (NMSE) for the Test Data Set. Degrees 2, 1, 1, 1, 1 2, 2, 1, 1, 1 2, 3, 1, 1, 1 2, 4, 1, 1, 1 3, 1, 1, 1, 1 3, 3, 3, 1, 1 3, 4, 1, 1, 1 3, 6, 9, 1, 1 2, 4, 3, 4, 5 3, 6, 3, 4, 5 2, 3, 1
MSEtest
(ST D)
Min
Max
2.3900 × 10−1 7.4135 × 10−2 6.6946 × 10−2 6.7852 × 10−2 9.4133 × 10−1 9.8419 × 10−1 9.5612 × 10−1 9.6607 × 10−1 7.4111 × 10−2 0.1044 × 101 6.8237 × 10−2
(3.8184 × 10−2 ) (4.8200 × 10−3 ) (4.0007 × 10−3 ) (4.0342 × 10−3 ) (5.2477 × 10−2 ) (5.7424 × 10−2 ) (6.0454 × 10−2 ) (6.6378 × 10−2 ) (6.4231 × 10−3 ) (6.9930 × 10−2 ) (4.2030 × 10−3 )
1.9346 × 10−1 6.5495 × 10−2 5.9426 × 10−2 5.9443 × 10−2 8.3121 × 10−1 8.6991 × 10−1 8.3115 × 10−1 8.6482 × 10−1 6.1047 × 10−2 9.2294 × 10−1 5.8822 × 10−2
3.2621 × 10−1 8.4480 × 10−2 7.6709 × 10−2 7.7210 × 10−2 1.0367 1.0875 1.0819 1.1104 8.5550 × 10−2 1.1540 7.4263 × 10−2
Notes: Results for the estimations of pu using the proposed methodology. The last row shows the results obtained when the AFN method was applied and a simplified topology is obtained.
The complexity of the proposed method grows exponentially with the number of inputs. To estimate pu , seven input variables are given; therefore, a complex model is derived if all levels of interactions are considered. Then a simpler model was chosen in the first instance, and its complexity was constantly increased. This model is obtained by considering a reduced number of univariate functions and limiting the multivariate functions obtained from its corresponding tensor products. As in the previous example, the polynomial family was selected for estimating the desired output. The first column in Table 5 shows the different degrees taken into account; the first number is the degree for the univariate functions, the second one is the degree for the bivariate functions, and so on. Note that six and seven arguments functions were not included because there is no improvement in the performance results when more relations are considered. The global sensitivity indices related to the best performance results and its corresponding total sensitivity indices are shown in Tables 6 and 7, respectively. To be concise, all the monomials or relations between them with index values under 0.0005 were removed from Table 6. It can be seen that the monomials with the highest sensitivity values are the three required (the first three rows in Table 6; see Table 4). Besides, the relation between the necessary variables h/L and h / h (fourth row in Table 6) is the most important one, although it has a low value compared to the monomials individually. All other variables and relations have lower sensitivity values and thus were not considered. After removing the unimportant factors, the topology of the functional network is much simpler. It has only three univariate functions
Functional Network Topology Learning and Sensitivity Analysis
251
Table 6: Global Sensitivity Indices for the Univariate and Bivariate Polynomials When Estimating pu . Replication Monomials
1
2
3
4
5
Mean
h/L h/ h θ
0.6644 0.2680 0.0169
0.6598 0.2763 0.0197
0.6544 0.2785 0.0161
0.6671 0.2626 0.0179
0.6594 0.2729 0.0197
0.6622 0.2728 0.0172
h/L, h / h h/L, θ h/L, d/ h h/L, HD /d h / h, θ h / h, d/ h
0.0429 0.0033 0.0002 0.0002 0.0022 0.0002
0.0362 0.0040 0.0002 0.0002 0.0010 0.0006
0.0443 0.0029 0.0000 0.0005 0.0013 0.0001
0.0440 0.0042 0.0005 0.0001 0.0011 0.0000
0.0396 0.0046 0.0002 0.0005 0.0007 0.0001
0.0392 0.0040 0.0002 0.0002 0.0015 0.0002
Notes: Results obtained by the first five replications and mean of the 20 replications. Rows with values under 0.0005 are not included. Shown below the line are the global sensitivity indices for the relations between monomials.
Table 7: Total Sensitivity Indices When Estimating pu . Replication Variables h/L h/ h θ d/ h HD /d B M /L h c /HD
1
2
3
4
5
Mean
0.7114 0.3135 0.0228 0.0007 0.0007 0.0007 0.0004
0.7007 0.3143 0.0254 0.0011 0.0012 0.0007 0.0004
0.7024 0.3243 0.0207 0.0007 0.0008 0.0009 0.0008
0.7163 0.3080 0.0238 0.0011 0.0005 0.0012 0.0011
0.7044 0.3138 0.0255 0.0008 0.0012 0.0010 0.0009
0.7061 0.3141 0.0234 0.0010 0.0010 0.0010 0.0008
Note: Results obtained by the first 5 replications and mean of the 20 replications.
(second-degree polynomials) and a bivariate function relating the variables h/L and h / h. However, the testing results after training this functional network are very similar to those obtained previously, as expected, because only the nonrelevant factors were removed. These results are presented in the last row of Table 5 to facilitate comparison. Note that the number of parameters is extremely reduced. In fact, there are 9 parameters when the proper dimensionless monomials are employed (3 univariate functions with 2 parameters and 1 bivariate function with 3 parameters), and 77 parameters are used when all the dimensionless monomials are taken into account, considering the same degree for the polynomials (7 univariate functions with 2 parameters and 21 bivariate functions with 3 parameters).
˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,
252
Table 8: Mean, Standard Deviation (STD), and Minimum and Maximum Values for the Normalized Mean Square Error (NMSE) for the Test Set When Estimating p1 , p2 , and p3 .
p1 p2 p3
Method
Mean NMSE
(STD)
Min
Max
AFN FN AFN FN AFN FN
1.8693 × 10−1 1.9795 × 10−1 7.7311 × 10−2 9.1324 × 10−2 7.7058 × 10−2 1.0552 × 10−1
(2.6462 × 10−2 ) (2.6624 × 10−2 ) (7.0881 × 10−3 ) (1.0700 × 10−2 ) (1.1104 × 10−2 ) (1.7653 × 10−2 )
1.4714 × 10−1 1.5455 × 10−1 6.5819 × 10−2 7.3875 × 10−2 6.2961 × 10−2 7.7863 × 10−2
2.4397 × 10−1 2.4882 × 10−1 9.0760 × 10−2 1.1522 × 10−1 1.0444 × 10−1 1.3590 × 10−1
Table 9: Total Sensitivity Indices for the Best Approximation of p1 , p2 , and p3 . Input/Output h L h h
θ
d h HD d BM L hc HD
p1
p2
p3
0.550433 0.003544 0.220720 0.107570 0.124541 0.083042 0.003075
0.652522 0.323328 0.040430 0.016706 0.019912 0.013103 0.001099
0.183044 0.001595 0.129206 0.035828 0.041625 0.027788 0.636814
Similar experiments were carried out for the other output variables ( p1 , p2 , and p3 ). However, for simplicity and clarity, only the most significant results are included. The best performance results obtained applying the proposed methodology (AFN) are shown in Tables 8 and 9, indicating the total sensitivity index for each input variable. The variables with the lowest values are those not checked in Table 4, and thus the proposed method is also able to discard the irrelevant variables in these cases. Again, considering the topology derived from the application of the method proposed, a functional network is trained for each output, and its performance results are also included in Table 8 (rows entitled FN). Again, it is important to remark that the simplification of the topology means a significant reduction in the number of parameters while maintaining the performance of the approach. By applying the proposed method, interactions between variables are removed and irrelevant variables are found. This suggests that the AFN method can be applied as a feature selection method, that is, a method that reduces the number of original features or variables by selecting a subset of them. The advantages of reducing the number of inputs have been extensively discussed in the machine learning literature (Kohavi & John, 1997; Guyon & Elisseeff, 2003). Two of the most important ones are that
Functional Network Topology Learning and Sensitivity Analysis
253
Table 10: Mean, Standard Deviation, and Minimum and Maximum Values for the Normalized MSE for the Test Set when Estimating pu Considering All the Inputs and Only the Three Relevant Ones. Neurons
Mean NMSE
(STD)
Min
Max
Considering the seven variables (1.2739 × 10−3 ) 5 1.9880 × 10−3 −4 10 4.8390 × 10 (1.0625 × 10−3 ) 15 2.5818 × 10−4 (7.5651 × 10−4 ) 20 1.5951 × 10−6 (2.2556 × 10−6 ) 25 1.4011 × 10−6 (2.6363 × 10−6 ) 27 8.0122 × 10−7 (8.1440 × 10−7 ) 30 2.5952 × 10−6 (6.3535 × 10−6 )
1.7747 × 10−4 1.7887 × 10−6 1.0661 × 10−6 2.5091 × 10−7 8.3305 × 10−8 1.4289 × 10−7 7.0063 × 10−8
4.8721 × 10−3 3.7790 × 10−3 2.6543 × 10−3 1.0151 × 10−5 1.1963 × 10−5 3.7679 × 10−6 2.4735 × 10−5
Using the three relevant variables (9.9263 × 10−4 ) 5 1.4199 × 10−3 10 2.4766 × 10−5 (6.7618 × 10−5 ) 15 1.1263 × 10−6 (9.7697 × 10−7 ) 20 2.1324 × 10−7 (2.7517 × 10−7 ) 25 1.1236 × 10−7 (1.0665 × 10−7 ) 27 7.2760 × 10−8 (4.1195 × 10−8 ) 30 8.8819 × 10−8 (7.7426 × 10−8 )
1.5737 × 10−4 1.8441 × 10−6 1.5265 × 10−7 3.4903 × 10−8 1.8267 × 10−8 1.0599 × 10−8 1.8998 × 10−8
2.8286 × 10−3 3.0382 × 10−4 3.8548 × 10−6 1.2734 × 10−6 4.3665 × 10−7 1.9125 × 10−7 2.6956 × 10−7
Note: Neurons refers to the number of neurons in the hidden layer.
it allows reducing computational complexity, and it improves performance results. As a feature selection method, the AFN method should indicate how to choose the relevant variables. This information is provided by the total sensitivity indices (TSI) that give a ranking of the variables in terms of its variance. Besides, a threshold is required to determine the variables to be selected, in such a way that variables with TSIs under the threshold are discarded. The establishment of this threshold is not a trivial issue and will be discussed in a further study. As a preliminary approach, the AFN method was used as a feature selection method for the breakwater problem. The threshold was established at 1%, that is, variables with TSI over 1% were chosen (see Tables 7 and 9). Then multilayer perceptrons (MLP) were used to solve the breakwater problem in both cases (using all the given inputs and only the inputs selected by the AFN method). The hyperbolic tangent was employed as the transfer function and the MSE as the performance function. The Levenberg-Marquardt learning algorithm was used to train the network (Levenberg, 1944; Marquardt, 1963). Two thousand epochs were employed because they were enough to the MLP to converge. The results are very representative in the case of pu , because the number of inputs is drastically reduced from 7 to 3. Table 10 shows these results. As can be seen, the reduction in the number of inputs leads to better performance when the same number of neurons in the hidden layer is used,
˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,
254
Table 11: Mean, Standard Deviation (STD), and Minimum and Maximum Values for the Normalized Mean Square Error (NMSE) for the Test Set When Estimating p1 , p2 , and p3 , Considering All the Inputs (All) and Only the Relevant (Rel.) Ones.
p1 p2 p3
Na
Varb
Mean NMSE
(STD)
Min
Max
35 37 21 22 37 39
All Rel. All Rel. All Rel.
1.4148 × 10−2 3.6129 × 10−3 1.0843 × 10−2 7.3305 × 10−3 6.3983 × 10−3 3.9794 × 10−3
(2.1045 × 10−2 ) (9.4444 × 10−4 ) (8.4366 × 10−3 ) (4.7671 × 10−3 ) (4.9059 × 10−3 ) (2.0662 × 10−3 )
2.8752 × 10−3 2.2487 × 10−3 4.1416 × 10−3 2.6945 × 10−3 2.4433 × 10−3 2.0258 × 10−3
9.8051 × 10−2 5.0922 × 10−3 3.6060 × 10−2 2.2399 × 10−2 2.2456 × 10−2 9.1660 × 10−3
Notes: a N refers to the number of neurons in the hidden layer. b Var. refers to the number of variables.
although it involves a smaller set of weights, that is, the network topology is simplified. The best results achieved when estimating p1 , p2 , and pu are shown in Table 11, although the variable reduction is not so important. Again, better performance results are achieved. 5 Conclusions and Future Work In this article, a new methodology, the AFN method, based on ANOVA decomposition and functional networks, was presented and described. This methodology permits a simplified topology for a functional network to be learned from the data available. As stated in section 1, to our knowledge, there is no other method that permits this to be done. In addition to the advantages inherited from the ANOVA decomposition (uniqueness and orthogonality), the proposed methodology has the following advantages:
r
r
r
Global sensitivity indices can be obtained from the application of the AFN. These can be used to establish the relevance of each functional component; consequently, they determine the input variables and relations to be selected. If a particular variable has no influence (in isolation or related to others), it can be discarded as an input for the functional or neural network. It allows learning and simplifying the topology of a functional or neural network. If the variables required for estimating a specific function are important, then the topology of the functional network should include these relationships. All existing multivariate interactions among variables are identified by the global sensitivity indices.
Functional Network Topology Learning and Sensitivity Analysis
r
r
255
Local sensitivity indices. Although the letter was focused on the global sensitivity indices for deriving the functional network topology, it is important to note that the proposed methodology also provides local sensitivity indices. These indices could be used to detect outliers in the samples or to make a selective sampling (these will be a future line of research for us). Several alternatives for selecting the basis orthonormal functions for the proposed approximation are available. Moreover, a new one has been presented here. As the orthonormalization decomposition is easily accomplished by one of these alternatives, the application of the AFN would require only a minimization problem to be solved.
The suitability of the proposed methodology was illustrated by its application to a real engineering problem: the design of a vertical breakwater. The performance results obtained were compared to those obtained using functional and neural networks to solve the same problem. It was demonstrated that although the AFN obtains similar performance results, it also returns a set of sensitivity indices that permits the initial topologies to be simplified (functional networks) or some of the input variables to be eliminated (neural networks). In view of the results obtained, future work will involve adapting the proposed methodology for use as a feature subset selection method. A detailed study will be carried out comparing the proposed methodology with the existing feature selection methods. Acknowledgments We are indebted to the Spanish Ministry of Science and Technology with FEDER Funds (Projects DPI2002-04172-C04-02 and TIC-2003-00600), to the Xunta de Galicia (Project PGIDIT04-PXIC10502), and to Iberdrola for partial support of this work. References Acz´el, J. (1966). Lectures on functional equations and their applications. New York: Academic Press. Bridgman, P. (1922). Dimensional analysis. New Haven, CT: Yale University Press. Brooke, A., Kendrik, D., Meeraus, A., & Raman, R. (1998). GAMS: A user’s guide. Washington, DC: Gams Development Corporation. Buckingham, E. (1914). On physically similar systems: Illustrations of dimensional equations. Phys. Rev., 4, 345–376. Buckingham, E. (1915). Model experiments and the form of empirical equations. Trans. ASME, 37, 263. Castillo, E. (1998). Functional networks. Neural Processing Letters, 7, 151–159.
256
˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,
Castillo, E., Cobo, A., Guti´errez, J. M., & Pruneda, E. (1998). Functional networks with applications. Boston: Kluwer. Castillo, E., Cobo, A., Guti´errez, J. M., & Pruneda, R. E. (1999). Working with differential, functional and difference equations using functional networks. Applied Mathematical Modeling, 23, 89–107. Castillo, E., Cobo, A., Guti´errez, J. M., & Pruneda, R. E. (2000). Functional networks: A new neural network based methodology. Computer-Aided Civil and Infrastructure Engineering, 15, 90–106. Castillo, E., & Guti´errez, J. M. (1998). Nonlinear time series modeling and prediction using functional networks. extracting information masked by chaos. Physics Letters A, 244, 71–84. Castillo, E., Guti´errez, J., & Hadi, A. (1997). Sensitivity analysis in discrete Bayesian networks. IEEE Transactions on Systems, Man and Cybernetics, 26, 412–423. Castillo, E., Iglesias, A., & Ru´ız-Cobo, R. (2004). Functional equations in applied sciences. New York: Elsevier. Castillo, E., & Ru´ız-Cobo, R. (1992). Functional equations in science and engineering. New York: Marcel Dekker. Chatterjee, S., & Hadi, A. S. (1988). Sensitivity analysis in linear regression. New York: Wiley. Currin, C., Mitchell, T., Morris, M., & Ylvisaker, D. (1991). Bayesian prediction of deterministic functions, with applications to the design and analysis of computer experiments. Journal of the American Statistical Association, 86, 953–963. Efron, B., & Stein, C. (1981). The jackknife estimate of variance. Annals of Statistics, 9, 586–596. Fisher, R. (1918). The correlation between relatives on the supposition of Mendelian inheritance. Transactions of the Royal Society, 52, 399–433. Goda, Y. (1972). Laboratory investigation of wave pressure exerted upon vertical and composite walls. Coastal Engineering, 15, 81–90. Guyon, I., & Elisseef, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182. Hadi, A., & Nyquist, H. (2002). Sensitivity analysis in statistics. Journal of Statistical Studies: A Special Volume in Honor of Professor Mir Masoom. Ali’s 65th Birthday, 125–138. Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. Annals of Mathematical Statistics, 19, 293–325. Jiang, T., & Owen, A. (2001). Quasi-regression with shrinkage. Mathematics and Computers in Simulation, 62, 231–241. Koehler, J., & Owen, A. (1996). Computer experiments: Design and analysis of experiments. In S. Ghosh & C. R. Rao (Eds.), Handbook of statistics, 13 (pp. 261–308). New York: Elsevier Science. Kohavi, R., & John, G. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1–2), 273–324. Levenberg, K. (1944). A method for the solution of certain non-linear problems in least squares. Quartely Journal of Applied Mathematics, 2(2), 164–168. Marquardt, D. W. (1963). An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society of Industrial and Applied Mathematics, 11(2), 431– 441.
Functional Network Topology Learning and Sensitivity Analysis
257
Sacks, J., Mitchell, W. W. T., & Wynn, H. (1989). Design and analysis of computer experiments. Statistical Science, 4(4), 409–435. Saltelli, A., Chan, K., & Scott, M. (2000). Sensitivity analysis. New York: Wiley. Sobol, I. M. (1969). Multidimensional quadrature formulas and Haar functions. Moscow: Nauka. (in Russian) Sobol, I. M. (2001). Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and Computers in Simulation, 55, 271–280. Takahashi, S. (1996). Design of vertical breakwaters (Techn. Rep. No. 34). Yokosuka, Japan: Port and Harbour Research Institute. Ministry of Transport.
Received August 12, 2004; accepted June 17, 2006.
LETTER
Communicated by Erin Bredensteiner
Second-Order Cone Programming Formulations for Robust Multiclass Classification Ping Zhong [email protected] College of Science, China Agricultural University, Beijing, 100083, China
Masao Fukushima [email protected] Department of Applied Mathematics and Physics, Graduate School of Informatics, Kyoto University, Kyoto, 606-8501, Japan
Multiclass classification is an important and ongoing research subject in machine learning. Current support vector methods for multiclass classification implicitly assume that the parameters in the optimization problems are known exactly. However, in practice, the parameters have perturbations since they are estimated from the training data, which are usually subject to measurement noise. In this article, we propose linear and nonlinear robust formulations for multiclass classification based on the M-SVM method. The preliminary numerical experiments confirm the robustness of the proposed method. 1 Introduction Given L labeled examples known to come from K (>2) classes T = {(x p , θ p )} Lp=1 ⊂ X × Y, where X ⊂ R N and Y = {1 , . . . , K }, multiclass classification refers to the construction of a discriminate function from the input space X onto the unordered set of classes Y. Support vector machines (SVMs) serve as a useful and popular tool for classification. Recent developments in the study of SVMs show that there are roughly two types of approaches to tackle multiclass classification problem. One is to construct and fuse several binary classifiers, such as “one-against-all” (Bottou et al., 1994; Vapnik, 1998), “one-againstone” (Hastie & Tibshirani, 1998; Kressel, 1999), directed acyclic graph SVM (DAGSVM; Platt, Cristianini, & Shawe-Taylor, 2000), error-correcting output code (ECOC; Allwein, Schapire, & Singer, 2001; Dietterich & Bakiri, 1995), K-SVCR method (Angulo, Parra, & Catal`a, 2003), and ν-K-SVCR Neural Computation 19, 258–282 (2007)
C 2006 Massachusetts Institute of Technology
SOCP for Multiclass Classification
259
method (Zhong & Fukushima, 2006), among others. The other, called “all-together,” is to consider all data in one optimization formulation (Bennett & Mangasarian, 1994; Bredensteiner & Bennett, 1999; Guermeur, 2002; Vapnik, 1998; Weston & Watkins, 1998; Yajima, 2005). In this letter, we focus on the second approach. There are several all-together methods. The method independently proposed by Vapnik (1998) and Weston and Watkins (1998) is similar to oneagainst-all. It constructs K two-class discriminants where each discriminant separates a single class from all the others. Hence, there are K decision functions, but all are obtained by solving one optimization problem. Bennett and Mangasarian (1994) constructed a piecewise-linear discriminant for the K -class classification by a single linear program. The method called M-SVM (Bredensteiner & Bennett, 1999) extends their method to generate a kernel-based nonlinear K -class discriminant by solving a convex quadratic program. Although the original forms proposed by Vapnik (1998), Weston and Watkins (1998), and Bredensteiner and Bennett (1999) are different, they are not only equivalent to each other, but also equivalent to that proposed by Guermeur (2002). Based on M-SVM, the linear programming formulations are proposed in a low-dimensional feature subspace (Yajima, 2005). In the methods noted, the parameters in the optimization problems are implicitly assumed to be known exactly. However, in practice, these parameters have perturbations since they are estimated from the training data, which are usually corrupted by measurement noise. As pointed out by Goldfarb and Iyengar (2003), the solutions to the optimization problems are sensitive to parameter perturbations. Errors in the input space tend to get amplified in the decision function, which often results in misclassification. So it will be useful to explore formulations that can yield discriminants robust to such estimation errors. In this article, we propose a robust formulation of M-SVM, which is represented as a second-order cone program (SOCP). The second-order cone (SOC) in Rn (n ≥ 1), also called the Lorentz cone, is the convex cone defined by Kn =
z0 z¯
: z0 ∈ R, z¯ ∈ Rn−1 , z¯ ≤ z0 ,
where · denotes the Euclidean norm. The SOCP is a special class of convex optimization problems involving SOC constraints, which can be efficiently solved by interior point methods. The work related to SOCP can be seen, for example, in Alizadeh and Goldfarb (2003), Fukushima, Luo, and Tseng (2002), Hayashi, Yamashita, and Fukushima (2005), and Lobo, Vandenberghe, Boyd, and L´ebret (1998). The letter is organized as follows. We first propose a robust formulation for piecewise-linear M-SVM in section 2 and then construct a robust
260
P. Zhong and M. Fukushima
classifier based on the dual SOCP formulation in section 3. In section 4, we extend the robust classifier to the piecewise-nonlinear M-SVM case. Section 5 gives numerical results. Section 6 concludes the letter. 2 Robust Piecewise-Linear M-SVM Formulation For each i, let Ai be a set of examples in the N-dimensional real space R N with cardinality li . Let Ai be an li × N matrix whose rows are the examples in Ai . The pth example in Ai and the pth row of Ai are both denoted Aip . Let ei denote the vector of ones of dimension li . For each i, let wi be a vector in R N and b i be a real number. The sets Ai , i = 1, . . . , K , are called piecewise-linearly separable (Bredensteiner & Bennett, 1999) if there exist wi and b i , i = 1, . . . , K , such that Ai wi − b i ei > Ai w j − b j ei ,
i, j = 1, . . . , K ,
i = j.
Piecewise-linear M-SVM can be formulated as follows (Bredensteiner & Bennett, 1999): i−1 K K K K 1 1 min ν wi − w j 2 + wi 2 + (1 − ν) (ei )T yij w, b, y 2 2
i=1 j=1
i=1 j=1, j=i
i=1
s.t. Ai (wi − w j ) − (b i − b j )ei + yij ≥ ei , yij ≥ 0,
i, j = 1, . . . , K ,
i = j,
(2.1)
where ν ∈ (0, 1],
T w = (w 1 )T , (w 2 )T , . . . , (w K )T ∈ R K N , T
b = b 1, b 2, . . . , b K ∈ RK ,
T y = ( y12 )T , . . . , ( y1K )T , . . . , ( yK 1 )T , . . . , ( yK (K −1) )T ∈ R L(K −1) ,
(2.2) (2.3) (2.4)
K li . When ν = 1, equation 2.1 is the formulation for the and L = i=1 piecewise-linearly separable case. Otherwise, it is the formulation for the piecewise-linearly inseparable case. Figure 1 shows an example of a piecewise-linearly separable M-SVM for three classes in two dimensions. The training data Ai , i = 1, . . . , K , used in problem 2.1, are implicitly assumed to be known exactly. However, in practice, training data are often corrupted by measurement noises. Errors in the input space tend to get amplified in the decision function, which often results in misclassification. For example, suppose each example in Figure 1 is allowed to move in a sphere (see Figure 2). The original discriminants cannot separate the training data
SOCP for Multiclass Classification
261
(w1 − w2 )T x = (b1 − b2 ) + 1 (w1 − w2 , b1 − b2 ) (w1 − w2 )T x = (b1 − b2 ) − 1
A1
A2
(w1 − w3 , b1 − b3 )
*
*
*
*
(w2 − w3 , b2 − b3 )
* A3
Figure 1: Three classes separated by piecewise-linear M-SVM discriminants.
A1 A2
*
*
*
*
* 3
Figure 2: An example of the effect of measurement noises.
sets in the worst case. It will be useful to explore formulations that can yield discriminants robust to such estimation errors. In the following, we discuss such a formulation. We assume Aˆ ip = Aip + ρ ip (aip )T ,
(2.5)
where Aˆ ip is the actual value of the training data and ρ ip (aip )T is the measurement noise, with aip ∈ R N , aip = 1 and ρ ip ≥ 0 being a given constant. Denote the unit sphere in R N by U = {a ∈ R N : a = 1}. The robust
262
P. Zhong and M. Fukushima
version of formulation 2.1 can be stated as follows:
min
w, b, y
i−1 K K K K 1 1 ν wi − w j 2 + wi 2 + (1 − ν) (ei )T yij 2 2 i=1 j=1
i=1 j=1, j=i
i=1
ij
Aip (wi − w j ) + ρ ip (aip )T (wi − w j ) − (b i − b j ) + yp ≥ 1,
s.t.
ij
yp ≥ 0,
∀ aip ∈ U,
p = 1, . . . , li , i, j = 1, . . . , K , i = j.
(2.6)
Since min{ρ ip (aip )T (wi − w j ) : aip ∈ U} = −ρ ip wi − w j , problem 2.6 is equivalent to the following SOCP:
min
w, b, y
i−1 K K K K 1 1 ν wi − w j 2 + wi 2 + (1 − ν) (ei )T yij 2 2 i=1 j=1
i=1 j=1, j=i
i=1
ij
Aip (wi − w j ) − ρ ip wi − w j − (b i − b j ) + yp ≥ 1,
s.t.
ij
yp ≥ 0,
(2.7)
p = 1, . . . , li , i, j = 1, . . . , K , i = j.
Denote Q = (K + 1)I K N − , with I K N being the identity matrix of order K N and
IN IN = . .. IN
IN IN .. . IN
··· ··· .. . ···
IN IN K N×K N. .. ∈ R . IN
Denote e = [(e1 )T , . . . , (e1 )T , . . . , (e K )T , . . . , (e K )T ]T ∈ R L(K −1) . The objec K −1
K −1
tive function of problem 2.7 can then be expressed compactly as ν T w Qw + (1 − ν)eT y. 2
(2.8)
SOCP for Multiclass Classification
263
Additionally, Q is a symmetric positive definite matrix, which can be inferred from the following proposition. The proof of the proposition is omitted since it is similar to that given by Yajima (2005). Proposition 1
Denote C =
√
K + 1I K N −
1. Q = C 2 . 2. C is nonsingular, and C −1 =
√
√
K +1−1 . K
1
I + K +1 K N
Then
√
K√+1−1 . K K +1
Let H ij be the K N × N matrix with all blocks being N × N zero matrices except the ith block being I N and the jth block being −I N : H ij = [O, . . . , O, I N , O, . . . , O, −I N , O, . . . , O]T .
(2.9)
Then, by equation 2.2 we get wi − w j = (H ij )T w.
(2.10)
Let r ij be the K -dimensional vector with all components being zero except the ith component being 1 and the jth component being −1: r ij = [0, . . . , 0, 1, 0, . . . , 0, −1, 0, . . . , 0]T .
(2.11)
Then by equation 2.3 we get b i − b j = (r ij )T b.
(2.12)
ij
Let h p be the L(K − 1)-dimensional vector with all components being zero except the ((K − 1) i−1 k=1 lk + ( j − 1)li + p)th component being 1: ij
h p = [0, . . . , 0, . . . , 0, 1, 0, . . . , 0, . . . , 0]T .
(2.13)
Then by equation 2.4, we get ij
ij
yp = (h p )T y.
(2.14)
By equations 2.10, 2.12, and 2.14, the first constraint in problem 2.7 can be rewritten as follows: ij T ρ ip (H ij )T w ≤ Aip (H ij )T w − (r ij )T b + h p y − 1.
(2.15)
264
P. Zhong and M. Fukushima
Therefore, by equations 2.8 and 2.15 and proposition 1, formulation 2.7 can be written as follows:
min νt + (1 − ν)eT y
w, b, y, t
s.t.
1 Cw2 ≤ t, 2
ij T ρ ip (H ij )T w ≤ Aip (H ij )T w − (r ij )T b + h p y − 1,
(2.16)
p = 1, . . . , li , i, j = 1, . . . , K , i = j, y ≥ 0.
Furthermore, formulation 2.16 can be cast as the following SOCP:
min νt + (1 − ν)eT y
w, b, y, t
√ 2Cw ≤ 1 + t, s.t. 1−t ij T ρ ip (H ij )T w ≤ Aip (H ij )T w − (r ij )T b + h p y − 1,
(2.17)
p = 1, . . . , li , i, j = 1, . . . , K , i = j, y ≥ 0.
3 Robust Piecewise-Linear M-SVM Classifier In this section, we construct a robust piecewise-linear M-SVM classifier based on the dual formulation of problem 2.17.
3.1 Dual of the Robust Piecewise-Linear M-SVM Formulation. Denote
¯ = B1T , B2T , . . . , B KT T ∈ R L(K −1)×KN A
(3.1)
SOCP for Multiclass Classification
265
with
−Ai . .. O Bi = O . .. O
· · · O Ai O · · · O . . . .. .. . .. .. .. . i i · · · −A A O · · · O ∈ Rli (K −1)×K N . · · · O Ai −Ai · · · O .. .. .. . . .. . . . . . i · · · O A O · · · −Ai
Denote
T ¯1,M ¯ 2T . . . , M ¯ TK T ∈ R L N(K −1)×K N , H¯ = M
(3.2)
where
O ··· O .. .. . . · · · −Mi Mi O · · · O ∈ Rli N(K −1)×K N , · · · O Mi −Mi · · · O .. .. . . .. . . . . O · · · O Mi O · · · −Mi
−Mi . . . O ¯i = M O . . .
· · · O Mi . .. .. . .. .
with
Mi := Mi (ρ) = [ρ1i I N , . . . , ρlii I N ]T ∈ Rli N×N ,
i = 1, . . . , K .
Denote T
E¯ = E 1T , E 2T , . . . , E KT ∈ R L(K −1)×K
(3.3)
266
P. Zhong and M. Fukushima
with
−ei . . . 0 Ei = 0 . . . 0
· · · 0 ei 0 · · · . . .. .. .. . . . . · · · −ei ei 0 · · · · · · 0 ei −ei · · · .. .. .. . . . . . . · · · 0 ei 0 · · ·
0 .. . 0 ∈ Rli (K −1)×K . 0 .. . −ei
We can derive the following dual of problem 2.17 (see appendix A): max eT α − (σ + τ )
α,s,σ,τ
s.t. E¯ T α = 0, α ≤ (1 − ν)e, σ − τ = ν, −√ 1 ¯ T α + H¯ T s) (A 2(K +1) ≤ σ, τ ij s p ≤ α ijp ,
(3.4)
p = 1, . . . , li , i, j = 1, . . . , K , j = i,
where α = [(α 12 )T , . . . , (α 1K )T , . . . , (α K 1 )T , . . . , (α K (K −1) )T ]T ∈ R L(K −1) , T T T T , . . . , sl12 , . . . , s1K , . . . , sl1K ,..., s = s12 1 1 1 1
K (K −1) T
s1
K (K −1) T , . . . , sl K
T
∈ R L N(K −1) .
(3.5)
(3.6)
In addition, we get the following complementary equations at optimality:
ij T
αp ij
sp
ij T Aip (H ij )T w − (r ij )T b + h p y − 1 ρ ip (H ij )T w
p = 1, . . . , li , i, j = 1, . . . , K , j = i,
= 0, (3.7)
SOCP for Multiclass Classification
−√ 1 2(K +1)
267
T σ 1+t √ ¯ T α + H¯ T s) (A 2Cw = 0,
(3.8)
1−t
τ
((1 − ν)e − α)T y = 0.
(3.9)
3.2 Robust Classifier. From formulation 3.4 we get σ > 0. In fact, if σ = 0, then τ = 0. The third constraint of formulation 3.4 becomes ν = 0, which contradicts ν > 0. By the complementary equation 3.8, we have the following implications (see appendix B for the complementary conditions in SOCP, equations B.1 to B.3): If −√ 1 ¯ T α + H¯ T s) (A 2(K +1) < σ, τ then √ 2Cw 1 − t = 1 + t = 0. But this contradicts t ≥ 0. So we must have −√ 1 ¯ T α + H¯ T s) (A 2(K +1) = σ. τ Since σ > 0, we have √ 2Cw 1 − t = 1 + t. Hence, there exists µ > 0 such that √ 2Cw =
µ 2(K + 1)
¯ T α + H¯ T s) and 1 − t = −µτ. (A
(3.10)
In addition, it is easy to get the following equalities by proposition 1: ¯T = √ C −1 A
1 K +1
¯T A
and C −1 H¯ T = √
1 K +1
H¯ T .
(3.11)
268
P. Zhong and M. Fukushima
Hence, by equations 3.10 and 3.11, we get w=
t−1 ¯ T α + H¯ T s). (A 2τ (K + 1)
Furthermore, by equations 2.2, 3.1, and 3.2, we get lj li K t − 1 ij α p (Aip )T − α pji (Apj )T wi = 2τ (K + 1) j=1, j=i
p=1
lj li ij ρ ip s p − ρ pj s pji . +
p=1
p=1
p=1
Therefore, the decision functions are given by f i (x) = x T wi − b i
lj li K t−1 ij = α p x T (Aip )T − α pji x T (Apj )T 2τ (K + 1) j=1, j=i
p=1
p=1
lj li ij ρ ip x T s p − ρ pj x T s pji − b i , i = 1, . . . , K . + p=1
(3.12)
p=1
In particular, if we set ρ ip = 0, i = 1, . . . , K , p = 1, . . . , li , then equation 3.12 becomes lj li K i T j T t−1 ij T ji T f i (x) = α p x Ap − α p x Ap 2τ (K + 1) j=1, j=i
p=1
p=1
− b i , i = 1, . . . , K .
(3.13)
Since ρ ip = 0, p = 1, . . . , li , i = 1, . . . , K , imply that the parameter perturbations are not considered (cf. equation 2.5); equation 3.13 corresponds to the discriminants for the case of no measurement noise. With these decision functions, the classification of an example x is to find a class i such that f i (x) = max{ f 1 (x), . . . , f K (x)}. 4 Robust Piecewise-Nonlinear M-SVM Classifier The above discussion is concerned with the piecewise-linear case. In this section, the analysis will be extended to the nonlinear case.
SOCP for Multiclass Classification
269
To construct separating functions in a higher-dimensional feature space, a nonlinear mapping ψ : X → F is used to transform the original examples into the feature space, which is equipped with the inner product defined by k(x, x ) = ψ(x), ψ(x ) , where k(·, ·) : R N × R N → R is a function called a kernel. Typical choices of kernels include polynomial kernels k(x, x ) = (x T x + 1)d with an integer parameter d and radial basis function (RBF) kernels k(x, x ) = exp(−x − x 2 /κ) with a real parameter κ. 4.1 Robust Piecewise-Nonlinear M-SVM Formulation. We assume T ˜ (4.1) ψ Aˆ ip = ψ (Aip )T + ρ˜ ip a˜ ip , a˜ ip ∈ U, where U˜ is a unit sphere in the feature space. For the nonlinear case, ρ˜ ip in the feature space associated with a kernel k(·, ·) can be computed as T T ρ˜ ip = ψ Aˆ ip − ψ Aip % T T & % T T & − 2 ψ Aˆ ip , ψ Aip = ψ Aˆ ip , ψ Aˆ ip % T T &1/2 + ψ Aip , ψ Aip T T T T T T 1/2 = k Aˆ ip , Aˆ ip . − 2k Aˆ ip , Aip + k Aip , Aip For example, for RBF kernels, since T T = 1, k Aˆ ip , Aˆ ip 2 i T i T = exp − ρ ip /κ , k Aˆ p , Ap T T = 1, k Aip , Aip we have 2 1/2 . ρ˜ ip = 2 − 2 exp − ρ ip /κ
(4.2)
The robust version of the piecewise-nonlinear M-SVM can be expressed as follows: i−1 K K K K 1 1 min ν wi − w j 2 + wi 2 + (1 − ν) (ei )T yij w,b, y 2 2
i=1 j=1
i=1
i=1 j=1, j=i
ij
˜ s.t. (ψ((Aip )T ))T (wi − w j ) + ρ˜ ip ( a˜ ip )T (wi − w j ) − (b i − b j ) + yp ≥ 1, ∀ a˜ ip ∈ U,
270
P. Zhong and M. Fukushima ij
yp ≥ 0,
p = 1, . . . , li , i, j = 1, . . . , K , i = j,
which can be rewritten as the following SOCP: K K K i−1 K 1 1 min ν wi − w j 2 + wi 2 + (1 − ν) (ei )T yij w,b, y 2 2
i=1 j=1
i=1 j=1, j=i
i=1
' '' (T ((T ij s.t. ψ Aip (wi − w j ) − ρ˜ ip wi − w j − (b i − b j ) + yp ≥ 1, ij
yp ≥ 0,
p = 1, . . . , li , i, j = 1, . . . , K , i = j.
Denote
˜ = B˜ 1T , B˜ 2T , . . . , B˜ KT T , A
where
−(Ai ) .. . O B˜ i = O .. . O
···
(Ai ) .. .
O .. .
···
· · · −(Ai ) (Ai )
O
..
.
O .. .
O .. .
···
O
···
O .. .
(Ai ) −(Ai ) · · · .. .. .. . . .
O .. .
···
O
(Ai )
with
T (Ai ) = ψ((Ai1 )T ), . . . , ψ((Alii )T ) .
Denote
T ˜1,M ˜ 2T . . . , M ˜ TK T , H˜ = M
O
· · · −(Ai )
(4.3)
SOCP for Multiclass Classification
271
where
−Mi . . . O ˜i = M O . . . O
· · · O Mi O · · · O . .. .. .. .. . .. . . . · · · −Mi Mi O · · · O · · · O Mi −Mi · · · O .. .. . . .. . . . . · · · O Mi O · · · −Mi
with T
Mi := Mi (ρ) ˜ = ρ˜ 1i I N , . . . , ρ˜ lii I N .
(4.4)
In a similar manner to that of getting formulation 3.4, we get the dual of problem 4.3 as follows: max eT α − (σ + τ )
α,s,σ,τ
s.t. E¯ T α = 0, α ≤ (1 − ν)e,
(4.5)
σ − τ = ν, −√ 1 ˜ T α + H˜ T s) (A 2(K +1) ≤ σ, τ ij s p ≤ α ijp ,
p = 1, . . . , li , i, j = 1, . . . , K , j = i.
4.2 Robust Classifier in a Feature Subspace. In the previous section, we have gotten the robust formulation 4.5 in the feature space. However, the feature space F may have an arbitrarily large dimension, possibly infi¨ nite. Usually the kernel principal component analysis (KPCA) (Scholkopf, ¨ Smola, & Muller, 1998; Yajima, 2005) is used for feature extraction. In this section, we first reduce the feature space F to an S-dimensional subspace with S < L by KPCA, and then construct the corresponding robust classifier of piecewise-nonlinear M-SVM in the subspace. j Consider the kernel matrix G = (k((Aip )T , (Ap )T )) ∈ R L×L associated with a kernel k(·, ·). Since G is a symmetric positive semidefinite matrix, there is an orthogonal matrix V such that G = V V T , where is a diagonal matrix whose diagonal elements are the eigenvalues λi ≥ 0, i = 1, . . . , L , of G, and v i , i = 1, . . . , L, the columns of V, are the corresponding eigenvectors. Suppose λ1 ≥ λ2 ≥ . . . ≥ λ L . Select the S(< L)
272
P. Zhong and M. Fukushima
largest√positive √ eigenvalues √ and the corresponding eigenvectors. Denote DS = [ λ1 v 1 , λ2 v 2 , . . . , λ S v S ], where the components of v i are written as follows: T
1 2 K v i = vi,1 , . . . , vi,1 l1 , vi,1 , . . . , vi,2 l2 , . . . , vi,1 , . . . , vi,Kl K . Define the vectors K l j j=1
ui :=
j
j
vi, p ψ((Ap )T ) , i = 1, . . . , S. √ λi
p=1
Then we have ui T ui =
1 T v Gv i = 1 λi i
and ui T u j =
1 v iT Gv j = 0, λi λ j
i = j.
Therefore, {u1 , u2 , . . . , u S } forms an orthogonal basis of an S-dimensional subspace of F. Let ψ S (x) be the S-dimensional subcoordinate of ψ(x), which is given by
T lj lj K K 1 1 j j ψ S (x) = √ v k(x, (Apj )T ), . . . , √ v k(x, (Apj )T ) . λ S j=1 p=1 S, p λ1 j=1 p=1 1, p (4.6) Then, similar to equation 3.12, we can get the decision functions associated with the robust formulation of piecewise-nonlinear M-SVM in the feature subspace as follows: li K t−1 ij f i (x) = α p ψ S (x)T ψ S ((Aip )T ) 2τ (K + 1) j=1, j=i
−
lj
p=1
α pji ψ S (x)T ψ S ((Apj )T )
p=1
lj li ij ρ˜ ip ψ S (x)T s p − ρ˜ pj ψ S (x)T s pji − b i , i = 1, . . . , K . + p=1
p=1
(4.7)
SOCP for Multiclass Classification
273
Table 1: Description of Iris, Wine, and Glass Data Sets.
Name
Dimension (N)
Number of Classes (K )
Number of Examples (L)
Iris Wine Glass
4 13 9
3 3 6
150 178 214
5 Preliminary Numerical Results In this section, through numerical experiments, we examine the performance of the robust piecewise-nonlinear M-SVM formulation and the original model for multiclass classification problems. We use RBF kernel in the experiments. As we described in section 4.2, we first construct an L × L kernel matrix G associated with the RBF kernel for the training data set. Then we decompose G and select an appropriate number S. Using equation 4.6, we obtain the S-dimensional subcoordinate of each point. The problems used in the experiments are the robust model 4.5 and the original model obtained by setting ρ˜ = 0 in equation 4.4. In the latter model, we have H˜ = O. Thus, we may write the problem as follows: max eT α − (σ + τ ) α,σ,τ
s.t. E¯ T α = 0, α ≤ (1 − ν)e,
(5.1)
σ − τ = ν, −√ 1 ˜Tα A 2(K +1) ≤ σ. τ The experiments were implemented on a PC (1GB RAM, CPU 3.00GHz) using SeDuMi1.05 (Sturm, 2001) as a solver. This solver is developed by J. Sturm for optimization problems over symmetric cones, including SOCP. Some experimental results on real-world data sets taken from the UCI machine learning repository (Blake & Merz, 1998) are reported below. Table 1 gives a description of the data sets. In the experiments, the data sets were normalized to lie between −1 and 1. For simplicity, we set all ρ ip in equation 2.5 to be a constant ρ. The measurement noise aip was generated randomly from the normal distribution and scaled on the unit sphere. Two experiments were performed. In the first, an appropriate value of S for getting reasonable discriminants was sought. The second experiment was
I II I II I II I II I II
1 1 2 2 2 2 3 3 12 12
S 0.5364 0.5364 0.7950 0.7950 0.7950 0.7950 0.8836 0.8836 0.9911 0.9911
Rt
Iris
62.67 60.67 89.33 87.33 89.33 87.33 85.33 84.0 88.0 86.67
PTa
Note: a PT: Percentage of tenfold testing correctness on validation set.
0.99
0.8
0.7
0.6
0.5
Ra
Robust (I), Original (II) 5 5 9 9 16 16 — — — —
S 0.5179 0.5179 0.6102 0.6102 0.7103 0.7103 — — — —
Rt
Wine PT 90.0 88.89 88.89 80.0 87.78 82.22 — — — —
Table 2: Results for Iris, Wine, and Glass Data Sets with Noise (ρ = 0.3, κ = 2, ν = 0.05).
1 1 2 2 3 3 4 4 — —
S
0.5737 0.5737 0.6826 0.6826 0.7523 0.7523 0.8002 0.8002 — —
Rt
Glass
35.24 31.43 66.67 32.86 66.67 38.57 66.67 45.24 — —
PT
274 P. Zhong and M. Fukushima
Iris (2) Wine (5) Glass (4)
Data Set (S)
I II I II I II
Robust (I), Original (II) 94.0 94.0 98.33 98.33 59.52 59.52
0 88.67 87.33 91.11 88.89 66.19 46.19
0.1 88.0 87.33 90.0 88.83 65.71 45.71
0.2
ρ
89.33 87.33 90.0 88.89 66.67 45.24
0.3
Table 3: Percentage of Tenfold Test Correctness for the Data Sets with Noise (κ = 2, ν = 0.05).
91.33 86.67 87.78 85.56 66.67 49.05
0.4
90.0 85.33 84.44 82.22 66.67 49.52
0.5
SOCP for Multiclass Classification 275
276
P. Zhong and M. Fukushima
conducted on the three data sets with the measurement noise. Tenfold cross validation was used in the experiments. In order to seek an appropriate value of S, a ratio Ra is set. It is chosen from the set {0.5, 0.6, 0.7, S 0.8, 0.99}. L For each value of Ra , we Sfind the smallL est integer S such that i=1 λi / i=1 λi ≥ Ra , and let Rt := i=1 λi / i=1 λi . At the same time, we test the accuracy on the validation set by computing the percentage of tenfold testing correctness. Table 2 contains these three kinds of results for the robust model and the original model on the Iris, Wine, and Glass data sets with the measurement noise scaled by ρ = 0.3. When Ra is large, we were unable to solve the problems for the Wine and Glass data sets because of memory limitations. Nevertheless, it can be seen from Table 2 that the values of Rt around 50% up to 70% yield reasonable discriminants. Moreover, in all cases, S is much smaller than the data size L. Table 3 shows the percentage of tenfold testing correctness for the robust model and the original model on the three data sets with various noise levels ρ. Especially, ρ = 0 means that there is no noise on the data sets. In this case, the robust model reduces to the original model. It can be observed that the performance of the robust model is consistently better than that of the original model, especially for the Glass data set. In addition, the correctness for the original model on the three data sets with noise is much lower than the results when ρ = 0. For the linear case, we also find that the correctness for the original model on the data sets with noise is lower than the correctness on the data sets without noise. For example, the correctness of the Iris data set, the Wine data set, and the Glass data set without noise is 90.67%, 97.78%, and 57.14%, respectively. However, when ρ = 0.5, the corresponding correctness is 87.33%, 78.89%, and 48.57%, respectively. For the robust model, when ρ = 0.5, the corresponding correctness is 90.67%, 81.11%, and 56.19%, respectively.
6 Conclusion In this letter, we have established the robust linear and nonlinear formulations for multiclass classification based on M-SVM method. KPCA has been used to reduce the feature space to an S-dimensional subspace. The preliminary numerical experiments show that the performance of the robust model is better than that of the original model. Unfortunately, the conic convex optimization solver SeDuMi1.05 (Sturm, 2001) used in our numerical experiments could solve problems only for small data sets. The sequential minimal optimization (SMO) techniques (Platt, 1999) are essential in large-scale implementation of SVM. Future subjects for investigation include developing SMO-based robust algorithms for multiclass classification.
SOCP for Multiclass Classification
277
Appendix A: Dual of Formulation 2.17 In order to get the dual of problem 2.17, we first state a more general primal and dual form of the SOCP. The notations used in section A.1 are independent of those in the other part of the letter. A.1 A General Primal and Dual Pair. For the SOCP min cT x + dT y + eT z
x, y, z
s.t. AT x + B T y + C T z = f ,
(A.1)
y ≥ 0, z ∈ Kn1 × · · · × Knl , its dual is written as follows: max f T w w,u,v
s.t. Aw = c, Bw + u = d, Cw + v = e,
(A.2)
u ≥ 0, v ∈ Kn1 × · · · × Knl . Now consider the problem min cT x + dT y x, y,z
s.t. G¯ iT x + qi ≤ g iT x + hiT y + r iT z + a i ,
i = 1, . . . , m,
y ≥ 0. This problem can be formulated as follows: min cT x + dT y
x, y,z,ζ
s.t. ζ i −
g iT x + hiT y + r iT z + a i G¯ iT x + qi
ζ i ∈ Kni , y ≥ 0,
i = 1, . . . , m,
= 0,
i = 1, . . . , m,
(A.3)
278
P. Zhong and M. Fukushima
which can be further rewritten as min cT x + dT y
x, y,z,ζ
−G 1 −H1 −R1 s.t. I O . .. O
−G 2 −H2 −R2 O I .. . O
ζ i ∈ Kni ,
T x · · · −G m y · · · −Hm β1 z · · · −Rm β2 ··· O ζ 1 = .. ζ2 . ··· O . . .. βm . .. .. ζm ··· I
i = 1, . . . , m,
y ≥ 0, where G i = [g i , G¯ i ], Hi = [hi , O], Ri = [r i , O], β i = [a i , qiT ]T , i = 1, . . . , m. In view of the primal-dual pair A.2 and A.3, we obtain the dual of problem A.3 as follows: max − η, λ
s.t.
m
β iT ηi
i=1 m
G i ηi = c,
i=1 m
Hi ηi + λ = d,
(A.4)
i=1 m
Ri ηi = 0,
i=1
λ ≥ 0, ηi ∈ Kni ,
i = 1, . . . , m.
A.2 Dual of Problem 2.17. In the following we derive the dual of formulation 2.17. The primal problem 2.17 can be put in the following equivalent form:
min 0T , ν x + (1 − ν)eT y
x, b, y
√
0 2C 0 ≤ 0T , 1 x + 1, s.t. x+ 1 0T −1
SOCP for Multiclass Classification
279
ij T i ij T ρ p (H ) , 0 x ≤ Aip (H ij )T , 0 x + h p y − (rij )T b − 1, p = 1, . . . , li , i, j = 1, . . . , K , i = j, y ≥ 0, T
where x = w T , t . Then, by equation A.4, we get the dual of problem 2.17 as follows:
max
α, s,σ, τ
s.t.
li K K
ij
α p − (σ + τ )
(A.5)
i=1 j=1, j=i p=1
√
2C T ξ +
li ' K K ( ij ij α p H ij (Aip )T + ρ ip H ij s p = 0,
(A.6)
i=1 j=1, j=i p=1
σ − τ = ν, K
K
li
(A.7) ij ij
α p h p + λ = (1 − ν)e,
(A.8)
i=1 j=1, j=i p=1
−
li K K
ij
α p r ij = 0,
(A.9)
i=1 j=1, j=i p=1
ξ τ ≤ σ, ij
ij
s p ≤ α p ,
(A.10) p = 1, . . . , li , i, j = 1, . . . , K , i = j,
λ ≥ 0.
(A.11) (A.12)
By equations 2.9, 3.1, and 3.5, we get li K K
ij ¯ T α. α p H ij (Aip )T = A
(A.13)
i=1 j=1, j=i p=1
By equations 3.2 and 3.6, we get li K K i=1 j=1, j=i p=1
ij ρ ip H ij s p = H¯ T s.
(A.14)
280
P. Zhong and M. Fukushima
Hence by equations A.13 and A.14, we can express equation A.6 compactly as follows: √
¯ T α + H¯ T s = 0. 2C T ξ + A
(A.15)
By equations A.15 and 3.11, we get the following equation: ξ = −
1 2(K + 1)
¯ T α + H¯ T s). (A
(A.16)
By equations 2.13 and 3.5, we have li K K
ij ij
α p h p = α.
i=1 j=1, j=i p=1
Hence, equation A.8 can be expressed as follows: (1 − ν)e − λ − α = 0.
(A.17)
By equations 2.11, 3.3, and 3.5, we can rewrite equation A.9 as follows: − E¯ T α = 0.
(A.18)
Combining equations A.16 to A.18, the problem given in A.5 to A.12 can be written as problem 3.4. Appendix B: Complementarity Conditions of SOCP Let bd Kn denote the boundary of Kn : bd Kn =
z0 z¯
∈ Kn : z¯ = z0 .
Let int Kn denote the interior of Kn : int Kn = For two elements
z0 z¯
∈ Kn
z0 z¯
∈ Kn : z¯ < z0 .
SOCP for Multiclass Classification
281
and
z0 z¯
∈ Kn ,
z0 z¯
T
z0 z¯
=0
if and only if the following conditions are satisfied (Lobo et al., 1998):
z0 z¯
∈ bd Kn \ {0},
z0 z¯
z0 z¯ z0 z¯
∈ int Kn ⇒ z¯ = z0 = 0,
(B.1)
∈ int Kn ⇒ z¯ = z0 = 0,
(B.2)
∈ bd Kn \ {0} ⇒
z0 z¯
=µ
z0 , − z¯
(B.3)
where µ > 0 is a constant. These three conditions are regarded as a generalization of the complementary slackness conditions in linear programming. Acknowledgments P.Z is supported in part by a grant-in-aid from the Ministry of Education Culture Sports Science and Technology of Japan and the National Science Foundation of China Grant No. 70601033. M.F. is supported in part by the Scientific Research Grant-in-Aid from the Japan Society for the Promotion of Science. References Alizadeh, F., & Goldfarb, D. (2003). Second-order cone programming. Math. Program., Ser. B, 95, 3–51. Allwein, E. L., Schapire, R. E., & Singer, Y. (2001). Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1, 113–141. Angulo, C., Parra, X., & Catal`a, A. (2003). K-SVCR: A support vector machine for multi-class classification. Neurocomputing, 55, 57–77. Bennett, K. P., & Mangasarian, O. L. (1994). Multicategory discrimination via linear programming. Optimization Methods and Software, 3, 27–39. Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. University of California. Available online at http://www.ics.uci.edu/∼mlearn/ MLRepository.html. Bottou, L., Cortes, C., Denker, J. S., Drucker, H., Guyon, I., Jackel, L. D., LeCun, ¨ Y., Muller, U. A., Sackinger, E., Simard, P., & Vapnik, V. (1994). Comparison of classifier methods: A case study in handwriting digit recognition. In IAPR (Ed.), Proceedings of the International Conference on Pattern Recognition (pp. 77–82). Piscataway, NJ: IEEE Computer Society Press.
282
P. Zhong and M. Fukushima
Bredensteiner, E. J., & Bennett, K. P. (1999). Multicategory classification by support vector machines. Computational Optimization and Applications, 12, 53–79. Dietterich, T. G., & Bakiri, G. (1995). Solving multi-class learning problems via errorcorrecting output codes. Journal of Artificial Intelligence Research, 2, 263–286. Fukushima, M., Luo, Z. Q., & Tseng, P. (2002). Smoothing functions for second-ordercone complementarity problems. SIAM Journal on Optimization, 12, 436–460. Goldfarb, D., & Iyengar, G. (2003). Robust convex quadratically constrained programs. Mathematical Programming, 97, 495–515. Guermeur, Y. (2002). Combining discriminant models with new multi-class SVMs. Pattern Analysis and Applications, 5, 168–179. Hastie, T. J., & Tibshirani, R. J. (1998). Classification by pairwise coupling. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 507–513). Cambridge, MA: MIT Press. Hayashi, S., Yamashita, N., & Fukushima, M. (2005). A combined smoothing and regularization method for monotone second-order cone complementarity problems. SIAM Journal on Optimization, 15, 593–615. Kressel, U. (1999). Pairwise classification and support vector machines. In B. ¨ Scholkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods: Support vector learning (pp. 255–268). Cambridge, MA: MIT Press. Lobo, M. S., Vandenberghe, L., Boyd, S., & L´ebret, H. (1998). Applications of secondorder cone programming. Linear Algebra and Applications, 284, 193–228. Platt, J. (1999). Sequential minimal optimization: A fast algorithm for training sup¨ port vector machines. In B. Scholkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods: Support vector learning (pp. 185–208). Cambridge, MA: MIT Press. Platt, J., Cristianini, N., & Shawe-Taylor, J. (2000). Large margin DAGs for multiclass ¨ classification. In S. A. Solla, T. K. Leen, & K. -R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 547–553). Cambridge, MA: MIT Press. ¨ ¨ Scholkopf, B., Smola, A., & Muller, K. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299–1319. Sturm, J. (2001). Using SeDuMi, a Matlab toolbox for optimization over symmetric cones. Department of Econmetrics, Tilburg University, Netherlands. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Weston, J., & Watkins, C. (1998). Multi-class support vector machines (CSD-TR-9804). Egham, UK: Royal Holloway, University of London. Yajima, Y. (2005). Linear programming approaches for multicategory support vector machines. European Journal of Operational Research, 162, 514–531. Zhong, P., & Fukushima, M. (2006). A new multi-class support vector algorithm. Optimization Methods and Software, 21, 359–372.
Received September 19, 2005; accepted May 24, 2006.
LETTER
Communicated by Grace Wahba
Fast Generalized Cross-Validation Algorithm for Sparse Model Learning S. Sundararajan [email protected] Philips Electronics India Ltd., Ulsoor, Bangalore, India
Shirish Shevade [email protected] Computer Science and Automation, Indian Institute of Science, Bangalore, India
S. Sathiya Keerthi [email protected] Yahoo! Research, 2821 Mission College Blvd., Santa Clara, CA 95054, USA
We propose a fast, incremental algorithm for designing linear regression models. The proposed algorithm generates a sparse model by optimizing multiple smoothing parameters using the generalized cross-validation approach. The performances on synthetic and real-world data sets are compared with other incremental algorithms such as Tipping and Faul’s fast relevance vector machine, Chen et al.’s orthogonal least squares, and Orr’s regularized forward selection. The results demonstrate that the proposed algorithm is competitive. 1 Introduction In recent years, there has been a lot of focus on designing sparse models in machine learning. For example, the support vector machine (SVM) (Cortes & Vapnik, 1995) and the relevance vector machine (RVM) (Tipping, 2001; Tipping & Faul, 2003) have been proven to provide sparse solutions to both regression and classification problems. Some of the earlier successful approaches for regression problems include Chen, Cowan, and Grant’s (1991) orthogonal least squares and Orr’s (1995b) regularized forward selection algorithms. In this letter, we consider only the regression problem. Often the target function model takes the form
y(x) =
M
wm φm (x).
(1.1)
m=1
Neural Computation 19, 283–301 (2007)
C 2006 Massachusetts Institute of Technology
284
S. Sundararajan, S. Shevade, and S. Keerthi
For notational convenience, we do not indicate the dependency of y on w, the weight vector. The available choices in selecting the set of basis vectors {φ m (x), m = 1, . . . , M} make the model flexible. Given a data set N {xn , tn }n=1 , tn ∈ R ∀ n, we can write the target vector t = (t1 , . . . ., tN )T as the sum of the approximation vector y(x) = (y(x1 ), . . . ., y(x N ))T and an error vector η, t = w + η,
(1.2)
where = ( φ 1 φ 2 , · · · , φ M ) is the design matrix and φ i is the ith column vector of of size N × 1 representing the response of the ith basis vector for all the input samples x j , j = 1, . . . , N. One possible way to obtain this x −x 2
design matrix is to use a gaussian kernel with i j = exp(− i 2z2 j ). The error vector η can be modeled as an independent zero-mean gaussian vector with variance σ 2 . In this context, controlling the model complexity in avoiding overfitting is an important task. This problem has been addressed in the past by regularization approaches (Bishop, 1995). In the classical approach, the sum squared error with weight decay regularization having a single regularization parameter α controls the trade-off between fitting the training data and smoothing the output function. In the local smoothing approach, each weight in w is associated with a regularization or ridge parameter (Hoerl & Kennard, 1970; Orr, 1995a; Denison & George, 2000). The interested reader can refer to Denison and George (2000) for a discussion on the generalized ridge regression and Bayesian approach. For our discussion, we consider the weight decay regularizer with multiple regularization parameters. In this case, the optimal weight vector is obtained by minimizing the following cost function (for a given set of hyperparameter values, α, σ 2 ), C(w, α, σ 2 ) =
1 1 t − w2 + wT Aw, 2σ 2 2
(1.3)
where A is a diagonal matrix with elements α = (α1 , . . . , α M )T , and the optimal weight vector is given by w=
1 −1 T S y, σ2
(1.4)
T where S = R + A and R = σ 2 . Note that this solution depends on the product ασ 2 . This solution is same as the maximum a posteriori (MAP) solution obtainable from defining the gaussian likelihood and gaussian prior for the weights in a Bayesian framework (Tipping, 2001).
Fast GCV Algorithm for Sparse Model Learning
285
The hyperparameters are typically selected using iterative approaches like marginal likelihood maximization (Tipping, 2001). In practice, many of the αi approach ∞. This results in the removal of associated basis vectors, thereby making the model sparse. The final model consists of a small number of basis vectors L (L M), called relevance vectors, and, hence, known by the name relevance vector machine (RVM). This procedure is computationally intensive and needs O(M3 ) effort at least for the initial iterations while starting with the full model (M = N or M = N + 1 if the bias term is included). Hence, it is not suitable for large data sets. This limitation was addressed in Tipping and Faul (2003), where a computationally efficient algorithm was proposed. In this algorithm, basis vectors are added sequentially starting from an empty model. It also allows deleting the basis vectors, which may subsequently become redundant. There are various other basis vector selection heuristics that can be used to design sparse models (Chen et al., 1991; Orr, 1995b). Chen et al. (1991) proposed an orthogonal least-squares algorithm as a forward regression procedure to select the basis vectors. At each step of the regression, the increment to the explained variance of the desired output is maximized. Orr (1995b) proposed an algorithm that combines regularization and crossvalidated selection of basis vectors. Some other promising incremental approaches are the algorithms of Csato and Opper (2002), Lawrence, Seeger, and Herbrich (2003), Seeger, Williams, and Lawrence (2003), and Smola and Bartlett (2000). But they apply to gaussian processes and are not directly related to the problem formulation addressed in this article. Generalized cross validation (GCV) is another important approach for the selection of hyperparameters and has been shown to exhibit good generalization performance (Sundararajan & Keerthi, 2001; Orr, 1995a). Orr (1995a) used the GCV approach to estimate the multiple smoothing parameters of the full model. This approach, however, is not suitable for large data sets due to its computational complexity. Therefore, there is a need to devise a computationally efficient algorithm based on the GCV approach for handling large data sets. In this letter, we propose a new fast, incremental GCV algorithm that can be used to design sparse models exhibiting good generalization performance. In the algorithm, we start with an empty model and sequentially add the basis functions to reduce the GCV error. The GCV error can also be reduced by deleting those basis functions that subsequently become redundant. This important feature offsets the inherent greediness exhibited by other sequential algorithms. This algorithm has the same computational complexity as that of the algorithm given in Tipping and Faul (2003) and is suitable for large data sets as well. Preliminary results on synthetic and real-world benchmark data sets indicate that the new approach gains on generalization but at the expense of a moderate increase in the number of basis vectors.
286
S. Sundararajan, S. Shevade, and S. Keerthi
The letter is organized as follows. In section 2, we describe the GCV approach and compare the GCV error function with marginal likelihood function. Section 3 describes the fast, incremental algorithm, computational complexity, and the numerical issues involved; the update expressions mentioned in this section are detailed in appendixes A to F. In section 4, we present the simulation results. Section 5 concludes the letter. 2 Generalized Cross Validation The standard techniques that estimate the prediction error of a given model are the leave-one-out (LOO) cross-validation (Stone, 1974) and the closely related GCV described in Golub, Heath, and Wahba (1979) and Orr (1995b). The generalization performance of the GCV approach is quite good, like that of the LOO cross-validation approach. The advantage in using the GCV error is that it takes a much simpler form compared to the LOO error and is given by N V(α, σ 2 ) = N
− y(xi ))2 , (tr(P))2
i=1 (ti
(2.1)
where P=I−
S−1 T . σ2
(2.2)
When this GCV error is minimized with respect to the hyperparameters, many of the αi approach ∞, making the model sparse. Equation 2.1 can be written as V(α, σ 2 ) = N
t T P2 t . (tr(P))2
(2.3)
Note that P is dependent only on ζ = ασ 2 . This means that we can get rid of σ 2 from equations 1.4 and 2.1. Then it is sufficient to work directly with ζ . However, using the optimal ζ obtained from minimizing the GCV error, the noise level can be computed from σˆ 2 =
t T P2 t . tr(P)
(2.4)
We now discuss the algorithm, proposed by Orr to determine the optimal set of hyperparameters.
Fast GCV Algorithm for Sparse Model Learning
287
2.1 Orr’s Algorithm. Starting with the full model, Orr (1995a) proposed an iterative scheme to minimize the GCV error, equation 2.3, with respect to the hyperparameters. Although this algorithm was originally described in terms of the variable ζ j , we describe it here using the variables α j and σ 2 for convenience. Each α j is optimized in turn, while the others are held fixed. The minimization thus proceeds by a series of one-dimensional minimizations. This can be achieved by rewriting equation 2.3 using t T P2 t = tr(P) =
a j 2j − 2b j j + c j 2j δjj − j , j
where a j = tT P2j t b j = (t
T
(2.5)
P2j φ j )(tT P j φ j )
(2.6)
c j = (φ Tj P2j φ j )(tT P j φ j )2
(2.7)
δ j = tr(P j )
(2.8)
j = (φ Tj P2j φ j ).
(2.9)
The above relationships follow from the rank-one update relationship between P and P j , P = Pj −
where P j = I −
1 Pj φ j φ Tj Pj , j T j S−1 j j
σ2
(2.10)
and
j = φ Tj P j φ j + α j σ 2 .
(2.11)
Here, j denotes the matrix with the column φ j removed. Therefore, S j = R j + A j , with the subscript j having the same interpretation as in j . Note that S j does not contain α j , and, hence, P j also does not contain α j . Thus, equation 2.3 can be seen as a rational polynomial in α j alone, with a single minimum in the range [0, ∞]. The details of minimizing this polynomial with respect to α j are given in appendix A. After one complete cycle in which each parameter is optimized once, the GCV score is calculated and compared with the score at the end of the previous cycle. If significant decrease has occurred, a new cycle is begun; otherwise, the algorithm
288
S. Sundararajan, S. Shevade, and S. Keerthi
terminates. As detailed in Orr (1995a), the computational complexity of one cycle of the above algorithm is O(N3 ), at least during the first few iterations. Although this will consequently reduce to O(L N2 ) as the basis vectors are pruned, this algorithm is not suitable for large data sets. 2.2 Comparison of GCV Error with Marginal Likelihood. We now compare the GCV error and marginal likelihood by studying their dependence on α j . In the marginal likelihood method, the optimal α j ’s are determined by maximizing the marginal likelihood with respect to α j . The GCV error is minimized to get the optimal α j ’s. First, we study the behavior of the GCV error with reference to α j . The term in the denominator of the GCV error has tr(P) = δ j − jj , where δ j and j are independent of α j and P is a positive semidefinite matrix. Further, tr(P) increases monotonically as a function of α j . Therefore, maximizing the tr(P) (in order to minimize the GCV error) will prefer the optimal value of α j to be ∞. Thus, the denominator term in equation 2.3 prefers a simple model. The term, tT P2 t, in the numerator of equation 2.3 is the squared error at the optimal weight vector in equation 1.4. Let g(α) = tT P2 t. Differentiating g(α) with respect to α j , we get ∂g(α) 2σ 2 = 2 ∂α j j
bj −
cj j
,
where b j and c j are independent of α j and c j , j ≥ 0. If b j is nonpositive, then g(α j ) is a monotonically decreasing function of α j . Minimization of g(α j ) with respect to α j would thus prefer α j to be ∞. On the other hand, if b j is positive, the minimum of g(α j ) would depend on the sign of σ 2 b j s¯ j − c j where s¯ j = φ Tj −1 − j φ j , − j is with the contribution of basis vector j removed and = σ 2 I + A−1 . Note that P = σ 2 −1 . Therefore, the optimal choice of α j using the GCV error method depends on the trade-off between the data-dependent term in the numerator and the term in the denominator that prefers α j to be ∞. The logarithm of marginal likelihood function is def
L(α) = −
1 N log(2π) + log || + tT −1 t . 2
Considering the dependence of L(α) on a single hyperparameter α j , the above equation can be rewritten (Tipping & Faul, 2003) as L(α) = L(α− j ) + l(α j ),
Fast GCV Algorithm for Sparse Model Learning α
where l(α j ) = 12 [log( α j +¯j s j ) +
q¯ 2j α j +¯s j
289
]. L(α− j ) is the marginal likelihood with def
φ j excluded and is thus independent of α j . Here, we have defined q¯ j = φ Tj −1 − j t. Note that q¯ j and s¯ j are independent of α j . The second term in l(α j ) comes from the data-dependent term, tT −1 t, in the logarithm of the marginal likelihood function, and maximization of this term prefers α j to be zero. On the other hand, the first term in l(α j ) comes from the Ockham factor (Tipping, 2001), and maximization of this term chooses α j to be ∞. So the optimal value of α j is a trade-off between the data-dependent term and the Ockham factor. Note that tT −1 t = σ12 tT P2 t + wT Aw. The term on the left-hand side of this equation appears in the negative logarithm of marginal likelihood function, while the first term on the right-hand side appears in the numerator of the GCV error. The key difference in the choice of the basis function is the additional term that is present in the data-dependent term of the marginal likelihood. For the marginal likelihood method, it has been shown that for a given basis function j, if q¯ 2j ≤ s¯ j , then the optimal value of α j is ∞; otherwise, the optimal value is
s¯ 2j q¯ 2j −¯s j
(Tipping and Faul, 2003). For the GCV
method, the optimal value of α j depends on q¯ j , s¯ j and some other quantities detailed in appendix A. Further, the sufficient condition for a basis function to be not relevant for the marginal likelihood method is q¯ 2j ≤ s¯ j and that for the GCV error method is b j ≤ 0. Note that b j is independent of s¯ j but is dependent on q¯ j . In general, a relevant vector obtained using a marginal likelihood (or GCV error) method need not be relevant in the GCV error (or marginal likelihood) method. This fact was also observed in our experiments. We now discuss the effect of scaling on the GCV error. First, note that the GCV error is a function of P. With the scaling of the output t, there will be an associated scaling of basis functions. However, P is invariant to such scaling (see equation 2.2). This will make the GCV error invariant to scaling. Also, a similar result holds for the log marginal likelihood function. 3 Fast GCV Algorithm In this section we describe the fast GCV (FGCV) algorithm, which constructs the model sequentially starting from an empty model. The basis vectors are added sequentially, and their weightings are modified to get the maximum reduction in the GCV error. The GCV error can also be decreased by deleting those basis vectors that subsequently become redundant. By maintaining the following set of variables for every basis vector, m, we can find the optimal value of αm for every basis vector and the corresponding GCV error efficiently: rm = tT Pφ m
(3.1)
290
S. Sundararajan, S. Shevade, and S. Keerthi T γm = φ m Pφ m
(3.2)
ξm = t T P 2 φ m
(3.3)
T 2 um = φ m P φm.
(3.4)
In addition, we need v = tr(P)
(3.5)
q = t P t.
(3.6)
T
2
Further, after every minimization process, these variables can be updated efficiently using the rank-one update given in equation 2.10. We now give the algorithm and discuss the relevant implementation details and storage and computational requirements. 3.1 Algorithm 1. Initialize σ 2 to some reasonable value (e.g., var(t) × .1). 2. Select the first basis vector φk (which forms the initial relevance vector set), and set the corresponding αk to its optimal value. The remaining α’s are notionally set to ∞. 3. Initialize S−1 , w (which are scalars initially), and the variables given in equations 3.1 to 3.6. 4. αk old := αk . 5. For all j, find the optimal solution α j , keeping the remaining αi , i = j fixed and the corresponding GCV error. Select the basis vector k for which reduction in the GCV error is maximum. 6. If αk old < ∞ and αk < ∞, the relevance vector set remains unchanged. 7. If αk old = ∞ and αk < ∞, add φk to the relevance vector set. 8. If αk old < ∞ and αk = ∞, then delete φk from the relevance vector set. 9. Update S−1 , w, and the variables given in equations 3.1 to 3.6. 10. Estimate the noise level using equation 2.4. This step may be repeated once in, for example, five iterations. 11. If there is no significant change in the values of α and σ 2 , then stop. Otherwise, go to step 4. 3.2 Implementation Details 1. Since we start from the empty model, the basis vector that gives the minimum GCV error is selected as the first basis vector (step 2). The details of this procedure are given in appendix B. The first basis vector
Fast GCV Algorithm for Sparse Model Learning
291
can also be selected based on the largest normalized projection onto the target vector, as suggested in Tipping and Faul (2003). 2. The initialization of relevant variables in step 3 is described in appendix C. 3. Appendixes D and A describe the method to estimate the optimal αi and the corresponding GCV error (step 5). 4. The variables in step 9 are updated using the details given in appendix E for the case in step 6 (reestimation) or step 8 (deletion) of the algorithm. The details corresponding to step 7 (addition) are given in appendix F. 5. In practice, numerical accuracies may be affected as the iterations progress. More specifically, the quantities γm and um , which are expected to remain nonnegative, may become negative while updating the variables. When any of these two quantities becomes negative, it is a good idea to compute the quantities in equations 3.1 to 3.6 afresh using direct computations. If the problem still persists (this typically happens when the matrix P becomes ill conditioned, for example, when the width parameter z used in a gaussian kernel is large), we terminate the algorithm. 6. When the noise level is also to be estimated (step 10), all the relevant variables are calculated afresh. This computation is simplified by expanding the matrix P using equation 2.2. 7. In an experiment, some of the α j may reach 0, as can be seen in appendix A. This may affect the solution. Therefore, it is useful to set such α j to a relatively small value, αmin . Setting this value to 1/N was found to work well in practice. 3.3 Storage and Computational Complexity. The storage requirements of the FGCV algorithm are more than that of the fast RVM (FRVM) algorithm in Tipping and Faul (2003) and arise from the additional variables ξm and um to be maintained. However, they are still linear in N. The computational requirements of the proposed algorithm are similar to those of the FRVM algorithm. Step 5 of the algorithm has the computational complexity of O(N). This is possible using the expressions given in appendix D. The computational complexity of the reestimation or deletion of a basis vector is O(L N), while that of the addition of a basis vector is O(N2 ). Step 10 of the algorithm, however, has a computational complexity of O(L N2 ) as it requires reestimating the relevant variables. We observed that the FGCV algorithm was 1.25 to 1.5 times slower than the FRVM algorithm in our simulations for these main reasons:
r
The error function is shallow near the solution, which results in more iterations.
292
S. Sundararajan, S. Shevade, and S. Keerthi
0.7
NMSE
0.6 0.5 0.4 0.3 0.2 0.1
1
2
3
4
3
4
Number of Basis Vectors
Algorithm
80 60 40 20 0 1
2 Algorithm
Figure 1: Results on the Friedman2 data set. Table 1: Statistical Significance (Wilcoxon Signed Rank) Test Results on the Friedman2 Data Set.
OLS RFS FRVM
r r
FGCV
FRVM
RFS
.47 .013 1.3e-12
9.4e-5 .097
.119
There is a higher number of relevance vectors at the solution. The additional variables, ξm and um , are updated.
4 Simulations The proposed FGCV algorithm is evaluated on four popular benchmark data sets and compared with the algorithms described in Tipping and Faul (2003), Chen et al. (1991), and Orr (1995b) and referred to as fast RVM (FRVM), orthogonal least squares (OLS), and regularized forward selection (RFS), respectively. Two of these data sets were generated, as described by Friedman (1991) and are referred to as Friedman2 and Friedman3. The input
Fast GCV Algorithm for Sparse Model Learning
293
0.35
NMSE
0.3 0.25 0.2 0.15 1
2
3
4
3
4
Algorithm
Number of Basis Vectors
80 70 60 50 40 30 20 10 1
2
Algorithm
Figure 2: Results on the Friedman3 data set. Table 2: Statistical Significance (Wilcoxon Signed Rank) Test Results on the Friedman3 Data Set.
OLS RFS FRVM
FGCV
FRVM
RFS
4.6e-9 2.1e-6 3.0e-6
.002 .139
.054
dimension for each of these data sets was four. For these data sets, the training set consisted of 200 randomly generated examples, while the test set had 1000 noise-free examples and the experiment was repeated 100 times for different training set examples. We report the normalized mean squared error (NMSE) (normalized with respect to the output variance) on the test set. The third data set used was the Boston housing data set obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston). This data set comprises 506 examples with 13 variables. The data were split into 481/25 training/testing splits randomly, and the partitioning was repeated 100 times independently. The fourth data set used was the Abalone data set (ftp://ftp.ics.uci.edu/pub/machine-learning-databases/abalone/). After mapping the gender encoding (male/female/infant) into {(1,0,0), (0,1,0),
294
S. Sundararajan, S. Shevade, and S. Keerthi
40
MSE
30 20 10
1
2
3
4
3
4
Number of Basis Vectors
Algorithm
150
100
50
1
2 Algorithm
Figure 3: Results on the Boston Housing data set. Table 3: Statistical Significance (Wilcoxon Signed Rank) Test Results on the Boston Housing Data Set.
OLS RFS FRVM
FGCV
FRVM
RFS
2.6e-5 .003 2.0e-5
.096 .647
.156
(0,0,1)}, the 10-dimensional data were split into 3000/1177 training/testing splits randomly. The partitioning was repeated 10 times independently. (The exact partitions for the last two data sets were obtained from http://www.gatsby.ucl.ac.uk/˜chuwei/regression.html.) For all the data sets, gaussian kernel was used and the width parameter was chosen by using fivefold crossvalidation. For the OLS and RFS algorithms, the readily available Matlab functions (http://www.anc.ed.ac.uk/˜mjo/software/ rbf2.zip) were used with the default settings. We adhered to the guidelines provided in Tipping and Faul (2003) for FRVM. The results obtained using the four algorithms (FGCV, algorithm 1; FRVM, algorithm 2; RFS, algorithm 3; OLS, algorithm 4) on these data sets are presented in Figures 1 to 4. From these box plots, it is clear that
Fast GCV Algorithm for Sparse Model Learning
295
0.8
MSE
0.7 0.6 0.5 0.4 1
2
3
4
3
4
Algorithm
Number Of Basis Vectors
80 70 60 50 40 30 20 10 1
2 Algorithm
Figure 4: Results on the Abalone data set. Table 4: Statistical Significance (Wilcoxon Signed Rank) Test Results on the Abalone Data Set.
OLS RFS FRVM
FGCV
FRVM
RFS
9.8e-4 .023 .462
9.8e-4 .005
.403
the FGCV algorithm generalizes well compared to the other algorithms. However, this happens at the expense of a moderate increase in the number of basis vectors as compared to the FRVM algorithm. It is worth noting that the sparseness of the solution obtained from the FGCV algorithm is still very good. On the Abalone data set, the FRVM algorithm is slightly better than the FGCV algorithm on the average. Box plots in Figures 1 to 4 show that the distribution of MSE is nonsymmetrical. Therefore, we used Wilcoxon matched-pair signed rank tests to compare the four algorithms; the results are given in Tables 1 to 4. Each box in the table compares an algorithm in the column to an algorithm in the row. The null hypothesis is that the two medians of the test error are the
296
S. Sundararajan, S. Shevade, and S. Keerthi
same, while the alternate hypothesis is that they are different. The p-value of this hypothesis test is given in the box. The following comparisons are made with respect to the significance level of 0.05. If a p-value is smaller than 0.05, then the algorithm in the column (row) is statistically superior to the algorithm in the row (column). Table 1 suggests that for the Friedman2 data set, the FGCV algorithm is statistically superior to the FRVM and RFS algorithms. But it is not significantly different from the OLS algorithm. The FGCV algorithm is statistically superior to all the other algorithms on the Friedman3 and the Boston Housing data sets, as is evident from Tables 2 and 3. For the Abalone data set, the results in Table 4 show that the performances of the FGCV and the FRVM algorithms are not significantly different. However, these algorithms are statistically superior to the OLS and RFS algorithms. We also compared the FGCV and FRVM algorithms with the gaussian process regression (GPR) algorithm on the Boston Housing and Abalone data sets (results reported in http://www.gatsby.ucl.ac.uk/˜chuwei/ regression.html). These comparisons were also done using Wilcoxon matched-pair signed-rank test with a significance level of .05. For the Boston Housing data set, the performance comparison gave p-values of 3.4e-10 (FGCV) and 1.22e-12 (FRVM). This shows that the GPR algorithm (with all basis vectors) is statistically superior to the FRVM algorithm. A similar comparison on the Abalone data set resulted in the p-values of .012 (FGCV) and .095 (FRVM). On this data set, the GPR algorithm is statistically superior to the FGCV algorithm, while it is not statistically superior to the FRVM algorithm.
5 Conclusion We have proposed a fast, incremental GCV algorithm for designing sparse regression models. This algorithm is very efficient and constructs the model sequentially starting from an empty model. In each iteration, it adds or deletes or reestimates the basis vectors depending on the maximum reduction in the GCV error. The experimental results suggest that, considering the requirements of sparseness, good generalization performance, and computational complexity, the FGCV algorithm is competitive. Clearly, this algorithm is an excellent alternative to the FRVM algorithm of Tipping and Faul (2003). We mainly compared our approach against OLS, RFS, and FRVM since they were quite directly related to our problem formulation. We also compared against GPR. We did not compare against the other sparse incremental GP algorithms mentioned in Csato and Opper (2002), Lawrence et al. (2003), Seeger et al. (2003), and Smola and Bartlett (2000) since we felt that those methods will be slightly inferior to GPR. But those comparisons could be interesting, especially if we compare at the same levels of sparsity. This will
Fast GCV Algorithm for Sparse Model Learning
297
be taken up in future work. It will also be interesting to extend the proposed algorithm to classification problems. Appendix A: α Estimation Using GCV Approach Using equations 2.5 to 2.9, the numerator of the derivative of GCV error with respect to α j can be shown to take the form, g j + h j α j , where g j = (δ j b j − a j j )ψ j − (δ j c j − b j j ), h j = (δ j b j − a j j )σ 2 , and ψ j = φ Tj P j φ j .
(A.1)
It is easy to verify that the denominator of the derivative of GCV error with respect to α j is nonnegative, and noting that α j ≥ 0, the solution can be obtained directly using g j and h j or using the sign information of the gradient. More specifically, the optimal solution α j (lying in the interval [0, ∞)) is obtained from one of the following possibilities: If {g j , h j } < 0, then α j = ∞. If {g j , h j } > 0, then, α j = 0. g If g j < 0, h j > 0, then a unique solution exists and is given by α j = − h jj . g
If g j > 0, h j < 0, then a unique solution exists and is given by α j = − h jj . But in this case, the derivative changes from a positive to a negative value while crossing zero. Therefore, it is possible to have solution at either 0 or ∞. In this case, we can evaluate the function value at 0 and ∞ and choose the right one. When h j = 0, the solution is dependent on the sign of g j .
Appendix B: Selection of the First Basis Vector Two quantities that are of interest in finding the GCV error for all the basis vectors are given by tT P2m t =
a 2m − 2b m m + c m 2m
(B.1)
T φm φm , T φ m φ m + αm σ 2
(B.2)
tr(Pm ) = N −
φ S−1 φ T φ T φ +α σ 2 where Pm = I − m σm2 m , Sm = m mσ 2 m , and m = Sm . Then the optimal solution αm can be obtained, as described in appendix A, from the coefficients 2.5 to 2.9 using P j = I. After substituting the optimal solution αm into equations B.1 and B.2, we can evaluate the GCV error for a given m.
298
S. Sundararajan, S. Shevade, and S. Keerthi
Finally, the basis vector j is selected as the index for which the GCV error is minimum. Appendix C: Initialization Once the first basis vector, φ j , is selected based on the GCV error with the optimal α j , initialization of all the relevant quantities is done as follows: r m = tT φ m − T φm − γm = φ m
(tT φ j )(φ Tj φ m ) (φ Tj φ m )2
ξm = t T φ m − 2
(C.2)
j
T φm − 2 um = φ m
v=N−
(C.1)
j
(φ Tj φ m )2 j
+
T (φ m φ m )(φ Tj φ m )2
(tT φ j )(φ Tj φ m ) j
(C.3)
2j +
(tT φ j )(φ Tj φ m )(φ Tj φ j ) 2j
φ Tj φ j
(C.5)
j
q = tT t − 2
(C.4)
(tT φ j )2 (φ Tj φ j ) (tT φ j )2 + , j 2j
(C.6)
tT φ j j
σ , S−1 = [ ] and = [φ j ]. Note j
where j = φ Tj φ j + α j σ 2 . Further, w =
2
that S−1 is a single element matrix when there is a single basis vector. Appendix D: Computing α j
Here, we find the set of coefficients required to compute the optimal α j from the set of quantities defined in equations 3.1 to 3.6. All the results given below are obtained using the rank one update relationship between P and P j . With o j =
α j σ 2r j α j σ 2 −t j
and ψ j =
o j jo j j + 2ξ j j j αjσ2 jo j j + b j = o j ξj αjσ2 j
aj =q +
c j = j o 2j ,
α j σ 2tj α j σ 2 −t j
, we have
(D.1) (D.2) (D.3)
Fast GCV Algorithm for Sparse Model Learning
where j = ψ j + α j σ 2 and j =
uj 2j . (α j σ 2 )2
299
Further,
g j = (δ j b j − a j j )ψ j − (δ j c j − b j j )
(D.4)
h j = (δ j b j − a j j )σ ,
(D.5)
2
where δ j = v + jj . For j not in the relevance vector set, we do not have to find the above set of quantities with P j as j = ∞ and P = P j . Therefore, the quantities in equations 2.5 to 2.9 required to compute the optimal α j can be found in a much simpler way. Next, the optimal solution α j is obtained using g j and h j , as mentioned earlier. (See the discussion in the paragraph below equation A.1.) Appendix E: Reestimation and Deletion of a Basis Vector T −1 Recall that the matrix P = I − Sσ 2 . Then, with σ 2 fixed (at least for the iteration under consideration or for few iterations), a change in α j (which is essentially the reestimation process itself) results in a change in S−1 . Let s j denote the jth column and sjj denote the jth diagonal element of S−1 . Note that the computations are similar to the one used for reestimation except for the coefficient K j . In the case of deletion, K j = s1 , and in the case of
jj
reestimation, K j = (sjj + (αˆ j − α j )−1 )−1 . Here, αˆ j denotes the new optimal solution. The final set of computations required for the reestimation or deletion of basis vectors is given below. Defining ρ jm = sTj T φ m , we have rm = rm + K j w j ρ jm γm = γm +
(E.1)
Kj 2 ρ . σ 2 jm
(E.2)
Defining χ jm = sTj AS−1 T φ m , we have um = um + where τ j =
Kj ρ jm (2χ jm + τ j ρ jm ), σ2
Kj T T s s j . σ2 j
(E.3)
Defining κ j = wT As j , we have
ξm = ξm + K j ρ jm κ j + K j w j (χ jm + ρ jm τ j ) v = v + τj
(E.4) (E.5)
q = q + K j σ w j (2κ j + τ j w j ). 2
(E.6)
Finally, w = w − K j w j s j and S−1 = S−1 − K j s j sTj . Although the set of equations given above is common for reestimation and deletion procedures, the
300
S. Sundararajan, S. Shevade, and S. Keerthi
jth row and/or column is to be removed from S−1 , w, and after making the necessary updates in the deletion procedure. Appendix F: Adding a New Basis Vector On adding the new basis vector j, the dimension of changes, and a new finite α j gets defined. Defining l j = σ12 S−1 T φ j and e j = φ j − l j and µ jm = eTj φ m , we get rm = rm − w j µ jm sjj γm = γm − 2 µ2jm , σ where sjj = we have
tj σ2
1 +α j
(F.1) (F.2)
and w j =
sjj r . σ2 j
Next, defining ν jm = µ jm − lTj AS−1 T φ m ,
sjj µ jm (ξ j − w j u j ) − w j ν jm σ2 s sjj jj . um = um + 2 µ jm u µ − 2ν j jm jm σ σ2 ξ m = ξm −
(F.3) (F.4)
Next, sjj uj σ2 q = q + w j (w j u j − 2ξ j ). v=v −
(F.5) (F.6)
Also, S
−1
=
S−1 + sjj l j lTj −sjj lTj
Finally, w =
w − wjlj wj
−sjj l j sjj
.
(F.7)
and = [ φ j ].
References Bishop, C. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Chen, S., Cowan, C. F. N., & Grant, P. M. (1991). Orthogonal least squares learning for radial basis function networks. IEEE Trans on Neural Networks, 2, 302–309. Cortes, C., & Vapnik, V. N. (1995). Support vector networks. Machine Learning, 20, 273–297. Csato, L., & Opper, M. (2002). Sparse on-line gaussian processes. Neural Computation, 14(3), 641–668.
Fast GCV Algorithm for Sparse Model Learning
301
Denison, D., & George, E. (2000). Bayesian prediction using adaptive ridge estimators. (Tech Rep.). London: Department of Mathematics, Imperial College. Available online at http://stats.ma.ic.ac.uk/dgtd/public html/Papers/grr.ps. Friedman, J. H. (1991). Multivariate adaptive regression splines. Annals of Statistics, 19, 1–141. Golub, G. H., Heath, M., & Wahba, G. (1979). Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21, 215–223. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12, 55–67. Lawrence, N., Seeger, M., & Herbrich, R. (2003). Fast sparse gaussian process ¨ & K. Obermethods: The informative vector machine. In S. Becker, S. Thrun, mayer (Eds.), Advances in neural processing information systems, 15(pp. 609–616). Cambridge, MA: MIT Press. Orr, M. J. L. (1995a). Local smoothing of radial basis function networks. In Proceedings of International Symposium on Neural Networks. Hsinchu, Taiwan. Orr, M. J. L. (1995b). Regularization in the selection of radial basis function centers. Neural Computation, 7(3), 606–623. Seeger, M., Williams, C., & Lawrence, N. D. (2003). Fast forward selection to speed up sparse gaussian process regression. In C. M. Bishop & B. J. Frey (Eds.), Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics. San Francisco: Morgan Kaufmann. Smola, A. J., & Bartlett, P. L. (2000). Sparse greedy gaussian process regression. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 619–625). Cambridge, MA: MIT Press. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions (with discussion). Journal of Royal Statistical Society (series B), 36, 111–147. Sundararajan, S., & Keerthi, S. S. (2001). Predictive approaches for choosing hyperparameters in gaussian processes. Neural Computation, 13(5), 1103–1118. Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211–244. Tipping, M. E., & Faul, A. (2003). Fast marginal likelihood maximisation for sparse Bayesian models. In C. M. Bishop & B. J. Frey (Eds.), Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics. San Francisco: Morgan Kaufmann.
Received August 29, 2005; accepted June 1, 2006.
302
Addendum In “Low-Dimensional Maps Encoding Dynamics in Entorhinal Cortex and Hippocampus” by Dmitri Pervouchine, Theoden Netoff, Horacio Rotstein, John White, Mark Cunningham, Miles Whittington, and Nancy Kopell (Vol. 18, No. 11: 2617–2650), the following paragraph was omitted at the end of Appendix section A.2: Synaptic gating variable s j obeys first order kinetics according to the equation ∂s j = α j (1 − s j ) 1 + tanh((v j − vth )/vsl ) − β j s j , ∂t where α j and β j are the respective synapse rise and decay rate constants, vth = 0, and vsl = 4. The decay rate constants were chosen to achieve the desired synapse decay time (for instance, β j 0.2 for τ = 5, and β j 0.05 for τ = 20). The rise rate constant α j = 20 used in the simulations was not accessible in the experiments.
ARTICLE
Communicated by S. Coombes
The Astrocyte as a Gatekeeper of Synaptic Information Transfer Vladislav Volman [email protected] School of Physics and Astronomy, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel-Aviv University, 69978 Tel-Aviv, Israel
Eshel Ben-Jacob [email protected] School of Physics and Astronomy, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel-Aviv University, 69978 Tel-Aviv, Israel, and Center for Theoretical Biological Physics, University of California at San Diego, La Jolla, CA 92093-0319, U.S.A.
Herbert Levine [email protected] Center for Theoretical Biological Physics, University of California at San Diego, La Jolla, CA 92093-0319, U.S.A.
We present a simple biophysical model for the coupling between synaptic transmission and the local calcium concentration on an enveloping astrocytic domain. This interaction enables the astrocyte to modulate the information flow from presynaptic to postsynaptic cells in a manner dependent on previous activity at this and other nearby synapses. Our model suggests a novel, testable hypothesis for the spike timing statistics measured for rapidly firing cells in culture experiments. 1 Introduction In recent years, evidence has been mounting regarding the possible role of glial cells in the dynamics of neural tissue (Volterra & Meldolesi, 2005; Haydon 2001; Newman, 2003; Takano et al., 2006). For astrocytes in particular, the specific association of processes with synapses and the discovery of two-way astrocyte-neuron communication has demonstrated the inadequacy of the previously held view regarding the purely supportive role for these glial cells. Instead, future progress requires rethinking how the dynamics of the coupled neuron-glial network can store, recall, and process information. At the level of cell biophysics, some of the mechanisms underlying the so-called tripartite synapse (Araque, Parpura, Sanzgiri, & Haydon, 1999) are becoming clearer. For example, it is now well established that astrocytic Neural Computation 19, 303–326 (2007)
C 2007 Massachusetts Institute of Technology
304
V. Volman, E. Ben-Jacob, and H. Levine
mGlu receptors detect synaptic activity and respond via activation of the calcium-induced calcium release pathway, leading to elevated Ca 2+ levels. The spread of these levels within a microdomain of one cell can coordinate the activity of disparate synapses that are associated with the same microdomain (Perea & Araque, 2002). Moreover, it might even be possible to transmit information directly from domain to domain and even from astrocyte to astrocyte if the excitation level is strong enough to induce either intracellular or intercellular calcium waves (Cornell-Bell, Finkbeiner, Cooper, & Smith, 1990; Charles, Merrill, Dirksen, & Sanderson, 1991; Cornell-Bell & Finkbeiner, 1991). One sign of the maturity in our understanding is the formulation of semiquantitative models for this aspect of neuron-glial communication (Nadkarni & Jung, 2004; Sneyd, Wetton, Charles, & Sanderson, 1995; Hofer, Venance, & Giaume, 2003). There is also information flow in the opposite direction, from astrocyte to synapse. Direct experimental evidence for this, via the detection of the modulation of synaptic transmission as a function of the state of the glial cells, will be reviewed in more detail below. One of the major goals of this work is to introduce a simple phenomenological model for this interaction. The model will take into account both a deterministic effect of high Ca 2+ in the astrocytic process, namely, the reduction of the postsynaptic response to incoming spikes on the presynaptic axon (Araque, Parpura, Sanzgiri, & Haydon, 1998a), and a stochastic effect, namely, the increase in the frequency of observed miniature postsynaptic current events uncorrelated with any input (Araque, Parpura, Sanzgiri, & Haydon, 1998b). There are also direct NMDA-dependent effects on the postsynaptic neuron of astrocyte-emitted factors (Perea & Araque, 2005), which are not considered here. As we will show, the coupling allows the astrocyte to act as a gatekeeper for the synapse. By this, we mean that the amount of data transmitted across the synapse can be modulated by astrocytic dynamics. These dynamics may be controlled mostly by other synapses, in which case the gatekeeping will depend on dynamics external to the specific synapse under consideration. Alternatively, the dynamics may depend mostly on excitation from the selfsame synapse, in which case the behavior of the entire system is determined self-consistently. Here we focus on the latter possibility and leave for future work the discussion of how this mechanism could lead to multisynaptic coupling. Our ideas regarding the role of the astrocyte offer a new explanation for observations regarding firing patterns in cultured neuronal networks. In particular, spontaneous bursting activity in these networks is regulated by a set of rapidly firing neurons, which we refer to as spikers; these neurons exhibit spiking even during long interburst intervals and hence must have some form of self-consistent self-excitation. We model these neurons as containing astrocyte-mediated self-synapses (autapses) (Segal, 1991, 1994; Bekkers & Stevens, 1991) and show that this hypothesis naturally accounts
The Astrocyte as a Gatekeeper of Synaptic Information Transfer
305
for the observed unusual interspike interval distribution. Additional tests of this hypothesis are proposed at the end. 2 Experimental Observations 2.1 Cultured Neuronal Networks. The cultured neuronal networks presented here are self-generated from dissociated cultures of mixed cortical neurons and glial cells drawn from 1-day-old Charles River rats. The dissection, cell dissociation, and recording procedures were previously described in detail (Segev et al., 2002). Briefly, following dissection, neurons are dispersed by enzymatic treatment and mechanical dissociation. Then the cells are homogeneously plated on multielectrode arrays (MEA, Multi-Channel Systems), precoated with Poly-L-Lysine. Culture media was DMEM, (sigma) enriched by serum, and changed every two days. Plated cultures are placed on the MEA board (B-MEA-1060, Multi Channel Systems) for simultaneous long-term noninvasive recordings of neuronal activity from several neurons at a time. Recorded signals are digitized and stored for off-line analysis on a PC via an A-D board (Microstar DAP) and data acquisition software (Alpha-Map, Alpha Omega Engineering). Noninvasive recording of the networks activity (action potentials) is possible due to the capacitive coupling that some of the neurons form with some of the electrodes. Since typically one electrode can record signals from several neurons, a specially developed spike-sorting algorithm (Hulata, Segev, Shapira, Benveniste, & Ben-Jacob, 2000) is utilized to reconstruct single neuron-specific spike series. Although there are no externally provided guiding stimulations or chemical cues, relatively intense dynamical activity is spontaneously generated within several days. The activity is marked by the formation of synchronized bursting events (SBEs): short (∼200 ms) time windows during which most of the recorded neurons participate in relatively rapid firing (Segev & Ben-Jacob, 2001). These SBEs are separated by long intervals (several seconds or more) of sporadic neuronal firing of most of the neurons. A few neurons (referred to as spiker neurons) exhibit rapid firing even during the inter-SBE time intervals. These neurons also exhibit much faster firing rates during the SBEs, and their interspike intervals distribution is marked by a long-tail behavior (see Figure 4). 2.2 Interspike Interval (ISI) Increments Distribution. One of the tools used to compare model results with measured spike data concerns the distribution of increments in the spike times, defined as δ(i) = ISI (i + 1) − ISI (i), i ≥ 1. The distribution of δ(i) will have heavy tails if there is a wide range of interspike intervals and rapid transitions from one type of interval to the next. For example, rapid transitions from bursting events to occasional interburst firings will lead to such a tail. Applying this analysis to the recorded spike data of cultured cortical networks, Segev et al. (2002) found
306
V. Volman, E. Ben-Jacob, and H. Levine
that distributions of neurons’ ISI increments can be well fitted with Levy functions over three decades in time. 3 The Model In this section we present the mathematical details of the models employed in this work. Readers interested mainly in the conclusions can skip directly to section 4. The basic notion we use is that standard synapse models must be modified to account for the astrocytic modulation, depending, of course, on the calcium level. In turn, the astrocytic calcium level is affected by synaptic activity; for this, we use the Li-Rinzel model where the IP3 concentration parameter governing the excitability is increased on neurotransmitter release. These ingredients suffice to demonstrate what we mean by gatekeeping. Finally, we apply this model to the case of an autaptic oscillator, which requires the introduction of neuronal dynamics. For this, we chose the Morris-Lecar model as a generic example of a type-I firing system. None of our results would be altered with a different choice as long as we retain the tangent-bifurcation structure, which allows for arbitrarily long interspike intervals. 3.1 TUM Synapse Model. To describe the kinetics of a synaptic terminal, we have used the model of an activity-dependent synapse first introduced by Tsodyks, Uziel, and Markram (2000). In this model, the effective synaptic strength evolves according to the following equations: z − uxδ(t − tsp ) τrec y y˙ = − + uxδ(t − tsp ) τin y z z˙ = − τin τrec
x˙ =
(3.1)
Here, x, y, and z are the fractions of synaptic resources in the recovered, active, and inactive states, respectively. For an excitatory glutamatergic synapse, the values attained by these variables can be associated with the dynamics of vesicular glutamate. As an example, the value of y in this formulation will be proportional to the amount of glutamate that is being released during the synaptic event, and the value of x will be proportional to the size of a readily releasable vesicle pool. The time series tsp denotes the arrival times of presynaptic spikes, τin is the characteristic time of postsynaptic currents (PSCs) decay, and τr ec is the recovery time from synaptic depression. Upon arrival of a spike at the presynaptic terminal at time tsp , a fraction u of available synaptic resources is transferred from the recovered
The Astrocyte as a Gatekeeper of Synaptic Information Transfer
307
Table 1: Parameters Used in Simulations. P0 τca r I P3 v1 k3 d3 c0 gL VL V3 Iba se u
0.5
ηmea n
1.2 · 10−3 µA cm−2
σ
0.1
4 sec 7.2 mM sec−1 6 sec−1 0.1 µM 0.9434 µM 2.0 µM 0.5 mS cm−2 −35 mV 10 mV 0.34 µA cm−2 0.1
κ I P3∗ v2 d1 d5 gCa VCa V1 V4 τd A
0.5 sec−1 0.16 µM 0.11 sec−1 0.13 µM 0.08234 µM 1.1 mS cm−2 100 mV −1 mV 14.5 mV 10 msec 10.00 µA cm−2
τ I P3 c1 v3 d2 a2 gK VK V2 φ τr ec
7 sec 0.185 0.9 µM sec−1 1.049 µ M 0.2 µM−1 sec−1 2.0 mS cm−2 −70 mV 15 mV 0.3 100 msec
state to the active state. Once in the active state, synaptic resources rapidly decay to the inactive state, from which they recover within a timescale τr ec . Since the typical times are assumed to satisfy τr ec τin , the model predicts onset of short-term synaptic depression after a period of high-frequency repetitive firing. The onset of depression can be controlled by the variable u, which describes the effective use of synaptic resources by the incoming spike. In the original TUM model, the variable u is taken to be constant for the excitatory postsynaptic neuron; in what follows, we will set u = 0.1. Other parameter choices for these equations as well as for the rest of the model equations are presented in Table 1. To complete the specification, it is assumed that the resulting PSC, arriving at the model neurons’ soma through the synapse, depends linearly on the fraction of available synaptic resources. Hence, a total synaptic current seen by a neuron is Isyn (t) = Ay(t), where A stands for an absolute synaptic strength. At this stage, we do not take into account the long-term effects associated with the plasticity of neuronal somata and take the parameter A to be time independent. 3.2 Astrocyte Response. Astrocytes adjacent to synaptic terminals respond to the neuronal action potentials by binding glutamate to their metabotropic glutamate receptors (Porter & McCarthy, 1996). The activation of these receptors then triggers the production of I P3 , which consequently serves to modulate the intracellular concentration of calcium ions; the effective rate of I P3 production depends on the amount of transmitter released during the synaptic event. We therefore assume that the production of intracellular I P3 in the astrocyte is given by I P3∗ − I P3 d[I P3 ] = + ri p3 y. dt τi p3
(3.2)
308
V. Volman, E. Ben-Jacob, and H. Levine
This equation is similar to the formulation used by Nadkarni and Jung (2004), with some important differences. First, the effective rate of I P3 production depends not on the potential of neuronal membrane, but on the amount of neurotransmitter that is being released into the synaptic cleft. Hence, as the resources of synapse are depleted (due to depression), there will be less transmitter released, and therefore the I P3 will be produced at lower rates, leading eventually to decay of calcium concentration. Second, as the neurotransmitter is released also during spontaneous synaptic events (noise), the latter will also influence the production of I P3 and subsequent calcium oscillations. 3.3 Astrocyte. To model the dynamics of a single astrocytic domain, we use the Li-Rinzel model (Li & Rinzel, 1994; Nadkarni & Jung, 2004), which has been specifically developed to take into account the I P3 -dependent dynamical changes in the concentration of cytosolic Ca 2+ . This is based on the theoretical studies of Nadkarni and Jung (2004), where it is decisively demonstrated that astrocytic Ca 2+ oscillations may account for the spontaneous activity of neurons. The intracellular concentration of Ca2+ in the astrocyte is described by the following set of equations: d[Ca 2+ ] = −J chan − J pump − J leak dt
(3.3)
dq = αq (1 − q ) − βq q . dt
(3.4)
Here, q is the fraction of activated I P3 receptors. The fluxes of currents through ER membrane are given in the following expressions: J chan = c 1 v1 m3∞ n3∞ q 3 ([Ca 2+ ] − [Ca 2+ ] E R ) J pump =
v3 [Ca ] + [Ca 2+ ]2
(3.5)
2+ 2
k32
J leak = c 1 v2 ([Ca 2+ ] − [Ca 2+ ] E R ),
(3.6) (3.7)
where m∞ =
[I P3 ] [I P3 ] + d1
(3.8)
n∞ =
[Ca 2+ ] [Ca 2+ ] + d5
(3.9)
αq = a 2 d2
[I P3 ] + d1 [I P3 ] + d3
(3.10)
The Astrocyte as a Gatekeeper of Synaptic Information Transfer
βq = a 2 [Ca 2+ ].
309
(3.11)
The reversal Ca 2+ concentration ([Ca 2+ ] E R ) is obtained after requiring conservation of the overall Ca 2+ concentration: [Ca 2+ ] E R =
c 0 − [Ca 2+ ] . c1
(3.12)
3.4 Glia-Synapse Interaction. Astrocytes affect synaptic vesicle release in a calcium-dependent manner. Rather than attempt a complete biophysical model of the complex chain of events leading from calcium rise to vesicle release (Gandhi & Stevens, 2003), we proceed in a phenomenological manner. We define a dynamical variable f that phenomenologically will capture this interaction. When the concentration of calcium in its synapseassociated process exceeds a threshold, we assume that the astrocyte emits a finite amount of neurotransmitter into the perisynaptic space, thus altering the state of a nearby synapse; this interaction occurs via glutamate binding to presynaptic mGlu and NMDA receptors (Zhang et al., 2004). As the internal astrocyte resource of neurotransmitter is finite, we include the saturation term (1 − f ) in the dynamical equation for f . The final form is f˙ =
−f 2+ + (1 − f )κ ([Ca 2+ ] − [Ca thr eshold ]). τCa 2+
(3.13)
Given this assumption, equations 3.1 should be modified to take this modulation into account. We assume the following simple form: z − (1 − f )uxδ(t − tsp ) − xη( f ) τr ec −y + (1 − f )uxδ(t − tsp ) + xη( f ). y˙ = τin
x˙ =
(3.14) (3.15)
In equations 3.14 and 3.15, η( f ) represents a noise term modeling the increased occurrence of mini-PSCs. The fact that a noise increase accompanies an amplitude decrease is partially due to competition for synaptic resources between these two release modes (Otsu et al., 2004). Based on experimental observations, we prescribe that the dependence of η( f ) on f is such that the rate of noise occurrence (the frequency of η( f ) in a fixed time step) increases with increasing f , but the amplitude distribution (modeled here as a gaussian-distributed variable centered around positive mean) remains unchanged. For the rate of noise occurrence, we chose the following functional dependence: 2 1− f P( f ) = P0 exp − √ , 2σ
(3.16)
310
V. Volman, E. Ben-Jacob, and H. Levine
with P0 representing the maximal frequency of η( f ) in a fixed time step. Note that although both synaptic terminals and astrocytes utilize glutamate for their signaling purposes, we assume the two processes to be independent. In so doing, we rely on existing biophysical experiments demonstrating that whereas a presynaptic terminal releases glutamate in the synaptic cleft, astrocytes selectively target extrasynaptic glutamate receptors (Araque et al., 1998a, 1998b). Hence, synaptic transmission does not interfere with the astrocyte-to-synapse signaling. 3.5 Neuron Model. We describe the neuronal dynamics with a simplified two-component Morris-Lecar model (Morris & Lecar, 1981), V˙ = −Iion (V, W) + Ie xt (t)
(3.17)
W∞ (V) − W(V) , τW (V)
(3.18)
˙ W(V) =φ
with Iion (V, W) representing the contribution of the internal ionic Ca 2+ , K + , and leakage currents with their corresponding channel conductivities gCa , g K , and g L being constant: Iion (V, W) = gCa m∞ (V)(V − VCa ) + g K W(V)(V − VK ) + g L (V − VL ). (3.19) Ie xt represents all the external current sources stimulating the neuron, such as signals received through its synapses, glia-derived currents, artificial stimulations, as well as any noise sources. In the absence of any such stimulation, the fraction of open potassium channels, W(V), relaxes toward its limiting curve (nullcline) W∞ (V), which is described by the sigmoid function, V − V1 1 W∞ (V) = 1 + tanh 2 V2
(3.20)
within a characteristic timescale given by
τW (V) =
cosh
1 V−V1 . 2V2
(3.21)
The Astrocyte as a Gatekeeper of Synaptic Information Transfer
311
In contrast to this, it is assumed in the Morris-Lecar model that calcium channels are activated immediately. Accordingly, the fraction of open Ca 2+ channels obeys the following equation: V − V3 1 m∞ (V) = 1 + tanh . 2 V4
(3.22)
For an isolated neuron, rendered with a single autapse, one has Ie xt (t) = Isyn (t) + Iba se , where Isyn (t) is the current arriving through the self-synapse and Iba se is some constant background current. In this work, we assume that Iba se is such that, when acting alone, it causes a neuron to fire at a very low constant rate. Of course, these two terms enter the equation additively, and the dynamics depends on the total external current. Nonetheless, it is important to separate these terms, as only one of them enters through the synapse; it is only this term that is modulated by astrocytic glutamate release and only this term that would be changed by synaptic blockers. As we will note later, the baseline current may also be due to astrocytes, albeit to a direct current directed into the neuronal soma. In anticipation of a better future understanding of this term, we consider it separately from the constant appearing in leak current (g L VL ), although there is clearly some redundancy in the way these two terms set the operating point of the neuron. 4 Results 4.1 Synaptic Model. In simple models of neural networks, the synapse is considered to be a passive element that directly transmits information, in the form of arriving spikes on the presynaptic terminal, to postsynaptic currents. It has been known for a long time that more complex synaptic dynamics can affect this transfer. One such effect concerns the finite reservoir of presynaptic vesicle resources and was modeled by Tsodyks, Uziel, and Markram (TUM) (Tsodyks et al., 2000). Spike trains with too high a frequency will be attenuated by a TUM synapse, as there is insufficient recovery from one arrival to the next. To demonstrate this effect, we fed the TUM synaptic model with an actual spike train recorded from a neuron in a cultured network (shown in Figure 1a); the resulting postsynaptic current (PSC) is shown in Figure 1b. As is expected, there is attenuation of the PSC height during time windows with high rates of presynaptic spiking input. 4.2 Effect of Presynaptic Gating. Our goal is to extend the TUM model to include the interaction of the synapse with an astrocytic process imagined to be wrapped around the synaptic cleft. The effects of astrocytes on stimulated synaptic transmission are well established. Araque et al. (1998a) report that astrocyte stimulation reduced the magnitude of
312
V. Volman, E. Ben-Jacob, and H. Levine (a)
(b)
(c)
(d)
0
4 time [sec]
Figure 1: The generic effect of an astrocyte on the presynaptic depression, as captured by our phenomenological model (see text for details). To illustrate the effect of presynaptic depression and the astrocyte influence, we feed a model synapse with the input of spikes taken from the recorded activity of a cultured neuronal network (see the text and Segev & Ben-Jacob, 2001 for details). (a) The input sequence of spikes that is fed into the model presynaptic terminal. (b) Each spike arriving at the model presynaptic terminal results in the postsynaptic current (PSC). The strength of the PSC depends on the amount of available synaptic resources, and the synaptic depression effect is clearly observable during spike trains with relatively high frequency. (c) The effect of a periodic gating function, f (t) = 0.5 + f 0 sin(wt), shown in (d). The period of the = 2 sec, is taken to be compatible with the typical timescales oscillation, T = 2π ω of variations in the intraglial Ca 2+ concentration. Note the reduction in the PSC near the maxima of f , along with the elevated baseline resulting from the increase in the rate of spontaneous presynaptic transfer.
action-potential-evoked excitatory and inhibitory synaptic currents by decreasing the probability of evoked transmitter release. Specifically, presynaptic metabotropic glutamate receptors (mGluRs) have been shown to affect the stimulated synaptic transmission by regulating presynaptic voltagegated calcium channels, which eventually leads to the reduction of calcium flux during the incoming spike and results in a decrease of amplitude of synaptic transmission. These results are best shown in Figure 8 of their paper, which presents the amplitude of evoked EPSC both before and after stimulation of an associated astrocyte. Note that we are referring here to “faithful" synapses—those that transmit almost all of the incoming spikes. Effects of astrocytic stimulation on highly stochastic synapses, namely, the increase in fidelity (Kang, Jiang, Goldman, & Nedergaard, 1998), are not studied here. In addition, astrocytes were shown to increase the frequency of spontaneous synaptic events. In detail, Araque et al. (1998b) have shown that
The Astrocyte as a Gatekeeper of Synaptic Information Transfer
313
astrocyte stimulation increases the frequency of miniature postsynaptic currents (mPSC) without modifying their amplitude distribution, suggesting that astrocytes act to increase the probability of vesicular release from the presynaptic terminal. Although the exact mechanism is unknown, this effect is believed to be mediated by NMDA receptors located at the presynaptic terminal. It is important to note that the two kinds of astrocytic influence on the synapse (decrease of the probability of evoked release and increase in the probability of spontaneous release) do not contradict each other. Evoked transmitter release depends on the calcium influx through calcium channels that can be inhibited by the activation of presynaptic mGluRs. On the other hand, the increase in the probability of spontaneous release follows because of the activation of presynaptic NMDA channels. In addition, spontaneous activity can deplete the vesicle pool (in terms of either number or filling) and hence directly lower triggered release amplitudes (Otsu et al., 2004). We model these effects by two modifications of the TUM model. First, we introduce a gating function f that modulates the stimulated release in a calcium-dependent manner. This term will cause the synapse to turn off at high calcium. This presynaptic gating effect is demonstrated in Figure 1c, where we show the resulting PSC corresponding to a case in which f is chosen to vary periodically with a timescale consistent with astrocytic calcium dynamics. The effect on the recorded spike train data is quite striking. The second effect, the increase of stochastic release in the absence of any input, is included as an f -dependent noise term in the TUM equations. This will be important as we turn to a self-consistent calculation of the synapse coupled to a dynamical astrocyte. 4.3 The Gatekeeping Effect. We close the synapse-glia-synapse feedback loop by inclusion of the effect of the presynaptic activity on the intracellular Ca 2+ dynamics in the astrocyte that in turn sets the value of the gating function f . Nadkarni and Jung (2004) have argued that the basic calcium phenomenology in the astrocyte, arising via the glutamate-induced production of I P3 , can be studied using the Li-Rinzel model. What emerges from their work is that the dynamics of the intra-astrocyte Ca 2+ level depends on the intensity of the presynaptic spike train, acting as an information integrator over a timescale on the order of seconds; the level of Ca 2+ in the astrocyte increases according to the summation of the synaptic spikes over time. If the total number of spikes is low, the Ca 2+ concentration in the astrocyte remains below a self-amplification threshold level and simply decays back to its resting level with some characteristic time. However, things change dramatically when a sufficiently intense set of signals arises across the synapse. Now the Ca 2+ concentration overshoots its linear response level, followed by decaying oscillations. Given our results, these high Ca 2+ levels in the astrocyte will in fact attenuate spike information that arrives subsequent to strong bursts of activity. We illustrate this time-delayed gatekeeping (TDGK) effect in
314
V. Volman, E. Ben-Jacob, and H. Levine (a)
(b)
(c)
(d)
0
20 time [sec]
Figure 2: The gatekeeping effect in a glia-gated synapse. (a) The input sequence of spikes, which is composed of several copies of the sequence shown in Figure 1, separated by segments of long quiescent time. The resulting time series may be viewed as bursts of action potentials arriving at the model presynaptic terminal. The first burst of spikes results in the elevation of free astrocyte Ca 2+ concentration (b), but this elevation alone is not sufficient to evoke oscillatory response. An additional elevation of Ca 2+ , leading to the emergence of oscillation, is provided by the second burst of spikes arriving at the presynaptic terminal. Once the astrocytic Ca 2+ crosses a predefined threshold, it starts to exert a modulatory influence back on the presynaptic terminal. In the model, this is manifested by the rising dynamics of the gating function (c). Note that as the decay time of the gating function f is on the order of seconds, the astrocyte influence on the presynaptic terminal persists even after concentration of astrocyte Ca 2+ has fallen. This is best seen from d, where we show the profile of the PSC. The third burst of spikes arriving at the presynaptic terminal is modulated due to the astrocyte, even though the concentration of Ca 2+ is relatively low at that time. This modulation extends also to the fourth burst of spikes, which together with the third burst leads again to the oscillatory response of astrocyte Ca 2+ . Taken together, these results illustrate a temporally nonlocal gatekeeping effect of glia cells.
Figure 2. We constructed a spike train by placing a time delay in between segments of recorded sequences. As can be seen, since the degree of activity during the first two segments exceeds the threshold level, there is attenuation of the late-arriving segments. Thus, the information passed through the synapse is modulated by previous arriving data. 4.4 Autaptic Excitatory Neurons. Our new view of synaptic dynamics will have broad consequences for making sense of neural circuitry. To illustrate this prospect, we turn to the study of an autaptic oscillator (Seung, Lee, Reis, & Tank, 2000), by which we mean an excitatory neuron
The Astrocyte as a Gatekeeper of Synaptic Information Transfer
315
that exhibits repeated spiking driven at least in part by self-synapses (Segal, 1991, 1994; Bekkers & Stevens, 1991; Lubke, Markram, Frotscher, & Sakmann, 1996). By including the coupling of a model neuron to our synapse system, we can investigate both the case of the role of an associated astrocyte with externally imposed temporal behavior and the case where the astrocyte dynamics is itself determined by feedback from this particular synapse. Finally, we should be clear that when we refer to one synapse, we are also dealing implicitly with the case of multiple self-synapses, all coupled to the same astrocytic domain, which in turn is exhibiting correlated dynamics in its processes connecting to these multiple sites. It is important to note that this same modulation can in fact correlate multiple synapses connecting distinct neurons coupled to the same astrocyte. The effect of this new multisynaptic coupling on the spatiotemporal flow of information in a model network will be described elsewhere. We focus on an excitatory neuron modeled with Morris-Lecar dynamics, as described in section 3. We add some external bias current so as to place the neuron in a state slightly beyond the saddle-node bifurcation, where it would spontaneously oscillate at a very low frequency in the absence of any synaptic input. We then assume that this neuron has a self-synapse (autapse). An excitatory self-synapse clearly has the possibility of causing a much higher spiking rate than would otherwise be the case; this behavior without any astrocyte influence is shown in Figure 3. The existence of autaptic neurons was originally demonstrated in cultured networks (Segal, 1991, 1994; Bekkers & Stevens, 1991) but has been detected in intact neocortex as well (Lubke et al., 1996). Importantly, these can be either inhibitory or excitatory. There has been some speculation regarding the role of autapses in memory (Seung et al., 2000), but this is not our concern here. Are such neurons observed experimentally? In Figure 4 we show a typical raster plot recorded from cultured neural network grown from a dissociated mixture of glial and neuronal cortical cells taken from 1-day-old Charles River rats (see section 2). The spontaneous activity of the network is marked by synchronized bursting events (SBEs)—short (several 100s of ms) periods during which most of the recorded neurons show relatively rapid firing separated by long (order of seconds) time intervals of sporadic neuronal firing of most of the neurons (Segev & Ben-Jacob, 2001). Only small fractions of special neurons (termed spiker neurons) exhibit rapid firing also during inter-SBEs intervals. These spiker neurons also exhibit much higher firing rates during the SBEs. But the behavior of these rapidly firing neurons does not match that expected of the simple autaptic oscillator. The major differences, as illustrated by comparing Figures 3 and 4, are (1) the existence of long interspike intervals for the spikers, marked by a long-tail (Levy) distribution of the increments of the interspike intervals, and (2) the beating or burstlike rate modulation in the temporal ordering of the spike train.
316
V. Volman, E. Ben-Jacob, and H. Levine
(a)
(b)
0
5 time [sec]
(c)
−1
Probability distribution
10
−5
10
1
10 100 δ(ISI) [msec]
1000
Figure 3: Activity of a model neuron containing the self-synapse (autapse), as modeled by the classical Tsodyks-Uziel-Markram model of synaptic transmission. In this case, it is possible to recover some of the features of cortical rapidly firing neurons, namely, the relatively high-frequency persistent activity. However, the resulting time series of action potentials for such a model neuron, shown in (a), is almost periodic. Due to the self-synapse, a periodic series of spikes results in the periodic pattern for the postsynaptic current, shown in (b), which closes the self-consistency loop by causing a model neuron to generate a periodic time series of spikes. Further difference between the model neuron and between cortical rapidly firing neurons is seen upon comparing the corresponding distributions of ISI increments, plotted on double-logarithmic scale. These distributions, shown in (c), disclose that, contrary to the cortical rapidly firing neurons, the increments distribution for the model neuron with TUM autapse (diamonds) is gaussian (seen as a stretched parabola on double-log scale), pointing to the existence of characteristic timescale. On the other hand, distributions for cortical neurons (circles) decay algebraically and are much broader. The distribution of the model neuron has been vertically shifted for clarity of comparison.
The Astrocyte as a Gatekeeper of Synaptic Information Transfer 1
317
(a)
neuron #
12
60 0
45
90
time [sec] 1
(b)
neuron # 60 0
(c)
−1
probability distribution
12
10
−3
10
−5
400
time [msec]
800
10
1
10
δ(ISI) [msec]
100
Figure 4: Electrical activity of in-vitro cortical networks. These cultured networks are spontaneously formed from a dissociated mixture of cortical neurons and glial cells drawn from 1-day-old Charles River rats. The cells are homogeneously spread over a lithographically specified area of Poly-D-Lysine for attachment to the recording electrodes. The activity of a network is marked by formation of synchronized bursting events (SBEs), short (∼100–400 msec) periods of time during which most of the recorded neurons are active. (a) Raster plot of recorded activity, showing a sample of a few SBEs. The time axis is divided into 10−1 s bins. Each row is a binary bar code representation of the activity of an individual neuron; the bars mark detection of spikes. Note that while the majority of the recorded neurons are firing rapidly mostly during SBEs, some neurons are marked by persistent intense activity (e.g., neuron no. 12). This property supports the notion that the activity of these neurons is autonomous and hence self-amplified. (b) A zoomed view of a sample synchronized bursting event. Note that each neuron has its own pattern of activity during the SBE. To access the differences in activity between ordinary neurons and neurons that show intense firing between the SBEs, for each neuron we constructed the series of increments of interspike intervals (ISI), defined as δ(i) = I SI (i + 1) − I SI (i), i ≥ 1. The distributions of δ(i), shown in (c), disclose that the dynamics of ordinary neurons (squares) is similar to the dynamics of rapidly firing neurons (circles), up to the timescale of 100 msec, corresponding to the width of a typical SBE. Note that since increments of interspike intervals are analyzed, the increased rate of neurons firing does not necessarily affect the shape of the distribution. Yet above the characteristic time of 100 msec, the distributions diverge, possibly indicating the existence of additional mechanisms governing the activity of rapidly firing neurons on a longer timescale. Note that for normal neurons, there is another peak at typical interburst intervals (> seconds), not shown here.
318
V. Volman, E. Ben-Jacob, and H. Levine
Motivated by the above and the glial gatekeeping effect studied earlier, we proceed to test if an autaptic oscillator with a glial-regulated self-synapse will bring the model into better agreement. In Figure 5 we show that the activity of such a modified model does show the additional modulation. The basic mechanism results from the fact that after a period of rapid firing of the neuron, the astrocyte intracellular Ca 2+ concentration (shown in Figure 5b) exceeds the critical threshold for time-delayed attenuation. This then stops the activity and gives rise to large interspike intervals. The distributions shown in Figure 5 are a much better match to experimental data for time intervals up to 100 msec.
5 Robustness Tests 5.1 Stochastic Li-Rinzel Model. One of the implicit assumptions of our model for astrocyte-synapse interaction is related to the deterministic nature of astrocyte calcium release. It is assumed that in the absence of any I P3 signals from the associated synapses, the astrocyte will stay “silent,” in the sense that there will be no spontaneous Ca 2+ events. However, it should be kept in mind that the equations for the calcium channel dynamics used in the context of Li-Rinzel model in fact describe the collective behavior of large numbers of channels. In reality, experimental evidence indicates that the calcium release channels in astrocytes are spatially organized in small clusters of 20 to 50 channels—the so-called microdomains. These microdomains were found to contain small membrane leaflets (of O(10 nm) thick), wrapping around the synapses and potentially able to synchronize ensembles of synapses. This finding calls for a new view of astrocytes as cells with multiple functional and structural compartments. The microdomains (within the same astrocyte) have been observed to generate the spontaneous Ca 2+ signals. As the passage of the calcium ions through a single channel is subject to fluctuations, the stochastic aspects can become important for small clusters of channels. Inclusion of stochastic effects can explain the generation of calcium puffs: fast, localized elevations of calcium concentration. Hence, it is important to test the possible effect of stochastic calcium events on the model’s behavior. We achieve this goal by replacing the deterministic Li-Rinzel model with its stochastic version, obtained using Langevin approximation, as has been recently described by Shuai and Jung (2003). With the Langevin approach, the equation for the fraction of open calcium channels is modified and takes the following form:
dq = αq (1 − q ) − βq q + ξ (t), dt
(5.1)
The Astrocyte as a Gatekeeper of Synaptic Information Transfer
319
(a)
(b)
(c)
(d) 0
10
20
time [sec]
(e)
−1
Probability distribution
10
−3
10
−5
10
1
10 100 δ(ISI) [msec]
1000
Figure 5: The activity of a model neuron containing a glia-gated autapse. The equations of synaptic transmission for this case have been modified to take into account the influence of synaptically associated astrocyte, as explained in text. The resulting spike time series, shown in (a), deviates from periodicity due to the slow modulation of the synapse by the adjacent astrocyte. The relatively intense activity at the presynaptic terminal activates astrocyte receptors, which in turn leads to the production of IP3 and subsequent oscillations of free astrocyte Ca 2+ concentration. The period of these oscillations, shown in (b), is much larger than the characteristic time between spikes arriving at the presynaptic terminal. Because Ca 2+ dynamics is oscillatory, so also will be the dynamics of the gating function f , as is seen from (c), and the period of oscillations for f will follow the period of Ca 2+ oscillations. The periodic behavior of f leads to slow periodic modulation of PSC pattern (shown in (d)), which closes the self-consistency loop by causing a neuron to fire in a burstlike manner. Additional information is obtained after comparison of distributions for ISI increments, shown in (e). Contrary to results for the model neuron with a simple autapse (see Figure 4c), the distribution for a glia-gated autaptic model neuron (diamonds) now closely follows the distributions of two sample recorded cortical rapidly firing neurons (circles), up to the characteristic time of ∼100 msec, which corresponds to the width of a typical SBE. The heavy tails of the recorded distributions above this characteristic time indicate that network mechanisms are involved in shaping the form of the distribution on longer timescales.
320
V. Volman, E. Ben-Jacob, and H. Levine (a)
(b)
(c)
(d) 0
40
time [sec]
80
Figure 6: The dynamical behavior of an astrocyte-gated model autaptic neuron, including the stochastic release of calcium from ER of astrocyte. Shown are the results of the simulation when calcium release from intracellular stores is mediated by a cluster of N = 10 channels. The generic form of the spike time series (shown in (a)) does not differ from those obtained for the deterministic model. Namely, even for the stochastic model, the neuron is still firing in a burstlike manner. Although the temporal profile of astrocyte calcium (b) is irregular, the resulting dynamics of the gating function (c) is relatively smooth, stemming from the choice of the gating function dynamics (being an integration over the calcium profile). As a result, the PSC profile (d) does not differ much from the corresponding PSC profile obtained for the deterministic model.
in which the stochastic term, ξ (t), has the following properties: ξ (t) = 0
ξ (t)ξ (t ) =
αq (1 − q ) + βq q
δ(t − t ). N
(5.2) (5.3)
In the limit of very large cluster size, N → ∞, and the effect of stochastic Ca 2+ release is not significant. On the contrary, the dynamics of calcium release are greatly modified for small cluster sizes. A typical spike-time series of glia-gated autaptic neuron, obtained for the cluster size of N = 10 channels, is shown Figure 6a. Note that while there appear considerable fluctuations in concentration of astrocyte calcium (see Figure 6b), the dynamics of the gating function (see Figure 6c) is less irregular. This follows because our choice of the gating function corresponds to the integration of calcium events. We have also checked that the distribution of interspike intervals is practically unchanged (data not shown). All told, our results indicate that including the stochastic nature of the release of calcium from astrocyte ER does not affect the dynamics of our model autaptic neuron in any significant way.
The Astrocyte as a Gatekeeper of Synaptic Information Transfer
321
0
10
−4
τ =4 sec,κ=5*10 f −4 τ =40 sec,κ=1*10 f
−1
probability distribution
10
−2
10
−3
10
−4
10
−5
10
0
10
1
10
2
δ(ISI) [msec]
10
3
10
Figure 7: Distributions of interspike interval increments for the model of an astrocyte-gated autaptic neuron with slow dynamics of the gating function, as compared with the corresponding distribution for the deterministic Li-Rinzel model. Due to the slow dynamics of the gating function, the transitions between different phases of bursting are blurred, resulting in a weaker tail for the distribution of interspike interval increments.
5.2 The Correlation Time of the Gating Function. Another assumption made in our model concerns the dynamics of the gating function. We have assumed the simple first-order differential equation for the dynamics of our phenomenological gating function and have selected timescales that are believed to be consistent with the influence of astrocytes on synaptic terminals. However, because the exact nature of the underlying processes (and corresponding timescales) is unknown, it is important to test the robustness of the model to variations in the gating function dynamics. To do that, we altered the baseline dynamics of the gating function to have a slower characteristic decay time and a slower rate of accumulation; for example, we can set τ f = 40 sec and κ = 0.1 sec−1 . Simulations show that the only effect is a slight blurring of the transition between different phases of the bursting, as would be expected. This can best be detected by looking at the distribution of interspike interval increments, for the case of slow gating dynamics. The distribution, shown in Figure 7, has a weaker tail as compared to the distribution obtained for the faster gating dynamics. This result follows because for slower gating, the modulation of the postsynaptic current is weaker. Hence, the transitions from intense firing
322
V. Volman, E. Ben-Jacob, and H. Levine (a)
(b)
(c)
(d) 0
45
time [sec]
90
Figure 8: The dynamical behavior of an astrocyte-gated model autaptic neuron with slowly oscillating background current. Shown are the results of the simut), T = 10 sec. The mean level of Ibase is set so as to put lation when Ibase ∝ sin( 2π T a neuron in the quiescent phase for half a period. The resulting spike time series (a) disclose the burstlike firing of a neuron, with the superimposed oscillatory dynamics of a background current. The variations in the concentration of astrocyte calcium (b) are much more temporally localized, and so is the resulting dynamics of the gating function (c). Consequently, the PSC profile (d) strongly reflects the burstlike synaptic transmission efficacy, thus forcing the neuron to fire in a burstlike manner and closing the self-consistency loop.
to low-frequency spiking are less abrupt, resulting in a relatively low proportion of large increments. It is worth remembering that large increments of interspike intervals reflect sudden changes in dynamics, which are eliminated by the blurring. Clearly, the model with fast gating does a better job in fitting the spiker data. 5.3 Time-Dependent Background Current. All of the main results were obtained under the assumption of constant background current feeding into neuronal soma, such that when acting alone, this current forces the model neuron to fire at a very low frequency. One may justly argue that there is no such thing as constant current. Indeed, if a background current has to do with the biological reality, then it should possess some dynamics. For example, a better match would be to imagine the background current to be associated with the activity in adjacent astrocytes (see, e.g., Angulo, Kozlov, Charpak, & Audinat, 2004). To test this, we simulated glia-gated autaptic neuron subject to slowly oscillating (T = 10 sec) background current. For this case, we found that the behavior of a model is generically the same. Yet now the transitions between the bursting phases are sharper (see Figure 8a). This in turn leads to the sharper modulation of postsynaptic currents (shown in Figure 8d). We can confirm this by noting that the distribution of interspike interval increments has a slightly heavier tail, as compared to the distribution obtained for
The Astrocyte as a Gatekeeper of Synaptic Information Transfer
323
the case of constant background current (data not shown). On the other hand, replacing the constant current with the oscillating one introduces a typical frequency not seen in the actual spiker data. This artificial problem will presumably disappear when the background current is determined self-consistently as part of the overall network activity. Similarly, the key to extending the increments distributions to longer timescales seems to be getting the network feedback to the spikers to regulate the interburst timing, which at the moment is too regular. This will be presented in a future publication. 6 Discussion In this article, we have proposed that the regulation of synaptic transmission by astrocytic calcium dynamics is a critical new component of neural circuitry. We have used existing biophysical experiments to construct a coupled synapse-astrocyte model to illustrate this regulation and explore its consequences for an autaptic oscillator, arguably the most elementary neural circuit. Our results can be compared to data taken from cultured neuron networks. This comparison reveals that the glial gatekeeping effect appears to be necessary for an understanding of the interspike interval distribution of observed rapidly firing spiker neurons, for timescales up to about 100 msec. Of course, many aspects of our modeling are quite simplified as compared to the underlying biophysics. We have investigated the sensitivity of our results to the modification of some of the parameters of our model as well as the addition of more complex dynamics for the various parts of our system. Our results with regard to the interspike interval are exceedingly robust. This work should be viewed as a step toward understanding the full dynamical consequences brought about by the strong reciprocal couplings between synapses and the glial processes that envelop them. We have focused on the fact that astrocytic emissions shut down synaptic transmission when the activity becomes too high. This mechanism appears to be a necessary part of the regulation of spiker activity; without it, spikers would fire too often, too regularly. Related work by S. Nadkarni and P. Jung (private communication, July 2005) focuses on a different aspect: that of increased fidelity of synaptic release (for otherwise highly stochastic synapses) due to glia-mediated increases in presynaptic calcium levels. As our working assumption is that the spikers are most likely to be neurons with “faithful” autapses, this effect does not play a role in our attempt to compare to the experimental data. It will of course be necessary to combine these two different pieces to obtain a more complete picture. The application to spikers is just one way in which our new synaptic dynamics may alter our thinking about neural circuits. This particular application is appealing and informative but must at the moment be
324
V. Volman, E. Ben-Jacob, and H. Levine
considered an untested hypothesis. Future experimental work must test the assumption that spikers have significant excitatory autaptic coupling, that pharmacological blockage of the synaptic current reverts their firing to low-frequency, almost periodic patterns, and that cutting the feedback loop with the enveloping astrocyte eliminates the heavy-tail increment distribution. Work toward achieving these tests is ongoing. In the experimental system, a purported autaptic neuron is part of an active network and would therefore receive input currents from the other neurons in the network. This more complex input would clearly alter the very-long-time interspike interval distribution, especially given the existence of a new interburst timescale in the problem. Similarly, the current approach of adding a constant background current to the neuron is not realistic; the actual background current, due to such processes as glialgenerated currents in the cell soma, would again alter the long-time distribution. Preliminary tests have shown that these effects could extend the range of agreement between autaptic oscillator statistics and experimental measurements. Just as the network provides additional input for the spiker, the spiker provides part of the stimulation that leads to the bursting dynamics. Future work will endeavor to create a fully self-consistent network model to explore the overall activity patterns of this system. One issue that needs investigation concerns the role that glia might have in coordinating the action of neighboring synapses. It is well known that a single astrocytic process might contact thousands of synapses; if the calcium excitation spreads from being a local increase in a specific terminus to being a more widespread phenomenon within the glial cell body, neighboring synapses can become dynamically coupled. The role of this extra complexity in shaping the burst structure and time sequence is as yet unknown. Acknowledgments We thank Gerald M. Edelman for insightful conversation about the possible role of glia. Eugene Izhikevich, Peter Jung, Suhita Nadkarni, and Itay Baruchi are acknowledged for useful comments and for the critical reading of an earlier version of this manuscript. V. V. thanks the Center for Theoretical Biological Physics for hospitality. This work has been supported in part by the NSF-sponsored Center for Theoretical Biological Physics (grant numbers PHY-0216576 and PHY-0225630), by Maguy-Glass Chair in Physics of Complex Systems. References Angulo, M., Kozlov, A., Charpak, S., & Audinat, E. (2004). Glutamate released from glial cells synchronizes neuronal activity in the hippocampus. J. Neurosci., 24(31), 6920–6927.
The Astrocyte as a Gatekeeper of Synaptic Information Transfer
325
Araque, A., Parpura, V., Sanzgiri, R., & Haydon, P. (1998a). Glutamate-dependent astrocyte modulation of synaptic transmission between cultured hippocampal neurons. Eur. J. Neurosci., 10(6), 2129–2142. Araque, A., Parpura, V., Sanzgiri, R., & Haydon, P. (1998b). Glutamate-dependent astrocyte modulation of synaptic transmission between cultured hippocampal neurons. J. Neurosci., 18(17), 6822–6829. Araque A., Parpura, V., Sanzgiri, R., & Haydon, P. (1999). Tripartite synapses: Glia, the unacknowledged partner. Trends in Neurosci., 22(5), 208–215. Bekkers, J., & Stevens, C. (1991). Excitatory and inhibitory autaptic currents in isolated hippocampal neurons maintained in cell culture. Proc. Nat. Acad. Sci., 88, 7834–7838. Charles, A., Merrill, J., Dirksen, E., & Sanderson, M. (1991). Inter-cellular signaling in glial cells: Calcium waves and oscillations in response to mechanical stimulation and glutamate. Neuron., 6, 983–992. Cornell-Bell A., & Finkbeiner, S. (1991). Ca2+ waves in astrocytes. Cell Calcium, 12, 185–204. Cornell-Bell, A., Finkbeiner, S., Cooper, M., & Smith, S. (1990). Glutamate induces calcium waves in cultured astrocytes: Long-range glial signaling. Science, 247, 470–473. Gandhi, A., & Stevens, C. (2003). Three modes of synaptic vesicular release revealed by single-vesicale imaging. Nature, 423, 607–613. Haydon, P. (2001). Glia: Listening and talking to the synapse. Nat. Rev. Neurosci., 2(3), 185–193. Hofer, T., Venance, L., & Giaume, C. (2003). Control and plasticity of inter-cellular calcium waves in astrocytes. J. Neurosci., 22, 4850–4859. Hulata, E., Segev, R., Shapira, Y., Benveniste, M., & Ben-Jacob, E. (2000). Detection and sorting of neural spikes using wavelet packets. Phys. Rev. Lett., 85, 4637–4640. Kang, J., Jiang, L., Goldman, S., & Nedergaard, M. (1998). Astrocyte-mediated potentiation of inhibitory synaptic transmission. Nat. Neurosci., 1, 683–692. Li, Y., & Rinzel, J. (1994). Equations for inositol-triphosphate receptor-mediated calcium oscillations derived from a detailed kinetic model: A Hodgkin-Huxley like formalism. J. Theor. Biol., 166, 461–473. Lubke, J., Markham, H., Frotscher, M., & Sakmann, B. (1996). Frequency and dendritic distributions of autapses established by layer-5 pyramidal neurons in developing rat cortex. J. Neurosci., 616, 3209–3218. Morris, C., & Lecar. H. (1981). Voltage oscillations in the barnacle giant muscle fiber. Biophys. J., 35, 193–213. Nadkarni, S., & Jung, P. (2004). Spontaneous oscillations of dressed neurons: A new mechanism for epilepsy? Phys. Rev. Lett., 91(26). Newman, E. (2003). New roles for astrocytes: Regulation of synaptic transmission. Trends in Neurosci., 26(10), 536–542. Otsu, Y., Shahrezaei, V., Li, B., Raymond, L., Delaney K., & Murphy, T. (2004). Competition between phasic and asynchronous release for recovered synaptic vesicles at developing hippocampal autaptic synapses. J. Neurosci., 24(2), 420–433. Perea, G., & Araque, A. (2002). Communication between astrocytes and neurons: A complex language. J. Physiol., 96, 199–207.
326
V. Volman, E. Ben-Jacob, and H. Levine
Perea, G., & Araque, A. (2005). Properties of synaptically evoked astrocyte calcium signals reveal synaptic information processing by astrocytes. J. Neurosci., 25, 2192– 203. Porter. J., & McCarthy, K. (1996). Hippocampal astrocytes in situ respond to glutamate released from synaptic terminals. J. Neurosci., 16(16), 5073–5081. Segal, M. (1991). Epileptiform activity in microcultures containing one excitatory hippocampal neuron. J. Neuroanat., 65, 761–770. Segal, M. (2004). Endogenous bursts underlie seizurelike activity in solitary excitatory hippocampal neurons in microculture. J. Neurophysiol., 72, 1874–1884. Segev, R., & Ben-Jacob, E. (2001). Spontaneous synchronized bursting activity in 2D neural networks. Physica A, 302, 64–69. Segev, R., Benveniste, M., Hulata, E., Cohen, N., Paleski, A., Kapon, E., Shapira, Y., & Ben-Jacob, E. (2002). Long term behavior of lithographically prepared in-vitro neural networks. Phys. Rev. Lett., 88, 118102. Seung, H., Lee, D., Reis, B., & Tank, D. (2000). The autapse: A simple illustration of short-term analog memory storage by tuned synaptic feedback. J. Comp. Neurosci., 9, 171–185. Shuai, J., & Jung, P. (2003). Langevin modeling of intra-cellular calcium dynamics. In M. Falcke & D. Malchow (Eds.), Understanding calcium dynamics—experiments and theory. (pp. 231–252). Berlin: Springer. Sneyd, J., Wetton, B., Charles, A., & Sanderson, M. (1995). Intercellular calcium waves mediated by diffusion of inositol triphosphate: A two-dimensional model. Am. J. Physiology, 268, C1537–C1545. Takano T., Tian, G., Peng, W., Lou, N., Libionka, W., Han, X., & Nedergaard, M. (2006). Astrocyte-mediated control of cerebral blood flow. Nat. Neurosci., 9(2), 260–267. Tsodyks, M., Uziel, A., & Markram, H. (2000). Synchrony generation in recurrent networks with frequency-dependent synapses. J. Neurosci., 20(RC50), 1–5. Volterra A., & Meldolesi J. (2005). Astrocytes, from brain glue to communication elements: The revolution continues. Nat. Neurosci., 6, 626–640. Zhang, Q., Pangrsic, T., Kreft, M., Krzan, M., Li, N., Sul, J., Halassa, M., van Bockstaele, E., Zorec, R., & Haydon, P. (2004). Fusion-related release of glutamate from astrocytes. J. Biol. Chem., 279, 12724–12733.
Received January 6, 2006; accepted May 25, 2006.
LETTER
Communicated by Gert Cauwenberghs
Thermodynamically Equivalent Silicon Models of Voltage-Dependent Ion Channels Kai M. Hynna [email protected]
Kwabena Boahen [email protected] Department of Bioengineering, University of Pennsylvania, Philadelphia, PA 19104, U.S.A.
We model ion channels in silicon by exploiting similarities between the thermodynamic principles that govern ion channels and those that govern transistors. Using just eight transistors, we replicate—for the first time in silicon—the sigmoidal voltage dependence of activation (or inactivation) and the bell-shaped voltage-dependence of its time constant. We derive equations describing the dynamics of our silicon analog and explore its flexibility by varying various parameters. In addition, we validate the design by implementing a channel with a single activation variable. The design’s compactness allows tens of thousands of copies to be built on a single chip, facilitating the study of biologically realistic models of neural computation at the network level in silicon. 1 Neural Models A key computational component within the neurons of the brain is the ion channel. These channels display a wide range of voltage-dependent responses (Llinas, 1988). Some channels open as the cell depolarizes; others open as the cell hyperpolarizes; a third kind exhibits transient dynamics, opening and then closing in response to changes in the membrane voltage. The voltage dependence of the channel plays a functional role. Cells in the gerbil medial superior olivary complex possess a potassium channel that activates on depolarization and helps in phase-locking the response to incoming auditory information (Svirskis, Kotak, Sanes, & Rinzel, 2002). Thalamic cells possess a hyperpolarization-activated cation current that contributes to rhythmic bursting in thalamic neurons during periods of sleep by depolarizing the cell from hyperpolarized levels (McCormick & Pape, 1990). More ubiquitously, action potential generation is the result of voltage-dependent dynamics of a sodium channel and a delayed rectifier potassium channel (Hodgkin & Huxley, 1952). Researchers use a variety of techniques to study voltage-dependent channels, each possessing distinct advantages and disadvantages. Neural Computation 19, 327–350 (2007)
C 2007 Massachusetts Institute of Technology
328
K. Hynna and K. Boahen
Neurobiologists perform experiments on real cells, both in vivo and in vitro; however, while working with real cells eliminates the need for justifying assumptions within a model, limitations in technology restrict recordings to tens of neurons. Computational neuroscientists create computer models of real cells, using simulations to test the function of a channel within a single cell or a network of cells. While providing a great deal of flexibility, computational neuroscientists are often limited by the processing power of the computer and must constantly balance the complexity of the model with practical simulation times. For instance, a Sun Fire 480R takes 20 minutes to simulate 1 second of the 4000-neuron network (M. Shelley, personal communication, 2004) of Tao, Shelley, McLaughlin, and Shapley (2004), just 1 mm2 (a single hypercolumn) of a sublayer (4Cα) of the primary visual cortex (V1). An emerging medium for modeling neural circuits is the silicon chip, a technique at the heart of neuromorphic engineering. To model the brain, neuromorphic engineers use the transistor’s physical properties to create silicon analogs of neural circuits. Rather than build abstractions of neural processing, which make gross simplifications of brain function, the neuromorphic engineer designs circuit components, such as ion channels, from which silicon neurons are built. Silicon is an attractive medium because a single chip can have thousands of heterogeneous silicon neurons that operate in real time. Thus, network phenomena can be studied without waiting hours, or days, for a simulation to run. To date, however, neuromorphic models have not captured the voltage dependence of the ion channel’s temporal dynamics, a problem outstanding since 1991, the year Mahowald and Douglas (1991) published their seminal work on silicon neurons. Neuromorphic circuits are constrained by surface area on the silicon die, limiting their complexity, as more complex circuits translate to fewer silicon neurons on a chip. In the face of this trade-off, previous attempts at designing neuromorphic models of voltage-dependent ion channels (Mahowald and Douglas, 1991; Simoni, Cymbalyuk, Sorensen, Calabrese, & DeWeerth, 2004) sacrificed the time constant’s voltage dependence, keeping the time constant fixed. In some cases, however, this nonlinear property is critical. An example is the lowthreshold calcium channel in the thalamus’s relay neurons. The time constant for inactivation can vary over an order of magnitude depending on the membrane voltage. This variation defines the relative lengths of the interburst interval (long) and the burst duration (short) when the cell bursts rhythmically. Sodium channels involved in spike generation also possess a voltagedependent inactivation time constant that varies from a peak of approximately 8 ms just below spike threshold to as fast as 1 ms at the peak of the voltage spike (Hodgkin & Huxley, 1952). Ignoring this variation by fixing the time constant alters the dynamics that shape the action potential. For example, lowering the maximum (peak) time constant would reduce the
Voltage-Dependent Silicon Ion Channel Models
329
size of the voltage spike due to faster inactivation of sodium channels below threshold. Inactivation of these channels is a factor in the failure of action potential propagation in Purkinje cells (Monsivais, Clark, Roth, & Hausser, 2005). On the other hand, increasing the minimum time constant at more depolarized levels—that is, near the peak voltage of the spike—would increase the width of the action potential, as cell repolarization begins once the potassium channels overcome the inactivating sodium channel. A wider spike could influence the cell’s behavior through several mechanisms, such as those triggered by increased calcium entry through voltage-dependent calcium channels. In this letter, we present a compact circuit that models the nonlinear dynamics of the ion channel’s gating particles. Our circuit is based on linear thermodynamic models of ion channels (Destexhe & Huguenard, 2000), which apply thermodynamic considerations to the gating particle’s movement, due to conformation of the ion channel protein in an electric field. Similar considerations of the transistor make clear that both the ion channel and the transistor operate under similar principles. This observation, originally recognized by Carver Mead (1989), allows us to implement the voltage dependence of the ion channel’s temporal dynamics, while at the same time using fewer transistors than previous neuromorphic models that do not possess these nonlinear dynamics. With a more compact design, we can incorporate a larger number of silicon neurons on a chip without sacrificing biological realism. The next section provides a brief tutorial on the similarities between the underlying physics of thermodynamic models and that of transistors. In section 3, we derive a circuit that captures the gating particle’s dynamics. In section 4, we derive the equations defining the dynamics of our variable circuit, and section 5 describes the implementation of an ion channel population with a single activation variable. Finally, we discuss the ramifications of this design in the conclusion. 2 Ion Channels and Transistors Thermodynamic models of ion channels are founded on Hodgkin and Huxley’s (empirical) model of the ion channel. A channel model consists of a series of independent gating particles whose binary state—open or closed—determines the channel permeability. A Hodgkin-Huxley (HH) variable represents the probability of a particle being in the open state or, with respect to the channel population, the fraction of gating particles that are open. The kinetics of the variable are simply described by α(V) (1 − u) −→ ←− u, β(V)
(2.1)
330
K. Hynna and K. Boahen
where α (V) and β (V) define the voltage-dependent transition rates between the states (indicated by the arrows), u is the HH variable, and (1 − u) represents the closed fraction. α (V) and β (V) define the voltage-dependent dynamics of the two-state gating particle. At steady state, the total number of gating particles that are opening—that is, the opening flux, which depends on the number of channels closed and the opening rate—are balanced by the total number of gating particles that are closing (the closing flux). Increasing one of the transition rates—through, for example, a shift in the membrane voltage— will increase the respective flow of particles changing state; the system will find a new steady state with new open and closed fractions such that the fluxes again cancel each other out. We can describe these dynamics simply using a differential equation: du = α (V) (1 − u) − β (V) u. dt
(2.2)
The first term represents the opening flux, the product of the opening transition rate α (V) and the fraction of particles closed (1 − u). The second term represents the closing flux. Depending on which flux is larger (opening or closing), the fraction of open channels u will increase or decrease accordingly. Equation 2.2 is often expressed in the following form: du 1 =− (u − u∞ (V)) , dt τu (V)
(2.3)
where u∞ (V) =
α (V) , α (V) + β (V)
(2.4)
τu (V) =
1 , α (V) + β (V)
(2.5)
represent the steady-state level and time constant (respectively) for u. This form is much more intuitive to use, as it describes, for a given membrane voltage, where u will settle and how fast. In addition, these quantities are much easier for neuroscientists to extract from real cells through voltageclamp experiments. We will come back to the form of equation 2.3 in section 4. For now, we will focus on the dynamics of the gating particle in terms of transition rates. In thermodynamic models, state changes of a gating particle are related to changes in the conformation of the ion channel protein (Hill & Chen, 1972; Destexhe & Huguenard, 2000, 2001). Each state possesses a certain energy
Voltage-Dependent Silicon Ion Channel Models
331
G*(V)
Energy
∆GC(V) ∆GO(V)
GC(V) ∆G(V) GO(V) Closed
Activated
Open
State Figure 1: Energy diagram of a reaction. The transition rates between two states are dependent on the heights of the energy barriers (GC and GO ), the differences in energy between the activated state (G∗ ) and the initial states (GC or GO ). Thus, the time constant depends on the height of the energy barriers, and the steady state depends on the difference in energy between the closed and open states (G).
(see Figure 1), dependent on the interactions of the protein molecule with the electric field across the membrane. For a state transition to occur, the molecule must overcome an energy barrier (see Figure 1), defined as the difference in energy between the initial state and an intermediate activated state. The size of the barrier controls the rate of transition between states (Hille, 1992): α (V ) = α0 e−GC (V )/R T β (V ) = β0 e
−GO (V )/R T
(2.6) ,
(2.7)
where α0 and β0 are constants representing base transition rates (at zero barrier height), GC (V ) and GO (V ) are the voltage-dependent energy barriers, R is the gas constant, and T is the temperature in Kelvin. Changes in the membrane voltage of the cell, and thus the electric field across the membrane, influence the energies of the protein’s conformations differently, changing the sizes of the barriers and altering the transition rates between states. Increasing a barrier decreases the respective transition rate, slowing the dynamics, since fewer proteins will have sufficient energy. The steady state depends on the energy difference between the two states. For a difference of zero and equivalent base transition rates, particles are
332
K. Hynna and K. Boahen
equally distributed between the two states. Otherwise, the state with lower energy is the preferred one. The voltage dependence of an energy barrier has many components, both linear and nonlinear. Linear thermodynamic models, as the name implies, assume that the linear voltage dependence dominates. This dependence may be produced by the movement of a monopole or dipole through an electric field (Hill & Chen, 1972; Stevens, 1978). In this situation, the above rate equations simplify to α (V ) = A e−b1 (V −VH )/R T β (V ) = A e
−b 2 (V −VH )/R T
(2.8) ,
(2.9)
where VH and A represent the half-activation voltage and rate, while b 1 and b 2 define the linear relationship between each barrier and the membrane voltage. The magnitude of the linear term depends on such factors as the net movement of charge or net change in charge due to the conformation of the channel protein. Thus, ion channels use structural differences to define different membrane voltage dependencies. While linear thermodynamic models have simple governing equations, they possess a significant flaw: time constants can reach extremely small values at voltages where either α (V) and β (V) become large (see equation 2.5), which is unrealistic since it does not occur in biology. Adding nonlinear terms in the energy expansion of α (V) and β (V) can counter this effect (Destexhe & Huguenard, 2000). Other solutions involve either saturating the transition rate (Willms, Baro, Harris-Warrick, & Guckenheimer, 1999) or using a three-state model (Destexhe & Huguenard, 2001), where the forward and reverse transition rates between two of the states are fixed, effectively setting the maximum transition rate. Linear models, however, bear the closest resemblance to the MOS transistor, which operates under similar thermodynamic principles. Short for metal oxide semiconductor, the MOS transistor is named for its structure: a metallic gate (today, a polysilicon gate) atop a thin oxide, which insulates the gate from a semiconductor channel. The channel, part of the body or substrate of the transistor, lies between two heavily doped regions called the source and the drain (see Figure 2). There are two types of MOS transistors: negative or n-type (NMOS) and positive or p-type (PMOS). NMOS transistors possess a drain and a source that are negatively doped— areas where the charge carriers are negatively charged electrons. These two areas exist within a p-type substrate, a positively doped area, where the charge carriers are positively charged holes. A PMOS transistor consists of a p-type source and drain within an n-type well. While the rest of this discussion focuses on NMOS transistor operation, the same principles apply to PMOS transistors, except that the charge carrier is of the opposite sign.
Voltage-Dependent Silicon Ion Channel Models gate
source
S
D
drain
n+
333
n+
B
G
B
G
psubstrate
a
b
S
D
Figure 2: MOS transistor. (a) Cross-section of an n-type MOS transistor. The transistor has four terminals: source (S), drain (D), gate (G) and bulk (B), sometimes referred to as the back-gate. (b), Symbols for the two transistor types: NMOS (left) and PMOS (right). The transistor is a symmetric device, and thus the direction of its current—by convention, the flow of positive charges—indicates the drain and the source. In an NMOS, current flows from drain to source, as indicated by the arrow. Conversely, current flows from source to drain in a PMOS.
In the subthreshold regime, charge flows across the channel by diffusion from the source end of the channel, where the density is high, to the drain, where the density is low. Governed by the same laws of thermodynamics that govern protein conformations, the density of charge carriers at the source and drain ends of the channel depends exponentially on the size of the energy barriers there (see Figure 3). These energy barriers exist due to a built-in potential difference, and thus an electric field, between the channel and the source or the drain. Adjusting the voltage at the source, or the drain, changes the charge carriers’ energy level. For the NMOS transistor’s negatively charged electrons, increasing the source voltage decreases the energy level; hence, the barrier height increases. This decreases the charge density at that end of the channel, as fewer electrons have the energy required to overcome the barrier. The voltage applied to the gate, which influences the potential at the surface of the channel, has the opposite effect: increasing it (e.g., from VG to VG1 in Figure 3) decreases the barrier height—at both ends of the channel. Factoring in the exponential charge density dependence on barrier height yields the relationship between an NMOS transistor’s channel current and its terminal voltages (Mead, 1989): Ids = Ids0 e
κVGB −VSB UT
−e
κVGB −VDB UT
,
(2.10)
where κ describes the relationship between the gate voltage and the potential at the channel surface. UT is called the thermal voltage (25.4 mV at room temperature), and Ids0 is the baseline diffusion current, defined by the barrier introduced when the oppositely doped regions (p-type and n-type) were fabricated. Note that, for clarity, UT will not appear in the
334
K. Hynna and K. Boahen
Energy
VG
φS
VG1 > VG
VS
φD VD > VS
Source
Channel
Drain
Figure 3: Energy diagram of a transistor. The vertical dimension represents the energy of negative charge carriers (electrons) within an NMOS transistor, while the horizontal dimension represents location within the transistor. φS and φD are the energy barriers faced by electrons attempting to enter the channel from the source and drain, respectively. VD , VS , and VG are the terminal voltages, designated by their subscripts. During transistor operation, VD > VS , and thus φS < φD . VG1 represents another scenario with a higher gate voltage. (Adapted from Mead, 1989.)
remaining transistor current equations, as all transistor voltages from here on are given in units of UT . When VDB exceeds VSB by 4 UT or more, the drain term becomes negligible and is ignored; the transistor is then said to be in saturation. The similarities in the underlying physics of ion channels and transistors allow us to use transistors as thermodynamic isomorphs of ion channels. In both, there is a linear relationship between the energy barrier and the controlling voltage. For the ion channel, either isolated charges or dipoles of the channel protein have to overcome the electric field created by the voltage across the membrane. For the transistor, electrons, or holes, have to overcome the electric field created by the voltage difference between the source, or drain, and the transistor channel. In both instances, the transport of charge across the energy barrier is governed by a Boltzman distribution, which results in an exponential voltage dependence. In the next section, we use these similarities to design an efficient transistor representation of the gating particle dynamics. 3 Variable Circuit Based on the discussion from the previous section, it is tempting to think we may be able to use a single transistor to model the gating dynamics of a channel particle completely. However, obtaining the transition rates solves only part of the problem. We still need to multiply the rates with the number of gating particles in each state to obtain the opening and closing fluxes, and then integrate the flux difference to update the particle counts
Voltage-Dependent Silicon Ion Channel Models
335
uH N4
uτH uH
N2
N2
VO
VO
N1
VC
uV
uV
C
C
N1
VC
uL
N3
uτL
a
b
uL
Figure 4: Channel variable circuit. (a) The voltage uV represents the logarithm of the channel variable u. VO and VC are linearly related to the membrane voltage, with slopes of opposite sign. uH and uL are adjustable bias voltages. (b) Two transistors (N3 and N4) are added to saturate the variable’s opening and closing rates; the bias voltages uτ H and uτ L set the saturation level.
(see equation 2.2). A capacitor can perform the integration if we use charge to represent particle count and current to represent flux. The voltage on the capacitor, which is linearly proportional to its charge, yields the result. The first sign of trouble appears when we attempt to connect a capacitor to the transistor’s source (or drain) terminal to perform the integration. As the capacitor integrates the current, the voltage changes, and hence the transistor’s barrier height changes. Thus, the barrier height depends on the particle count, which is not the case in biology; gating particles do not (directly) affect the barrier height when they switch state. Our only remaining option, the gate voltage, is unsuitable for defining the barrier height, as it influences the barrier at both ends of the channel identically. α (V) and β (V), however, demonstrate opposite dependencies on the membrane voltage; that is, one increases while the other decreases. We can resolve this conundrum by connecting two transistors to a single capacitor (see Figure 4a). Each transistor defines an energy barrier for one of the transition rates: transistor N1 uses its source and gate voltages (uL and VC , respectively) to define the closing rate, and transistor N2 uses its drain and gate voltages (uH and VO ) to define the opening rate (where uH > uL ). We integerate the difference in transistor currents on the capacitor Cu to update the particle count. Notice that neither barrier
336
K. Hynna and K. Boahen
depends on the capacitor voltage uV . Thus, uV becomes representative of the fraction of open channels; it increases as particles switch to the open state. How do we compute the fluxes from the transition rates? If uV directly represented the particle count, we would take the product of uV and the transition rates. However, we can avoid multiplying altogether if uV represents the logarithm of the open fraction rather than the open fraction itself. uV ’s dynamics are described by the differential equation Cu
duV = Ids0 eκ VO e−uV − e−uH − Ids0 eκ VC e−uL , dt = Ids0 eκ VO −uH e−(uV −uH ) − 1 − Ids0 eκ VC −uL ,
(3.1)
where Ids0 and κ are transistor parameters (defined in equation 2.10), and VO , VC , uH , uL , and uV are voltages (defined in Figure 4a). We assume N1 remains in saturation during the channel’s operation; that is, uV > uL + 4 UT , making the drain voltage’s influence negligible. The analogies between equation 3.1 and equation 2.2 become clear when we divide the latter by u. Our barriers—N1’s source-gate for the closing rate and N2’s drain-gate for the opening rate—correspond to α (V) and β (V), while e−(uV −uH ) corresponds to u−1 . Thus, our circuit computes (and integrates) the net flux divided by u, the open fraction. Fortuitously, the net flux scaled by the open fraction is exactly what we need to update the fraction’s logarithm, since d log(u)/dt = (du/dt)/u. Indeed, substituting uV = log u + uH —our log-domain representation of u—into equation 3.1 yields Qu du = Ids0 eκ VO −uH u dt
1 −1 u
− Ids0 eκ VC −uL
Ids0 κ VC −uL du Ids0 κ VO −uH = ... e e u, (1 − u) − dt Qu Qu
(3.2)
where Qu = Cu UT . If we design VC and VO to be functions of the membrane voltage V, equation 3.2 becomes directly analogous to equation 2.2. In linear thermodynamic models, the transition rates depend exponentially on the membrane voltage. We can realize this by designing VC and VO to be linear functions of V, albeit with slopes of opposite sign. The opposite slopes ensure that as the membrane voltage shifts in one direction, the opening and closing rates will change in opposite directions relative to each other. Thus far in our circuit design, we have not specified whether the variable activates or inactivates as the membrane voltage increases. Recall that for activation, the gating particle opens as the cell depolarizes, whereas for
Voltage-Dependent Silicon Ion Channel Models
337
inactivation, the gating particle opens as the cell hyperpolarizes. In our circuit, whether activation or inactivation occurs depends on how we define VO and VC with respect to V. Increasing VO , and decreasing VC , with V defines an activation variable, as at depolarized levels this results in α (V) > β (V). Conversely, increasing VC , and decreasing VO , with V defines an inactivation variable, as now at depolarized voltages, β (V) > α (V), and the variable will equilibrate in a closed state. Naturally, our circuit has the same limitation that all two-state linear thermodynamic models have: its time constant approaches zero when either VO or VC grows large, as the transition rates α (V) and β (V) become unrealistically large. This shortcoming is easily rectified by imposing an upper limit on the transition rates, as has been done for other thermodynamic models (Willms et al., 1999). We realize this saturation by placing two additional transistors in series with the original two (see Figure 4b). With these transistors, the transition rates become α (V) =
Ids0 eκ VO −uH Qu 1 + eκ(VO −uτ H )
(3.3)
β (V) =
eκ VC −uL Ids0 , Qu 1 + eκ(VC −uτ L )
(3.4)
where the single exponentials in equation 3.2 are now scaled by an additional exponential term. The voltages uτ H and uτ L (fixed biases) set the maximum transition rate for opening and closing, respectively. That is, when VO < uτ H − 4 UT , α (V) ∝ eκ VO , a rate function exponentially dependent on the membrane voltage. But when VO > uτ H + 4 UT , α (V) ∝ eκuτ H , limiting the transition rate and fixing the minimum time constant for channel opening. The behavior of β (V) is similarly defined by VC ’s value relative to uτ L . In the following section, we explore how the channel variable computed by our circuit changes with the membrane voltage and how quickly it approaches steady state. To do so, we must relate the steady state and time constant to the opening and closing rates and specify how the circuit’s opening and closing voltages depend on the membrane voltage. 4 Circuit Operation To help understand the operation of the channel circuit and the influence of various circuit parameters, we will derive u∞ (V) and τu (V) for the circuit in Figure 4b using equations 2.4 and 2.5 and the transistor rate equations (equations 3.3 and 3.4), limiting our presentation to the activation version. The derivation, and the influence of various parameters, is similar for the inactivation version.
338
K. Hynna and K. Boahen
For the activation version of the channel circuit (see Figure 4b), we define the opening and closing voltages’ dependence on the membrane voltage as: VO = φo + γo V
(4.1)
VC = φc − γc V,
(4.2)
where φo , γo , φc , and γc are positive constants representing the offsets and slopes for the opening and closing voltages. Additional circuitry is required to define these constants; one example is described in the next section. In this section, however, we will leave the definition as such while we derive the equations for the circuit. Under certain restrictions (see appendix A), u’s steady-state level has a sigmoidal voltage dependence: u∞ (V) =
1 , V−Vmid 1 + exp − V∗u
(4.3)
u
where Vmid u = V∗u =
1 (φc − φo + (uH − uL )/κ) γ o + γc
(4.4)
UT 1 . γ o + γc κ
(4.5)
Figure 5a shows how the sigmoid arises from the transition rates and, through them, its relationship to the voltage biases. The midpoint of the sigmoid, where the open probability equals half, occurs when α (V) = β (V); it will thus shift with any voltage biases (φo , φc , uH , or uL ) that scale either of the transition rate currents. For example, increasing uH reduces α (V), shifting the midpoint to higher voltages. The slope of the sigmoid around the midpoint is defined by the slopes γo and γc of VO (V) and VC (V). To obtain the sigmoid shape, we restricted the effect of saturation. It is assumed that the bias voltages uτ H and uτ L are set such that saturation is negligible in the linear midsegment of the sigmoid (i.e., VO < uτ H − 4 UT and VC < uτ L − 4 UT when V ∼ Vmid u ). That is why α (V) and β (V) appear to be pure exponentials in Figure 5a. This restriction is reasonable as saturation is supposed to occur only for large excursions from Vmid u , where it imposes a lower limit on the time constant. Therefore, under this assumption, the sigmoid lacks any dependence on the biases uτ H and uτ L .
Voltage-Dependent Silicon Ion Channel Models
339
1
-uL
e
e
-κV -u e c + τL
-κV -u e o + τΗ
e
e
-κ (Vo-uτH)
τu(V) 1 0
0
0
a
1+e
1+e
-κ (Vc-uτL)
τu/τmin
-uH
Transition Rate
u∞
u∞(V)
b
V
V
Figure 5: Steady state and time constants for channel circuit. (a) The variable’s steady-state value (u∞ ) changes sigmoidally with membrane voltage (V), dictated by the ratio of the opening and closing rates (dashed lines). The midpoint occurs when the rates are equal, and hence its horizontal location is affected by the bias voltages (uH and uL ) applied to the circuit (see Figure 4b). (b) The variable’s time constant (τu ) has a bell-shaped dependence on the membrane voltage (V), dictated by the reciprocal of the opening and closing rates (dashed lines). The time constant diverges from these asymptotes at intermediate voltages, where neither rate dominates; it follows the reciprocal of their sum, peaking when the sum is minimized.
Under certain further restrictions (see appendix A), u’s time constant has a bell-shaped voltage dependence: τu (V) = τmin 1 +
exp
V−V1u V∗1u
1
2u + exp − V−V V∗
(4.6)
2u
where V1u = ( uτ H − φo ) /γo V∗1u = UT / (κ γo ) V2u = (φc − uτ H + (uH − uL )/κ) /γc V∗2u = UT / (κ γc ) and τmin = (Qu /Ids0 ) e−(κ uτ H −uH ) . Figure 5b shows how the bell shape arises from the transition rates and, through them, its relationship to the voltage biases. For large excursions of the membrane voltage, one transition rate dominates, and the time constant closely follows its inverse. For small excursions, neither rate dominates, and the time constant diverges from the inverses, peaking at the membrane voltage where the sum of the transition rates is minimized.
340
K. Hynna and K. Boahen
To obtain the bell shape, we saturated the opening and closing rates at the same level by setting κ uτ H − uH = κ uτ L − uL . Though not strictly necessary, this assumption simplifies the expression for τu (V) by matching the minimum time constants at hyperpolarized and depolarized voltages, yielding the result given in equation 4.6. The bell shape also requires this so-called minimum time constant to be smaller than the peak time constant in the absence of saturation. The free parameters within the circuit—φo , γo , φc , φo , uH , uL , uτ H , and uτ L —allow for much flexibility in designing a channel. Appendix B provides an expanded discussion on the influence of the various parameters in the equations above. In the following section, we present measurements from a simple activation channel designed using this circuit, which was fabricated in a standard 0.25 µm CMOS process.
5 A Simple Activation Channel Our goal here is to implement an activating channel to serve as a concrete example and examine its behavior through experiment. We start with the channel variable circuit, which computes the logarithm of the channel variable, and attach its output voltage to the gate of a transistor, which uses the subthreshold regime’s exponential current-voltage relationship to invert the logarithm. The current this transistor produces, which is directly proportional to the variable, can be injected directly into a silicon neuron (Hynna & Boahen, 2001) or can be used to define a conductance (Simoni et al., 2004). The actual choice is irrelevant for the purposes of this article, which demonstrates only the channel variable. In addition to the output transistor, we also need circuitry to compute the opening and closing voltages from the membrane voltage. For the opening voltage (VO ), we simply use a wire to tie it to the membrane voltage (V), which yields a slope of unity (γo = 1) and an intercept of zero (φo = 0). For the closing voltage (VC ), we use four transistors to invert the membrane voltage. The end result is shown in Figure 6. For the voltage inverter circuit we chose, the closing voltage’s intercept φc = κ VC0 (set by a bias voltage VC0 ) and its slope γc = κ 2 /(κ + 1) (set by the transistor parameter defined in equation 2.10). Since κ ≈ 0.7, the closing voltage has a shallower slope than the opening voltage, which makes the closing rate change more gradually, skewing the bell curve in the hyperpolarizing direction as intended for the application in which this circuit was used (Hynna, 2005). This eight-transistor design captures the ion channel’s nonlinear dynamics, which we demonstrated by performing voltage clamp experiments (see Figure 7). As the command voltage (i.e., step size) increases, the output current’s time course and final amplitude both change. The clustering and speed at low and high voltages are what we would expect from a sigmoidal steady-state dependence with a bell-shaped time constant. The
Voltage-Dependent Silicon Ion Channel Models
341
uH VO
V
N5
N6
VC 0
N8
N7
N3
N2
uτΗ
uV
VC
N1
N4
IT
Cu
uG
uL
Figure 6: A simple activating channel. A voltage inverter (N5-8) produces the closing voltage (VC ); a channel variable circuit (N1-3) implements the variable’s dynamics in the log domain (uV ); and an antilog transistor (N4) produces a current (IT ) proportional to the variable. The opening voltage (VO ) is identical to the membrane voltage (V). The series transistor (N2) sets the minimum time constant at depolarized levels. The same circuit can be used to implement an inactivating channel simply by swapping VO and VC .
Figure 7: Channel circuit’s measured voltage-dependent activation. When the membrane voltage is stepped to increasing levels, from the same starting level, the output current becomes increasingly larger, approaching its steady-state amplitudes at varying speeds.
relationship between this output current (IT ) and the activation variable, defined as u = euV −uH , has the form IT = uκ IT .
(5.1)
342
K. Hynna and K. Boahen 1
1.2 1.
0.6
0.8
u∞
τ u (ms)
0.8
0.4
0.6 0.4
0.2 0 0.2
a
0.2 0 0.3 0.4 0.5 0.6 Membrane Voltage (V)
0.3
b
0.4 0.5 0.6 0.7 Membrane Voltage (V)
Figure 8: Channel circuit’s measured sigmoid and bell curve. (a) Dependence of activation on membrane voltage in steady state, captured by sweeping the membrane voltage slowly and recording the normalized current output. (b), Dependence of time constant on membrane voltage, extracted from the curves in Figure 7 by fitting exponentials. Fits (solid lines) are of equations 4.3 = 423.0 mV, V∗u = 28.8 mV, τmin = 0.0425 ms, V1u = 571.3 mV, and 4.6: Vmid u ∗ V1u = 36.9 mV, V2u = 169.8 mV, V∗2u = 71.8 mV.
Its maximum value IT = eκ uH −uG and its exponent κ ≈ 0.7 (the same transistor parameter). It is possible to achieve an exponent of unity, or even a square or a cube, but this requires a lot more transistors. We measured the sigmoidal change in activation directly, by sweeping the membrane voltage slowly, and its bell-shaped time constant indirectly, by fitting the voltage clamp data in Figure 7 with exponentials. The results are shown in Figure 8; the relationship above was used to obtain u from IT . The solid lines are the fits of equations 4.3 and 4.6, which reasonably capture the behavior in both sets of data. The range in the time constant data is limited due to the experimental protocol used. Since we modulated only the step size from a fixed hyperpolarized position, we need a measurable change in the steady-state output current to be able to measure the temporal dynamics for opening. However, this worked to our benefit as, given the range of the fit, there was no need to modify equation 4.6 to allow the time constant to go to zero at hyperpolarized levels (this circuit omits the second saturation transistor in Figure 4b). All of the extracted parameters from the fit are reasonably close to our expectations—based on equations 4.3 and 4.6 and our applied voltage biases—except for V∗2u . For κ ≈ 0.7 and UT = 25.4 mV, we expected V∗2u ≈ 125 mV, but our fit yielded V∗2u ≈ 71.8 mV. There are two possible explanations, not mutually exclusive. First, the fact that equation 4.6 assumes the presence of a saturation transistor, in addition to the limited data along the left side of the bell curve, may have contributed to the underfitting of that value. Second, κ is not constant within the chip but possesses voltage dependence. Overall, however, the analysis matches reasonably well the performance of the circuit.
Voltage-Dependent Silicon Ion Channel Models
343
6 Conclusion We showed that the transistor is a thermodynamic isomorph of a channel gating particle. The analogy is accomplished by considering the operation of both within the framework of energy models. Both involve the movement of charge within an electric field: for the channel, due to conformations of the ion channel protein; for the transistor, due to charge carriers entering the transistor channel. Using this analogy, we generated a compact channel variable circuit. We demonstrated our variable circuit’s operation by implementing a simple channel with a single activation variable, showing that the steady state is sigmoid and the time constant bell shaped. Our measured results, obtained through voltage clamp experiments, matched our analytical results, derived from knowledge of transistors. Bias voltages applied to the circuit allow us to shift the sigmoid and the bell curve and set the bell curve’s height independently. However, the sigmoid’s slope, and its location relative to the bell curve, which is determined by the slope, cannot be changed (it is set by the MOS transistor’s κ parameter). Our variable circuit is not limited to activation variables: reversing the opening and closing voltages’ linear dependence on the membrane voltage will change the circuit into an inactivation variable. In addition, channels that activate and inactivate are easily modeled by including additional circuitry to multiply the variable circuits’ output currents (Simoni et al., 2004; Delbruck, 1991). The change in temporal dynamics of gating particles plays a critical role in some voltage-gated ion channels. As discussed in section 1, the inactivation time constant of T channels in thalamic relays changes dramatically, defining properties of the relay cell burst response, such as the interburst interval and the length of the burst itself. Activation time constants are also influential: they can modify the delay with which the channel responds (Zhan, Cox, Rinzel, & Sherman, 1999), an important determinant of a neuron’s temporal precision. Incorporating these nonlinear temporal dynamics into silicon models will yield further insights into neural computation. Equally important in our design, not only were we able to capture the nonlinear dynamics of gating particles, we were able to do so using fewer transistors than previous silicon models. Rather than exploit the parallels between transistors and ion channels, as we did, previous silicon modelers attempted to “linearize” the transistor, to make it approximate a resistor. After designing a circuit to accomplish this, the resistor’s value had to be adjusted dynamically, so more circuitry was added to filter the membrane voltage. The time constant of this filter was kept constant, sacrificing the ion channel’s voltage-dependent nonlinear dynamics for simplicity. We avoided all these complications by recognizing that the transistor is a
344
K. Hynna and K. Boahen
thermodynamic isomorph of the ion channel. Thus, we were able to come up with a compact replica. The size of the circuit is an important consideration within silicon models, as smaller circuits translate to more neurons on a silicon die. To illustrate, a simple silicon neuron (Zaghloul & Boahen, 2004), with a single input synapse, possessing the activation channel from section 5, requires about 330 µm2 of area. This corresponds to around 30,000 neurons on a silicon die, 10 mm2 in area. Adding additional circuits, such as inactivation to the channel, increases the area of the cell design, reducing the size of the population on the chip (assuming, of course, that the total area of the die remains constant). To compensate for larger cell footprints, we can either increase the size of the whole silicon die (which costs money), or we can simply incorporate multiple chips into the system, easily doubling or tripling the network size. And unlike computer simulations, the increase in network sizes comes with minimal cost in performance or “simulation” time. Of course, like all other modeling media, silicon has its own drawbacks. For one, silicon is not forgiving with respect to design flaws. Once the chip has been fabricated, we are limited to manipulating our models using only external voltage biases within our design. This places a great deal of importance on verification of the final design before submitting it for fabrication; the total time from starting the design to receiving the fabricated chip can be on the order of 6 to 12 months. An additional characteristic of the silicon fabrication process is mismatch, a term referring to the variability among fabricated copies of the same design within a silicon chip (Pavasovic, Andreou, & Westgate, 1994). Within an array of silicon neurons, this translates into heterogeneity within the population. While we can take steps to reduce the variability within an array, generally at the expense of area, this mismatch can be considered a feature, since biology also needs to deal with variability. When we build silicon models that reproduce biological phenomena, being able to do so lends credence to our models, given their robustness to parameter variability. And when we discover network phenomena within our chips, these are likely to be found in biology as well, as they will be robust to biological heterogeneity. With the ion channel design described in this article and our ability to expand our networks without much cost, we have a great deal of potential in building many of the neural systems within the brain, which consists of numerous layers of cells, each possessing its own distinct characteristics. Not only do we have the opportunity to study the role of an ion channel within an individual cell, we have the potential to study its influence within the dynamics of a population of cells, and hence its role in neural computation.
Voltage-Dependent Silicon Ion Channel Models
345
Appendix A: Derivations We can use the transition rates for our channel circuit (see equations 3.3 and 3.4) to calculate the voltage dependence of its steady state and time constant. Starting with the steady-state equation, equation 2.4, u∞ (V) = = =
α (V) α (V) + β (V) Ids0 Qu
Ids0 e−uH Qu e−κ VO +e−κuτ H −u e−uH + IQds0u e−κ VCe+eL−κuτ L e−κ VO +e−κuτ H
1+
e−κ VO +e−κuτ H e−κ VC +e−κuτ L
1 euH −uL
.
Throughout the linear segment of the sigmoid, we set the voltage biases such that uτ H > VO + 4 UT and uτ L > VC + 4 UT . These restrictions essentially marginalize uτ H and uτ L . By the time either of the terms with these two biases becomes significant, the steady state will be sufficiently close to either unity or zero, so that their influence is negligible. Therefore, we drop the exponential terms with uτ H and uτ L ; substituting equations 4.1 and 4.2 yields the desired result of equation 4.3. For the time constant, we substitute equations 3.3 and 3.4 into equation 4.1: τu (V) =
1 α (V) + β (V)
=
Qu Ids0
=
Qu Ids0
1 e−uH e−κ VO +e−κuτ H
+
e−uL e−κ VC +e−κuτ L
1 1 e−(κ VO −uH ) +e−(κ uτ H −uH )
+
1 e−(κ VC −uL ) +e−(κ uτ L −uL )
.
To equalize the minimum time constant at hyperpolarized and depolarized levels, we establish the following relationship: κ uτ H − uH = κ uτ L − uL . After additional algebraic manipulation, the time constant becomes
e−κ (VO −uτ H ) e−κ (VC −uτ L ) e−κ (VO −uτ H ) + e−κ (VC −uτ L ) + 2 1 − −κ (V −u ) , O τ H + e−κ (VC −uτ L ) + 2 e
τu (V) = τmin
1+
346
K. Hynna and K. Boahen
where τmin = (Qu /Ids0 ) e−(κ uτ H −uH ) is the minimum time constant. To reduce the expression further, we need to understand the relative magnitudes of the various terms. We can drop the constant in both denominators, as one of the exponentials there will always be significantly larger. Since VO and VC have opposite signs for their slopes with respect to V, the sum of the exponentials in the denominator peaks at a membrane voltage somewhere within the middle section of its operational voltage range. As it happens, this peak is close to the midpoint of the sigmoid (see below), where we have defined uτ H > VO + 4 UT and uτ L > VC + 4 UT . Thus, at the peak, we know the constant is negligible. As the membrane voltage moves away from the peak, in either direction, one of the exponentials will continue to increase while the other decreases. Thus, with the restriction on the bias voltage uτ H and uτ L , the sum of the exponentials will always be much larger than the constant in the denominator. By the same logic, we can disregard the final term, since the sum of the exponentials will always be substantially larger than the numerator, making the fraction negligible over the whole membrane voltage. With these assumptions, and substituting the dependence of VO and VC on V (see Equations 4.1 and 4.2), we obtain the desired result, equation 4.6. Appendix B: Circuit Flexibility This section is more theoretical in nature, using the steady-state and timeconstant equations derived in appendix A to provide insight into how the various parameters influence the two voltage dependencies (steady state and time constant). This section is likely of interest only to those who wish to use our approach for modeling ion channels. An important consideration in generating these models is defining the location (i.e., the membrane voltage) and magnitude of the peak time constant. Unlike the minimum time constant, which is determined simply by the difference between the bias voltages uτ H and uH (or uτ L and uL ), no bias directly controls the maximum time constant, since it is the point at which the sum of the transition rates is minimized (see Figure 5b). The voltage at which it occurs, however, is easily determined from equation 4.6: Vτ pk = Vmid + u
UT log [γc /γo ] . κ (γo + γc )
(B.1)
Thus, where the bell curve lies relative to the sigmoid, whose midpoint lies at Vmid u , is determined by the opening and closing voltages’ slopes (γo and γc ). Consequently, changing these slopes is the only way to displace the bell curve relative to the sigmoid. Shifting the bell curve by changing
Voltage-Dependent Silicon Ion Channel Models
0.5
0 200 300 400 500 600 700 800 V
a
3. Log 10 [ τ /τ min ]
u∞
1.
347
2. 1. 0 200 300 400 500 600 700 800 V
b
1.
10 τ /τ min
u∞
8 0.5
6 4 2
c
0 200 300 400 500 600 700 800 V
d
0 200 300 400 500 600 700 800 V
Figure 9: Varying the closing voltage’s slope (γc ). (a) Changing γc adjusts both the slope and midpoint of the steady-state sigmoid. (b) γc also affects the location and height (relative to the minimum) of the time constant’s bell curve; the change in height (plotted logarithmically) is dramatic. (c, d) Same as in a and b, except that we adjusted the bias voltage uL to compensate for the change in γc , so the sigmoid’s midpoint and the bell curve’s height remain the same. The sigmoid’s slope does change, as it did before, and the bell curve’s location shifts as well, though much less than before. In these plots, φo = 400 mV, uH = 400 mV, uL = 50 mV, uτ H = 700 mV, and γo = 1.0. γc ’s values are 0.5 (thin, solid line), 0.75 (short, dashed line), 1.0 (long, dashed line), and 1.25 (thick, solid line).
a parameter other than γo or γc will automatically shift the sigmoid by the same amount. As we change the opening and closing voltages’ slopes (γo and γc ), two things happen. First, due to Vmid u ’s dependence on these parameters (see equation 4.4), the sigmoid and the bell curve shift together. Two, due to the dependence we just described (see equation B.1), the bell curve shifts relative to the sigmoid. These effects are illustrated in Figures 9a and 9b for γc (γo behaves similarly). To eliminate the first effect while preserving the second, we can compensate for the sigmoid’s shift by adjusting the bias voltage uL , which scales the closing rate (see equation 4.2). Consequently, uL also rescales the left part of the bell curve, where the closing rate is dominant, reducing its height relative to the minimum and canceling its shift due to the first effect. This is demonstrated in Figures 9c and 9d. The sigmoid remains fixed, while the bell curve shifts (slightly) due to the second effect. Although the bias voltage uL cancels the shift γc produces in the sigmoid, it does not compensate for the change in the sigmoid’s slope. We can shift the sigmoid
348
K. Hynna and K. Boahen 1.
25
0.5
τ / τ min
u∞
20 15 10 5 0 200 300 400 500 600 700 800 V
a
b
0 200 300 400 500 600 700 800 V
a
25
25
20
20
15
15
10
τ / τ min
τ / τ min
Figure 10: Varying the closing voltage’s intercept (φc ). (a) Changing φc shifts the steady-state sigmoid’s midpoint, leaving its slope unaffected. (b) φc also affects the time constant bell curve’s location and height (relative to the minimum). In these plots, φo = 0 mV, uH = 400 mV, uL = 50 mV, uτ H = 700 mV, γo = 1.0, and γc = 0.5. φc ’s values are 400 mV (thin, solid line), 425 mV (short, dashed line), 450 mV (long, dashed line), and 475 mV (thick, solid line).
10
5
5
0 200 300 400 500 600 700 800 V
0 200 300 400 500 600 700 800 V
b
Figure 11: Setting the bell curve’s location and height independently. (a) Changing the opening and closing voltages’ intercepts (φo and φc ) together shifts the bell curve’s location without affecting its height. The steady-state sigmoid (not shown) moves with the bell curve (see equation B.1). (b) Changing φo and φc in opposite ways increases the height (relative to the minimum) without affecting the location. The steady-state sigmoid (not shown) stays put as well. In these plots, uH = 400 mV, uL = 50 mV, uτ H = 700 mV, γo = 1.0, and γc = 0.5. In both a and b, φc = 400 mV and φo = 0 mV for the thin, solid line. In a, both φc and φo increment by 25 mV from the thin, solid line to the short, dashed line, to the long, dashed line and to the thick, solid line. In b, φc increments and φo decrements by 25 mV from the thin, solid line to the short, dashed line, to the long, dashed line, and to the thick, solid line.
and leave its slope unaffected by changing the opening and closing voltages’ intercepts, as shown in Figure 10 for φc (φo behaves similarly). However, the bell curve shifts by the same amount, and its height changes as well, since φc rescales the closing rate. Thus, it is not possible to shift the bell curve relative to the sigmoid without changing the latter’s slope; this is evident from equations 4.5 and B.1.
Voltage-Dependent Silicon Ion Channel Models
349
We can set the bell curve’s location and height independently if we change the opening and closing voltages’ intercepts by the same amount or by equal and opposite amounts, respectively. Equal changes in the intercepts (φo and φc ) shift the opening and closing rate curves by the same amount, thus shifting the bell curve (and the sigmoid) without affecting its height (see Figure 11a), whereas equal and opposite changes shift the opening and closing rate curves apart, leaving the point where they cross at the same location while rescaling the value of the rate there. As a result, the bell curve’s height is changed without affecting its location (see Figure 11b). Choosing values for γo and γc , however, presents a trade-off. These two parameters define the dependence of VO and VC on V and are not external biases like the other parameters; rather, they are defined by the fabrication process through the transistor parameter κ. We can define their relationships with κ through the use of different circuits; in our simple activation channel (see section 5), γc = κ 2 /(κ + 1), as defined by the four transistors that invert the membrane voltage. There are a couple of drawbacks. First, not all values for γo or γc are possible using only a few transistors. Expressed another way, a trade-off needs to be made between achieving more precise values for γo or γc and using fewer transistors within the design. The other drawback is that after fabrication, γo and γc can no longer be modified, as they are defined as functions of the transistor parameter κ. These issues merit special consideration before submitting the chip for fabrication. References Delbruck, T. “Bump” circuits for computing similarity and dissimilarity of analog voltages. In Neural Networks, 1991, IJCNN-91-Seattle International Joint Conference on (Vol. 1, pp. 475–479). Piscataway, NJ: IEEE. Destexhe, A., & Huguenard, J. R. (2000). Nonlinear thermodynamic models of voltage-dependent currents. J. Comput. Neurosci., 9(3), 259–270. Destexhe, A., & Huguenard, J. R. (2001). Which formalism to use for voltagedependent conductances? In R. C. Cannon & E. D. Schutter (Eds.), Computational neuroscience: Realistic modeling for experimentalists (pp. 129–157). Boca Raton, FL: CRC. Hill, T. L., & Chen, Y. (1972). On the theory of ion transport across the nerve membrane. VI. free energy and activation free energies of conformational change. Proc. Natl. Acad. Sci. U.S.A., 69(7), 1723–1726. Hille, B. (1992). Ionic channels of excitable membranes (2nd ed.). Sunderland, MA: Sinauer Associates. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol., 117(4), 500–544. Hynna, K. (2005). T channel dynamics in a silicon LGN. Unpublished doctoral dissertation, University of Pennsylvania.
350
K. Hynna and K. Boahen
Hynna, K., & Boahen, K. (2001). Space-rate coding in an adaptive silicon neuron. Neural Networks, 14(6–7), 645–656. Llinas, R. R. (1988). The intrinsic electrophysiological properties of mammalian neurons: Insights into central nervous system function. Science, 242(4886), 1654–1664. Mahowald, M., & Douglas, R. (1991). A silicon neuron. Nature, 354(6354), 515–518. McCormick, D. A., & Pape, H. C. (1990). Properties of a hyperpolarization-activated cation current and its role in rhythmic oscillation in thalamic relay neurones. J. Physiol., 431, 291–318. Mead, C. (1989). Analog VLSI and neural systems. Reading, MA: Addison-Wesley. Monsivais, P., Clark, B. A., Roth, A., & Hausser, M. (2005). Determinants of action potential propagation in cerebellar Purkinje cell axons. J. Neurosci., 25(2), 464–472. Pavasovic, A., Andreou, A. G., & Westgate, C. R. (1994). Characterization of subthreshold MOS mismatch in transistors for VLSI systems. J. VLSI Signal Process. Syst., 8(1), 75–85. Simoni, M. F., Cymbalyuk, G. S., Sorensen, M. E., Calabrese, R. L., & DeWeerth, S. P. (2004). A multiconductance silicon neuron with biologically matched dynamics. IEEE Transactions on Biomedical Engineering, 51(2), 342–354. Stevens, C. F. (1978). Interactions between intrinsic membrane protein and electric field: An approach to studying nerve excitability. Biophys. J., 22(2), 295–306. Svirskis, G., Kotak, V., Sanes, D. H., & Rinzel, J. (2002). Enhancement of signal-tonoise ratio and phase locking for small inputs by a low-threshold outward current in auditory neurons. J. Neurosci., 22(24), 11019–11025. Tao, L., Shelley, M., McLaughlin, D., & Shapley, R. (2004). An egalitarian network model for the emergence of simple and complex cells in visual cortex. Proc. Natl. Acad. Sci. U.S.A., 101(1), 366–371. Willms, A. R., Baro, D. J., Harris-Warrick, R. M., & Guckenheimer, J. (1999). An improved parameter estimation method for Hodgkin-Huxley models. Journal of Computational Neuroscience, 6(2), 145–168. Zaghloul, K. A., & Boahen, K. (2004). Optic nerve signals in a neuromorphic chip II: Testing and results. IEEE Transactions on Biomedical Engineering, 51(4), 667–675. Zhan, X. J., Cox, C. L., Rinzel, J., & Sherman, S. M. (1999). Current clamp and modeling studies of low-threshold calcium spikes in cells of the cat’s lateral geniculate nucleus. J. Neurophysiol., 81(5), 2360–2373.
Received October 6, 2005; accepted June 20, 2006.
LETTER
Communicated by Harry Erwin
Spatiotemporal Conversion of Auditory Information for Cochleotopic Mapping Osamu Hoshino [email protected] Department of Intelligent Systems Engineering, Ibaraki University, Hitachi, Ibaraki, 316-8511 Japan
Auditory communication signals such as monkey calls are complex FM vocal sounds and in general induce action potentials in different timing in the primary auditory cortex. Delay line scheme is one of the effective ways for detecting such neuronal timing. However, the scheme is not straightforwardly applicable if the time intervals of signals are beyond the latency time of delay lines. In fact, monkey calls are often expressed in longer time intervals (hundreds of milliseconds to seconds) and are beyond the latency times observed in the brain (less than several hundreds of milliseconds). Here, we propose a cochleotopic map similar to that in vision known as a retinotopic map. We show that information about monkey calls could be mapped on a cochleotopic cortical network as spatiotemporal firing patterns of neurons, which can then be decomposed into simple (linearly sweeping) FM components and integrated into unified percepts by higher cortical networks. We suggest that the spatiotemporal conversion of auditory information may be essential for developing the cochleotopic map, which could serve as the foundation for later processing, or monkey call identification by higher cortical areas. 1 Introduction Frequency modulation (FM) is a critical parameter for constructing auditory communication signals in both humans and monkeys. For example, human speech contains FM sounds called formant transitions that are critical for encoding consonant and vowel combinations such as “ga,” “da,” and “ba” (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967; Suga, 1995). Monkeys use complex FM sounds—the so-called monkey calls—that are considered to be a precursor of human speech (Poremba et al., 2004; Holden 2004), in order to make social interactions with other members of the species (Symmes, Newman, Talmage-Riggs, & Lieblich, 1979; Janik, 2000; Tyack, 2000). Auditory FM signals differ in informational structure from other sensory signals in that they are processed in a time-dependent manner (or characterized by time-varying spectral frequencies), while color and orientation Neural Computation 19, 351–370 (2007)
C 2007 Massachusetts Institute of Technology
352
O. Hoshino
in vision, odorants in olfaction, and tastants in gustation are characterized in most cases in a time-independent manner. Understanding how the auditory cortex encodes and detects the streams of spectral information arising from the temporal structure of FM sounds is one of the most challenging problems in auditory cognitive neuroscience. Auditory signals being sent from receptor neurons enter the primary auditory area (AI). Neurons of the AI are orderly arranged and form the so-called tonotopic map, where tonotopic representation of the cochlea is well preserved (Yost, 1994). Each neuron on the map tends to respond best to its characteristic frequency. Because of the tonotopic organization, these neurons could be activated in a sequential manner and generate action potentials in different timing when stimulated with an FM sound. It is interesting to know how higher cortical areas, to which the AI project, detect the timing of action potentials so that the brain can identify the applied FM sound. One of the effective ways for detecting the timing of action potentials is to use delay lines. Jeffress (1948) proposed a theory for sound localization. The theory is that an interaural time difference, which expresses the azimuthal location of a sound source, could be detected by neurons that receive signals from both ears through distinct delay lines. Suga (1995) proposed a multiple delay line scheme for echo sound detection in bats. Bi and Poo (1999) demonstrated in a cultured hippocampal neuronal network that the timing of action potentials could be detected when relevant delay lines are properly chosen. The timing of action potentials in these processes was relatively short, ranging from submilliseconds to tens of milliseconds, for which the delay line scheme worked well. However, the information about FM sounds contained in monkey calls is in general expressed in longer time intervals of hundreds of milliseconds to seconds. The delay line scheme is unlikely to be applicable for them, because the delay lines observed in the brain are at most of several hundreds of milliseconds (Miller, 1987). This implies that the brain might employ another strategy for identifying individual calls. A speculative cochleotopic map was proposed in relation to a retinotopic map in vision (Rauschecker, 1998). The retinotopic map expresses information about the location of a moving bar in a two-dimensional visual space that is projected onto the retina. As the bar moves in the visual space, the map shows a spatiotemporal firing pattern of neurons. The axes of the map indicate the location of the bar in the two-dimensional visual space. In the cochleotopic mapping scheme, the location of neuronal activation moves along the frequency axis as the frequency of a FM sound sweeps. However, the exact auditory variable for the other axis is still unknown. We propose here a hypothetical two-dimensional (frequency axis, propagation axis) cochleotopic neural network (NC O network) for the AI on which information about FM sounds is mapped in a spatiotemporal manner. The propagation axis is assumed for the unknown axis. Stimulation
Spatiotemporal Conversion of Auditory Information
353
with a pure (single-frequency) tone activates a peripheral neuron, whose activity propagates along the propagation axis, or along its isofrequency band. When the NC O network is stimulated with an FM sound, peripheral neurons are sequentially activated, which then propagates along their isofrequency bands. As a consequence, the information about the applied FM sound is expressed as a specific spatiotemporal firing pattern in the NC O network dynamics. Such activity propagation has been reported in the AI. Hess and Scheich (1996) stimulated Mongolian gerbils with pure tones (1 kHz to 16 kHz) and recorded the activity of AI neurons. The researchers found that neuronal activation propagated along isofrequency bands at all frequencies. Taniguchi, Horikawa, Moriyama, and Nasu (1992) stimulated guinea pigs with pure tones (1 kHz to 30 kHz) and recorded the activity of the anterior field (field A) in which tonotopic organization was well preserved. The researchers found that the focal activation beginning in field A propagated in two directions: along isofrequency bands and toward field DC. There has been neurophysiological evidence that FM sounds are precisely detected by auditory neurons. Neurons of the lateral belt (LB) (Rauschecker, 1998) that receives signals from the AI responded to FM sounds. These neurons were relatively organized in an orderly fashion depending on the sweeping rate (between slow and fast) and direction (upward or downward) of the FMs. Based on these experimental findings, we construct a neural network model (NFM ) for the LB to which the NC O network projects. A given FM sound evokes a specific spatiotemporal firing pattern in the NC O network, to which a certain group of NFM neurons (NFM column) responds and identifies the applied FM. It is also well known that LB neurons respond to monkey calls (Rauschecker, 1997). Monkey calls as vocal signatures are complex FM sounds and play an important role in identifying individuals, especially when their visual systems are unavailable, as in a luxuriant forest (Symmes et al., 1979). To detect such complex FM sounds, we construct a higher neural network (NI D network) model for the STGr (rostral portion of the superior temporal gyrus) to which the LB projects. The NI D network receives selective projections from the NFM network. When a monkey call is presented to the NC O network, multiple NFM columns are sequentially activated in a specific order. The NI D network integrates the sequence of the dynamic NFM columns, thereby identifying that call. Based on the proposed cochleotopic mapping scheme, we investigate how FM sound information is encoded and detected. Applying to the NC O network simple (linearly sweeping) and complex (monkey call) FM sounds, we record the activities of neurons. By statistically analyzing them, we try to understand the neuronal mechanisms that underlie FM sound information processing in the auditory cortex.
354
O. Hoshino
2 Neural Network Model 2.1 Outline of the Model. The NC O network, modeling the AI, is organized in a tonotopic fashion as shown in Figure 1A. When the frequency of an applied FM sound sweeps upward, the neuronal activation of the periphery (p1) sweeps from f1 to f40, which then moves along the propagate axis (isofrequency bands). The filled circles schematically indicate a neuronal firing pattern induced by a simple (linearly sweeping) upward FM sound at a certain time after the stimulus onset. The gray circles indicate a neuronal firing pattern induced by a downward FM sound. The spatiotemporal firing pattern of the NC O network expresses combinatorial information about the direction and the sweep rate of the applied FM sound. Neurons of the NFM network, modeling the LB, receive convergent projections from the NC O network (solid and dashed lines), and detect the upward (black circles) and downward (gray circles) FM sounds. We made sets of selective convergent projections from the NC O to NFM network in order to detect specific linearly sweeping FM sounds. Neurons within isofrequency bands are connected by excitatory and inhibitory delay lines, as shown in Figure 1B. The excitatory connections (solid lines) from the periphery to the center are the major driving force for the neuronal activation to move along the propagation axis, and the inhibitory connections (dashed lines) were employed to sharpen the spatiotemporal firing patterns. This specific circuitry was used for functionally expressing the propagation axis whose evidence in the AI is being accumulated (Hess & Scheich, 1996; Taniguchi et al., 1992). For simplicity, there is no connection between isofrequency bands. Neurons within NFM columns are connected with each other via excitatory synapses, and neurons between NFM columns are connected via inhibitory synapses. This circuitry enables each NFM column to respond selectively to a specific linearly sweeping FM sound. Figures 1C and 1D are schematic drawings of neuronal responses to a linearly sweeping FM sound. When the NC O network is stimulated with an upward FM sound sweeping at a slow (Figure 1C, left), intermediate (Figure 1C, center), or fast rate (Figure 1C, right), the activation area moves from the lower left to upper right (arrows). When a downward FM sound is presented, the activation area moves from the lower right to upper left (see Figure 1D). When the activation area reaches a certain position (gray ellipses), the neurons send action potentials to the NFM network via the selective feedforward projections, and activate the NFM column corresponding to the applied FM (black ellipses). The neuronal activation of the other NC O regions (dashed ellipses) can also be used for the FM detection. Nevertheless, we chose the firing patterns (gray ellipses), because these patterns appear first in the time courses with maximal neurons simultaneously activated. This enables the NFM network to respond reliably and rapidly to the applied FM sounds.
Spatiotemporal Conversion of Auditory Information
A
355
B
c1
c10
,,,,,,, ,,,
c40
,,,,,,,
,,,
,,,
NFM ,,
p ro
c31
,,,,,,,
,,,
n
tio
a ag
upwards
,,
,,
,,
p40
,,,,,,,
p
,,,
,,,
,,,,,,,
p1 input
f1
frequency
low
NCO
propagation - axis
downwards
f40 fl (l = 1, 2, ,,, 40)
high
C
D upward
downward
slow
fast
slow
fast
NCO
frequency
propagation
propagation
NFM
frequency
Figure 1: Neural network model. (A) The NC O network is organized in a tonotopic manner. The NFM network receives selective projections from the NC O network. Among NFM columns, c1–c10 and c31–c40 detect FM sounds that sweep linearly in downward and upward directions, respectively. For clarity, only two sets of firing patterns (black and gray circles) and projections (solid and dashed lines) for an upward and a downward FM sound are depicted. (B) Neuronal connections within isofrequency bands of the NC O network. Neurons are connected via excitatory (solid lines) and inhibitory (dashed lines) delay lines, where t denotes a signal transmission delay time. (C) Schematic drawings of the spatiotemporal neuronal responses of the NC O network and those of the NFM network to simple (linearly sweeping) upward FM sounds. Activity patterns for FM sounds that sweep at a slow (left), intermediate (center), and fast (right) rate are shown. Arrows indicate the directions of movements of active areas. (D) Schematic drawings as in C for downward FM sounds.
356
O. Hoshino
2.2 Model Description. Dynamic evolutions of the membrane potentials of neurons of the NC O and NFM networks are defined, respectively, by D ex duCk Oi (t) wk i,k = −uCk Oi (t) + dt
M
τC O
j=1
+ wki h i,k
τFM
CO (i+ j) Sk (i+ j) (t
CO (i− j) Sk (i− j) (t
− jt)
− jt) ,
(2.1)
f 40 p40 duiF M (t) = −uiF M (t) + L i,k j SkC Oj (t − tFM ) dt k= f 1 j= p1
MFM
+
wiFj M S Fj M (t),
(2.2)
j=1( j=i)
where
Prob SiY (t) = 1 = f Y uiY (t) f Y [u] =
1 . 1 + e −ηY (u−θY )
(Y = C O, F M), (2.3)
uCk Oi (t) and uiF M (t) are the membrane potential of the ith NC O neuron of the kth (k = f1–f40) isofrequency band and that of the ith NFM neuron at time t, respectively. τY (Y = CO, FM) is a decay time of the membrane potential of the network NY . MD is the number of excitatory or inhibitory input delay lines that a single NC O neuron receives from other neurons within isofrequency bands. wke xi,k (i− j) and wki h i,k (i+ j) are, respectively, excitatory and inhibitory synaptic connection strengths from neuron (i − j) to i and from (i + j) to i of the kth isofrequency band. SkC Oj (t − jt) = 1 expresses an action potential of the jth NC O neuron of the kth isofrequency band, where jt denotes a signal transmission delay time (see Figure 1B). (f1– f40, p1–p40) denotes the locations of neurons on the two-dimensional NC O map (see Figure 1A). L i,k j is the strength of synaptic connection from the jth NC O neuron of the kth isofrequency band to the ith NFM neuron. tFM is a signal transmission delay time from the NC O to NFM network. MFM is the number of NFM neurons. wiFj M is the synaptic connection strength from the jth to the ith NFM neuron, and S Fj M (t) = 1 expresses an action potential of the jth NFM neuron. {wiFj M } was set for the neurons within NFM columns (c1–c40; see Figure 1A) to be mutually excited and for the neurons between NFM columns to be laterally inhibited. ηY and θY are the steepness and the threshold of the sigmoid function f Y , respectively, for Y neuron. Equation 2.3 defines the probability of firing of neuron i; that is, the
Spatiotemporal Conversion of Auditory Information
357
probability of SiY (t) = 1 is given by function f Y . After firing, its membrane potential is reset to 0. FM sound stimuli are applied to the peripheral neurons of the NC O network. Dynamic evolutions of membrane potentials of these neurons are defined by D duCk O1 (t) wki h 1,k = −uCk O1 (t) + dt
M
τC O
CO (1+ j) (t)Sk (1+ j) (t
− jt)
j=1
+ α Ik 1 (t),
(2.4)
where Ik 1 (t) is the input stimulus to the peripheral neuron of the kth isofrequency band, or the neuron located at (k, p1; k = f1–f40; see Figure 1A). α is the intensity of the input. Note that the peripheral neurons receive only an inhibitory input from the (1 + j)th neuron with a delay of jt and do not receive any delayed excitatory input. Network parameter values are as follows. The number of neurons are 40 (f1–f40) × 40 (p1–p40) and 40 (c1–c40) × 12 for the NC O and NFM networks, respectively: τC O = 10 ms, τFM = 10 ms, θY = 0.7, ηY = 10.0, and MD = 3. wke xi,k (i− j) = 5.0 and wki h i,k (i− j) = −0.5. t = 10 ms and tFM = 20 ms. L i,k j was selectively set at either 0.1 or 0, as addressed in section 2.1, by which the specific firing patterns induced in the NC O network dynamics can activate their corresponding NFM columns (see Figures 1C and 1D). wiFj M = 0.3 and −5.0 within and between NFM columns, respectively. α = 8.0, and Ik 1 (t) = 1 for an input and 0 for no input. 3 Results 3.1 Tuning Property to Simple FM Sounds. We show here how the information about simple (linearly sweeping) FM sounds could be expressed as spatiotemporal firing patterns in the NC O network dynamics. We also show how the auditory information could be transferred to and detected by specific neuronal columns of the NFM network. Response properties (action potential generation) of NFM neurons are compared with those observed experimentally. An upward FM sound sweeping linearly at 20 Hz per ms induces a specific spatiotemporal firing pattern in the NC O network, in which the neuronal activation moves from the lower left toward the upper right (see Figure 2A). When the activity reaches a certain point (time = 200 ms), the active neurons send action potentials to the NFM network and stimulate the corresponding NFM column (arrow) at time = ∼ 220 ms. The difference in activation time between the NC O (200 ms) and NFM (220 ms) networks arises from a difference in signal transmission delay between the two networks, or tFM = 20 ms (see equation 2.2).
358
O. Hoshino A
120 ms
220 ms
320 ms
520 ms
NFM NCO 100 ms B
200 ms
300 ms
500 ms
f (kHz)
spikes/bin
upward 10 8 6 4 2 0
13.3 Hz/ms 10 8 6 4 2 0
20 Hz/ms
12
12
12
8
8 0 100 200 300
26.7 Hz/ms
10 8 6 4 2 0
8 0 100 200 300
0 100 200 300
time (ms)
downward
f (kHz)
spikes/bin
C 10 8 6 4 2 0
13.3 Hz/ms 10 8 6 4 2 0
12
12
8
20 Hz/ms
26.7 Hz/ms
12
8 0 100 200 300
10 8 6 4 2 0
8 0 100 200 300
time (ms)
0 100 200 300
Spatiotemporal Conversion of Auditory Information
359
We assumed f1 = 8 kHz to f40 = 11.9 kHz (see Figure 1A), where the isofrequency bands were placed at an even interval, or 100 Hz per band. These frequencies are within the range observed in squirrel monkeys (Symmes et al., 1979) and employed for investigating how complex FM sounds such as monkey calls could be identified, as will be shown in sections 3.2 and 3.3. Figure 2B presents the total spike counts (top) and raster plots (middle) for the neurons of a given NFM column when stimulated with different upward FM sounds (bottom). The columnar neurons show specific sensitivity to the upward FM sound (20 Hz per ms) but less to downward FM sounds (see Figure 2C). This tendency is almost consistent with that observed in macaque monkeys (Rauschecker, 1998). Although the tuning characteristic of the NFM columnar neurons to the applied FM sound (sweep rate = 20 Hz per ms; upward) is evident, these neurons also show weak responses to the other upward FM sounds with different sweep rates (see the arrows of Figure 2B). Such weak responsiveness to the “irrelevant” FM sounds is due to the overlapping of NC O firing patterns, as schematically shown in Figure 3. The set of NC O neurons within the solid ellipse, which are to be simultaneously activated by the FM stimulus (sweep rate = 20 Hz per ms; upward), send action potentials to the relevant NFM column and maximally activate the column (top left). However, the subsets of NC O neurons within the overlapping regions (gray and black) could also be activated, respectively, by the irrelevant upward FM sounds (26.7 and 13.3 Hz per ms). These neurons send a relatively small number of action potentials to the same NFM column, which results in weaker neural responses in the column (bottom left and top right). 3.2 Tuning Property to Complex FM Sounds. We show here how complex FM sounds such as monkey calls could be expressed as specific spatiotemporal firing patterns in the NC O network dynamics. We also show
Figure 2: Response property of the NC O and NFM networks. (A) Time courses of neuronal activation when stimulated with an upward FM sound sweeping at 20 Hz per ms. The level of neuronal activity is expressed for each neuron as the number of action potentials observed within a time interval (bin = 10 ms). Arrows indicate the responses of NFM neurons to the applied FM sound. (B) Spike counts (top) and raster plots (middle) for the NFM column neurons when stimulated with different upward FM sounds (bottom). Spike count is the number of action potentials of the NFM column neurons observed within a time interval (bin = 10 ms). (C) Spike counts (top) and raster plots (middle) for the NFM column when stimulated with different downward FM sounds (bottom). The NFM column shows specific sensitivity to the upward FM sound sweeping at 20 Hz per ms.
20 Hz/ms 10 8 6 4 2 0 0 100 200 300
propagation
NCO
spikes/bin
time (ms)
10 8 6 4 2 0
26.7 Hz/ms
spikes/bin
O. Hoshino
spikes/bin
360
low
10 8 6 4 2 0
13.3 Hz/ms
0 100 200 300
time (ms)
high frequency
0 100 200 300
time (ms)
Figure 3: Overlapping property of NC O firing patterns. Thin-dashed, solid, and thick-dashed areas schematically depict the firing patterns generated by the FM sounds sweeping at 26.7 Hz per ms, 20 Hz per ms, and 13.3 Hz per ms, respectively. A certain NFM column is activated maximally by its preferred FM sound (20 Hz per ms) (top left). The overlapping regions, or gray and black regions, are activated not only by the upward FM (20 Hz per ms) but also by those sweeping at 26.7 Hz per ms (gray region) and 13.3 Hz per ms (black region). This results in weaker neuronal responses in the same NFM column (bottom left and top right).
how the information about individual monkey calls can be decomposed into simple (linearly sweeping) FM components. We used artificial isolation peep (IP) as monkey calls (Symmes et al., 1979), whose pitch profiles are shown in Figure 4A. A specific spatiotemporal firing pattern is induced in the NC O network when stimulated with the IP of monkey X, which then activates multiple NFM columns in a sequential manner (see Figure 4B). We have observed distinct sequential orders of columnar activation for monkey call X, Y, and Z. Figure 4C presents the details of the sequences of columnar activation for monkey X (the thick solid line), Y (the thin solid line), and Z (dashed line), indicating that the IPs are decomposed into specific sequences of simple (linearly sweeping) FM components. In the model, the neurons of a currently active NFM -column continue firing, even without any excitatory input, until another dynamic NFM column emerges, or its neurons begin to fire. For example, NFM column c31 is activated at time = 0.22 s (upward arrow) and continues firing (downward arrow) until the NFM column c1 begins to fire (at 0.34 s; rightward arrow).
Spatiotemporal Conversion of Auditory Information A
frequency(kHz)
monkey X
361
monkey Y
monkey Z
12
12
12
8
8
8
4
4
4 0
0.2
0.4
0
0.2
0
0.4
0.2
0.4
time (s) B
NFM NCO 129 ms
222 ms
366 ms
262 ms
419 ms
341 ms
501 ms
C
monkey X:
c40
monkey Y:
NFM-column
c35
monkey Z:
c31
c10 c5 c1 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
time (s) Figure 4: Spatiotemporal conversion of information about isolation-peeps (IPs). (A) Profiles of artificial IPs for monkey calls (X, Y, Z) used in the simulations. (B) Time courses of neuronal activation of the NC O and NFM networks induced by the IP of monkey X. (C) Sequences of dynamic NFM columns induced by the IPs of monkey X (thick solid line), Y (thin solid line), and Z (dashed line). Each NFM column, c31–c40 and c1–10, responds to a specific linearly sweeping FM component involved in the IPs. NFM columns (c11–c30) were not assigned to detect FM sounds in this simulation.
362
O. Hoshino
This self-generative continuous firing could be mediated by mutual excitation within NFM columns. In the next section, we try to identify these monkey calls by an integrative process, to which the persistent neuronal firing effectively contributes. 3.3 Monkey Call Identification. We showed in the previous section that the information about complex FM sounds, or the IPs of monkeys, could be decomposed into simple (linearly sweeping) FM components. It reminds us that early visual systems decompose complex visual objects into simple features such as edge, orientation, and color. Higher visual areas integrate these features so that the visual images of the objects can be reconstructed as unified percepts. To identify the sequences of the dynamic NFM columns, we extended the model, adding a higher (integrative) neural network NI D (see Figure 5A). We made selective feedforward projections from the NFM to NI D network (solid lines) in order to integrate the series of FM components. The specific NI D column (filled circles; NI D ) assigned to detect a certain IP receives convergent inputs from the NFM columns (filled circles; NFM ) that are to be sequentially activated by the FM components constituting the IP (filled circles; NC O ). The NI D network may correspond to a lateral belt (LB) area that is known to respond to monkey calls or the rostral portion of the superior temporal gyrus (STGr) to which the LB projects (Rauschecker, 1998), as we assumed here. Dynamic evolutions of membrane potentials of NI D neurons are defined by τI D
MFM duiI D (t) L iIjD S Fj M (t − tI D ) = −uiI D (t) + dt j=1
+
MI D
wiIjD S Ij D (t),
(3.1)
j=1( j=i)
where action potentials are generated according to equation 2.3 and the other parameter values were the same as those for the NFM network (see equation 2.2). As shown in Figure 5B, when the NC O network is stimulated with the IP of monkey X, multiple NFM columns are sequentially activated, which then activate the call-relevant NI D column at 367 ms. In this integrative process, the NI D column receive consecutive action potentials from the NFM columns, as addressed in the previous section (see Figure 4C). Such continuous activation of the NI D column gradually depolarizes its neurons and allows them to fire when the membrane potentials reach a threshold, thereby identifying the applied IP. Figure 5C presents the identification processing for monkey Z.
Spatiotemporal Conversion of Auditory Information
363
A
B
monkey X NID NFM NCO 103 ms
C
208 ms
350 ms
367 ms
338 ms
410 ms
509 ms
monkey Z NID NFM NCO 83 ms
Figure 5: Monkey call identification. (A) A higher (integrative) neural network NI D is added to the original model (see Figure 1A), which integrates the specific sequences of dynamic NFM columns induced by individual IPs. (B,C) Time courses of neuronal activation in the NC O , NFM , and NI D networks when stimulated with the IPs of monkeys X (B) and Z (C).
Figure 6 presents how similar monkey calls could be distinguished from each other. The IP of monkey V (see Figure 6A, left) has a similar spectrogram in the first part (time = 0–0.1 s; solid line) to that of monkey A (dashed line). When the NC O network is stimulated with the IP (see Figure 6B), multiple NFM columns are sequentially activated, which then activate the two
364
O. Hoshino
A
8
4 0
0.2
0.4
monkey W frequency(kHz)
frequency(kHz)
monkey V 12
12
8
4 0
time(sec)
0.2
0.4
time(sec)
B
monkey V NID NFM NCO 250 ms
C
330 ms
446 ms
500 ms
338 ms
380 ms
509 ms
monkey W
NID NFM NCO 83 ms
Figure 6: Identification of monkey calls that have similar auditory spectrograms. (A) Profiles of artificial IPs for monkeys V and W (solid lines). The dashed lines denote that of monkey A. (B,C) Time courses of neuronal activation in the NC O , NFM , and NI D networks when stimulated with the IPs of monkeys V (B) and W (C).
NI D columns corresponding to the IPs of monkeys V and A (at 330 ms). The two dynamic NI D columns compete for a while (at 330–446 ms), and the NI D column corresponding to monkey V finally prevails (at 500 ms). The neuronal competition between the two dynamic NI D columns arises from the lateral inhibition between NI D columns. In contrast, when the IP of monkey W, which has a similar spectrogram in the last part (0.3–0.4 s; see Figure 6A, right), is presented, the NI D column corresponding to the IP of monkey W is selectively activated (at 338 ms) without competition, as shown in Figure 6C. Note that the time required
Spatiotemporal Conversion of Auditory Information
365
to identify monkey W (338 ms; see Figure 6C) is shorter than that for monkey V (500 ms; see Figure 6B), which arises presumably because there is less neuronal competition between dynamic NI D columns. These results indicate that the spectrogram of the last part might be useful for further analyses if circumstances require, although it is more time consuming. 4 Discussion We have proposed a hypothesis that (1) information about monkey calls could be mapped on a cochleotopic cortical network as spatiotemporal firing patterns, (2) which can then be decomposed into simple (linearly sweeping) FM components and (3) integrated into unified percepts by higher cortical networks. For the cochleotopic two-dimensional map (hypothesis 1), we assumed activity propagation along isofrequency axis (bands) in order to make a distinct spatiotemporal firing pattern for a given monkey call. Imaging studies (Taniguchi et al., 1992; Hess & Scheich, 1996; Song et al., 2005) evidenced such activity propagation in the primary auditory cortex (AI). When presented with alternating pure tones or alternation between 1 and 8 kHz (Hess & Scheich, 1996), the activity propagation was confined to the low and high isofrequency bands. To our knowledge, the exact formation of spatiotemporal firing patterns for FM sound stimulation has not been identified yet, but we simply extended this scheme. Namely, the neuronal activity propagates along multiple isofrequency bands corresponding to the tone frequencies constituting the applied FM sound. Actual spatiotemporal firing patterns induced by monkey calls might be rather complex because of the interaction between isofrequency bands or the influences of other brain regions. Accordingly, the information about individual monkey calls could be encoded more precisely in the auditory cortex. Nevertheless, the proposed simple cochleotopic map was sufficient to generate distinct spatiotemporal firing patterns and worked well as the foundation for later sound processing, or monkey call identification. The delay line proposed for the propagation axis (see Figure 1A) was our speculation prompted by a recent experimental study (Song et al., 2005). The study demonstrated that an electrical pulse, applied focally within an isofrequency band, triggered activity propagation along the isofrequency band that was similar to tone-evoked activation. When the auditory thalamus was chemically lesioned, the electrically evoked activity in the AI was not affected, but the tone-evoked activity was abolished. Based on these results, it was suggested that intracortical connectivity in the AI enables neuronal activity to propagate along isofrequency bands. The underling neuronal mechanisms of activity propagation in the AI has not fully been understood yet, but we assumed the intracortical connectivity via delay lines (see Figure 1B) for developing the activity propagation. Note that the intracortical delay lines are relatively short (less than tens of milliseconds)
366
O. Hoshino
that could be neurophysiologically plausible in the brain as addressed in section 1. For expressing the information about simple (linearly sweeping) FM components (hypothesis 2), we assumed neurons respond selectively to the sweeping rates and directions of FMs. Neurophysiological studies (Rauschecker, 1997, 1998; Tian & Rauschecker, 1998) demonstrated that many neurons of lateral belt areas to which the primary auditory cortex (AI) projects responded better to more complex stimuli, such as FMs and band passed noises, than to pure tones. These neurons were highly selective to the rates and directions of FMs. Neurons of the anterolateral (AL) and caudolateral (CL) belt areas responded better to slower and faster FM sweep rates, respectively. Neurons of the posterior auditory field were highly selective for one direction. The detailed organization of these (rateand direction-selective) neurons has not clearly been identified yet, but we represented them in a simple and functional manner (see NFM ; Figure 1A). To integrate the sequence of linearly sweeping FMs, expressing a specific monkey call, into a unified percept (hypothesis 3), we assumed that neurons respond selectively to the call. Neurophysiological studies (Rauschecker, 1997, 1998) demonstrated that neurons of the lateral belt responded to a certain class of monkey calls. Very few neurons responded to a single call, and most neurons responded to a number of calls. These results imply that the lateral belt is not yet the end state processing monkey calls. Higher cortical areas such as the rostral portion of the superior temporal gyrus (STGr) or the prefrontal cortex, or both, might be responsible for monkey call identification (Rauschecker, 1998). The exact cortical areas whose neurons have selective responsiveness to individual monkey calls have not clearly been identified yet, but we assumed such “call-selective” neurons in the STGr (see NI D ; Figure 5A). The selective projections from the NFM to NI D (or the lateral belt to STGr) were our speculation in order for the model to perform the identification of individual monkey calls or detect the sequence of FM components. Coincidence detection based on a delay line scheme, as addressed in section 1, cannot be applicable for longer signals such as monkey calls (more than 500 ms), because delay lines observed in the brain are at most of several hundred milliseconds. The delay lines proposed here for the propagation axis (NC O ; Figure 1B) are shorter ones (tens of milliseconds) that were speculative but neurophysiologically plausible in the brain. The idea of this study is on temporal-to-spatiotemporal conversion of auditory information mediated by shorter (plausible) delay lines (∼tens of milliseconds) but not on a coincidence detection scheme. In the propagation axis, we assumed delay lines between neighboring neurons ranging from 10 to 30 ms (see section 2.2). This architecture allowed the cochleotopic neural network (NC O ) to propagate along isofrequency bands as observed in the auditory cortex (Taniguchi et al., 1992; Hess & Scheich, 1996). To our knowledge, such delay lines have not been reported
Spatiotemporal Conversion of Auditory Information
367
in the auditory cortex. However, Hess and Scheich (1996) pointed out that activity propagation along isofrequency bands might be closely related to the distribution of response latency in the AI. Tian and Rauschecker (1998) found a response-latency distribution (23 ± 12 ms) in the AI. The range of delay lines used for the NC O network as an AI area is within the range observed. We assumed activity propagation along isofrequency bands. However, activity propagation across isofrequency bands has also been reported (Taniguchi et al., 1992; Hess & Scheich, 1996), for which the interaction between different isofrequency bands might be responsible. The neuronal activation propagated toward the two (isofrequncy and tonotopic-gradient) directions, where the peak activity was shifted along isofrequency bands (Hess & Scheich, 1996). Although the spatiotemporal firing pattern propagating toward the two directions might contribute to encoding auditory information, presumably in a more precise manner, we modeled the propagation only along the isofrequency bands. Panchev and Wermter (2004) proposed a neural network model that can detect temporal sequences in timescale from hundreds of milliseconds to several seconds. The network consisted of integrate-and-fire neurons with active dendrites and dynamic synapses. The researchers applied the model to recognizing words such as bat, tab, cab, and cat. Each word was expressed as a sequence of phonemes, (for example, c → a → t for cat. A spike train of a single input neuron encoded each phoneme of a word, with a 100 ms delay between the onset times of successive phonemes. After training based on spike-timing-dependent plasticity, a single output neuron was able to detect a particular sequence of phonemes or identify a specific word. Their model could be an alternative, especially for the integration network (NI D ; Figure 5A), that detects the sequences of simple (linearly sweeping) FMs. Sargolini and colleagues (2006) found evidence of conjunctive representation of position, direction, and velocity in the entorhinal cortex of rats that explored two-dimensional environments. In the medial entorhinal cortex (MEC), the network of grid cells constituted a spatial coordinate system in which positional information was represented. Head direction cells were responsible for head-directional information representation. Grid cells were co-localized with head direction cells and conjunctive (grid and head direction) cells, and the running speed of the rat modulated these cells. The researchers suggested that the conjunctive cells might update the representation of spatial location by integrating positional and directional and velocity information in the grid cell network during navigation. Such a conjunctive representation may be an alternative for the spatiotemporal representation of monkey calls, in which the information about spectral components, sweep rates and sweep directions and their combinatorial information may be represented by distinct types of cells in the primary auditory cortex.
368
O. Hoshino
In humans, it has been suggested that a voice contains information about not only a speech but also an “auditory face” which allows us to identify individuals (Belin, Fecteau, & Bedard, 2004). This is called auditory face perception and is processed based on a neurocognitive scheme similar to that proposed for visual face perception. Among vocal components for human speech processing, formants (Fitch, 1997) and syllables (Belin & Zatorre, 2003) might be candidate components used for identifying individuals. We suggest that monkey call identification may also be a kind of auditory face perception making use of FM components. There has been evidence that the inferior colliculus (IC) encodes spectrotemporal acoustic patterns of species-specific calls. For example, Suta, Kvasnak, Popelar, and Syka (2003) investigated the neuronal representation of specific calls in the IC of guinea pigs. Responses of individual IC neurons of anesthetized guinea pigs to four typical calls (purr, chutter, chirp, and whistle) were recorded. A majority of neurons (55% of 124 units) responded to all calls. A small portion of neurons (3%) responded to only one call or did not respond to any of the calls. A time-reversed version of calls elicited on average a weaker response. The researchers concluded that the IC neurons do not respond selectively to specific calls but encode spectrotemporal acoustic patterns of the calls. Maki and Riquimaroux (2002) recorded responses of IC neurons of gerbils to two distinct FM sounds that have the same spectral components (5–12 kHz) with an opposite sweep direction, or an upward sweep and a downward sweep. The upward FM generated much stronger responses than the downward FM. The researchers suggested that the directional selectivity to the FM sweeps implies that the IC may encode spectrotemporal acoustic patterns of species-specific calls. These experiments imply that the encoding of spectrotemporal acoustic images of specific calls takes place, in part, in the IC, which presumably makes the spatiotemporal pattern of neuronal activation more complex in the present cochleotopic map. Hopefully, we will know details of it in the near future. 5 Conclusion In this study, we have proposed a cochleotopic map similar to the retinotopic map in vision. When the cochleotopic (NC O ) network was stimulated with a monkey call, the peripheral neurons located on the frequency axis were sequentially activated. The active area moved along the propagation axis, by which the information about the call was mapped as a spatiotemporal firing pattern in the cochleotopic network dynamics. This spatiotemporal conversion was quite effective for the NFM network to decompose the call information into simple (linearly sweeping) FM components, by which the higher network (NI D ) was able to integrate these components into a unified percept, or to identify the call.
Spatiotemporal Conversion of Auditory Information
369
We suggest that the information about monkey calls could be mapped on a cochleotopic cortical network as spatiotemporal firing patterns, which can then be decomposed into simple (linearly sweeping) FM components and integrated into unified percepts by higher cortical networks. The spatiotemporal conversion of auditory information may be essential for developing the cochleotopic map, which could subserve as the foundation for later processing, or monkey call identification by higher cortical areas. Acknowledgments I am grateful to Yuishi Iwasaki for productive discussions and to Hiromi Ohta for her encouragement throughout the study. I am also grateful to the reviewers for giving me valuable comments and suggestions on the earlier draft. References Belin, P., Fecteau, S., & Bedard, C. (2004). Thinking the voice: Neural correlates of voice perception. Trends Cogn. Sci., 8, 129–135. Belin, P., & Zatorre, R. J. (2003). Adaptation to speaker’s voice in right anterior temporal lobe. Neuroreport, 14, 2105–2109. Bi, G. Q., & Poo, M. M. (1999). Distributed synaptic modification in neural networks induced by patterned stimulation. Nature, 401, 792–796. Fitch, W. T. (1997). Vocal tract length and formant frequency dispersion correlate with body size in rhesus macaques. J. Acoust. Soc. Am., 102, 1213–1222. Hess, A., & Scheich, H. (1996). Optical and FDG mapping of frequency-specific activity in auditory cortex. Neuroreport, 7, 2643–2677. Holden, C. (2004). The origin of speech. Science, 303, 1316–1319. Janik, V. M. (2000). Whistle matching in wild bottlenose dolphins (Tursiops truncates). Science, 289, 1355–1357. Jeffress, L. A. (1948). A place theory of sound localization. J. Comp. Physiol. Psychol., 41, 35–39. Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy M. (1967). Perception of the speech code. Psychol. Rev., 74, 431–461. Maki, K., & Riquimaroux, H. (2002). Time-frequency distribution of neuronal activity in the gerbil inferior colliculus responding to auditory stimuli. Neurosci. Lett., 331, 1–4. Miller, R. (1987). Representation of brief temporal patterns, Hebbian synapses, and the left-hemisphere dominance for phoneme recognition. Psychobiol., 15, 241–247. Panchev, C., & Wermter, S. (2004). Spike-timing-dependent synaptic plasticity: From single spikes to spike trains. Neurocomputing, 58–60, 365–371. Poremba, A., Malloy, M., Saunders, R. C., Carson, R. E., Herscovitch, P., & Mishkin, M. (2004). Species-specific calls evoke asymmetric activity in the monkey’s temporal poles. Nature, 427, 448–451. Rauschecker, J. P. (1997). Processing of complex sounds in the auditory cortex of cat, monkey, and man. Acta Otolaryngol. (Stockh.), 532, 34–38.
370
O. Hoshino
Rauschecker, J. P. (1998). Parallel processing in the auditory cortex of primates. Audiol. Neuro-Otol., 3, 86–103. Sargolini, F., Fyhn, M., Hafting, T., McNaughton, B. L., Witter, M. P., Moser, M. B., & Moser E. I. (2006). Conjunctive representation of position, direction, and velocity in entorhinal cortex. Science, 312, 758–762. Song, W. J., Kawaguchi, H., Totoki, S., Inoue, Y., Katura, T., Maeda, S., Inagaki, S., Shirasawa, H., & Nishimura, M. (2005). Cortical intrinsic circuits can support activity propagation through an isofrequency strip of the guinea pig primary auditory cortex. Cereb. Cortex, 16, 718–729. Suga, N. (1995). Processing of auditory information carried by species-specific complex sounds. In M. S. Gazzaniga (Ed.), The cognitive neurosciences (pp. 295–313). Cambridge, MA: MIT Press. Suta, D., Kvasnak, E., Popelar, J., & Syka, J. (2003). Representation of species-specific vocalizations in the inferior colliculus of the guinea pig. J. Neurophysiol., 90, 3794– 3808. Symmes, D., Newman, J. D., Talmage-Riggs, G., & Lieblich, A. K. (1979). Individuality and stability of isolation peeps in squirrel monkeys. Anim. Behav., 27, 1142–1152. Taniguchi, I., Horikawa, J., Moriyama, T., & Nasu, M. (1992). Spatio-temporal pattern of frequency representation in the auditory cortex of guinea pigs. Neurosci. Lett., 146, 37–40. Tian, B., & Rauschecker, J. P. (1998). Processing of frequency-modulated sounds in the cat’s posterior auditory field. J. Neurophysiol., 79, 2629–2642. Tyack, P. L. (2000). Dolphins whistle a signature tune. Science, 289, 1310–1311. Yost, W. A. (1994). The neural response and the auditory code. In W. A. Yost (Ed.), Fundamentals of hearing (pp. 116–133). San Diego, CA: Academic Press.
Received January 19, 2006; accepted June 9, 2006.
LETTER
Communicated by Gal Chechik
Reducing the Variability of Neural Responses: A Computational Theory of Spike-Timing-Dependent Plasticity Sander M. Bohte [email protected] Netherlands Centre for Mathematics and Computer Science (CWI), 1098 SJ Amsterdam, The Netherlands
Michael C. Mozer [email protected] Department of Computer Science, University of Colorado, Boulder, CO, U.S.A.
Experimental studies have observed synaptic potentiation when a presynaptic neuron fires shortly before a postsynaptic neuron and synaptic depression when the presynaptic neuron fires shortly after. The dependence of synaptic modulation on the precise timing of the two action potentials is known as spike-timing dependent plasticity (STDP). We derive STDP from a simple computational principle: synapses adapt so as to minimize the postsynaptic neuron’s response variability to a given presynaptic input, causing the neuron’s output to become more reliable in the face of noise. Using an objective function that minimizes response variability and the biophysically realistic spike-response model of Gerstner (2001), we simulate neurophysiological experiments and obtain the characteristic STDP curve along with other phenomena, including the reduction in synaptic plasticity as synaptic efficacy increases. We compare our account to other efforts to derive STDP from computational principles and argue that our account provides the most comprehensive coverage of the phenomena. Thus, reliability of neural response in the face of noise may be a key goal of unsupervised cortical adaptation. 1 Introduction Experimental studies have observed synaptic potentiation when a presynaptic neuron fires shortly before a postsynaptic neuron and synaptic de¨ pression when the presynaptic neuron fires shortly after (Markram, Lubke, Frotscher, & Sakmann, 1997; Bell, Han, Sugawara, & Grant, 1997; Zhang, Tao, Holt, Harris, & Poo, 1998; Bi & Poo, 1998; Debanne, G¨ahwiler, & Thomp¨ om, ¨ Turrigiano, & Nelson, 2001; Nishiyama, son, 1998; Feldman, 2000; Sjostr Hong, Mikoshiba, Poo, & Kato, 2000). The dependence of synaptic Neural Computation 19, 371–403 (2007)
C 2007 Massachusetts Institute of Technology
372 A
S. Bohte and M. Mozer B
Figure 1: (A) Measuring STDP experimentally. Presynaptic and postsynaptic spike pairs are repeatedly induced at a fixed interval tpr e− post , and the resulting change to the strength of the synapse is assessed. (B) Change in synaptic strength after repeated spike pairing as a function of the difference in time between the presynaptic and postsynaptic spikes. A presynaptic before postsynaptic spike induces LTP and postsynaptic before presynaptic LTD (data points were obtained by digitizing figures in Zhang et al., 1998). We have superimposed an exponential fit of LTP and LTD.
modulation on the precise timing of the two action potentials, known as spike-timing dependent plasticity (STDP), is depicted in Figure 1. Typically, plasticity is observed only when the presynaptic and postsynaptic spikes occur within a 20 to 30 ms time window, and the transition from potentiation to depression is very rapid. The effects are long lasting and are therefore referred to as long-term potentiation (LTP) and depression (LTD). An important observation is that the relative magnitude of the LTP component of STDP decreases with increased synaptic efficacy between presynaptic and postsynaptic neuron, whereas the magnitude of LTD remains roughly constant (Bi & Poo, 1998). This finding has led to the suggestion that the LTP component of STDP might best be modeled as additive, whereas the LTD component is better modeled as being multiplicative (Kepecs, van Rossum, Song, & Tegner, 2002). For detailed reviews of STDP see Bi and Poo (2001), Roberts and Bell (2002), and Dan and Poo (2004). Because these intriguing findings appear to describe a fundamental learning mechanism in the brain, a flurry of models has been developed that focus on different aspects of STDP. A number of studies focus on biochemical models that explain the underlying mechanisms giving rise to STDP (Senn, Markram, & Tsodyks, 2000; Bi, 2002; Karmarkar, Najarian, & ¨ otter, ¨ ¨ otter, ¨ Buonomano, 2002; Saudargiene, Porr, & Worg 2004; Porr & Worg 2003). Many researchers have also focused on models that explore the consequences of STDP-like learning rules in an ensemble of spiking neurons
Reducing the Variability of Neural Responses
373
(Gerstner, Kempter, van Hemmen, & Wagner, 1996; Kempter, Gerstner, & van Hemmen, 1999, 2001; Song, Miller, & Abbott, 2000; van Rossum, Bi, & Turrigiano, 2000; Izhikevich & Desai, 2003; Abbott & Gerstner, 2004; Burkitt, Meffin, & Grayden, 2004; Shon, Rao, & Sejnowski, 2004; Legenstein, Naeger, & Maass, 2005), and a comprehensive review of the different types and con¨ otter, ¨ clusions can be found in Porr and Worg (2003). Finally, a recent trend is to propose models that provide fundamental computational justifications for STDP. This article proposes a novel justification, and we explore the consequences of this justification in detail. Most commonly, STDP is viewed as a type of asymmetric Hebbian learning with a temporal dimension. However, this perspective is hardly a fundamental computational rationale, and one would hope that such an intuitively sensible learning rule would emerge from a first-principle computational justification. Several researchers have tried to derive a learning rule yielding STDP from first principles. Dayan and H¨ausser (2004) show that STDP can be viewed as an optimal noise-removal filter for certain noise distributions. However, even a small variation from these noise distributions yields quite different learning rules, and the noise statistics of biological neurons are ¨ otter, ¨ unknown. Similarly, Porr and Worg (2003) propose an unsupervised learning rule based on the correlation of bandpass-filtered inputs with the derivative of the output and show that the weight change rule is qualitatively similar to STDP. Hopfield and Brody (2004) derive learning rules that implement ongoing network self-repair. In some circumstances, a qualitative similarity to STDP is found, but the shape of the learning rule depends on both network architecture and task. M. Eisele (private communication, April 2004). has shown that an STDP-like learning rule can be derived from the goal of maintaining the relevant connections in a network. Rao and Sejnowski (1999, 2001) suggest that STDP may be related to prediction, in particular to temporal difference (TD) learning. They argue that STDP emerges when a neuron attempts to predict its membrane potential at some time t from the potential at time t − t. As Dayan (2002) points out, however, temporal difference learning depends on an estimate of the prediction error, which will be very hard to obtain. Rather, a quantity that might be called an activity difference can be computed, and the learning rule is then better characterized as a “correlational learning rule between the stimuli, and the differences in successive outputs” (Dayan, 2002; see also ¨ otter, ¨ Porr & Worg 2003, appendix B). Furthermore, Dayan argues that for true prediction, the model has to show that the learning rule works for biologically realistic timescales. The qualitative nature of the modeling makes it unclear whether a quantitative fit can be obtained. Finally, the derived difference rule is inherently instable, as it does not impose any bounds on synaptic efficacies; also, STDP emerges only for a narrow range of t values.
374
S. Bohte and M. Mozer
Chechik (2003) relates STDP to information theory via maximization of mutual information between input and output spike trains. This approach derives the LTP portion of STDP but fails to yield the LTD portion. Nonetheless, an information-theoretic approach is quite elegant and has proven valuable in explaining other neural learning phenomena (e.g., Linsker, 1989). The account we describe in this article also exploits an informationtheoretic approach. We are not the only ones to appreciate the elegance of information-theoretic accounts. In parallel with a preliminary presentation of our work at the NIPS 2004 conference, two quite similar informationtheoretic accounts also appeared (Bell & Parra, 2005; Toyoizumi, Pfister, Aihara, & Gerstner, 2005). It will be easiest to explain the relationship of these accounts to our own once we have presented ours. The computational approaches of Chechik (2003), Dayan and H¨ausser ¨ otter ¨ (2004) and Porr and Worg (2003) are all premised on a rate-based neuron model that disregards the relative timing of spikes. It seems quite odd to argue for STDP using neural firing rate: if spike timing is irrelevant to information transmission, then STDP is likely an artifact and is not central to understanding mechanisms of neural computation. Further, as Dayan and H¨ausser (2004) note, because STDP is not quite additive in the case of multiple input or output spikes that are near in time (Froemke & Dan, 2002), one should consider interpretations that are based on individual spikes, not aggregates over spike trains. In this letter, we present an alternative theoretical motivation for STDP from a spike-based neuron model that takes the specific times of spikes into account. We conjecture that a fundamental objective of cortical computation is to achieve reliable neural responses, that is, neurons should produce the identical response—in both the number and timing of spikes—given a fixed input spike train. Reliability is an issue if neurons are affected by noise influences, because noise leads to variability in a neuron’s dynamics and therefore in its response. Minimizing this variability will reduce the effect of noise and will therefore increase the informativeness of the neuron’s output signal. The source of the noise is not important; it could be intrinsic to a neuron (e.g., a time-varying threshold), or it could originate in unmodeled external sources that cause fluctuations in the membrane potential uncorrelated with a particular input. We are not suggesting that increasing neural reliability is the only objective of learning. If it were, a neuron would do well to shut off and give no response regardless of the input. Rather, reliability is but one of many objectives that learning tries to achieve. This form of unsupervised learning must, of course, be complemented by other unsupervised, supervised, and reinforcement learning objectives that allow an organism to achieve its goals and satisfy drives. We return to this issue below and in our conclusions section.
Reducing the Variability of Neural Responses
375
We derive STDP from the following computational principle: synapses adapt so as to minimize the variability in the timing of the spikes of the postsynaptic neuron’s output in response to given presynaptic input spike trains. This variability reduction causes the response of a neuron to become more deterministic and less sensitive to noise, which provides an obvious computational benefit. In our simulations, we follow the methodology of neurophysiological experiments. This approach leads to a detailed fit to key experimental results. We model not only the shape (sign and time course) of the STDP curve, but also the fact that potentiation of a synapse depends on the efficacy of the synapse; it decreases with increased efficacy. In addition to fitting these key STDP phenomena, the model allows us to make predictions regarding the relationship between properties of the neuron and the shape of the STDP curve. The detailed quantitative fit to data makes our work unique among first-principle computational accounts. Before delving into the details of our approach, we give a basic intuition about the approach. Noise in spiking neuron dynamics leads to variability in the number and timing of spikes. Given a particular input, one spike train might be more likely than others, but the output is nondeterministic. By the response variability minimization principle, adaptation should reduce the likelihood of these other possibilities. To be concrete, consider a particular experimental paradigm. In Zhang et al. (1998), a presynaptic neuron is identified with a weak synapse to a postsynaptic neuron, such that this presynaptic input is unlikely to cause the postsynaptic neuron to fire. However, the postsynaptic neuron can be induced to fire via a second presynaptic connection. In a typical trial, the presynaptic neuron is induced to fire a single spike, and with a variable delay, the postsynaptic neuron is also induced to fire (typically) a single spike. To increase the likelihood of the observed postsynaptic response, other response possibilities must be suppressed. With presynaptic input preceding the postsynaptic spike, the most likely alternative response is no output spikes at all. Increasing the synaptic connection weight should then reduce the possibility of this alternative response. With presynaptic input following the postsynaptic spike, the most likely alternative response is a second output spike. Decreasing the synaptic connection weight should reduce the possibility of this alternative response. Because both of these alternatives become less likely as the lag between preand postsynaptic spikes is increased, one would expect that the magnitude of synaptic plasticity diminishes with the lag, as is observed in the STDP curve. Our approach to reducing response variability given a particular input pattern involves computing the gradient of synaptic weights with respect to a differentiable model of spiking neuron behavior. We use the spike response model (SRM) of Gerstner (2001) with a stochastic threshold, where the stochastic threshold models fluctuations of the membrane potential or
376
S. Bohte and M. Mozer
the threshold outside experimental control. For the stochastic SRM, the response probability is differentiable with respect to the synaptic weights, allowing us to calculate the gradient that reduces response variability with respect to the weights. Learning is presumed to take a gradient step to reduce the response variability. In modeling neurophysiological experiments, we demonstrate that this learning rule yields the typical STDP curve. We can predict the relationship between the exact shape of the STDP curve and physiologically measurable parameters, and we show that our results are robust to the choice of the few free parameters of the model. Many important machine learning algorithms in the literature seek local optimizers. It is often the case that the initial conditions, which determine which local optimizer will be found, can be controlled to avoid unwanted local optimizers. For example, with neural networks, weights are initialized near the origin; large initial weights would lead to degenerate solutions. And K-means has many degenerate and suboptimal solutions; consequently, careful initialization of cluster centers is required. In the case of our model’s learning algorithm, the initial conditions also avoid the degenerate local optimizer. These initial conditions correspond to the original weights of the synaptic connections and are constrained by the specific methodology of the experiments that we model: the subthreshold input must have a small but nonzero connection strength, and the suprathreshold input must have a large connection strength (less than 10%, more than 70% probability of activating the target, respectively). Given these conditions, the local optimizer that our learning algorithm discovers is an extremely good fit to the experimental data. In parallel with our work, two other groups of authors have proposed explanations of STDP in terms of neurons maximizing an informationtheoretic measure for the spike-response model (Bell & Parra, 2005; Toyoizumi et al., 2005). Toyoizumi et al. (2005) maximize the mutual information of the input and output between a pool of presynaptic neurons and a single postsynaptic output neuron, whereas Bell and Parra (2005) maximize sensitivity between a pool of (possibly correlated) presynaptic neurons and a pool of postsynaptic neurons. Bell and Parra use a causal SRM model and do not obtain the LTD component of STDP. As we will show, when the objective function is minimization of (conditional) response variability, obtaining LTD critically depends on a stochastic neural response. In the derivation of Toyoizumi et al. (2005), LTD, which is very weak in magnitude, is attributed to the refractoriness of the spiking neuron (via the autocorrelation function), where they use questionably strong and enduring refractoriness. In our framework, refractoriness suppresses noise in the neuron after spiking, and we show that in our simulations, strong refraction in fact diminishes the LTD component of STDP. Furthermore, the mathematical derivation of Toyoizumi et al. is valid only for an essentially constant membrane potential with small fluctuations, a condition clearly violated in experimental
Reducing the Variability of Neural Responses
377
conditions studied by neurophysiologists. It is unclear whether the derivation would hold under more realistic conditions. Neither of these approaches thus far succeeds in quantitatively modeling specific experimental data with neurobiologically realistic timing parameters, and neither explains the relative reduction of STDP as the synaptic efficacy increases as we do. Nonetheless, these models make an interesting contrast to ours by suggesting a computational principle of optimization of information transmission, as contrasted with our principle of neural response variability reduction. Experimental tests might be devised to distinguish between these competing theories. In section 2 we describe the sSRM, and in section 3 we derive the minimal entropy gradient. In section 4 we describe the STDP experiment, which we simulate in section 5. We conclude with section 6. 2 The Stochastic Spike Response Model The spike response model (SRM), defined by Gerstner (2001), is a generic integrate-and-fire model of a spiking neuron that closely corresponds to the behavior of a biological spiking neuron and is characterized in terms of a small set of easily interpretable parameters (Jolivet, Lewis, & Gerstner, 2003; Paninski, Pillow, & Simoncelli, 2005). The standard SRM formulation describes the temporal evolution of the membrane potential based on past neuronal events, specifically as a weighted sum of postsynaptic potentials (PSPs) modulated by reset and threshold effects of previous postsynaptic spiking events. The general idea is depicted in Figure 2; formally (following Gerstner, 2001), the membrane potential ui (t) of cell i at time t is defined as ui (t) =
η(t − f i ) +
f i ∈Git
j∈i
wi j
(t| f j , Git ),
(2.1)
f j ∈G tj
where i is the set of inputs connected to neuron i; Git is the set of times prior to t that a neuron i has spiked, with firing times f i ∈ Git ; wi j is the synaptic weight from neuron j to neuron i; (t| f j , Git ) is the PSP in neuron i due to an input spike from neuron j at time f j given postsynaptic firing history Git ; and η(t − f i ) is the refractory response due to the postsynaptic spike at time f i . To model the postsynaptic potential ε in a leaky-integrate-and-fire neuron, a spike of presynaptic neuron j emitted at time f j generates a postsynaptic current α(t) for a presynaptic spike arriving at f j for t > f j . In the absence of postsynaptic firing, this kernel (following Gerstner & Kistler, 2002, eqs 4.62–4.56, pp. 114–115) can be computed as (t| f j ) =
t fj
s− f j α(s − f j ) ds, exp − τm
(2.2)
378
S. Bohte and M. Mozer
Figure 2: Membrane potential u(t) of a neuron as a sum of weighted excitatory PSP kernels due to impinging spikes. Arrival of PSPs marked by arrows. Once the membrane potential reaches threshold, it is reset, and a reset function η is added to model the recovery effects of the threshold.
where τm is the decay time of the postsynaptic neuron’s membrane potential. Consider an exponentially decaying postsynaptic current α(t) of the form α(t) =
t 1 H(t) exp − τs τs
(2.3)
(see Figure 3A), where τs is the decay time of the current and H(t) is the Heaviside function. In the absence of postsynaptic firing, this current contributes a postsynaptic potential of the form (t| f j ) =
1 1 − τs /τm
exp
(t − f j ) (t − f j ) − − exp − H(t − f j ), τm τs (2.4)
with current decay time constant τs and decay time constant τm . When the postsynaptic neuron fires after the presynaptic spike arrives—at some time fˆ i following presynaptic spike at time f j —the membrane potential is reset, and only the remaining synaptic current α(t ) for t > fˆ i is integrated in equation 2.2. Following Gerstner, 2001 (section 4.4, equation 1.66), the PSP that takes such postsynaptic firing into account can be written as fˆ i < f j , (t| f j ) (2.5) (t| f j , fˆ i ) = ( f j − fˆ i ) (t| f j ) fˆ i ≥ f j . exp − τs
Reducing the Variability of Neural Responses
379
A
B
C
D
Figure 3: (A) α(t) function. Synaptic input modeled as exponentially decaying current. (B) Postsynaptic potential due to a synaptic input in the absence of postsynaptic firing (solid line), and with postsynaptic firing once and twice (dotted resp. dashed lines; postsynaptic spikes indicated by arrow). (C) Reset function η(t). (D) Spike probability ρ(u) as a function of potential u for different values of α and β parameters.
This function is depicted in Figure 3B, for the cases when a postsynaptic spike occurs both before and after the presynaptic spike. In principle, this formulation can be expanded to include the postsynaptic neuron firing more than once after the onset of the postsynaptic potential. However, for fast current decay times τs , it is useful to consider only the residual current input for the first postsynaptic spike after onset and assume that any further postsynaptic spiking is modeled by a postsynaptic potential reset to zero from that point on. The reset response η(t) models two phenomena. First, a neuron can be in a refractory period: it simply cannot spike again for about a millisecond after a spiking event. Second, after the emission of a spike, the threshold of the neuron may initially be elevated and then recover to the original value (Kandel, Schwartz, & Jessell, 2000). The SRM models this behavior as negative contributions to the membrane potential (see equation 2.1): with s = t − fˆ i denoting the time since the postsynaptic spike, the refractory
380
S. Bohte and M. Mozer
reset function is defined as (Gerstner, 2001):
η(s) =
Uabs Uabs exp
−
s + δr τr
f
+ Ur exp
−
s τrs
0 < s < δr s ≥ δr ,
(2.6)
where a large negative impulse Ua bs models the absolute refractory period, with duration δr ; the absolute refractory contribution smoothly resets via a f fast-decaying exponential with time constant τr . The term Ur models the slow exponential recovery of the elevated threshold with time constant τrs . The function η is depicted in Figure 3C. We made a minor modification to the SRM described in Gerstner (2001) by relaxing the constraint that τrs = τm and also by smoothing the absolute refractory function (such smoothing is mentioned in Gerstner, but is not explicitly defined). In all simulations, we use δr = 1 ms, τrs = 3 ms, and f τr = 0.25 ms (in line with estimates for biological neurons; Kandel et al., 2000; the smoothing parameter was chosen to be fast compared to τrs ). The SRM we just described is deterministic. Gerstner (2001) introduces a stochastic variant of the SRM (sSRM) by incorporating the notion of a stochastic firing threshold: given membrane potential ui (t), the probability of the neuron firing at time t is specified by ρ ui (t) . Herrmann and Gerstner (2001) find that for a reasonable escape-rate noise model of the integration of current in real neurons, the probability of firing is small and constant for small potentials, but around a threshold ϑ, the probability increases linearly with the potential. In our simulations, we use such a function, ρ(v) =
β {ln[1 + exp(α (ϑ − v))] − α(ϑ − v)}, α
(2.7)
where α determines the abruptness of the constant-to-linear transition in the neighborhood of threshold ϑ and β determines the slope of the linear increase beyond ϑ. This function is depicted in Figure 3D for several values of α and β. We also conducted simulation experiments with sigmoidal and exponential density functions and found no qualitative difference in the results. 3 Minimizing Conditional Entropy We now derive the rule for adjusting the weight from a presynaptic input neuron j to a postsynaptic neuron i so as to minimize the entropy of i’s response given a particular spike train from j. A spike train is described by the set of all times at which a neuron i emitted spikes within some interval between 0 and T, denoted GiT . We assume the interval is wide enough that the occurrence of spikes outside
Reducing the Variability of Neural Responses
381
the interval does not influence the state of a neuron within the interval (e.g., through threshold reset effects). This assumption allows us to treat intervals as independent of each other. The set of input spikes received by neuron i during this interval is denoted FiT , which is just the union of all output spike trains of connected presynaptic neurons j: FiT = G Tj ∀ j ∈ i . Given input spikes FiT , the stochastic nature of neuron i may lead not only to the observed response GiT but also to a range of other possibilities. Denote the set of possible responses i , where GiT ∈ i . Further, let binary variable σ (t) denote the state of the neuron in the time interval [t, t + t), where σ (t) = 1 means the neuron spikes and σ (t) = 0 means no spike. A response ξ ∈ i is then equivalent to [σ (0), σ (t), . . . , σ (T)]. Given a probability density p(ξ ) over all possible responses ξ , the differential entropy of neuron i’s response conditional on input FiT is then defined as
h i FiT = −
i
p(ξ ) log p(ξ ) dξ.
(3.1)
According to our hypothesis, a neuron adjusts its weights so as to minimize the conditional response variability. Such an adjustment is obtained by performing gradient descent on the weighted likelihood of the response, which corresponds to the conditional entropy, with respect to the weights,
∂h i |FiT wi j = −γ , ∂wi j
(3.2)
with learning rate γ . In this section, we compute the right-hand side of equation 3.2 for an sSRM neuron. Substituting the entropy definition of equation 3.1 into equation 3.2, we obtain:
∂h i FiT ∂ =− p(ξ ) log( p(ξ )) dξ ∂wi j ∂wi j ∂ log( p(ξ )) p(ξ ) (log( p(ξ )) + 1) dξ. =− ∂wi j i We closely follow Xie and Seung (2004) to derive tiable neuron model firing at times
p(ξ ) =
T t=0
GiT .
P(σ (t)|{σ (t ), ∀t < t}).
∂ log( p(ξ )) ∂wi j
(3.3)
for a differen-
First, we factorize p(ξ ):
(3.4)
382
S. Bohte and M. Mozer
The states σ (t) are conditionally independent as the probability for a neuron i to fire during [t, t + t) is determined by the spike probability density of the membrane potential: σ (t) =
1 with probability
pi = ρi (t)t,
0 with probability
pi = 1 − pi (t),
for the spike probability density of the membrane with ρi (t) shorthand
potential, ρ ui (t) ; this equation holds for sufficiently small σ (t) (see also Xie & Seung, 2004, for more details). We note further that ∂ ln( p(ξ )) 1 ∂ p(ξ ) ≡ ∂wi j p(ξ ) ∂wi j and ∂ρi (t) ∂ρi (t) ∂ui (t) = . ∂wi j ∂ui (t) ∂wi j
(3.5)
It is straightforward to derive: ∂ log( p(ξ )) = ∂wi j
T
t=0
=−
∂ρi (t) ∂ui (t) ∂ui (t) ∂wi j T
t=0
f i ∈FiT
ρi (t) (t| f j , f i )dt +
δ(t − f i ) − ρi (t) ρi (t)
dt,
ρ ( fi ) i ( f i | f j , f i ), ρ i ( fi ) T
(3.6)
f i ∈Fi
∂ρi (t) and δ(t − f i ) is the Dirac delta, and we use that in the where ρi (t) ≡ ∂u i (t) sSRM formulation,
∂ui (t) = (t| f j , f i ). ∂wi j The term ρi (t) in equation 3.6 can be computed for any differentiable spike probability function. In the case of equation 2.7, ρi (t) =
β . 1 + exp(α(ϑ − ui (t))
Reducing the Variability of Neural Responses
383
Substituting our model for ρi (t), ρi (t) from equation 2.7 into equation 3.6, we obtain
∂ log( p(ξ )) = −β ∂wi j +
f i ∈GiT
T t=0
(t| f j , f i ) dt 1 + exp[α(ϑ − ui (t))]
( f i | f j , f i ) . α {ln(1 + exp[α (ϑ − ui ( f i ))]) − α (ϑ − ui ( f i ))}(1 + exp[α(ϑ − ui ( f i ))]) (3.7)
Equation 3.7 can be substituted into equation 3.3, which, when integrated, provides the gradient-descent weight update that implements conditional entropy minimization (see equation 3.2). The hypothesis under exploration is that this gradient-descent weight update yields STDP. Unfortunately, an analytic solution to equation 3.3 (and hence equation 3.2) is not readily obtained. Nonetheless, numerical methods can be used to obtain a solution. We are not suggesting a neuron performs numerical integration of this sort in real time. It would be preposterous to claim biological realism for an instantaneous integration over all possible responses ξ ∈ i , as specified by equation 3.3. Consequently, we have a dilemma: What use is a computational theory of STDP if the theory demands intensive computations that could not possibly be performed by a neuron in real time? This dilemma can be circumvented in two ways. First, the resulting learning rule might be cached in some form through evolution so that the computation is not necessary. That is, the solution—the STDP curve itself—may be built into a neuron. As such, our computational theory provides an argument for why neurons have evolved to implement the STDP learning rule. Second, the specific response produced by a neuron on a single trial might be considered a sample from the distribution p(ξ ), and the integration in equation 3.3 can be performed by a sampling process over repeated trials; each trial would produce a stochastic gradient step.
3.1 Numerical Computation. In this section, we describe the procedure for numerically evaluating equation 3.2 via Simpson’s integration (Hennion, 1962). This integration is performed over the set of possible responses i (see equation 3.3) within the time interval [0 . . . T]. The set i can be divided into disjoint subsets in , which contain exactly n spikes: i = in ∀ n.
384
S. Bohte and M. Mozer
Using this breakdown,
∂h i |FiT ∂ log(g(ξ )) =− g(ξ )(log(g(ξ )) + 1) dξ, ∂wi j ∂wi j i n=∞ ∂ log(g(ξ )) =− g(ξ )(log(g(ξ )) + 1) dξ. ∂wi j in
(3.8)
n=0
It is illustrative to walk through the alternatives. For n = 0, there is only one response given the input. Assuming the probability of n = 0 spikes is p0 , the n = 0 term of equation 3.8 reads:
T ∂h i FiT = p0 (log( p0 ) + 1) −ρi (t) (t| f j , f i )dt. ∂wi j t=0
(3.9)
The probability p0 is the probability of the neuron not having fired between t = 0 and t = T given inputs FiT resulting in membrane potential ui (t) and hence probability of firing at time t of ρ(ui (t)), p0 = S[0, T] = exp −
T
ρ (ui (t)) dt ,
(3.10)
t=0
which is equal to the survival function S for a nonhomogeneous Poisson process with probability density ρ(ui (t)) for t = [0 . . . T]. (We use the inclusive/exclusive notation for S: S(0,T) computes the function excluding the end points; S[0,T] is inclusive.) For n = 1, we must consider all responses containing exactly one output spike: GiT = { f i1 }, f i1 ∈ [0, T]. Assuming that neuron i fires only at time f i1 with probability p1 ( f i1 ), the n = 1 term of equation 3.8 reads
f 1 =T T i
∂h i FiT = p1 f i1 log p1 ( f i1 ) + 1 −ρi (t) t| f j , f i1 dt 1 ∂wi j f i =0 t=0
1
ρ f + i i1 f i1 | f j , f i1 d f i1 . (3.11) ρi f i The probability p1 ( f i1 ) is computed as
p1 f i1 = S 0, f i1 ρi f i1 S f i1 , T ,
(3.12)
Reducing the Variability of Neural Responses
385
where the membrane potential now incorporates one reset at t = f i1 :
ui (t) = η t − f i1 + wi j t| f j , f i1 . j∈i
f j ∈F tj
For n = 2, we must consider all responses containing exactly two output spikes: GiT = { f i1 , f i2 } for f i1 , f i2 ∈ [0, T]. Assuming that neuron i fires at f i1 and f i2 with probability probability p2 ( f i1 , f i2 ), the n = 2 term of equation 3.8 reads:
f 1 =T f 2 =T i i
∂h i |FiT = p2 f i1 , f i2 log p2 ( f i1 , f i2 ) + 1 ∂wi j f i1 =0 f i2 = f i1 T × −ρi (t) (t| f j , f i1 , f i2 )dt t=0
+
ρi f i2 2
1 1 2 1 2 + | f , f , f | f , f , f f f d f i1 d f i2 .
j j i i i i i i ρi f i1 ρi f i2
ρi
f i1
(3.13) The probability p2 ( f i1 , f i2 ) can again be expressed in terms of the survival function,
p2 f i1 , f i2 = S 0, f i1 ρi f i1 S f i1 , f i2 ρi f i2 S f i2 , T ,
(3.14)
with ui (t) = η(t − f i1 ) + η(t − f i2 ) + j∈i wi j f j ∈F tj (t| f j , f i1 , f i2 ). This procedure can be extended for n > 2 following the pattern above. In our simulation of the STDP experiments, the probability of obtaining zero, one, or two spikes already accounted for 99.9% of all possible responses; adding the responses of three spikes (n = 3) accounted for all possible responses got this number up to ≈ 99.999, which is close to the accuracy of our numerical computation. In practice, we found that taking into account n = 3 had no significant contribution to computing w, and we did not compute higher-order terms as the cumulative probability of these responses was below our numerical precision. For the results we present later, we used only terms n ≤ 2; we demonstrate that this is sufficient in appendix A. In this section, we have replaced an integral over possible spike sequences i with an integral over the time of two output spikes, f i1 and f i2 , which we compute numerically.
386
S. Bohte and M. Mozer
Figure 4: Experimental setup of Zhang et al. (1998).
4 Simulation Methodology We modeled in detail the experiment of Zhang et al. (1998) involving asynchronous costimulation of convergent inputs. In this experiment, depicted in Figure 4, a postsynaptic neuron is identified that has two neurons projecting to it: one weak (subthreshold) and one strong (suprathreshold). The subthreshold input results in depolarization of the postsynaptic neuron, but the depolarization is not strong enough to cause the postsynaptic neuron to spike. The suprathreshold input is strong enough to induce a spike in the postsynaptic neuron. Plasticity of the synapse between the subthreshold input and the postsynaptic neuron is measured as a function of the timing between subthreshold and postsynaptic neurons’ spikes (tpre-post ) by varying the intervals between induced spikes in the subthreshold and the suprathreshold inputs (tpre-pre ). This measurement yields the well-known STDP curve (see Figure 1b). In most experimental studies of STDP, the postsynaptic neuron is induced to spike not via a suprathreshold neuron, but rather by depolarizing current injection directly into the postsynaptic neuron. To model experiments that induce spiking via current injection, additional assumptions must be made in the spike response model framework. Because these assumptions are not well established in the literature, we have focused on the synaptic input technique of Zhang et al. (1998). In section 5.1, we propose a method for modeling a depolarizing current injection in the spike-response model. The Zhang et al. (1998) experiment imposes four constraints on a simulation: (1) the suprathreshold input alone causes spiking more than 70% of the time; (2) the subthreshold input alone causes spiking less than 10% of the time; (3) synchronous firing of suprathreshold or subthreshold inputs causes LTP if and only if the postsynaptic neuron fires; and (4) the time constants of the excitatory PSPs (EPSPs)—τs and τm in the sSRM—are
Reducing the Variability of Neural Responses
387
in the range of 1 to 5 ms and 7 to 15 ms, respectively. These constraints remove many free parameters from our simulation. We do not explicitly model the two input cells; instead, we model the EPSPs they produce. The magnitude of these EPSPs is picked to satisfy the experimental constraints: in most simulations, unless reported otherwise, the suprathreshold EPSP (wsupra ) alone causes a spike in the post on 85% of trials, and the subthreshold EPSP (wsub ) alone causes a spike on fewer than 0.1% of trials. In our principal re-creation of the experiment (see Figure 5), we added normally distributed variation to wsupra and wsub to simulate the experimental selection process of finding suitable supra-subthreshold input pairs according to: wsupra = wsupra + N(0, σsupra ) and wsub = wsub + N(0, σsub ) (we controlled the random variation for conditions outside the specified firing probability ranges). Free parameters of the simulation are ϑ and β in the spike probability function (α can be folded into ϑ) and the magnitude (urs , uabs ) and time f constants (τrs , τr , abs ) of the reset. We can further investigate how the results depend on the exact strengths of the subthreshold and suprathreshold EPSPs. The dependent variable of the simulation is tpre-pre , and we measure the time of the post spike to determine tpr e− post . In the experimental protocol, a pair of inputs is repeatedly stimulated at a specific interval tpre-pre at a low frequency of 0.1 Hz. The weight update for a given tpre-pre is measured by comparing the size of the EPSC before stimulation and (about) half an hour after stimulation. In terms of our model, this repeated stimulation can be considered as drawing a response ξ from the stochastic conditional response density p(ξ ). We estimate the expected weight update for this density p(ξ ) for a given tpre-pre using equation 3.2 by approximating the integral by a summation over all time-discretized output responses consisting of 0, 1, or 2 spikes. Note that performing the weight update computation like this implicitly assumes that the synaptic efficacies in the experiment do not change much during repeated stimulation; since longterm synaptic changes require the synthesis of for example, proteins this seems a reasonable assumption, also reflected in the half-hour or so that the experimentalists wait after stimulation before measuring the new synaptic efficacy.
5 Results Figure 5A shows an STDP curve produced by the model, obtained by plotting the estimated weight update of equation 3.2 against tpre− post for fixed supra and subthreshold inputs. Specifically, we vary the difference in time between subthreshold and suprathreshold inputs (a pre-pre pair), and we compute the expected gradient for the subthreshold input wsub over all responses of the postsynaptic neuron via equation 3.2. We thus obtain a value for w for each tpre-pre data point; we then compute wsub (%) as the
388
S. Bohte and M. Mozer
A
B
C
.,
Figure 5: (A) STDP: experimental data (triangles) and model fit (solid line). and wsub (B) Added simulation data points with perturbed weights wsupra (crosses). STDP data redrawn from Zhang et al. (1998). Model parameters: τs = 2.5 ms, τm = 10 ms, sub- and suprathreshold weight perturbation in B: σsupra = 0.33 wsupra , σsub = 0.1 wsub . (C) Model fit compared to previous generative models (Chechik, 2003, short dashed line; Toyoizumi et al., 2005, long dashed line; data points and curves were obtained by digitizing figures in original papers). Free parameters of Chechik (2003) and Toyoizumi et al. (2005) were fit to the experimental data as described in appendix B.
relative percentage change of synaptic efficacy: wsub (%) = w/wsub × 100%.1 For each tpre-pre , the corresponding value tpr e− post is determined by calculating for each input pair the average time at which the postsynaptic neuron fires relative to the subthreshold input. Together, this results in a set of (tpr e− post , wsub (%)) data points. The continuous graph in Figure 5A We set the global learning rate γ in equation 3.2 such that the simulation curve is scaled to match the neurophysiological results. In all other experiments where we use relative percentage change wsub (%), the same value for γ is used. 1
Reducing the Variability of Neural Responses
389
is obtained by repeating this procedure for fixed supra- and subthreshold weights and connecting the resultant points. In Figure 5B, the supra- and subthreshold weights in the simulation are randomly perturbed for each pre-pre pair, to simulate the fact that in the experiment, different pairs of neurons are selected for each pre-pre pair, leading inevitably to variation in the synaptic strengths. Mild variation of the input weights yields the “scattering” data points of the relative weight changes similar to the experimentally observed data. Clearly, the mild variation we apply is small only relative to the observed ¨ om, ¨ in vivo distributions of synaptic weights in the brain (e.g., Song, Sjostr Reigl, Nelson, & Chklovskii, 2005). However, Zhang et al. (1998) did not sample randomly from synapses in the brain but rather selected synapses that had a particularly narrow range of initial EPSPs to satisfy the criteria for “supra-” and “subthreshold” synapses (see also section 4). Hence, the experimental variance was particularly small (see Figure 1e of Zhang et al., 1998), and our variation of the size of the EPSP is in line with the observed variations in the experimental results of Zhang et al. (1998). The model produces a good quantitative fit to the experimental data points (triangles), especially compared to other related work as discussed in section 1 and robustly obtains the typical LTP and LTD time windows associated with STDP. In Figure 5C, we show our model fit compared to the models of Toyoizumi et al. (2005) and Chechik (2003). Our model obtained the lowest sum squared error (1.25 versus 1.63 and 3.27,2 respectively; see appendix B for methods)—this despite the lack of data in the region tpre-post = 0, . . . , 10 ms in the Zhang et al. (1998) experiment, where difference in LTD behavior is most pronounced. The qualitative shape of the STDP curve is robust to settings of the spiking neuron model’s parameters, as we will illustrate shortly. Additionally, we found that the type of spike probability function ρ (exponential, sigmoidal, or linear) is not critical. Our model accounts for an additional finding that has not been explained by alternative theories: the relative magnitude of LTP decreases as the efficacy of the synapse between the subthreshold input and the postsynaptic target neuron increases; in contrast, LTD remains roughly constant (Bi & Poo, 1998). Figure 6A shows this effect in the experiment of Bi and Poo (1998), and Figure 6B shows the corresponding result from our model. We compute the magnitude of LTP and LTD for the peak modulation (i.e., tpr e− post = −5 for LTP and tpr e− post = +5 for LTD) as the amplitude of 2 Of note for this comparison is that our spiking neuron model uses a more sophisticated difference of exponentials (see equation 2.4) to describe the EPSP, whereas the spiking neuron models in Toyoizumi et al. (2005) and Chechik (2003) use a single exponential. These other models might be improved using the more sophisticated EPSP function.
390 A
S. Bohte and M. Mozer B
Figure 6: Dependence of LTP and LTD magnitude on efficacy of the subthreshold input. (A) Experimental data redrawn from Bi and Poo (1998). (B) Simulation result.
the subthreshold EPSP is increased. The model’s explanation for this phenomenon is simple: as the synaptic weight increases, its effect saturates, and a small change to the weight does little to alter its influence. Consequently, the gradient of the entropy with respect to the weight goes toward zero. Similar saturation effects are observed in gradient-based learning methods with nonlinear response functions such as backpropagation. As we mentioned earlier, other theories have had difficulty reproducing the typical shape of the LTD component of STDP. In Chechik (2003), the shape is predicted to be near uniform, and in Toyoizumi et al. (2005), the shape depends on the autocorrelation. In our stochastic spike response model, this component arises due to the stochastic variation in the neural response: in the specific STDP experiment, reduction of variability is achieved by reducing the probability of multiple output spikes. To argue for this conclusion, we performed simulations that make our neuron model less variable in various ways, and each of these manipulations results in a reduction in the LTD component of STDP. In Figures 7A and 7B, we make the threshold more deterministic by increasing the values of α and β in the spike probability density function. In Figure 7C, we increase the magnitude of the refractory response η, which will prevent spikes following the initial postsynaptic response. And finally, in Figure 7D, we increase the efficacy of the suprathreshold input, which prevents the postsynaptic neuron’s potential from hovering in the region where the stochasticity of the threshold can induce a spike. Modulation of all of these variables makes the threshold more deterministic and decreases LTD relative to LTP. Our simulation results are robust to biologically realizable variation in the parameters of the sSRM model. For example, time constants of the EPSPs can be varied with no qualitative effect on the STDP curves.
Reducing the Variability of Neural Responses
A
B
C
D
391
Figure 7: Dependence of relative LTP and LTD on (A) the parameter α of the stochastic threshold function, (B) the parameter β of the stochastic threshold function, (C) the magnitude of refraction, η, and (D) efficacy of the suprathreshold synapse, expressed as p(fire|supra), the probability that the postsynaptic neuron will fire when receiving only the suprathreshold input. Larger values of p(fire|supra) correspond to a weaker suprathreshold synapse. In all graphs, the weight gradient for individual curves is normalized to peak LTP for comparison purposes.
Figures 8A and 8B show the effect of manipulating the membrane potential decay time τm and the EPSP rise time τs , respectively. Note that manipulation of these time constants does predict a systematic effect on STDP curves. Increasing τm increases the duration of both the LTP and LTD windows, whereas decreasing τs leads to a faster transition from LTP to LTD. Both predictions could be tested experimentally by correlating time constants of individual neurons studied with the time course of their STDP curves. 5.1 Current Injection. We mentioned earlier that in many STDP experiments, an action potential is induced in the postsynaptic neuron not via a suprathreshold presynaptic input, but via a depolarizing current injection. In order to model experiments using current injection, we must
392
S. Bohte and M. Mozer A
B
Figure 8: Influence of time constants of the sSRM model on the shape of the STDP curve: (A) varying the membrane potential time-constant τm and (B) varying the EPSP rise time constant τs . In both figures, the magnitude of LTP and LTD has been normalized to 1 for each curve to allow for easy examination of the effect of the manipulation on temporal characteristics of the STDP curves.
characterize the current function and its effect on the postsynaptic neuron. In this section, we make such a proposal framed in terms of the spike response model and report simulation results using current injection. We model the injected current I(t) as a rectangular step function, I(t) = H(t − f I ) Ic H(t − [ I − f I ]),
(5.1)
where the current of magnitude Ic is switched on at t = f I and off at t = f I + I . In the Zhang et al. (1998) experiment, I is 2 ms, a value we adopted for our simulations as well. The resulting postsynaptic potential, c is
t
c (t) = 0
s exp − τm
I(s) ds.
(5.2)
In the absence of postsynaptic firing, the membrane potential of an integrate-and-fire neuron in response to a step current is (Gerstner, 2001): c (t| f I ) = Ic (1 − exp[−(t − f I )/τm ]).
(5.3)
In the presence of postsynaptic firing at time fˆ i , we assume—as we did previously in equation 2.5—a reset and subsequent integration of the residual
Reducing the Variability of Neural Responses
393
Figure 9: Voltage response of a spiking neuron for a 2 ms current injection in the spike response model. Solid curve: The postsynaptic neuron produces no spike, and the potential due to the injected current decays with the membrane time constant τm . Dotted curve: The postsynaptic neuron spikes while the current is still being applied. Dashed curve: The postsynaptic neuron spikes after application of the current has terminated (moment of postsynaptic spiking indicated by arrows).
current:
s I(s) ds exp − τm 0 t s I(s) ds. + H(t − fˆ i ) exp − τm fˆ i
c (t| fˆ i ) = H( fˆ i − t)
t
(5.4)
These c kernels are depicted in Figure 9 for a postsynaptic spike occurring at various times fˆ i . In our simulations, we chose the current magnitude Ic to be large enough to elicit spiking of the target neuron with probability greater than 0.7. Figure 10a shows the STDP curve obtained using the current injection model for the exact same model parameter settings used to produce the result based on a suprathreshold synaptic input (depicted in Figure 5A) superimposed on the experimental data STDP obtained by depolarizing current injection from Zhang et al. (1998). Figure 10b additionally superimposes the earlier result on the current injection result, and the two curves are difficult to distinguish. As in the earlier result, variation of model parameters has little appreciable effect on the model’s behavior using the current injection paradigm, suggesting that current injection versus synaptic input makes little difference on the nature of STDP.
394
S. Bohte and M. Mozer
(a)
(b)
Figure 10: (A) STDP curve obtained for SRM with current injection (solid curve) compared with experimental data for depolarizing current injection (circles; redrawn from Zhang et al., 1998). (B) Comparing STDP curves for both current injection (solid curve) and suprathreshold input (dashed curve) models. The same model parameters are used for both curves. Experimental data redrawn from Zhang et al. (1998) for current injection (circles) and suprathreshold input (triangles) paradigms are superimposed.
6 Discussion In this letter, we explored a fundamental computational principle: that synapses adapt so as to minimize the variability of a neuron’s response in the face of noisy inputs, yielding more reliable neural representations. From this principle, instantiated as entropy minimization, we derived the STDP learning curve. Importantly, the simulation methodology we used to derive the curve closely follows the procedure used in neurophysiological experiments (Zhang et al., 1998): assuming variation in sub- and suprathreshold synaptic efficacies from experimental pair-to-pair even recovers the noisy scattering of efficacy changes. Our simulations furthermore obtain an STDP curve that is robust to model parameters and details of the noise distribution. Our results are critically dependent on the use of Gerstner’s stochastic spike response model, whose dynamics are a good approximation to those of a biological spiking neuron. The sSRM has the virtue of being characterized by parameters that are readily related to neural dynamics, and its dynamics are differentiable such that we can derive a gradient-descent learning rule that minimizes the response variability of a postsynaptic neuron given a particular set of input spikes. Our model predicts the shape of the STDP curve and how it relates to properties of a neuron’s response function. These predictions may be empirically testable if a diverse population of cells can be studied. The predictions include the following. First, the width of the LTD and LTP windows depends on the (excitatory) PSP time constants (see Figures 7A
Reducing the Variability of Neural Responses
395
and 7B). Second, the strength of LTD relative to LTP depends on the degree of noise in the neuron’s response; the LTD strength is related to the noise level. Our model also can characterize the nature of the learning curve for experimental situations that deviate from the boundary conditions of Zhang et al. (1998). In Zhang et al., the subthreshold and suprathreshold inputs produced postsynaptic firing with probability less than .10 and greater than .70, respectively. Our model can predict the consequences of violating these conditions. For example, when the subthreshold input is very strong or the suprathreshold input is very weak, our model produces strictly LTD, that is, anti-Hebbian learning. The consequence of a strong subthreshold input is shown in Figure 6B, and the consequence of a weak suprathreshold input is shown in Figure 7D. Intuitively, this simulation result makes sense because—in the first case—the most likely alternative response of the postsynaptic neuron is to produce more than one spike, and—in the second case—the most likely alternative response is no postsynaptic spike at all. In both cases, synaptic depression reduces the probability of the alternative response. We note that such strictly anti-Hebbian learning has been reported in relation to STDP-type experiments (Roberts & Bell, 2002). For very noisy thresholds and for weak suprathreshold inputs, our model produces an LTD dip before LTP (see Figure 7D). This dip is in fact also present in the work of Chechik (2003). We find it intriguing that this dip is also observed in the experimental results of Nishiyama et al. (2000). The explanation for this dip may be along the same lines as the explanation for the LTD window: given the very noisy threshold, the subthreshold input may occasionally cause spiking, and decreasing its weight would decrease response variability. This may not be offset by the increase due to its contribution to the spike caused by the suprathreshold input, as it is too early to have much influence. With careful consideration of experimental conditions and neuron parameters, it may be possible to reconcile the somewhat discrepant STDP curves obtained in the literature using our model. In our model, the transition from LTP to LTD occurs at a slight offset from tpr e− post = 0: if the subthreshold input fires 1 to 2 ms before the postsynaptic neuron fires (on average), then neither potentiation nor depression occurs. This offset of 1 to 2 ms is attributable to the current decay time constant, τs . The neurophysiological data are not sufficiently precise to determine the exact offset of the LTP-LTD transition in real neurons. Unfortunately, few experimental data points are recorded near tpr e− post = 0. However, the STDP curve of our model does pass through the one data point in that region (see Figure 5A), so the offset may be a real phenomenon. The main focus of the simulations in this letter was to replicate the experimental paradigm of Zhang et al. (1998), in which a suprathreshold presynaptic neuron is used to induce the postsynaptic neuron to fire. The Zhang et al. (1998) study is exceptional in that most other experimental studies of STDP use a depolarizing current injection to induce the
396
S. Bohte and M. Mozer
postsynaptic neuron to fire. We are not aware of any established model for current injection within the SRM framework. We therefore proposed a model of current injection within the SRM framework in section 5.1. The proposed model is an ideal abstraction of current injection that does not take into account effects like current onset and offset fluctuations inherent in such experimental methods. Even with these limitations in mind, the current injection model produced STDP curves very similar to the ones obtained by the simulation of the suprathreshold input–induced postsynaptic firing. The simulations reported in this letter account for classical STDP experiments in which a single presynaptic spike is paired with a single postsynaptic spike. The same methodology can be applied to model experimental paradigms involving multiple presynaptic or postsynaptic spikes, or both. However, the computation involved becomes nontrivial. We are currently engaged in modeling data from the multispike experiments of Froemke and Dan (2002). We note that one set of simulation results we reported is particularly pertinent for comparing and contrasting our model to the related model of Toyoizumi et al. (2005). The simulations reported in Figure 7 suggest that noise in our model is critical for obtaining the LTD component of STDP and that parameters that reduce noise in the neural response also reduce LTD. We found that increasing the strength of neuronal refraction reduces response variability and therefore diminishes the LTD component of STDP. This notion is also put forward in very recent work by Pfister, Toyoizumi, Barber, and Gerstner (2006), where an STDP-like rule arises from from a supervised learning procedure that aims to obtain spikes at times specified by a teacher. The LTD component in this work also depends on the probability of stochastic activity. In sharp contrast, Toyoizumi et al. (2005) suggest that neuronal refraction is responsible for LTD. Because the two models are quite similar, it seems unlikely that the models make opposite predictions and the discrepancy may be due to Toyoizumi et al.’s focus on analytical approximations to solve the mathematical problem at hand, limiting the validity of comparisons between that model and biological experiments in the process. It is useful to reflect on the philosphy of choosing reduction of spike train variability as a target function, as it so obviously has the degenerate but energy-efficient solution of emitting no spikes at all. The usefulness of our approach clearly relies on the stochastic gradient reaching a local optimum in the likelihood space that does not always correspond to the degenerate solution. We compute the gradient of the input weights with respect to the conditionally independent sequence of response intervals [t, t + ]. The gradient approach tries to push the probability of the responses in these intervals to either 0 or 1, irrespective of what the response is (not firing or firing). We find that in the sSRM spiking neuron model, this gradient can be toward either state of each response interval, which can be attributed
Reducing the Variability of Neural Responses
397
to the monotonically increasing spike probability density as a function of the membrane potential. This spike probability density allows neurons to become very reliable by firing spikes only at specific times, at least when starting from a set of input weights that, given the input pattern, is likely to induce a spike in the postsynaptic neuron. The fact that the target function is the reduction of postsynaptic spike train variability does predict that in the case of small inputs impinging on a postsynaptic target causing only occasional firing, the prediction would be that the average weight update due to this target function would reduce these inputs to zero. We have modeled the experimental studies in some detail, beyond the level of detail achieved by other researchers investigating STDP. Even a model with an entirely heuristic learning rule has value if it obtains a better fit to the data than other models of similar complexity. Our model has a learning rule that goes beyond heuristic: the learning rule is derived from a computational objective. To some, this objective may not be as exciting as more elaborative objectives like information maximization. As it is, our model stands alone from the contenders in providing a first-principle account of STDP that fits experimental data extremely well. Might there be a mathematically sexier model? We certainly hope so, but it has not yet been discovered. We reiterate the point that our learning objective is viewed as but one of many objectives operating in parallel. The question remains as to why neurons would respond in such a highly variable way to fixed input spike trains: a more deterministic threshold would eliminate the need for any minimization of response variability. We can only speculate that the variability in neuronal responses may also well serve these other objectives, such as exploitation or exploration in reinforcement learning or the exploitation of stochastic resonance phenomena (e.g., Hahnloser, Sarpeshkar, Mahowald, Douglas, & Seung, 2000). It is interesting to note that minimization of conditional response variability corresponds to one part of the equation that maximizes mutual information. The mutual information I between input X and outputs Y is defined as I(X, Y) = H(Y) − H(Y|X). Hence, minimization of the conditional entropy H(Y|X)—our objective—along with the secondary unsupervised objective of maximizing the marginal entropy H(Y) maximize mutual information. The first unsupervised objective is notoriously hard to compute (e.g., see Bell & Parra, 2005, for an extensive discussion) whereas, as we have shown, the second objective—conditional entropy minimization—can be computed relatively easily via stochastic gradient descent. Indeed, in this light, it
398
S. Bohte and M. Mozer
Figure 11: STDP graphs for w computed using the terms n ≤ 2 (solid line) and n ≤ 3 (crossed solid line).
is a virtue of our model that we account for the experimental data with only one component of the mutual information objective (taking the responses in the experimental conditions as the set of responses Y). The relatively simple nature of the experiments that uncovered STDP lacks any interaction with other (output) neurons, and we may speculate that STDP may be the degenerate reflection of information maximization in the absence of such interactions. If subsequent work shows that STDP can be explained by mutual information maximization (without the drawbacks of existing work, such as the rate-based treatment of Chechik, 2003, or the unrealistic autocorrelation function and difficulty of relating to biological parameters of Toyoizumi et al., 2005), this work contributes in helping to tease apart the components of the objective that are necessary and sufficient for explaining the data. Appendix A: Higher-Order Spike Probabilities To compute w, we stop at n = 2, as in the experimental conditions that we model, the contribution of n > 2 spikes is vanishingly small. We find that the probability of three spikes occurring is typically < 1e − 5, and the n = 3 term did not contribute significantly, as shown, for example, in Figure 11. Intuitively it seems very unlikely that the gradient of the conditional response entropy is dominated by terms that are highly unlikely. This could be the case only if the gradient on the probability of getting three or more spikes would be much larger than the gradient on getting, say, two spikes.
Reducing the Variability of Neural Responses
399
Given the setup of the model with an increasing probability of firing a spike as a function of the membrane potential, it is easy to see that changing a weight will change the probability of obtaining two spikes much more than the probability of obtaining three spikes. Hence, the entropy gradient from components n ≤ 2 will be (in practice, much) larger than the gradient for terms n = 3, n = 4, . . .. As we remarked before, in our simulation of the experimental setup, the probability of obtaining three spikes given the input was computed to be < 1e − 5; the overall probability was computed at up to 1e − 6. The probability of n = 4 was below the precision of our simulation. Appendix B: Sum-Squared-Error Parameter Fitting To compute the sum squared error when comparing the different STDP models in section 5, we use linear regression in the free parameters to minimize the sum-squared error between the model curves and the experimental data. For m experimental data points {(t1 , w1 ), . . . (tm , wm )} and model curve w = f (t), we report for each model curve the sum-squared error for those values of the free parameters that minimize the sum-squared error E 2: E2 =
min
free parameters
m
(wi − f (ti ))2 .
i=1
For our model, linear regression is performed in the scaling parameter γ in equation 3.2 that relates the gradient obtained with the model parameters mentioned to the weight change. Where possible for the other models, we set model parameters to correspond to the values observed in the experimental conditions described in section 4. For the model by Chechik (2003), the weight update is computed as the sum of a positive exponent and a negative damping contribution,
w = γ H(−t) exp(t/τ ) − H(t + )H(− − t)K , where t is computed as tpr e − tpost , K denotes the negative damping contribution that is applied over a time window before and after the postsynaptic spike, and H( ) denotes the Heaviside function. The time constant τ is related to the decay of the EPSP, and we set this value to the same value we use for our model: 10 ms. Linear regression to find the minimal sum-squared error is performed on the free parameters γ , K , .
400
S. Bohte and M. Mozer
In Toyoizumi et al. (2005), the learning rule is the sum of two terms
w = γ 2 (t) − µ0 (φ 2 )(t) , where (t) is the EPSP, modeled as (t) exp(−t/τ ), and µ0 (φ 2 )(t) is a function of the autocorrelation function of a neuron (φ 2 )(t), times the spontaneous neural activity in the absence of input, µ0 . The EPSP decay time constant used in Toyoizumi et al. was already set to τ = 10 ms, and for the two terms in the sum we used the functions described by Figures 2A and 2B in Toyoizumi et al. We performed linear regression to the one free parameter, γ . Note that for this model, we obtain better LTD, and hence E 2 , for larger values of µ0 as those used in Toyoizumi et al. (2005). However, then E 2 still remains worse than for the other two models, and the spontaneous neural activity becomes unrealistically large. Acknowledgments We thank Tony Bell, Lucas Parra, and Gary Cottrell for insightful comments and encouragement. We also thank the anonymous reviewers for constructive feedback, which allowed us to improve the quality of our work and this article. The work of S.M.B. was supported by the Netherlands Organization for Scientific Research (NWO), TALENT S-62 588 and VENI 639.021.203. The work of M.C.M. was supported by National Science Foundation BCS 0339103 and CSE-SMA 0509521. References Abbott, L., & Gerstner, W. (2004). Homeostasis and learning through spike-timing dependent plasticity. In D. Hansel, C. Chow, B. Gutkin, & C. Meunier (Eds.), Methods and models in neurophysics. In Proceedings of the Les Houches Summer School 2003. Amsterdam: Elsevier. Bell, C. C., Han, V. Z., Sugawara, Y., & Grant, K. (1997). Synaptic plasticity in a cerebellum-like structure depends on temporal order. Nature, 387, 278–281. Bell, A., & Parra, L. (2005). Maximising sensitivity in a spiking network. In L. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 121–128). Cambridge, MA: MIT Press. Bi, G.-Q. (2002). Spatiotemporal specificity of synaptic plasticity: Cellular rules and mechanisms. Biol. Cybern., 87, 319–332. Bi, G.-Q., & Poo, M.-M. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18(24), 10464–10472. Bi, G.-Q., & Poo, M.-M. (2001). Synaptic modification by correlated activity: Hebb’s postulate revisited. Ann. Rev. Neurosci., 24, 139–166.
Reducing the Variability of Neural Responses
401
Burkitt, A., Meffin, H., & Grayden, D. (2004). Spike timing-dependent plasticity: The relationship to rate-based learning for models with weight dynamics determined by a stable fixed-point. Neural Computation, 16(5), 885–940. Chechik, G. (2003). Spike-timing-dependent plascticity and relevant mutual information maximization. Neural Computation, 15, 1481–1510. Dan, Y., & Poo, M.-M. (2004). Spike timing-dependent plasticity of neural circuits. Neuron, 44, 23–30. Dayan, P. (2002). Matters temporal. Trends in Cognitive Sciences, 6(3), 105–106. Dayan, P., & H¨ausser, M. (2004). Plasticity kernels and temporal statistics. In S. Thrun, ¨ L. Saul, & B. Scholkopf (Eds.), Advances in neural information processing systems, 16. Cambridge, MA: MIT Press. Debanne, D., G¨ahwiler, B., & Thompson, S. (1998). Long-term synaptic plasticity between pairs of individual CA3 pyramidal cells in rat hippocampal slice cultures. J. Physiol., 507, 237–247. Feldman, D. (2000). Timing-based LTP and LTD at vertical inputs to layer II/III pyramidal cells in rat barrel cortex. Neuron, 27, 45–56. Froemke, R., & Dan, Y. (2002). Spike-timing-dependent synaptic modification induced by natural spike trains. Nature, 416, 433–438. Gerstner, W. (2001). A framework for spiking neuron models: The spike response model. In F. Moss & S. Gielen (Eds.), The handbook of biological physics, (vol 4, pp. 469–516). Amsterdam: Elsevier. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neural learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Gerstner, W., & Kistler, W. (2002). Spiking neuron models. Cambridge: Cambridge University Press. Hahnloser, R. H. R., Sarpeshkar, R., Mahowald, M. A., Douglas, R. J., & Seung, H. S. (2000). Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405, 947–951. Hennion, P. E. (1962). Algorithm 84: Simpson’s integration. Communications of the ACM, 5(4), 208. Herrmann, A., & Gerstner, W. (2001). Noise and the PSTH response to current transients: I. General theory and application to the integrate-and-fire neuron. J. Comp. Neurosci., 11, 135–151. Hopfield, J., & Brody, C. (2004). Learning rules and network repair in spike-timingbased computation. PNAS, 101(1), 337–342. Izhikevich, E., & Desai, N. (2003). Relating STDP to BCM. Neural Computation, 15, 1511–1523. Jolivet, R., Lewis, T., & Gerstner, W. (2003). The spike response model: A framework to predict neuronal spike trains. In O. Kaynak, E. Alpaydin, E. Oja, & L. Yu (Eds.), Proc. Joint International Conference ICANN/ICONIP 2003 (pp. 846–853). Berlin: Springer. Kandel, E. R., Schwartz, J., & Jessell, T. M. (2000). Principles of neural science. New York: McGraw-Hill. Karmarkar, U., Najarian, M., & Buonomano, D. (2002). Mechanisms and significance of spike-timing dependent plasticity. Biol. Cybern., 87, 373–382. Kempter, R., Gerstner, W., & van Hemmen, J. (1999). Hebbian learning and spiking neurons. Phys. Rev. E, 59(4), 4498–4514.
402
S. Bohte and M. Mozer
Kempter, R., Gerstner, W., & van Hemmen, J. (2001). Intrinsic stabilization of output rates by spike-based Hebbian learning. Neural Computation, 13, 2709–2742. Kepecs, A., van Rossum, M., Song, S., & Tegner, J. (2002). Spike-timing-dependent plasticity: Common themes and divergent vistas. Biol. Cybern., 87, 446–458. Legenstein, R., Naeger, C., & Maass, W. (2005). What can a neuron learn with spiketiming-dependent plasticity? Neural Computation, 17, 2337–2382. Linsker, R. (1989). How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Computation, 1, 402–411. ¨ Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efficacy by coincidence of postsynaptic APS and EPSPS. Science, 275, 213–215. Nishiyama, M., Hong, K., Mikoshiba, K., Poo, M.-M., & Kato, K. (2000). Calcium stores regulate the polarity and input specificity of synaptic modification. Nature, 408, 584–588. Paninski, L., Pillow, J., & Simoncelli, E. (2005). Comparing integrate-and-fire models estimated using intracellular and extracellular data. Neurocomputing, 65–66, 379– 385. Pfister, J.-P., Toyoizumi, T., Barber, D., & Gerstner, W. (2006). Optimal spike-timing dependent plasticity for precise action potential firing. Neural Computation, 18, 1318–1348. ¨ otter, ¨ Porr, B., & Worg F. (2003). Isotropic sequence order learning. Neural Computation, 15(4), 831–864. Rao, R., & Sejnowski, T. (1999). Predictive sequence learning in recurrent neocortical ¨ circuits. In S. A. Solla, T. K. Leen, & K. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 164–170). Cambridge, MA: MIT Press. Rao, R., & Sejnowski, T. (2001). Spike-timing-dependent plasticity as temporal difference learning. Neural Computation, 13, 2221–2237. Roberts, P., & Bell, C. (2002). Spike timing dependent synaptic plasticity in biological systems. Biol. Cybern., 87, 392–403. ¨ otter, ¨ Saudargiene, A., Porr, B., & Worg F. (2004). How the shape of pre- and postsynaptic signals can influence STDP: A biophysical model. Neural Computation, 16, 595–625. Senn, W., Markram, H., & Tsodyks, M. (2000). An algorithm for modifying neurotransmitter release probability based on pre- and postsynaptic spike timing. Neural Computation, 13, 35–67. Shon, A., Rao, R., & Sejnowski, T. (2004). Motion detection and prediction through spike-timing dependent plasticity. Network: Comput. Neural Syst., 15, 179–198. ¨ om, ¨ Sjostr P., Turrigiano, G., & Nelson, S. (2001). Rate, timing, and cooperativity jointly determine cortical synpatic plasticity. Neuron, 32, 1149–1164. Song, S., Miller, K., & Abbott, L. (2000). Competitive Hebbian learning through spiketime -dependent synaptic plasticity. Nature Neuroscience, 3, 919–926. ¨ om, ¨ P. J., Reigl, M., Nelson, S., & Chklovskii, D. B. (2005). Highly nonSong, S., Sjostr random features of synaptic connectivity in local cortical circuits. PLoS Biology, 3(3), e68. Toyoizumi, T., Pfister, J.-P., Aihara, K., & Gerstner, W. (2005). Spike-timing dependent plasticity and mutual information maximization for a spiking neuron model. In L. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 1409–1416). Cambridge, MA: MIT Press.
Reducing the Variability of Neural Responses
403
van Rossum, R., Bi, G.-Q., & Turrigiano, G. (2000). Stable Hebbian learning from spike time dependent plasticity. J. Neurosci., 20, 8812–8821. Xie, X., & Seung, H. (2004). Learning in neural networks by reinforcement of irregular spiking. Physical Review E, 69, 041909. Zhang, L., Tao, H., Holt, C., Harris, W., & Poo, M.-M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44.
Received April 6, 2005; accepted June 19, 2006.
LETTER
Communicated by Alexandre Pouget
Fast Population Coding Quentin J. M. Huys [email protected] Gatsby Computational Neuroscience Unit, University College London, London WC1N 3AR, U.K.
Richard S. Zemel [email protected] Department of Computer Science, University of Toronto, Toronto, Ontario, Canada M5S 3H5
Rama Natarajan [email protected] Department of Computer Science, University of Toronto, Toronto, Ontario, Canada M5S 3H5
Peter Dayan [email protected] Gatsby Computational Neuroscience Unit, University College London, London WC1N 3AR, U.K.
Uncertainty coming from the noise in its neurons and the ill-posed nature of many tasks plagues neural computations. Maybe surprisingly, many studies show that the brain manipulates these forms of uncertainty in a probabilistically consistent and normative manner, and there is now a rich theoretical literature on the capabilities of populations of neurons to implement computations in the face of uncertainty. However, one major facet of uncertainty has received comparatively little attention: time. In a dynamic, rapidly changing world, data are only temporarily relevant. Here, we analyze the computational consequences of encoding stimulus trajectories in populations of neurons. For the most obvious, simple, instantaneous encoder, the correlations induced by natural, smooth stimuli engender a decoder that requires access to information that is nonlocal both in time and across neurons. This formally amounts to a ruinous representation. We show that there is an alternative encoder that is computationally and representationally powerful in which each spike contributes independent information; it is independently decodable, in other words. We suggest this as an appropriate foundation for understanding time-varying population codes. Furthermore, we show how adaptation to Neural Computation 19, 404–441 (2007)
C 2007 Massachusetts Institute of Technology
Fast Population Coding
405
temporal stimulus statistics emerges directly from the demands of simple decoding.
1 Introduction From the earliest neurophysiological investigations in the cortex, it became apparent that sensory and motor information is represented in the joint activity of large populations of neurons (Barlow, 1953; Georgopoulos, Schwartz, & Kettner, 1983). There are by now substantial ideas and data about how these representations are formed (Rao, Olshausen, & Lewicki 2002), how information can be decoded from recordings of this activity (Paradiso, 1988; Snippe & Koenderinck, 1992; Seung & Sompolinsky, 1993), and how various sorts of computations, including uncertaintysensitive, Bayesian optimal statistical processing can be performed through the medium of feedforward and recurrent connections among the populations (Pouget, Zhang, Deneve, & Latham, 1998; Deneve, Latham, & Pouget, 2001). Critical issues that have emerged from these analyses are the forms of correlations between neurons in the populations, whether these correlations are significant for decoding and computation, and what sorts of prior information are relevant to computations and can be incorporated by such networks. However, many theoretical investigations into population coding have so far somewhat neglected a major dimension of coding: time. This is despite the beautiful and influential analyzes of circumstances in which individual spikes contribute importantly to the representation of rapidly varying stimuli (Bialek, Rieke, de Ruyter van Steveninck & Warland, 1991; Reinagel & Reid, 2000; Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997; Johansson & Birznieks, 2004) and the importance accorded to fast-timescale spiking by some practical investigations into population coding (Wilson & McNaughton, 1993; Schwartz, 1994; Zhang, Ginzburg, McNaughton, & Sejnowski, 1998; Brown, Frank, Tang, Quirk, & Wilson, 1998). The assumption is often made that encoded objects do not vary quickly with time and that therefore spike counts in the population suffice. Even some approaches that consider fast decoding (Brunel & Nadal, 1998; Van Rullen & Thorpe, 2001) treat stimuli as being discrete and separate rather than as evolving along whole trajectories. In this letter, we study the generic computational consequences of population coding in time. We analyze decoding in time as a proxy for computation in time as it is the most comprehensive computation that can be performed (accessing all information present). Decoding therefore constitutes a canonical test (Brown et al., 1998; Zhang et al., 1998). We consider a regime in which stimuli are not static and create sparse trains of spikes. Decoding trajectory information from these population spike trains is thoroughly ill posed, and prior information about what trajectories are likely
406
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
comes to play a critical role. We show that optimal decoding with ecological priors formally couples together the spikes, making trajectory inference computationally very hard. We thus consider the prospects for neural populations to recode the information about the trajectory into new sets of spikes that do support simple computations. Phenomena reminiscent of adaptation emerge as a by-product of the maintenance of a computationally advantageous code. We analyze the extension of one of the simplest ideas about population codes for static stimuli (Snippe & Koenderinck, 1992) to the case of trajectories. This links a neurally plausible population encoding model with a naturally realistic gaussian process prior. Unlike some previous work on decoding in time (Brown et al., 1998; Zhang et al., 1998; Smith & Brown, 2003), we do not confine ourselves to recursively specifiable priors and can therefore treat smoother cases. It is these smooth priors that render decoding, and likely other computations, hard and inspire an energy-based (product of experts) recoding (Hinton, 1999; Zemel, Huys, Natarajan, & Dayan, 2005), which makes for readier probabilistic computation. Section 2 starts with a simple encoding model. It introduces the need for priors, their shape, and analytical results for decoding in time. Section 3 shows how priors determine the form in which information is available to downstream neurons. We show that the decoder corresponding to the simple encoder can be extraordinarily complex, meaning that the encoded information is not readily available to downstream neurons. Finally, section 4 proposes a representation that has comparable power but for which decoding requires vastly less downstream computation. 2 A Gaussian Process Prior Approach As a motivating example, consider tennis. The player returning a serve has to predict the position of the ball based on data acquired in fractions of seconds. Experts compensate for the extraordinarily sparse stimulus information with a very rich temporal prior over ball trajectories and thus make predictions that are accurate enough to guarantee many a winning return. Figure 1 illustrates the setting of this article more formally. It shows an array of neurons with partially overlapping tuning functions that emit spikes in response to a stimulus that varies in time. These could be V1 neurons responding to a target (the tennis ball) as it moves through their receptive fields, or hippocampal neurons with place fields firing as a rat explores an environment. The task is to decode the spikes in time, that is, recover the trajectory of the stimulus (the ball’s position, say) based on the spikes, a knowledge of the neuronal tuning functions (cf. Brown et al., 1998; Zhang et al., 1998, for hippocampal examples), and some knowledge about the temporal characteriztics of the stimulus (the prior). In Figure 1, the ordinate represents the stimulus space (here one-dimensional for illustrative
407
Neurone position/Space
Fast Population Coding
0
50
100 150 Time [ms]
200
250
Figure 1: The problem: Reconstructing the stimulus as a function of time, given the spikes emitted by a population of neurons. When a neuron with preferred stimulus si emits a spike at time t, a black dot is plotted at (t, si ). A few example tuning functions are shown in gray. The ordinate represents stimulus space, with each neuron being positioned according to its preferred stimulus si . The decoding problem is related to fitting a line through these points, which is achievable only if there is prior information about the line to be fitted (e.g., the order of a polynomial fit or the smoothness).
purposes) and the abscissa, time. Neuron i has preferred stimulus si . If it emits a spike ξti at time t, a dot is drawn at position (t, si ). The dots in Figure 1 thus represent the spiking activity of the entire population of neurons over time. Our aim is to find, for each observation time T, a distribution over likely stimulus values sT given all the spikes previous to T. This is related to fitting a line representing the trajectory of the stimulus through the points. It is a thoroughly ill-posed problem, for instance, because we are not given any information about the stimulus at all between the spikes. To solve this ill-posed problem, we have to bring in additional knowledge in the form of a prior distribution about likely stimulus trajectories. The prior distribution specifies the temporal characteriztics of the trajectories (e.g., how smooth they are) and also whether they live within some constrained part of the stimulus space. Subjects are assumed to possess such prior information ahead of time—for instance, from previous exposures to trajectories (a good tennis player will have seen many serves). To gain analytical insight into the structure of decoding in this temporally rich case, we consider a very simple spiking model p(ξti |st ) (c.f., Snippe & Koenderinck, 1992, for the static case), augmented with a simple prior over stimulus trajectories p(s). We thereafter follow standard approaches (Zhang et al., 1998; Brown et al., 1998) by performing causal decoding and thus recovering p(sT |ξ ) over the current stimulus sT at time T given all the J spikes ξ ≡ {ξtij } Jj=1 at times 0 < {t j } Jj=1 < T in the observation period
408
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
([0, T)), emitted by the entire population. Here, i = 1, . . . , N designates the neuron that emitted the spike. To state the problem in mathematical terms, we can write (at least for the case that there is no spike at time T itself) p(sT |ξ ) ∝ p(sT ) p(ξ |sT ) = p(sT ) dsT p(ξ |sT ) p(sT |sT ),
(2.1) (2.2)
where, being slightly notationally sloppy, we are integrating over stimulus trajectories sT up to, but not including, time T, but restricted to just those trajectories that end at sT . Equation 2.2 lays bare the two parts of the definition of the problem. One is the likelihood p(ξ |sT ) of the spikes given the trajectory. This will be assumed to arise from a Poisson-gaussian spiking model. The other is the prior p(sT ) p(sT |sT ) = p(s)
(2.3)
over the trajectories. This will be assumed to be a gaussian process. 2.1 Poisson-Gaussian Spiking Model. We first define the spiking model. Let φi (s) be the tuning function of neuron i and assume independent, inhomogeneous, and instantaneous Poisson neurons (Snippe & Koenderinck, 1992; Brown et al., 1998; Barbieri et al., 2004). Let j be an index running over all the spikes in the population, with i( j) reporting the index of the neuron that spiked at time t j . Then, from the basic definition of an inhomogeneous Poisson process, the likelihood of a particular population spike train ξ given the stimulus trajectory sT can be written as
p(ξ |sT ) =
φi( j) (st j ) exp −
j
∝
i
φi( j) (st j ),
T
dtφi (st )
(2.4)
0
(2.5)
j
assuming that the trajectories are such that we can swap the order of the sum and the integral in the exp(·), that tuning functions are sufficiently dense that the sum spiking rate is constant independent of the location of the stimulus st , and that no two neurons ever fire together.
Fast Population Coding
409
Finally, we assume squared-exponential (gaussian) tuning functions,
st j − si φi (st j ) = φmax exp − 2σ 2
2 ,
where φmax is the maximal firing rate of a neuron and si the ith neuron’s preferred stimulus. Combining this with our previous assumptions (see equation 2.5) and completing the square implies that (sξ − θ )T (sξ − θ) p(ξ |sT ) ∝ φmax exp − , 2σ 2
(2.6)
where the spikes from the entire population have been ordered in time; the jth component of both sξ and θ corresponds to the jth spike and is, respectively, the stimulus at that spike’s time t j and the preferred stimulus si of the neuron that produced it. Note that time is continuous here. 2.2 Gaussian Process Prior. The prior p(s) defines a distribution over stimulus trajectories that are continuous in time. However, p(ξ |sT ) in equation 2.6 depends on only the times t j at which neurons in the population spike. Thus, in the integral in equation 2.2, we can formally marginalize or integrate out all the nonspiking times, making the key quantity to be defined by the prior to be p(sξ , sT ). For a gaussian process (GP), this quantity is a multivariate gaussian, defined by its (J + 1)-dimensional mean vector m and covariance matrix C, which can in general depend on the times t j . We write the distribution as p(sξ , sT ) ∼ N (m, C)
Ct j t j = c exp −αt j − t j ζ .
(2.7)
The parameter ζ ≥ 0 dictates the smoothness and the correlation structure of the process. If ζ = 0, then the stimulus is assumed to be constant (we sometimes call this the static case). Setting ζ = 1 corresponds to assuming that the stimulus evolves as an Ornstein-Uhlenbeck (OU) or first-order autoregressive process. This is the generative model underlying Kalman filters (Twum-Danso & Brockett, 2001) and generates an autocorrelation with the Fourier spectrum ∼1/ω2 often observed experimentally (Atick, 1992; Dong & Atick, 1995; Wang, Liu, Sanchez-Vives, & McCormick, 2003). This can be generalized to nth-order autoregressive processes. Setting ζ = 2 leads to the opposite end of the spectrum, with smooth trajectories that are non-Markovian. The parameter α dictates the temporal extent of the correlations and c their overall size (c also parameterizes the scale of the overall process). Example trajectories drawn from these priors for ζ = {1, 2} are shown in Figure 2. For most of the letter, we will let m = 0. Assuming
410
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan 0.5
A
Smooth 0
−0.5 0.5
B OU
0
−0.5
100
200 Time [ms]
Figure 2: Example trajectories drawn from the prior distribution in equation 2.7. (A) Examples for the smooth covariance matrix with ζ = 2. (B) The OU covariance matrix, ζ = 1.
a GP prior with a particular covariance matrix is exactly equivalent to regularizing the autocorrelation of the trajectory. 2.3 Posterior. Making these assumptions, we can write down the posterior distribution p(sT |ξ ) analytically by solving equation 2.2. It is a simple gaussian distribution with mean µ(T) and variance ν 2 (T) given in terms of tuning function widths σ , the vector θ , and the covariance matrix C. All three terms in equation 2.2 are now defined. The conditional distribution p(sξ |sT ) is given in terms of the partitioned covariance matrix C, p(sξ |sT ) = Nsξ Cξ T CT−1T sT , Cξ ξ − Cξ T CT−1T CTξ , where Cξ ξ is the covariance matrix of the stimulus at all the spike times, CTξ and Cξ T are vectors with the cross-covariances between the spike times and the observation time T, and CT T is the marginal (static) stimulus prior at the observation time (constant for the stationary processes considered here). The corresponding partitioning of the matrix C is
Cξ ξ Cξ T . C= CTξ CT T
(2.8)
Fast Population Coding
411
The remaining two terms in equation 2.2 are given by p(sT ) = NsT (0, CT T ) and equation 2.6. As the integral in equation 2.2 is a convolution of two gaussians, the variances add, and the integral evaluates to p(ξ |sT ) = Nθ Cξ T CT−1T sT , Cξ ξ − Cξ T CT−1T CTξ + Iσ 2 . Finally, taking a product with p(sT ), renormalizing, and applying the matrix inversion lemmas (see appendix A), we get µ(T) = k(ξ , T) · θ(T) ν 2 (T) = CT T − k(ξ , T) · Cξ T 2 −1
k(ξ , T) = CTξ (Cξ ξ + Iσ ) .
(2.9) (2.10) (2.11)
The mean µ(T) of the gaussian posterior is thus a weighted sum of the preferred stimulus of those neurons that emitted particular spikes. The weights are given by what we term the temporal kernel k(ξ , T). As we will see, the weight given to each spike will depend strongly on the time at which it occurred. A spike that occurred in the distant past will be given small weight. The posterior variance depends on only C and σ 2 . Remember that C depends on only the times of spikes, not the identities of the neurons that fired them. The posterior variance ν 2 , similar to a Kalman filter, depends on only when data are observed, not what data. This depends on the squared exponential nature of the tuning functions φ, and other tuning functions (e.g., with nonzero baselines) may not lead to this quality. However, it will not affect the conclusions reached below. This posterior distribution p(sT |ξ ) is well known in the GP literature as the predictive distribution (MacKay, 2003). 2.4 Structure of the Code. The operations needed to evaluate the posterior p(sT |ξ ) give us insight into the structure of the code and will be analyzed in section 3 for various priors. If the posterior is a function of combinations of spikes, postsynaptic neurons have to have simultaneous access to all those spikes. This point will be critical in temporal codes, as the spikes to which access is required are spread out in time. Only if spikes are interpretable independently can they be forgotten once they have been used for inference. All information the spikes contribute to some future time T > T is then contained within p(sT |ξ ). If the posterior depends on combinations of spikes (as will be the case for ecological, smooth priors), information that can be extracted from a spike about times T > T is not entirely contained within p(sT |ξ ). As a result, past spikes have to be stored and the posterior recomputed using them—an operation that is nonlocal in time. We will show that under ecological priors, the posterior depends on spike combinations and is thus complex. Decoding for the simple encoder (the spiking model) is thus hard. In section 4, we will illustrate the type
412
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
of computations (“recoding”) a network has to perform to access all the information. This will be equivalent to finding a new, complex encoder in time for which decoding is simple. 3 Effect of the Prior The effect of the prior manifests itself very clearly in the temporal kernels k(ξ , T) from equation 2.11 and the independence structure of the code. We show this by analyzing a representative set of priors in terms of both the behavior of the temporal kernels and the structure of the code, including priors that generate constant, varyingly rough and entirely smooth trajectories. (MATLAB example code can be downloaded from http://www.gatsby. ucl.ac.uk/∼qhuys/code.html.) 3.1 Constant stimulus prior ζ = 0. We first show that our treatment of the time-varying case is an exact generalization of the case in which the stimulus is fixed (does not change relative to the mean m), by rederiving classical results for static stimuli. Snippe and Koenderinck (1992) have shown that the posterior mean and variance (under a flat prior) is given by a weighted spike count, µ(T) =
i
ni (T)si J (T)
ν 2 (T) =
σ2 J (T)
(3.1)
T where ni (T) = 0 dt ξti is the ith neuron’s spike count and J (T) = i ni (T) is the total population spike count at time T. If we let ζ = 0, the matrix Cξ ξ = cnnT where n is a J (T) × 1 vector of ones. Equations 2.9 to 2.11 can then be solved analytically:
(σ 2 + c J (T))δi j − c σ 2 (σ 2 + c J (T)) c k(ξ , T) = 2 n σ + c J (T) ni (T)si c µ(T) = 2 i σ + c J (T)
(Cξ ξ + Iσ 2 )−1
ij
=
ν 2 (T) =
cσ 2 , σ 2 + c J (T)
which is exactly analogous to equation 3.1 with an informative prior. The temporal kernel k(ξ , T) does not decay but is flat, with a magnitude proportional to 1/J (T). The contribution of each neuron to the mean µ(T) is given by its spike count ni (T). Each spike is given the same weight,
Fast Population Coding
413
dynamic inference
static inference
static stim
dynamic stim
A
B
C
D
Figure 3: Comparison of static and dynamic inference. Throughout, the posterior density p(sT |ξ ) is indicated by gray shading, the spikes are vertical (gray) lines with dots, and the true stimulus is the line at the top of each plot. (A) Static stimulus, constant temporal kernel. (B) Moving stimulus, constant temporal kernel. (C) Static stimulus, decaying temporal kernel. (D) Moving stimulus, decaying temporal kernel. A and D show that only a match between true stimulus statistics and prior allows the posterior to capture the stimulus well.
which is a sensible approach only if spikes are eternally informative about the stimulus. This is true only if the covariance matrix is flat, which itself implies that the only time-varying component of the stimulus is in the mean m and not the covariance C. If the stimulus is a varying function of time s(t), spikes at time t are informative only about the stimulus at times t close to t and the influence of each spike on the posterior should fade away with time. This is illustrated in Figure 3. Figure 3A shows the present static case, where the stimulus does indeed not move; over time, the posterior p(sT |ξ ) sharpens up around the true value. However, if the stimulus does move, the posterior ends up at the wrong value (see Figure 3B). If the temporal kernel k(ξ , T) decays, this amounts to downweighting spikes observed in the more distant past. In the following, we analyze the behavior of p(sT |ξ ) and the optimal temporal kernel k(ξ , T) for various stimulus autocorrelation functions. Figure 3C shows that a decaying kernel leads to a posterior that widens inbetween spikes. This is incorrect if the stimulus is static, but Figure 3D shows how such a decaying temporal kernel would, in contrast to Figure 3B, allow p(sT |ξ ) to track the moving stimulus correctly.
414
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
Space
−1
0
1
0.05
0.1
0.15 0.2 Time [s]
0.25
0.3
Figure 4: Posterior distribution p(sT |ξ ) for OU prior. Same representation as in Figure 1. The dashed line shows the actual stimulus trajectory used to generate the spikes, the dots are the spikes, the posterior distribution is in gray scale, and the solid line shows the posterior mean. Between spikes, the posterior mean decays exponentially back toward the mean m (here 0), and the variance approaches the static prior variance CT T .
3.2 Nonsmooth (Ornstein-Uhlenbeck) Prior ζ = 1. Setting ζ = 1 in the definition of the prior (see equation 2.7) corresponds to assuming that the stimulus evolves as a random walk with drift to zero (an OU process): ds = −(1 − e −α )s(t)dt +
√ c(1 − e −2α ) dt d N(t),
(3.2)
with gaussian noise N(t) ∼ N (0, 1) and parameters as in equation 2.7. The OU process is the underlying generative process assumed by standard Kalman filters. The simplicity of Kalman-filter-like formulations explains some of its wide applicability and success (Brown et al., 1998; Barbieri, et al., 2004). However, as indicated visually by the example trajectories in Figure 2, the rough trajectories this prior favors are not a good model of smooth biological movements (see also section 5). Figure 4 shows a sampled stimulus trajectory, sample spikes generated from it, and the posterior distribution p(sT |ξ ). The mean of the posterior does a good job of tracking the true underlying stimulus trajectory and is never more than two standard deviations away from it. Between spikes, the mean simply moves back to zero (albeit rather slowly given the parameters associated with the Figure shown). Figure 5A displays example temporal kernels k(ξ , T) for inference in this process. They are very close to exponentials (note the logarithmic ordinate). This makes intuitive sense as an OU process is a first-order Markov process (it can be rewritten as a first-order difference equation). In fact, assuming the spikes arrive regularly (replacing each of the interspike intervals (ISI) 1 by their average value = 1J j (t j − t j−1 ) ∝ φmax ) allows us to write the
Fast Population Coding
415
B Kernel size (weight) (log scale)
Kernel size (weight) (log scale)
A −2
10
−4
10
−6
10
−8
10
0.05
0.1
0.15
T−t k
0.2
increasing observation time T
−3
10
−5
10
−7
10
0.2
0.4
0.6 T−t
0.8
1
k
Figure 5: OU temporal kernels for ζ = 1. (A) Example of temporal kernels. The top traces are for lower and the bottom for higher average firing rates. The gray traces show temporal kernels for Poisson spike trains. The components of the vector k(ξ , T) are plotted against the corresponding spike time. The dashed black traces show temporal kernels for regular spike arrivals (metronomic temporal kernels). The true (gray) temporal kernels are relatively tightly bunched around the metronomic temporal kernel. The firing rate affects the slope of the kernel, but not its overall scale of the kernel. (B) The effect of the time since the last spike on the temporal kernel is an overall multiplicative scaling. There is no effect on the slope.
jth component of k(ξ , T) as j−1
k j ≈ d1 λ1 , where d1 and λ1 are constants defined in appendix B. For such metronomic spiking, k(ξ , T) is thus really simply a decaying exponential. Somewhat similar expressions can be obtained for the original case of Poisson-distributed ISIs (see appendix B). Figure 5A shows that the metronomic approximation provides a generally good fit, capturing especially the slope of the true temporal kernels, which depends mostly on the correlation length α and the maximal (or average) firing rate φmax . The remaining quality of the fit is influenced most strongly by the match between and the time since the last spike T − tJ (which takes its effects through CTξ in equation 2.8 and 2.9–2.11). This determines the overall scale of the temporal kernel. The factors influencing the slope of the temporal kernel and its height do not interact greatly; that is, T − tJ does not affect the slope (shape) of the temporal kernel, only its magnitude, as shown in Figure 5B (metronomic temporal kernels are used for clarity, but the argument applies equally to the exact kernel). Conversely, affects mostly the slope. Replacing the true
416
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
−1
A
Space
0
1 −1
B 0
1
0.05
0.1
0.15 Time [s]
0.2
0.25
0.3
Figure 6: Comparison between exact and metronomic kernels. Same representation as in Figure 4. (A) Exact posterior p(sT |ξ ). (B) Approximate posterior derived by replacing all ISIs by but keeping T − tJ . This corresponds to approximating the true kernels with the metronomic kernels in Figure 5. The approximation is very close.
temporal kernels by metronomic temporal kernels, that is, replacing all ISIs by but keeping the time since the last spike T − tJ , does not greatly degrade p(sT |ξ ) (cf. Figures 6A and 6B). The dependence in Figure 5B can be understood by writing out the integrand of equation 2.2 in detail for the OU prior. This factorizes over potentials involving duplets of spikes because, as we show in appendix B, C −1 is tridiagonal, implying that the elements of C −1 involve only two successive spikes: T 1 s p(sξ , sT ) ∝ exp − sξ sT C −1 ξ sT 2 J +1 J 1 st2j Ct−1 + st j Ct−1 s = exp − j tj j ,t j+1 t j+1 2 j=1
p(sξ , sT ) = ψ(sT )
J
ψ(st j , st j+1 ),
j=1
(3.3)
j=1
where tJ stands for the time of the last spike, tJ −1 the time of the penultimate one, and so on, and the observation time T = tJ +1 . Note that the last equality
Fast Population Coding
A
417
B
90
0
−0.2
80
−0.4 −0.6 j
log C(t − t )
60
i
Space [cm]
70
50
−0.8 −1 −1.2
40
−1.4 30 −1.6 40
60 Space [cm]
80
−1.5
−1
−0.5
0 0.5 t −t [s] i
1
1.5
2
j
Figure 7: Natural trajectories are smooth. (A) Position of a rat freely exploring a square environment. (B) Covariance function of the position along the ordinate (gray, dashed line) and a quadratic approximation (black, solid line). Note the logarithmic ordinate. The smoothing applied to eliminate artifacts was of a timescale short enough not to interfere with the overall shape of the covariance function.
implies that the determinant also factors over spike pairs. This means that the integrations over each spike in the main equation 2.2 can be written in a recursive form akin to that used in message-passing algorithms (MacKay, 2003) and the exact Kalman filter. 3.3 Smooth Prior ζ = 2. Setting ζ = 2 in the definition of the prior (see equation 2.7) corresponds to assuming that the stimulus evolves as a non-Markov random walk. Trajectories with this autocovariance function are smooth (Figure 2A shows some sample trajectories generated from the prior) and infinitely differentiable. The smoothness makes it a more ecologically relevant prior for Bayesian decoding from movement-related trajectories than nonsmooth priors since natural objects (and limbs) move along smooth trajectories rather than jumping. As an example, Figure 7A shows trajectories of a rat exploring a square environment (data kindly provided by Lever, Wills, Cacucci, Burgess, & O’Keefe, 2002). Not only are these natural trajectories smooth, but Figure 7B also shows that a squared exponential covariance function closely approximates the real covariance function.1
1
Only the center of the covariance function is shown here. Due to the small size of the environment, the rat runs back and forth the entire available length, and there are oscillating flanks to the covariance function for delays larger than those shown.
418
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
−1
Space
−0.5 0 0.5 1
0.05
0.1 Time [s]
0.15
0.2
Figure 8: Posterior distribution p(sT |ξ ) for the smooth prior. Same representation as in Figure 4. The arrow highlights where the smooth prior uses spike combinations to constrain higher-order statistics of the process, such as velocity, acceleration, and jerk. While the smooth prior correctly predicts that the stimulus will continue away from the mean before returning back, the OU process can predict a decay only back to the mean (see Figure 9). The first spike on the left is the very first spike observed. As the spike history becomes more extensive, the posterior distribution is seen to sharpen up and follow the stimulus accurately.
Figure 8 is the equivalent of Figure 4 for the smooth case and shows the posterior p(sT |ξ ). The main dynamical difference between inference in this smooth case and inference in the OU case is indicated by the arrow in the Figure. While the OU process simply decays back to the mean (here, zero for simplicity), the dynamics of the smooth posterior mean are much richer. In the absence of spikes, the mean continues in its current direction for a while before reversing back. As can be seen, this gives a better fit to the underlying stimulus trajectory (the black dashed line) than would otherwise have been achieved. It arises directly from the fact that the correlations extend essentially beyond the last spike (and into the entire past). For comparison, Figure 9 shows the posterior when the wrong prior is used. The stimulus was generated from the smooth prior, but the OU prior was used to infer the posterior. The arrow indicates where the infelicity of the inaccurate posterior is most apparent, falling back to zero instead of predicting that the stimulus will continue to move farther away from zero. In terms of difference equations, the larger extent of correlations intuitively means that the higher-order derivatives of the process are also “constrained” by the covariance C. The simple exponential temporal kernels observed in the OU process cannot give rise to the reversals observed in the smooth process. Figure 10A shows the temporal kernels for the smooth process, which have a distinctively different flavor from the OU temporal kernels (shown in Figure 5),
Fast Population Coding
419
Space
−0.5 0 0.5 0.05
0.1 Time [s]
0.15
0.2
0.4 0.3 0.2 0.1 0 0.05 0.1 0.15 0.2 T−t j
C 0.2 Increasing observation time T
0.1 0 0
0.05 0.1 T−t j
0.15
Kernel size (weight)
B
0.5
Kernel size (weight)
A
Kernel size (weight)
Figure 9: Posterior distribution p(sT |ξ ) for smooth stimulus but wrongly assuming an OU prior. The posterior is consistently wider than it should be (cf. Figure 8). The arrow points out where the prediction is qualitatively wrong: the OU prior allows for decay back only to zero, unlike the smooth prior. Note also that the beneficial effect of a larger spike history observed in Figure 8 is absent here.
Increasing observation time T
2 1 0 0
0.2
0.4
0.6
T−t j
Figure 10: Temporal kernels for the smooth prior. (A) Exact (gray solid) and metronomic (black dashed) temporal kernels for the smooth prior with ζ = 2. The metronomic kernels again provide a close fit. (B) The metronomic temporal kernels change in a complex manner as the observation time T is moved away from the time of the last spike. Unlike in the OU case, this is not just a recursively implementable multiplication. (C) The same qualitative behavior arises for kernels derived from the empirical covariance function of the rat trajectories.
including oscillating terms multiplying the exponential decay. Most important, the oscillating terms allow the weight assigned to a spike to dip below zero; that is, a spike initially signifies proximity of the stimulus to the neuron’s preferred stimulus but later swaps over, signaling that the stimulus is not there anymore. This feature of the temporal kernels gives rise to the reversals seen in the posterior mean.
420
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
As in the OU case, the metronomic temporal kernel based on equal ISIs gives a good description of the temporal kernel mostly for spikes in the more distant past. Replacing the true temporal kernels by metronomic temporal kernels (but keeping the exact time since the last spike T − tJ ) again does not affect the posterior strongly. Nevertheless, the Kullback-Leibler divergence between the true posterior and the metronomic posterior is larger in the smooth than in the OU case (data not shown), indicating that the exact timing of spikes is more important in the smooth inference. Unlike in the OU case, there is no simple analytical expression for the metronomic temporal kernel (let alone the true temporal kernel). In particular, Figure 10B shows that changing the time since the last observed spike T − tJ does not simply scale the temporal kernel, but also changes the shape of the temporal kernel (it produces a complicated phase shift of the oscillating component). Again, for clarity, the metronomic kernels are used as an illustration, but the argument also applies to the exact kernels. Local structure has complex global consequences in the smooth case, with a single new spike requiring individual reweighting of all past spikes depending on their precise times. By comparison, for the OU process, the reweighting involves multiplication by a single factor. Figure 10C shows that this temporal kernel complexity is also a feature of the temporal kernel derived from the covariance function of the empirical rat trajectories in Figure 7. The fundamental difference between the OU and the smooth temporal kernels arises from the difference in the factorization properties of the prior. Because the inverse of the covariance matrix for ζ ∈ / {0, 1}, and specifically for ζ = 2, is dense, it does not factorize over spike combinations and therefore does not allow a recursive form. To see that a recurrence relation is possible only for the OU prior that factorizes across duplets of spikes, write p(sT |ξ ) = ds J ds J p(sT , s J , s J |ξ ) by expanding and integrating over the stimulus s J at the time tJ of the last spike, and s J at the time of all the spikes apart from the last ∝ ds J p(sT , s J ) p(ξ J |s J ) ds J p(s J , ξ J |sT , s J ) using Bayes rule, and the instantaneity of spiking =
ds J p(sT , s J ) p(ξ J |s J )
ds J p(ξ J |s J ) p(s J |sT , s J ),
again because the spikes are instantaneous, =
ds J p(sT , s J ) p(ξ J |s J )mT (sT , s J , ξ J ).
(3.4)
Fast Population Coding
421
Were mT (sT , s J , ξ J ) independent of sT , this would be exactly like a recursive update equation, with p(sT , s J ) being the transition probability from the last observed spike to the inference time T, p(ξ J |s J ) being the innovation due to the last observation (the likelihood of the last observed spike), and the message mT (s J , ξ J ) propagating the uncertainty from all the spikes other than the last to the last one. However, for general priors, p(s J |sT , s J ), and therefore also mT (sT , s J , ξ J ), do depend on sT , so all spikes have to be used to infer the posterior at each time T. To make the mT independent of sT , the prior has to be Markov in individual spike timings, with p(s J |sT , s J ) = p(s J |s J ),
(3.5)
which makes mT (sT , s J , ξ J ) =
ds J p(ξ J |s J ) p(s J |s J )
≡ mT (s J , ξ J ),
(3.6) (3.7)
which is indeed independent of sT . So for the OU process, the last message mT (s J , ξ J ) merely needs to be multiplied by the transition probability (see Figure 5B). However, the smooth temporal kernel changes shape in a complex way (corresponding to the dependence of the message mT (sT , s J , ξ J ) in equation 3.4 on sT ). Again, this means that all spikes have to be kept in memory for full inference. Note, finally, that this conclusion, and the fact that there is a recursive form for the OU process, do not depend on the particular spiking model assumed, verifying the assertion that the choice of squared exponential tuning functions, although mathematically helpful, does not pose limitations on our conclusions. 3.4 Intermediate (Autoregressive) Processes. There are cases intermediate to the smooth and the OU process that allow a partially recursive formulation. For instance, the metronomic OU process can be generalized to an autoregressive model of nth order by writing st =
n
√ βi st−i + c ηt .
(3.8)
i=1
In this case, the inverse covariance matrix C −1 is (2n + 1)-diagonal (see appendix C), with entries determined directly by the βi . This implies that the posterior factorizes over cliques ψ involving n + 1 spikes (see equation 3.3), and that inference will be Markov in groups of n spikes. Zhang et al. (1998) find that a two-step Bayesian decoder, which is an AR(2) process in our terms, significantly improves decoding hippocampal place cell data.
422
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
B
A
0.5 order=1 0 0.5 order=2
Space
kernel size (weight)
0 0.5 order=3 0 0.5 order=5 0 0.5 order=10 0 0.05 0.1
0.15 0.2
0.25 0.3
0.35 0.4
0.45 0.5
Time [s]
0.05
0.1 0.15 T−t j [s]
0.2
Figure 11: Autoregressive processes of increasing order. (A) Samples from processes of order n = {1, 2, 3, 5, 10} from top to bottom. The top process corresponds to an OU process. (B) Metronomic temporal kernels k(ξ , T) corresponding to the processes in A. The different lines (in varying shades of gray) correspond to increasing the observation time T as in Figures 5 and 10.
Figure 11A shows sample trajectories from such processes of increasing order. The coefficient vectors’ β was set here such that the nth difference of the processes evolved as an OU process (see appendix C). The higher the order, the smoother the processes that can be generated, and the more oscillations are apparent in the temporal kernels. The OU and the smooth processes (see section 3.3) are at opposite ends of this spectrum, with tridiagonal and dense inverse matrices, respectively. The higher the order, the greater the complexity of the code. Indeed, the complexity grows exponentially (since groups of n spikes have to be considered and the number of such groups increases exponentially). While natural stimulus trajectories may not be indefinitely differentiable, the exponential increase in complexity implies that any smoothness has great potential to render the code complex. 4 Expert Spikes for Efficient Computation Complex codes, following, for instance, from the assumption of natural smooth priors, render the information inherent in the spikes hard to extract. Efficient computation in time requires access to all encoded information and
Fast Population Coding
423
thus requires that the complex temporal structure of the code be taken into account. Here, we show that information present in the complex codes can be re-represented using codes that are straightforward to decode and use in key probabilistic computations. Specifically, we propose to decode each spike independently and multiply together the contributions from all spikes. This corresponds to treating each spike as an independent expert in a product of experts (PoE) setting (Hinton, 1999): 1 i pˆ (sT |ξ ) = exp gi (s, t)ξT−t . Z(T) i t
(4.1)
That is, each time a spike ξ i occurs, it contributes its same projection kernel exp(gi (s, t)) to the posterior distribution pˆ (sT |ξ ). To put it another way, for each spike, we add the same stereotyped contribution to the log posterior and then renormalize. From the discussion in the preceding sections, it is immediately apparent that the PoE approximation is a better approximation for the OU case than for the smooth case. In the following, we first derive an approximate analytical expression for separable projection kernels gi (s, t) = f i (s)h(t) based on metronomic spikes and the OU prior. We then remove any restrictions and derive nonparametric, nonseparable gi (s, t) for both the OU and the smooth temporal kernel and show that these still perform better for the OU process than for the smooth process. Finally, we infer a new set of spikes ρ ξ such that decoding according to the PoE model produces a posterior distribution pˆ (sT |ρ ξ ) that matches the true posterior distribution p(sT |ξ ) well for both OU and smooth priors. 4.1 Approximate Projection Kernels 4.1.1 Metronomic Projection Kernels Section 3.2 showed that for the OU process, the weight accorded a spike is approximately a decreasing exponential function of the time elapsed since its occurrence and that replacing the true temporal kernels by the metronomic temporal kernels (without fixing the time since the last spike at ) gives a qualitatively good approximation (see Figure 4). This suggests writing an approximate distribution with spatiotemporally separable projective kernels, pˆ (sT |ξ ) ∝
φi (s)
i
=
i
exp
i −βt t ξT−t e
t
(4.2)
log(φi (s))e
−βt
i ξT−t
(4.3)
424
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
A
−1
B
Space
0 1
0.05
0.1
0.15
0.2
0.25
0.3
0.05
0.1
0.15 Time [s]
0.2
0.25
0.3
−1 0 1
Figure 12: Separable projection kernel for the OU process: Comparison of true p(sT |ξ ) (A) and pˆ (sT |ξ ) from equation 4.3 (B). The left arrows indicate where the variance of the approximate distribution diverges toward ∞ as T − tJ → ∞ rather than approaching CT T . The right arrows show the effect of this on the mean or the approximate posterior, which returns to the prior mean m = 0 more rapidly than the true posterior.
to use exactly the form of equation 4.1. We can thus also write pˆ (sT |A) ∝
i
φi (s) Ai (T) ,
(4.4)
where Ai (T) can be seen as an equivalent “activity” of each neuron. The performance of this approximation is shown in Figure 12 for the OU process (see also Zemel et al., 2005). There are a few differences between Figures 4 and 12. Keeping the φi (s) as before, the variance of this approximation is νˆ 2 (T) = σ 2 / i Ai (T). As the last observed spike recedes into the past, this approaches infinity (left arrows in Figure 12), and the mean returns to zero (right arrows in Figure 12). This is different from the case of exact inference, which approaches the static prior with variance CT T . The mean is always normalized and returns to zero more slowly µ(T) ˆ = i si AiA(T) j (T) j than the variance increases. This introduces an inaccuracy, since the true OU temporal kernels (shown in Figure 5) are not normalized t kt (ξ , T) < 1, which arises because of the weight given to the spatial prior. For the smooth case, no simple approximation of the form of equation 4.3 is viable. This can be seen, for instance, from the fact that the smooth temporal kernels (see Figure 10) dip below zero (making it tricky to use them in products).
Fast Population Coding
425
A
B
Figure 13: Projection kernels inferred by equation 4.5 for OU (A) and smooth (B) priors. Stimulus trajectories and corresponding population spike trains ξ were generated until the update equations converged (approximately 2 · 104 spike trains). Both kernels have the shape of difference of gaussians for t = 0 and fall off exponentially with time. There is little nonseparable structure in both cases.
4.1.2 Inferring Full Spatiotemporal Projection Kernels gi (s, t). To apply expression 4.1 to the smooth case, we inferred gi (s, t) in a nonparametric way by discretizing time and space over which the distributions are defined and minimizing the Kullback-Leibler divergence between the discretized versions p(sT |ξ ) and pˆ (sT |ξ ) with respect to the projection kernels, gi (s, t) ← gi (s, t) − ε∇gi (s,t) DK L ( p(sT |ξ )|| pˆ (sT |ξ )),
(4.5)
where DK L ( p(s)||q (s)) = ds p(s) log qp(s) . Given that our approximation 4.1 (s) is related to restricted Boltzmann machines (RBM), it is not surprising that the gradient has a form akin to the wake-sleep algorithm (Hinton, Dayan, Frey, & Neal, 1995): ∇gi (s,t) DK L ( p(sT |ξ )|| pˆ (sT |ξ )) =
[ pˆ (sT |ξ ) − p(sT |ξ )] ξi (T − t).
(4.6)
T
Figure 13 shows the projection kernels inferred for the OU prior (see Figure 13A) and the smooth prior (see Figure 13B). Both start, for t = 0 with a spatial profile similar to a difference of gaussians (DOG), and then fall off as exponentials of time. The kernels gi (s, t) shown here are for neurons i with si close to 0, the center of the gaussian prior over the trajectories. The projection kernels shown are for the same parameter settings as Figures 4 and 8, and the faster decay of the smooth projection kernels is due to the
426
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
Figure 14: Projection kernels are independent of contrast. The left-most panel shows an OU kernel for the same contrast (φmax ) as in Figure 13; the contrast is doubled in the middle and quadrupled in the right panel. All these are offcenter kernels with the same parameters as used in the other Figures. Despite a slight slant toward the mean, the kernels are approximately separable.
shorter correlation timescale. For the OU process, the kernels for neurons i with si > 0 become slightly slanted toward −1 over time (and the converse holds for those with si < 0) to capture the decay to the mean (zero), which is only a function of the distance from the mean. This effect is noticeable for the OU but very small for the smooth kernels. Figure 14 shows off-center OU kernels inferred for different contrast (by varying φmax ). As can be seen, the kernels are invariant to the contrast, and the slant effect is small. For the parameter range explored here, both projection kernels are approximately separable, indicating that the analytically derived motivation above may be close to optimal and that, in the PoE framework of equation 4.1, separable projection kernels may be the optimal choice even for the smooth prior. However, simply using these projection kernels to interpret the original spikes ξ results in an approximation that is far from perfect, especially in the smooth case. Figure 15 compares the true posterior distribution and that given by the approximation with the above projection kernels. The cost of independent decoding is quantified in Figure 15A using
1 D( p(sT |ξ )|| pˆ (sT |ξ )) T t H( p(sT |ξ ))
,
(4.7)
p(s,ξ )
where H( p) is the entropy of p and the average is over many stimulus trajectories s ∼ N (0, C) and spikes ξ ∼ p(ξ |s). This quantity can also be interpreted as a percentage information loss. It is larger for the smooth than for the OU process, showing that the OU process suffers much less from the approximation than the smooth prior. Visually, there are no gross differences between p(sT |ξ ) and pˆ (sT |ξ ) for the OU prior (see Figures 15B and 15D). However, for the smooth prior, the arrows in
Fast Population Coding
427 OU
Smooth
−1
Exact
−0.5
0.12
Space
0.06 0.04
1 −1
−0.5
0.02 OU
Smooth
Approx
0.1 0.08
0
0 0.5
0 0.5 1
0.05
0.1
0.15 0.2 Time [s]
0.25
0.3
0.05
0.1 Time [s]
0.15
0.2
Figure 15: Comparison of true distribution p(sT |ξ ) and approximate distribution pˆ (sT |ξ ) given by equation 4.1 with projection kernels inferred by equation 4.5 and shown in Figure 13. Organization is the same as in previous figures. (A) T1 t D( p(sT |ξ )|| pˆ (sT |ξ ))/H( p(sT |ξ )) p(ξ ,s) ± 1 standard deviation for both priors. (B, C) p(sT |ξ ). (D, E) The corresponding pˆ (sT |ξ ) for the same spikes. (B, D) A stimulus generated from the OU prior. (C, E) The smooth prior. pˆ (sT |ξ ) is a good approximation for the OU prior but fails for the smooth prior. The arrows indicate where the approximation fails fundamentally in a similar way to that shown in Figure 9.
Figures 15C and 15E indicate areas where a large mismatch is introduced by the independent treatment of the spikes, which discards all information contained in spike combinations. This mismatch is entirely to be expected.
4.2 Recoding: Finding Expert Spikes. The previous section has shown that an independent interpretation of spikes is more costly with the smooth than with the OU prior. In this section, we show that it is possible to find a new set of “expert” spikes ρ, such that each spike can be interpreted independently and the posterior distribution is matched closely for both the OU and the smooth prior. This recoding thus takes spikes ξ that are redundant in a decoding sense and produces a new set of spikes ρ that can be easily used for efficient neural computation because the decoding redundancy has been eliminated. We first infer real-valued activities aξ and then proceed to infer actual spikes ρ. We use neurally implausible methods to infer the new set of spikes ρ. In a companion paper we will explore the capability of neurally plausible spiking networks to do this recoding and to use the resulting simple code for probabilistic computations in time (see also Zemel, Huys, Natarajan, and Dayan, 2004).
428
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
Figure 16: Inferring activities A for the OU prior. (A) True posterior p(sT |ξ ). (B) Approximate posterior pˆ (sT |A), which matches arbitrarily well (for this example, DK L T ∼ 10−5 and the entropy HT ∼ 2, making the information loss I ∼ 10−5 ). (C) Activities A for all neurons. The vertical black lines with dots indicate the original spike times ξ . Each thin line along the gray surface is the “activity” of one neuron as a function of time. There is a small amount of activity away from the spikes, but zeroing this affects the match between p(sT |ξ ) and pˆ (sT |A) only marginally.
4.2.1 Activities. Given a set of projection kernels gi (s, t) from the previous section, we can go back and infer the optimal activities A ≥ 0 of neurons by writing pˆ (sT |A) ∝ exp
Ai (T − t)gi (s, t) .
(4.8)
i,t
If we let Ai (T − t) = exp(Bi (T − t)) and minimize with respect to B the ¨ Kullback-Leibler divergence from the true posterior, we simultaneously enforce A ≥ 0: Bi (t) ← Bi (t) − ε∇ Bi (t) DK L ( p(sT |ξ )|| pˆ (sT |A)) .
(4.9)
The results of this procedure are shown for both the OU process (see Figure 16) and for the smooth process (see Figure 17). Figures 16A and 17A show the true spikes ξ and the corresponding distribution p(sT |ξ ). Figures 16B and 17B show the approximate distributions pˆ (sT |A) defined in equation 4.8 for the optimal activities A inferred with equation 4.9. The continuous nature of the activation functions means that they can contain as much information as the distribution itself, and indeed we find empirically that arbitrarily close matches are possible (exemplified by the two Figures; in both cases DK L T ∼ 10−5 ). Figures 16C and 17C finally show
Fast Population Coding
A
429
C
B
Figure 17: Inferring activities A for the smooth prior. (A) True posterior p(sT |ξ ). (B) Approximate posterior pˆ (sT |A), which matches arbitrarily well (for this example, DK L T ∼ 10−5 and the entropy HT ∼ 2, making the information loss I ∼ 10−5 ). (C) Activities A for all neurons. The vertical black lines with dots indicate the original spike times ξ . Each thin line along the gray surface is the “activity” of one neuron as a function of time. There is a small amount of activity away from the spikes, which allows the approximation pˆ (sT |A) to “bend” between spikes. Unlike in the OU case, zeroing this small activity affects the match between p(sT |ξ ) and pˆ (sT |A) strongly.
the inferred activities A. Most important, we see that the inferred activities (one gray line for each neuron) are very sparse (in time and across neurons), suggesting that there might indeed be a set of (zero-one) spikes that leads to a good approximation via equation 4.1. On the other hand, the activities line up closely with the original spikes ξ (vertical black lines with dots), and it may be that the approximations with the original spikes in the previous paragraph already gave us the best possible approximation. For the OU prior (see Figure 16), the activities in spikeless times are extremely small, and zeroing them does not significantly worsen the approximation with pˆ (sT |A). However, for the smooth prior (see Figure 17), there is residual activity between the peaks, the zeroing of which significantly worsens the quality of approximation by pˆ (sT |A) (data not shown). The very close approximation found here is not surprising. The distribution to be matched and the activities are discretized on the same spatial grid and are both positive, real quantities. As we have not imposed any constraints on A beyond positivity, the entropy of A can match any entropy in the distributions p(sT |ξ ). This, however, will change drastically when the activities are forced to be binary. 4.2.2 Spikes via Simulated Annealing. To check whether there exists in fact a set of spikes ρ ξ such that decoding according to equation 4.1 results in a posterior distribution pˆ (sT |ρ) that matches p(sT |ξ ) closely for the smooth
430
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
prior, we assume, as before, the projection kernels gi (s, t) inferred above and find a set of spikes ρ that minimizes DK L ( p(sT |ξ )|| pˆ (sT |ρ)), where pˆ (sT |ρ) given by 1 i pˆ (sT |ρ) = exp gi (s, t)ρT−t . Z(T) i t
(4.10)
This is the same interpretation we gave the original spikes in equation 4.1. Our aim is thus to find a new set of spikes that satisfies ρ = arg min DK L ( p(sT |ξ )|| pˆ (sT |ρ)). ρ
(4.11)
This is a highly nonconvex discrete problem, so we applied standard simulated annealing techniques.2 Figure 18 shows the results. Figure 18A shows the distribution p(sT |ξ ) based on the original spikes. Figure 18B shows pˆ (sT |ρ) using the projection kernels shown in Figure 13. The arrow in Figure 18B indicates where the new set of spikes performs better than the original, independently interpreted, spikes and matches the shifting distribution by adding a new spike. This is a qualitative improvement on what is possible by interpreting the original spikes according to equation 4.1 (see Figure 15) and is most pronounced for the smooth case. From the close match between p(sT |ξ ) and pˆ (sT |A), we expect the overall minimum KL divergence to depend strongly on the projective kernels. Figure 18C shows the effect of scaling the inferred projection kernel gi (s, t). The increase in accuracy due to an increase in the number of spikes offsets the decrease in accuracy due to the absence of prior information. The more spikes are allowed, the closer this scheme is to the one where rates are allowed. Thus, the prior has here literally been replaced by spikes; the input spikes have been “augmented” with spikes that represent information contributed by past spikes in accordance with a prior over stimulus trajectories. Figure 18D shows relative average KL divergences over 100 sets of new spike trains, for different scalings of the gi (s, t). In general, the projection kernels found here form an overcomplete basis set. By scaling them down and allowing more spikes, we come closer to the setting in the previous section where we allowed continuous activities rather than 0-1 spikes. Note the different coding strategy indicated by the arrows, especially in Figure 18C. Here, spikes are positioned such that they take into account
2 From the very strong sensitivity of our simulated annealing results to the procedure used to reduce the temperature, we infer that the optima are not very well separated, with a number of similar sets of spike trains doing approximately equally well. We rendered the procedure more global by evaluating, at each step, the decrease in cost that would accompany switching every spike and accepting one of the best switches probabilistically.
Fast Population Coding
431
A −1 D 0 KL
80
%
KL
60 40 20
0
0
1
1/2 1/4 1 Scaling of kernel
1/2
E
1 −1
2 1 0 −18
0 1 0
Smooth
100
Corr
C
>/±1S.D.
1 −1
Space
B
OU
120
−12 −6
0.02
0.04
0.06
Time [s]
0.08
16 11 0
6
6 12
18 0 Distance btw neurons
Time lag [ms]
Figure 18: Inferring new spikes for smooth prior. (A) Original spikes with p(sT |ξ ). (B) pˆ (sT |ρ) with projection kernels gi (s, t) given by equation 4.5. (C) pˆ (sT |ρ) with scaled projection kernels gi (s, t)/4. Note the increased firing rate. Arrows explained in the text. (D) Percentage change in KL divergences between p(sT |ξ ) and pˆ (sT |ρ) for projection kernels scaled by factors of 1/2 and 1/4 relative to the KL divergence of the unscaled kernel. (E) Cross-correlations between neurons for a few different time lags. Black lines: original spikes ξ . Gray lines: recoded spikes ρ for unscaled projection kernels. The autocorrelation has been scaled to unity, except at zero lag, where the autocorrelation was excluded and scaling was performed with respect to the maximal cross-correlation.
what has already been expressed by previous spikes: spikes are positioned with respect to the distribution that has already been encoded. To put it another way, there are explicit relations among the new spikes that are not directly explained by the stimulus itself. Figure 18E shows this more clearly. The black traces show the stimulus-based (“signal”) correlations of the original spikes ξ . The gray lines show the correlations of the recoded spikes ρ. At lag 0 (bottom of the Figure), flanks appear in the cross-correlation functions, but at greater lags, the cross-correlations are flatter for the recoded than for the original spikes. Requiring independently decodable spikes has introduced instantaneous correlations and flattened the spatial profile of cross-correlations over time—a sign of adaptation to temporal statistics. Thus, here we find that the maintenance of a simple code results in the emergence of what appear to be adaptive properties.
432
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
5 Discussion Here, we have analyzed the structure of a Bayesian, optimal decoder in a simple, analytically tractable model. The results are a direct generalization of decoding in the static gaussian-Poisson encoding model (Snippe & Koenderinck, 1992). We show that the structure of the decoder depends on the prior over stimulus trajectories in time, that realistic priors render decoding hard (nonlocal in time and space), and that an independent code in which information is readily available for computational purposes exists. Finally, we showed that apparently adaptive properties of a coding scheme may result from the requirement of a constant, simple code. We are currently working on a biologically plausible network that approximates this recoding and uses the resulting code for flexible probabilistic computations (see also Zemel et al., 2004). The main innovation in our work is the way we used the gaussian process prior over stimulus trajectories. Figure 7 indicates that the exponential prior with ζ = 2 is a good model of natural movements, as they tend to be smooth. Classically, natural stimuli have been characterized as having autocorrelation structures that fall off exponentially (Dong & Atick, 1995), corresponding to a power spectrum that falls off as a power law function of frequency (∝ 1/ωb ). However, smooth trajectories in time have a much faster (exponential) spectral falloff ∝ exp(−ω2 ). Most previous work has assumed priors within the OU class (Brown et al., 1998; Smith & Brown, 2003; Barbieri et al., 2004; Kemere et al., 2004; Gao, Black, Bienenstock, Shoham, & Donoghue, 2002), perhaps because of the recursive formulation of decoding. However, Zhang et al. (1998) use a two-step Bayesian decoder corresponding to a second-order autoregressive process (AR(2)) with coefficients that fall off as squared exponentials (their equation 43). This two-step decoder is much more competent than a onestep decoder (corresponding to an AR(1) process) on hippocampal place cell data. Of course, more complex priors are also possible. For instance, Kemere et al. (2004) point one way forward. They showed the benefits for decoding from motor cortex spikes of using a rich, modular prior based on separate models for each of the (seven) possible arm movements to be extracted. In terms of our framework, Figure 15A illustrates the cost of neglecting the prior temporal structure and treating all spikes independently. The differences between inference in the smooth and OU case (e.g., the overshoot in Figure 8, which is not seen in Figure 4) also indicate qualitatively what information is lost by applying Kalman-filter-like formulations to decoding. The absolute magnitude of this effect depends on the specifics of the true model, and so remains an empirical question for psychophysical or physiological test. If spikes are dense relative to the movement in the stimulus (the likelihood term in equation 2.2 dominates, either via a very small noise (low σ ) or high firing rates (large φmax )), the contribution of the
Fast Population Coding
433
prior will be small, and approximation by a recursive prior (as in Brown et al., 1998; Zhang et al., 1998) may suffice. However, if spikes are sparse, the prior will be more important and approximations more costly. Finally, in many cases, the correct prior can be acquired only from experience, which itself may be costly. While it is sensible to expect nervous systems ¨ to acquire detailed and correct informative priors (Kording & Wolpert, ¨ 2004; Kording, Ku, & Wolpert, 2004; Adams, Graf, & Ernst, 2004), it remains to be seen whether incorporation of informative priors is generally feasible for decoding applications in engineering domains (e.g., brainmachine-interfaces). The historical approach to population coding is based on ideas of Fisher information. The Fisher information arises from notions of asymptotic normality where there are many “data”—long-spike counting windows and many neurons. In the asymptotic limit, the posterior distribution is well approximated by a gaussian with width (JI F )−1 where J is the number of data points or spikes in our case. This is a linear expansion where each data point (spike) contributes the same amount 1/I F to the variance of the posterior. We, like others before us (Brunel & Nadal, 1998; Bethge, Rotermund, & Pawelzik, 2002), are interested in the case where the population as a whole has emitted few spikes, as indicated in Figure 1—in regimes far from the asymptotic limit. For us, spikes can contribute varied amounts: some (typically the most recent) very much more than others (the most distant). As an analog of Fisher information, it is possible to study the dependence of the posterior variance ν 2 (T) on the width of the encoding tuning functions σ 2 and the dimensionality. In our simple model, we find similar results to previously reported ones (Snippe & Koenderinck, 1992; Zhang & Sejnowski, 1999) (data not shown). However, as we are always in the sparse spike limit, the information per spike is of most relevance, and the posterior variance is strictly increasing in σ , the width of the encoding tuning functions, independent of the dimensionality. If there were dense spiking, the population firing rate (Zhang & Sejnowski, 1999; Silberberg, Bethge, Markram, Pawelzik, & Tsodyks, 2004; DeCharms & Zador, 2000; Knight 1972) might carry enough information to overwhelm any prior. Three assumptions about the encoding model merit discussion. First, the bell-shaped form of the tuning functions, asymptoting at 0, is only very roughly realistic. However, the arguments in sections 3.2 and 3.3 about the recursive and nonrecursive structure of the decoder depend on the nature of the prior (ζ = 1 and ζ = 2), and so will generalize. The most fundamental change would be that the variance of the posterior would depend not only on spike timing but also on the relative tuning preferences of the neurons that emit the spikes. Second, we assume an instantaneous relationship between the rate of the inhomogeneous Poisson process and the stimulus. This should be formally straightforward to relax if the dependence on the stimulus history can be approximated by a linear filter (a discrete sum) as is standard for
434
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
linear-nonlinear-Poisson like model neurons (Paninski, 2003). In that case, the likelihood term in equation 2.5 will become a function of the stimulus at a number of times, each of which enters equation 2.2. Each spike will thus contribute as many entries to the covariance matrix as its linear filter extends in time. The extra complexity of decoding in this case is not completely clear. The third questionable assumption is that spiking is independent across time (the Poisson process assumption) and neurons. The actual degree of independence is, of course, hotly debated (Averbeck et al., 2006; DeCharms & Zador, 2000). In fact, the signal correlations that we assume induce some of the same issues for decoding that the noise correlations discussed in those references. Thus, our work can be seen as casting this debate in a slightly different light by showing that it is possible to recode correlated spiking in a particular, independently interpretable, form. Our companion paper (Natarajan et al., 2006) considers a network- (rather than a simulated-annealing-) based implementation of recoding. Note that Nirenberg, Carcieri, Jacobs, and Latham (2001) have studied independent interpretability (in the spikes of retinal ganglion cells) in a less modeldependent manner. The advantage of independent interpretability (based on equation 4.1) is not confined to decoding. For instance, combining information from different modalities—as in multisensory integration (Ernst and Banks, 2002; Hillis, Ernst, Banks, & Landy, 2002) or sensorimotor integration (Zemel et al., ¨ 2005; Kording and Wolpert, 2004)—becomes straightforward and requires only an addition in the log domain or single-neuron (Chance, Abbott, & Reyes, 2002; Salinas & Abbott, 1996; Poirazi, Brannon, & Mel, 2003a, 2003b) or population (Deneve et al., 2001) multiplication. Recoding produces spike trains that lack temporal redundancy. It is the exact dynamic analog of efficient coding approaches based on producing population activities lacking, for example, spatial correlations (Srinivasan, Laughlin, & Dubs, 1982; Atick, 1992; Nirenberg et al., 2001). As such, it displays phenomena that are strongly reminiscent of adaptation in the static domain. Note, however, that the rationale for adaptation here is different— it is a by-product of the requirement for a computationally efficient code. This rationale may find application outside our particular domain. Finally, the probabilistic coding here arises only from the ill-posed nature of recovering the spike train from a sparse set of noisy spikes. A more fundamental form of so-called computational uncertainty (Zemel, Dayan, & Pouget, 1998) arises in cases such as the aperture problem (Weiss & Adelson, 1998), when the information in the sensory input is ambiguous in a way that does not depend on noise. Various approaches to computational uncertainty have been suggested in the static case (Anderson, 1994; Barber, Clark, & Anderson, 2003; Zemel et al., 1998; Sahani & Dayan, 2003), but their extension to our dynamic framework remains an open problem.
Fast Population Coding
435
Appendix A: Matrix Inversion Lemmas For a matrix partitioned as in equations 2.8, or generally E=
AB , CD
it can be shown that the following equalities hold by inverting E: (D − CA−1 B)−1 = D−1 + D−1 C(A − BD−1 C)−1 BD−1 −1
−1
−1
(D − CA B) CA
−1
−1
−1
= D C(A − BD C) .
(A.1) (A.2)
These identities allow us to perform the final multiplication p(ξ |sT ) p(sT ), here written in the log domain, and then to renormalize: −1 1 sT CT−1T + CT−1T CTξ Cξ ξ − Cξ T CT−1T CTξ + Iσ 2 Cξ T CT−1T sT 2 −1 − 2sT CT−1T Cξ T Cξ ξ − Cξ T CT−1T CTξ + Iσ 2 · θ + const.
log( p(ξ |sT ) p(sT )) = −
thus
−1 −1 Cξ T CT−1T . ν 2 (T) = CT−1T + CT−1T CTξ Cξ ξ + Iσ 2 − Cξ T CT−1T CTξ
Making the following substitutions A = Cξ ξ + Iσ 2 C = CTξ
B = Cξ T D = CT T
allows us to apply equation A.1 and write out the variance in equation 2.10. The mean finally is obtained by writing out −1 µ(t) = ν 2 (T)CT−1T CTξ Cξ ξ − Cξ T CT−1T CTξ + Iσ 2 ·θ = (D − CA−1 B)D−1 C(A − BD−1 C)−1 · θ and applying equation A.2 to directly yield equation 2.9.
436
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
Appendix B: OU Process Replacing each of the ISIs by the average value , we get a Kac-MurdockSzego Toeplitz matrix for which the analytical inverse is (Dow, 2003):
1 ρ C =c ρ2 ρ3
ρ 1 ρ ρ2
ρ2 ρ 1 ρ
ρ3 ρ2 ρ 1
C −1
1 −ρ 0 −ρ 1 + ρ 2 −ρ 1 = c(1 − ρ 2 ) 0 −ρ 1 + ρ 2 0 0 −ρ
0 0 , −ρ 1
where ρ = exp(−α ). Rewriting equation 2.11 as k(ξ , T) = CTξ Cξ−1 ξ / −1 2 −1 2 −1 σ (Cξ ξ + I/σ ) , we note that CTξ Cξ ξ ≈ δi−1 ; only the first component of this vector is one, and all others are zero. The second factor is a −ρ 0 0 0 −ρ a + ρ 2 −ρ 0 0 1 −1 2 2 −ρ 0 0 −ρ a + ρ = (C + I/σ ) = , (a − 1)σ 2 0 0 −ρ a + ρ 2 −ρ 0 0 0 −ρ a
A−1
where a = σc2 (1 − p 2 ) + 1. We know A−1 A = I. Neglecting the prefactor for a moment, the first row of A (which is the one of interest) therefore has to satisfy the following recurrence relation: A2,1 = (a A1,1 − 1)/ρ Ak+2,1 = (a /ρ + ρ)Ak+1,1 − Ak,1
(B.1) n>3
for
AN,1 /AN−1,1 = ρ/a .
(B.2) (B.3)
Equation B.2 is a simple two-term linear recurrence equation and can be solved with boundary conditions given by equations B.1 and B.3. The characteristic equation of equation B.2 is r 2 − (a /ρ + ρ)r + 1 = 0 with real roots 1 λ1,2 = a /ρ + ρ ± (a /ρ + ρ)2 − 4 . 2 Including the boundary conditions leads to a solution + d2 λn−1 An,1 = d1 λn−1 1 2
(a λ1 − ρ) d1 = a − λ1 ρ − (a − λ2 ρ) (a λ2 − ρ)
λ1 λ2
N−2 −1
Fast Population Coding
d2 =
437
1 − d1 (a − λ1 ρ) . a − λ2 ρ
One of the eigenvalues will always be greater than 1, the other less than 1, but both are positive. As Cξ ξ is symmetric, so are A−1 and A, and the first column of A is equal to its first row, which we pick out by premultiplying with CTξ Cξ−1 ξ . This vector A1,1:N is exactly the sum of two exponentials we saw when using regular spikes to infer the temporal kernel k(ξ , T), and the nth component of k(ξ , T), kn , is given by kn = [CTξ (Cξ ξ + Iσ 2 )−1 ]n = (a − 1)σ 2 An,1 = (a − 1)σ 2 d1 λn−1 + d2 λn−1 . 1 2 (B.4) If λ1 is the larger eigenvalue, we see that the corresponding coefficient d1 will be ≈ (λ2 /λ1 ) N , which is very small. The contribution of the larger λ will grow only very slowly and be seen only for the very distant spikes. On the other hand, d2 will be ≈ 1/(a − λ2 ρ). For all intents and purposes, the temporal kernel will be decaying exponentially with a negative spike time constant log λ2 . Furthermore, if the second boundary condition (for time 0) is moved to −∞, the result is a pure exponential. Both the analytical and numerical kernels are plotted in Figure 5. Relaxing the assumption of metronomic spiking gives a matrix A−1 , which is still tridiagonal but the elements of which are not equal. Writing matrix C as
1
a C= ab
bd 1 d
C −1 =
a a b a bd 1 b b
a bd bd d
1
1 1−a 2 a − 1−a 2
0 0
a − 1−a 2
0
1−a 2 b 2 b − 1−b 2 (1−a 2 )(1−b 2 ) b 1−b 2 d 2 − 1−b 2 (1−b 2 )(1−d 2 ) d 0 − 1−d 2
0 0 d − 1−d 2
,
1 1−d 2
(B.5) where a = ce −α|t1 −t2 | , b = ce −α|t2 −t3 | , and soon, leads to a set of equations similar to equations B.2 and B.3 but including more terms. Appendix C: Autoregressive Processes of Second and Higher Order An nth-order gaussian autoregressive sequence of length T as produced by equation 3.8 can be written as a sample from a multivariate normal distribution in the following way: Let b = [1, −β1 , −β2 , −β3 , . . . , β N ] and
438
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
let Bt = [0t , β, 0T−n−t ], where 0t stands for a vector of zeros of length t. The inverse covariance matrix of the process is given by C −1 =
T−n−1
Bt BtT .
(C.1)
t=0
For the coefficients of b to define a stationary and finite process, C must be Toeplitz. One way of generating a finite process from the b is by letting the nth derivative of the process evolve as an OU process, (n)
st
√ (n) = β0 st−1 + c ηt ,
(C.2)
in which case the coefficients of the vector b are given by β i = n Ci (−β0 )i−1 ,
(C.3)
where n Ci is the binomial coefficient. To enforce stationarity, we have to finally perform a subtraction, C
−1
=
T−1
Bt BtT
t=0
−
T
Bt−1 Bt−T ,
(C.4)
t =T−n
where we abuse notation and B −1 stands for Bt−1 = [0t , β N , β N−1 , . . . , β1 , 0T−n−t ]. Acknowledgments We thank Jeff Beck, Sophie Den`eve, Peter Latham, M´at´e Lengyel, Liam Paninski, and Peggy Seri`es for helpful discussions and for reading versions of the manuscript. This work was supported by the BIBA consortium (Q.H., P.D.), the UCL Medical School M.B./Ph.D. program (Q.H.), the Gatsby Charitable Foundation (P.D.), and NSERC and CIHRC NET program (R.N., R.Z.). References Adams, W. J., Graf, E. W., & Ernst, M. O. (2004). Experience can change the lightfrom-above prior. Nat. Neurosci., 7(10), 1057–1058. Anderson, C. H. (1994). Basic elements of biological computational systems. Int. J. Modern Physics C, 5(2), 135–137. Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing? Network, 3, 213–251.
Fast Population Coding
439
Averbeck, B. B., Latham, P. E., & Pouget, A. (2006). Neural correlations, population coding and computation. Nat. Rev. Neurosci. 7(5), 358–366. Barber, M. J., Clark, J. W., & Anderson, C. H. (2003). Neural representation of probabilistic information. Neural Comp., 15, 1843–1864. Barbieri, R., Frank, L. M., Nguyen, D. P., Quirk, M. C., Solo, V., Wilson, M. A., & Brown, E. N. (2004). Dynamic analyzes of information encoding in neural ensembles. Neural Comp., 16, 277–307. Barlow, H. B. (1953). Summation and inhibition in the frog’s retina. J. Physiol., 137, 69–88. Bethge, M., Rotermund, D., & Pawelzik, K. (2002). Optimal short-term population coding: When Fisher information fails. Neural Comp., 14(2), 303–319. Bialek, W., Rieke, F., de Ruyter van Steveninck, R. R., & Warland, D. (1991). Reading a neural code. Science, 252(5014), 1854–1857. Brown, E. N., Frank, L. M., Tang, D., Quirk, M. C., & Wilson, M. A. (1998). A statistical paradigm for neural spike train decoding applied to position prediction from ensemble firing patterns of rat hippocampal place cells. J. Neurosci., 18(18), 7411– 7425. Brunel, N., & Nadal, J.-P. (1998). Mutual information, Fisher information, and population coding. Neural Comp., 10, 1731–1757. Chance, F. S., Abbott, L. F., & Reyes, A. D. (2002). Gain modulation from background synaptic input. Neuron, 35(4), 773–782. DeCharms, C., & Zador, A. (2000). Neural representation and the cortical code. Annu. Rev. Neurosci., 23, 613–647. Deneve, S., Latham, P. E., & Pouget, A. (2001). Efficient computation and cue integration with noisy population codes. Nat. Neurosci., 4(8), 826–831. Dong, D. W., & Atick, J. J. (1995). Statistic of natural time-varying images. Network: Computation in Neural Systems, 6, 345–358. Dow, M. (2003). Explicit inverses of Toeplitz and associated matrices. Austr. New Zealand Industr. Appl. Math. J. E, 44, E185–E215. Ernst, M. O., & Banks, M. S. (2002). Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415(6870), 429–433. Gao, Y., Black, M. J., Bienenstock, E., Shoham, S., & Donoghue, J. P. (2002). Probabilistic inference of hand motion from neural activity in motor cortex. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Georgopoulos, A. P., Schwartz, A. B., & Kettner, R. E. (1983). Neuronal population coding of movement direction. Science, 233(4771), 1416–1419. Hillis, J. M., Ernst, M. O., Banks, M. S., & Landy, M. S. (2002). Combining sensory information: Mandatory fusion within, but not between, senses. Science, 298(5598), 1627–1630. Hinton, G. E. (1999). Products of experts. In Ninth International Conference on Artificial Neural Networks, (ICANN 9) (Vol. 1, pp. 1–6). Piscataway, NJ: IEEE. Hinton, G. E., Dayan, P., Frey, B. J., & Neal, R. M. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268(5214), 1158–1161. Johansson, R. S., & Birznieks, I. (2004). First spikes in ensembles of human tactile afferents code complex spatial fingertip events. Nat. Neurosci., 7, 170– 177.
440
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
Kemere, C., Santhaman, G., Yu, B. M., Ryu, S., Meng, T., & Shenoy, K. V. (2004). Modelbased decoding of reaching movements for prosthetic systems. In Proceedings of the 26th Annual International Conference of the IEEE EMBS (pp. 4524–4528). Piscataway, NJ: IEEE Press. Knight, B. W. (1972). Dynamics of encoding in a population of neurons. J. Gen. Physiol., 59, 734–766. ¨ Kording, K. P., Ku, S. P., & Wolpert, D. M. (2004). Bayesian integration in force estimation. J. Neurophysiol., 92(5), 3161–3165. ¨ Kording, K., & Wolpert, D. M. (2004). Bayesian integration in sensorimotor learning. Nature, 427(6971), 244–247. Lever, C., Wills, T., Cacucci, F., Burgess, N., & O’Keefe, J. (2002). Long-term plasticity in the hippocampal place cell representation of environmental geometry. Nature, 426, 90–94. MacKay, D. J. (2003). Information theory, inference and learning algorithms. Cambridge: Cambridge University Press. Nirenberg, S., Carcieri, S. M., Jacobs, A. L., & Latham, P. E. (2001). Retinal ganglion cells act largely as independent encoders. Nature, 411, 698–701. Paninski, L. (2003). Convergence properties of three spike-triggered analysis techniques. Network, 14(3), 437–464. Paradiso, M. A. (1988). A theory for the use of visual orientation information which exploits the columnar structure of striate cortex. Biol. Cybern., 58(1), 35–49. Poirazi, P., Brannon, T., & Mel, B. W. (2003a). Arithmetic of subthreshold synaptic summation in a model CA1 pyramidal cell. Neuron, 37, 977–987. Poirazi, P., Brannon, T., & Mel, B. W. (2003b). Pyramidal neuron as 2-layer neural network. Neuron, 37, 989–999. Pouget, A., Dayan, P., & Zemel, R. S. (2003). Inference and computation with population codes. Annu. Rev. Neurosci., 26, 318–410. Pouget, A., Zhang, K., Deneve, S., & Latham, P. E. (1998). Statistically efficient estimation using population coding. Neural Comp., 10(2), 373–401. Rao, R. P. N., Olshausen, B. A., & Lewicki, M. S. (Eds.) (2002). Probabilistic models of the brain: Perception and neural function. Cambridge, MA: MIT Press. Reinagel, P., & Reid, R. C. (2000). Temporal coding of visual information in the thalamus. J. Neurosci., 20(14), 5392–5400. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Sahani, M., & Dayan, P. (2003). Doubly distributional population codes: Simultaneous representation of uncertainty and multiplicity. Neural Comp., 15, 2255– 2279. Salinas, E., & Abbott, L. F. (1996). A model of multiplicative neural responses in parietal cortex. Proc. Natl. Acad. Sci. USA, 93, 11956–11961. Schwartz, A. B. (1994). Direct cortical representation of drawing. Science, 265, 540– 543. Seung, S. H., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proc. Natl. Acad. Sci. USA, 90(22), 10749–10753. Silberberg, G., Bethge, M., Markram, H., Pawelzik, K., & Tsodyks, M. (2004). Dynamics of population rate codes in ensembles of neocortical neurons. J. Neurophysiol., 91(2), 704–709.
Fast Population Coding
441
Smith, A. C., & Brown, E. N. (2003). Estimating a state-space model from point process observations. Neural Comp., 15, 965–991. Snippe, H. P., & Koenderinck, J. J. (1992). Discrimination thresholds for channelcoded systems. Bio. Cybern., 66, 543–551. Srinivasan, M. V., Laughlin, S. B., & Dubs, A. (1982). Predictive coding: A fresh view of inhibition in the retina. Proc. R. Soc. Lond. B Biol. Sci., 216(1253), 427–459. Twum-Danso, N., & Brockett, R. (2001). Trajectory estimation from place cell data. Neural Networks, 14, 835–844. Van Rullen, R., & Thorpe, S. J. (2001). Rate coding versus temporal order coding: What the retinal ganglion cells tell the visual cortex. Neural Comp., 13(6), 1255– 1283. Wang, X.-J., Liu, Y., Sanchez-Vives, M. V., & McCormick, D. A. (2003). Adaptation and temporal decorrelation by single neurons in the primary visual cortex. J. Neurophysiol., 89(6), 3279–3293. Weiss, Y., & Adelson, E. H. (1998). Slow and smooth: A Bayesian theory for the combination of local motion signals in human vision (A. I. Memo 1624). Cambridge, MA: MIT, Artificial Intelligence Laboratory. Wilson, M. A., & McNaughton, B. L. (1993). Dynamics of the hippocampal ensemble code for space. Science, 261(5124), 1055–1058. Zemel, R. S., Dayan, P., & Pouget, A. (1998). Probabilistic interpretation of population codes. Neural Comp., 10(2), 403–430. Zemel, R. S., Huys, Q. J. M., Natarajan, R., & Dayan, P. (2005). Probabilistic computation in spiking populations. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 1609–1616). Cambridge, MA: MIT Press. Zhang, K., Ginzburg, I., McNaughton, B. L., & Sejnowski, T. J. (1998). Interpreting neuronal population activity by reconstruction: Unified framework with application to hippocampal place cells. J. Neurophysiol., 79, 1017–1044. Zhang, K., & Sejnowski, T. J. (1999). Neuronal tuning: To sharpen or to broaden. Neural Comp., 11(1), 75–84.
Received October 17, 2005; accepted June 17, 2006.
LETTER
Communicated by Joshua Gold
The Basal Ganglia and Cortex Implement Optimal Decision Making Between Alternative Actions Rafal Bogacz [email protected] Department of Computer Science, University of Bristol, Bristol BS8 1UB, U.K.
Kevin Gurney [email protected] Department of Psychology, University of Sheffield, Sheffield S10 2TP, U.K.
Neurophysiological studies have identified a number of brain regions critically involved in solving the problem of action selection or decision making. In the case of highly practiced tasks, these regions include cortical areas hypothesized to integrate evidence supporting alternative actions and the basal ganglia, hypothesized to act as a central switch in gating behavioral requests. However, despite our relatively detailed knowledge of basal ganglia biology and its connectivity with the cortex and numerical simulation studies demonstrating selective function, no formal theoretical framework exists that supplies an algorithmic description of these circuits. This article shows how many aspects of the anatomy and physiology of the circuit involving the cortex and basal ganglia are exactly those required to implement the computation defined by an asymptotically optimal statistical test for decision making: the multihypothesis sequential probability ratio test (MSPRT). The resulting model of basal ganglia provides a new framework for understanding the computation in the basal ganglia during decision making in highly practiced tasks. The predictions of the theory concerning the properties of particular neuronal populations are validated in existing experimental data. Further, we show that this neurobiologically grounded implementation of MSPRT outperforms other candidates for neural decision making, that it is structurally and parametrically robust, and that it can accommodate cortical mechanisms for decision making in a way that complements those in basal ganglia. 1 Introduction Recent experimental results have established that both the cortex and the basal ganglia are involved in decision making between alternative actions (Chevalier, Vacher, Deniau, & Desban, 1985; Deniau & Chevalier, 1985; Medina & Reiner, 1995; Redgrave, Prescott, & Gurney, 1999; Schall, 2001; Shadlen Neural Computation 19, 442–477 (2007)
C 2007 Massachusetts Institute of Technology
The Brain Implements Optimal Decision Making
443
& Newsome, 2001; Smith, Bevan, Shink, & Bolam, 1998). However, it is necessary to distinguish between two phases of developing task competence (Ashby & Spiering, 2004; Shadmehr & Holcomb, 1997). In the acquisition or learning phase, actions appropriate in a given behavioral state (i.e., a combination of ongoing behavior and stimulus) are being developed that are usually driven by external reward. In this phase, the main difficulty lies in finding the optimal policy: the mapping between states and actions that maximizes reward (Sutton & Barto, 1998). Consistent with this requirement, there is a great deal of evidence that the basal ganglia act as a substrate for reinforcement learning (O’Doherty et al., 2004; Samejima, Ueda, Doya, & Kimura, 2005; Schultz, Dayan, & Montague, 1997), and several models for this process have been proposed (Doya, 2000; Frank, Seeberger, & O’Reilly, 2004; Montague, Dayan, & Sejnowski, 1996). In contrast, in the proficient phase, the mapping between stimulus and an appropriate response is well established, and the requirement is one of performing action selection or decision making. This consists of identifying the current behavioral state and executing the known appropriate action as soon as a certain level of confidence in the identification is reached (Gold & Shadlen, 2001, 2002). There is a great deal of evidence (reviewed below) for the involvement of the basal ganglia in action selection. We view the hypotheses that basal ganglia perform reinforcement learning and that they perform action selection as complementary. Thus, at any particular time, the basal ganglia perform action selection proficiently with respect to a suite of alternatives that have already been learned. Subsequent learning phases will modify that suite of alternatives and shape the profile of selections that can be made. In this article, we focus on the neural mechanisms underlying decision making in the proficient phase. As such, this is the phase under discussion if no qualifier is specified. We return to the relation of our work to task acquisition and the learning phase in section 6. 1.1 Action Selection and Decision Making. Experimental data show that during the decision process in visual discrimination tasks, neurons in cortical areas representing alternative actions gradually increase their firing rate, thereby accumulating evidence supporting these alternatives (Schall, 2001; Shadlen & Newsome, 2001). Hence, the models of decision making based on neurophysiological data (Shadlen & Newsome, 2001; Wang, 2002) assume that there exist connections from neurons representing stimuli to the appropriate cortical neurons representing actions (these connections may develop during the months of training the animals undergo before these experiments). These cortical connections are assumed to encode the stimulus-response mapping. However, even in simple, highly constrained laboratory tasks, there will be more than one possible response, and so there is a problem of action selection in which the representation for the correct response has to take control of the animal’s motor plant. In natural, ethological settings, this problem is
444
R. Bogacz and K. Gurney
exacerbated because, under these circumstances, there are usually multiple, complex sensory streams demanding a variety of behaviors. The problem of action selection was recently addressed by Redgrave et al. (1999), who conceived of it as the resolution of conflict between command centers throughout the brain competing for behavioral expression. These authors examined the problem from a computational perspective and argued that competitions between brain centers vying for expression were best resolved by a central switch examining the urgency or salience of each action request, and that anatomical and physiological evidence pointed to the basal ganglia as the neural substrate for this switch. Thus, the basal ganglia receive widespread input from all over the brain (Parent & Hazrati, 1995a) and, in their quiescent state, send tonic inhibition to midbrain and brain stem targets implicated in executing motor actions, thus blocking cortical control over these actions (Chevalier et al., 1985; Deniau & Chevalier, 1985). Actions are supposed to be selected when neurons in the output nuclei have their activity reduced (under control of the rest of the basal ganglia), thereby disinhibiting their targets (Chevalier et al., 1985; Deniau & Chevalier, 1985). The selection hypothesis for basal ganglia has been tested in biologically realistic computational models in a variety of anatomical contexts by several authors (Brown, Bullock, & Grossberg, 2004; Frank, 2005; Gurney, Prescott, & Redgrave, 2001a, 2001b; Humphries & Gurney, 2002). In sum, the research reviewed above indicates that during decision making among alternative actions, cortical regions associated with the alternatives integrate evidence supporting each one and that the basal ganglia act as a central switch by evaluating this evidence and enabling those behavioral requests that are best supported (most salient). 1.2 Scope of the Letter. We have argued that between bouts of learning—that is, during proficient phases of activity—the primary computational role of the basal ganglia is to act as an action selection mechanism, mediating resolution of the action selection problem by gating behavioral requests. It is this mode of basal ganglia operation that we address in this article. Biologically realistic network models (Gurney et al., 2001a, 2001b) have shown how the basal ganglia could perform the required selection computation. However, such models fail to elucidate possible analytic descriptions of the computation (i.e., selection) that provide it with a theoretical grounding. This letter provides an analytic description of function of a circuit involving cortex and basal ganglia by showing how an optimal abstract decision algorithm maps onto the anatomy and physiology of this circuit. The main goal of this letter is to provide a new algorithmic framework for understanding the computation in the basal ganglia in the proficient phase. The algorithm relates to computations being performed at the systems level of description of the basal ganglia, that is, considering the circuit to be a
The Brain Implements Optimal Decision Making
445
set of interacting neuronal populations described by their overall firing rate (Dayan, 2001). This does not preclude the possibility that computations may be performed at other levels of description (Gurney, Prescott, Wickens, & Redgrave, 2004) dealing with microcircuits, membranes, or molecular signalling pathways. However, as far as computation performed by virtue of the organization of the basal ganglia in toto is concerned, we argue that this will have an integrity apparent at the systems level, while acknowledging that it must ultimately be consistent with lower-level models. In view of these methodological considerations, we therefore deliberately do not attempt to incorporate the overwhelming amount of knowledge available for the basal ganglia at the microanatomical, physiological, and molecular levels of description. The letter is organized as follows. Section 2 reviews relevant background material concerning the neurobiology and theory of decision making in cortex and basal ganglia. Section 3 deals with the central technical argument and proposes how the basal ganglia may perform action selection in an optimal way. Section 4 shows how the specific experimental predictions of the theory are verified by existing data. Section 5 compares the performance of the proposed model against other models of decision making. Section 6 discusses the relation of this work to other theories of action selection and published experimental data. 2 Review of the Neurobiology and Theory of Decision Making This section reviews the material critical for an understanding of our model. The theory of optimal decision making in perceptual tasks has hitherto been grounded almost exclusively in cortical mechanisms. In contrast, we develop the theory of decision making in this article with respect to the neural circuit involving both the cortex and the basal ganglia. Thus, in proposing a neural mechanism for optimal decision making, we link two strands of research: that dealing with putative cortical decision mechanisms and that dealing with action selection in the basal ganglia. Elements from both areas are therefore required background material. First, we present the theory of optimal decision making, and then, we review those aspects of basal ganglia anatomy and physiology critical for the model. 2.1 Decision Making and Cortical Integration. The neural basis of decision making in cortex has been studied extensively using single-cell recordings (Britten, Shadlen, Newsome, & Movshon, 1993; Kim & Shadlen, 1999; Schall, 2001). Typically these studies have used a direction of motion discrimination task using fields of drifting random dots, with response via saccadic eye movements. During these experiments, the mapping between dot movement direction and required response was kept constant for many weeks of the training, so that these studies describe the proficient phase of task acquisition. After stimulus onset, neurons in cortical sensory areas (e.g.,
446
R. Bogacz and K. Gurney
area MT in the visual motion task) respond if their receptive fields encounter the stimulus and are appropriately tuned to the overall direction of motion (Britten et al., 1993; Kim & Shadlen, 1999). However, the instantaneous firing rates in MT are noisy, probably reflecting the uncertainty inherent in the stimulus and its neural representation. Further, this noise is such that decisions based on the activity of MT neurons at a given moment in time would be inaccurate, because the largest firing rate does not always indicate the direction of coherent motion in the stimulus. Therefore, a statistical interpretation is required. An often used hypothesis (Gold & Shadlen, 2001, 2002) is that populations of neurons in MT encode evidence for a particular perceptual decision. To formalize this, denote the evidence supporting decision i, (i is “left” or “right”) provided at time t, by xi (t). Then, under the neural encoding hypothesis, xi (t) corresponds to the total activity of MT neurons selective for direction i at time t. The decision-making process can be defined as one of finding which xi (t) has the highest mean (Gold & Shadlen, 2001, 2002). To solve it, it appears that subsequent cortical areas are invoked to accumulate evidence over time. Thus, in the motion discrimination task, neurons in the lateral intraparietal area (LIP) and frontal eye field (FEF) (which are implicated in the response via saccadic eye movements) gradually increase their firing rate (Schall, 2001; Shadlen & Newsome, 2001) and could therefore be computing Yi (T) =
T
xi (t)
(2.1)
t=1
over the temporal interval [1,T] (where we assume for simplicity a discrete representation of time). The accumulated evidence Yi (T) may now be used in making a decision about which xi (t) has the highest mean. 2.2 Modeling the Decision Criterion. The above description of cortical integration leaves open a central question: When should a neural mechanism stop the integration and execute the action with the highest accumulated evidence Yi (T)? A simple solution to this problem is to execute an action as soon as any Yi (T) exceeds a certain decision threshold, yielding the so-called race model (Vickers, 1970). However, this model does not perform optimally. For example, in case of decision between two alternatives, it is more efficient to compute the difference between the accumulated evidence supporting the two alternatives and execute action as soon as this difference crosses a positive or a negative decision threshold. This procedure is known as a random walk (Laming, 1968; Stone, 1960) or a diffusion (Ratcliff, 1978) model, and it may be shown to implement a statistical decision test known as the sequential probability ratio test (SPRT) (Barnard, 1946; Wald, 1947). The SPRT is optimal in the following sense: among all decision methods
The Brain Implements Optimal Decision Making
447
allowing a certain probability of error, it requires the shortest period of sampling the xi —that is, it minimizes decision time (Wald & Wolfowitz, 1948). 2.3 The MSPRT. For more than two alternatives, there is no single optimal test in the sense that SPRT is optimal for two alternatives, but there are tests that are asymptotically optimal; that is, they minimize decision time for a fixed probability of error when this probability decreases to zero (Dragalin, Tertakovsky, & Veeravalli, 1999). These tests are the so-called multihypothesis SPRTs (MSPRT’s) (Baum & Veeravalli, 1994; Dragalin et al., 1999), and for two alternatives, they simplify to the SPRT. While it has been shown that MSPRT may be performed in a two-layer connectionist network (McMillen & Holmes, 2006), the required complexities in this model mitigate against any obvious implementation in the brain (and, in particular, the cortex). We now introduce the MSPRT (Baum & Veeravalli, 1994). A decision among N alternative actions can be formulated in the same way as for the case of the two alternatives described in section 2.1. That is, it amounts to finding which xi (t) has the highest mean (Gold & Shadlen, 2001, 2002). Let us define a set of N hypotheses Hi such that roughly speaking, each Hi corresponds to xi (t) having the highest mean. More precisely, we define Hi analogous to its definition for two alternatives (Gold & Shadlen, 2001, 2002); Hi is the hypothesis that xi (t) come from independent and identically distributed (i.i.d.) normal distributions with mean µ+ and standard deviation σ , while x j=i (t) come from i.i.d. normal distributions with mean µ− and standard deviation σ , where µ+ >µ− . Bearing in mind that we are integrating evidence up until some time T, denote the entirety of sensory evidence available up to T by input(T) = {xi (t) : 1≤ i ≤ N, 1≤ t ≤ T}. The MSPRT (Baum & Veeravalli, 1994) is equivalent to the following decision criterion at time T: for each alternative i, compute the conditional probability of hypothesis Hi given sensory inputs so far, Pi (T) = P(Hi |input(T)), and execute an action as soon as any of the Pi (T) exceeds a certain decision threshold. Hence, sensory information is gathered until the estimated probability of one of the inputs having the highest mean exceeds the decision threshold. Appendix A describes how Pi (T) can be computed on the basis of sensory evidence. In particular, it shows that the logarithm of Pi (T), which we denote by L i (T), is given by
L i (T) = yi (T) − ln
N
exp(yk (T))
(2.2)
k=1
where yi (T) is proportional to the accumulated evidence supporting action i. In particular, yi (T) = g ∗ Yi (T), where Yi (T) is the accumulated evidence
448
R. Bogacz and K. Gurney
supporting action i (see section 2.1) and g ∗ is a constant. We will refer to yi (T) as the salience of action i. Thus, MSPRT implies that an action should be selected as soon as any of the L i (T) exceeds a fixed decision threshold. Equation 2.2 is the basis for mapping MSPRT onto the basal ganglia. However, before we proceed with this process, we describe some of the intuitive properties of MSPRT, as implemented in equation 2.2. The right-hand side of equation 2.2 has two terms. The first term yi (T) is simply the salience and, on its own, describes a race model (Vickers, 1970). This term therefore incorporates information about the absolute size of the salience of the currently “winning” alternative. The second term in equation 2.2 occurs in subsequent analysis throughout the article, and we denote it by S(T) where, S (T) = ln
N
exp (yk (T)).
(2.3)
k=1
The term S(T) includes summation over all alternatives and does not depend on i. S(T) therefore decreases the value of all L i (T) by the same amount, thereby increasing the minimum salience required for an action to be selected. It may therefore be thought of as representing response conflict, because its value is increased by more actions having high salience. In this way, S(T) allows incorporation of information about the difference between the salience of the currently winning alternative and its competitors. The degree of scaling of the salience required for action selection implied by the particular form of S(T) is critical for optimal decision making; it allows a much lower average decision time for fixed accuracy than when the scaling is not present (i.e., race model), as will be shown in section 5. 2.4 Basal Ganglia Connectivity. The basal ganglia connectivity used in our study contain the major pathways known to exist in basal ganglia anatomy and was based on that used in the model of Gurney et al. (2001a). Figure 1A shows this connectivity for rat in cartoon form (for reviews of basal ganglia anatomy, see Gerfen & Wilson, 1996; Mink, 1996; Smith et al., 1998). Cortex sends excitatory projections to the striatum (Nakano, Kayahara, Tsutsumi, & Ushiro, 2000) and subthalamic nucleus (STN) (Smith et al., 1998). The striatum is the largest basal ganglia nucleus and is divided into two populations of projection neurons differentiated, inter alia, by their anatomical targets and preferential dopamine receptor type (Gerfen & Young, 1988). The neurons in one striatal subpopulation send focused inhibitory projections to the basal ganglia output nuclei: the substantia nigra pars reticulate (SNr) and entopeduncular nucleus (EP) (the homologue of primate globus pallidus internal segment (GPi)). These striatal neurons are associated with D1-type dopamine receptors (Smith et al., 1998) and,
The Brain Implements Optimal Decision Making
449
A
B
Figure 1: Comparison of connectivity of basal ganglia and a network implementing the multihypothesis sequential probability ratio Test (MSPRT). (A) Connectivity of basal ganglia nuclei and its cortical afferents in the rat (modified from Gurney et al., 2001a). Connections and nuclei denoted by dashed lines are not essential for the implementation of MSPRT. (B) Architecture of the network implementing MSPRT. The equations show expressions calculated by each layer of neurons.
450
R. Bogacz and K. Gurney
together with their targets (SNr, EP), constitute the so-called “direct pathway” (in section 3 we associate this direct pathway with first term yi (T) in equation 2.2). Neurons in the other striatal population are also inhibitory, send focused projections to the globus pallidus (GP) (globus pallidus external segment or GPe in primate), and are associated with D2-type dopamine receptors (Smith et al., 1998). Neurons in the STN are glutamatergic and send diffuse excitatory projections to SNr/EP and GP (Parent & Hazrati, 1993, 1995a). The GP sends inhibitory connections to the output nuclei (Bevan, Smith, & Bolam, 1996). This second striatal population therefore gives rise to an indirect pathway to the output nuclei via GP and STN (in section 3, we propose that the nuclei traversed by the indirect pathway are involved in computation of term S(T)). The output nuclei send widespread inhibitory connections to the midbrain, brain stem (Faull & Mehler, 1978; Kha et al., 2001), and thalamus (Alexander, DeLong, & Strick, 1986). 2.5 Neuronal Selectivity in the Basal Ganglia. There is much evidence (reviewed below) pointing to a topographic representation of functionality within basal ganglia and its associated thalamocortical circuitry, leading to the hypothesis that these circuits support a range of discrete channels associated with different actions. This concept will be important for the model; in section 3, we associate each channel with a term L i (T). At the largest scale of organization, Alexander et al. (1986) divided the loops from cortex, through basal ganglia, thalamus, and back to cortex, into five parallel, segregated circuits (cf. Nakano et al., 2000) associated with different functionality. Since this article treats the problem of action selection, we focus on motor and oculomotor circuits and return to consider information processing in the limbic and two prefrontal circuits in section 6. Within the motor circuit, studies of awake primates in behavioral tasks have established that all basal ganglia nuclei have somatotopic organization. Thus, Crutcher and DeLong (1984a, 1984b) have shown that neurons selective for arm, leg, and face are located within different parts of the striatum. Further, within each body part, there are clusters of neurons responding selectively before and during movement of individual joints (often only in single direction) (Crutcher & DeLong, 1984a, 1984b). Similarly, Georgopoulos, Delong, and Crutcher (1983) found neurons in other basal ganglia nuclei (STN, GP, EP) that were selective to the direction and speed of individual movements. These observations led Alexander et al. (1986) to propose that “the motor circuit may be composed of multiple, parallel subcircuits or channels concerned with movement of individual body parts,” which traverse all nuclei of basal ganglia. The notion of channels was incorporated into the computational model of Gurney et al. (2001a), who proposed that each action is associated anatomically with a discrete neural population within each nucleus. Channels are therefore defined at the input nuclei (striatum and STN) as populations
The Brain Implements Optimal Decision Making
451
innervated by the cortical afferents associated with each action. Channel populations in other nuclei (GP, EP, SNr) are then defined by focused projections from corresponding striatal populations.
3 The Basal Ganglia Implements Selection Using MSPRT We now show how the MSPRT test defined by equation 2.2 may be performed in a biologically constrained network model of the basal ganglia. For simplicity of explanation, we first show how equation 2.2 maps onto a model of basal ganglia including only a subset of the known anatomical connections (we exclude the connections marked by dotted lines in Figure 1A). Subsequently we demonstrate the mapping onto the model with the complete set of connectivity. The mapping between equation 2.2 and the network is shown graphically in Figure 1B. In our decomposition, each channel (see section 2.5) is associated with an action i and a term L i (T) in the MSPRT. Hence, we assume that there is a finite number N of available actions represented in a discrete (or “localist”) fashion (the topic of action representations is dealt with further in section 6). We note first that the L i (T) are always negative or equal to 0, because L i (T) = lnPi (T), and Pi (T) are probabilities. Thus, by definition, Pi (T) ≤ 1, and so lnPi (T) ≤ ln1 = 0. Therefore, the L i (T) themselves cannot be represented as firing rates in neuronal populations (since neurons cannot have negative firing rates). This may be overcome by assigning the network output OUTi to –L i (T), that is,
OUTi (T) = −yi (T) + ln
N
exp(yk (T)).
(3.1)
k=1
The decision is now made whenever any output decreases its activity below the threshold. Notice that this is consonant with the supposed action of basal ganglia outputs in performing selection by disinhibition of target structures (Chevalier et al., 1985; Deniau & Chevalier, 1985). As described in section 1, we propose, along with others (Schall, 2001; Shadlen & Newsome, 2001), that quantities like yi (T), representing salience, are computed in cortical regions that project to basal ganglia. In the motion discrimination example (described in section 2.1), yi (T) would be computed in FEF, which is known to innervate the basal ganglia (Parthasarathy, Schall, & Graybiel, 1992). Since yi (T) is the product of the raw accumulated evidence Yi (T) and a scaling factor, g ∗ , we interpret g ∗ as the gain that cortex introduces in computing the salience (Brown et al., 2005). As shown in appendix A, the MSPRT algorithm specifies g ∗ exactly. Thus, there is an
452
R. Bogacz and K. Gurney
optimal gain: g∗ =
µ+ − µ− , σ2
(3.2)
where µ+ , µ− , σ parameterize the cortical inputs (see section 2.3). We return to the question of parametric robustness with respect to gain later. Equation 3.1, describing the activity of the basal ganglia output nuclei, includes two terms, the first of which we propose is computed within the direct pathway, while the second term is within the pathway traversing STN and GP. The first term in equation 3.1, –yi (T), is an inhibitory component and cannot be supplied by cortex since its efferents are glutamatergic. We argue therefore that one function of the population of GABAergic striatal projection neurons with D1 receptors (see Figure 1A) is to provide an inhibitory copy of the salience signal to the output nuclei. Turning to the second term in equation 3.1, this is S(T) (defined in equation 2.3) which supplies an excitatory contribution to the output nuclei. A key aspect of S(T) is that it involves summing over channels. The source of excitation in the basal ganglia is the STN, which sends diffuse projections to the basal ganglia output nuclei (Parent & Smith, 1987). Thus, each output neuron receives many afferents from widespread sources within STN, and so it is plausible that they are performing a summation over channels. In the network model, this is reflected in the fact that neurons in each channel i of the output nuclei compute the quantity OUTi (T) = −yi (T) + (T), where: (T) =
N
STNi (T).
(3.3)
i=1
The model then implements MSPRT if (T) = S(T). We now show, first in outline and then more rigorously, how the form of STNi (T) required in order to ensure (T) = S(T) may be enabled by the interaction between STN and GP and the characteristic transfer functions of their neurons. A first correspondence between equations 2.3 and 3.3 involves summation over channels. Second, since STN receives input yi (T) from the cortex, this suggests that the STN firing rate should be proportional to the exponent of its input. We also propose that the logarithm in equation 2.3 comes from interactions between STN and GP. The log transform may be thought of as a compression of the range of STN activity, plausibly derived from GP inhibition, since this is, in turn, under STN control. Thus, rather than supplying a fixed decrement in STN activity through a fixed level of inhibition, GP increases its inhibition in response to increased activity in STN. We now formalize these requirements, resulting in quantitative predictions about the input-output relations of STN and GP neurons. First, we require that the firing rate of neurons in STN is proportional to an
The Brain Implements Optimal Decision Making
453
exponential function of its inputs: STNi (T) = exp (yi (T) − GPi (T)) .
(3.4)
Since STN projects diffusely to GP (Parent & Hazrati, 1995b), we assume that the STN input to GP channel i is (T) rather than STNi (T). The required log transform is obtained by supposing that the firing rate of GP channel i, GPi (T), is given by GPi (T) = (T) − ln ((T)) ,
(3.5)
since, substituting equation 3.5 into equation 3.4, summing over i, and solving for (T) then yields (T) = S(T). In summary, an implementation of MSPRT defined by equations 3.1 to 3.5 may be realized by a subset of basal ganglia anatomy, defined in Figure 1B, if the behavior of neurons in STN and GP follows equations 3.4 and 3.5. As described so far, the model lacks two known pathways within basal ganglia that were shown in Figure 1A. First, the GP functionality defined in equation 3.5 omits afferents from striatal projection neurons associated with D2-type dopamine receptors. Second, GP projections to the output nuclei have not been included in equation 3.1. It has been proposed that these pathways play a critical role in the learning phase, when they block actions that have been punished (Frank et al., 2004). This function is not included in our model, because we address only the computation in the proficient phase. Appendix B shows that incorporation of these pathways into an anatomically more complete scheme still admits a model of basal ganglia that supports MSPRT. Therefore, the model with all pathways shown in Figure 1A also achieves the optimal performance of the MSPRT. 4 Predicted Requirements for STN and GP Physiology Are Validated by Existing Data In this section we compare the predictions of equations 3.4 and 3.5 concerning the firing rates of STN and GP neurons as a function of their input, with published experimental data. In order to make this comparison, model variables (e.g., yi (T), STNi (T), GPi (T)) are assumed to be proportional to experimentally observed neuronal firing rates. Note, however, that proportionality constants are not uniquely specified by the model because a change in any such constant for a particular nucleus can be absorbed by rescaling the weights in projections from this nucleus to other areas. (The use of interpathway weights is illustrated in the anatomically more complete model described in appendix B.) The forms for STN and GP functionality given in equations 3.4 and 3.5 were derived on the basis of the known anatomy of the basal ganglia and the
454
R. Bogacz and K. Gurney
assumption that the network involving cortex and basal ganglia implements MSPRT. Since we did not use the physiological properties of STN and GP neurons in deriving equations 3.4 and 3.5, these equations represent predictions of the model for the physiological properties of STN and GP, thereby providing an independent means for testing the model. These predictions are very strong; in particular, the theory of section 3 implies that the firing rate of STN neurons should be proportional to the exponent of its input. Such a relation is highly unusual in most neural populations. Furthermore, such a relation is very different from the STN input-output relations assumed by other models: Gurney et al. (2001a) assume a piecewise linear relation, while Frank et al. (2005) assume a sigmoid relation. The response properties of STN neurons have been studied extensively (Hallworth, Wilson, & Bevan, 2003; Overton & Greenfield, 1995; Wilson, Weyrick, Terman, Hallworth, & Bevan, 2004). Typically they have nonzero spontaneous firing and can achieve unusually high firing rates. Our proposed exponential form for firing rate as a function of input (see equation 3.4) explains these features since, in the absence of input, the model gives nonzero (unity) output and exp(.) is a rapidly growing function yielding potentially high firing rates. In order to test the prediction of equation 3.4 quantitatively, we fitted exponential functions to firing rate data in the literature. Figure 2H shows the pooled results of this exercise based on two studies (Hallworth et al., 2003; Wilson et al., 2004). The fit to an exponential function is a good one, consistent with the prediction in equation 3.4. Second, the theory makes predictions, defined by equation 3.5, concerning the firing rate of GP. First, we show that the function defined by equation 3.5 is roughly linear if we make the reasonable assumption that N (the number of channels or available actions) is large. Thus, since yi (T) > 0, then from equation 2.3, S(T) is bounded below by ln(N), so that S(T) increases with N. Now, for large S(T), S(T) >> ln(S(T)), so that the linear term in equation 3.5 dominates, and GPi (T) becomes an approximately linear function of its input S(T). We therefore predict that GP neurons display a roughly linear relation between input and firing rate, and two studies validate this. Nambu and Llinas (1994) have established that for those GP neurons that are most influential on the population firing rate, their firing rate is indeed well approximated by a linear function of the injected current (see Figure 2I), a result that is in agreement with an earlier study by Kita and Kitai (1991). In any case, a model in which GP neurons obey an exactly linear input firing rate relation departs little in performance from the model in which GP is described by equation 3.5 (see section 5.3). Finally, it is intriguing to note that GP also includes two types of neurons whose input-output properties are logarithmic (Nambu & Llinas, 1994). Thus microcircuits within GP making use of intranucleus inhibitory collaterals (Nambu & Llinas, 1997) could also support the exact computation required by equation 3.5 for MSPRT.
The Brain Implements Optimal Decision Making A
B
Firing rate [Hz]
150
C
150
D
150
150
100
100
100
100
50
50
50
50
0
0
E
50 100 150 Input current [pA]
150
Firing rate [Hz]
455
0
0
F
100 200 Input current [pA]
150
0
100
100
50
50
50
0
100 200 Input current [pA]
0
50 100 Input current [pA]
0
50 100 Input current [pA]
0
0
50 100 Input current [pA]
150
100
0
0
G
0
100 200 Input current [pA]
0
H STN
I GP
10
8
25 Number of spikes
Normalized STN output
30
6
4
20 15 10 5
2
0 0
0
0
0.5
1 1.5 Normalized STN input
2
2.5
0.2
0.4
0.6
0.8
Input current [nA]
Figure 2: Firing rates f of STN and GP neurons as a function of input current I . (A–D) The panels replot data on the firing rate of STN neurons presented in Hallworth et al. (2003) in Figure 4b, 4f, 12d, and 13d respectively (control condition). (E–G) The panels replot the data from STN presented in Wilson et al. (2004) in Figures 1c, 2c, 2f respectively (control condition). Only firing rates below 135 Hz are shown. Lines show best fit of the function f = a exp(b I ). (H) Scaled data from A–G ( f j /a,bI j ) plotted on the same axes for all neurons. (I) Number of spikes n produced by a GP neuron of type II (Nambu & Llinas, 1994) in a 242 ms stimulation interval using current injection I . (The data used in this figure, kindly provided by Atsushi Nambu, come from the same neuron analyzed in Figure 5g of Nambu and Llinas, 1994.)
5 Performance of MSPRT Model of the Basal Ganglia and Its Variants The performance of the algorithmically defined model described in section 3 was investigated in simulation.
456
R. Bogacz and K. Gurney
5.1 Simulation Methods. In all numerical experiments described in this section, we simulated a decision process between N alternative actions, with plausible parameters describing sensory evidence. The evidence xi (t) was accumulated in integrators Yi in time steps of δt = 1ms. For the correct alternative i, evidence xi (t) was generated from a normal distribution with mean µ+ δt and variance σ 2 δt, while for other alternatives, x j (t) was generated from a normal distribution with mean µ− δt and variance σ 2 δt. It transpires, in fact, that we require only the values of the signal µ+ − µ− , rather than individual means themselves (see appendix A). This was estimated from a sample participant in experiment 1 from the study of Bogacz, Brown, Moehlis, Holmes, and Cohen (2006), that is, µ+ − µ− = 1.41. An estimate of σ was taken from the same experiment to be 0.33. For each set of parameters, a decision threshold was found numerically that resulted in an error rate of 1% ± 0.2% (SE); this search for threshold was repeated 10 times. For each of these 10 thresholds, the decision time was then found in simulation and their average used to construct the data points. 5.2 MSPRT in the Basal Ganglia Outperforms Alternative Decision Mechanisms. It is instructive to see quantitatively how the performance for the MSPRT model compares with that of two other standard models of decision making in the brain: the race model (Vickers, 1970) and a model proposed by Usher and McClelland (2001) (henceforth, the UM model). While the MSPRT has been shown to be asymptotically optimal as the error approaches zero, its performance with finitely large errors has to be evaluated numerically. To do this, we conducted simulations for differing numbers of competing inputs, N, for all three models, with a 1% error rate. Figure 3A shows that the MSPRT consistently outperforms both the UM and race models (especially in the more realistic large N regime). This result is in agreement with recent work by McMillen and Holmes (2006) (who also showed another feature in Figure 3A—that as N increases, the performance of UM model asymptotically approaches that of the race model). For N = 2, the performance of the MSPRT and UM models is very similar since, in this instance, the latter approximates SPRT (Bogacz et al., 2006; Brown et al., 2005). 5.3 The Model is Parametrically Robust. As previously noted, MSPRT specifies a unique value of the cortical gain parameter g ∗ if MSPRT is to be faithfully implemented. We now analyze the performance of the model with different values of gain g = g ∗ in a general cortical integrator relation yi (T) = gYi (T) (instead of yi (T) = g ∗ Yi (T)). Equation 3.2 implies that the optimal value of gain g ∗ depends on the parameters of the inputs to the cortical integrators (µ+ , µ− , σ ). These parameters are task specific and are therefore unlikely to be known to any neural decision system. It is therefore essential for any biologically realistic implementation of MSPRT
The Brain Implements Optimal Decision Making
457
Decision time for ER=1% [s]
A 0.8 0.7 0.6 0.5 0.4 Basal Ganglia Usher & McClelland Race
0.3 0.2
2
3
4 5 6 8 10 12 Number of alternatives
16 20
B Decision time for ER=1% [s]
0.64 0.62 0.6 0.58 0.56 0.54 0.52 0.5 0.48 0.01
-0.1
g/g*
1
10
Figure 3: Comparison of decision times (DT) of various models described in the text. Simulation details for the MSPRT model are given in section 5.1. (A) Comparison of DT of MSPRT model, the Usher and McClelland (2001) (UM) model and race model (Vickers, 1970) for different numbers of alternative actions. The standard error of the mean decision time estimation (SEM.) was always lower than 7.3 ms. The inhibition and decay parameters of the UM model were set to 100. The isolated filled triangles show DTs for the linearized GP model with two parameter pairs (g, a ) (see section 5.3). The triangle pointing up shows DT for default values g = g* and a = 1, and the triangle pointing down for optimal values g = 0.4g* and a = 0.84. The isolated square shows the DT for a version of MSPRT model in which simple cortical integrators are replaced by a UM model with both decay and inhibition parameters equal to 10. (B) Robustness of MSPRT model under variation in the gain parameter. The solid line shows the dependence of DT for N = 10 alternatives on the value of parameter g (expressed via its ratio g ∗ ). Error bars indicate SEM. The dashed line shows the decision time of the UM model.
458
R. Bogacz and K. Gurney
that this mechanism does not significantly deviate from optimality under variation of the gain g from its optimal value, g ∗ . Figure 3B shows that the model is indeed robust to changes in g. If g > g ∗ , the decision time does not increase, while if g < g ∗ , the decision time does increase, but it never exceeds that of the UM model (see appendix C). Hence, even if the parameters of the inputs to the cortical integrators are not known, the performance may be optimized by setting g as high as possible. Turning to the functionality of GP, we evaluated the decrease in performance under an exact linearization of GP with respect to that shown in Figure 3A, for N = 10 alternatives. To do this, suppose the firing rate of a neuron in GP is a linear function of its input from STN with proportionality constant a :
GPi (T) = a
N
STN j (T) = a S (T) .
(5.1)
j=1
Substituting equation 5.1 into equation 3.4 and summing over i yields
ln (S (T)) + a S (T) = ln
N
exp (yi (T)) .
(5.2)
i=1
equation 5.2 does not have a closed-form solution for S(T), and so this is found by solving equation 5.2 numerically. Decision times (DT) were contingent on the values of parameters g and a . With default values, g = g* and a = 1, DT = 607 ms (with SEM. = ± 3 ms), which is better than the UM model (DT = 628 ms) or race model (DT = 676 ms). A parameter search yielded optimal performance, with DT = 545 ms (±3 SEM.), with g = 0.4 g ∗ and a = 0.84. 5.4 Competition May Occur in Both Cortex and Basal Ganglia. The UM model (as well as the model of cortical decision making in area LIP by Wang, 2002) assumes that cortical integrators not only integrate evidence (as in the MSPRT model) but also actively compete with one another. Appendix D shows that if the cortex performs a computation equivalent to the UM model, then the activity levels of the basal ganglia output nuclei are exactly the same as in the original MSPRT model. As a consequence, the decision times remain the same, and the system as whole still achieves optimal performance. This is illustrated in Figure 3A, where an isolated open square symbol shows the DT for the model augmented with UM-based cortical processing; this DT is the same as for the original MSPRT model.
The Brain Implements Optimal Decision Making
459
6 Discussion 6.1 Summary. Our main result is that a circuit involving cortex and the basal ganglia may be devoted to implementing a powerful (asymptotically optimal) decision mechanism (MSPRT) in a parametrically robust way. Further, our results suggest a division between a core functional anatomy (shown in Figure 1B) and additional pathways (incorporated in the anatomically complete model) that may serve other purposes (e.g., enhancement of robustness and learning) without compromising MSPRT. In addition, the MSPRT model was shown to outperform other decision mechanisms. While the UM and race models avail themselves of simple network implementations, the sophisticated architecture and neural functionality of the basal ganglia appear to have evolved to support the more powerful MSPRT, allowing the brain to make accurate decisions substantially faster than the simpler mechanisms. The model also made several predictions about the physiological properties of STN and GP neurons, which, while consistent with existing data, provide a challenge for further experimental studies to test them in vitro and in vivo with synaptic input. 6.2 Relationship to Other Models of Decision Making 6.2.1 Action Selection in Basal Ganglia. The model described in this article has exactly the same architecture (shown in Figure 1A) as a previous model of action selection in the basal ganglia in the proficient phase (Gurney et al., 2001a). This architecture has been shown before to exhibit appropriate action selection and switching properties in a computational model (Gurney et al., 2001b). The underlying architecture has also been shown to be functionally robust (from a selection perspective) in a variety of settings. Thus, it has also been shown to perform these functions within the anatomical context supplied by associated thalamocortical loops (Humphries & Gurney, 2002), under the addition of further circuitry intrinsic to the basal ganglia (Gurney, Prescott, et al., 2004), when embodied in a complete, behaving autonomous agent (Prescott, Montes-Gonzalez, Gurney, Humphries, & Redgrave, 2006) and in spiking neuron models (Humphries, Stewart, & Gurney, in press; Stewart, Gurney, & Humphries, 2005), which are constrained by significantly more physiological detail than their systems-level counterparts. The model of Gurney et al. (2001a) and the MSPRT model are consistent in proposing similar functions of individual nuclei. In particular, we suppose here that the GP plays a crucial role in limiting STN activity (via a log transform). This function is similar to that proposed for GP in Gurney et al. (2001a), in which GP automatically limits the excitation of basal ganglia output nuclei in order to allow network mechanisms to perform selection. This work differs from Gurney et al. (2001a) in that it provides an analytic description for the computation performed during the selection, thereby
460
R. Bogacz and K. Gurney
providing a new framework for understanding why the basal ganglia are organized in the way they are. The function of STN in our model is similar to that posited by Frank (2005), who proposed that it “can dynamically modulate the threshold for executing responses depending on the degree of response conflict present.” The novelty of our work lies in specifying precisely how STN should modulate this threshold to optimize the performance. 6.2.2 Bayesian Decision Making. MSPRT can be viewed as a Bayesian method of decision making, as it is based on evaluating conditional probabilities using the Bayes theorem (see appendix A). Recently Yu and Dayan (2005) proposed a Bayesian model of attentional modulation of decision making. In this model, the final layer of an abstract neural network performs a computation equivalent to that accomplished by the outputs of the MSPRT model of basal ganglia presented here (i.e., each output unit computes exponents of L i (T)). The novelty of our work lies in showing how this computation may be performed by an identified, biological network of neurons, the basal ganglia. 6.2.3 Cortical Decision Making. From the theoretical perspective, the use of integrated evidence leaves open the possibility that the cortex may operate as the first stage of a two-stage decision process, in which cortical mechanisms of the kind posited in the UM model (for example) make a first-pass filter for actions with small saliences, thereby preventing these requests from propagating to the basal ganglia for further processing. This possibility was confirmed by mathematical analysis and simulation, where identical results were obtained after incorporation of a first stage consisting of a UM network (rather than the simple integration implied in equation 2.1). 6.2.4 Integration of Evidence. The model presented here assumes that cortical neurons integrate evidence in support of alternative actions. Several mechanisms have been proposed for how this may occur; for example, Wang (2002) proposed that the integration occurs via excitatory connections in the cortex. However, if the basal ganglia model presented here is to be a universally applicable solution to the problem of action selection, then the appearance of accumulated evidence must be guaranteed at its inputs under all circumstances, irrespective of the specific action request and brain system generating it. Anatomically, the basal ganglia form a component of loops consisting of projections from cortex to basal ganglia, then to thalamus and back to cortex (Alexander & Crutcher, 1990). In a computational study of basal ganglia and cortex, Humphries and Gurney (2002) showed that cortical regions receiving thalamic input had their activity levels amplified beyond those of sensory areas that did not. This raises the intriguing
The Brain Implements Optimal Decision Making
461
possibility that the feedback in these loops could serve to bootstrap evidence accumulation in cortex so that no separate mechanism is required. 6.2.5 Reinforcement Learning. As stated in section 1, this article focuses on the proficient phase of task acquisition. However, during learning, it is still necessary to identify the behavioral state (stimulus and ongoing behavior) and represent the stimulus response mapping in some way. Both processes are aspects of the decision-making process discussed in this article, and so it is useful to speculate how it might be possible to unify the accounts of decision making and learning into a single coherent framework. To develop these ideas, we consider a version of the motion discrimination task described in section 2.1 in which the stimulus-response mapping is constantly modified and hence must be learned by the animal. In this case, it is unlikely that the stimulus-response mapping would be represented in the connections between MT and FEF, because these connections would have to be rapidly and continuously modified. Many models based on experimental data would assume that in this experiment, the stimulus-response mapping would be stored in much more plastic synapses in the prefrontal cortex or the striatum (Doya, 2000; Miller & Cohen, 2001; O’Doherty et al., 2004). This is also the kind of scheme considered by Ashby, Alfonso-Turken, and Waldron (1998) in the context of category learning with motor response. Here, early stimulus-response mappings are learned in the basal ganglia, while slower consolidation takes place in direct mappings between sensory and motor cortices. Further, we note that behavioral state identification during learning could rely on integration of information, just as it does during the proficient phase in the theory presented here. However, in the case of learning, the integration may occur in regions different from those used during the proficient phase. For example, Gold and Shadlen (2003) have shown that in the version of the motion discrimination task in which the mapping between stimulus and direction of saccade is not known during the information integration, the integration does not occur in FEF. If a unified framework encompassing action selection and learning could be developed along the lines outlined above, it would simultaneously allow predictions of the probabilities of taking alternative actions, as well as the probability distributions of onsets of action initiations (reaction times). However, before such a unified account is developed, a number of questions must be answered, in particular, where in the brain the evidence supporting alternative actions is computed in the learning phase. To address this question, we look forward to studies of neuronal responses in cortex and basal ganglia in the version of the motion discrimination task described above. 6.2.6 Working Memory. O’Reilly and Frank (2006) have proposed that the basal ganglia gate access to working memory and decides whether a newly presented stimulus should be stored in the working memory. According to
462
R. Bogacz and K. Gurney
Alexander et al.’s (1986) multiple loop scheme, this kind of decision would be performed by the dorso-lateral prefrontal circuit. As noted in section 2.5, this letter focuses on motor and oculomotor circuits. It would therefore be interesting to investigate whether the basal ganglia implementing MSPRT could also optimize selection within working memory and, indeed, cognitive selection in general. 6.3 Relationship to Other Experimental Data 6.3.1 Psychological Data. A model of decision making must be consistent with the rich body of psychological data concerning reaction times (RT). Other psychological models are consistent with these data (Ratcliff, Van Zandt, & McKoon, 1999; Usher & McClelland, 2001), and consistency in this respect will therefore not distinguish our model in favor of these alternatives but, rather, is a necessary requirement for its psychological plausibility. Our model is indeed completely consonant with the account of RT data in two alternative choice paradigms given by the diffusion and SPRT models (e.g., Ratcliff et al., 1999), since, under these circumstances, MSPRT reduces to SPRT. For more than two alternatives, it is interesting to note that the decision time of the MSPRT model (shown in Figure 3A) is approximately proportional to the logarithm of the number of alternatives (cf. McMillen & Holmes, 2006), thus following the experimentally observed Hick’s law (Teichner & Krebs, 1974) describing RT as a function of number of choices. Further support for the basal ganglia as a psychologically plausible response mechanism is provided in recent work by Stafford and Gurney (2004, in press). 6.3.2 The Neural Representation of Actions. In this letter, we assumed for simplicity that there was an anatomically separate channel for each possible action. However, this raises the question of what constitutes a separate action. For example, is moving one’s hand 10 cm to the left a different action from moving it 15 cm to the left? If not, then how are these actions differentiated? If they are different actions (with respect to basal ganglia selection), then the number of actions is potentially infinite, and the basal ganglia are confronted with the seemingly impossible task of representing an infinite number of discrete channels. Neurophysiological data provide a clue to a possible answer to these questions. Georgopoulos et al. (1983) studied neuronal responses in basal ganglia in a task (akin to the example above) in which different stimuli required an animal to move its hand in the same direction with three different amplitudes. They noticed that some hand-selective neurons had activity proportional to movement amplitude, while other neurons had activity inversely proportional to the amplitude (they responded most for short movements). This suggests that although movements of different joints may be represented by the separate neuronal populations (channels), the
The Brain Implements Optimal Decision Making
463
fine tuning of the movement is represented in a distributed fashion within a channel. It will therefore be of interest to extend the current theory to incorporate coding by distributed representations. The current MSPRT model describes activity of a neuronal population selective for each alternative within each nucleus by a single variable (corresponding to activity in a localist unit). Recently Bogacz (2006) has shown how to map linear localist decision networks into computationally equivalent distributed decision networks and derived the parameters of decision networks with distributed representations implementing SPRT. Although this network mapping process cannot be directly applied to the MSPRT model (because it includes nonlinear processing in STN), it is likely that a similar approach for particular types of nonlinearities present in the MSPRT model may be developed. 6.3.3 Responses of Striatal Neurons. As noted in section 1, the MSPRT model aims to provide a general framework for understanding computation in the basal ganglia as a whole during action selection in the proficient phase. Therefore, the model does not aim to incorporate all known data on basal ganglia neurons and, in particular, does not aim to explain data relating to the learning phase (during which the striatum is know to play a prominent role). However, it is of interest to see whether the behavior of the population of striatal projection neurons in the proficient phase is consistent with our ideas. In the MSPRT model, we assume, in accordance with experimental observations (Crutcher & DeLong, 1984b; Georgopoulos et al., 1983), that during the proficient phase, the activity of striatal neurons encoding certain actions reflects the activity in corresponding cortical motor or oculomotor regions. Note that the above assumption does not prevent striatal neurons selective for particular actions to be modulated by expected reward in the proficient phase, as it has been shown that the cortical integrators are modulated by expected reward in the study by Platt and Glimcher (1999) in which the stimulus-response mapping was kept constant for many weeks of training and experiment. We assumed that in the proficient phase, the neurons in striatum selective for actions should reflect the activity of cortical integrators. This predicts that in the motion discrimination task described in section 2.1, striatal neurons selective for alternative directions of eye movements should exhibit gradually increasing firing rates similar to those in the cortex. This prediction may seem to contradict the observation that striatal neurons are bistable (Wilson, 1995) with an active “up” state, and an inactive “down” state. However, Okamato, Isomura, Takada, and Fukai (2005) have shown recently that even if neurons are bistable, and if the probability of the onset of the up state depends on the magnitude of input, these neurons may implement information integration, and their activity averaged across trials may be linearly increasing.
464
R. Bogacz and K. Gurney
6.3.4 Dopaminergic Modulation. In this article, we do not analyze the influence of dopamine on basal ganglia computation. However, there are two types of dopamine release within the basal ganglia. First, phasic release (brief pulses of dopamine) is associated with salient or unexpected behavioral events. In particular, it has been proposed that the phasic dopamine signal represents a variable in temporal difference reinforcement learning, namely, the reward prediction error (i.e., the difference between the actual and the predicted levels of expected reward) (Montague et al., 1996; Schultz et al., 1997). Since we do not address learning here, we do not consider phasic release further. A second type of dopamine release provides tonic or background levels, severe lowering of which can result in Parkinson’s disease (Obeso et al., 2000). A recent modeling study of the basal ganglia circuit (Gurney, Humphries, Wood, Prescott, & Redgrave, 2004) has indicated that tonic levels of dopamine may influence the speed-accuracy trade-off in making responses. This is consistent with experimental results showing that the level of tonic dopamine influences RTs (Amalric & Koob, 1987; Amalric, Moukhles, Nieoullon, & Daszuta, 1995). It will be interesting to determine theoretically to what extent the tonic dopamine level can influence the speed-accuracy trade-off while still preserving the optimality of MSPRT and whether this mechanism can play a role in finding the speed-accuracy trade-off that maximizes the rate of reward acquisition in tasks including repeating sequences of choices (Bogacz et al., 2006; Simen, Cohen, & Holmes, 2006). Appendix A: The MSPRT This section supplies details of the derivation of equation 2.2. Noting the definition in the main text Pi (T) = P(Hi |input(T)), then, from Baye’s theorem, Pi (T) =
P (input (T) |Hi ) P (Hi ) . P (input (T))
(A.1)
Notice that the hypotheses Hi are mutually exclusive (as xi cannot simultaneously have mean µ+ and µ− ). Further, we assume that the set of hypotheses Hi covers all possibilities concerning the distribution of sensory inputs (a standard assumption is statistical testing). Then the denominator of equation A.1 can be written as
P (input (T)) =
N k=1
P (input (T) ∧ Hk ),
(A.2)
The Brain Implements Optimal Decision Making
465
but from the definition of conditional probability, P(input(T) ∧ Hk ) = P(input(T)|Hk )P(Hk ), so that equation A.1 can be written as P(input(T)|Hi )P(Hi ) Pi (T) = N . k=1 P(input(T)|Hk )P(Hk )
(A.3)
We assume that we do not have any prior knowledge about which of the hypotheses is more likely, so all prior probabilities must be equal to each other, that is, P(Hi ) = 1/N, and hence they cancel in equation A.3 and we obtain the original form of MSPRT given by Baum and Veeravalli (1994): P(input(T)|Hi ) Pi (T) = N . k=1 P(input(T)|Hk )
(A.4)
We now compute the logarithm of Pi , defined by equation A.4: L i (T) = ln Pi (T) = ln P(input(T)|Hi ) − ln
N
exp(ln P(input(T)|Hk )).
(A.5)
k=1
Equation A.5 already has a form similar to that in equation 2.2. We now show how to obtain equation 2.2 exactly. We first compute the term ln P(input(T)|Hi ) that occurs in equation A.5: T N
+ f (µ− ,σ ) x j (t) ln P (input (T) |Hi ) = ln f (µ ,σ ) (xi (t)) t=1
=
T
j=1 j=i
ln f (µ+ ,σ ) (xi (t)) +
t=1
N T t=1
f (µ− ,σ ) x j (t) ,
j=1 j=i
where f (µ,σ ) denotes the probability density function of a normal distribution with mean µ and standard deviation σ . Therefore, ln P(input(T)|Hi ) =
T
t=1
+
(xi (t) − µ+ ) − ln √ 2σ 2 2πσ
N T j=1 j=i
2
1
t=1
(x j (t) − µ− )2 − ln √ 2σ 2 2πσ 1
466
R. Bogacz and K. Gurney
1 = NT ln √ + −T (µ+ )2 + (N − 1) (µ− )2 2σ 2 2πσ T N T
2 µ+ − µ− −x j (t) + 2µ− x j (t) + xi (t). + σ2 1
j=1 t=1
t=1
The first two terms on the right-hand side do not depend on i,and so, denoting their sum by C,we have ln P(input (T) |Hi ) = C + g ∗ Yi (T) ,
(A.6)
where g ∗ = (µ+ − µ− )/σ 2 , and Yi (T) is defined in equation 2.1. Hence substituting equation A.6 into A.5, we obtain: ∗
L i (T) = C + g Yi (T) − ln exp (C)
N
∗
exp (g Yk (T))
k=1
= g ∗ Yi (T) − ln
N
exp (g ∗ Yk (T))
k=1
which gives equation 2.2. Appendix B: Basal Ganglia Model, Including All Major Pathways To accomplish inclusion of all pathways shown in Figure 1A, we introduce a model with weighted connection strengths. For simplicity, we assume that all the excitatory weights are equal to 1 and denote the inhibitory weight from nucleus A to nucleus B by w A→B . Denote the striatal projection neurons whose dopaminergic receptors are predominantly of D1 and D2 types by S1 and S2, respectively. The activity levels of basal ganglia nuclei in the network of Figure 1A are then given by: GPi =
N j=1
N STN j − ln STN j − w S2→GP yi
(B.1)
j=1
STNi = exp(yi − wGP→STN GPi )
OUTi = −w S1→OUT yi +
N j=1
STN j − wGP→OUT GPi .
(B.2)
(B.3)
The Brain Implements Optimal Decision Making
467
We now derive constraints on the weights that must be satisfied so OUTi (T) = −L i (T) as defined by equation 2.2. Substituting equation B.1 into B.2., STNi = exp yi (1 + wGP→STN w S2→GP )
− wGP→STN
N
N STN j + wGP→STN ln STN j .
j=1
j=1
Summing over i and rearranging terms, we get N i=1
STNi =
N
exp (yi (1 + wGP→STN w S2→GP ))
i=1
× exp wGP→STN
N
−1 STNi
i=1
·
N
wGP→STN .
STNi
i=1
Taking the logarithm of both sides and rearranging terms, ln
N
exp (yi (1 + wGP→STN w S2→GP )) = (1 − wGP→STN ) ln
i=1
N
STNi
i=1
+ wGP→STN
N
STNi .
(B.4)
i=1
Now substitute equation B.1 into B.3: OUTi = −yi (w S1→OUT − wGP→OUT w S2→GP ) + (1 − wGP→OUT )
N j=1
STN j + wGP→OUT ln
N
STN j .
(B.5)
j=1
Comparing equations B.4 and B.5, we note that if the following condition is satisfied, wGP→OUT = 1 − wGP→STN ,
(B.6)
468
R. Bogacz and K. Gurney
then we can substitute equation B.4 into B.5 and obtain OUTi = −yi (w S1→OUT − wGP→OUT w S2→GP ) N exp(y j (1 + wGP→STN w S2→GP )) . + ln
(B.7)
j=1
In order to satisfy OUTi (T) = −L i (T), the two coefficients of yi in equation B.7 must be equal, and the gain coefficient must be modified accordingly. Thus, the following must be satisfied: w S1→OUT − wGP→OUT w S2→GP = 1 + wGP→STN w S2→GP .
(B.8)
Substituting the constraint B.6 into B.8 gives w S1→OUT = 1 + w S2→GP .
(B.9)
The optimal value of the gain must be modified and becomes g∗ =
1 1 + wGP→STN w S2→GP
µ+ − µ− . σ2
Note that with the additional pathways, the optimal gain g ∗ is less than that required without these pathways. In summary, the network in Figure 1A implements MSPRT if the inhibitory weights satisfy constraints B.6 and B.9. Appendix C: Performance for Nonoptimal Values of Parameter g This section describes the performance of the basal ganglia model when parameter g has nonoptimal values. Section C.1 shows that if g > g ∗ , the decision time does not increase since, as g → ∞, the model converges to the other asymptotically optimal test MSPRTb . Section C.2 shows that if g < g ∗ , the decision time does increase, but it never exceeds that of the UM model (Usher & McClelland, 2001) because as g → 0, the proposed model converges to an approximation of the UM model. C.1 Overestimation of Gain. We first describe the other asymptotically optimal test MSPRTb (Dragalin et al., 1999) and go on to show that as g → ∞, the model approximates MSPRTb . In this test, after each sample,
The Brain Implements Optimal Decision Making
469
the following ratios are computed (Dragalin et al., 1999): RBi (T) =
P (input (T) |Hi ) . max P (input (T) |Hk ) 1≤k≤N k=i
The decision is made whenever one of the ratios exceeds a threshold. From equation A.6, the logarithms of the above ratios for the hypotheses defined in the main text become L Bi (T) = g Yi (T) − max Yk (T) .
(C.1)
1≤k≤N k=i
The decision is made whenever one of LBi (T) exceeds a threshold. Let Ym1 (T), Ym2 (T), . . ., Ymi (T), . . ., YmN (T) be the ordered sequence of Yi (T) with Ym1 (T) the largest member. Then, according to equation C.1, the decision is made whenever Ym1 (T) − Ym2 (T) exceeds a threshold. We now show that the basal ganglia model in the main text works in just this way, as g → ∞. The log-ratio L i (T) in equation 2.2 of the main text is equal to N L i (T) = gYi (T) − ln exp (gYm1 (T)) + exp (gYm2 (T)) + exp(gYmj (T)) j=3
= gYi (T) − ln
+
N j=3
exp (gYm1 (T)) 1 + exp (g (Ym2 (T) − Ym1 (T)))
exp g Ymj (T) − Ym1 (T)
= gYi (T) − gYm1 (T) − ln 1 + exp (g (Ym2 (T) − Ym1 (T)))
+
N
exp g Ymj (T) − Ym1 (T) .
j=3
A decision will be made when the largest among L i (T) exceeds a threshold z, that is, when gYm1 (T) − gYm1 (T) − ln 1 + exp (g (Ym2 (T) − Ym1 (T)))
+
N j=3
exp g Ymj (T) − Ym1 (T) > z.
470
R. Bogacz and K. Gurney
Two first terms cancel. Taking the negative, applying the exp(.) function, and subtracting 1 gives exp (g (Ym2 (T) − Ym1 (T))) +
N
exp(g(Ymj (T) − Ym1 (T))) < z ,
j=3
where z is another threshold. After some manipulation, we have exp (g (Ym2 (T) − Ym1 (T))) 1 +
N
exp(g(Ymj (T) − Ym2 (T))) < z .
(C.2)
j=3
Since Yi (T) are sums of samples from continuous normal distributions, then Ym2 (T) > Ym3 (T) with probability 1 for T > 0. Therefore, as g → ∞, the expressions g(Ymj (T) − Ym2 (T)) → −∞, and the content of the square brackets in equation C.2 converges to 1. Thus, taking the logarithm and dividing by −g gives Ym1 (T) − Ym2 (T) > z (where z is also a threshold). Hence, as g goes to infinity, the proposed model makes a decision whenever the difference between the two largest Yi (T) exceeds a threshold, which is equivalent to MSPRTb . C.2 Underestimation of Gain. In this section we show that as g → 0, the model converges to an approximation of the UM model. This model consists of N mutually inhibiting leaky integrators whose dynamics are described by u˙ i (T) = xi (T) − kui (T) − w
N
u j (T),
ui (0) = 0,
(C.3)
j=1 j=i
where k determines a leakage rate constant and w is the weight of inhibitory connections between the integrators. McMillen and Holmes (2005) showed that the performance of the UM model is optimized when k = w and both T go to infinity. In this case, putting Yi (T) = o xi (t)dt, the decision is made whenever any of the following differences, 1 Yj (T), N N
LU Mi (T) = Yi (T) −
j=1
The Brain Implements Optimal Decision Making
471
exceeds a decision threshold (McMillen & Holmes, 2006). We now show that exactly the same criterion for a decision applies in the proposed basal ganglia model when g → 0. For small g, we may approximate the right-hand side of equation 2.2 of the main text by retaining only linear terms in Taylor expansions. Thus, working on the exponential L i (T) → gYi (T) − ln
N
(1 + gYk (T)) = gYi (T) − ln N + g
k=1
N
Yk (T)
k=1
and then the ln() function, 1 Yk (T). N N
L i (T) → gYi (T) − ln N − g
k=1
A decision will be made whenever any L i (T) exceeds a threshold z, that is, when 1 Yk (T) > z. N N
gYi (T) − ln N − g
k=1
Subtracting ln N and dividing by g, we get the following condition for decision: 1 Yk (T) > z . N N
Yi (T) −
k=1
Hence, as g → 0 the proposed basal ganglia model makes a decision under the same conditions as the UM with optimal values of inhibition and decay (w = k, and w, k → ∞). Appendix D: Model with Competing Cortical Integrators Let us consider the UM model described in section C.2, in which decay is equal to inhibition (w = k, it is one of the conditions required for optimal performance of UM model—see above) and the integrators have optimal gain. Then equation C.3 becomes u˙ i (T) = g ∗ xi (T) − w
N j=1
u j (T),
ui (0) = 0.
(D.1)
472
R. Bogacz and K. Gurney
Note that in the above case, all the integrators receive exactly the same inhibition; hence, in the limit of intervals between samples going to 0, the relationship between variables yi of the MSPRT model and ui of the UM model becomes ∀i
yi (T) = ui (T) + c (T) ,
that is, the variables differ by a term c(T), which, although it differs over time (it will typically be negative), is the same for all integrators. However, we now show that if the same term c(T) is added to the cortical firing rate of all integrators, the activity of output nuclei does not change. Thus, N OUTi (T) = − (yi (T) + c (T)) + ln exp(y j (T) + c (T))
j=1
= −yi (T) − c (T) + ln exp (c (T)) N = −yi (T) + ln exp(y j (T)) .
N
exp(y j (T))
j=1
j=1
Since the activity of the output nuclei does not change, a version of MSPRT model in which simple cortical integration is replaced by UM model of equation D.1 achieves the same decision times (for any fixed error rate) as the original MSPRT model. The above also explains a correspondence problem that arises in the MSPRT model because it assumes that at the beginning of the decision process, yi = 0, whereas the baseline firing rate of cortical neurons is nonzero. Acknowledgments This work was supported by EPSRC grants: EP/C516303/1 and EP/C514416/1. We thank Atshushi Nambu for providing data used in Figure 2I. We thank Philip Holmes, Jonathan D. Cohen, Paul Overton, Sander Nieuwenhuis, Eric Shea-Brown, and Tobias Larsen for reading the earlier version of the manuscript and very useful comments, and Peter Redgrave for discussion. References Alexander, G. E., & Crutcher, M. D. (1990). Functional architecture of basal ganglia circuits: Neural substrates of parallel processing. Trends Neurosci., 13(7), 266–271.
The Brain Implements Optimal Decision Making
473
Alexander, G. E., DeLong, M. R., & Strick, P. L. (1986). Parallel organization of functionally segregated circuits linking basal ganglia and cortex. Annu. Rev. Neurosci., 9, 357–381. Amalric, M., & Koob, G. F. (1987). Depletion of dopamine in the caudate nucleus but not in nucleus accumbens impairs reaction-time performance in rats. J. Neurosci., 7(7), 2129–2134. Amalric, M., Moukhles, H., Nieoullon, A., & Daszuta, A. (1995). Complex deficits on reaction time performance following bilateral intrastriatal 6-OHDA infusion in the rat. Eur. J. Neurosci., 7(5), 972–980. Ashby, F. G., Alfonso-Reese, L. A., Turken, A. U., & Waldron, E. M. (1998). A neuropsychological theory of multiple systems in category learning. Psychol. Rev., 105, 442–481. Ashby, F. G., & Spiering, B. J. (2004). The neurobiology of category learning. Behav. Cogn. Neurosci. Rev., 3(2), 101–113. Barnard, G. (1946). Sequential tests in industrial statistics. Journal of Royal Statistical Society Supplement, 8, 1–26. Baum, C. W., & Veeravalli, V. V. (1994). A sequential procedure for multihypothesis testing. IEEE Transactions on Information Theory, 40, 1996–2007. Bevan, M. D., Smith, A. D., & Bolam, J. P. (1996). The substantia nigra as a site of synaptic integration of functionally diverse information arising from the ventral pallidum and the globus pallidus in the rat. Neuroscience, 75(1), 5–12. Bogacz, R. (2006). Optimal decision networks with distributed representation. Manuscript submitted for publication. Bogacz, R., Brown, E., Moehlis, J., Holmes, P., & Cohen, J. D. (2006). The physics of optimal decision making: A formal analysis of models of performance in twoalternative forced choice tasks. Psychol. Rev., 113, 700–765. Britten, K. H., Shadlen, M. N., Newsome, W. T., & Movshon, J. A. (1993). Responses of neurons in macaque MT to stochastic motion signals. Vis. Neurosci., 10(6), 1157–1169. Brown, E., Gao, J., Holmes, P., Bogacz, R., Gilzenrat, M., & Cohen, J. D. (2005). Simple networks that optimize decisions. International Journal of Bifurcations and Chaos, 15, 803–826. Brown, J. W., Bullock, D., & Grossberg, S. (2004). How laminar frontal cortex and basal ganglia circuits interact to control planned and reactive saccades. Neural Netw., 17(4), 471–510. Chevalier, G., Vacher, S., Deniau, J. M., & Desban, M. (1985). Disinhibition as a basic process in the expression of striatal functions. I. The striato-nigral influence on tecto-spinal/tecto-diencephalic neurons. Brain Res., 334(2), 215–226. Crutcher, M. D., & DeLong, M. R. (1984a). Single cell studies of the primate putamen. I. Functional organization. Exp. Brain Res., 53(2), 233–243. Crutcher, M. D., & DeLong, M. R. (1984b). Single cell studies of the primate putamen. II. Relations to direction of movement and pattern of muscular activity. Exp. Brain Res., 53(2), 244–258. Dayan, P. (2001). Levels of analysis in neural modeling. In R. A. Wilson & F. C. Keil (Eds.) Encyclopedia of cognitive science. London: Macmillan. Deniau, J. M., & Chevalier, G. (1985). Disinhibition as a basic process in the expression of striatal functions. II. The striato-nigral influence on
474
R. Bogacz and K. Gurney
thalamocortical cells of the ventromedial thalamic nucleus. Brain Res., 334(2), 227– 233. Doya, K. (2000). Complementary roles of basal ganglia and cerebellum in learning and motor control. Curr. Opin. Neurobiol., 10(6), 732–739. Dragalin, V. P., Tertakovsky, A. G., & Veeravalli, V. V. (1999). Multihypothesis sequential probability ratio tests—part I: asymptotic optimality. IEEE Transactions on Information Theory, 45, 2448–2461. Faull, R. L., & Mehler, W. R. (1978). The cells of origin of nigrotectal, nigrothalamic and nigrostriatal projections in the rat. Neuroscience, 3(11), 989–1002. Frank, M. J. (2005). When and when not to use your subthalamic nucleus: Lessons from a computational model of the basal ganglia. Paper presented at the International Workshop on Modelling Natural Action Selection, Edinburgh. Frank, M. J., Seeberger, L. C., & O’Reilly R, C. (2004). By carrot or by stick: Cognitive reinforcement learning in Parkinsonism. Science, 306(5703), 1940–1943. Georgopoulos, A. P., DeLong, M. R., & Crutcher, M. D. (1983). Relations between parameters of step-tracking movements and single cell discharge in the globus pallidus and subthalamic nucleus of the behaving monkey. J. Neurosci., 3(8), 1586– 1598. Gerfen, C. R., & Wilson, C. J. (1996). The basal ganglia. In A. Bjorklund, T. Hokfelt, & L. Swanson (Eds.), Handbook of chemical neuroanatomy (Vol. 12, pp. 369–466). New York: Elsevier. Gerfen, C. R., & Young, W. S. III. (1988). Distribution of striatonigral and striatopallidal peptidergic neurons in both patch and matrix compartments: An in situ hybridization histochemistry and fluorescent retrograde tracing study. Brain Res., 460(1), 161–167. Gold, J. I., & Shadlen, M. N. (2001). Neural computations that underlie decisions about sensory stimuli. Trends Cogn. Sci., 5(1), 10–16. Gold, J. I., & Shadlen, M. N. (2002). Banburismus and the brain: Decoding the relationship between sensory stimuli, decisions, and reward. Neuron, 36(2), 299– 308. Gold, J. I., & Shadlen, M. N. (2003). The influence of behavioral context on the representation of a perceptual decision in developing oculomotor commands. J. Neurosci., 23(2), 632–651. Gurney, K., Humphries, M., Wood, R., Prescott, T. J., & Redgrave, P. (2004). Testing computational hypotheses of brain systems function: A case study with the basal ganglia. Network, 15(4), 263–290. Gurney, K., Prescott, T. J., & Redgrave, P. (2001a). A computational model of action selection in the basal ganglia. I. A new functional anatomy. Biol. Cybern., 84(6), 401–410. Gurney, K., Prescott, T. J., & Redgrave, P. (2001b). A computational model of action selection in the basal ganglia. II. Analysis and simulation of behaviour. Biol. Cybern., 84(6), 411–423. Gurney, K., Prescott, T. J., Wickens, J. R., & Redgrave, P. (2004). Computational models of the basal ganglia: From robots to membranes. Trends Neurosci., 27(8), 453–459. Hallworth, N. E., Wilson, C. J., & Bevan, M. D. (2003). Apamin-sensitive small conductance calcium-activated potassium channels, through their selective coupling
The Brain Implements Optimal Decision Making
475
to voltage-gated calcium channels, are critical determinants of the precision, pace, and pattern of action potential generation in rat subthalamic nucleus neurons in vitro. J. Neurosci., 23(20), 7525–7542. Humphries, M. D., & Gurney, K. N. (2002). The role of intra-thalamic and thalamocortical circuits in action selection. Network, 13(1), 131–156. Humphries, M. D., Stewart, R. D., & Gurney, K. N. (in press). A physiologically plausible model of action selection and oscillatory activity in the basal ganglia. J. Neurosci. Kha, H. T., Finkelstein, D. I., Tomas, D., Drago, J., Pow, D. V., & Horne, M. K. (2001). Projections from the substantia nigra pars reticulata to the motor thalamus of the rat: Single axon reconstructions and immunohistochemical study. J. Comp. Neurol., 440(1), 20–30. Kim, J. N., & Shadlen, M. N. (1999). Neural correlates of a decision in the dorsolateral prefrontal cortex of the macaque. Nat. Neurosci., 2(2), 176–185. Kita, H., & Kitai, S. T. (1991). Intracellular study of rat globus pallidus neurons: Membrane properties and responses to neostriatal, subthalamic and nigral stimulation. Brain Res., 564(2), 296–305. Laming, D. R. J. (1968). Information theory of choice reaction time. New York: Wiley. McMillen, T., & Holmes, P. (2006). The dynamics of choice among multiple alternatives. Journal of Mathematical Psychology, 50, 30–57. Medina, L., & Reiner, A. (1995). Neurotransmitter organization and connectivity of the basal ganglia in vertebrates: Implications for the evolution of basal ganglia. Brain Behav. Evol., 46(4–5), 235–258. Miller, E. K., & Cohen, J. D. (2001). An integrative theory of prefrontal cortex function. Annu. Rev. Neurosci., 24, 167–202. Mink, J. W. (1996). The basal ganglia: Focused selection and inhibition of competing motor programs. Prog. Neurobiol., 50(4), 381–425. Montague, P. R., Dayan, P., & Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J. Neurosci., 16(5), 1936– 1947. Nakano, K., Kayahara, T., Tsutsumi, T., & Ushiro, H. (2000). Neural circuits and functional organization of the striatum. J. Neurol., 247 (Suppl 5), V1–15. Nambu, A., & Llinas, R. (1994). Electrophysiology of globus pallidus neurons in vitro. J. Neurophysiol., 72(3), 1127–1139. Nambu, A., & Llinas, R. (1997). Morphology of globus pallidus neurons: Its correlation with electrophysiology in guinea pig brain slices. J. Comp. Neurol., 377(1), 85–94. Obeso, J. A., Rodriguez-Oroz, M. C., Rodriguez, M., Lanciego, J. L., Artieda, J., Gonzalo, N., & Olanow, C. W. (2000). Pathophysiology of the basal ganglia in Parkinson’s disease. Trends Neurosci., 23(10 Suppl), S8–19. O’Doherty, J., Dayan, P., Schultz, J., Deichmann, R., Friston, K., & Dolan, R. J. (2004). Dissociable roles of ventral and dorsal striatum in instrumental conditioning. Science, 304(5669), 452–454. Okamoto, H., Isomura, Y., Takada, M., & Fukai, T. (2005). Temporal integration by stochastic dynamics of a recurrent network of bistable neurons. Paper presented at the Computational Cognitive Neuroscience meeting, Washington, DC.
476
R. Bogacz and K. Gurney
O’Reilly, R. C., & Frank, M. J. (2006). Making working memory work: A computational model of learning in the frontal cortex and basal ganglia. Neural Computation, 18, 283–328. Overton, P. G., & Greenfield, S. A. (1995). Determinants of neuronal firing pattern in the guinea-pig subthalamic nucleus: An in vivo and in vitro comparison. J. Neural Transm. Park. Dis. Dement. Sect., 10(1), 41–54. Parent, A., & Hazrati, L. N. (1993). Anatomical aspects of information processing in primate basal ganglia. Trends Neurosci., 16(3), 111–116. Parent, A., & Hazrati, L. N. (1995a). Functional anatomy of the basal ganglia. I. The cortico-basal ganglia-thalamo-cortical loop. Brain Res. Rev., 20(1), 91–127. Parent, A., & Hazrati, L. N. (1995b). Functional anatomy of the basal ganglia. II. The place of subthalamic nucleus and external pallidum in basal ganglia circuitry. Brain Res. Rev., 20(1), 128–154. Parent, A., & Smith, Y. (1987). Organization of efferent projections of the subthalamic nucleus in the squirrel monkey as revealed by retrograde labeling methods. Brain Res., 436(2), 296–310. Parthasarathy, H. B., Schall, J. D., & Graybiel, A. M. (1992). Distributed but convergent ordering of corticostriatal projections: analysis of the frontal eye field and the supplementary eye field in the macaque monkey. J. Neurosci., 12(11), 4468–4488. Platt, M. L., & Glimcher, P. W. (1999). Neural correlates of decision variables in parietal cortex. Nature, 400(6741), 233–238. Prescott, T. J., Montes-Gonzalez, F. M., Gurney, K., Humphries, M. D., & Redgrave, P. (2006). A robot model of the basal ganglia: Behavior and intrinsic processing. Neural Networks, 19(1), 31–61. Ratcliff, R. (1978). A theory of memory retrieval. Psychol. Rev., 83, 59–108. Ratcliff, R., Van Zandt, T., & McKoon, G. (1999). Connectionist and diffusion models of reaction time. Psychol. Rev., 106, 261–300. Redgrave, P., Prescott, T. J., & Gurney, K. (1999). The basal ganglia: A vertebrate solution to the selection problem? Neuroscience, 89(4), 1009–1023. Samejima, K., Ueda, Y., Doya, K., & Kimura, M. (2005). Representation of actionspecific reward values in the striatum. Science, 310(5752), 1337–1340. Schall, J. D. (2001). Neural basis of deciding, choosing and acting. Nat. Rev. Neurosci., 2(1), 33–42. Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275(5306), 1593–1599. Shadlen, M. N., & Newsome, W. T. (2001). Neural basis of a perceptual decision in the parietal cortex (area LIP) of the rhesus monkey. J. Neurophysiol., 86(4), 1916– 1936. Shadmehr, R., & Holcomb, H. H. (1997). Neural correlates of motor memory consolidation. Science, 277(5327), 821–825. Simen, P. A., Cohen, J. D., & Holmes, P. (2006). Rapid decision threshold modulation by reward rate in a neural network. Neural Networks, 19(8), 1013–1026. Smith, Y., Bevan, M. D., Shink, E., & Bolam, J. P. (1998). Microcircuitry of the direct and indirect pathways of the basal ganglia. Neuroscience, 86(2), 353–387. Stafford, T., & Gurney, K. (2004). The role of response mechanisms in determining reaction time performance: Pieron’s law revisited. Psychon. Bull. Rev., 11, 975– 987.
The Brain Implements Optimal Decision Making
477
Stafford, T., & Gurney, K. (in press). The basal ganglia as the selection mechanism in a cognitve task. Philos. Trans. R. Soc. Lond. B Biol. Sci. Stewart, R. D., Gurney, K., & Humphries, M. D. (2005). A large-scale model of the sensorimotor basal ganglia: Evoked responses and selection properties. Paper presented at the Society for Neuroscience meeting, Washington, DC. Stone, M. (1960). Models for choice reaction time. Psychometrika, 25, 251–260. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning. Cambridge, MA: MIT Press. Teichner, W. H., & Krebs, M. J. (1974). Laws of visual choice reaction time. Psychol. Rev., 81, 75–98. Usher, M., & McClelland, J. L. (2001). The time course of perceptual choice: The leaky, competing accumulator model. Psychol. Rev., 108(3), 550–592. Vickers, D. (1970). Evidence for an accumulator model of psychophysical discrimination. Ergonomics, 13, 37–58. Wald, A. (1947). Sequential analysis. New York: Wiley. Wald, A., & Wolfowitz, J. (1948). Optimum character of the sequential probability ratio test. Annals of Mathematical Statistics, 19, 326–339. Wang, X. J. (2002). Probabilistic decision making by slow reverberation in cortical circuits. Neuron, 36(5), 955–968. Wilson, C. J. (1995). The contribution of cortical neurons to firing patterns of striatal spiny neurons. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 22–50). Cambridge, MA: MIT Press. Wilson, C. J., Weyrick, A., Terman, D., Hallworth, N. E., & Bevan, M. D. (2004). A model of reverse spike frequency adaptation and repetitive firing of subthalamic nucleus neurons. J. Neurophysiol., 91(5), 1963–1980. Yu, A. D., & Dayan, P. (2005). Inference, attention, and decision in a Bayesian neural architecture. In L. K. Saul, W. Yair, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 1577–1584). Cambridge, MA: MIT Press.
Received December 17, 2005; accepted May 19, 2006.
LETTER
Communicated by Olivier Faugeras
Realistically Coupled Neural Mass Models Can Generate EEG Rhythms Roberto C. Sotero [email protected]
Nelson J. Trujillo-Barreto [email protected]
Yasser Iturria-Medina [email protected] Cuban Neuroscience Center, Brain Dynamics Department, Havana, Cuba
Felix Carbonell [email protected]
Juan C. Jimenez [email protected] Instituto de Cibern´etica Matem´atica y F´ısica, Departamento de Sistemas Adaptativos, Havana, Cuba
We study the generation of EEG rhythms by means of realistically coupled neural mass models. Previous neural mass models were used to model cortical voxels and the thalamus. Interactions between voxels of the same and other cortical areas and with the thalamus were taken into account. Voxels within the same cortical area were coupled (short-range connections) with both excitatory and inhibitory connections, while coupling between areas (long-range connections) was considered to be excitatory only. Short-range connection strengths were modeled by using a connectivity function depending on the distance between voxels. Coupling strength parameters between areas were defined from empirical anatomical data employing the information obtained from probabilistic paths, which were tracked by water diffusion imaging techniques and used to quantify white matter tracts in the brain. Each cortical voxel was then described by a set of 16 random differential equations, while the thalamus was described by a set of 12 random differential equations. Thus, for analyzing the neuronal dynamics emerging from the interaction of several areas, a large system of differential equations needs to be solved. The sparseness of the estimated anatomical connectivity matrix reduces the number of connection parameters substantially, making the solution of this system faster. Simulations of human brain rhythms were carried out in order to test the model. Physiologically plausible results were obtained based on this anatomically constrained neural mass model. Neural Computation 19, 478–512 (2007)
C 2007 Massachusetts Institute of Technology
Realistic Neural Mass Models
479
1 Introduction We aim at describing a generative model of EEG rhythms based on anatomically constrained coupling of neural mass models. Although the biophysical mechanisms underlying oscillatory activity in neurons are well studied, the origin and function of global characteristics (such as EEG rhythms) of large populations of neurons are still unknown. In this context, computational models have proven capable of providing insight into this problem. Two modeling approaches have been used to study the behavior of neuronal populations. One approach is based on the study of large computational neuronal networks in which each cell is realistically represented by multiple compartments for the soma, axon, and dendrites, and each synaptic connection is modeled explicitly (Traub & Miles, 1991). A disadvantage of such realistic modeling is that it requires high computational power. For this reason, simplified versions of such models in which only one compartment model was taken into account have been used (Wang, Golomb, & Rinzel, 1995; Rinzel, Terman, Wang, & Ermentrout, 1998). However, even in this case, the use of such detailed models makes it difficult to determine the influence of each model parameter on the generated average network characteristics. The second approach is based on the use of neural mass models (NMM). In contrast to neuronal networks, NMMs describe the dynamics of cortical columns and brain areas by using only a few parameters, but without much detail. In this approach, spatially averaged magnitudes are assumed to characterize the collective behavior of populations of neurons of a given type instead of modeling single cells and their interactions in a detailed network (Wilson & Cowan, 1972; Lopes da Silva, Hoeks, Smits, & Zetterberg, 1974; Zetterberg, Kristiansson, & Mossberg, 1978; van Rotterdam, Lopes da Silva, van den Ende, Viergever, & Hermans, 1982; Jansen & Rit, 1995; Valdes, Jim´enez, Riera, Biscay, & Ozaki, 1999; Wendling, Bellanger, Bartolomei, & Chauvel, 2000; David & Friston, 2003). Despite their simplicity, NMMs have been very helpful in the study of EEG rhythms. For example, in order to study the origin of the alpha rhythm, Lopes da Silva et al. (1974) developed an NMM comprising two interacting populations of neurons, the thalamocortical relay cells and the inhibitory interneurons, which were interconnected by a negative feedback loop. Later, this model was extended to include both positive and negative feedback loops (Zetterberg et al., 1978; Jansen, Zouridakis, & Brandt, 1993). In Jansen and Rit (1995), alpha and beta activities were replicated by using two coupled neural mass models representing cortical areas with delays in the interconnections between them. Wendling et al. (2000) investigated the generation of epileptic spikes by extending Jansen’s model to multiple coupled areas but explored the dynamics of only three areas for different values of the connectivity constants. David and Friston (2003) considered that a cortical area comprises several noninteracting neuronal populations that were in turn described by Jansen’s models. This model allowed them
480
R. Sotero et al.
to reproduce the whole spectrum of EEG signals by changing the coupling between two of these areas, thus showing the importance of parameters related to the coupling between different brain regions for the resulting dynamics. Due to the lack of information about actual connectivity patterns, previous approaches have been devoted to the study of the temporal evolution of neuronal populations rather than their spatial dynamics. However, the development of techniques such as diffusion weighted magnetic resonance imaging (DWMRI) in the past 20 years makes noninvasive study of the anatomical brain circuitry of the living human brain possible. This has put at our disposal extremely valuable information and gives us a unique opportunity to formulate models that take into account a more realistic view of the central nervous system. In this article, we extend Jansen’s model to characterize the dynamics of a cortical voxel. Building on this, our goal is to study the rhythms obtained when coupling several areas comprising large numbers of interconnected voxels. In this approach, voxels of the same cortical area are assumed to be coupled with excitatory and inhibitory connections (short-range connections), while connections between areas (long-range connections) are assumed to be excitatory only. This is an extension of the model of David and Friston (2003) because interactions between voxels within the same area are taken into account. Additionally, we introduce anatomical constraints on coupling strength parameters (CSP) between different brain regions. This is accomplished by using CSPs estimated from human DWMRI data (Iturria-Medina, Canales-Rodr´ıguez, Meli´e-Garcia, & Vald´es-Hern´andez, 2005). The thalamus is also modeled using a single NMM, mutually coupled with the cortical areas. We show that this anatomically constrained NMM can reproduce the EEG rhythms recorded at the scalp, another step in the validation of NMMs as a valuable tool for describing the activity of large populations of neurons. The letter is organized as follows. In section 2, the NMMs used for a cortical voxel and the thalamus are formulated. Section 3 is devoted to the modeling of short-range and long-range interactions. The integrated thalamocortical model is formulated in section 4. In section 5 the observation model that links the activity in a cortical voxel with the recorded EEG on the scalp surface is presented. Section 6 describes the numerical method used for integrating the huge system of random differential equations obtained. Finally, sections 7 and 8 are devoted to a description of the simulations as well as a discussion of the results. 2 Models for a Cortical Voxel and the Thalamus In this work we model the neuronal activity in a cortical voxel by extending previous neural mass models that have been used to describe neural dynamics at smaller spatial scales. In particular, Jansen and Rit (1995)
Realistic Neural Mass Models
481
A
B
Figure 1: The basic neural mass models used. (A) Model for a cortical voxel. The dynamics is obtained by the interaction of three populations of neurons: excitatory pyramidal cells, excitatory spiny stellate cells, and inhibitory interneurons (smooth stellate cells). Input to the system is modeled by the pulse density p(t). This reduces to Jansen’s model when removing the dashed-line loop. (B) Model for the thalamus. Interaction between one excitatory and one inhibitory subpopulation, representing thalamocortical and reticular-thalamic neurons, respectively.
studied the temporal dynamics of a cortical column by means of the interaction of interconnected populations of excitatory pyramidal and stellate cells as well as inhibitory interneurons. This model (see Figure 1A, except for the dashed-line loop) capitalizes on the basic circuitry of the cortical column (Mountcastle, 1978, 1997). Pyramidal cells receive feedback from
482
R. Sotero et al.
excitatory and inhibitory interneurons, which receive only excitatory input from pyramidal cells. Nevertheless, at the spatial scale of voxels, this simplified circuitry is no longer valid, but more complicated interactions need to be considered. While a cortical column has a diameter of 150 to 300 µm, the length of a voxel usually ranges from 1 to 3 mm, comprising several interconnected columns. At this spatial scale, pyramidal-to-pyramidal connections become increasingly important, accounting for the majority of ¨ 1998). In our approach, Jansen’s intracortical fibers (Braitenberg & Schuz, model is extended to include this kind of connection by adding a selfexcitatory loop for the pyramidal population (see Figure 1B). Zetterberg et al. (1978) had already used a generic model of this kind for describing the temporal dynamics of a neuronal mass consisting of two excitatory and one inhibitory population. Although they were able to produce signals resembling EEG background activity, their model had no specific neural substrate. Usually in this type of modeling, the dynamics of each neuronal population is represented by two transformations (Jansen & Rit, 1995; Zetterberg et al., 1978). First, the average pulse density of action potentials entering a population is converted into an average postsynaptic membrane potential by means of linear convolutions with impulse responses h e (t) and h i (t) for the excitatory and the inhibitory case, respectively: h e (t) = Aa te −a t h i (t) = Bbte
−bt
(2.1) .
(2.2)
Parameters Aand B represent the maximum amplitude of excitatory (EPSP) and inhibitory postsynaptic potentials (IPSP), respectively, while lumped parameters a and b depend on passive membrane time constants and other distributed delays in the dendritic network. The second (nonlinear) transformation accounts for the conversion of average membrane potentials of the population into an average pulse density of action potentials, which is accomplished by means of a sigmoid function,
S(ν) =
2e 0 , 1 + e r (v0 −v)
(2.3)
where 2e 0 is the maximum firing rate, ν0 is the postsynaptic potential (PSP) corresponding to firing rate e 0 , and parameter r controls the steepness of the sigmoid function. The interaction between the different populations is characterized by five connectivity constants c 1 , c 2 , c 3 , c 4 , c 5 (see Figure 1A), which represent the average number of synaptic contacts. The first four constants have the
Realistic Neural Mass Models
483
same interpretation as in Jansen and Rit (1995) and were related in that work based on physiological arguments: c 2 = 0.8c 1 , c 3 = c 4 = 0.25c 1 .
(2.4)
Similarly, c 5 accounts for interactions between pyramidal cells. Using Laplace transforms, this convolution model can easily be expressed in terms of the following system of differential equations: y˙ 1 (t) = y4 (t) y˙ 2 (t) = y5 (t) y˙ 3 (t) = y6 (t) y˙ 4 (t) = Aa { p(t) + c 5 S(y1 (t) − y2 (t)) + c 2 S(c 1 y3 (t))} − 2a y4 (t) − a 2 y1 (t) y˙ 5 (t) = Bbc 4 S(c 3 y3 (t)) − 2by5 (t) − b 2 y2 (t) y˙ 6 (t) = Aa S(y1 (t) − y2 (t)) − 2a y6 (t) − a 2 y2 (t),
(2.5)
where y1 (t), y3 (t) are average EPSP and y2 (t) is the average IPSP (see Figure 1A). From classical electrophysiology, pyramidal cells are considered to be the main contributors to the measured EEG at the scalp surface. This is also known to stem from two basic reasons: (1) the specific location of excitatory and inhibitory synaptic inputs on the pyramidal neuron and (2) the spatial organization of these cells in the form of a palisade oriented perpendicular to the cortex (Niedermeyer & Lopes da Silva, 1987; Nunez, 1981). That is, afferent excitatory connections make contact predominantly on apical dendrites, creating an extracellular negative potential (sink of current due to Na + entering the cell), while inhibitory synapses reach mainly the soma of the neuron, creating an extracellular positive potential (source of current due to Cl − entering the cell). Thus, at a macroscopic level, the potential field generated by a synchronously activated palisade of neurons behaves like that of a dipole layer and can be considered the basis of EEG generation. Note that in our approach, y1 (t) and y2 (t) are the result of excitatory and inhibitory synapses acting on the pyramidal cell population. Thus, the difference y1 (t) − y2 (t) can be taken in a fairly good approximation to be proportional to the EEG signal (Jansen & Rit, 1995). The input to the model in Figure 1A is the process p(t), which represents the basal stochastic activity within the voxel plus the extrinsic input. This extrinsic input does not include afferences from voxels of the same and other cortical areas, but it models specific stimulus-dependent input that modulates the activity of the neuronal population. Additionally, our approach capitalizes on specific properties of the thalamic nuclei in order to model the contribution of this subcortical structure,
484
R. Sotero et al.
which is the main relay station of the sensory input on its way to the cortex. That is, the thalamus comprises two major populations of neurons: thalamocortical (TC) excitatory relay neurons and thalamic inhibitory reticular (RE) neurons. Only the axons of TC neurons project to the cortex, while RE neurons form an inhibitory network that surrounds the thalamus and project inhibitory connections to TC neurons exclusively (Destexhe & Sejnowski, 2001). This circuitry allowed us to formulate the mass model shown in Figure 1B, which Lopes da Silva et al. (1974) used for simulating the alpha rhythm. Note that the model is used to describe the dynamics of the thalamus as a whole, without taking into account any subdivision into voxels. The motivation for this was the physical separation of the two types of neuronal populations that form the thalamus. On the other hand, the absence of any preferential orientation of the neurons in the thalamus, together with the depth of this structure, has led to the idea that it does not generate any measurable voltage at the scalp surface. Thus, in our approach, no direct thalamic contribution to the EEG is considered. A more sophisticated model of the thalamic activity could be used to elucidate its actual contribution to the EEG generation, but that would entail a detailed knowledge of its microscopic cytoarchitectonic organization, which we lack at the moment.
3 Short-Range versus Long-Range Interactions If we want to model the dynamics of macroscopic cortical areas, we have to augment the model described to account for the specific neuronal interactions in the brain. In the cytoarchitectonic organization of the cerebral cortex, two types of connections can be distinguished: short-range and long-range connections (SRC and LRC, respectively). SRCs are made by cortical neurons with local branches that can reach a maximum of about 10 mm (Braitenberg, 1978; Abeles, 1991; Jirsa & Haken, 1997) corresponding to the length of the longest horizontal axon collaterals plus the length of the longest dendrites. LRCs are mainly formed by axons of pyramidal cells that cross the white matter and connect cortical regions that can be up to 100 mm apart (Abeles, 1991; Jirsa & Haken, 1997). Several differences between these two systems can be established, not only in their characteristic lengths but also in their specificity and function ¨ 1998). While SRCs can be described in (Abeles, 1991; Braitenberg & Schuz, statistical terms, with an effective connection strength that diminishes exponentially with the distance between neuronal populations, LRCs do not show such a regular and smooth pattern but have a high degree of specificity. Additionally, LRCs terminate preferentially on the apical dendrites of pyramidal cells, while local collaterals terminate more on basal dendrites. In the following sections, we describe how this differentiation of cortical interactions is incorporated into our model.
Realistic Neural Mass Models
485
3.1 Within-Area Interactions. First, SRCs that mediate neuronal interactions within a cortical area will be described. A diagram of the proposed augmented model for a given voxel is shown in Figure 2A. We assume that voxels of the same area are coupled by means of both excitatory and inhibitory connections, where the pyramidal cell population is the target of all the input to the system. Excitatory input comes from pyramidal and spiny stellate cell populations with coupling strengths described by connectivity functions that depend on the distance between voxels |xm − x j | (Jirsa & Haken, 1996, 1997): mj
ke1 = mj
ke2 =
1 − |xmσ−x j | e1 e 2σe1
(3.1)
1 − |xmσ−x j | e2 e . 2σe2
(3.2) mj
mj
In equations 3.1 and 3.2, ke1 and ke2 are the strengths of the connections that pyramidal cells and excitatory interneurons of voxel m make on pyramidal cells of voxel j, respectively. Delays in these connections can be modeled by linear transformations similar to equation 2.1, as in Jansen and Rit (1995). We use h d2 (t) and h d3 (t), respectively: h d2 (t) = Aa d2 te −a d2 t h d3 (t) = Aa d3 te
−a d3 t
(3.3) .
(3.4)
Similar equations are formulated for within-area inhibitory connections, mj which are considered by means of the connection strength ki that inhibitory interneurons (smooth stellate cells) of voxel m exert on pyramidal cells of voxel j, that is, 1 − |xmσ−x j | i e 2σi
(3.5)
h d4 (t) = Bb d4 te −bd4 t .
(3.6)
mj
ki =
3.2 Between-Area Interactions. Interactions between distant cortical areas are determined by LRCs, which in our approach are mediated by pyramidal-to-pyramidal cell connections. There is no clarity about the rules ¨ 1998). Nevertheless, governing the system of LRCs (Braitenberg & Schuz, the emergence of new imaging techniques like diffusion weighted magnetic resonance imaging (DWMRI) in the past few years has opened a window to the study of this type of connection in the living human brain. DWMRI is based on the fact that trajectories followed by water molecules during diffusion processes reflect the microscopic environment of brain tissues. Then nervous fibers can be characterized by tracing probabilistic paths
486
R. Sotero et al.
Figure 2: The thalamocortical model. (A) Input-output relationship for the neural mass model of a cortical voxel. A cortical voxel receives excitatory afferences from pyramidal cells and excitatory stellate cells of voxels of the same area, from pyramidal cells of other cortical areas, and from TC relay neurons. Inhibitory afferences are received from inhibitory interneurons of voxels of the same area. The pulse density p(t) accounts for the basal stochastic activity within the voxel. (B) Input-output relationship for the neural mass model of the thalamus. TC relay neurons send axonal projections to the cortex, where they make synapses on the three populations of neurons considered. TC cells receive backward connections from cortical pyramidal neurons. The pulse density pth (t) accounts for the background stochastic activity within the thalamus plus stimulus related inputs.
Realistic Neural Mass Models
487
that represent the preferred direction of water diffusion along the nervous fibers in each voxel. Finally, measures of anatomical connectivity reflecting the interrelation between voxels and areas can be defined and calculated based on these paths. In this article, the LRC strength K i,n between two areas i and n will be estimated from actual DWMRI data. Most work in the field of DWMRI has defined measures of betweenvoxel rather than between-areas anatomical connectivity. For example, for characterizing the probability of connection of a point R with a point P, one can use the number of probabilistic paths leaving P and entering R divided by the number of probabilistic paths generated in P (Behrens et al., 2003; Hagmann et al., 2003; Parker, Wheeler-Kingshott, & Barker, 2002; Koch, Norris, & Hund-Georgiadis, 2002). In Tuch (2002) the probability of connection between two points is estimated as the probability of the most probable paths joining both points, while in Parker et al. (2002) and Tournier, Calamate, Gadian, & Connelly, (2003), it is taken as the probability of the path that minimizes a cost function depending on diffusion data. However, diffusion data generally show high isotropy within the gray matter, giving poor information about the distribution of nerve fibers. Consequently, in most cases, it is not convenient to trace probabilistic paths within gray matter areas. Another approach for studying connectivity in the brain is to use only the points at the boundaries of gray matter regions for characterizing connection strengths between zones and not between the individual voxels comprising them. In this line of work, Iturria-Medina, Canales-Rodr´ıguez et al. (2005) and Iturria-Medina, Vald´es-Hern´andez et al. (2005) proposed a model for obtaining anatomical connection strengths (ACS) between brain zones. In their method, the ACS between two regions, A and B (C A↔B ), was defined as being proportional to the cross-section area of the total cylindrical-shaped volume of the nervous fibers shared by these regions. ACS was then calculated as the area occupied by the connector volume over the surfaces of the connected zones, C A↔B = AT(A,B) + BT(A,B) ,
(3.7)
where AT( A,B) and BT( A,B) represent the area defined by fibers (both afferent and efferent), joining A with B over the surface of A and B, respectively. These areas were evaluated from a DWMRI data set by measuring the number of target voxels on each surface of the generated probabilistic paths that simulate fiber trajectories under the assumption that those paths are contained inside the connector volume. Notice that methods for reconstructing fiber paths using DWMRI data cannot distinguish between excitatory and inhibitory connections. However, in the proposed model, LRCs are considered to be excitatory only. This assumption is consistent with physiological and anatomical studies, demonstrating, that pyramidal cells of layer III are the main source of the
488
R. Sotero et al.
feedforward projections to other cortical areas (Gilbert & Wiesel, 1979, 1983; Martin & Whitteridge, 1984; McGuire, Gilbert, Rivlin, & Wiesel, 1991; Douglas & Martin, 2004). Thus, information obtained from DWMRI data can be used as a measure of the CSP (K i,n ) between any two given areas i and n. For modeling delays in long-range connections, we use h d1 (t) = Aa d1 te −a d1 t .
(3.8)
Like parameters a and b of Jansen’s model, constants a d1 , a d2 , a d3 , and b d4 account for passive membrane time constants and other distributed delays in the dendritic network. 4 Thalamocortical Model In this section we present the final mathematical formulation of the thalamocortical model proposed, with SRCs and LRCs included. As shown in Figure 2A, intra- and intervoxel interactions are described by nine connectivity constants, from c 1 to c 9 . The first five constants, c 1 to c 5 , have already been defined in section 2. Constant c 6 represents the number of synapses made by pyramidal cells of one voxel on voxels of other areas, c 7 is the number of synapses made by pyramidal cells on voxels of the same area, and constants c 8 and c 9 are the number of synapses made by excitatory and inhibitory interneurons on pyramidal cells of voxels within the same area. Thus, the dynamics of the neuronal activity in a cortical voxel is described by a system of 16 differential equations: nj
nj
nj
nj
nj
nj
nj
nj
nj
nj
nj
nj
nj
nj
nj
nj
y˙ 1 (t) = y9 (t) y˙ 2 (t) = y10 (t) y˙ 3 (t) = y11 (t) y˙ 4 (t) = y12 (t) y˙ 5 (t) = y13 (t) y˙ 6 (t) = y14 (t) y˙ 7 (t) = y15 (t) y˙ 8 (t) = y16 (t) nj nj nj nj y˙ 9 (t) = Aa c 5 S y1 (t) − y2 (t) + c 2 S y3 (t) + K th,n c 3t S(x4 (t)) Mn
mj mj ke1 c 7 S y6mn (t) + ke2 c 8 S (y7mn (t)) + Aa m=1 m= j
Realistic Neural Mass Models
+
N i=1 i=n
K i,n
Mi
489
c 6 S y5im (t)
m=1
nj
nj
− 2a y9 (t) − a 2 y1 (t) + Aa p nj (t)
Mn
nm nj nj mj nj nj ki c 9 S y8 (t) − 2by10 (t) − b 2 y2 (t) y˙ 10 (t) = Bb c 4 S y4 (t) + m=1
m= j
nj nj nj nj nj y˙ 11 (t) = Aa c 1 S y1 (t) − y2 (t) + K th,n c 4t S(x5 (t)) − 2a y11 (t) − a 2 y3 (t) nj nj nj nj nj y˙ 12 (t) = Aa c 3 S y1 (t) − y2 (t) + K th,n c 5t S (x6 (t)) − 2a y12 (t) − a 2 y4 (t) nj nj nj nj 2 nj y˙ 13 (t) = Aa d1 S y1 (t) − y2 (t) − 2a d1 y13 (t) − a d1 y5 (t) nj nj nj nj 2 nj y˙ 14 (t) = Aa 21 S y1 (t) − y2 (t) − 2a d2 y14 (t) − a d2 y6 (t) nj nj nj 2 nj y˙ 15 (t) = Aa d3 S y3 (t) − 2a d3 y15 (t) − a d3 y7 (t) nj nj nj 2 nj y˙ 16 (t) = Bb d4 S y4 (t) − 2b d4 y16 (t) − b d4 y8 (t), (4.1)
where the superscript nj refers to the voxel j in the area n. The system has extrinsic inputs given by connections from other cortical voxels and the thalamus as well as an intrinsic input p(t) that represents baseline stochastic nj nj activity within the voxel. The output is taken as y1 (t) − y2 (t), which from section 2 can be considered proportional to the generated EEG signal. Note that close voxels separated by the boundary of two neighboring regions are considered to be coupled through LRC solely, even if they are close enough for the SRC to be important. This simplification can easily be avoided by nj nj adding relevant terms to the equations for y˙ 9 and y˙ 10 in equation 4.1. Nevertheless, in the calculations shown here, all anatomical areas were chosen large enough and the activation regions within each area located distant enough such that boundary effects could be neglected. This allowed us to keep equations simple for the sake of a better understanding of the model. In the case of the thalamus, the main output of the system is mediated by TC cells, which are assumed to send axonal projections to the three populations of neurons in the cortex: pyramidal, spiny stellate, and smooth stellate cells (see Figure 2). On the contrary, only pyramidal cells project back to the thalamus and make synapses on TC neurons. This model is described by the following set of 12 differential equations: x˙ 1 (t) = x7 (t) x˙ 2 (t) = x8 (t)
490
R. Sotero et al.
x˙ 3 (t) = x9 (t) x˙ 4 (t) = x10 (t) x˙ 5 (t) = x11 (t) x˙ 6 (t) = x12 (t) N Mi
im th,i x˙ 7 (t) = At a t K c 6 S y5 (t) + pth (t) − 2a t x7 (t) − a t2 x1 (t) i=1
m=1
x˙ 8 (t) = Bt b t c 2t S (c 1t x3 (t)) − 2b t x8 (t) − b t2 x2 (t) x˙ 9 (t) = At a t S(x1 (t) − x2 (t)) − 2a t x9 (t) − a t2 x3 (t) 2 x˙ 10 (t) = At a d1t S (x1 (t) − x2 (t)) − 2a d1t x10 (t) − a d1t x4 (t) 2 x˙ 11 (t) = At a d2t S (x1 (t) − x2 (t)) − 2a d2t x11 (t) − a d2t x5 (t) 2 x6 (t). x˙ 12 (t) = At a d3t S (x1 (t) − x2 (t)) − 2a d3t x12 (t) − a d3t
(4.2)
The pulse density input pth (t) in this case can be modeled as the sum of a stimulus (visual or auditory stimulation for example) and a basal random activity in the thalamus.K th,i is the CSPs between the thalamus and the cortical area i and is obtained from the anatomical connectivity matrix as described in section 3.2. At this point, it is worth making some comments about the mathematical formulation of systems 4.1 and 4.2. The two neural mass models correspond to a mathematical object called random differential equation (RDE) (Hasminskii, 1980 i.e., equations of the form d x(t) = f (x(t), p(t))dt, where p(t) is an external stochastic process). Indeed, in equations 4.1 and 4.2, the nonautonomous inputs p nj (t) and pth (t) are external stochastic processes. Note that for each fixed realization of p nj (t) and pth (t), the systems 4.1 and 4.2 are in fact nonautonomous ordinary differential equations. This approach differs from the neural mass models proposed in Jansen and Rit (1995), Wendling et al. (2000), and David and Friston (2003), which mainly use stochastic differential equations (SDE) (i.e., equations of the form d x(t) = f (x(t))dt + g(x(t))dW(t), where the differential dW(t) of the Wiener process W(t) is a gaussian white noise). Notice that the mathematical formulation of SDEs and RDEs is quite different (Gard, 1988). However, both types of equations are related in the particular case p(t) = W(t). That is, the RDE d x(t) = f (x(t), p(t))dtcan be transformed into an SDE by adding the equation d p(t) = d W(t). Hence, the framework of RDEs allows us to consider not only Wiener process but also other types of modeling for the noise p(t). Besides, numerical calculations can be significantly simplified by
Realistic Neural Mass Models
491
avoiding elements of stochastic integration theory usually required in SDEs. Finally, the existence and uniqueness of the solution of systems 4.1 and 4.2 follow from classical arguments. That is, given a realization of the stochastic processes p nj (t) and pth (t), it is easily seen that the right-hand side of systems 4.1 and 4.2 is given by continuously differentiable functions of their respective arguments (for any choice of the corresponding parameters). Besides, function S (v(t)) defined in equation 2.3, the only source of nonlinearity presented on both systems, satisfies a Lipchitz condition (it has first derivative bounded). According to theorem 3.1 in Hasminskii (1980), these two last conditions (continuity and Lipchitz) are enough to guarantee that our model is well posed (i.e., existence and uniqueness of the solution). 5 EEG Observation Model We now relate the average membrane potential of pyramidal cells of each voxel to the observed EEG on the scalp surface. The current dipole due to a PSP is approximately (H¨am¨al¨ainen, Hari, Ilmoniemi, Knuutila, & Lounasmaa, 1993) j=
π 2 d σin V = εV 4
(5.1)
ε=
π 2 d σin , 4
(5.2)
where d is the diameter of the dendrite, σin is the intracellular conductivity, and V is the change of voltage during a PSP (for a single PSP j ≈ 20 fAm; H¨am¨al¨ainen et al., 1993). The equivalent current dipole (ECD) due to N PSPs is then J =
N
αi ji ≈ ε
i=1
N
αi Vi ,
(5.3)
i=1
where αi is +1 for EPSP and −1 for IPSP. As discussed in section 2, we can assume that for one voxel. y1 (t) − y2 (t) =
N
αi Vi
(5.4)
i=1
and J (t) = ε (y1 (t) − y2 (t)) .
(5.5)
492
R. Sotero et al.
On the other hand, from the forward problem of the EEG, we know that the voltage measured at the scalp surface due to a current density distribution is given by φ(rs , t) =
K (rs , rg ) J (r g , t)d 3 rg .
(5.6)
where the kernel K (rs , rg ) is the electric lead field, which summarizes the electric and geometric properties of the conducting media (brain, skull, and scalp), and J (r g , t) is the primary current density (PCD) vector. The indices s and g run over the sensor (electrodes) and generator (voxels) spaces, respectively, and t denotes time. The lead field matrix K (rs , rg ) is commonly known and can be calculated by means of the reciprocity theorem (Plonsey, 1963; Rush & Driscoll, 1969). Thus, the EEG generated due to the activation of one voxel at a given array of electrodes distributed over the scalp surface can easily be calculated by multiplication of the lead field by expression 5.5, both evaluated for the given voxel. 6 Numerical Method As we mentioned in section 4, an RDE is just a nonautonomous ordinary differential equation (ODE) coupled with a stochastic process ( p(t) and pth (t) in our model). Thus, in principle, RDEs can be integrated by applying conventional numerical methods for ODEs. However, it is well known that classical methods in ODEs may introduce “ghost” solutions and numerical instabilities when applied to nonlinear equations, which is critical when dealing with high-dimensional problems (as in our case). On the other hand, very few researchers have studied systematically the properties of such methods for the numerical integration of RDEs (Grune & Kloeden, 2001; Carbonell, Jimenez, Biscay, & de la Cruz, 2005). In this letter we choose the local linearization (LL) for RDE (Carbonell et al., 2005) to solve systems 4.1 and 4.2 (see a description of this method in the appendix). The rationale for this choice is the fact that the LL method improves the order of convergence and stability properties of conventional numerical integrators (Carbonell et al., 2005) for RDEs. Moreover, the LL approach has been key for constructing efficient and stable numerical schemes for the integration and estimation of various classes of random dynamical systems (Shoji & Ozaki, 1998; Prakasa-Rao, 1999; Schurz, 2002). 7 Results In this section, computational simulations are used to demonstrate the capability of the model to reproduce the temporal dynamics and spectral characteristics of the EEG signal as recorded on the scalp surface. A number of model predictions arising from the results obtained are also shown,
Realistic Neural Mass Models
493
Figure 3: (A) Average anatomical connectivity matrix estimated from diffusion weighted magnetic resonance imaging data of two human subjects. This matrix contains the connection strength between 71 brain areas. These areas were segmented based on the Probabilistic Brain Atlas developed at the Montreal Neurological Institute. The values are normalized between 0 and 1. (B) Random matrix calculated with the same degree of sparseness as that of the anatomical connectivity matrix in A.
and can be further tested experimentally. Additionally, the sensitivity of the results to changes in model parameters is studied. 7.1 Description of the Simulation Method. In the simulations, 71 brain areas were defined based on the Probabilistic MRI Atlas (PMA) produced by the Montreal Neurological Institute (Evans et al., 1993, 1994; Collins, Neelin, Peters, & Evans, 1994; Mazziotta, Toga, Evans, Fox, & Lancaster, 1995). Anatomical connectivity matrices between these areas, corresponding to two human subjects, were then calculated and averaged out. The elements of the resultant average matrix were then used as CSP for betweenareas interactions. As shown in Figure 3, the sparseness of this anatomical connectivity matrix allows for a significant reduction in the number of CSPs and thus reduces the computational cost of solving the large RDE system that describes the model presented here. Initial conditions were set to zero in all simulations, and an integration step size of 5 ms was used. The first 100 points of the simulated signals were discarded in order to avoid transient behavior, and the input p(t) was a gaussian process with mean and standard deviation of 10,000 and 1000 pulses per second, respectively. As for the rest of the model parameter values, they were chosen to vary across space, with values taken from a uniform distribution in the interval between 5% below and above the standard values reported in Table 1. This choice was motivated by the inhomogeneities found in cortical tissue. The motor cortex, for example, has a relatively sparse population of neurons, while sensory cortices tend
494
R. Sotero et al.
Table 1: Values of Model Parameters and Physiological Interpretation. Parameters with the Same Value in All Simulations Parameter A, B e 0 , ν0 , r σe1 , σe2 , σi c6 , c7 , c8 , c9
Physiological Interpretation Average synaptic gain for cortical voxels Parameters of the nonlinear sigmoid function Parameters of the connectivity functions. Average number of synaptic contacts between the neuronal populations of different cortical voxels
Value A = 3.25 mV, B = 22 mV e 0 = 5 s−1 , ν0 = 6 mV r = 0.56 mV−1 σe1 = 10, σe2 = 10, σi = 2 c 6 = 200, c 7 = 100 c 8 = 100, c 9 = 100
Parameters with Different Values for Each Simulation Parameter a, b a d1 , a d2 , a d3 , b d4 c 1 , c 2 = 0.8c 1 c 3 = c 4 = 0.25c 1 c5 At , Bt c 1t , c 2t , c 3t c 4t , c 5t a t , bt Rhythm Delta rhythm Theta rhythm Alpha rhythm
Beta rhythm Gamma rhythm
Physiological Interpretation Average synaptic time constants for excitatory and inhibitory populations Average time delay on efferent connection from a population for cortical voxels Average number of synaptic contacts between the neuronal populations within a cortical voxel
Synaptic gain for the thalamus Number of synaptic contacts made by the neuronal populations of the thalamus Average synaptic time constants for TC and RE populations Actual Values Used in Each Case a = 20 s−1 , b = 20 s−1 , a d1 = 15 s−1 , a d2 = 20 s−1 , a d3 = 20 s−1 , b d4 = 20 s−1 , c 1 = 50 a = 50 s−1 , b = 40 s−1 , a d1 = 15 s−1 , a d2 = 50 s−1 , a d3 = 50 s−1 , b d4 = 40 s−1 , c 1 = 50 a = 100 s−1 , b = 50 s−1 , a d1 = 33 s−1 , a d2 = 100 s−1 , a d3 = 100 s−1 , b d4 = 50 s−1 , c 1 = 150, At = 3.25 mV, Bt = 22 mV, c 1t = 50, c 2t = 50, c 3t = 80, c 4t = 100, c 5t = 80, a t = 100, b t = 40 a = 100 s−1 , b = 100 s−1 , a d1 = 33 s−1 , a d2 = 100 s−1 , a d3 = 100 s−1 , b d4 = 100 s−1 , c 1 = 180 a = 100 s−1 , b = 100 s−1 , a d1 = 33 s−1 , a d2 = 100 s−1 , a d3 = 100 s−1 , b d4 = 100 s−1 , c 1 = 250
Realistic Neural Mass Models
495
to be more densely populated than the average. Furthermore, even within a given cytoarchitectonic region, cortical thickness and neuronal densities vary considerably (Abeles, 1991). However, despite these differences, it is believed that all cortical regions process information locally according to the same principles. Parameter values will also depend on voxel size. Nevertheless, because it is difficult to establish the exact nature of this dependence, the selection of a plausible range of variation of parameter values (see Table 1) was based on previous modeling studies (Lopes da Silva et al., 1974; Zetterberg et al., 1978; Jansen & Rit, 1995; David & Friston, 2003). Parameters used in previous works and parameters introduced for the first time in this study were then tuned so that the model could produce the different rhythms observed in the human EEG. In all simulations, our generator space consists of a 3D regular grid of points that represent the possible generators of the EEG inside the brain, while the measurement space is defined by the array of sensors where the EEG is recorded. In this work, 41,850 grid points (defining voxels of size 4.25 mm × 4.25 mm × 4.25 mm) and a dense array of 120 electrodes are placed in registration with the PMA. The 3D grid is further clipped by the gray matter, which consists of all 71 brain regions segmented (18,905 grid points remaining). The electrodes’ positions were determined by extending and refining the standard 10/20 system (FP1, FP2, F3, F4, C3, C4, P3, P4, O1, O2, F7, F8, T3, T4, T5, T6, Fz, Cz, and Pz). The physical head model constructed in this way allows us to easily compute the electric lead field matrix that relates the PCD inside the brain to the voltage measured at the sensors’ locations. 7.2 Generation of Alpha Rhythm. Experimental data from fMRI as well as solutions of the EEG inverse problem show that generators of the different brain rhythms are not localized in the same cortical areas (Goldman, Stern, Engel, & Cohen, 2002; Mart´ınez-Montes, Vald´es-Sosa, Miwakeichi, Goldman, & Cohen, 2004). In the case of the alpha rhythm, it has been demonstrated that increased EEG alpha power is correlated with decreased fMRI signal in multiple regions of occipital, superior temporal, inferior frontal, and cingulate cortex and with increased power in the thalamus and insula (Goldman et al., 2002). In this study, these regions were segmented and subdivided into 12 anatomical areas selected from the PMA (see Table 2), which were further subdivided based on their hemispheric location (left or right). An illustrative visualization of some of these regions with the corresponding fiber tracts estimated from DWMRI is shown in Figure 4A. The 24 defined brain areas together with their corresponding anatomical connectivity values were used in our model to generate alpha activity; the results are shown in Figure 5. The temporal dynamics and the power spectrum of the signal at the electrodes of maximum amplitude (O1 and O2) show reasonable agreement with experimentally observed EEG (see
496
R. Sotero et al.
Table 2: Brain Areas Selected for Each Rhythm Simulation. Alpha Rhythm
Delta, Beta, Theta, and Gamma Rhythms
Cuneus (left and right) Lingual gyrus (left and right) Lateral occipitotemporal gyrus (left and right) Insula (left and right) Inferior occipital gyrus (left and right) Superior occipital gyrus (left and right) Medial occipitotemporal gyrus (left and right) Occipital pole (left and right) Cingulate region (left and right) Inferior frontal gyrus (left and right) Superior temporal gyrus (left and right) Thalamus (left and right)
Medial front-orbital gyrus (left and right) Middle frontal gyrus (left and right) Precentral gyrus (left and right) Lateral front-orbital gyrus (left and right) Medial frontal gyrus (left and right) Superior frontal gyrus (left and right) Inferior frontal gyrus (left and right)
Figure 4: Visualization of the fiber tracts connecting the brain regions used in the simulations as calculated from DWMRI. (A) Alpha rhythm: left and right superior temporal gyri (1, 2); left and right thalamus (3, 4); left and right medial occipitotemporal gyri (5, 6); left and right occipital poles (7, 8). (B) Beta, gamma, delta, and theta rhythms: left and right superior frontal gyri (1, 2); left and right inferior frontal gyri (3, 4).
Realistic Neural Mass Models
497
Figure 5: Simulation of alpha rhythm (8–12 Hz). (A) Temporal dynamics and spectrum of the signal at electrodes O1 and O2. (B) Topographic distribution of the alpha power on the scalp surface. The spatial distribution of the 120 electrodes’ system used in the simulation is also shown.
Figure 5A). The amplitude of the signal varies between 30 and 50 µV for both electrodes, and the maximum of the spectrum was at 9.98 Hz for electrode O1 and 9.90 Hz for electrode O2. These values are in the range of actual recordings. Other qualitative properties, like the waxing and waning characteristic of actual alpha rhythm, as well as interhemispheric symmetry, are also observed. Regarding the topographic distribution of the activity,
498
R. Sotero et al.
Figure 6: Reactivity test simulation. A trapezoidal input of 12 seconds duration and amplitude of 106 pulses per second is used to simulate visual input at the thalamic level. The response of the system results in a significant reduction in the amplitude of both the signal and the peak of the alpha power and a shift of the power spectrum toward higher frequencies.
Figure 5B shows that the simulated EEG is maximal over the occipital regions and diminishes in amplitude toward frontal areas, which has also been described for alpha rhythm. The influence of an external input (stimulus) on the simulated alpha activity was also studied, and results are depicted in Figure 6. Actual alpha rhythm is known to be temporarily blocked by an influx of light (eye opening). This is known as reactivity. The degree of reactivity varies; the alpha rhythm may be blocked, suppressed, or attenuated with voltage
Realistic Neural Mass Models
499
reduction (Niedermeyer & Lopes da Silva, 1987). In the simulations shown in Figure 6, a trapezoidal input representing visual stimulation of 12 seconds duration and amplitude of 106 pulses per second was given to the thalamus. As shown in the figure, the effect of this input was to attenuate the EEG signal and shift the whole EEG spectrum toward higher frequencies with a reduction in the amplitude of the power, which is also observed in actual reactivity tests. 7.3 Generation of Delta, Theta, Beta, and Gamma Rhythms. For the simulation of delta, theta, beta, and gamma rhythms, 7 brain regions (medial-front orbital gyrus, middle frontal gyrus, precental gyrus, lateralfront orbital gyrus, medial frontal gyrus, superior frontal gyrus, and inferior frontal gyrus) were selected and subdivided based on their hemispheric location (left or right), for a total of 14 brain areas. Some of these areas, with the corresponding fiber tracts connecting them, are shown in Figure 4B. In these studies, 30 seconds of EEG data were simulated for each rhythm. Figures 7, 8, 9, and 10 show the temporal dynamics and the spectrum of the signal obtained at electrodes Fp1 and Fp2 for the delta, theta, beta, and gamma rhythms simulations, respectively. Amplitudes and spectral frequency bands obtained are within the range of experimentally observed EEG. Note the characteristic interhemispheric asymmetry in the frequency of the peak for the beta rhythm (see Figure 9). That is, the maximum of the spectrum is located at 18.46 Hz for electrode Fp1 and at 16.76 Hz for electrode Fp2. The rest of the simulated rhythms do not show a significant interhemispheric frequency asymmetry. Finally note that the power spectrum on the scalp surface was maximum at frontal areas in all cases and reaches minimum amplitudes at occipital electrodes. 7.4 Influence of the Model Parameters on the Generation of the EEG Rhythms. Parameter values in Table 1 show that by increasing c 1 (and thus c 2 , c 3 , and c 4 ), while keeping the other parameters constant, the EEG spectrum shifts toward higher frequencies, going from beta to gamma frequency bands. On the other hand, keeping c 1 constant, we can switch from delta to theta rhythm by increasing parameters associated with membrane time constants and dendritic tree time delays. However, as there are 32 parameters in the model (excluding the connectivity matrix values), it is difficult to study the influence of each of them on the output of the model. That is, parameter values presented in Table 1 are not the only ones capable of producing EEG rhythms. Additionally, neural mass models can produce the same EEG rhythm with different sets of parameter values, as shown by David and Friston (2003), which varied excitatory and inhibitory time constants and obtained regions of the parameters’ phase-space within which the same rhythm was obtained. In this section, we perform similar analyses for parameters a and b.
500
R. Sotero et al.
Figure 7: Simulation of delta rhythm (1–4 Hz). (A) Temporal dynamics and power spectrum of the signal at electrodes Fp1 and Fp2. (B) Topographic distribution of the delta power on the scalp surface. The spatial distribution of the 120 electrodes’ system used in the simulation is also shown.
In the first set of simulations the areas involved in the alpha rhythm simulation study were selected, and then parameters a , b were varied from 10 s−1 to 500 s−1 , with a step size of 10 s−1 , while the rest of the parameters were kept constant with the values in Table 1 that correspond to alpha rhythm. For each case, the resulting dynamics was assigned to an EEG
Realistic Neural Mass Models
501
Figure 8: Simulation of theta rhythm (4–8 Hz). (A) Temporal dynamics and power spectrum of the signal at electrodes Fp1 and Fp2. (B) Topographic distribution of the theta power on the scalp surface. The spatial distribution of the 120 electrodes’ system used in the simulation is also shown.
band. A phase diagram is shown in Figure 11A. Two phases are obtained depending on parameter values: one corresponding to alpha rhythms and the other corresponding to noise.
502
R. Sotero et al.
Figure 9: Simulation of beta rhythm (12–30 Hz). (A) Temporal dynamics and power spectrum of the signal at electrodes Fp1 and Fp2. (B) Topographic distribution of the theta power on the scalp surface. The spatial distribution of the 120 electrodes’ system used in the simulation is also shown.
In the second study, parameters a , b were also varied from 10 s−1 to 500 s−1 with a step size of 10 s−1 , but now the brain areas selected were the ones involved in the generation of delta, theta, beta, and gamma rhythms. As in the previous case, the other parameters were kept constant with the values shown in Table 1 for beta rhythm. The corresponding phase diagram
Realistic Neural Mass Models
503
Figure 10: Simulation of gamma rhythm (30–70 Hz). (A) Temporal dynamics and power spectrum of the signal at electrodes Fp1 and Fp2. (B) Topographic distribution of the gamma power on the scalp surface. The spatial distribution of the 120 electrodes’ system used in the simulation is also shown.
is shown in Figure 11B. As can be seen, five different phases are obtained corresponding to delta, theta, beta, and gamma rhythms, as well as noise. However, alpha rhythm (the small spots in white between theta and beta areas in Figure 11B) was also obtained for three pairs of values (b, a ): (340,180), (340,190), (120,280).
504
R. Sotero et al.
Figure 11: Phase-space diagram of the average synaptic time constants for excitatory (a ) and inhibitory (b) populations. (A) The chosen regions were the ones used previously in the alpha rhythm simulation. The phase space splits into two regions that separate noise from alpha rhythm. (B) The selected areas were the ones used in delta, theta, beta, and gamma rhythm simulations. The phase space splits into different regions—one for each rhythm and noise. Nevertheless, for three pairs of values (b, a ), alpha rhythm was also obtained (white points between theta and beta areas). (C) The same areas as in A were selected but with values extracted from the random matrix of Figure 3B. (D) The same areas as in B but with values extracted from the random matrix. In C and D, for very large physiologically unrealistic values of (b, a ), several rhythms appear.
Thus, by changing a or b, or both, the system undergoes phase transitions that allow the same model to generate the different EEG rhythm. Note also that, as expected, in both sets of simulations, noise is obtained for high values of a or b, or both, which corresponds to processes decaying very rapidly and thus occurring at very high (nonphysiological) frequencies. One of the main goals of this article is the validation of the use of anatomical connectivity information obtained from DWMRI to model the LRC strength between multiple coupled neural mass models representing cortical areas and the capability of this modeled system to generate different EEG rhythms. So far, we have proven that our model is capable of generating alpha, delta, theta, beta, and gamma rhythm in a wide range of parameter values. Nevertheless, an important question remains: How important is the
Realistic Neural Mass Models
505
Figure 12: Effect of the reduction of connectivity parameters on alpha and beta rhythms. CSPs between areas were diminished by 50%, and parameter c 1 was also diminished to c 1 = 50. The dotted line is the spectrum when using the standard values, and the solid line represents the reduced parameters.
use of the actual connectivity matrix values? In other words, wouldn’t be possible to generate EEG rhythms with a completely different connectivity matrix? To elucidate this matter, the two sets of simulations previously described were replicated, but in this case, the random connectivity matrix shown in Figure 3B was used instead for the coupling between the brain areas involved in the simulations. This random matrix was generated with the same degree of sparseness as the actual connectivity matrix. Results are shown in Figures 11C and 11D. As can be seen, no rhythm emerged for the range of values (b, a ) from 10 s−1 to 500 s−1 in these cases. Only for extremely high and nonphysiological values (corresponding to average population time constants of less than 1 ms) are some rhythms obtained. An interesting case is to analyze the impact of decreased connection strengths (see section 8). For this, the values of the anatomical connectivity matrix were decreased 50% and parameter c 1 was reduced to c 1 = 50 for the cases of the alpha and beta rhythms generation models. The results are shown in Figure 12. For both alpha and beta, the peak amplitude of the power spectrum reduces for its corresponding frequency bands, while the delta and theta bands were accentuated. 8 Discussion In this work, previous neural mass models (Lopes da Silva et al., 1974; Zetterberg et al., 1978; Jansen & Rit, 1995) were generalized and combined to account for intravoxel interactions, as well as short-range and long-range connections representing interactions between cortical voxels within the same area and with voxels of other cortical areas, respectively. The thalamus
506
R. Sotero et al.
was also described by a neural mass model and coupled to the cortical areas. Anatomical information from DWMRI was introduced to constrain coupling strength parameters between different brain regions. This allowed us to couple multiple brain areas with real anatomical connectivity values and study the rhythms emerging from these interactions across different spatial scales. Knowing the average depolarization of pyramidal cells in each voxel and the lead field matrix that characterizes the spatial distribution of the electromagnetic field in the head, we obtained the voltage signals recorded at the location of given electrodes on the scalp surface. For validating the model, several simulations were carried out. The parameters that were tuned in each simulation were the time delays of efferent connections and connectivity constants between the different populations within a voxel. The choice of parameter was given by its importance in the generation of different rhythms, as demonstrated by previous work in this field (Jansen & Rit, 1995; David & Friston, 2003). The temporal dynamics and the spectrum of human EEG rhythms were reproduced by the proposed model. EEG rhythms were also obtained at the right spatial locations. In the case of alpha rhythm, the activity was maximal at occipital electrodes and diminished toward the frontal electrodes. Delta, theta, beta, and gamma rhythms were maximal at frontal electrodes. The amplitudes of the rhythms were also within the range observed in actual EEG recordings. Interhemispheric symmetry was found in the case of alpha rhythm, since the difference between the peaks of EEG spectrum at electrodes O1 and O2 was less than 0.1 Hz. In contrast beta rhythm showed interhemispheric asymmetry; that is, there was a difference of 1.7 Hz between the peaks of the EEG spectrum at Fp1 and Fp2. For delta, theta, and gamma rhythms, interhemispheric symmetry with respect to the peak frequency of the spectrum was obtained. In clinical practice, no EEG is complete without a reactivity test (Niedermeyer & Lopes da Silva, 1987). For this reason, we made a reactivity test simulation, finding that the effect of stimulation was to shift the spectrum toward higher frequencies together with a reduction in amplitude, which also agrees with actual EEG recordings in healthy subjects. Instead of the anatomical connectivity matrix, values from a random matrix were also used to couple brain areas. In this case, some rhythms appear for extremely high values of parameters a and b. However, these values of a and b correspond to an average population time constant of less than 1 ms, which does not make much sense from the physiological point of view. Thus, it can be concluded that in the range of physiologically plausible values, no rhythm was obtained when using a nonrealistic connectivity matrix. Finally, the model was used to study the effect of a reduction in all connectivity parameters values predicting an increase in delta and theta power (see Figure 12), along with a decrease in alpha and beta peaks. It is
Realistic Neural Mass Models
507
interesting to note that this slowing of oscillatory activity is a prominent functional abnormality that has been reported in EEG and MEG studies of Alzheimer disease (AD) patients (Coben, Danziger, & Berg, 1983; Penttil¨a, Partanen, Soininen, & Riekkinen, 1985; Schreiter-Gasser, Gasser, & Ziegler, 1993; Berendse, Verbunt, Scheltens, van Dijk, & Jonkman, 2000). On the other hand it is also recognized that AD patients show a progressive cortical disconnection syndrome due to the loss of large pyramidal neurons in cortical layers III and V (Pearson, Esiri, Hiorns, Wilcock, & Powell, 1985; Morrison, Scherr, Lewis, Campbell, & Bloom, 1986). Related to this, it has also been found that the total callosal area is significantly reduced in patients suffering from this pathology (Rose et al., 2000; Hampel et al., 1998). According to our model, the shift of the EEG spectrum toward low frequencies found in AD could be associated with the cortical disconnection syndrome that is reflected in the decreased values of the coupling strength parameters, although the model does not exclude contributions from other mechanisms. As we know, lumped models are simplifications of actual brain structures, so further improvements could be made by including more realistic models in order to describe a wider range of phenomena. Another limitation of our approach is that DWMRI cannot distinguish between efferent and afferent fibers (this is reflected in the symmetry of the connectivity matrix; see Figure 3A), and connectivity values thus could be overestimated. Moreover, the validity of the connectivity matrix values depends on the assumption that probabilistic paths calculated based on the information obtained from DWMRI reflect the actual connections between brain areas. While this letter focused on the EEG forward problem, the next logical step is the estimation of the parameters of this model from data. Previous work has accomplished this task. For example, in Valdes et al. (1999), the neural mass model described in Zetterberg et al. (1978) was fitted to actual EEG alpha data. In that work, the LL approach was used to discretize Zetterberg’s model, which was then reformulated in a state-space formalism that could be integrated by using a nonlinear Kalman filter scheme. A maximum likelihood procedure was then employed to estimate the model parameters for several EEG data sets. Bayesian inference, dynamic causal modeling (Friston, Harrison, & Penny, 2003; David, Harrison, Kilner, Penny, & Friston, 2004), and genetic algorithms (Jansen, Balaji Kavaipatti, & Markusson, 2001) have also been used for estimating the parameters of neural mass models of the type presented here. A more ambitious goal is the integration of EEG with functional imaging techniques such as functional magnetic resonance imaging and positron emission tomography. The use of this model as a framework in which electrical, hemodynamic, and metabolic activities can be coupled is currently under study and will be the subject of a future publication.
508
R. Sotero et al.
Appendix: Local Linearization Method for RDE Assume that a k-dimensional random process p(t), t ∈ [t0 , T] and a nonlinear function f are given, and define the d-dimensional RDE: dy(t) = f (y(t), p(t))dt y(t0 ) = y0 . Then, given a step size h,the local linearization scheme that solves numerically the equation above at the time instants tn = t0 + nh, n = 0, 1, . . . , is given by yn+1 = yn + Le Cn h R, where L = [Id , 0d x2 ], R = [01x(d+1) , 1]T , and the (d + 2) × (d + 2) matrix Cn is defined by blocks as
f˙ y (t)(yn , p(tn )) f˙ p (t)(yn , p(tn )) p(tn+1 h)− p(tn ) f (yn , p(tn )) Cn = 0 0 1 0 0 0 Here, f˙ y (t) and f˙ p (t) denote the derivatives of f with respect to the variables y and p, respectively. Due to the high dimensionality of our problem, the main task in the implementation of the LL scheme is the computation of the matrix exponential e Cn h . The use of the Krylov subspace methods (Sidje, 1998) is strongly recommended for that purpose. Indeed, that method can be used to compute the vector v = e Cn h R, and then the operation Lv reduces to take the first d elements of v. Acknowledgments We thank Trinidad Virues, Elena Cuspineda, and Thomas Wennekers for helpful comments. References Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Behrens, T., Johansen-Berg, H., Woolrich, M. W., Smith, S. M., WheelerKingshott, C., Boulby, P. A., Barker, G. J., Sillery, E. L., Sheehan, K., Ciccarelli, O., Thompson, A. J., Brady, J. M., & Matthews, P. M. (2003). Non-invasive mapping of connections between human thalamus and cortex using diffusion imaging. Nature Neuroscience, 6, 750–757.
Realistic Neural Mass Models
509
Berendse, H. W., Verbunt, J. P., Scheltens, P., van Dijk, B. W., & Jonkman, E. J. (2000). Magnetoencephalographic analysis of cortical activity in Alzheimer’s disease: A pilot study. Clin. Neurophysiol., 111, 604–612. Braitenberg V. (1978). Cortical architectonics: General and areal. In M. A. B. Brazier and M. Petche (Eds.), Architectonics of the cerebral cortex (pp. 443–465). New York: Raven Press. ¨ A. (1998). Cortex: Statistics and geometry of neuronal connecBraitenberg, V., & Schuz, tivity. Berlin: Springer-Verlag. Carbonell, F., Jimenez, J. C., Biscay, R. J., & de la Cruz, H. (2005). The local linearization method for numerical integration of random differential equations. BIT Numerical Mathematics, 45(1), 1–14. Coben, L. A., Danziger, W. L., & Berg, L. (1983). Frequency analysis of the resting awake EEG in mild senile dementia of Alzheimer type. Electroencephalogr. Clin. Neurophysiol., 55, 372–380. Collins, D. L., Neelin, P., Peters, R. M., & Evans, A. C. (1994). Automatic 3D intersubject registration of MR volumetric in standardized Talairach space. J. Comp. Assist. Tomog., 18, 192–205. David, O., & Friston, K. J. (2003). A neural mass model for MEG/EEG: Coupling and neuronal dynamics. NeuroImage., 20, 1743–1755. David, O., Harrison, L., Kilner, J., Penny, W., & Friston, K. J. (2004). Studying effective connectivity with a neural mass model of evoked MEG/EEG responses. In Proceedings of the 14th International Conference on Biomagnetism BIOMAG 2004 (pp. 136–138). Boston. Destexhe, A., & Sejnowski, T. J. (2001). Thalamocortical assemblies: How ion channels, single neurons, and large-scale networks organize sleep oscillations. New York: Oxford University Press. Douglas, R. J., & Martin, K. A. C. (2004). Neuronal circuits of the neocortex. Annu. Rev. Neurosci., 27, 419–51. Evans, A. C., Collins, D. L., Mills, S. R., Brown, E. D., Kelly, R. L & Peters, T. M. (1993). 3D statistical neuroanatomical models from 305 MRI volumes. Proc. In IEEE Nuclear Science Symposium and Medical Imaging Conference (pp. 1813–1817). London: MTP Press. Evans, A .C., Collins, D. L., Neelin, P., MacDonald, D., Kamber, M., & Marrett, T. S., 1994. Three-dimensional correlative imaging: Applications in human brain mapping. In: Thatcher, R., Hallet, M., Zeffiro, T., John, E. R., & Huerta, M. (Eds.), Functional Neuroimaging: Technical Foundations. Academic Press, New York, pp. 145–161. Friston, K. J., Harrison, L., & Penny, W. (2003). Dynamic causal modelling. NeuroImage, 19, 1273–1302. Gard, T. C. (1988). Introduction to stochastic differential equations. New York: Marcel Dekker. Gilbert, C. D., & Wiesel, T. N. (1979). Morphology and intracortical projections of functionally characterised neurones in the cat visual cortex. Nature, 280, 120–125. Gilbert, C. D., & Wiesel, T. N. (1983). Functional organization of the visual cortex. Prog. Brain Res., 58, 209–218. Goldman, R. I., Stern, J. M., Engel, J., & Cohen, M. S. (2002). Simultaneous EEG and fMRI of the alpha rhythm. NeuroReport, 13(18), 2487–2492.
510
R. Sotero et al.
Grune, L., & Kloeden, P. E. (2001). Pathwise approximation of random ordinary differential equation. BIT, 41, 711–721. Hagmann, P., Thiran, J. P., Jonasson, L., Vandergheynst, P., Clarke, S., Maeder, P., & Meuli, R. (2003). DTI mapping of human brain connectivity: Statistical fibre tracking and virtual dissection. NeuroImage, 19, 545–554. H¨am¨al¨ainen, M., Hari, R., Ilmoniemi, R. J., Knuutila, J., & Lounasmaa, O. V. (1993). Magnetoencephalography—theory, instrumentation and applications to noninvasive studies of the working human brain. Rev. Modern Phys., 65, 413–497. Hampel, H., Teipel, S. J., Alexander, G. E., Horwitz, B., Teichberg, D., Schapiro, M. B., & Rapoport, S. I. (1998). Corpus callosum atrophy is a possible indicator of regionand cell type–specific neuronal degeneration in Alzheimer disease: A magnetic resonance imaging analysis. Arch. Neurol., 55, 193–198. Hasminskii, R. Z. (1980). Stochastic stability of differential equations. Alphen aan den Rijn, The Netherlands: Sijthoff Noordhoff. Iturria-Medina, Y., Canales-Rodr´ıguez, E., Meli´e-Garc´ıa, L., & Vald´es-Hern´andez, P. (2005). Bayesian formulation for fiber tracking. Presented at the 11th Annual Meeting of the Organization for Human Brain Mapping, Toronto, Ontario, Canada. Iturria-Medina, Y., Vald´es-Hern´andez, P., & Canales-Rodriguez, E. (2005). Measures of anatomical connectivity obtained from neuro diffusion images. Presented at the 11th Annual Meeting of the Organization for Human Brain Mapping, Toronto, Ontario, Canada. Jansen, B. H., Balaji Kavaipatti, A., & Markusson, O. (2001). Evoked potential enhancement using a neurophysiologically-based model. Methods of Information in Medicine, 40, 338–345. Jansen, B. H., & Rit, V. G. (1995). Electroencephalogram and visual evoked potential generation in a mathematical model of coupled cortical columns. Biol. Cybern., 73, 357–366. Jansen, B. H., Zouridakis, G., & Brandt, M. E. (1993). A neurophysiologically-based model of flash visual evoked potentials. Biol. Cybern., 68, 275–283. Jirsa, V. K., & Haken, H. (1996). Field theory of electromagnetic brain activity. Phys. Rev. Lett., 77(5), 960–963. Jirsa, V. K., & Haken, H. (1997). A derivation of a macroscopic field theory of the brain from the quasi-microscopic neural dynamics. Physica D, 99, 503–526. Koch, M. A., Norris, D. G., & Hund-Georgiadis, M. (2002). An investigation of functional and anatomical connectivity using magnetic resonance imaging. NeuroImage, 16, 241–250. Lopes da Silva, F. H., Hoeks, A., Smits, H., & Zetterberg, L. H. (1974). Model of brain rhythmic activity: The alpha-rhythm of the thalamus. Kybernetik, 15, 27–37. Martin, K. A. C., & Whitteridge, D. (1984). Form, function and intracortical projections of spiny neurones in the striate visual cortex of the cat. J. Physiol, 353, 463–504. Mart´ınez-Montes, E., Vald´es-Sosa, P. A., Miwakeichi, F., Goldman, R. I., & Cohen, M. S. (2004). Concurrent EEG/fMRI analysis by multiway partial least squares. NeuroImage, 22, 1023–1034. Mazziotta, J. C., Toga, A., Evans, A. C., Fox, P., & Lancaster, J. (1995). A probabilistic atlas of the human brain: Theory and rationale for its development. Neuroimage, 2, 89–101.
Realistic Neural Mass Models
511
McGuire, B. A., Gilbert, C., Rivlin, P. K., & Wiesel, T. N. (1991). Targets of horizontal connections in macaque primary visual cortex. J. Comp. Neurol., 305, 370– 392. Morrison, J. H., Scherr, S., Lewis, D. A., Campbell, M. J., & Bloom, F. E. (1986). The laminar and regional distribution of neocortical somatostatin and neuritic plaques: Implications for Alzheimer’s disease as a global neocortical disconnection syndrome. In A. B. Scheibel, A. F. Weschler (Eds.), The biological substrates of Alzheimer’s disease (pp. 115–131). New York: Academic Press. Mountcastle V. B. (1978). An organizing principle for cerebral function. In G. M. Edelman, V. B. Mountcastle (Eds.), The mindful brain (pp. 7–50). Cambridge, MA: MIT Press. Mountcastle, V. B. (1997). The columnar organization of the neocortex. Brain, 120, 701–722. Niedermeyer, E. & Lopes da Silva, F. (1987). Electroencephalography: Basic principlesclinical applications and related fields (2nd ed.). Baltimore: Urban & Schwarzenberg. Nunez, P. L. (1981). Electric fields of the brain. New York: Oxford University Press. Parker, G. J., Wheeler-Kingshott, C. A., & Barker, G. J. (2002). Estimating distributed anatomical connectivity using fast marching methods and diffusion tensor imaging. IEEE Transactions on Medical Imaging, 21(5), 505–512. Pearson, R. C. A., Esiri, M. M., Hiorns, R. W., Wilcock, G. K., & Powell, T. P. S. (1985). Anatomical correlates of the distribution of the pathological changes in the neocortex in Alzheimer’s disease. Proc. Natl. Acad. Sci. USA, 82, 4531–4534. Penttil¨a, M., Partanen, J. V., Soininen, H., & Riekkinen, P. J. (1985). Quantitative analysis of occipital EEG in different stages of Alzheimer’s disease. Electroencephalogr. Clin. Neurophysiol., 60, 1–6. Plonsey, R. (1963). Reciprocity applied to volume conductors and the ECG. IEEE Trans. BioMedical Electronics, 10, 9–12. Prakasa-Rao, B. L. S. (1999). Statistical inference for diffussion type processes. New York: Oxford University Press. Rinzel, J., Terman, D., Wang, X.-J., & Ermentrout, B. (1998). Propagating activity patterns in large-scale inhibitory neuronal networks. Science, 279, 1351–1355. Rose, S. E., Chen, F., Chalk, J. B., Zelaya, F. O., Strugnell, W. E., Benson, M., Semple, J., & Doddrell, D. M. (2000). Loss of connectivity in Alzheimer’s disease: An evaluation of white matter tract integrity with colour coded MR diffusion tensor imaging. J. Neurol. Neurosurg. Psychiatry, 69, 528–530. Rush, S., & Driscoll, D. A. (1969). EEG electrode sensitivity–an application of reciprocity. IEEE Trans on Biomed. Eng., 16(1), 15–22. Schreiter-Gasser, U., Gasser, T., & Ziegler, P. (1993). Quantitative EEG analysis in early onset Alzheimer’s disease: A controlled study. Electroencephalogr. Clin. Neurophysiol., 86, 15–22. Schurz H. (2002). Numerical analysis of stochastic differential equations without tears. In D. Kannan & V. Lakahmikamtham (Eds.), Handbook of stochastic analysis applications (pp. 237–358). New York: Marcell Dekker. Shoji, I., & Ozaki, T. (1998). Estimation for nonlinear stochastic differential equations by a local linearization method. Stochast. Anal. Appl., 16, 733–752. Sidje, R. B. (1998). EXPOKIT: Software package for computing matrix exponentials. AMC Trans. Math. Software, 24, 130–156.
512
R. Sotero et al.
Tournier, J. D., Calamante, F., Gadian, D. G., & Connelly, A. (2003). Diffusionweighted magnetic resonance imaging fibre tracking using a front evolution algorithm. NeuroImage, 20, 276–288. Traub, R. D., & Miles, R. (1991). Neuronal networks of the hippocampus. Cambridge: Cambridge University Press. Tuch, D. (2002). Diffusion MRI of complex neural architecture. Unpublished doctoral dissertation, Massachusetts Institute of Technology. Valdes, P. A., Jim´enez, J. C., Riera, J., Biscay, R., & Ozaki, T. (1999). Nonlinear EEG analysis based on a neural mass model. Biol. Cybern., 81, 415–424. van Rotterdam, A., Lopes da Silva, F. H., van den Ende, J., Viergever, M. A., & Hermans, A. J. (1982). A model of the spatial temporal characteristic of the alpha rhythm. Bull. Math. Biol., 44, 238–305. Wang, X.-J., Golomb, D., & Rinzel, J. (1995). Emergent spindle oscillations and intermittent burst firing in a thalamic model: Specific neuronal parameters. Proc. Natl. Acad. Sci. USA, 92, 5577–5581. Wendling, F., Bellanger, J. J., Bartolomei, F., & Chauvel, P. (2000). Relevance of nonlinear lumped-parameter models in the analysis of depth-EEG epilectic signals. Biol. Cybern., 83, 367–378. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophys. J., 12, 1–24. Zetterberg, L. H., Kristiansson, L., & Mossberg, K. (1978). Performance of a model for a local neuron population. Biol. Cybern., 31, 15–26.
Received December 20, 2005; accepted June 9, 2006.
LETTER
Communicated by Aapo Hyvarinen
Dimension Selection for Feature Selection and Dimension Reduction with Principal and Independent Component Analysis Inge Koch [email protected] Department of Statistics, School of Mathematics, University of New South Wales, Sydney, NSW 2052 Australia
Kanta Naito [email protected] Department of Mathematics, Faculty of Science and Engineering, Shimane University, Matsue 690-8504 Japan
This letter is concerned with the problem of selecting the best or most informative dimension for dimension reduction and feature extraction in high-dimensional data. The dimension of the data is reduced by principal component analysis; subsequent application of independent component analysis to the principal component scores determines the most nongaussian directions in the lower-dimensional space. A criterion for choosing the optimal dimension based on bias-adjusted skewness and kurtosis is proposed. This new dimension selector is applied to real data sets and compared to existing methods. Simulation studies for a range of densities show that the proposed method performs well and is more appropriate for nongaussian data than existing methods. 1 Introduction This letter proposes a dimension selector for the most informative lowerdimensional subspace in feature extraction and dimension-reduction problems. Our methodology is based on separating the informative part of the data from noise in high-dimensional problems: it reduces high-dimensional data and represents them on a smaller number of new directions that contain the interesting information inherent in the original data. Our algorithm combines principal component analysis (PCA) and independent component analysis (ICA) and makes use of objective- and data-driven criteria in the selection of the best dimension. A combined PC-IC algorithm has been used in a number of papers, including Prasad, Arcot, and Koch (2005), Gilmour and Koch (2004), Koch, Marron, and Chen (2005) and is also implicitly referred to in many ICA algorithms (see Hyv¨arinen, Karhunen, & Oja, 2001). Our approach is motivated Neural Computation 19, 513–545 (2007)
C 2007 Massachusetts Institute of Technology
514
I. Koch and K. Naito
by Prasad et al. (2005), who proposed a feature selection and extraction method for efficient classification by choosing the best subset of features using PC-IC. Their method resulted in a very efficient algorithm that is also highly competitive in terms of classification accuracy. Their research was driven by finding a smaller number of predictors or features for efficient classification, while our research focuses on the more general problem of dimension reduction in high-dimensional data. Features selection can be regarded as a special case of dimension reduction: representing or approximating high-dimensional data in terms of a smaller number of new directions that span a lower-dimensional space. PCA alone achieves dimension reduction of high-dimensional data. PCA is well understood (see Anderson, 1985, Mardia, Kent, & Bibby, 1979), easily accessible, and implemented in many software packages. However, the lower-dimensional directions chosen by PCA do not always preserve or represent the structure inherent in the original data. Finding interesting and informative structure in data is one of the two main aims of ICA. ICA can be applied to the original or to PC-reduced lower-dimensional data. In either case, it aims to find new directions, and the number of new directions is at most that of the data that form the input to ICA. A mixture of PCA and ICA seems to be reasonable since considerable computational effort is required to implement ICA, especially for high-dimensional data, and hence dimension reduction without loss of information or with minimal loss is essential as a step preceding ICA. This is the reason we adopt PCA as a first step prior to ICA. Different approaches to ICA and software packages are available (see Cardoso, 1999; Bach & Jordan, 2002; Amari, 2002). We follow the approach of Hyv¨arinen et al. (2001) and use their software, FastICA2.4 (Hyv¨arinen & Oja, 2005). More specifically, we use their skewness and kurtosis approximation to negentropy, since these provide fast computation as well as easy interpretation. Dimension-reduction techniques generally provide a constructive approach to representing data in terms of new directions in lower-dimensional spaces, with the number of new directions varying from one to the (possibly very high) dimension of the original data. Traditionally in PCA, one is looking for an elbow, or one uses the 95% (resp. 99.9%) rule; that is, one uses the dimension whose contribution to total variance is at least 95% (resp. 99.9%). The latter is the route Prasad et al. (2005) have taken, and this choice of dimension clearly lacks objectivity. A number of (gaussian) model-based approaches to dimension reduction have been proposed in the literature over the past few years. We review these approaches in section 5. If the model fits the data, then these methods may be exactly what is needed, and good dimension-reduction solutions are expected to be obtained with any of them. Furthermore, in specific applications, prior or additional information can help determine the dimension. A practical adjustment of this kind has been discussed in Beckmann and Smith (2004). However, to our knowledge, there do not appear to exist
Dimension Selection with PC-IC
515
nonparametric or model-free objective approaches that will determine the right number of new directions or the best dimension for a lowerdimensional space. We shall see in section 6 in particular that the gaussian model–based approaches are not adequate in many cases and that better results can be achieved with our nonparametric method. The main purpose of this article is to propose a generic nonparametric approach to dimension selection that is based on theory and objective criteria without requiring a model or knowledge of the distribution of the data and is data driven. Of course, if additional knowledge about the distribution of the data exists, integrating such knowledge is expected to improve the performance, but we will not be concerned with special cases here. Closely related to the selection of this dimension is the choice of the direction vectors in this lower-dimensional subspace. In an ICA-based framework, the natural candidates are those IC directions based on the same objective function as the selector—in our case, skewness or kurtosis. Finding those directions is the aim of any good ICA algorithm, so we will not be concerned with this issue here, but will focus instead on the choice of the dimension for the lower-dimensional subspace. We use the following notations throughout this letter. Let X denote a d-dimensional random vector with mean µ and covariance matrix . For k = 3, 4, define the moments βk (X) = max Bk (α|X), α∈S d−1
(1.1)
where k α T (X − µ) − 3δk,4 , Bk (α|X) = E √ α T α S d−1 is the unit sphere in Rd , and δk,4 is the Kronecker delta. The moments β3 (X) and β4 (X) are the multivariate skewness and kurtosis of X, respectively, in the sense of Malkovich and Afifi (1973). Note that multivariate skewness was originally defined as the square of β3 (X) above, but many authors since then have used it in the form given here, as this is more in keeping with the one-dimensional version. The multivariate skewness and kurtosis are both defined such that they are zero for gaussian random vectors. For N independent observations X1 , . . . , XN , the sample skewness and sample kurtosis are defined as Bk (α|X1 , . . . , XN ) βk (X1 , . . . , XN ) = max α∈S d−1
516
I. Koch and K. Naito
for k = 3, 4, where k N T 1 (X − X) α i − 3δk,4 . Bk (α|X1 , . . . , XN ) = √ T α Sα N i=1
(1.2)
Here, X denotes the sample mean and S the sample covariance matrix. We will base our feature extraction and dimension reduction for highdimensional data on these βk ’s. It should be noted that βk (X) is not a function of X but just a property of the distribution of X, while βk (X1 , . . . , XN ) is a random variable. Expressing X and (X1 , . . . , XN ) in terms of βk (X) and βk (X1 , . . . , XN ), respectively, is necessary to explain our technique in the subsequent discussion. This type of statistic, based on the projections and summary statistics, is commonly used in the statistical science comunity. A typical and powerful tool based on projections is projection pursuit (see Huber, 1985; Jones & Sibson, 1987; Friedman, 1987), in which the nonlinear structure of high-dimensional data is explored through low-dimensional projections. The optimal low-dimensional directions are determined via a projection index that aims to detect nongaussianity. Hence, statistics for testing gaussianity could be based on the projection index, and typical instances are the βk ’s above. The projection index proposed in Jones and Sibson (1987) is nothing other than a weighted quadratic combination of βk ’s. A close relation between ICA and projection pursuit has been discussed by several authors (see, e.g., Cardoso, 1999). This letter is organized as follows. Section 2 describes our methodology and algorithm. Details are given in both the population version and the empirical (sample) version. A criterion for selecting the best or most informative dimension is proposed in section 3. The idea to construct well-known Akaike information criteria (AIC) is exploited to develop the proposed criterion. We argue that the gaussian distribution could be seen as the null structure for developing our criterion and present the necessary theoretical results. In section 4, we report the results of applying our selection method to real data sets. Section 5 discusses and appraises some model-based approaches to dimension reduction. The reports of Monte Carlo simulations and comparisons with the model-based approached discussed in section 5 are provided in section 6. All proofs of theoretical results are summarized in appendix A.
2 PC-IC Algorithm Our algorithm for dimension reduction and feature extraction is described in this section. It combines PCA and ICA.
Dimension Selection with PC-IC
517
2.1 PCA and ICA. ICA can be used to extract features in multivariate structure (Hyv¨arinen et al., 2001). Feature extraction and testing for gaussianity (normality) with the skewness and kurtosis approximations in ICA were discussed in Prasad et al. (2005) and Koch et al. (2005). The relevant skeleton of the algorithm in those articles contains two steps: reduce and maximize: Reduce. For 2 ≤ p < d, determine the p-dimensional subspace consisting of the first p principal component directions obtained via PCA. Maximize. Find the most nongaussian direction vector for the pdimensional data using ICA. The usual PCA is applied to high-dimensional data in the reduce step, and the reduced p-dimensional data are obtained as the p-dimensional principal component scores. ICA is applied to these p-dimensional principal component scores in the maximize step to find the most prominent structure. We shall refer to this algorithm as the PC-IC algorithm. 2.2 Algorithm in Detail: Population Version. We describe here the PCIC algorithm for the population setting. Assume that all components of X have fourth moment. Consider the spectral decomposition of = T , where = diag{λ1 , . . . , λd }, λ1 ≥ λ2 ≥ · · · ≥ λd > 0 are the eigenvalues of , and = [γ1 , . . . , γd ] is an orthogonal matrix whose column γ is the eigenvector of belonging to λ for = 1, . . . , d. For simplicity of notation, we assume that has rank d. Note that our arguments below are not affected if the rank of is r < d. The only change required for this situation is to consider values for p ≤ r instead of p < d. Let p = [γ1 , . . . , γ p ], and let p = diag{λ1 , . . . , λ p } for p < d. We obtain a dimension reduction of X to p dimensions via a transformation defined by Y = Tp (X − µ). The Y vector is known as the p-dimensional PC score. To extract features inherent in the original structure of X, the PC-IC algorithm applies FastICA (Hyv¨arinen et al., 2001) to this Y in the following way. First, the sphering of Y is accomplished by Z( p) = −1/2 Y = −1/2 Tp (X − µ). p p Note that Y and hence Z( p) are p-dimensional random vectors. The PC-IC algorithm finds the direction αo,k , which results in βk (Z( p)) for k = 3, 4, as
518
I. Koch and K. Naito
in equation 1.1: αo,k = arg max Bk (α|Z( p)). α∈S p−1
(2.1)
2.3 Algorithm in Detail: Empirical Version. The empirical version is based on the sample covariance matrix S of the data X1 , . . . , XN . The spectral decomposition S = T with descending order of eigenvalues naturally provides submatrices p and p defined similarly as p and p , respectively. From these we obtain the p-dimensional sphered data, Z( p)i = −1/2 µ), Tp (Xi − p
(2.2)
for i = 1, . . . , N, where µ = X. In the empirical version corresponding to equation 2.1, we let IC1k denote the PC-IC direction which gives rise to βk (Z( p)1 , . . . , Z( p) N ). So for k = 3, 4, Bk (α|Z( p)1 , . . . , Z( p) N ), αo,k = arg max IC1k = α∈S p−1
(2.3)
with Bk (α|Z( p)1 , . . . , Z( p) N ), as in equation 1.2. We explore the underlying structure of the original data and extract T T informative features from the projection ( αo,k Z( p)1 , . . . , αo,k Z( p) N ) of the reduced data Z( p)i (i = 1, . . . , N) onto IC1k (k = 3, 4). In Prasad et al. (2005), a similar algorithm, which is based on the first p directions IC1-IC p, was shown to be efficient in extracting features of high-dimensional data. A crucial—and still unanswered—question is that of the choice of p, the dimension onto which the data are projected from their original d dimensions. Prasad et al.’s heuristic procedure for selecting the dimension p is based on the variance contribution in the PCA step: they use the cut-off dimension p95% which results in a contribution to variance of at least 95%. This criterion may be reasonable if we analyze multivariate data by PCA only and do not have any other relevant information. However, a criterion based on the variance contribution does appear to be rather arbitrary and does not seem appropriate in a combined PC-IC, since ICA results in directions based on maximizing skewness and kurtosis. Our claim is that the dimension should be selected in accordance with the objective functions skewness and kurtosis if dimension reduction and feature extraction are implemented through these quantities. 3 Selection of the Dimension This section presents the procedure for choosing the best dimension. 3.1 Methodology. The first direction, IC1k (k = 3, 4), is the most nongaussian direction, at least in theory. This direction and its degree of
Dimension Selection with PC-IC
519
nongaussianity will depend on the input data. If the dimension of the input data is too small, ICA may not find interesting directions, since valuable information may have been lost. On the other hand, a large p requires a higher computational effort and the data may still contain noise or irrelevant information. A compromise between data complexity and available information is therefore required. A naive idea is to select the dimension p as pk = arg max∗ βk (Z( )1 , . . . , Z( ) N ), 2≤ ≤ p
for k = 3, 4, where p ∗ = min{N, d} is the upper bound (or p ∗ = r , where r denotes the rank of the sample covariance matrix). It is easy to see that this selection rule has the trivial solution pk = p ∗ for both k = 3, 4. This implies that this criterion is not meaningful. Therefore, we need to formulate a criterion that includes a penalty term for large dimensions. We exploit the idea of including the AIC in our problem. It is known that the AIC is based on a bias-adjusted version of the log likelihood. Since βk (Z( p)1 , . . . , Z( p) N ) is an estimator of βk (Z( p)), it is natural to reduce the bias of the estimator, and we therefore aim to evaluate
βk (Z( p)1 , . . . , Z( p) N ) − βk (Z( p)), Biask ( p) = E for k = 3, 4. The evaluation of this bias now raises the question of which population should be employed. 3.2 Null Structure. What distribution should we assume as the null structure in our setting? We recall that our objective in the ICA step is to pursue a nongaussian structure as measured by skewness or kurtosis. To judge whether the obtained nongaussian structure is real, it is necessary to evaluate the performance of βk (Z( p)1 , . . . , Z( p) N ) under the structure with null skewness and kurtosis. This suggests that the multivariate gaussian distribution itself is one of the right null structures. This kind of argument has been employed by Sun (1991) in conjunction with the significance level in projection pursuit. In what follows, all distributional results will be addressed under the assumption of gaussianity of X. We therefore note that βk (Z( p)) = 0 for k = 3, 4 and for all p ≤ p ∗ under the gaussianity assumption. Further, the gaussianity assumption implies that βk (Z( p)1 , . . . , Z( p) N ) is an estimator of zero, and so
βk (Z( p)1 , . . . , Z( p) N ) . Biask ( p) = E 3.3 Theoretical Results. Our problem requires the evaluation of Biask ( p) under the assumption that the X1 , . . . , XN are independently and identically distributed random vectors from the d−dimensional gaussian
520
I. Koch and K. Naito
distribution Nd (µ, ). Since skewness and kurtosis are location-scale invariant, without loss of generality we can assume that µ = 0 and = Id , the identity matrix. In this setting, as N → ∞,
N βk (Z( p)1 , . . . , Z( p) N ) ⇒ Tk ( p) k!
for any p ≤ p ∗ , where ⇒ denotes convergence in distribution and Tk ( p) is the maximum of a zero-mean gaussian random field on S p−1 with a suitable covariance function (see Machado, 1983; Kuriki & Takemura, 2001). Hence, for large N, E
N βk (Z( p)1 , . . . , Z( p) N ) E [Tk ( p)] k!
for k = 3, 4. Kuriki and Takemura (2001) consider the probability P (Tk ( p) > a ) for a > 0. As this tail probability is difficult to calculate exactly, they give upper and lower bounds for this probability. We make use of this idea and develop upper and lower bounds for the expected value of Tk ( p) for k = 3, 4 by noting that
E [Tk ( p)] =
∞
P (Tk ( p) > a ) da .
0
This leads to our main result. Theorem 1. For k = 3, 4, we have L Bk ( p) ≤ E [Tk ( p)] ≤ U Bk ( p). For each dimension p, L Bk ( p) =
p−1
ω p−e,k p−e,ρ− p+e (tan2 θk ),
e=0,e:even
U Bk ( p) = L Bk ( p) +
2k − 2 3k − 2
1/2
E χρ [1 − (θk , p)] ,
(3.1)
where χ denotes a chi-distributed random variable with degrees of freedom, ρ = ρ( p, k) =
p+k−1 , k
Dimension Selection with PC-IC
θk = cos
−1
2k − 2 3k − 2
521
1/2 , e/2
[( p + 1) /2] , [( p + 1 − e) /2] (e/2)! p−1 2k − 2 (θk , p) = ω p−e,k B( p−e)/2,(ρ− p+e)/2 3k − 2 e=0 e:even ω p−e,k = (−1) k e/2
( p−1)/2
k−1 k
and B α,β (·) denotes the upper-tail probability of the beta distribution given by
Bα,β (a ) =
ξ α−1 (1 − ξ )β−1 dξ B(α, β)
1
a
with the usual beta function B(α, β). Further, for b > 0 and positive integers m and n, m,n (b) = m − m,n (b) with m = E [χm ] =
m,n (b) =
∞
∞
a2
0
√ [(m + 1) /2] 2 , (m/2) ¯ n (bξ )dξ da , gm (ξ )G
¯ n (t) = P(χn2 > t) for t ≥ 0. Exwhere gm (ξ ) is the density function of χm2 and G pressions for m,n (b) are m,2ν (b) = E [χm ] √
m+1
1 b+1
m,2ν−1 (b) = m,1 (b) + E [χm ] √ ×
ν−2 i=0
i ν−1 b 1 1 1+ , i b + 1 B (i, (m + 1) /2) i=1
1 i + 1/2
1
m+1
b+1 i+1/2
b b+1
1 . B (i + 1/2, (m + 1) /2)
3.4 Practical Calculations. Exact evaluation of the upper and lower bounds U Bk ( p) and L Bk ( p) in theorem 1 is not easy. Even with Mathematica, the terms p−e,ρ− p+e (b) in p−e,ρ− p+e (b) are not easily calculated for some combinations of ρs and ps (in particular, for the cases ρ − p + e = 3mod4),
522
I. Koch and K. Naito
while the terms p−e,k are not a problem. The reason these expressions are difficult to evaluate is that a summation of a large number of terms is required; the range of the individual terms is very large, and the weights are positive and negative. We therefore do not evaluate the formulas for p−e,ρ− p+e (b) directly but aim to obtain an approximate value. From theorem 3.1 in Kuriki and Takemura (2001), a general form of our target can be expressed as
∞
m,n (b) =
∞
a2
0
gm (ξ )G n (bξ )dξ da ,
where G n (·) is the cumulative distribution function of χn2 . To obtain an approximation, we consider a truncated version of m,n (b), defined as
T
m,n (b|T) = 0
∞
a2
gm (ξ )G n (bξ )dξ da .
Consider a random variable U, uniformly distributed on an interval [0, T], and let FU denote the cumulative distribution function of U. Furthermore, let Y ∼ χm2 . Assume that Y and U are mutually independent. Then we have
m,n (b|T) =
T
E Y 1{Y ≥ a 2 }G n (bY) da
0
= T EU,Y [1{Y ≥ U 2 }G n (bY)] = T E Y {G n (bY)EU [1{Y ≥ U 2 }]} √ = T E Y [FU ( Y)G n (bY)], and hence m,n (b|T) can be expressed as the expectation of Y only. Therefore, a Monte Carlo estimate for m,n (b|T) is m,n (b|T) =
N T FU ( Yj )G n (bYj ), N j=1
where the Y1 , . . . , YN are independently and identically distributed random variables from the χm2 distribution. m,n (b|T) is an unbiased estimate of m,n (b|T), but not Obviously of m,n (b). Moreover, it underestimates m,n (b); this follows from the definition of the
truncated version. The mean squared error (MSE) of m,n (b|T), MSE m,n (b|T) , is given as follows:
Dimension Selection with PC-IC
523
Theorem 2. As T → ∞ and N → ∞,
T2 m,n (b|T) = O exp(−T)T m−2 + . MSE N For practical purposes, we suggest using the 100(1 − α)% point of the 2 χm2 distribution as Te , which we denote by χm,α for each m. (Recall that m = p − e as in theorem 1.) Then, using the expression of the squared bias (see the proof of theorem 2), it follows that α 2 1 2 MSE m,n (b|χm,α ) = O + . b N
From theorem 1, the actual value of b is tan2 θk , that is, 3/4 for k = 3 and 2/3 for k = 4. If we choose α = 10−10 , (α/b)2 ≈ 1.33 × 10−10 for k = 3, (α/b)2 ≈ 1.5 × 10−10 for k = 4. This (α/b)2 is the quantity that will not vanish, but it can be as small as required, so will be negligible for practical purposes. Now let L BkT ( p) =
p−1
ω p−e,k p−e,ρ− p+e (tan2 θk |Te )
e=0,e:even
be a truncated version of L Bk ( p), where the Te s are chosen differently for m,n (b|T) appropriately yields the estimate each e. Then combining T L B k ( p) =
p−1
p−e,ρ− p+e (tan2 θk |Te ) ω p−e,k
e=0,e:even 2 of L Bk ( p|T). Hence, if we use independent samples distributed as χ p−e for p−e,ρ− p+e , it follows that each
MSE L B k ( p|T) =
p−1
p−e,ρ− p+e (tan2 θk |Te ) . ω2p−e,k MSE
e=0,e:even
3.5 The Proposed Criterion in Practice. The second term on the righthand side of equation 3.1 can be calculated with sufficient accuracy, for example, with Mathematica. So we define U B k ( p) = L B k ( p|T) +
2k − 2 3k − 2
1/2
E χρ [1 − (θk , p)] .
(3.2)
524
I. Koch and K. Naito
The actual values of U B k ( p) are tabulated in Table 3 in appendix B. Our criterion for choosing the best or most informative dimension is based on the bias-adjusted version of βk given by Ik ( p) =
N B k ( p). βk (Z( p)1 , . . . , Z( p) N ) − U k!
(3.3)
For k = 3, 4 the dimension selector is thus given by pk = arg max ∗ Ik ( p), k = 3, 4. 2≤ p≤ p
(3.4)
Here, we have chosen U Bk ( p) as an appropriate substitute for E[Tk ( p)]. Our calculations show that L Bk ( p) vanishes very rapidly with increasing dimension, and thus it contains little information, while the extra term in U Bk ( p) dominates the behavior of U Bk ( p). Since U Bk ( p) is an upper bound for E[Tk ( p)], Ik ( p) is smaller than the corresponding expression with E[Tk ( p)] instead of U B k ( p). It is not clear whether this affects the determination of pk in equation 3.4. 4 Implementation: Real Data 4.1 The Data Sets. We apply the bias-adjusted selector of the best or most informative dimension to a number of real data sets. Apart from the last two, all our data sets are available on the UCI Machine Learning Repository (see Blake & Merz, 1998). References and more information about these data sets are available on their Web site. Some of these data sets have also been used in Prasad et al. (2005) in their feature selection. We calculate only the first IC from the p-dimensional scores, while Prasad et al. obtained and used all p directions IC1, . . . , IC p for their efficient and accurate classification. The Heroin Market data consists of 17 key indicator data series with observations (counts) collected over 66 consecutive months from January 1997 to December 2002. These data series form part of a broader project funded by the National Drug Law Enforcement Research Foundation (NDLERF). (For a detailed description of these series, see Gilmour & Koch, 2004.) This data set is different from all others used here in that it contains only 17 independent observations, but each observation has 66 features or dimensions. Thus, we are dealing with a HDLSS (high-dimension, low-sample-size) problem. As we shall see, the skewness-based criterion was able to find the most informative dimension, while the kurtosis-based dimension did not. Finally, the Swiss Bank Notes data set stems from Flury and Riedwyl (1988). It deals with six measurements of 100 real and 100 counterfeit old
Dimension Selection with PC-IC
525
Swiss 1000 franc bank notes. A preliminary PCA showed that the first principal component scores are bimodal, and PC1 contributes about 67% to the total variance. In the data sets Bupa Liver, Abalone, Glass, Wine, and Breast Cancer, the different features vary greatly in their range. Features on a large scale dominate the contribution to variance and would normally be picked as outliers in a PC analysis. As our primary interest is the determination of the optimal dimension rather than the specific derived features (see Hastie & Tibshirani, 2002), we used most data sets in their raw and standardized version in our PC-IC analysis. The “Comment” column in Table 1 indicates whether the raw data or the standardized data were used. Here, standardization refers to centering with the sample mean and normalizing with the standard deviation, usually across features, with the exception of the Heroin data set, where we standardized each time series, as the time series (rather than the features) were on different scales. 4.2 Calculation of the Best Dimension. Theorem 1 gives an expression for the upper bounds U Bk , which are used in the bias-adjusted selector, equation 3.4. The calculations of the optimal dimension were carried out in Matlab 7.0.4. For the PC part, we used pcaSM (see Marron, 2004), and for the IC part we used FastICA2.4 (see Hyv¨arinen & Oja, 2005). The data sets vary in their dimensions, from 6 to 33, and number of independent samples. For most of the data sets, we considered all samples as well as smaller subsets of the samples. Obviously the performance of pk depends on the sample size. We shall see in this and the next section that the larger the sample size, the higher the value of pk . For each data combination, that is, for a data set in its raw or standardized version, with all samples or with a subset of samples, we calculate the p-dimensional sphered data Z( p)i (i = 1, . . . , N) as in equation 2.2 for p = 2, . . . , p ∗ , with p ∗ = r ≤ min{N, d} where r denotes the rank of the covariance matrix S. For the p-dimensional sphered data, we calculate the most nongaussian direction with FastICA and the absolute skewness and absolute kurtosis. FastICA is an iterative technique with randomly chosen starting positions. We observed that different runs of ICA (even for the same input data) can converge to different components or different local optima. We found empirically that 10 runs of IC1 for each data combination were sufficient to guarantee a global skewness or kurtosis maximum of IC1. Using equations 3.3 and 3.4 we determine the best dimension. 4.3 Results and Interpretation. The results of the best dimension selection are given in Table 1. The second column details the dimension of each data combination: N denotes the size of the sample, and d denotes the dimension of the original feature space. The relative contributions to
526
I. Koch and K. Naito
variance due to the most nongaussian subspace are given in the Percentage Variance column. The last two columns contain the skewness and kurtosis dimension p3 and p4 . Figures in parentheses in the column p4 denote a local maximum at the dimension given in parentheses. If the values for pk differed for skewness and kurtosis, we reported two numbers in the variance column: the first one refers to the variance contribution obtained for the skewness dimension and the second to the contribution due to kurtosis. Table 1 highlights the following:
r
For the same size of the sample, the raw data resulted in a lower pk value than the standardized data.
r Smaller samples result in a smaller pk value. r The p3 and p4 values are mostly identical. r pk does not appear to be related to the variance contribution of 95%.
How can we interpret the results presented in Table 1? Consider the Swiss Bank Notes. Measurements of six different quantities were made for 200 bank notes, and these six quantities correspond to properties of the bank notes such as the length of the bank notes. Our method selected the first four PC dimensions. This means that these PC combinations are regarded as best in capturing the nongaussian nature of the data. Since they are linear combinations of the original measurements, we learn how to combine the original six measurements in such a way that the dimensionality of the problem is reduced while the most informative information is preserved. For nongaussian data such as the Swiss Bank Notes, our criterion takes advantage of the nature of the data. For gaussian data, other dimension-selection criteria exist, and we briefly describe two such methods in section 5. Different criteria lead to different solutions and therefore have different interpretations. To conclude this section, we return to the dimension-selection results of Table 1. If the raw data contain outliers or observations on a different scale, a high contribution to variance is obtained with a small number of PC directions. Outlier directions are highly nongaussian, and a low-dimensional space will therefore be the most informative, that is, ICA will find the most nongaussian directions within a low-dimensional space. Subsets of the original data set resulted in a lower pk value than the total sample. This is consistent with the test for gaussianity carried out inside ICA. In the next section, we consider this aspect of the dimension selector further. FastICA mostly results in very similar values for p3 and p4 . When the p4 value was clearly bigger than p3 , the calculations showed that the kurtosis approximation had a local maximum at p = p3 .
Dimension Selection with PC-IC
527
Table 1: Dimension Selection for Real Data. N×d
Comment
Bupa Liver
345 × 6 100 × 6 345 × 6 100 × 6
Swiss Notes Abalone
200 × 6 4177 × 8 1000 × 8
Raw Raw, first 100 records Standardized Standardized, first 100 records Raw Raw Raw, 1000 random records Raw, first 100 records Standardized Standardized, 1000 random records Standardized, first 100 records Raw Standardized Raw Standardized Raw Raw, first 100 records Standardized Standardized, first 100 records Raw Raw, first 200 records Raw, records 101–200 Raw Standardized
Data Set
100 × 8 4177 × 8 1000 × 8 100 × 8 Glass Wine Breast Cancer
Ionosphere
Heroin market
214 × 9 214 × 9 178 × 13 178 × 13 569 × 30 100 × 30 569 × 30 569 × 30 351 × 33 200 × 33 100 × 33 17 × 66 17 × 66
Percentage Variance
p3
p4
97.2445 97.821 95.7750 95.5136
3 2 5 5
3 2 5 5
97.3140 100 99.9979; 99.9991
4 7 6
4 7 7
99.9767; 99.9993 100 98.3901; 99,2634
3 6 4
6 (3) 7 5
98.1336; 99.7993
4
6 (4)
98.2290; 99.8338 99.2726 99.9997; 100 89.3368 100 99.8967; 100 99.9999; 100 96.6162; 98.0303
5 7 8 7 15 2 25 11
6 7 10 7 27 (15) 24 (2) 26 13
69.5810; 99.4328 57.0032; 96.7084 77.3381; 91.9988 99.9386 67.4945
7 5 9 2 3
31 (7) 25 16 (9) 2 3
A comparison with the pk values we obtained and the dimension p95% chosen in Prasad et al. (2005) (which is based on explaining 95% of variance) indicates that contribution to variance is not a good predictor for the best nongaussian dimension. Indeed, other effects, such as the differing scales of measurements, outliers, and sample size, affect the dimension of the most nongaussian space more strongly. 5 PCA-Based Approaches We describe some model-based approaches to dimension reduction that have been used in the recent literature and examine their applicability. These approaches fall into two main groups:
r r
Probabilistic principal component analysis Bayesian principal component analysis
528
I. Koch and K. Naito
The approaches have the same starting point: a common underlying model, X = AV + µ + ,
(5.1)
where X denotes a d-dimensional vector of observations with mean µ, V denotes a p-dimensional vector of latent variables, and ∼ N(0, σ 2 Id ) represents independent gaussian noise. In this framework, p < d, A is a d × p matrix, and both A and p are assumed to be unknown. The aim is to determine the unknown dimension p of the latent variables V. Probabilistic principal component analysis is described in Tipping and Bishop (1999) and Bishop (1999). Their key idea is that the dimension of the latent variables is that value of p that minimizes the estimated prediction error. We will denote this solution by p (P) . In addition to the assumptions of equation 5.1, we assume the following: P1: The latent vectors are independent and normal with V ∼ N(0, I p ). P2: The observations are normal with X ∼ N(µ, AAT + σ 2 Id ). P3: The conditional distributions X|V and V|X are normal with mean and variances derived from the means and variances of X, V, and . Under the assumptions P1–P3 the estimated prediction error is the negative log likelihood. For a sample of size N from this model, the log likelihood is given by L=−
N d log(2π) + log |C| + tr(C −1 S) , 2
(5.2)
where C = AAT + σ 2 Id . The second approach, Bayesian principal component analysis, is described in Penny, Roberts, and Everson, (2001) and Beckmann and Smith (2004). It is based on the key idea of a Bayesian framework proposed in Minka (2000). We first describe the main ideas of Minka’s method and then discuss how these ideas are used in Penny et al. and Beckmann and Smith. Similar to Tipping and Bishop (1999), Minka (2000) starts with the gaussian model of equation 5.1, which satisfies P1 and P2. Instead of employing the maximum likelihood solution for A, σ , and for the mean µ, Minka proposes to use a Bayesian framework that requires evaluation of the conditional probabilities pr (X1 , . . . , XN | A, V, µ, σ 2 ). Minka uses a noninformative prior for the mean µ and Laplace’s method to approximate the conditinal density. The final probability is conditioned
Dimension Selection with PC-IC
529
on the dimension p of the latent variables V only: p(X1 , . . . , XN | p) ≈ pr (U)
p
−N/2 λj
j=1
× σˆ −N(d− p) (2π)(m+ p)/2 |AZ |−1/2 N− p/2 ,
(5.3)
with m = d p − p( p + 1)/2, p(U) = 2− p
p
((d − j + 1)/2) π −(d− j+1)/2 ,
j=1 d ˆ −1 (λi − λ j ), λˆ −1 j − λi p
|AZ | = N
i=1 j=i+1
λˆ i = λi
for i ≤ p,
and λ j = σˆ 2
for j > p.
The product on the right-hand side of equation 5.3 is used to estimate the likelihood, and the value p (B) that maximizes this product is the solution to the latent variable dimension selection. Penny et al. (2001) use the linear mixing framework of ICA, described in equation 5.1, which deviates from the classical ICA framework in that it includes additive noise. In their model, µ = 0 and assumptions P1 to P3 are replaced by the weaker assumptions B1 and B2: B1: The latent vectors are independent. B2: The observations x conditional on A and V are normal with (X| A, V) ∼ N(AV, σ 2 Id ). For this framework, they extend Tipping and Bishop’s (1999) approach in their section 12.3 by including a rotation matrix Q in the expression for the mixing matrix A, as is usual in ICA. In section 12.4 the authors state that “for gaussian sources the ICA model reduces to PCA”. From there on, they propose the use of Minka’s method to determine the dimension p, but they also discuss “flexible” source models. We will return to this point. The essence of Beckmann and Smith (2004) is Minka’s method; however, the authors take into account the limited data structure and the structure of noise as found in their FMRI applications. The latter refers to preprocessing of the noise, which includes standardizing the noise. For gaussian sources V, they propose the use of Minka’s p (B) , but they differ from Minka by using the mean as an estimate for µ instead of Minka’s prior distribution. After determining p (B) , the authors calculate the ICA rotation matrix Q and the independent sources within a standard ICA framework. This last step
530
I. Koch and K. Naito
does not appear to relate to the choice of dimension, but it is relevant to the discussion of the applicability of their methodology. Tipping and Bishop (1999) and Minka (2000) propose clear, easy-to-use, and theoretically well-founded methods for selecting a lower dimension for the latent variables. Both articles make it very clear that the sources as well as the noise are gaussian. These properties enable them to combine PCA ideas with gaussian likelihood maximization and result in elegant solutions. The two solutions will, of course, be different, as different problems are solved (probabilistic and Bayesian, respectively). Penny et al. (2001) and Beckmann and Smith (2004) have extended these algorithms to ICA frameworks, that is, to nongaussian sources. Penny et al. offer some discussion regarding a number of other source distributions and flexible source models and give two such examples where reference is made to specific (nongaussian) source distributions and how to accommodate them into Minka’s gaussian framework. This extension will no doubt work in the specific cases but does not lead to a general methodology. In contrast, Beckmann and Smith combine two approaches with inconsistent assumptions: finding the best p (B) by assuming that the sources are gaussian, and then finding p (B) nongaussian sources with standard ICA methodology. Since real data are never purely gaussian, and since, in practice, ICA finds nongaussian structure in gaussian randomly generated data, the method could lead to reasonable results for some data. The question of interest here is whether PCA-based methods and their extensions described in Tipping and Bishop (1999) and Minka (2000) can result in the right dimension reduction without knowledge of the underlying model or distribution. Like the initial approaches of Tipping and Bishop and Minka, the method we propose is generic and supported by theory. We are not interested in finding modifications for specific cases as suggested by Penny et al. (2001) and Beckmann and Smith (2004). In the following section we compare our method with those of Tipping and Bishop (1999) and Minka (2000). As the simulations will demonstrate, for the realistic nongaussian sources we use, neither of these methods is appropriate, and both are inferior to our method in terms of finding the correct dimension. This result is not surprising, since the assumptions of their models have been violated. These comparisons alone strongly indicate the need for our method, which does not replace the existing methods but complements them. 6 Implementation: Simulation Study 6.1 Nongaussianity, Sample Size, and PC Projections. We test the performance of our approach in simulations based on a mixture of gaussian and nongaussian directions. We determine the number of nongaussian directions with the PC-IC algorithm and FastICA. For comparison, we also
Dimension Selection with PC-IC
531
include results obtained with the probabilistic and the Bayesian approaches discussed in section 5. Our theoretical arguments are based on the limit case N → ∞, while in practice, we deal with small and moderate sample sizes. The simulation results are therefore not expected to give a precise answer, especially for smaller sample sizes. A powerful test for nongaussianity is given by the negentropy: for gaussian data, the negentropy is zero, while positive negentropy indicates deviations from gaussianity. The notion of negentropy relies on the joint probability density of the data. Density estimation for multivariate or high-dimensional data becomes more complex as the dimension increases, and good estimates in high dimensions require a large number of data. Even then, current density estimation methods are not always very accurate. For this reason, approximations to the negentropy are commonly used. FastICA uses the skewness and kurtosis as approximations to the negentropy (refered to as nonlinearities in their framework). Below we discuss the capability of skewness and kurtosis to distinguish gaussian from nongaussian models. This analysis informs our choice of simulation models. 6.1.1 Small Sample Sizes. If the sample size is small, the empirical distribution of nongaussian data may not differ sufficiently from that of a gaussian sample of the same size. This means that in a test for gaussianity, the data appear consistent with the gaussian model and will not be rejected as nongaussian. Thus, a technique such as ICA, which tests the deviation from gaussianity in the new directions, tends to underestimate the number of nongaussian directions, and hence also the dimension of the most nongaussian subspace. For small samples, pk is therefore likely to be smaller than the true number of nongaussian dimensions. 6.1.2 Large Sample Sizes. If the sample size is large, the model is a better approximation to the theory, and the dimension pk should therefore give an accurate description of the presence of nongaussian structure in the data. In the large sample case, nongaussianity tests are powerful and easily find deviations from gaussianity. Thus, nongaussian directions will appear as nongaussian. FastICA employs a very sensitive test for nongaussianity: indeed, even for samples from the multivariate gaussian distribution, FastICA will find nongaussian directions (see Koch et al., 2005). Thus, for large samples, the number of nongaussian directions ICA finds could be an overestimate. Consequently, pk is likely to be bigger than the true number of nongaussian dimensions. 6.1.3 Skewness as a Test for Nongaussianity. The skewness is based on the third moment of a random vector or sample. Symmetric distributions have a zero skewness. As the skewness is location and scale invariant, the
532
I. Koch and K. Naito
skewness approximation to negentropy will not be able to differentiate between the gaussian distribution and a distribution that is symmetric about its mean. In particular, ICA-based skewness tests will not be able to differentiate between gaussian and uniform data, or between data from the gaussian and the double exponential distributions, and pk will severely underestimate the nongaussian structure. 6.1.4 Kurtosis as a Test for nongaussianity. The kurtosis is based on the fourth moment of a random vector or sample. Kurtosis as in equation 1.2 is adjusted to yield to zero kurtosis for multivariate gaussian random vectors. For supergaussian distributions, kurtosis is positive, while subgaussian distributions lead to negative kurtosis values. However, the kurtosis test in FastICA is biased toward positive kurtosis, and it is therefore harder to differentiate subgaussian distributions from the gaussian distribution. Consequently, for subgaussian data, pk is smaller than the true number of subgaussian data directions. Indeed, our simulations with the uniform distribution confirmed this bias in the kurtosis estimates. 6.1.5 Support of the Distribution. The gaussian distribution is characterized by its infinite support. Distributions with compact support are therefore clearly distinct from those with infinite support. Neither skewness nor kurtosis tests are able to differentiate on the basis of the support of the distribution. Specifically, the beta distribution with positive and equal parameter values has skewness and kurtosis similar to those of the gaussian distribution. FastICA is not able to distinguish between them. So, again, pk is smaller than the true number of nongaussian data directions. 6.1.6 PC Projections. The first step in the PC-IC algorithm consists in reducing the d-dimensional original data to p-dimensional scores for p = 2, . . . , r, where r = min{min(d, N),rank(S)}. The PC scores are weighted averages of the original data. As the dimension d becomes large (d ≥ 30), a central limit effect seems to be noticeable in the projected data; thus, the scores are more gaussian than the original data, and pk will be smaller than the true number of nongaussian directions.
6.2 Simulation Models and Results 6.2.1 Simulation Strategy. The above discussion informs our choice of distributions and our simulation strategy. Since the accuracy of pk depends on the type of distribution as well as the sample size, we demonstrate its performance across a number of distributions and sample sizes and also show its relative accuracy by comparing it for models that differ only by the number of nongaussian dimensions but are the same otherwise.
Dimension Selection with PC-IC
533
Table 2: Dimensions d for Simulated Models with dng Nongaussian Dimensions. d dng
5 3
5 4
7 3
7 5
10 3
10 4
10 6
20 3
20 4
20 5
20 6
6.2.2 Nongaussian Models and the Mixed Data. The development of our approach and theoretical arguments requires N independent observations X1 , . . . , XN without reference to any model. The only assumption we require is that the observations have fourth moments. Such d-dimensional data correspond best to the generic ICA description: X = AV,
(6.1)
with A the d × d mixing matrix and V a d-dimensional vector of independent sources. This ICA model is different from that in equation 5.1, which forms the bases of the PCA-type approaches. We choose three nongaussian families of distributions, which set them apart from the gaussian distribution in ways that skewness or kurtosis should be able to discern:
r r
r
The beta distribution with parameters 5 and 1.2. This distribution has compact support and is asymmetric. A mixture of two gaussians consisting of one-quarter of the samples from the gaussian distribution N(−1, 0.49) and three-quarters of the data from the gaussian N(4, 1.44). This distribution is strongly bimodal and asymmetric. The exponential distribution with parameter λ = 2. This is a supergaussian that is also asymmetric.
For each of these models, we generate independent source data. We generate N × dng independent random samples from the distribution F , where F denotes any one of the three distributions above and N × dg independent samples from the multivariate gaussian distribution with zero mean and covariance matrix Idg . Then d = dng + dg denotes the dimension of the independent source data V. In the simulations, we vary d, and dng within d. For the sample size N, we use the values N = 20, 50, 100, 200, 500, 1000, 2000. Table 2 summaries the combinations we use. We apply a d × d mixing matrix A to the independent source data V1 , . . . , VN to obtain the observed or mixed data X1 , . . . , XN given in equation 6.1. We use a class of Higham test matrices (see Davies & Higham, 2000). These matrices are generated in Matlab with the command A = gallery(‘randcolu , d). The columns in the random matrix A have unit length. The singular values of this matrix can be specified, but we do not use
534
I. Koch and K. Naito
this option. Matrices of this form are invertible. For all simulations based on a specific dimension d, the same matrix A = A(d) is used. 6.2.3 Approaches for Calculating the Dimension p. We apply the PC-IC algorithm to mixed data X1 , . . . , XN obtained from sources V = V(N, d, dng, F ). The independent sources are characterized by d, dng, the nongaussian distribution F , and the sample size N. For p = 2, . . . , d we determine the most nongaussian IC1 direction by taking the maximum over 10 IC1 repetitions obtained with FastICA2.4. From these values for each p, we determine the pk (k = 3, 4) as in equation 3.3. Our method is based on skewness and kurtosis. We carry out separate simulations for skewness and kurtosis, which result in the values p3 for skewness and p4 for kurtosis. In addition we apply the probabilistic PCA approach of Tipping and Bishop (1999), which we refer to as PPCA, and the Bayesian approach of Minka (2000), referred to as BPCA, to the data X1 , . . . , XN . For the probabilistic PCA, we calculate the log likelihood of equation 5.2 and find its maximizer p P . For the Bayesian approach, we find the maximizer p B of equation 5.3. In the PCA-based simulations, p ranges over the values {1, . . . , d − 1}. Our approach considers values p ∈ {2, . . . , d}, since ICA requires at least a two-dimensional subspace to find a nongaussian direction. As our approach is nonparametric and treats all observations the same way, the flexible sources referred to in Penny et al. (2001) are not appropriate, nor is the preprocessing of the noise as described in Beckmann and Smith (2004) applicable, since there is no additive noise in the model of equation 6.11. Thus, the algorithms we can compare our method with are the probabilistic PCA given in equation 5.2 and the Bayesian PCA given in equation 5.3. 6.2.4 Simulation Results. For 1000 runs with the sources V(N, d, dng, F ), we recorded the proportion of times that dimension p was selected. Figures 1 to 3 show typical performances for our method and the two PCA-based methods. The pattern is the same in each figure. Each row of subplots presents the results of one method. In the first row, we present our skewness results, referred to as PC-IC skewness in the figures. In Figures 1 and 2, the second row shows our kurtosis results (PC-IC kurtosis). The second last row displays the results of PPCA, and the last row shows the results of BPCA. We display the results for the sample sizes N = 50, 100, 200, 500, and 1000. Each column of subplots displays the results of a fixed sample size; the first column has results for N = 50, the next for N = 100, and the last for N = 1000. The x-axis of each subfigure displays the range of dimensions 1, . . . , d, and the y-axis shows the percentage that a particular dimension p was the maximizer in 1000 runs. The correct dimension, namely, the number of nongaussian dimensions, is indicated by a square shape in each plot, for example, dimension 4 in Figure 1.
Dimension Selection with PC-IC
535
100
100
100
100
100
50
50
50
50
50
0
0 2
4
0 2
4
0 2
4
0 2
4
100
100
100
100
100
50
50
50
50
50
0
0 2
4
0 2
4
0 2
4
4
100
100
100
100
50
50
50
50
50
0 2
4
0 2
4
0 2
4
4
100
100
100
100
50
50
50
50
50
0 2
4
0 2
4
0 2
4
2
4
2
4
2
4
0 2
100
0
4
0 2
100
0
2
0 2
4
Figure 1: Proportion p against dimension for five-dimensional distribution. Sample sizes N = 50, 100, 200, 500, 1000 vary along rows. p with PC-IC skewness (top row), with PC-IC kurtosis (second row), with PPCA (third row), with BPCA (bottom row). Correct dimension indicated by a square, here at p = 4.
Figure 1 displays results for low dimensions. For five-dimensional data, we use the beta distribution for four of the independent sources. The results are displayed in the 20 subfigures of Figure 1. The PC-IC skewness criterion (top row) finds the correct dimension for N = 100 and 200. For larger values of the sample size, p3 is a little bigger than the true value. This is consistent with our comments about large samples. The results obtained with kurtosis are not as good as those for skewness. Although kurtosis finds the correct dimension for the two largest sample sizes, the percentages for all values of p are quite high. Further, skewness is computationally preferable, as kurtosis calculations take two to four times longer. The third row shows the results from PPCA: for all sample sizes, the maximum occurs at the correct dimension. Finally, the results for BPCA indicate that BPCA finds the correct dimension for the smallest sample size (N = 50), but for all other sample sizes, BPCA has a maximum at p B = 1. Figure 2 shows the result for 10-dimensional data. Here we use the bimodal distribution in three dimensions. For this moderate dimension, PC-IC skewness (top row) finds the correct dimension for the larger sample sizes and underestimates it by one for the smaller values. This is expected: p3 tends to increase with sample size. The kurtosis results are
536
I. Koch and K. Naito 100
100
100
100
100
50
50
50
50
50
0
0 5
10
0 5
10
0 5
10
0 5
10
100
100
100
100
100
50
50
50
50
50
0
0 5
10
0 5
10
0 5
10
10
100
100
100
100
50
50
50
50
50
0 5
10
0 5
10
0 5
10
10
100
100
100
100
50
50
50
50
50
0 5
10
0 5
10
0 5
10
5
10
5
10
5
10
0 5
100
0
10
0 5
100
0
5
0 5
10
Figure 2: Proportion p against dimension for 10-dimensional distribution with three bimodal dimensions. Sample sizes N = 50, 100, 200, 500, 1000 vary along rows. p with PC-IC skewness (top row), with PC-IC kurtosis (second row), with PPCA (third row), with BPCA (bottom row). Correct dimension indicated by a square, here at p = 3.
very similar in that the correct dimension is underestimated by one for the smaller sample sizes, but again, the skewness results are superior to the kurtosis results. The PPCA plots all have a maximum at 1. Unlike the results for PC-IC skewness, the dimension of the maxium does not change with sample size and indeed is wrong in each case. BPCA shows a trend that was not so clear for the five-dimensional data of Figure 1: for low sample sizes, the maximum of BPCA occurs at the last value of its range, d − 1, and for higher sample sizes, the maximum occurs at the 1. Figure 3 shows the results for 20-dimensional data with four bimodal dimensions. As the kurtosis results are inferior to the skewness results, we present only the skewness results here. For the three larger samples, PC-IC skewness determined the correct dimension convincingly. For 20dimensional data, sample sizes of 50 or 100 are very small, and it is therefore not surprising that the skewness criterion underestimates the dimension. The second row of figures shows that PPCA picked d − 1 = 19 as the dimension of the latent variables for each sample size. This choice is far from the correct choice: four dimensions. As in the results in Figure 2, PPCA appears to be independent of the sample size, but this time, PPCA picks
Dimension Selection with PC-IC
537
100
100
100
100
100
50
50
50
50
50
0
0 0
10
20
0 0
10
20
0 0
10
20
0 0
10
20
100
100
100
100
100
50
50
50
50
50
0
0 0
10
20
0 0
10
20
0 0
10
20
10
20
100
100
100
100
50
50
50
50
50
0 0
10
20
0 0
10
20
0 0
10
20
10
20
0
10
20
0
10
20
0 0
100
0
0
0 0
10
20
Figure 3: Proportion p against dimension for 20-dimensional distribution with three bimodal dimensions. Sample sizes N = 50, 100, 200, 500, 1000 vary along rows. p with PC-IC skewness (top row), with PPCA (middle row), with BPCA (bottom row). Correct dimension indicated by a square, here at p = 4.
the other extreme dimension, d − 1, instead of 1 in the 10-dimensional data. We will comment on this discrepancy between these two results in the last paragraph of this section. BPCA shows the same pattern as for the 10-dimensional data of Figure 2: for the two smallest sample sizes, the maximum occurs at the largest dimension (at p (B) = 19), and for the two largest sample sizes, the maximum occurs at 1. For N = 200, we note that the peak has now moved to p (B) = 3, so happens to be close to the true value. These figures show that PC-IC with the skewness criterion produces good results for nongaussian distributions of diverse dimensions and different sample sizes. PC-IC with skewness is mostly superior to PC-IC with kurtosis in determining the correct dimension. This is a little surprising, but the results are very convincing. Since skewness is calculated more efficiently than kurtosis with FastICA, we recommend the use of PC-IC with skewness. When the dimension of the data is small as in Figure 1, PC-IC with skewness yields correct results for the smaller sample sizes we considered, while for larger samples, the selected dimension can be too large. For small N and moderate or larger dimensions d ≥ 10, p3 tends to be a little smaller than dng, but as N increases, the PC-IC selects the correct dimension. These results are consistent with the observations of section 6.1 and the results in Table 1. A comparison between PC-IC with skewness and the PCA-based methods shows convincingly that our method is clearly superior to either of these two for nongaussian data and models as in equation 6.1. This is not
538
I. Koch and K. Naito
surprising, since both methods were developed for gaussian data with additive gaussian noise. PPCA appeared to select the correct dimension for the low-dimensional simulations based on the beta distribution. In some sense, this distribution is closer to the normal distribution than the bimodal, so a possible explanation for the good performance of PPCA is that it shows some distributional robustness, that is, PPCA can handle models that are close to the gaussian model for which it has been designed. The bimodal distribution is very different from the gaussian distribution, and it is therefore not suprising that PPCA does not yield such good results. What is curious, however, is the jump from the lowest to the highest dimension that occurred between Figures 2 and 3 or between dimensions 10 and 20. Further simulations (not presented here) showed that PPCA depended strongly on the model, here the class of mixing matrices, so PPCA is not model robust. These simulations indicate that the class of models and distributions for which PPCA works is quite restrictive and thus highlight a serious drawback of PPCA. In contrast, BPCA appears to be model robust and always has a very clear peak where the maximum occurs, but BPCA appears too sensitive to sample size. We observed (but do not show this here) that the selected dimension of the maximum decreases from d − 1 to 1 as the sample size increases from 100 to 120. In summary, the results show that our method produces good results, where neither of the PCA-based methods is applicable, so complements these methods.
7 Conclusion A criterion for choosing the dimension in dimension-reduction problems has been proposed. The criterion is essentially based on an approximation of bias-adjusted skewness and kurtosis for PC scores. Applications to real data sets reveal that the proposed criterion returns a higher dimension than p95% based on PC alone. This seems to be obvious since nongaussianity cannot be captured by the variance of low-dimensional projections alone as PCA aims for. Our proposed criterion includes as a vital part pursuing of nongaussianity, and hence it should yield higher dimensions. We have shown in simulation studies that the criterion performs well for simulated structures and performs better than existing PCA-based methods for nongaussian data. The ICA part of the criterion seems to make its performance quite similar to that of testing gaussianity based on skewness and kurtosis, and in fact this has been observed in our simulation studies. Therefore, it is desirable to consider simulations of structures that are used in the literature as alternative hypotheses for testing gaussianity. As mentioned in section 6.1.5, more accurate selection of the dimension would be possible provided we could differentiate between compact and noncompact support of the underlying structure. However, we reserve this problem for a future topic.
Dimension Selection with PC-IC
539
Appendix A: Proof of Theoretical Results The following lemma is useful to prove theorem 1. Lemma 1. (1) For b ≥ 0 , we have
∞ G m [(b + 1 )t ]dt = 2
0
2 [(m + 1 )/2] E[χm ] . = √ b+1 (m/2) b+1
(2) Fix integers iand m and put k = 2i + m. For a , b ≥ 0 the tail probability G k satisfies G k [(b + 1 )a 2 ] (2b)i (k/2) = (b + 1 )k/2 (m/2)
∞
e −bt/2 (bt)i gm (t)dt.
a2
(3) Fix integers iand m and put = 2i + 1 + m. For a , b ≥ 0 the tail probability G satisfies G [(b + 1 )a 2 ] b (2i+1)/2 ( /2) = (b + 1 ) /2 (m/2)
∞
e −bt/2
bt 2
(2i+1)/2 gm (t)dt.
a2
Proof of Lemma 1. 1. With G m (t 2 ) = P(χm2 ≥ t 2 ) = P(χm ≥ t) for t ≥ 0, using the change of variables u = t 2 , we find that
∞
∞ G m (t )dt = 2
0
du G m (u) √ 2 u
0
√ = G m (u) u|∞ 0 +
∞
√
ugm (u)du
0
= E[χm ] since the first partial integrand equals 0. From this equation and applying the change of variable u = (b + 1)t 2 , we get
∞
∞ G m [(b + 1)t ]dt = 2
0
0
G m (u) du 2 u(b + 1)
540
I. Koch and K. Naito
1 E[χm ] =√ b+1 2 [(m + 1)/2] = . b + 1 (m/2) 2. Fix i and m. For a , b ≥ 0, let K denote the integral:
∞ K≡
e −bt/2 (bt)i × gm (t)dt
a2
∞ =
e −(b+1)t/2 b i ×
a2
∞ =
t (2i+m)/2−1 dt 2m/2 (m/2)
e −(b+1)t/2 b i t (2i+m)/2−1 ×
a2
2i [(2i + m)/2] dt. + m)/2](m/2)
2(2i+m)/2 [(2i
Put k = 2i + m. The change of variable r = (b + 1)t, and applying the definition of the tail probability G k leads to (2b)i (k/2) K= (m/2) (2b)i (k/2) = (m/2) =
∞ a2
e −(b+1)t/2 t k/2−1 dt 2k/2 (k/2)
∞ (b+1)a 2
e −r/2 r k/2−1 dr + 1)k/2
2k/2 (k/2)(b
(2b)i (k/2) × G k [(b + 1)a 2 ]. (b + 1)k/2 (m/2)
3. Fix i and m. For a , b ≥ 0 let L denote the integral
∞ L≡
e −(bt)/2
bt 2
(2i+1)/2 gm (t)dt
a2
∞ = a2
∞ = a2
e −(b+1)t/2 b (2i+1)/2 t (2i+1+m)/2−1 dt 2(2i+1+m)/2 (m/2) e −(b+1)t/2 b (2i+1)/2 t (2i+1+m)/2−1
[(2i + 1 + m)/2] dt. + 1 + m)/2](m/2)
2(2i+1+m)/2 [(2i
Dimension Selection with PC-IC
541
With = 2i + 1 + m, and the change of variable r = (b + 1)t, we get b (2i+1)/2 ( /2) L= (m/2) =
=
b (2i+1)/2 ( /2) (m/2)
∞ a2
e −(b+1)t/2 t /2−1 dt 2 /2 ( /2)
∞
e −r/2 r /2−1 dr + 1) /2
2 /2 ( /2)(b
(b+1)a 2
b (2i+1)/2i ( /2) × G [(b + 1)a 2 ]. (b + 1) /2 (m/2)
Proof of Theorem 1. From theorems 3.1 and 3.3 of Kuriki and Takemura (2001), we have L U Qk, p (a ) ≤ P(Tk ( p) ≥ a ) ≤ Qk, p (a )
for a > 0, where p−1
L Qk, p (a ) =
ω p−e,k Q p−e,ρ− p+e (a 2 , tan2 θk ),
e=0 e:even
QU k, p (a ) =
L Qk, p (a )
Qm,n (a , b) =
∞
Vol(Mθk ) 2 2 ¯ + G ρ (a (1 + tan θk )) 1 − , Vol(S ρ−1 ))
¯ n (bξ )}dξ, gm (ξ ){1 − G
a
Vol(Mθk ) is the volume of tube Mθk which is fully described in Kuriki and Takemura (2001), and Vol(S ρ−1 ) is the total volume of S ρ−1 . Using the exU L pression of Qk, p and Qk, p , it follows that L Bk ( p) =
p−1
ω p−e,k
∞
Q p−e,ρ− p+e (a 2 , b)da
0
e=0,e:even
≤ E[Tk ( p)] ≤
p−1
ω p−e,k
e=0,e:even
∞
Q p−e,ρ− p+e (a 2 , b)da
0
∞ Vol(Mθc ) ¯ ρ ((1 + tan2 θc )a 2 )da G + 1− Vol(Sρ−1 ) 0
= U Bk ( p).
542
I. Koch and K. Naito
The ratio Vol(Mθc )/Vol(Sρ−1 ) can be expressed using (θk , p) as in lemmas 3.3 and 3.4 in Kuriki and Takemura (2001). Direct but tedious calculations of the integral Qm,n (a 2 , b) by means of formulas 18.18 and 18.19 in Johnson, Kotz, and Barakrishnan (1994), and the previous lemma yield the expressions tabulated in the theorem. Proof of Theorem 2. First, we note that
m,n (b|T) = E m,n (b|T) − m,n (b) 2 MSE "
! m,n (b|T) m,n (b|T) 2 + Var = Bias 2
m,n (b|T) . = m,n (b|T) − m,n (b) + Var Using expression 18.18 in Johnson et al. (1994) again, we have for even m that
∞
|m,n (b) − m,n (b|T)| =
∞
∞
gm (ξ )G n (bξ )dξ da
∞
a2
T
=
∞
a2
T
≤
gm (ξ )dξ da
¯ m (a 2 )da G
T
2 2 i 1 ∞ a a exp − da = i! T 2 2 (m/2)−1
i=0
((i + 1)/2) ¯ i+1 (T 2 ) G √ i+1 2 i! i=0
(m/2)−1
=
((i + 1)/2) √ i+1 2 i! i=0
(m/2)−1
¯ m/2 (T 2 ) ≤G
= O(exp(−T/2)T (m/2)−1 ). The proof for odd m is similar using equation 18.19 in Johnson et al. (1994). The evaluation of the variance is trivial. Appendix B: Table of Upper Bounds The table here (see Table 3) contains the upper bounds U B k ( p) for dimensions p = 2, . . . , 50 and k = 3, 4 as in equation (7) of section 3.5. The kurtosis
Dimension Selection with PC-IC
543
Table 3: Skewness and Kurtosis Upper Bounds. Dimension 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
UB-skewness
UB-kurtosis
Dimension
UB-skewness
UB-kurtosis
1.5993 2.9790 3.3430 4.4404 5.6317 6.9225 8.2636 9.6954 11.1995 12.7727 14.4123 16.1157 17.8806 19.7050 21.5870 23.5251 25.5157 27.5629 29.6600 31.8075 34.0042 36.2491 38.5413 40.8796 43.2633
1.7536 2.8356 4.5500 6.4576 8.6776 11.2116 14.0606 17.2250 20.7051 24.5010 28.6129 33.0409 37.7849 42.8451 48.2214 53.9138 59.9224 66.2473 72.8883 79.8455 87.1189 94.7085 102.6140 110.8360 119.3750
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
45.6915 48.1634 50.6783 53.2353 55.8339 58.4734 61.1532 63.8726 66.6312 69.4283 72.2634 75.1380 78.0458 80.9920 83.9745 86.9630 90.0460 93.1343 96.2571 99.7390 102.6050 105.8290 109.0860 112.3760
128.2290 137.4000 146.8870 156.6900 166.8090 177.2450 187.9970 199.0650 210.4490 222.1490 234.1660 246.4990 259.1480 272.1130 285.3950 298.9930 312.9070 327.1370 341.6840 356.5460 371.7250 387.2200 403.0320 419.1600
values increase much faster than skewness bounds, and both increase at a rate that is faster than the linear rate. Acknowledgments We gratefully acknowledge the reviewers’ suggestions to include comparisons with probabilistic dimension selection methods. References Amari, S.-I. (2002). Independent component analysis (ICA) and method of estimating functions. IEICE Trans. Fundamentals, E85-A (3), 540–547. Anderson, T. W. (1985). Introduction to multivariate statistical analysis (2nd ed). New York: Wiley. Bach, F. R., & Jordan, M. I. (2002). Kernel independent component analysis. Journal of Machine Learning Research, 3, 1–48. Beckmann, C. F., & Smith, S. M. (2004). Probabilistic independent component analysis for functional magnetic resonance imaging. IEEE Trans. Med. Imaging, 23(2), 137– 152.
544
I. Koch and K. Naito
Bishop, C. M. (1999). Bayesian PCA. In M. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 382–388). Cambridge, MA: MIT Press. Blake, C., & Merz, C. (1998). UCI repository of machine learning databases. Available online at http://www.ics.uci.edu/∼mlearn/MLRepository.html and http:// www.kernel-machines.com/. Cardoso, J.-F. (1999). High-order contrasts for independent component analysis. Neural Computation, 11(1), 157–192. Davies, P. I., & Higham, N. J. (2000). Numerically stable generation of correlation matrices and their factors. BIT, 40, 640–651. Flury, B., & Riedwyl, H. (1988). Multivariate statistics: A practical approach. Cambridge: Cambridge University Press. Friedman, J. H. (1987). Exploratory projection pursuit. Journal of the American Statistical Association, 82, 249–266. Gilmour, S., & Koch, I. (2004). Understanding Australian illicit drug markets: Unsupervised learning with independent component analysis. In Proceedings of the 2004 Intelligent Sensors, Sensor Networks and Information Processing Conference, ISSNIP (pp. 271–276). Melbourne, Australia: ARC Research Network on Sensor Networks. Hastie, T., & Tibshirani, R. (2002). Independent component analysis through product ¨ L. K. Saul, & B. Scholkopf ¨ density estimation. In S. Thrun, (Eds.), Advances in neural information processing systems, 16 (pp. 649–656). Cambridge, MA: MIT Press. Huber, P. J. (1985). Projection pursuit (with discussion). Annals of Statistics, 13, 435– 475. Hyv¨arinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Hyv¨arinen, A., & Oja, E. (2005). FastICA2.4. Available online at http://www.cis/ hut./fi/projects/ica/fastica/code/dlcode.html. Johnson, N. L., Kotz, S., & Balakrishnan, N. (1994). Continuous univariate distributions. (2nd ed.). New York: Wiley. Jones, M. C., & Sibson, R. (1987). What is projection pursuit? Journal of the Royal Statistical Society Series A, 150, 1–36. Koch, I., Marron, J. S., & Chen, J. (2005). Independent component analysis and simulation of nongaussian populations of kidneys. Manuscript submitted for publication. Kuriki, S., & Takemura, A. (2001). Tail probabilities of the maxima of multilinear forms and their applications. Annals of Statistics, 29, 328–371. Machado, S. G. (1983). Two statistics for testing for multivariate normality. Biometrika, 70, 713–718. Malkovich, J. F., & Afifi, A. A. (1973). On tests for multivariate normality. Journal of the American Statistical Association, 68, 176–179. Mardia, K. V., Kent, J., & Bibby, J. (1979). Multivariate analysis. Orlando, FL: Academic Press. Marron, J. S. (2004). pcaSM.m. Available online at http://www.stat.unc.edu/ postscript/papers/marron/Matlab7Software/General/. Minka, T. P. (2000). Automatic choice of dimensionality for PCA. In S. A. Solla, T. K. ¨ Leen, & K.-R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 598–604). Cambridge, MA: MIT Press.
Dimension Selection with PC-IC
545
Penny, W. D., Roberts, S. J., & Everson, R. M. (2001). ICA: Model order selection and dynamic source models. In S. J. Roberts and R. M. Everson (Eds.), Independent component analysis: Principles and practice (pp. 299–314). Cambridge: Cambridge University Press. Prasad, M., Arcot, S., & Koch, I. (2005). Deriving significant features in high dimensional data with independent component analysis. Manuscript submitted for publication. Sun, J. (1991). Significance levels in exploratory projection pursuit. Biometrika, 78, 759–769. Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. J. R. Statist. Soc. B, 61, 611–622.
Received November 21, 2005; accepted June 16, 2006.
LETTER
Communicated by Te-Won Lee
One-Bit-Matching Theorem for ICA, Convex-Concave Programming on Polyhedral Set, and Distribution Approximation for Combinatorics Lei Xu [email protected] Department of Computer Science and Engineering, Chinese University of Hong Kong, Shatin, NT, Hong Kong, P. R. C.
According to the proof by Liu, Chiu, and Xu (2004) on the so-called onebit-matching conjecture (Xu, Cheung, and Amari, 1998a), all the sources can be separated as long as there is an one-to-one same-sign correspondence between the kurtosis signs of all source probability density functions (pdf’s) and the kurtosis signs of all model pdf’s, which is widely believed and implicitly supported by many empirical studies. However, this proof is made only in a weak sense that the conjecture is true when the global optimal solution of an independent component analysis criterion is reached. Thus, it cannot support the successes of many existing iterative algorithms that usually converge at one of the local optimal solutions. This article presents a new mathematical proof that is obtained in a strong sense that the conjecture is also true when any one of local optimal solutions is reached in helping to investigating convex-concave programming on a polyhedral set. Theorems are also provided not only on partial separation of sources when there is a partial matching between the kurtosis signs, but also on an interesting duality of maximization and minimization on source separation. Moreover, corollaries are obtained on an interesting duality, with supergaussian sources separated by maximization and subgaussian sources separated by minimization. Also, a corollary is obtained to confirm the symmetric orthogonalization implementation of the kurtosis extreme approach for separating multiple sources in parallel, which works empirically but lacks mathematical proof. Furthermore, a linkage has been set up to combinatorial optimization from a distribution approximation perspective and a Stiefel manifold perspective, with algorithms that guarantee convergence as well as satisfaction of constraints. 1 Introduction Independent component analysis (ICA) aims at blindly separating the independent sources s from an unknown linear mixture x = As via y = Wx. Tong, Inouye, and Liu (1993) showed that y recovers s up to constant scales and a permutation of components when the components of y become Neural Computation 19, 546–569 (2007)
C 2007 Massachusetts Institute of Technology
One-Bit-Matching ICA Theorem and Distribution-Based Combinatorics
547
component-wise independent and at most one of them is gaussian. The problem is formalized by Comon (1994) as ICA. Although ICA has been studied from different perspectives, such as the minimum mutual information (MMI) (Bell & Sejnowski, 1995; Amari, Cichocki, & Yang, 1996) and maximum likelihood (ML) (Cardoso, 1998), in the case that W is invertible, all such approaches are equivalent to minimizing the following cost function, D(W) =
p(y, W) p(y; W) ln n dy, i=1 q (yi )
(1.1)
where q (yi ) is the predetermined model probability density function (pdf) and p(y, W) is the distribution on y = Wx. With each model pdf q (yi ) prefixed, however, this approach works only when the components of y are either all subgaussians (Amari et al., 1996) or all supergaussians (Bell & Sejnowski, 1995). To solve this problem, it is suggested that each model pdf q (yi ) is a flexibly adjustable density that is learned together with W, with the help of either a mixture of sigmoid functions that learns the cumulative distribution function (cdf) of each source (Xu, Yang, & Amari, 1996; Xu, Cheung, Yang, & Amari, 1997) or a mixture of parametric pdf’s (Xu, 1997; Xu, Cheung, & Amari, 1998b). A learned parametric mixture–based ICA (LPMICA) algorithm is derived, with successful results on sources that can be either subgaussian or supergaussian, as well as any combination of both types. The mixture model was also adopted in the ICA algorithm by Pearlmutter and Parra (1996), although it did not explicitly target separating the mixed sub- and supergaussian sources. It has also been found that a rough estimate of each source pdf or cdf may be enough for source separation. For instance, a simple sigmoid function such as tanh(x) seems to work well on the supergaussian sources (Bell & Sejnowski, 1995), and a mixture of only two or three gaussians may be enough (Xu et al., 1998b) for the mixed sub- and supergaussian sources. This leads to the so-called one-bit-matching conjecture (Xu et al., 1998a), which states that “all the sources can be separated as long as there is an one-to-one same-sign correspondence between the kurtosis signs of all source pdf’s and the kurtosis signs of all model pdf’s.” This conjecture has been implicitly supported by several other ICA studies (Girolami, 1998; Everson & Roberts, 1999; Lee, Girolami, & Sejnowski, 1999; Welling & Weber, 2001). Cheung and Xu (2000) gave a mathematical analysis for the case involving only two subgaussian sources. Amari, Chen, and Cichock: (1997) also studied the stability of an ICA algorithm at the correct separation points via its relation to the nonlinearity φ(yi ) = d ln q i (yi )/dyi , but without touching the circumstance under which the sources can be separated. Recently, the conjecture on multiple sources was proved mathematically in a weak sense (Liu et al. 2004). When only skewness and kurtosis of
548
L. Xu
sources are considered with Es = 0 and Ess T = I , and the model pdf’s skewness is designed as zero, the problem minW D(W) by equation 1.1 is as simplified as
max J (R),
RRT =I
J (R) =
n n
ri4j ν sj kim ,
n ≥ 2,
(1.2)
i=1 j=1
where R = (ri j )n×n = W A is an orthonormal matrix, and ν sj is the kurtosis of the source s j , and kim is a constant with the same sign as the kurtosis νim of the model q (yi ).1 Then it is further proved that the global maximization of equation 1.2 can be reachable only by setting R a permutation matrix up to a certain sign indeterminacy. However, this proof still cannot support the successes of many existing iterative ICA algorithms that typically implement gradient-based local search and thus usually converge to one of local optimal solutions. In the next section of this article, all the local maxima of equation 1.2 are investigated using special convex-concave programming on a polyhedral set, from which we prove the one-bit-matching conjecture in a strong sense that it is true when any one of local maxima by equation 1.2 is reached. Theorems have also been provided on separation of sources when there is a partial matching between the kurtosis signs and on an interesting duality of maximization and minimization. Moreover, corollaries are obtained to state that the duality makes it possible to get supergaussian sources by maximization and subgaussian sources by minimization. Another corollary also confirms the symmetric orthogonalization implementation of the extreme approach of the kurtosis for separating multiple sources in parallel, which works empirically but without mathematical proof (Hyvarinen, Karhunen, & Oja, 2001). In section 3, we discuss that equation 1.2, with R being a permutation matrix up to certain sign indeterminacy, becomes equivalent to a special example of the following combinatorial optimization: min E o (V), V = {vi j , i = 1, . . . , N, j = 1, . . . , M}, V
Cc :
N
vi j = 1, j = 1, . . . , M,
i=1
Cb :
vi j takes either 0 or 1.
Cr :
M
subject to
vi j = 1, i = 1, . . . , N;
j=1
(1.3)
1 The details refer to theorem 1 of Liu et al. (2004) for the specific expression k m and i the proof that kim is a constant with the same sign as the kurtosis νim of the model q (yi ).
One-Bit-Matching ICA Theorem and Distribution-Based Combinatorics
549
Figure 1: Convex set and polyhedral set.
This connection suggests investigating combinatorial optimization not only from a distribution approximation perspective of finding a simple distribution to approximate the Gibbs distribution induced from E o (V), but also a perspective of gradient flow searching within the Stiefel manifold, with algorithms that guarantee convergence as well as constraint satisfaction. 2 One-Bit-Matching Theorem and Extension 2.1 An Introduction to Convex Programming. To facilitate mathematical analysis, we briefly introduce convex programming. A set in Rn is said to be convex if x1 ∈ S, x2 ∈ S; we then have λx1 + (1 − λ)x2 ∈ S for any 0 ≤ λ ≤ 1. Shown in Figure 1 are examples of convex sets. As an important special case of convex sets, a set in Rn is called a polyhedral set if it is the intersection of a finite number of closed half-spaces, that is, S = {x : a it x ≤ αi , f or i = 1, . . . , m}, where a i is a nonzero vector and αi is a scalar for i = 1, . . . , m. The second and third imagestion Figure 1 are two examples. Let S be a nonempty convex set. A vector x ∈ S is called an extreme point of S if x = λx1 + (1 − λ)x2 with x1 ∈ S, x2 ∈ S, and 0 < λ < 1 implies that x = x1 = x2 . We denote the set of extreme point by E, illustrated in Figure 1. Let f : S → R, where S is a nonempty convex set in Rn . As shown in Figure 2, the function f is said to be convex on S if f (λx1 + (1 − λ)x2 ) ≤ λ f (x1 ) + (1 − λ) f (x2 )
(2.1)
for x1 ∈ S, x2 ∈ S and for 0 < λ < 1. The function f is called strictly convex on S if the above inequality is true as a strict inequality for each distinct x1 ∈ S, x2 ∈ S and for 0 < λ < 1. The function f is called concave (strictly concave) on S if − f is convex (strict convex) on S. Considering an optimization problem minx∈S f (x), if x¯ ∈ S and f (x) ≥ f (x¯ ) for each x ∈ S, then x¯ is called a global optimal solution. If x¯ ∈ S and if there exists an ε-neighborhood Nε (x¯ ) around x¯ such that f (x) ≥ f (x¯ ) for
550
L. Xu
Figure 2: Convex and concave function.
each x ∈ S ∩ Nε (x¯ ), then x¯ is called a local optimal solution. Similarly, if x¯ ∈ S and if f (x) > f (x¯ ) for all x ∈ S ∩ Nε (x¯ ), x = x¯ , for some ε, then x¯ is called a strict local optimal solution. An optimization problem minx∈S f (x) is called a convex programming problem if f is a convex function and S is a convex set. Lemma 1. a. Let S be a nonempty open convex set in Rn , and let f : S → R be twice differentiable on S. If its Hessian matrix is positive definite at each point in S, the f is strictly convex. b. Let S be a nonempty convex set in Rn , and let f : S → R be convex on S. Consider the problem of minx∈S f (x). Suppose that x¯ is a local optimal solution to the problem. Then (i) x¯ is a global optimal solution. (ii) If either x¯ is a strict local minimum or if f is strictly convex, then x¯ is the unique global optimal solution. c. Let S be a nonempty compact polyhedral set in Rn , and let f : S → R be a strict convex function on S. Consider the problem of maxx∈S f (x). All the local maxima are reached at extreme points of S. Statements a and b are common knowledge, and statement c is not difficult to understand. As illustrated in Figure 2, assume x¯ is a local maximum but not an extreme point. We may find x1 ∈ Nε (x¯ ), x2 ∈ Nε (x¯ ) such that x¯ = λx1 + (1 − λ)x2 for 0 < λ < 1. It follows from equation 2.1 that f (x¯ ) < λ f (x1 ) + (1 − λ) f (x2 ) ≤ max[ f (x1 ), f (x2 )], which contradicts that x¯ is a local maximum, while at an extreme point x of S, x = λx1 + (1 − λ)x2 with x1 ∈ S, x2 ∈ S and 0 < λ < 1 implies that x = x1 = x2 , which does not contradict the definition of a strict convex function made after equation 2.1. That is, a local maximum can be reached at only one of the extreme points of S. (For details about convex programming, see, e.g., Bazaraa, Sherali, Shetty, 1993.) 2.2 One-Bit-Matching Theorem. This section aims at showing that every local maximum of J (R) on RRT = I by equation 1.2 is reached at R,
One-Bit-Matching ICA Theorem and Distribution-Based Combinatorics
551
which is a permutation matrix up to sign indeterminacy at its nonzero elements, as long as there is a one-to-one same-sign correspondence between the kurtosis of all source pdf’s and the kurtosis of all model pdf’s. The proving line is sketched by two key steps. First, we divide the set C of constraint equations RRT = I into two nonoverlapped sets Cn ∪ Co . That is, for RT = [r1 , r2 , . . . , rn ] with r j = [r1 j , r2 j , . . . , rnj ]T , we have Cn : Co :
n
ri2j = 1, j = 1, . . . , n,
i=1 riT r j
for normalization,
= 0, i = 1, . . . , n − 1, j = i, . . . , n, for orthogonalization. (2.2)
We find all the local maxima of max s.t. Cn J (R) by proving lemma 2. Second, we consider how these local maxima will be affected by adding the constraint set Co . Local maxima of max s.t. Cn J (R) that do not satisfy Co will be discarded. We find that adding Co will not bring any extra local maximum. As a result, we conclude our aim by theorem 1. We let pi j = ri2j and turn max s.t. Cn J (R) into the following problem: n n pi2j ν sj kim , P = ( pi j )n×n , n ≥ 2, max J (P), J (P) = P∈S i=1 j=1 n pi j = 1, f or i = 1, . . . , n, and every pi j ≥ 0 , S = pi j , i, j = 1, . . . , n : j=1
(2.3) where ν sj and kim are same as in equation 1.2 and S becomes a convex set or precisely a polyhedral set. For every local maximum solution P = [ pi j ] in equation 2.3, we can get a subset of local maxima R = [ri j ] of max s.t. Cn J (R) with ri j = 0 for pi j = 0 and either ri j = ±1 for pi j = 1. We stack P into a vector vec[P] of n2 elements and compute the Hessian HP with respect to vec[P], resulting in, HP is a n2 × n2 diagonal matrix with each diagonal element being ν sj kim . (2.4) Thus, whether J (P) is convex can be checked simply via all the signs of ν sj kim . We use En×k to denote a family of matrices, with each E n×k ∈ En×k being an n × k matrix, with every row consisting of zero elements except that one and only one element is 1.
552
L. Xu
Lemma 2. When either νis > 0, kim > 0, ∀i or νis < 0, kim < 0 , ∀i, every local maximum of J (P) is reached at a P ∈ En×n . Proof. We have every ν sj kim > 0, and thus it follows from equation (2.4) and lemma 1a that J (P) is strictly convex on the polyhedral set S. It further follows from Lemma 1 c that all the local maxima
of J (P) are reached at the polyhedral set’s extreme points that satisfy nj=1 pi j = 1, f or i = 1, . . . , n, that is, each local maximum P ∈ En×n . Lemma 3. a. Given p = [ p1 , . . . , pn ] ∈ P = { p : p j ≥ 0, j = 1, . . . , n,
cal maximum of J ( p) = nj=1 p 2j β j is reached at:
n j=1
p j = 1}, a lo-
i. p = [e k : 0] for k ≥ 1 with β j > 0, j = 1, . . . , k and β j < 0 , j = k + 1 , . . . , n, where the row vector e k ∈ E1×k and 0 is a row vector consisting of n − k zeros.2 ii. One p ∈ P that is not in the form [e k : 0], when β j < 0 , j = 1 , . . . , n.3 b. For an unknown 0 < k < n with νis > 0 , kim > 0 , i = 1 , . . . , k and νis < 0 , kim < 0 , i = k + 1 , . . . , n, every local maximum of J (P) in equation 2.3 is reached at P1 0 (2.5) P= , P1 ∈ Ek×k , P2 ∈ E(n−k)×(n−k) . 0 P2 Proof. a. Let J (p) = J + (p+ ) + J − (p− ) with p = [p+ : p− ] and J + (p+ ) =
k j=1
p 2j β j ,
J − (p− ) =
n
p 2j β j .
(2.6)
j=k+1
J − (p− ) ≤ 0 is strictly concave from lemma 1a because of β j < 0, and thus it has only one maximum at p− = 0. Therefore, all the local maxima of J (p) are reached at [p+ : 0] and determined by all the local maxima of J + (p+ ) on
the polyhedral set = { kj=1 p j = 1, p j ≥ 0, j = 1, . . . , k}. It follows from + + lemma 1b that J (p ) is strictly convex on since β j > 0. It further follows from lemma 1c that all its local maxima are reached at the extreme points of , that is, each local maximum is reached at p+ = e k . 2 In the rest of this letter, we use 0 to denote a matrix of all its elements in zeros without explicitly indicating its dimension, including being either a row vector or a column vector. 3 To illustrate, we observe f (x, y) = −a x 2 − by2 , a > 0, b > 0 subject to x + y = 1, x ≥ b a , y = a +b . 0, y ≥ 0, A local maximum of f (x, y) is reached at x = a +b
One-Bit-Matching ICA Theorem and Distribution-Based Combinatorics
553
for β j < 0, j = 1, . . . , n, it follows from lemma 1a that J (p) =
nParticularly 2 its only maximum at p = 0 is not j=1 p j β j is strictly concave. However,
reachable because the constraint nj=1 p j = 1. Instead, the maximum of J (p) subject to this constraint is reached at one p ∈ P that is not in the form [e k : 0].
b. Notice that the constraint nj=1 pi j = 1 is imposed on only the ith row p(i) = [ pi1 , . . . , pin ], and J (P) in equation 2.3 is additive. The problem of finding the local maxima of J (P) s.t. P ∈ S is equivalent to separately finding the local maxima
J (p(i) ) =
n
pi2j ν sj kim , subject to
j=1
n
pi j = 1, pi j ≥ 0, j = 1, . . . , n.
j=1
(2.7) For each i ≤ k, we have βi j = ν sj kim > 0, a nd j = 1, . . . , k and βi j < 0, j = (i)
k + 1, . . . , n. Therefore, it follows from lemma 3a that p(i) = [e k : 0] with (i) e k ∈ E1×k , for i = 1, . . . , k. For each i > k, we have βi j < 0, j = 1, . . . , k (i) (i) and βi j > 0, j = k + 1, . . . , n. Similarly, we have p(i) = [0 : e n−k ] with e n−k ∈ E1×(n−k) , for i = k + 1, . . . , n. In summary, we get P given by equation 2.5. From every local maximum solution P = [ pi j ] in lemmas 2 and 3b, √ all the matrices R = [ri j ] with ri j = ± pi j are local maxima R = [ri j ] of max s.t. Cn J (R). In other words, all the local maxima of max s.t. Cn J (R) can be summarized as follows:
R = {R = [ri j ] : ri j =
0,
if pi j = 0,
±1
if pi j = 1.
.
(2.8)
We will return to consider J (R) on RRT = I by equation 1.2 by adding the orthogonal constraint Co in equation 2.2 to max s.t. Cn J (R). Local maxima in R that do not satisfy Co will be discarded. Also, we show that adding in Co will not bring any extra local maximum. Along this line, we can prove the following theorem: Theorem 1. Every local maximum of J (R) on RRT = I by equation 1.2 is reached at R, which is a permutation matrix up to sign indeterminacy at its nonzero elements, as long as there is a one-to-one same-sign correspondence between the kurtosis of all source pdf’s and the kurtosis of all model pdf’s.
554
L. Xu
Proof. We check every local maximum J (R) s.t. Cn , that is, every solution in R by equation 2.8, whether it satisfies the orthogonal constraint Co in equation 2.2. If not, it is discarded. If it does, it is put into the following solution set, P = {R : every R ∈ R that satisfies Co },
(2.9)
which is not empty and each P ∈ P is either a permutation matrix or a variant with one or more nonzero elements switched to a negative sign. We check whether every R∗ ∈ P is a local maximum of J (R) on RRT = I by equation 1.2. Since R∗ is a local maximum J (R) s.t. Cn , there is a small enough neighbor area NR∗ with R∗ ∈ NR∗ such that every R ∈ NR∗ , R = R∗ satisfies Cn and J (R ) < J (R∗ ). If R∗ also satisfies Co , R∗ must belong to the intersection NR∗ ∩ NCo , where NCo denotes the set of all the points that satisfy Co . It further follows from equation 2.2 that NR∗ ∩ NCo is also a neighbor area of R∗ , within which we have J (R) < J (R∗ ) for every R ∈ NR∗ ∩ NCo , R = R∗ . That is, each R∗ ∈ P by equation 2.9 is a local maximum of J (R) on RRT = I by equation 1.2. We show that adding Co to J (R) s.t. Cn will not create any extra local maximum. Without Co , it is linked by pi j = ri2j that the problem is equivalent to separately finding the local maxima of J (p(i) ) by equation 2.7 for i = 1, . . . , n, respectively. These individual optimizations may occur within the same n-dimensional space or across several different n-dimensional spaces that may have certain overlap. With Co added in, these individual optimizations are forced to be implemented in the n different n-dimensional spaces that are orthogonal to each other. The effect of this process is equivalent to forcing the valid local maxima of J (p(i) ) by equation 2.7 to be located on a more restricted polyhedral set. Without Co , it follows from lemma 3a (i) that the valid local maxima of J (p(i) ) by equation 2.7 should be located on a polyhedral set as follows:
i =
pi j = 0 for ν sj kim < 0, pi j ≥ 0 for ν sj kim > 0 and
pi j ≥0
pi j = 1 . (2.10)
With Co added, i becomes more restrictive, with one or more constraints of type pi j ≥ 0 being further restricted into those of type pi j = 0. Since there will be at least one P by equation 2.5 in an n × n nonsingular matrix, for every i there will be at least one j with its corresponding pi j remaining to be type pi j ≥ 0; that is, no situation stated by lemma 3a (ii) occurs. Thus, the valid local maxima of J (p(i) ) by equation 2.7 on a more restricted polyhedral set is merely a subset of those local maxima before imposing the extra restrictions. In other words, no extra local maximum will be created.
One-Bit-Matching ICA Theorem and Distribution-Based Combinatorics
555
Summarizing the above two aspects and noticing that kim has the same sign as the kurtosis νim of the model density q i (yi ), the theorem is proved. Equation 1.2 is obtained from equation 1.1 by considering only the skewness and kurtosis and with the model pdf’s without skewness. In such an approximative sense, all the sources can also be separated by a local searching ICA algorithm (e.g., a gradient-based algorithm) obtained from equation 1.1 as long as there is an one-to-one same-sign correspondence between the kurtosis of all source pdf’s and the kurtosis of all model pdf’s. Moreover, this approximation can be removed by an ICA algorithm obtained directly from equation 1.2. Under the one-to-one kurtosis sign-matching assumption, we can derive a local search algorithm that is equivalent to maximizing the problem by equation 1.2 directly. A prewhitening is made on observed samples such that we can consider the samples of x with E x = 0, E xx T = I . As a result, it follows from I = E xx T = AEss T AT and Ess T = I that AAT = I , that is, A is orthonormal. Thus, an orthonormal W is considered to let y = Wx become independent among its components by max J (W), J (W) =
WW T =I
n
y
kim νi ,
(2.11)
i=1
y
where νi = E yi4 − 3, i = 1, . . . , n, and kim , i = 1, . . . , n are prespecified constants with the same sign as the kurtosis νim . We can derive its gradient ∇W J (W) and then project it onto WW T = I , which results in an iterative updating algorithm for updating W in a way similar to equations 3.17 and 3.18 at the end of section 3.3. Such an ICA algorithm actually maximizes the problem by equation 1.2 directly by noticing y = Wx = W As = Rs, R = W A, RRT = I , and thus y
νi =
n
ri4j ν sj , i = 1, . . . , n.
(2.12)
j=1
That is, the problem by equation 2.11 is equivalent to the problem by equation 1.2. In other words, under the one-to-one kurtosis sign-matching assumption, it follows from theorem 1 that all the sources can be separated by an ICA algorithm in an exact sense as long as equation 2.12 holds. However, theorem 1 does not tell us how such a kurtosis sign matching is built, which is attempted via equation 1.1 through learning each model pdf q i (yi ) together with learning W (Xu et al., 1996; Xu et al., 1997; Xu et al., 1998b) as well as further advances given in Lee et al. (1999) and Welling and Weber (2001) or by equation 103 in Xu (2003). Still, it remains an open problem whether these efforts or the possibility of
556
L. Xu
developing other techniques can guarantee such a one-to-one kurtosis signmatching either surely or in some probabilistic sense, which deserves future investigations. 2.3 No Matching and Partial Matching. Next, we consider what happens when one-to-one kurtosis sign correspondence does not hold. We start at the extreme situation with the following lemma: Lemma 4 (no matching). When either νis > 0, kim < 0, ∀i or νis < 0, kim > 0, ∀i, J (P) has only one maximum that is not in En×n . Proof. From equation 2.4 and lemma 1a, J (P) is strictly concave since ν sj kim < 0 for every term. Thus, it follows from lemma 1b that it has only one maximum that is not at the extreme points of S. Lemma 5 (partial matching). Given two unknown integers k, m with 0 < k < m < n and provided that νis > 0 , kim > 0 , i = 1 , . . . , k, νis kim < 0 , i = k + 1 , . . . , m, and νis < 0 , kim < 0 , i = m + 1 , . . . , n, every local maximum of J (P) is reached at P by equation 2.5 with P1 ∈ Em×k , P2 ∈ E(n−m)×(n−k) , when νis < 0 , kim > 0 , i = k + 1 , . . . , m; P1 ∈ Ek×m , P2 ∈ E(n−k)×(n−m) , when νis > 0 , kim < 0 , i = k + 1 , . . . , m. (2.13)
Proof. When νis < 0, kim > 0, i = k + 1, . . . , m, for each i ≤ m, we have βi j = ν sj kim > 0, j = 1, . . . , k and βi j < 0, j = k + 1, . . . , n, and it follows (i)
(i)
from equation 2.7 and lemma 3a that p(i) = [e k : 0] with e k ∈ E1×k , for i = 1, . . . , m. For each i > m, we have βi j < 0, j = 1, . . . , k and βi j > (i) (i) 0, j = k + 1, . . . , n, and we have p(i) = [0 : e n−k ] with e n−k ∈ E1×(n−k) , for i = m + 1, . . . , n. Thus, we get P with P1 ∈ Em×k , P2 ∈ E(n−m)×(n−k) . The situation of νis > 0, kim < 0, i = k + 1, . . . , m is just a swap of the position (i, j). We can get P simply by matrix transposition. Theorem 2. Given two unknown integers k, m with 0 < k < m < n, and provided that νis > 0 , kim > 0 , i = 1 , . . . , k, νis kim < 0 , i = k + 1 , . . . , m, and νis < 0 , kim < 0 , i = m + 1 , . . . , n,every local maximum of J (R) on RRT = I 0
by equation 1.2 is reached at R = 0 R¯ subject to a 2 × 2 block permutation, where is a (k + n − m) × (k + n − m) permutation matrix up to sign inde¯ is an (m − k) × (m − k) orthonorterminacy at its nonzero elements, while R ¯R ¯ T = I , but usually not a permutation matrix up to sign mal matrix with R indeterminacy.
One-Bit-Matching ICA Theorem and Distribution-Based Combinatorics
557
Proof. Similar to the proof of theorem 1, we consider the effect of imposing the orthogonal constraint Co in equation 2.2 to R by equation 2.8 with P by equation 2.5 but P1 , P2 by equation 2.13. We give more details for the cases of νis < 0, kim > 0, i = 1, . . . , k, for each i ≤ m, while the cases of νis > 0, kim < 0, i = k + 1, . . . , m can be understood simply by matrix transposition. Although the proof is similar to that for the proof of theorem 1, there is a key difference. Considering the row rank, P1 ∈ Em×k is k < m, P2 ∈ E(n−m)×(n−m) is n − m, and thus P by equation 2.5 is k + n − m < n. Since a permutation matrix remains a permutation matrix after any permutation, without losing generality, we can say that the first k + n − m rows of R are at the rank k + n − m. Co will not only force the rest of the m − k columns of the k + n − m rows to become 0 but also force every pi j ≥ 0 f or ν sj kim > 0 in i by equation 2.10 to become type pi j = 0 in the rest of the m − k rows of R such that the first k + n − m columns of the m − k rows also become 0. To keep Cn in equation 2.2 satisfied, it has to be imposed on each row of the rest of the m − k columns with all the (m − k) × (m − k) elements featured by ν sj kim < 0. 0 As a result, we get R = 0 R¯ subject to a 2 × 2 block permutation. Being the same as that in theorem 2, is a (k + n − m) × (k + n − m) permutation ¯ we have R ¯R ¯ T = I from RRT = matrix up to sign indeterminacy, while for R, I . Moreover, it follows from lemma 3a (ii) that the local maxima of J (p(i) ) ¯ is usually not by equation 2.7 are reached at one p(i) not in E1×n . That is, R a permutation matrix up to sign indeterminacy. In other words, there will be k + n − m sources that can be successfully separated in help of a local searching ICA algorithm when there are k + n − m pairs of matching between the kurtosis signs of source and model pdf’s. However, the remaining m − k sources are not separable. Suppose that the kurtosis sign of each model is described by a binary random variable ξi with 1 for + and 0 for −, that is, p(ξi ) = 0.5ξi 0.51−ξi . When there are k sources with their kurtosis signs positive, there is a proban bility p( i=1 ξi = k) of having a one-to-one kurtosis sign correspondence even when model pdf’s are prefixed without knowing the kurtosis signs of sources. Moreover, even when a one-to-one kurtosis sign correspondence does not hold for all the sources, will still be n − | − k| sources re there n coverable with a probability p( i=1 ξi = ). This explains not only why early ICA studies (Amari et al., 1996; Bell & Sejnowski, 1995) work in some cases while failing in other cases due to the predetermined model pdf’s, but also why some existing heuristic ICA algorithms always work to some extent. 2.4 Maximum Kurtosis versus Minimum Kurtosis. Interestingly, it can be observed that changing the maximization in equations 1.2, 2.3, and 2.11, into the minimization will lead to similar results, which are summarized in lemma and theorem.
558
L. Xu
Lemma 6. a. When either νis > 0 , kim > 0 , ∀i or νis < 0 , kim < 0 , ∀i, J (P) has only one minimum that is not in En×n . b. When either νis > 0 , kim < 0 , ∀i or νis < 0, kim > 0 , ∀i, every local minimum of J (P) is reached at a P ∈ En×n . c. For an unknown 0 < k < n with νis < 0 , kim > 0 , i = 1 , . . . , k and νis > 0 , kim< 0 , i = k + 1 , . . . , n, every local minimum of J (P) is reached at P 0
P = 01 P2 with P1 ∈ Ek×k , P2 ∈ E(n−k)(n−k) . d. For two unknown integers k, m with 0 < k < m < n with νis > 0 , kim > 0 , i = 1 , . . . , k, νis kim < 0 , i = k + 1 , . . . , m, and νis < 0 , kim < 0, i = m + 0 P 1 , . . . , n, every local minimum of J (P) is reached at P = P2 01 with either P1 ∈ Ek×(n−m) , P2 ∈ E(n−k)×m when νis > 0 , kim < 0 , i = k + 1 , . . . , m or P1 ∈ Em×(n−k) , P2 ∈ E(n−m)×k when νis < 0 , kim > 0 , i = k + 1 , . . . , m.
Proof. The proof is similar to those in proving lemmas 2 to 5. The key difference is a shift in focus from the maximization of a convex function on a polyhedral set to the minimization of a concave function on a polyhedral set, with swaps between minimum and maximum, maxima and minima, convex and concave, and positive and negative, respectively. The key point is that lemma 1 remains correct after these swaps. Similar to theorem 2, from the above lemma we get: Theorem 3. a. When either νis kim < 0 , i = 1 , . . . , n or νis < 0 , kim > 0 , i = 1 , . . . , k and νis > 0 , kim < 0 , i = k + 1 , . . . , n for an unknown 0 < k < n, every local minimum of J (R) on RRT = I by equation 1.2 is reached at a permutation matrix R up to sign indeterminacy at its nonzero elements. b. For two unknown integers k, m with 0 < k < m < n with νis > 0 , kim > 0 , i = 1 , . . . , k, νis kim < 0 , i = k + 1 , . . . , m, and νis < 0 , kim < 0 , i = m + T 1 , . . . , n, local minimum of J (R) on RR = I by equation 1.2 is reached every 0
at R = 0 R¯ subject to a 2 × 2 block permutation. When m + k ≥ n, is an (n − m + n − k) × (n − m + n − k) permutation matrix up to sign inde¯ is an (m + k − n) × (m + k − n) terminacy at its nonzero elements, while R T ¯ ¯ orthonormal matrix with R R = I , but usually not a permutation matrix up to sign indeterminacy. When m + k < n, is a (k + m) × (k + m) permuta¯ is an tion matrix up to sign indeterminacy at its nonzero elements, while R ¯R ¯ T = I , but usually (n − k − m) × (n − k − m) orthonormal matrix with R not a permutation matrix up to sign indeterminacy.
In a comparison of theorems 2 and 3, when m + k ≥ n, comparing n − m + n − k with k + n − m, more sources can be separated by minimization
One-Bit-Matching ICA Theorem and Distribution-Based Combinatorics
559
than maximization if k < 0.5n and by maximization than minimization if k > 0.5n. When m + k < n, comparing k + m with k + n − m, more sources can be separated by minimization than maximization if m > 0.5n, and by maximization than minimization if m < 0.5n. We further consider a special case that kim = 1, ∀i. In this case, equation 1.2 is simplified into J (R) =
n n
ri4j ν sj , n ≥ 2.
(2.14)
i=1 j=1
From theorem 2 at n = m, we can easily obtain the following corollary: Corollary 1. For an unknown integer 0 < k < n with νis > 0 , i = 1 , . . . , k and T νis < 0 , i = k + 1 , . . . ,n, every local maximum of J (R) on RR = I by equation 0
2.14 is reached at R = 0 R¯ subject to a 2 × 2 block permutation, where is a k × k permutation matrix up to sign indeterminacy at its nonzero elements, while ¯ is an (n − k) × (n − k) orthonormal matrix with R ¯R ¯ T = I , but usually not a R permutation matrix up to sign indeterminacy. Similarly, from theorem 3 we also get:
Corollary 2. For an unknown integer k with 0 < k < n with νis > 0 , i = T 1 , . . . , k and νis < 0 , i = k + 1 , . . ., n, every local minimum of J (R) on RR = I ¯ 0 R
by equation 1.2 is reached at R = 0 subject to a 2 × 2 block permutation, where is an (n − k) × (n − k) permutation matrix up to sign indeterminacy at ¯ is a k × k orthonormal matrix with R ¯R ¯ T = I , but its nonzero elements, while R usually not a permutation matrix up to sign indeterminacy. According to corollary 1, k sources of supergaussian components are separated by max RRT =I J (R), while n − k sources of subgaussian components are separated by min RRT =I J (R) according to corollary 2. In implementation, from equation 2.11, we get J (W) =
n
y
νi .
(2.15)
i=1
Then, from maxWWT =I J (W) we get k sources of super gaussian components and from minWWT =I J (W) we get n − k sources of subgaussian components. Thus, instead of learning one-to-one kurtosis sign matching, the problem can be turned into one of selecting supergaussian components from y = Wx with W obtained via maxWWT =I J (W) and of selecting subgaussian components from y = Wx with W obtained via minWWT =I J (W). Though we know neither k nor which components of y should be selected, we can pick those
560
L. Xu
with positive signs as supergaussian ones after maxWWT =I J (W) and pick those with negative signs as subgaussian ones after minWWT =I J (W). The y reason comes from νi = nj=1 ri4j ν sj and the above corollaries. By corollary 1, the kurtosis of each supergaussian component of y is simply one of ν sj > 0, j = 1, . . . , k. Although the kurtosis of each of the other components in y is a weighted combination of ν sj < 0, j = k + 1, . . . , n, the kurtosis signs of these will all remain negative. Similarly, we can find out those subgaussian components according to corollary 2. Anther corollary can be obtained from equation 2.11 by considering a y special case that kim = sign[νi ], ∀i:
max J (W), J (W) =
WW T =I
n y ν . i
(2.16)
i=1
This leads to what is called a kurtosis extreme approach and extensions (Delfosse & Loubation, 1995; Moreau & Macchi, 1996; Hyvarinen et al., 2001), where studies began by extracting one source by a vector w and then extracting multiple sources by either sequentially implementing the onevector algorithm such that the newly extracted vector is orthogonal to previous ones or implementing the one-vector algorithm on all the vectors of W in parallel together with a symmetric orthogonalization at each iterative step, as suggested by (Hyvarinen et al. 2001). In the literature, the success of using one vector w to extract one source has been proved mathematically, and the proof can be carried easily to sequentially extracting a new source with its corresponding vector w being orthogonal to the subspace spanned by its previous ones. However, this mathematical proof is not applicable to the above symmetric orthogonalization-based implementation of the one-vector algorithm in parallel with all the vectors of W. Actually, what Hyvarinen et al., (2001) suggested can only ensure a convergence of a symmetric orthogonalization-based algorithm but cannot guarantee that this local searching-featured iterative algorithm will converge to a solution that can separate all the sources, though experiments have usually been successful.
y When νi = nj=1 ri4j ν sj holds, which is true only when the prewhitening can be made perfectly, it follows from equation 2.16 that
min J (R), J (R) =
RRT =I
n n
ri4j ν sj ,
(2.17)
i=1 j=1
which is covered by lemma 2 and theorem 2. Thus, we can directly prove the following corollary:
One-Bit-Matching ICA Theorem and Distribution-Based Combinatorics
561
y Corollary 3. As long as νi = nj=1 ri4j ν sj holds, every local minimum of the above J (R) on RRT = I is reached at a permutation matrix up to sign indeterminacy. Actually, it provides a mathematical proof on the success of the symmetric orthogonalization-based implementation of the one-vector algorithm in parallel on separating all the sources. 3 Combinatorial Optimization, Distribution Approximation, and Stiefel Manifold The combinatorial optimization problem by equation 1.3 has been encountered in various applications and still remains difficult to solve. Many efforts have also been made in the literature of neural networks since Hopfield and Tank (1985). As summarized in Xu (2003), these efforts can be roughly classified according to the features on dealing with Cecol , Cer ow , and Cb . Although almost all the neural network motivated approaches are parallel implementable, they share one unfavorable feature: these intuitive approaches have no theoretical guarantee on convergence to even a feasible solution. Interestingly, focusing on local maxima only, equations 1.2 and 2.3 can be regarded as special examples of the combinatorial optimization problem by equation 1.3 simply by regarding pi j or ri j as vi j . Although such a linkage is not useful for ICA, where we do not need to seek a global optimization, a link from equation 1.3 to 1.2 and even 1.1 leads to two interesting questions. Could we consider the combinatorial optimization by equation 1.3 from the perspective of approximating one distribution by another simple model distribution, as in equation 1.1? Can we replace the constraints Cecol , Cer ow , and Cb by the Stiefel manifold RRT = I for developing a more effective implementation? 3.1 A Distribution Approximation Perspective of Combinatorial Optimization. Finding a global minimization solution of E o (V) under a set of constraints is equivalent to finding a global peak of the following Gibbs distribution, e − β Eo (V) , Zβ 1
p(V, β) =
Zβ =
e − β Eo (V) , 1
(3.1)
V
subject to the constraints, since maxV p(V, β) is equivalent to maxV ln p(V, β) or minV E o (V). Usually this p(V, β) has many local maximums, so it is difficulty, to get the peak Vp . To avoid this difficulty, we use a simple distribution q (V) to approximate p(V, β) on a domain Dv such that the global peak of q (V) is easy to find and that p(V, β) and q (V) share the
562
L. Xu
same peak Vp ∈ Dv , where Dv is considered using the following support of p(V, β), Dε (β) = {V : p(V, β) > ε, a small consta nt ε > 0},
(3.2)
under the control of a parameter β. For a sequence β0 > β1 , . . . > βt , we have Dε (βt ) ⊂ . . . Dε (β1 ) ⊂ Dε (β0 ), which includes the global minimization solution of E o (V), since the equivalence of maxV p(V, β) to minV E o (V) is irrelevant to β. Therefore, we can find a sequence q 0 (V), q 1 (V), . . . , q t (V) that approximates p(V, β) on the shrinking domain Dε (β). For a large βt , p(V, β) has large support, and thus q (V) adapts the overall configuration of p(V, β) in the large domain Dε (β). As βt reduces, q t (V) concentrates more and more on adapting the detailed configuration of p(V, β) around the global peak solution Vp ∈ Dε . As long as β0 is large enough and β reduces slowly enough toward zero, we can find the global minimization solution of E o (V). We adopt the following for implementing such a distribution approximation: min K L(q , p), K L(q , p) = p
q (V) ln
V∈Dv
q (V) . p(V, β)
(3.3)
Alternatively, we can also consider min p K L(q , p), which leads us to a class of Metroplois sampling-based mean-field approaches. (For details, see section II(B) in Xu, 2003.) Here, we consider only equation 3.3 with q (V) in the following simple forms: q 1 (V) = Z1−1 q 2 (V) =
e vi j ln qi j ,
0 ≤ q i j ≤ ∞,
Z1 =
i, j v
q i ji j (1 − q i j )1−vi j ,
i, j
e vi j ln qi j ,
i, j
0 ≤ q i j ≤ 1,
(3.4)
i, j
and from the constraints in equation 1.3, we have
Cc :
N vi j = 1, j = 1, . . . , M, C r :
M vi j = 1, i = 1, . . . , N;
i=1
j=1
vi j =
Z q i j Zi1j
qi j ,
,
for q 1 (V), Zi j = e vkl ln qkl , for q 2 (V), k=i,l= j k,l
(3.5)
where x denotes the expectation of the random variable x. When N, M are large, we have Zi j ≈ Z1 , and thus vi j ≈ q i j for the case of q 1 (V).
One-Bit-Matching ICA Theorem and Distribution-Based Combinatorics
563
Putting equation 3.4 into equation 3.3 and after certain derivations as shown in Xu (2003), equation 3.3 becomes equivalent to
min E({q i j }), subject to equation 3.5, qi j
1 E({q i j }) = E o ({q i j }) + β
q i j ln q i j ,
ij
i j [q i j
for q 1 (V),
ln q i j + (1 − q i j ) ln (1 − q i j )].
for q 2 (V).
The case for q 1 (V) interprets the Lagrange transform approach with the barrier i, j vi j ln vi j in Xu (1994) and justifies the intuitive treatment of simply regarding the discrete vi j as an analog variable between the interval [0, 1]. From this perspective, these analog variables are the parameters of the simple distribution that we use to approximate the Gibbs distribution induced from the cost E o (V) of the discrete variables. Subsequently, a discrete solution will be recovered from these analog parameters of q i j . Similarly, the case for q 2 (V) interprets and justifies the Lagrange trans
form approach with the barrier i, j [vi j ln vi j + (1 − vi j ) ln (1 − vi j )] in Xu (1995), where this barrier is intuitively argued to be better than the barrier
v i, j i j ln vi j because it gives a U shape curve. Here, this intuitive preference can also be justified from equation 3.5 since there is an approximation Zi j ≈ Z1 used for q 1 (V) while transforming vi j into q i j , but no approximation for q 2 (V). Moreover, both barriers are the special cases S(vi j ) = vi j and S(vi j ) = vi j /(1 − vi j ) of a family of barrier functions equivalent to minimizing the leaking energy in the classical Hopfield network (Xu, 1995).
3.2 Lagrange-Enforcing Algorithms. A general iterative procedure was proposed in Xu (1994) and then refined in Xu (1995) for minimizing E({q i j }) by considering the following Lagrange barrier costs: N M 1 col E({q i j }) = E o ({q i j }) + λj qi j − 1 β j=1
+ B(q i j ) +
N
λri ow
i=1
B(q i j ) =
ij
q i j ln q i j ,
i j [q i j
i=1
M
q i j − 1 ,
j=1
q 1 (V),
ln q i j + (1 − q i j ) ln (1 − q i j )], q 2 (V).
(3.6)
564
L. Xu
Following the derivation made in Xu (1994, 1995), it follows from that
q iej
=
e
1 a i b j exp β1 1
1+a i b j exp
∂ E o ({q i j }) ∂q i j 1 ∂ E o ({q i j β ∂q i j
,
}) ,
for q 1 (V), for q 2 (V);
a i = exp (λri ow ) ,
∂ E({q i j }) ∂q i j
=0
b j = exp λcol j (3.7)
∂ E ({q })
∂ 2 E({q })
when o∂qi j i j is irrelevant to q i j and ∂ 2 qi ji j > 0. If other variables are fixed at their old values, E({q i j }) is minimized at q i j = q iej , given by equation 3.7. Thus, we can update e old , for a η > 0 small enough, either q inew = q iej or q inew = q iold j j j + η qi j − qi j (3.8) which will reduce E({q i j }) monotonically, since each q iej − q iold j is the descending direction of E({q i j }) along the coordinate q i j . We can see that the constraints C c , C r are satisfied by q inew as long as they j old e are satisfied by both q i j and q i j . What needs to be done is to enforce N
q iej = 1, j = 1, . . . , M;
i=1
M
q iej = 1, i = 1, . . . , N,
(3.9)
j=1
which is achieved by an ENFORCING-LAGRANGE iteration loop, for example, by iteratively solving the above N + M nonlinear equations with N N M respect to {a i }i=1 , {b j } M j=1 or equivalently finding {a i }i=1 , {b j } j=1 that reaches a global minimum (i.e., zero) of
P({q i j }) =
N M j=1
i=1
2 q iej − 1
2 N M + q iej − 1 . i=1
(3.10)
j=1
As this ENFORCING-LAGRANGE loop converges, we have N M N M col r ow λj qi j − 1 + λi q i j − 1 < ε j=1 i=1 i=1 j=1
(3.11)
for an arbitrarily small ε. Thus, the fact that equation 3.8 can reduce E({q i j }) monotonically is equivalent to being able to reduce β1 E o ({q i j }) + B(q i j )
One-Bit-Matching ICA Theorem and Distribution-Based Combinatorics
565
monotonically. When β becomes small enough, it is also equivalent to equation 3.8 reducing E o (θ ) monotonically under equation 3.11, because B(q i j ) is bounded for q i j ∈ [0, 1]. In summary, we get the following iterative procedure that guarantees convergence to a feasible solution that is a minimum of E o ({q i j }) and satisfies C c and C r : Step 0: Initialize {q i j } such that they satisfy the constraints Cec , Cer by equation 3.5, Step 1: Get {q iej } by equation 3.7 and then call an inner ENFORCINGLAGRANGE loop until it converges or terminates according to a given checking criterion on the satisfaction of equation 3.11. It results in a specific setting on a i , b j , j = 1, . . . , M, i = 1, . . . , N. new Step 2: Update q iold by using equation 3.8 either sequentially or in j to q i j parallel. Step 3: Check whether the procedure converges according to a pregiven criterion. If yes, stop; otherwise, go to step 1. Based on the obtained {q i j }, we can get a discrete solution simply by a threshold Th > 0 as follows: $ vi j =
1, 0,
if q i j > Th , and resolve a tie heuristically. otherwise,
(3.12)
Moreover, we can also consider explicitly the constraints Cec and get $ vi j =
1, 0,
if j = arg maxk q ik , and resolve a tie heuristically. otherwise,
(3.13)
Similarly, we can also consider explicitly the constraints Cer , as well as both Cec and Cer . 3.3 Stiefel Gradient Flow Algorithms. Alternatively, the study on equation 1.2 via equation 2.3 motivates another way to handle the constraints Cecol , Cer ow , and Cb : let vi j = ri2j , and then use RRT = I to guarantee the constraints, Cecol , Cer ow as well as a relaxed version of Cb (i.e., 0 ≤ vi j ≤ 1). That is, problem equation 1.3 becomes
min
RRT =I f or N≤M
Eo
% &i=N, j=M ri2j i=1, j=1 ,
i=N, j=M
R = {ri j }i=1, j=1 .
(3.14)
566
L. Xu
We consider the problems with ∂ 2 E o (V) is negative definite, ∂vec[V]∂vec[V]T
(3.15)
or E o (V) in a form similar to J (P) in equation 2.3,
E o (V) = −
n n
vi2j a j b i ,
(3.16)
i=1 j=1
with a i > 0, b i > 0, i = 1, . . . , k and a i < 0, b i < 0, i = k + 1, . . . , n after an appropriate permutation on either or both of [a 1 , . . . , a n ] and [b 1 , . . . , b n ]. Similar to the study of equation 2.3, maximizing E o (V) under the constraints Cecol , Cer ow , and vi j ≥ 0 will imply satisfying Cb . In other words, the solutions of equation 3.14 and equation 1.3 are the same. Thus, we can solve the hard problem of combinatorial optimization by equation 1.3 using a gradient flow on the Stiefel manifold RRT = I to maximize the problem by equation 3.14. At least a local optimal solution of equation 1.3 can be reached, with all the constraints Cecol , Cer ow , and Cb guaranteed automatically. To get an appropriate updating flow on the Stiefel manifold RRT = I , we first compute the gradient ∇V E o (V) and then get G R = ∇V E o (V) ◦ R, where the notation ◦ means that a 11 a 12 b b a b a b ◦ 11 12 = 11 11 12 12 . a 21 a 22 b 21 b 22 a 21 b 21 a 22 b 22 Given a small disturbance δ on RRT = I , it follows from RRT = I that the solution of δ RRT + Rδ RT = 0 must satisfy δ R = ZR + U(I − RT R), U is a m × d matrix and Z = −Z is an asymmetric matrix.
(3.17)
From Tr [G TR δ R] = Tr [G TR {ZR + U(I − RT R)}] = Tr [(G R RT )T Z] + Tr [(G R (I − RT R))T U], we get U(I − RT R) = U, (a) T T T (b) Z = G R R − RG R , U = G R (I − R R), δ R = ZR, ZR + U, (c) Rnew = Rold + γt δ R.
(3.18)
That is, we can use one of the above three choices of δ R as the updating direction of R. A general technique for optimization on the Stiefel manifold
One-Bit-Matching ICA Theorem and Distribution-Based Combinatorics
567
that was elaborately discussed by Edelman, Arias, and Smith (1998) can also be adopted for implementing the problem by equation 3.14. 4 Conclusion The one-to-one kurtosis sign-matching conjecture has been proved in a strong sense that every local maximum of max RRT =I J (R) by equation 1.2 is reached at a permutation matrix up to a certain sign indeterminacy if there is a one-to-one same-sign correspondence between the kurtosis signs of all source pdfs and the kurtosis signs of all model pdf’s. That is, all the sources can be separated by a local search ICA algorithm. Theorems have been provided not only on partial separation of sources when there is a partial matching between the kurtosis signs, but also on an interesting duality of maximization and minimization on source separation. Moreover, corollaries are obtained to state that seeking a one-to-one same-sign correspondence can be replaced by using the duality; supergaussian sources can be separated by maximization and subgaussian sources can be separated by minimization. Furthermore, a corollary is obtained to provide a mathematical proof of the success of symmetric orthogonalization implementation of the kurtosis extreme approach. There still remain problems for study. First, the success of the efforts based on equation 1.1 (Xu et al., 1996; Xu et al., 1997; Xu et al., 1998b; Lee et al., 1999; Welling & Weber, 2001; Xu, 2003) can be explained as their ability to build up a one-to-one kurtosis sign matching. However, we still need a mathematical analysis to guarantee that these approaches can achieve this matching exactly or in some probabilistic sense. Second, a theoretical guarantee on either the kurtosis extreme approach or the approach of extracting supergaussian sources via maximization and subgaussian sources via minimization is true only when the prewhitening can be made perfectly. It remains to be studied on comparison of the two approaches as well as approaches based on equation 1.1. Also, comparison may deserve to be made on convergence rates of different ICA algorithms. Last, but not least, the linkage of the problem by equation 1.3 to equation 1.2 and equation 2.3, as well as equation 1.1, leads us to both a distribution approximation and a Stiefel manifold perspective of combinatorial optimization with algorithms that guarantee both convergence and satisfaction of constraints, which also deserve further investigation. Acknowledgments The work was done in preparation for a possible RGC earmarked grant (CUHK 417707E) to be supported by the Research Grant Council of the Hong Kong SAR. A preliminary version of this work was presented as a plenary talk at the Second International Symposium of Neural Networks (Xu, 2005).
568
L. Xu
References Amari, S., Chen, T.-P., & Cichocki, A. (1997). Stability analysis of adaptive blind source separation. Neural Networks, 10, 1345–1351. Amari, S. I., Cichocki, A., & Yang, H. (1996). A new learning algorithm for blind separation of sources. In D. Toureteky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing, 8 (pp. 757–763). Cambridge, MA: MIT Press. Bazaraa, M. S., Sherall, H. D., & Shetty, C. M. (1993). Nonlinear programming: Theory and algorithms. New York: Wiley. Bell, A., & Sejnowski, T. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Cardoso, J.-F. (1998). Blind signal separation: Statistical principles. Proc. IEEE, 86(10), 2009–2025. Cheung, C. C., & Xu, L. (2000). Some global and local convergence analysis on the information-theoretic independent component analysis approach. Neurocomputing, 30, 79–102. Comon, P. (1994). Independent component analysis: A new concept? Signal Processing, 36, 287–314. Delfosse, N., & Loubation, P. (1995). Adaptive blind separation of independent sources: A deflation approach. Signal Processing, 45, 59–83. Edelman, A., Arias, T. A., & Smith, S. T. (1998). The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl., 20, 303–353. Everson, R., & Roberts, S. (1999). Independent component analysis: A flexible nonlinearity and decorrelating manifold approach. Neural Computation, 11, 1957–1983. Girolami, M. (1998). An alternative perspective on adaptive independent component analysis algorithms. Neural Computation, 10, 2103–2114. Hopfield, J. J., & Tank, D. W. (1985). Neural computation of decisions in optimization problems. Biological Cybernetics, 52, 141–152. Hyvarinen, A., Karhunen, J., & Oja, A. (2001), Independent component analysis. New York: Wiley. Lee, T. W., Girolami, M., & Sejnowski, T. J. (1999). Independent component analysis using an extended infomax algorithm for mixed subgaussian and supergaussian sources. Neural Computation, 11, 417–441. Liu, Z. Y., Chiu, K. C., & Xu, L. (2004). One-bit-matching conjecture for independent component analysis. Neural Computation, 16, 383–399. Moreau, E., & Macchi, O. (1996). High order contrasts for self-adaptive source separation. International Journal of Adaptive Control and Signal Processing, 10, 1996. Pearlmutter, B. A., & Parra, L. C. (1996). A context-sensitive genaralization of ICA. In Proc. of Int. Conf. on Neural Information Processing, 151–157. Hong Kong: SpringerVerlag. Tong, L., Inouye, Y., & Liu, R. (1993). Waveform-preserving blind estimation of multiple independent sources. Signal Processing, 41, 2461–2470. Welling, M., & Weber, M. (2001). A constrained EM algorithm for independent component analysis. Neural Computation, 13, 677–689. Xu, L. (1994). Combinatorial optimization neural nets based on a hybrid of Lagrange and transformation approaches. In Proc. of World Congress on Neural Networks (pp. 399–404). San Diego, CA.
One-Bit-Matching ICA Theorem and Distribution-Based Combinatorics
569
Xu, L., (1995). On the hybrid LT combinatorial optimization: New U-shape barrier, sigmoid activation, least leaking energy and maximum entropy. In Proc. of Intl. Conf. on Neural Information Processing (pp. 309–312). Beijing, China. Xu, L. (1997). Bayesian ying-yang learning-based ICA models. In Proc. 1997 IEEE Signal Processing Soc. Workshop on Neural Networks for Signal Processing VII (pp. 476– 485). Piscataway, NJ: IEEE. Xu, L. (2003). Distribution approximation, combinatorial optimization, and Lagrange-barrier. In Proc. of International Joint Conference on Neural Networks 2003 (pp. 2354–2359). Piscataway, NJ: IEEE. Xu, L. (2005). One-bit-matching ICA theorem, convex-Concave programming, and Combinatorial optimization. In Advances in neural networks (pp. 5–20). Berlin: Springer-Verlag. Xu, L., Cheung, C. C., & Amari, S. I. (1998a). Further results on nonlinearity and separtion capability of a liner mixture ICA method and learned LPM. In C. Fyfe (Ed.), Proceedings of the International ICSC Workshop on Independence and Artificial Neural Networks (pp. 39–44). Tenerife, Spain. Xu, L., Cheung, C. C., & Amari, S. I. (1998b). Learned parametric mixture based ICA algorithm. Neurocomputing, 22, 69–80. Xu, L., Cheung, C. C., Yang, H. H., & Amari, S. I. (1997). Independent component analysis by the information-theoretic approach with mixture of density. In Proc. of 1997 IEEE-INNS International Joint Conference on Neural Networks (Vol. 3, pp. 1821–1826). Piscataway, NJ: IEEE. Xu, L., Yang, H. H., & Amari, S. I. (1996, April). Signal source separation by mixtures: Accumulative distribution functions or mixture of bell-shape density distribution functions. Paper presented at the Frontier Forum, Institute of Physical and Chemical Research, Japan.
Received January 18, 2005; accepted June 20, 2006.
LETTER
Communicated by Wei-nian Li
Boundedness and Stability for Integrodifferential Equations Modeling Neural Field with Time Delay Xuyang Lou [email protected]
Baotong Cui [email protected] Research Center of Control Science and Engineering, Southern Yangtze University, Wuxi, Jiangsu 214122, P.R.C.
In this letter, delayed integrodifferential equations modeling neural field (DIEMNF) is studied. The model of DIEMNF is first established as a modified neural field model. Second, it has been proved that if the interconnection of neural field is symmetric, then every trajectory of the system converges to an equilibrium. The boundedness condition for integrodifferential equations modeling neural field with delay is obtained. Moreover, when the interconnection is asymmetric, we give a sufficient condition that can guarantee that the equilibrium of the DIEMNF is unique and is also a global attractor. 1 Introduction Neural networks have many applications in pattern recognition, image processing, association, and other areas. Some of these applications require that the equilibrium points of the designed network are stable (see Zhang, Li, & Liao, 2005). Therefore, it is vital to study the stability of neural networks. However, time delays can be encountered in implementing artificial neural networks (see Hopfield, 1984; Roska & Chua, 1992), and these delays are often a source of oscillation and instability in neural networks. This situation has motivated the study of the stability of delayed neural networks. In recent years, the stability of delayed neural networks (DNN) has been investigated by many researchers (e.g., Arik 2000; Cao & Wang, 2003; Cao, 2000, 2001a, 2001b; Cui & Lou, 2006; Lou and Cui, 2006a, 2006b; Zhang, Wei, & Xu, 2005; Liao, Chen, & Sanchez, 2002; Zeng & Wang, 2006; Lu & Chen, 2006). In addition, a continuum of neurons is useful whenever waves of neural activity have to be modeled (see Wilde, 1997). If the states of the individual neurons cannot be measured, it is often possible to measure the electrical or magnetic fields generated by the waves of activity. This has furthered neural field theory, which has been studied in past years and has found many applications in different areas, for example, perceptual learning (see Shi, Huang, & Zhang, 2004), traveling waves of excitation (see Daniel & Neural Computation 19, 570–581 (2007)
Boundedness and Stability for DIEMNF
571
Andreas, 2002), and knowledge learning (see Luo, Wen, & Huang, 2002; Jirsa, Jantzen, Fuchs, & Kelso, 2001). Neural field models can be used to understand the transformation mechanism, dynamical behavior, capability, and limitation of neural network models by studying globally topological and geometrical structures on parameter spaces of neural networks. The model includes many important models arising from mathematical biology and neural networks, which have been intensively and extensively studied. This class of models typically take the form of integrodifferential equations. For example, Amari (1990, 1991) considers a neural field where infinitely many neurons are continuously put in a two-dimensional field. Also, neural field models have had an important impact in helping to develop an understanding of the dynamics seen in brain slice preparations (see Liley, Cadusch, & Wright, 1999). Recently, more rigorous results of the existence and uniqueness for an equilibrium point of neural networks with asymmetrical connections or with general connections have been achieved (see Chen & Amari, 2001; Yang & Dillon, 1994). It is now necessary to consider the case for the existence of time delays. Moreover, these results provide the motivation and momentum for us to pursue an investigation of DIEMNF qualitative behaviors. This letter is organized as following. In section 2, we establish some elementary lemmas that will be used in later sections. In section 3, we give a boundedness condition for DIEMNF. In section 4, we give the conditions for the global stability of DIEMNF. The main results include two parts. First, we use the Lyapunov method to investigate the convergence of DIEMNF having symmetric interconnection; then we give the conditions for the global stability of DIEMNF having asymmetric interconnection by using a different form than used in section 1. Finally, the conclusion is put forward in section 5. Let χ = C() and take f = maxx∈ | f (x)| for all f ∈ χ; then χ is a Banach space. The set of bounded linear operators mapping χ into itself is briefly denoted by [χ], particularly the identity operator I1 ∈ [χ]. A point λ of the complex plane is called a regular point of an operator A ∈ [χ] if [χ] contains the operator (A − λI1 )−1 . The set ρ(A) of all regular points of an operator A is open. Its complement σ (A) is called the spectrum of A and defined by κ(A), which is defined by κ(A) = max Re(λ). λ∈σ (A)
2 System Description and Preliminaries In this letter we propose the following DIEMNF, ∂u(t, x) = −a u(t, x) + W(x)g(u(t − τ, x)) ∂t + T(x, y)g(u(t, y)) dy + I (x) = F (u), |g(u(t, x))| ≤ M,
(2.1)
572
X. Lou and B. Cui
where = D ( D : a bounded, open, and connected subset of R2 ). Let u(t, x) = [u1 (t, x), u2 (t, x), . . . , un (t, x)]T represent the state (input) of the system at position x in at time t; I (x) = [I1 (x), I2 (x), . . . , In (x)]T be the external input at position x in ; and g(u(t, x)) = [g1 (u(t, x)), g2 (u(t, x)), . . . , gn (u(t, x))]T be the output of the system at position x at time t having range |g(u(t, x))| ≤ M. The typical input-output relation g(·) is a continuous and monotone-increasing function, and it can be chosen as sigmoid, tanh, arctan, or something else. W(x) denotes n × n dimension the connection weights. For all x, y ∈ , let T(x, y) represent the n × n dimension interconnection, which connects the neuron at the position y to the neuron at position x; a is a positive constant. ξ is called an equilibrium of DIEMNF if F (ξ ) = 0. In order to proceed with the analysis, we need to establish some definitions and elementary facts. Definition 1.
A dynamical system on χ is a mapping
: R+ × χ → χ,
R+ = [0, ∞)
such that: 1. (·, u) : R+ → χ, continuous, for all u ∈ χ. 2. (t, ·) : χ → χ, continuous, for all t ∈ R+ . 3. (0, u) = u, for all u ∈ χ. 4. (t + s, u) = (t, (s, u)), for all t, s ∈ R+ , u ∈ χ. Definition 2. F is Fr´echet differentiable at ξ ∈ χ if there exists a bounded linear operator Fξ ∈ [χ] such that
lim
h→0
F (ξ + h) − F (ξ ) − Fξ (h) h
= 0,
(2.2)
Fξ (h) is the so-called Fr´echet derivative of F at ξ . Lemma 1. Suppose that the spectrum of the bounded linear operator A lies in the interior of the left half plane, so that e At ≤ N0 e v0 t , and suppose condition F (u) ≤ q u (t ∈ R+ , u ≤ ρ, q > 0) is satisfied for q < Nv00 . Then du = Au + F (u) has a negative Bohl exponent at zero. dt Proof.
Daleckii and Krein (1974, pp. 285–289) for the proof.
Lemma 2. Let w(t, x) = ∂t∂ u(t, x), where u(t) is a continuously differentiable function from a compact subset ∈ R2 to R. If w(t) ≤ Ee −ηt
Boundedness and Stability for DIEMNF
573
for some positive E < ∞ and η < ∞, then u(t) converges to some ξ in χ. Proof.
Since
u2 − u(t1 ) =
t2
t1
w(t) dt t2
≤
w(t) dt ≤
t1
t2
Ee −ηt dt ≤
t1
E ηt1 (e − e ηt2 ), η
it follows that for all > 0, there exists an N ∈ N such that t1 ≥ N, t2 ≥ N and
E ηt1 (e − e ηt2 ) ≤ , η
where N denote the set of natural numbers. Then by Cauchy convergence theorem (Yoshida, 1980), u(t) → ξ as t → ∞. Lemma 3. Let u(t, x) be a solution of DIEMNF. If u(t) converges to a limit ξ , then ξ is an equilibrium of DIEMNF. Proof. Suppose that u(t, x) be a solution of DIEMNF, and (0, u(t)) = u(t) → ξ as t → ∞. Clearly, ∂t∂ ξ = 0. Assume that ξ is not an equilibrium, that is, F (ξ ) = 0. Then there exists h > 0, h with sufficiently small such that
(h, ξ ) = ξ . But
(h, ξ ) = (h, lim (0, u(t))) = lim (h, (0, u(t))) t→∞
t→∞
= lim (h, u(t)) = ξ. t→∞
(2.3)
This is a contradiction, so F (ξ ) = 0. Thus, ξ is an equilibrium. Lemma 4. (Halanay’s inequality, Cao & Wang, 2004). Let a > b > 0 and v(t) be a nonnegative continuous function on [t0 − τ, t0 ], and satisfy the following inequality: D+ v(t) ≤ −a v(t) + b sup v(s), t ≥ t0 , t−τ ≤s≤t
where τ is a nonnegative constant. Then there exist constants k. λ > 0 satisfies v(t) ≤ ke −λ(t−t0 ) , t ≥ t0 ,
574
X. Lou and B. Cui
where k = supt−τ ≤s≤t v(s), λ is a unique, positive solution of the following equation: λ = a − be λτ . 3 Boundedness In this section, we give the boundedness condition for DIEMNF, equation 2.1. Theorem 1. Let T be continuous on × with being a compact subset. Then equation 2.1 implies that any solution u(t) is bounded if a > 0. Proof.
Since
∂u(t, x) = −a u(t, x) + W(x)g(u(t − τ, x)) ∂t + T(x, y)g(u(t, y)) dy + I (x),
∂ ∂t
u(t, x) −
(3.1)
1 1 I (x) = −a u(t, x) − I (x) + W(x)g(u(t − τ, x)) a a T(x, y)g(u(t, y)) dy, (3.2) +
then u(t, x) −
t 1 1 e −a (t−s) (W(x)g(u(s − τ, x)))ds I (x) = e −a t u(0, x) − I (x) + a a 0
t + e −a (t−s) T(x, y)g(u(s, y)) dy ds. (3.3)
0
So it follows, t 1 e −a (t−s) (W(x)g(u(s − τ, x))) ds I (x) 1 − e −a t + a 0
t −a (t−s) + e T(x, y)g(u(s, y)) dy ds. (3.4)
u(t, x) = e −a t u(0, x) +
0
Boundedness and Stability for DIEMNF
575
Obviously one obtains t 1 −a t e a s (|W(x)||g(u(s − τ, x))|) ds |u(t, x)| ≤ |u(0, x)| + |I (x)| + e a 0
t + e −a t eas |T(x, y)||g(u(s, y))|dy ds. (3.5) 0
Since g(u(t, x)) is bounded, that is, |g(u(t, x))| ≤ M, then if a > 0, the right < ∞ for all t ∈ R+ . The part of equation 3.5, is bounded, that is, u(t) ≤ M proof is complete. 4 Stability of the Equilibrium In this section, we give the main results of this letter. For symmetric interconnection case, we prove the convergence of DIEMNF, equation 2.1. And we give a sufficient condition for the global stability of DIEMNF when the interconnection is asymmetric. 4.1 Asymptotic Stability when the Interconnection Is Symmetric. According to lemma 3, there exists an equilibrium u∗ (x) of DIEMNF, equation 2.1. In the following, we always set v(t, x) = u(t, x) − u∗ (x), where u∗ (x) is an equilibrium (F (u∗ (x)) = 0) and u(t, x) is an arbitrary solution of DIEMNF. Then ∂ ∂ ∂ v(t, x) = u(t, x) − u∗ = F (u) − F (u∗ ) ∂t ∂t ∂t = −a (v(t, x) + u∗ (x)) + W(x)g(v(t − τ, x) + u∗ (x)) + T(x, y)g(v(t, y) + u∗ (y)) dy
= −a v(t, x) + W(x)[g(v(t − τ, x) + u∗ (x)) − g(u∗ (x))] + T(x, y)[g(v(t, y) + u∗ (y)) − g(u∗ (y))]dy
= −a v(t, x) + W(x)[g (u∗ (x))v(t − τ, x) + o(v(t, x))] + T(x, y)[g (u∗ (y))v(t, y) + o(v(t, y))]dy.
(4.1)
By definition 2, the Fr´echet derivative of F at u∗ (x) is
Fu∗ (v(x)) = (−a I + B)v(t, x) + Hv(t, x),
(4.2)
576
X. Lou and B. Cui
where Bv(x) =
T(x, y)g (u∗ (y))v(t, y) dy, Hv(x) = W(x)g (u∗ (x))v(t − τ, x). (4.3)
Theorem 2. The equilibrium u∗ (x) is asymptotically stable if κ(B + H) < a , where B, H are compact operators that satisfy Bv(x) = ∗ T(x, y)g (u (y))v(t, y) dy and Hv(x) = W(x)g (u∗ (x))v(t − τ, x), respec tively. Proof.
Equation 4.1 can be written as
∂ v = F (u∗ + v) = Fu∗ (v) + o(v). ∂t
(4.4)
By lemma 1, if κ(Fu∗ ) < 0, then equation 4.4 has a negative Bohl exponent at zero, that is, u∗ is asymptotically stable. From equation 4.2, it follows that
κ(Fu∗ ) = κ(−a I + B + H) ≤ −a +
max
λ∈σ (B+H)
Re(λ).
(4.5)
And from equation 4.3, it is clear that B, H are compact operators, since T(x, y) ·g (u∗ (y)) is continuous. Hence, σ (B + H) is a compact subset of C, so maxλ∈σ (B+H) Re(λ) exists. Therefore,
κ(Fu∗ ) < 0 ⇔ max Re(λ) = κ(B + H) < a . λ∈σ (B+h)
Thus, the equilibrium u∗ is asymptotically stable if κ(B + H) < a . 4.2 Global Stability when the Interconnection Is Asymmetric. In this section, we give a condition to guarantee the global stability for DIEMNF, equation 2.1. Here, T(x, y) = T(y, x) is not assumed. d In the following, we always assume that w(t) = dt u(t). Then ∂u(t, x) = −a u(t, x) + W(x)g(u(t − τ, x)) ∂t T(x, y)g(u(t, y)) dy + I (x) +
Boundedness and Stability for DIEMNF
577
and ∂w(t, x) = −a w(t, x) + W(x)g (v(t − τ, x))w(t, x) ∂t + T(x, y)g (u(t, y))w(t, y) dy.
And we denote D+ (w(t)) = lim sup h→0
w(t + h) − w(t) . h
By using linearization, we derive w(t + h, x) = w(t, x) + h
∂ w(t, x) + O(h 2 ) ∂t
= (1 − a h)w(t, x) + hW(x)g (v(t − τ, x))w(t, x) +h T(x, y)g (u(t, y))w(t, y) dy + O(h 2 )
(4.6)
and v(t + h, x) = v(t, x) + h
∂ v(t, x) + O(h 2 ) ∂t
= (1 − a h)v(t, x) + hW(x)[g(v(t − τ, x) + u∗ (x)) − g(u∗ (x))] +h T(x, y)[g(v(t, x) + u∗ (x)) − g(u∗ (x))] dy + O(h 2 ),
(4.7)
where v(t) = u(t) − u∗ (u∗ is an equilibrium and u(t, x) is an arbitrary solution of DIEMNF, equation 2.1. Theorem 3.
If there exist two positive constants β, η such that
|g (x)| < β, ∀x ∈ ,
(4.8)
βW(x) < η, −a + β |T(x, y)|dy ≤ −2η,
(4.9)
(4.10)
then the system 2.1 owns a global attractor such that every solution converges to it.
578
X. Lou and B. Cui
Proof.
From equations 4.9 and 4.10, we have
−a + βW(x) + β
|T(x, y)|dy ≤ −η.
(4.11)
Choose a nonzero solution u(t, x) from DIEMNF, equation 2.1, arbitrarily. From equations 4.6 and 4.11, it follows that
∂ w(t + h) = max w(t, x) + h w(t, x) + O(h 2 ) x∈ ∂t ≤ |1 − a h| · max |w(t, x)| + hβ · max |W(x)||w(t, x)| x∈
x∈
+ hβ · max x∈
|T(x, y)||w(t, y)|dy + O(h 2 ).
It is easy to see there exist h → 0 such that 1 − a h > 0. Then we obtain w(t + h) ≤ (1 − a h)w(t) + hβ · W(x)w(t) + hβ · |T(x, y)|w(t)dy + O(h 2 )
≤ (1 − a h)w(t) + hβ
a −η w(t) + O(h 2 ) β
= (1 − a h)w(t) + h(a − η)w(t) + O(h 2 ).
(4.12)
Then D+ (w(t)) = lim sup h→0
≤ lim sup h→0
w(t + h) − w(t) h [1 − a h + h(a − η)]w(t) − w(t) ≤ −ηw(t), h (4.13)
which implies w(t) ≤ w(0)e −ηt . Then by lemma 2, u(t) → ξ as t → ∞, for some ξ ∈ χ. And lemma 3 guarantees that u∗ is an equilibrium. Second, we show that every solution of DIEMNF, equation 2.1, converges to u∗ . Let u(t, x) be any other solution from u(t, x) and v(t) = u(t) − u∗ . From
Boundedness and Stability for DIEMNF
579
equation 4.9, it follows that v(t + h) = max (1 − a h)v(t) + hW(x)[g(v(t − τ, x) + u∗ (x)) − g(u∗ (x))] x∈
+h
T(x, y)[g(v(t, x) + u∗ (x)) − g(u∗ (x))]dy + O(h 2 )|
≤ |1 − a h| · max |v(t, x)| x∈
+ h · max |W(x)||g(v(t − τ, x) + u∗ (x)) − g(u∗ (x))| x∈
+ h · max x∈
|T(x, y)||g(v(t, x) + u∗ (x)) − g(u∗ (x))|dy + O(h 2 )
≤ |1 − a h| · v(t) + h · max |W(x)||g ( f (x))||v(t − τ, x)| + h · max x∈
x∈
|T(x, y)||g ( f (y))||v(t, y)|dy + O(h 2 )
≤ |1 − a h| · v(t) + hβ · W(x) · v(t − τ ) |T(x, y)| · v(t)dy + O(h 2 ) + hβ ·
≤ |1 − a h| · w(t) + hβ
a −η w(t) β
+ hβ · W(x) · v(t − τ ) + O(h 2 ) = |1 − a h| · v(t) + h(a − η)v(t) + hβ · W(x) · v(t − τ ) + O(h 2 ).
(4.14)
Then D+ (v(t)) = lim sup h→0
v(t + h) − v(t) h
1 ≤ lim sup ((1 − a h + h(a − η))v(t) h→0 h + hβ · W(x) · v(t − τ ) − v(t)) ≤ −ηv(t) + β · W(x) · v(t).
(4.15)
Since equation 4.9 holds, we have η > β · W(x). So from lemma 4, it follows that v(t) ≤ v(0)e −ε(t−t0 ) ,
(4.16)
580
X. Lou and B. Cui
where ε = η − β · W(x)e ετ . Thus, v(t) → 0 as t → ∞; equivalently, v(t) → 0 as t → ∞. Since v(t, x) = u(t, x) + u∗ (x), u(t) → u∗ as t → ∞, and the proof is complete. 5 Conclusion In this letter, we studied the boundedness and stability of DIEMNF and obtained sufficient conditions for the boundedness and stability of DIEMNF when the interconnection is symmetric. Moreover, when the interconnection is asymmetric, we give a sufficient condition that can guarantee that the equilibrium of the DIEMNF is unique by using a different method. Acknowledgments This work is supported by the National Natural Science Foundation of China (No. 60674026) and the Science Foundation of Southern Yangtze University (No. 103000-21050323). References Amari, S. I. (1990). Mathematical foundations of neurocomputation. Proc. IEEE on Neural Networks, 78, 1443–1463. Amari, S. I. (1991). Mathematical theory of neural learning. New Generation Computing, 8(4), 281–294. Arik, S. (2000). Stability analysis of delayed neural networks. IEEE Trans. Circ. Syst. I, 47(7), 1089–1092. Cao, J. D. (2000). Global exponential stability and existence of periodic solutions of delayed CNNs. Sci. China (series E), 30(6), 541–549. Cao, J. D. (2001a). A set of stability criteria for delayed cellular neural networks. IEEE Trans. Circ. Syst. I, 48(4), 494–498. Cao, J. D. (2001b). Global stability conditions for delayed CNNS. IEEE Trans. Circ. Syst. I, 48, 1330–1333. Cao, J., & Wang, J. (2003). Global asymptotic stability of a general class of recurrent neural networks with time-varying delays. IEEE Trans. Circ. Syst. I, 50, 34– 44. Cao, J., & Wang, J. (2004). Absolute exponential stability of recurrent neural networks with Lipschitz-continuous activation functions and time delays. Neural Networks, 17(3), 379–390. Chen, T. P., & Amari, S. I. (2001). Stability of asymmetric Hopfield networks. IEEE Trans. Neural Networks, 12(1), 159–163. Cui, B. T., & Lou, X. Y. (2006). Global asymptotic stability of BAM neural networks with distributed delays and reaction-diffusion terms. Chaos, Solitons, and Fractals, 27, 1347–1354. Daleckii, J. L., & Krein, M. G. (1974). Stability of solutions of differential equations in Banach space. Providence. RI: American Mathematical Society.
Boundedness and Stability for DIEMNF
581
Daniel, C., & Andreas, V. M. H. (2002). Traveling waves of excitation in neural field models: Equivalence of rate descriptions and integrate-and-fire dynamics. Neural Computation, 14, 1651–1667. Hopfield, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl Acad. Sci., 81, 3088–3092. Jirsa, V. K., Jantzen, K. J., Fuchs, A., & Kelso, J. A. S. (2001). Neural field dynamics on the folded three-dimensional cortical sheet and its forward EEG and MEG. In Information processing in medical imaging (pp. 286–299). Berlin: SpringerVerlag. Liao, X., Chen, G., & Sanchez, E. N. (2002). LMI-based approach for asymptotically stability analysis of delayed neural networks. IEEE Trans. Circ. Syst. I, 49, 1033– 1039. Liley, D. T. J., Cadusch, P. J., & Wright, J. J. (1999). A continuum theory of electrocortical activity. Neurocomputing, 26–27, 795–800. Lou, X. Y., & Cui, B. T. (2006a). Absolute exponential stability analysis of delayed bi-directional associative memory neural networks. Chaos, Solitons, and Fractals, 31(3), 695–701. Lou, X. Y., & Cui, B. T. (2006b). New LMI conditions for delay-dependent asymptotic stability of delayed Hopfield neural networks. Neurocomputing, 69(16–18), 2374– 2378. Lu, W. L., & Chen, T. P. (2006). Dynamical behaviors of delayed neural network systems with discontinuous activation functions. Neural Computation, 18(3), 683– 708. Luo, S. W., Wen, J. W., & Huang, H. (2002). Knowledge-increasable learning behaviors research of neural field. In Proceedings of the First International Conference on Machine Learning and Cybernetics (pp. 586–590). Piscataway, NJ: IEEE Press. Roska, T., & Chua, L. O. (1992). Cellular neural networks with non-linear and delaytype template elements and non-uniform grids. International Journal of Circuit Theory and Applications, 20, 469–482. Shi, Z. Z., Huang, Y. P., & Zhang, J. (2004). Neural field model for perceptual learning. In Third IEEE International Conference on Cognitive Informatics (ICCI’04), (pp. 192– 198). Wilde, P. D. (1997). Neural network models. Berlin: Springer-Verlag. Yang, H., & Dillon, T. S. (1994). Exponential stability and oscillation of Hopfield graded response neural network. IEEE Trans. Neural Networks, 5, 719–729. Yoshida, K. (1980). Functional analysis. Berlin: Springer-Verlag. Zeng, Z. G., & Wang, J. (2006). Multiperiodicity and exponential attractivity evoked by periodic external inputs in delayed cellular neural networks. Neural Computation, 18(4), 848–870. Zhang, H. B., Li, C. G., & Liao, X. F. (2005). A note on the robust stability of neural networks with time delay. Chaos, Solitons and Fractals, 25, 357–360. Zhang, Q., Wei, X. P., & Xu, J. (2005). On global exponential stability of delayed cellular neural networks with time-varying delays. Applied Mathematics and Computation, 162, 679–686.
Received April 14, 2006; accepted June 9, 2006.
ARTICLE
Communicated by Maneesh Sahani
Temporal Symmetry in Primary Auditory Cortex: Implications for Cortical Connectivity Jonathan Z. Simon [email protected] Department of Electrical and Computer Engineering, Department of Biology, Institute for Systems Research, and Program in Neuroscience and Cognitive Sciences, University of Maryland, College Park, MD 20742–3311, U.S.A.
Didier A. Depireux [email protected] Anatomy and Neurobiology, University of Maryland, Baltimore, MD 21201, U.S.A.
David J. Klein [email protected] Department of Electrical and Computer Engineering and Institute for Systems Research, University of Maryland, College Park, MD 20742–3311, U.S.A.
Jonathan B. Fritz [email protected] Institute for Systems Research, University of Maryland, College Park, MD 20742–3311, U.S.A.
Shihab A. Shamma [email protected] Department of Electrical and Computer Engineering, Institute for Systems Research, and Program in Neuroscience and Cognitive Sciences, University of Maryland, College Park, MD 20742–3311, U.S.A.
Neurons in primary auditory cortex (AI) in the ferret (Mustela putorius) that are well described by their spectrotemporal response field (STRF) are found also to have a distinctive property that we call temporal symmetry. For temporally symmetric neurons, every temporal cross-section of the STRF (impulse response) is given by the same function of time, except for a scaling and a Hilbert rotation. This property held in 85% of neurons (123 out of 145) recorded from awake animals and in 96% of neurons (70 out of 73) recorded from anesthetized animals. This property of temporal symmetry is highly constraining for possible models of functional neural connectivity within and into AI. We find that the simplest models of functional thalamic input, from the ventral medial geniculate body Neural Computation 19, 583–638 (2007)
C 2007 Massachusetts Institute of Technology
584
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
(MGB), into the entry layers of AI are ruled out because they are incompatible with the constraints of the observed temporal symmetry. This is also the case for the simplest models of functional intracortical connectivity. Plausible models that do generate temporal symmetry, from both thalamic and intracortical inputs, are presented. In particular, we propose that two specific characteristics of the thalamocortical interface may be responsible. The first is a temporal mismatch between the fast dynamics of the thalamus and the slow responses of the cortex. The second is that all thalamic inputs into a cortical module (or a cluster of cells) must be restricted to one point of entry (or one cell in the cluster). This latter property implies a lack of correlated horizontal interactions across cortical modules during the STRF measurements. The implications of these insights in the auditory system, and comparisons with similar properties in the visual system, are explored. 1 Introduction The spectrotemporal response field (STRF) is one measurement used to describe how an auditory neuron responds to a spectrally dynamic sound (Aertsen & Johannesma, 1981b; Eggermont, Aertsen, Hermes, & Johannesma, 1981; Hermes, Aertsen, Johannesma, & Eggermont, 1981; Johannesma & Eggermont, 1983; Smolders, Aertsen, & Johannesma, 1979). It is related to the spectral response area of a neuron, defined roughly as the range of frequencies and intensities of pure tones that elicit excitatory or inhibitory responses (though response area measurements are blind to timing information within the responses). The STRF is a measure of the spectral and dynamic properties of auditory response areas and has been used in a variety of auditory areas in both mammals and birds and using a variety of stimuli ranging from simple (e.g., auditory ripples), to complex (e.g. natural sounds; Aertsen & Johannesma, 1981a; deCharms, Blake, & Merzenich, 1998; Epping & Eggermont, 1985; Escabi & Schreiner, 2002; Fritz, Shamma, Elhilali, & Klein, 2003; Kowalski, Versnel, & Shamma, 1995; Linden, Liu, Sahani, Schreiner, & Merzenich, 2003; Miller, Escabi, Read, & Schreiner, 2002; Qin, Chimoto, Sakai, & Sato, 2004; Rutkowski, Shackleton, Schnupp, Wallace, & Palmer, 2002; Schafer, Rubsamen, Dorrscheidt, & Knipschild, 1992; Schreiner & Calhoun, 1994; Sen, Theunissen, & Doupe, 2001; Shamma & Versnel, 1995; Shamma, Versnel, & Kowalski, 1995; Theunissen, Sen, & Doupe, 2000; Valentine & Eggermont, 2004; Yeshurun, Wollberg, & Dyn, 1987). The stimuli and techniques, adapted from studies of visual processing (De Valois & De Valois, 1988) and psychoacoustic research (Green, 1986; Hillier, 1991; Summers & Leek, 1994), apply linear and nonlinear systems theory to measure the response area of auditory units. From the systems theoretic point of view, the spike train output (averaged over many trials) is determined by the spectrotemporal profile of a broadband, dynamic,
Temporal Symmetry
585
b
c
Spectro-Temporal Response Field 8
8
4
4
4
2
2
2
1
1
1
0.5
0.5
0.5
a
0.25
0.25 0
Time (ms)
250
0
Time (ms)
250
Spectro-Temporal Response Field +
Spike Rate
Frequency (kHz)
Spectro-Temporal Response Field 8
–
0.25 0
Time (ms)
250
Figure 1: (a) A simulated fully separable STRF, with spectral and temporal one-dimensional cross-sections as inserts (all horizontal slices have the same profile, as do vertical slices). (b) A simulated high-rank STRF with no particular symmetries. (c) A simulated temporally symmetric and quadrant-separable STRF of rank 2. This STRF’s symmetry is not obviously visible. This STRF was created by adding to the STRF in a the same STRF except with the spatial crosssection shifted upward by 3/4 octave and the temporal cross-section Hilbertrotated (see below) by 30 degrees.
acoustic input (Depireux, Simon, & Shamma, 1998). As can be seen from the example in Figure 1, the STRF includes quantitative information shaping how the neuron determines its firing rate as a function of both spectrum (a vertical cross-section gives firing rate as a function of frequency at some moment in time) and time (a horizontal cross-section gives firing rate as a function of time for a single frequency). There are also analogs of the STRF in other sensory modalities. The most prominent are in vision, where the STRF (De Valois & De Valois, 1988) partially inspired the study of auditory STRFs, and the somatosensory domain (DiCarlo & Johnson, 1999, 2000, 2002; Ghazanfar & Nicolelis, 1999). Any other sensory modality with the concept of a spatial receptive field or response area can be generalized to include the dimension of time and stands to gain from the methodology (Ghazanfar & Nicolelis, 2001; Linden & Schreiner, 2003). Using spectrotemporally rich stimuli and systems analysis methods, STRFs of hundreds of units in AI (primary auditory cortex) in the anesthetized and awake ferret have been measured, mapped, and compared to those obtained from one- and two-tone stimuli (Depireux, Simon, Klein, & Shamma, 2001; Klein, Simon, Depireux, & Shamma, 2006). For these techniques to be useful requires that responses to such broadband stimuli have a robustly linear component, an assumption that has been investigated, and, in many types of cells, confirmed (Escabi & Schreiner, 2002; Klein et al., 2006; Kowalski, Depireux, & Shamma, 1996; Schnupp, Mrsic-Flogel, & King, 2001; Shamma et al., 1995; Theunissen et al., 2000). Robustness requires at least that when the STRF is measured using stimulus sets
586
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
with very different spectrotemporal profiles, the STRF is (approximately) independent of stimulus set (Klein et al., 2006). The most important consequence of linearity, the superposition principle (Papoulis, 1987), requires that the responses to combinations of spectrotemporal envelopes be linearly additive and predictable from the STRF. These results were successfully applied to predictions of responses to spectra composed of multiple moving ripples (Klein et al., 2006; Kowalski et al., 1996). It should be noted that many researchers have also observed a significant proportion of units that are unpredictable or poorly responsive and cannot be described by purely linear assumptions (Theunissen et al., 2000; Ulanovsky, Las, & Nelken, 2003). Nor does a response that is robustly linear disallow nonlinearities: neurons with a substantially linear component typically contain substantial nonlinearities as well, such as having firing rates above the mean with a much larger dynamic range than firing rates below the mean. In particular, response profiles typically have strong, static nonlinearities, and yet their linear response is still robust (van Dijk, Wit, Segenhout, & Tubis, 1994). In this work, we show that those neurons in AI that are well described by STRFs have a special property, which we call temporal symmetry. Temporal symmetry means that all temporal cross-sections of any STRF are the same time function (i.e., impulse response), except for a scaling and a Hilbert rotation (defined below). We further show that temporal symmetry has strong implications for the functional neural connectivity of neurons in AI, in both their thalamic input—from the ventral medial geniculate body (MGB)—and their intracortical inputs. In fact, most simple, otherwise compelling models of functional neural connectivity of neurons in AI are disallowed physiologically because they violate the property of temporal symmetry. Other models, still biologically plausible, are suggested that obey the temporal symmetry property. Most of the mathematical treatments discussed in this work arose from the context of linear systems. It is crucial, however, that the linear systems treatment itself lies in the context of the more general nonlinear framework, for example, of Volterra and Wiener (Eggermont, 1993; Rugh, 1981). Thus, although the system has strong nonlinearities in addition to its linearity, as long as the linear component of the overall response is robust, the estimated STRF itself should be robust. This robust STRF, with its property of temporal symmetry, can justifiably be used to strongly constrain models of neural connectivity in AI. To reiterate, the presence of strong nonlinearities is consistent with the presence of a robust linear component, and that robust linear component (here, the STRF) places strong constraints on models of neural connectivity. 2 Methods 2.1 Defining the Spectrotemporal Response Field. In the auditory system, the STRF is a function of both time and frequency, h(t, x), where t is
Temporal Symmetry
587
response time (e.g., in ms) and x = log2 ( f f 0 ) is the number of octaves above a reference frequency f 0 . The firing rate of the neuron r (t) has a linear component rlin (t) given by the linear, temporal convolution of its STRF, h(t, x), with the spectrotemporal envelope of the stimulus s(t, x): rlin (t) =
dt
d x s(t − t, x)h(t , x)
=
d x s(t, x) ∗t h(t, x),
(2.1)
where ∗t means convolution in the t-dimension (but not x). The full firing rate r (t) will differ from the linear rate rlin (t) to the extent that the system is not entirely linear, but the STRF determines all of the firing rate that is linear with respect to the spectrotemporal envelope of the stimulus (Depireux et al., 1998). Crucially, even if the system has strong nonlinearities, so long as the linear properties are robust, rlin (t) will be consistently determined entirely by the STRF and the spectrotemporal envelope of the stimulus. There are several straightforward interpretations of the STRF, all ultimately equivalent. Four are presented below. The first two interpret the two-dimensional STRF as a collection of one-dimensional response profiles. The third interprets the entire STRF as the spectrotemporal representation of an optimal (acoustic) stimulus. The fourth is a general qualitative scheme for predicting the response to any broadband stimulus from the STRF. 2.1.1 Spectral Response Field Interpretation. Any STRF cross-section at a single moment in time (i.e., a vertical cross-section) can be interpreted as an instantaneous spectral response field (see Figure 2a, bottom left). In this interpretation, a peak in the response field indicates a high (instantaneous) spike rate when the stimulus has enhanced power in that spectral band. A dip in the response field indicates a low (instantaneous) spike rate when the stimulus has enhanced power in that spectral band (e.g., from side-band inhibition). A cross-section can be examined at any instant in time, so the STRF can be interpreted as a time-evolving response field. 2.1.2 Impulse Response Interpretation. Any STRF cross-section at a single frequency (i.e., a horizontal cross-section) can be interpreted as a narrowband impulse response (see Figure 2a, top right). In this interpretation, the impulse response is the response to the instantaneous presentation of high power in a narrow band. There is a separate impulse response for every frequency channel, so the STRF can be interpreted as a spectrally ordered collection of impulse responses. 2.1.3 Optimal Stimulus Interpretation. The entire STRF can be interpreted as a whole by flipping the time axis and interpreting the new image as
588
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
a
b
c
Figure 2: (a) An experimentally measured STRF, with several spectral and temporal one-dimensional cross-sections. (b) The same STRF interpreted as the spectrogram of an optimal stimulus. (c) An intricate stimulus and how different areas of the stimulus spectrogram contribute to the neuron’s firing rate at any given moment (single unit/awake: z004b03-p-tor.a1-2).
proportional to the spectrogram of a stimulus. In this picture, stimulus time evolves (from left to right), growing less negative until it stops at t = 0 (see Figure 2b). The stimulus is optimal in a very specific sense: of all stimuli with the same power, the (linear estimate of the) stimulus that gives the highest
Temporal Symmetry
589
spike rate for that power is proportional to the time-reversed STRF. This is a straightforward result from linear systems theory (see, e.g., deCharms et al., 1998; Papoulis, 1987). 2.1.4 General Qualitative Interpretation. The response of the entire neuron to any broadband stimulus can be estimated by convolving features of the stimulus spectrogram with features of the STRF (see Figure 2c). Regions of the STRF that are positive (excitatory) will contribute positively to the firing rate when the stimulus has enhanced power in that spectral band (enhanced relative to the background stimulus level). Similarly, regions of the STRF that are negative (inhibitory) will contribute negatively to the firing rate when the stimulus has enhanced power in that spectral band. Naturally, regions of the STRF that are positive (excitatory) will also contribute negatively to the firing rate when the stimulus has diminished power in that spectral band (diminished relative to the background stimulus level). Less intuitive, but still natural, regions of the STRF that are negative (inhibitory) will contribute positively to the firing rate when the stimulus has diminished power in that spectral band (since the tendency to reduce firing rate is itself weakened). The firing rate at any given moment is the sum of all these products, with the appropriate weighting. This last interpretation is really just a verbal description of equation 2.1. 2.2 Measuring the STRF. The STRF can be estimated in many different ways, but all are equivalent to inverting equation 2.1, that is, crosscorrelating the full neural response rate r (t) with the spectrotemporal envelope of the stimulus s(t, x). This is also known as spike-triggered averaging. Many types of stimuli can be used to measure an STRF as long as they are sufficiently spectrotemporally rich. Among the stimuli used are auditory ripples (e.g., Depireux et al., 1998; Kowalski et al., 1996; Miller & Schreiner, 2000; Qiu, Schreiner, & Escabi, 2003), auditory m-sequences (Kvale & Schreiner, 1997), random chords (e.g., deCharms et al., 1998; Valentine & Eggermont, 2004), and spectrotemporally rich natural sounds (e.g., Theunissen et al., 2000). 2.3 Relationship to Vision and the Spatiotemporal Response Field. The visual system has neurons that are well characterized by the analogous quantity, the STRF, h(t, x ), whose arguments are the two-dimensional angular distance, x and time t, and whose temporal convolution with a Spatiotemporal stimulus, s(t, x ) (e.g., drifting contrast gratings), gives the linear firing rate of the cell: rl (t) =
d x s(t, x ) ∗t h(t, x ),
(2.2)
590
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
which is the same as equation 2.1 but with retinotopic position x instead of cochleotopic position x. All the methodology, above and below, relevant to STRFs h(t, x) also ap and plies to STRFs h(t, x ), with the following substitutions: x → x , → , · x . The applications below will apply only to the extent that corx → tical visual processing is comparable to cortical auditory processing and to their respective physiological properties, and, of course, which area within cortex is being characterized. Stimuli used to calculate visual STRFs must have contrast that changes in both space and time. Typical stimuli range from drifting contrast gratings, randomly changing dots or bars, m-sequences, or more complex patterns (see, e.g., De Valois, Cottaris, Mahon, Elfar, & Wilson, 2000; De Valois & De Valois, 1988; Reid, Victor, & Shapley, 1997; Richmond, Optican, & Spitzer, 1990; Sutter, 1992; Victor, 1992). These spatiotemporally rich stimuli can be compared to the spectrotemporally rich auditory stimuli described above (auditory ripples, random chords, and spectrotemporally rich natural sounds). We use the abbreviation STRF to apply to both spectral (auditory) and spatial (visual) cases. Context will make clear to which case it refers. 2.4 Rank and Separability. The rank of a two-dimensional function, such as an STRF or a spectrotemporal modulation transfer function (MTFST ), captures one aspect of how simple the function is. When a two-dimensional function is the simple product of two one-dimensional functions, that is, h(t, x) = f (t)g(x), this captures an important notion of simplicity. When this occurs, the function is of rank 1. When the sum of two products is required, for example, h(t, x) = f A(t)g A(x) + f B (t)g B (x), the function is of rank 2. (In cases of rank 2 and higher, we demand that each temporal function f i (t) be linearly independent of every other temporal function f j (t), and the same for the spectral functions g; otherwise, we could have used a smaller number of terms.) A rank 2 function is clearly not as simple as a rank 1 function but nevertheless can be expressed rather concisely. In general, the rank of any two-dimensional function is the minimum number of simple products needed to describe the function. (When the functions are approximated as discrete, the definition of rank is identical to the definition of the algebraic rank of a matrix.) An STRF of rank 1, also called fully separable, can be written h FS (t, x) = f (t)g(x),
(2.3)
which has this simple interpretation: the temporal processing of the STRF is performed independent of the spectral processing (and, of course, vice versa). A simple model of a neuron with this property is that its spectral processing is due purely to inputs from presynaptic neurons with a range
Temporal Symmetry
591
of center frequencies, while the temporal processing is due to integration in the soma of all inputs arriving from all dendrites. For many peripheral neurons, this model is a good one. An STRF of rank 1 is also called fully separable because its processing separates cleanly into independent spectral and temporal processing stages. The example STRF in Figure 1a is fully separable, and this can be verified by noting that all spectral (vertical) cross-sections have the same shape (the shape of the spectral function g(x)), differing only in amplitude (and possible sign). Similarly, all temporal (horizontal) cross-sections have the same shape (the shape of the temporal function f (t)), differing only in amplitude (and possibly sign). An STRF of rank 2 is somewhat less simple and somewhat less straightforward to interpret. h R2 (t, x) = f A(t)g A(x) + f B (t)g B (x).
(2.4)
One interpretation comes from noting that this h R2 (t, x) can be written as the sum of two fully separable STRFs, h FAS (t, x) = f A(t)g A(x) and h FB S (t, x) = f B (t)g B (x). This implies a possible, but less than satisfying, interpretation: the neuron has exactly two neural inputs, each of which has a fully separable STRF, and then simply adds them. Below we present more realistic interpretations, consistent with known physiology. An STRF described by a generic two-dimensional function would not necessarily have the same physiologically motivated interpretations or models. An STRF of general rank N can be written h RN (t, x) = f A(t)g A(x) + f B (t)g B (x) + · · · + f Z (t)g Z (x) .
(2.5)
N terms
As the rank of an STRF increases, more and more complexity is permitted. Figure 1b demonstrates a simulated STRF of high rank (though still well localized in time and spectrum). STRFs of this complexity are not seen in AI (Klein et al., 2006). In general, higher rank implies more STRF complexity. Lower rank suggests there is a specific property (constraint) that causes this simplicity. 2.5 Singular Value Decomposition Analysis of the STRF. Singular value decomposition (SVD) is a method that can be applied to any finite dimensional matrix (e.g., a discretized version of the STRF) to establish both its rank and a unique reexpression of the matrix as the sum of terms whose number is the rank of the matrix (Hansen, 1997; Press, Teukolsky, Vettering, & Flannery, 1986). The SVD decomposition of a matrix M takes the form T Mij = Au Ai v TAj + B u Bi v BT j + · · · + Z u Zi v Zj , N terms
(2.6)
592
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
where N is the rank of the matrix, u and v are vectors normalized to have unit power, and each is the term’s root mean square (RMS) power. If we discretize the STRF into a finite number of frequencies and time steps, x = {xi } = (x1 , . . . , xM ) and t = {t j } = (t1 , . . . , tN ), so that h(ti , x j ) = h ij = h(t1 , . . . , tN ; x1 , . . . , xM ), we see that h ij = Au A(xi )v A(t j ) + B u B (xi )v B (t j ) + · · · + Z u Z (xi )v Z (t j ),
(2.7)
N terms
where N is the rank of the STRF, the u and v vectors are normalized to have unit power, and each is the RMS power of its term. This is the same as equation 2.5, except that time and frequency have been discretized, and thus the STRF has been discretized also. What makes SVD unique among decompositions is that (1) it automatically orders the terms by decreasing power: A > B > · · · > Z ; (2) each column u A, u B , . . . , u Z is orthogonal to all the others; and (3) each row v TA, v BT , . . . , v ZT is orthogonal to all the others. The mathematical specifics are described well in textbooks (see, e.g., Press et al., 1986) and will not be covered here. Mathematically, SVD is intimately related to principal component analysis (PCA), and both are used for a variety of analytic purposes, including noise reduction (Hansen, 1997). Since measured STRFs are made with noisy measurements (the noise arising from both neural variability and instrument noise), the true rank of the STRF must be estimated. There are a variety of methods to do this (Stewart, 1993), but all use the same conceptual framework: once the power of the noise is estimated, then all SVD components with power greater than the noise can be considered signal, and the number of components satisfying this criterion is the estimate of the rank. This estimate of rank is biased (more noise results in a lower rank estimate), but it has been shown that for range of signal-to-noise ratios and the STRFs used in this study, noise is not an impediment to measuring high rank (Klein et al., 2006). SVD also motivates us to recast equation 2.5 into its continuous form, h RN (t, x) = Av A(t)u A(x) + B v B (t)u B (x) + · · · + Z v Z (t)u Z (x), (2.8) N terms
where the u and v functions have unit power and each is the RMS power of its term. Compared to equation 2.5, it more complex but less arbitrary: decompositions of the form of equations 2.3, 2.4, and 2.5 are not unique since amplitude can be arbitrarily shifted between the temporal and spectral components. In equation 2.8, all amplitude information is explicitly shared within each term by each i coefficient. When the technique of SVD, which is designed for discrete matrices, is applied to continuous twodimensional functions, as in the case of equation 2.8, it is called the singular
Temporal Symmetry
593
value expansion (Hansen, 1997). We will go back and forth between the continuous and discretized versions of the STRF without loss of generality (so long as N is finite), depending on which formalism is more beneficial. 2.6 Hilbert Transform and Partial Hilbert Transforms and Rotations. We now discuss the Hilbert transform, a standard tool in signal processing and necessary for the phenomenon of temporal symmetry. The Hilbert transform of a function produces the same function but with all its phase components shifted by 90 degree. This can be seen in the Fourier domain. For a function f (t) with Fourier transform F (ω), that is, F (ω) = Fω [ f (t)] =
dt f (t)e−jωt
f (t) = Ft−1 [F (ω)] = (2π)−1
dωF (ω)ejωt ,
(2.9)
the Hilbert transform, designated by H or ∧ , is defined by
−1 jπ fˆ(t) = H [ f (t)] = Ft sgn(ω)e /2 F (ω) ,
(2.10)
where e jπ /2 = j is a rotation by 90 degree in the complex plane (the role of sgn(ω) guarantees that the Hilbert transform of a real function is itself a real function). This rotation of phase by 90 degree means that the Hilbert transform of any sine wave is a cosine wave, and the Hilbert transform of any cosine wave is the negative sine wave, but unlike differentiation, the amplitude is unchanged by the operation. An important property of the Hilbert transform is that it is orthogonal to the original function, and yet it still has the same frequency content (aside from the DC component, i.e., its mean, which is zeroed out). fˆ(t) is said to be “in quadrature” with f (t); a demonstration is illustrated in Figure 3. For the remainder of this section, we assume that any function f (t) that will be Hilbert transformed has mean zero (or has had its mean subtracted manually). The double application of a Hilbert transform, since applying two successive 90 degree rotations is equivalent to one 180 degree rotation, is just a sign inversion. H[H[ f (t)]] = fˆˆ(t) = − f (t).
(2.11)
It is also useful to define a partial Hilbert transform. A Hilbert transform of a function can be viewed as a 90 degree rotation in a mixing angle plane, so one can define a partial version of the transform: f θ (t) = sin θ fˆ(t) + cos θ f (t).
(2.12)
594
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
c
a
Phase
8π 6π 4π
time
2π 0 frequency
b Magnitude
-2π -4π -6π 0
frequency
-8π
Figure 3: An example of a function and its Hilbert transform. (a) A function (in black) overlaid with its Hilbert transform (in gray); the two are orthogonal. (b) The magnitude of the Fourier transform of the function (in black) overlaid with the magnitude of the Fourier transform of its Hilbert transform (in gray); they overlap exactly. (c) The phase of the Fourier transform of the function (in black) overlaid with the phase of the Fourier transform of its Hilbert transform (in gray). The difference is exactly ±90 degrees (dashed line).
In this convention, note that fˆ(t) = f π /2 (t), f (t) = f 0 (t), and fˆˆ(t) = f (t) = − f (t). Thus, a partial Hilbert transform still has the same frequency content as the original function, but its phase “rotation” is not restricted to 90 degrees and can be any angle on the complex plane. Physiological examples of the Hilbert transform have been demonstrated in the visual system and have been named “lagged” cells (De Valois et al., 2000; Humphrey & Weller, 1988; Mastronarde, 1987a, 1987b). These lagged cells are located in the lateral geniculate nucleus (LGN), one of the visual thalamic nuclei. We will continue this nomenclature and call any neuron whose impulse response is the Hilbert transform of another the lagged version of the latter. We will further generalize and call any neuron whose impulse response is the partial Hilbert transform (Hilbert rotation) of another, the “partially lagged” version of the latter. Note that the lag is a phase lag, not a time lag. The full and partial Hilbert transform or rotation is not restricted to the time domain and is equally applicable to the spectral domain—for example, π
H [g(x)] = gˆ (x) H [H [g(x)]] = gˆˆ (x) = −g(x) θ
g (x) = sin θ gˆ (x) + cos θg(x).
(2.13) (2.14) (2.15)
Temporal Symmetry
595
2.7 Temporal Symmetry. An important class of STRFs consists of those for which all temporal cross-sections (i.e., each cross-section at a constant spectral index xc ) of the given STRF are related to each other by a simple scaling, g, and rotation, θ , of the same time function, h(t, xc ) = gxc f θxc (t),
(2.16)
where each scaling and rotation can depend on xc . Since this is then true for all spectral indices x, we call the system temporally symmetric and write it in the functional form h T S (t, x) = g(x) f θ (x) (t).
(2.17)
The meaning is still the same: all temporal cross-sections are related to each other by a simple scaling and rotation of the same time function. There is only one function of t, that is, f (t), and Hilbert rotations of it (demonstrated in Figure 4). Using the definition of the Hilbert rotation, equation 2.12, we can reexpress equation 2.17 to explicitly show that a temporally symmetric STRF is rank 2 (i.e., is the sum of two linearly independent product terms): h T S (t, x) = g(x) f θ (x) (t) = g(x) cos θ (x) f (t) + g(x) sin θ (x) fˆ(t) = f (t)g A(x) + fˆ(t)g B (x)
(2.18)
where g A(x) = g(x) cos θ (x) g B (x) = g(x) sin θ (x) (2.19) g B (x) tan θ (x) = g A(x) g 2 (x) = g 2A(x) + g 2B (x). We will often use the form of equation 2.18, which is completely equivalent to equation 2.17. In equation 2.18 it is explicit that a temporally symmetric STRF has rank 2 and cannot have higher rank. For systems that are not exactly temporally symmetric but are of rank 2 or for systems that have been truncated by SVD to rank 2, we can define an index of temporal symmetry, ηt . This index ranges from 0 to 1, where ηt = 1 for the temporally symmetric case and ηt = 0 when the two time functions are temporally unrelated. First we put equation 2.18, which is explicitly
596
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma Spectro-Temporal Response Field
a
+
4 2 1 Spike Rate
Frequency (kHz)
8
0.5 0.25 0
Time (ms)
250
b
0
250
Time (ms)
–
c
0
0
Time (ms)
Time (ms)
250
250
Figure 4: (a) The simulated temporally symmetric and quadrant-separable STRF from Figure 1c and five fixed-frequency cross-sections, corresponding to five temporal impulse responses. (b) The same five impulse responses but individually Hilbert-rotated and rescaled. (c) The same Hilbert-rotated and rescaled impulse responses superimposed. The Hilbert rotation phases were calculated by taking the negative phase of the complex correlation coefficient between the analytic signal of each temporal cross-section and the analytic signal of the fourth temporal cross-section.
rank 2, into the form of equation 2.8: h R2 (t, x) = Av A(t)u A(x) + B v B (t)u B (x).
(2.20)
Since the u and v functions have unit power, we define the index of temporal symmetry to be the magnitude of the normalized complex inner product between the two temporal analytic signals (Cohen, 1995), 1 ∗ ηt = (v A(t) + j vˆ A(t)) (v B (t) + j vˆ B (t)) dt , 2
(2.21)
where ∗ is the complex conjugate operator. The rank 1 case, since it is automatically temporally symmetric, is also given the value ηt = 1.
Temporal Symmetry
597
Temporal symmetry’s cousin, spectral symmetry, can be defined analogously: h SS (t, x) = f (t)g θ (t) (x) = f (t) cos θ (t)g(x) + f (t) sin θ (t)gˆ (x) = f A(t)g(x) + f B (t)gˆ (x),
(2.22)
where f A(t) = f (t) cos θ (t) f B (t) = f (t) sin θ (t) tan θ (t) =
f B (t) f A(t)
f 2 (t) = f A2 (t) + f B2 (t)
(2.23)
and 1 ηs = (u A(x) + j uˆ A(x))∗ (u B (x) + j uˆ B (x)) d x . 2
(2.24)
2.8 Spectrotemporal Modulation Transfer Functions (MTFST ). Just as any STRF may have the property of temporal or spectral symmetry, it may also have the property of quadrant separability. Quadrant separability is most easily described in terms of the spectrotemporal modulation transfer function (MTFST ), which is presented here. The STRF, which is two-dimensional, can also be represented by its twodimensional Fourier transform or its closely related partner, the MTF ST , H(w, ) = Fw, [h(t, −x)] = dt d x h(t, x)e2π j(−wt+x)
(2.25)
where w and are the coordinates Fourier-conjugate to t and x respectively (see Depireux et al., 1998 for sign conventions). Examples are shown in Figure 5. It follows that the inverse Fourier transform of H(w, ) gives the STRF of the cell. −1 h(t, x) = Ft,−x [H(w, )] .
(2.26)
598
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
b Spectro-Temporal c Spectro-Temporal Spectro-Temporal Modulation Transfer Function Modulation Transfer Function Modulation Transfer Function
Spectral Density (cycles/octave)
a
Ω
1.6
Ω
360°
Ω
270°
0.8 0
w
w
180°
w
-1.6 -24
-12
0
12
Modulation Rate (Hz)
a
c
24 -24
-12
0
12
Modulation Rate (Hz)
b
d
24 -24
-12
0
12
Modulation Rate (Hz)
24
Phase
90°
-0.8
0°
Temporal Symmetry
599
The MTFST is a Fourier transform of the STRF and is also used to characterize auditory processing. For example, to the extent that the STRF represents a stimulus that the neuron prefers, the Fourier transform provides an analytical description of the features of that stimulus. Power at low w, which has dimension of cycles per second or Hz, corresponds to smoother temporal features, or slower temporal evolution. Power at high w corresponds to finer temporal features and faster temporal resolution. Lower
Figure 5: (a) The simulated fully separable MTFST generated by the STRF in Figure 1a. Phase is given by hue (scale on right) and amplitude by intensity. Since the STRF is fully separable, the MTFST is separable as well in both amplitude (all horizontal slices have the same intensity profiles, as do vertical slices) and phase. Separability leads to a phase profile that is the direct sum of a purely temporally dependent phase and a purely spectrally dependent phase. When the phase is primarily linear, as it is for STRFs well localized in spectrum and time, the phase profile is diagonal, with slope determined by the location in spectrum and time of the STRF. (b) The simulated MTFST generated by the high-rank STRF in Figure 1b. Phase as in a . Since there is no particular symmetry in the STRF, there is no particular symmetry in the MTFST . To the extent that the STRF is localized in spectrum and time, the phase slope is approximately constant. (c) The simulated MTFST generated by the temporally symmetric and quadrant-separable STRF of rank 2. The symmetry is now more visible than in Figure 1. This MTFST is somewhat directionally selective: quadrant 2 (characterizing responses to sounds with an upward spectral glide) is strong and clearly separable within the quadrant; quadrant 1 (characterizing responses to sounds with an downward spectral glide) is weak. Quadrants 3 and 4 are the complex conjugates of quadrants 1 and 2, by equation 2.27.
Figure 6: (a) An STRF equal to the sum of two simulated fully separable STRFs, identical to each other except translated in time and spectrum (the one with lower best frequency and shorter delay is displayed in Figure 1a). This results in a (not fully separable) strongly velocity-selective STRF. (b) The MTFST of the STRF in a . The spectrotemporal modulation transfer function is clearly not a vertical column–horizontal row product within each quadrant, and therefore the entire spectrotemporal modulation transfer function cannot be quadrant separable. Because the spectrotemporal modulation transfer function is not quadrant separable, it cannot be temporally symmetric. Phase is given by hue (scale on right); amplitude is given by intensity. (c) An STRF equal to the sum of two simulated temporally symmetric STRFs: identical to each other except translated in time and spectrum (the one with higher best frequency and shorter delay is displayed in Figure 1c). (d) The first ten singular values (from SVD) of the STRF show a rank of 4, which cannot be temporally symmetric (temporal symmetry requires a rank of 2).
600
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
(versus higher) , which has dimensions of cycles per octave, corresponds to smoother (versus finer) scale spectral features, such as broad (versus sharp) peaks or formants (versus harmonics). The four possible sign combinations of w and break the MTFST into four quadrants, numbered 1 (w, > 0), 2 (w < 0, > 0), 3 (w, < 0), and 4 (w > 0, < 0). From equation 2.25 it can be seen that H(w, ) is a complex valued function. Because h(t, x) is purely real, the MTFST has a complexconjugate symmetry, H(−w, −) = H ∗ (w, ).
(2.27)
Equation 2.27 also holds for the Fourier transform of any real function of t and x. This means that the value of the MTFST at any point in quadrant 3 is fully determined by the value at the reflected point in quadrant 1 (and similarly for the pair quadrant 4 and quadrant 2). The MTFST , and its properties and interpretations, is discussed in greater detail elsewhere (Depireux et al., 1998, 2006). 2.9 Directionality and Quadrant Separability. When the STRF is separable, the MTFST is separable, because the Fourier transform of the STRF is given by the simple products of the Fourier transforms of f (t) and g(x). It was noticed that even when the STRF is not separable, the quadrants of the MTFST are still individually separable (Klein et al., 2006; Kowalski et al., 1996), but neither the significance nor the origin of this property was well understood. (See McLean & Palmer, 1994, for the analogous case in vision.) With the discovery of temporal symmetry and the relationship between temporal symmetry and quadrant separability, the significance is now clear, as will be shown. One of the most useful properties of the Fourier representation (the MTFST ) over the spectrotemporal representation (the STRF) is that in the Fourier representation, the contributions of the response to stimuli with upward- and downward-moving spectral features are explicitly segregated in different quadrants. The response to any downward-moving component is governed entirely by the MTFST in quadrant 1, and the response to any upward-moving component is governed entirely by the MTFST in quadrant 2. Quadrant separability is a particular generalized symmetry property that an STRF and its MTFST may have, but it is obvious only when seen in the MTFST domain: within each quadrant, the MTFST is separable. For example in quadrant 1, where both w and are positive, the MTFST is the simple product of a horizontal (temporal) function and a vertical (spectral) function. Similarly, in quadrant 2, where w is negative and is positive, the MTFST is the simple product of a different horizontal (temporal) function and a different vertical (spectral) function. An example is shown in
Temporal Symmetry
601
Figure 5c. Quadrant separable MTFST s are defined and characterized with more mathematical detail in the appendix. Historically in audition, the property of quadrant separability was noticed when the spectrotemporal modulation transfer function measured in quadrant 1 was separable, and the spectrotemporal modulation transfer function measured in quadrant 2 was also separable, but the two separable functions were not the same (Depireux et al., 2001; Kowalski, Depireux, & Shamma, 1996) (if it is separable in quadrants 1 and 2, then it is automatically separable in quadrants 3 and 4, from equation 2.27). In vision studies, quadrant separability was invoked for the notion of directional selectivity (Watson & Ahumada, 1985). In this work, we argue that quadrant separability in the auditory system is due entirely to temporal symmetry. Quadrant separability is a property of both the STRF and the MTFST , though visible only in the MTFST . Nevertheless, since the STRF and MTFST are just different representations of the same response properties, it is a property that is held (or not) by both. It is shown in the appendix that a quadrant-separable STRF can always be written in the form h QS (t, x) = f A(t)g A(x) + fˆ B (t)gˆ B (x) + fˆ A(t)g B (x) + f B (t)gˆ A(x),
(2.28)
which is shown in the appendix to be of rank 4 unless additional symmetries, such as those discussed later, reduce the rank to 2 or 1. 2.10 Quadrant Separability and Temporal Symmetry. Comparing equation 2.28 to equation 2.18 we can see that the temporally symmetric h T S (t, x) has the same form as h QS (t, x) for the special case that f B (t) = 0. Thus, a temporally symmetric STRF is automatically quadrant separable. It is not a generic quadrant-separable STRF, since its rank is not 4 but 2 (by inspection of equation 2.18). It nevertheless possesses the defining property of quadrant separability: each quadrant of its MTFST is separately separable (e.g., see Figure 5c). We will use this property below to show that an STRF that is not quadrant separable cannot be temporally symmetric. 2.11 Quadrant Separability and General Symmetries. There are three ways of taking the generic quadrant-separable STRF of rank 4 and finding a generalized symmetry that causes it to be lower rank. The generalized symmetry of temporal Hilbert symmetry has already been discussed. It can be seen by taking the most general form of a quadrantseparable STRF, equation 2.28, and noting that setting f B (t) = 0 = fˆ B (t), so that only f A(t) and fˆ A(t) survive as temporal functions, reduces the number of independent components (i.e., the rank) from 4 to 2. The generalized symmetry of spectral symmetry works analogously, since the mathematics is blind to the difference between time and spectrum. It can be seen by noting that setting g B (x) = 0 = gˆ B (x), so that only g A(x) and
602
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
gˆ A(x) survive as spectral functions, also reduces the number of independent components (i.e., the rank) from 4 to 2. This gives the spectrally symmetric, quadrant-separable STRF: h SS (t, x) = f A(t)g A(x) + f B (t)gˆ A(x).
(2.29)
We are not aware of any physiological system in which this generalized symmetry is realized. Finally, a third generalized symmetry is pure directional selectivity. In the case of this generalized symmetry, we set f A(t) = fˆ B (t) g A(x) = gˆ B (x),
(2.30)
which, when combined with the identity that the double application of the Hilbert operator is a Hilbert rotation of 180 degrees and so is equivalent to multiplication by −1, gives
h DS (t, x) = 2 f A(t)g A(x) − fˆ A(t)gˆ A(x) .
(2.31)
This is the case much discussed in the vision literature when STRF is interpreted as the visual STRF: both temporal and spectral functions are added in quadrature (Adelson & Bergen, 1985; Barlow & Levick, 1965; Borst & Egelhaaf, 1989; Chance, Nelson, & Abbott, 1998; De Valois et al., 2000; Emerson & Gerstein, 1977; Heeger, 1993; Maex & Orban, 1996; McLean & Palmer, 1994; Smith, Snowden, & Milne, 1994; Suarez, Koch, & Douglas, 1995; Watson & Ahumada, 1985). The result is a purely directionally selective response field, and again is rank 2. These three symmetries reduce the rank of a quadrant-separable spectrotemporal modulation transfer function from 4 to 2. In the appendix, it is proven these are the only symmetries that can reduce the rank to 2, and that a quadrant-separable STRF can never be of rank 3 (the rank 1 case is fully separable). We are not aware of any physiological system that possesses a generic quadrant separable STRF, that is, of rank 4. 2.12 Counterexamples. After the preceding examples, one might be tempted to believe that all rank 2 STRFs are also quadrant separable, but this can be shown false with a simple counterexample. The form of equation 2.28 demonstrates how difficult it is to make a quadrant-separable transfer function by combining fully separable inputs: half of the terms are required to be very specific functionals of the other half. Any departure leads to total inseparability. For example, an STRF that
Temporal Symmetry
603
is the linear sum of two separable STRFS, h R2 (t, x) = h FAS (t, x) + h FB S (t, x),
(2.32)
has a spectrotemporal modulation transfer function that is the linear sum of two separable spectrotemporal modulation transfer functions: H R2 (w, ) = HAR2 (w, ) + HBR2 (w, ).
(2.33)
This is not, in general, the form of a quadrant-separable spectrotemporal modulation transfer function. As an example, the sum of two (almost) identical, fully separable STRFs whose only difference is that one is translated spectrally and temporally with respect to the other gives an STRF that is strongly velocity selective and is not quadrant separable. This is demonstrated in Figure 6b, where the MTFST clearly is not quadrant separable. This proves that the STRF in Figure 6a cannot be temporally symmetric. It is also simple to show an example of an STRF that is the sum of two temporally symmetric STRFs but itself is not temporally symmetric. As an example, the sum of two (almost) identical, temporally symmetric STRFs whose only difference is that one is translated spectrally and temporally with respect to the other gives an STRF that has rank 4, not the rank of 2 required by temporally symmetry. This is demonstrated in Figures 6c and 6d. 2.13 Surgery and Animal Preparation. Data were collected from 11 domestic ferrets (Mustela putorius) supplied by Marshall Farms (Rochester, NY). Eight of these ferrets were anesthetized during recording, and details of the surgery in full procedural details are provided in Shamma, Fleshman, Wiser, and Versnel (1993). These ferrets were anesthetized with sodium pentobarbital (40 mg/kg) and maintained under deep anesthesia during the surgery. Once the recording session started, a combination of ketamine (8 mg/kg/hr), xylazine (1.6 mg/kg/hr), atropine (10 µg/kg/hr), and dexamethasone (40 µg/kg/hr) was given throughout the experiment by continuous intravenous infusion, together with dextrose, 5% in Ringer solution, at a rate of 1 cc/kg/hr, to maintain metabolic stability. The ectosylvian gyrus, which includes the primary auditory cortex, was exposed by craniotomy, and the dura was reflected. The contralateral ear canal was exposed and partly resected, and a cone-shaped speculum containing a miniature speaker (Sony MDR-E464) was sutured to the meatal stump. The remaining three ferrets were used for awake recordings, with full surgical procedural details in Fritz et al. (2003). In these experiments, ferrets were habituated to lie calmly in a restraining tube for periods of up to 4 to 6 hours. A head-post was surgically implanted on the ferret’s skull (anesthetized with sodium pentobarbital, 40 mg/kg, and maintained under deep
604
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
anesthesia during the surgery) and used to hold the animal’s head in a stable position during the daily neurophysiological recoding sessions. All experimental procedures were approved by the University of Maryland Animal Care and Use Committee and were in accord with NIH Guidelines. 2.14 Recordings, Spike Sorting, and Selection Criteria. Action potentials from single units were recorded using tungsten microelectrodes with 5 to 7 M tip impedances at 1 kHz. In each animal, electrode penetrations were made orthogonal to the cortical surface. In each penetration, cells were typically isolated at depths of 350 to 600 µm corresponding to cortical layers III and IV (Shamma et al., 1993). In four anesthetized animals, neural signals were fed through a window discriminator, and the time of spike occurrence relative to stimulus delivery was stored using a computer. In the other seven animals, the original neural electrical signals were stored for further processing off-line. Using Matlab software designed in-house, action potentials were then manually classified as belonging to one or more single units, and the spike times for each unit were recorded. The action potentials assigned to a single class met the following criteria: (1) the peaks of the spike waveforms exceeded four times the standard deviation of the entire recording; (2) each spike waveform was less than 2 ms in duration and consisted of a clear, positive deflection followed immediately by a negative deflection; (3) the spike waveform classes were not visibly different from each other in amplitude, shape, or time course; (4) the histogram of interspike intervals evidenced a minimum time between spikes (refractory period) of at least 1 ms; and (5) the spike activity persisted throughout the recording session. This procedure occasionally produced units with very low spike counts. After consulting the distribution of spike counts for all units, units that fired less than half a spike per second were excluded from further analysis since a neuron with such a low spike rate requires longer stimulus durations to analyze. 2.15 Stimuli and STRF Measurement. The stimuli used were temporally orthogonal ripple combinations (TORCs), as described by Klein et al. (2006). TORCs are more complex than individually presented dynamic ripples, which are instances of bandpassed noise whose spectral and temporal envelopes are cosinusoidal and can be thought of as auditory analogs of drifting contrast gratings used in vision studies (Shamma & Versnel, 1995; Shamma et al., 1995). The spectrotemporal envelope of a TORC is composed of sums of the spectrotemporal envelopes of temporally orthogonal dynamic ripples. Temporally orthogonal means that no two ripple components of a given stimulus share the same temporal modulation rate (their temporal correlation is zero); therefore, each component evokes a different frequency in the linear portion of the response. Each TORC spectrotemporal envelope is composed from six dynamic ripples having the same spectral density (in cyc/oct) but different w spanning the range of 4 to 24 Hz. In the
Temporal Symmetry
605
reverse-correlation operation, the 4 Hz response component is orthogonal to all stimulus components besides the 4 Hz ripple, the 8 Hz component is correlated only with the 8 Hz ripple, and so on. Fifteen distinct TORC envelopes are presented, with spectral density ranging from −1.4 cycles per octave to +1.4 cycles per octave in steps of 0.2 cycles per octave. Each of those 15 TORCS is then presented again but with the reverse polarity of its spectrotemporal envelope (the inverse-repeat method) to remove systematic errors due to even-order nonlinearities (Klein et al., 2006; Moller, 1977; Wickesberg & Geisler, 1984). Multiple sweeps were presented for each stimulus. Sweeps of different stimuli, separated by 3 to 6 s of silence, were presented in a pseudorandom order, until a neuron was exposed to between 55 and 110 periods (13.75–27.5 s) of each stimulus. All stimuli had an 8 ms rise and fall time. All stimuli were gated and fed through an equalizer into the earphone. Calibration of the sound delivery system (to obtain a flat frequency response up to 20 kHz) was performed in situ with the use of a 1/8 ¨ & Kjaer 4170 probe microphone. In the anesthetized case, the miin Bruel crophone was inserted into the ear canal through the wall of the speculum to within 5 mm of the tympanic membrane; the speculum and microphone setup resembles closely that suggested by Evans (1979). In the awake case, the stimuli were delivered through inserted earphones that were calibrated in situ at the beginning of each experiment. STRFs were measured by reverse correlation, that is, spike-triggered averaging (Klein, Depireux, Simon, & Shamma, 2000; Klein et al., 2006). In particular, only the sustained portions of the responses were analyzed, since the first 250 ms interval of poststimulus onset response was not used. To ensure reliable estimates, neurons with STRFs whose estimated signalto-noise-ratio was worse than 2 were excluded (Klein et al., 2006). 3 Results 3.1 Temporal Properties of the STRF. In an arbitrary STRF, the spectral and temporal dimensions of the response are not necessarily related in any way. For example, the temporal cross-sections (impulse responses) at different frequencies (x) need not be systematically related to each other in any specific manner. However, Figure 7 illustrates an unanticipated result that we found to be prevalent in our data: all temporal cross-sections of a given STRF are related to each other by a simple scaling and rotation of the same time function. For example, if we designate the impulse response at the best frequency (1.2 kHz) to be the function f (t) = f θ =0 (t) = f 0 (t), then the cross-section at 0.72 kHz is approximately its scaled inverse (≈ −0.50 f (t) = 0.50 f π (t)); at 0.92 kHz, it is the scaled and lagged version (≈ 0.33 f π /2 (t))). A schematic depiction of this STRF response property is shown in Figure 7. This property was defined above, in equation 2.18, to be temporal symmetry. In the following section, we quantify and demonstrate the existence of this response property in almost all of our neurons.
606
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
a
Spectro-Temporal Response Field 2000
2 1 0 0.5
Spike Rate
Frequency (kHz)
4
0.25 0.125
0
-2000 0
250
Time (ms)
250
Time (ms)
b
0
0
Time (ms)
Time (ms)
250
250
c 16 Spike Rate
Frequency (kHz)
2000 8 4
0 2 0
1
-2000
0.5 0
250
Time (ms)
250
Time (ms)
Temporal Symmetry
607
In general, a pure scaling (i.e., no rotation, θ = 0 among the temporal cross-sections of the STRF is expected if the STRF is fully separable, that is, it can be decomposed into one product of a temporal and a spectral function: h F S (t, x) = f (t)g(x). However, only 50% of our cells can be considered fully separable in the awake population and 67% in the anesthetized population; the remainder all not fully separable. Consequently, this highly constraining and ubiquitous relationship involving a scaling and a rotation, described mathematically by equation 2.18, must imply another basic characteristic of cell responses, as we discuss next. 3.1.1 Rank and Temporal Symmetry in AI. To examine spectrotemporal interactions, we applied SVD analysis to all STRFs derived from AI neurons in our experiments (see Klein et al., 2006, for the method), with results in Table 1. In the awake recordings, the STRF rank was found to be best approximated as rank 1 or 2 for 98% of the neurons and in the anesthetized case, 97%. That is, the STRF is of the form of either equation 2.3 or equation 2.4. In general, an STRF need not be of such a low rank at all. For example, Figure 1b shows an otherwise plausible simulated high-rank STRF. Does low rank reflect a simplicity to the underlying neural circuitry? In the awake recordings, STRFs of rank 1 constituted 50% of all neurons; in the anesthetized recordings, STRFs of rank 1 constituted 67% of all neurons. These STRFs are fully separable, and hence all temporal crosssections of a given STRF are automatically related by a simple scaling. For rank 2 STRFs (awake 48%, anesthetized 30%), however, there is no such relationship. In fact, for a rank 2 STRF, expressed as equation 2.4, there is no mathematical need for any particular relationship between its temporal cross-sections, but physiological evidence provides some. The experimental results for neurons with STRF of rank 2 in Figure 8a highlight a strong relationship between the temporal functions f A(t) and f B (t) isolated by the SVD analysis in equation 2.4, and compared via the
Figure 7: Temporal symmetry demonstrated in the experimentally measured STRFs of 2 example neurons. (a) A rank 2 (not fully separable) STRF and five fixed-frequency cross-sections, giving five temporal impulse responses. (b) The same five impulse responses but Hilbert-rotated and rescaled to have equal power to that of the impulse response at the best frequency (left panel) and the same Hilbert-rotated and rescaled impulse responses superimposed (right panel). The temporal symmetry index for this neuron is ηt = 0.90. (c) Another rank 2 STRF, and five cross-sections (left panel), and the corresponding five impulse responses, Hilbert-rotated, rescaled to have equal power, and superimposed. This temporal symmetry is ηt = 0.73, illustrating that the temporal symmetry index need not be overly close to unity to demonstrate temporal symmetry (single units/awake: D2-4-03-p-c.a1-3, R2-6-03-p-2-a.a1-2).
608
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
Table 1: Population Distributions of STRFs Recorded from All Neurons, According to Animal State (Awake/Anesthetized), STRF Rank, and Temporal Symmetry. Awake
Rank 1 Rank 2 Temporally symmetric Nontemporally symmetric Rank 3 Total Rank 1 + rank 2 temporally symmetric Rank 2 nontemporally symmetric + rank 3 Total
Anesthetized
Number
Percent
Number
Percent
72 70 51 19 3 145 123 22 145
50 48
49 22 21 1 2 73 70 3 73
67 30
2 100 85 15 100
3 100 96 4 100
temporal symmetry index defined in equation 2.21. A temporal symmetry index near 1 means that f A(t) and f B (t) are not arbitrary, but instead are closely related by a Hilbert transform. Comparing equation 2.4 and the last line of equation 2.18, we see that we must have: f B (t) = fˆ A(t).
(3.1)
Temporal symmetry in an STRF is an extremely restrictive property. SVD guarantees that f B (t) be orthogonal to f A(t), but does not restrict its frequency content in any substantive way (Stewart, 1990, 1991, 1993). f B (t) = fˆ A(t) is special since fˆ A(t) is the only function orthogonal to f A(t) that has the same frequency content as f A(t). In this sense, there is only one time function in the STRF, the one characterized by f A(t). This is not the case in the spectral dimension, where there is no single special spectral function picked out: g A(x) and g B (x) are mathematically and physiologically unconstrained: the population distribution for the analogous spectral symmetry index in Figure 8b shows no such close relationship. There is no evidence for spectral symmetry. Note also that fully separable (rank 1) neurons are not included in the population shown in Figure 8 since they are automatically temporally symmetric and spectrally symmetric. Alternatively, we can begin with the STRF in the form of equation 2.17 and explain why the temporal cross-sections in our STRFs exhibit the specific relationship depicted in Figure 7. The impulse response at any x can be thought of as a linear combination of f A(t) and fˆ A(t), which by equation 2.18 always gives a scaled version of a Hilbert-rotated f A(t). Another test for the presence of temporal symmetry arises from comparing STRFs approximated by two different means: the first two terms of the singular value expansion versus the first term of the quadrant-separable
Temporal Symmetry
a
609
b
Temporal Symmetry Index
Spectral Symmetry Index
25
25
20
20
15
15
10
10
5
5
0
0
0.2
0.4
0.6
0.8
1
0
0
0.2
0.4
0.6
0.8
1
Figure 8: The population distributions of the symmetry indices for all neurons with rank 2 (not fully separable) STRFs: awake N = 70 (black), and anesthetized N = 22 (gray). (a) The population distributions of temporal symmetry index. The populations are heavily biased toward high temporal symmetry; cf. the temporal symmetry index of the STRFs in Figure 7, which ranges from 0.73 to 0.90. In the awake population, 51 neurons had temporal symmetry index greater than 0.65 and 20 in the anesthetized population. (b) For comparison, the same statistics but for spectral symmetry. The population is spread over the full range of values, despite potential tonotopic arguments for a narrow distribution near 1.
expansion in singular values (see Klein et al., 2006, for details). The former (rank 2 truncation) is of rank 2 by construction. The latter (quadrantseparable truncation), since it is quadrant separable by construction, should be of rank 4 unless the STRF has a symmetry. Since there is no mathematical reason they should give the same result, the near-unity correlation coefficient between the two shown in Figure 9a is experimental evidence that they are identical up to measurement error: the quadrant-separable STRF is actually of rank 2 and therefore possesses a symmetry. The only quadrant-separable STRFs with rank less than 4 must be temporally symmetric, spectrally symmetric, or directionally selective. The results shown in Figure 8b rule out spectral symmetry, and the analogous analysis for the index of directional selectivity (not shown) rules out directional selectivity. Therefore, STRFs’ symmetry must be temporal symmetry. For comparison, Figure 9b shows two other distributions. On the right is the same distribution as above, but for rank 4 truncations instead of rank 2. The correlations decrease, indicating that rank 2 (and hence temporal symmetry) is a better estimate than rank 4 (generic quadrant-separable STRFs are of rank 4, and only symmetric STRFs are of rank 2). The change must be small, since SVD orders contributions to the rank by decreasing power, but it need not have been negative. In the center of Figure 9b is the distribution of the same quantity as in Figure 9a, but with the STRFs permuted: the rank 2 truncation of the STRF is correlated with the quadrant-separable
610
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
a
40
Correlation Coefficient Distributions
30
Quadrant Separable Truncation vs. Rank 2 Truncation
20 10 0 -1
b
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
40 Correlation Coefficient Distributions 30
Quadrant Separable Truncation vs. Rank 4 Truncation
20
Permuted Truncations (scaled)
10 00 -1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Figure 9: (a) The population distributions of the correlations for all rank 2 neurons between their rank 2 and quadrant-separable estimates: awake (black) and anesthetized (gray). The high correlations indicate that the quadrant-separable truncation may be of rank 2, evidence of temporal symmetry. (b) For comparison, the distributions of the comparable correlations. (Right) The population distributions of the correlations for all neurons between their rank 4 and quadrant-separable estimates: awake (black), and anesthetized (gray). The correlations become worse for both populations, despite the fact that generic quadrant-separable STRFs are of rank 4, providing more evidence that the quadrant-separable STRFs are of rank 2, implying temporal symmetry. (Center) The populations of permuted STRFs: the rank 2 estimate of each STRF is correlated with the quadrant-separable estimate of every other STRF: awake (dark gray hash) and anesthetized (light gray hash). The population is scaled by the ratio of the number of STRFs to the number of permuted STRF pairs.
truncation of every other STRF for every rank 2 truncation. This distribution is broadly peaked around 0 (with the population scale normalized to be the same as in Figure 9a). This demonstrates that the skewness of the population toward unity in Figure 9a is not due to potentially confounding factors, such as the STRFs’ dominant power in the first ∼100 ms, or having rank 2, or being quadrant separable. Thus, the results shown in Figures 8 and 9 provide evidence that all of these rank 2 neurons are indeed temporally symmetric. However, temporal symmetry is highly restrictive. There are many ways to obtain a rank 2 STRF and only a very small subset is temporally symmetric, yet almost all of AI rank 2 STRFs are. This finding must be a consequence of a fundamental anatomical and physiological constraint on the way AI units are driven by
Temporal Symmetry
611
the spectrotemporally dynamic stimuli. In the remainder of this article, we demonstrate the simplest possible explanations that can give rise to these observed temporal symmetry properties. 3.2 Implications and Interpretation of Temporal Symmetry for Neural Connectivity and Thalamic Inputs to AI. Neurons in AI receive thalamic inputs from ventral MGB, via layers III and IV (see Read, Winer, & Schreiner, 2002, for a recent review; Smith & Populin, 2001). In this section, we analyze the effects of the constraints of temporal symmetry on the temporal and spectral components of the thalamic inputs to an AI neuron. We are not analyzing the cells with fully separable STRFs directly, but it will be shown below that their analysis proceeds almost identically. Note again that the linear equations used in this analysis do not assume that all processing of inputs is linear; rather, they assume that the linear component of the processing is strong and robust, in addition to all nonlinear components of the processing. The analysis below shows that there are physiologically reasonable models consistent with temporally symmetric neurons in and throughout AI. The models presented here require two features: that the STRFs of the thalamic inputs be fully separable and that some of the thalamic inputs be lagged (phase shifted), whether at the output of the thalamic neurons themselves or at their synapses onto AI neurons. Ventral MGB neurons possess STRFs consistent with being fully separable (Miller et al., 2002; Miller, Escabi, & Schreiner, 2001; L. Miller, personal communication, October 1999), but there has been no systematic study. Lagged neurons have not been reported in ventral MGB, though they exist in visual thalamus (Saul & Humphrey, 1990), and other lagging mechanisms that may be present in the auditory system are discussed below. Alternatively, is very difficult to construct physiologically reasonable models consistent with temporal symmetry neurons in AI without these two features. We know of no such models, and we were not able to construct any. From this, we are forced to predict that these two features will be found. Independently of whether this occurs, however, some explanation is still needed for the temporal symmetry displayed strongly in AI, and in this section, we provide a reasonable basis. 3.2.1 Simplistic Model of Thalamic Inputs. h T S (t, x) = f A(t)g A(x) + fˆ A(t)g B (x),
The last line of equation 2.18, (3.2)
explicitly demonstrates that the temporally symmetric STRF is rank 2. A simplistic interpretation of equation 3.2 is that a cell with a temporally symmetric STRF has two fully separable inputs, for example, two cells in ventral MGB. Each of those two input cells has the same temporal
612
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
a
b θ=0
θ=0 FS
FS
TS
TS θ =/ 0
θ = π/2 FS
FS
c
d KA(f)
θ=0
FS
θ =/ 0
θ=0
.. .
.. .
TS
.. .
FS
θ =/ 0
f slow kA(t)
TS
.. .
Figure 10: Schematics depicting simple models represented by equations 3.2 to 3.11. (a) Summing the inputs of two fully symmetric (FS) cells whose temporal functions are in quadrature (θ = 0, θ = π/2) results in a temporally symmetric (TS) cell (see equation 3.2). (b) Summing the inputs of two fully symmetric (FS) cells whose temporal functions are in partial quadrature (θ = 0, θ = 0) still results in a temporally symmetric (TS) cell (see equation 3.3). c. Summing the inputs of many fully symmetric (FS) cells whose temporal functions are in partial quadrature still results in a temporally symmetric (TS) cell (see equation 3.6). (d) Summing the inputs of many fully symmetric (FS) cells whose temporal functions are in partial quadrature, with differing impulse responses (at high frequencies), input into a neuron whose somatic impulse response is slow, results in a temporally symmetric (TS) cell (see equation 3.12): (Inset) A spectral schematic of the low-pass nature of the slow somatic impulse response: K A( f ) is the Fourier transform of k A(t).
processing behavior except that one, whose temporal processing is characterized by fˆ A(t), is the Hilbert transform of (lagged with respect to) the other, with temporal processing characterized by f A(t). Mathematically, fˆ(t) is in quadrature with f (t). The two inputs may have different spectral distribution of their own inputs or different synaptic weights as inputs to the cortical cell: g A(x) = g B (x). This is shown schematically in Figure 10a. (It is
Temporal Symmetry
613
also consistent with this model that g A(x) could equal g B (x), but such a case reduces to the simple instance of a fully separable STRF.) This interpretation is mathematically concise and explicitly demonstrates that the temporally symmetric STRF is rank 2. This is the maximally reduced form, and one may proceed to expand the form to demonstrate other forms that its inputs are permitted to take, for example, allowing many inputs from thalamic or even intracortical connections. The goal of this section is to relax the strict decomposition implied by equation 3.2 (and imposed arbitrarily by SVD) and to rewrite it in a form that allows for physiologically reasonable inputs and physiologically reasonable somatic processing, and to analyze the restrictions that are imposed on those inputs. The more realistic decompositions below will allow us to reasonably model the thalamic inputs and the somatic processing by identifying them with terms in the decomposition, in a substantially more realistic interpretation than those simple interpretations presented above. The more realistic decompositions have an additional role as well: they contrast with models or decompositions that, appearing physiologically reasonable otherwise, conflict with the data and so can now be ruled out. 3.2.2 Generalized Simplistic Model of Thalamic Inputs. First, the severity of the full Hilbert transform can be relaxed. The model decomposition can use a partial Hilbert transform f θ (t) (see equation 2.12) instead of the full Hilbert transform. In equation 3.3, the first term of the decomposition has some temporal impulse response f A(t) and an arbitrary spectral response field gC (x). The second term has an impulse response f Aθ (t) that is some Hilbert rotation of the first impulse response and an arbitrary spectral response field g D (x), h T S (t, x) = f A(t)gC (x) + f Aθ (t)g D (x),
(3.3)
which is equivalent to the previous decomposition in equation 3.2: h(t, x) = f A(t)gC (x) + f Aθ (t)g D (x)
= f A(t)gC (x) + sin θ fˆ A(t) + cos θ f A(t) g D (x) = f A(t) (gC (x) + cos θ g D (x)) + ˆf A(t) sin θ g D (x) g A(x)
g B (x)
= f A(t)g A(x) + fˆ A(t)g B (x) = h TS (t, x),
(3.4)
for any θ different from zero. This is physiologically more relevant since a partial (θ < 90 degrees) Hilbert transform may be a simpler operation for a neural process to perform than a full transform. The physiological
614
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
interpretation of the first line of equation 3.4 is that the two temporal functions of the independent input classes need not be related by a full Hilbert transform (fully lagged cell); it is sufficient that one be a Hilbert rotation of the other (partially lagged cell). This is shown schematically in Figure 10b. It should be noted that exact Hilbert rotations for any θ = 0 are ruled out by causality: it can be shown (e.g., Papoulis, 1987) that the Hilbert transform of a causal filter is acausal. Therefore, we would never expect an exact Hilbert transform (or exact temporal quadrature), only an approximate Hilbert transform. This is exemplified in the visual system where the Hilbert transform behavior of lagged cells is only a good approximation for the frequency band 1 to 16 Hz (Saul & Humphrey, 1990). In the auditory system we expect similar behavior: the (partial or full) Hilbert rotation will be performed accurately only in the relevant frequency band. To restate, AI models that attempt a decomposition of the form h(t, x) = f A(t)gC (x) + f B (t)g D (x)
(3.5)
where f B (t) = f Aθ (t) for some θ are ruled out. This is because they violate the temporal symmetry property found in our STRFs. An example of a rank 2 STRF that is of the form of equation 3.5 but f B (t) = f Aθ (t) was shown Figure 6a. It is not temporally symmetric and is therefore ruled out as a model. 3.2.3 Multiple Input Model. Continuing to add more physiological realism to the decomposition, we can allow for each of the two pathways to be made of multiple inputs as long as their temporal structure is related, shown schematically in Figure 10c. In equation 3.6, the first sum in the decomposition is made up of a number of individual input components (m = 1, . . . , M) that all have the same temporal impulse response f A(t) but may have individual spectral responses fields gCm (x). The second sum is made up of a number of individual input components (n = 1, . . . , N) that each may have different temporal impulse responses f Aθn (t), but all are given by some individual Hilbert rotation (θn ) of the initial impulse response and individual spectral response fields g Dn (x):
h TS (t, x) =
M m=1
( f A(t)gCm (x)) +
N
f Aθn (t)g Dn (x) .
(3.6)
n=1
This decomposition is leading toward one based on the many neural inputs (i.e., N + M, which may be a very large number), each of which may have different spectral response fields and different Hilbert rotations, but must be related in their temporal structure. This is equivalent to the previous
Temporal Symmetry
615
decomposition or to that in equation 3.2: h(t, x) =
M
( f A(t)gCm (x)) +
m=1
N
f Aθn (t)g Dn (x)
n=1
= f A(t)
M
(gCm (x)) + f A(t)
m=1
+ fˆ A(t)
N
(cos θn g Dn (x))
n=1
N
(sin θn g Dn (x))
n=1
= f A(t)
M m=1
gCm (x) +
N n=1
(cos θn g Dn (x))
g A(x)
+ fˆ A(t)
N
n=1
(sin θn g Dn (x)) g B (x)
= f A(t)g A(x) + fˆ A(t)g B (x) = h TS (t, x)
(3.7)
The physiological interpretation of equation 3.6 is that the cortical cell may receive inputs from many thalamic inputs, not just two, and those many cells need not be identical to each other. Nor do we require that the spectral response fields be related in any way, though they may (they will likely have similar best frequencies due to the tonotopic organization of the auditory system, but they may be of different spectral response field shapes). The inputs have been broken into two groups corresponding to the two terms of equation 3.3. The first input group consists of M ordinary unlagged inputs. The second input group consists of N lagged inputs. The phase lags may all be the same, or they may be different. To restate, we conclude that models that attempt a decomposition of the form h TS (t, x) =
M m=1
( f A(t)gCm (x)) +
N
( f Bn (t)g Dn (x))
(3.8)
n=1
where f Bn (t) = f Aθn (t) for some θn are ruled out. Such models implicitly violate the temporal symmetry property. 3.2.4 Multiple Fast Input and Slow Output Model. There is another physiologically motivated relaxation we can allow in our progressive
616
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
decomposition to allow the thalamic temporal inputs to not be so carefully matched, as long as the differences do not enter into the cortical computation (e.g., at high frequencies). For instance, we note that the temporal processing of the cell possessing this STRF is well characterized as the convolution of its faster presomatic (multiple input) stage and its slower somatic (aggregate processing) stage. We can thus separate the entire temporal component of the cortical processing, f (t), which is a temporal impulse response, into the convolution of impulse responses from a presomatic stage, k(t), and somatic stage, k A(t): f (t) = k(t) ∗ k A(t).
(3.9)
The presomatic stage may contain high frequencies (e.g., the output of thalamic ventral MGB neurons) that will not be passed though the slower soma. In filter terminology, the somatic stage is a much lower-frequency bandpass filter than the presomatic stage (the soma acts as an integrator, which is a form of low-pass filtering). The convolution operation is used since the filters (characterized by impulse responses) are in series. The reason for making this separation is that we can allow each input component f i (t) to be different as long as the difference is zero when passed through the somatic component (Yeshurun, Wollberg, Dyn, & Allon, 1985),
ki (t) − k j (t) ∗ k A(t) = 0,
for all i, j
ki (t) ∗ k A(t) = k j (t) ∗ k A(t) = f A(t),
(3.10) for all i, j,
(3.11)
and still obtain a temporally symmetric STRF. Physiologically, we are allowing the thalamic inputs to have greatly differing temporal impulse responses, so long as all the differences appear in timescales significantly faster than the somatic processing timescale (Creutzfeldt, Hellweg, & Schreiner, 1980; Rouiller, de Ribaupierre, Toros-Morel, & de Ribaupierre, 1981): h
TS
(t, x) =
M m=1
N θn k An (t)g Dn (x) ∗ k A(t). (k Am (t)gCm (x)) +
(3.12)
n=1
Here the decomposition is broken up into a presomatic input stage and a somatic processing stage, shown schematically in Figure 10d. The somatic processing stage is the final convolution with the somatic (low-pass) temporal impulse response k A(t). The input stage looks very similar to the previous decomposition, except that now all the individual input components may have temporal responses that are different from each other as long as the differences are all for high frequencies (fast timescales), and they remain identical for low frequencies (slow timescales). This is equivalent to
Temporal Symmetry
617
the previous decomposition or to that in equation 3.2: h(t, x) =
M
N θn k An (t)g Dn (x) ∗ k A(t) (k Am (t)gCm (x)) +
m=1
=
M
n=1
((k Am (t) ∗ k A(t)) gCm (x)) +
m=1
=
M
n=1
( f A(t)gCm (x)) +
m=1
=h
N k θAnn (t) ∗ k A(t) g Dn (x)
TS
N
f Aθn (t)g Dn (x)
n=1
(t, x).
(3.13)
Note the Hilbert rotation of the input stage k(t) is conferred on the entire temporal filter f (t), since
ˆ ∗ k A(t) k θ (t) ∗ k S (t) = cos θ k(t) + sin θ k(t)
ˆ ∗ k A(t) = cos θ (k(t) ∗ k A(t)) + sin θ k(t) = cos θ f A(t) + sin θ fˆ A(t) = f Aθ (t),
(3.14)
which uses the Hilbert transfer property kˆ 1 (t) ∗ k2 (t) = H [k1 (t)] ∗ k2 (t) = H [k1 (t) ∗ k2 (t)] = kˆ 12 (t),
(3.15)
where k1 (t) ∗ k2 (t) = k12 (t). That is, once lagging is performed in the thalamus, the effects of lagging continue up the pathway. The physiological interpretation of equation 3.12 is that the thalamic inputs are quite unrestricted in their similarity. We do not require that all thalamic inputs have the same temporal impulse response, plus Hilbert rotations of that same temporal impulse response, as would be implied by equation 3.6. We require that only the low-frequency components of the thalamic impulse response be the same and that there be partial lagging in some subset of the population. High-frequency components of the temporal impulse responses and all aspects of the spectral response fields are completely unrestricted. In fact, high-frequency components of the temporal impulse responses need not even be lagged, since they are filtered out.
618
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
To restate, models that attempt a decomposition of the form h
TS
(t, x) =
M m=1
N θn k An (t)g Dn (x) ∗ k A(t) (k Am (t)gCm (x)) +
(3.16)
n=1
where ki (t) ∗ k A(t) = k j (t) ∗ k A(t) for all i, j are ruled out. From the signal processing point of view, because all the thalamic inputs ki (t) and k j (t), as impulse responses, are so fast compared to the slow somatic filter k A(t) they are convolved with, they are indistinguishable from a pure impulse δ(t). More rigorously, in the frequency domain, the thalamic input transfer functions (Fourier transform representations of the thalamic input impulse responses) change only gradually over the typical range of the somatic frequencies (a few tens of Hz). A transfer function with approximately constant amplitude and constant phase is approximately the impulse function δ(t). That the thalamic inputs may also contain delays does not change the result: a transfer function with approximately constant amplitude and linear phase is approximately the delayed impulse function δ(t − t0 ). Even differential delays among inputs can be tolerated if the differences are not too great, since a relative delay shift of 2 ms changes phase by only 4% of a cycle over 20 Hz and less for slower frequencies. For instance, differential delays of a few milliseconds that arise from different thalamic inputs or from different dendritic branches of a cortical neuron are too small to break the temporal symmetry. Let us summarize the implications of temporal symmetry for thalamic inputs to AI: the thalamic inputs are almost unrestricted, especially in their spectral support, but they must have the same low-frequency temporal structure (e.g., approximately constant amplitude and phase linearity for a few tens of Hz). Any lagged inputs (significantly different from zero but not necessarily near 90 degrees) will cause the cortical STRF to lose full separability but maintain temporal symmetry. This is a prediction of the reasonably simple but still physiologically plausible models presented above. There are no constraints imposed by temporal symmetry on the spectral response fields of the thalamic inputs (despite tonotopy in both ventral MGB and AI), nor are there any constraints on the thalamic input impulse response’s high-frequency (fast) components individually. The main assumptions of the models are that the thalamic inputs are fully separable and that the soma in AI includes a low-pass filter. 3.3 Implications and Interpretation of AI STRF Temporal Properties for Neural Connectivity and Intracortical Inputs to AI. Neurons in AI receive their predominant thalamic (ventral MGB) input in layers III and IV (Smith & Populin, 2001). Other neurons in a given local circuit receive primarily intracortical inputs, and neurons receiving thalamic inputs receive intracortical inputs as well. Temporal symmetry of the STRF places
Temporal Symmetry
619
strong constraints on the nature of these intracortical inputs as well. Unlike the thalamic inputs, that are assumed above to be fully separable, we assume that the STRFs of cortical inputs are not fully separable but do have temporal symmetry. In this section, we explore the constraints on the temporal and spectral components of the intracortical inputs to an AI neuron. 3.3.1 Adding Cortical Feedforward Connections. The first generalization is that we allow the temporally symmetric AI neuron to project to another AI neuron, which itself receives no other input, shown schematically in Figure 11a. This second neuron is then also temporally symmetric, and its STRF is given by the original STRF its input, convolved with the intrinsic impulse response of the second neuron: h TS (t, x) =
M
N θn k An (t)g Dn (x) ∗ k A(t) ∗ k2 (t). (k Am (t)gCm (x)) +
m=1
n=1
(3.17) Before we show the next generalization that maintains temporal symmetry, it is instructive to see a model generalization that does not maintain temporal symmetry, and so is disallowed by the physiology. The sum of two temporally symmetric STRFs is not, in general, temporally symmetric: ˆ h TS 1 (t, x) = f A1 (t)g A1 (x) + f A1 (t)g B1 (x) ˆ h TS 2 (t, x) = f A2 (t)g A2 (x) + f A2 (t)g B2 (x) TS h(t, x) = h TS 1 (t, x) + h 2 (t, x)
= f A1 (t)g A1 (x) + fˆ A1 (t)g B1 (x) + f A2 (t)g A2 (x) + fˆ A2 (t)g B2 (x) = h TS (t, x),
(3.18)
since it generically gives an STRF of rank 4, which cannot be temporally symmetric since then it must be rank 2. There do exist configurations in which there is a special relationship between the several spectral functions, which does result in temporal symmetry, but they are highly restrictive: for example, when g A1 (x) = g A2 (x) and g B1 (x) = g B2 (x). An example of an STRF that is of the form of equation 3.18, where there is no special relationship between the spectral and temporal components of its constituents, is shown in Figure 6c. It is not temporally symmetric and is therefore ruled out as a model. 3.3.2 Adding Cortical Feedback Connections. The next generalization is at a higher level, in that it is the first generalization we consider that has
620
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
a θ=0
.. .
FS
θ =/ 0
slow kA(t)
TS
TS
k 2(t)
1
2
.. . θ=0
.. .
b FS
θ =/ 0
c
TS
k 2(t)
1
TS
2
.. . k 2(t)
θ=0
TS
2
.. .
FS
θ =/ 0
slow kA(t)
slow kA(t)
TS
1
k 3(t)
k4(t)
.. .
TS
3
TS
4
Figure 11: Schematics depicting models that are more complex. (a) Using the output of a temporally symmetric (TS) neuron as sole input to another neuron results in a temporally symmetric (TS) neuron (see equation 3.17). (b) Feedback from such a temporally symmetric neuron whose sole source is the first temporally symmetric neuron is still self-consistently temporally symmetric (see equation 3.19). (c) Multiple examples of feedback and feedforward: The initial neuron TS 1 provides temporal symmetry to all other neurons in the network due to its role as sole input for the network. All other neurons inherit the temporal symmetry, and the feedback is also self-consistently temporally symmetric.
feedback: h 1 (t, x) = TS
M
N θn k An (t)g Dn (x) ∗ k A(t) + h TS (k Am (t)gCm (x)) + 2 (t, x)
m=1
h 2 (t, x) = h 1 (t, x) ∗ k2 (t). TS
TS
n=1
(3.19)
Temporal Symmetry
621
This is an important generalization, shown schematically in Figure 11b. The proof that the result is temporally symmetric is by causal induction: line two of equation 3.19 guarantees that the STRF of neuron 2 is temporally symmetric with the same spectral functions g A(x) and g B (x) of neuron 1. Then since the result of line 1 is the sum of two temporally symmetric STRFs sharing the same g A(x) and g B (x), the resulting STRF is also temporally symmetric (and shares the same g A(x) and g B (x)). A more rigorous proof follows straightforwardly when using the MTFST instead of the STRF. This illustrates the base of a large class of generalizations: Any network of cortical neurons in one self-contained circuit, including feedforward and feedback connections, will remain temporally symmetric as long as all spectral information arises from a single source that receives all thalamic input. That is, if input into the cortical neural circuit arises, ultimately, from a single source receiving all thalamic input (e.g., one neuron in layer III/IV), then all neurons in that circuit will be temporally symmetric and all will share the same spectral functions g A(x) and g B (x). A substantially more complex example is illustrated schematically in Figure 11c. The converse to this rule of a single source of thalamic inputs to a cortical circuit is that almost any other inputs will break temporal symmetry and are therefore disallowed by the physiology. As shown in equation 3.18, any neuron that receives input from two sources whose temporal symmetries are incompatible will not be temporally symmetric and is therefore disallowed. Essentially, each local circuit must get its input from a single layer III/IV cell. This is a strong claim: it demands that neurons in the local circuit but not in layer III/IV must also be temporally symmetric. Generally inputs from neighboring or distant circuits will have a different spectral balance of thalamic inputs and would break temporal symmetry if strongly coupled, at least at timescales important to these STRFs (a few to a few tens of Hz). Only if the neighboring cortical cell were to have a compatible temporal symmetry would it be allowed to take part in the circuit without violating temporal symmetry. Fully separable cells in AI are also well explained by all the above models, except without the second lagged thalamic population. All of the results, generalizations, and restrictions apply equally strongly, since anything that breaks temporal symmetry increases the rank and so does not allow the STRF to be of rank 1 either; and maintaining temporal symmetry is consistent with full separability if there are no lagged inputs to increase the STRF rank from 1 to 2. 4 Discussion Temporal symmetry is a property that STRFs may possess, but there is not any mathematical reason that they must. Nevertheless, in the awake case, 85% of neurons measured in AI were temporally symmetric—ηt > 0.65,
622
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
and 96% in the anesthetized population (including those that are fully separable). This property merits serious attention. Simple models of functional thalamic input to AI, where the temporal functions are unconstrained or even if identical but with relative delays of more than a few milliseconds, do not possess the property of temporal symmetry, and so must be ruled out. On the other hand, it is quite possible for a model to achieve temporal symmetry: the only restriction on thalamic inputs is that their low-pass-filtered versions (at rates below 20 Hz, that is, those commensurate with cortical integration time constants) must be identical. Additionally, some of the ventral MGB inputs must be (at least partially) lagged in order for the cortical neuron to be temporally symmetric without being fully separable. Additional constraints are forced on neurons in AI receiving input from other neurons in AI, whether from simple feedforward or as part of a complicated feedback-feedforward network. Simply taking two temporally symmetric neurons as input will not generate a temporally symmetric neuron. Since the temporal profiles of cortical neurons are not identical, the only way to maintain temporal symmetry is to keep the same spectral balance of the lagged and nonlagged components (g A(x) and g B (x) in equation 3.2). The only sure way to accomplish this is to restrict all inputs to the local circuit network to a single input neuron, for example, in layer III/IV. Another possibility, not ruled out, is to carefully match the balance of the lagged and nonlagged components even when coming from another column. It seems this would be difficult to match well enough to avoid breaking temporal symmetry, but it is not disallowed by this analysis. It is consistent with the findings of Read et al. (2001), who note that horizontal connections match not only in best frequency but also in spectral bandwidth. One can rule out the possibility that temporal symmetry is due to a simple interaction between ordinary transmission delays and limited temporal modulation bandwidths. For instance, if the STRF’s temporal bandwidth is narrow, an ordinary time delay (with phase linear in frequency) might appear to be constant phase lag. If this effect were important, the temporal symmetry parameter should show a negative correlation with temporal bandwidth. In fact, the correlation between temporal bandwidth (of the first singular value temporal cross-section) and temporal symmetry index is small and positive: 0.24 for the awake case and 0.21 for the anesthetized case. Of course, the experimental results of Figures 8 and 9 do not show absolute temporal symmetry, only a large degree. Similarly, intracortical connections from neighboring or distant circuits are not ruled out absolutely: rather, their strength must be small and their effects more subtle. Indeed, for neurons whose STRF is noisy, the “noise” may be due to inputs from neighboring or distant circuits, whether thalamic or intracortical, that are not time-locked.
Temporal Symmetry
623
In addition, since all the STRFs in this analysis were derived entirely from the steady-state portion of the response, this analysis and its models do not apply to neural connections that are active only during onset. This is a limitation of any study of steady-state response. The neural connectivity it predicts is a functional connectivity—the connectivity active during the steady state. Nevertheless, the steadystate response is important in auditory processing (e.g., for background sounds, auditory streaming), nor is it is likely that steady-state functional connectivity is completely unrelated to other measures of functional connectivity. It should be noted that the possibility of thalamic inputs being not fully separable but still temporally symmetric (still with all inputs having the same temporal function) is not strictly ruled out mathematically. It amounts to a rearranging of the terms within the largest parentheses of equation 3.12. There is no physiological evidence against the existence of not fully separable, temporally symmetric neurons, in auditory thalamus, with all inputs having the same temporal function. In fact, it is possible to shift the models presented here down one level, using input from fully separable neurons in inferior colliculus (IC) instead of thalamus and using thalamocortical loops instead of intracortical feedback. In this case, it would be necessary to find a mechanism that lags (Hilbert rotates) some of the IC inputs. The essential predictions are the same: all input to the circuit must be through a single source, and any exceptions would break the temporal symmetry, but the source has been moved down one level. However, it does add another layer of complexity and of strong assumptions to the models, so this appears less likely to be relevant. If, however, ventral MGB neurons are found with substantially more complicated STRF structures (e.g., quadrant separable but not temporally symmetric, or even inseparable without any symmetries, both of which are more complicated than all of the AI neurons described here), then the models proposed here would not be able to accommodate them, and another explanation will have to be found for the temporal symmetry found in AI. The specific models proposed by the equations above (e.g., equations 3.3, 3.6, 3.12, 3.17, 3.19) are somewhat more general than the precise descriptive text that follows from them. For instance, lagged cortical cells are allowed in place of the lagged thalamic cells: for example, in the macaque visual system, there are both lagged and nonlagged V1 cortical neurons, in addition to the same two populations in visual thalamus (De Valois et al., 2000). The models above do not distinguish between those cortical inputs and those thalamic inputs. Similarly, the models do not depend on the mechanism by which input neurons are lagged, whether by thalamic inhibition (Mastronarde, 1987a, 1987b) or synaptic depression (Chance et al., 1998); from equation 3.3, the models do not even depend on the sign or magnitude of the phase lag. Nevertheless, this does not invalidate the models’ ease of falsifiability.
624
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
The example of a feedback network described by equation 3.19 and depicted in Figure 11b is the simplest of systems that include feedback. The more complex model depicted in Figure 11c serves as an example of a more complex cortical network with feedback, but the class of feedback systems is so much larger than of feedforward systems that it is beyond the scope of this article to categorize all of them. One important feature of these models is their ease of falsifiability. Most interactions will break temporal symmetry. Any functionally connected pair of neurons in AI with temporally symmetric STRFs, but with incompatible impulse responses and spectral response fields, would invalidate the model. Finding direct evidence is also possible, and we propose one experiment as an example: simultaneous recording from two functionally connected neurons in AI with strong temporally symmetric STRFs. While presenting steady-state auditory stimuli, electrically stimulate the neuron whose output is the other’s input in a way that is incompatible with temporal symmetry but compatible with the steady-state stimulus. If these models are correct, the STRF of the other neuron will remain well formed but lose temporally symmetry. Additional caveats are in order, delineating the types of functional connections permitted and disallowed by the measured temporal symmetry. The STRF captures many features of the response in AI, but not all. For instance, typical STRFs in AI only go up to a few tens of Hz (Depireux et al., 2001) since the neurons cannot lock to rates higher than that, yet some spikes within the response show remarkably high temporal precision across different presentations—as fast as a millisecond (see, e.g., Elhilali, Fritz, Klein, Simon, & Shamma, 2004). Since STRFs do not capture effects at this timescale, predictions made from the temporal symmetry of the STRFs cannot place any constraints on functional connections that take effect only at this timescale. Similarly, since the STRFs are measured only from the sustained response and do not use the onset response, constraints cannot be placed on functional connections that occur only in the first few hundred milliseconds. These results have been obtained from STRFs measured by a single method (using TORC stimuli) and still need to be verified for STRFs measured using different stimuli. Use of the SVD truncation (see equation 2.20) and calculation of the temporal symmetry index (see equation 2.21) is straightforward for STRFs from any source. The main caveat in applying this technique to other data is that the STRFs should be sufficiently error free, especially in systematic error (Klein et al., 2006). Too much error in the STRF estimate leads to the second term in the singular value truncation being dominated by noise, which corrupts the estimate of the temporal symmetry index. How much of this work generalizes beyond AI? In vision, it appears that temporal and spectral symmetry measurements for spatiotemporal response fields are rarely, if ever, made (these measurements would require
Temporal Symmetry
625
the entire spatiotemporal response field to be measured, not just the temporal response at the best spatial frequency, or the converse). It is certainly true that many spatiotemporal response fields show strong directional selectivity, and it is clear from equation 2.31 that an STRF, which is quadrant separable and purely directional selective, is also temporally symmetric. The population distributions of the temporal symmetry index, shown in Figure 8a, exhibit some differences between the awake and anesthetized populations, the most pronounced of which is the low-symmetry tail in the awake population. The similarities are more profound than the differences, however: the large majority of STRFs show pronounced temporal symmetry. This is especially striking when compared with the spectral case, shown in Figure 8b, which has broad distribution in both the awake and anesthetized populations. Let us also restate an earlier point regarding the validity of drawing conclusions from a linear systems analysis of a system known to have strong nonlinearities. Linear systems are a subcategory of those nonlinear systems well described by Volterra and Wiener expansions (Eggermont, 1993; Rugh, 1981). If, in particular, the system is well described by a linear kernel followed by a static nonlinearity, the measured STRF is still proportional to that linear kernel (Bussgang, 1952; Nykamp & Ringach, 2002), since the stimuli used here, though discretized in rate and spectral density, are approximately spherical. If the system’s nonlinearities are more general or the static nonlinearity is sensitive to the residual nonsphericity of the stimuli, then the measured STRF will contain additional stimulusdependent, nonlinear contributions. Crucially, it has been shown that those potential higher-order nonlinearities are not dominant (i.e., the linear properties are robust) in ferret AI, since using several stimulus sets with widely differing spectrotemporal properties produces measured STRFs that do not differ substantially from each other (Klein et al., 2006). This implies that the STRF measured here is dominantly proportional to a linear kernel (followed by some static nonlinearity) and that the analysis performed here is valid. To the extent that there is higher-order contamination in the STRF, however, the analysis is affected in unknown ways. For example, there are related neural systems for which the dominant response has not been seen to be linear (Machens, Wehr, & Zador, 2004; Sahani & Linden, 2003). The most immediate other question is whether ventral MGB neurons have fully separable STRFs and whether any have lagged outputs. If neither of these holds true, then it becomes very difficult to explain temporal symmetry, which is so easily broken. Another set of questions revolves around the generation of STRFs themselves. What distinguishes AI neurons that have clean STRFs from those that have noisy STRFs, or none at all? Only some (large) fraction of AI neurons have recognizable STRFs, and only some (large) fraction of those recognizable STRFs have a high signal-to-noise ratio (no specific statistics have been gathered, and a careful
626
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
analysis is beyond the scope of this article). What is the cause of this noise, and how is it related to (presumably) non-phase-locked inputs? Appendix A.1 Temporal Symmetry Ambiguities. There are ambiguities and symmetries in the decomposition in equation 3.2: h TS (t, x) = f A(t)g A(x) + fˆ A(t)g B (x). The transformation parameterized by an amplitude and phase (λ, θ ), f A(t) → λ f Aθ (t) = λ(cos θ f A(t) + sin θ fˆ A(t)) π /2 θ +π /2 (t) = λ(− sin θ f A(t) + cos θ fˆ A(t)) fˆ A(t) = f A (t) → λ f A
g A(x) → λ−1 (cos θ g A(x) + sin θ g B (x)) g B (x) → λ−1 (− sin θ g A(x) + cos θ g B (x)),
(A.1)
leaves equation 3.2 invariant. This indicates a twofold symmetry, which here means that this decomposition is ambiguous (nonunique). The amplitude symmetry, parameterized by λ, demonstrates that the power (strength) of an STRF cannot be ascribed to either the spectral or the temporal cross-section. A standard method of removing this ambiguity is to normalize the cross-sections and include an explicit amplitude, h TS (t, x) = Av A(t)u A(x) + B vˆ A(t)u B (x),
(A.2)
where the ui and vi are of unit norm. This method is used in SVD. The rotation transformation parameter θ merely establishes which of the possible rotations of f Aθ (t) we choose to compare against all others. For example, in Figure 7, the temporal cross-section at the best frequency (i.e., the frequency at which the STRF reaches its most extreme value) is chosen to be the standard impulse response against which all others are compared. To do this, we choose θ such that h TS (t, xB F ) = f A(t)g A(xB F ) and λ such that g A(xB F ) = 1. A.2 Quadrant-Separable Spectrotemporal Modulation Transfer Functions. For mathematical convenience, we name and define the crosssectional functions referred to without names above in section 2.9. In quadrant 1, where both w and are positive, H(w, ) = F1+ (w)G + 1 (), where the index 1 refers to quadrant 1 and the + indicates that the function is defined only for positive argument. In quadrant 2, where w is negative and is positive, H(w, ) = F2+∗ (−w)G + 2 (), where the index 2 refers to quadrant 2, the + indicates that the function is defined only for positive argument, –w
Temporal Symmetry
627
is positive, and the complex conjugation makes later equations significantly simpler. Thus, a quadrant-separable MTFST can be expressed as (valid in all 4 quadrants) H QS (w, ) = F1+ (w) (w)G + 1 () () + F2+ (w) (w)G +∗ 2 (−) (−) + F1+∗ (−w) (−w)G +∗ 1 (−) (−) + F2+∗ (−w) (−w)G +∗ 2 () (),
(A.3)
where () is the step function (1 for positive argument and 0 for negative argument). In equation A.3 it is explicit that the spectrotemporal modulation transfer function in quadrants 3 (w < 0, < 0) and 4 (w > 0, < 0) is complex conjugate to quadrants 1 and 2, respectively. This can be written more compactly as ˆ 1 () + Fˆ 2 (w)G ˆ 2 (), H QS (w, ) = F1 (w)G 1 () + F2 (w)G 2 () − Fˆ 1 (w)G (A.4) where Fi (w) = Fi+ (w) (w) + Fi+∗ (−w) (−w) G i () = G i+ () () + G i+∗ (−) (−)
(A.5)
are defined for both positive and negative argument (and are complex conjugate symmetric by construction), i = (1, 2), and the Hilbert transform is defined in transfer function space as Fˆ (w) = j sgn(w) F (w) ˆ G() = j sgn() F ().
(A.6)
Equation A.4 demonstrates that a quadrant-separable spectrotemporal modulation transfer function can be decomposed into the sum of four fully separable spectrotemporal modulation transfer functions where the last two terms are fully determined by the first two and the requirement of quadrantseparability. This is the first time we see that a generic quadrant-separable MTFST (and its STRF) is rank 4, and not 2. Roughly speaking, it comes from the combination of two independent product terms in quadrants 1 and 2, plus the restriction that the MTFST must be complex conjugate symmetric. An equivalent decomposition equally concise, but more amenable to comparison with definitions motivated earlier, of a quadrant-separable spectrotemporal modulation transfer function into the sum of four fully
628
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
separable spectrotemporal modulation transfer functions is ˆ B () + Fˆ A(w)G B () + F B (w)G ˆ A(), H QS (w, ) = F A(w)G A() + Fˆ B (w)G (A.7) where F A(w) = √12 (F1 (w) + F2 (w)) F B (w) = √12 (− Fˆ 1 (w) + Fˆ 2 (w)) G A() = √12 (G 1 () + G 2 ()) ˆ 1 () + G ˆ 2 ()). G B () = √12 (−G
(A.8)
Taking the inverse Fourier transform gives the canonical form of the quadrant-separable STRF h QS (t, x) = f A(t)g A(x) + fˆ B (t)gˆ B (x) + fˆ A(t)g B (x) + f B (t)gˆ A(x),
(A.9)
which is also equation 2.28. A.3 Quadrant Separability Ambiguities. There are ambiguities and symmetries in the decomposition in equation A.3 (and, correspondingly, equations A.4 and A.7). The transformation parameterized by a pair of amplitudes and phases (1 , θ1 , 2 , θ2 ) F1+ → F1+ 1 exp ( jθ1 ) + −1 G+ 1 → G 1 1 exp (− jθ1 )
F2+ → F2+ 2 exp ( jθ2 ) + −1 G+ 2 → G 2 2 exp (− jθ2 )
(A.10)
or, equivalently, F1 (w) → 1 F1θ1 (w) −θ1 G 1 () → −1 1 G 1 ()
F2 (w) → 2 F2θ2 (w) −θ2 G 2 () → −1 2 G 2 (),
(A.11)
Temporal Symmetry
629
leaves equation A.3 invariant, representing an ambiguous/nonunique decomposition. The spectrotemporal version of the transformation is f 1 (t) → 1 f 1θ1 (t) −θ1 g1 (x) → −1 1 g1 (x)
f 2 (t) → 2 f 2θ2 (t) −θ2 g2 (x) → −1 2 g2 (x).
(A.12)
The amplitude symmetry, parameterized by 1 and 2 , demonstrates that the amplitude (or power) in one quadrant of a spectrotemporal modulation transfer function cannot be ascribed to either the spectral or the temporal cross-section. Again, a standard method of removing this ambiguity is to normalize the cross-sections and include an explicit amplitude: QS HST (w, ) = V1 (w)1 U1 () + V2 (w)2 U2 ()
ˆ 1 () + V ˆ 2 (), ˆ 1 (w)1 U ˆ 2 (w)2 U −V
(A.13)
where the Ui and Vi are of unit norm. This method is used in singular value decomposition (SVD). The spectrotemporal version is h QS (t, x) = v1 (t)1 u1 (x) + v2 (t)2 u2 (x) − vˆ 1 (t)1 uˆ 1 (x) + vˆ 2 (t)2 uˆ 2 (x).
(A.14)
The phase ambiguity, parameterized by θ1 and θ2 , can be useful. By judicious choice of this transformation, it can always be arranged that F2 be ˆ 1 simultaneously. This is useful orthogonal to Fˆ 1 and G 2 be orthogonal to G when performing detailed calculations (e.g., the analytic expression below for the singular values of a quadrant separable spectrotemporal modulation transfer function) because it explicitly decomposes the spectrotemporal modulation transfer function, viewed as a linear operator, into components that act on orthogonal subspaces. A.4 Rank of Quadrant-Separable Spectrotemporal Response Fields. This section uses SVD to prove that the rank of a generic quadrant-separable STRF is 4. We begin with equation A.14. In this form, the symmetry generated by 1 and 2 in equation A.11 has already been fixed, but we can still use the symmetry generated by θ1 and θ2 in equation A.11 to simplify later calculations. We quantify the nonorthogonality of the terms in
630
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
equation A.14 by the four angles, cos θV = v1 · v2 = vˆ 1 · vˆ 2 cos θU = u1 · u2 = uˆ 1 · uˆ 2 cos ϕV = v1 · vˆ 2 = −ˆv1 · v2 cos ϕU = u1 · uˆ 2 = −uˆ 1 · u2 ,
(A.15)
where wi · w j := wi w j , and vi · v j = vˆ i · vˆ j = δij = ui · u j = uˆ i · uˆ j . By making an explicit transformation with θ1 and θ2 from equation A.12 such that cos ϕV tan(θ1 − θ2 ) = − cos θV cos ϕU tan(θ1 + θ2 ) = , (A.16) cos θU then it can be shown that after the transformation, v1 · vˆ 2 = 0 = u1 · uˆ 2 ,
(A.17)
and there are new values for cos θV = v1 · v2 and cos θU = u1 · u2 as well. This step is not necessary but it reduces the complexity of the following equations. To avoid double counting, we also need to restrict the range of either θU or θV (just as on a sphere, we let the azimuthal angle run over the entire equator but enforce the polar angle to run over only half a great circle of longitude). We do this by enforcing that cos θU be positive. We then apply SVD to equation A.14 with the constraint of equation A.17. The singular values λ1 , λ2 , λ3 , and λ4 are the eigenvalues of the operator formed by taking the inner product of the STRF with itself along the w axis:
dt h QS (t, x) h QS† (t, x ) = 21 u1 (x)u1 (x ) + 22 u2 (x)u2 (x )
+ 1 2 cos θV u1 (x)u2 (x ) + u2 (x)u1 (x ) + 21 uˆ 1 (x)uˆ 1 (x ) + 22 uˆ 2 (x)uˆ 2 (x )
− 1 2 cos θV uˆ 1 (x)uˆ 2 (x ) + uˆ 2 (x)uˆ 1 (x ) . (A.18)
Each pair of lines of the right-hand side of equation A.18 can then be orthogonalized separately. In fact, after we orthogonalize the first two lines, the second last two lines can be orthogonalized by substituting cos θV → − cos θV and taking the Hilbert transform of each component function. To orthogonalize the first two lines, we look for the
Temporal Symmetry
631
eigenvalues λ2 and coefficients c 1 and c 2 such that ζ (x) = c 1 u1 (x) + c 2 u2 (x)
(A.19)
is the eigenfunction solving the eigenvalue equation,
d x 21 u1 (x)u1 (x ) + 22 u2 (x)u2 (x )
+1 2 cos θV u1 (x)u2 (x ) + u2 (x)u1 (x ) ζ (x ) = λ2 ζ (x).
(A.20)
Putting equation A.19 into equation A.20 and expanding gives a pair of solutions defined (up to normalization) by the two roots of the quadratic equation
c2 c1
2
2
1 cos θU + 1 2 cos θV
c2 c1
+
21 − 22 − 22 cos θU + 1 2 cos θV ,
(A.21)
and the corresponding two solutions of λ = 2
21
+ 1 2 cos θV cos θU +
c2 c1
21 cos θU + 1 2 cos θV .
(A.22)
Solving similarly for the eigenfunctions and eigenvalues of the second last two lines of the right-hand side of equation A.18 gives the same as equations A.21 and A.22 but with cos θV → − cos θV . The eigenfunctions of w are given by using the operator formed by taking the inner product of the STRF with itself along the axis,
d x h QS (t, x) h QS† (t , x),
(A.23)
and give the same eigenvalues. Explicitly, the four eigenvalues λ2 are given by the four solutions of λ2 =
21 +22 2
±
+ 1 2 cos θV cos θU
21 +22 2 − 21 22 (1 − cos2 θV − cos2 θU ) + 21 + 22 1 2 cos θV cos θU . 2
(A.24)
632
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
and the same with cos θV → − cos θV . The four singular values are given by the four positive square roots of λ2 . Because the system is explicitly orthogonal, the rank is given by the number of nonzero values of λ2 . Generically, the four values of λ2 are nonzero, with the only exceptions described below. A.5 Lower-Rank Quadrant-Separable Spectrotemporal Response Fields. This section uses SVD to prove that if a quadrant-separable STRF has rank less than the generic rank of 4, then this is caused by specific symmetries, and the rank must be reduced to 2 (or 1). Label the four eigenvalues λ2 as λ2A± for the two solutions to equation A.24 and λ2B± for the two solutions to equation A.24 but with cos θV → − cos θV . First, we consider the case that neither 1 nor √ 2 vanishes (we will consider the alternative case later). Define 0 = 1 2 and 22 1 21 α2 = , (A.25) + 2 22 21 where it can be shown that α 2 > 1 for any 1 and 2 . Then the SVD eigenvalues can be written as 2 α +cos θV cos θU
2 λ2 = 20 , 2 2 2 ±
α + cos θV cos θU
− (1 + cos θV cos θU ) + (cos θV + cos θU )
(A.26) and the same with cos θV → − cos θV . Note for reference below that since α 2 > 1, the top row expression of equation A.26 is nonnegative, as is the expression under the square root. It can also be shown that the top row expression is greater than or equal to the square root expression, guaranteeing that all of λ2A± and λ2B± are nonnegative. The loss of rank (from 4 to a smaller number) corresponds to at least one of the eigenvalues vanishing, since the rank is the number of nonzero eigenvalues. Because of the nonnegativity properties of the components of equation A.26, we always have λ2A+ > λ2A− and λ2B+ > λ2B− , and so if only one of the eigenvalues vanishes, it must be λ2A− or λ2B− . First, consider the case of λ2A− = 0. This can occur only when the top row expression of equation A.26 cancels the square root expression exactly or, since they are both nonnegative, when their squares cancel exactly. This condition, when simplified, can be written as
1 − cos2 θV 1 − cos2 θU = 0 (A.27) which can occur only when cos2 θV = 1 or cos2 θU = 1. 2 The case θV = 1 results in only two nonzero eigenvalues, with values cos 2 2 2 λ = 20 α ± cos θU , and 2 zero eigenvalues, λ2A− = λ2B− = 0, giving an
Temporal Symmetry
633
STRF of rank 2 (not rank 3, the guess associated with the initial assumption of λ2A− = 0). From equation A.15, this also results in v2 (t) = ±v1 (t), which is necessary and sufficient for temporal symmetry: the STRF has only one temporal function and its Hilbert transform. Analogously, the case cos2 θU = 1 results in spectral symmetry. There are only two nonzero eigenvalues, with values λ2 = 220 α 2 ± cos θV , and two zero eigenvalues, λ2A− = λ2B− = 0, giving an STRF of rank 2. Considering the case of λ2B− = 0 leads to the identical conclusions. Aside from the degenerate case, in which only one nonzero eigenvalue remains (which is the fully separable case), the only other possibility of low rank arises from when either 1 or 2 vanishes. If 1 = 0, then the quadrantseparable STRF of equation A.14 is explicitly in the form of equation 2.31, giving this STRF the symmetry of pure directional selectivity. Equation A.24 simplifies dramatically and results in only two non-zero eigenvalues (both with value 21 ), giving a rank 2 STRF. Similarly, if 2 = 0, the STRF is purely directionally selective in the opposite direction, resulting in only two nonzero eigenvalues (with value 22 ), and also rank 2. Thus, we have proven that whenever the rank of a quadrant separable STRF is less than 4, it must be rank 2 and temporally symmetric; rank 2 and spectrally symmetric; rank 2 and directionally selective; or rank 1 and fully separable. Acknowledgments S.A.S. was supported by Office of Naval Research Multidisciplinary University Research Initiative, Grant N00014-97-1-0501. D.A.D. was supported by the National Institute on Deafness and Other Communication Disorders, National Institutes of Health, Grant R01 DC005937. We thank Shantanu Ray for technical assistance in electronics and major contributions in customized software design and Sarah Newman and Tamar Vardi for assistance with training animals. We also thank Alan Saul, and three anonymous reviewers from an earlier submission, for their helpful and insightful comments. References Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. J. Opt. Soc. Am. A, 2(2), 284–299. Aertsen, A. M., & Johannesma, P. I. (1981a). A comparison of the spectrotemporal sensitivity of auditory neurons to tonal and natural stimuli. Biol. Cybern., 42(2), 145–156. Aertsen, A. M., & Johannesma, P. I. (1981b). The spectrotemporal receptive field: A functional characteristic of auditory neurons. Biol. Cybern., 42(2), 133–143. Barlow, H. B., & Levick, W. R. (1965). The mechanism of directionally selective units in rabbit’s retina. J. Physiol., 178(3), 477–504.
634
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
Borst, A., & Egelhaaf, M. (1989). Principles of visual-motion detection. Trends in Neurosciences, 12(8), 297–306. Bussgang, J. J. (1952). Crosscorrelation functions of amplitude-distorted gaussian signals. Cambridge, MA: Research Laboratory of Electronics, Massachusetts Institute of Technology. Chance, F. S., Nelson, S. B., & Abbott, L. F. (1998). Synaptic depression and the temporal response characteristics of V1 cells. J. Neurosci., 18(12), 4785–4799. Cohen, L. (1995). Time-frequency analysis. Englewood Cliffs, NJ: Prentice Hall. Creutzfeldt, O., Hellweg, F. C., & Schreiner, C. (1980). Thalamocortical transformation of responses to complex auditory stimuli. Exp. Brain. Res., 39(1), 87–104. De Valois, R. L., Cottaris, N. P., Mahon, L. E., Elfar, S. D., & Wilson, J. A. (2000). Spatial and temporal receptive fields of geniculate and cortical cells and directional selectivity. Vision Res., 40(27), 3685–3702. De Valois, R. L., & De Valois, K. K. (1988). Spatial vision. New York: Oxford University Press. deCharms, R. C., Blake, D. T., & Merzenich, M. M. (1998). Optimizing sound features for cortical neurons. Science, 280(5368), 1439–1443. Depireux, D. A., Simon, J. Z., Klein, D. J., & Shamma, S. A. (2001). Spectrotemporal response field characterization with dynamic ripples in ferret primary auditory cortex. J. Neurophysiol., 85(3), 1220–1234. Depireux, D. A., Simon, J. Z., & Shamma, S. A. (1998). Measuring the dynamics of neural responses in primary auditory cortex. Comments. Theor. Biol., 5(2), 89– 118. DiCarlo, J. J., & Johnson, K. O. (1999). Velocity invariance of receptive field structure in somatosensory cortical area 3b of the alert monkey. J. Neurosci., 19(1), 401–419. DiCarlo, J. J., & Johnson, K. O. (2000). Spatial and temporal structure of receptive fields in primate somatosensory area 3b: Effects of stimulus scanning direction and orientation. J. Neurosci., 20(1), 495–510. DiCarlo, J. J., & Johnson, K. O. (2002). Receptive field structure in cortical area 3b of the alert monkey. Behav. Brain. Res., 135(1–2), 167–178. Eggermont, J. J. (1993). Wiener and Volterra analyses applied to the auditory system. Hear. Res., 66(2), 177–201. Eggermont, J. J., Aertsen, A. M., Hermes, D. J., & Johannesma, P. I. (1981). Spectrotemporal characterization of auditory neurons: Redundant or necessary. Hear. Res, 5(1), 109–121. Elhilali, M., Fritz, J. B., Klein, D. J., Simon, J. Z., & Shamma, S. A. (2004). Dynamics of precise spike timing in primary auditory cortex. J. Neurosci., 24(5), 1159–1172. Emerson, R. C., & Gerstein, G. L. (1977). Simple striate neurons in the cat. II. Mechanisms underlying directional asymmetry and directional selectivity. J. Neurophysiol., 40(1), 136–155. Epping, W. J., & Eggermont, J. J. (1985). Single-unit characteristics in the auditory midbrain of the immobilized grassfrog. Hear. Res., 18(3), 223–243. Escabi, M. A., & Schreiner, C. E. (2002). Nonlinear spectrotemporal sound analysis by neurons in the auditory midbrain. J. Neurosci., 22(10), 4114–4131. Evans, E. F. (1979). Single-unit studies of mammalian cochlear nerve. In H. A. Beagley (Ed.), Auditory investigation: The scientific and technological basis (pp. 324–367). New York: Oxford University Press.
Temporal Symmetry
635
Fritz, J., Shamma, S., Elhilali, M., & Klein, D. (2003). Rapid task-related plasticity of spectrotemporal receptive fields in primary auditory cortex. Nat. Neurosci., 6(11), 1216–1223. Ghazanfar, A. A., & Nicolelis, M. A. (1999). Spatiotemporal properties of layer V neurons of the rat primary somatosensory cortex. Cereb. Cortex, 9(4), 348–361. Ghazanfar, A. A., & Nicolelis, M. A. (2001). Feature article: The structure and function of dynamic cortical and thalamic receptive fields. Cereb. Cortex, 11(3), 183–193. Green, D. M. (1986). “Frequency” and the detection of spectral shape change. In B. C. J. Moore & R. D. Patterson (Eds.), Auditory frequency selectivity (pp. 351–359). New York: Plenum Press. Hansen, P. C. (1997). Rank-deficient and discrete ill-posed problems: Numerical aspects of linear inversion. Philadelphia: SIAM. Heeger, D. J. (1993). Modeling simple-cell direction selectivity with normalized, half-squared, linear operators. J. Neurophysiol., 70(5), 1885–1898. Hermes, D. J., Aertsen, A. M., Johannesma, P. I., & Eggermont, J. J. (1981). Spectrotemporal characteristics of single units in the auditory midbrain of the lightly anaesthetised grass frog (Rana temporaria L) investigated with noise stimuli. Hear. Res., 5(2–3), 147–178. Hillier, D. (1991). Auditory processing of sinusoidal spectral envelopes. St. Louis, MO: Washington University. Humphrey, A. L., & Weller, R. E. (1988). Functionally distinct groups of X-cells in the lateral geniculate nucleus of the cat. J. Comp. Neurol., 268(3), 429–447. Johannesma, P. I., & Eggermont, J. J. (1983). Receptive fields of auditory neurons in the frog’s midbrain as functional elements for acoustic communications. In J.-P. Ewert, R. R. Capranica, D. Ingle & North Atlantic Treaty Organization, Scientific Affairs Division (Eds.), Advances in vertebrate neuroethology (pp. 901–910). New York: Plenum Press. Klein, D. J., Depireux, D. A., Simon, J. Z., & Shamma, S. A. (2000). Robust spectrotemporal reverse correlation for the auditory system: Optimizing stimulus design. J. Comput. Neurosci., 9(1), 85–111. Klein, D. J., Simon, J. Z., Depireux, D. A., & Shamma, S. A. (2006). Stimulus-invariant processing and spectrotemporal reverse correlation in primary auditory cortex. Journal of Computational Neuroscience, 20(2), 111–136. Kowalski, N., Depireux, D. A., & Shamma, S. A. (1996). Analysis of dynamic spectra in ferret primary auditory cortex. I. Characteristics of single-unit responses to moving ripple spectra. J. Neurophysiol., 76(5), 3503–3523. Kowalski, N., Versnel, H., & Shamma, S. A. (1995). Comparison of responses in the anterior and primary auditory fields of the ferret cortex. J. Neurophysiol., 73(4), 1513–1523. Kvale, M., & Schreiner, C. E. (1997). Short-term adaptation of auditory receptive fields to dynamic stimuli. J. Neurophysiol., 91(2), 604–612. Linden, J. F., Liu, R. C., Sahani, M., Schreiner, C. E., & Merzenich, M. M. (2003). Spectrotemporal structure of receptive fields in areas AI and AAF of mouse auditory cortex. J. Neurophysiol., 90(4), 2660–2675. Linden, J. F., & Schreiner, C. E. (2003). Columnar transformations in auditory cortex? A comparison to visual and somatosensory cortices. Cereb. Cortex, 13(1), 83– 89.
636
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
Machens, C. K., Wehr, M. S., & Zador, A. M. (2004). Linearity of cortical receptive fields measured with natural sounds. J. Neurosci., 24(5), 1089–1100. Maex, R., & Orban, G. A. (1996). Model circuit of spiking neurons generating directional selectivity in simple cells. J. Neurophysiol., 75(4), 1515–1545. Mastronarde, D. N. (1987a). Two classes of single-input X-cells in cat lateral geniculate nucleus. I. Receptive-field properties and classification of cells. J. Neurophysiol., 57(2), 357–380. Mastronarde, D. N. (1987b). Two classes of single-input X-cells in cat lateral geniculate nucleus. II. Retinal inputs and the generation of receptive-field properties. J. Neurophysiol., 57(2), 381–413. McLean, J., & Palmer, L. A. (1994). Organization of simple cell responses in the three-dimensional (3-D) frequency domain. Vis. Neurosci., 11(2), 295–306. Miller, L. M., Escabi, M. A., Read, H. L., & Schreiner, C. E. (2002). Spectrotemporal receptive fields in the lemniscal auditory thalamus and cortex. J. Neurophysiol., 87(1), 516–527. Miller, L. M., Escabi, M. A., & Schreiner, C. E. (2001). Feature selectivity and interneuronal cooperation in the thalamocortical system. J. Neurosci., 21(20), 8136–8144. Miller, L. M., & Schreiner, C. E. (2000). Stimulus-based state control in the thalamocortical system. J. Neurosci., 20(18), 7011–7016. Moller, A. R. (1977). Frequency selectivity of single auditory-nerve fibers in response to broadband noise stimuli. J. Acoust. Soc. Am., 62(1), 135–142. Nykamp, D. Q., & Ringach, D. L. (2002). Full identification of a linear-nonlinear system via cross-correlation analysis. Journal of Vision, 2(1), 1–11. Papoulis, A. (1987). The Fourier integral and its applications. New York: McGraw-Hill. Press, W. H., Teukolsky, S. A., Vettering, W. T., & Flannery, B. P. (1986). Numerical recipes: The art of scientific computing. Cambridge: Cambridge University Press. Qin, L., Chimoto, S., Sakai, M., & Sato, Y. (2004). Spectral-shape preference of primary auditory cortex neurons in awake cats. Brain Res., 1024(1–2), 167–175. Qiu, A., Schreiner, C. E., & Escabi, M. A. (2003). Gabor analysis of auditory midbrain receptive fields: Spectrotemporal and binaural composition. J. Neurophysiol., 90(1), 456–476. Read, H. L., Winer, J. A., & Schreiner, C. E. (2001). Modular organization of intrinsic connections associated with spectral tuning in cat auditory cortex. Proc. Natl. Acad. Sci. U.S.A., 98(14), 8042–8047. Read, H. L., Winer, J. A., & Schreiner, C. E. (2002). Functional architecture of auditory cortex. Curr. Opin. Neurobiol., 12(4), 433–440. Reid, R. C., Victor, J. D., & Shapley, R. M. (1997). The use of m-sequences in the analysis of visual neurons: Linear receptive field properties. Vis. Neurosci., 14(6), 1015–1027. Richmond, B. J., Optican, L. M., & Spitzer, H. (1990). Temporal encoding of twodimensional patterns by single units in primate primary visual cortex. I. Stimulusresponse relations. J. Neurophysiol., 64(2), 351–369. Rouiller, E., de Ribaupierre, Y., Toros-Morel, A., & de Ribaupierre, F. (1981). Neural coding of repetitive clicks in the medial geniculate body of cat. Hear. Res., 5(1), 81–100. Rugh, W. J. (1981). Nonlinear system theory: The Volterra/Wiener approach. Baltimore, MD: Johns Hopkins University Press.
Temporal Symmetry
637
Rutkowski, R. G., Shackleton, T. M., Schnupp, J. W., Wallace, M. N., & Palmer, A. R. (2002). Spectrotemporal receptive field properties of single units in the primary, dorsocaudal and ventrorostral auditory cortex of the guinea pig. Audiol. Neurootol., 7(4), 214–227. Sahani, M., & Linden, J. F. (2003). How linear are auditory cortical responses? In S. Becker, S. Thrun & K. Obermeyer (Eds.), Advances in neural information processing systems, 15 (pp. 109–116). Cambridge, MA: MIT Press. Saul, A. B., & Humphrey, A. L. (1990). Spatial and temporal response properties of lagged and nonlagged cells in cat lateral geniculate nucleus. J. Neurophysiol., 64(1), 206–224. Schafer, M., Rubsamen, R., Dorrscheidt, G. J., & Knipschild, M. (1992). Setting complex tasks to single units in the avian auditory forebrain. II. Do we really need natural stimuli to describe neuronal response characteristics? Hear. Res., 57(2), 231–244. Schnupp, J. W., Mrsic-Flogel, T. D., & King, A. J. (2001). Linear processing of spatial cues in primary auditory cortex. Nature, 414(6860), 200–204. Schreiner, C. E., & Calhoun, B. M. (1994). Spectral envelope coding in cat primary auditory cortex: Properties of ripple transfer functions. Aud. Neurosci., 1(1), 39–62. Sen, K., Theunissen, F. E., & Doupe, A. J. (2001). Feature analysis of natural sounds in the songbird auditory forebrain. J. Neurophysiol., 86(3), 1445–1458. Shamma, S. A., Fleshman, J. W., Wiser, P. R., & Versnel, H. (1993). Organization of response areas in ferret primary auditory cortex. J. Neurophysiol., 69(2), 367– 383. Shamma, S. A., & Versnel, H. (1995). Ripple analysis in ferret primary auditory cortex: II. Prediction of unit responses to arbitrary spectral profiles. Aud. Neurosci., 1(3), 255–270. Shamma, S. A., Versnel, H., & Kowalski, N. (1995). Ripple analysis in ferret primary auditory cortex: I. Response characteristics of single units to sinusoidally rippled spectra. Aud. Neurosci., 1(3), 233–254. Smith, A. T., Snowden, R. J., & Milne, A. B. (1994). Is global motion really based on spatial integration of local motion signals? Vision Res., 34(18), 2425–2430. Smith, P. H., & Populin, L. C. (2001). Fundamental differences between the thalamocortical recipient layers of the cat auditory and visual cortices. J. Comp. Neurol., 436(4), 508–519. Smolders, J. W. T., Aertsen, A. M. H. J., & Johannesma, P. I. M. (1979). Neural representation of the acoustic biotope: Comparison of the response of auditory neurons to tonal and natural stimuli in the cat. Biol. Cybern., 35(1), 11–20. Stewart, G. W. (1990). Stochastic perturbation—theory. SIAM Review, 32(4), 579–610. Stewart, G. W. (1991). Perturbation theory for the singular value decomposition. In R. J. Vaccaro (Ed.), SVD and signal processing, II: Algorithms, analysis, and applications. Amsterdam: Elsevier. Stewart, G. W. (1993). Determining rank in the presence of error. In M. S. Moonen, G. H. Golub, & B. L. R. D. Moor (Eds.), Linear algebra for large scale and real-time applications. Dordrecht: Kluwer. Suarez, H., Koch, C., & Douglas, R. (1995). Modeling direction selectivity of simple cells in striate visual cortex within the framework of the canonical microcircuit. J. Neurosci., 15(10), 6700–6719.
638
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
Summers, V., & Leek, M. R. (1994). The internal representation of spectral contrast in hearing-impaired listeners. J. Acoust. Soc. Am., 95(6), 3518–3528. Sutter, E. E. (1992). A deterministic approach to nonlinear systems analysis. In R. B. Pinter & B. Nabet (Eds.), Nonlinear vision: Determination of neural receptive fields, function, and networks (pp. 171–220). Boca Raton, FL: CRC Press. Theunissen, F. E., Sen, K., & Doupe, A. J. (2000). Spectral-temporal receptive fields of nonlinear auditory neurons obtained using natural sounds. J. Neurosci., 20(6), 2315–2331. Ulanovsky, N., Las, L., & Nelken, I. (2003). Processing of low-probability sounds by cortical neurons. Nat. Neurosci., 6(4), 391–398. Valentine, P. A., & Eggermont, J. J. (2004). Stimulus dependence of spectrotemporal receptive fields in cat primary auditory cortex. Hear. Res., 196(1–2), 119–133. van Dijk, P., Wit, H. P., Segenhout, J. M., & Tubis, A. (1994). Wiener kernel analysis of inner ear function in the American bullfrog. J. Acoust. Soc. Am., 95(2), 904–919. Victor, J. D. (1992). Nonlinear systems analysis in vision: Overview of kernel methods. In R. B. Pinter & B. Nabet (Eds.), Nonlinear vision: Determination of neural receptive fields, function, and networks (pp. 1–38). Boca Raton, FL: CRC Press. Watson, A. B., & Ahumada, A. J., Jr. (1985). Model of human visual-motion sensing. J. Opt. Soc. Am. A, 2(2), 322–341. Wickesberg, R. E., & Geisler, C. D. (1984). Artifacts in Wiener kernels estimated using gaussian white noise. IEEE Trans. Biomed. Eng., 31(6), 454–461. Yeshurun, Y., Wollberg, Z., & Dyn, N. (1987). Identification of MGB cells by Volterra kernels II: Towards a functional classification of cells. Biol. Cybern., 56(2–3), 203– 208. Yeshurun, Y., Wollberg, Z., Dyn, N., & Allon, N. (1985). Identification of MGB cells by Volterra kernels. I: Prediction of responses to species specific vocalizations. Biol. Cybern., 51(6), 383–390.
Received January 31, 2005; accepted June 19, 2006.
LETTER
Communicated by Walter Senn
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity: Synaptic Memory and Weight Distribution Taro Toyoizumi [email protected] School of Computer and Communication Sciences and Brain-Mind Institute, Ecole Polytechnique F´ed´erale de Lausanne, CH-1015 Lausanne EPFL, Switzerland, and Department of Complexity Science and Engineering, Graduate School of Frontier Sciences, University of Tokyo, Tokyo 153-8505, Japan
Jean-Pascal Pfister [email protected] School of Computer and Communication Sciences and Brain-Mind Institute, Ecole Polytechnique F´ed´erale de Lausanne, CH–1015 Lausanne EPFL, Switzerland
Kazuyuki Aihara [email protected] Department of Information and Systems, Institute of Industrial Science, University of Tokyo, 153–8505, Tokyo, Japan, and Aihara Complexity Modelling Project, ERATO, JST, Shibuya-ku, Tokyo 151–0064, Japan
Wulfram Gerstner [email protected] School of Computer and Communication Sciences and Brain-Mind Institute, Ecole Polytechnique F´ed´erale de Lausanne, CH-1015 Lausanne EPFL, Switzerland
We studied the hypothesis that synaptic dynamics is controlled by three basic principles: (1) synapses adapt their weights so that neurons can effectively transmit information, (2) homeostatic processes stabilize the mean firing rate of the postsynaptic neuron, and (3) weak synapses adapt more slowly than strong ones, while maintenance of strong synapses is costly. Our results show that a synaptic update rule derived from these principles shares features, with spike-timing-dependent plasticity, is sensitive to correlations in the input and is useful for synaptic memory. Moreover, input selectivity (sharply tuned receptive fields) of postsynaptic neurons develops only if stimuli with strong features are presented. Sharply tuned neurons can coexist with unselective ones, and the distribution of synaptic weights can be unimodal or bimodal. The formulation of synaptic dynamics through an optimality criterion provides a simple graphical argument for the stability of synapses, necessary for synaptic memory. Neural Computation 19, 639–671 (2007)
C 2007 Massachusetts Institute of Technology
640
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
1 Introduction Synaptic changes are thought to be involved in learning, memory, and cortical plasticity, but the exact relation between microscopic synaptic properties and macroscopic functional consequences remains highly controversial. In experimental preparations, synaptic changes can be induced by specific stimulation conditions defined through pre- and postsynaptic firing rates (Bliss & Lomo, 1973; Dudek & Bear, 1992), postsynaptic membrane potential (Kelso, Ganong, and Brown, 1986), calcium entry (Malenka, Kauer, Zucker, ¨ & Nicoll, 1988; Lisman, 1989), or spike timing (Markram, Lubke, Frotscher, & Sakmann, 1997; Bi & Poo, 2001. In the theoretical community, conditions for synaptic changes are formulated as synaptic update rules or learning rules (von der Malsburg, 1973; Bienenstock, Cooper, & Munroe, 1982; Miller, Keller, & Stryker, 1989) (for reviews, see Gerstner & Kistler, 2002; Dayan & Abbott, 2001; Cooper, Intrator, Blais, & Shouval, 2004), but the exact features that make a synaptic update rule a suitable candidate for cortical plasticity and memory are unclear. From a theoretical point of view, a synaptic learning rule should be (1) sensitive to correlations between pre- and postsynaptic neurons (Hebb, 1949) in order to respond to correlations in the input (Oja, 1982); they should (2) allow neurons to develop input selectivity (e.g., receptive fields) (Bienenstock et al., 1982; Miller et al. 1989), in the presence of strong input features, but (3) distribution of synaptic strength should remain unimodal ¨ otherwise (Gutig, Aharonov, Rotter, & Sompolinsky, 2003). Furthermore (4) synaptic memories should show a high degree of stability (Fusi, Drew, & Abbott, 2005) and nevertheless remain plastic (Grossberg, 1987). Moreover, experiments suggest that plasticity rules are (5) sensitive to the presynaptic firing rate (Dudek & Bear, 1992), but (6) depend also on the exact timing of the pre- and postsynaptic spikes (Markram et al., 1997, Bi & Poo, 2001). Many other experimental features could be added to this list, for example, the role of intracellular calcium and of NMDA receptors, but we will not do so (see Bliss & Collingridge, 1993, and Malenka & Nicoll, 1993, for reviews). The items in the above list are not necessarily exclusive, and the relative importance of a given aspect may vary from one subsystem to the next; for example, synaptic memory maintenance might be more important for a long-term memory system than for primary sensory cortices. Nevertheless, all of the above aspects seem to be important features of synaptic plasticity. However, the development of theoretical learning rules that exhibit all of the above properties has posed problems in the past. For example, traditional learning rules that have been proposed as an explanation of receptive field development (Bienenstock et al., 1982; Miller et al., 1989) exhibit a spontaneous separation of synaptic weights into two groups, even if the input shows no or only weak correlations. This is difficult to reconcile with experimental results in visual cortex of young rats, where a
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
641
¨ om, ¨ unimodal distribution was found (Sjostr Turrigiano, & Nelson, 2001). Moreover model neurons that specialize early in development on one subset of features cannot readily readapt later. Other learning rules, however, ¨ that exhibit a unimodal distribution of synaptic weights (Gutig et al., 2003) do not lead to a long-term stability of synaptic changes. In this article, we show that all of features 1 to 6 emerge naturally in a theoretical model where we require only a limited number of objectives that will be formulated as postulates. In particular, we study how the conflicting demands on synaptic memory maintenance, plasticity, and distribution of synaptic synapses could be satisfied by our model. Although the postulates are rather general and could be adapted to arbitrary neural systems, we had in mind excitatory synapses in neocortex or hippocampus and exclude inhibitory synapses and synapses in specialized systems such as the calix of Held in the auditory pathway. Our arguments are based on three postulates: A. Synapses adapt their weights so as to allow neurons to efficiently transmit information. More precisely, we impose a theoretical postulate that the mutual information I between presynaptic spike trains and postsynaptic firing be optimized. Such a postulate stands in the tradition of earlier theoretical work (Linsker, 1989; Bell & Sejnowski, 1995), but is formulated here on the level of spike trains rather than rates. B. Homeostatic processes act on synapses to ensure that the long-term average of the neuronal firing rate becomes close to a target rate that is characteristic for each neuron. Synaptic rescaling and related mechanism could be a biophysical implementation of homeostatis (Turrigiano & Nelson, 2004). The theoretical reason for such a postulate is that sustained high firing rates are costly from an energetic point of view (Laughlin, de Ruyter van Steveninck, and Anderson, 1998; Levy & Baxter, 2002). C. C1: Maintenance of strong synapses is costly in terms of biophysical machinery, in particular, in view of continued protein synthesis (Fonseca, N¨agerl, Morris, & Bonhoeffer, 2004). C2: Synaptic plasticity is slowed for very weak synapses in order to avoid a (unplausible) transition from excitatory to inhibitory synapses. Optimality approaches have a long tradition in the theoretical neurosciences and have been utilized in two different ways. Firstly, optimality approaches allow deriving strict theoretical bounds against which performance of real neural systems can be compared (Barlow, 1956; Laughlin, 1981; Britten, Shadlen, Newsome, & Movshon, 1992; de Ruyter van Steveninck & Biale, 1995). Second, they have been used as a conceptual framework since they allow connecting functional objectives (e.g., “be reliable!”) and constraints (e.g., “don’t use too much energy!”) with electrophysiological properties of single neurons and synapses or neuronal populations (Barlow, 1961; Linsker, 1989; Atick & Redlich, 1990; Levy & Baxter,
642
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
2002; Seung, 2003). Our study, a derivation of synaptic update rules from an optimality viewpoint, follows this second, conceptual approach. 2 The Model 2.1 Neuron Model. We simulated a single stochastic point neuron model with N = 100 input synapses. Presynaptic spikes at synapse j are def noted by their arrival time t j and evoke excitatory postsynaptic potentials f
f
(EPSPs) with time course exp[−(t − t j )/τm ] for t ≥ t j , where τm = 20 ms is the membrane time constant. Recent experiments have shown that action potentials propagating back into the dendrite can partially suppress EPSPs measured at the soma (Froemke, Poo, & Dan, 2005). Since our model neuron has no spatial structure, we included EPSP suppression by a phenomenologf ical amplitude factor a (t j − tˆ) that depends on the time difference between presynaptic spike arrival and the spike trigger time tˆ of the last (somatic) action potential of the postsynaptic neuron. In the absence of EPSP suppression, the amplitude of a single EPSP at synapse j is characterized by its weight w j and its duration by the membrane time constant τm . Summation of the EPSPs caused by presynaptic spike arrival at all 100 explicitly modeled synapses gives the total postsynaptic potential u(t) = ur +
N j=1
wj
exp −
f
t j
f
t − tj τm
f
a (t j − tˆ),
(2.1)
where ur = −70 mV is the resting potential and the sum runs over all spike f f arrival times t j in the recent past, tˆ < t j ≤ t. The EPSP suppression factor f f takes a value of zero if t j < tˆ and is modeled for t j ≥ tˆ as exponential f f recovery a (t j − tˆ) = 1 − exp[−(t j − tˆ)/τa ] with time constant τa = 50 ms (see Figure 1a) unless stated otherwise. The parameters w j for 1 ≤ j ≤ N denote the synaptic weight of the N = 100 synapses and are updated using a learning rule discussed below. In order to account for unspecific background input that was not modeled explicitly, spikes were generated probabilistically with density
ρ(t) = ρr + [u(t) − ur ] · g,
(2.2)
where ρr = 1 Hz is the spontaneous firing rate (in the absence of spike input at the 100 explicitly modeled synapses) and g = 12.5 Hz/mV is a gain factor. Thus, the instantaneous spike density increases linearly with the total postsynaptic potential u(t). Note, however, that due to EPSP suppression, the total postsynaptic potential increases sublinearly with the number of
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
a
b 50
−40 −45
40
−50
ρ [Hz]
−55 −60
−
u [mV]
643
30
20
−65
10 −70 −75 −50
0
50
100
0 0
150
10
20
40
50
ν [Hz]
Time [ms] d 10Hz 20Hz 30Hz
0.03
0.02
0.01
0
0
20
40
60
ISI [ms]
80
100
Auto correlation [Hz]
c ISI distribution
30
100
10Hz 20Hz 30Hz
80
60
40
20
0 −100
−50
0
50
100
Time [ms]
Figure 1: Properties of the stochastically spiking neuron model. (a) Neuron model and EPSP suppression. EPSPs arriving just after a postsynaptic spike at t = 0 are attenuated by a factor (1 − e −t/τa ) and recover exponentially to their maximal amplitude w j . (b) The output firing rate ρ¯ of a neuron that receives stochastic spike input at rate ν at all 100 synapses. Each presynaptic spike evokes an EPSP with maximal amplitude w j = 0.4 mV. (c) Interspike interval (ISI) distribution with input frequency ν = 10 Hz (solid line), 20 Hz (dashed line), and 30 Hz (dotted line) at all 100 synapses. (d) Autocorrelation function of postsynaptic action potentials at an input frequency of 10 Hz (solid line), 20 Hz (dashed line), and 30 Hz (dotted line).
input spikes, and so does the mean firing rate of the postsynaptic neuron (see Figure 1b). The neuron model is simulated in discrete time with time steps of t = 1 ms on a standard personal computer using custom-made software written in Matlab. 2.2 Objective Function. Postulates A and B have been used previously (Toyoizumi, Pfister, Aihara, & Gerstner, 2005a) and lead to an optimality criterion L = I − γ D, where I is the mutual information between presynaptic input and postsynaptic output and D a measure of the distance of
644
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
the mean firing rate of the neuron from its target rate. The parameter γ scales the importance of the information term I (postulate A) compared to the homeostatic term D (postulate B). It was shown that optimization of L by gradient ascent yields a synaptic update rule that shows sensitivity to correlations (see point 1 above), input selectivity (see point 2 above), and depends on presynaptic firing rates (see point 5 above) (Toyoizumi et al., 2005a). However, while the learning rule in Toyoizumi et al. (2005a) showed some dependence on spike timing, it did not (without additional assumptions) have the typical features of Spike-timing-dependent plasticity (STDP) as measured in vitro (point 6 above), and it exhibited, like earlier models (Bienenstock et al., 1982), spontaneous synaptic specialization, even for very weak input features, which is in contrast to point 3 above. In this earlier theoretical study, synaptic potentiation was artificially stopped at some upper bound w max (and synaptic depression was stopped at weight w = 0), so as to ensure that weights w stayed in a regime 0 ≤ w ≤ w max . In this letter, we take the more realistic assumption that strong weights are more likely to show depression than weaker ones but do not impose a hard upper bound. Similarly, we require that adaptation speed is slowed for very weak synapses but do not impose a hard bound at zero weight. We will show that with these assumptions, the resulting synaptic update rule shows properties of STDP (see point 6 above), is suitable for memory retention (see point 4 above), and leads to synaptic specialization when driven by strong input (see point 3 above), while keeping properties 1, 2, and 5, which were found in Toyoizumi et al. (2005a). To avoid hard upper bounds for the synapses, we use postulate C1 and add a term to the optimality criterion L that is proportional to w 2 (i.e., the square of the synaptic weight) and proportional to the presynaptic firing rate. This term comes with a negative sign, since a cost is associated with big weights. Hence, from our optimality viewpoint, synapses change so as to maximize a quantity L = I − γ D − λ,
(2.3)
where I is the information to be maximized, D is a measure of the firing rate mismatch to be minimized, and is the cost induced by strong synapses to be minimized. Factors γ and λ control the relative importance of the three terms. In other words, synapses adjust their weights so as to be able to transmit information while keeping the mean firing rate and synaptic weights at low values. Thus, our three postulates A, B, and C give rise to one unified optimality criterion L. We hypothesize that a significant part of findings regarding synaptic potentiation and depression can be conceptually understood as the synapse’s attempt to optimize the criterion L.
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
645
The learning rule used for the update of the synaptic weights w j is derived from the objective function 2.3, L = I − γ D − λ, which contains three terms. The first term is the mutual information between the ensemble of 100 input spike trains (spanning the interval of a single trial from 0 to T; the ensemble of all presynaptic trains is formally denoted by X(T) = {x j (t) = f f t j δ(t − t j )| j = 1, . . . , 100, 0 ≤ t < T}) and the output spike train of the postsynaptic neuron over the same interval (denoted by Y(T) = {y(t) = f f f tpost δ(t − t post )|0 ≤ t < T}, where t post represent output spike timing), that is, P(Y|X) I = log , P(Y) Y,X
(2.4)
where angular brackets · Y,X denote averaging over all combinations of input and output spike trains.1 Here, P(Y|X) is the conditional probability density of our stochastic neuron model to generate a specific spike train Y f with (one or several) spike times {tpost } during a trial of duration T given 100 known input spike trains X. This conditional probability density is f given as a product of the instantaneous probabilities ρ(tpost ) of firing at the f postsynaptic spike times {tpost } and the probability of not firing elsewhere, that is,
f P(Y|X) = ρ(tpost ) exp − f tpost
T
ρ(t)dt .
(2.5)
0
Similarly, P(Y) is the probability of generating the same output spike train Y not knowing the input. Here “not knowing the input” implies that we have to average over all possible inputs so as to get the expected instantaneous firing density ρ(t) ¯ at time t. However, because of the EPSP suppression factor, the expected firing density will also depend on the last output spike before t. We therefore define ρ(t) ¯ = ρ(t) X(t)|Y(t) ,
(2.6)
that is, we average over the inputs but keep the knowledge of the previous f output spikes tpost < t. P(Y) is then given by a formula analogous to equation 2.5, but with ρ replaced by ρ. ¯ Hence, given our neuron model, both
1 From now on, when X or Y are without argument, we take implicitly X ≡ X(T) and Y ≡ Y(T), i.e., the spike trains over the full interval T.
646
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
P(Y|X) and P(Y) in equation 2.4 are well defined. The information term I of equation 2.4 is the formal instantiation of postulate A. The second term is the homeostatic term P(Y) D = log , ˜ P(Y) Y
(2.7)
which compares the actual distribution of output spike trains P(Y) with ˜ that of an ideal distribution P(Y) generated by the same neuron firing at a target rate of ρ˜ = 5 Hz—formula (2.5) with ρ replaced by ρ. ˜ Mathematically speaking, D is the Kullback-Leibler distance between two distributions (Cover & Thomas, 1991), but in practice, we may think of D simply as a measure of the difference between actual and target firing rates (Toyoizumi, et al., 2005a). The term D is our mathematical formulation of postulate B. The third term is the cost associated with strong synapses. We assume that the cost increases quadratically with the synaptic weights but that only synapses that have been activated in the past contribute to the cost. Hence, the mathematical formulation of postulate C1 yields a cost =
1 2 wj nj X , 2 j
(2.8)
where n j is the number of presynaptic spikes that have arrived at synapse j during the duration T of the interval under consideration. Cost terms that are quadratic in the synaptic weights are common in the theoretical literature (Miller & MacKay, 1994), but the specific dependence on presynaptic spiking induced by the factor n j in equation 2.8 is not. The dependence of on presynaptic spike arrival means that in our model, only activated synapses contribute to the cost. The specific formulation of is mainly due to theoretical reasons to be discussed below. The intuition is that activation of a synapse in the absence of any postsynaptic activity can weaken the synapse if the factor λ is sufficiently positive (see also Figure 3d). The restriction of the cost to previously activated synapses is reminiscent of synaptic tagging (Frey & Morris, 1997; Fonseca et al., 2004), although any relation must be seen as purely hypothetical. The three terms are given a relative importance by choosing γ = 0.1 (for a discussion of this parameter, see Toyoizumi et al., 2005a) and λ = 0.026 so as to achieve a baseline of zero in the STDP function (see appendix A). 2.3 Synaptic Update Rule. We optimize the synaptic weights by gradient ascent w j = α(w j )
∂L , ∂w j
(2.9)
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
647
with a weight-dependent update rate α(w j ). According to postulate C2, plasticity is reduced for very small weights. For simplicity, we chose w4
j α(w j ) = 4 · 10−2 w4 +w 4 , where ws = 0.2 mV, that is, learning slows for weak j
s
synapses with EPSP amplitudes around or less than 0.2 mV. Note that updates according to Equation 2.9 are always uphill; however, because of the w j dependence of α, the ascent is not necessarily along the steepest gradient. Using the same mathematical arguments as in Toyoizumi et al. (2005a), we can transform the optimization by gradient ascent into a synaptic update rule. First, differentiating each term, we find ∂I 1 ∂ P(Y|X) = log ∂w j P(Y|X) ∂w j ∂ P(Y|X) 1 ∂D = log ∂w j P(Y|X) ∂w j
P(Y|X) , P(Y) Y,X P(Y) , ˜ P(Y) Y,X
∂ = wj nj X . ∂w j
(2.10) (2.11) (2.12)
We will rewrite the terms appearing in equations 2.10 and 2.11, by introducing the auxiliary variables c j (t) =
dρ/du|u=u(t) [y(t) − ρ(t)] ρ(t)
∞
ds (s )x j (t − s )
(2.13)
0
and ρ(t) ¯ ρ(t) − (ρ(t) − ρ(t)) ¯ − γ y(t) log − (ρ(t) ¯ − ρ) ˜ , B post (t) = y(t) log ρ(t) ¯ ρ˜ (2.14) Using the definitions in equations 2.13 and 2.14, we find the derivative of the conditional probability density that appears in equations 2.10 and 2.11: ∂ P(Y|X) = P(Y|X) ∂w j
T
c j (t )dt
(2.15)
0
and log
P(Y|X) P(Y) = − γ log ˜ P(Y) P(Y)
T
B post (t)dt.
(2.16)
0
As a first interpretation, we may say that c j represents the causal correlation between input and output spikes (corrected for the expected correlation),
648
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
and B post is a function of postsynaptic quantities, namely, the output spikes y, current firing rate ρ via the membrane potential u, average firing rate ρ, ¯ and the target firing rate ρ. ˜ More precisely, B post compares the actual output with the expected output and, modulated by a factor γ , the expected output with the target. Hence, with the results from equations 2.10 to 2.16, the derivative of the objective function is written in terms of averaged quantities ·Y,X as ∂L = ∂w j
T
dt 0
T
c j (t )dt B post (t) − λw j x j (t)
0
.
(2.17)
Y,X
An important property of c j is that its average c j Y|X vanishes. On the other hand, the correlations between c j (t ) and B post (t) are limited by the timescale τ AC of the autocorrelation function of the output spike train. Hence, we can limit the integration to the relevant timescales without loss of generality and introduce an exponential cut-off factor with time constant τC > τ AC : C j (t) = lim
ε→+0 0
t+ε
c j (t )e −(t−t )/τC dt .
(2.18)
With this factor C j , we find a batch learning rule (i.e., with expectations over the input and output statistics on the right-hand side) of the form ∂L ≈ ∂w j
0
T
dt C j (t)B post (t) − λw j x j (t) Y,X .
(2.19)
Finally, for slow learning rate α and stationary input statistics, the system becomes self-averaging (i.e., expectations can be dropped due to automatic temporal averaging; Gerstner & Kistler, 2002) so that we arrive at the online gradient learning rule, dw j = α(w j ) C j (t)B post (t) − λw j x j (t) . dt
(2.20)
See Figure 2 for an illustration of the dynamics of C j and B post . The last term has the form of a “weight decay” term common in artificial neural networks (Hertz, Krogh, & Palmer, 1991) and arises from the derivative of the weight-dependent cost term . The parameter λ is set such that dw j /dt = 0 for large enough |t pr e − t post | in the STDP in vitro paradigm. A few steps of calculation (see appendix A) yield λ = 0.026. In our simulations, we take τC = 100 ms for the cut-off factor in Equation 2.18. For a better understanding of the learning dynamics defined in equation 2.20, let us look more closely at Figure 2a. The time course w/w of the potentiation has three components: first, a negative jump at the moment
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
b
−2 −50
0
50
100
B post
−100 0.1
1 0 −1 −2 −50 0.05
Cj
2
0 −0.1 −100
−50
0
50
100
0.4 0.2 0 −0.2 −100
−50
0
50
100
Time [ms]
∆ w/w
∆ w/w
B post
Cj
a
649
0
50
100
150
0
50
100
150
0
50
100
150
0
−0.05 −50 0.05 0 − 0.05 − 0.1 −50
Time [ms]
Figure 2: Illustration of the dynamics of C j (top), B post (middle), and w/w = (w − winit )/winit (bottom) during a single pair of pre- and postsynaptic spikes (indicated schematically at the very top). We always start from an initial weight winit = 4 mV and induce postsynaptic firing at time t post = 0. (a) Pre-before-post timing with t pr e = −10 ms (solid line) induces a large potentiation, whereas t pr e = −50 ms (dashed line) induces almost no potentiation. (b) Due to the EPSP suppression factor, a post-before-pre timing with t pr e = 10 ms (solid line) induces a large depression, whereas t pr e = 50 ms (dashed line) induces a smaller depression. The marks (circle, cross, square, and diamond) correspond to the weight change due to a single pair of pre- and postsynaptic spike after weights have converged to their new values. Note that in Figure 3, the corresponding marks indicate the weight change after 60 pairs of pre- and postsynaptic spikes.
of the presynaptic spike induced by the weight decay term; second, a slow increase in the interval between pre- and postsynaptic spike times induced by C j (t)B post (t) > 0; and third, a positive jump immediately after the postsynaptic spike induced by the singularity in B post combined with a positive Cj. As it is practically difficult to calculate ρ(t) ¯ = ρ(t) X(t)|Y(t) , we estimate ρ¯ by the running average of the output spikes, that is, τρ¯
d ρ¯ est = −ρ¯ est (t) + y(t), dt
(2.21)
with τρ¯ = 1 min. This approximation is valid if the characteristics of the stimulus and output spike trains are stationary and uncorrelated. To simulate in vitro STDP experiments, the initial value of ρ¯ est is set equal to the injected pulse frequency. Other, more accurate estimates of ρ¯ est are possible but lead
650
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
to no qualitative change of results (data not shown). In the simulations of Figures 4 to 6, the initial synaptic strength is set to be winit = 0.4 ± 0.04 mV. 2.4 Stimulation Paradigm 2.4.1 Simulated STDP In Vitro Paradigm. For the simulations of Figure 3, f spike timings t pr e = t j at synapse j and postsynaptic spike times t post are imposed with a given relative timing t pr e − t post . For the calculation of the total STDP effect according to a typical in vitro stimulation paradigm, the pairing of pre- and postsynaptic spikes is repeated until 60 spike pairs have been accumulated. Spike pairs are triggered at a frequency of 1 Hz except for Figure 3c, where the stimulation frequency was varied. 2.4.2 Simulated Stochastic Spike Arrival. In most simulations, presynaptic spike arrival was modeled as Poisson spike input at either a fixed rate (homogeneous Poisson process) or a modulated rate (inhomogeneous Poisson process). For example, for the simulations in Figure 6, with a gaussian profile, spike arrival at synapse j is generated by an inhomogeneous Poisson process with the following characteristics. During a segment of 200 ms, the rate is fixed at ν j = (ν max − ν0 ) exp[−0.01 ∗ d( j − k)2 ] + ν0 , where ν0 = 1 Hz is the baseline firing rate and d( j − k) is the difference between index j and k. The value of k denotes the location of the maximum. The value of k was reset every 200 ms to a value chosen stochastically between 1 and 100. (As indicated in the main text, presynaptic neurons in Figure 6 were considered to have a ring topology, which has been implemented by evaluating the difference d( j − k) as d( j − k) = min{| j − k|, 100 − | j − k|}.) However, in the simulations for Figure 5, input spike trains were not independent Poisson, but we included spike-spike correlations. A correlation index of c = 0.2 implies that between a given pair of synapses, 20% of spikes have identical timing. More generally, for a given value of c within a group of synaptic inputs, 100 c percent of spike arrival times are identical at an arbitrary pair of synapse within the group. 3 Results The mathematical formulation of postulates A, B, and C1 led to an optimality criterion L that was optimized by changing synaptic weights in an uphill direction. In order to include postulate C2, the adaptation speed was made to depend on the current value of the synaptic weight so that plasticity was significantly slowed for synapses with excitatory postsynaptic potentials (EPSPs) of amplitude less than 0.2 mV (see section 2 for details). As in a previous study based on postulates A and B (Toyoizumi et al., 2005a), the optimization of synaptic weights can be understood as a synaptic
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
a
b 100
50
4mV 6mV
4000 80
∆w /w [%]
∆w /w [%]
40 30 20 10
3000 2000
60
1000 40
0 0.1
20
1
10
0
0 −10 −100
−50
0
50
−20 2
100
tpre − tpost [ms]
4
6
8
10
Initial EPSP amplitude [mV]
c
d 50
2
0.5Hz 1Hz 2Hz
30 20 10 0 −10 −100
−50
pre
t
0
−
post
t
50
[ms]
100
25ms 50ms No cost
1.5
∆ w [mV]
40
w / w [%]
651
1
0.5
0
−0.5 − 100
− 50
0
50
100
tpre − tpost [ms]
Figure 3: The synaptic update rule of the model shares features with STDP. (a) STDP function (percentage change of EPSP amplitude as a function of t pr e − t post ) determined using 60 pre-and-post spike pairs injected at 1 Hz. The initial EPSP amplitudes are 4 mV (dashed line) and 6 mV (dotted). Marks (circle, cross, square and diamond) correspond respectively to t pr e = −50 ms, t pr e = −10 ms, t pr e = 10 ms, and t pr e = 50 ms, also depicted on Figure 2. (b) The percentage change in EPSP amplitude after 60 pre-and-post spike pairs injected at 1 Hz for pre-before-post timing ((t pr e − t post ) = −10 ms, solid line) and post-beforepre timing ((t pr e − t post ) = +10 ms, dashed line) as a function of initial EPSP amplitude. Our model results qualitatively resemble experimental data (see Figure 5 in Bi & Poo, 1998). (c) Frequency dependence of the STDP function: spike pairs are presented at frequencies of 0.5 Hz (dashed line), 1 Hz (dotted line), and 2 Hz (dot-dashed line). The STDP function exhibits only a weak sensitivity to the change in stimulation frequency. (d) STDP function for different choices of model parameters. The extension of the synaptic depression zone for pre-after-post timing (t pr e − t post > 0) depends on the timescale τa of EPSP suppression (dot-dashed line, τa = 50 ms; dashed line, τa = 25 ms). The dotted line shows the STDP function in the absence of a weight-dependent cost term . The STDP function exhibits a positive offset indicating that without the cost term , unpaired presynaptic spikes would lead to potentiation, a non-Hebbian form of plasticity.
652
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
update rule that depends on presynaptic spike arrival, postsynaptic spike firing, the postsynaptic membrane potential, and the mean firing rate of the postsynaptic neuron. In addition, the synaptic update rule in this study included a term that decreases the synaptic weight upon presynaptic spike arrival by a small amount proportional to the EPSP amplitude (see section 2 for details). This term can be traced back to the additional weight-dependent cost term in equation 2.3 that accounts for postulate C1. In order to study the consequences of the synaptic update rule derived from postulates A, B, and C, we used computer simulations of a model neuron that received presynaptic spike trains at 100 synapses. Each presynaptic spike evoked an EPSP with exponential time course (time constant τm = 20 ms). In order to account for dendritic interaction between somatic action potentials and postsynaptic potentials, the amplitude of EPSPs was suppressed immediately after postsynaptic spike firing (Froemke et al., 2005) and recovered with time constant τa = 50 ms (see Figure 1a). As a measure of the weight of a synapse j, we used the EPSP amplitude w j at this synapse in the absence of EPSP suppression. With all synaptic parameters w j set to a fixed value, the model neuron fired stochastically with a mean firing rate ρ¯ that increases with the presynaptic spike arrival rate (see Figure 1b), has a broad interspike interval distribution (see Figure 1c), and has an autocorrelation function with a trough of 10 to 50 ms that is due to reduced excitability immediately after a spike because of EPSP suppression (see Figure 1d). 3.1 The Learning Rule Exhibits STDP. In a first set of plasticity experiments, we explored the behavior of the model system under a simulated in vitro paradigm as used in typical STDP experiments (Bi & Poo, 1998). In order to study the influence of the pre- and postsynaptic activity on the changes of weights as predicted by our online learning rule in equation 2.20, we plotted in Figure 2 the postsynaptic factor B post and the correlation term C j that both appear on the right-hand side of equation 2.20, together with the induced weight change w/w as a function of time. Indeed, the learning rule predicts positive weight changes when the presynaptic spike occurs 10 ms before the postsynaptic one and negative weight changes under reversed timing. For a comparison with experimental results, we used 60 pairs of preand postsynaptic spikes applied at a frequency of 1 Hz and recorded the total change w in EPSP amplitude. The experiment is repeated with different spike timings, and the result is plotted as a function of spike timing difference t pr e − t post . As in experiments (Markram et al., 1997; Zhang, Tao, ¨ om ¨ et al., 2001), we Holt, Harris, & Poo, 1998; Bi & Poo, 1998, 2001; Sjostr find that synapses are potentiated if presynaptic spikes occur about 10 ms before a postsynaptic action potential, but are depressed if the timing is reversed. Compared to synapses with amplitudes in the range of 1 or 2 mV, synapses that are exceptionally strong show a reduced effect of potentiation
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
653
for pre-before-post timing, or even depression (see Figures 3a and 3b), in agreement with experiments on cultured hippocampal neurons (Bi & Poo, 1998). The shape of the STDP function depends only weakly on the stimulation frequency (see Figure 3c), even though a significant reduction of the potentiation amplitude with increasing frequency can be observed. In a recent experimental study (Froemke et al., 2005), a strong correlation between the timescale of EPSP suppression (which was found to depend on dendritic location) and the duration of the long-term depression (LTD) part in the STDP function was observed. Since our model neuron had no spatial structure, we artificially changed the time constant of EPSP suppression in the model equations. We found that indeed only the LTD part of the STDP function was affected, whereas the LTP part remained unchanged (see Figure 3d). In order to study the influence of the weight-dependent cost term in our optimality criterion L, we systematically changed the parameter λ in equation 2.3. For λ = 0, the weight-dependent cost term has no influence, and because the postsynaptic firing rate is close to the desired rate, synaptic plasticity in our model is mainly controlled by information maximization. In this case, synapses with a reasonable EPSP amplitude of one or a few millivolts are always strengthened, even for post-before-pre timing (see Figure 3d, dashed line). This can be intuitively understood since an increase of synaptic weight is always beneficial for information transmission except if spike arrival occurs immediately after a postsynaptic spike. In this case, the postsynaptic neuron is insensitive, so that no information can be transmitted. Nevertheless, information transmission is maximal in a situation where the presynaptic spike occurs just before the postsynaptic one. The weight-dependent cost term derived from postulate C is essential to shift the dashed line in Figure 3d to negative values so as to induce synaptic depression in our STDP paradigm. The optimal value of λ = 0.026, which ensures that for large spike timing differences |t pr e − t post | neither potentiation nor depression occurs, has been estimated from a simple analytical argument (see appendix A). 3.2 Both Unimodal and Bimodal Synapse Distributions Are Stable. Under random spike arrival with a rate of 10 Hz at all 100 synapses, synaptic weights show little variability with a typical EPSP amplitude in the range of 0.4 mV. This unspecific pattern of synapses stays stable even if 20 out of the 100 synapses are subject to a common rate modulation between 1 and 30 Hz (see Figure 4a). However, if modulation of presynaptic firing rates becomes strong, the synapses rapidly develop a specific pattern with large values of weights at synapses with rate-modulated spike input and weak weights at those synapses that received input at fixed rates (synaptic specialization; see Figures 4a and 4b), making the neuron highly selective to input at one group of synapses (input selectivity). Thus, our synaptic update rule is capable of selecting strong features in the input, but also allows a stable, unspecific
654
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
pattern of synapses in case of weak input. This is in contrast to most other Hebbian learning rules, where unspecific patterns of synapses are unstable, so that synaptic weights move spontaneously toward their upper or lower bounds (Miller et al., 1989; Miller & MacKay, 1994; Gerstner, Kempter, van Hemmen, & Wagner, 1996; Kempter, Gerstner, & van Hemmen, 1999; Song, Miller, & Abbott, 2000). After induction of synaptic specialization by strong modulation of presynaptic input, we reduced the rate modulation back to the value that previously led to an unspecific pattern of synapses. We found that the strong synapses remained strong and weak synapses remained weak; synaptic
Figure 4: Potentiation and depression depend on the presynaptic firing rate. Twenty synapses of group 1 receive input with common rate modulation, while the 80 synapses of group 2 (synapse index 21–100) receive Poisson input at a constant rate of 10 Hz. The spike arrival rate in group 1 switches stochastically every 200 ms between a low rate νl = 1 Hz and a high rate νh taken as a parameter. (a) Evolution of synaptic weights as a function of time for different amplitudes of rate modulation, that is, νh changes from 10 Hz during the first hour to 30 Hz, then to 50 Hz, 30 Hz, and back to 10 Hz. During the first 2 hours of stimulation, an unspecific distribution of synapses remains stable even though a slight decrease of weights in group 2 can be observed when the stimulus switches to νh = 30 Hz. A specialization of the synaptic pattern with large weights for synapses in group 1 is induced during the third hour of stimulation and remains stable thereafter. (b) Top: Mean synaptic weights (same data as in a) ¯ 2 , green line). Bottom: The stimulation of group 1 (w ¯ 1 , blue line) and group 2 (w paradigm νh as a function of time. Note that at νh = 30 Hz (second and fourth hour of stimulation), both an unspecific pattern of synapses with little difference ¯ 2 (second hour, top) and a highly specialized pattern (fourth between w ¯ 1 and w hour, top, large difference between solid and dashed lines) are possible. (c) The value of the objective function L (average value per second of time) in color code as a function of the mean synaptic weight in group 1 (y-axis, w ¯ 1 ) and group 2 (x-axis, w ¯ 2 ) during stimulation with νh = 30 Hz. Two maxima can be perceived: ¯ 1 ≈ 0.4 mV) and a broad maximum for the unspecific synapse pattern (w ¯2 ≈ w a pronounced elongated maximum for the specialized synapses pattern (w ¯2 ≈ 0; w ¯ 1 ≈ 0.8 mV). The dashed line indicates a one-dimensional section (see d) through the two-dimensional surface. (d) Objective function L as a function of ¯ 1 between the mean synaptic weights in groups 2 and 1 the difference w ¯2 −w along the line indicated in c. For νh = 30 Hz (dashed, same data as in c) two ¯ 1 ≈ 0 and a narrow but higher maxima are visible: a broad maximum at w ¯2 −w maximum corresponding to the specialized synapse pattern to the very left ¯2 −w ¯1 ≈ of the graph. For νh = 50 Hz (dotted line), the broad maximum at w 0 disappears, and only the specialized synapse pattern remains, whereas for νh = 10 Hz (solid line), the broad maximum of the unspecific synapse pattern dominates.
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
b 1.6
Synaptic index
90
1.4
80 70
1.2
60
1
50
0.8
40
0.6
30 0.4
20
0.2
10 0
60
120
180
240
1.5
¯ [mV] w
100
1 0.5 0 0
60
120
180
240
300
60
120
180
240
300
60
νh [Hz]
a
40 20 0 0
300
Time [min]
Time [min] c
d 1
8
10Hz 30Hz 50Hz
6 0.8
4
0.6
3 0.4 2 0.2
0 0
7
5
1 0 0.2
0.4
¯ 2 [mV] w
0.6
L [s −1]
w ¯1 [mV]
655
6 5 4
3 −0.8
−0.6
−0.4
−0.2
0
0.2
0.4
¯2 − w ¯1 [mV] w
specialization was stable against a change in the input (see Figure 4b). This result shows that synaptic dynamics exhibits hysteresis, which is an indication of bistability: for the same input, both an unspecific pattern of synapses and synaptic specialization are stable solutions of synaptic plasticity under our learning rule. Indeed under rate modulation between 1 and 30 Hz for 20 out of the 100 synapses, the objective function L shows two local maxima (see Figure 4c): a sharp maximum corresponding to synaptic specialization (mean EPSP amplitude about 0.8 mV for synapses receiving rate-modulated input and less than 0.1 mV for synapses receiving constant input) and a broader but slightly lower maximum where both groups of synapses have a mean EPSP amplitude in the range of 0.3 to 0.5 mV (see appendix B for details on the method). In additional simulations, we confirmed that both the unspecific pattern of synapses and the selective pattern representing synaptic specialization remained stable over several hours of continued stimulation with rate-modulated input (data not shown). Bistability of selective and unspecific synapse patterns was consistently observed for rates modulated between 1 and 10 Hz or between 1 and 30 Hz, but the unspecific synapse pattern was unstable if the rate was modulated between 1 and 50 Hz consistent with the weight dependence of our objective function L (see Figure 4d).
656
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
3.3 Retention of Synaptic Memories. In order to study synaptic memory retention with our learning rule, we induced synaptic specialization by stimulating 20 out of the 100 synapses by correlated spike input (spikespike correlation index c = 0.2; see section 2.4 for details). The remaining 80 synapses received uncorrelated Poisson spike input. The mean firing rate (10 Hz) was identical at all synapses. After 60 minutes of correlated input at the group of 20 synapses, the stimulus was switched to uncorrelated spike input at the same rate. We studied how well synaptic specialization was maintained as a function of time after induction (see Figures 5a and 5b). Synaptic specialization was defined by a bimodality index that compared the distribution of EPSP amplitudes at synapses that received correlated input with those receiving uncorrelated input. For each of the two groups of synapses, we calculated the mean w ¯ = w and the variance σ 2 = [w j − 2 2 w] ¯ : w ¯ A and σ A for the group of synapses receiving correlated input and w ¯B and σ B2 for those receiving uncorrelated input. We then approximated the two distributions by gaussian functions. The bimodality index depends on the overlap between the two gaussians and is given by b = 0.5 [erf( w√¯ A2σ−ˆs ) + A x ˆ −w ¯B erf( s√ )], where erf(x) = √2π 0 exp(−t 2 )dt is the error function and sˆ is 2σ B one of the two crossing points of the two gaussians such that w ¯ B < sˆ < w ¯ A. The two distributions (strong and weak synapses) started to separate within the first 5 minutes and remained well separated even after the correlated memory-inducing stimulus was replaced by a random stimulus (see Figure 5b). In order to study how synaptic memory retention depended on the induction paradigm, the experiment was repeated with different values of the correlation index c that characterizes the spike-spike correlations during the induction period. For correlations c < 0.1, the two synaptic distributions are not well separated at the end of the induction period (bimodality index < 0.9), but they are well separated for c ≥ 0.15 (see Figure 5c). Good separation at the end of the induction period alone is, however, not sufficient to guarantee retention of synaptic memory, since for the synaptic distribution induced by stimulation with c = 0.15, specialization breaks down at t = 80 min., that is, after only 20 minutes of memory retention (see Figure 5d). On the other hand, for c = 0.2 and larger, synaptic memory is retained over several hours (only the first hour is plotted in Figure 5d). The long duration of synaptic memory in our model can be explained by the reduced adaptation speed of synapses with weights close to zero (postulate C2). If weak synapses change only slowly because of reduced adaptation speed, strong synapses must stay strong because of homeostatic processes that keep the mean activity of the postsynaptic neuron close to a target value. Moreover, the terms in the online learning rule derived from information maximization favor the bimodal distribution. Reduced adaptation speed of weak synapses could be caused by a cascade of
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
a
657
b 1
100 80
0.8
1.2 1
60
0.8 40
0.6 0.4
20
0.2 0
30
60
90
Distribution
Synaptic index
1.4
0.6 0.4 0.2 0 0
120
0.5
1
1.5
w [mV]
Time [min] c
d
w¯
1
0.5
0
0
0.05
0.1
0.15
0.2
0.25
Spike-spike correlation
0.3
Specialization index
1.5 1 c = 0.05 c = 0.1 c = 0.15 c = 0.2
0.8 0.6 0.4 0.2 0 0
30
60
90
120
Time [min]
Figure 5: Synaptic memory induced by input with spike-spike correlations. (a) Evolution of 100 synaptic weights (vertical axis) as a function of time. During the first 60 minutes, synapses of group A ( j = 1, . . . , 20) receive a Poisson stimulation of 10 Hz with correlated spike input (c = 0.2), and those of group B ( j = 21, . . . , 100) receive uncorrelated spike input at 10 Hz. (b) Distribution of the EPSP amplitudes across the 100 synapses after t = 5 min (dotted line), t = 60 min (solid line) and t = 120 min (dot-dashed line). The red lines denote group A while the blue ones group B. (c) Mean EPSP amplitude of group A (solid red line) and B (dashed blue line) at t = 60 min for different values of correlation c of the input applied to group A. (d) Bimodality index b of the two groups of weights as a function of time. The memory induction by correlated spike input to group B stops at t = 60 min. Memory retention is studied during the following 60 minutes. A bimodality index close to one implies, as for the case with c = 0.2, that synaptic memory is well retained.
intracellular biochemical processing stages with different time constants, as suggested by Fusi, Drew, & Abbott (2005). Thus our synaptic update rule allows for retention of synaptic memories over timescales that are significantly longer than the memory induction time, as necessary for
658
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
any memory system. Nevertheless, synaptic memory in our model will eventually decay if random firing of pre- and postsynaptic neurons persists, in agreement with experimental results (Abraham, Logan, Greenwood, & Dragunow, 2002; Zhou, Tao, & Poo, 2003). We note that in the absence of presynaptic activity, the weights remain unchanged, since the decay of synaptic weights is conditioned on presynaptic spike arrival (see equation 2.20). 3.4 Receptive Field Development. Synaptic plasticity is thought to be involved not only in memory (Hebb, 1949), but also in the development of cortical circuits (Hubel & Wiesel, 1962; Katz & Shatz, 1996; von der Malsburg, 1973; Bienenstock et al., 1982; Miller et al., 1989) and, possibly, cortical reorganization (Merzenich, Nelson, Stryker, Cynader, Schoppmann, & Zook 1984; Buonomano & Merzenich, 1998). To study how our synaptic update rule would behave during development, we used a standard paradigm of input selectivity (Yeung, Shouval, Blais, & Cooper, 2004), which is considered to be a simplified scenario of receptive field development. Our model neuron was stimulated by a gaussian firing rate profile spanned across the 100 input synapses (see Figure 6a). The center of the gaussian was shifted every 200 ms to an arbitrarily chosen presynaptic neuron. In order to avoid border effects, neuron number 100 was considered a neighbor of neuron number 1, that is, we can visualize the presynaptic neurons as being located on a ring. Nine postsynaptic neurons with slightly different initial values of synaptic weights received identical input from the same set of 100 presynaptic neurons. During 1 hour of stimulus presentation, six of the nine neurons developed synaptic specialization, leading to input selectivity. The optimal stimulus for these six neurons varies (see Figures 6b and 6c), so that any gaussian stimulus at an arbitrary location excites at least one of the postsynaptic neurons. In other words, the six postsynaptic neurons have developed input selectivity with different but partially overlapping receptive fields. The distribution of synaptic weights for the selective neurons is bimodal with a first peak for very weak synapses (EPSP amplitudes less than 0.1 mV) and a second peak around EPSP amplitudes of 0.6 mV (see Figure 6d); the amplitude distribution of the unselective neurons is broader, with a single peak at around 0.4 mV. The number of postsynaptic neurons showing synaptic specialization depends on the total stimulation time and the strength of the stimulus. If the stimulation time is extended to 3 hours instead of 1 hour, all postsynaptic neurons become selective using the same stimulation parameters as before. However, if the maximal presynaptic firing rate at the center of the gaussian is reduced to 40 Hz instead of 50 Hz, only six of nine are selective after 3 hours of stimulation, and with a further reduction of the maximal rate to 30 Hz, only a single neuron is selective after 3 hours of stimulation (data not shown). We hypothesize that the coexistence of unselective and selective
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
a
659
b 60 50
ν [Hz]
40 30 20 10 0 0
20
40
60
80
100
Synaptic index c
d 100
0.4
Synaptic index
90
0.2
80
0 0 0.2
70 60 50
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
0.4
0.6
0.8
0.1
40
0 0 0.2
30 20
0.1
10 1
3
5
7
Output neuron
9
0 0
0.2
w [mV]
Figure 6: The synaptic update rule leads to input selectivity of the postsynaptic neuron. (a) gaussian firing rate profile across the 100 presynaptic neurons. The center of the gaussian is shifted randomly every 200 ms. Presynaptic neurons fire stochastically and send their spikes to nine postsynaptic neurons. (b) Evolution of synaptic weights of the nine postsynaptic neurons. Some neurons become specialized for a certain input pattern at the early phase of learning, others become specialized later, and the last three neurons have not yet become specialized. Since the input spike trains are identical for all the nine neurons, the specialization is due to noise in the spike generator of the postsynaptic neurons. (c) Final synaptic weight values of the nine output neurons after 1 hour of stimulus presentation. (d) The distribution of EPSP amplitudes after 1 hour of stimulation for (top) the specialized output neurons 1, 2, . . . , 6; (middle) for nonspecialized neurons 7, 8, 9; (bottom) for all nine output neurons.
neurons during development could explain the broad distribution of EPSP ¨ om ¨ et al., 2001, in rat viamplitudes seen in some experiments (e.g., Sjostr sual cortex). For example, if we sample the synaptic distribution across all nine postsynaptic cells, we find the distribution shown at the bottom of Figure 6d. If the number of unspecific neurons were higher, the relative importance of synapses with EPSP amplitudes of less than 0.1 mV would diminish. If the number of specialized neurons increased, the distribution would turn into a clear-cut bimodal one, which would be akin to
660
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
sampling an ensemble of two-state synapses with all-or-none potentiation on a synapse-by-synapse basis (Petersen, Malenka, Nicoll, & Hopfield, 1998, in rat hippocampus). 4 Discussion 4.1 What Can We and What Can We Not Expect from Optimality Models? Optimality models can be used to clarify concepts, but they are unable to make specific predictions about molecular implementations. In fact, the synaptic update rule derived in this article shares functional features with STDP and classical long-term potentiation (LTP), but it is blind with respect to interesting questions such as the role of NMDA, Kainate, endocannabinoid, or CaMKII in the induction and maintenance of potentiation and depression (Bliss & Collingridge, 1993; Bortolotto, Lauri, Isaac, & Collingridge, 2003; Frey & Morris, 1997; Lisman, 2003; Malenka & Nicoll, ¨ om, ¨ 1993; Sjostr Turrigiano, & Nelson, 2004). If molecular mechanisms are the focus of interest, detailed mechanistic models of synaptic plasticity (Senn, Tsodyks, & Markram, 2001; Yeung et al., 2004) should be preferred. On the other hand, the mere fact that similar forms of LTP or LTD seem to be implemented across various neural systems by different molecular mechanisms leads us to speculate that common functional roles of synapses are potentially more important for understanding synaptic dynamics than the specific way that these functions are implemented. Ideally, optimality approaches such as the one developed in this article should be helpful to put seemingly diverse experimental or theoretical results into a coherent framework. We have listed in section 1 a couple of points, partially linked to experimental results and partially linked to earlier theoretical investigations. Our aim has been to connect these points and trace them back to a small number of basic principles. Let us return to our initial list and discuss the points in the light of the results of the preceding section. 4.2 Correlations. Hebb (1949) postulated an increase in synaptic coupling in case of repeated coactivation of pre- and postsynaptic neurons as a useful concept for memory storage in recurrent networks. In our optimality framework defined by equation 2.3, correlation-dependent learning is not imposed explicitly but arises from the maximization of information transmission between pre- and postsynaptic neurons. Indeed, information transmission is possible only if there are correlations between pre- and postsynaptic neurons. Information transmission is maximized if these correlations are increased. Gradient ascent of the information term hence leads to a synaptic update rule that is sensitive to correlations between pre- and postsynaptic neurons (see equation 2.20). An increase of synaptic weights enhances these correlations and maximizes information transmission. We emphasize that in contrast to Hebb (1949), we do not invoke memory
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
661
formation and recall as a reason for correlation dependence, but information transmission. Similar to other learning rules (Linsker, 1986; Oja, 1982), the sensitivity of our update rule to correlations between pre- and postsynaptic neurons gives the synaptic dynamics a sensitivity to correlations in the input as demonstrated in Figures 4 to 6. 4.3 Input Selectivity. During cortical development, cortical neurons develop input selectivity typically quantified as the width of receptive fields (Hubel & Wiesel, 1962). As illustrated in the scenario of Figure 6, our synaptic update rule shows input selectivity and hence stands in the research tradition of many other studies (von der Malsburg, 1973; Bienenstock et al., 1982; Miller et al., 1989). Input selectivity in our model arises through the combination of the correlation sensitivity of synapses discussed above with the homeostatic term D in equation 2.3 (Toyoizumi et al., 2005a). Since the homeostatic term keeps the mean rate of the postsynaptic neuron close to a target rate, it leads effectively to a normalization of the total synaptic input similar to the sliding threshold mechanism in the Bienenstock-CooperMunro rule (Bienenstock et al., 1982). Normalization of the total synaptic input through firing rate stabilization has also been seen in previous STDP models (Song et al., 2000; Kempter et al., 1999; Kempter, Gerstner, & van Hemmen, 2001). Its effect is similar to explicit normalization of synaptic weights, a well-known mechanism to induce input selectivity (von der Malsburg, 1973; Miller & MacKay, 1994), but our model does not need an explicit normalization step. So far we have studied only a single or a small number of independent postsynaptic neurons, but we expect that as in many other studies (e.g., Erwin, Obermayer, & Schulten, 1995; Song & Abbott, 2001; Cooper et al., 2004), our synaptic update rule would yield feature maps if applied to a network of many weakly interacting cortical neurons. 4.4 Unimodal versus Bimodal Synapse Distributions. In several previous models of rate-based or spike-timing-based synaptic dynamics, synaptic weights evolved always toward a bimodal distribution, with some synapses close to zero and others close to maximal weight (Miller et al., 1989; Miller & MacKay, 1994; Kempter et al., 1999; Song & Abbott, 2001; Toyoizumi et al., 2005a). Thus, synapses specialize on certain features of the input, even if the input has weak or no correlation at all, which seems questionable from a functional point of view and disagrees with experimentally ¨ om ¨ et al., found distributions of EPSP amplitudes in rat visual cortex (Sjostr 2001). As shown in Figures 4 to 6, an unspecific pattern of synapses with a broad distribution of EPSP amplitudes is stable with our synaptic update rule if correlations in the input are weak. Synaptic specialization (bimodal distribution) develops only if synaptic inputs show a high degree of correlations on the level of spikes or firing rates. Thus, neurons specialize only on highly significant input features, not in response to noise. A
662
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
similar behavior was noted in two recent models on spike-timing¨ dependent plasticity (Gutig, Aharonov, Rotter, & Sompolinsky, 2003; Yeung et al., 2004). Going beyond those studies, we also demonstrated stability of the specialized synapse distribution over some time, even if the amount of correlation through rate modulation is reduced after induction of synaptic specialization stimulation (see Figure 4). Thus, for the same input characteristics, our model can show unimodal or bimodal distribution of synaptic weight, depending on the stimulation history. Moreover, we note that in the state of bimodal distribution, the EPSP amplitudes at the depressed synapses are so small that in an experimental setting, they could easily remain undetected or classified as a “silent” synapse (Kullmann, 2003). A large proportion of silent synapses has been previously shown to be consistent with optimal memory storage (Brunel, Hakim, Isope, Nadal, & Barbour, 2004). Furthermore, our results show that in a given population of cells, neurons with specialized synapses can coexist with others that have a broad and unspecific synapse distribution. We speculate that these nonspecialized neurons could then be recruited later for new stimuli, as hypothesized in earlier models of neural networks (Grossberg, 1987). In passing, we note that in models of synaptic plasticity with multiplicative weight dependence, the distribution of weights is always unimodal (Rubin, Lee, & Sompolinsky, 2001; van Rossum, Bi, & Turrigiano, 2000), but in this case, retention of synaptic memories is difficult. 4.5 Synaptic Memory. In the absence of presynaptic input, synapses in our model do not change significantly. Furthermore, our results show that even in the presence of random pre- and postsynaptic firing activity, synaptic memories can be retained over several hours, although a slow decay occurs. Essential for long-term memory maintenance under random spike arrival is the reduced adaptation speed for small values of synaptic weights as formulated in postulate C2. This postulate is similar in spirit to a recent theory of Fusi et al. (2005), with two important differences. First, we work with a continuum of synaptic states (characterized by the value of w), whereas Fusi et al. assume a cascade of a finite number of discrete internal synaptic states where transitions are unidirectional and characterized by different time constants. The scheme of Fusi et al. guarantees slow (i.e., not exponential) decay of memories, whereas in our case, decay is always exponential even if the time constant is weight dependent. Second, Fusi et al. use a range of different time constants for both the potentiated and the unpotentiated state, whereas in our model, it is sufficient to have a slower time constant for the unpotentiated state only. If the change of unpotentiated synapses is slow compared to homeostatic regulation of the mean firing rate, then the maintenance of the strong synapses is given by homeostasis (and also supported by information maximization). The fact that our model has been formulated in terms of a continuum of synaptic weights was for convenience only. Alternatively it is conceivable
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
663
to define a number of internal synaptic states that give rise to binary, or a small number of, synaptic weight values (Fusi, 2002; Fusi et al., 2005; Abarbanel, Talathi, Gibb, & Rabinovich, 2005). The actual number of synaptic states is unknown with conflicting evidence (Petersen et al., 1998; Lisman, 2003). 4.6 Rate Dependence. Our results show that common rate modulation in one group of synapses strengthens these synapses if the modulation amplitude is strong enough. In contrast, an increase of rates to a fixed value of 40 Hz (without modulation) in one group of synapses while another group of synapses receives background firing at 10 Hz does not lead to a synaptic specialization, only to a minor readjustment of weights (data not shown). For a comparison with experimental results, it is important to note that rate dependence is typically measured with extracellular stimulation of presynaptic pathways. We assimilate repeated extracelluar stimulation with a strong and common modulation of spike arrival probability at one group of synapses (as opposed to an increased rate of a homogeneous Poisson process). Under this interpretation, our results are qualitatively consistent with experiments. 4.7 STDP. Our results show that our synaptic update rule shares several features with STDP as found in experiments (Markram et al., 1997; Bi & Poo, ¨ om ¨ et al., 2001). The timescale of the potentiation part of 1998, 2001; Sjostr the STDP function depends in our model on the duration of EPSPs. The timescale of the depression part is determined by the duration of EPSP suppression in agreement with experiments (Froemke et al., 2005). Our model shows that the relative importance of LTP and LTD depends on the initial value of the synaptic weight in a way similar to that found in experiments (Bi & Poo, 1998). However, for EPSP amplitudes between 0.5 and 2 mV, the LTP part clearly dominates over LTD in our model, which seems to have less experimental support. Also, the frequency dependence of STDP in our model is less pronounced than in experiments. Models of STDP have previously been formulated on a phenomenological level (Gerstner et al., 1996; Song et al., 2000; Gerstner & Kistler, 2002) or on a molecular level (Lisman, 2003; Senn et al., 2001; Yeung et al., 2004). Only recently have models derived from optimality concepts moved to the center of interest (Toyoizumi et al., 2005a; Toyoizumi, Pfister, Aihara, & Gerstner, 2005b; Bohte & Mozer, 2005; Chechik, 2003). There are important differences of the present model to the existing “optimal” models. Chechik (2003) used information maximization, but limited his approach to static input patterns, while we consider arbitrary inputs. Bell and Parra (2005) minimize output entropy, and Bohte and Mozer (2005) maximize spike reliability, whereas we maximize the information between full input and output spike trains. None of these studies considered optimization under homeostatic and maintenance cost constraints.
664
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
After we introduced the homeostatic constraint in a previous study, which gave rise to a learning rule with several interesting properties (Toyoizumi et al., 2005a), we realized that this model did not exhibit properties of STDP in an in vitro situation without some additional assumptions. Indeed, as shown in Figure 3d, the STDP function derived from information maximization alone exhibits no depression in the absence of an additional weight-dependent cost term in the optimality function. The weight-dependent cost term introduced in this letter hence plays a crucial role in STDP since it shifts the STDP function to more negative values. 4.8 How Realistic Is a Weight-Dependent Cost Term? The weightdependent cost term in the optimality criterion L depends quadratically on the value of the weights of all synapses converging onto the same postsynaptic neuron. This turns out to be equivalent to a “decay” term in the synaptic update (see the last term on the right-hand side of equation 2.20). Such decay terms are common in the theoretical literature (Bienenstock et al., 1982; Oja, 1982), but the question is whether such a decay term (leading to a slow depression of synapses) is realistic from a biological point of view. We emphasize that the decay term in our synaptic update rule is proportional to presynaptic activity. Thus, in contrast to existing models in the theoretical literature (Bienenstock et al., 1982; Oja, 1982), a synapse that receives no input is protected against slow decay. The specific form of the “decay” term considered in this article was such that synaptic weights decreased with each spike arrival, but presynaptic activity could also be represented in the decay term by the mean firing rate rather than spike arrival, with no qualitative changes to the results. An important aspect of our cost term is that only synapses that have recently been activated are at risk regarding weight decay. We speculate that the weight-dependent cost term could, in a loose sense, be related to the number of plasticity factors that synapses require and compete for during the first hours of synaptic maintenance (Fonseca et al., 2004). According to the synaptic tagging hypothesis (Frey & Morris, 1997), only synapses that have been activated in the recent past compete for plasticity factors, while unpotentiated synapses do not suffer from decay (Fonseca et al., 2004). We emphasize that such a link of our cost term to the competition for plasticity factors is purely hypothetical. Many relevant details of tagging and competition for synaptic maintenance are omitted in our approach. 4.9 Predictions and Experimental Tests. In order to achieve synaptic memory that is stable over several hours, the reduced adaptation speed for weak synapses formulated in postulate C2 turns out to be essential. Thus, an essential assumption of our model is testable: for synapses with extremely
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
665
small EPSP amplitudes, in particular silent synapses, the induction of both LTP and LTD should require stronger stimulation or stimulation sustained over longer times, compared to synapses that are of average strength. This ¨ aspect is distinct from models (Gutig et al., 2003) that postulate for weak synapses a reduced adaptation speed for depression only, but maximal effect of potentiation. Thus, comparison of LTP induction for silent synapses (Kullmann, 2003) with that for average synapses should allow differentiating between the two models. In an alternative formulation, synaptic memory could also be achieved by making strong synapses resistant to further changes. As an aside, we note that the model of Fusi et al. (2005) assumes a reduced speed of transition between several internal synaptic states so that transition would not necessarily be visible as a change in synaptic weight. A second test of our model concerns the pattern of synaptic weights converging on the same postsynaptic neuron. Our results suggest that early in development, most neurons would show an unspecific synapse pattern, that is, a distribution of EPSP amplitudes with a single but broad peak, whereas later, a sizable fraction of neurons would show a pattern of synaptic specialization with some strong synapses and many silent ones, that is, a bimodal distribution of EPSP amplitudes. Ideally the effect would be seen by scanning all the synapses of individual postsynaptic neurons; it remains to be seen if modern imaging and staining methods will allow doing this. Alternatively, by electrophysiological methods, distributions of synaptic strengths could be built up by averaging over many synapses on different ¨ om ¨ et al., 2001). In this case, our model would predict that neurons (Sjostr during development, the histogram of EPSP amplitudes would change in two ways (see Figure 6d): (1) the number of silent synapses increases so that the amplitude of the sharp peak at small EPSP amplitude grows, and (2) the location of the second, broader peak shifts to larger values of the EPSP amplitudes. Furthermore, and in contrast to other models where the unimodal distribution is unstable, the transition to a bimodal distribution depends in our model on the stimulation paradigm. 4.10 Limitations and Extensions. We emphasize that properties of our synaptic update rules have so far been tested only for single neurons in an unsupervised learning paradigm. Extensions are possible in several directions. First, instead of single neurons, a large, recurrent network could be considered. This could, on one side, further our understanding of the model properties in the context of cortical map development (Erwin et al., 1995), and on the other side, scrutinize the properties of the synaptic update rule as a functional memory in recurrent networks (Amit & Fusi, 1994). Second, instead of unsupervised learning where the synaptic update rule treats all stimuli alike whether they are behaviorally relevant or not, a rewardbased learning scheme could be considered (Dayan & Abbott, 2001; Seung, 2003; Pfister et al., 2006). Behaviorally relevant situations can be taken into
666
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
account by optimizing reward instead of information transmission (Schultz, Dayan, & Montague, 1997). Appendix A: Determination of the Parameter λ The parameter λ is set to give w j = 0 for large enough |t pr e − t post | in a simulated STDP in vitro paradigm. In order to find the appropriate value of λ, we separately consider the effects of a presynaptic spike and a postsynaptic one, which is possible since they are assumed to occur at a large temporal distance. Since a postsynaptic spike alone does not change synaptic strength (C j (t) = 0, always), we choose a λ that gives no synaptic change when a presynaptic spike alone is induced. For a given presynaptic spike f at t j , we have C j (t) = −g 0
t
f t − t j e −(t−t )/τm dt
f f τC τm − t−t j /τm −(t−t j )/τC e . = −g −e τC − τm
(A.1)
Since ρ˜ ≈ ρr , and ρ¯ ≈ ρr in this in vitro setting, the factor B post in equation 2.14 is approximated as f B post (t) ≈ −w j ge − t−t j /τm .
(A.2)
Hence, we find the effect of a presynaptic spike as w = 0
T
dw j τC τm dt = w j g 2 dt τC − τm
τC τm τm − − λw j . τC + τm 2
τC The condition of no synaptic change gives λ = g 2 ττCm−τ m used this λ in the numerical code.
τm τC τm +τC
(A.3)
−
τm 2
. We
Appendix B: Weight-Dependent Evaluation of the Optimality Criterion In Figures 4c and 4d the optimality criterion has been evaluated as a function of some artifical weight distribution. Specifically, values of synaptic weights have been chosen stochastically from two gaussian distributions with mean w ¯ 1 and standard deviation σ1 for group 1 and w ¯ 2 and σ2 for group 2. In order to account for differences in standard deviations due to the
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
667
weight-dependent update rate α(w), we chose σ (w) ¯ = 0.1 mV ·w ¯ 4 /(ws4 + 4 w ¯ ), which gives a variance of synaptic weights in both groups, which is consistent with the variance seen in Figure 4A. For a fixed set of synaptic weight values, the network is simulated during a trial time of 30 minutes, while the synaptic updated rule has been turned off and the objective function L defined in equation 2.3 is evaluated using ρ¯ est from equation 2.21. The result is divided by the trial time t and plotted in Figures 4c and 4d in units of s −1 . The mesh size of mean synaptic strength is 0.04 mV. The one-dimensional plot in Figure 4d is taken along the direction w ¯ 2 = 0.8 mV −w ¯ 1. Acknowledgments This research is supported by the Japan Society for the Promotion of Science and Grant-in-Aid for JSPS fellows 03J11691 (T.T.), the Swiss National Science Foundation 200020-108097/1 (J.P.P), and Grant-in-Aid for Scientific Research on Priority Areas 17022012 from MEXT of Japan (K.A.). References Abarbanel, H., Talathi, S., Gibb, L., & Rabinovich, M. (2005). Synaptic plasticity with discrete state synapses. Physical Review E, 72, art. no. 031917 Part 1. Abraham, W., Logan, B., Greenwood, J., & Dragunow, M. (2002). Induction and experience-dependent consolidation of stable long-term potentiation lasting months in the hippocampus. J. Neuroscience, 22, 9626–9634. Amit, D., & Fusi, S. (1994). Learning in neural networks with material synapses. Neural Computation, 6, 957–982. Atick, J., & Redlich, A. (1990). Towards a theory of early visual processing. Neural Computation, 4, 559–572. Barlow, H. (1956). Retinal noise and absolute threshold. J. Opt. Soc. Am., 46, 634–639. Barlow, H. B. (1961). Possible principles underlying the transformation of sensory messages. In W. A. Rosenbluth (Ed.), Sensory communication (pp. 217–234). Cambridge, MA: MIT Press. Bell, A. J., & Parra, L. C. (2005). Maximising sensitivity in a spiking network. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 121–128). Cambridge, MA: MIT Press. Bell, A., & Sejnowski, T. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bi, G., & Poo, M. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472. Bi, G., & Poo, M. (2001). Synaptic modification of correlated activity: Hebb’s postulate revisited. Ann. Rev. Neurosci., 24, 139–166. Bienenstock, E., Cooper, L., & Munroe, P. (1982). Theory of the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. Journal of Neuroscience, 2, 32–48.
668
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
Bliss, T. V. P., & Collingridge, G. L. (1993). A synaptic model of memory: Long-term potentiation in the hippocampus. Nature, 361, 31–39. Bliss, T., & Lomo, T. (1973). Long-lasting potentation of synaptic transmission in the dendate area of anaesthetized rabbit following stimulation of the perforant path. J. Physiol., 232, 351–356. Bohte, S. M., & Mozer, M. C. (2005). Reducing spike train variability: A computational theory of spike-timing dependent plasticity. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 201–208). Cambridge, MA: MIT Press. Bortolotto, Z. A., Lauri, S., Isaac, J. T. R., & Collingridge, G. L. (2003). Kainate receptors and the induction of mossy fibre long-term potentiation. Phil. Trans. R. Soc. Lond B: Biological Sciences, 358, 657–666. Britten, K., Shadlen, M., Newsome, W., & Movshon, J. (1992). The analysis of visual motion: A comparison of neuronal and psychophysical performance. J. Neuroscience, 12, 4745–4765. Brunel, N., Hakim, V., Isope, P., Nadal, J.-P., & Barbour, B. (2004). Optimal information storage and the distribution of synaptic weights: Perceptron versus Purkinje cell. Neuron, 43, 745–757. Buonomano, D., & Merzenich, M. (1998). Cortical plasticity: From synapses to maps. Annual Review of Neuroscience, 21, 149–186. Chechik, G. (2003). Spike-timing-dependent plasticity and relevant mututal information maximization. Neural computation, 15, 1481–1510. Cooper, L., Intrator, N., Blais, B., & Shouval, H. Z. (2004). Theory of cortical plasticity. Singapore: World Scientific. Cover, T., & Thomas, J. (1991). Elements of information theory. New York: Wiley. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. de Ruyter van Steveninck, R. R., & Bialek, W. (1995). Reliability and statistical efficiency of a blowfly movement-sensitive neuron. Phil. Trans. R. Soc. Lond. Ser. B, 348, 321–340. Dudek, S. M., & Bear, M. F. (1992). Homosynaptic long-term depression in area CA1 of hippocampus and effects of N-methyl-D-aspartate receptor blockade. Proc. Natl. Acad. Sci. USA, 89, 4363–4367. Erwin, E., Obermayer, K., & Schulten, K. (1995). Models of orientation and ocular dominance columns in the visual cortex: A critical comparison. Neural Comput., 7, 425–468. Fonseca, R., N¨agerl, U., Morris, R., & Bonhoeffer, T. (2004). Competition for memory: Hippocampal LTP under the regimes of reduced protein synthesis. Neuron, 44, 1011–1020. Frey, U., & Morris, R. (1997). Synaptic tagging and long-term potentiation. Nature, 385, 533–536. Froemke, R. C., Poo, M.-M., & Dan, Y. (2005). Spike-timing-dependent synaptic plasticity depends on dendritic location. Nature, 434, 221–225. Fusi, S. (2002). Hebbian spike-driven synaptic plasticity for learning patterns of mean firing rates. Biol. Cybern., 87, 459–470. Fusi, S., Annunziato, M., Badoni, D., Salamon, A., & Amit, D. J. (2000). Spike-driven synaptic plasticity: Theory, simulation, VLSI implementation. Neural Computation, 12, 2227–2258.
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
669
Fusi, S., Drew, P., & Abbott, L. (2005). Cascade models of synaptically stored memories. Neuron, 45, 599–611. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Gerstner, W., & Kistler, W. K. (2002). Spiking neuron models. Cambridge: Cambridge University Press. Grossberg, S. (1987). The adaptive brain I. Dordrecht: Elsevier. ¨ Gutig, R., Aharonov, R., Rotter, S., & Sompolinsky, H. (2003). Learning input correlations through non-linear temporally asymmetry Hebbian plasticity. J. Neuroscience, 23, 3697–3714. Hebb, D. O. (1949). The organization of behavior. New York: Wiley. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Redwood City, CA: Addison-Wesley. Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. (London), 160, 106– 154. Katz, L. C., & Shatz, C. J. (1996). Synaptic activity and the construction of cortical circuits. Science, 274, 1133–1138. Kelso, S. R., Ganong, A. H., & Brown, T. H. (1986). Hebbian synapses in hippocampus. Proc. Natl. Acad. Sci. USA, 83, 5326–5330. Kempter, R., Gerstner, W., & van Hemmen, J. L. (1999). Hebbian learning and spiking neurons. Phys. Rev. E, 59, 4498–4514. Kempter, R., Gerstner, W., & van Hemmen, J. L. (2001). Intrinsic stabilization of output rates by spike-based Hebbian learning. Neural Computation, 13, 2709– 2741. Kullmann, D. M. (2003). Silent synapses: What are they telling us about long-term potentiation? Phil. Trans. R. Soc. Lond. B: Biological Sciences, 358, 727–733. Laughlin, S. (1981). A simple coding procedure enhances a neuron’s information capacity. Z. Naturforschung, 36, 910–912. Laughlin, S., de Ruyter van Steveninck, R. R., & Anderson, J. (1998). The metabolic cost of neural information. Nature Neuroscience, 1, 36–41. Levy, W., & Baxter, R. (2002). Energy-efficient neuronal computation via quantal synaptic failures. J. Neuroscience, 22, 4746–4755. Linsker, R. (1986). From basic network principles to neural architecture: Emergence of spatial-opponent cells. Proc. Natl. Acad. Sci. USA, 83, 7508–7512. Linsker, R. (1989). How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Computation, 1(3), 402– 411. Lisman, J. (1989). A mechanism for Hebb and anti-Hebb processes underlying learning and memory. Proc. Natl. Acad. Sci. USA, 86, 9574–9578. Lisman, J. (2003). Long-term potentiation: Outstanding questions and attempted synthesis. Phil. Trans. R. Soc. Lond B: Biological Sciences, 358, 829–842. Malenka, R. C., Kauer, J., Zucker, R., & Nicoll, R. A. (1988). Postsynaptic calcium is sufficient for potentiation of hippocampal synaptic transmission. Science, 242, 81–84. Malenka, R. C., & Nicoll, R. A. (1993). NMDA-receptor-dependent plasticity: Multiple forms and mechanisms. Trends Neurosci., 16, 480–487.
670
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
¨ Markram, H., Lubke, J. L, Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efficacy by coincidence of postysnaptic AP and EPSP. Science, 275, 213–215. Merzenich, M., Nelson, R., Stryker, M., Cynader, M., Schoppmann, A., & Zook, J. (1984). Somatosensory cortical map changes following digit amputation in adult monkeys. J. Comparative Neurology, 224, 591–605. Miller, K., Keller, J. B., & Stryker, M. P. (1989). Ocular dominance column development: Analysis and simulation. Science, 245, 605–615. Miller, K. D., & MacKay, D. J. C. (1994). The role of constraints in Hebbian learning. Neural Computation, 6, 100–126. Oja, E. (1982). A simplified neuron model as a principal component analyzer. J. Mathematical Biology, 15, 267–273. Petersen, C., Malenka, R., Nicoll, R., & Hopfield, J. (1998). All-or-none potentiation of CA3-CA1 synapses. Proc. Natl. Acad. Sci. USA, 95, 4732–4737. Pfister, J.-P., Toyoizumi, T., Barber, D., & Gerstner, W. (2006). Optimal spike-timing dependent plasticity for precise action potential firing in supervised learning. Neural Computation, 18, 1309–1339. Rubin, J., Lee, D. D., & Sompolinsky, H. (2001). Equilibrium properties of temporally asymmetric Hebbian plasticity. Physical Review Letters, 86, 364–367. Schultz, W., Dayan, P., & Montague, R. (1997). A neural substrate for prediction and reward. Science, 275, 1593–1599. Senn, W., Tsodyks, M., & Markram, H. (2001). An algorithm for modifying neurotransmitter release probability based on pre- and postsynaptic spike timing. Neural Computation, 13, 35–67. Seung, H. (2003). Learning in spiking neural networks by reinforcement of stochastic synaptic transmission. Neuron, 40, 1063–1073. ¨ om, ¨ Sjostr P., Turrigiano, G., & Nelson, S. (2001). Rate, timing, and cooperativity jointly determine cortical synaptic plasticity. Neuron, 32, 1149–1164. ¨ om, ¨ Sjostr P., Turrigiano, G., & Nelson, S. (2004). Endocannabinoid-dependent neocortical layer-5 LTD in the absence of postsynaptic spiking. J. Neurophysiol., 92, 3338–3343. Song, S., & Abbott, L. (2001). Column and map development and cortical re-mapping through spike-timing dependent plasticity. Neuron, 32, 339–350. Song, S., Miller, K., & Abbott, L. (2000). Competitive Hebbian learning through spike-time-dependent synaptic plasticity. Nature Neuroscience, 3, 919–926. Toyoizumi, T., Pfister, J.-P., Aihara, K., & Gerstner, W. (2005a). Generalized Bienenstock-Cooper-Munro rule for spiking neurons that maximizes information transmission. Proc. National Academy Sciences (USA), 102, 5239–5244. Toyoizumi, T., Pfister, J.-P., Aihara, K., & Gerstner, W. (2005b). Spike-timing dependent plasticity and mutual information maximization for a spiking neuron model. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 1409–1416). Cambridge, MA: MIT Press. Turrigiano, G., & Nelson, S. (2004). Homeostatic plasticity in the developing nervous system. Nature Reviews Neuroscience, 5, 97–107. van Rossum, M. C. W., Bi, G. Q., & Turrigiano, G. G. (2000). Stable Hebbian learning from spike timing-dependent plasticity. J. Neuroscience, 20, 8812–8821. von der Malsburg, C. (1973). Self-organization of orientation selective cells in the striate cortex. Kybernetik, 14, 85–100.
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
671
Yeung, L. C., Shouval, H., Blais, B., & Cooper, L. (2004). Synaptic homeostasis and input selectivity follow from a model of calcium dependent plasticity. Proc. Nat. Acad. Sci. USA, 101, 14943–14948. Zhang, L., Tao, H., Holt, C., Harris, W. A., & Poo, M.-M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44. Zhou, Q., Tao, H., & Poo, M. (2003). Reversal and stabilization of synaptic modifications in the developing visual system. Science, 300, 1953–1957.
Received January 13, 2006; accepted June 22, 2006.
LETTER
Communicated by Robert Kass
Nonparametric Modeling of Neural Point Processes via Stochastic Gradient Boosting Regression Wilson Truccolo Wilson [email protected]
John P. Donoghue∗ John [email protected] Neuroscience Department, Brown University, Providence, RI 02912, U.S.A.
Statistical nonparametric modeling tools that enable the discovery and approximation of functional forms (e.g., tuning functions) relating neural spiking activity to relevant covariates are desirable tools in neuroscience. In this article, we show how stochastic gradient boosting regression can be successfully extended to the modeling of spiking activity data while preserving their point process nature, thus providing a robust nonparametric modeling tool. We formulate stochastic gradient boosting in terms of approximating the conditional intensity function of a point process in discrete time and use the standard likelihood of the process to derive the loss function for the approximation problem. To illustrate the approach, we apply the algorithm to the modeling of primary motor and parietal spiking activity as a function of spiking history and kinematics during a two-dimensional reaching task. Model selection, goodness of fit via the time rescaling theorem, model interpretation via partial dependence plots, ranking of covariates according to their relative importance, and prediction of peri-event time histograms are illustrated and discussed. Additionally, we use the tenfold cross-validated log likelihood of the modeled neural processes (67 cells) to compare the performance of gradient boosting regression to two alternative approaches: standard generalized linear models (GLMs) and Bayesian P-splines with Markov chain Monte Carlo (MCMC) sampling. In our data set, gradient boosting outperformed both Bayesian P-splines (in approximately 90% of the cells) and GLMs (100%). Because of its good performance and computational efficiency, we propose stochastic gradient boosting regression as an off-the-shelf nonparametric tool for initial analyses of large neural data sets (e.g., more than 50 cells; more than 105 samples per cell) with corresponding multidimensional covariate spaces (e.g., more than four covariates). In the cases where a
∗ J. P. Donoghue is a cofounder and shareholder in Cybernetics, Inc., a neurotechnology company that is developing neural prosthetic devices.
Neural Computation 19, 672–705 (2007)
Nonparametric Modeling of Neural Point Processes
673
functional form might be amenable to a more compact representation, gradient boosting might also lead to the discovery of simpler, parametric models. 1 Introduction Modeling neural spiking activity as a function of covariates, such as spiking history, neural ensemble activity, stimuli, and behavior, is a developing field in neuroscience, with many challenges and problems yet to be overcome. Recent work has provided a solid formulation for the parametric modeling of spiking activity, while preserving its point process nature, based on maximum likelihood analysis (Brillinger, 1988; Chornoboy, Schramm, & Karr, 1988; Brown, Barbieri, Eden, & Frank, 2003; Paninski, 2004; Truccolo, Eden, Fellows, Donoghue, & Brown, 2005; Okatan, Wilson, & Brown, 2005). However, there still remains a need for new statistical tools that enable the discovery and approximation of functional forms relating spiking activity and covariates, rather than simply imposing a priori parametric forms. Previous attempts to address this challenge have relied on cardinal cubic spline modeling (Frank, Eden, Solo, Wilson, & Brown, 2002), Zernicke polynomials (Barbieri et al., 2002), smoothing splines in the context of generalized additive models (Kass & Ventura, 2001), and Bayesian free-knot spline smoothing (DiMatteo, Genovese, & Kass, 2001; Kaufman, Ventura, & Kass, 2005). While these attempts have successfully generated new insights in their applications, there still is a need for new approaches that combine high-fitting performance with computational efficiency in analyses of large neural data sets. For example, motor neurophysiology studies using multiple multielectrode arrays now permit the simultaneous recording of hundreds of cells with often more than 105 samples per cell and at the same time generate measurements of behaviorally relevant high-dimensional covariates including arm kinematics in joint space, joint torques, and electromyograms (EMGs) (Hatsopoulos, Joshi, & O’Leary, 2004; Truccolo, Vargas, Philip, & Donoghue, 2005). Off-the-shelf, computationally efficient modeling algorithms that require a minimum amount of fine tuning, while at the same time maintaining good fit, are required for initial analyses of these data sets. Here, we approach this challenge from a different perspective. Recent developments in the statistical and machine learning communities have provided powerful new tools for nonparametric modeling (Hastie, Tibshirani, & Friedman, 2001; Vapnik, 1998). In the particular case of interest here, this progress originated from the insight gained by relating the nonparametric regression problem to numerical function approximation via gradient descent methods (Breiman, 1997; Friedman, 1999). This insight has led to the development of stochastic gradient boosting regression (Friedman, 2001, 2002). Rather than optimizing in parameter space, as performed in parametric modeling, the approximation problem is formulated
674
W. Truccolo and J. Donoghue
in terms of optimization in function space. Intuitively, stochastic gradient boosting can be seen as a regularized fitting of the residuals in a gradual iterative fashion. This approach has provided both a new theoretical perspective for understanding nonparametric modeling algorithms such as ensemble of experts and other boosting algorithms,1 and a general off-the-shelf regression tool (Hastie et al., 2001). Here, we show how stochastic gradient boosting can be successfully extended to the statistical modeling of neural point processes. We introduce the neural point process modeling problem in the next section. Succinctly, a point process joint probability can be completely specified by its conditional intensity function. The modeling problem can thus be expressed in terms of approximating the conditional intensity of the process as a function of the neuron’s own spiking history and other covariates of interest. We then formulate stochastic gradient boosting in the context of conditional intensity function approximation. Because stochastic gradient boosting allows arbitrary (differentiable) loss functions, our application uses the standard likelihood function for general point processes to derive the loss function for the conditional intensity function optimization problem. To illustrate the theory, we apply the algorithm to spike train data recorded from primary motor (M1) and parietal (5d) cortices in a behaving monkey performing a standard center-out reaching task. Goodness-of-fit assessment of the function approximation is done by applying the time-rescaling theorem (Brown et al., 2001). Additional tools for model interpretation and selection are also discussed and illustrated. Finally, we perform an extensive 1 Most current investigations of boosting approaches originated with the AdaBoost algorithm proposed by Freund and Schapire (1995) for classification problems. The relations between AdaBoost and additive logistic regression were first established by Friedman, Hastie, and Tibshirani (2000). Relations between boosting, maximum likelihood estimation, Kullback- and Kullback-Leibler divergence were explored by Lebanon and Lafferty (2002) in the context of exponential family models. Breiman (1997, 1999), Friedman (1999, 2001) and Mason, Baxter, Bartlett, & Frean (1999) were perhaps the first to formulate boosting in terms of greedy function approximation via gradient descent and to extend it to several types of loss functions. Consistency analyses of boosting algorithms have been presented by Lugosi and Vayatis (2004) and Zhang (2004). Of particular interest here is the result in Rosset, Zhu, & Hastie (2004) showing that boosting trees approximately (and in some cases exactly) minimize the loss criterion with a L 1 − nor m constraint on the estimated parameter vector. Relations between boosting, L 1 -constrained regression methods (e.g., Lasso), and sparse representations have also been discussed in Friedman, Hastie, Rosset, Tibshirani, & Zhu (2004) and in Hastie et al. (2001). Additionally, an interpretation of boosting integrating its formulation in terms of both gradient function optimization and margin maximization (Schapire, Freund, Bartlett, & Lee, 1998) has been provided by Rosset et al. (2004). In comparison to other nonparametric approaches, stochastic gradient boosting has been shown (Friedman, 2001) to outperform multivariate additive regression approaches based on smoothing splines (MARS; Friedman, 1991). In our exposition and approach to modeling neural point processes (section 3), we have followed Friedman’s statistical perspective.
Nonparametric Modeling of Neural Point Processes
675
comparison between stochastic gradient boosting regression and two alternative approaches: standard log-linear models in the parameters (GLMs) and Bayesian P-splines fitted via Markov chain Monte Carlo (MCMC) sampling. We use the tenfold cross-validated log likelihoods of the neural point processes as the criterion for the performance comparison on a 67-cell data set.
2 Neural Point Processes: Stochastic Formulation A neural point process can be completely characterized by its conditional intensity function (CIF; Daley & Vere-Jones, 2003), defined as
λ(t|x(t)) = lim
→0
P[N(t + ) − N(t) = 1|x(t)] ,
(2.1)
where P[·|·] is a conditional probability and N(t) denotes the number of spikes counted in the time interval (0, t] for t ∈ (0, T]. The term x(t) represents covariates of interest: the spiking history up to (but not including) time t and stimuli-behavioral covariates at varied time lags with respect to the spiking activity. The CIF is a strictly positive function that gives a history-dependent generalization of the rate function of a Poisson process. From equation 2.1 and for small , we have that λ(t|x(t)) gives approximately the neuron’s spiking probability in the time interval (t, t + ]. A discrete time representation of the point process is often preferable. To obtain this representation, we choose a large integer K and partition the K observation interval (0, T] into K subintervals (tk−1 , tk ]k=1 , each of length −1 = T K . We choose K large so that there is at most one spike per subinterval. Let Nk = N(tk ), N1:k = N0:tk , and xk = x(tk ). Because we choose K large, the differences Nk = Nk − Nk−1 result in a binary time series of zero and one counts, which corresponds to the usual representation of spike trains. For any discrete time neural point process, the joint probability density of the spike train, that is, the joint probability density of J spike times 0 < u1 < u2 < . . . < u J ≤ T in (0, T] conditioned on history and covariates, can be expressed as (Daley & Vere-Jones, 2003; Truccolo, Eden et al., 2005) P(N1:K |x1:K ) = exp
K k=1
Nk log[λ(tk |xk )] −
K
λ(tk |xk ) + o( J ).
k=1
(2.2)
676
W. Truccolo and J. Donoghue
The likelihood for the point process is readily available from equation 2.2.2 This probability distribution belongs to the exponential family, with the natural parameter given by k = log λ(tk |xk ) = F (xk ).
(2.3)
The problem we address here is the approximation of F (x) above. We will present a new approach to this problem, which consists of treating this approximation as a regularized optimization in function space and using stochastic gradient boosting regression to solve it. 3 Optimization in Function Space and Stochastic Gradient Boosting Consider the random variables (y, x). In general y and x = [x1 , . . . , x p ] can be any continuous, discrete, or mixed set of variables. In the specific case of a discrete time point process, y is a binary sequence. The target function to be approximated is the function F ∗ (x), which relates the covariate vector x to y while minimizing the expected value of some arbitrary (differentiable) loss function L(y, F (x)) over the joint distribution of all (y, x) variables, that is, F ∗ = arg min E y,x L(y, F (x)) = arg min E x E y (L(y, F (x))) | x . F
(3.1)
F
This is an optimization problem in function space. In other words, we consider F (x) evaluated at each point x to be a “parameter” and seek to minimize E y [L(y, F (x)) | x]
(3.2)
at each individual x directly with respect to F (x), observing that pointwise minimization of equation 3.2 is equivalent to minimization of equation 3.1. Numerical optimization methods are often required to solve this type of problem. Commonly they involve an iterative procedure where the approximation to F ∗ (x) takes the form F ∗ (x) ≈ Fˆ (x) =
M
f m (x),
(3.3)
m=0
2 In section 7, our data consist of multiple realizations or trials of the neural point process. The extension of equation 2.2 to reflect multiple realizations is straightforward. However, for notational simplicity, we leave the existence of multiple realizations implicit.
Nonparametric Modeling of Neural Point Processes
677
where f 0 (x) is an initial guess and { f m (x)}1M are successive increments (steps or boosts), each dependent on the preceding steps. In the particular case of function approximation via steepest-descent gradient, the mth step has a (steepest-descent) direction defined by the negative gradient vector gm (x) of the loss function with respect to F (x) and a magnitude ρm obtained via line search. K Given finite data samples {yk , xk }k=1 , smoothing is required to improve generalization of the approximation over nonsampled covariate values. This is achieved by introducing a parametric model at each of the successive steps, such that the approximation in equation 3.3 becomes Fˆ (x) = Fˆ 0 (x) +
M
βm h(x; am ),
(3.4)
m=1
where h(x) is a regressor or base learner, βm and am = {a 1 , a 2 , . . .} are parameters, and Fˆ 0 (x) is a constant. A “greedy stagewise” approach for solving equation 3.1 under the assumptions in equations 3.3 and 3.4 results (see section A.1 in the appendix) in searching, at each iteration step, for the member of the parameterized K class h(x; am ) that produces the vector hm = {h(xk ; am )}k=1 ∈ R K most paralK lel to the negative gradient vector, that is, gm = {gmk (xk )}k=1 ∈ R K . For most choices of base learners, the most parallel vector can be obtained by solving the least-squares problem, am = arg min a,β
K
[gmk (xk ) − βh(xk ; a)]2 .
(3.5)
k=1
This gives the direction for the steepest-descent step. If needed, other minimization criteria can be used in place of equation 3.5. The magnitude of the step is obtained by the line search ρm = arg min ρ
K
L yk , Fˆ m−1 (xk ) + ρh(xk ; am ) .
(3.6)
K =1
Therefore, the update rule for the approximation at each step becomes Fˆ m (x) = Fˆ m−1 (x) + ρm h(x; am ).
(3.7)
Fitting the regressor to the negative gradient effectively works as a gradient smoothing. For the case of squared loss functions L(y, F (x)) = 1 [y − F (x)]2 , the negative gradient is simply the usual residual, and 2 the procedure corresponds to residual fitting.
678
W. Truccolo and J. Donoghue
Loss
0.094
α = 0.1
0.088
α = 0.01 α = 0.001 0.082 0
2000
4000 6000 iterations (trees)
8000
10000
Figure 1: Loss function dependence on number of iterations (trees) and shrinkage parameters. The negative of the log likelihood for the point process (cell A) is computed on validation data for three different choices of shrinkage values α and as a function of the number of trees in the approximation. The dotted line shows the value of the loss function computed on the training data for α = 0.001.
3.1 Regularization. In order to improve generalization and avoid overfitting, the approximation can be constrained by the choice of the number of steps or iterations and by weighting the contribution of each base learner. A shrinkage strategy is adopted (Copas, 1983) for the weighting, that is, the update rule at each iteration becomes Fˆ m (x) = Fˆ m−1 (x) + α ρm h(x; am ), 0 < α ≤ 1.
(3.8)
The actual value for α may depend on the data (see Figure 1), but usually small values around α = 0.001 are chosen. With a fixed shrinkage parameter, estimation of an optimal number of iterations is obtained by cross validation on a test data set, where we assess if additional iteration steps improve significantly the minimization of the loss function. As we mentioned in section 1, consistency and convergence analyses have been recently provided by Lugosi and Vayatis (2004) and Zhang (2004). In addition, Rosset et al. (2004) have shown that boosting trees regression approximately (and in some cases exactly) minimizes its loss criterion with a L 1 − norm constraint on the estimated parameters vector. Some initial
Nonparametric Modeling of Neural Point Processes
679
discussion relating boosting, L 1 -constrained regression methods, and sparse representation has appeared in Friedman et al. (2004) and Hastie et al. (2001). 3.2 Variance Reduction via Stochastic Subsampling. Bootstrap subsampling has been shown to reduce the variance of iterative approximations such as done in bagging procedures (Breiman, 1996, 1999). In a similar fashion, a stochastic subsampling can be adopted (Friedman, 2002) to reduce variance in gradient boosting estimates. Specifically, at each iteration, a subsample of the training data is drawn at random (without replacement) from the full training data set. This randomly selected subsample is then used to fit the base learner and compute the model update for the current iteration. Previous analysis suggests that a good size for the subsampling is about half the size of the total available training data (Friedman, 2002). 3.3 Stochastic Gradient Boosting: General Algorithm. Under the general formulation given above, the algorithm for the stochastic gradient boosting can be expressed as: 1. Set the initial value Fˆ 0 (x) = arg min µ
K
L(yk , µ)
k=1
2. For m = 1 to M do: Kˆ 3. Select a random subsample of size Kˆ , {ykˆ , xkˆ }k=1 ˆ , without replacement
4. Compute the components of the negative gradient vector from the subsample
∂ L (ykˆ , F (xkˆ )) gmkˆ = − , kˆ = 1, . . . , Kˆ ∂ F (xkˆ ) F (x)= Fˆ m−1 (x) 5. Fit h(x; a) to the negative gradient by solving am = arg min 6. ρm = arg min ρ
Kˆ
[gmkˆ − βh(xkˆ ; a)]2
a,β
ˆ k=1
Kˆ
L ykˆ , Fˆ m−1 (xkˆ ) + ρh(xkˆ ; am )
ˆ k=1
7. Update Fˆ m (x) = Fˆ m−1 (x) + αρm h(x; am ) 3.4 Choosing the Base Learner: Regression Trees and Interaction Order in the Approximation. Thus far, we have not defined the type of parameterized model h(x; a). In principle, we could boost GLMs, spline models, or many other different models. However, commonly regression trees are
680
W. Truccolo and J. Donoghue
chosen based on two motivations. First, approximation using an ensemble of regression trees is a simple yet powerful approach. In this case, each base learner h(x; a) is an R terminal node regression tree. At each mth iteration, a regression tree partitions the space of all joint values of the covariates x into R-disjoint sets or regions {Srm }rR=1 and predicts a separate constant value in each of these regions, such that h(x; am ) =
R
c rm 1{x∈Srm } ,
(3.9)
r =1
where c rm is the terminal node mean. Therefore, the parameter am specifies the splitting variables, the splitting locations that define the disjoint regions {Srm }rR=1 , and the terminal node means of the tree at the mth iteration. To fit a regression tree at each iteration, we need to decide on the splitting variables and split points and also what topology (shape) the tree should have. We follow the CART algorithm (Breiman, Friedman, Olshen, & Stone, 1984). In this case, the partition of the space is restricted to recursive binary partitions (i.e., at each node, we partition the current variable into two disjoint spaces: left and right of the splitting point). Binary splits are preferred instead of multiway splits because the latter tend to fragment the data too quickly, leaving insufficient data at the next level down. Additionally, multiway splits can be achieved in gradient boosting trees by a series of binary splits, each occurring in one of the iteration steps (Hastie et al., 2001). Finding the best binary partition in terms of minimum sum of squares is generally computationally infeasible. A greedy algorithm is adopted. Starting with all of the data, consider a splitting variable j and split points s, and define the pair of half-planes: S1 ( j, s) = {x|x j ≤ s} and S2 ( j, s) = {x|x j > s}.
(3.10)
Then we seek the splitting variable j and split point s that solve min min j,s
c1
xk ∈S1 ( j,s)
(yk − c 1 )2 + min c2
(yk − c 2 )2 .
(3.11)
xk ∈S2 ( j,s)
For any choice j and s, the inner minimization is solved by cˆ 1 = ave(yk |xk ∈ S1 ( j, s)) and cˆ 2 = ave(yk |xk ∈ S2 ( j, s)).
(3.12)
The computation of the terminal node mean depends on the distribution adopted for the data (see equation 4.4). For each splitting variable, the determination of the split point s can be done very quickly, and hence by scanning through all of the inputs, determination of the best pair ( j,s)
Nonparametric Modeling of Neural Point Processes
681
is feasible. Having found the best split, we partition the data into the two resulting regions and repeat the splitting process on each of the two regions. Then this process is repeated on all of the resulting regions. The second motivation for choosing regression trees is that by controlling the number of terminal nodes, it is also possible to control for the interaction order in the approximation. Usually, in order for Fˆ (x) to approximate well the target F ∗ (x), it is important to consider how close the interaction order of Fˆ (x) is required to be to that of F ∗ (x). Interaction order can be thought of as follows. Suppose that the target function can be expressed in terms of the expansion F ∗ (x) =
i
Fi (xi ) +
Fi j (xi , x j ) +
ij
Fi jk (xi , x j , xk ) + . . .
(3.13)
i jk
A tree with splitting on only one of the covariates falls in the first sum of equation 3.13, a tree with splits in two covariates falls in the second sum, and so on. The highest interaction order is limited by the number p of covariates, and, given the additive nature of the boosting procedure, a tree with R terminal nodes produces a function approximation with interaction order of at most min(R − 1, p). In practice, for a large number of covariates, a good approximation can be obtained by a much lower order.3 Depending on the case, we can either select the interaction order based on testing on a single test data set or by an n-fold cross-validation procedure, or fix a priori the order to capture only second-order interactions among the covariates, for example. 4 Loss Function for Point Process Modeling In order to apply stochastic gradient boosting to neural point processes, we define the loss function at each iteration step to be the negative of the log likelihood for the point process. For a given CIF model in log-scale log λ(tk |xk ) = Fˆ (xk ), the loss function becomes K K Nk Fˆ m−1 (xk ) − exp Fˆ m−1 (xk ) , = L m {Nk , Fˆ m−1 (xk )}k=1 k=1
(4.1)
3 See Friedman (2001) for a discussion of the selection of interaction order. Note also that because we are modeling the natural parameter of the point process distribution log λ, a model containing only the first term in equation 3.13 will already lead to multiplicative effects among the covariates.
682
W. Truccolo and J. Donoghue
where again K is the total number of subintervals or number of samples. The kth component of the negative gradient at each iteration is then gmk
∂ L m Nk , Fˆ m−1 (xk ) =− = exp Fˆ m−1 (xk ) − Nk . ∂ Fˆ m−1 (xk )
(4.2)
The initial value is given by
K 1 Fˆ 0 (x) = log Nk , K
(4.3)
k=1
and the terminal node estimates in the regression tree are given by K
1{xk ∈Rrm } Nk . ˆ k=1 1{xk ∈Rrm } exp[ Fm−1 (xk )]
c rm = log K
k=1
(4.4)
5 Goodness-of-Fit Analysis via the Time Rescaling Theorem Given spike train and covariate time series, we can approximate the CIF ˆ λ(t|x(t)) = exp{ Fˆ (x(t))} using the stochastic gradient boosting approach just described. Assessing the goodness of fit of the approximation can be implemented by applying the time-rescaling theorem for point processes as follows (Brown et al., 2001; Papangelou, 1972; Girsanov, 1960). We start by computing the rescaled times z j from the estimated conditional intensity and from the spike times as z j = 1 − exp
−
u j+1
ˆ λ(t|x(t)) dt
,
(5.1)
uj
for j = 1, . . . , J − 1 spike times. The time-rescaling theorem states that the z j s will be independent uniformly distributed random variables on the interval [0, 1) if and only if the estimated CIF approximates well the true conditional intensity function of the process. Because the transformation in equation 5.1 is one-to-one, any statistical assessment that measures the agreement between the z j s and a uniform distribution directly evaluates how well the original model agrees with the spike train data. Additional tests to check the independence of the rescaled times can be formulated (Truccolo, Eden et al., 2005) but will not be employed here. A Kolmogorov-Smirnov (K-S) test can be constructed by ordering the z j s from smallest to largest and then plotting the values of the cumulative distri bution function of the uniform density function defined as b j = ( j − 1/2) J for j = 1, . . . , J against the ordered z j s. We term these plots K-S plots. If the
Nonparametric Modeling of Neural Point Processes
683
model is correct, then the points should lie on a 45 degree line. Confidence bounds for the degree of agreement between a model and the data may be constructed using the distribution of the K-S statistic (Johnson & Kotz, 1970). For moderate to large sample sizes, the 95% confidence bounds are 1 well approximated (Johnson & Kotz, 1970) by b j ± 1.36 × J −2 . 6 Model Interpretation: Relative Importance and Partial Dependence Plots 6.1 Relative Importance of Individual Covariates. Given an approximated CIF, we would like to assess the relative importance or contribution of each covariate. This assessment is particularly important in more exploratory studies, where the set of covariates is large and little a priori knowledge is available about their individual relevance. A measure of relative importance should reflect the relative influence of a variable xl on the variation of Fˆ (x). Assessing the influence of a covariate on the variation of Fˆ (x) by expected partial derivatives is not possible given the piecewise constant approximations provided by regression trees. Instead, we use the relative importance measure that Breiman et al. (1984) used. We start by computing it for a single tree Tm = h(x, am ) as 2 Iˆlm (Tm ) =
R−1
iˆl2 1(v(r ) = l),
(6.1)
r =1
where the summation is over the nonterminal nodes r of the R-terminal node tree. The term v(r ) gives the splitting variable associated with node r in the tree. The term iˆl2 is the corresponding empirical improvement in squared error achieved by splitting, at the point given by the splitting variable, a region Sr into two left and right subregions (Sr − , Sr + ). The empirical improvement in squared error is given by the weighted squared difference between the constants fit to the left and right subregions. The final relative importance measure with respect to the approximation Fˆ (x), including the entire collection of regression trees, is obtained by averaging equation 6.1 over all of the trees: 1 2 Iˆlm (Tm ). M M
Iˆl2 =
(6.2)
m=1
6.2 Partial Dependence Plots. Beyond assessing the relative importance of individual covariates, we would like to gain insight into the functional form relating response variable and covariates. That might be done by graphically plotting the partial dependence of the approximation on a
684
W. Truccolo and J. Donoghue
small subset of the covariates zl = {z1 , . . . , zl } ⊂ {x1 , . . . , x p }. This partial dependence can be formulated in terms of the expected value of Fˆ (x) with respect to the probability of the complement subset z\l (Friedman, 2001; see also section A.2 of the appendix). An empirical approximation to this partial dependence is straightforward to obtain in the case of regression trees based on single variable splits. In this case, the partial dependence of Fˆ (x) on a specified target variable subset zl is evaluated given only the tree, without reference to the data. For a specified set of values for the variables zl , a weighted transversal of the tree is performed. At the root of the tree, a weight of 1 is assigned. For each terminal node visited, if its split variable is in the target subset zl , the appropriate left or right daughter node is visited and the weight is not modified. If the node’s split variable is a member of z\l , then both daughters are visited and the current weight is multiplied by the fraction of training observations that went left or right, respectively, at that node. Each terminal node visited during the transversal is assigned the current value of the weight. When the tree transversal is complete, the weighted average of the Fˆ (x) values over those terminal nodes visited during the tree transversal provides an empirical estimate of the expected value of Fˆ (x) with respect to the complement subset. For a collection of M regression trees, the results from the individual trees are simply averaged.
7 Applications: Modeling Spiking Activity as a Function of Spiking History and Kinematics To illustrate this approach to neural point process modeling, in this section we apply stochastic gradient boosting to spiking activity recorded from two cells, one in M1, and the other in 5d, of a behaving monkey. Those two cells are henceforth referred to as cell A and B, respectively. They were chosen based on their rich modulation by intended and executed kinematics. Details of the basic recording hardware and surgical protocols are available elsewhere (Paninski, Fellows, Hatsopoulos, & Donoghue, 2004). Following task training, two Bionic Technologies LLC (BTL, Salt Lake City, UT) 100-electrode silicon arrays were implanted in M1 and 5d areas related to sensorimotor control of the arm. One monkey (M. mulatta) was operantly conditioned to perform a standard reaching center-out radial task with instructed delay. The monkey learned to use a manipulandum in a horizontal plane. The manipulandum controlled a cursor on a monitor oriented vertically in front of her. Each trial consisted of moving the cursor to one of eight radially displaced targets on the screen. The task started with the monkey moving the manipulandum to a center holding position. Soon after, an instructed delay cue specifying the target for the reach was presented for 1.5 sec. Finally, a go cue, appearing randomly in the time interval 1–2 sec after the disappearance of the instructed delay cue, signaled the monkey to reach for the remembered target. The hand position signal in 2D
Nonparametric Modeling of Neural Point Processes
685
was digitized and resampled to 1 KHz. Hand velocities and accelerations were obtained via standard Savitsky-Golay derivatives. 7.1 Covariate List, Data Set, Boosting Meta-Parameters, and Computational Issues. The following covariates were selected. To incorporate simple spiking history effects, we selected the time elapsed since the last spike. This covariate will be denoted by stk . To introduce the kinematics effects, we chose position, velocity, and acceleration in 2D. These kinematics ˙ k + τq˙ ), and q(t ¨ k + τq¨ ), respectively. covariates are denoted by q(tk + τq ), q(t When referring to specific components of a kinematics vector, we use subscripts, for example, q = [q 1 , q 2 ] . The investigation of kinematics effects at multiple time lags for each covariate, although straightforward, is beyond the scope of investigation in this article. For this reason, we chose to present the simpler case of single time lags for each covariate. The choice of single optimal time lags was based on mutual information analyses (Paninski et al., 2004). For cell A, the time delays for position, velocity and acceleration were set to −0.15, 0.25, and 0.25 sec, respectively. For cell B, these delays were set to zero. In order to account for the planning of movement direction during the instructed delay period, we included in the model the covariate target direction, denoted by φ and henceforth referred to as the intended movement direction in a given trial. This is not to be confused with the actual movement direction derived from the velocity vector. Our covariate vector thus consisted of xtk = [stk , φ, qtk +τq , q˙ tk +τq˙ , q¨ tk +τq¨ ] , with p = 8 covariates. We analyzed 268 trials, distributed almost uniformly across the eight directions. The recorded neural point process was time discretized by taking = 0.001 sec. Given the known refractory period after spiking, this choice ensured that no more than a single event would be observed in each subinterval of the partition. In each trial, only the data in the time interval [−0.4, 0.5 sec], with respect to the movement onset at time zero, were used. Typical hand movements lasted approximately 0.4 sec, with velocity bell shape profiles peaking at around 0.2 sec after movement onset. After time discretization, the number of samples, including data from all of the trials, resulted in K = 241, 200 samples. The indexes k for the resulting samples or tuples {yk , xk }kK were reordered by a single random permutation. That was done to ensure that the training and validation data sets contained samples that originated from everywhere in the experimental session, not just the beginning or the ending part, for example. The meta-parameters for the boosting algorithm were set as follows. Two-thirds of the data set were separated for model fitting, and the remaining data were taken for validation tests. For the stochastic subsampling at each iteration, half of the model-fitting data set was randomly subsampled without replacement. The number of terminal nodes for the regression trees was set to R = 9, that is, p+1. The shrinkage coefficient was set to α = 0.001. This choice was based on the analysis of the loss function computed on
686
W. Truccolo and J. Donoghue
the validation data set for different shrinkage parameters and different numbers of regression trees. In the example shown for cell A in Figure 1, at low shrinkage values, for example, α = 0.1, large generalization errors start to occur after just 50 to 100 trees, resulting in poor approximations. Much better performance was obtained for α = 0.001, with improvement on the loss function remaining roughly constant after approximately 11,000 trees. Similar results were obtained for cell B and other cells in the recorded ensemble (not shown). The gradient boosting algorithm was implemented in C. It took about 3 hours of computational time in a 2 GHz dual processor (2 GB RAM) computer to fit a model for a single cell. We used a computer cluster to run the cross-validation analyses presented in the last part of this section and in the next section. 7.2 Approximated Conditional Intensity Function and Goodness-ofFit Analysis. To provide a glimpse of the type of CIF approximations generated by an ensemble of regression trees, Figure 2 shows an example for cell A. In this case, two main features are observed. First, after each spike, the CIF captures a refractory period (the CIF approximates zero) and a quick rebound excitation. Second, around 200 ms before movement onset, the cell begins to be modulated by the kinematics. This effect peaks about 100 ms prior to movement onset. (See also Figures 4 and 5 for details on the contributions of spiking history and kinematics to the approximation.) With the exception of sharp changes reflecting refractory periods after spiking, the temporal evolution of the estimated CIF is otherwise relatively smooth. Once the CIFs of the two cells were approximated by the ensemble of regression trees, a goodness-of-fit assessment by time rescaling was done on the validation data set. Figure 3 shows that overall, only minor deviations are observed for the quantiles close to 0.5 (cell A) and to 0.25 (cell B). These deviations could have arisen from a lack of fit of the approximation itself or from the absence in the chosen covariate set of other variables relevant to the modeled spiking activity. 7.2 Relative Importance and Partial Dependence Plots. Table 1 shows the estimated relative importance for each of the covariates. Different covariate rankings were observed for the two cells. For cell A, the most important covariates were spiking history and velocity. For cell B, velocity and position in the vertical axis, together with spiking history, were highest ranked. Partial dependence plots are shown in Figures 4 and 5 to display the marginal effects of subsets of the covariates. Each partial dependence plot shows the contribution of a subset in terms of the approximated function Fˆ , that is, the CIF in log scale. The partial dependence plots do not include the constant term Fˆ 0 (x) in the approximation. The estimated effects of spiking history in terms of time elapsed since the last spike are shown in Figure 4A. An increase in the partial dependence, implying also an increase in spiking
Nonparametric Modeling of Neural Point Processes
687
λ (tk|x) (Hz)
120
60
−0.4
−0.3
−0.2
−0.1
0 0.1 time (sec)
0.2
0.3
0.4
Figure 2: Example of approximated conditional intensity function. The resulting approximated CIF via stochastic gradient boosting with approximately 11,000 trees (top plot) is shown together with the related spike train data for that time segment (bottom plot). The top plot also shows the magnitudes (arbitrary units) of the position (dotted), velocity (solid), and acceleration (dashed) vectors. This example was taken from cell A, during a reach to a target to the right (0 degrees). Movement onset is at time zero. 1
1
CDF
A
B 0.5
0
0.5
0
0.5
quantiles
1
0
0
0.5
1
quantiles
Figure 3: K-S goodness-of-fit plots via time rescaling. The K-S plots for cells A and B are shown in plots A and B, respectively. Goodness-of-fit assessment was performed on validation data.
688
W. Truccolo and J. Donoghue −3.5
−4.4
A
B
F(φ)
F(s)
−4 −4.5
−4.5
−5
0 0.02
0.05 s (sec)
0.1
−4.58
−π
φ (rad )
π
Figure 4: Partial dependence plots: spiking history and intended movement direction. (A) The partial dependence on the time elapsed since the last spike is shown for cell A. (B) The partial dependence on the intended movement direction (eight discrete states) is shown for the same cell. Table 1: Covariates’ Relative Importance Cell A 1. s 2. q˙ 1 3. q˙ 2 4. q 1 5. q¨ 1 6. φ 7. q 2 8. q¨ 2
Cell B 32.68 28.08 14.99 7.17 5.49 5.29 3.17 3.13
1. q˙ 2 2. q 2 3. s 4. q¨ 1 5. q˙ 1 6. q 1 7. q¨ 2 8. φ
25.01 22.57 18.61 8.98 7.41 6.73 6.22 4.26
probability, follows after a very short refractory period, just 3 to 4 ms after a spike. This feature is confirmed by the interspike interval histogram of the cell (not shown), which shows a peak around 3 ms confirming the existence of spike doublets (see also Chen & Fetz, 2005). A broader increase in spiking probability appears centered around 30 ms. The contribution of spiking history in cell B has similar features, with the exception of doublets (not shown). Figure 4B shows the partial dependence on the intended movement direction for cell A. This covariate contributed with a constant (throughout the trial) value that was dependent on the direction of the target in a given trial. Its effects can also be clearly seen prior to movement onset, as shown by peri-movement time histograms in Figure 6A. Intended direction effects were minimal for cell B.
A −4.4
B −3.2
F (q)
F (q) ˙
Nonparametric Modeling of Neural Point Processes
−4.8 −10
q1
0
10
−10
0
10
689
−4
−5 20
q2
0
q1
−20
20
−20
0
q2
C F (q) ¨
−4.4
−4.6 20
q¨2
20
0 −20 −20
0
q¨1
Figure 5: Partial dependence plots for kinematics. The partial dependences for position (q 1 and q 2 , in cm), velocity (q˙ 1 and q˙ 2 , in cm/s), and acceleration (q¨ 1 and q¨ 2 , in cm/s2 ) at the specified time lags (see text) for cell A are shown in plots A, B, and C, respectively.
The partial dependencies on kinematics for cell Aare summarized in Figure 5. The strongest partial contributions are given by the velocity covariate. For this cell, the velocity tuning narrowed at higher speeds. A dependence of preferred direction on speed, with a shift ranging about 50 degrees, was also observed. The partial dependence on position, though small, showed a non-monotonic relation with saturation. For acceleration, the functional form was close to that of sigmoidal functions. 7.3 Prediction of Peri-Movement Time Histograms Based on Covariate State. In neurophysiology, the analysis of the average neuronal response to stimuli or behavioral events commonly plays an important role in identifying relevant covariates to the neuron’s spiking activity. In our example, this average response corresponds to what is usually referred to as the perimovement time histogram (PMTH). Here we illustrate how to do a simple assessment of the predictive power of the chosen covariates in terms of their ability to predict the time course of this average response. This can
690
W. Truccolo and J. Donoghue
be done by comparing the predicted PMTH, derived from the single-trial approximated CIFs, to the empirical PMTH. The predicted PMTH is therefore a function of the covariates, while the empirical PMTH is simply a time-locked average response that does not explicitly model the covariates’ effects. This comparison between predicted and empirical PMTHs (assuming the empirical PMTH provides a good estimation of the PMTH) can complement the goodness-of-fit assessment using the time-rescaling theorem. Commonly, the time-rescaling-based assessment will show some lack of fit without, on the one hand, specifying a source for the fitting problem or, on the other hand, pointing to what aspects the model captures well despite a lack of fit. The comparison between predicted and empirically estimated PMTHs can then be used as an additional diagnostic tool to check if the modeled covariate effects capture at least the temporal features observed in the peri-event time histograms. The empirical PMTH was obtained by first time-locking the single trial binary sequences {Nk } to the time of hand movement onset and then averaging them across trials. This average was then smoothed by a gaussian (SD = 0.05 sec) kernel, and finally expressed as spiking frequency in Hz units.4 To obtain the predicted PMTH, we first ran stochastic gradient boosting on a model-fitting data set containing all of the trials, with the exception of the trial whose CIF was being estimated. This corresponds to a leave-one-trial-out cross-validation procedure. Then, given the fitted model and observed covariates, we estimated the CIF for the selected trial. This was done for all of the trials. The predicted PMTH was finally obtained by time-locking the estimated single trial CIFs to the onset of movement, and then averaging them across trials. Figure 6 shows both the predicted and empirical PMTHs. The empirical PMTHs show strong pre- and post-movement effects for cell A, but only post-movement effects for cell B. Overall, predicted and empirical PMTHs agree well for both cells. For these two cells, we found that the selected kinematics and spiking history covariates alone could account for most of the features in the time course of activation. Similar results were obtained for the other cells in the recorded cell ensemble (approximately 60 cells) in the two cortical areas (not shown). These results, together with the goodnessof-fit assessment by time rescaling, suggest that the CIF approximations have explained most of the cells’ spiking behavior and that the model captured well the dependence of the time course of the cells’ activity on movement parameters. The small disagreements between predicted and empirical PMTHs happening at the extremes of the time interval [−0.4, 0.5] could have arisen from nonmodeled motor preparation and reward effects
4
We could have obtained a better estimate of the “empirical” PMTH, with corresponding confidence intervals by fitting Bayesian splines to the histogram as a function of time. However, we found this to be unnecessary for these illustration examples.
Nonparametric Modeling of Neural Point Processes
691
A
Hz
100 50 0 −0.5 0 0.5 Time (sec)
B
Hz
100 50 0 −0.5 0 0.5 Time (sec)
Figure 6: Prediction of PMTH based on the single trial approximated conditional intensity functions. The predicted (black) and empirical (gray) PMTHs for cells A and B are shown in plots A and B, respectively. The eight plots show the PMTH conditioned on the 8 radial directions for the executed reachings.
or from estimation problems resulting from lower signal-to-noise ratio at these extremes. 8 Comparison to Other Approaches: GLMs and Bayesian P-Splines We compare stochastic gradient boosting regression to two alternative approaches: a generalized linear model and Bayesian penalized splines. The GLM framework applied to neural point processes has been described in detail in Truccolo, Eden et al. (2005). Briefly, in the GLM approach, the conditional intensity function was modeled as λ = exp{µ + α1 x1 + . . . + α p x p }. The Bayesian P-splines approach is described below.
692
W. Truccolo and J. Donoghue
8.1 Bayesian P-Splines. Bayesian P-splines are a state-of-the-art nonparametric regression tool. We choose Bayesian P-splines (Brezger & Lang, 2006) instead of a Bayesian free-knot spline approach using reversible-jump MCMC (DiMatteo et al., 2001) because of its faster computation. Consider the following model for the conditional intensity function, log λi = f i1 (xi1 ) + f i2 (xi2 ) + . . . + f i p (xi p ) + viT γ ,
(8.1)
where f (x j ), j = 1, . . . , p are smooth functions and viT γ represents the strictly linear part of the predictor. In the Eilers and Marx (1996) approach, it is assumed that the unknown functions can be approximated by Mj = k j + l B-splines of degree l (usually cubic) with equally spaced knots, ζ j0 = x j,min < ζ j1 < . . . < ζ j,k j−1 < ζ j,k j = x j,max ,
(8.2)
over the domain of x j . Denoting the mth basis function by B jm , we obtain f j (x j ) =
Mj
β jm B jm (x j ).
(8.3)
m=1
By defining the n × Mj design matrices X j with the elements in row i and column m given by X j (i, m) = B jm (xi j ), we can rewrite the predictor in matrix notation as log λ = X1 β1 + X2 β2 + . . . + X p β p + Vγ .
(8.4)
T Here β j = β j1 , . . . , β j Mj , j = 1, . . . , p corresponds to the vectors of unknown regression coefficients. The matrix V is the usual design matrix for linear effects. To overcome the well-known difficulties involved with regression splines, Eilers and Marx (1996) suggested a relatively large number of knots (usually between 20 and 40) to ensure enough flexibility and introduce a roughness penalty on adjacent regression coefficients to regularize the problem and avoid overfitting. Theoretically, because roughness is controlled by the penalty term, once a minimum number of knots is reached, the fit given by a P-spline should be almost independent of the knot number and location. In their frequentist approach, Eiler and Marx proposed penalties based on squared r th order differences. Here we choose second-order differences in analogy to second-order derivatives in smoothing splines. In the Bayesian P-splines approach, the second-order differences are replaced with their stochastic analogs, that is, second-order random walks defined by β jm = 2β j,m−1 − β j,m−2 + u jm
(8.5)
Nonparametric Modeling of Neural Point Processes
693
with gaussian error u jm ∼ N(0, σ j2 ) and diffuse priors β j1 and β j2 ∝ const, for initial values.5 While first-order random walks would penalize abrupt jumps β jm − β j,m−1 between successive parameters, second-order random walks penalize deviations from the linear trend 2β j,m−1 − β j,m−2 . The amount of smoothing is controlled by the parameter σ j2 , which corresponds to the inverse smoothing parameter in the traditional smoothing spline approach. By defining an additional hyperprior for the variance parameters, the amount of smoothness can be estimated simultaneously with the regression coefficients. We assign the conjugate prior for σ j2 , which is an inverse gamma prior with hyperparameters a j and b j , that is, σ j2 ∼ I G(a j , b j ). Based on Brezger and Lang’s (2006) experience from extensive simulation studies, we use a j = b j = 0.001. Bayesian inference is based on the posterior of the model, which is given by p(β1 , . . . , β p , σ12 , . . . , σ p2 , γ |y, X) ∝ L(y, X, β1 , . . . , β p , σ12 , . . . , σ p2 , γ ) p 1 1 T × exp − 2 β j K j β j (σ j2 )r k(K j )/2 2σ j j=1 p 2 −a j −1 bj × exp − 2 , σj (8.6) σj j=1 where L(·) is the likelihood. The two last terms on the right are the priors for the spline coefficients and for the variance of the random walk, respectively; r k(·) denotes the matrix rank and the term K j is a penalty matrix that depends on the prior assumptions about the smoothness of f j and the type of covariate. This penalty matrix is of the form K j = DT D, where D is in our case a second-order difference matrix. Spline parameters and smoothing terms were estimated by MCMC sampling. Good mixing properties of the Markov chain are provided by a sampling scheme that combines an approximation of the full conditionals of regression parameters via IWLS (iteratively weighted least squares) and uses them as proposals in a Metropolis-Hastings algorithm, as proposed by Gamerman (1997). We set the number of MCMC iterations to 20,000, with the first 5000 iterations discarded as the burn in part. In order to reduce correlations between the MCMC samples, the chain was thinned by keeping only the samples from every 10 iterations. We visually inspected the resulting Markov chains to verify convergence. Cubic spline bases with 30 uniformly spaced knots were used. In the computation of the Bayesian P-splines, we used the
5 In cases where the estimated function is highly oscillating, the assumption of a global variance parameter may be relaxed by using u jm ∼ N(0, σ j2 /δ jm ), where both σ j2 and δ jm are hyperparameters to be estimated.
694
W. Truccolo and J. Donoghue
publicly available software BayesX developed by Brezger, Kneib, and Lang (2005). Besides its good MCMC mixing properties especially designed for Bayesian P-splines, this software has been shown to be 60 to 280 times faster than the other well-known alternative for Bayesian inference, the software WinBUGS (Spiegelhalter, Thomas, Best, & Lunn, 2003). 8.2 Cross-Validation Results. We used a 10-fold cross-validation scheme to compare the different modeling approaches. Exactly the same 10 test data sets were used for each modeling approach, and each approach used the same eight covariates as described in section 7. We modeled all 67 cells available in the data set. As a comparison criterion, we adopted the average cross-validated log likelihood of the modeled neural process. Because in Bayesian splines the introduction of interaction terms is usually restricted in practice to interaction surfaces (second-order interactions, modeled by the tensor product of B-splines, for example) and because their inclusion would increase dramatically the computational time for these comparisons, we opted not to include higher-order interaction terms in any of the modeling approaches. Therefore, the gradient boosting regression models included trees with splits in a single covariate only. Additionally, to prevent gradient boosting from having an unfair advantage in crossvalidation performance, we did not estimate the optimal number of trees from the computed loss function in a small test data subset. Instead, we set the shrinkage value to a small value, α = 0.001, and fixed the number of trees to a high number: 40,000 for all of the modeled cells. Hyperparameters for the Bayesian splines and additional parameters for the MCMC sampling were specified as described. For all of the cells, we obtained high acceptance rates in the Markov chains. For comparison purposes, we also computed the cross-validated log likelihood for a model that consisted of only a constant mean Poisson process, independent of any of the covariates. We refer to this model as the noise model. Additionally, we used the difference among the cross-validated log likelihoods ( log likelihood) from the different models. More specifically, the log likelihoods from GLM, Bayesian P-splines, or noise models were subtracted from the corresponding log likelihoods of the gradient boosting regression models. In this way, a positive log likelihood means a better performance for the gradient boosting models in the test data sets. Figure 7 summarizes the comparison results over the 67 cells. Overall, the gradient boosting regression models performed much better for all of the 67 cells when compared to the noise and GLM models, as expected, and performed better in about 90% of the cells when compared to Bayesian P-splines. A preliminary analysis contrasting the types of estimated covariate effects on cell A based on gradient boosting regression and Bayesian P-splines is given in Figure 8. We focused on the shapes of the estimated effects rather than on their absolute magnitudes and chose the velocity covariate because, for most of the cells, it is ranked as the most important covariate. Compared
Nonparametric Modeling of Neural Point Processes
695
N
20 10 0
0
100
200
300
400
500
0
100
200
300
400
500
0
100
200
300
400
500
N
20 10 0
N
20 10 0
∆ log−likelihood
Figure 7: Comparison between the different modeling approaches using crossvalidated log likelihood. (Top) The histogram for the differences between the average 10-fold cross-validated log likelihoods from the gradient boosting regression and Bayesian P-splines models. (Middle) The histogram for the differences between the gradient boosting regression and GLM models. (Bottom) The histogram for the differences between the gradient boosting regression and noise models (a Poisson process with constant spiking rate independent on the covariates). Total N is 67 modeled cells. Positive differences ( log likelihoods) mean that the gradient boosting regression models performed better in the test data than the comparison modeling approach did.
to the gradient boosting estimate (see Figure 8A), the Bayesian P-spline estimate seems to oversmooth (see Figure 8B), missing many of the structures captured by the gradient boosting regression trees model. The gradient boosting estimate clearly partitions the work space into four regions (movements up and down, left and right), with fine structure especially visible in two of these quadrants. However, this structure could be simply an artifact resulting from the piecewise constant approximation nature of regression trees. Assuming for the moment that this fine structure truly exists, it could then be that a different spline approach not dependent on MCMC sampling would better capture this structure. To address this issue we estimated a P-spline model using the recently developed MGCV algorithm in R (Wood, 2004). Although this algorithm does not provide the full Bayesian inference setup as in Bayesian P-splines, it still allows for the automatic estimation of smoothness parameters. Figure 8C shows the resulting estimate obtained
696
W. Truccolo and J. Donoghue
A
C
F (q)
.
20
. 0
q1
−20
20
0
.
−20
q2
0
−20
20
0
− 20
D
B
−20
20
0 20 0
−20
20
0
0
−20 −20 20
F
E
−20
20
0
0
−20 20
−20
0
0
−20 20
Figure 8: Comparison of estimated dependence on velocity from Bayesian Psplines, gradient boosting regression, and other spline-based methods. In all plots, the estimated marginal effects of the velocity covariate for cell A are shown as 2D surfaces. (A) Estimated dependence obtained from stochastic gradient boosting. (B) From Bayesian P-splines. (C) Penalized cubic-splines (MGCV). (D) Top view of the estimated dependence in A. (E) Top view of the estimated dependence in C. (F) Penalized cubic splines with shrinkage (top view; estimated using MGCV). The range of velocity comprises the actually observed velocities in the fitted data. Velocities are given in cm/s. See the text for details.
Nonparametric Modeling of Neural Point Processes
697
by fitting the model with all covariates and using penalized cubic splines (basis dimension set to 32). Interestingly, the estimated effect for velocity via MGCV is overall closer to the gradient boosting model than the Bayesian P-spline model is to the gradient boosting model. However, it does so by paying the high price of having significant oscillations (overshootings and undershootings) at certain regions, which probably makes the Bayesian P-spline smoother approximation better. More important, these oscillations appear exactly at regions where there are reasonable jumps in the piecewise constant approximation (see Figure 8E). Figure 8F shows another example, now using MGCV with penalized cubic splines with shrinkage. Instead of minimizing the oscillation artifacts, it actually increases them. Similar results were obtained with B-spline bases. (We could not compute the model using thin-plate splines in MGCV because of memory allocation problems.) Therefore, we conclude that the fine structure estimated by the gradient boosting regression cannot be easily explained as a pure artifact resulting from the piecewise constant approximation. Instead, this fine structure is likely to reflect features of the true underlying relation between spiking and velocity. 9 Discussion We have demonstrated how stochastic gradient boosting regression can be successfully extended to the modeling of neural point processes, thus providing a robust nonparametric modeling approach for neural spiking data while preserving their point process nature. Furthermore, our choice of loss function keeps this nonparametric modeling approach close in spirit to the (regularized) maximum likelihood principle. Analysis on validation data of the loss function’s dependence on the number of iterations (trees), together with the application of the time rescaling theorem, provided the approach with both model selection and a rigorous goodness-of-fit assessment. Partial dependence plots and analysis of averaged CIFs were shown to enhance the model’s interpretability, which is important in neurophysiological studies of the relation between spiking activity and observed covariates. Our cross-validation analyses showed that gradient boosting regression can perform similar to or superior to a state-of-the-art Bayesian regression tool. We speculate that one of the possible reasons for this better performance with respect to Bayesian splines might be related to the choice of basis learners. The preliminary analysis provided in Figure 8 seems to argue in this direction. Additionally, as already pointed out by Friedman (2001), methods that use smooth functions tend to be adversely affected by outliers and wide tail distributions. On the other hand, the piecewiseconstant approximation in regression trees is more robust to these adverse effects. However, although the issue of smooth versus piecewise constant approximation seems to shed some light on why gradient boosting performed better, there are many other and equally important features in
698
W. Truccolo and J. Donoghue
gradient boosting that could also have contributed to this better performance. These features are shrinkage, stochastic subsampling, and especially iterative regularized fitting of residuals. This makes the proper addressing of why gradient boosting performed better a nontrivial problem. A thorough analysis of this problem is beyond the scope of this article. From a theoretical point of view, we believe that Bayesian splines may provide comparable or higher performance if one spends enough time with the fine tuning of prior distributions and parameters in the MCMC sampling and with the selection of a smaller subset of covariates. While this fine tuning and selection is feasible for smaller selected data subsets, it is precisely what one would like to avoid in initial analyses of large data sets. It is in that sense that we see gradient boosting regression as an off-the-shelf tool. Given its good performance and computational efficiency, we propose stochastic gradient boosting regression as an off-the-shelf modeling tool for the analysis of large neural data sets currently generated by multiple arrays recording technology (e.g., more than 50 cells, number of samples K > 105 ) and corresponding measurements of multidimensional (dimension > 4) covariate spaces. Additionally, while in gradient boosting regression trees, we can easily attempt to capture arbitrary high-order interactions among the covariates, splines-based methods are in practice currently limited to second-order interaction effects usually modeled with tensor products of two univariate spline bases. The introduction of second-order interaction effects by tensor product splines in generalized additive models (GAM) can be in practice infeasible even in the case of low-dimensional covariate spaces. That is because of memory storage problems resulting from the requirement in current algorithms that projections of the covariate data on these tensor product bases be explicitly stored in matrices. These matrices can be prohibitively large given the typical size of neural point process data sets (K > 105 ). The use of Bayesian P-splines would alleviate the memory storage problem. Currently, however, these methods are not a practical choice because of their computational cost and the difficulty in designing MCMC samplers with good mixing properties in high-dimensional parameter spaces when interaction effects are added. Alternative methods based on Kernel regression algorithms (e.g., kernel logistic regression and gaussian process regression) currently face similar difficulties. Most available algorithms require the explicit representation of kernel matrices as 2D arrays and are therefore limited to a few hundred thousand samples. Finally, another advantage of stochastic gradient boosting regression is that it can easily handle mixed metric and discrete (categorical) covariates. Mixed covariates present difficulties for GAM algorithms based exclusively on spline bases. Confidence intervals for the approximated functions and partial covariate effects can in principle be obtained by bootstrapping gradient boosting regression models. However, the introduction of bootstrapping would make gradient boosting regression less attractive in terms of computational
Nonparametric Modeling of Neural Point Processes
699
efficiency. We recommend the following approach. If more detailed statistical inference based on estimated confidence intervals is required after an initial modeling analysis performed with stochastic gradient boosting regression, one can focus on a selected smaller subset of modeled cells and perhaps lower-dimensional covariate sets. Bootstrapping of gradient boosting models could then be used. Alternatively, one could consider the application of Bayesian splines to this selected smaller data subset in order to assess parameter uncertainty. To address these issues, we are also considering a Bayesian formulation of boosting regression trees (work in preparation). In neuroscience, modeling the relations between neural spiking and stimulus-behavioral covariates has been intrinsically associated with the problem of neural population decoding. In this regard, neural point process decoding based on models fitted through stochastic gradient boosting can be easily implemented using sequential Monte Carlo particle filters (Doucet, de Freitas, & Gordon, 2001; Gao, Black, Bienenstock, Shoham, & Donoghue, 2001, Brockwell, Rojas, & Kass, 2004). The spiking probability in the time interval ((k − 1), k] given estimated states, necessary to compute the importance sampling weights for the particles, is simply obtained as P(Nk |xk ) = exp{Nk Fˆ (xk ) − exp{ Fˆ (xk )}. The application of gradient boosting models to other neural decoding approaches that require the computation of partial derivatives of the CIF with respect to covariates (Eden, Frank, Barbieri, Solo, & Brown, 2004) will be problematic because of the piecewise constant nature of the regression trees approximation. We can think of several directions for improvement of stochastic gradient boosting regression for neural point processes. The algorithm is generic regarding the types of base learners or regressors that can be used at each iteration step. We plan to investigate the use of penalized low-dimensional B-splines as base learners, for example. Boosting splines could be used, for example, in cases where smooth approximations are preferable. Also, this choice might result in fewer boosting iterations. With respect to our goal of identifying the features of the functional form relating spiking and covariates and despite the contribution of partial dependence plots in this regard, model interpretability remains a topic for further development. As mentioned before, one advantage of an ensemble of regression trees is that the modeling of interaction effects among the covariates can be easily implemented. Estimated second-order interaction effects can be easily visualized in 2D plots. However, an analysis of higher-order interaction effects and their contribution to spiking activity is beyond the assessment provided by the partial dependence plots. We are also considering more efficient algorithmic implementations of boosting regression in terms of least angle regression (Efron, Hastie, Johnstone, & Tibshirani, 2004). A comparison to other related nonparametric modeling algorithms such as Random Forests (Breiman, 2001) is underway.
700
W. Truccolo and J. Donoghue
Our applications to M1 and 5d data in this article were intended as illustrative examples of the algorithm. Refined models will likely require larger sets of covariates covering more extensive spiking history effects, as well as the introduction of kinematics covariates at multiple time lags. In addition, movement covariates in coordinate frames other than the Cartesian coordinates adopted here and covariates involving dynamic forces are other crucial topics for investigation. A detailed study of the relations between spiking and movement covariates in these two areas is the topic of a future report. We are also currently applying stochastic gradient boosting to the study of cortico-cortico interactions between primary motor (M1) and parietal (5d) cortices during sensorimotor control of reaching (Truccolo, Vargas et al., 2005). Appendix For completeness, we describe in more detail two aspects of stochastic gradient boosting. We follow the exposition in Friedman (2001). A.1 Gradient Smoothing via “Greedy-Stagewise” Approximation. Given the representation
F ∗ (x) ≈ Fˆ (x) =
M
f m (x),
(A.1)
m=0
for the function solving the optimizing problem in equation 3.1, and under sufficient regularity conditions such that we can interchange differentiation and integration, each f m (x) can be implemented as the mth step in a steepestdescent gradient optimization, f m (x) = ρgm (x) = −ρ E y
∂ L(y, F (x)) , |x ∂ F (x) F (x)= Fˆ m−1 (x)
(A.2)
where gm is the negative gradient determining the (steepest-descent) direction of the step and ρm determines the magnitude of the step, and Fˆ m−1 (x) = m−1 j=0 f j (x). Formulated in this simple pointwise optimization manner, this nonparametric approach would result in poor generalization when applied K to finite data sets {yk , xk }k=1 . The approximation would lack generalization when estimating F ∗ (x) at x-values other than the ones provided by the sample data set. This shortcoming can be dealt with by imposing smoothness on the approximation. Gradient smoothing can be achieved by adopting a parameterized form of f m (x) and performing parameter
Nonparametric Modeling of Neural Point Processes
701
optimization at each of the iteration steps. A common parameterization is to express each f m (x) in terms of a regressor or base learner h(x, am ), such M that Fˆ (x) = m=1 βm h(x; am ). The optimization problem then becomes
M {βˆ m , aˆ m }m=1
= arg min
K
L
{βm ,am }1M k=1
yk ,
M
βm h(xk ; am ) .
(A.3)
m=1
Instead of solving the above at once (i.e., for all m = 1, . . . , M simultaneously), a “greedy stagewise” approach is motivated as follows. First, proceeding in an iterative fashion as before, the problem to be solved at each step is
(βˆ m , aˆ m ) = arg min β,a
K
L yk , Fˆ m−1 (x) + βh(xk ; am ) ,
(A.4)
k=1
with Fˆ m−1 (x) = m−1 j=0 h j (x; a j ). Second, the problem can be simplified by noticing that it suffices to find the member of the parameterized class K h(x; am ) that produces the vector hm = {h(xk ; am )}k=1 most parallel to the K negative gradient vector gm = {gmk (xk )}k=1 ∈ R K , where the K -components of the negative gradient are given by
∂ L(yk , F (xk )) gmk (xk ) = − ∂ F (xk )
F (x)= Fˆ m−1 (x)
, k = 1, . . . , K .
(A.5)
Note that the hm most parallel to gm is also the one most correlated with gm (x) over the data distribution. For most types of base learners, the most parallel vector can obtained by solving the least-squares problem am = arg min a,β
K
[−gmk (xk ) − βh(xk ; a)]2 .
(A.6)
k=1
Other minimization criteria can be used in place of equation A.6 if needed. In summary, by proceeding in a greedy-stagewise manner, the hard gradient smoothing problem in equation A.3 has been replaced by the much simpler problem of least-square fitting h(x; a) to the negative gradient followed by only a single parameter optimization for the magnitude of the gradient step ρm in ρm = arg min ρ
K K =1
L yk , Fˆ m−1 (xk ) + ρh(xk ; am ) .
(A.7)
702
W. Truccolo and J. Donoghue
K A.2 Partial Dependence Plots. Given a data set {yk , xk }k=1 , with x = [x1 , x2 , . . . , x p ] , let zl be a chosen target subset, of size l, of the covariate variables x
zl = {z1 , . . . , zl } ⊂ {x1 , . . . , x p },
(A.8)
and z\l be the complement set such that z\l ∪ zl = x.
(A.9)
If one conditions on specific values for the variables in z\l , then Fˆ (x) can be considered as a function only of the variables in the chosen subset zl : Fˆ z\l (zl ) = Fˆ (zl |z\l ).
(A.10)
In general, the functional form of Fˆ (zl ) will depend on the particular values chosen for z\l . If this dependence is not too strong, then the average function F¯ l (zl ) = E z\l Fˆ (x) =
Fˆ (zl , z\l ) p\l (z\l )dz\l ,
(A.11)
with p\l (z\l ) =
p(x)dzl ,
(A.12)
can represent a useful summary of the partial dependence of Fˆ (x) on the chosen variable subset zl . Here p\l (z\l ) is the marginal probability density of z\l p\l (z\l ) =
p(x)dzl ,
(A.13)
where p(x) is the joint density of all the covariates x. An empirical estimate of this quantity can be obtained from the data. In the case where the dependence of Fˆ (x) on the chosen covariate subset is additive or multiplicative, the form of Fˆ z\l (zl ) does not depend on the joint values of the complement covariates z\l and equation A.11 provides a complete description of the variation of Fˆ (x) on the chosen covariate subset. The approximation of the partial dependence measure based on the reasoning above is simplified in the case of regression trees by the procedure given in section 6.2.
Nonparametric Modeling of Neural Point Processes
703
Acknowledgments W. T. thanks Emery Brown and Uri Eden for discussions on point processes theory, Ji Zhu for discussion on boosting algorithms, and Carlos Vargas for the data recordings. This work was supported by ONR NRD-339, NIH NS-25074, and DARPA MDA 972-00-1-0026. References Barbieri, R., Frank, L. M., Quirk, M. C., Solo, V., Wilson, M. A. & Brown, E. N. (2002). Construction and analysis of non-gaussian spatial models of neural spiking activity. Neurocomputing, 44–46, 309–314. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. Breiman, L. (1997). Arcing the edge (Techn. Rep. No. 486). Berkeley: Department of Statistics, University of California, Berkeley. Breiman, L. (1999). Prediction games and arcing algorithms. Neural Computation, 11(7), 1493–1517. Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. Breiman, L., Friedman, J. H., Olshen, J. R., & Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth. Brezger, A., Kneib, T., & Lang, S. (2005). BayesX: Analyzing Bayesian structured additive regression models. Journal of Statistical Software, 14(11), 1–22. Brezger A., & Lang, S. (2006). Generalized structured additive regression based on Bayesian P-splines. Computational Statistics and Data Analysis, 50, 967–991. Brillinger D. R. (1988). Maximum likelihood analysis of spike trains of interacting nerve cells. Biological Cybernetics, 59, 189–200. Brockwell, A. E., Rojas, A. L. & Kass, R. E. (2004). Recursive Bayesian decoding of motor cortical signals by particle filtering. Journal of Neurophysiology, 91(2), 1899–1907. Brown, E. N., Barbieri, R., Eden, U. T., & Frank, L. M. (2003). Likelihood methods for neural data analysis. In J. Feng (Ed.), Computational neuroscience: A comprehensive approach. CRC. Brown, E. N., Barbieri, R., Ventura, V., Kass, R. E., & Frank, L. M. (2001). The timerescaling theorem and its application to neural spike train data analysis. Neural Computation, 14, 325–2346. Chen, D., & Fetz, E. E. (2005). Characteristic membrane potential trajectories in primate sensorimotor cortex neurons recorded in vivo. Journal of Neurophysiology, 94, 2713–2725. Chornoboy, E. S., Schramm, L. P., & Karr, A. F. (1988). Maximum likelihood identification of neuronal point process systems. Biological Cybernetics, 59, 265–275. Copas, J. B. (1983). Regression, prediction, and shrinkage (with discussion). Journal of the Royal Statistical Society, B, 45, 311–354. Daley, D., & Vere-Jones, D. (2003). An introduction to the theory of point processes. New York: Springer-Verlag. DiMatteo, I., Genovese, C. R., & Kass, R. E. (2001). Bayesian curve-fitting with freeknot splines. Biometrika, 88, 1055–1071.
704
W. Truccolo and J. Donoghue
Doucet, A. de Freitas, N., & Gordon, N. (2001). Sequential Monte Carlo methods in practice. New York: Springer-Verlag. Eden, U. T., Frank, L. M., Barbieri, R., Solo, V., & Brown, E. N. (2004). Dynamic analyses of neural encoding by point process adaptive filtering. Neural Computation, 16(5), 971–998. Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32(2), 407–499. Eilers P. H. C., & Marx, B. D. (1996). Flexible smoothing with B-splines and penalties. Statistical Science, 11, 89–102. Frank, L. M., Eden, U. T., Solo, V., Wilson, M. A., & Brown, E. N. (2002). Contrasting patterns of receptive field plasticity in the hippocampus and the entorhinalcortex: An adaptive filtering approach. Journal of Neuroscience, 22, 3817– 3830. Freund, Y., & Schapire, R. E. (1995). A decision-theoretic generalization of on-line learning and an application to boosting. In European Conference on Computational Learning Theory (pp. 23–27). Friedman, J. H. (1991). Multivariate adaptive regression splines (with discussion). Annals of Statistics, 19(1), 1–82. Friedman, J. H. (1999). Stochastic gradient boosting (Tech. Rep.). Palo. Alto, CA: Stanford University, Statistics Department. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232. Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics and Data Analysis, 38, 367–378. Friedman, J., Hastie, T., Rosset, S., Tibshirani, R., & Zhu, J. (2004). Discussion. Annals of Statistics, 32(1), 102–107. Friedman, J. H., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting (with discussion). Annals of Statistics, 28(2), 337– 374. Gamerman, D. (1997). Sampling from the posterior distribution in generalized linear mixed models. Statistics and Computing, 7, 57–68. Gao, Y., Black, M. J., Bienenstock, E., Shoham, S., & Donoghue, J. P. (2001). Probabilistic inference of hand motion from neural activity in motor cortex. In T. G. Dietlerich, S. Becker, & Z. Ghahramani (Eds.), Advances in Neural information process systems, 14 (pp. 213–220). Cambridge, MA: MIT Press. Girsanov, I. V. (1960). On transforming a certain class of stochastic processes by absolutely continuous substitution of measures. Theory of Probability and Its Applications, 5(3), 285–301. Hastie, T., Tibshirani, T., & Friedman, J. H. (2001). Elements of statistical learning. New York: Springer-Verlag. Hatsopoulos, N., Joshi, J., & O’Leary, J. G. (2004). Decoding continuous and discrete motor behaviors using motor and premotor cortical ensembles. Journal of Neurophysiology, 92, 1165–1174. Johnson, A., & Kotz, S. (1970). Distributions in Statistics: Continuous univariate distributions. New York: Wiley. Kass, R. E., & Ventura, V. (2001). A spike-train probability model. Neural Computation, 13, 1713–1720.
Nonparametric Modeling of Neural Point Processes
705
Kaufman, C. G., Ventura, V., & Kass, R. E. (2005). Spline-based non-parametric regression for periodic functions and its application to directional tuning of neurons. Statistics in Medicine, 24(14), 2255–2265. Lebanon, G., & Lafferty, J. (2002). Boosting and maximum likelihood for exponential models. In T. G. Dietlerich, S. Becker, & Z. Ghahramanl (Eds.), Neural information processing systems, 14 (pp. 447–454). Cambridge, MA: MIT Press. Lugosi, G., & Vayatis, N. (2004). On the Bayes-risk consistency of regularized boosting methods. Annals of Statistics, 32(1), 30–55. Mason, L., Baxter, J., Bartlett, P., & Frean, M. (1999). Boosting algorithms as gradi¨ ent descent. In S. A. Solla, T. K. Leen, & K.-R. Muller (Eds.), Neural information processing systems, 12 (pp. 512–518). Cambridge, MA: MIT Press. Okatan, M., Wilson, M. A., & Brown, E. N. (2005). Analyzing functional connectivity using a network likelihood model of ensemble neural spiking activity. Neural Computation, 17(9), 1927–1961. Paninski, L. (2004). Maximum likelihood estimation of cascade point-process neural encoding models. Network, 15(4), 243–262. Paninski, L., Fellows, M. R., Hatsopoulos, N. G., & Donoghue J. P. (2004). Spatiotemporal tuning of motor neurons for hand position and velocity. Journal of Neurophysiology, 91, 515–532. Papangelou, F. (1972). Integrability of expected increments of point processes and a related change of scale. Transactions of the American Mathematical Society, 165, 483–506. Rosset, S., Zhu, J., & Hastie, T. (2004). Boosting as a regularized path to a maximum margin classifier. Journal of Machine Learning Research, 5, 941–973. Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26, 1651–1686. Spiegelhalter, D., Thomas, A., Best, N., & Lunn, D. (2003). WinBUGS user Manual (Version 1.4). Cambridge: Medical Research Council Biostatistics Unit. Truccolo, W., Eden, U. T., Fellows, M. R., Donoghue, J. P., & Brown, E. N. (2005). A point process framework for relating neural spiking activity to spiking history, neural ensemble, and extrinsic covariate effects. Journal of Neurophysiology, 93(2), 1074–1089. Truccolo, W., Vargas, C., Philip, B., & Donoghue, J. P. (2005). M1-5d Statistical interdependencies via dual multi-electrode array recordings. Society for Neuroscience, abstract 981.13, Washington, DC. Available online at http://sfn.scholarone.com/itin2005/index.html. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Wood, S. N. (2004). Stable and efficient multiple smoothing parameter estimation for generalized additive models. Journal of the American Statistical Association, 99, 673–686. Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32(1), 56–85.
Received August 16, 2005; accepted July 11, 2006.
LETTER
Communicated by Bard Ermentrout
Synchrony of Neuronal Oscillations Controlled by GABAergic Reversal Potentials Ho Young Jeong [email protected] Center for Neural Science, New York University, New York, NY 10003, U.S.A.
Boris Gutkin [email protected] Group for Neural Theory, DEC, ENS-Paris; Coll´ege de France, and URA 2169 Recepteurs et Cognition, Institut Pasteur, Paris, France
GABAergic synapse reversal potential is controlled by the concentration of chloride. This concentration can change significantly during development and as a function of neuronal activity. Thus, GABA inhibition can be hyperpolarizing, shunting, or partially depolarizing. Previous results pinpointed the conditions under which hyperpolarizing inhibition (or depolarizing excitation) can lead to synchrony of neural oscillators. Here we examine the role of the GABAergic reversal potential in generation of synchronous oscillations in circuits of neural oscillators. Using weakly coupled oscillator analysis, we show when shunting and partially depolarizing inhibition can produce synchrony, asynchrony, and coexistence of the two. In particular, we show that this depends critically on such factors as the firing rate, the speed of the synapse, spike frequency adaptation, and, most important, the dynamics of spike generation (type I versus type II). We back up our analysis with simulations of small circuits of conductance-based neurons, as well as large-scale networks of neural oscillators. The simulation results are compatible with the analysis: for example, when bistability is predicted analytically, the large-scale network shows clustered states. 1 Introduction Several studies (Ben-Ari, 2001, 2002) have shown that the effect of the GABAergic synapses on the postsynaptic membrane varies according to the developmental stages of neuronal circuits. In the adult brain, the glutamatergic and GABAergic synapses act as the major sources of excitation and inhibition, respectively, through ionotropic ligand-gated channels. The balance between these (the excitation and the inhibition) is essential to sculpt oscillatory patterns of neural activity such as theta or gamma activity. In contrast to the adult, GABA synaptic neurotransmission in the developing Neural Computation 19, 706–729 (2007)
C 2007 Massachusetts Institute of Technology
GABA Reversal of Synchrony
707
brain circuits, is depolarizing (excitatory) as a result of high intracellular concentration of chloride ion. This excitatory effect is associated with the early network oscillations (ENOs) in the CA1 (Garaschuk, Hanse, & Konnerth, 1998) and giant depolarizing potentials (GDPs) in the CA3 region of the neonatal hippocampus (Ben-Ari, Cherubini, Corradetti, & Gaiarsa, 1989). Both the ENOs and the GDPs reflect underlying network activity with spontaneous oscillatory patterns during early postnatal development. It is believed that GABA-mediated depolarization is a key factor in the development of central nervous structures. GABAergic synaptic channels are permeable to the chloride ions whose concentration controls the GABA synaptic reversal potential. In the developing neurons, the reversal potential for chloride (E Cl ) sits at a more depolarized level than the resting membrane potential (E rest )—hence, the depolarizing action of the GABA synapse. In the rodent hippocampus, E Cl decreases with age during the immediate postnatal period and arrives at the typical resting potential of −65 mV after about two weeks (Owens, Boyce, Davis, & Kriegstein, 1996). E Cl in the adult can be below E rest , but depending on the activity of the postsynaptic neuron can ride up toward E rest , hence producing so-called shunting inhibition. Synchronous patterns have been recognized as one of the most prominent features of neuronal activity found in the brain (see Buzsaki & Draguhn, 2004, for review). For example, synchrony of neuronal firing has been observed during sensory information processing such as visual ¨ pattern recognition (Gray, Konig, Engel, & Singer, 1989), odor recognition (MacLeod & Laurent, 1996), and sensorimotor tasks (Abeles, Bergman, Margalit, & Vaadia, 1993). Several computational studies have revealed that GABA-mediated inhibition is capable of synchronizing the activity in interneuronal networks when the decay time of the inhibitory synaptic current is slow relative to the intrinsic membrane recovery time (Bush & Sejnowski, 1996; Hansel & Mato, 2003; van Vreeswijk, Abbott, & Ermentrout, 1994; Wang, 2003). These results are in good agreement with the physiological results showing that the GABA A-receptor-mediated inhibition in hippocampal interneuronal circuits, without the involvement of pyramidal neurons, could give rise to coherent 40 Hz oscillations (Whittington, Traub, & Jefferys, 1995). Also, spontaneous coherent oscillations occurring at 40 Hz have been demonstrated in hippocampal slices with cholinergic modulation, suggesting that synchronization can occur in networks of both interneurons and pyramidal cell provided that the balance between excitatory and inhibitory synaptic transmissions is maintained (Fisahn, Pike, Buhl, & Paulsen, 1998). In the developing neuronal circuits, spontaneous coherent oscillatory patterns of activity such as ENOs and GDPs are clearly present. These behaviors appear at the developmental stages when the glutamatergic neuronal transmission has not matured, yet the depolarizing GABAergic synapses are fully functional, thus suggesting that actions of GABA play a
708
H. Jeong and B. Gutkin
crucial role in generating the neuronal activity in the neonatal hippocampus (Ben-Ari, 2002). However, this appears to be at odds with the previous theoretical results pointing out that synchronization of neuronal oscillations is unlikely to be realized by excitation alone (except under specific conditions on the neurons themselves; see below), but is achieved with the help of synaptic inhibition (Ermentrout, 1996; Gerstner & Kistler, 2002; Hansel, Mato, & Meunier, 1995; Neltner & Hansel, 2001). A partial explanation exists for the conditions under which intrinsically oscillating neurons can synchronize through excitation (Crook, Ermentrout, & Bower, 1998; Ermentrout, Pascal, & Gutkin, 2001). The key was the slow voltagedependent potassium current (the M-current) that has an ability to shift a neuron from a type I to a type II oscillator. In addition, the calciumdependent potassium current producing the afterhyperpolarization (AHP) current gave rise to synchronization by making neurons insensitive to inputs that would normally desynchronize the circuit. However, previous theoretical analysis considered synaptic inputs not as conductances but as currents. Thus, they could not show how changes in the reversal potential of the GABAergic synapses affect the synchronization properties. The analysis presented in such work was based on pairs of mutually coupled neural oscillators. Hence, most of these studies did not consider how a large neuronal network would act in the parameter regions treated by the analysis of two neurons. To address these issues, we consider three representative regimes of the GABAergic synapses in reciprocally coupled neuronal circuits: hyperpolarizing, shunting, and depolarizing. For these regimes, we investigate how synchrony is affected by the adaptation current that is involved in neonatal neurons. Through both theoretical analysis and numerical simulations, we finally show that adaptation-induced low firing rates with depolarizing GABAergic synapses may be the key to the formation of developing neuronal circuits through synchronization. Parts of this work have been presented previously in conference proceedings (Jeong & Gutkin, 2005). 2 The Model and Methods 2.1 Neuron Model. In this work, we use a single compartmental conductance-based model that contains the currents required for spiking: a voltage-dependent transient sodium (I Na ), a delayed rectifier K-current (I K ), and a nonspecific leak current (I L ). In addition, we add a high-threshold calcium current (L current), ICa , and a calcium-dependent potassium current, IAHP , since these currents are associated with generating the neuronal activities in neonatal rat CA3 pyramidal neurons in vitro (Tanabe, Gahwiler, & Gerber, 1998). In particular, a voltage-independent afterhyperpolarization (AHP) gives spike frequency adaptation, as implemented by the L-type calcium channels, without changing the bifurcation structure of the neuron model. The model is a simple variant of Traub’s
GABA Reversal of Synchrony
709
neuron model (Traub & Miles, 1991), and its mathematical description and parameters are borrowed from Ermentrout and colleagues (Ermentrout et al., 2001). We use this particular model so as to make direct comparisons with the previously published work (e.g., Ermentrout et al., 2001). The membrane potential V of a neuron obeys the following currentbalance equation, Cm
dV + I L + I Na + I K + ICa + IAHP = I, dt
(2.1)
where Cm = 1 µF/cm2 and the leak current I L = g L (V − E L ). I is the injected current. The voltage-dependent sodium and potassium currents are described by a gating variable x that satisfies a first-order kinetics, dx = αx (V)(1 − x) − βx (V)x, dt
(2.2)
which can be considered as the mean activity of large number of channels. The rates αx and βx are functions of the voltage. For sodium current, I Na = g¯ Na m3 h(V − E Na ), where αm (V) = 0.32(V + 54)/(1 − exp(−(V + 54)/4)) and βm (V) = 0.284(V + 27)/(exp((V + 27)/5) − 1), αh = 0.128 exp(−(V + 50)/18) and βh = 4/(1 + exp(−(V + 27)/5)). For delayed rectifying potassium current, I K = g¯ K n4 (V − E K ), where αn (V) = 0.032(V + 52)/(1 − exp(−(V + 52)/5)) and βn (V) = 0.5 exp(−(V + 57)/40). The high-threshold calcium current ICa = g¯ Ca m∞ (V − E Ca ), where m∞ = 1/(1 + exp(−(V + 25))/2.5)). The voltage-independent calcium-activated potassium current IAHP = g¯ AHP ([Ca 2+ ]/([Ca 2+ ] + 1))(V − E K ). The change in intracellular calcium concentration [Ca 2+ ] is given by d[Ca 2+ ] [Ca 2+ ] , = − ICa − dt τCa
(2.3)
where is proportional to the ratio of the membrane area to the volume beneath the membrane, and τCa is a time constant to relax [Ca 2+ ]. Here, we choose the same parameter values as Wang (1998) used: = 0.002 µ M (ms µA)−1 cm2 and τCa = 80 ms. The parameter values for generating spikes are chosen to ensure type I membrane since there is some evidence that cortical neurons have such dynamics (Gerstner & Kistler, 2002; Tateno, Harsch, & Robinson, 2004): E K = −100, E Na = 50, E L = −67, E Ca = 120 (in mV); g¯ K = 80, g¯ Na = 100, g L = 0.2, gCa = 1 (in mS/cm2 ). The parameters g¯ AHP and I vary in simulations to control the adaptation and the firing rate of the model. Figure 1 illustrates that the neuron model shows typical behavior for type I membranes. As Ermentrout et al. (2001) pointed out, the neuron is in the resting state at low values of injected current and begins to fire repetitively at an
710
H. Jeong and B. Gutkin
B)
A) 50
80
40 0
−50
60
−100
20
0
300
150
0
V (mV)
Oscillation 50
40 0
-20
-40
Unstable
−50
-60 20
−100
0
150
300
SN = 0.4917 Stable
-80
0 0.5
1.0
1.5
2.0
2.5
-100 0.0
0.5
1.0
1.5
2.0
2.5
2
I (µA/cm )
Figure 1: F-I curve and bifurcation diagram. (A) The onset of firing rate occurs at the saddle-node (SN) bifurcation, and its frequency is zero. Even if the adaptation current is relatively strong (g¯ AHP = 2), the neuron fires at an arbitrarily low frequency. The smooth lines of the firing rate are fitted to the data obtained from each individual injected current (). Insets illustrate the firing with and without the adaptation current. (B) The bifurcation analysis of the Traub neuron shows that the model has an SN for a bifurcation point when g¯ AHP = 1, which is a typical pattern of type I membrane.
arbitrarily low frequency (see Figure 1A). The bifurcation diagram shows that the neuron model has a stable limit cycle when a saddle point and a stable node coalesce and disappear (see Figure 1B). Note that this highadaptation current does not affect the bifurcation structure of the neuron in the equilibrium state, but causes the neuron to have the lower firing rate when the current above the critical value is injected into the neuron (as compared to the nonadapting case). The adapting IAHP provides a delayed negative feedback on the neuron’s excitability through a slow, outward potassium current. 2.2 Synaptic Model. The release of the GABA neurotransmitter that binds to the GABA A receptors is well described by a first-order kinetic model (Destexhe, Mainen, & Sejnowski, 1998), which represents the transitions between the open and closed state of the receptors. If S is defined as a fraction of the receptors in the open state, its change is governed by the kinetics of the GABA A receptors and the concentration of transmitters [T], dS = α[T](1 − S) − β S, dt
(2.4)
where α and β are forward and backward rate constants for GABA A receptors. In this work, α is fixed by 2.0 mM−1 ms−1 , but β is varied to investigate how a decay time constant influences synchrony. For [T], it is convenient
GABA Reversal of Synchrony
711
A)
B) 0.6
0.5
V (mV)
S
0.4
=0.1
0.3
50 0 −50 −100 −64
Presynaptic neuron EGABA = 0 mV
−68 −64
EGABA = −30mV
−68 −64
EGABA = −65 mV
0.2
−68 −64
=0.5 0.1
EGABA = −80 mV
−68
=2.0 0.0 0
5
10
15
20
25
50
Time (msec)
60
70
80
90
100
Time (msec)
Figure 2: Effects of the synaptic decay time and influence of the reversal potential on the postsynaptic membrane potential. (A) The shape of S changes according to the backward reaction rate β. When β increases, synapses become faster, and the stabilities of phase dynamics vary. (B) The transient postsynaptic membrane potentials are caused by the presynaptic release of GABAergic neurotransmitters. Depending on the reversal potential of GABAergic synapses, the postsynaptic membranes become excitatory (E GABA = 0 and −30 mV), shunting (−65 mV), and inhibitory (−80 mV).
to use a continuous function to transform the presynaptic voltage into the concentration. If it is assumed that all reactions involved are relatively fast, they can be considered in the steady state, and the transform of the presynaptic voltage to the concentration of transmitter can be obtained from the stationary relationship: [T](Vpr e ) =
Tmax , 1 + exp(−(Vpr e − Vp )/K p )
(2.5)
where Tmax is the maximal concentration of transmitter in the synaptic cleft and Vp and K p determine the threshold and the stiffness for the release. We use the values that Ermentrout et al. (2001) chose: Tmax = 1 mM, Vp = −10 mV, and K p = 10 mV. Note that the synapse speeds up as the backward rate constant β increases (see Figure 2A). The postsynaptic current IGABA is given by IGABA = g¯ GABA S(t)(V − E GABA ),
(2.6)
where g¯ GABA is the maximal conductance of GABA A receptors and E GABA the reversal potential. This synaptic model is inserted into the neuron membrane model (see equation 2.1). Once a fast chemical synapse activates, a rapid and transient change can be observed in the postsynaptic potentials. When the synapse is excitatory, the membrane potential rapidly depolarizes and then returns slowly to its resting state, producing an excitatory
712
H. Jeong and B. Gutkin
postsynaptic potential (EPSP). In contrast, an inhibitory synapse gives rise to the transient hyperpolarization of postsynaptic membrane, an inhibitory postsynaptic potential (IPSP). Figure 2B illustrates how the reversal potential of GABAergic synapse influences the postsynaptic membrane potential. 2.3 Phase Reduction. In general, it is difficult to predict the behavior of coupled neurons that have nonlinear properties. However, in the limit of weak coupling, we can reduce the behavior of a pair of coupled neurons to a simple phase model and analyze the stability of its phase-locked solutions (Kuramoto, 1984; Rinzel & Ermentrout, 1998; Ermentrout et al., 2001; Pikovsky, Rosenblum, & Kurths, 2001; Winfree, 2001). If individual neurons with identical properties have stable limit cycles and their coupling is weak enough not to affect the amplitude of the neurons limit cycle (or, weak effect on the voltage), then the full system can be reduced to a model for the difference in the phases of the neurons. In this section, we recap how to derive the phase model and then apply it to specific forms for conductance-based neuron models. Consider a system governed by ordinary differential equations, dX = F (X), dt
(2.7)
and suppose that this system has a stable periodic solution. We now consider an identical pair of oscillators Xi (i = 1, 2) that are weakly coupled: the coupling does not distort the waveforms but perturbs only the relative phase between the coupled oscillators. We write them in the general form d Xi = F (Xi ) + gc G i (X1 , X2 ). dt
(2.8)
Using the frequencies of uncoupled oscillators ωi = 1, we rewrite equation 2.8 as a phase θ on the limit cycle. From the chain rule, we obtain ∂θi dθi ∂θi F (Xk ) + gc G k (X1 , X2 ) = dt ∂ X ∂ Xk k k k = ωi + gc
∂θi G k (X1 , X2 ), ∂ Xk k
k = 1, 2.
(2.9) (2.10)
For 0 < gc << 1, the solution of equation 2.8 has the form Xi (t) = X0 (θi ) + Oi (gc ), since the deviations of X from the limit cycle of uncoupled oscillators X0 are small. Thus, in the first approximation, we can neglect these deviations and calculate the second term on the right-hand side. Because θ changes slowly in the timescale of T, we can average over the full period of
GABA Reversal of Synchrony
713
oscillation T, keeping θi fixed, and obtain the phase dynamics:
dθi gc = ωi + dt T
T
Z(t)G i (t, t + (θ j − θi )T)dt
(2.11)
0
= ωi + Hi (θ j − θi ),
i = j,
(2.12)
where Z(t) describes the phase shift due to a a perturbation and is called the infinitesimal phase response curve (PRC). Note that H is a T-periodic averaged interaction function. Since the main reason to derive the equations is to study the synchronization properties of weakly coupled oscillators, we consider a pair of identical oscillators that are weakly and symmetrically coupled. Then equations 2.12 for two oscillators satisfy Hi = H j = H and ω0i = ω0 j = ω0 . They can be rewritten as the dynamics of the phase difference between coupled neurons, φ = θ j − θi :
dφ = H(−φ) − H(φ) dt = −2Hodd (φ),
(2.13) (2.14)
where Hodd is the odd component of H. Since phase-locked solutions appear at values of φ where Hodd (φ) crosses the zero axis, there are always at least two roots: synchronous (φ = 0) and antiphase (φ = T/2). From ¯ equation 2.14, it is clear that a phase-locked solution φ¯ is stable when Hodd (φ) is positive. In particular, the synchronous solution is stable if H (0) > 0. In this article, we define synchrony as a zero phase lag when two identical neurons are coupled. So far, we have dealt with the general forms for the weakly coupled oscillators. For the mutually coupled neurons with GABA A receptors (see equations 2.1–2.6), the averaged interaction function H is given by
H(φ) =
g¯ G AB AA T
T 0
Z(t)S(t + φT)(E G AB AA − V(t))dt.
(2.15)
This equation shows that all the changes in the function H arise from three factors: the intrinsic property of membrane dynamics, the synaptic function S(t), and the synaptic reversal potential.
714
H. Jeong and B. Gutkin
2.4 Populations of Globally Coupled Neurons. The effect of mutual synchronization of two coupled oscillators can be extended to large networks by the simple differential equation, N dθk g¯ G AB AA = ωk + H(θ j − θk ), dt N
(2.16)
j=1
where N is the number of neurons and H is the interaction function between two neurons defined in equation 2.15. It is generally impossible to obtain the function H because the response function Z can be determined analytically only in special cases. Since H is a periodic function of T, however, it can be approximated by using a Fourier expansion (five components are used here). If the individual neurons are identical and their interactions are homogeneous, then their natural frequencies ω can be set to 1 without loss of generality. Note that large networks can show transitions between the synchronous state and the asynchronous state when varying a synaptic strength as a controlling parameter (Kuramoto, 1984). However, we treat the weak coupling limit in order to compare with two-neuron analysis for synchronization. To characterize the collective behavior of the N-neurons, it is convenient to introduce the order parameter (Kuramoto, 1984; Hansel, Mato, & Meunier, 1993) 1 exp(niθ j (t)) N N
Zn (t) =
(n = 1, 2),
(2.17)
j=1
which measures the degree of phase coherence. The instantaneous degree of coherent behavior in the network can be further described by the square of the modulus of Zn (t): 2 2 N N 1 2 cos(nθ j (t)) + sin(nθ j (t)) Rn (t) = 2 N j=1 j=1 =
N 1 cos(nθ j (t) − nθk (t)), N2
(2.18)
(2.19)
j,k=1
which is called synchronization index. From equation 2.19, it is obvious that the network is in the synchronous state when Rn converge to 1 in time, while the complete asynchronous state results when they go to 0. The network with two equal size clusters especially gives R1 = 0, but R2 = 1. All
GABA Reversal of Synchrony
715
numerical simulations and bifurcation analysis are done with the XPPAUT software (Ermentrout, 2002). 3 Results 3.1 Changes in the Synaptic Reversal Potential. We first investigate how changes in the GABA reversal affect synchronization of two type I neural oscillators. Here, the neurons are induced to fire repetitively by application of a constant current input. In this first case, the neurons are nonadapting. In Figure 3, we plot the infinitesimal PRCs and the asymptotic values of the phase difference between two neurons coupled with a GABAergic synapse. As the reversal potential increases, denoting that the synapse becomes partially depolarizing, a pitchfork bifurcation appears at a critical value of the reversal potential. At this point, the antisynchronous state becomes unstable, and two asynchronous states (i.e., states with a nonzero phase lag) emerge. These asynchronous states move closer to the synchronous state as the GABAergic reversal potential increases. This situation can be enhanced when two neurons (here: nonadapting) have a low firing rate due to a lower amplitude of the input current (see Figures 3A and 3B), yet even at the E GABA = 0, the phase difference is still large (see Figure 3B). When we control the firing rate of the neuron with spike-frequency adaptation, the asynchronous states tend to approach the synchronous state, and the phase difference between two neurons becomes nearly 0 when the reversal potential moves above the cell resting potential. Figure 3C shows that the slow process associated with the adaptation current plays a crucial role in synchronization. This might result from the partial refractoriness due to the adaptation current that makes the neurons insensitive to inputs arriving after a spike. At the same firing rate as the nonadapting case (10 Hz), the adaptation current causes the neuron to have a more flattened PRC at the beginning of the firing cycle and leads to near synchrony (see Figures 3B and 3C). We note that even if the adaptation current is involved, the neuron model is a type I membrane because the PRC is nonnegative so that a brief depolarizing injected current dose not delay the phase. The bifurcation diagram of the phase difference also shows that in the shunting region, where the reversal potential is nearly equal to the resting potential of the membrane (E rest ), the only stable solutions are antiphase (asynchronous), regardless of whether the neuron is adapting. Hyperpolarizing GABAegic synapses can give stable synchronous states in addition to the asynchronous state in the absence of adaptation and lead to the bistability of phase-locked solutions. Neurons with adaptation do not have a stable in-phase locked solution, despite hyperpolarizing GABAergic reversal potentials. This fact seems to be different from previous results (van Vreeswijk et al., 1994; Hansel & Mato, 2003). However, when the synapse is sufficiently slow, the adapting system regains a stable synchronous state (e.g., see Figure 4C). Therefore, the bistability of the phase dynamics can
716
H. Jeong and B. Gutkin 40Hz without adaptation
A)
PRC
Phase difference ( /T)
1.0
Erest 0.5
0.0 0 0
5
10
15
20
25
−100
−80
−60
−40
−20
0
10Hz without adaptation
B)
PRC
Phase difference ( /T)
1.0
0.5
0.0 0 0
20
40
60
80
100
−100
−80
−60
−40
−20
0
10Hz with adaptation
C)
PRC
Phase Difference ( /T)
1.0
0.5
0.0 0 0
20
40
60
Time (msec)
80
100
−100
−80
−60
−40
−20
0
Synaptic Reversal Potential (mV)
Figure 3: Both the GABAergic reversal potential and the AHP current affect synchrony. The infinitesimal PRC is shown in the left panel, and the bifurcation diagrams for phase difference between two neurons as a function of the reversal potential is plotted in the right panel. In the diagram of the phase difference, solid lines denotes stable solutions and dashed lines unstable solutions. The resting potential is marked on each diagram. Here we control the firing rate of the model by changing the injected current. (A) 40 Hz without adaptation. (B) 10 Hz without adaptation. (C) 10 Hz with adaptation.
GABA Reversal of Synchrony
Phase difference ( / T )
A)
717
B)
EGABA=0mV 1.0
1.0
0.5
0.5
0.0 0.0
C)
EGABA=−65mV
1.0
0.5
0.0 0.5
1.0
1.5
2.0
0.0
EGABA=−80mV
0.0 0.5
1.0
1.5
2.0
0.0
0.5
1.0
1.5
2.0
Figure 4: Bifurcation diagrams of phase difference when β as the control parameter changes. (A) Hyperpolarizing region. (B) Shunting region. (C) Depolarizing region of GABAergic synapses.
be considered as the general property for two neurons coupled with hyperpolarizing GABAergic synapses. If this analysis is extended to a large neuronal network, the system could break into multicluster states, as previously suggested (van Vreeswijk, 1996). 3.2 Effects of Synaptic Speed on Synchronization. In the previous section, we saw that the shape of the synaptic function plays an important role in controlling the synchronous states for GABAergic synapses. Figure 4 shows how the synchronous behavior of two neurons is affected by the backward rate constant β in the GABAergic synapse. As β increases, the synapse speeds up, and its decay time is faster (see Figure 2A). For depolarized coupling, the phase difference between two neurons tends to zero as β increases. However, there are no values of β that lead to synchronization. In the shunting region, the stability of antiphaselocked solutions is independent of the synaptic decay time. Hyperpolarizing GABAergic synapses usually leads to antiphase solutions, but when β decreases below the critical point, representing a sufficiently slow synapse, an additional stable synchronous state occurs (as seen above). As a result, the backward rate constant β changes the stability of the phase difference via the supercritical and subcritical pitchfork bifurcation for the depolarizing and hyperpolarizing GABAergic synapse, respectively. The shunting GABAergic synapse always leads to antisynchronization. The weakly coupled oscillator analysis shows that the stability of the phase difference between two neurons is affected by three factors: the adaptation current, the synaptic function, and the reversal potential (see equation 2.15). Furthermore, numerical simulations of five neurons that are mutually coupled with GABAergic synapses confirm the results and imply that this analysis for two neurons can be extended to greater numbers of neurons or a large neuronal network (see Figure 5).
718
H. Jeong and B. Gutkin
A) 40
V (mV)
0
−40
−80
B)
0
100
200
300
400
500
40
V (mV)
0
−40
−80
0
100
200
700
800
900
1000
Time (msec)
Figure 5: Numerical simulations show that the analysis for two neurons can be successfully applied to five coupled neurons. In the depolarizing GABAergic connection (E GABA = −20 mV), the model leads to asynchrony when synapses are slow sufficiently (top, β = 0.1). In contrast, the model gives rise to almostsynchronization when synapses are fast (bottom, β = 2).
3.3 Large Neuronal Network. Numerical simulations show that the analysis of two neurons with PRC can successfully predict the behavior of synchrony for five neurons that are mutually connected. To extend our results to coherent behavior in large-scale networks, we use a globally coupled phase-model network with 100 neurons (see equation 2.16). In the network, the synaptic coupling is based on the interaction function (H) between two neurons. If we assume that the interactions are homogeneous, the model ˜ can be simulated by using the approximate interaction function H(φ) obtained from the Fourier expansion of the interaction function between two coupled neurons. We expect under conditions for bistable synchronous and asynchronous solutions of two neurons the network to break up into cluster states. Previous results indeed suggested that the network with inhibitory coupling can break up into the cluster states (van Vreeswijk, 1996). Figure 6 illustrates that for a large network whose interaction function reflects the presence of adaptation (skewed PRC) excitatory coupling leads
GABA Reversal of Synchrony
719 EGABA= 0mV with adaptation
A)
1.0
0.8
~ H, H
R1, R2
0.6
0.4
0.2
0.0 0
20
40
60
80
100
0
10
20
30
40
50
10
20
30
40
50
EGABA= −80mV with adaptation
B)
1.0
0.8
~ H, H
R1, R2
0.6
0.4
0.2
0.0 0
20
40
60
80
100
0
EGABA= −80mV without adaptation
C)
1.0
0.8
~ H, H
R1, R2
0.6
0.4
R1 R2
0.2
H ~ H
0.0 0
20
40
60
Time (msec)
80
100
0
20
40
60
80
100
Time
Figure 6: Synchrony and onset of cluster states in large networks. The average ˜ (solid line) approximated from the two-neuron phase interaction functions H model (dotted line) are plotted in the left panel. With five components in the Fourier transform, we obtain the compatible results with the two-neuron analysis. The evolution of the R1 and R2 of the complex order parameters is shown in the right panel. Note that we randomly select the initial condition of network since the cluster state depends on it. (A) E GABA = 0 mV with adaptation. (B) E GABA = −80 mV with adaptation. (C) E GABA = −80 mV without adaptation.
720
H. Jeong and B. Gutkin
A)
B) 40 20
PRC
V (mv)
0
Oscillation
−20 −40
unstable −60
0
0
5
10
15
Time (msec)
20
25
30
−100
HB = 5.4983
stable
−80
0
1
2
3
4
5
6
7
8
9
10
2
I ( µA/cm )
Figure 7: The M current yields type II spike-generating dynamics in a neuron model. (A) PRC showing positive and negative regions indicative of type II dynamics. (B) Bifurcation diagram for a type II neuron as a function of injected current. A subcritical Hopf bifurcation leads to the onset of oscillations.
to synchronization (see Figure 6A), but inhibitory coupling leads to desynchronization (see Figure 6B). These behaviors of large neuronal networks are in good agreement with the two-neuron analysis. In addition, we show that the network has a two-cluster state (see Figure 6C) when the coupling is hyperpolarizing and reflects nonadapting neurons (PRC without a strong left skew). The order parameters (see equation 2.19) effectively capture the state of the network. When both R1 and R2 converge to 0 or 1 together, the neurons are perfectly synchronized or desynchronized, respectively (see Figures 6A or 6B). However, when these two quantities go to different values (see Figure 6C), the model has a two-cluster state. 3.4 Stability of Phase-Locked States, Type II Neurons. In the previous sections, we have examined how the GABAergic reversal potential influences the stability of the synchronous and asynchronous solutions for a pair, a small circuit, and a large network of neural oscillators. For that analysis, we assume that the neuron shows type I membrane dynamics, meaning that the spiking is due to a saddle-node bifurcation and the PRC is purely positive. However it was previously shown that type II oscillators have different synchronization properties (Ermentrout, 1996). In particular, type II membranes can synchronize with excitatory (depolarizing) synapses (Ermentrout et al., 2001). The key to this analysis is that the phase resetting curve has negative as well as positive regions, which leads to the zero-phase lagged solution being stable. Ermentrout et al. (2001) and Gutkin, Ermentrout, and Reyes (2005) showed that type I membrane can be changed into type II through a voltage-dependent negative feedback such as a low-threshold slow potassium current I M as opposed to the high-threshold calcium-mediated IAHP used above. The difference between these two currents lies in their effective threshold. The I M current has a lower effective threshold, while the AHP
GABA Reversal of Synchrony EGABA=−65mV
A)
Hodd
721 EGABA=−70mV
B)
0
0
0
5
10
15
20
25
30
EGABA=−90mV
C)
0
10
15
20
25
30
25
30
EGABA=−96mV
D)
0
0
0.00
Hodd
5
0.07
0.14
0.00
0
0.08
0.16
0
0
5
10
15
Time (msec)
20
25
30
0
5
10
15
20
Time (msec)
Figure 8: Hodd ’s for type II neuron (see above) for different reversal potentials of the synapse; reversal potentials are marked on the figure. As can be seen for E GABA near and above the E rest , the slope of Ho dd at the zero phase-lag intersection is positive, and thus synchrony is stable. Synchrony loses stability when E GABA drops below −92.1 mV. Insets magnify the zero phase-lag region. (A) E GABA = −65 mV. (B) E GABA = −70 mV. (C) E GABA = −90 mV. (D) E GABA = −96 mV.
current has a higher effective threshold. Interestingly, the M current affects the bifurcation structure: the stable solution branch now loses stability at a subcritical Hopf bifurcation point (see Figures 1B and 7B for comparison). Since the stable oscillation appears at a nonzero frequency near a subcritical Hopf bifurcation, it can be said that the M current changes the dynamics from a type I to a type II. In this case, the PRC is no longer purely positive but changes sign during the firing cycle (see Figure 7A). For the low-threshold slow potassium current, I M = g¯ M w(V − E K ) where dw/dt = (w∞ (v) − w)/τw (v) with w∞ (v) = 1.0/(1.0 + exp(−(v + 35)/10.0)) and τw (v) = 100/(3.3 exp((v + 35)/20.0) + exp(−(v + 35)/20.0)). The parameter g¯ M is set to 2 mS/cm2 . Figure 8 shows the odd component of the interaction function H with various levels of GABAergic reversal potentials for a type II Traub model
722
H. Jeong and B. Gutkin
Phase difference ( /T)
1.0
Erest 0.5
0.0
−100
−90
−80
−70
−60
−50
−40
−30
−20
−10
0
Synaptic Reversal Potential (mV) Figure 9: Bifurcation diagram as a function of E GABA for a type II neuron. Synchrony is stable and asynchrony is unstable above the resting potential. In the range near rest (down to −92.1 mV), synchrony and asynchrony coexist; below, only asynchrony is stable. A dot marks the resting potential.
firing at 30 Hz. The positive (negative) slope of Hodd at a phase-locked solution denotes a stable (unstable) synchronous solution. Note that synchrony is stable for reversal of −90 mV but becomes unstable for −96 mV. In Figure 9, the bifurcation diagram is plotted with the GABAergic reversal potential as the control parameter. In contrast to the type I neuron, a synchronous phase-locked solution remains stable for almost all GABAergic reversal potentials and loses stability at E GABA = −92.1 mV. Since the hyperpolarizing GABAergic synapses also give rise to a stable antisynchronous solution, the system shows the bistability. However, the bistable region disappears when the GABAergic reversal potential goes below the −92 mV value, which is different from the pattern of type I. The shunting GABAergic reversal potential also gives rise to the bistable synchronous and asynchronous solution. The depolarizing GABAergic synapse leads to stable synchrony, which fits in with the earlier work showing that excitatory coupling leads to synchrony in type II neuronal oscillators. Furthermore, even partially depolarizing GABA can lead to synchrony.
GABA Reversal of Synchrony
A)
723 EGABA=−60mV
40 20
V (mV)
0 −20 −40 −60 −80 −100 0
B)
50
100
150
19800
19850
19900
19950
20000
EGABA=−67mV
40 20
V (mV)
0 −20 −40 −60 −80 −100 0
50
100
150
9800
9850
9900
9950
10000
0
50
100
150
9800
9850
9900
9950
10000
900
950
1000
40 20
V (mV)
0 −20 −40 −60 −80 −100
C)
EGABA=−100mV
40 20
V (mV)
0 −20 −40 −60 −80 −100 0
50
100
150
800
850
Time (msec)
Figure 10: Examples of neuron pairs (type II neurons) coupled with reciprocal GABA synapses. (A) E GABA = −60 mV. (B) E GABA = −67 mV with two different initial conditions. (C) E GABA = −100 mV.
724
H. Jeong and B. Gutkin
We further validate our weak coupling analysis with direct simulations of a small circuit of type II neurons. Figure 10 shows a coupled pair of neurons with the M current. We see that for synapses with a reversal potential of −60 mV, the neurons readily synchronize, while for synapses with E GABA = −67 mV, the cells either fall out of synchrony or synchronize depending on initial conditions. For the E GABA = −100 mV, only the antisynchrony is stable. The heuristic explanation for these results is that when the synapses reverse above the minimal voltage attained by the neural membrane, it acts as depolarizing for a part of the firing cycle. Given that the cell with sufficient AHP spends a significant part of the firing cycle below the resting potential, synapses that reverse at E rest are depolarizing for most of the firing cycle. Synapses that reverse near the resting potential of the neuron are depolarizing for the vast majority of the duty cycle. Figure 11 illustrates this for the Traub neuron with I M adaptation. Thus, heuristically we fall back into the generic type II situation: depolarizing coupling synchronizes. 4 Conclusion In this report, we have investigated the role of GABAergic reversal potential in controlling stability of various phase-locked oscillations in coupled neuronal oscillators. The GABAergic reversal potential changes significantly during the development as well as during increases in firing activity in interneuronal subnetworks. Such GABA-driven activity has been proposed to be important in structuring the developing hippocampal as well as neocortical circuitry, particularly through possible formation of synchronous oscillations and cell assemblies. Here we focused specifically on this GABAergic reversal and showed how it interacts with a number of biophysical parameters to control the emergence of various firing configurations.
Figure 11: Relationship between the GABAergic reversal potential and the membrane voltage during a firing cycle explains why shunting inhibition synchronizes type II neurons. Here we see one firing cycle with three different E GABA values. Light gray marks the portion of the cycle where the GABAergic synapse is depolarizing; dark gray is hyperpolarizing. E rest is at −67 mV, and the minimal voltage attained by the neuron (E min ) is −96 mV. As can be seen, the GABAergic synapse with reversal potentials slightly above a resting membrane potential is depolarizing for virtually all of the firing cycle; hence, it acts as “excitation.” A shunting synapse (E GABA = E rest ) depolarizes the membrane for more than half of the firing cycle. A synapse with the reversal potential below the E min is hyperpolarizing.
GABA Reversal of Synchrony
725
EGABA=−60mV
40 20
V (mV)
0 −20 −40 −60 −80 −100 0
5
10
15
20
25
30
EGABA=−67mV
40 20
V (mV)
0 −20 −40 −60 −80 −100 0
5
10
15
20
25
30
EGABA=−96mV
40 20
V (mV)
0 −20 −40 −60 −80 −100 0
5
10
15
Time (msec)
20
25
30
726
H. Jeong and B. Gutkin
We have found that hyperpolarizing GABAergic coupling leads to synchrony being stable for type I oscillators without adaptation or a bistability between a synchronous and an asynchronous state. Depolarizing GABAergic synapses, on the other hand, lead to a stable, almost phaselocked state with partial phase shifts (so-called third mode). This third mode approaches synchrony when the firing rate of the oscillators is low. When adaptation is included, which results in a lower firing rate, hyperpolarizing GABAergic synapses lead to stable asynchrony, while depolarizing GABAergic synapses give stability to the synchronous state. We then show that for various GABAergic reversal potentials and a given low firing rate, the speed of the synapse changes stability of phase-locked states: hyperpolarizing GABA renders the asynchrony stable for relatively slower synapses, and slow depolarizing GABA yields stability for synchrony. This confirms previous results (van Vreeswijk, 1996). However, we have found that for shunting GABA (reversal at −65 mV, the resting potential of the model), asynchrony is stable for all parameters examined. Direct simulations confirm this picture. The results of our analysis appear to be in agreement with previous simulation studies of increased GABA reversal potential in interneuronal networks (Wang & Buzsaki, 1996). A recent simulation study of a realistic model of a hippocampal interneuronal network suggests that partially depolarizing (shunting) inhibition in fact helps the robustness of gamma-band coherent oscillations (Vida, Bartos, & Jonas, 2006). This result is intriguing in the light of analysis presented here, and the reasons for the differences may be due to the different coupling regimes considered (weak here versus strong in Vida et al., 2006) or in the heterogeneity of the network, or both. Another possibility is due to the intrinsic firing properties of the neuronal model considered (type I versus type II). In particular, we found that for partially depolarizing synapses that are sufficiently fast, a solution close to synchrony may appear for type I oscillators (results not shown), whereas for type II oscillators, reversal potential above rest gives a stable synchronous solution (coexisting with the antisynchrony). However, a formal analysis for why the results differ remains a topic of investigation. We have found that neurons coupled with inhibition (GABA hyperpolarizing) can give rise to synchronous solutions as well as antiphase solutions coexisting when synapses are sufficiently slow. Since this situation leads to the bistability of the phase model, we predict that a large neuronal network has the potential to develop into a cluster state, depending on the initial condition of the phase dynamics. We then examined the behavior of large-scale networks with synaptic parameters close to those identified in vitro and with various reversal potentials. We have found that the large-scale networks are consistent with the analysis of neuron pairs and the small-circuit simulations. In cases where the Hodd analysis predicts synchrony, the large-scale network shows synchrony; in cases where the small-scale analysis predicts bistability (coexistence of
GABA Reversal of Synchrony
727
synchronous and asynchronous solutions), the large-scale networks break up into multiple antiphase clusters of synchronized neurons. We further found that the dynamics of spike generation can control the stability of synchrony with partially depolarizing GABA synapses. When the spike frequency adaptation in the neuron is due to a low-threshold K-current (I M ), the dynamics of spike generation change from type I to II. In these neurons, depolarizing and shunting GABA synapses synchronize. We thus show that synchronization due to shunting inhibition may, in principle, be controlled by neuromodulators that control the spike-generating properties of cortical neurons. As has been previously suggested (Gutkin et al., 2005), acetylcholine (Ach), by blocking the I M current, can readily convert neurons between type I and type II spike generation. Since high Ach condition is concomitant with states of vigilance and cognitive effort, while during sleep and rest Ach is low, we may speculate that shunting inhibition plays a synchronizing role during the latter as opposed to the former states, and perhaps in the lower-frequency band. While our results begin to explore how the changing properties of the GABAergic synapses, in consort with the intrinsic neuronal properties, may control patterns of activity during development, the origin of the global firing patterns seen in development, for example, the bursting activity leading to the giant depolarizing potentials in the hippocampus remain to be explored. Furthermore, how such patterns lead to the formation of neural circuitry remains a topic for future study. Acknowledgments We thank Peter Dayan for fruitful discussion and Brent Doiron for his comments on the manuscript. This work was supported in part by the Gatsby Foundation and CNRS (B.S.G.). References Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993). Spatiotemporal firing patterns in the frontal cortex of behaving monkeys. J. Neurophysiol., 70(4), 1629– 1638. Ben-Ari, Y. (2001). Developing networks play a similar melody. Trends in Neurosci., 24, 353–360. Ben-Ari, Y. (2002). Excitatory actions of GABA during development: The nature of the nurture. Nature Rev. Neurosci., 3, 728–739. Ben-Ari, Y., Cherubini, E., Corradetti, R., & Gaiarsa, J. L. (1989). Giant synaptic potentials in immature rat CA3 hippocampal neurones. J. Physiol., 416(1), 303– 325. Bush, P., & Sejnowski, T. (1996). Inhibition synchronizes sparsely connected cortical neurons within and between columns in realistic network models. J. Comput Neurosci., 3(2), 91–110.
728
H. Jeong and B. Gutkin
Buzsaki, G., & Draguhn, A. (2004). Neuronal oscillations in cortical networks. Science, 304, 1926–1929. Crook, S. M., Ermentrout, B., & Bower, J. M. (1998). Spike frequency adaptation affects the synchronization properties of networks of cortical oscillators. Neural Comp., 10, 837–854. Destexhe, A., Mainen, Z. F., & Sejnowski, T. J. (1998). Kinetic models of synaptic transmission. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling. Cambridge, MA: MIT Press. Ermentrout, B. (1996). Type I membranes, phase resetting curves, and synchrony. Neural Comp., 8, 979–1001. Ermentrout, B. (2002). Simulating, analyzing, and animating dynamical systems. Philadelphia: SIAM. Ermentrout, B., Pascal, M., & Gutkin, B. (2001). The effects of spike frequency adaptation and negative feedback on the synchronization of neural oscillators. Neural Comp., 13, 1285–1310. Fisahn, A., Pike, F. G., Buhl, E. H., & Paulsen, O. (1998). Cholinergic induction of network oscillations at 40 Hz in the hippocampus in vitro. Nature, 394, 186–189. Garaschuk, O., Hanse, E., & Konnerth, A. (1998). Developmental profile and synaptic origin of early network oscillations in the CA1 region of rat neonatal hippocampus. J. Physiol., 507, 219–236. Gerstner, W., & Kistler, W. (2002). Spiking neuron models. Cambridge: Cambridge University Press. ¨ Gray, C. M., Konig, P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties, Nature, 338, 334–337. Gutkin, B., Ermentrout, B., Reyes, A. (2005). Phase response curves determine the responses of neurons to transient inputs. J. Neurophys., 94, 1623–1635. Hansel, D., & Mato, G. (2003). Asynchronous states and the emergence of synchrony in large networks of interacting excitatory and inhibitory neurons. Neural Comput., 15(1), 1–56. Hansel, D., Mato, G., & Meunier, C. (1993). Clustering and slow switching in globally coupled phase oscillators. Phys. Rev. E, 48, 3470–3477. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Comp., 7, 307–337. Jeong, H. Y., & Gutkin, B. (2005). Study on the role of GABAergic synapses in synchronization. Neurocompting, 65–66, 859–868. Kuramoto, Y. (1984). Chemical oscillations, waves, and turbulence. New York: SpringerVerlag. MacLeod, K., & Laurent, G. (1996). Distinct mechanisms for synchronization and temporal patterning of odor-encoding cell assemblies. Science, 274(5289), 976– 979. Neltner, L., & Hansel, D. (2001). On synchrony of weakly coupled neurons at low firing rate. Neural Comp., 13, 765–774. Owens, D. F., Boyce, L. H., Davis, M. B. E., & Kriegstein, A. R. (1996). Excitatory GABA Responses in embryonic and neonatal cortical slices demonstrated by gramicidin perforated-patch recordings and calcium imaging. J. Neurosci., 16(20), 6414–6423.
GABA Reversal of Synchrony
729
Pikovsky, A., Rosenblum, M., & Kurths, J. (2001). Synchronization: A universal concept in nonlinear sciences. Cambridge: Cambridge University Press. Rinzel, J., & Ermentrout, B. (1998). Analysis of neural excitability and oscillations. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling. Cambridge, MA: MIT Press. Tanabe, M., Gahwiler, B. H., & Gerber, U. (1998). L-type Ca2+ channels mediate the slow Ca2+ -dependent afterhyperpolarization current in rat CA3 pyramidal cells in vitro. J. Neurophysiol., 80(5), 2268–2273. Tateno, T., Harsch, A., & Robinson, H. P. C. (2004). Threshold firing frequency-current relationships of neurons in rat somatosensory cortex: Type I and Type 2 dynamics. J. Neurophysiol., 92, 2283–2294. Traub, R. D., & Miles, R. (1991). Neuronal networks of the hippocampus. Cambridge: Cambridge University Press. van Vreeswijk, C. (1996). Partial synchronization in populations of pulse-coupled oscillators. Phys. Rev. E., 54, 5522–5537. van Vreeswijk, C., Abbott, L. F., & Ermentrout, B. (1994). When inhibition not excitation synchronizes neural firing. J. Comp. Neurosci., 1, 313–321. Vida, I., Bartos, M., & Jonas, P. (2006). Shunting inhibition improves robustness of gamma oscillations in hippocampal interneuron networks by homogenizing firing rates. Neuron, 49, 107–117. Wang, X.-J. (1998). Calcium coding and adaptive temporal computation in cortical pyramidal neurons. J. Neurophysiol., 79(3), 1549–1566. Wang, X.-J. (2003). Neural oscillations. In L. Nadal (Ed.), Encyclopedia of cognitive science, (pp. 272–280). New York: Macmillan. Wang, X.-J., & Buzsaki, G. (1996). Gamma oscillations by synaptic inhibition in a hippocampal interneuronal network model. J. Neurosci., 16, 6402–6413. Whittington, M. A., Traub, R. D., & Jefferys, J. G. R. (1995). Synchronized oscillations in interneuron networks driven by metabotropic glutamate receptor activation. Nature, 373, 612–615. Winfree, T. (2001). The geometry of biological time (2nd ed.). New York: Springer-Verlag.
Received January 3, 2006; accepted July 24, 2006.
LETTER
Communicated by Andrew Barto
Reinforcement Learning State Estimator Jun Morimoto [email protected] JST, ICORP, Computational Brain Project, 4-1-8 Honcho, Kawaguchi, Saitama, 332-0012, Japan
Kenji Doya [email protected] Initial Research Project, Okinawa Institute of Science and Technology, Uruma, Okinawa 904-2234, Japan
In this study, we propose a novel use of reinforcement learning for estimating hidden variables and parameters of nonlinear dynamical systems. A critical issue in hidden-state estimation is that we cannot directly observe estimation errors. However, by defining errors of observable variables as a delayed penalty, we can apply a reinforcement learning framework to state estimation problems. Specifically, we derive a method to construct a nonlinear state estimator by finding an appropriate feedback input gain using the policy gradient method. We tested the proposed method on single pendulum dynamics and show that the joint angle variable could be successfully estimated by observing only the angular velocity, and vice versa. In addition, we show that we could acquire a state estimator for the pendulum swing-up task in which a swing-up controller is also acquired by reinforcement learning simultaneously. Furthermore, we demonstrate that it is possible to estimate the dynamics of the pendulum itself while the hidden variables are estimated in the pendulum swing-up task. Application of the proposed method to a two-linked biped model is also presented. 1 Introduction Reinforcement learning is widely used to find policies to accomplish tasks through trial and error (Sutton & Barto, 1998). However, standard reinforcement learning algorithms like Q-learning can perform badly when the state variables of the environment are not fully observable. In this letter, we propose a dual reinforcement learning paradigm in which an agent learns to predict the hidden state of the environment and also to take actions to the environment. There are three major approaches to deal with partially observable Markov decision processes (POMDPs): memoryless stochastic policy Neural Computation 19, 730–756 (2007)
C 2007 Massachusetts Institute of Technology
Reinforcement Learning State Estimator
731
learning, memory-based state augmentation, and model-based state prediction. Memoryless policy gradient methods to achieve better average performance over hidden variables are used in Jaakkola, Singh, & Jordan (1995), Baird & Moore (1999), Meuleau, Kim, & Kaelbling (2001), and Kimura & Kobayashi (1998). However, even when stochastic policies are used, the learning performance of a policy for POMDPs can be significantly worse than that of a policy acquired for Markov decision processes (MDPs). External memory, which stores the history of state transitions and defines an augmented state space to recover the MDP assumption, is used in Meuleau, Peshkin, Kim, and Kaelbling (1999) and McCallum (1995). However, even with external memory, we cannot always estimate hidden variables from observed variables. For example, we need an infinite impulse response filter such as an integrator to estimate position from velocity. Furthermore, a simple integrator does not work well if sensor data include observation noise or if we do not know the initial position. If we have a dynamic model of an environment, it is possible to use a state estimator for POMDPs. If the dynamics is linear and deterministic, we can design an observer (Luenberger, 1971), which is used in control theory to estimate the hidden states. In the case that the dynamics is linear and stochastic, we can use a Kalman filter, which is widely known as the optimal state estimator for linear dynamics with gaussian noise (Kalman & Bucy, 1961). Where the state transition probability distribution is nongaussian, the particle filter (Doucet, Godsill, & Andrieu, 2000) is used in many studies (Thrun, 2000; Arulampalam, Maskell, Gordon, & Clapp, 2002). Wan and van der Merwe (2000) proposed unscented filters for the nongaussian case. In this study, we propose a reinforcement learning state estimator (RLSE), a novel learning method for estimating hidden variables of nonlinear dynamical systems to cope with POMDPs. Evaluation of state estimation policies is difficult because we cannot observe estimation errors for hidden states. However, by defining errors of observable variables as a penalty in the reinforcement learning framework, state estimation problems can be considered as delayed reward problems, and we can apply reinforcement learning to learn policies for hidden state estimation. We show that the state estimator can be acquired by using the policy gradient method (Kimura & Kobayashi, 1998) and demonstrate that we can acquire the state estimator for a pendulum swing-up task in which a swing-up controller is also acquired simultaneously by reinforcement learning. Although we can use the extended Kalman filter not only for estimating state variables but also for estimating model parameters of the pendulum dynamics (Wan & Nelson, 1997), the extend Kalman filter requires the form of the target dynamics model. In this study, we show that the swing-up controller can be acquired through estimating the hidden variables even in the case that the dynamics model of the pendulum is
732
J. Morimoto and K. Doya
not known. We also apply our proposed framework to a biped walking task. In section 2, we introduce the nonlinear state estimation problem. It is well known that optimal control and optimal estimation for linear dynamics are dual problems, that is, we can solve an optimal estimation problem as an optimal control problem of a dual system. We present our idea that the parameters of the state estimator can be derived by solving an optimal control problem even for nonlinear dynamics (Morimoto & Doya, 2002). In section 3, we introduce the reinforcement learning state estimator (RLSE), showing our approach to acquire the state estimation policy in a reinforcement learning framework. Section 4 contains our simulation results, obtained by applying the proposed method to a single pendulum dynamics and a two-linked biped robot model. 2 Nonlinear State Estimator We consider a state estimator for nonlinear stochastic dynamics: x(k + 1) = f(x(k), u(k)) + n(k), y(k) = h(x(k)) + v(k),
n(k) ∼ N (0, ),
v(k) ∼ N (0, R),
(2.1) (2.2)
where x ∈ X ⊂ Rn is the state, u ∈ U ⊂ Rm is the control input, y ∈ Y ⊂ Rl is the observed output, n(k) is the system noise where N (0, ) represents a normal distribution with zero mean and covariance , and v(k) denotes the observation noise where N (0, R) represents a normal distribution with a zero mean and covariance R. We consider the state estimator that has the form xˆ (k + 1) = f(ˆx(k), u(k)) + a(k),
(2.3)
where xˆ ∈ X ⊂ Rn is the estimated state and a ∈ A ⊂ Rn is a state estimator input to correct the current estimation by using observation y. In this study, we define the state estimation problem as the problem of finding a stochastic state estimation policy π(ˆx, y, a) = P(a|ˆx, y) to reduce the estimation error. Here we can consider this estimation problem as POMDP because we cannot directly access the state x, and it is necessary to decide actions for this estimation task according to current observation y(k), where y(k) is given by the conditional probability distribution, P(y(k)|x(k)) =
1 (2π)l |R|
exp{−(y(k)
− h(x(k)))T R−1 (y(k) − h(x(k)))}.
(2.4)
Reinforcement Learning State Estimator
733
The difficulty of finding this policy is that we cannot directly observe the error for the hidden states. For linear dynamics, x(k + 1) = Ax(k) + Bu(k) + n(k), y(k) = Cx(k) + v(k),
(2.5) (2.6)
we can design an observer, xˆ (k + 1) = Aˆx(k) + Bu(k) + L(y(k) − C xˆ (k)),
(2.7)
where observer gain L is selected to stabilize the error dynamics, e(k + 1) = (A − LC)e(k),
(2.8)
which can be derived by subtracting equation 2.7 from equation 2.5 without considering noise input n(k) in equation 2.5, where e(k) = x(k) − xˆ (k). In this case, the estimation policy is represented as a deterministic policy: a(k) = L(y(k) − C xˆ (k)), that is, π(ˆx(k), y(k), a(k)) = P(a(k)|ˆx(k), y(k)) = δ(a(k) − L(y(k) − C xˆ (k))), where δ() denotes the Dirac delta function. Our solution to this estimation problem is to define the accumulated errors of observable variables as the objective function and solve the optimal control problem using a reinforcement learning framework. In the next section, we introduce our proposed method to learn the estimation policy. 3 Reinforcement Learning State Estimator Here we introduce the reinforcement learning state estimator (RLSE), a new method to acquire a state estimation policy based on a reinforcement learning framework. We consider a discrete-time formulation of reinforcement learning with the following system dynamics: x(k + 1) = f(x(k), u(k)) + n(k),
(3.1)
y(k) = h(x(k)) + v(k),
(3.2)
ˆ x(k), u(k); p(k)) + a(k), xˆ (k + 1) = f(ˆ
(3.3)
where a ∈ A ⊂ Rn is the output of a state estimation policy (see section 3.2), which is input to the state estimator; p denotes the parameter vector of the approximated plant model fˆ and is the output of a model estimation policy (see section 3.3). The reward is defined as r (ˆx, y, a) = e(y − h(ˆx)) − c(a),
(3.4)
734
J. Morimoto and K. Doya
where e(y − h(ˆx)) denotes reward function for correct estimation, and c(a) denotes cost function for changing the estimator dynamics (Morimoto & Doya 2002). The function e has the maximum value when the size of the error |y − h(ˆx)| equals zero, and if the direction of the error vector is the same, the value of the function e monotonically decreases when the size of the error grows. The cost function c(a) corresponds to the amplitude of process noise. The state estimator tends to be strongly affected by the observed output y(k) with a smaller cost for changing the estimator dynamics. For example, we can use a negative log-likelihood function as the reward and the cost functions (Crassidis & Junkins, 2004). The basic goal is to find a policy πw (ˆx, y, a) = P(a|ˆx, y; w) that maximizes the expectation of the discounted accumulated reward, E{V(k)|πw } = E
∞ i=k
γ
i−k
r (ˆx(i + 1), y(i + 1), a(i))πw ,
(3.5)
where V(k) is the actual return, w is the parameter vector of the policy πw , and γ is the discount factor. Therefore, we essentially try to solve a smoothing problem rather than a filtering problem because we consider future estimation error represented in equation 3.5 without explicit backward calculation. Figure 1 shows a schematic diagram of the proposed learning system. The RLSE consists of a state estimator and the reinforcement learning (RL) module to learn an estimation policy πw . The state estimator takes control input u and the output of RL module a, that is, the output of the estimation policy, as input variables. Then the state estimator outputs estimated state xˆ as the RLSE output. Parameters of the plant model p used in the state estimator and used in a controller can also be estimated by the RLSE. The problem of learning a state estimation policy is partially observable because we have hidden variables x to be estimated. We use a policy gradient method proposed by Kimura and Kobayashi (1998) in our RLSE framework. Although we can apply the policy gradient method directly to POMDPs without learning the state estimators for some target problems (Jaakkola et al., 1995; Baird & Moore, 1999; Kimura & Kobayashi, 1998), we will show in the pendulum swing-up problem in the following sections that many tasks require the state estimator. The RLSE can be used even when the order of a plant is known but its form is not and when there are hidden state variables that we cannot directly observe. 3.1 Gradient Calculation and Policy Parameter Update. In the policy gradient methods, we calculate the gradient direction of the expectation of the actual return with respect to parameters of a policy w. Kimura and
Reinforcement Learning State Estimator
Plant
State Estimator p RL Module p
Control Input: u
Output: y Reward: r(y-y,a)
RLSE
a
735
Controller
Output Function: y = h( x ) Estimated State x
Figure 1: Reinforcement learning state estimator (RLSE). The RLSE consists of a state estimator and an RL module to learn an estimation policy. The state estimator takes control input u and the output of RL module a, that is, the output of the estimation policy, as input variables. Then the state estimator outputs estimated state xˆ as the RLSE output. Parameters of the plant model p used in the state estimator and used in a controller can also be estimated by the RLSE.
Kobayashi (1998) suggested that we can estimate the expectation of the gradient direction as ∞ ∂ ln π ∂ w ˆ (V(k) − V(x)) E{V(0)|πw } ≈ E πw , ∂w ∂w
(3.6)
k=0
ˆ where V(x(k)) is an approximation of the value function: V πw (x) = E{V(k)|x(k) = x, πw }. 3.1.1 Value Function Approximation. The value function is approximated using a normalized gaussian network (Doya, 2000; see appendix C),
ˆ x) = V(ˆ
N
vi b i (ˆx),
(3.7)
i=1
where vi is a ith parameter, and N is the number of basis functions (see appendix C). An approximation error of the value function is represented
736
J. Morimoto and K. Doya
by TD error (Sutton & Barto, 1998): ˆ x(k + 1)) − V(ˆ ˆ x(k)). δ(k) = r (k) + γ V(ˆ
(3.8)
We update the parameters of the value function approximator using the TD(0) method (Sutton & Barto, 1998), vi (k + 1) = vi (k) + αδ(k)b i (ˆx(k)),
(3.9)
where α is the learning rate. 3.1.2 Policy Parameter Update. We update the parameters of a policy w by using the estimated gradient direction in equation 3.6. Kimura and Kobayashi (1998) showed that we can estimate the gradient direction by using TD error, ∞ ∞ ∂ ln π w ˆ E (V(k) − V(x(k))) δ(k)z(k)πw , πw = E ∂w k=0
(3.10)
k=0
where z is the eligibility trace of the parameter w. Then we can update the parameter w as w(k + 1) = w(k) + βδ(k)z(k),
(3.11)
where the eligibility trace is updated as ∂ lnπw z(k) = ηz(k − 1) + ∂w
,
(3.12)
w=w(k)
where η is the decay factor for the eligibility trace. Equation 3.10 can be derived if the condition η = γ is satisfied. 3.2 State Estimation Policy. We construct the state estimation policy based on the normal distribution, πw (ˆx, y, a j ) = √
1 2πσ j (ˆx)
exp
−(a j − µ j (ˆx, y))2 , 2σ j2 (ˆx)
(3.13)
where a j is the jth output of the policy πw . The mean output µ of the policy πw is given by a feedback gain matrix L and the estimation error for observable variables y − h(ˆx), µ j (ˆx, y) = L j (ˆx)(y − h(ˆx)),
(3.14)
Reinforcement Learning State Estimator
737
and the jth row vector L j is modeled by the normalized gaussian network: L j (ˆx) =
N
µ
wi j b i (ˆx).
(3.15)
i=1 µ
Here, wi j ∈ Rl denotes the ith parameter vector for jth output of the policy πw , and N is the number of basis functions. We represent the variance σ j of the policy πw using a sigmoid function (Kimura & Kobayashi, 1998), σ j (ˆx) =
σ0 , 1 + exp(−σ jw (ˆx))
(3.16)
where σ0 denotes the scaling parameter, and the state-dependent parameter σ jw (ˆx) is also modeled by the normalized gaussian network, σ jw (ˆx) =
N
σ
wi j b i (ˆx),
(3.17)
i=1 σ
where wi j denotes the ith parameter for the variance of jth output. Considering equation 3.13, the eligibilities of the policy parameters are given as ∂ ln πw a j − µ j (y − h(ˆx))b i (ˆx), µ = σ j2 ∂wi j (a j − µ j )2 − σ j2 (1 − σ j /σ0 ) ∂ ln πw = b i (ˆx). σ σ j2 ∂wi j
(3.18)
(3.19)
We update the parameters by applying the update rules in equations 3.11 and 3.12. 3.3 Model Estimation Policy. Since we cannot always have exact information on the target dynamics, it is also necessary to estimate parameters of the dynamics model as well. In this study, we use a reinforcement learning framework based on the policy gradient method (Kimura & Kobayashi, 1998) to estimate parameters of the dynamics model. This model estimation policy is constructed based on the normal distribution,
π
2 p − p j − µ j (ˆx) (ˆx, p j ) = √ exp p 2 , p 2πσ j (ˆx) 2 σ j (ˆx) 1
wp
(3.20)
738
J. Morimoto and K. Doya
where p j is the jth output of the policy πw p . The mean output µ p of the policy πw p is given by p
µ j (ˆx) =
N
µ
p
wi j b i (ˆx).
(3.21)
i=1 µ
Here, wi j denotes the ith parameter for jth output of the policy πw p , and p N denotes the number of basis functions. We represent the variance σ j of p the policy πw using a sigmoid function (Kimura & Kobayashi 1998), p
p
σ j (ˆx) =
σ0 , p 1 + exp(−σ jw (ˆx))
(3.22)
p
where σ0 denotes the scaling parameter, and the state-dependent parameter p σ jw (x) is also modeled by the normalized gaussian network, σ jw (ˆx) = p
N
σ
p
wi j b i (ˆx),
(3.23)
i=1 σ
p
where wi j denotes the ith parameter for the variance of jth output. Considering equation 3.13, the eligibilities of the policy parameters are given as ∂ ln πw p p µj
∂wi
∂ ln πw p σ
p
∂wi j
p
a j − µj = 2 b i (ˆx), p σj p 2 p 2 p p a j − µj − σj 1 − σ j /σ0 = b i (ˆx). p 2 σj
(3.24)
(3.25)
We update the parameters by applying the update rules in equations 3.11 and 3.12. 4 Simulation In this section, we demonstrate the performance of the RLSE. First, in section 4.1, we apply the RLSE to a single-pendulum model to display its basic properties and performance. In section 4.2, we apply it to a two-linked biped model, which has more complicated dynamics than a single-pendulum model, to show that the RLSE can be applied to a more difficult problem. In section 4.1, we first show that the RLSE can stabilize the error dynamics (see equation 2.8) of the linearized pendulum model (see section 4.1.1).
Reinforcement Learning State Estimator
739
Second, we apply it to the pendulum model to estimate the state variables of free-swing movement (see section 4.1.2). Third, we show that it is possible to acquire a controller to swing the pendulum up by using state variables estimated by the RLSE. Here, the RLSE is also acquired simultaneously (see sections 4.1.3, 4.1.4, and 4.1.5). In section 4.1.3, we compare swing-up performance among a memoryless policy gradient method, a memory-base policy gradient method, and a value-gradient-based method (Doya, 2000) with using the RLSE. We demonstrate that only the value-gradient-based method using the RLSE can acquire a swing-up controller when only the angular velocity can be observed. In section 4.1.4, we show that the state estimator can be acquired even in the case that we do not know a precise parameter of the pendulum. In other words, we show that we can acquire control, state estimation, and model estimation policies simultaneously. Section 4.1.5 shows that we can acquire an estimation policy for the pendulum swing-up task even when the dynamics of the pendulum is unknown. Finally, in section 4.2, we show that the RLSE can be used to estimate state variables of the biped model. The task in this case for the biped robot is walking down a slope without falling over. We could simultaneously acquire both control and estimation policies for the biped walking task. 4.1 Application to a Single Pendulum. The proposed method was applied to pendulum dynamics, ml 2 ω˙ = −µω − mgl sin θ + T,
(4.1)
where θ is the angle from the bottom position, ω is the angular velocity, T is the input torque, µ = 0.01 is the coefficient of friction, m = 1.0 kg is the weight of the pendulum, l = 1.0 m is the length of the pendulum, and g = 9.8 m/s2 is the acceleration due to gravity (see Figure 2). We define the state vector as x = (θ, ω)T and the control output as u = T. Here we consider one-dimensional output y. Thus, the dynamics is given by the system x(k + 1) = f(x(k), u(k)) + n(k),
(4.2)
y(k) = Cx(k) + v(k),
(4.3)
where n(k) and v(k) are noise inputs with parameters = diag{0.01, 0.01} t and R = 1.0 t, and the discrete-time dynamics of the pendulum is f(x(k), u(k)) = x(k) +
(k+1) t
F(x(s), uc (s))ds, k t
(4.4)
740
J. Morimoto and K. Doya
Figure 2: Single pendulum swing-up task. θ: joint angle, T: input torque, l: length of the pendulum, m: mass of the pendulum.
F(x(t), uc (t)) =
− gl
ω sin θ −
µ ω ml 2
+
0 1 ml 2
uc ,
(4.5)
where t is a time step of the discrete-time system, and uc is the control input in continuous time. We define the state estimator dynamics as xˆ (k + 1) = f(ˆx(k), u(k)) + a(k),
(4.6)
T is the estimated state and a(k) is the output of the where xˆ (k) = (θˆ (k), ω(k)) ˆ estimation policy. We define the reward function for the state estimation task as
(y − C xˆ )2 r = exp − σr2
−
2 j=1
c(a j ) t,
(4.7)
where σr2 = 0.25 and c(a j ) is the output cost term. We saturate the output of the policy using a sigmoid function in equation A.3 (Doya, 2000; see appendix A) with the maximum output amax = (a 1max , a 2max ) = (5.0, 5.0)T . Then we use the corresponding cost function, c(a j ) = d 0
aj
−1
u a max j
du,
(4.8)
Reinforcement Learning State Estimator
741
where d = 0.1 and are monotonically increasing functions (see appendix A). The learning parameters were set as follows. The learning rate for the value function was α = 0.5, the learning rate for the policy was β = 0.5, the discount factor was γ = 0.99, and the decay factor for the eligibility trace was η = 0.95. A 5 × 5 basis normalized gaussian network for twodimensional space xˆ was used for the state estimation policy in equations 3.15 and 3.17 and the value function in equation 3.7. We located basis functions on a grid with an even interval in each dimension of the input space (−π ≤ θ ≤ π, −3π ≤ ω ≤ 3π). In the current simulations, the centers are fixed in a grid, which is analogous to the “boxes” approach (Barto, Sutton, & Anderson, 1983) often used in discrete RL. Grid allocation of the basis functions enables efficient calculation of their activation as the outer product of the activation vectors for individual input variables. We used the fourth-order Runge-Kutta method with a time step of t = 0.01 sec to derive the discrete pendulum dynamics in equations 4.4 and 4.6 and put a zero-order hold on the input as uc (s) = u(k) for k t ≤ s < (k + 1) t. 4.1.1 Estimation of Free-Swing Trajectories: Linear Pendulum. We first consider a linear problem in order to check whether the pole of the error dynamics learned by the RLSE is inside a unit circle. Thus, we have considered the pendulum dynamics only near the stable equilibrium point x = (0, 0)T , F (x(t), u(t)) =
0 1 − gl − mlµ2
x(t) +
0 1 ml 2
uc (t),
(4.9)
where the discrete pendulum dynamics was derived with equation 4.4. Here, since the RLSE attempts to learn a policy to estimate free-swing trajectories, we always set the control input to zero u(k) = 0. Let us consider two types of observation noise that have different amplitudes: R = 1.0 t and R = 10.0 t. Each trial was started from an initial state x(0) = (θ (0), 0.0)T and an initial estimate xˆ (0) = (θˆ (0), 0.0)T , where θ (0) was selected from a uniform distribution that ranged over − 14 π < θ (0) < 14 π. Then θˆ (0) was selected from a uniform distribution around θ (0), which ranged over − 16 π + θ (0) < θˆ < 16 π + θ (0). Each trial was terminated after 20 sec. We applied an acquired estimation policy after 100 learning trials to estimate linear pendulum states and used a deterministic policy, represented by the mean of the acquired stochastic policy, to illustrate the estimation performance. The initial state was set as x(0) = (− 16 π, 0.0)T and xˆ (0) = (− 13 π, 12 π)T . Figures 3A and 3B show the successful estimation of the linearized pendulum state, and Figure 3C indicates that the pole of the error dynamics stayed within the unit circle, that is, the absolute eigenvalues of error dynamics |λ| are always less than one, |λ| < 1, during state estimation. Furthermore,
742
J. Morimoto and K. Doya
0
R=1.0 R=10.0
1 |λ|
2
1.01
True Estimated (R=1.0) Estimated (R=10.0)
1 θ [rad]
ω [rad/sec]
2
True Estimated (R=1.0) Estimated (R=10.0)
4
0
0.99
−1
−2 0
1
2 3 Time [sec]
(A)
4
5
0
1
2 3 Time [sec]
4
5
0.98 0
1
(B)
2 3 Time [sec]
4
5
(C)
Figure 3: Estimated pendulum states and absolute eigenvalues of error dynamics. (A) Estimated angular velocity ω. (B) Estimated joint angle θ. (C) Time course of absolute eigenvalues |λ(A − LC)|.
the RLSE acquired different estimation policies with different observation noise. The error dynamics had slower convergence; absolute eigenvalues |λ| had larger values with greater observation noise (R = 10.0 t), all results consistent with the properties of the Kalman filter. 4.1.2 Estimation of Free-Swing Trajectories: Nonlinear Pendulum. Now we try to estimate free-swing trajectories of nonlinear pendulum dynamics, equation 4.4, by using the RLSE. Each trial was started from an initial state x(0) = (θ (0), ω(0))T , where θ (0) was selected from a uniform distribution that ranged over − π2 < θ (0) < π2 , and ω(0) was selected from a uniform distribution ranging over − π2 < ω(0) < π2 . In addition, each trial was started T ˆ from an initial estimate xˆ (0) = (θˆ (0), ω(0)) ˆ , where θ(0) was selected from a uniform distribution around θ (0) that ranged over − π4 + θ (0) < θˆ (0) < π + θ (0), and ω(0) ˆ was selected from a uniform distribution around ω(0) 4 that ranged over − π4 + ω(0) < ω(0) ˆ < π4 + ω(0). Each trial was terminated after 20 sec. The following three conditions are considered: 1. Joint angular velocity ω is observed—C = [0 1] in equation 4.3. 2. Joint angle θ is observed—C = [1 0] in equation 4.3. 3. Combined state of joint angle θ and angular velocity ω is observed— C = [1 1] in equation 4.3. Figure 4 illustrates the learning performance of the RLSE under the three different conditions. We applied the acquired policy after 100 learning trials to estimate freeswing trajectories under each condition using deterministic policy, which is represented by the mean of the acquired stochastic policy, to demonstrate the estimation performance. Initial states were set as x(0) = (− π2 , 0.0)T and xˆ (0) = (−π, π)T . Results reveal that acquired policies successfully estimated the true state trajectories (see Figure 5).
Reinforcement Learning State Estimator
743
Average reward
1 0.8 0.6 0.4
y=ω y=θ y=θ+ω
0.2 0
0
50 Trials
100
Figure 4: Learning performance of free-swing trajectory estimation. (Average over 10 simulation runs. We filtered the data with a moving average of 10 trials.) If the initial estimated state xˆ (0) is equal to the actual state x(0) and the RLSE perfectly follows an actual state trajectory, the average reward becomes 1 without considering cost term c(a).
4.1.3 Pendulum Swing-Up Task in POMDPs: With Exact Model. Here we try to acquire the swing-up policies while estimating state variables of the pendulum simultaneously. First, we assume that we have an exact model of the pendulum. We apply the value-gradient-based reinforcement learning (Doya, 2000) to acquire pendulum swing-up policies in POMDPs. The task of swinging up a pendulum (Doya, 2000) is a simple but strongly nonlinear control task. Here we assume that the swing-up policies cannot observe the position of the pendulum θ ; only the angular velocity ω can be observed. The reward function for the swing-up control policies is given as q (x, u) = {−cos θ − ν(u)} t,
(4.10)
where ν(u) is a cost function for control variable u (see appendix A). We use the same reward function in equation 4.7 for the estimation policies. Each trial was started from an initial state x(0) = (θ (0), ω(0))T , where θ (0) was selected from a uniform distribution that ranged over − π4 < θ (0) < π4 , and ω(0) was selected from a uniform distribution ranging over − π4 < ω(0) < π4 . In addition, each trial was started from an initial estimate xˆ (0) = T , where θˆ (0) were selected from a uniform distribution around (θˆ (0), ω(0)) ˆ ˆ θ (0) that ranged over − π4 + θ (0) < θ(0) < θ (0) + π4 , and ω(0) ˆ were selected from a uniform distribution around ω(0), which ranged over − π4 + ω(0) < ω(0) ˆ < ω(0) + π4 . Each trial was terminated after 20 sec. As a measure of the swing-up performance, we defined the time for which the pendulum
744
J. Morimoto and K. Doya
Observed
0
0
−2
5 Time [sec]
−5
10
0
(A) 4
−2
5 Time [sec]
θ [rad] 0
(D)
−2
5 Time [sec]
(G)
10
True Estimated
5
0 −5
10
5 Time [sec]
(F)
ω [rad/sec]
ω [rad/sec]
−5 5 Time [sec]
−4 0
10
True Estimated
5
0
0
0
(E) True Estimated
5
True Estimated
2
0
−4
10
10
(C)
−2
0
5 Time [sec]
4
True Estimated
2 θ [rad]
θ [rad]
10
− 10 0
(B)
0
−4
5 Time [sec]
4
True Estimated
2
ω [rad/sec]
0
−5 0
Observed
5 θ+ω
2
−4
10
Observed 5
ω
θ
4
0 −5
0
5 Time [sec]
(H)
10
0
5 Time [sec]
10
(I)
Figure 5: Estimation of free-swing trajectories: (A–C) Observed state. (D–F) Estimated joint angle θ . (G–I) Estimated angular velocity ω. (A, D, G) Joint angle θ is observed. (B, E, H) Joint angular velocity ω is observed. (C, F, I) Combined state of joint angle and angular velocity θ + ω is observed.
stayed up (π − π4 < |θ | < π + π4 ) as tup . A trial was regarded as successful when tup > 10 sec. We used the number of trials made before achieving 10 successful trials as the measure of the learning speed and limited the output torque of the controller as umax = 3.0 N · m. Appendix A explains the implementation of the value-gradient-based policy. Figure 6A shows the learning performance of the pendulum swingup task. After 85 trials (averaged over 10 simulation runs), the swingup policies and the state estimation policies were acquired. This result reveals that RLSE can obtain estimation policies that are capable of acquiring pendulum swing-up policies. Several studies (Jaakkola et al., 1995; Baird & Moore, 1999; Kimura & Kobayashi 1998; Meuleau et al., 1999) suggested that we can find a good memoryless or memory-based policy for POMDPs using the policy gradient method. Here we directly applied the policy gradient method (Kimura & Kobayashi 1998) to learn the swing-up controller (see appendix B). If full
Reinforcement Learning State Estimator
20
745 25
VG with RLSE
15 t
up
15
up
t
Memoryless External memory Fully observable
20
10
10
5 0 0
5 100
200 300 Trials
(A)
400
500
0
0
100
200 300 Trials
400
500
(B)
Figure 6: Learning performance for the pendulum swing-up task (average over 10 simulation runs). (A) Value-gradient polices with RLSE (we filtered the data with a moving average of 10 trials). (B) Direct use of policy gradient (we filtered the data with a moving average of 10 trials).
state information (θ, ω) is available to the controller, this policy gradient controller can acquire the swing-up policy with 171 trials. Now we apply this controller to the pendulum swing-up task under the same POMDP assumption: that the controller cannot observe the position of the pendulum θ . The solid line in Figure 6B shows the learning performance for 500 trials. This result indicates that the pendulum swing-up policy cannot be acquired by observing only the angular velocity of the pendulum ω using the policy gradient method. We also applied the policy gradient method with external memory. Because the pendulum dynamics is represented by a second-order differential equation, we defined the state space of the controller as (ω(t), ω(t − T))T , where T = 0.5 sec, which is a quarter of the actual period of the linearized pendulum. The dashed line in Figure 6B shows learning performance for 500 trials. This result also indicates that the pendulum swing-up policy cannot be acquired through observing only the angular velocity of the pendulum ω even when using the external memory. Figure 7 shows swing-up trajectories using RLSE from an initial state x(0) = (θ (0), ω(0)) = (0.0, 0.0)T and an initial estimate xˆ (0) = (θˆ (0), ω(0)) ˆ = (− 12 π, π)T . 4.1.4 Pendulum Swing-Up Task in POMDPs: With Modeling Errors. In the previous section, it was assumed that we have an exact model of the pendulum dynamics. However, especially in the real environment, it is difficult to have a perfect model of the target dynamics. Here, we try to acquire an estimation policy while at the same time also learning a model parameter of the pendulum. In other words, we attempt to acquire the swing-up control policy, the state estimation policy, and the model estimation policy
746
J. Morimoto and K. Doya 4
6
True Estimated
True Estimated
4 ω [rad/sec]
θ [rad]
2 0 −2 −4
0
2 0 −2 −4
5 Time [sec]
10
(A)
−6 0
5 Time [sec]
10
(B)
Figure 7: Swing-up trajectories and estimation performance. The dash-dotted line represents the upright position. The initial state x(0) = (θ(0), ω(0)) = ˆ ω(0)) ˆ = (− 21 π, π )T . (A) Estimated (0.0, 0.0)T ; the initial estimate xˆ (0) = (θ(0), ˆ joint angle θ. (B) Estimated angular velocity ω. ˆ
simultaneously. A 1 × 1, that is, only one, basis normalized gaussian network was used as the model estimation policy in equations 3.21 and 3.23. We set the decay factor for the eligibility trace of the model estimation policy as η = 0.90. After 227 trials and 128 trials (average over 10 simulation runs), the swing-up policies and the state estimation policies were acquired with initially incorrect models, l = 0.5 m and l = 2.0 m, respectively (see Figure 8A). Figure 8B shows the learning performance of the model estimation policy with two different modeling errors. Results reveal that the RLSE could acquire the correct model parameter during learning of the swing-up task and the state estimation task. The estimated mean model parameter of 10 simulation runs after 500 trials was l = 1.03 m, and the variance was 1.05 × 10−2 with the initially incorrect model l = 0.5 m. Furthermore, with the initially incorrect model l = 2.0 m, the estimated mean model parameter for 10 simulation runs after 500 trials was l = 1.08 m, and the variance was 0.27 × 10−2 . 4.1.5 Pendulum Swing-Up Task in POMDPs: With Unknown Model. Here we show that the RLSE can be used even when a form of the pendulum dynamics is unknown. We assume to know that the pendulum model is an input-affine system and know the input matrix of equation 4.5: F(ˆx(t), u(t)) =
ω f ω (x)
0 + uc . 1
(4.11)
The model estimation policy introduced in section 3.3 can be directly used to estimate the dynamics model of the pendulum. We used the output
Reinforcement Learning State Estimator 2.5
25
l_init=0.5 l_init=2.0
tup
15 10
l_init=0.5 l_init=2.0
2 Length [m]
20
1.5 1 0.5
5 0 0
747
100
200 300 Trials (A)
400
500
0
0
100
200 300 Trials
400
500
(B)
Figure 8: Learning the swing-up controller and the state estimation policy with the model estimation policy. linit represents the initial model parameter of the pendulum length. (A) Learning performance of the swing-up policy when using RLSE (average over 10 simulation runs; we filtered the data with a moving average of 10 trials). (B) Estimated length of the pendulum at each trial.
of the model estimation policy as f ω (x) ≈ 10.0 × p. A 9 × 4 basis normalized gaussian network was used for the model estimation policy in equations 3.21 and 3.23. We used the learning parameter set σr = 0.2, α = 1.0, β = 1.0, γ = 0.98, η = 0.90 for the state estimation policy and η = 0.96 for the model estimation policy. Figure 9A shows an acquired pendulum dynamics. A sinusoidal shape that represents the pendulum dynamics (see equation 4.5) was successfully acquired. Figure 9B illustrates the learning performance of swing-up policy with estimating pendulum dynamics and state variables. After 692 trials (average over 10 simulation runs), the swing-up policies and the state estimation policies were acquired with an initially unknown model. 4.2 Application to a Biped Model. Finally, the RLSE is applied to a biped model (see Figure 10). The task in this case for the biped robot is walking down a slope without falling over. A biped dynamics model introduced in Goswami, Thuilot, and Espiau (1996) was employed here. Although it is well known that this two-linked biped model can walk down the slope without any actuation, it is necessary to set proper initial angular velocities at each degree of freedom to generate this passive walking pattern. In this study, we try to acquire a controller that can make the biped robot walk even when the initial angular velocities are not suitable for passive walking. We define the state vector as x = (θsw , θst , ωsw , ωst )T , where θsw and θst are leg angles and ωsw and ωst are leg angular velocities, and we define applied torque at the hip joint u = T as the action. We assume that the covariance of the system noise n(k) and the observation noise v(k) are = diag(0.01, 0.01, 0.01, 0.01) t and R = diag(0.01, 0.01) t, respectively.
748
J. Morimoto and K. Doya
25 20
Estimated fω
500
tup
15
10
10
0
5
0
−10 −100 0 100 θ [deg]
(A)
−500
ω [deg/s]
0 0
500 Trials
1000
(B)
Figure 9: Learning the state estimation policy with the dynamics of the pendulum in the pendulum swing-up task. (A) Acquired pendulum model. A sinusoidal shape that represents the pendulum dynamics was successfully acquired. (B) Learning performance of the swing-up policy with estimating pendulum dynamics and state variables (average over 10 simulation runs; we filtered the data with a moving average of 10 trials).
Figure 10: Biped model: m H = 10 kg, m = 5 kg, a = b = 0.5 m, φ = 0.03 rad.
This biped model includes a discontinuity when the swing foot touches the ground. Furthermore, we assume that only the leg angles θsw and θst can be observed; in other words, we consider linear observation function h(x(k)) = Cx(k) with observation matrix C = [1 1 0 0] in equation 2.2. We also assume that we know the exact biped model. The learning parameters were set as follows. The learning rate for the value function was α = 0.5, the learning rate for the policy was β = 1.0, the discount factor was γ = 0.98, and the decay factor for the eligibility trace was η = 0.97. A 5 × 5 × 10 × 10 basis normalized gaussian network for four-dimensional space xˆ was used for the state estimation policy in equations 3.15 and 3.17 and the value function in equation 3.7. We
Reinforcement Learning State Estimator
749 15
0.8
10 0.6
Steps
Average reward
1
0.4
5
0.2 0
0
100
200 300 Trials (A)
400
500
0
0
100
200 300 Trials (B)
400
500
Figure 11: (A) Learning performance of estimation policy for the biped walking task (average over 10 simulation runs; we filtered the data with a moving average of 10 trials). (B) Learning performance for the biped walking task. The horizontal axis represents the number of walking steps (average over 10 simulation runs; we filtered the data with a moving average of 10 trials).
located basis functions on a grid with an even interval in each dimension of the input space (− 14 π ≤ θsw ≤ 14 π, − 14 π ≤ θst ≤ 14 π, −3π ≤ ωsw ≤ 3π, −3π ≤ ωst ≤ 3π). We used the fourth-order Runge-Kutta method with a time step of t = 0.01 sec to derive the discrete biped dynamics and used the reward function 4 r = exp −(y − C xˆ )T S(y − C xˆ ) − c(a j ) t, (4.12) j=1
where S = diag(1/σr2 , 1/σr2 ) with σr2 = 0.001. We saturate the output of the policy using a sigmoid function in equation A.3 with the maximum output amax = (a 1max , a 2max , a 3max , a 4max ) = (150.0, 150.0, 150.0, 150.0). Again, we used value-gradient-based policy (see appendix A). The penalty −0.5 is given to the biped controller when the robot falls over. Each trial was started from an initial state x(0) = (0.0, 0.0, ωsw (0), ωst (0))T , where ωsw (0) and ωst (0) were selected from uniform distributions that ranged over 2.0 < ωsw (0) < 2.2 rad/sec and −0.5 < ωst (0) < −0.7 rad/ sec. In addition, each trial was started from initial estimate xˆ (0) = (0.0, 0.0, ωˆ sw (0), ωˆ st (0))T , where ωˆ sw (0) and ωˆ st (0) were selected from uniform distributions ranging over − π6 + ωsw < ωˆ sw (0) < π6 + ωsw (0) and − π6 + ωst (0) < ωˆ st (0) < π6 + ωst (0). Each trial was terminated after 10 sec. A trial was regarded as successful when the number of steps is more than 10. Figure 11A shows results of the learning performance of estimation policies. The RLSE successfully acquired estimation policies in the biped walking task. Figure 11B illustrates the learning performance for a biped
J. Morimoto and K. Doya
st
1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1
θ [deg]
θ
sw
[deg]
750
True Estimated 0
0.5
1
1.5
2 2.5 3 Time [sec]
3.5
4
4.5
5
1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1
True Estimated 0
0.5
1
1.5
3
2
2
1
1
0 −1
True Estimated
−2 −3 0
0.5
1
1.5
2 2.5 3 Time [sec]
(C)
3.5
4
4.5
5
(B)
3
ωst [deg]
ω
sw
[deg]
(A)
2 2.5 3 Time [sec]
3.5
4
4.5
0 −1
True Estimated
−2
5
−3
0
0.5
1
1.5
2 2.5 3 Time [sec]
3.5
4
4.5
5
(D)
Figure 12: Biped walking trajectory and estimation performance. Initial state x = (θsw , θst , ωsw , ωst ) = (0.0, 0.0, 2.2, −0.7); initial estimate xˆ = (θˆsw , θˆst , ωˆ sw , ωˆ st ) = (0.0, 0.0, 2.0, −0.5). (A) Swing leg angle θsw . (B) Stance leg angle θst . (C) Swing leg angular velocity ωsw . (D) Stance leg angular velocity ωst .
walking task. The biped controllers were successfully acquired after 273 trials (average over 10 simulation runs). Figure 12 shows the estimation performance for a biped walking trajectory. The RLSE could estimate state variables even for a biped model. Although slight estimation errors remained in angular velocities, biped walking policies could be successfully acquired by using RL. Figure 13A displays the initial performance of the control policy. The biped fell over right after its first step. Figure 13B gives the acquired performance of the control policy after 500 trials. The RLSE could be successfully used to acquire a biped walking controller. 5 Discussion A conventional way to solve POMDPs is to use belief state representation. By doing so, the problem can be considered as MDPs in continuous state space (Littman, Cassandra & Kaelbling, 1995). However, the conventional methods can be applied to only small problems in discrete state space, whereas the RLSE can be applied to problems in continuous state space. Backprop-through-time algorithms (Erdogmus, Genc, & Principe, 2002; Xu, Erdogmus, & Principe, 2005) were proposed to learn a state estimator. Although this approach can be applied to the state estimation problems
Reinforcement Learning State Estimator
751
[m]
1 0.5 0 −0.5 −1
0
1
2
3
4 [m]
5
6
7
8
4 [m]
5
6
7
8
(A)
[m]
1 0.5 0 −0.5 −1
0
1
2
3 (B)
Figure 13: Simultaneous learning of the biped walking controller and the state estimator. The solid line represents the right leg, and the dotted line represents the left leg. Initial state x = (θsw , θst , ωsw , ωst )T = (0.0, 0.0, 2.2, −0.7)T ; initial estimate xˆ = (θˆsw , θˆst , ωˆ sw , ωˆ st )T = (0.0, 0.0, 2.0, −0.5)T . (A) Initial performance. (B) Acquired performance.
in continuous state space, backprop-through-time algorithms can be used only for off-line estimation. On the other hand, RL can improve parameters of a state estimator through online update since RL can reduce an accumulated estimation error over time instead of reducing only an instantaneous estimation error. Application of genetic algorithms to the learning state estimator problem was proposed by Porter and Passino (1995). In the genetic algorithm approach, the gain parameter of a state estimator L in equation 3.15 was constant, whereas the RLSE acquired state-dependent gain L(ˆx). This constant gain L can limit the performance of the state estimator if we apply the state estimator to a nonlinear dynamics. Thau (1973) and Raghavan and Hedrick (1994) proposed a nonlinear observer to design an observer that can guarantee the convergence of the error dynamics even for a nonlinear environment. However, this approach considered any nonlinear term of a target dynamics as a modeling error from a linear dynamics. Consequently, the applications of this method are limited. The duality between estimation and control is discussed in Crassidis and Junkins (2004). They showed that the RTS smoother can be derived from optimal control theory. They used the negative log-likelihood function for the
752
J. Morimoto and K. Doya
reward and the cost functions, that is, e(y − h(ˆx)) = − 12 (y − h(ˆx))T R−1 (y − h(ˆx)) and c(a) = 12 aT −1 a in equation 3.4. To acquire the RTS smoother, we need off-line calculation to derive the costate vector. In our method, we can acquire estimation policies without explicit backward calculation by using the reinforcement learning framework. 6 Conclusion We proposed a novel use of reinforcement learning, the reinforcement learning state estimator (RLSE), to estimate hidden variables and parameters of nonlinear dynamical systems. RLSE was applied to the pendulum swingup task, and simultaneous learning of the RLSE and a swing-up controller, which used the estimated state of the RLSE, was accomplished, whereas the direct use of the policy gradient method could not accomplish this task. We also showed that a swing-up policy could be acquired even in the case that the RLSE initially held an incorrect model of the pendulum or did not know the form of the model. Application of the RLSE to the two-linked biped model was also presented. In this study, we used a gaussian distribution to represent an estimation policy. In the future, it may be possible to use a mixture of gaussian representations for the estimation policy to cope with nongaussian state transition probability. Because the state estimation policy and the model estimation policy are mutually dependent, future work will involve analysis of our method to show a relationship between the accuracy of state estimation and model approximation performance. Appendix A: Value-Gradient-Based Policy for Pendulum Swing-Up Task The approximated value function is represented as
ˆ c (x) = V
Nc
vic b i (x),
(A.1)
i=1
where vic is the ith parameter of the function approximator and b i (x) is the normalized gaussian basis function (see appendix C). A 21 × 21 basis normalized gaussian network for two-dimensional space xˆ was used for the value function approximation in the swing-up task. A 5 × 5 × 17 × 17 basis normalized gaussian network for four-dimensional space xˆ was used for the value function approximation in the biped walking task.
Reinforcement Learning State Estimator
753
We use the value-gradient-based policy represented as u = umax
∂ V ∂f(x, u) , ∂x ∂u
(A.2)
where umax = 3.0N · m for the pendulum swing-up task and umax = 20N · m for the biped walking task, and we limit the amplitude of the action using the sigmoid function:
(x) =
π 2 arctan x . π 2
(A.3)
This corresponds to defining the cost function in equation 4.10 as ν(u) = d
u
c
−1
u
umax
0
du ,
(A.4)
where we set the parameter for the cost function d c = 0.1 for both tasks. The learning rate for the value function was α = 1.0, and the discount factor was γ = 0.99. Appendix B: Policy Gradient Method for the Pendulum Swing-Up Task The value function is represented as in equation A.1. Then we construct the model for the swing-up policy based on the normal distribution πwc (x, u) = P(u|x; wc ), πwc (x, u) = √
1 2πσ c (x)
exp
−(u − µc (x))2 , 2(σ c (x))2
(B.1)
where u is the output of the policy πwc . We limit the amplitude of the action using the sigmoid function in equation A.3. The mean output µc of the policy πwc is modeled by the normalized gaussian network, N c
µc (x) =
µc
wi b ic (x),
(B.2)
i=1 µc
where wi denotes the parameter for the output of the policy πwc , Nc denotes the number of basis functions, and b c (x) is the normalized gaussian basis
754
J. Morimoto and K. Doya
function (see appendix C). We represent the variance σ c of the policy πwc using a sigmoid function, σ c (ˆx) =
σ0c , 1 + exp(−σ wc (ˆx))
(B.3)
where σ0c denotes the scaling parameter, and the state-dependent parameter c σ w (ˆx) is also modeled by the normalized gaussian network, σ w (ˆx) = c
Nc
wiσ b i (ˆx), c
(B.4)
i=1
where wiσ denotes the ith parameter for the variance of output. Considering equation B.1, the eligibility of the policy parameters are given as c
∂ ln πwc µc
∂wi
=
u − µc b i (ˆx), (σ c )2
((u − µc )2 − (σ c )2 )(1 − σ c /σ0c ) ∂ ln πwc = b i (ˆx). c ∂wiσ (σ c )2
(B.5) (B.6)
We update the parameters by using the update rules in equations 3.11 and 3.12. A 21 × 21 basis normalized gaussian network for two-dimensional space xˆ was used for the policy represented in equations B.2 and B.4. The learning rate for the control policy was β = 1.0, and the decay factor for the eligibility trace was η = 0.90. Appendix C: Normalized Gaussian Network Basis functions for the normalized gaussian network are represented as ξi (x) b i (x) = N , j=1 ξ j (x)
(C.1)
where N is the number of basis functions and ξ j (x) = e −||s j (x−c j )|| . T
2
(C.2)
The vectors c j and s j define the center and the size of the jth basis function, respectively.
Reinforcement Learning State Estimator
755
Acknowledgments We thank Hirokazu Tanaka and reviewers for their helpful information and comments.
References Arulampalam, M. S., Maskell, S., Gordon, N., & Clapp, T. (2002). A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking. IEEE Transactions on Signal Processing, 50(2), 174–188. Baird, L. C., & Moore, A. W. (1999). Gradient descent for general reinforcement learning. In M. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press. Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13, 834–846. Crassidis, J. L., & Junkins, J. L. (2004). Optimal estimation of dynamic systems. London: Chapman and Hall. Doucet, A., Godsill, S. J., & Andrieu, C. (2000). On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and Computing, 10, 197–208. Doya, K. (2000). Reinforcement learning in continuous time and space. Neural Computation, 12(1), 219–245. Erdogmus, D., Genc, A. U., & Principe, J. C. (2002). A neural network perspective to extended Luenberger observers. Institute of Measurement and Control, 35(2), 10– 16. Goswami, A., Thuilot, B., & Espiau, B. (1996). Compass-like biped robot part I: Stability and bifurcation of passive gaits (Tech. Rep. No. RR-2996). Rocguencourt, France: INRI. Jaakkola, T., Singh, S. P., & Jordan, M. I. (1995). Reinforcement learning algorithm for partially observable Markov decision problems. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 345–352). Cambridge, MA: MIT Press. Kalman, R. E., & Bucy, R. S. (1961). New results in linear filtering and prediction theory. Trans., ASME, Series D, J. Basic Engineering, 83(1), 95–108. Kimura, H., & Kobayashi, S. (1998). An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value functions. In Proceedings of the 15th Int. Conf. on Machine Learning (pp. 284–292). San Francisco: Morgan Kaufmann. Littman, M. L., Cassandra, A. R., & Kaelbling, L. P. (1995). Learning policies for partially observable environments: Scaling up. In A. Prieditis & S. Russell (Eds.), Proceedings of the Twelfth International Conference on Machine Learning (pp. 362–370). San Francisco: Morgan Kaufmann. Luenberger, D. G. (1971). An introduction to observers. IEEE Trans., AC, 16, 596–602. McCallum, R. A. (1995). Reinforcement learning with selective perception and hidden state. Unpublished doctoral dissertation, University of Rochester.
756
J. Morimoto and K. Doya
Meuleau, N., Kim, K. E., & Kaelbling, L. P. (2001). Exploration in gradient-based reinforcement learning (Tech. Rep., AI Memo 2001-003). Cambridge, MA: MIT. Meuleau, N., Peshkin, L., Kim, K., & Kaelbling, L. P. (1999). Learning finite-state controllers for partially observable environments. In Proceedings of the Fifteenth International Conference on Uncertainty in Artificial Intelligence (pp. 427–436). San Francisco: Morgan Kaufmann. Morimoto, J., & Doya, K. (2002). Development of an observer by using reinforcement learning. In Proc. of the 12th Annual Conference of the Japanese Neural Network Society (pp. 275–278). Tottorl, Japan. (in Japanese) Porter, L. L., & Passino, K. M. (1995). Genetic adaptive observers. Engineering Applications of Artificial Intelligence, 8(3), 261–269. Raghavan, I. R., & Hedrick, J. K. (1994). Observer design for a class of nonlinear systems. International Journal of Control, 59(2), 515–528. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Thau, F. E. (1973). Observing the state of nonlinear dynamic systems. International Journal of Control, 17(3), 471–479. ¨ Thrun, S. (2000). Monte Carlo POMDPs. In S. A. Solla, T. K. Leen, & K. R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 1064–1070). Cambridge, MA: MIT Press. Wan, E., & van der Merwe, R. (2000). The unscented Kalman filter for nonlinear estimation. In Proc. of IEEE Symposium 2000 (AS-SPCC). Piscataway, NJ: IEEE. Wan, E. A., & Nelson, A. T. (1997). Dual Kalman filtering methods for nonlinear prediction, smoothing, and estimation. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 793–799). Cambridge, MA: MIT Press. Xu, J. W., Erdogmus, D., & Principe, J. C. (2005). Minimum error entropy Luenberger observer. In Proc. of American Control Conference. Piscataway, NJ: IEEE.
Received May 10, 2005; accepted July 17, 2006.
LETTER
Communicated by Yoshua Bengio
Training Recurrent Networks by Evolino ¨ Jurgen Schmidhuber [email protected] IDSIA, 6928 Manno (Lugano), Switzerland, and TU Munich, 85748 Garching, ¨ Munchen, Germany
Daan Wierstra [email protected]
Matteo Gagliolo [email protected]
Faustino Gomez [email protected] IDSIA, 6928 Manno (Lugano), Switzerland
In recent years, gradient-based LSTM recurrent neural networks (RNNs) solved many previously RNN-unlearnable tasks. Sometimes, however, gradient information is of little use for training RNNs, due to numerous local minima. For such cases, we present a novel method: EVOlution of systems with LINear Outputs (Evolino). Evolino evolves weights to the nonlinear, hidden nodes of RNNs while computing optimal linear mappings from hidden state to output, using methods such as pseudoinverse-based linear regression. If we instead use quadratic programming to maximize the margin, we obtain the first evolutionary recurrent support vector machines. We show that Evolino-based LSTM can solve tasks that Echo State nets (Jaeger, 2004a) cannot and achieves higher accuracy in certain continuous function generation tasks than conventional gradient descent RNNs, including gradient-based LSTM. 1 Introduction Recurrent neural networks (RNNs; Pearlmutter, 1995; Robinson & Fallside, 1987; Rumelhart & McClelland, 1986; Werbos, 1974; Williams, 1989) are mathematical abstractions of biological nervous systems that can perform complex mappings from input sequences to output sequences. In principle, one can wire them up just like microprocessors; hence, RNNs can compute anything a traditional computer can compute (Schmidhuber, 1990). In particular, they can approximate any dynamical system with arbitrary precision (Siegelmann & Sontag, 1991). However, unlike traditional, programmed computers, RNNs learn their behavior from a training set of correct example sequences. As training sequences are fed to the network, the error between the actual and desired network output is minimized using Neural Computation 19, 757–779 (2007)
C 2007 Massachusetts Institute of Technology
758
J. Schmidhuber, D. Wierstra, M. Gagliolo, and F. Gomez
gradient descent, whereby the connection weights are gradually adjusted in the direction that reduces this error most rapidly. Potential applications include adaptive robotics, speech recognition, attentive vision, music composition, and innumerable others where retaining information from arbitrarily far in the past can be critical to making optimal decisions. Recently, echo state networks (ESNs; Jaeger, 2004a) and a similar approach, liquid state machines (Maass, Natschl¨ager, & Markram, 2002), have attracted significant attention. Composed primarily of a large pool of hidden neurons (typically hundreds or thousands) with fixed random weights, ESNs are trained by computing a set of weights from the pool to the output units using fast, linear regression. The idea is that with so many random hidden units, the pool is capable of very rich dynamics that just need to be correctly tapped by setting the output weights appropriately. ESNs have the best-known error rates on the Mackey-Glass time-series prediction task (Jaeger, 2004a). The drawback of ESNs is that the only truly computationally powerful, nonlinear part of the net does not learn, whereas previous supervised, gradient-based learning algorithms for sequence-processing RNNs (Pearlmutter, 1995; Robinson & Fallside, 1987; Schmidhuber, 1992; Werbos, 1988; Williams, 1989) adjust all weights of the net, not just the output weights. Unfortunately, early RNN architectures could not learn to look far back into the past because they made gradients either vanish or blow up exponentially with the size of the time lag (Hochreiter, 1991; Hochreiter, Bengio, Frasconi, & Schmidhuber, 2001). A recent RNN called long short-term memory (LSTM; Hochreiter & Schmidhuber, 1997a), however, overcomes this fundamental problem through a specialized architecture that does not impose any unrealistic bias toward recent events by maintaining constant error flow back through time. Using gradient-based learning for both linear and nonlinear nodes, LSTM networks can efficiently solve many tasks that were previously unlearnable using RNNs, (e.g., Gers & Schmidhuber, 2001; Gers, Schmidhuber, & Cummins, 2000; Gers, Schraudolph, & Schmidhuber, 2002; Graves & Schmidhuber, 2005; Hochreiter & Schmidhuber, 1997a; P´erez-Ortiz, Gers, Eck, & Schmidhuber, 2003; Schmidhuber, Gers, & Eck, 2002). However, even when using LSTM, gradient-based learning algorithms can sometimes yield suboptimal results because rough error surfaces can often lead to inescapable local minima. As we showed (Hochreiter & Schmidhuber, 1997b; Schmidhuber, Hochreiter, & Bengio, 2001), many RNN problems involving long-term dependencies that were considered challenging benchmarks in the 1990s turned out to be trivial in that they could be solved by random weight guessing. That is, these problems were difficult only because learning relied solely on gradient information. There was actually a high density of solutions in the weight space, but the error surface was too rough to be exploited using the local gradient. By repeatedly selecting weights at random, the network does not get
Training Recurrent Networks by Evolino
759
stuck in a local minimum and eventually happens on one of the plentiful solutions. One popular method that uses the advantage of random weight guessing in a more efficient and principled way is to search the space of RNN weight matrices (Miglino, Lund, & Nolfi, 1995; Miller, Todd, & Hedge, 1989; Nolfi, Floreano, Miglino, & Mondada, 1994; Sims, 1994; Yamauchi & Beer, 1994; Yao, 1993), using evolutionary algorithms (Holland, 1975; Rechenberg, 1973; Schwefel, 1977). The applicability of such methods is actually broader than that of gradient-based algorithms, since no teacher is required to specify target trajectories for the RNN output nodes. In particular, recent progress has been made with cooperatively coevolving recurrent neurons, each with its own rather small, local search space of possible weight vectors (Gomez, 2003; Moriarty & Miikkulainen, 1996; Potter & De Jong, 1995). This approach can quickly learn to solve difficult reinforcement learning control tasks (Gomez & Schmidhuber, 2005a; Gomez, 2003; Moriarty, 1997), including ones that require use of deep memory (Gomez & Schmidhuber, 2005b). Successfully evolved networks of this type are currently still rather small, with not more than several hundred weights or so. At least for supervised applications, however, such methods may be unnecessarily slow, since they do not exploit gradient information about good directions in the search space. To overcome such drawbacks, in what follows, we limit the domain of evolutionary methods to weight vectors of hidden units, while using fast traditional methods for finding optimal linear maps from hidden to output units. We present a general framework for training RNNs called EVOlution of recurrent systems with LINear Outputs (Evolino) (Schmidhuber, Wierstra, & Gomez, 2005; Wierstra, Gomez, & Schmidhuber, 2005). Evolino evolves weights to the nonlinear, hidden nodes while computing optimal linear mappings from hidden state to output, using methods such as pseudo-inverse-based linear regression (Penrose, 1955) or support vector machines (Vapnik, 1995), depending on the notion of optimality employed. This generalizes methods such as those of Maillard (Maillard & ¨ Gueriot, 1997) and Ishii et al. (Ishii, van der Zant, Beˇcanovi´c, & Ploger, 2004; ¨ van der Zant, Beˇcanovi´c, Ishii, Kobialka, & Ploger, 2004) that evolve radial basis functions and ESNs, respectively. Applied to the LSTM architecture, Evolino can solve tasks that ESNs (Jaeger, 2004a) cannot and achieves higher accuracy in certain continuous function generation tasks than conventional gradient descent RNNs, including gradient-based LSTM (G-LSTM). The next section describes the Evolino framework as well as two specific instances, PI-Evolino (section 2.3), and Evoke (section 2.4), that both combine a cooperative coevolution algorithm called Enforced SubPopulations (section 2.1) with LSTM (section 2.2). In section 3, we apply Evolino to four time series prediction problems, and in section 4 we provide some concluding remarks.
760
J. Schmidhuber, D. Wierstra, M. Gagliolo, and F. Gomez
y1(t) y2(t)
ym(t) Linear Output Layer W
φ1(t) φ2(t) φ3(t) φ4(t)
φ n(t)
Recurrent Neural Network u1(t) u2(t) u3(t) u4(t)
up(t)
Figure 1: Evolino network: A recurrent neural network receives sequential inputs u(t) and produces the vector (φ1 , φ2 , . . . , φn ) at every time step t. These values are linearly combined with the weight matrix W to yield the network’s output vector y(t). While the RNN is evolved, the output layer weights are computed using a fast, optimal method such as linear regression or quadratic programming.
2 Evolino Evolino is a general framework for supervised sequence learning that combines neuroevolution (i.e., the evolution of neural networks) and analytical linear methods that are optimal in some sense, such as linear regression or quadratic programming (see section 2.4). The underlying principle of Evolino is that often a linear model can account for a large number of properties of a problem. Properties that require nonlinearity and recurrence are then dealt with by evolution. Figure 1 illustrates the basic operation of an Evolino network. The output of the network at time t, y(t) ∈ Rm , is computed by the following formulas: y(t) = Wφ(t),
(2.1)
φ(t) = f (u(t), u(t − 1), . . . , u(0)),
(2.2)
where φ(t) ∈ Rn is the output of a recurrent neural network f (·) and W is a weight matrix. Note that because the networks are recurrent, f (·) is indeed
Training Recurrent Networks by Evolino
761
a function of the entire input history, u(t), u(t − 1), . . . , u(0). In the case of maximum margin classification problems (Vapnik, 1995), we may compute W by quadratic programming. In what follows, however, we focus on mean squared error minimization problems and compute W by linear regression. In order to evolve an f (·) that minimizes the error between y and the correct output, d, of the system being modeled, Evolino does not specify a particular evolutionary algorithm, but rather stipulates only that networks be evaluated using the following two-phase procedure. In the first phase, a training set of sequences obtained from the system, {ui , d i }, i = 1..k, each of length l i , is presented to the network. For each sequence ui , starting at time t = 0, each input pattern ui (t) is successively propagated through the recurrent network to produce a vector of activations φ i (t) that is stored as a row in a ik l i × n matrix . Associated with each φ i (t), is a target vector d i (t) in matrix D containing the correct output values for each time step. Once all k sequences have been seen, the output weights W (the output layer in Figure 1) are computed using linear regression from to D. The row vectors in (i.e., the values of each of the n outputs over the entire training set) form a nonorthogonal basis that is combined linearly by W to approximate D. In the second phase, the training set is presented to the network again, but now the inputs are propagated through the recurrent network f (·) and the newly computed output connections to produce predictions y(t). The error in the prediction, or the residual error, is then used as the fitness measure to be minimized by evolution. Alternatively, the error on a previously unseen validation set, or the sum of training and validation error, can be minimized. Neuroevolution is normally applied to reinforcement learning tasks where correct network outputs (i.e., targets) are not known a priori. Evolino uses neuroevolution for supervised learning to circumvent the problems of gradient-based approaches. In order to obtain the precision required for time-series prediction, we do not try to evolve a network that makes predictions directly. Instead, the network outputs a set of vectors that form a basis for linear regression. The intuition is that finding a sufficiently good basis is easier than trying to find a network that models the system accurately on its own. One possible instantiation of Evolino that we have explored thus far with promising results coevolves the recurrent nodes of LSTM networks using a variant of the enforced subpopulations (ESP) neuroevolution algorithm. The next sections describe ESP, LSTM, and the details of how they are combined in the Evolino framework to form two algorithms: PI-Evolino, which uses the mean squared error optimality criterion, and Evoke, which uses the maximum margin.
2.1 Enforced SubPopulations (ESP). Enforced SubPopulations differs from standard neuroevolution methods in that instead of evolving complete
762
J. Schmidhuber, D. Wierstra, M. Gagliolo, and F. Gomez
ESP
fitness
Time series input
output
pseudo−inverse weights
LSTM Network Figure 2: Enforced subpopulations (ESP). The population of neurons is segregated into subpopulations. Networks are formed by randomly selecting one neuron from each subpopulation. A neuron accumulates a fitness score by adding the fitness of each network in which it participates. The best neurons within each subpopulation are mated to form new neurons. The network shown here is an LSTM network with four memory cells (the triangular shapes).
networks, it coevolves separate subpopulations of network components or neurons (see Figure 2). ESP searches the space of networks indirectly by sampling the possible networks that can be constructed from the subpopulations of neurons. Network evaluations serve to provide a fitness statistic that is used to produce better neurons that can eventually be combined to form a successful network. This cooperative coevolutionary approach is an extension to symbiotic, adaptive neuroevolution (SANE; Moriarty, & Miikkulainen, 1996), which also evolves neurons, but in a single population. By using separate subpopulations, ESP accelerates the specialization of neurons into different subfunctions needed to form good networks because members of different evolving subfunction types are prevented from mating. Subpopulations also reduce noise in the neuron fitness measure because each evolving neuron type is guaranteed to be represented in every network that is formed. Both of these features allow ESP to evolve networks more efficiently than SANE (Gomez & Miikkulainen, 1999). ESP normally uses crossover to recombine neurons. However, for the Evolino variant, where fine local search is desirable, ESP uses Cauchydistributed mutation to produce all new individuals, making the approach in effect an evolution strategy (Schwefel, 1995). More concretely, evolution proceeds as follows:
Training Recurrent Networks by Evolino
763
1. Initialization. The number of hidden units H in the networks that will be evolved is specified, and a subpopulation of n neuron chromosomes is created for each hidden unit. Each chromosome encodes a neuron’s input and recurrent connection weights with a string of random real numbers. 2. Evaluation. A neuron is selected at random from each of the H subpopulations and combined to form a recurrent network. The network is evaluated on the task and awarded a fitness score. The score is added to the cumulative fitness of each neuron that participated in the network. This procedure is repeated until each neuron participated in m evaluations. 3. Reproduction. For each subpopulation, the neurons are ranked by fitness, and the top quarter of the chromosomes, or parents, in each subpopulation are duplicated, and the copies, or children, are mutated by adding noise to all of their weight values from the Cauchy distribution f (x) = π (α2α+x2 ) , where the parameter α determines the width of the distribution. The children then replace the lowest-ranking half of their corresponding subpopulation. 4. Repeat the evaluation–reproduction cycle until a sufficiently fit network is found. If during evolution, the fitness of the best network evaluated so far does not improve for a predetermined number of generations, a technique called burst mutation is used. The idea of burst mutation is to search the space of modifications to the best solution found so far. When burst mutation is activated, the best neuron in each subpopulation is saved, the other neurons are deleted, and new neurons are created for each subpopulation by adding Cauchy distributed noise to its saved neuron. Evolution then resumes, but now searching in a neighborhood around the previous best solution. Burst mutation injects new diversity into the subpopulations and allows ESP to continue evolving after the initial subpopulations have converged.
2.2 Long Short-Term Memory. LSTM is a recurrent neural network purposely designed to learn long-term dependencies via gradient descent. The unique feature of the LSTM architecture is the memory cell, which is capable of maintaining its activation indefinitely (see Figure 3). Memory cells consist of a linear unit, which holds the state of the cell, and three gates that can open or close over time. The input gate “protects” a neuron from its input: only when the gate is open can inputs affect the internal state of the neuron. The output gate lets the state out to other parts of the network, and the forget gate enables the state to “leak” activity when it is no longer useful.
764
J. Schmidhuber, D. Wierstra, M. Gagliolo, and F. Gomez
peephole output
GF Go
S Σ
GI
external inputs
Figure 3: Long short-term Memory. The figure shows an LSTM memory cell. The cell has an internal state S together with a forget gate (G F ) that determines how much the state is attenuated at each time step. The input gate (G I ) controls access to the cell by the external inputs that are summed into the unit, and the output gate (G O ) controls when and how much the cell fires. Small, dark nodes represent the multiplication function.
The state of cell i is computed by si (t) = neti (t)giin (t) + gi
f orget
(t)si (t − 1),
(2.3)
where g in and g forget are the activation of the input and forget gates, respectively, and net is the weighted sum of the external inputs (indicated by the s in Figure 3), neti (t) = h
wicell j c j (t
− 1) +
j
cell wik uk (t)
,
(2.4)
k
where h is usually the identity function and c j is the output of cell j: c j (t) = tanh(g out j (t)s j (t)),
(2.5)
Training Recurrent Networks by Evolino
765
where g out is the output gate of cell j. The amount each gate gi of memory cell i is open or closed at time t is calculated by: type gi (t)
=σ
j
type wi j c j (t
− 1) +
type wik uk (t)
,
(2.6)
k
where type can be input, output, or forget, and σ is the standard sigmoid function. The gates receive input from the output of other cells c j and from the external inputs to the network. 2.3 Combining LSTM, ESP, and Pseudoinverse in Evolino. We apply our general Evolino framework to the LSTM architecture, using ESP for evolution and regression for computing linear mappings from hidden state to outputs. ESP coevolves subpopulations of LSTM memory cells instead of standard recurrent neurons (see Figure 2). Each chromosome is a string containing the external input weights and the input, output, and forget gate weights, for a total of 4 ∗ (I + H) weights in each memory cell chromosome, where I is the number of external inputs and H is the number of memory cells in the network. There are four sets of I + H weights because the three gates and the cell itself receive input from outside the cell and the other cells. Figure 4 shows how the memory cells are encoded in an ESP chromosome. Each chromosome in a subpopulation encodes the connection weights for a cell’s input, output, and forget gates, and external inputs. The linear regression method used to compute the output weights (W in equation 2.2) is the Moore-Penrose pseudoinverse method, which is both fast and optimal in the sense that it minimizes the summed squared error (Penrose, 1955) (compare Maillard & Gueriot, 1997, for an application to feedforward RBF nets and Ishii et al., 2004, for an application to echo state networks). The vector φ(t) consists of both the cell outputs c i and their internal states si , so that the pseudoinverse computes two connection weights for each memory cell. We refer to the connections from internal states to the output units as “output peephole” connections, since they peer into the interior of the cells. For continuous function generation, backprojection (or teacher forcing, in standard RNN terminology) is used where the predicted outputs are fed back as inputs in the next time step: φ(t) = f (u(t), y(t − 1), u(t − 1), . . . , y(0), u(0)). During training, the correct target values are backprojected, in effect “clamping” the network’s outputs to the right values. During testing, the network backprojects its own predictions. This technique is also used by ESNs, but whereas ESNs do not change the backprojection connection weights, Evolino evolves them, treating them like any other input to the network. In the experiments described below, backprojection was found useful
766
J. Schmidhuber, D. Wierstra, M. Gagliolo, and F. Gomez
Genotype output gate
Phenotype
O
forget gate
Go
input gate
GF
S
external inputs
GI Σ
Figure 4: Genotype-phenotype mapping. Each chromosome (genotype, left) in a subpopulation encodes the external input, and input, output, and forget gate weights of an LSTM memory cell (phenotype, right). The weights leading out of the state (S) and output (O) units are not encoded in the genotype, but are instead computed at evaluation time by linear regression.
for continuous-function generation tasks, but interferes to some extent with performance in the discrete context-sensitive language task. 2.4 Evoke: Evolino for Recurrent Support Vector Machines. As outlined in section 2, the Evolino framework does not prescribe a particular optimality criterion for computing the output weights. If we replace mean squared error with the maximum margin criterion of support vector machines (SVMs; Vapnik, 1995), the optimal linear output weights can be evaluated using, for example, quadratic programming, as in traditional SVMs. We call this Evolino variant EVOlution of systems with KErnel-based outputs (Evoke; Schmidhuber, Gagliolo, Wierstra, & Gomez, 2006). The Evoke variant of equation 2.1 becomes
y(t) = w0 +
li k i=1 j=0
wi j K (φ(t), φ i ( j)),
(2.7)
Training Recurrent Networks by Evolino
767
where φ(t) ∈ Rn is, again, the output of the recurrent neural network f (·) at time t (see equation 2.2), K (·, ·) is a predefined kernel function, and the weights wi j correspond to k training sequences φ i , each of length li , and are computed with the support vector algorithm. SVMs are powerful regressors and classifiers that make predictions based on a linear combination of kernel basis functions. The kernel maps the input feature space to a higher-dimensional space where the data are linearly separable (in classification) or can be approximated well with a hyperplane (in regression). A limited way of applying existing SVMs to time-series predic¨ tion (Mukherjee, Osuna, & Girosi, 1997; Muller et al., 1997) or classification (Salomon et al., 2002) is to build a training set, either by transforming the sequential input into some static domain (e.g., a frequency and phase representation) or by considering restricted, fixed time windows of m sequential input values. One alternative presented in Shimodaira, Noma, Nakai, & Sagayama, (2002) is to average kernel distance between elements of input sequences aligned to m points. Of course, such approaches are bound to fail if there are temporal dependencies exceeding m steps. In a more sophisticated approach by Suykens and Vandewalle (2000), a window of m previous output values is fed back as input to a recurrent model with a fixed kernel. So far, however, there has not been any recurrent SVM that learns to create internal state representations for sequence learning tasks involving time lags of arbitrary length between important input events. For example, consider the task of correctly classifying arbitrary instances of the context-free language a n b n (n a’s followed by n b’s, for arbitrary integers n > 0). For Evoke, the evolved RNN is a preprocessor for a standard SVM kernel. The combination of both can be viewed as an adaptive kernel learning a task-specific distance measure between pairs of input sequences. Although Evoke uses SVM methods, it can solve several tasks that traditional SVMs cannot solve even in principle. We will see that it also outperforms recent state-of-the-art RNNs on certain tasks, including echo state networks (ESNs) (Jaeger, 2004a) and previous gradient descent RNNs (Hochreiter & Schmidhuber, 1997a; Pearlmutter, 1995; Robinson & Fallside, 1987; Rumelhart & McClelland, 1986; Werbos, 1974; Williams, 1989). 3 Experiments Experiments with PI-Evolino were carried out on four test problems: context-sensitive languages, multiple superimposed sine waves, parity problem with display, and the Mackey-Glass time series. The first two were chosen to highlight Evolino’s ability to perform well in both discrete and continuous domains, and to solve tasks that neither ESNs (Jaeger, 2004a) nor traditional gradient-descent RNNs (Pearlmutter, 1995; Robinson, & Fallside, 1987; Rumelhart, & McClelland, 1986; Werbos, 1974; Williams, 1989) can solve well. We also report successful experiments with
768
J. Schmidhuber, D. Wierstra, M. Gagliolo, and F. Gomez
Evoke-trained RSVMs on these tasks. The parity problem with display demonstrates PI-Evolino’s ability to cope with rough error surfaces that confound gradient-based approaches, including G-LSTM. Although the Mackey-Glass system can be modeled accurately by nonrecurrent systems (Jaeger, 2004a), it was selected to compare PI-Evolino with ESNs, the reference method on this widely used time-series benchmark. 3.1 Context-Sensitive Grammars. Context-sensitive languages are languages that cannot be recognized by deterministic finite-state automata and are therefore more complex in some respects than regular languages. In general, determining whether a string of symbols belongs to a contextsensitive language requires remembering all the symbols in the string seen so far, ruling out the use of nonrecurrent architectures. To compare Evolino-based LSTM with published results for G-LSTM (Gers & Schmidhuber, 2001), we chose the language a n b n c n . The task was implemented using networks with four input units—one for each symbol (a , b, c) plus the start symbol S—and four output units—one for each symbol plus the termination symbol T. Symbol strings were presented sequentially to the network, with each symbol’s corresponding input unit set to 1 and the other three set to −1. At each time step, the network must predict the possible symbols that could come next in a legal string. Legal strings in a n b n c n are those in which the number of a ’s, b’s, and c’s is equal—for example, ST, Sa bcT, Sa a bbccT, and Sa a a bbbcccT. So for n = 3, the set of input and target values would be: Input: S a a a b b b c c c Target: a/T a/b a/b a/b b b c c c T Evolino-based LSTM networks were evolved using eight different training sets, each containing legal strings with values for n as shown in the first column of Table 1. In the first four sets, n ranges from 1 to k, where k = 10, 20, 30, 40. The second four sets consist of just two training samples and were intended to test how well the methods could induce the language from a nearly minimal number of examples. LSTM networks with memory cells were evolved (four for PI-Evolino and five for Evoke), with random initial values for the weights between −0.1 and 0.1 for Evolino and between −5.0 and 5.0 for Evoke, and the inputs were multiplied by 100. The Cauchy noise parameter α for both mutation and burst mutation was set to 0.00001 for Evolino and to 0.1 for Evoke; 50% of the mutations are kept within these bounds. In keeping with the setup in Gers and Schmidhuber (2001), we added a bias unit to the Forget gates and Output gates with values of +1.5 and −1.5, respectively. For Evoke, the parameters of the SVM module were chosen heuristically: a gaussian kernel with standard deviation 2.0 and capacity 100.0. Evolino evaluates fitness on
Training Recurrent Networks by Evolino
769
Table 1: Results for the a n b n c n Language.
Training Data 1..10 1..20 1..30 1..40 10,11 20,21 30,31 40,41
Standard PI-Evolino
Tuned PI-Evolino
GradientLSTM
1..29 1..67 1..93 1..101
1..53 1..95 1..355 1..804
1..28 1..66 1..91 1..120
4..14 13..36 26..39 32..53
3..35 5..39 3..305 1..726
10..11 17..23 29..32 35..45
Notes: The table compares pseudoinverse-based Evolino (PIEvolino) with gradient-based LSTM (G-LSTM) on the a n b n c n language task. “Standard” refers to Evolino with the parameter settings used for both discrete and continuous domains (a n b n c n and superimposed sine waves). The “tuned” version is biased to the language task: we additionally squash the cell input with the tanh function. The left-most column shows the set of strings used for training in each of the experiments. The other three columns show the set of legal strings to which each method could generalize after 50 generations (3000 evaluations), averaged over 20 runs. The upper training sets contain all strings up to the indicated length. The lower training sets contain only a single pair. PI-Evolino generalizes better than G-LSTM, most notably when trained on only two examples of correct behavior. The G-LSTM results are taken from Gers and Schmidhuber (2001).
the entire training set, but Evoke uses a slightly different way of evaluating fitness: while the training set consists of the first half of the strings, fitness was defined as performance on the second half of the data, the validation set. Evolution was terminated after 50 generations, after which the best network in each simulation was tested. Table 1 compares the results of Evolino-based LSTM, using pseudoinverse as supervised learning module (PI-Evolino), with those of G-LSTM from (Gers & Schmidhuber, 2001); standard PI-Evolino uses parameter settings that are a compromise between discrete and continuous domains. If we set h to the tanh function, we obtain tuned PI-Evolino. We never managed to train ESNs to solve this task, presumably because the random prewiring of ESNs rarely represents an algorithm for solving such context-sensitive language problems. The standard PI-Evolino networks had generalization very similar to that of G-LSTM on the 1..k training sets, but slightly better on the two-example training sets. Tuned PI-Evolino showed a dramatic improvement over G-LSTM on all of the training sets, most remarkably on the two-example
770
J. Schmidhuber, D. Wierstra, M. Gagliolo, and F. Gomez 2000
cell 1 cell 2 cell 3 cell 4
1500
1000
500
0 −500 −1000
0
500
1000
1500
2000
2500
time steps Figure 5: Internal state activations. The state activations for the four memory cells of an Evolino network being presented the string a 800 b 800 c 800 . The plot clearly shows how some units function as “counters,” recording how many a ’s and b’s have been seen. More complex, nonlinear behavior by the gates is not shown.
sets, where it was able to generalize on average to all strings up to n = 726 after being trained on only n = {40, 41}. Evoke’s performance was superior for N = 10 and N = 20, generalizing up to n = 257 and n = 374, respectively, but degraded for larger values of N, for which both PI-Evolino and G-LSTM achieved better results. Figure 5, shows the internal states of each of the four memory cells of one of the networks evolved by PI-Evolino while processing a 800 b 800 c 800 . 3.2 Parity Problem with Display. The parity problem with display involves classifying sequences consisting of 1’s and −1’s according to whether the number of 1’s is even or odd. The target, which depends on the entire sequence, is a display of 10 × 10 output neurons depicting O for odd and E for even. The display prevents the task from being solved by guessing the network weights (Hochreiter & Schmidhuber, 1997b) and makes the error gradient very rough. We trained PI-Evolino with two memory cells on 50 random sequences of length between 100 and 110. Unlike G-LSTM, which typically cannot solve this task due to a lack of global gradient information, PI-Evolino learned a perfect display classification on a test set within 30 generations in all 20 experiments. 3.3 Mackey-Glass Time-Series Prediction. The Mackey-Glass system (MGS; Mackey & Glass, 1977) is a standard benchmark for chaotic timeseries prediction. The system produces an irregular time series that is
Training Recurrent Networks by Evolino
771
generated by the following differential equation: y˙ (t) = αy(t − τ )/(1 + y(t − τ )β ) − γ y(t), where the parameters are usually set to α = 0.2, β = 10, γ = 0.1. The system is chaotic whenever the delay τ > 16.8. We use the most common value for the delay, τ = 17. Although the MGS can be modeled accurately using feedforward networks with a time window on the input, we compare PI-Evolino to ESNs (currently the best method for MGS) in this domain to show its capacity for making precise predictions. We used the same setup in our experiments as in Jaeger (2004a). Networks were evolved in the following way. During the first phase of an evaluation, the network predicts the next function value for 3000 time steps with the benefit of the backprojected target from the previous time step. For the first 100 washout time steps, the vectors φ(t) are not collected; only the φ(t), t = 101..3000, are used to calculate the output weights using the pseudoinverse. During the second phase, the previous target is backprojected only during the washout time, after which the network runs freely by backprojecting its own predictions. The fitness score assigned to the network is the MSE on time steps 101..3000. Networks with 30 memory cells were evolved for 200 generations and a Cauchy noise α of 10−7 . A bias input of 1.0 was added to the network, the backprojection values were scaled by a factor of 0.1, and the cell input was squashed with the tanh function. At the end of an evolutionary run, the best network found was tested by having it predict using the backprojected previous target for the first 3000 steps and then run freely from time step 3001 to 3084.1 The average NRMSE84 for PI-Evolino with 30 cells over the 15 runs was 1.9 × 10−3 compared to 10−4.2 for ESNs with 1000 neurons (Jaeger, 2004a). The PI-Evolino results are currently the second-best reported so far. Figure 6 shows the performance of an Evolino network on the MG time series with even fewer memory cells, after 50 generations. Because this network has fewer parameters, it is unable to achieve the same precision as with 30 neurons, but it demonstrates how Evolino can learn such functions very quickly—in this case, within approximately 3 minutes of CPU time. 3.4 Multiple Superimposed Sine Waves. Learning to generate a sinusoidal signal is a relatively simple task that requires only one bit of memory to indicate whether the current network output is greater or less than the previous output. When sine waves with frequencies that are not integer multiples of each other are superimposed, the resulting signal becomes much harder to predict because its wavelength can be extremely long; there is a large number of time steps before the periodic signal repeats. Generating
1 The normalized root mean square error (NRMSE ) 84 steps after the end of the 84 training sequence is the standard comparison measure used for this problem.
772
J. Schmidhuber, D. Wierstra, M. Gagliolo, and F. Gomez
1.4
Mackey−Glass Evolino
Evolino
1.2
Mackey−Glass
1
0.8
0.6
0.4
0
200
400
600
800
1000
1200
1400
Figure 6: Performance of PI-Evolino on the Mackey-Glass time series. The plot shows both the Mackey-Glass system and the prediction made by a typical Evolino-based LSTM network evolved for 50 generations. The obvious difference between the system and the prediction during the first 100 steps is due to the washout time. The inset shows a magnification that illustrates more clearly the deviation between the two curves. Table 2: PI-Evolino Results for Multiple Superimposed Sine Waves. Number of Sines 2 3 4 5
Number of Cells
Training NRMSE
Generalization NRMSE
10 15 20 20
2.01 × 10−3 2.44 × 10−3 1.51 × 10−2 1.60 × 10−2
4.15 × 10−3 8.04 × 10−3 1.10 × 10−1 1.66 × 10−1
Notes: The training NRMSE is calculated on time steps 100 to 400 (i.e., the washout time is not included in the measure). The generalization NRMSE is calculated for time steps 400 to 700 (averaged over 20 experiments).
such a signal accurately without recurrency would require a prohibitively large time delay window using a feedforward architecture. Jaeger (2004b) reports that echo state networks are unable to learn functions composed of even two superimposed oscillators, in particular sin(0.2x) + sin(0.311x). The reason is that the dynamics of all the neurons in the ESN pool are coupled, while this task requires that both of the two underlying oscillators be represented by the network’s internal state. Here we show how Evolino-based LSTM can not only solve the twosine function mentioned above, but also more complex functions formed by superimposing nup to three more sine waves. Each of the functions was constructed by i=1 sin(λi x), where n is the number of sine waves and λ1 = 0.2, λ2 = 0.311, λ3 = 0.42, λ4 = 0.51, and λ5 = 0.74. For this task, PI-Evolino networks were evolved using the same setup and procedure as for the Mackey-Glass system except that steps 101..400 were used to calculate the output weights in the first evaluation phase and fitness in the second phase. Also, the inputs were multiplied by 0.01 instead of 100. Again, during the first 100 washout time steps, the vectors φ(t) were not collected for computing the pseudoinverse.
Training Recurrent Networks by Evolino
773
Generalization
Training 2
2
1
1
2
0
0
−1
−1 −2
−2 100
150
200
250
300
350
400
9700
3
3
2
2
1
1
9900
9950
10000
2750
2800
2850
2900
2950
3000
450
500
550
600
650
700
450
500
550
600
650
700
0
−2
-2
−3
-3 150
200
250
300
350
2700
400
4
4
3
3
2
2
1
1
4
0
0
−1
−1 −2
−2 −3
−3
−4
−4 150
200
250
300
350
400
400
4
4
2
2
5
0
0 −2
−2
−4 100
9850
−1
-1
100
9800
3
0
100
9750
−4 150
200
250
300
350
400
400
time steps Figure 7: Performance of PI-Evolino on the superimposed sine wave tasks. The plots show the behavior of a typical network produced after a specified number of generations: 50 for the two-, three-, and four-sine functions and 150 for the five-sine function. The first 300 steps of each function, in the left column, were used for training. The curves in the right column show values predicted by the networks (dashed curves) further into the future versus the corresponding reference signal (solid curves). While the onset of noticeable prediction error occurs earlier as more sines are added, the networks still track the correct behavior for hundreds of time steps, even for the five-sine case.
The first three tasks, n = 2, 3, 4, used subpopulations of size 40, and simulations were run for 50 generations. The five-sine wave task, n = 5, proved much more difficult to learn, requiring a larger subpopulation size of 100, and simulations were allowed to run for 150 generations. At the end of each run, the best network was tested for generalization on data points
774
J. Schmidhuber, D. Wierstra, M. Gagliolo, and F. Gomez
2 1
net prediction
0 −1 −2 1914
ouput
1911.5
1909.5
18
cell 1
peephole
14 11 793.6
ouput
793 792.8 −1556
cell 2
peephole
−1562
−1568 −1168.6
ouput
−1169.1
−1169.6
14
cell 3
peephole
11 8
1600
1650
1700
1750
1800
1850
time steps
1900
1950
2000
Figure 8: Internal representation of two-sine function. The upper graph shows the output of a PI-Evolino LSTM network with three cells predicting the two-sine function. The three pairs of lower graphs show the output (upper) and output peephole (lower) values of each cell in the network multiplied by their respective (pseudoinverse-generated) output weight. These six signals are added to generate the signal in the upper graph.
from time steps 401..700, making predictions using backprojected previous predictions. For Evoke, a slightly different setting was used, in which networks were evolved to minimize the sum of training and validation error, on points 100..400 and 400..700, respectively, and tested on points 700..1000. The
Training Recurrent Networks by Evolino
775
weight range was set to [−1.0, 1.0], and a gaussian kernel with standard deviation 2.0 and capacity 10.0 was used for the SVM. Table 2 shows the number of memory cells used for each task and the average summed squared error on both the training set and the testing set for the best network found during each evolutionary run of PI-Evolino. Evoke achieved a relatively low generalization NRMSE of 1.03 × 10−2 on the double sines problem, but gave unsatisfactory results for three or more sines. Figure 7 shows the behavior of one of the successful networks from each of the tasks. The column on the left shows the target signal from Table 2 and the output generated by the network on the training set. The column on the right shows the same curves forward in time to show the generalization capability of the networks. For the two-sine function, even after 9000 time steps, the network continues to generate the signal accurately. As more sines are added, the prediction error grows more quickly, but the overall behavior of the signal is still retained. Figure 8 reveals how the two-sine wave is represented internally by a typical PI-Evolino network. For illustration, a less accurate network containing only 3 cells instead of 10 is shown. The upper graph shows the overall output of the network, while the other graphs show the output peephole and output activity of each cell multiplied by the corresponding pseudoinverse-generated output weight. Although the network can generate the function accurately for thousands of time steps, it does not do so by implementing sinusoidal oscillators. Instead, each cell by itself behaves in a manner that is qualitatively similar to the two-sine, but scaled, translated, and phase-shifted. These six separate signals are added together to produce the network output.
4 Conclusion The human brain is a biological, learning RNN. Previous successes with artificial RNNs have been limited by problems overcome by the LSTM architecture. Its algorithms for shaping not only the linear but also the nonlinear parts allow LSTM to learn to solve tasks unlearnable by standard feedforward nets, support vector machines, hidden markov models, and previous RNNs. Previous work on LSTM has focused on gradientbased G-LSTM (Gers & Schmidhuber, 2001; Gers et al., 2000,2002; Graves & Schmidhuber, 2005; Hochreiter & Schmidhuber, 1997a; P´erez-Ortiz et al., 2003; Schmidhuber et al., 2002). Here we introduced the novel Evolino class of supervised learning algorithms for such nets that overcomes certain problems of gradient-based RNNs with local minima. Successfully tested instances with hidden coevolving recurrent neurons use either the pseudoinverse to minimize the MSE of the linear mapping from hidden units to outputs (PI-Evolino), or quadratic programming to maximize the margin.
776
J. Schmidhuber, D. Wierstra, M. Gagliolo, and F. Gomez
The latter yields the first evolutionary recurrent SVMs or RSVMs, trained by an Evolino variant called Evoke. In the experiments of our pilot study, RSVMs generally performed better than G-LSTM and previous gradient-based RNNs but typically worse than PI-Evolino. One possible reason for this could be that the kernel mapping of the SVM component induces a more rugged fitness landscape that makes evolutionary search harder. All of the evolved networks were comparatively small, usually featuring fewer than 3000 weights. For large data sets, such as those used in speech recognition, we typically need much larger LSTM networks with on the order of 100,000 weights (Graves & Schmidhuber, 2005). On such problems, we have so far generally obtained better results with G-LSTM than with Evolino. This seems to reaffirm the heuristic that evolution of large parameter sets is often harder than gradient search in such sets. Currently it is unclear when exactly to favor one over the other. Future work will explore hybrids combining G-LSTM and Evolino in an attempt to leverage the best of both worlds. We will also explore ways of improving the performance of Evoke, including the coevolution of SVM kernel parameters. We have barely tapped the set of possible applications of our new approaches; in principle, any learning task that requires some sort of adaptive short-term memory may benefit. Acknowledgments This research was partially funded by SNF grant 200020-107534, EU MindRaces project FP6 511931, and a CSEM Alpnach grant to J. S. References Gers, F. A., & Schmidhuber, J. (2001). LSTM recurrent networks learn simple context free and context sensitive languages. IEEE Transactions on Neural Networks, 12(6), 1333–1340. Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). Learning to forget: Continual prediction with LSTM. Neural Computation, 12(10), 2451–2471. Gers, F. A., Schraudolph, N., & Schmidhuber, J. (2002). Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research, 3, 115–143. Gomez, F. J. (2003). Robust nonlinear control through neuroevolution. Unpublished doctoral dissertation, University of Texas at Austin. Gomez, F., & Miikkulainen, R. (1999). Solving non-Markovian control tasks with neuroevolution. In Proceedings of the 16th International Joint Conference on Artificial Intelligence. San Francisco: Morgan Kaufmann. Gomez, F., & Schmidhuber, J. (2005a). Evolving modular fast-weight networks for control. In W. Duch, J. Kacprzyk, E. Oja, & S. Zadrozny (Eds.), Proceedings of the Fifteenth International Conference on Artificial Neural Networks: ICANN-05 (pp. 383–389). Berlin: Springer.
Training Recurrent Networks by Evolino
777
Gomez, F. J., & Schmidhuber, J. (2005b). Co-evolving recurrent neurons learn deep memory POMDPS. In Proceedings of the Genetic Evolutionary Computation Conference (GECCO-05). Berlin: Springer-Verlag. Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18, 602– 610. Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma ¨ thesis, Technische Universit¨at Munchen. Hochreiter, S., Bengio, Y., Frasconi, P., & Schmidhuber, J. (2001). Gradient flow in recurrent nets: The difficulty of learning long-term dependencies. In S. C. Kremer, & J. F. Kolen (Eds.), A field guide to dynamical recurrent neural networks. Piscataway, NJ: IEEE Press. Hochreiter, S., & Schmidhuber, J. (1997a). Long short-term memory. Neural Computation, 9(8), 1735–1780. Hochreiter, S., & Schmidhuber, J. (1997b). LSTM can solve hard long time lag problems. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 473–479). Cambridge, MA: MIT Press. Holland, J. H. (1975). Adaptation in natural and artificial systems. Ann Arbor. University of Michigan Press. ¨ Ishii, K., van der Zant, T., Beˇcanovi´c, V., & Ploger, P. G. (2004). Identification of motion with echo state network. In Proc. IEEE Oceans04, (pp. 1205–1230). Kobe, Japan: IEEE. Jaeger, H. (2004). Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304, 78–80. Jaeger, W. (2004b). The echo state approach to recurrent neural networks. Available online at http://www.faculty.iu-bremen.de/hjaeger/courses/SeminarSpring04/ ESNStandardSlides.pdf. Maass, W., Natschl¨ager, T., & Markram, H. (2002). A fresh look at real-time computation in generic recurrent neural circuits (Tech. Rep.) Graz: Institute for Theoretical Computer Science, TU Graz. Mackey, M. C., & Glass, L. (1977). Oscillation and chaos in physiological control systems. Science, 197, 287–289. Maillard, E. P., & Gueriot, D. (1997). RBF neural network, basis functions and genetic algorithms. In IEEE International Conference on Neural Networks (pp. 2187–2190). Piscataway, NJ: IEEE. Miglino, O., Lund, H., & Nolfi, S. (1995). Evolving mobile robots in simulated and real environments. Artificial Life, 2(4), 417–434. Miller, G., Todd, P., & Hedge, S. (1989). Designing neural networks using genetic algorithms. In Proceedings of the 3rd International Conference on Genetic Algorithms (pp. 379–384). San Francisco: Morgan Kaufmann. Moriarty, D. E. (1997). Symbiotic evolution of neural networks in sequential decision tasks. PhD thesis, Dept. of Computer Sciences, University of Texas at Austin. Technical Report UT-AI97-257. Moriarty, D. E., & Miikkulainen, R. (1996). Efficient reinforcement learning through symbiotic evolution. Machine Learning, 22, 11–32. Mukherjee, S., Osuna, E., & Girosi, F. (1997). Nonlinear prediction of chaotic time series using support vector machines. In J. Principe, L. Giles, N. Morgan, &
778
J. Schmidhuber, D. Wierstra, M. Gagliolo, and F. Gomez
E. Wilson (Eds.), IEEE Workshop on Neural Networks for Signal Processing VII (pp. 511). Piscataway, NJ: IEEE Press. ¨ ¨ Muller, K., Smola, A., G.R¨atsch, Scholkopf, B., Kohlmorgen, J., & Vapnik, V. (1997). Predicting time series with support vector machines. In W. Gerstner, A. Germond, M. Hasler, & J.-O. Nicoud (Eds.), Proceedings of the Seventh International Conference on Artificial Neural Networks: ICANN-97 (pp. 999–1004). Berlin: Springer-Verlag. Nolfi, S., Floreano, D., Miglino, O., & Mondada, F. (1994). How to evolve autonomous robots: Different approaches in evolutionary robotics. In R. A. Brooks & P. Maes, P (Eds.), Proceedings of the Fourth International Workshop on the Synthesis and Simulation of Living Systems (Artificial Life IV) (pp. 190–197). Cambridge, MA: MIT Press. Pearlmutter, B. A. (1995). Gradient calculations for dynamic recurrent neural networks: A survey. IEEE Transactions on Neural Networks, 6(5), 1212–1228. Penrose, R. (1955). A generalized inverse for matrices. Proceedings of the Cambridge Philosophy Society, 51, 406–413. P´erez-Ortiz, J. A., Gers, F. A., Eck, D., & Schmidhuber, J. (2003). Kalman filters improve LSTM network performance in problems unsolvable by traditional recurrent nets. Neural Networks, 16(2), 241–250. Potter, M. A., & De Jong, K. A. (1995). Evolving neural networks with collaborative species. In Proceedings of the 1995 Summer Computer Simulation Conference (pp. 340–345). Ottawa: Society of Computer Simulation. Rechenberg, I. (1973). Evolutionsstrategie—Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Stuttgart, Germany: Fromman-Holzboog. Robinson, A. J., & Fallside, F. (1987). The utility driven dynamic error propagation network (Tech. Rep. CUED/F-INFENG/TR.1). Cambridge: Cambridge University, Engineering Department. Rumelhart, D. E., & McClelland, J. L. (Eds.). (1986). Parallel distributed processing (vol. 1). Cambridge, MA: MIT Press. Salomon, J., King, S., & Osborne, M. (2002). Framewise phone classification using support vector machines. In Proceedings of the International Conference on Spoken Language Processing. Denver. Schmidhuber, J. (1990). Dynamische neuronale Netze und das fundamentale raumzeitliche Lernproblem. Unpublished doctoral dissertation, Technische Univer¨ sit¨at Munchen. Schmidhuber, J. (1992). A fixed size storage O(n3 ) time complexity learning algorithm for fully recurrent continually running networks. Neural Computation, 4(2), 243– 248. Schmidhuber, J., Gagliolo, M., Wierstra, D., & Gomez, F. (2006). Evolino for recurrent support vector machines. In Proceedings of ESANN’06. Evere, Belgium: d-side. Schmidhuber, J., Gers, F., & Eck, D. (2002). Learning nonregular languages: A comparison of simple recurrent networks and LSTM. Neural Computation, 14(9), 2039– 2041. Schmidhuber, J., Hochreiter, S., & Bengio, Y. (2001). Evaluating benchmark problems by random guessing. In S. C. Kremer and J. F. Kolen (Eds.), A field guide to dynamical recurrent neural networks. Piscataway, NJ: IEEE Press. Schmidhuber, J., Wierstra, D., & Gomez, F. J. (2005). Evolino: Hybrid neuroevolution/optimal linear search for sequence prediction. In Proceedings of
Training Recurrent Networks by Evolino
779
the 19th International Joint Conference on Artificial Intelligence (IJCAI). San Francisco: Morgan Kaufman. Schwefel, H. P. (1977). Numerische Optimierung von Computer-Modellen. Basel: Birkh¨auser. Schwefel, H. P. (1995). Evolution and optimum seeking. New York: Wiley Interscience. Shimodaira, H., Noma, K.-I., Nakai, M., & Sagayama, S. (2002). Dynamic timealignment kernel in support vector machine. In T. G. Dietterich, S. Becker, and Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Siegelmann, H. T., & Sontag, E. D. (1991). Turing computability with neural nets. Applied Mathematics Letters, 4(6), 77–80. Sims, K. (1994). Evolving virtual creatures. In A. Glassner (Ed.), Proceedings of SIGGRAPH ’94 (Orlando, Florida, July 1994) (pp. 15–22). New York: ACM Press. Suykens, J., & Vandewalle, J. (2000). Recurrent least squares support vector machines. IEEE Transactions on Circuits and Systems-I, 47(7), 1109–1114. ¨ van der Zant, T., Beˇcanovi´c, V., Ishii, K., Kobialka, H.-U., & Ploger, P. G. (2004). Finding good echo state networks to control an underwater robot using evolutionary computations. In Proceedings of the 5th IFAC symposium on Intelligent Autonomous Vehicles (IAV04). New York: Elsevier. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer. Werbos, P. J. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. Unpublished doctoral dissertation, Harvard University. Werbos, P. J. (1988). Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1, 339–365. Wierstra, D., Gomez, F. J., & Schmidhuber, J. (2005). Modeling non-linear dynamical systems with Evolino. In Proc. GECCO 2005. New York: ACM Press. Williams, R. J. (1989). Complexity of exact gradient computation algorithms for recurrent neural networks (Tech. Rep. NU-CCS-89-27). Boston: Northeastern University, College of Computer Science. Yamauchi, B. M., & Beer, R. D. (1994). Sequential behavior and learning in evolved dynamical neural networks. Adaptive Behavior, 2(3), 219–246. Yao, X. (1993). A review of evolutionary artificial neural networks. International Journal of Intelligent Systems, 8, 539–567.
Received December 2, 2005; accepted May 15, 2006.
LETTER
Communicated by Erkki Oja
A Generalized Divergence Measure for Nonnegative Matrix Factorization Raul Kompass [email protected] ¨ Mathematik und Informatik, 14152 Berlin, Germany FU Berlin, Institut fur
This letter presents a general parametric divergence measure. The metric includes as special cases quadratic error and Kullback-Leibler divergence. A parametric generalization of the two different multiplicative update rules for nonnegative matrix factorization by Lee and Seung (2001) is shown to lead to locally optimal solutions of the nonnegative matrix factorization problem with this new cost function. Numeric simulations demonstrate that the new update rule may improve the quadratic distance convergence speed. A proof of convergence is given that, as in Lee and Seung, uses an auxiliary function known from the expectationmaximization theoretical framework.
1 Introduction Nonnegative matrix factorization (NMF) is a technique of representational learning that consists in finding an approximation X ≈ MY of a matrix X = (xij ), whose elements are nonnegative, by a product of two matrices M = (mik ) and Y = (ykj ), which also are under the constraint of having nonnegative elements. The columns of X may be interpreted as input data, the columns of M as components, and the columns of Y as representations. Because of nonnegativity, there is no known general exact solution for the approximation problem, unlike PCA in the quadratic distance case. The nonnegativity constraint, which for itself is biologically plausible because neural activity has no negative values, leads to an important property of the resulting representations. They tend to be parts based because the positive components may only be scaled and added to approximate the input. This is a main argument for the biological significance of NMF. Lee and Seung (2001) presented two solutions for the NMF approximation problem: two pairs of iteration formulas for Y and M that decrease either the quadratic distance,
X − MY2 =
xij − (MY)ij
2
,
(1.1)
ij
Neural Computation 19, 780–791 (2007)
C 2007 Massachusetts Institute of Technology
A Generalized Divergence Measure for Nonnegative Matrix Factorization 781
or the Kullback-Leibler divergence, D(X||MY) =
xij log
ij
xij − xij + (MY)ij . (MY)ij
(1.2)
Their formulas, which are derived with a gradient-descent approach, have the advantage of rapidly converging. This is achieved by diagonal rescaling. The step-size parameter η, which in gradient descent is a rather small constant, is element-wise set to a variable and possibly large value with the new technique. As a result, the addition and subtraction terms of gradient descent occur as numerator and denominator in a multiplicative update formula, and the parameter η is gone. Lee and Seung (2001) proved convergence of the new formulas with an auxiliary function following the same strategy as in the expectation maximization (EM) theoretical framework (Dempster, Laird, & Rubin 1977). For the case of quadratic distance, the NMF iteration formulas given by Lee and Seung (2001) in our notation are as follows: yaµ ← ya µ
(MT X)a µ (MT MY)a µ
mia ← mia
(XY T )ia . (MY Y T )ia
(1.3)
For the minimization of the Kullback-Leibler divergence D(X||MY), these equations take a somewhat different form ya µ ← ya µ
i
mia xiµ /(MY)iµ k mka
mia ← mia
µ
ya µ xiµ /(MY)iµ . ν ya ν
(1.4)
The action of the iteration formulas, 1.3 and 1.4, may be related to different forms of learning known from neural network theory. In the quadratic distance case, equation 1.3, the update rules perform a multiplicative activation and divisive error correction. The update of M is a multiplicative version of Karhunen and Oja’s subspace rule for PCA (Oja, 1989). Instead of setting Y = MT X in Oja (1989), a similar multiplication and error-correcting division is employed here in the update of Y. In the Kullback-Leibler divergence case, the updates may be understood as an additional action of a divisory form of preintegration lateral inhibition (Spratling & Johnson, 2002) in the multiplicative as well as in the error-correcting divisory part of the update. This leads to replacement of the terms X and MY by X./(MY) and (MY)./(MY) = 1 in the update formulas, where ./ may denote elementwise division. Through this strong form of inhibition, the denominators in equation 1.4 reduce to normalization terms. The error correction does not operate anymore. In biological networks, both forms of learning may act concurrently, and this raises the question what happens if we reduce the preintegration lateral
782
R. Kompass
inhibition in equation 1.4 so that the resulting denominators still perform some error correction. In the most fortunate scenario, we might yield an even faster convergence than with the NMF formulas, equations 1.3 and 1.4. In the worst case, we might experience a loss of convergence of the iteration. This work gives answers to this question. We present an intermediate version of NMF update with restricted preintegration lateral inhibition. We also present a general parametric divergence measure that contains the quadratic distance and Kullback-Leibler divergence as special cases. We then show that the intermediate NMF update leads to no increase of the new divergence with the according parameter. Finally, we give numerical examples that show that the new update iterates well and may even lead to faster approximation than the original formula. The mathematical proofs presented in the appendix use the same approach known from EM theory (Dempster et al., 1977) like those in Lee and Seung (2001). 2 Generalization of the NMF Iteration Formulas Equations 1.3 and 1.4 differ only in the roles of X and MY within the equations. The term xiµ in equation 1.3 is substituted by xiµ /(MY)iµ in equation 1.4. Likewise, the term (MY)iµ in equation 1.3 is substituted by 1, which is the same as (MY)iµ /(MY)iµ . We associate the division by (MY)iµ with preintegration lateral inhibition (Spratling & Johnson, 2002). We di1−α minish the inhibition by dividing instead by (MY)iµ , where α ∈ [0, 1]. In this way, we obtain the following two generalized parametric NMF update equations: α−1 i mia xiµ (MY)iµ ya µ ← ya µ α k mka (MY)kµ
mia ← mia
α−1 ya µ xiµ (MY)iµ . α ν ya ν (MY)iν
µ
(2.1)
The new parameter α may be set so that either of the two cost functions for NMF approximation is locally minimized. The quadratic distance case in equation 2.1 corresponds to α = 1. The Kullback-Leibler divergence case corresponds to α = 0. With these generalized NMF (GNMF) formulas, the interesting question now is what happens if we use the intermediate values of α ∈ (0, 1) for iteration. We will show that the iteration converges as well and that the resulting Y and W interpolate between the solutions of equations 1.3 and 1.4. 3 The Divergence Measure Dα Another question that naturally arises at this point is whether there is a divergence measure Dα (X, MY) that interpolates between X − MY2 and
A Generalized Divergence Measure for Nonnegative Matrix Factorization 783
α
Dα(1,b)
α
1.5 1
0.5 0 − 0.5
1
1
0 .5
0 .5
0
0
0 .5
1
1 .5
0
b
0
0 .5
1
1 .5
b
Figure 1: Generalized distance measure Dα . α = 1 corresponds to quadratic distance; α = 0 corresponds to Kullback-Leibler divergence.
D(X||MY) so that the GNMF for a certain value of α ∈ (0, 1) leads to a decrease of Dα (X, MY). The search for an interpolating divergence measure may at first be restricted to the case of input dimension n = 1 and number of inputs m = 1. The divergence of higher-dimensional input matrices may then be obtained by summing the distances of the corresponding elements as in equations 1.1 and 1.2. We found that such a desired divergence exists. It has the following form: α α a a −b + b α (b − a ) : α ∈ (0, 1] α Dα (a , b) := . (3.1) a (log a − log b) + (b − a ) : α = 0 2 and D0 (a , b) = D(X||MY). Dα (a , b) is continObviously D1 (a , b) = (a a α−−bb) α = log a − log b. Dα (a , b) is a divergence meauous in α because lim α α→0
sure because Dα (a , a ) = 0 and because with a ∂ Dα (a , b) = −(α + 1) 1−α − b α , ∂b b
(3.2)
Dα (a , b) increases as b moves farther away from a for a ,b > 0 and α > −1. Figure 1 exhibits plots of the distance measure Dα (a , b) for certain values of α. The Kullback-Leibler divergence deviates from quadratic distance by its asymmetry. With constant absolute difference |a − b| of a and b, the case b < a results in a higher divergence than the opposite case, b > a . In the new divergence measure Dα , the degree of asymmetry is adjustable by the parameter α. Values of α below 0 lead to an even stronger asymmetry than
784
R. Kompass
in Kullback-Leibler divergence. For α > 1, the asymmetry has the opposite direction. We define Dα (X, MY) :=
Dα (xij , (MY)ij ).
(3.3)
ij
For the gradients of this measure with respect to M and Y, we obtain xiµ ∂ α Dα (X, MY) = −(1 + α) mia − mka (MY)kµ 1−α ∂ ya µ (MY)iµ i k xiµ ∂ α Dα (X, MY) = −(1 + α) y − (MY)iν ya ν . 1−α a µ ∂mia µ (MY)iµ ν
(3.4)
(3.5)
Ordinary gradient descent might be applied with these expressions. This, however, does not ensure that the elements of Y and M remain positive. In the same way as in Lee and Seung (2001), in the simple gradient-descent additive update rule, yaµ ← yaµ + ηaµ
mia
i
xiµ 1−α (MY)iµ
−
mka (MY)αkµ
,
(3.6)
k
we therefore set ηa µ =
yaµ α , m k ka (MY)kµ
(3.7)
which leads to the multiplicative update rule for Y in equation 2.1. In a similar manner, the multiplicative update for M is obtained. Theorem 1. For α ∈ [0 , 1 ] the generalized divergence Dα (X, MY) is nonincreasing under the update rules, equation 2.1. The generalized divergence is invariant under these updates if and only if M and Y are at a stationary point of the divergence. The proof is given in the appendix. 4 Numerical Results We performed numerical simulations to test how iteration according to the new update, equation 2.1, proceeds and how the results interpolate between the outcomes of minimization of quadratic distance, equation 1.3, and of
A Generalized Divergence Measure for Nonnegative Matrix Factorization 785
Figure 2: Approximation of GNMF for different values of α. Quadratic versus Kullback-Leibler distance. (Left) USPS data (digits), dim(Y) = 20. (Right) CBCL data (faces), dim(Y) = 49.
Kullback-Leibler divergence, equation 1.4. We investigated the iteration with α varying in 10 steps between 0 and 1 with two data sets: the U.S. Postal Service (USPS) digit data (LeCun et al., 1989) and the CBCL face data, which were also used in the first publication on NMF (MIT Center for Biological and Computation Learning, 2000). The USPS, data are given as 16 × 16 = 256 dimensional input, and the CBCL data are given as 19 × 19 pixels, which results in an input dimension of 381. The GNMF of the USPS data was computed with a Y-dimension of r = 20, and the CBCL data were factored with r = 49, as in Lee and Seung (1999). The pixel brightness intensities were linearly scaled to cover the range [0, 1]. The two data sets have different characteristics. The overall pixel brightness intensity of the face data is unimodal, with the mode at about 0.35; for the digit data, it is bimodal because most elements are nearly or exactly zero (black pixels) and there is a mode close to 1 (white). In order to limit the computational cost, we took only the first 1000 samples of either data set. Figure 2 shows the final quadratic distance and Kullback-Leibler divergence of the approximations MY to the input X for the different values of α after 64, 128, 256, or 512 iterations. The distances in the plot (see equations 1.1 and 1.2) are scaled by dividing by the number of elements in X and averaged across 20 runs of GNMF with random initial values of Y and M. Confidence intervals (95%) of means that are not corrected for multiple comparisons are also plotted. The numerical results demonstrate that the generalized update formulas iterate well and lead to meaningful results for intermediate values of α. With many iteration steps (512), there is a gradual transition between quadratic distance and Kullback-Leibler divergence approximation. Values of α close to 0 and 1 lead to considerable improvement of one measure without sacrificing the other. In an NMF approximation task, where the cost function is a mixture of both divergence measures, the new formulas seem to be clearly
786
R. Kompass
superior to the old ones. We tested this for a very large number of iterations (5000) for α ∈ {0, 0.5, 1} for both data sets. Table 1 displays the results. An intermediate value of α = 12 was found to lead to costs of approximately 1.5% to 6% above the absolute minimum for both divergence measures, whereas the standard NMF updates, which correspond to α = 0 or α = 1, optimize one measure but lead to a 11% to 23% increase of the other. With few (128 or fewer) iteration steps, we observe that the best approximation in the quadratic sense is achieved for α in the range [0.3..0.8]. Initially for such parameter values, the iteration seems to converge faster than with α = 1. This could result from the simultaneous action of both forms of error correction discussed in section 1. For α outside the interval [0, 1], the iteration with the above data did not converge. 5 Discussion We extended the NMF iteration formulas given in Lee and Seung (2001), which were designed to minimize the quadratic distance (see equation 1.1) or Kullback-Leibler distance (see equation 1.2) to a continuous family of iterations with parameter α that locally minimize a new general cost function, equations 3.1 and 3.3. Numerical examples indicate that the new iteration yields good approximation in terms of both original cost functions when the parameter α is set to intermediate values (α ∈ (0.4, 0.8)). In section 1, we argued that for such parameter values, the NMF update can be interpreted as a combination of the two forms of learning error correction and presynaptic lateral inhibition (Spratling & Johnson, 2002). The numerical results show that such a combination may result in improved approximation with a fixed moderate number of iteration steps. Preliminary results not presented here show that the advantage is preserved in a biologically more realistic serialized “online” form of NMF update. Also, other variants of preintegration lateral inhibition might be beneficial. Spratling and Johnson (2002) used a rectified subtraction, which we did not explore in conjunction with the multiplicative NMF update. In general, we believe that the combination of the different forms of learning and error correction deserves further investigation. With our extension, the advantage of simplicity of NMF is preserved. The theoretical treatment of convergence (see the appendix) has become even simpler by being reduced to only one parametric case. The parameter α, a new degree of freedom, determines the asymmetry of the cost function. An asymmetric cost function, like Kullback-Leibler divergence, for example, in pattern recognition, more strongly penalizes misses in the reproduction of input data and better tolerates overshootings of these reproductions. It seems plausible within the context of representation of inputs for further processing in biological vision because missing something in a natural environment may be much more disadvantageous than an overamplification of represented activity. In general, it is not known how asymmetric the costs
0 0.5 1 0 0.5 1
USPS
CBCL
α
Data Set
0.0462 0.0421 0.0415 0.00261 0.00222 0.00215
Quadratic Distance −1
11.4% 1.5% 0 21.4% 3.5% 0
D Dmin
0.0556 0.0530 0.0545 0.00479 0.00448 0.00473
D1/2
−1
4.8% 0 2.8% 7.0% 0 5.7%
D Dmin
0.0726 0.0743 0.0814 0.0111 0.0117 0.0136
Kullback-Leibler Divergence
Table 1: Cost Functions D1 , D1/2 , and D0 for USPS and CBCL Data after 5000 GNMF Updates with α ∈ {0, 0.5, 1}.
−1 0 2.4% 12.2% 0 5.1% 22.8%
D Dmin
A Generalized Divergence Measure for Nonnegative Matrix Factorization 787
788
R. Kompass
are in a given task. Therefore, the new formulas might help adjust the cost function to an optimal shape when NMF is applied to pattern recognition. Investigation of the dependence of recognition performance on α certainly depends on the methods used for subsequent processing and classification and therefore is beyond the scope of this work. We hope that future research will be dedicated to this problem and to the problem of finding converging NMF update rules for parameters α outside the interval (0, 1). Appendix: Convergence Proof of Theorem 1 We closely follow Lee and Seung (2001). We show that for an arbitrary fixed α ∈ (0, 1) in every iteration, the cost function F (Y, M) does not increase. First, we prove this for the update of Y. For that purpose, as in the expectation-maximization algorithm (Dempster et al., 1977), we use an auxiliary function. ˜ is an auxiliary function for F (Y) if for every Y and Y˜ G(Y, Y)
Definition 1.
G(Y, Y) = F (Y)
and
˜ ≥ F (Y) G(Y, Y)
(A.1)
hold. ˜ the cost function F (Y) is nonLemma 1. With an auxiliary function G(Y, Y), increasing under the update Ynew = arg minY G(Y, Yold ). Proof.
(A.2)
F (Ynew ) ≤ G(Ynew , Yold ) ≤ G(Yold , Yold ) = F (Yold )
The cost function may be written in the following form: F (Y) =
1 1+α 1 + α xij − xij (MY)αij + (MY)1+α ij . α α ij
Lemma 2.
ij
(A.3)
ij
The EM update rule applied with the function
˜ = G(Y, Y)
1 1+α 1 + α mik y˜ kj xij − xij ˜ ij α α i jk (MY) ij
+
ykj1+α i jk
y˜ kjα
˜ α mik (MY) ij
˜ ij (MY) mik ykj mik y˜ kj
α
(A.4)
A Generalized Divergence Measure for Nonnegative Matrix Factorization 789
leads to the GNMF iteration update in Y given by equation 2.1. We have
Proof.
∂ 1 + α mik y˜ kj xij ˜ ij ∂ ya µ α i jk (MY) = (1 + α)
y˜ a µ ya µ
1−α
˜ ij (MY) mik ykj mik y˜ kj
α
α−1 ˜ iµ xiµ (MY) mia
(A.5)
i
and
1+α yaµ α ∂ yij α ˜ ˜ αkµ m (M Y) = (1 + α) mka (MY) ki kj ∂ ya µ i jk y˜ ijα y˜ aµ k
(A.6)
˜ takes its minimum in Y at a point where ∂ G(Y, Y) ˜ = 0. This corG(Y, Y) ∂Y responds to equality of the right sides of equations A.5 and A.6, which simplifies to yaµ
˜ αkµ = y˜ aµ mka (MY)
k
α−1 ˜ iµ mia xiµ (MY) ,
(A.7)
i
which is equivalent to the GNMF iteration in Y given in equation 2.1. Lemma 3.
˜ is an auxiliary function forF (Y). G(Y, Y)
˜ = F (Y). It remains to show that Proof. It is easy to verify that G(Y, Y) ˜ ˜ G(Y, Y) ≥ F (Y) ∀ Y, Y. We prove the inequality separately for the summands ˜ The first summands are identical. in G(Y, Y). mik y˜ kj Because k (MY) = 1, we may interpret k mik ykj as a convex combi˜ ij
˜
Y)i nation of the points zik = (M mik ykj with the coefficients λik = mik y˜ k α ∈ (0, 1) the function x −→ x α is concave. Therefore, we have
mik y˜ kj ˜ ij (MY) k
˜ ij (MY) mik ykj mik y˜ kj
α ≤
mik y˜ k ˜ i (MY)
. For
α mik ykj
,
k
and the inequality of the second summands follows from this.
(A.8)
790
R. Kompass
To prove inequality for the third summands, we have to show that
mik
ykj1+α
˜ α≥ (MY) ij
y˜ kjα
k
ij
mik ykj
k
ij
α .
mik ykj
(A.9)
k
Because M and Y may take arbitrary positive values, it is sufficient and necessary to prove the inequality for one fixed index combination (i, j) only. Hence, we have to show
k
y1+α mk k α y˜ k
α ≥
mk y˜ k
k
mk yk
k
α mk yk
.
(A.10)
k
With the substitutions zi = y˜ i /yi and ni = mi yi , we get
nk zk−α
k
α nk zk
≥
k
k
nk
α nk
.
(A.11)
k
−α This inequality follows from convexity of the function x −→ x because with the coefficients nk / l nl ,
−α nk zk −α k nk zk k ≥ k nk k nk
(A.12)
holds.
To prove that the update of M does not increase Dα , one may use the auxiliary function ˜ ia ya j 1 1+α 1 + α m ˜ = G(M, M) xij − xij ˜ ij α α i ja ( MY) ij
+
i ja
yij
1+α α mia MY˜ ij m ˜ ia
˜ ij ( MY) mia ya j m ˜ ia ya j
α
(A.13)
and proceed in an analog way. Acknowledgments We thank Raul Rojas for helpful discussions and for comments on a draft of this article. This work was supported by DFG grant KO 2265/2-1.
A Generalized Divergence Measure for Nonnegative Matrix Factorization 791
References Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39, 1–38. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1, 541–551. Lee, D. D., & Seung, H. S. (1999). Learning parts of objects by non-negative matrix factorization. Nature, 401, 788–791. Lee, D. D., & Seung, H. S. (2001). Algorithms for non-negative matrix factorization. In T. K. Leen, T. G. Diatterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 556–562). Cambridge, MA: MIT Press. MIT center for biological and computation learning. (2000). CBCL face database #1. Available online at http://cbcl.mit.edu/cbcl/software-datasets/facedata2.html. Oja, E. (1989). Neural networks, principal components and subspaces. International Journal of Neural Systems, 1(1), 61–68. Spratling, M. W., & Johnson, M. H. (2002). Pre-integration lateral inhibition enhances unsupervised learning. Neural Computation, 14(9), 2157–2179.
Received January 25, 2006; accepted June 17, 2006.
LETTER
Communicated by Ralf Herbrich
Support Vector Ordinal Regression Wei Chu [email protected] Center for Computational Learning Systems, Columbia University, New York, NY 10115, U.S.A.
S. Sathiya Keerthi [email protected] Yahoo! Research, Media Studios North, Burbank, CA 91504, U.S.A.
In this letter, we propose two new support vector approaches for ordinal regression, which optimize multiple thresholds to define parallel discriminant hyperplanes for the ordinal scales. Both approaches guarantee that the thresholds are properly ordered at the optimal solution. The size of these optimization problems is linear in the number of training samples. The sequential minimal optimization algorithm is adapted for the resulting optimization problems; it is extremely easy to implement and scales efficiently as a quadratic function of the number of examples. The results of numerical experiments on some benchmark and real-world data sets, including applications of ordinal regression to information retrieval, verify the usefulness of these approaches. 1 Introduction We consider the supervised learning problem of predicting variables of ordinal scale, a setting that bridges metric regression and classification and referred to as ranking learning or ordinal regression. Ordinal regression arises frequently in social science and information retrieval, where human preferences play a major role. The training samples are labeled by ranks, which exhibits an ordering among the different categories. In contrast to metric regression problems, these ranks are of finite types, and the metric distances between the ranks are not defined. These ranks are also different from the labels of multiple classes in classification problems due to the existence of the ordering information. There are several approaches for tackling ordinal regression problems in the domain of machine learning. The naive idea is to transform the ordinal scales into numerical values and then solve the problem as a standard regression problem. Kramer, Widmer, Pfahringer, & DeGroeve (2001) investigated the use of a regression tree learner in this way. A problem with this approach is that there might be no principled way of devising an Neural Computation 19, 792–815 (2007)
C 2007 Massachusetts Institute of Technology
Support Vector Ordinal Regression
793
appropriate mapping function since the true metric distances between the ordinal scales are unknown in most of the tasks. Another idea is to decompose the original ordinal regression problem into a set of binary classification tasks. Frank and Hall (2001) converted an ordinal regression problem into nested binary classification problems that encode the ordering of the original ranks and then organized the results of these binary classifiers in some ad hoc way for prediction. It is also possible to formulate the original problem as a large, augmented binary classification problem. Har-Peled, Roth, and Zimak (2002) proposed a constraint classification approach that provides a unified framework for solving ranking and multiclassification problems. Herbrich, Graepel, and Obermayer (2000) applied the principle of structural risk minimization (Vapnik, 1995) to ordinal regression, leading to a new distribution-independent learning algorithm based on a loss function between pairs of ranks. The main difficulty with these two algorithms (Har-Peled et al., 2002, Herbrich et al., 2000) is that the problem size of these formulations is a quadratic function of the training data size. As for sequential learning, Crammer and Singer (2002) proposed a proceptron-based online algorithm for rank prediction, known as the PRank algorithm. Shashua and Levin (2003) generalized the support vector formulation for ordinal regression by finding r − 1 thresholds that divide the real line into r consecutive intervals for the r ordered categories. However there is a problem with their approach: the ordinal inequalities on the thresholds, b 1 ≤ b 2 ≤ . . . ≤ br −1 , are not included in their formulation. This omission may result in disordered thresholds at the solution on some unfortunate cases (see section 4.1 for an example). It is important to make any method free of bad situations where it will not work. In this letter, we propose two new approaches for support vector ordinal regression. The first takes only the adjacent ranks into account in determining the thresholds, exactly as Shashua and Levin (2003) proposed, but we introduce explicit constraints in the problem formulation that enforce the inequalities on the thresholds. The second approach is entirely new; it considers the training samples from all the ranks to determine each threshold. Interestingly, we show that in this second approach, the ordinal inequality constraints on the thresholds are automatically satisfied at the optimal solution, though there are no explicit constraints on these thresholds. For both approaches, the size of the optimization problems is linear in the number of training samples. We show that the popular SMO algorithm (Platt, 1999; Keerthi, Shevade, Bhattacharyya, & Murthy, 2001) for support vector machines can be easily adapted for the two approaches. The resulting algorithms scale efficiently; empirical analysis shows that the cost is roughly a quadratic function of the problem size. Using several benchmark data sets, we demonstrate that the generalization capabilities of the two approaches are much better than that of the naive approach of doing standard regression on the ordinal labels.
794
W. Chu and S. Keerthi
The letter is organized as follows. In section 2, we present the first approach with explicit inequality constraints on the thresholds, derive the optimality conditions for the dual problem, and adapt the sequential minimal optimization (SMO) algorithm for the solution. In section 3 we present the second approach with implicit constraints. In section 4 we do an empirical study to show the scaling properties of the two algorithms and their generalization performance. We conclude in section 5. Throughout this letter, we use x to denote the input vector of the ordinal regression problem and φ(x) to denote the feature vector in a highdimensional reproducing kernel Hilbert space (RKHS) related to x by transformation. All computations are done using the reproducing kernel function only, which is defined as K(x, x ) = φ(x) · φ(x ) ,
(1.1)
where · denotes the inner product in the RKHS. Without loss of generality, we consider an ordinal regression problem with r ordered categories and denote these categories as consecutive integers Y = {1, 2, . . . , r } to keep the known ordering information. In the jth category, where j ∈ Y , the number of training samples is denoted as n j , and the ith training sample is j j denoted as xi where xi ∈ Rd . The total number of training samples rj=1 n j is denoted as n. b j , j = 1, . . . , r − 1 denote the (r − 1) thresholds. 2 Approach 1: Explicit Constraints on Thresholds As a powerful computational tool for supervised learning, support vector machines (SVMs) map the input vectors into feature vectors in a high¨ dimensional RKHS (Vapnik, 1995; Scholkopf & Smola 2002), where a linear machine is constructed by minimizing a regularized functional. For binary classification (a special case of ordinal regression with r = 2), SVMs find an optimal direction that maps the feature vectors into function values on the real line, and a single optimized threshold is used to divide the real line into two regions for the two classes, respectively. In the setting of ordinal regression, the support vector formulation could attempt to find an optimal mapping direction w, and r − 1 thresholds, which define r − 1 parallel discriminant hyperplanes for the r ranks accordingly. For each threshold b j , Shashua and Levin (2003) suggested considering the samples from the two adjacent categories, j and j + 1, for empirical errors (see Figure 1 for an illustration). More exactly, each sample in the jth category should have a function value that is less than the lower margin b j − 1; j j otherwise, w · φ(xi ) − (b j − 1) is the error (denoted as ξi ). Similarly, each sample from the ( j + 1)th category should have a function value that is j+1 greater than the upper margin b j + 1; otherwise, (b j + 1) − w · φ(xi ) is
Support Vector Ordinal Regression
795
Figure 1: An illustration of the definition of slack variables ξ and ξ ∗ for the thresholds. The samples from different ranks, represented as circles filled with different patterns, are mapped by w · φ(x) onto the axis of function value. Note that a sample from rank j + 1 could be counted twice for errors if it is sandwiched by b j+1 − 1 and b j + 1 where b j+1 − 1 < b j + 1, and the samples from rank j + 2, j − 1, and others never give contributions to the threshold b j . ∗ j+1
the error (denoted as ξi ).1 Shashua and Levin (2003) generalized the primal problem of SVMs to ordinal regression as follows: r −1
1 min ∗ w · w + C w,b,ξ ,ξ 2 j=1
nj
n j+1
j
ξi +
i=1
∗ j+1
ξi
,
(2.1)
i=1
subject to j j w · φ xi − b j ≤ −1 + ξi j ξi ≥ 0, for i = 1, . . . , n j j+1 ∗ j+1 w · φ xi − b j ≥ +1 − ξi ∗ j+1 ξi ≥ 0, for i = 1, . . . , n j+1 ,
(2.2)
where j runs over 1, . . . , r − 1 and C > 0. A problem with the above formulation is that the natural ordinal inequalities on the thresholds, that is, b 1 ≤ b 2 ≤ . . . ≤ br −1 cannot be guaranteed to hold at the solution. To tackle this problem, we explicitly include the
∗ j+1
1 The superscript ∗ in ξ denotes that the error is associated with a sample in the i adjacent upper category of the jth threshold.
796
W. Chu and S. Keerthi
following constraints in equation 2.2: b j−1 ≤ b j , for j = 2, . . . , r − 1.
(2.3)
2.1 Primal and Dual Problems. By introducing two auxiliary variables b 0 = −∞ and br = +∞, the modified primal problem in equations 2.1 to 2.3 can be equivalently written as follows,
j 1 ∗j ξi + ξi , w · w + C 2 nj
r
min ∗
w,b,ξ ,ξ
(2.4)
j=1 i=1
subject to j j j w · φ xi − b j ≤ −1 + ξi , ξi ≥ 0, ∀i, j j ∗j ∗j w · φ xi − b j−1 ≥ +1 − ξi , ξi ≥ 0, ∀i, j b j−1 ≤ b j , ∀ j.
(2.5)
The dual problem can be derived by standard Lagrangian techniques. Let j j ∗j ∗j αi ≥ 0, γi ≥ 0, αi ≥ 0, γi ≥ 0, and µ j ≥ 0 be the Lagrangian multipliers for the inequalities in equation 2.5. The Lagrangian for the primal problem is
j 1 ∗j ξi + ξi Le = w · w + C 2 r
nj
j=1 i=1
n r j
−
j j j αi − 1 + ξi − w · φ xi + b j
j=1 i=1 n r j
−
∗j
αi
∗ j ∗j − 1 + ξi + w · φ xi − b j−1
j=1 i=1 n r j
−
n r j
j
j
γi ξi −
j=1 i=1
j=1 i=1
∗j ∗j
γi ξi −
r
µ j (b j − b j−1 ).
(2.6)
j=1
The Karush-Kuhn-Tucker (KKT) conditions for the primal problem require the following to hold:
∗j ∂Le j j αi − αi φ(xi ) = 0 =w − ∂w r
nj
j=1 i=1
(2.7)
Support Vector Ordinal Regression
∂Le ∂ξi
j
∂Le ∗j
∂ξi
j
797
j
= C − αi − γi = 0, ∀i, ∀ j ∗j
∗j
= C − αi − γi
(2.8)
= 0, ∀i, ∀ j
(2.9)
j ∗ j+1 ∂Le =− αi − µ j + αi + µ j+1 = 0, ∀ j. ∂b j nj
n j+1
i=1
i=1
Note that the dummy variables associated with b 0 and br , that is, µ1 , µr , αi∗1 ’s, and αir ’s, are always zero. The conditions 2.8 and 2.9 give rise j ∗j to the constraints 0 ≤ αi ≤ C and 0 ≤ αi ≤ C, respectively. Let us now apply Wolfe duality theory to the primal problem. By introducing the KKT conditions 2.7 to 2.9 into the Lagrangian, equation 2.6, and applying the kernel trick, equation 1.1, the dual problem becomes a maximization problem involving the Lagrangian multipliers α, α ∗ , and µ: max
α,α ∗ ,µ
j
∗j
αi + αi
−
j,i
1 ∗j j ∗ j j j j α − αi αi − αi K xi , xi , 2 j,i j ,i i
(2.10)
subject to j
0 ≤ αi ≤ C, ∗ j+1 0 ≤ αi ≤ C, j n n j+1 j ∗ j+1 αi + µ j = αi + µ j+1 , i=1
µ j ≥ 0,
i=1
∀i, ∀ j ∀i, ∀ j (2.11)
∀j ∀ j,
where j runs over 1, . . . , r − 1. Leaving the dummy variables out of account, the size of the optimization problem is 2n − n1 − nr (α and α ∗ ) plus r − 2 (for µ). The dual problem, equations 2.10 and 2.11, is a convex quadratic programming problem. Once the α, α ∗ and µ are obtained by solving this problem, w is obtained from equation 2.7. The determination of the b j ’s is addressed in the next section. The discriminant function value for a new input vector x is f (x) = w · φ(x) =
∗j j j αi − αi K xi , x .
(2.12)
j,i
The predictive ordinal decision function is given by arg min{i : f (x) < b i }. i
798
W. Chu and S. Keerthi
2.2 Optimality Conditions for the Dual. To derive proper stopping conditions for algorithms that solve the dual problem and also determine the thresholds b j ’s, it is important to write down the optimality conditions for the dual. Though the resulting conditions that are derived below look a bit clumsy because of the notations, the ideas behind them are very much similar to those for the binary SVM classifier case, as the optimization problem is actually composed of r − 1 binary classifiers with a shared mapping direction w and the additional constraint 2.3. The Lagrangian for the dual can be written as follows: Ld =
1 ∗j j ∗ j j j j αi − αi αi − αi K xi , xi 2 j,i j ,i +
r −1
j n n j+1 j ∗ j+1 βj αi − αi + µ j − µ j+1
j=1
−
i=1
j
i=1 ∗j
∗j
j
ηi αi + ηi αi
−
j,i
j j πi C − αi
j,i
∗j
+ πi
∗ j
C − αi
r −1
−
λjµj −
∗j
j
αi + αi
,
j,i
j=2
j
∗j
j
∗j
where the Lagrangian multipliers ηi , ηi , πi , πi , and λ j are nonnegative, while β j can take any value. The KKT conditions associated with α and α ∗ can be given as follows: ∂Ld j ∂αi
j
j
j
j
= − f (xi ) − 1 − ηi + πi + β j = 0, πi ≥ 0
j j j j j ηi ≥ 0, πi C − αi = 0, ηi αi = 0, for i = 1, . . . , n j
∂Ld ∗ j+1 ∂αi ∗ j+1
πi
j+1 ∗ j+1 ∗ j+1 − 1 − ηi = f xi + πi − βj = 0
∗ j+1 ∗ j+1 αi
ηi
∗ j+1
≥ 0, ηi
∗ j+1
≥ 0, πi
∗ j+1
C − αi
= 0, for i = 1, . . . , n j+1 ,
=0 (2.13)
where f (x) is defined as in equation 2.12 and j runs over 1, . . . , r − 1, while the KKT conditions associated with the µ j are β j − β j−1 − λ j = 0, λ j µ j = 0, λ j ≥ 0,
(2.14)
Support Vector Ordinal Regression
799
where j = 2, . . . , r − 1. The conditions in equation 2.13 can be regrouped into the following six cases: Case 1: Case 2: Case 3: Case 4: Case 5: Case 6:
j
αi = 0 j 0 < αi < C j αi = C ∗ j+1 αi =0 ∗ j+1 0 < αi
j f xi + 1 ≤ β j j f xi + 1 = β j j f xi + 1 ≥ β j j+1 f xi − 1 ≥ βj j+1 f xi − 1 = βj j+1 − 1 ≤ βj. f xi
We can classify any variable into one of the following six sets: j def j I0a = i ∈ {1, . . . , n j } : 0 < αi < C j def ∗ j+1 I0b = i ∈ {1, . . . , n j+1 } : 0 < αi
j def j
j
j
j
def j
j
j
j
Let us define I0 = I0a ∪ I0b , Iup = I0 ∪ I1 ∪ I3 , and Ilow = I0 ∪ I2 ∪ I4 . We furj i ther define Fup (β j ) on the set Iup as i Fup (β j )
j j j f xi + 1 if i ∈ I0a ∪ I3 j+1 = j j f xi − 1 if i ∈ I0b ∪ I1 j
i and Flow (β j ) on the set Ilow as
i Flow (β j )
j j j f xi + 1 if i ∈ I0a ∪ I2 j+1 = j j − 1 if i ∈ I0b ∪ I4 . f xi
Then the conditions can be simplified as j
j
i i β j ≤ Fup (β j ) ∀i ∈ Iup and β j ≥ Flow (β j ) ∀i ∈ Ilow ,
which can be compactly written as j
j for j = 1, . . . , r − 1, blow ≤ b up j
j
(2.15) j
j
i i where b up = min{Fup (β j ) : i ∈ Iup } and blow = max{Flow (β j ) : i ∈ Ilow }.
800
W. Chu and S. Keerthi
The KKT conditions in equation 2.14 indicate that the condition, β j−1 ≤ β j always holds and that β j−1 = β j if µ j > 0. To merge conditions 2.14 and 2.15, let us define k k j j B˜ low = max blow : k = 1, . . . , j and B˜ up = min b up : k = j, . . . , r − 1 , where j = 1, . . . , r − 1. The overall optimality conditions can be simply written as j
j ∀ j, Blow ≤ Bup
(2.16)
where j Blow
=
j+1 B˜ low if µ j+1 > 0 j = and Bup j B˜ low otherwise
j−1 B˜ up if µ j > 0 j B˜ up otherwise.
(2.17)
It should be easy to see from the above sequence of steps that {β j } and {µ j } are optimal for the problem in equations 2.10 and 2.11 iff equation 2.16 is satisfied. We introduce a tolerance parameter τ > 0, usually 0.001, to define approximate optimality conditions. The overall stopping condition becomes j j max Blow − Bup : j = 1, . . . , r − 1 ≤ τ.
(2.18)
From the conditions in equations 2.13 and 2.2, it is easy to see the close relationship between the b j ’s in the primal problem and the multipliers β j ’s. In particular, at the optimal solution, β j and b j are identical. Thus, j j b j can be taken to be any value from the interval, [Blow , Bup ]. We can rej j 1 solve any nonuniqueness by simply taking b j = 2 (Blow + Bup ). Note that the KKT conditions in equation 2.14, coming from the additive constraints in equation 2.3 we introduced in Shashua and Levin’s (2003) formulation, j−1 j j−1 j enforce Blow ≤ Blow and Bup ≤ Bup at the solution, which guarantee that the thresholds specified in these feasible regions will satisfy the inequality constraints b j−1 ≤ b j ; without the constraints in equation 2.3, the thresholds might be disordered at the solution! Here we essentially consider an optimization problem of multiple learning tasks where individual optimality conditions are coupled together by a joint constraint. We believe this is a nontrivial extension and useful in other applications. The SMO algorithm (Platt, 1999; Keerthi et al., 2001) can be adapted for the solution of equations 2.10 and 2.11. The details are given in the appendix.
Support Vector Ordinal Regression
801
Figure 2: An illustration on the new definition of slack variables ξ and ξ ∗ that imposes implicit constraints on the thresholds. All the samples are mapped by w · φ(x) onto the axis of function values. Note the term ξ3i∗1 in this graph.
3 Approach 2: Implicit Constraints on Thresholds In this section we present a new approach to support vector ordinal regression. Instead of considering only the empirical errors from the samples of adjacent categories to determine a threshold, we allow the samples in all the categories to contribute errors for each threshold. This kind of loss function was also suggested by Srebro, Rennie, and Jaakkola (2005) for collaborative filtering. A very nice property of this approach is that the ordinal inequalities on the thresholds are satisfied automatically at the optimal solution in spite of the fact that such constraints on the thresholds are not explicitly included in the new formulation. We give a proof on this property in the following section. Figure 2 explains the new definition of slack variables ξ and ξ ∗ . For a threshold b j , the function values of all the samples from all the lower categories should be less than the lower margin b j − 1; if that does not j hold, then ξki = w · φ(xik ) − (b j − 1) is taken as the error associated with the sample xik for b j , where k ≤ j. Similarly, the function values of all the samples from the upper categories should be greater than the upper margin ∗j b j + 1; otherwise, ξki = (b j + 1) − w · φ(xik ) is the error associated with the sample xik for b j , where k > j. Here, the subscript ki denotes that the slack variable is associated with the ith input sample in the kth category; the superscript j denotes that the slack variable is associated with the lower categories of b j ; and the superscript ∗ j denotes that the slack variable is associated with the upper categories of b j .
802
W. Chu and S. Keerthi
3.1 Primal Problem. By taking all the errors associated with all r − 1 thresholds into account, the primal problem can be defined as follows: j nk nk r −1 r 1 j ∗ j min w · w + C ξki + ξki w,b,ξ ,ξ ∗ 2 j=1
k=1 i=1
(3.1)
k= j+1 i=1
subject to j w · φ xik − b j ≤ −1 + ξki ,
j
ξki ≥ 0
for k = 1, . . . , j and i = 1, . . . , nk ∗j ∗j w · φ xik − b j ≥ +1 − ξki , ξki ≥ 0 for k = j + 1, . . . , r and i = 1, . . . , nk ,
(3.2)
where j runs over 1, . . . , r − 1. Note that there are r − 1 inequality constraints for each sample xik (one for each threshold). To prove the inequalities on the thresholds at the optimal solution, let us consider the situation where w is fixed and only the b j ’s are optimized. j ∗j Note that the ξki and ξki are automatically determined once the b j are given. To eliminate these variables, let us define, for 1 ≤ k ≤ r , def Iklow (b) = i ∈ {1, . . . , nk } : w · φ xik − b ≥ −1 , def up Ik (b) = i ∈ {1, . . . , nk } : w · φ xik − b ≤ 1 . It is easy to see that b j is optimal iff it minimizes the function
e j (b) =
j
w · φ xik − b + 1
k=1 i∈I low (b) k
+
r
− w · φ xik + b + 1 .
(3.3)
k= j+1 i∈I up (b) k
Let B j denote the set of all minimizers of e j (b). By convexity, B j is a closed interval. Given two intervals B1 = [c 1 , d1 ] and B2 = [c 2 , d2 ], we say B1 ≤ B2 if c 1 ≤ c 2 and d1 ≤ d2 . Lemma 1.
B1 ≤ B2 ≤ · · · ≤ Br–1 .
Support Vector Ordinal Regression
Proof.
803
The “right-side derivative” of e j with respect to b is
g j (b) = −
j k=1
|Iklow (b)| +
r
up
k= j+1
|Ik (b)|.
(3.4)
Take any one j, and consider B j = [c j , d j ] and B j+1 = [c j+1 , d j+1 ]. Suppose c j > c j+1 . Define b j = c j and b j+1 = c j+1 . Since b j+1 is strictly to the left of the interval B j that minimizes e j , we have g j (b j+1 ) < 0. Since b j+1 is a minimizer of e j+1 , we also have g j+1 (b j+1 ) ≥ 0. Thus, we have g j+1 (b j+1 ) − g j (b j+1 ) > 0; also, by equation 3.4, we get low up (b j+1 ) − I j+1 (b j+1 ), 0 < g j+1 (b j+1 ) − g j (b j+1 ) = − I j+1 which is impossible. In a similar way, d j > d j+1 is also not possible. This proves the lemma. If the optimal b j are all unique,2 then lemma 1 implies that the b j satisfy the natural ordinal ordering. Even when one or more b j ’s are nonunique, lemma 1 says that there exist choices for the b j that obey the natural ordering. The fact that the order preservation comes about automatically is interesting and nontrivial, which differs from the PRank algorithm (Crammer & Singer, 2002), where the order preservation on the thresholds is easily brought in by their update rule. It is also worth noting that lemma 1 holds even for an extended problem formulation that allows the use of different costs (different C values) for j different misclassifications (class k misclassified as class j can have a Ck ). In applications such as collaborative filtering, such a problem formulation can be very appropriate; for example, an A-rated movie that is misrated as D may need to be penalized much more than if a B-rated movie is misrated, as D. Shashua and Levin’s (2003) formulation and its extension given in section 2 of this article do not precisely support such a differential cost structure. This is another good reason in support of the implicit problem formulation of the current section. j
j
∗j
∗j
3.2 Dual Problem. Let αki ≥ 0, γki ≥ 0, αki ≥ 0, and γki ≥ 0 be the Lagrangian multipliers for the inequalities in equation 3.2. Using ideas parallel to those in section 2.1, we can show that the dual of equations 3.1 and 3.2 is the following maximization problem that involves only the multipliers α
2
term
If, in the primal problem, we regularize the b j ’s as well (i.e., include the extra cost 2 j b j /2), then the b j ’s are guaranteed to be unique. Lemma 1 still holds in this case.
804
W. Chu and S. Keerthi
and α ∗ : 1 max − α,α ∗ 2 k,i k ,i +
k−1 k,i
k−1
∗j αki
+
j αki
k −1
j=k
j=1 ∗j αki
−
r −1
r −1
∗j
αk i −
j=1
r −1
j αk i K xik , xik
j=k
j αki
,
(3.5)
j=k
j=1
subject to j n k
k=1 i=1 j 0 ≤ αki ∗j 0 ≤ αki
n r k
j
αki =
∗j
αki
∀j
k= j+1 i=1
≤C
∀ j and k ≤ j
≤C
∀ j and k > j.
(3.6)
The dual problem 3.5 and 3.6 is a convex quadratic programming problem. The size of the optimization problem is (r − 1)n, where n = rk=1 nk is the total number of training samples. The discriminant function value for a new input vector x is k−1 r −1 ∗j j f (x) = w · φ(x) = αki − αki K xik , x . k,i
j=1
(3.7)
j=k
The predictive ordinal decision function is given by arg min{i : f (x) < b i }. i
The ideas of adapting the SMO algorithm to equations 3.5 and 3.6 are similar to those in the appendix. The resulting suboptimization problem is analogous to the case of standard update in the appendix, where only one of the equality constraints from equation 3.6 is involved. Full details of the SMO algorithm are given in our technical report (Chu & Keerthi, 2005). 4 Numerical Experiments We have implemented the two SMO algorithms for the ordinal regression formulations with explicit constraints (EXC) and implicit constraints (IMC),3 along with the algorithm of Shashua and Levin (2003) for comparison purpose. The function caching technique and the double-loop scheme
3 The source code of the two algorithms written in ANSI C can be found online at http://www.gatsby.ucl.ac.uk/∼chuwei/svor.htm.
Support Vector Ordinal Regression
805
proposed by Keerthi et al. (2001) have been incorporated in the implementation for efficiency. We begin this section with a simple data set to illustrate the typical behavior of the three algorithms and then empirically study the scaling properties of our algorithms. Then we compare the generalization performance of our algorithms against standard support vector regression on eight benchmark data sets for ordinal regression. The following gaussian kernel was used in these experiments:
d κ K(x, x ) = exp − (xς − xς )2 , 2
(4.1)
ς =1
where κ > 0 and xς denotes the ςth element of the input vector x. The tolerance parameter τ was set to 0.001 for all the algorithms. We have utilized two evaluation metrics that quantify the accuracy of predicted ordinal scales { yˆ 1 , . . . , yˆ t } with respect to true targets {y1 , . . . , yt }: 1. Mean absolute error is the average deviation of the prediction from the t true target, that is, 1t i=1 | yˆ i − yi |, in which we treat the ordinal scales as consecutive integers. 2. Mean zero-one error the fraction of incorrect predictions on individual samples. 4.1 Grading Data Set. The grading data set was used in Chapter 4 of Johnson and Albert (1999) as an example of the ordinal regression problem.4 There are 30 samples of students’ scores. The “SAT-math score” and “grade in prerequisite probability course” of these students are used as input features, and their final grades are taken as the targets. In our experiments, the six students with final grade A or E were not used, and the feature associated with the “grade in prerequisite probability course” was treated as a continuous variable though it had an ordinal scale. In Figure 3, we present the solution obtained by the three algorithms using the gaussian kernel, equation 4.1, with κ = 0.5 and the regularization factor value of C = 1. In this particular setting, the solution to Shashua and Levin’s (2003) formulation has disordered thresholds b 2 < b 1 , as shown in Figure 3 (left); the formulation with explicit constraints corrects this disorder and yields equal values for the two thresholds, as shown in Figure 3 (middle). 4.2 Scaling. In this experiment, we empirically studied how the two SMO algorithms scale with respect to training data size and the number of ordinal scales in the target. The California Housing data set was used in the 4 The grading data set is available online at http://www.mathworks.com/support/ books/book1593.jsp.
806
W. Chu and S. Keerthi
Figure 3: Training results of the three algorithms using a gaussian kernel on the grading data set. The discriminant function values are presented as contour graphs indexed by the two thresholds. The circles denote the students with grade D, the dots denote grade C, and the squares denote grade B.
Figure 4: Plots of CPU time versus training data size on log-log scale, indexed by the estimated slopes, respectively. We used the gaussian kernel with κ = 1 and the regularization factor value of C = 100 in the experiment.
scaling experiments.5 Twenty-eight training data sets with sizes ranging from 100 to 5,000 were generated by random selection from the original data set. The continuous target variable of the California Housing data was discretized to an ordinal scale by using 5 or 10 equal-frequency bins. The standard support vector regression (SVR) was used as a baseline, in which the ordinal targets were treated as continuous values and = 0.1. These data sets were trained by the three algorithms using a gaussian kernel with κ = 1 and a regularization factor value of C = 100. Figure 4 gives plots of the computational costs of the three algorithms as functions of the problem size, for the two cases of 5 and 10 target bins. Our algorithms scale well with scaling exponents between 2.13 and 2.33, while the scaling exponent of SVR is about 2.40 in this case. This near-quadratic property in scaling 5 The California Housing data set can be found online at http://lib.stat.cmu.edu/ datasets/.
Support Vector Ordinal Regression
807
comes from the sparseness property of SVMs, that is, nonsupport vectors affect the computational cost only mildly. The EXC and IMC algorithms cost more than the SVR approach due to the larger problem size. For large sizes, the cost of EXC is only about two times that of SVR in this case. As expected, we also noticed that the computational cost of IMC is dependent on r , the number of ordinal scales in the target. The cost for 10 ranks is observed to be roughly five times that for five ranks, whereas the cost of EXC is nearly the same for the two cases. These observations are consistent with the size of the optimization problems. The problem size of IMC is (r − 1)n (which is heavily influenced by r ), while the problem size of EXC is about 2n + r (which largely depends on n only since we usually have n r ). This factor of efficiency can be a key advantage for the EXC formulation. 4.3 Benchmark data sets. Next, we compared the generalization performance of the two approaches against the naive approach of using standard support vector regression (SVR) and the method (SLA) of Shashua and Levin (2003). We collected eight benchmark data sets that were used for metric regression problems.6 The target values were discretized into ordinal quantities using equal-frequency binning. For each data set, we generated two versions by discretizing the target values into 5 or 10 ordinal scales, respectively. We randomly partitioned each data set into training and test splits, as specified in Table 1. The partitioning was repeated 20 times independently. The input vectors were normalized to zero mean and unit variance, coordinate-wise. The gaussian kernel, equation 4.1, was used for all the algorithms. Fivefold cross validation was used to determine the optimal values of model parameters (the gaussian kernel parameter κ and the regularization factor C) involved in the problem formulations, and the test error was obtained using the optimal model parameters for each formulation. The initial search was done on a 7 × 7 coarse grid linearly spaced in the region {(log10 C, log10 κ)| − 3 ≤ log10 C ≤ 3, −3 ≤ log10 κ ≤ 3}, followed by a fine search on a 9 × 9 uniform grid linearly spaced by 0.2 in the (log10 C, log10 κ) space. The ordinal targets were treated as continuous values in standard SVR, and the predictions for test cases were rounded to the nearest ordinal scale. The insensitive zone parameter, of SVR, was fixed at 0.1. The test results of these algorithms are recorded in Tables 1 and 2. It is very clear that the generalization capabilities of the three ordinal regression algorithms are better than that of the approach of SVR. The performance of Shashua and Levin’s method is similar to our EXC approach, as expected, since the two formulations are pretty much the same. Our ordinal algorithms are comparable on the mean zero-one error, but the results also show the IMC algorithm yields much more stable results on mean absolute error than the
6 These regression data sets are available online at http://www.liacc.up.pt/∼ltorgo/ Regression/DataSets.html.
27 6 13 8 32 21 8 16
Pyrimidines MachineCPU Boston Abalone Bank Computer California Census
50/24 150/59 300/206 1000/3177 3000/5182 4000/4182 5000/15640 6000/16784
Training/Test 0.552±0.102 0.454±0.054 0.354±0.033 0.555±0.016∗ 0.590±0.004∗ 0.289±0.007∗ 0.451±0.004∗ 0.522±0.005∗
SVR 0.525±0.095 0.423±0.060 0.336±0.033 0.522±0.015 0.528±0.004 0.270±0.006 0.424±0.003 0.473±0.003
EXC
Mean Zero-one Error
0.517±0.086 0.431±0.054 0.332±0.024 0.527±0.009 0.537±0.004∗ 0.273±0.007 0.426±0.004 0.478±0.004∗
IMC 0.673±0.160 0.495±0.067 0.376±0.033 0.679±0.014∗ 0.712±0.006∗ 0.306±0.008∗ 0.515±0.004∗ 0.619±0.006∗
SVR
0.623±0.120 0.458±0.067 0.362±0.036 0.662±0.005 0.674±0.006∗ 0.288±0.007 0.494±0.004 0.576±0.003
EXC
Mean Absolute Error
0.615±0.127 0.462±0.062 0.357±0.024 0.657±0.011 0.661±0.005 0.289±0.007 0.491±0.005 0.573±0.005
IMC
Notes: The targets of these benchmark data sets were discretized into five equal-frequency bins. d denotes the input dimension and “training/test” denotes the partition size. The results are the averages over 20 trials, along with the standard deviation. We use boldface to indicate the cases in which the average value is the lowest among the results of the three algorithms. Asterisks indicate the cases significantly worse than the winning entry; a p-value threshold of 0.01 in Wilcoxon rank sum test was used to decide this.
d
Data Set
Partition
Table 1: Test Results of the Three Algorithms Using a Gaussian Kernel.
808 W. Chu and S. Keerthi
SLA
0.756±0.073 0.643±0.057 0.561±0.023 0.739±0.008∗ 0.759±0.005∗ 0.462±0.006 0.640±0.003 0.699±0.002
SVR
0.777±0.068∗ 0.693±0.056∗ 0.589±0.025∗ 0.758±0.017∗ 0.786±0.004∗ 0.494±0.006∗ 0.677±0.003∗ 0.735±0.004∗ 0.752±0.063 0.661±0.056 0.569±0.025 0.736±0.011 0.744±0.005 0.462±0.005 0.640±0.003 0.699±0.002
EXC 0.719±0.066 0.655±0.045 0.561±0.026 0.732±0.007 0.751±0.005∗ 0.473±0.005∗ 0.639±0.003 0.705±0.002∗
IMC 1.404±0.184 1.048±0.141 0.785±0.052 1.407±0.021∗ 1.471±0.010∗ 0.632±0.011∗ 1.070±0.008∗ 1.283±0.009∗
SVR 1.400±0.255 1.002±0.121 0.765±0.057 1.389±0.027∗ 1.414±0.012∗ 0.597±0.010 1.068±0.006∗ 1.271±0.007∗
SLA
1.331±0.193 0.986±0.127 0.773±0.049 1.391±0.021∗ 1.512±0.017∗ 0.602±0.009 1.068±0.005∗ 1.270±0.007∗
EXC
Mean Absolute Error
1.294±0.204 0.990±0.115 0.747±0.049 1.361±0.013 1.393±0.011 0.596±0.008 1.008±0.005 1.205±0.007
IMC
Notes: The partition sizes of these benchmark data sets were same as in Table 1, but the targets were discretized by 10 equal-frequency bins. The results are the averages over 20 trials, along with the standard deviation. We use boldface to indicate the lowest average value among the results of the four algorithms. Asterisks indicate the cases significantly worse than the winning entry; a p-value threshold of 0.01 in Wilcoxon rank sum test was used to decide this.
Pyrimidines MachineCPU Boston Abalone Bank Computer California Census
Data Set
Mean Zero-One Error
Table 2: Test Results of the Four Algorithms Using a Gaussian Kernel.
Support Vector Ordinal Regression 809
810
W. Chu and S. Keerthi
EXC algorithm.7 From the view of the formulations, EXC considers only the extremely worst samples between successive ranks, whereas IMC takes all the samples into account. Thus, the outliers may affect the results of EXC significantly, while the results of IMC are relatively more stable in both validation and test. 4.4 Information Retrieval. Ranking learning arises frequently in information retrieval. Hersh, Buckley, Leone, and Hickam (1994) generated the OHSUMED data set,8 which consists of 348,566 references and 106 queries with their respective ranked results. The relevance level of the references with respect to the given textual query was assessed by human experts using a three-rank scale: definitely, possibly, or not relevant. In our experiments, we used the results of query 1, which contains 107 assessed references (19 definitely relevant, 14 possibly relevant, and 74 irrelevant) taken from the database. In order to apply our algorithms, the bag-of-words representation was used to translate these reference documents into vectors. We computed for all documents the vector of term frequencies (TF) components scaled by inverse document frequencies (IDF). The TFIDF is a weighted scheme for the bag-of-words representation that gives higher weights to terms that occur very rarely in all documents. We used the rainbow software released by McCallum (1998) to scan the title and abstract of these references for the bag-of-words representation. In the preprocessing, we skipped the terms in the stoplist,9 and restricted ourselves to terms that appear in at least 3 of the 107 documents. This results in 462 distinct terms. Each document is represented by its TFIDF vector with 462 elements. To account for different document lengths, we normalized the length of each document vector to unity (Joachims, 1998). We randomly selected a subset of the 107 references (with size chosen from {20, 30, . . . , 60}) for training and then tested them on the remaining references. For each size, the random selection was repeated 100 times. The generalization performance of the two support vector algorithms for ordinal regression was compared against the naive approach of using support vector regression. The linear kernel K(xi , x j ) = xi · x j was employed for the three algorithms. The test results of the three algorithms are presented as box plots in Figure 5. In this case, the zero-one error rates are almost at same level, but the naive approach yields a much worse absolute error rate than the two ordinal SVM algorithms.
j
∗ j+1
As a plausible argument, ξi + ξi in equation 2.1 of EXC is an upper bound on the j j ∗j zero-one error of the ith example, while, in equation 2.17 of IMC, k=1 ξki + rk= j+1 ξki is an upper bound on the absolute error. Note that in all the examples, we use consecutive integers to represent the ordinal scales. 8 This data set is publicly available online at ftp://medir.ohsu.edu/pub/ohsumed/. 9 The stoplist is the SMART systems’ list of 524 common words, like the and of. 7
Support Vector Ordinal Regression
811
Figure 5: Test results of the three algorithms on the subset of the OHSUMED data relating to query 1, over 100 trials. The grouped boxes represent the results of SVR (left), EXC (middle), and IMC (right) at different training data sizes. The notched boxes have lines at the lower quartile, median, and upper quartile values. The whiskers are lines extending from each end of the box to the most extreme data value within 1.5·IQR (interquartile range) of the box. Outliers are data with values beyond the ends of the whiskers, which are displayed by dots. The left graph gives mean zero-one error, and the right graph gives mean absolute error.
5 Conclusion In this article, we proposed two new approaches to support vector ordinal regression that determine r − 1 parallel discriminant hyperplanes for the r ranks by using r − 1 thresholds. The ordinal inequality constraints on the thresholds are imposed explicitly in the first approach and implicitly in the second one. The problem size of the two approaches is linear in the number of training samples. We also designed SMO algorithms that scale only about quadratically with the problem size. The results of numerical experiments verified that the generalization capabilities of these approaches are much better than the naive approach of applying standard regression. Appendix: SMO Algorithm In this section we adapt the SMO algorithm (Platt 1999; Keerthi et al., 2001) for the solution of equations 2.10 and 2.11. The key idea of SMO consists of starting with a valid initial point and optimizing only one pair of variables at a time while fixing all the other variables. The suboptimization problem of the two active variables can be solved analytically. Table 3 presents an outline of the SMO implementation for our optimization problem. In order to determine the pair of active variables to optimize, we select the active threshold first. The index of the active threshold is defined as j j J J J = arg max j {Blow − Bup }. Let us assume that Blow and Bup are actually
812
W. Chu and S. Keerthi
Table 3: Basic Framework of the SMO Algorithm for Support Vector Ordinal Regression Using Explicit Threshold Constraints. Start at a Valid point, α, α ∗ and µ, that satisfy equation 2.11, find the current j j Bup and Blow ∀ j Do
SMO Loop
1. Determine the active threshold J 2. Optimize the pair of active variables and the set µa j
j
3. Compute Bup and Blow ∀ j at the new point while the optimality condition, equation 2.18, has not been satisfied Return α, α ∗ and b
Exit
j
j
o defined by blow and b upu , respectively, and that the two multipliers associated jo j with blow and b upu are αo and αu . The pair of multipliers (αo , αu ) is optimized from the current point (αoold , αuold ) to reach the new point, (αonew , αunew ). It is possible that jo = ju . In this case, named as cross update, more than one equality constraint in equation 2.11 is involved in the optimization that may update the variable set µa = {µmin{ jo , ju }+1 , . . . , µmax{ jo , ju } }, a subset of µ. In the case of jo = ju , named as standard update, only one equality constraint is involved, and the variables of µ are kept intact, that is, µa = ∅. These suboptimization problems can be solved analytically, and the detailed formulas for updating are given in the following. The following equality constraints have to be taken into account in the consequent suboptimization problem:
n j
n j+1
j αi
+µ = j
i=1
∗ j+1
αi
+ µ j+1 ,
min{ jo , ju } ≤ j ≤ max{ jo , ju }.
(A.1)
i=1
Under these constraints, the suboptimization problem for αo , αu , and µa can be written as
min
αo ,αu ,µa
1 ∗j j ∗ j j j j α − αi αi − αi K xi , xi − αo − αu , 2 j,i j ,i i
(A.2)
subject to the bound constraints 0 ≤ αo ≤ C, 0 ≤ αu ≤ C, and µa ≥ 0. The equality constraints, equation A.1, can be summarized for αo and αu as a
Support Vector Ordinal Regression
813
linear constraint αo + so su αu = αoold + so su αuold = ρ, where so = su =
j
j
j
j
−1 if i o ∈ I0ao ∪ I2 o
o o +1 if i o ∈ I0bju ∪ I4ju −1 if i u ∈ I0a ∪ I3
j
(A.3)
j
+1 if i u ∈ I0bu ∪ I1 u .
It is quite straightforward to verify that for the above optimization problem, which is restricted to αo , αu , and µa , one can use the same derivations that led to equation 2.16 to show that the optimality condition for the restricted jo j problem is nothing but blow ≤ b upu . Since we started with this condition being violated, we are guaranteed that the solution of the restricted optimization problem will lead to a strict improvement in the objective function. To solve the restricted optimization problem, first consider the case of a positive-definite kernel matrix. For this case, the unbounded solution can be exactly determined as αonew = αoold + so µ αunew = αuold − su µ j j µnew = µold − dµ µ
(A.4) ∀µ j ∈ µa ,
where dµ =
+1 if jo ≤ ju −1 if jo > ju
(A.5)
µ =
− f (xo ) + f (xu ) + so − su . K(xo , xo ) + K(xu , xu ) − 2K(xo , xu )
(A.6)
and
Now we need to check the box-bound constraint on αo and αu and the constraint µ j ≥ 0. We have to adjust µ toward 0 to draw the out-ofrange solution back to the nearest boundary point of the feasible region and update variables accordingly, as in equation A.4. Note that the cross update allows αo and αu to be associated with the same sample. For the positive semidefinite kernel matrix case, the denominator of equation A.6 can vanish. Still, leaving out the denominator defines the descent direction, which needs to be traversed until the boundary is hit, in order to get the optimal solution. It should be noted that working with just any violating pair of variables does not automatically ensure the convergence of the SMO algorithm. However, since the most violating pair is always chosen, the ideas of Keerthi and
814
W. Chu and S. Keerthi
Gilbert (2002) can be adapted to prove the convergence of the SMO algorithm. Acknowledgments Part of the work was carried out at the Institute for Pure and Applied Mathematics, University of California, from April to June 2004. W.C. was supported by NIH Grant 1 P01 GM63208. The revision work was partly supported by a research contract from Consolidated Edison. The reviewers’ thoughtful comments are gratefully appreciated. References Chu, W., & Keerthi, S. S. (2005). New approaches to support vector ordinal regression (Tech. Rep.). Burbank, CA: Yahoo! Research Labs. Crammer, K., & Singer, Y. (2002). Pranking with ranking. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 641–647). Cambridge, MA: MIT Press. Frank, E., & M. Hall, M. (2001). A simple approach to ordinal classification. In Proceedings of the European Conference on Machine Learning (pp. 145–165). Berlin: Springer. Har-Peled, S., Roth, D., & Zimak D. (2002). Constraint classification: A new approach ¨ & K. Obermayer to multiclass classification and ranking. In S. Becker, S. Thrun, (Eds.), Advances in neural information processing systems, 15 (pp. 785–792). Cambridge, MA: MIT Press. Herbrich, R., Graepel, T., & Obermayer, K. (2000). Large margin rank boundaries for ¨ ordinal regression. In P. J. Bartlett, B. Scholkopf, D. Schuumans, & A. J. Smola (Eds.), Advances in large margin classifiers (pp. 115–132). Cambridge, MA: MIT Press. Hersh, W., Buckley, C., Leone, T., & Hickam D. (1994). Ohsumed: An interactive retrieval evaluation and new large test collection for research. In Proceedings of the 17th Annual ACM SIGIR Conference (pp. 192–201). Berlin: Springer. Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In C. N´edellec & C. Rouveirol (Eds.), Proceedings of ECML-98, 10th European Conference on Machine Learning (pp. 137–142). Berlin: Springer-Verlag. Johnson, V. E., & Albert, J. H. (1999). Ordinal data modeling (Statistics for social science and public policy). Heldelberg: Springer-Verlag. Keerthi, S. S., & Gilbert, E. G. (2002). Convergence of a generalized SMO algorithm for SVM classifier design. Machine Learning, 46, 351–360. Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., & Murthy, K. R. K. (2001). Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Computation, 13, 637–649. Kramer, S., Widmer, G., Pfahringer, B., & DeGroeve, M. (2001). Prediction of ordinal classes using regression trees. Fundamenta Informaticae, 47, 1–13.
Support Vector Ordinal Regression
815
McCallum, A. (1998). Rainbow. Available online at http://www.cs.umass.edu/ ∼mccallum/bow/rainbow/. Platt, J. C. (1999). Fast training of support vector machines using sequential minimal ¨ optimization. In B. Scholkopf, C. J. C. Burges, & A. J. Smola (1999). Advances in kernel methods–support vector learning (pp. 185–208). Cambridge, MA: MIT Press. ¨ Scholkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge, MA: MIT Press. Shashua, A., & Levin, A. (2003). Ranking with large margin principle: Two ap¨ K. Obermayer (Eds.), Advances in neural informaproaches. In S. Becker, S. Thrun, tion processing systems, 15 (pp. 937–944). Cambridge, MA: MIT Press. Srebro, N., Rennie, J. D. M., & Jaakkola, T. (2005). Maximum-margin matrix factorization. In L. K. Saul, Y. Weiss, & Lo Bottou (Eds.), Advances in neural information processing systems, 17. Cambridge, MA: MIT Press. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: SpringerVerlag.
Received November 17, 2005; accepted June 29, 2006.
LETTER
Communicated by Klaus Brinker
Neighborhood Property–Based Pattern Selection for Support Vector Machines Hyunjung Shin [email protected] Department of Industrial and Information Systems Engineering, Ajou University, Wonchun-dong, Yeoungtong-gu, 443–749, Suwon, Korea, and Friedrich Miescher ¨ Laboratory, Max Planck Society, 72076, Tubingen, Germany
Sungzoon Cho [email protected] Seoul National University, Shillim-Dong, Kwanak-Gu, 151-744, Seoul, Korea
The support vector machine (SVM) has been spotlighted in the machine learning community because of its theoretical soundness and practical performance. When applied to a large data set, however, it requires a large memory and a long time for training. To cope with the practical difficulty, we propose a pattern selection algorithm based on neighborhood properties. The idea is to select only the patterns that are likely to be located near the decision boundary. Those patterns are expected to be more informative than the randomly selected patterns. The experimental results provide promising evidence that it is possible to successfully employ the proposed algorithm ahead of SVM training. 1 Introduction The support vector machine (SVM) has been spotlighted in the machine learning community because of its theoretical soundness and practical performance. SVM has been highly successful in practical applications as diverse as face detection and recognition, handwritten character and digit recognition, text detection and categorization (Byun & Lee, 2002; Dumais, 1998; Heisele, Poggio, & Pontil, 2000; Joachims, 2002; Moghaddam & Yang, ¨ 2000; Osuna, Freund, & Girosi, 1997; Scholkopf, Burges, & Smola, 1999). However, when applied to a large data set, SVM training can become computationally intractable. In the formulation of SVM quadratic programming (QP), the dimension of the kernel matrix (M × M) is equal to the number of training patterns (M) (Vapnik, 1999). Thus, when the data set is huge, training cannot be finished in a reasonable time, even if the kernel matrix could be loaded on the memory. Most standard SVM QP solvers have a time complexity of O(M3 ): MINOS, CPLEX, LOQO, and MATLAB QP routines. And the solvers using decomposition methods have a time complexity of Neural Computation 19, 816–855 (2007)
C 2007 Massachusetts Institute of Technology
Neighborhood Property–Based Pattern Selection for SVMs
817
I · O(Mq + q 3 ), where I is the number of iterations and q is the size of ¨ the working set: Chunking, SMO, SVMlight , and SOR (Hearst, Scholkopf, ¨ Dumais, Osuna, & Platt, 1997; Joachims, 2002; Scholkopf et al., 1999; Platt, 1999). Needless to say, I increases as M increases. Empirical studies have estimated the run time of common decomposition methods to be proportional to O(M p ), where p varies from approximately 1.7 to 3.0 depending on the problem (Hush, Kelly, Scovel, & Steinwart, 2006; Laskov, 2002). Moreover, SVM requires heavy or repetitive computation to find a satisfactory model: for instance, which kernel is favorable over the others (among RBF, polynomial, sigmoid, and others) and, additionally, how to choose the kernel parameters (width of basis function, polynomial degree, offset, and scale, respectively). There has been a considerable amount of work related to this topic, referred to as kernel learning, hyperkernels, and kernel alignment, for example (see Ong, Smola, & Williamson, 2005, and Sonnenburg, R¨atsch, ¨ Sch¨afer, & Scholkopf, 2006). But the most common approaches for parameter selection are dependent on (cross–) validation. This implies one should train an SVM model multiple times until a proper model is found. One way to circumvent this computational burden might be to select some training patterns in advance. One of the distinguishable merits of SVM theory is that it clarifies which patterns are important to training. These patterns are distributed near the decision boundary, and fully and succinctly define the classification task at hand (Cauwenberghs & Poggio, 2001; Pontil & Verri, 1998; Vapnik, 1999). Furthermore, the subset of support vectors (SVs) is almost identical regardless of which kernel function is cho¨ sen for training (Scholkopf, Burges, & Vapnik, 1995). From a computational point of view, it is therefore worth identifying a subset of would-be SVs in preprocessing, and then training the SVM model with the smaller set. There have been several approaches to pattern selection that reduce the number of training patterns. Lyhyaoui et al. (1999) proposed the 1-nearest neighbor, searching from the opposite class after class-wise clustering. But this method presumes no possible class overlap in the training set to find the patterns near the decision boundary. Almeida, Braga, and Braga (2000) conducted k-means clustering. A cluster is defined as homogeneous if it consists of the patterns from the same class and heterogeneous otherwise. All of the patterns from a homogeneous cluster are replaced by a single centroid pattern, while the patterns from a heterogeneous cluster are all selected. The drawback is that it is not clear how to determine the number of clusters. Koggalage and Halgamuge (2004) also employed clustering to select the patterns from the training set. Their approach is quite similar to Almeida et al.’s (2000): clustering is conducted on the entire training set first, and the patterns are chosen that belong to the heterogeneous clusters. For a homogeneous cluster, on the contrary, the patterns along the rim of cluster are selected instead of the centroid of Almeida et al. (2000). It is a relatively safer approach since even in homogeneous clusters, the patterns near the decision boundary can exist if the cluster’s boundary is almost in contact
818
H. Shin and S. Cho
with the decision boundary. But, it has a relative shortcoming as well, in that the patterns far away from the decision boundary are also picked as long as they lie along the rim. Furthermore, the setting of the radius and defining of the width of the rim are still unclear. In the meantime, for the reduced SVM (RSVM) of Lee and Mangasarian (2001), Zheng, Lu, Zheng, and Xu (2003) proposed using the centroids of clusters instead of random samples. It is more reasonable since the centroids are more representative than random samples. However, these clustering-based algorithms have a common weakness: the selected patterns are fully dependent on clustering performance, which could be unstable. Liu and Nakagawa (2001) made a related performance comparison. Sohn and Dagli (2001) suggested a slightly different approach. It utilizes fuzzy class membership through k nearest neighbors. The score of fuzzy class membership is translated as a probability of how deeply a pattern belongs to a class. By the scores, the patterns having a weak probability are eliminated from the training set. However, it overlooks the importance of the patterns near the decision boundary; they are equally treated to outliers (noisy pattern far from the decision boundary). Active learning shares this issue of significant pattern identification with pattern selection (Brinker, 2003; Campbell, Cristianini, & Smola, 2000; Schohn & Cohn, 2000). However, there are substantial differences between them. First, the primary motivation of active learning comes from the high cost of obtaining labeled training patterns, not from that of training itself. For instance, in industrial process modeling, obtaining even a single training pattern may require several days. In e-mail filtering, obtaining training patterns is not expensive, but it takes many hours to label them. In pattern selection, however, the labeled training patterns are assumed to already exist. Second, active learning alternates between training with a newly introduced pattern and making queries over the next pattern. On the contrary, pattern selection runs only once before training as a preprocessor. In this letter, we propose the neighborhood property—based pattern selection algorithm (NPPS). The practical time complexity of NPPS is v M, where v is the number of patterns in the overlap region around the decision boundary. We utilize k nearest neighbors to look around the pattern’s periphery. The first neighborhood property is that “a pattern located near the decision boundary tends to have more heterogeneous neighbors in their class membership.” The second neighborhood property dictates that “a noisy pattern tends to belong to a different class from that of its neighbors.” And the third neighborhood property is that “the neighbors of the decision boundary pattern tend to be located near the decision boundary as well.” The first property is used for identifying the patterns located near the decision boundary. The second property is used for removing the patterns located on the wrong side of the decision boundary. And the third property is used for skipping unnecessary distance calculation, thus accelerating the pattern selection procedure. The letter also provides how to choose k for the
Neighborhood Property–Based Pattern Selection for SVMs
819
proposed algorithm. Note that it has been skipped or dealt with as trivial in other pattern selection methods that employ either k-means clustering or the k-nearest neighbor rule. The remaining part of this letter is organized as follows. Section 2 briefly explains the SVM theory, in particular, the patterns critically affecting the training. Section 3 presents the proposed method, NPPS, that selects the patterns near the decision boundary. Section 4 explains how to choose the number of neighbors, k—the parameter of the proposed method. Section 5 provides the experimental results on artificial data sets, real-world bench-marking data sets, and a real-world marketing data set. We conclude with some future work in section 6. 2 Support Vector Machines and Critical Training Patterns Support vector machines (SVMs) are a general class of statistical learning architectures that perform structural risk minimization on a nested set struc¨ ture of separating hyperplanes (Cristianini & Shawe-Taylor, 2000; Scholkopf & Smola, 2002; Vapnik, 1999). Consider a binary classification problem with M patterns (xi , yi ), i = 1, . . . , M where xi ∈ d and yi ∈ {−1, 1}. Let us assume that patterns with yi = 1 belong to class 1, while those with yi = −1 belong to class 2. SVM training involves solving the following quadratic pro2 gramming problem, which yields the largest margin ( w ) between classes. 1 ξi , ||w|| 2+C 2 i M
min
(w, ξ) =
s. t.
· (xi ) + b) ≥ 1 − ξi , yi (w ξi ≥ 0,
i = 1, . . . , M,
(2.1)
where w ∈ d , b ∈ (see Figure 1, and an error tolerance parameter C ∈ . Equation 2.1 is the most general SVM formulation, allowing both nonseparable and nonlinear cases. The ξ ’s are nonnegative slack variables, which play a role of allowing a certain level of misclassification for a nonseparable case. The (·) is a mapping function for a nonlinear case that projects patterns from the input space into a feature space. This nonlinear mapping is performed implicitly by employing a kernel function K (x , x ) to avoid the costly calculation of inner products, (x ) · (x ).1 There are three typical
1 Aside from the computational efficiency by dot-product replacement, the kernels provide operational benefits as well: it is easy and natural to work on (or integrate) various types of data—vectors, sequences, text, images, and graphs, for example—and to detect very general types of relations therein.
820
H. Shin and S. Cho
Figure 1: SVM classification problem: Through a mapping function (·), the class patterns are linearly separated in a feature space. The patterns determining both margin hyperplanes are outlined. The decision boundary is the halfway hyperplane between margins.
kernel functions: RBF, polynomial, and sigmoid in due order, K (x , x ) = exp(−||x − x ||2 /2σ 2 ), K (x , x ) = (x · x + 1) p , K (x , x ) = tanh (ρ(x · x ) + δ).
(2.2)
The optimal solution of equation 2.1 yields a decision function of the following form, f (x ) = sign (w · (x ) + b) = sign = sign
M
M
yi αi (xi ) · (x ) + b
i=1
yi αi K (xi , x ) + b ,
(2.3)
i=1
where αi s are nonnegative Lagrange multipliers associated with training patterns, respectively. The solutions, αi s, are obtained from the dual problem of equation 2.1, which minimizes the convex quadratic objective
Neighborhood Property–Based Pattern Selection for SVMs
821
function under constraints min W(αi , b) =
0≤αi ≤C
M M M 1 αi α j yi y j K (xi · x j ) − αi + b yi αi . 2 i, j=1
i=1
i=1
The first-order conditions on W(αi , b) are reduced to the Karush-KuhnTucker (KKT) conditions, ∂ W(αi , b) = yi y j K (xi , x j )α j + yi b − 1 = yi ¯f (xi ) − 1 = gi , ∂αi M
j=1
∂ W(αi , b) = y j α j = 0, ∂b M
(2.4)
j=1
where ¯f (·) is the function inside the parentheses of sign in equation 2.3. The KKT complementarity condition, equation 2.4, partitions the training pattern set into three categories according to the corresponding αi s: (a) gi > 0
→
αi = 0
(b) gi = 0
→
0 < αi < C : margin support vectors
(c) gi < 0
→
αi = C
: irrelevant patterns
: error support vectors
Figure 2 illustrates those categories (Cauwenberghs & Poggio, 2001; Pontil & Verri, 1998). The patterns belonging to category a are out of the margins, thus irrelevant to training, while the patterns belonging to categories b and c are critical and directly affect training. They are called support vectors (SVs). The patterns of category b are strictly on the margin; hence, they are called margin SVs. The patterns of category c lie between two margins; hence, they are called error SVs but are not necessarily misclassified. (Note that there is another type of SV belonging to the category of error SV. They are incorrectly labeled patterns that could be located very far from the decision boundary. We regard them as outliers that do not contribute to margin construction. We focus only on the SVs located around the decision boundary, not those located deep in the realm of the opposite class.) Going back to equation 2.3, we can now see that the decision function is a linear combination of kernels on only those critical training patterns (denoted as SVs) because the patterns corresponding to αi = 0 have no influence on the decision result: f (x ) = sign
M i=1
yi αi K (xi , x ) + b
= sign
i∈SVs
yi αi K (xi , x ) + b .
(2.5)
822
H. Shin and S. Cho
Figure 2: Three categories of training patterns.
Equation 2.5 leads us to an attempt that reduces the whole training set to a subset of would-be SVs. 3 Neighborhood Property–Based Pattern Selection To circumvent memory- and time-demanding SVM training, we propose a preprocessing algorithm: a neighborhood property–based pattern selection algorithm (NPPS). The idea of the algorithm is to select only those patterns located around the decision boundary since they are the ones that contain the most information. Contrary to a usually employed random sampling, this approach can be viewed as informative or intelligent sampling. Figure 3 conceptually shows the difference between NPPS and random sampling in selecting a subset of the training data. NPPS selects the patterns in the region around the decision boundary, and random sampling selects those from the whole input space. Obviously no one knows how close a pattern is to the decision boundary until a classifier is built. However, we can infer the proximity ahead of training by utilizing neighborhood properties. The first neighborhood property is that “a pattern located near the decision boundary tends to have more heterogeneous neighbors in its class membership.” A well-known entropy concept can be utilized for the measurement of heterogeneity of class labels among k-nearest neighbors. And the measure will lead us to estimate the proximity accordingly. Here, we define a measure
Neighborhood Property–Based Pattern Selection for SVMs
823
a. NPPS
b. Random Sampling Figure 3: NPPS and random sampling select different subsets. Open circles and squares are the patterns belonging to class 1 and class 2, respectively. Filled circles and squares are the selected patterns.
824
H. Shin and S. Cho
by using (negative) entropy, Neighbors Entropy (x , k) =
J j=1
P j · log J
1 , Pj
where j indicates a particular class out of J classes, and P j is defined as k j /k, where k j is the number of the neighbors belonging to class j. In most cases, a pattern with a positive value of Neighbors Entropy (x , k) is close to the decision boundary, and thus selected. Those patterns are likely to be SVs, which correspond to the margin SVs in Figure 2b or the error SVs in Figure 2c. Among the patterns having a positive value of Neighbors Entropy (x , k), noisy patterns are also present. Here, let us define an overlap region as a hypothetical region in feature space shared by both classes, and overlap patterns as the patterns in that region. (We defer the finer definitions for overlap region and overlap patterns until section 4.1). A set of overlap patterns contains the patterns located not only on the right side of the decision boundary but also on the other side of it. The latter denotes noisy patterns, which should be identified and removed as much as possible because they are more likely to be the error SVs that would be misclassified. To remove this noisy pattern, we take the second neighborhood property: An overlap pattern or an outlier tends to belong to a different class from its neighbors. If a pattern’s own label is different from the majority label of its neighbors, it is likely to be incorrectly labeled. The measure Neighbors Match (x , k) is defined as the ratio of neighbors whose label matches that of x , Neighbors Match (x , k) =
|{x |label(x ) = label(x ), x ∈ kNN(x )}| , k
where k NN(x ) is the set of k nearest neighbors of x . The patterns with a small Neighbors Match (x , k) value are likely to be the ones incorrectly labeled. In that point of view, Neighbors Match can be interpreted as a confidence for k-nearest neighbor classification. Only the patterns satisfying the two conditions, Neighbors Entropy (x , k) > 0 and Neighbors Match (x , k) ≥ 1J , are selected. Figure 4 shows a toy example of how patterns are selected by the proposed measures. The numbers scattered in the figure (1, 2, and 3) stand for the class labels of the patterns. We assume J = 3 and k = 6. For representational simplicity, we consider only six patterns marked by dotted circles. Table 1 presents the values of P j , Neighbors Entropy, and Neighbors Match for these marked patterns. Let us consider x 1 first. x 1 is remote from the decision boundary belonging to class 1 and is thus surrounded by the neighbors belonging to class 1. x 1 is not selected since it does not satisfy the condition of Neighbors Entropy. Meanwhile, x 2 resides in a deep region
Neighborhood Property–Based Pattern Selection for SVMs
825
Figure 4: A toy example. The numbers scattered in the figure (1, 2, and 3) stand for the class labels of the patterns. The parameters are assumed to be J = 3 and k = 6. For representational simplicity, we consider only 6 patterns (out of 29) marked by dotted circles. Table 1 shows the patterns selected by the proposed measures presenting details: the values of P j , Neighbors Entropy, and Neighbors Match.
of class 2, but it is a noise pattern labeled as class 1. Since the composition of class membership of its neighbors is homogeneous (all of them belong to class 2), it also has a zero value of Neighbors Entropy. Therefore, it is excluded from selection. However, the case of the pattern x 2 is different from the case of the pattern x 1 ; for the pattern x 2 , all its neighbors belong to a different class from the class of x 2 , while for the pattern x 1 , all its neighbors belong to the same class that x 1 belongs to. We can differentiate two cases by Neighbors Match, which is 1 for x 1 and 0 for x 2 . In any case, the remote patterns from the decision boundary are screened out by the proposed measures. On the other hand, the pattern x 3 is close to the decision boundary, and the composition of the neighbors shows heterogeneous class membership; two of them belong to class 1, another two belong to class 2, and the rest belong to class 3. This leads to Neighbors Entropy = 1 and Neighbors Match = 1/3, and consequently satisfies the conditions. Similarly, the patterns x 4 and x 5 are chosen. From the example, we can see that the selected patterns are distributed near the decision boundary and almost correctly labeled. However, NPPS takes O(M2 ) to evaluate kNNs for M patterns, so the pattern selection process itself can be time-consuming. To accelerate the pattern selection procedure, we consider the third neighborhood property: The neighbors of a pattern located near the decision boundary tend to be
1 1 2 3 3 2
1 2 1 3 3 3
j1
1 2 1 3 3 3
j2
1 2 2 2 3 3
j3
1 2 2 2 3 3
j4
1 2 3 3 3 3
j5
Class Labels of Neighbors
1 2 3 1 2 2
j6 6/6 0/6 2/6 1/6 0/6 0/6
P1 0/6 6/6 2/6 2/6 1/6 1/6
P2 0/6 0/6 2/6 3/6 5/6 5/6
P3 0 0 1 0.9227 0.4100 0.4100
Neighbors Entropy
6/6 0/6 2/6 3/6 5/6 1/6
Neighbors Match
Notes: j ∗ is the class label of pattern x i , and j k is the label of its kth nearest neighbor. Refer to Figure 4.
x 2 x 3 x 4 x 5 x 6
x 1
j∗
Table 1: A Toy Example: Patterns Selected by Neighbors Entropy and Neighbors Match.
X X X
Selected Patterns
826 H. Shin and S. Cho
Neighborhood Property–Based Pattern Selection for SVMs NPPS (D, k) { [0] Initialize D0e with randomly chosen patterns from D. Constants, k and J, are given. Initialize i and various sets as follows: i ← 0,
S0o ← ∅,
S0x ← ∅,
S0 ← ∅.
while Die = ∅ do [1] Choose x satisfying [Expanding Criteria]. Dio ← {x | N eig hbors E ntr opy(x; k) > 0, x ∈ Die }. Dix ← Die − Dio . [2] Select x satisfying [Selecting Criteria]. Dis ← {x | N eig hbors M atch (x; k) ≥ 1/J, x ∈ Dio }. [3] Update the pattern sets. Si+1 ← Sio ∪ Dio : the expanded, o Si+1 ← Six ∪ Dix : the nonexpanded, x Si+1 ← Si ∪ Dis : the selected. [4] Compute the next evaluation set Di+1 e . Di+1 ← e
x∈Dio
[5] i ← i + 1. end return Si }
Figure 5: NPPS algorithm.
i+1 k NN(x) − (Si+1 o ∪ Sx ).
827
828
H. Shin and S. Cho
Table 2: Notation. Symbol D Die Dio Dix Dis Sio Six Si kNN(x )
Meaning The original training set whose cardinality is M The evaluation set at ith step A subset of Die , the set of patterns to be expanded from Die , each element of which will compute its k nearest neighbors to constitute the next evaluation set, Di+1 e A subset of Die , the set of patterns not to be expanded from Die , or i i i Dx = De − Do The set of “selected” patterns from Dio at ith step j The accumulated set of expanded patterns, i−1 j=0 Do i−1 j The accumulated set of nonexpanded patterns, j=0 Dx j N The accumulated set of selected patterns, i−1 j=0 Ds the last of which S is the reduced training pattern set The set of k nearest neighbors of x
located near the decision boundary as well. Assuming the property, one may narrow the scope of computation from all the training patterns to the patterns near the decision boundary. Only the neighbors of a pattern satisfying Neighbors Entropy (x , k) > 0 are evaluated in the next step. This lazy evaluation can reduce the practical time complexity from O(M2 ) to O(v M), where v is the number of patterns in the overlap region. In most practical problems, v < M holds. In addition, any algorithm on efficient nearest-neighbor searching can be incorporated into the proposed method to further reduce the time complexity. There is a considerable amount of literature about efficient searching for nearest neighbors. Some of it attempts to save distance computation time (Grother, Candela, & Blue, 1997; Short & Fukunaga, 1981), and others attempt to avoid redundant searching time (Bentley, 1975; Friedman, Bentley, & Finkel, 1977; Guttman, 1984; Masuyama, Kudo, Toyama, & Shimbo, 1999). Apart from them, there are many sophisticated NN classification algorithms such as approximated (Arya, Mount, Netanyahu, Silverman, & Wu, 1998; Indyk, 1998), condensed ¨ (Hart, 1968), and reduced (Gates, 1972). (See also Berchtold, Bohm, Keim, & Kriegel, 1997; Borodin, Ostrovsky, & Rabani, 1999; Ferri, Albert, & Vidal, 1999; Johnson, Ladner, & Riskin, 2000; Kleingberg, 1997; Tsaparas, 1999.) The time complexity analysis for the fast NPPS can be found online at http://www.kyb.mpg.de/publication.html?user=shin and Shin and Cho (2003a), and a brief empirical result in appendix A. The algorithm and related notations are shown in Figure 5 and Table 2. 4 How to Determine the Number of Neighbors In this section, we briefly introduce a heuristic for determining the value of k. Too large a value of k results in too many patterns being selected.
Neighborhood Property–Based Pattern Selection for SVMs
829
Consequently, we will achieve little effect of pattern selection. Too small a value of k, leads to few patterns selected. However, it may degrade SVM accuracy since there are more chances to miss important patterns—the would-be support vectors. The dilemma about k will not be symmetrical because a serious loss of accuracy is not likely to be well compensated for by the benefit of a downsized training set. This point relates to our idea: the selected pattern set should be large enough to contain at least the patterns in the overlap region. Therefore, we must first estimate how many training patterns reside in the overlap region. Based on the estimate, the next step is to enlarge the value of k until the size of the corresponding set covers the estimate. Under this condition, the minimum value of k will be regarded as optimal unless SVM accuracy degrades. In the following sections, we first identify the overlap region R and the overlap set V. Second, we estimate the lower bound on the cardinality of V. Third, we define Neighbors Entropy set Bk with respect to the value of k and some related properties as well. Finally, we provide a systematic procedure for determining the value of k. 4.1 Overlap Region and Overlap Set. Consider a two-class classification problem (see Figure 6), f (x ) =
x → C1 x → C2
if if
f (x ) > 0, f (x ) < 0,
(4.1)
where f (x ) is a classifier and f (x ) = 0 is the decision boundary. Let us define noisy overlap patterns (NOPs) as the patterns that are located on the wrong side of the decision boundary. They are shown in Figure 6 as squares located above f (x ) = 0 and circles located below f (x ) = 0. Let R denote the overlap region—a hypothetical region where NOPs reside, the convex-hull of NOPs, enclosed by the dotted lines in Figure 6. Similarly, let us define correct overlap patterns (COPs) as the patterns that are enveloped by the convex hull, yet in the right class side. Note that R is defined by NOPs, but it contains not only NOPs but also COPs. Let V denote the overlap set defined as the intersection of D and R (that is, the subset of D that comprises NOPs and COPs. There are six NOPs and six COPs in Figure 6. The cardinality of V is denoted as v. 4.2 Size Estimation of Overlap Set. Now we estimate v. Let PR (x ) denote the probability that a pattern x falls in the region R. Then we can calculate the expected value of v from the given training set D as v = MPR (x ),
(4.2)
830
H. Shin and S. Cho
Figure 6: Two-class classification problem where the circles belong to class 1 and the squares belong to class 2. The area enclosed by the dotted lines is defined as overlap region R. The area comprises the overlap set V of NOPs and COPs.
where M is the number of the training patterns, say, M = |D|. The probability PR (x ) can be dissected class-wise, such as PR (x ) =
2
P(x ∈ R, C j ),
(4.3)
j=1
where P(x ∈ R, C j ) is the joint probability of x belonging to class C j and lying in R. Note that PR (x ) also can be interpreted as the probability of x being COPs or NOPs. Further, if the region R is divided into R1 = {x ∈ R | f (x ) ≥ 0} and R2 = {x ∈ R | f (x ) < 0},
(4.4)
Equation 4.3 can be rewritten as PR (x ) = P(x ∈ R, C1 ) + P(x ∈ R, C2 ) = P(x ∈ R1 ∪ R2 , C1 ) + P(x ∈ R1 ∪ R2 , C2 ) = P(x ∈ R1 , C2 ) + P(x ∈ R2 , C1 )
(a)
+ P(x ∈ R1 , C1 ) + P(x ∈ R2 , C2 ) .
(b)
(4.5)
Neighborhood Property–Based Pattern Selection for SVMs
831
In the last row, a and b denote the probabilities of the patterns located in R that are incorrectly and correctly classified—NOPs and COPs, respectively. Since all NOPs are the incorrectly classified patterns, a can be estimated from the misclassification error rate Perror of the classifier f (x ). On the contrary, it is not easy to estimate b.2 Therefore, what we can do in practice is to infer the lower bound of it. Generally: the following inequality holds, P(x ∈ R j , C j ) ≥ P(x ∈ R j , Ci ),
j = i.
(4.6)
Based on that, equation 4.5 can be simplified as PR (x ) ≥ 2Perror ,
(4.7)
and the lower bound of v becomes v ≥ v L OW = 2MPerror
(4.8)
from equation 4.2. 4.3 Pattern Set with Positive Neighbors Entropy. Let us define Bk given a specified value of k, Bk = {x | Neighbors Entropy(x , k) > 0, x ∈ D}. Note that Bk is the union of Dio ’s, Bk =
N
(4.9)
Do where N is the total number
i=0
of iterations in the algorithm NPPS (see Figure 5). The following property of Bk leads to a simple procedure. Lemma 1.
A Neighbors Entropy set Bk is a subset of Bk+1 :
Bk ⊆ Bk+1
2 ≤ k ≤ M − 2.
(4.10)
Proof. Denote P jk as the probability that k j out of k nearest neighbors belong to class C j . If x ∈ Bk , then it means Neighbors Entropy (x , k) > 0. k A positive Neighbors Entropy is always accompanied by P jk = kj < 1, ∀ j. Therefore, k j < k, ∀ j.
2
Of course, if R1 and R2 contain roughly the same number of correct and incorrect patterns, the probabilities of a and b become similar. But this assumption will work only when both class distributions follow the uniform distribution.
832
H. Shin and S. Cho
Adding 1 to both sides yields (k j + 1) < (k + 1), ∀ j. Suppose the (k + 1)th nearest neighbor of x belongs to C j ∗ . Then (k j ∗ + 1) < (k + 1) holds for j ∗ , while k j < (k + 1) holds for j = j ∗ . The inequalities lead to P jk+1 < 1 and P j k+1 ∗ = j ∗ < 1, respectively. As a consequence, we have a positive Neighbors Entropy of x in the case of the (k + 1)th nearest neighbor as well. Therefore, Neighbors Entropy (x , k + 1) > 0, which indicates x ∈ Bk+1 . From lemma 1, it follows that b k , the cardinality of Bk , is an increasing function of k. 4.4 Procedure for Determining the Number of Neighbors. Bk larger than V merely increases the SVM training time by introducing redundant training patterns. In contrast, Bk smaller than V could degrade the SVM accuracy. Therefore, our objective is to find the smallest Bk that covers V. Let us define k min using the lower bound of v from equation 4.8: k min = min {k | b k ≥ v L OW , k ≥ 2 }. From lemma 1, we know that b k is an increasing function of k. Therefore, it is not necessary to evaluate the values of k less than k min . Instead, we check svm the stabilization of SVM training error rate Perror only for the k’s larger than min k by increasing the value little by little. The optimal value of k is then chosen as svm svm (k) − Perror (k + 1)| ≤ , k ≥ k min }, k ∗ = arg min { |Perror
(4.11)
where is set to a trivial value. Eauation 4.11 requires several dry runs of SVM training. However, the runs are not likely to impose heavy computational burden since the strong inequality b k << M holds for the first smallest k’s, and hence the size of the training set (the selected pattern set) is small. The procedure is summarized in Figure 7.3 The main contribution of the proposed procedure is to limit the search space of k by means of the lower-bound estimation of the size of overlap set. More details can be found in Shin and Cho (2003b) and some empirical results in appendix B.
3
There are several benefits of using 1-NN rule as a Perror estimator in step 1 since it is a local learning algorithm, simple, and computationally efficient. One can skip step 1 if an approximate Perror of the problem is given a priori.
Neighborhood Property–Based Pattern Selection for SVMs
833
[1] Estimate Perror : Pˆerror is calculated based on the training error rate of 1-NN rule. [2] Calculate the lower bound of v according to equation 4.8: v ≥ v LOW = 2M Pˆerror . [3] Find k ∗ according to equation 4.11: svm (k) − P svm (k + 1)| ≤ , k ≥ k min }. k ∗ = arg min { |Perror error
where k min = min {k | bk ≥ v LOW , k ≥ 2 } and is a trivial value.
Figure 7: Procedure to determine the value of k.
5 Experiments We applied NPPS to various kinds of data sets: artificial data sets, benchmarking data sets, and a real-world marketing data set. The following three sections present the experimental results in order. 5.1 Artificial Data Sets. The first problem was drawn from four gaussian densities: the continuous XOR problem. A total of 600 training patterns, 300 from each class, were generated. The classes, C1 and C2 , were defined as
−3 3 ≤ x ≤ , C1 = x | x ∈ N1A ∪ N1B , −3 3
−3 3 C2 = x | x ∈ N2A ∪ N2B , ≤ x ≤ , −3 3 where 2
2
1 0.5 0 −1 0.5 0 N1A = x |N , = x |N , ,N , 1B 1 −1 0 0.52 0 0.52
2
2
−1 0.5 0 1 0.5 0 , , N2A = x |N ,N2B = x |N . 1 −1 0 0.52 0 0.52
834
H. Shin and S. Cho
See also Figure 11.1a. The second problem is the sine function problem. The input patterns were generated from a two-dimensional uniform distribution, and then the class labels were determined by whether the pattern was located above or below a sine decision function:
1 0 x ≤ 1 ≤ x2 2.5 −2.5
1 0 x C2 = x | x2 ≤ sin (3x1 + 0.8)2 , . ≤ 1 ≤ 2.5 x2 −2.5
C1 = x | x2 > sin (3x1 + 0.8)2 ,
To make the density near the decision boundary thicker, four different gaussian noises were added along the decision boundary: N(µ, s 2 I ) where µ is an arbitrary point on the decision boundary and s is a gaussian width parameter (s = 0.1, 0.3, 0.8, 1.0). A total of 500 training patterns were generated including noises. Figure 11.1b shows the problem. For both problems, Figure 8 presents the relationship between distance from decision boundary to a pattern and its value of Neighbors Entropy. The closer to the decision boundary a pattern is, the higher the value of Neighbors Entropy it has. Both Figures 8a and 8b ensure that Neighbors Entropy is pertinent to estimating the proximity to the decision boundary. Among the patterns with positive Neighbors Entropy values, the ones that meet the Neighbors Match condition are selected. They are depicted as filled circles against open ones. Note that the filled circles are distributed nearer the decision boundary. The distribution of the selected patterns is shown in Figures 11.3a and 11.3b. NPPS changes the size as well as the distribution of the training set. Such changes may change the optimal value of hyperparameter that brings the best result. Thus, we first observed the effect of change of training set. The SVM performance was measured under various combinations of hyperparameters such as (C, σ ) ∈ {0.1, 1.0, 10, 100, 1000} × {0.25, 0.5, 1, 2, 3}, where C and σ indicate misclassification tolerance and gaussian kernel width, respectively. For each of the artificial problems, 1000 test patterns were generated from the statistically identical distributions to its original training set. We compared the test error rates (%) of SVM when trained with all patterns (ALL), trained with random patterns (RAN), and trained with the selected patterns (SEL). Figure 9 depicts the test error rate over the hyperparameter variation for Continuous XOR problem. The most pronounced feature is the higher sensitivity of SEL to hyperparameter variation than before (ALL). It may be caused by the fact that the patterns selected by NPPS are mostly distributed in the narrow region along the decision boundary. In Figure 9a, for instance, SEL shows a sudden rise of the test error rate for σ larger than a certain value when C is fixed, and similarly, a sharp descent after a certain value of C when σ is fixed. An interesting point is that SEL can always reach
Neighborhood Property–Based Pattern Selection for SVMs
835
Figure 8: The relationship between the distance from decision boundary and Neighbors Entropy. The closer to decision boundary a pattern is, the higher value of Neighbors Entropy it has. The selected patterns are depicted as filled circles against open ones.
836
H. Shin and S. Cho
Figure 9: Performance comparison over hyperparameter variation: Continuous XOR problem.
a performance comparable to that of ALL by adjusting the value of the hyper-parameters. Now we compare SEL and RAN from Figure 9b. We used the same number of random samples as that of the patterns selected by NPPS. Therefore, there is no difference in the size of training set, only a difference in data distribution. When compared with SEL, RAN is less
Neighborhood Property–Based Pattern Selection for SVMs
837
Table 3: Best Result Comparison: ALL vs. RAN vs. SEL. Continuous XOR (C, σ ) Execution time (sec) Number of training patterns Number of support vectors Test error (%) McNemar’s test (p-value)
Sine Function
ALL (10, 0.5)
RAN (100, 1)
SEL (100, 0.25)
ALL (10, 0.5)
RAN (0.1, 0.25)
SEL (10, 0.5)
454.83
3.02
3.85
267.76
8.97
8.79
600
180
180
500
264
264
167
68
84
250
129
136
12.33 0.08
9.67 1.00
16.34 0.12
12.67 0.89
9.67 —
13.33 —
sensitive to hyperparameter variation because the patterns are distributed all over the region. However, note that RAN never performs as well as ALL since random sampling inevitably misses many patterns of importance to SVM training. See the best result of RAN in Table 3. Figure 10 shows similar results for the sine function problem. Table 3 summarizes the best results of ALL, SEL, and RAN for both problems. First, compared with the training time of ALL, that of SEL is little more than trivial because of the reduced size of training set. For both artificial problems, a standard QP solver, Gunn’s SVM MATLAB toolbox, was used. Considering that model selection always involves multiple trials, an individual’s reduction in training time can amount to a huge savings of time. Second, compared with RAN, SEL achieved an accuracy on a similar level to the best model of ALL while RAN could not. To show that there was no significant difference between ALL and SEL, we conducted the McNemar’s test (Dietterich, 1998). The p-values between ALL and RAN, and also between ALL and SEL, are presented in the last row of Table 3. In principle, the McNemar’s test determines whether classifier A is better than classifier B. A p-value of zero indicates a significant difference between A and B, and a value of one indicates no significant difference. Although the p-value between ALL and RAN is not less than 5% in each problem, one can still compare its degree of difference from the p-value between ALL and SEL in statistical terms. Figure 11 visualizes the experimental results of Table 3 on both problems. The subfigures, 11.1 to 11.3, indicate the results of ALL, RAN, and SEL in order. The decision boundary is depicted as a solid line and the margin as a dotted line. Support vectors are outlined. The decision boundary of ALL
838
H. Shin and S. Cho
Figure 10: Performance comparison over hyperparameter variation: sine function problem.
looks more alike to that of SEL than to that of RAN. That explains why SEL produced similar results to ALL. 5.2 Benchmarking Data Sets. Table 4 presents the summary of the comparison results on various real-world benchmarking data sets (MNIST,
Neighborhood Property–Based Pattern Selection for SVMs
839
1.a. Continuous XOR:
ALL
1.b. Sine Function:
ALL
2.a. Continuous XOR:
RAN
2.b. Sine Function:
RAN
3.a. Continuous XOR:
SEL
3.b. Sine Function:
SEL
Figure 11: Patterns and SVM decision boundaries. A decision boundary is depicted as a solid line and the margins as the dotted lines. Support vectors are outlined. The decision boundary of ALL looks more alike to that of SEL than to that of RAN. This explains why SEL performed similar to ALL.
500 264 264 335 1000 275 275 492
Sine function: 1000 test patterns ALL C = 10, σ = 0.50 RAN C = 0.1, σ = 0.25 SEL C = 10, σ = 0.5, k = 5 SVM-KM C = 50, σ = 0.25, k = 50
4 × 4 Checkerboard: 10,000 test patterns ALL C = 20, σ = 0.25 RAN C = 50, σ = 0.5 SEL C = 50, σ = 0.25, k = 4 SVM-KM C = 20, σ = 0.25, k = 100
Pima Indian Diabetes: 153 test patterns (5-cv out of 768 patterns) ALL C = 100, p = 2 615 RAN C = 1, p = 1 311 SEL C = 100, p = 2, k = 4 311 SVM-KM C = 100, p = 2, k = 62 561
600 180 180 308
Continuous XOR: 1000 test patterns ALL C = 10, σ = 0.5 RAN C = 100, σ = 1 SEL C = 100, σ = 0.25, k = 5 SVM-KM C = 1, σ = 0.25, k = 60
Number of Training Patterns
Table 4: Empirical Result for Benchmarking Data Sets.
330 175 216 352
172 75 148 159
250 129 136 237
167 68 84 149
Number of Support Vectors
– – 0.13 0.48
– – 0.08 49.13
– – 0.26 2.04
– – 0.20 0.78
Preprocessing Time (Ratio)
1.00 0.47 0.58 0.98
1.00 0.42 0.40 0.33
1.00 0.31 0.29 0.27
1.00 0.10 0.12 0.16
Training Time (Ratio)
29.90 31.11 30.30 28.75
4.03 8.44 4.66 5.20
13.33 16.34 12.67 14.00
9.67 12.33 9.67 10.00
Test Error Rate (%)
– 0.61 0.87 0.72
– 0.00 0.76 0.00
– 0.12 0.89 0.81
– 0.08 1.00 0.69
McNemar’s p-value
840 H. Shin and S. Cho
11982 4089 4089 NA 11769 1135 1135 NA 11800 1997 1997 NA
MNIST: 3–8: 1984 test patterns ALL C = 10, p = 5 RAN C = 10, p = 5 SEL C = 50, p = 5, k = 50 SVM-KM k = 1198
MNIST: 6–8: 1932 test patterns ALL C = 10, p = 5 RAN C = 10, p = 5 SEL C = 20, p = 5, k = 50 SVM-KM k = 1177
MNIST: 9–8: 1983 test patterns ALL C = 10, p = 5 RAN C = 10, p = 5 SEL C = 10, p = 5, k = 40 SVM-KM k = 1180
Wisconsin Breast Cancer: 136 test patterns (5-cv out of 683 patterns) ALL C = 5, p = 3 546 RAN C = 1, p = 1 96 SEL C = 10, p = 3, k = 6 96 SVM-KM C = 10, p = 3, k = 55 418
823 323 631 NA
594 185 421 NA
1253 1024 1024 NA
87 27 41 85
– – 0.19 NA
– – 0.20 NA
– – 0.21 NA
– – 0.01 17.08
1.00 0.07 0.09 NA
1.00 0.03 0.07 NA
1.00 0.15 0.10 NA
1.00 0.05 0.06 0.62
0.41 0.99 0.43 NA
0.25 0.83 0.25 NA
0.50 0.86 0.45 NA
6.80 12.33 6.70 11.02
– 0.01 0.57 NA
– 0.01 0.67 NA
– 0.19 0.42 NA
– 0.60 0.94 0.53
Neighborhood Property–Based Pattern Selection for SVMs 841
842
H. Shin and S. Cho
1998; UCI, 1995), including the artificial data sets in the previous section. Again, SEL is compared with ALL and RAN, and another method is added as a competitor, SVM-KM (Almeida et al., 2000), which is a preprocessing method that reduces the training set based on k-means clustering algorithm. The hyperparameters of SVM were chosen based on the best results using validation: C, σ , and p, indicate the misclassification tolerance, the width of RBF kernel, and the order of polynomial kernel, respectively. The parameter of SEL (NPPS), the number of neighbors k, was determined by the procedure in section 4 and appendix B (see also Shin & Cho, 2003b). And the parameter of SVM-KM, the number of clusters k, was set to 10% of the number of training patterns as recommended in Almeida et al. (2000). For the MNIST (1988) data set, we chose three binary classification problems; they are known as most difficult due to the similar shapes of the digits, for instance, the digit 3 looks similar to the digit 8. Most of the experiments were run on Pentium 4 with 1.9 GHz 512 RAM, whereas Pentium 4 with 3.0 GHz 1024 RAM for the MNIST problems. A standard QP solver lacks adequate memory capacity to handle the real-world data sets, so we chose an iterative SVM solver, OSU SVM Classifier Toolbox (Kernel Machines Organization, n.d.), after brief comparison about scalability with another candidate RSVM (Lee & Mangasarian, 2001). (Refer to appendix C for a comparison of the OSU SVM Classifier with RSVM.) In Table 4, N/A denotes results that are not available. For ease of comparison, the computational time, preprocessing or SVM training, is shown as a ratio to the SVM training time of ALL. The best two results are represented as boldface in the test error rate. Similarly, the statistically most similar result with ALL is represented in boldface type in McNemar’s pvalue. The results show that the preprocessing by either SEL or SVM-KM is of great benefit to SVM training in reducing the training time. Particularly, if multiple runnings of SVM are required, one can take the benefit as many times as required. However, SVM-KM took longer than SEL, and, worse, it ran out of memory in large-sized problems. Also note that the pairwise test shows that SEL is more similar to ALL than any other method. 5.3 Real-World Marketing Data Set. The proposed algorithm was also applied to a marketing data set from (the Direct Marketing Association, n.d.). The data set DMEF4 has been used in other research (Ha, Cho, & MacLachlan, 2005; Malthouse, 2001, 2002). It is concerned with an upscale gift business that mails general or specialized catalogs to customers. The task predicts a customer will respond to the offering during the test period, September 1992 to December 1992. A customer who received the catalog and buys the product is labeled +1; otherwise, he or she is labeled −1. The training set is given based on the period December 1971 to June 1992. There are 101,532 patterns in the data set, each representing the purchase history information of a customer. We derived 17 input variables out of 91 original ones just as in Ha et al. (2005) and Malthouse (2001).
Neighborhood Property–Based Pattern Selection for SVMs
843
To show the effectiveness of SEL, we compared it with seven RANs because random sampling has most commonly been employed when researchers in this field attempt to reduce the size of the training set. Table 5 shows the models: RAN∗ denotes an SVM trained with random samples, where ∗ indicates the ratio of random samples drawn without replacement. Each model was trained and evaluated by using fivefold cross-validation. The hyperparameters of SVM were determined from (C, σ ) = {0.1, 1, 10, 100, 1000} × {0.25, 0.5, 1, 2, 3}. The number of neighbors (k) for SEL was set to 4. The OSU SVM Classifier was used as an SVM solver (Kernel Machines Organization, n.d.). Typically, a DMEF data set (Direct Marketing Association, n.d.) has a severe class imbalance problem because the response rate for the retailer’s offer is very low: 9.4% for DMEF4. In that case, an ordinary accuracy measure tends to mislead us about the results by giving more weight to the heavily represented class. Thus, we used another accuracy measure, Balanced Correct-classification Rate (BCR), defined as Balanced Correct-classification Rate (BCR) =
m11 m1
m22 · , m2
where mi denotes the size of class i and mii is the number of patterns correctly classified into class i. Now the performance measure is balanced between two different class sizes. Figure 12 shows the BCRs of the eight SVM models, and Table 6 presents the details. More patterns result in higher BCR among RAN∗ ’s. However, the training time also increases proportionally to the number of training patterns, with the peak of 4820 sec for RAN100. SEL takes only 68 sec, and 129 sec if the NPPS running time is included. Note that one should perform SVM training several times to find a set of optimal hyperparameters, but only once for NPPS ahead of the whole procedure of training. In the last column of the table, the p-value of McNemar’s test is listed when SEL is compared with an individual RAN∗ . There is a statistically significant difference between SEL and RAN up to RAN60 in accuracy, but no difference between SEL and RAN80 or RAN100. Overall, SEL achieves almost the same accuracy as RAN80 or RAN100 only with the number of training patterns comparable to RAN10 or RAN20. 6 Conclusions and Discussion In this letter, we introduced an informative sampling method for the SVM classification task. By preselecting the patterns near the decision boundary, one can relieve the computational burden during SVM training. In the experiments, we compared the performance of the selected pattern set (SEL) by NPPS with that of a random sample set (RAN) and that of the original training set (ALL). We also compared the proposed method with a
4060 (5%)
Number of patterns
8121 (10%)
RAN10 (100, 1) 16,244 (20%)
RAN20 (100, 0.5) 32,490 (40%)
RAN40 (100, 0.5) 48,734 (60%)
RAN60 (10, 1) 64,980 (80%)
RAN80 (10, 0.5)
81,226 (100%)
RAN100 (10, 0.5)
8871, Average
SEL (10, 0.5)
Notes: RAN∗ denotes an SVM trained with random samples, where ∗ indicates the percentage of random sampling. Note that RAN100 corresponds to ALL. The number of patterns of SEL slightly varies with the given set of each fold of 5-CV, and thus, it is represented as an average.
RAN05 (100, 1)
Model (C, σ )
Table 5: SVM Models.
844 H. Shin and S. Cho
Neighborhood Property–Based Pattern Selection for SVMs
845
Figure 12: Balanced Correct-classification Rate (BCR): RAN∗ is depicted as a filled circle, and SEL is represented as a dotted reference line. Table 6: Empirical Result for a Real-World Marketing Data Set, DMEF4
RAN05 RAN10 RAN20 RAN40 RAN60 RAN80 RAN100 SEL
Number of Training Patterns
Number of Support Vectors
Training Time (sec)
Correct Rate, BCR (%)
McNemar’s Test (p-value)
4060 8121 16,244 32,490 48,734 64,980 81,226 8871
1975 4194 7463 14,967 22,193 28,968 35,529 6624
13.72 56.67 149.42 652.11 1622.06 2906.97 4820.06 68.29
49.49 56.17 58.05 61.02 63.86 64.92 65.17 64.92
0.00 0.00 0.00 0.04 0.08 0.64 0.87 –
competing method, SVM-KM (Almeida et al., 2000). Through the comparison of synthetic and real-world problems, we empirically validated the efficiency of the proposed algorithm. SEL achieved an accuracy similar to that of ALL, while the computational cost was still similar to that of RAN with 10% or 20% of the samples. When compared with SVM-KM, SEL achieved better computational efficiency. Here, we would like to address some future work. First, SVM solves a multiclass problem by a divide-and-combine strategy, which divides the multiclass problem into several binary subproblems (e.g., one-versus-others or one-versus-one) and then combines the outputs. This has led to the application of NPPS to binary class problems. However, NPPS can readily be extended to multiclass problems without major correction. Second, NPPS
846
H. Shin and S. Cho
can also be utilized to reduce the lengthy training time of neural network classifiers. But it is necessary to add extra correct patterns to the selected pattern set to enhance the overlap region near the decision boundary (Choi, & Rockett, 2002; Hara & Nakayama, 2000). The rationale is that overlap patterns located on the “wrong” side of the decision boundary would lengthen the MLP training time. The derivatives of the backpropagated errors will be very small when evaluated at those patterns since they are grouped in a narrow region on either side of the decision boundary. By adding extra correct patterns, however, the network training will converge faster. In the meantime, the idea of adding some randomly selected patterns to SEL may also relieve the higher sensitivity of SEL to hyperparameter variation (see Figure 9). If the high sensitivity is caused by “narrow” distribution of the selected patterns, it can be relaxed by some random patterns from “overall” input space. But the mixing ratio of the patterns from SEL and from RAN requires further study. Third, the current version of NPPS works for classification problems only, and thus is not applicable to regression problems. A straightforward idea for regression would be to use the mean (µ) and variance ( ) of k nearest neighbors’ outputs. A pattern having a small value of can be replaced by µ of its neighbors, including itself. That is, k + 1 patterns can be replaced by one pattern. On the contrary, a pattern having a large value of can be totally eliminated since in a regression problem, the patterns located away from major group, such as outliers, are less important to learning. But its neighbors should be used to explore the next pattern. Similar research based on ensemble neural network was conducted by Shin and Cho (2001). Fourth, one of interesting future directions is the effort to extend NPPS for data with concept drift (Aha, Kibler, & Albert, 1991; Bartlett, Ben-David, & Kulkarni, 2000; Helmbold & Long, 1994; Klinkenberg, 2004; Pratt & Tschapek, 2003; Stanley, 2003; Widmer & Kubat, 1996). In concept drift, the decision boundary changes as data arrive in the form of a stream. A naive approach would be to employ a moving window over the data stream and then run NPPS repeatedly. For better results, NPPS should be substantially modified and rigorously tested. Appendix A: Empirical Complexity Analysis We empirically show that NPPS runs in approximately v M, where M is the number of training patterns and v is the number of overlap patterns (a full theoretical analysis is available in Shin & Cho, 2003a). A total of M patterns, half from each class, were randomly generated from a pair of two-dimensional uniform distributions:
1 −1 < x < , v v ) (1 − 12 M ) (0 − 12 M
−1 1 < x < . C2 = x | U v v (−1 + 12 M ) (0 + 12 M )
C1 =
x | U
Neighborhood Property–Based Pattern Selection for SVMs
847
a. v = 0.3M
b. v = 0.5M
c. v = 0.7M Figure 13: Overlap of two uniform distributions. The dark gray area is the overlap region that contains v patterns. The number of overlap patterns, v, is set to every decile of training set size, M. (a, b, c) v = 0.3M, v = 0.5M, and v = 0.7M, respectively.
We set v to every decile of M : v = 0, 0.1M, 0.2M, . . . , 0.9M, M. Figure 13 shows the distributions of v = 0.3M, v = 0.5M, and v = 0.7M. The larger values of v correspond to more overlap patterns. We set out to see how the computation time of NPPS works with the changing value of v—in particular, whether the computation time is linearly proportional to v as
848
H. Shin and S. Cho
a. The number of selected patterns (%)
b. Computation time Figure 14: Empirical complexity analysis: NPPS with increasing number of overlap patterns v.
Neighborhood Property–Based Pattern Selection for SVMs
849
a. Pima Indians Diabetes (k ∗ = 4)
b. Wisconsin Breast Cancer (k ∗ = 6) Figure 15: SVM test error rates. The error rate is stabilized at about 30.0% for Pima Indian Diabetes when k ≥ 4 and at about 6.7% for Wisconsin Breast Cancer when k ≥ 6.
850
H. Shin and S. Cho
Table 7: Empirical Comparison on Different SVM Solvers. Number of Training Patterns
Time Ratio: Training
Test Error Rate (%)
McNemar’s p-Value
Continuous XOR: 1000 test patterns OSU-SVM C = 10, σ = 0.5 RSVM C = 10, σ = 0.5
600 600
1.00 1.20
9.67 9.97
– 0.79
Sine function: 1000 test patterns OSU-SVM C = 10, σ = 0.50 RSVM C = 100, σ = 0.25
500 500
1.00 0.31
13.33 13.46
– 0.85
1000 1000
1.00 1.87
4.03 3.88
– 0.00
Pima Indian Diabetes: 153 test patterns (5-cv out of 768 patterns) OSU-SVM C = 100, p = 2 615 1.00 29.90 RSVM C = 10, p = 1 615 0.10 29.64
– 0.72
Wisconsin Breast Cancer: 136 test patterns (5-cv out of 683 patterns) OSU-SVM C = 5, p = 3 546 1.00 6.80 RSVM C = 10, p = 3 546 1.65 8.02
– 0.45
MNIST: 3–8: 1984 test patterns OSU-SVM C = 10, p = 5 RSVM
11982 11982
1.00 N/A
0.50 N/A
– N/A
MNIST: 6–8: 1932 test patterns OSU-SVM C = 10, p = 5 RSVM
11769 11769
1.00 N/A
0.25 N/A
– N/A
MNIST: 9–8: 1983 test patterns OSU-SVM C = 10, p = 5 RSVM
11800 11800
1.00 N/A
0.41 N/A
– N/A
4 × 4 checkerboard: 10,000 test patterns OSU-SVM C = 20, σ = 0.25 RSVM C = 100, σ = 0.25
mentioned. Figure 14a shows the number of selected patterns through evaluation when v grows from 0 up to M = 10, 000. In Figure 14b, one can clearly see that the computation time is proportional to v. Appendix B: Experiments on the Number of Neighbors Based on the procedure presented in Figure 7, we briefly present experiments on Pima Indian Diabetes and Wisconsin Breast Cancer. According to the procedure, the lower bounds of v were estimated as 393 and 108 from the training error rates, Perror = 32.0% and Perror = 9.9%, respectively. These led to k ∗ = 4 in Pima Indian Diabetes and k ∗ = 6 in Wisconsin Breast Cancer. (The values of k min were also 4 and 6, respectively.) Figure 15
Neighborhood Property–Based Pattern Selection for SVMs
851
shows that the test error rate stabilizes at about 30.0% for Pima Indian Diabetes when k ≥ 4, and at about 6.7% for Wisconsin Breast Cancer when k ≥ 6. Appendix C: Empirical Comparison on SVM Solvers We provide a short comparison between OSU-SVM (OSU SVM Classifier, Kernel Machines Organization) and RSVM (Lee & Mangasarian, 2001), which are known as the fastest SVM learning algorithms. All of the experimental settings were as identified in section 5.2. The parameter of RSVM, the random sampling ratio, was set to 5% to 10% of training patterns according to Lee and Mangasarian. In Table 7, the computational time of RSVM is shown as a ratio to the SVM training time of OSU-SVM. The results show that both of the algorithms are almost similar in accuracy. Also in training time, it is hard to tell which is superior to the other. However, RSVM was not available for the MNIST (1998) problems. It failed to load the kernel matrix on the memory (note that RSVM is not an iterative solver). Therefore, we chose to use OSU-SVM as a base learner in our experiment because it is relatively more scalable and also parameter free. Acknowledgments H.S. was partially supported by the grant for Post Brain Korea 21. S.C. was supported by grant no. R01-2005-000-103900-0 from the Basic Research Program of the Korea Science and Engineering Foundation, Brain Korea 21, and Engineering Research Institute of SNU. References Aha, D., Kibler, D., & Albert, M. K. (1991). Instance-based learning algorithms. Machine Learning, 6(1), 37–66. Almeida, M. B., Braga, A., & Braga, J. P. (2000). SVM-KM: Speeding SVMs learning with a priori cluster selection and k-means. In Proc. of the 6th Brazilian Symposium on Neural Networks (pp. 162–167). Washington, DC: IEEE Computer Society. Arya, S., Mount, D. M., Netanyahu, N. S., Silverman, R., & Wu, A. Y. (1998). An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. Journal of the ACM, 45(6), 891–923. Bartlett, P., Ben-David, S., & Kulkarni, S. (2000). Learning changing concepts by exploiting the structure of change. Machine Learning, 41, 153–174. Bentley, J. L. (1975). Multidimensional binary search trees used for associative searching. Communications of ACM, 18, 509–517. ¨ Berchtold, S., Bohm, C., Keim, D. A., & Kriegel, H.-P. (1997). A cost model for nearest neighbor search in high-dimensional data space. In Proc. of Annual ACM Symposium on Principles Database Systems (pp. 78–86). New York: ACM.
852
H. Shin and S. Cho
Borodin, A., Ostrovsky, R., & Rabani, Y. (1999). Lower bound for high dimensional nearest neighbor search and related problems. In Proc. of the 31st ACM Symposium on Theory of Computing (pp. 312–321). New York: ACM. Brinker, K. (2003). Incorporating diversity in active learning with support vector machines. In Proc. of the 20th International Conference on Machine Learning (ICML) (pp. 59–66). Menlo Park, CA: AAAI Press. Byun, H., & Lee, S. (2002). Applications of support vector machines for pattern recognition: A survey. Lecture Notes in Computer Science, 2388, 213–236. Campbell, C., Cristianini, N., & Smola, A. J. (2000). Query learning with large margin classifiers. In Proc. of the 17th International Conference on Machine Learning (ICML) (pp. 111–118). San Francisco: Morgan Kaufmann. Cauwenberghs, G., & Poggio, T. (2001). Incremental and decremental support vector machine learning. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 409–415). Cambridge, MA: MIT Press. Choi, S. H., & Rockett, P. (2002). The training of neural classifiers with condensed dataset. IEEE Transactions on Systems, Man, and Cybernetics–PART B: Cybernetics, 32(2), 202–207. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press. Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10, 1895–1923. Direct Marketing Association. (N.d.). DMEF data set library. Available online at http://www.the-dma.org/dmef/dmefdset.shtml. Dumais, S. (1998). Using SVMs for text categorization. IEEE Intelligent Systems, 13(4), 21–23. Ferri, F. J., Albert, J. V., & Vidal, E. (1999). Considerations about sample-size sensitivity of a family of edited nearest-neighbor rules. IEEE Transactions on Systems, Man, and Cybernetics–Part B: Cybernetics, 29(4), 667–672. Friedman, J. H., Bentley, J. L., & Finkel, R. A. (1977). An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3(3), 209–226. Gates, G. W. (1972). The reduced nearest neighbor rule. IEEE Transactions on Information Theory, IT, 18(3), 431–433. Grother, P. J., Candela, G. T., & Blue, J. L. (1997). Fast implementations of nearest neighbor classifiers. Pattern Recognition, 30(3), 459–465. Guttman, A. (1984). R-trees: A dynamic index structure for spatial searching. Proc. of the ACM SIGMOD International Conference on Management of Data (pp. 47–57). New York: ACM. Ha, K., Cho, S., & MacLachlan, D. (2005). Response models based on bagging neural networks. Journal of Interactive Marketing, 19, 17–30. Hara, K., & Nakayama, K. (2000). A training method with small computation for classification. In Proc. of the IEEE-INNS-ENNS International Joint Conference (Vol. 3, pp. 543–548). Washington DC: IEEE Computer Society. Hart, P. E. (1968). The condensed nearest neighbor rule. IEEE Transactions on Information Theory, IT, 14(3), 515–516. ¨ Hearst, M. A., Scholkopf, B., Dumais, S., Osuna, E., & Platt, J. (1997). Trends and controversies—support vector machines. IEEE Intelligent Systems, 13, 18–28.
Neighborhood Property–Based Pattern Selection for SVMs
853
Heisele, B., Poggio, T., & Pontil, M. (2000). Face detection in still gray images (Tech. Rep. No. 1687). Cambridge, MA: MIT AI Lab. Helmbold, D., & Long, P. (1994). Tracking drifting concepts by minimizing disagreements. Machine Learning, 14(1), 27–46. Hush, D., Kelly, P., Scovel, C., & Steinwart, I. (2006). QP algorithms with guaranteed accuracy and run time for support vector machines. Journal of Machine Learning Research, 7, 733–769. Indyk, P. (1998). On approximate nearest neighbors in non-Euclidean spaces. In Proc. of IEEE Symposium on Foundations of Computer Science (pp. 148–155). Washington, DC: IEEE Computer Society. Joachims, T. (2002). Learning to classify text using support vector machines: Methods, theory and algorithms. Norwell, MA: Kluwer. Johnson, M. H., Ladner, R. E., & Riskin, E. A. (2000). Fast nearest neighbor search of entropy-constrained vector quantization. IEEE Transactions on Image Processing, 9(9), 1435–1437. Kernel Machines Organization. (N.d.). Available online at http://www.kernelmachines.org/. Kleingberg, J. M. (1997). Two algorithms for nearest-neighbor search in high dimensions. In Proc. of 29th ACM Symposium on Theory of Computing (pp. 599–608). New York: ACM. Klinkenberg, R. (2004). Learning drifting concepts: Example selection vs. example weighting. Intelligent Data Analysis, 8(3), 281–300. Koggalage, R., & Halgamuge, S. (2004). Reducing the number of training samples for fast support vector machine classification. Neural Information Processing—Letters and Reviews, 2(3), 57–65. Laskov, P. (2002). Feasible direction decomposition algorithms for training support vector machines. Machine Learning, 46, 315–349. Lee, Y., & Mangasarian, O. L. (2001). RSVM: Reduced support vector machines. In Proc. of the SIAM International Conference on Data Mining. Available online at http://www.cs.wisc.edu/∼olvi/olvi.html. Liu, C. L., & Nakagawa, M. (2001). Evaluation of prototype learning algorithms for nearest-neighbor classifier in application to handwritten character recognition. Pattern Recognition, 34, 601–615. Lyhyaoui, A., Martinez, M., Mora, I., Vazquez, M., Sancho, J., & FigueirasVaidal, A. R. (1999). Sample selection via clustering to construct support vector-like classifiers. IEEE Transactions on Neural Networks, 10(6), 1474– 1481. Malthouse, E. C. (2001). Assessing the performance of direct marketing models. Journal of Interactive Marketing, 15(1), 49–62. Malthouse, E. C. (2002). Performance-based variable selection for scoring models. Journal of Interactive Marketing, 16(4), 10–23. Masuyama, N., Kudo, M., Toyama, J., & Shimbo, M. (1999). Termination conditions for a fast k-nearest neighbor method. In Proc. of 3rd International Conference on Knowledge-Based Intelligent Information Engineering Systems (pp. 443–446). New York: IEEE Computer Society. MNIST. (1998). The MNIST database of handwritten digits. Available online at http://yann.lecun.com/exdb/mnist/.
854
H. Shin and S. Cho
Moghaddam, B., & Yang, M. H. (2000). Gender classification with support vector machines. In Proc. of International Conference on Pattern Recognition (pp. 306–311). Washington, DC: IEEE Computer Society. Ong, C. S., Smola, A. J., & Williamson, R. C. (2005). Learning the kernel with hyperkernels. Journal of Machine Learning Research, 6, 1045–1071. Osuna, E., Freund, R., & Girosi, F. (1997). Training support vector machines: An application to face detection. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (pp. 130–136). New York: IEEE Computer Society. Platt, J. C. (1999). Fast training of support vector machines using sequential minimal ¨ optimization. In B. Scholkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods: Support vector machines (pp. 185–208). Cambridge, MA: MIT Press. Pontil, M., & Verri, A. (1998). Properties of support vector machines. Neural Computation, 10, 955–974. Pratt, K. B., & Tschapek, G. (2003). Visualizing concept drift. In Proc. of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 735–740). New York: ACM. Schohn, G., & Cohn, D. (2000). Less is more: Active learning with support vector machines. In Proc. of the 17th International Conference on Machine Learning (ICML) (pp. 839–846). San Francisco: Morgan Kaufmann. ¨ Scholkopf, B., Burges, C. J. C., & Vapnik, V. (1995). Extracting support data for a given task. In Proc. of 1st International Conference on Knowledge Discovery and Data Mining (pp. 252–257). Menlo Park, CA. AAAI Press. ¨ Scholkopf, B., Burges, C. J. C., & Smola, A. J. (1999). Advances in kernel methods: Support vector machines. Cambridge, MA: MIT Press. ¨ Scholkopf, B., & Smola, A. J. (2002). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge, MA: MIT Press. Shin, H. J., & Cho, S. (2001). Pattern selection using the bias and variance of ensemble. Journal of the Korean Institute of Industrial Engineers, 28(1), 112–127. Shin, H. J., & Cho, S. (2003a). Fast pattern selection algorithm for support vector classifiers. In Time complexity analysis (pp. 1008–1015). Available online at http://www.kyb.mpg.de/publication.html?user=shin. Shin, H. J., & Cho, S. (2003b). How many neighbors to consider in pattern preselection for support vector classifiers? In Proc. of the International Joint Conference on Neural Networks (IJCNN) (pp. 565–570). Washington, DC: IEEE Computer Society. Short, R. D., & Fukunaga, K. (1981). The optimal distance measure for nearest neighbor classification. IEEE Transactions on Information and Theory, IT, 27(5), 622–627. Sohn, S., & Dagli, C. H. (2001). Advantages of using fuzzy class memberships in self-organizing map and support vector machines. In Proc. of International Joint Conference on Neural Networks (IJCNN). ¨ Sonnenburg, S., R¨atsch, G., Sch¨afer, S., & Scholkopf, B. (2006). Large scale multiple kernel learning. Journal of Machine Learning Research, 7, 1531–1565. Stanley, K. O. (2003). Learning concept drift with a committee of decision trees (Tech. Rep. No. UTAI-TR-03-302), Austin: Department of Computer Sciences, University of Texas at Austin. Tsaparas, P. (1999). Nearest neighbor search in multidimensional spaces (Tech. Rep. No. 319-02). Toronto: Department of Computer Science, University of Toronto.
Neighborhood Property–Based Pattern Selection for SVMs
855
UCI. (1995). UCI repository of machine learnin databases and domain theories. http://www.ics.uci.edu/∼mlearn/. Vapnik, V. (1999). The nature of statistical learning theory (2nd ed.). Berlin: SpringerVerlag. Widmer, G., & Kubat, M. (1996). Learning in the presence of concept drift and hidden contexts. Machine Learning, 23(1), 69–101. Zheng, S. F., Lu, X. F., Zheng, N. N., & Xu, W. P. (2003). Unsupervised clustering based reduced support vector machines. In Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (Vol. 2, pp. 821–824). Washington, DC: IEEE Computer Society.
Received January 6, 2006; accepted July 24, 2006.
LETTER
Communicated by Marco Saerens
Transductive Methods for the Distributed Ensemble Classification Problem David J. Miller [email protected]
Siddharth Pal [email protected] Department of Electrical Engineering, Pennsylvania State University, University Park, PA 16802-2701
We consider ensemble classification for the case where there is no common labeled training data for jointly designing the individual classifiers and the function that aggregates their decisions. This problem, which we call distributed ensemble classification, applies when individual classifiers operate (perhaps remotely) on different sensing modalities and when combining proprietary or legacy classifiers. The conventional wisdom in this case is to apply fixed rules of combination such as voting methods or rules for aggregating probabilities. Alternatively, we take a transductive approach, optimizing the combining rule for an objective function measured on the unlabeled batch of test data. We propose maximum likelihood (ML) objectives that are shown to yield well-known forms of probabilistic aggregation, albeit with iterative, expectationmaximization-based adjustment to account for mismatch between class priors used by individual classifiers and those reflected in the new data batch. These methods are extensions, for the ensemble case, of the work of Saerens, Latinne, and Decaestecker (2002). We also propose an information-theoretic method that generally outperforms the ML methods, better handles classifier redundancies, and addresses some scenarios where the ML methods are not applicable. This method also well handles the case of missing classes in the test batch. On UC Irvine benchmark data, all our methods give improvements in classification accuracy over the use of fixed rules when there is prior mismatch. 1 Introduction Ensemble classification systems form ultimate decisions by aggregating the (hard or soft) decisions made by classifiers in a collection or ensemble. This paradigm is often motivated by considerations of accuracy, taking into account factors that affect (stand-alone) classifier performance such as limited training data, local optima of the training objective, and a lack of prior knowledge of the class-discriminating features and their underlying Neural Computation 19, 856–884 (2007)
C 2007 Massachusetts Institute of Technology
Transductive Methods
857
distributions. Diversity of classifiers in an ensemble—in the feature spaces they use, their structures, their parameter initializations, and their training data—can all help to reduce inductive bias associated with a single classifier. Ensembles have been justified, for example, under the assumption of statistically independent decision making (Hansen & Salamon, 1990) and from the standpoint of variance reduction (Tumer & Ghosh, 1996). We can broadly categorize ensemble techniques according to their method of training. For jointly optimized ensembles, the base classifiers and the aggregation function that combines their decisions are jointly designed, using a common set of training (and possibly validation) data. Such methods include boosting (Freund & Schapire, 1996), mixture of experts (Jacobs, Jordan, Nowlan, & Hinton, 1991), and other techniques (e.g., Dietterich & Kong, 1995). A second category focuses on training the aggregation rule, given fixed classifiers. An example is Kang, Kim, and Kim (1997). Finally, there are techniques that do not involve any training of the aggregation rule. These methods apply fixed mathematical rules to combine decisions. Such techniques include hard decision voting (Hansen & Salamon, 1990), applications of Bayes rule (Xu, Krzyzak, & Suen, 1992), arithmetic averaging (Breiman, 1996), variants of averaging (Miller & Yan, 1999), robust averaging (Tumer & Ghosh, 2002), and Dempster-Shafer theory (Chen, Wang, & Chi, 1997). Whereas the first two categories of techniques are usually chosen at the designer’s discretion, this latter category, without training optimization of the combiner, is often applied of necessity. There are two primary scenarios for which such training appears to be precluded: 1. Legacy or proprietary classifiers. Different organizations may have built classifiers, each using “in-house” data. Even if organizations are willing to share data, each classifier may use a different feature space.1 In this case, from company A’s data (which was collected with a particular feature space in mind), it may not be possible to extract instances of the input features used by company B’s classifier. Thus, in general, it will not be possible to form a common pool of labeled data, as required for supervised combiner learning. 2. Distributed multisensor/multimedia classification. Spatially remote sensors may collect measurements at different times, using either the same or different sensing modalities (Landgrebe, 2005). Each sensor may form its own local training set, without any known correspondence (e.g., between labeled examples in sensor A’s database and those in sensor B’s). In fact, the two sensors, while taking measurements of objects from the same set of classes, may not have sampled
1 There may be some features in common. Even in this case, we would say the classifiers use different feature spaces.
858
D. Miller and S. Pal
any common examples. Even if there are common examples, solving the correspondence problem to identify these examples appears daunting unless the data come with time stamps. A related application is vowel recognition based on speech and mouth images, with separate training sets. In the above examples there are no common labeled data across the ensemble, needed for supervised learning of the combining rule. In this letter, we focus on this case, which we refer to as distributed ensemble classification. This type of problem has been recognized before, in its general form (Basak & Kothari, 2004; Miller & Yan, 1999) and as seen in the context of classification over sensor networks (D’Costa, Ramachandran, & Sayeed, 2004). Basak & Kothari (2004) derive a fixed combining rule by applying Bayes theorem and taking proper account of redundancies in the feature spaces used by the local classifiers. This approach requires communication among all pairs of local classifiers to identify the features in common. D’Costa et al. (2004) derive a fixed combining rule for the case where the local classifiers’ feature vectors are jointly gaussian. We emphasize that Basak and Kothari (2004) and D’Costa et al. (2004) apply fixed combining rules. The conventional wisdom is that without common labeled training data, one must apply a fixed combining strategy. Alternatively, we will show it is both possible and in general desirable to learn the combining rule. We propose a transductive learning strategy (Joachims, 1999; Miller & Uyar, 1997; Saerens, Latinne, & Decaestecker, 2002), wherein an objective measured over the test data batch is optimized with respect to the batch of class decisions.2 Several such objective functions will be proposed. There are two main reasons that we expect transductive learning may outperform fixed combining. First, consider the very practical case where the class priors seen (in training data) by individual classifiers do not reflect the true class priors, that is, those seen in the test batch. There are a number of possible reasons for such mismatch:
r r
For the distributed multisensor domain, the class distribution may be spatially varying; for example, each local region (and hence each local classifier) may “see” examples from only a subset of the classes. Rare classes may not be represented proportionately in a training data set.
2 Batch transductive learning jointly yields decisions for all samples in the test batch. Alternatively, to avoid sample (but not processing) delays in decision making, we can instead apply transductive learning sample by sample. In this case, the decision on the current sample xt is made (at least in principle) by transductive learning using the entire batch of test samples seen until now, that is, {x1 , x2 , . . . , xt−1 }. While this learning will yield decisions on all samples in the current batch, only the decision on xt is retained, since decisions on past samples have already been made in this (sample-by-sample) case.
Transductive Methods
r r
859
Class priors may be environment dependent. Groups of classes that are highly confusable are difficult to hand-label. Thus, ironically, although it is important to have sufficient examples from them, these classes may be underrepresented in a labeled set.
Saerens et al. (2002) developed a transductive maximum likelihood (ML) method to correct class priors for the case of a single classifier. We extend their approach to address distributed ensembles and demonstrate performance gains over fixed combining when there is mismatch between the true priors and those used by individual classifiers. Moreover, we observe that for the likelihood objectives considered in this work and in Saerens et al. (2002), the EM learning is globally convergent, that is, it finds the (unique) maximum likelihood estimate. This was not recognized in Saerens et al. (2002). A second motivation for a transductive approach is to avoid unnecessary statistical assumptions implicitly made by using a particular combining rule. We will show that arithmetic averaging and naive Bayes rules emerge as the transductive ML-optimal rules when particular assumptions—conditional independence, in the naive Bayes case—are made about the underlying stochastic model that generated the data. As will be seen in our results, conditional independence in particular is a poor assumption when there is statistical redundancy between the decisions produced by individual classifiers. To avoid making unnecessary assumptions, we also propose an information-theoretic transductive technique for aggregation of distributed ensembles. We have found that this method generally outperforms the ML techniques and, in particular, much better accounts for redundancy between classifiers than the ML methods. This method is also applicable for some scenarios where transductive ML techniques cannot be used. In section 2, we describe the distributed ensemble problem more fully, stating our assumptions and introducing notation. In section 3, we develop two transductive ML techniques. Section 4 develops our informationtheoretic method. Section 5 gives experimental results. The letter concludes with a summary and future work. 2 Assumptions for Distributed Ensemble Classification Consider an ensemble with Ne classifiers and an aggregation function for combining their decisions. The transductive ML techniques we develop assume each classifier estimates a posteriori probabilities P j [C = c|x ( j) ], c = 1, . . . , Nc , j = 1, . . . , Ne , Nc the number of classes, and x ( j) the feature vector for the jth classifier. Each classifier has its own labeled training set for building an a posteriori model. The priors reflected in this training set will in general be different for each local classifier and, of course, may differ from the true priors reflected in the test data batch, which are unknown.
860
D. Miller and S. Pal
Consistent with the proprietary scenario, for example, we assume there is no communication between the individual classifiers for discerning common features. As emphasized before, there is no common labeled training data that could be used for supervised learning of the aggregation function. However, during the operational (use) phase, common data are observed (in a synchronized fashion) across the ensemble, that is, for each test object, a corresponding feature vector is measured by each classifier. Thus, the input to the ensemble system is the concatenated vector x = (x (1) , x (2) , . . . , x (Ne ) ), but with classifier j seeing only x ( j) . We suppose it is not feasible to directly convey the feature vectors x ( j) to the aggregation function; only the classifier decisions (hard or soft) are conveyed. This is consistent with the proprietary scenario and also with the case of high feature dimensionality (and hence high-bandwidth transmission cost). Each classifier may communicate either soft decisions {P j [C = c|x ( j) ], c = 1, . . . , Nc } or maximum a posteriori (MAP) hard decisions. A key aspect of our approach is that in making decisions, we perform transductive learning on a batch of test samples. As mentioned, one possibility is to make joint decisions on all samples in a given test set batch, Xtest = {xi , i = 1, . . . , Ntest }. This will entail collecting batches of soft (or hard) decisions from each of the individual classifiers, that is, ( j) {{P j [c|xi ]∀c ∀ j}, i = 1, . . . , Ntest }. There will be delays in decision making associated with both decision collection and processing by the aggregation function to produce the ensemble decision batch {{Pe [c|xi ]}, i = 1, . . . , Ntest }. For our methods, these delays will grow linearly with Ntest . Alternatively, in the sample-by-sample case (see note 2), there is no delay associated with data (i.e., decision) collection. Finally, we note an important distinction between the ML methods and the information-theoretic method to be developed, for the case where individual classifiers forward hard decisions. For the ML methods, each local classifier must explicitly produce a posteriori probabilities {P j [c|x ( j) ]}, even though only a hard MAP decision will be conveyed. This is required because local MAP decisions cannot be explicitly corrected to account for new estimated priors unless there is access, at least at the local classifiers, if not at the aggregation function, to local a posteriori probabilities. By contrast, as will be seen, the information-theoretic method can be applied and can effectively correct priors even if each local classifier is an inaccessible black box that produces only hard decisions.
3 Transductive Maximum Likelihood Ensembles Saerens et al. (2002) considered the case of a single a posteriori model P[c|x]. They effectively assumed this model is reasonably accurate except possibly in the class priors implicitly used. They proposed to reestimate class priors so as to maximize the data likelihood measured on the test
Transductive Methods
861
set batch. Within the expectation-maximization (EM) learning framework (Dempster, Laird, & Rubin, 1977; McLachlan & Peel, 2000), the revised a posteriori class probabilities, based on the current estimates of the class priors, are computed in the E-step, with the class priors reestimated in the M-step. Here, we extend their approach to address distributed ensembles. We consider two underlying stochastic models, each of which will be shown to be consistent with a particular well-known form of ensemble combining. 3.1 Class-Conditional Feature Independence. We denote the classconditional density (or probability mass function, pmf, in the discrete case) for the random feature vector X( j) , given C = c, by p j [x ( j) |C = c]. Since each local classifier is required only to produce an a posteriori model P j [c|x ( j) ], the density function will in general be unspecified.3 Moreover, even if each classifier is explicitly using a class-conditional density for its feature vector, we are not proposing, as in, for example, Miller and Uyar (1997), to reestimate the parameters of these densities within the transductive learning framework. Regardless, the class-conditional densities do appear in our maximum likelihood formulation. In this section, we assume the local feature vectors {X( j) , j = 1, . . . , Ne } are conditionally independent, given the class. This is a standard naive Bayes assumption that is often applied in distributed classification and detection contexts (Varshney, 1997). The associated stochastic data generation for each test sample xi , assumed to be independently generated, is to first randomly select a class based on {Pe [C = k]} and then to randomly generate each local feature vector, given the chosen class k , via the den( j) sity p j [xi |C = k ]∀ j. This stochastic generation defines a special mixture model for generating x, with mixture component k representing class k, k = 1, . . . , Nc . Thus, Pe [C = k] is the prior for component(class) k, with ( j) { p j [xi |C = k], j = 1, . . . , Ne } the component model for component(class) k. For this mixture model, applying Bayes rule, the transductive incomplete data log likelihood for the unlabeled batch test set, Xtest , is log L(Xtest ) = log L(x1 , . . . , x Ntest ) Ntest Nc Ne ( j) = log Pe [C = k] p j xi |C = k i=1
=
Ntest i=1
log
k=1
j=1
Nc
Ne
k=1
Pe [C = k]
j=1
(3.1)
( j) p j xi ( j) (3.2) P j C = k|xi · P j [C = k]
3 It will be known if the a posteriori model is derived from an underlying stochastic model for the feature vector.
862
D. Miller and S. Pal
=
Ntest
log
i=1
+
Ntest i=1
Ne
( j) p j xi
j=1
( j) Nc Ne P j C = k|xi , log Pe [C = k] P j [C = k] k=1
(3.3)
j=1
e ( j) with the final form obtained using the fact that Nj=1 p j [xi ] has no dependence on k. This is to be maximized over the ensemble class priors ( j) {Pe [C = k]}. Here, P j [C = k|xi ] is local classifier j’s posterior, P j [C = k] ( j) is its prior (measured based on its training set), and p j [xi ] is its feature ( j) vector density. Note that p j [xi ] is in general unknown since each classifier is assumed only to produce class posteriors. However, this is not
Ntest e ( j) problematic since i=1 log( Nj=1 p j [xi ]) has no dependence on the parameters to be optimized. The optimization is accomplished using an EM algorithm whose defining formulation details are given in the appendix. The algorithm amounts to a sequence of iterations, each consisting of (1) reestimation of the ensemble class posteriors on the test data, given fixed ensemble priors (E-step), followed by (2) reestimation of the ensemble class priors, given fixed posteriors (M-step), applied until a convergence condition is met. Each step in the optimization is guaranteed to ascend in log L. The E-step at iteration t is (t−1)
Pe Pe(t) [C = k|xi ] =
Nc
[C = k]
(t−1) [C l=1 Pe
Ne
= l]
( j)
P j [C=k|xi ] P j [C=k]
j=1
Ne j=1
( j)
P j [C=l|xi ] P j [C=l]
,
(3.4)
and the M-step is:
Pe(t) [C = k] =
Ntest 1 Pe(t) [C = k|xi ], Ntest
(3.5)
i=1
(0)
with the initial prior probabilities given as {Pe [C = k]}. We emphasize ( j) that in the E-step, P j [C = k] and P j [C = k|xi ], the prior and posterior for the jth classifier, (t−1) remain unchanged throughout the optimiza[C=k] is effecting correction of the posterior tion. Note that the ratio PeP j [C=k] ( j) priP j [C = k|xi ] to reflect the current estimates of the ensemble (class j) P [C=k|x ] ors measured on the test batch. Note also that the ratio jP j [C=k]i is pro( j) portional to the class-conditional density p j [xi |C = k]. Thus, as it must,
Transductive Methods
863
equation 3.4 takes the naive Bayes posterior form, that is, the form of a model that assumes class-conditional independence of the feature vectors {X( j) , j = 1, . . . , Ne }, with class priors given by the transductive MLestimated ones. At convergence, the E-step is used to compute the ensemble class a posteriori probabilities. Note that the difference between our approach and standard naive Bayes aggregation is that our method learns and, at convergence, uses corrected priors in the naive Bayes rule, as given in equation 3.4. In fact, the solution after the first E-step (without any prior correction) is the standard naive Bayes solution, which we will refer to as the fixed product rule and compare against in our experimental results. It is important to observe that the objective function, equation 3.1, as well as the objective in Saerens et al. (2002), is concave in the parameters {Pe [C = k]}, with, thus, a unique maximum. Moreover, convergence for this particular problem—ML estimation for mixture proportions, with the parameters of the component densities held fixed—is addressed in Redner and Walker (1984). There it is shown that under mild assumptions (nonzero initialization of mixing proportions for all components, positivity of the mixture density for each sample, linearly independent mixture components with positive support on the sample domain), the EM algorithm is globally convergent, to the unique maximum likelihood estimate. Only local optimality was recognized in Saerens et al. (2002). Our method requires each local classifier to convey its test-batch posterior pmfs to the aggregation function. Moreover, each classifier must also convey its class priors {P j [C = k]}. 3.2 Censored Feature Vectors. A well-known drawback of the productbased rule, equation 3.4, is its sensitivity to unreliable experts: a single classifier that gives low probability to the true class may cause an ensemble error. Arithmetic averaging is generally believed to be more robust to unreliable “experts.” In deriving rule 3.4 within the transductive ML framework, we modeled stochastic generation of x in a standard way, using conditional independence. In this section, we show that an arithmetic averaging rule emerges when a somewhat unconventional stochastic model is assumed. Imagine that for any given test set object, only one of the individual classifiers receives measurements, with the other feature vectors essentially censored. Each classifier is assumed equally likely to be uncensored. More precisely, the stochastic data generation is as follows. For each test sample, one first randomly selects a class based on {Pe [C = k]}, then the uncensored classifier (based on a uniform pmf over the Ne classifiers), and finally the sample, given the chosen class (k ) and classifier ( j ), that is, via ( j ) p j [xi |C = k ]. For this model, the transductive censored incomplete data
864
D. Miller and S. Pal
log likelihood, assuming independent test set samples, is
log L cens (Xtest ) =
Ntest
log
Ne
1 ( j) Pe [C = k] p j xi |C = k . Ne
Nc
i=1
k=1
(3.6)
j=1
The appendix casts this ML problem within the EM framework. The ensemble a posteriori class probabilities are again reestimated in the E-step, which now takes the arithmetic averaging form: (t−1) Pe [C
Pe(t) [C = k|xi ] =
Nc l=1
(t−1)
Pe
=
k] N1e
Ne
[C = l] N1e
( j)
P j [C=k|xi ] P j [C=k]
j=1
Ne j=1
( j)
P j [C=l|xi ] P j [C=l]
.
(3.7)
The M-step again computes the new class priors using equation 3.5. Again, (t−1)
[C=k] is effecting class prior correction. At convergence, equathe ratio PeP j [C=k] tion 3.7 computes the ensemble a posteriori probabilities. The solution after the first E-step (before prior correction) in this case is the fixed arithmetic averaging rule, which we will compare against in our results. As in the previous section, this EM algorithm is globally convergent. It is important to emphasize that we are not especially advocating the censored model as a good statistical description that captures particular distributed ensemble scenarios.4 We have simply identified the statistical modeling assumption (random censoring) needed for transductive ML to yield the (popular) arithmetic averaging form of combining. In section 5, we experimentally evaluate this transductive technique, along with the product rule method and an alternative method, developed next.
4 An Information-Theoretic Transductive Ensemble The aggregation technique developed in this section, which we dub the information-theoretic (IT) method, differs in important respects from the ML methods. First, the IT method does not require that individual classifiers produce a posteriori probabilities. Correction of class priors is effectively achieved even if each local classifier only produces hard decisions. Second, we saw in the previous section that well-known aggregation rules emerge when specific statistical assumptions are made. On the one hand, these rules generalize, providing an ensemble decision given any set of local posteriors {{P j [c|x ( j) ]}, j = 1, . . . , Ne }, associated with a new x. That is, suppose there
4 In particular, in our experiments in section 5, there is no actual censoring of feature vectors.
Transductive Methods
865
are two test batches, X1 and X2 . If transductive ML learning is applied to X1 , assuming the class priors in X2 are the same as in X1 , decisions can be made on the examples in X2 without performing additional learning.5 On the other hand, the statistical assumptions may not be warranted—in particular, conditional independence, when there is significant redundancy between classifiers. As an extreme example, consider an ensemble with Ne − 1 1 identical weak classifiers (i.e., ones that are perfect copies of each other) and one strong classifier, statistically independent of the rest. Clearly, the naive Bayes rule in equation 3.4 will bias in favor of the weak classifiers and give poor accuracy in this case. What is instead desired is that the aggregation function properly account for redundancies. For this example, it should ideally treat the ensemble as effectively consisting of two (independent) classifiers, one strong and one weak. The IT method does not produce a parametric aggregation function; given a test batch, it simply yields a table of class a posteriori pmf values, with each row specifying the pmf for one sample in the test batch. On the one hand, this means there is no automatic rule of generalization for new samples; decisions on new samples can be made only using a new round of transductive learning. On the other hand, this approach does not make the statistical assumptions required for the ML techniques. The IT method can be applied for the case of a single classifier, as considered in Saerens et al. (2002), and in this case, it will not require the statistical assumptions made there. However, IT is especially advantageous for the ensemble case focused on here, where there is a need to account for classifier redundancies. We will demonstrate that IT properly treats redundancy for the stark type of example described above and more generally outperforms the ML methods. Finally, we note that there is no explicit prior correction in the IT method because it optimizes directly over test batch decisions rather than over prior parameters. Regardless, IT is experimentally observed to achieve good decision accuracy in cases of large prior mismatch, that is, it achieves effective prior correction. The IT method is inspired by frameworks such as maximum entropy (ME) statistical inference (Jaynes, 1989), which treats learning of probability distributions as a problem of encoding equality constraints on a set of judiciously chosen statistics. ME has been justified on statistical and axiomatic grounds, and the solution consistent with given satisfiable linear constraints is unique. Moreover, there are simple ME parameter learning algorithms with guaranteed convergence, such as iterative scaling techniques (e.g., Della Pietra, Della Pietra, & Lafferty, 1997). In other work (Miller & Pal, In principle X1 X2 should be used for transductive learning, or, if the class priors are different in the two batches, separate transductive learning should be performed on each. However, learning complexity may preclude this. 5
866
D. Miller and S. Pal
2005) we have explicitly cast standard ensemble combining (where there is common training data across the ensemble) as an ME problem with an iterative scaling solution. Unfortunately, for the transductive setting addressed here, the “right” constraints are not guaranteed to be satisfied. Accordingly, we will take a heuristic optimization approach, albeit for a principled goal, choosing ensemble posteriors on the test batch to agree as closely as possible with constraints measured by the local classifiers, as reflected by an IT measure of distance between probability distributions. Casting learning as a problem of constraint encoding is especially advantageous for achieving robustness to classifier redundancy, as can be understood from the following simple argument. Suppose we have already optimized the test set decisions so as to precisely meet the constraints measured by an ensemble of Ne local classifiers. If we increase the ensemble size by one, adding a classifier that is an identical copy of an existing one in the ensemble, this classifier’s constraints are perfectly redundant and already met by the current solution. Thus, this new classifier will have no effect on the solution. As will be demonstrated, the IT method handles classifier redundancy in a fashion quite consistent with this description, achieving accuracy nearly invariant to explicit redundancy in the ensemble. 4.1 Choice of Constraints. In general, one would like to encode as much (accurate) statistical constraint information as possible, to learn the most accurate probability model (Jaynes 1989), that is, joint probabilities on collections of features (e.g., Della Pietra et al., 1997; Miller & Yan, 2000). In our ensemble setting, the local decision Cˆ j is a discrete-valued feature. Thus, in principle, we might try encoding joint statistics such as fourth-order pmfs P[C, Cˆ j , Cˆ k , Cˆ l ]. However, constraint probabilities are usually estimated, by training set frequency counts, with accuracy that decreases with increasing order of the probabilities. This limits the orders used in practice. Moreover, in our distributed setting, with no common labeled data, it is not possible to measure joint statistics involving two or more classifiers. Thus, we are limited to encoding pairwise statistics involving C and individual decisions, Cˆ j . In prior work (Miller & Yan, 2000), pairwise pmfs involving the class label and individual discrete features were encoded into the learning of ME classifiers, and this was found to achieve good accuracy. In our ensemble setting, using its local training set, each classifier, j, can measure the pair( j) wise pmf Pcon [C, Cˆ j ],6 with “con” indicating this pmf is a constraint. Thus, at first glance, it appears reasonable to loosely cast our transductive learning problem as choosing the ensemble posteriors {Pe [C|xi ], i = 1, . . . , Ntest } to satisfy the pairwise pmf constraints contributed by each of the local
6
The superscript indicates that the constraint is measured by classifier j.
Transductive Methods
867
( j) classifiers, that is, {Pcon [C, Cˆ j ], j = 1, . . . , Ne }. However, note that the pair( j) ( j) ( j) wise pmf Pcon [C, Cˆ j ] determines the marginal pmfs Pcon [C] and Pcon [Cˆ j ]. ( j) That is, by seeking to agree with the constraint Pcon [C, Cˆ j ], one is also trying to force the ensemble test batch posteriors {Pe [C|xi ]} to agree with the (perhaps erroneous, as well as inconsistent) class priors and class decision priors measured by each local classifier, based on its respective training set.7 ( j) Thus, rather than encoding the joint pmfs {Pcon [C, Cˆ j ]}, we instead suggest ( j) ˆ encoding the conditional pmfs {[Pcon [C j |C = c]∀c}. These constraint pmfs do not specify the class priors, and they specify the pairwise pmf except for the prior. They are thus as close as we can get to encoding pairwise pmfs.8 ( j) Note also that the pmfs {Pcon [Cˆ j |C = c]∀c} define the confusion matrix for classifier j (measured based on its training set). We believe it is quite reasonable to expect these statistics should be preserved as measured on the test batch. In Saerens et al. (2002) and previous work cited there, the confusion matrix was used in a procedure to reestimate class priors for the test batch. These priors were then used to correct posteriors in essentially the same way they are corrected within our ML algorithms. Although it is possible we ( j) could also use {[Pcon [Cˆ j |C = c]∀c} in this way, such an approach would be similar to the methods suggested in the previous section and would address only prior correction. We instead use the confusion matrix information from each local classifier to define a set of constraints for transductive learning. This approach will be shown experimentally to be quite effective when there is prior mismatch. Beyond this, though, this method does not make unnecessary statistical assumptions; our (constraint-based) method will be demonstrated to handle classifier redundancy much better than the ML methods. The constraint probabilities for classifier j are estimated as9
Nj ( j) Pcon [Cˆ C j
= cˆ |C = c] =
( j)
( j) P j Cˆ j = cˆ |xi
i=1:c i =c
Nj
( j) i=1:c i =c
1
.
(4.1)
7 If P ( j) [C] is mismatched to the true class prior, it is likely that P ( j) [C ˆ j ] is also con con mismatched to the true class decision prior (reflected in the test data batch). 8 One might imagine that it is worthwhile to also encode P ( j) [C|C ˆ j = cˆ ]. However, con ( j) ˆ ( j) P [C =ˆc |C=c]P [C=c] ( j) , that is, the only information in note that Pcon [C = c|Cˆ j = cˆ ] = con j ( j) ˆ con Pcon [C j =ˆc ] ( j) ( j) Pcon [C = c|Cˆ j = cˆ ] relative to Pcon [Cˆ j = cˆ |C = c] is a function of the (mismatched) quanti( j) ( j) ties Pcon [C = c] and Pcon [Cˆ j = cˆ ]. Moreover, experimentally we have found that encoding ( j) ˆ ( j) Pcon [C j = cˆ |C = c] gives much better accuracy than encoding Pcon [C = c|Cˆ j = cˆ ]. 9
It is also possible to improve this estimate via a cross-validation procedure.
868
D. Miller and S. Pal ( j)
( j)
Here, X j ≡ {(xi , c i ), i = 1, . . . , Nj } is the labeled training set for classifier j. The transductive estimate of the constraint 4.1, produced by choosing the ensemble a posterior pmfs on the test batch Xtest , is Pe [Cˆ j = cˆ |C = c] =
Ntest i=1
( j) Pe [C = c|xi ]P j Cˆ j = cˆ |xi .
Ntest i=1 Pe [C = c|xi ]
(4.2)
4.2 Nonsatisfiability of Constraints. Consider the following two-class example with two classifiers: suppose each classifier makes hard decisions, and let one classifier achieve perfect training accuracy: (1)
(1)
(4.3)
(2)
(4.4)
Pcon [Cˆ = 1|C = 2] = 0, Pcon [Cˆ = 2|C = 1] = 0. For the second classifier, let (2)
Pcon [Cˆ = 1|C = 1] = 1, Pcon [Cˆ = 2|C = 2] = 1/2.
Suppose the same training set is seen by each local classifier. Let the test batch be the same as the training set, except that it contains additional samples from class 2 that are all correctly classified by both classifiers. In this case, it is easy to see it is not possible to choose {{P[C = c|xi ]}, i = 1, . . . , Ntest } to satisfy all the specified conditional constraints. Thus, as one expects, one cannot in general jointly satisfy distributed constraints, gleaned from a collection of local classifiers, by transductive learning. 4.3 Information-Theoretic Learning Approach. Although the given constraints cannot be precisely met in general, in practice we have found they typically can be quite well met. Thus, here we develop a learning approach that seeks to agree with them as much as possible. Since our constraints are on pmfs, we quantify the discrepancy between constraint pmfs and ensemble estimates by the information-theoretic measure of relative en
P[x] tropy (Cover & Thomas, 1991), that is, D(P[X]||Q(X)) = x∈S P[x] log Q[x] , S the (common) support set, with P[·] playing the role of “posterior” and Q[·] the “prior.” In our case, the ensemble’s estimate is the “posterior" and the constraint pmfs are the “priors.” From the previous section, we would ideally like to achieve ( j) Pe [Cˆ j |C = c] = ˙ Pcon [Cˆ j |C = c], ∀ j, c.
(4.5)
Transductive Methods
869
It thus appears natural to choose the ensemble posteriors to minimize the sum of cross-entropies: ˜ = R
|Nc | Ne
( j) ˆ D(Pe [Cˆ j |C = c]||Pcon [C j |C = c]).
(4.6)
j=1 c=1
However, consider the case where the true prior is zero for some class c˜ , that is, where this class does not occur in the
test batch.10 In this case, the Ntest posterior pmfs should be chosen such that i=1 Pe [C = c˜ |xi ] = 0. However, this choice would lead to an improper estimate of Pe [Cˆ j |C = c˜ ] in equation 4.2. Choosing the test set posteriors to minimize equation 4.6 forces the choice of a proper Pe [Cˆ j |C = c˜ ]∀ j, which is achieved only if
Ntest minimizi=1 Pe [C = c˜ |xi ] > 0. Thus, for the case of a missing class,
Ntest ˜ must yield soft decision inaccuracy (as reflected by ing R i=1 Pe [C = c˜ |xi ] > 0). ˜ includes constraints (inThe problem here is that equation 4.5 (and R) volving c˜ ) that should not be imposed. What we desire is a simple way to ( j) avoid encoding constraints {Pcon [Cˆ j |C = c˜ ], ∀ j} associated with classes c˜ that do not occur in the test batch (even as we do not a priori know whether there are such missing classes). This is effectively achieved by multiplying
Ntest equation 4.5, both sides, by Pe [C = c] = N1test i=1 Pe [C = c|xi ], that is, by expressing the constraints ( j)
Pe [Cˆ j |C = c] · Pe [C = c] = ˙ Pcon [Cˆ j |C = c] · Pe [C = c], ∀ j, c.
(4.7)
It may look as though equation 4.7 is imposing the (undesirable) pairwise pmf constraints discussed in section 4.1, since the equation equates two pairwise pmfs. However, equation 4.7 does not impose the local classifier’s ( j) ( j) class prior, Pcon [C = c]. It imposes only the conditional pmf Pcon [Cˆ j |C = c]. In particular, note that equation 4.7 is equivalent to the conditional pmf constraints, equation 4.5, for c such that Pe [C = c] > 0 (since in this case Pe [C = c] can be canceled on both sides), but with no constraint imposed when Pe [C = c] = 0, that is, equation 4.7 encodes conditional pmf constraints for all classes estimated to have nonzero priors in the test batch. The effect of multiplying both sides of equation 4.5 by Pe [C = c] is thus to avoid encoding constraints for classes estimated not to be present in the test batch. We encode the objectives 4.7 by choosing the ensemble posterior pmfs on Xtest to minimize the sum of relative entropies over all local
10 Both the particular class and the fact that a class is missing from the test batch are, of course, unknown.
870
D. Miller and S. Pal
classifiers: R=
|N | |N | Ne c c j=1
Pe [Cˆ j = cˆ |C = c]Pe [C = c]
c=1 cˆ =1
Pe [Cˆ j = cˆ |C = c]Pe [C = c] × log ( j) Pcon [Cˆ j = cˆ |C = c]Pe [C = c] =
|Nc | Ne
( j) ˆ Pe [C = c]D(Pe [Cˆ j |C = c]||Pcon [C j |C = c]).
(4.8)
j=1 c=1
˜ ExperiThe latter form clearly shows the relationship between R and R. ˜ gives very similar classification mentally, we have found that minimizing R accuracy to minimizing R when there are no missing classes in the test ˜ fails when there are missing classes, as will batch. However, minimizing R be shown in section 5. We perform the minimization of R by gradient descent. The partial derivative with respect to a given posterior probability can be shown to be e c ∂R 1 P j Cˆ j = cˆ j |x (mj) = ∂ Pe [C = k|xm ] Ntest
N
N
j=1 cˆ j =1
× log
Pe [Cˆ j = cˆ j |C = k] ∀k, m. ( j) Pcon [Cˆ j = cˆ j |C = k]
(4.9)
Note that one way of achieving zero gradient (and an optimal solution) is if the constraints are precisely met. In practice, without loss of generality, but to ensure Pe [C = k|xm ] is preserved as a pmf throughout the optimization, we parameterize it using a softmax function, e γk,m Pe [C = k|xm ] = γ , e k ,m
(4.10)
k
based on the real-valued parameters {γk,m }. Applying the chain rule, we have c ∂R ∂R ∂ Pe [C = k |xm ] = , . ∂γm,k ∂ Pe [C = k |xm ] ∂γm,k
N
k =1
(4.11)
Transductive Methods
871
where ∂ Pe [C = k |xm ] = ∂γm,k
Pe [C = k |xm ] · (1 − Pe [C = k |xm ]),
−Pe [C = k |xm ] · Pe [C = k|xm ],
for
k = k
for
k = k . (4.12)
We have not determined whether equation 4.8, based on the parameterization, equation 4.10, is convex; there may be multiple local minima. Thus, unlike section 3, where the learning was globally convergent, gradient descent may find only solutions locally optimal with respect to equation 4.8. Finally, note that implementation of this technique requires that each classifier, j, in addition to conveying its test batch decisions, also conveys ( j) the constraint pmfs {Pcon [Cˆ j |C = c]∀c} to the aggregation function. 5 Experimental Results We have performed a number of experiments involving several types of comparisons: 1. Transductive ML learning versus fixed combining (sum and product) 2. Transductive IT learning versus fixed combining 3. Transductive ML versus transductive IT learning 4. Specialized tests to study the effect of explicit classifier redundancy, the performance with strong classifiers (SVMs), the case of missing classes, and the effect of batch size The data sets, taken from the UC Irvine machine learning repository, are described in Table 1. To simulate a distributed classification environment, we used five local classifiers, each (initially) a naive Bayes classifier working on a randomly selected subset of features.11 Continuous features were modeled by a gaussian distribution and discrete features by a multinomial. For example, Vehicle has 4 classes and 18 continuous features. The first local classifier used 15 randomly selected features, while the second classifier used 14. We performed five replications of twofold cross validation for all data sets except Satellite. For each replication, the data set was divided into two equal-sized sets. In the first fold, one set was used as training and the other as test, with these roles reversed in the second fold. 5.1 Prior Mismatch. We evaluated classification accuracy for varying test set priors. A new test set with given priors was obtained by sampling 11 Thus, for the conditional independence model of section 3.1, both the local classifiers and the combining rule are based on naive Bayes.
872
D. Miller and S. Pal
Table 1: UC Irvine ML Data Sets That Were Evaluated.
Data Set Satellite Vehicle Segment Liver Disorder Diabetes Mushroom Sonar
Type of Features
Number of Classes
Number of Features
Feature Subspace Size
Continuous Continuous Continuous Continuous
5 4 7 2
33 18 19 6
[28, 29, 30, 31, 32] [15, 14, 15, 16, 17] [15, 16, 15, 17, 18] [5, 4, 5, 4, 5]
Continuous Discrete Continuous
2 2 2
7 22 60
[6, 3, 4, 5, 6] [13, 15, 17, 19, 21] [52, 54, 56, 58, 59]
with replacement from the original test set. For example, for the first fold of the first replication, denote the training set by Sa and the test set by Sb . Suppose we want to change the test set priors to {m1 , . . . ., m Nc },
c m j = 1. Then the new test set ( Sˆb ) is obtained by sampling where Nj=1 with replacement from Sb so as to achieve (or approximately achieve) {m1 , . . . ., m Nc }. The size of the new test set is chosen the same as the original set, Sb . Satellite has separate training and test sets. For this set, rather than performing cross validation, we averaged the test error rate over 10 trials, based on different resamplings of the test data for the given mismatched priors. We quantify the mismatch between test set and local training set priors by the sum of cross-entropies: M=
Nc Ne j=1 k=1
P j [C = k] log
P j [C = k] . Ptest [C = k]
(5.1)
We evaluated performance varying the value of this mismatch index. 5.2 Initialization. The transductive ML methods used initial priors estimated based on frequency counts taken over all the local training sets. These priors were also used as (fixed) priors by the fixed rule methods. The fixed rule methods are thus equivalent to the solutions obtained by the (corresponding) EM algorithms (see sections 3.1 and 3.2) after the first E-step, before any prior correction is made. The IT method started from a uniform posterior pmf initialization for all test batch samples, via γk,m = 0 ∀ k, m. 5.3 Stopping Criteria. In practice, one must either use a fixed stopping rule or one automatically determined, given each new data set. The desire is that learning should cease without substantial over- or underfitting. In this work, we have applied fixed stopping rules empirically chosen
Transductive Methods
873
transductive IT learning 0
−181
−1
−182 −183
−3
Linc
log (cent)
−2
−4 −5
−185
−187
−7 0
200
400
600 800 1000 1200 Number of Iterations
1400
1600
−188
1800
1
0.4
0.9
0.38
0.8
2
4
6
8 10 12 Number of EM steps
14
16
18
2
4
6
8 10 12 Number of EM steps
14
16
18
0.36
0.7
% Error
% Error
−184
−186
−6
−8
transductive ML (sum) learning
0.6
0.34 0.32
0.5 0.3
0.4
0.28
0.3 0.2
0
200
400
600
800 1000 1200 Number of Iteratios
1400
1600
1800
0.26
Figure 1: Learning objective and the test error rate as learning proceeds for transductive methods on Diabetes with prior mismatch value of 2.55.
for each method. Figure 1 gives example curves illustrating that test set performance is not very sensitive to the choice of stopping rule. We observed little overfitting for any of the methods. For the ML methods, we stop learning when the relative change in log likelihood drops below 1e-3. Since most of the improvement in test error was observed to occur in the first few iterations, this was found to generally stop near the best test set performance. For IT, we used gradient descent starting from a nominal step size (with value 10), taking steps with this size so long as the learning objective decreases. When this is not the case, the step size is lowered by half. Thus, equation 4.8 monotonically decreases with each step taken. We have found that the general trend for the test error rate is (also) nonincreasing behavior with increasing number of steps. For IT, we stop learning when the relative change in equation 4.5 is less than 1e-4, or after 3000 learning steps. This ensures the algorithm does a sufficient amount of learning. For all the methods, there are cases where generalization error does increase during learning. However, we have observed these increases to be modest. 5.4 Transductive ML Learning Versus Fixed Combining (Sum and Product). Comparison of generalization error rates for transductive ML and fixed combining is given in Table 2. The second column gives the mismatch between training and test set priors. It can be seen that when the mismatch increases, ML methods outperform the fixed rules, with the
0.31 0.53 0.97 1.68 3.2e-2 1.30 1.73 2.57 5.1e-2 0.55 0.83 2.43 3.1e-2 0.74 1.42 2.58 1.2e-3 0.36 0.99 2.36 4.2e-2 0.60 1.35 1.94
Satellite
Sonar
Mushroom
Diabetes
Liver D
Segment
M
Data Set
22.56 21.00 21.16 24.87 20.88 33.13 25.78 41.44 46.13 36.24 35.49 21.04 23.20 27.29 26.99 23.50 0.65 0.49 6.6e-2 0.078 31.83 26.54 25.58 18.85
ML
Sum
22.51 21.02 21.43 25.32 20.94 33.85 27.04 43.38 44.10 38.03 40.98 31.21 23.27 28.91 33.08 33.46 0.65 0.93 0.44 0.16 32.12 26.83 26.54 20.96
Fixed 22.65 20.97 21.11 24.97 20.96 33.00 26.10 41.00 47.96 35.20 34.97 17.92 23.05 26.58 24.85 21.77 0.70 0.49 5.7e-2 0.12 32.02 26.06 25.67 19.42
ML
Product
22.56 21.10 21.53 25.28 20.96 34.11 27.16 43.48 44.05 38.38 40.92 31.27 23.20 28.83 32.93 33.08 0.69 0.87 0.42 0.25 32.02 26.83 26.92 20.77
Fixed 0.56 0.78 2.35 6.12 1.65 1.31 4.49 5.89 0.73 1.93 4.73 8.15 0.78 6.04 5.35 0.66 0.76 0.79 0.86 5.00
Fixed (S) 0.60 0.98 2.87 4.78 1.59 1.26 3.61 5.13 1.09 1.95 4.58 7.87 1.16 3.9 4.8 1.37 0.84 0.79 1.34 5.83
Fixed (P)
ML Sum
0.70 0.94 2.71 6.27 1.91 1.57 4.74 4.12 0.68 2.82 4.76 7.18 1.2 4.5 6.4 0.552 0.82 0.68 0.79 5.16
0.77 1.21 3.81 4.81 1.36 1.55 4.28 4.12 0.65 2.67 4.90 7.50 1.0 3.9 5.86 0.96 1.0 0.68 1.23 5.40
Fixed (P)
ML Product Fixed (S)
5 × 2 cv F Test
Table 2: Generalization Error Rates Based on Five Replications of Twofold Cross Validation, Transductive ML and Fixed Methods.
874 D. Miller and S. Pal
Transductive Methods
875
Table 3: Generalization Error Rates Based on Five Replications of Twofold Cross Validation, Transductive IT and Fixed Combining Methods. 5 × 2 cv F Test
Fixed Data Set
M
Satellite
0.96 1.68 1.27 2.73 3.4e-2 1.31 1.83 2.57 5.6e-2 0.55 1.74 2.3e-2 0.74 1.43 2.57 1.6e-3 0.36 0.99 2.36 0.32 0.91 1.37
Vehicle Segment
Liver Disorder
Diabetes
Mushroom
Sonar
IT 20.88 25.28 41.35 34.53 16.51 27.75 11.85 32.12 44.57 43.24 32.54 23.08 26.99 24.36 17.82 0.24 0.073 0.12 0.12 31.83 27.21 23.08
Sum 21.46 25.60 45.60 41.18 21.32 37.23 13.59 43.66 45.20 44.51 35.66 23.72 33.76 30.68 31.84 0.33 0.13 0.83 0.52 35.29 35.19 37.50
Product 21.40 25.51 45.56 41.11 21.40 37.57 13.89 43.71 45.09 44.68 35.72 23.91 33.16 30.90 31.69 0.57 0.17 0.81 0.96 35.38 35.29 37.79
Sum
Product
-
-
6.45 1.26 5.13 34.01 9.75 9.98 0.71 0.77 0.80 1.30 2.68 15.77 8.31 4.0 2.42 4.86 2.49 2.56 2.72 7.74
5.6 1.25 5.26 26.34 8.63 9.65 0.68 0.74 0.80 1.45 1.99 15.45 7.66 8.11 4.52 5.0 7.11 2.58 2.78 8.36
difference generally growing with mismatch. For example, for Diabetes when the mismatch is 0.74, ML (sum/product) error rates are 27.29/26.58 which are less than the fixed combining values of 28.91/28.83. (Saerens et al., 2002) used a likelihood ratio test to decide if the a priori probabilities have significantly changed from training to test set. The solutions obtained by their EM learning were retained only if the likelihood ratios exceeded a confidence threshold. We have found it is not very necessary to apply this type of test for our method. Only in one case of small mismatch value, for Liver Disorder, did we find that use of transductive ML gave a noticeably higher test set error rate. That is, use of our approach was typically found to either improve performance or have little effect (in the small mismatch case). 5.5 Transductive IT Learning versus Fixed Combining (Sum and Product). Table 3 compares error rates between IT and fixed combining. The IT method outperformed fixed rules for all data sets, with gains generally increasing with the prior mismatch. As one example of IT’s prior
876
D. Miller and S. Pal
Table 4: Generalization Error Rates Based on Five Replications of Twofold Cross Validation, Transductive IT and ML Methods. 5 × 2 cv F Test
ML Data Set
M
Satellite
0.97 1.68 1.27 2.73 3.3e-2 1.30 1.83 2.57 4.2e-2 2.44 3.54 1.6e-2 0.74 1.43 2.57 1.6e-3 0.36 0.99 2.36 2.5e-2 0.18 0.61 1.34
Vehicle Segment
Liver Disorder
Diabetes
Mushroom
Sonar
IT 20.85 25.38 44.59 34.45 16.13 25.34 10.82 31.93 44.45 21.33 19.42 23.05 23.65 23.91 18.31 0.24 0.073 0.12 0.12 29.9 30.29 27.60 25.96
Sum 20.94 25.54 46.15 34.86 21.58 34.98 12.79 43.85 45.55 33.87 20.58 23.83 24.54 26.84 22.78 0.34 0.064 0.25 0.044 31.83 31.35 27.79 27.02
Product 20.85 25.47 47.42 34.83 21.56 35.42 13.22 43.20 48.5 37.23 21.51 24.06 24.36 26.39 21.47 0.56 0.17 0.30 0.11 31.92 31.35 27.69 26.73
Sum 0.86 0.52 4.37 3.06 1.78 1.65 0.69 2.04 0.54 0.69 1.53 2.48 1.23 10.4 1 1.07 4.4 1.72 1.29 2.67 0.68
Product 1.49 0.51 4.33 3.01 2.88 1.75 1.45 2.12 0.72 1.09 1.26 1.56 1.58 8.6 5.8 1.98 0.67 1.47 1.39 2.43 0.66
correction, on Diabetes with M = 2.58, the true priors were [0.2, 0.8], and the transductive estimates were [Pe [C = 0], Pe [C = 1]] = [0.29, 0.71]. 5.6 Transductive ML versus Transductive IT Learning. Table 4 compares all the transductive methods. For all data sets, IT performed better than the ML methods except on Mushroom for the mismatch values 0.36 and 2.36. The difference in error rates in both cases is ≤ 0.1%. We attribute much of the gain of the IT method to the nonparametric nature of its combining. In encoding the given constraint information, IT does not make the statistical assumptions (presumably unwarranted) of the ML methods. We will shed further light on this point shortly, when we consider specialized tests involving explicit classifier redundancy. 5.7 Statistical Tests. To assess statistical significance of the comparisons, we used the 5 × 2 cv F test (Alpaydin, 1999). Here, five replications of twofold cross validation are performed. The difference
Transductive Methods
877 j
between the generalization error ( pi ) for both classifiers is computed where i is the replication index and j the fold index. The estimated variance (si2 ) of this difference is computed, with the average difference on each replication given by pi = 0.5 ∗ ( pi1 + pi2 ). The 5x2 cv F test statistic,
5 f =
2
i=1
2∗
j=1
5
j
pi
2 i=1 si
2 ,
(5.2)
is approximately F distributed (Alpaydin, 1999) with 10 and 5 degrees of freedom. We can reject the hypothesis that two algorithms have the same error rate with 95% confidence if f is greater than 4.74. In Table 2, columns 7 and 8 give values of f for comparisons between ML (sum) and fixed rules (sum/product). Columns 9 and 10 compare ML (product) and fixed rules (sum/product). Note that as prior mismatch increases, the value of f generally increases. For Sonar when the mismatch is 1.94, f ≥ 4.74 for all ML methods compared with fixed rules. In Table 3, columns 6 and 7 compare IT with the fixed combining rules. Except for Liver Disorder, all other data sets have at least one case where f ≥ 4.74. Moreover, if we compare Tables 2 and 3, we see IT gives statistically significant improvement over the fixed methods both in more cases and for smaller mismatch values than the ML methods. IT gives significant improvement for M < 1.5 on Vehicle, Segment, Diabetes, and Mushroom. In Table 4, we compare IT with the ML methods. Although IT achieves smaller error rates, f ≤ 4.74 except on Mushroom. 5.8 Redundant Classifiers. Table 5 gives results (for a single training/test split) comparing methods in the presence of explicit classifier redundancy. We selected the second weakest classifier from the five in the ensemble (in terms of test error) and duplicated this classifier k times, k =1,2, 5, and 20. The column under “5” indicates the test error for the original ensemble (with Ne = 5), with column “5 + k” giving the test error for the redundant ensemble (with Ne = 5 + k). Note that IT is almost perfectly robust to redundant classifiers, whereas for both ML methods, the accuracy degrades as k increases. This experiment exhibits the weakness of the ML methods, and the strength of IT, in handling redundancy. 5.9 Strong Classifiers. In a distributed sensor context, each classifier may be inherently weak (e.g., with computation limited by battery power). However, when there is freedom to choose the local classifier, one might choose a strong one such as a support vector machine (SVM), also believed to be less sensitive to class prior mismatch than other classifiers. It is thus interesting to assess the value of transductive learning for this choice.
Diabetes Mismatch IT Tml(sum) Tml(pro) Ionosphere Mismatch IT Tml(sum) Tml(pro)
Data Set
0.8861 21.21 34.03 33.50
0.3192 14.07 15.84 15.43
0.2660 14.12 15.64 14.90
5+1
0.7331 21.19 30.55 28.88
5
0.2812 14.57 16.44 16.12
0.7363 24.17 36.81 36.45
5
0.3937 14.43 18.40 16.89
1.0308 24.40 43.78 46.36
5+2
0.2678 12.44 15.17 14.70
0.7278 21.81 29.26 25.04
5
0.5356 12.86 18.29 16.28
1.4556 22.42 39.72 41.71
5+5
0.2654 15.65 19.43 16.91
0.7374 21.12 32.17 27.55
5
Table 5: Effect of Explicit Classifier Redundancy on Transductive IT and ML Test Set Classification Accuracy.
1.0617 16.48 24.17 21.74
2.94 22.76 48.93 56.14
5 + 20
878 D. Miller and S. Pal
Transductive Methods
879
Table 6: Generalization Error Based on Five Replications of Twofold Cross Validation for IT and Fixed Combining of Linear, Two-Class SVMs.
Data Set Sonar Ionosphere Diabetes
Mismatch
SVM Error (Test)
IT
Fixed (Sum)
5 × 2 cv F Test
0.0563 1.3801 0.0165 2.305 0.0538 1.4301
25.08 27.20 17.20 31.28 27.59 41.43
19.78 23.10 14.18 19.30 25.04 28.80
20.52 23.89 15.42 26.32 29.06 45.02
0.69 0.82 0.968 6.148 1.968 8.148
Table 6 compares the IT method (the only transductive method that works with hard classifier decisions) with fixed averaging (sum) for linear SVMs on two-class tasks. The column “SVM Error (Test)” gives the average test error of the individual SVM classifiers. For Sonar, SVM performance is only weakly sensitive to increases in the mismatch value, with the error increasing from 25.1% to 27.2%, while for the other two data sets, SVM performance is quite sensitive to mismatch. From the table, the advantage of IT over fixed rules appears to depend on the degree of this sensitivity. The advantage of IT on Sonar is small (especially compared with the results for naive Bayes classifiers in Table 3), while the advantage for the high-mismatch case is statistically significant on both Ionosphere and Diabetes. 5.10 Missing Classes in the Test Batch. We performed two experiments to validate our claim in section 4.3 that the IT method (minimizing R in equation 4.8) handles missing classes in the test batch much better than ˜ in equation 4.6. On the two-class Diabetes, excluding class 1 minimizing R from the test set, IT estimated Pe [C = 1] = 0.05 and achieved a test error rate ˜ gave Pe [C = 1] = 0.51 and a test error of 2.9%. By contrast, minimizing R rate of 39.62%. On the seven-class Segment, excluding class 1, IT estimated ˜ yielded Pe [C = 1] = 0.01, with a test error rate of 14.87%. Minimizing R Pe [C = 1] = 0.08 and a test error rate of 30.47%. 5.11 Batch Size. Finally we assessed how batch size affects performance. Table 7 shows the generalization error for varying test batch size, for IT compared with ML combining (shown in parentheses). While IT improves with increasing batch size, IT performance is comparable to or better than ML when the batch size is at least 30% of the test set. Note also that the ML performance seems largely invariant to batch size.
4.6e-2
2.2e-2
Diabetes
M
Liver Disorder
Data Set 47.54 (46.89/47.64) 25.12 (22.56/22.10)
10% 44.20 (46.34/47.10) 22.78 (23.44/23.48)
30% 41.21 (45.26/46.88) 23.28 (22.12/22.52)
50%
Batch Size as % of Test Set
42.64 (46.40/46.15) 23.10 (23.52/23.40)
80%
43.00 (46.56/46.05) 21.58 (22.10/21.78)
100%
Table 7: Effect of Batch Size on Generalization Error for Transductive IT Learning, Compared with ML Sum/Product Combining (Shown in Parentheses).
880 D. Miller and S. Pal
Transductive Methods
881
6 Conclusion We identified the distributed ensemble classification problem, discussed applications in which it arises, and proposed several transductive methods for its solution. All our methods were found to improve on fixed combining rules, with relatively large gains in accuracy achieved in the case of large class prior mismatch. The IT method generally outperformed the ML methods and is the only method suitable when local classifiers solely produce hard decisions. This method gave performance benefits when individual classifiers are both relatively weak (naive Bayes) and strong (SVMs). This approach was also demonstrated to be much more robust to classifier redundancy than the ML methods. IT also well handled the case where classes are missing from the test batch. There are several directions to pursue in the future. First, we did not address any practical communications issues, such as quantization of soft decisions and variable delays from different classifiers, which could disrupt synchronization needed to aggregate decisions. We also did not consider possible information exchange between local classifiers. Second, a theoretical study of conditions under which the ensemble class priors converge to the true ones could be undertaken. Third, our approaches could be improved by conveying information beyond local decisions. In Miller and Pal (2005), we proposed a method based on mixture modeling for encoding novel constraints involving continuous features, within a maximum entropy framework. This approach could be applied within our distributed ensemble setting, with information based on a local sensor’s continuous features (i.e., a posteriori probabilities on mixture component indices) conveyed, along with local decisions, and with associated constraints encoded. Finally, other aggregation rules could be explored; for example, it is straightforward to extend the ML methods to produce the alpha-trimmed mean in Tumer and Ghosh (2002). It would be interesting to determine whether other combining rules also arise as ML learning solutions. Appendix: EM Framework for Transductive ML Techniques For both transductive ML techniques, the data treated as missing within the EM framework are the class labels for the new batch samples. Let Zik = 1 if sample xi (exclusively) belongs to class k, and Zik = 0, otherwise. Then, for the independence case of section 3.1, the complete data log likelihood is
log L c (Xtest ) =
Ntest Nc i=1 k=1
Zik log Pe [C = k] +
Nc
( j) log p j xi |C = k; jk .
j=1
(A.1)
882
D. Miller and S. Pal
Here, the parameters that specify the class-conditional models, jk , even though in general unknown, are introduced for completeness. In the E-step, we compute the expected value of log L c (Xtest ) given the current parameter estimates—that is, given ≡ {{Pe [C = k]}, { jk }}. This is
E[log L c (Xtest )|] = × log Pe [C = k] +
Ntest Nc
E[Zik |]
i=1 k=1 Ne
( j) log p j xi |C = k; jk .
(A.2)
j=1
The expectations are just the ensemble class a posteriori probabilities, which can be derived under the feature vector independence assumption:
E[Zik |] ≡ Pe [C = k|xi ] =
=
Pe [C = k]
Ne j=1
Z Pe [C = k]
Ne
( j) p j xi |C = k
( j)
P j C=k|xi j=1 P j [C=k] Z
(A.3)
,
where Z and Z are the proper normalizations, ensuring a valid probability mass function. The E-step is thus specified by equation 3.4. In the M-step we maximize equation A.2 over the parameters to be adjusted, in this case {Pe [C = k]}, given the E-step probabilities held fixed. This gives the update in equation 3.5. Likewise, for the ML problem in section 3.2, we can derive the expected complete data-censored log likelihood:
E[log L c−cens (Xtest )|] =
Ntest Nc
E[Zik |]
i=1 k=1
Ne ( j) 1 p j xi |C = k; jk . × log Pe [C = k] + log Ne
(A.4)
j=1
In this case, the E-step probabilities are derived using
E[Zik |] = Pe [C = k|xi ] =
Ne j=1
Pe [C = k|xi , j uncensored]
1 Ne
Transductive Methods
883 e ( j) 1 p j xi |C = k Pe [C = k] Z
N
=
j=1
( j) Ne P j C = k|xi 1 = Pe [C = k]. Z P j [C = k]
(A.5)
j=1
The E-step is thus specified by equation 3.7. Maximizing equation A.4 with respect to {Pe [C = k]} given fixed E-step probabilities again gives equation 3.5.
References Alpaydin, E. (1999). Combined 5 × 2 cv F test for comparing supervised classification learning algorithms. Neural Comp., 11(8), 1885–1892. Basak, J., & Kothari, R. (2004). A classification paradigm for distributed vertically partitioned data. Neural Comp., 16(7), 1525–1544. Breiman, L. (1996). Bagging predictors. Mach. Learn., 26, 123–140. Chen, K., Wang, L., & Chi, H. (1997). Methods of combining multiple classifiers with different features and their applications to text independent speaker identification. Intl. J. of Patt. Rec. and Art. Intell., 11(3), 417–445. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. D’Costa, A., Ramachandran, V., & Sayeed, A. (2004). Distributed classification of gaussian space-time sources in wireless sensor networks. IEEE J. Sel. Areas in Comm., 22, 1026–1036. Della Pietra, S., Della Pietra, V. J., & Lafferty J. D. (1997). Inducing features of random fields. IEEE Trans. Patt. Anal. Mach. Intell., 19(4), 380–393. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum-likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc., Ser. B, 39, 1–38. Dietterich, T. G., & Kong, E. B. (1995). Error-correcting output coding corrects bias and variance. In Proceedings of the 12th International Conference on Machine Learning (pp. 313–321). San Francisco: Morgan Kaufmann. Freund, Y., & Schapire, R. (1996). Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning (pp. 148–156). San Francisco: Morgan Kaufmann. Hansen, L. K., & Salamon, P. (1990). Neural network ensembles. IEEE Trans. Patt. Anal. Mach. Intell., 12(10), 993–1001. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Comp., 3(1), 79–87. Jaynes, E. T. (1989). Papers on probability, statistics, and statistical physics (2nd ed.). Norwell, MA: Kluwer. Joachims, T. (1999). Transductive inference for text classification using support vector machines. In Proceedings of the 16th International Conference on Machine Learning (pp. 200–209). San Francisco: Morgan Kaufmann.
884
D. Miller and S. Pal
Kang, H. J., Kim K., & Kim, J. H. (1997). Optimal approximation of discrete probability distribution with kth-order dependency and its application to combining multiple classifiers. Patt. Rec. Lett., 18, 515–523. Landgrebe, D. A. (2005). Multispectral land sensing: Where from, where to? IEEE Trans. Geo. Rem. Sens., 43(3), 414–421. McLachlan, G., & Peel, D. (2000). Finite mixture models. New York: Wiley. Miller, D. J., & Pal, S. (2005). An extension of iterative scaling for joint decision-level and feature-level fusion in ensemble classification. 2005 International Workshop on Machine Learning for Signal Processing. New York: IEEE. Miller, D. J., & Uyar, H. S. (1997). A mixture of experts classifier with learning based on both labeled and unlabeled data. Neural Info. Proc. Sys., 9, 571–577. Miller, D. J., & Yan, L. (1999). Critic-driven ensemble classification. IEEE Trans. Sig. Proc., 47(10), 2833–2844. Miller, D. J., & Yan, L. (2000). Approximate maximum entropy joint feature inference consistent with arbitrary lower-order probability constraints: Application to statistical classification. Neural Comp., 12(9), 2175–2207. Redner, R. A., & Walker, H. F. (1984). Mixture densities, maximum likelihood, and the EM algorithm. SIAM Review, 26, 195–239. Saerens, M., Latinne, P., & Decaestecker, C. (2002). Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure. Neural Comp., 14(1), 21–41. Tumer, K., & Ghosh, J. (1996). Error correlation and error reduction in ensemble classifiers. Connection Science, 8(3/4), 385–404. Tumer, K., & Ghosh, J. (2002). Robust combining of disparate classifiers through order statistics. Patt. Anal. & Apps., 5(2), 189–200. Varshney P. (1997). Distributed detection and data fusion. New York: Springer. Xu, L., Krzyzak, A., & Suen, C. Y. (1992). Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Trans. Sys., Man, Cyb., 22, 418–435.
Received February 19, 2006; accepted July 11, 2006.
ARTICLE
Communicated by Erkki Oja
Synergies Between Intrinsic and Synaptic Plasticity Mechanisms Jochen Triesch [email protected] Frankfurt Institute for Advanced Studies, Johann Wolfgang Goethe University, 60438 Frankfurt am Main, Germany, and Department of Cognitive Science, University of California, San Diego, La Jolla, CA 92093-0515, U.S.A.
We propose a model of intrinsic plasticity for a continuous activation model neuron based on information theory. We then show how intrinsic and synaptic plasticity mechanisms interact and allow the neuron to discover heavy-tailed directions in the input. We also demonstrate that intrinsic plasticity may be an alternative explanation for the sliding threshold postulated in the BCM theory of synaptic plasticity. We present a theoretical analysis of the interaction of intrinsic plasticity with different Hebbian learning rules for the case of clustered inputs. Finally, we perform experiments on the “bars” problem, a popular nonlinear independent component analysis problem. 1 Introduction Synaptic learning mechanisms have always been a central area of research in neural computation, but there has been relatively little work on intrinsic plasticity (IP) mechanisms, which change a neuron’s intrinsic excitability. Such changes have been observed in many species and brain areas (Desai, Rutherford, & Turrigiano, 1999; Zhang & Linden, 2003; Daoudal & Debanne, 2003; Zhang, Hung, Zhu, Xie, & Wang, 2004; Cudmore & Turrigiano, 2004; van Welie, van Hooft, & Wadman, 2004, 2006), and they can be measured as changes to a neuron’s frequency-current (f-I) function, such as alterations to the spike threshold or the initial slope of the f-I function. Interestingly, such effects seem to occur on different timescales, with some changes occurring over days (e.g., Desai et al., 1999) and others occurring much faster (Cudmore & Turrigiano, 2004; van Welie et al., 2004, 2006). Despite recent progress, the biophysical implementation of these mechanisms remains somewhat unclear, and the exact computational functions of these forms of plasticity are not well established either. Changes to the intrinsic excitability of neurons may help to shape the dynamics of neural circuits (Marder, Abbott, Turrigiano, Liu, & Golowasch, 1996) in desired ways. It is also reasonable to assume that IP contributes to neural homeostasis by keeping the level of activity of individual neurons and neural Neural Computation 19, 885–909 (2007)
C 2007 Massachusetts Institute of Technology
886
J. Triesch
circuits in a desirable regime. It is less clear, however, what exactly such a desirable regime may be. It has been proposed that efficient sensory coding can be obtained by adjusting neural responses to match the statistics of signals from the environment in a way that maximizes information transmission. In particular, if a neuron uses all of its possible response levels equally often, then its entropy is maximized (Laughlin, 1981). For visual cortex of cats and macaques, Baddeley, Abbott, Booth, Sengpiel, and Freeman (1998) have observed that individual neurons exhibit an approximately exponential distribution of firing rates in response to stimulation with natural videos. They have argued that this may be motivated through the fact that the exponential distribution maximizes entropy for a fixed mean. Thus, neurons may assume exponential firing-rate distributions in order to maximize the information they convey to downstream targets given a fixed average activity level, which corresponds to a certain energy budget. Based on this idea, Stemmler and Koch (1999) developed a model of IP for a Hodgkin-Huxley-style model neuron. They derived a learning rule for the adaptation of a number of voltage-gated channels to match the firing-rate distribution of the unit to a desired distribution. Shin, Koch, and Douglas (1999) adjusted two conductances in electronic circuit analogs of neurons to adapt to the mean and variance of the time-varying somatic input current. In this work, we describe an IP model for a simple continuous-activation model neuron with only two parameters. We then study the interaction of intrinsic and synaptic learning processes and show how a model neuron with IP at the soma and Hebbian learning at the synapses can discover heavy-tailed directions in its input. We analyze the interaction of IP with different Hebb rules including a BCM rule (Bienenstock, Cooper, & Munro, 1982). We also show how IP may function as an alternative to the sliding threshold mechanism assumed to balance long-term potentiation (LTP) and long-term depression (LTD) in BCM theory. Finally, we demonstrate how a single unit with IP and Hebbian learning discovers an independent component in the “bars” problem, a standard unsupervised learning problem. 2 A Model of Intrinsic Plasticity Since the long-term goal of our research is to study the effects of IP at the level of large networks, we choose to use a simple rate-coding model neuron. We model IP as changes to the neuron’s nonlinear activation function. We consider a model neuron with a parametric sigmoid nonlinearity of the form y = ga b (x) =
1 . 1 + exp(−(a x + b))
(2.1)
Here, x denotes the total synaptic current arriving at the soma, and y is the neuron’s firing rate. The parameters a and b change the slope and offset
Intrinsic and Synaptic Plasticity
887
of the sigmoid nonlinearity. In Triesch (2005b) we proposed an update rule for the parameters of the sigmoid based on the idea of moment matching. This rule took the form of two proportional control laws coupling the offset and slope of the sigmoid to estimates of the first and second moment of the neuron’s firing-rate distribution.1 More recently we derived a gradient rule that aims to directly minimize the Kullback-Leibler divergence between the neuron’s current firing-rate distribution and the optimal exponential distribution (Triesch, 2005a). As derived in the appendix, this ansatz leads to the following discrete learning rule: a ← a + a and b ← b + b, with 1 1 1 +x− 2+ xy + xy2 a µ µ 1 1 b = ηIP 1 − 2 + y + y2 , µ µ
a = ηIP
(2.2) (2.3)
where µ is the neuron’s desired mean activity and ηIP is a small learning rate. This rule is very similar to the one derived by Bell and Sejnowski (1995) for a single sigmoid neuron maximizing its entropy. However, our rule has additional terms stemming from the objective of keeping the mean firing rate low. Note that this rule is strictly local. The only quantities used to update the neuron’s nonlinear transfer function are x, the total synaptic current arriving at the soma, and the firing rate y. Furthermore, it relies on only the neuron’s current activity y and its squared activity y2 , which might be estimated from the cell’s internal calcium concentration. 2.1 Experiments with the Intrinsic Plasticity model 2.1.1 Behavior for Different Distributions of Total Synaptic Current. To illustrate the behavior of the learning rule, we consider a single-model neuron with a fixed distribution of synaptic current f x (x). Figure 1 illustrates the result of adapting the neuron’s nonlinear transfer function to three different input distributions: gaussian, uniform, and exponential. We set the unit’s desired mean activity to µ = 0.1, a tenth of its maximum activity, to reflect the low firing rates observed in cortex. The following results do not critically depend on the precise value of µ as long as µ is small (µ 1). In each case, the sigmoid nonlinearity moves close to the optimal nonlinearity that would result in an exactly exponential distribution of the firing rate, resulting in a sparse distribution of firing rates. Note that since the sigmoid nonlinearity bounds activity to be less than one (y < 1), the resulting distribution p y (y) necessarily has a truncated tail. Also, since the sigmoid has
1 In this work the sigmoid nonlinearity was parameterized slightly differently, with a representing the inverse slope of the sigmoid.
888
J. Triesch
a
b 102 0 frequency
b
−1 −2
0
10
−2
10
−3 −4
−4
1
1.5
10
2
a
c 1 0.8
d
input distribution optimal transfer fct. learned transfer fct.
0
0.5 firing rate
e
input distribution optimal transfer fct. learned transfer fct.
1
0.8
1
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0 −4
−2
0
2
4
0
0
0.5
1
0 0
1
input distribution optimal transfer fct. learned transfer fct.
0.5
1
Figure 1: Dynamics of the IP rule for various input distributions. (a–c) Gaussian input distribution. (a) The phase plane diagram. Arrows indicate the flow field of the system. Dotted lines indicate approximate locations of the nullclines (found numerically). Two example trajectories are exhibited (solid lines), which converge to the stationary point (marked with a circle). (b) The resulting activity distribution (solid curve), which approximates that of an exponential distribution (dotted). (c) The theoretically optimal transfer function (dotted) and the learned sigmoidal transfer function (solid). The gaussian input distribution (dashed, not drawn to scale) is also shown. (d, e) Same as c but for uniform and exponential input distribution. Parameters were µ = 0.1, ηIP = 0.001.
only two degrees of freedom, the match is usually not perfect. As can be seen in the figure, large deviations from the optimal transfer function can sometimes be observed where the probability density of the input distribution is low, and this leads to deviations from the desired exponential activity distribution. A closer fit could be obtained with a different nonlinearity g with more free parameters. Since it is unclear, however, what additional degrees of freedom biological neurons have in this respect, we chose to stick with equation 2.1. When a constant input x0 is presented to the unit— f x (x) is described by a δ-function—then the slope parameter a will diverge as the unit attempts to adjust its nonlinearity to match its deterministic output ga b (x0 ) to the desired exponential distribution. Due to simultaneous adjustment of the threshold parameter b, however, the unit’s activity will remain close to µ. The results obtained with equations 2.2 and 2.3 are very similar to the ones from our previous study, which used a set of two simple control laws
Intrinsic and Synaptic Plasticity
a
889
firing rate
1
0.5
0 0
1
2
3
4
time
1
firing rate
0.8
c 100
before adapted
frequency
b
0.6 0.4
5 4
× 10
before after adapted
−2
10
0.2 0 −4
−4
−2
0 input
2
4
10
0
0.5 firing rate
1
Figure 2: Response to sudden deprivation from afferent input. (a) The unit’s firing rate y as a function of the number of presented inputs (time t). We plot the response only to every 10th input. After 104 time steps, the standard deviation of the input signal is reduced by a factor of five. In response, the neuron adjusts its intrinsic excitability to restore its desired exponential firing-rate distribution. (b) The nonlinear activation function before the deprivation and after adaptation to deprivation. (c) The distribution of firing rates before deprivation, immediately after deprivation, and after adaptation to deprivation.
that updated the sigmoid nonlinearity based on estimates of the first and second moment of the neuron’s firing-rate distribution (Triesch, 2005b). The current rule seems to lead to faster convergence, however. 2.1.2 Response to Sudden Sensory Deprivation. The proposed form of plasticity of a neuron’s intrinsic excitability may help neurons maintain stable firing-rate distributions in the presence of systematic changes to their afferent input. An interesting special case of such a change is that of sensory deprivation. Here, we model sensory deprivation as a fivefold reduction of the standard deviation σ of a gaussian-distributed total synaptic current x arriving at the soma. Figure 2 shows how the neuron learns to adjust to the sudden change of the distribution of total synaptic current occurring after 104 time steps. After a transient phase of low activity, the neuron gradually reestablishes its desired firing-rate level (see Figure 2a) by making its activation function steeper, that is, the neuron becomes more excitable. This is illustrated in Figure 2b, which shows the neuron’s activation function
890
J. Triesch
before the reduction of afferent input and after adjusting to the new input distribution. Figure 2c shows how the distribution of firing rates is affected by the sudden reduction in afferent input and the subsequent adjustment of the neuron’s excitability. After adjustment, the neuron regains its original, approximately exponential firing-rate distribution. 2.2 Discussion of the IP model. It is interesting to note that equation 2.1 can be viewed as linearly transforming input x before passing it through a standard sigmoid function. Such a linear rescaling could also be obtained by modifying all synaptic weights (including a “bias” weight) entering the neuron. Thus, there is a redundancy between the neuron’s intrinsic and synaptic degrees of freedom. This “redundancy” of neural degrees of freedom in our model is specific to our choice of the nonlinearity and its parameterization. In fact, the derivation of the learning rule can be generalized to other nonlinearities with different adjustable parameters. The only requirements are that the nonlinearity g should be strictly monotonically increasing and differentiable with respect to x, and the partial derivatives of log ∂ y/∂ x with respect to the parameters of g (in the above case, a and b) must exist. Nevertheless, there is ample evidence that homeostatic synaptic mechanisms also contribute to keeping a neuron’s activity in a desirable regime (Turrigiano & Nelson, 2000, 2004), especially if the afferent input to the neuron has been reduced (Maffei, Nelson, & Turrigiano, 2004). From a selforganization viewpoint, however, it may be easiest and most effective to implement mechanisms that change the spiking statistics of the neuron directly at the soma—as locally as possible. But since such mechanisms may operate over only a limited range, they may rely on other processes such as synaptic scaling to keep the distribution of currents arriving at the soma in a desired range. In fact, most recent evidence suggests that such synaptic scaling is sensitive to glutamate levels in the extracellular medium (Stellwagen & Malenka, 2006), which is more directly related to the synaptic input x to the neuron rather than its firing activity. Thus, the brain may use synaptic scaling to keep the level of input to a neuron in a desired regime while using intrinsic mechanisms to map this input to its own firing activity in a nearly optimal way. Changes in the slope and threshold of a neuron’s f − I curve as well as combinations of both have been observed experimentally. Such changes occur through the modification of voltage-gated channels including, for example, Na+ channels (Aizenman, Akerman, Jensen, & Hollis, 2003), K+ channels (van Welie et al., 2006), and Ih conductances (van Welie et al., 2004). Many of such modifications seem to depend on the cell’s internal Ca2+ concentration, which is often regarded as an indicator of the neuron’s activity level. Modeling studies have also revealed that concerted changes of I A and Ih could implement the kind of gain changes associated with modification of the a -parameter in our parametric sigmoid (Burdakov, 2005). In sum,
Intrinsic and Synaptic Plasticity
891
it appears plausible that changes to the f − I function as predicted by our model could be implemented through the modification of voltage-gated channels. The most important prediction of our IP model(s) for experiments is that changes to the intrinsic excitability of a neuron should not be driven only by the average spiking activity of the neuron—the first moment of the neuron’s activity—but also depend on higher moments. In particular, our gradient rule and our earlier rule based on moment matching (Triesch, 2005b) rely on the second moment of the neuron’s activity. This implies that experimental manipulations that leave a neuron’s average activity constant but change the second moment of its activity will lead to systematic changes in the neuron’s f − I curve. The required estimate of the second moment of the neuron’s firing rate may be implemented by an agent A that binds two Ca2+ ions: A + 2Ca2+ → A . The concentration of A could then approximate the square of the current firing rate. Furthermore, we hypothesize that the functional role of homeostatic synaptic plasticity is to maintain the synaptic current arriving at the soma in a desired regime, while the role of IP is to map this distribution of synaptic current onto the desired distribution of spiking activity. In principle, this hypothesis can be tested by any experimental technique that allows independently controlling a neuron’s synaptic input and its firing activity. 3 Interaction of IP with Hebbian Learning Since IP appears to be a ubiquitous phenomenon in the brain, it is important to study how it interacts with other forms of plasticity, in particular with associative synaptic learning mechanisms. We begin by considering a standard Hebbian learning rule, w = ηHebb uy(u) = ηHebb uga b (wT u),
(3.1)
where u is the vector of synaptic inputs to the neuron and ηHebb is a small learning rate. Note that since y is positive, wi will always be positive if ui is positive. Thus, for positive inputs ui , the strength of weight wi can only increase. Such uncontrolled weight growth is undesirable, even if the IP counteracts this effect and stabilizes the neuron’s activity. In order to avoid unlimited weight growth, we keep the weight vector fixed at unit length through the multiplicative weight normalization w ← w/w. Future work could also incorporate a model of synaptic scaling to keep the input to the neuron in a desired regime (see the discussion in section 2). Since synaptic scaling also seems to function in a multiplicative fashion, (e.g., Turrigiano, Leslie, Desai, Rutherford, & Nelson, 1998), we expect that this would lead to similar results.
892
J. Triesch
3.1 Behavior in the Limit of Fast IP. It is instructive to consider the limiting case of fast IP: ηIP ηHebb . In this case, the IP will have generated an approximately exponential distribution of y(u), before the weight vector w can change much. Note that our discussion is independent of the exact IP mechanism (e.g., the specific learning rule introduced above) as long as the IP mechanism is effective in producing approximately exponential firing-rate distributions. In the case of fast IP, it can be seen from equation 3.1 that the expected value of the weight update given by E[w] = ηHebb E[uy(u)] is an exponentially weighted sum of all the inputs u, with the firing rates y(u) functioning as the exponentially distributed weighting factors. A small number of inputs that produce the highest firing rates will exert a strong pull on the weight vector, while the majority of inputs lead to a small firing rate and change the weight vector only marginally. This mechanism can lead to the discovery of heavy-tailed directions in the input, as illustrated by the following experiment. A careful analysis for clustered inputs is given below. We consider a model neuron with just two inputs. The input distribution is such that there is one heavy-tailed direction and an orthogonal direction without heavy tails. At the same time, the distribution is white—that is, the covariance matrix of the inputs is the 2 × 2 identity matrix. For concreteness we consider the Laplace band whose pdf is given by p(u1 , u2 ) = p(u1 ) p(u2 ) =
1 √ 2 6
√ exp(− 2|u1 |) : 0 :
√ 3 √ . |u2 | > 3
|u2 | ≤
(3.2)
Along the u1 -direction, this distribution has the shape of a Laplacian with exponential tails, while along the u2 -direction, this distribution is uniform with no tails at all (see Figure 3a). Note that since this distribution has the identity matrix as its covariance matrix, standard Hebbian learning in a linear unit would perform a random walk in the weight space; it cannot discover the heavy-tailed direction. In combination with IP, however, the weight vector is drawn toward the heavy-tailed direction (see Figures 3b and 3c; parameters were µ = 0.1, ηIP = 0.01, ηHebb = 0.001). This process can be understood as follows. Consider the weight vector in an oblique orientation (the thin arrow in Figure 3b). For this weight vector, we plot a sample of inputs (circles) with their sizes representing how strongly they will activate the unit. The inputs leading to the highest firing rates (biggest circles) are more likely to appear to the right of the weight vector, at the far right of the input distribution. This imbalance will cause the weight vector to turn clockwise in the direction of the heavy tail. This process ends when
Intrinsic and Synaptic Plasticity
893
a
b
u2 2 1 0 −1
u2
−2
u1
−2
d orientation of weight vector
orientation of weight vector
c 80 60 40 20 0 −20 0
500 time/1000
1000
0
2 u1
80 60 40 20 0 −20 0
500 time/1000
1000
Figure 3: Discovery of heavy-tailed direction in a model neuron with just two inputs. See the text for details.
the weight vector has aligned with the heavy-tailed u1 -direction, which corresponds to an orientation of zero degrees in Figure 3c. We have also experimented with a number of other input distributions and found that the property of seeking out the more heavy-tailed directions is a robust effect. Figure 3d demonstrates this for the case of a white input distribution that is the product of a gaussian along the u2 direction and a Laplacian along the u1 direction. The weight vector aligns with the more heavy-tailed u1 direction (parameters as above). Note that while we have assumed IP being much faster than Hebbian learning, the same behavior is also exhibited if IP is slower than Hebbian learning. This is a robust effect that holds true in this example and for all experiments discussed in this article (an example is given in Figure 8). This mechanism for finding heavy-tailed directions is closely related to a number of approaches for independent component analysis (ICA) that use nonlinear Hebbian terms in their learning rules. A prominent example is the FastICA algorithm (Hyv¨arinen & Oja, 1997; Hyv¨arinen, Karhunen, & Oja, 2001). Since FastICA works with a fixed nonlinearity, it is not surprising that our method finds heavy-tailed direction even when IP is slow. However, one of the benefits of IP is that it drives the neuron toward using an energyefficient code.
894
J. Triesch 0.5 0.4
Ω(y)
0.3 0.2 0.1 0
rel. frequency simple Hebb covariance rule BCM rule
−0.1 −0.2 0
0.1
0.2 0.3 firing rate y
0.4
0.5
Figure 4: Illustration of alternative Hebbian learning rules, plotting the functions (y) for three different learning rules. For the covariance and BCM rules, we plot for the thresholds that balance LTP and LTD. For the BCM rule, has been scaled by a factor five for better visibility. The dotted horizontal line marks the boundary between LTP ((y) > 0) and LTD ((y) < 0). The dashed line shows the relative probability of firing rate y (not drawn to scale).
3.2 Alternative Hebbian Learning Rules. We now discuss the interaction of IP with alternative Hebbian learning rules. We focus on generalizations of the Hebbian learning rule, equation 3.1, of the kind w = ηHebb u (y(u)) ,
(3.3)
where is a generally nonlinear function. The choice id (y) = y corresponds to the standard Hebbian learning rule from above. Other choices lead to different learning rules, as illustrated in Figure 4. The choice cov (y) = y − θcov is a so-called covariance rule, since it can lead to the discovery of the principal eigenvector of the input covariance matrix in a linear unit (e.g., Dayan & Abbott, 2001). To this end, θcov is set to be an online estimate of the unit’s mean activity. The choice BCM (y) = (y − θBCM )y is a so-called quadratic BCM rule (Cooper, Intrator, Blais, & Shouval, 2004), which automatically stabilizes weight growth. Here, θBCM is usually an online estimate of the unit’s mean squared firing rate. In our analysis, we will keep the thresholds θcov and θBCM fixed, since the IP drives the unit to a fixed firing-rate distribution automatically. Both the covariance and the BCM rule differ from the standard rule by introducing long-term depression (LTD) in addition to long-term potentiation (LTP). In connection with IP as introduced above, it is easy to relate the amount of LTP versus LTD to the thresholds used in the learning rules.
Intrinsic and Synaptic Plasticity
895
A natural choice for the thresholds is one that balances LTP and LTD. The specific condition for this balance is that (y) is on average zero:2 0
∞
1 y exp − (y)dy = 0. µ µ
(3.4)
In the case of the covariance rule, this condition leads to θcov = µ, that is, LTD is exhibited for inputs that lead to below-average firing rates, and LTP is exhibited for above-average firing rates (see Figure 4). For the BCM rule, equation 3.4 leads to θBCM = 2µ. Thus, in this balanced LTP-LTD regime, the BCM rule will exhibit maximal LTD for inputs that correspond to the mean firing rate and LTP occurs only for inputs that lead to firing rates greater than twice the average firing rate (compare Figure 4). Firing rates in the tail of the exponential distribution have a very strong influence on the weight vector due to the quadratic increase of (y). Empirically, we found that such alternative Hebbian learning rules are also effective in finding heavy-tailed directions when combined with our IP mechanism. Based on the arguments on the relation of nonlinear Hebbian learning rules to ICA algorithms discussed above, this is not surprising. In particular, it has been argued that BCM-type learning rules can discover sparse directions in the absence of any IP (Cooper et al., 2004). 3.3 IP as an Alternative to the Sliding Threshold in BCM Theory. A hallmark of the BCM theory of synaptic plasticity is the idea of a sliding threshold that separates postsynaptic firing rates producing LTP versus LTD (Cooper et al., 2004). LTP occurs only when the postsynaptic activity is above the threshold θBCM , but θBCM is assumed to be changing as a function of the cell’s activity history. The evidence for such a sliding LTPLTD threshold seems to be only indirect, however, and IP may present a plausible alternative to the sliding threshold proposed in BCM learning. One of the pieces of evidence put forward as support for the idea of a sliding threshold is data from Kirkwood and colleagues demonstrating that for dark-reared cats, weaker inputs to a neuron are sufficient to produce LTP when compared to normally reared cats (Kirkwood, Rioult, & Bear, 1996). The threshold stimulation frequency that separates LTD and LTP is smaller for the dark-reared cats, which is consistent with the idea of a sliding threshold on the postsynaptic activity level. But an alternative explanation of these data is that neurons in the dark-reared cats are more excitable such that the same stimulation frequency leads to higher firing levels, and this is
2 This may not be the only sensible choice, however. Another form of balance would result from requiring that half of the inputs induce LTP, while the other half induce LTD; that is, we require (µ ln(2)) = 0, where µ ln(2) is the median firing activity of the unit. This choice leads to θcov = θBCM = µ ln(2).
896
J. Triesch 2 high variance low variance change in weight
1.5 1 0.5 0 −0.5 −2
0
2 input
4
6
Figure 5: Intrinsic plasticity may provide an alternative explanation to the idea of a sliding threshold as introduced in the BCM theory of Hebbian plasticity. The change of the weight w is plotted as a function of the strength of the input driving the neuron. For a neuron that experiences four times lower variance inputs, the plasticity curve shifts to the left because IP makes the neuron more excitable. Due to this, the unit in the low-variance input environment will exhibit a stronger response and thus a stronger change in weight for the same input.
why weaker inputs can already induce LTP, even for a fixed modification threshold. Figure 5 illustrates this effect. It shows a plasticity curve, where the change of the weight w corresponding to the amount of LTD or LTD is plotted as a function of the strength of the input to the neuron. We plot this for two neurons that have received different levels of stimulation in the past and have adjusted their activation function accordingly. Both units receive white zero mean gaussian input of different variance. The unit receiving input with four times lower variance corresponds to the situation of the dark-reared cats. Due to the low variance input, it has assumed a steeper activation function that produces higher outputs for the same input, as shown in Figure 2. As a consequence, its plasticity function is shifted to the left. Thus, shifts in the plasticity function can in principle be explained by IP and do not require the assumption of a sliding threshold. This mechanism may work in conjunction with homeostatic synaptic scaling mechanisms, as mentioned earlier. 3.4 Analysis for Clustered Inputs. A more formal analysis of the interaction between IP and Hebbian learning rules can be performed when we assume that the input distribution is a mixture of N well-separated clusters in a high-dimensional input space. Let the cluster centers be at locations ci , i = 1, . . . , N. We treat only the case where the clusters have equal
897
contribution to weight vector f
i
Intrinsic and Synaptic Plasticity
simple Hebb covariance rule BCM rule
0.8 0.6 0.4 0.2 0 −0.2 0
10
20 30 cluster number i
40
50
Figure 6: Behavior of three Hebbian learning rules for clustered input. The contribution that each input cluster makes to the weight vector is plotted as a function of the cluster number.
prior probabilities. The extension to the general case is straightforward. As shown in the appendix, for the case of the standard Hebbian learning rule, the stationary solutions for the weight vector w will be a weighted sum of the cluster centers ci , w=
N
f iHebb ci ,
(3.5)
i=1
where the weighting factors f iHebb are given by f iHebb ∝ 1 + log(N) − i log(i) + (i − 1) log(i − 1),
(3.6)
and we define 0 log(0) ≡ 0. This relation is plotted in Figure 6. Here, the clusters have been ordered such that cluster 1 is the one closest to the final weight vector and cluster N is the one most distant from the final weight vector. There can be at most N! resulting weight vectors owing to the number of possible assignments of the f iHebb to the clusters, but in general not all of them are stationary points of the learning dynamics. An interesting result of this analysis is that the final weight vector does not depend on the desired mean activity level µ. In the special case of the cluster centers lying at random locations in a very high-dimensional space, all of the possible stationary weight vectors correspond to heavytailed directions. More precisely, the linear projection of the inputs onto the weight vector will have an approximately exponential distribution, that is, its distribution will have a heavier tail than a gaussian (Triesch, 2005b). This formal result complements the informal analysis from above, which
898
J. Triesch
suggested that the combination of IP with Hebbian learning leads to the discovery of heavy-tailed directions in the input. We have performed the same analysis for the case of the alternative Hebb rules introduced above (see the appendix). For the covariance rule, the contributions of the individual clusters to the weight vector are very similar to the contributions in the simple Hebbian case, as shown in Figure 6, although a few of the weights will actually become slightly negative. For the BCM rule, however, most of the input clusters will contribute to the weight vector with a small negative weight and only a small fraction of the clusters is contributing to the final weight vector with a strong positive weight. The exact size of this fraction depends on the choice of the threshold θBCM . Figure 6 shows the result for the “balanced” choice θBCM = 2µ. 4 The Bars Problem The bars problem is a standard problem for unsupervised learning archi¨ ak, 1990). Formally, it is a nonlinear independent component tectures (Foldi´ analysis problem. The input domain consists of an N-by-N retina. On this retina, all horizontal and vertical bars (2N in total) can be displayed. The presence or absence of each bar is determined independently, with every bar occurring with the same probability p (in our case, p = 1/N). If a horizontal and a vertical bar overlap, the pixel at the intersection point will be just as bright as any other pixel on the bars rather than twice as bright. This makes the problem a nonlinear ICA problem. Example stimuli from the bars data set are shown in Figure 7a. Note that we normalize input vectors to unit length. The goal of learning in the bars problem is to find the independent sources of the images: the individual bars. Thus, the neural learning system should develop filters that represent the individual bars. We have trained an individual sigmoidal model neuron on the bars input domain. We used the simple Hebbian rule, equation 3.1, together with the multiplicative weight normalization that keeps the weight vector at unit length. The theoretical analysis above assumed that intrinsic plasticity is much faster than synaptic plasticity. Here, we set the timescale of IP to be identical to the one for synaptic plasticity (ηIP = 0.01, ηHebb = 0.01). As illustrated in Figure 7b, the unit’s weight vector aligns with one of the individual bars as soon as the intrinsic plasticity has pushed the model neuron into a regime where its responses are sparse: the unit has discovered one of the independent sources of the input domain. This result is robust if the desired mean activity µ of the unit is changed over a wide range. If µ is reduced from its default value (1/2N = 0.05) over several orders of magnitude (we tried down to 10−5 ), the result remains unchanged. However, if µ is increased above about 0.15, the unit will fail to represent an individual bar but will learn a mixture of two or more bars, with different bars being represented with different strengths. Thus, in this example—in contrast to the theoretical result above—the desired mean activity µ does influence the
Intrinsic and Synaptic Plasticity b
a
899
1
0.8
activity
0.6
0.4
0.2
80
0
frequency
c 100
60
0
0.2
0.4
0.6
0.8
1.2 1 input patterns
1.4
1.6
1.8
2 × 104
40 20
0
0.01
0.03 0.02 weight value
0.04
Figure 7: Single-model neuron discovering a “bar” in the bars problem. (a) Examples of bars stimuli. (b) Activity of the neuron during the learning process and weight vector at five times during learning. (c) Histogram of weight strength of the final weight vector. Parameters were ηIP = 0.01, ηHebb = 0.01.
weight vector that is being learned. The reason is that the IP only imperfectly adjusts the output distribution to the desired exponential shape. As long as µ is small, however, discovery of a single bar is a very robust effect. For example, it still occurs if all the input patterns presented to the system have at least, say, four bars in them (i.e., the unit never sees a single bar in isolation). The discovery of a single bar also does not critically depend on the relative time scales of IP and Hebbian learning. Figure 8 shows results for different relative speeds of IP and Hebbian learning. When IP is faster than Hebbian learning (left graph, ηIP = 0.01, ηHebb = 0.001), a single bar is discovered as soon as the unit’s activity is driven into a heavy-tailed regime. When IP is slower than Hebbian learning (center graph, ηIP = 0.001, ηHebb = 0.01), the same behavior is observed, although convergence is slower. Importantly, this behavior is not observed when IP is absent, as shown in Figure 8 (right). In this experiment, we used a fixed sigmoid nonlinearity with a = 5.0, b = −1.15. These values correspond to the final sigmoid after convergence in Figure 7.3 With this fixed nonlinearity, the unit is unable to discover a bar. We varied the learning rate ηHebb over two orders of magnitude and also ran longer simulations, but the unit never discovered a single bar. This demonstrates that IP contributes more to the learning process than just setting the sigmoid nonlinearity to its final shape. Instead, it shows that the interaction of IP with Hebbian learning over an extended time contributes to the final result. This extended interaction leads to large transient
3 More precisely, in the “converged state,” these parameters still fluctuate. a stays mostly in the interval [4.5, 5.5]. b fluctuates in the interval [−1.0, −1.3].
J. Triesch 1
1
0.8
0.8
0.6 0.4 0.2 0 0
activity
1 0.8 activity
activity
900
0.6 0.4 0.2
5 input patterns
10 4 × 10
0 0
0.6 0.4 0.2
5 input patterns
10 4 × 10
0 0
5 input patterns
10 4 × 10
Figure 8: Learning is robust to changes in the relative timescale of IP versus Hebbian learning. Faster IP (left plot, ηIP = 0.01, ηHebb = 0.001) and slower IP (center plot, ηIP = 0.001, ηHebb = 0.01) both lead to the discovery of a single bar (see final weight vectors shown as small insets). Without any IP (right plot, ηHebb = 0.001), but with a constant nonlinearity that corresponds to the final nonlinearity in Figure 7, no bar is discovered.
deviations of the sigmoid from its final shape. During this transient deformation, the unit’s activity first enters into a sparse regime and discovers a bar. This weight pattern is maintained as the unit keeps adjusting its nonlinearity to the final shape. It should be noted, however, that setting a and b to constant values that are assumed during the transient phase can also lead to successful discovery of a bar. Thus, IP is not strictly necessary for successful learning, but it automatically adjusts the nonlinearity in the desired way and ensures an energy-efficient, approximately exponential activity distribution throughout the learning process. 5 Discussion We have presented a model of IP rooted in information theory and have shown that the model is effective in maintaining homeostasis of the neuron’s firing-rate distribution by making it approximately exponential. The model predicts that changes of the neuron’s intrinsic excitability should depend not only on the neuron’s average firing rate (first moment or mean) but also on its average squared firing rate (second moment). This mechanism may complement other homeostasis mechanisms such as synaptic scaling (Turrigiano & Nelson, 2000, 2004; Maffei et al., 2004). Recent evidence suggests that homeostatic synaptic scaling may occur in response to changing levels of afferent input to the cell rather than to changes of the cell’s firing activity (Stellwagen & Malenka, 2006). We propose that synaptic scaling tries to keep the synaptic current arriving at the soma in a certain regime, while intrinsic plasticity adjusts the neuron’s nonlinear properties to map this current onto an energy-efficient, sparse firing pattern. 5.1 Connection to Posttraumatic Epileptogenesis. We demonstrated that the proposed IP mechanism contributes to homeostasis by increasing
Intrinsic and Synaptic Plasticity
901
a neuron’s excitability after reduction of its afferent input. This effect may play an important role in the genesis of posttraumatic epilepsy. Physiological studies have found increased excitability of pyramidal neurons after trauma (Prince & Tseng, 1993), and modeling studies have shown that such increased excitability together with increased NMDA conductance may explain posttraumatic epileptogenesis (Bush, Prince, & Miller, 1999). Our IP model suggests a causal connection between the trauma and the increased excitability. Due to the trauma, neurons in the neighboring tissue lose a fraction of their afferent input, which triggers an increased excitability due to IP. This in turn makes the local circuit more prone to generating epileptiform events. This mechanism may be complementary to the putative role of homeostatic synaptic plasticity in posttraumatic epilpeptogenesis (Houweling, Bazhenov, Timofeev, Steriade, & Sejnowski, 2005). More work is needed to understand the interaction of these mechanisms in detail, with potential implications for clinical applications. 5.2 Role for Learning Efficient Sensory Representations. We have demonstrated that there may be a synergistic relationship between Hebbian synaptic learning and IP mechanisms that changes the intrinsic excitability of neurons. The two processes may work together to find heavy-tailed directions in the input space. While our theoretical arguments explained this behavior only in the limiting case of IP being much faster than synaptic plasticity, we found empirically that the effect persists if intrinsic plasticity is as slow as or even much slower than synaptic plasticity. In addition, we have demonstrated how IP may mimic the effect of a sliding threshold for synaptic plasticity as postulated by the BCM theory. New experiments are needed to distinguish between the alternative explanations of a sliding plasticity threshold or a change in excitability with a fixed threshold. Our work is closely related to other work on finding efficient representations of sensory stimuli in general (Olshausen & Field, 1996; Bell & Sejnowski, 1997) and on doing so with biologically plausible plasticity mechanisms in particular. For example, some authors have used modified versions of standard Hebbian learning rules in order to discover independent components in the input (Blais, Intrator, Shouval, & Cooper, 1998; Hyv¨arinen & Oja, 1998), and certain independent component analysis techniques also rely on terms that multiply the input signal with a nonlinearly transformed output signal (Hyv¨arinen & Oja, 1997; Hyv¨arinen et al., 2001). Thus, our work is closely related to these approaches, but it introduces IP as a mechanism to adapt the nonlinearity online in a biologically viable fashion. To our knowledge, this was ignored in previous work on the topic. Our result from Figure 8 (right), suggests that at least in certain instances, this online adaptation aids learning. In addition, the IP mechanism ensures an energy-efficient encoding throughout the learning process.
902
J. Triesch
5.3 Coupling with Reward Mechanisms. Good sensory representations must efficiently encode the aspects of the input domain that are relevant for the organism’s behavior, which implies not wasting neural resources on representing irrelevant information. While in the experiments on the bars problem, a model neuron learned to represent a single bar at random, it is easy to envision how a simple gating mechanism can lead to learning of relevant heavy-tailed directions in a biologically plausible fashion (e.g., Montague, Dayan, Person, & Sejnowski, 1995; Roelfsema & van Ooyen, 2005). For example, the following modification of the synaptic learning rule allows the discovery of a specific bar that is associated with rewards or punishments: w = ηyxR.
(5.1)
In this equation, R represents a scalar relevance signal identifying situations that may be worth learning about. For example, we may define R ≡ |δ|, where δ represents a temporal difference error (Sutton & Barto, 1998) which has been associated with the activity of dopaminergic neurons. The use of |δ| reflects that both positive and negative events may be worth learning about. If, for example, an unexpected positive or negative reward is given (δ = ±1), whenever a specific bar appears (irrespective of the presence or absence of any other bars), the modified learning rule will discover this specific bar, because the gating signal effectively restricts learning to those stimuli that contain this bar. Such a relevance feedback could be the effect of dopamine, as suggested above, but it could also be associated with the action of acetylcholine, which is known to globally influence synaptic plasticity (Pennartz, 1995). It would be interesting to incorporate the ideas presented in this article into an actor-critic framework to study how the emergence of efficient sensory representations may be influenced by task demands. 5.4 Future Work. As a next step, it is important to study the effect of combined intrinsic and Hebbian synaptic plasticity at the network level. While we have shown that intrinsic plasticity can effectively ensure the lifetime sparseness of individual neurons (Willmore & Tolhurst, 2001) in the sense of exponential firing-rate distributions, additional lateral inhibitory interactions in a network are likely necessary to decorrelate units in a population in order to also ensure population sparseness, as is observed in the cortex (Vinje & Gallant, 2000). We speculate that the cerebral cortex utilizes a synergistic interaction of intrinsic and synaptic plasticity mechanisms within a sophisticated system of local excitation and inhibition to find sensory codes that exhibit both lifetime and population sparseness (Lehky, Sejnowski, & Desimone, 2005). Preliminary work has already established that topographic arrangements of simple cell-like receptive fields emerge in a two-dimensional network of interconnected units with simultaneous intrinsic and synaptic learning (Butko & Triesch, 2006).
Intrinsic and Synaptic Plasticity
903
It will also be interesting to study the effects of combining intrinsic and synaptic learning mechanisms in recurrent network models. Recently we have also started to investigate the interaction of IP models for spiking neurons with spike-timing-dependent plasticity (STDP) in recurrent networks and found that the combination of STDP and IP can produce very different network dynamics from those resulting from STDP without IP. We speculate that similar combinations of plasticity mechanisms may be effective in driving recurrent networks into a regime of great computational power (Bertschinger & Natschl¨ager, 2004) for a fixed energy expenditure. In conclusion, since IP appears to be a ubiquitous phenomenon in the brain, many aspects of neural functioning and plasticity could be profoundly influenced by it. A careful analysis of the interaction of various forms of plasticity in the nervous system is arguably vital for our understanding of the brain’s ability to adapt to ever changing demands during normal development and learning and in response to various disturbances such as trauma or infection. Our analysis of the interaction of Hebbian learning with the proposed IP model is a small step in this direction. In a similar vein, Janowitz and van Rossum (2006) have explored the role of a fast nonhomeostatic IP mechanism for trace conditioning. So far, however, we have seen only a small glimpse of this terra incognita. Appendix A: Gradient Rule for IP We describe the distribution of the total synaptic current arriving at the soma x by the probability distribution f x (x). Assuming the neuron’s nonnegative firing rate is given by y = g(x), where g is strictly monotonically increasing, the distribution of y is given by
f y (y) =
f x (x) ∂y ∂x
.
(A.1)
We are looking for ways to adjust g so that f y (y) is approximately exponential, that is, we would like f y (y) to be “close” to f exp (y) = µ1 exp( −y ). A µ natural measure is to consider the Kullback-Leibler divergence (KL divergence) between f y and f exp :
f (y) y dy D ≡ d( f y f exp ) = f y (y) log (A.2) −y 1 exp µ µ y = f y (y) log( f y (y))dy − f y (y) − − log µ dy µ
(A.3)
904
J. Triesch
= −H(y) +
1 E(y) + log µ, µ
(A.4)
where H(y) is the entropy of y and E(y) is its expected value. Thus, in order to minimize d( f y f exp ), we need to maximize the entropy H(y) while minimizing the expected value E(y). The term log µ does not depend on g and is irrelevant for this minimization. To derive a stochastic gradient-descent rule for IP that strives to minimize equation A.4, we need to take into account the specific form of the parameterized nonlinearity, equation 2.1. To construct the gradient, we calculate the partial derivatives of D with respect to a and b: ∂ D ∂d( f y f exp ) ∂H 1 ∂ E(y) = =− + ∂a ∂a ∂a µ ∂a ∂ ∂y 1 ∂y =E − log + ∂a ∂x µ ∂a 1 1 2 1 xy − xy , = − + E −x + 2 + a µ µ
(A.5) (A.6) (A.7)
where we have used equation A.1 and moved the differentiation inside the expected value operation to obtain equation A.6 and exploited
∂y log ∂x
= log a + log y + log(1 − y)
(A.8)
in the last step. Similarly, for the partial derivative with respect to b, we find: ∂ D ∂d( f y f exp ) ∂H 1 ∂ E(y) = =− + ∂b ∂b ∂b µ ∂b 1 2 1 y− y . = −1 + E 2+ µ µ
(A.9) (A.10)
The resulting stochastic gradient-descent rule is given by a ← a + a and b ← b + b, with: 1 1 1 +x− 2+ xy + xy2 a µ µ 1 1 y + y2 . b = ηIP 1 − 2 + µ µ
a = ηIP
(A.11) (A.12)
Intrinsic and Synaptic Plasticity
905
Appendix B: Analysis of Learning with Clustered Inputs We consider the limiting case of fast IP and we assume that IP is perfect—it leads to an exactly exponential distribution of activities (Triesch, 2005b). We will first treat the case of only two input clusters and then generalize to N clusters. Let us assume two clusters of inputs at locations c1 and c2 . Let us also assume that both clusters account for exactly half of the inputs. If the weight vector is slightly closer to one of the two clusters, inputs from this cluster will activate the unit more strongly. Let m = µ ln(2) denote the median of the exponential firing-rate distribution with mean µ. Then inputs from the closer cluster, say c1 , will be responsible for all activities above m, while the inputs from the other cluster will be responsible for all activities below m. Hence, for the standard Hebb rule, the expected value of the weight update
w will be given by:
w ≈ αc1
∞
m
=
y exp(−y/µ)dy + αc2 µ
m 0
αµ ((1 + ln 2) c1 + (1 − ln 2) c2 ) . 2
y exp(−y/µ)dy µ
(B.1) (B.2)
Taking the multiplicative weight normalization into account, we see that stationary points of the weight vector require that w is parallel to w. From this, it follows that the weight vector must converge to either of the following two stationary states: w=
(1 ± ln 2)c1 + (1 ∓ ln 2)c2 . (1 ± ln 2)c1 + (1 ∓ ln 2)c2
(B.3)
The weight vector moves close to one of the two clusters but does not fully commit to it. In the case of N clusters, the situation is similar. In order to calculate the contribution of the ith closest cluster to the weight vector, we have to consider the inverse of the cumulative distribution function (percent point function) of the exponential firing-rate distribution, which is given by F y−1 ( p) = −µ log(1 − p).
(B.4)
(Recall that the median of a random variable y is given by F y−1 (1/2).) Assuming that all clusters contribute the same number of inputs, the cluster associated with the ith highest interval of firing rates is “responsible” for the interval [1 − Ni , 1 − i−1 ] of the cumulative distribution function. It N
906
J. Triesch
follows that the average amount of weight update associated with inputs from cluster i under the standard Hebb rule is given by
F y−1 (1− i−1 N ) F y−1 (1− Ni )
y
1 y exp − dy, µ µ
(B.5)
which reduces to µ
i i i −1 i −1 1 − log − 1 − log . N N N N
(B.6)
This expression determines the relative contribution of cluster i to the weight vector in the steady state. After simplification, we can see that possible stable weight vectors are of the form w=
N
f iHebb ci ,
(B.7)
i=1
where f iHebb ∝ 1 + log(N) − i log(i) + (i − 1) log(i − 1)
(B.8)
is the relative contribution of the ith cluster to the weight vector. Note that for the case of N = 2, this matches the result above (see equation B.3). It is straightforward to generalize this analysis to the alternative Hebbian learning rules introduced in the main text. In order to calculate the expected weight update due to cluster i, we generalize equation B.5 to
F y−1 (1− i−1 N ) F y−1 (1− Ni )
(y)
1 y dy. exp − µ µ
(B.9)
For the balanced covariance rule (y) = y − µ, we obtain f icov ∝ f iHebb −
µ . N
(B.10)
For the balanced BCM rule (y) = y(y − 2µ), we obtain: i 2 i −1 2 f iBCM ∝ i log − (i − 1) log . N N In Figure 6 we plot normalized f i such that
i
f i2 = 1.
(B.11)
Intrinsic and Synaptic Plasticity
907
Acknowledgments This work was supported in part by the National Science Foundation under grant NSF IIS-0208451 and the Hertie foundation. I thank Nicholas Butko, Emanuel Todorov, and Erik Murphy-Chutorian for discussions and Cornelius Weber and two anonymous reviewers for helpful comments on earlier drafts. References Aizenman, C., Akerman, C., Jensen, K., & Hollis, T. (2003). Visually driven regulation of intrinsic neuronal excitability improves stimulus detection in vivo. Neuron, 39, 831–842. Baddeley, R., Abbott, L. F., Booth, M., Sengpiel, F., & Freeman, T. (1998). Responses of neurons in primary and inferior temporal visual cortices to natural scenes. Proc. R. Soc. London, Ser. B, 264, 1775–1783. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge filters. Vision Research, 37, 3327–3338. Bertschinger, N., & Natschl¨ager, T. (2004). Real-time computation at the edge of chaos in recurrent neural networks. Neural Computation, 16, 1413–1436. Bienenstock, E., Cooper, L., & Munro, P. (1982). Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. Journal of Neuroscience, 2, 32–48. Blais, B. S., Intrator, N., Shouval, H., & Cooper, L. N. (1998). Receptive field formation in natural scene environments. Neural Computation, 10, 1797–1813. Burdakov, D. (2005). Gain control by concerted changes in i a and i h conductances. Neural Computation, 17, 991–995. Bush, P., Prince, D., & Miller, K. (1999). Increased pyramidal excitability and NMDA conductance can explain posttraumatic epileptogenesis without disinhibition: A model. Journal of Neurophysiology, 82(4), 1748–1758. Butko, N. J., & Triesch, J. (2006). Exploring the role of intrinsic plasticity for the learning of sensory representations. In M. Verleysen (Ed.), European Symposium on Artificial Neural Networks (ESANN) (pp. 467–472). Bruges, Belgium: d-side. Cooper, L., Intrator, N., Blais, B., & Shouval, H. (2004). Theory of cortical plasticity. River Edge, NJ: World Scientific Publishing. Cudmore, R., & Turrigiano, G. (2004). Long-term potentiation of intrinisic excitability in LV visual cortical neurons. J. Neurophysiol., 92, 341–348. Daoudal, G., & Debanne, D. (2003). Long-term plasticity of intrinsic excitability: Learning rules and mechanisms. Learning and Memory, 10, 456–465. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. Desai, N. S., Rutherford, L. C., & Turrigiano, G. G. (1999). Plasticity in the intrinsic excitability of cortical pyramidal neurons. Nature Neuroscience, 2(6), 515–520. ¨ ak, P. (1990). Forming sparse representations by local anti-Hebbian learning. Foldi´ Biological Cybernetics, 64, 165–170.
908
J. Triesch
Houweling, A., Bazhenov, M., Timofeev, I., Steriade, M., & Sejnowski, T. (2005). Homeostatic synaptic plasticity can explain post-traumatic epileptogenesis in chronically isolated neocortex. Cerebral Cortex, 15, 834–845. Hyv¨arinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Hyv¨arinen, A., & Oja, E. (1997). A fast fixed-point algorithm for independent component analysis. Neural Computation, 9, 1483–1492. Hyv¨arinen, A., & Oja, E. (1998). Independent component analysis by general nonlinear Hebbian-like learning rules. Signal Processing, 64(3), 301–313. Janowitz, M., & van Rossum, M. (2006). Excitability changes that complement Hebbian learning. Network: Computation in Neural Systems, 17(1), 31–41. Kirkwood, A., Rioult, M., & Bear, M. (1996). Experience-dependent modification of synaptic plasticity in visual cortex. Nature, 381, 526–528. Laughlin, S. (1981). A simple coding procedure enhances a neuron’s information capacity. Z. Naturforsch., 36c, 910–912. Lehky, S., Sejnowski, T., & Desimone, R. (2005). Selectivity and sparseness in the responses of striate complex cells. Vision Research, 45, 57–73. Maffei, A., Nelson, S., & Turrigiano, G. (2004). Selective reconfiguration of layer 4 visual cortical circuitry by visual deprivation. Nature Neuroscience, 7(12), 1353– 1359. Marder, E., Abbott, L. F., Turrigiano, G. G., Liu, Z., & Golowasch, J. (1996). Memory from the dynamics of intrinsic membrane currents. Proc. Natl. Acad. Sci. USA, 93, 13481–13486. Montague, P., Dayan, P., Person, C., & Sejnowski, T. (1995). Bee foraging in uncertain environments using predictive Hebbian learning. Nature, 377, 725–728. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Pennartz, C. (1995). The ascending neuromodulatory systems in learning by reinforcement: Comparing computational conjectures with experimental findings. Brain Res. Rev., 21, 219–245. Prince, D., & Tseng, G.-F. (1993). Epileptogenesis in chronically injured cortex: In vitro studies. Journal of Neurophysiology, 69, 1276–1291. Roelfsema, P. R., & van Ooyen, A. (2005). Attention-gated reinforcement learning of internal representations for classification. Neural Computation, 17, 2176–2214. Shin, J., Koch, C., & Douglas, R. (1999). Adaptive neural coding dependent on the time-varying statistics of the somatic input current. Neural Computation, 11, 1893– 1913. Stellwagen, D., & Malenka, R. C. (2006). Synaptic scaling mediated by glial TNF-α. Nature, 440(7087), 1054–1059. Stemmler, M., & Koch, C. (1999). How voltage-dependent conductances can adapt to maximize the information encoded by neuronal firing rate. Nature Neuroscience, 2(6), 521–527. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Triesch, J. (2005a). A gradient rule for the plasticity of a neuron’s intrinsic excitability. In W. Duch, J. Kacprzyk, E. Oja, & S. Zadrozny (Eds.), Proc. Int. Conf. on Artificial Neural Networks (ICANN) (pp. 65–70). Berlin: Springer.
Intrinsic and Synaptic Plasticity
909
Triesch, J. (2005b). Synergies between intrinsic and synaptic plasticity in individual model neurons. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17. Cambridge, MA: MIT Press. Turrigiano, G., Leslie, K., Desai, N., Rutherford, L., & Nelson, S. (1998). Activitydependent scaling of quantal amplitude in neocortical neurons. Nature, 391, 892– 896. Turrigiano, G., & Nelson, S. (2000). Hebb and homeostasis in neuronal plasticity. Curr. Opin. Neurobiol., 10(3), 358–364. Turrigiano, G., & Nelson, S. (2004). Homeostatic plasticity in the developing nervous system. Nature Reviews Neuroscience, 5, 97–107. van Welie, I., van Hooft, J. A., & Wadman, W. J. (2004). Homeostatic scaling of neuronal excitability by synaptic modulation of somatic hyperpolarization-activated Ih channels. Proc. Nat. Acad. Sci., 101(14), 5123–5128. van Welie, I., van Hooft, J. A., & Wadman, W. J. (2006). Background activity regulates excitability of rat hippocampal CA1 pyramidal neurons by adaptation of a K+ conductance. J. Neurophysiology, 95, 2007–2012. Vinje, W. E., & Gallant, J. L. (2000). Sparse coding and decorrelation in primary visual cortex during natural vision. Science, 287, 1273–1276. Willmore, B., & Tolhurst, D. (2001). Characterizing the sparseness of neural codes. Network: Computation in Neural Systems, 12, 255–270. Zhang, M., Hung, F., Zhu, Y., Xie, Z., & Wang, J.-H. (2004). Calcium signal-dependent plasticity of neuronal excitability developed postnatally. J. Neurobiol., 61(2), 277– 287. Zhang, W., & Linden, D. J. (2003). The other side of the engram: Experience-driven changes in neuronal intrinsic excitability. Nature Reviews Neuroscience, 4, 885–900.
Received November 11, 2005; accepted April 27, 2006.
LETTER
Communicated by Michael Hasselmo
Distinguishing Causal Interactions in Neural Populations Anil K. Seth [email protected]
Gerald M. Edelman [email protected] The Neurosciences Institute, San Diego, CA 92121, U.S.A.
We describe a theoretical network analysis that can distinguish statistically causal interactions in population neural activity leading to a specific output. We introduce the concept of a causal core to refer to the set of neuronal interactions that are causally significant for the output, as assessed by Granger causality. Because our approach requires extensive knowledge of neuronal connectivity and dynamics, an illustrative example is provided by analysis of Darwin X, a brain-based device that allows precise recording of the activity of neuronal units during behavior. In Darwin X, a simulated neuronal model of the hippocampus and surrounding cortical areas supports learning of a spatial navigation task in a real environment. Analysis of Darwin X reveals that large repertoires of neuronal interactions contain comparatively small causal cores and that these causal cores become smaller during learning, a finding that may reflect the selection of specific causal pathways from diverse neuronal repertoires. 1 Introduction How do neuronal interactions in complex nervous systems cause specific outputs, and how are these causal interactions modulated during learning? Addressing these issues requires a theoretical analysis of the principles governing causal interactions in large and richly interconnected neuronal populations during behavior. Here, we describe a novel network analysis that distinguishes causal interactions in the neural activity in relation to a given neural reference. We use the term neural reference (NR) to refer to the activity of a particular neuron (e.g., a motor neuron) at a particular time (e.g., when an output is generated). As part of this analysis, we introduce the concept of a causal core to designate those neural interactions that are causally significant for a given NR. This concept provides a novel means of linking neuronal interactions in large populations to specific outputs. Our analysis derives from the perspective that the brain is a selectional system and that neural interactions reflect processes of selection operating on diverse repertoires of neuronal groups (Edelman, 1978, 1987, 1993; Neural Computation 19, 910–933 (2007)
C 2007 Massachusetts Institute of Technology
Distinguishing Causal Interactions in Neural Populations
911
Izhikevich, Gally, & Edelman, 2004; Seth & Baars, 2005). Our approach here elaborates on three features of this view in particular. First, variability in population activity may be a necessary substrate for the formation of diverse and dynamic repertoires of neuronal interactions and may reflect not only neuronal noise. Second, population activity patterns are interpreted as dynamic, causally effective processes giving rise to specific outputs, and not as information-bearing representations underlying computations. Finally, synaptic plasticity is suggested to contribute to behavioral learning by shaping the selection of causal pathways within populations and not by reinforcing connections among neural representations of sensory and motor variables. We emphasize that selectionist theory provides an interpretive framework for this analysis and is not itself tested by our analysis. The general framework for the analysis is illustrated in Figure 1. First, an NR is selected from among many possible neuronal events—in this example, by virtue of its relationship to a specific behavioral output (see Figure 1A; yellow neuron at time t). Second, a network of neuronal interactions is identified, incorporating only those interactions that could have potentially contributed to the occurrence of the NR (Krichmar, Nitz, Gally, & Edelman, 2005). This network corresponds to the “context network” of the NR (see Figure 1A; network of green neurons). In order to sample the most salient components for a particular NR, these networks generally reflect a relatively short time period. Finally, a subnetwork, which we define as the causal core, is selected from the context network. The causal core comprises those interactions that are causally significant as assessed by Granger causality (see Figure 1B and below; red arrows show causally significant connections), and which form part of a causal chain leading to the NR (see Figure 1C; red arrows indicate the causal core, and the blue arrow indicates a causally significant interaction that is excluded from the core because it is a causal dead-end). The concept of Granger causality is based on prediction: if a signal A causes a signal B, then past values of A should help predict B with greater accuracy than past values of B alone (Granger, 1969). Because the statistical nature of Granger causality requires large sample sizes to make robust inferences, it is assessed over a time period considerably longer than that used for identifying the context network. In practice, the above analysis requires extensive knowledge of anatomical and functional connectivity during behavior. Unfortunately, such data are not currently available in vivo. Therefore, to provide an illustrative example, we analyze a brain-based device (BBD), a physical device whose behavior in a real-world environment is controlled by a simulated nervous system based on vertebrate neuroanatomy and neurophysiology (Krichmar & Edelman, 2002; Seth, McKinstry, Edelman, & Krichmar, 2004b, 2004a; Krichmar, Seth, Nitz, Fleischer, & Edelman, 2005). Unlike animal models, use of a BBD enables precise recording during behavior of the state of every neuronal unit and every synaptic connection. The BBD analyzed here, Darwin X, incorporates a detailed model of the hippocampus and
912
A. Seth and G. Edelman
Figure 1: Distinguishing causal interactions in neuronal populations. (A) Select a neural reference (NR), that is, the activity of a particular neuron (yellow) at a particular time (t). The context network of the NR corresponds to the network of all coactive and connected precursors, assessed over a short time period (e.g., t, t − 1, . . . , t − 6). (B) Assess the Granger causality significance of each interaction in the context network, based on extended time series of the activities of the corresponding neurons (0 − T; this time period is considerably longer than that used to identify the context network itself). Red arrows indicate causally significant interactions. (C) The causal core of the NR (red arrows) is defined as that subset of the context network that is causally significant for the activity of the corresponding neuron (excluding both noncausal interactions—black arrows—and dead-end causal interactions such as 5→4, indicated in blue). The concept of a causal core provides a novel means of linking neuronal interactions in large populations to specific outputs.
surrounding cortical areas. By integrating visual and self-motion cues, it learns to navigate to an arbitrarily chosen location in a room (Krichmar, Nitz et al., 2005; Krichmar, Seth et al., 2005). To analyze Darwin X, we identified NRs at various stages of the learning process, selecting those corresponding to hippocampal control of motor
Distinguishing Causal Interactions in Neural Populations
913
output. By recording in detail the responses of Darwin X’s nervous system during behavior, we were able to construct context networks and identify causal cores for the chosen NRs. We found that causal cores generally comprised small subsets of the corresponding context networks. Moreover, causal cores were not composed solely of strong synaptic interactions and were not random samples of a context network. We also found that causal cores became smaller with experience, a process we refer to here as refinement. Although BBDs are necessarily gross simplifications of biological systems, their analysis can provide potentially useful heuristics for the interpretation of neurobiological data (Krichmar & Edelman, 2002). Accordingly, as with previous analyses of Darwin X (Krichmar, Nitz, et al., 2005; Krichmar, Seth, et al., 2005), the results presented here can be related to mammalian hippocampal function. However, our primary purpose here is to elaborate several heuristics for analyzing and interpreting activity patterns within neuronal populations. The timeliness of developing such heuristics is emphasized by recent methodological advances that promise detailed new data revealing anatomical and functional architectures of neuronal populations at microscopic scales (Kelly & Strick, 2004; Konkle & Bielajew, 2004; Raffi & Siegel, 2005; Cossart, Ikegaya, & Yuste, 2005; Ohki, Chung, Kara, & Reid, 2005).
Figure 2: Darwin X in its environment with visual landmarks hung from the walls and with the hidden platform indicated. Darwin X has a CCD camera, which responds to landmarks that appear within its field of view. While Darwin X could not see the hidden platform, it could detect an encounter with the platform by means of a downward-facing IR sensor. This figure has been adapted from (Krichmar, Seth, et al., 2005).
914
A. Seth and G. Edelman
Figure 3: Selected components of Darwin X’s simulated nervous system. Each clear ellipse represents a neural area. There were two visual input streams responding to the color (inferotemporal area I T), and width (parietal area P R), of visual landmarks on the walls, as well as one odometric input signaling Darwin X’s heading (anterior thalamic area AT N). These inputs were reciprocally connected with the hippocampus, which included entorhinal cortical areas ECin and ECout , dentate gyrus DG, and the CA3 and CA1 hippocampal subfields. Plastic connections were updated according a modified BCM learning rule (Bienenstock et al., 1982). When the value system (area S) was activated by an encounter with the hidden platform, synaptic strengths in value-dependent pathways were further modulated. The number of simulated neuronal units in each area is indicated in gray (e.g., 900 units in I T). Neuronal activities were computed using a mean-firing rate model. This figure has been adapted from (Krichmar, Seth, et al., 2005).
We emphasize that this study differs fundamentally from previous analyses of Darwin X (Krichmar, Nitz, et al., 2005; Krichmar, Seth, et al., 2005). One previous study introduced the recursive analysis of anatomical and functional connectivity that is used in this study to identify context networks (Krichmar, Nitz, et al., 2005). However, this previous study did not analyze causal interactions or examine changes in neural dynamics during learning. A second study applied a different Granger causality analysis to small circuits of 8 to 10 neuronal units each (Krichmar, Seth, et al., 2005). However, these circuits were selected without reference to anatomy, and the small size of the selected circuits precluded analysis of population-level neuronal dynamics.
Distinguishing Causal Interactions in Neural Populations
915
2 Materials and Methods 2.1 Darwin X. Darwin X is the most recent of a series of autonomous brain-based devices (BBDs) that have been developed over the past 12 years (Edelman et al., 1992; Krichmar & Edelman, 2002; Seth et al., 2004a; 2004b; Krichmar, Nitz, et al., 2005; Krichmar, Seth, et al., 2005). Details of the construction and behavior of Darwin X are available elsewhere (Krichmar et al., 2005a, 2005b) and only those necessary for comprehension will be repeated here. Figure 2 shows the device in its environment, which consisted of an enclosed arena within which paper stripes of different widths and colors were hung on different walls. Darwin X was equipped with a CCD camera, enabling it to detect these stimuli. Figure 3 provides a schematic diagram of the simulated nervous system, in which different sensory streams converged onto a hippocampal region in which there were several levels of signal processing before the processed signals projected divergently back to sensory cortex (Lavenex & Amaral, 2000). The sensory streams included visual “where” signals that respond to stripe width (parietal area Pr ), visual “what” signals that respond to stripe color (inferotemporal area I T), and head-direction signals (anterior thalamic area AT N). These areas were reciprocally connected to entorhinal area EC (subdivided into ECin and ECout ). Both trisynaptic pathways (EC→DG→CA3→CA1) and perforant shortcuts (e.g., EC→CA1) were included in the simulated hippocampal anatomy (Witter, Naber, & van Haeften, 2000). The full nervous system of Darwin X contained 50 neural areas, approximately 90,000 mean firing-rate neuronal units, and approximately 1.4 million synaptic connections. Darwin X was trained on a dry version of the Morris water maze task (Morris, 1984). A hidden platform was placed in one quadrant of the arena (see Figure 2). This platform could be detected by Darwin X only when the device was directly overhead. Each encounter with the platform stimulated a value system that modulated synaptic plasticity at various loci in the simulated nervous system (see Figure 3). Synaptic plasticity in Darwin X was based on a modified BCM learning rule (Bienenstock, Cooper, & Munro, 1982; Krichmar, Seth, et al., 2005) in which synapses between neuronal units with strongly correlated firing rates are potentiated and synapses between neuronal units with weakly correlated firing rates are depressed. As a result of the above features, Darwin X learned to navigate to the platform from arbitrary starting locations. Darwin X was trained over 16 separate trials, each beginning from one of four different starting locations. As reported in detail in Krichmar, Seth, et al. (2005), initially Darwin X searched randomly for the platform, but after about 10 trials, the device reliably took a comparatively direct route to the platform from any starting location. 2.2 Context Networks. To identify context networks in Darwin X’s neural activity, we selected 15 neuronal units in area CA1 of the simulated
916
A. Seth and G. Edelman
hippocampus, whose activity directly affected motor output. We refer to these units as reference units. For each reference unit, we then identified various time steps, drawn from a variety of experimental trials, at which the reference unit was active and at which Darwin X changed its direction. This combination of reference units and time steps provided 96 neural references (NRs; see Figure 1A) that spanned the learning process. It is important to emphasize here these distinctions:
r r r
Neuronal unit, which make up all parts of Darwin X’s simulated nervous system. Reference unit, a neuronal unit in CA1 chosen for its influence on behavior. Neural reference, the activity of a particular reference unit at a particular time. In this analysis, each reference unit was associated with multiple NRs. While it may be natural to select NRs that correspond to behavioral outputs, NRs can in principle correspond to the activity of any neuronal unit at any time.
We then recursively examined the activity of all neuronal units (throughout the simulated nervous system) that led to each NR (see Figure 1A). This procedure has previously been referred to as a backtrace (Krichmar, Nitz, et al., 2005). For each NR, we identified the list of neuronal units that were both anatomically connected to the corresponding reference unit and were active, above a threshold, at the previous time step. The threshold sets the minimum value of the product of presynaptic activity and synaptic strength for a synaptic interaction to be included; it was set to be very low (0.01). A second iteration was carried out by repeating this procedure for each neuronal unit in this list. In total, six successive iterations were carried out for each NR, reflecting six simulation cycles of neural activity, which corresponded to approximately 1.2 s of real time. This limit was chosen in order to isolate the most salient neural interactions for a particular NR while avoiding a combinatorial explosion in the size of successive iterations. The resulting context networks consisted of multiple paths of functional interactions leading, through time, to the corresponding NRs (see Figure 1A). We emphasize that the term context network is used here to refer specifically to the networks described and does not connote alternative uses of the term context pertaining to, for example, invariances in environmental, emotional, or cognitive conditions (Mackintosh, 1983; Bouton, 2004; Fanselow, 2000; Baars, 1988). In principle, this procedure could be extended to NRs consisting of multiple coactivated neuronal units, in which case the first iteration of the backtrace would be carried out using a list of neuronal units rather than a single neuronal unit. However, the corresponding context networks would likely increase in size very rapidly, leading to combinatorial problems.
Distinguishing Causal Interactions in Neural Populations
917
2.3 Causal Cores. To identify the causal core of each NR, we applied a Granger causality analysis to its context network (Granger, 1969; Seth, 2005). The concept of Granger causality is based on prediction, and its application usually involves linear regression modeling. To illustrate Granger causality, suppose that the temporal dynamics of two time series, X1 (t) and X2 (t) (both of length T), can be described by a bivariate autoregressive model: X1 (t) =
X2 (t) =
p
A11, j X1 (t − j) +
p
j=1
j=1
p
p
j=1
A21, j X1 (t − j) +
A12, j X2 (t − j) + E 1 (t)
A22, j X2 (t − j) + E 2 (t),
j=1
where p is the maximum number of lagged observations included in the model (the model order, p < T), A contains the coefficients of the model (i.e., the contributions of each lagged observation to the predicted values of X1 (t) and X2 (t)), and E 1 , E 2 are residuals (prediction errors) for each time series. If the variance of E 1 (or E 2 ) is reduced by the inclusion of the X2 (or X1 ) terms in the first (or second) equation, then it is said that X2 (or X1 ) Granger-causes X1 (or X2 ). In other words, X2 Granger-causes X1 if the coefficients in A12 are jointly significantly different from zero. This can be tested by performing an F-test of the null hypothesis that A12 = 0, given assumptions of covariance stationarity on X1 and X2 . The magnitude of a Granger causality interaction can be estimated by the logarithm of the corresponding F-statistic (Geweke, 1982). We used Granger causality to assess the causal significance of every connection in each context network (see Figure 1B). For each connection, X1 and X2 corresponded to the activity time series of the presynaptic and postsynaptic neuronal units, respectively. To ensure a robust estimation of causal significance, X1 and X2 contained neuronal unit activities for the entire trial in which the corresponding NR was located. The length of these time series ranged from 450 to 4992 time steps (in contrast to the 6 time steps incorporated into each backtrace; see above), and each contained many separate periods of neuronal activity and inactivity. To achieve covariance stationarity, first-order differencing was applied to X1 and X2 . The analyzed time series therefore represented changes in neural activity rather than absolute activity levels. After differencing, only 3 of 96 context networks contained more than 1% of time series that were not covariance stationary (Dickey-Fuller test, P < 0.01). These networks were subsequently excluded from the analysis. For each connection in each of the 93 remaining context networks, we constructed a bivariate autoregressive model of order p = 4. This model order was chosen according to the Bayesian information criterion (BIC) (Schwartz, 1978) (the mean minimum BIC, computed from all bivariate
918
A. Seth and G. Edelman
models, was 3.46). To verify that the order-4 models sufficiently described the data, we noted that the mean adjusted residual-sum-square RSSa d j was sufficiently high (0.35 ±0.168). From each bivariate model, we calculated the statistical significance of a Granger causality interaction from the presynaptic unit (X1 ) to the postsynaptic unit (X2 ). To take account of the large number of connections in each context network, a Bonferroni correction was applied to the significance thresholds (corrected P < 0.01). The resulting networks of significant Granger causality interactions are referred to here as Granger networks. The causal core of each NR was identified by extracting the subset of the corresponding Granger network consisting of all causally significant connections that led, via other causally significant connections, to the NR (see Figure 1C). We emphasize the importance of this step. In order to be causally relevant to an NR, a causally significant interaction must be part of a chain of causally significant interactions leading to the NR. To assess the robustness of our results, we repeated the Granger causality analysis of each context network using a range of Bonferroni-corrected significance thresholds (P < 0.02 to P < 0.05). The resulting Granger networks and causal cores were very similar in all cases to those obtained using the default threshold (P < 0.01). Robustness to model order p was assessed by repeating the analysis using order-8 models, as indicated by the Akaike information criterion, an alternative to the BIC (Akaike, 1974). The resulting Granger causalities were identical to those identified by the order-4 models. 2.4 Network Measures. Throughout this letter, we use the term network to provide a useful and commensurable representation, in terms of nodes and edges, of both real physical interactions (as in context networks) and virtual statistical interactions (as in Granger networks and causal cores). In both cases, nodes correspond to neuronal units. In context networks, edges correspond to synaptic interactions, weighted by the product of synaptic strength and presynaptic activity. In Granger networks and causal cores, edges correspond to statistically significant causal interactions, weighted by the estimated strength of the causal interaction. In order to measure the similarity between two networks A and B, we define the following network similarity index, η AB =
2 f (A B) , f (A) + f (B)
where A B is the network consisting of the edges jointly present in networks A and B, and the function f (X) returns the size (number of edges) of the network X. This index is equal to 1 if A and B are identical and is equal to 0 if A and B are nonoverlapping. It is important to emphasize that
Distinguishing Causal Interactions in Neural Populations
919
while topologies can be compared among different network types using the η index, the specific weight assigned to an edge in a context network cannot be directly compared to the corresponding value in a Granger network or causal core. Networks of each type were visualized using either an energyminimization method to emphasize network structure or, in a standardized circular arrangement, to aid comparison among networks. In the former case, we used the Pajek program (http://vlado.fmf.uni-lj.si/ pub/networks/pajek/), which implements the Kamada-Kawai energy minimization algorithm, with the result that nodes that are strongly interconnected tend to be bunched together. Because this algorithm was applied separately for each visualized network, spatial arrangements cannot be compared between networks. 3 Results 3.1 Causal Cores in Darwin X. Figure 4 shows the context network for an example NR (reference unit 6 during trial 13). The thickness of each line and size of each arrowhead is determined by the product of synaptic strength and presynaptic activity. Figure 5 shows the corresponding Granger network—the subset of interactions of the context network in Figure 4 calculated to be causally significant. Here, the thickness of each line reflects the strength of the corresponding Granger causality interaction. Figure 6 shows the corresponding causal core—the subnetwork of Figure 5 following removal of all dead-end edges that did not participate in causal chains leading to the NR. As in the Granger network (see Figure 5), the thickness of each line in the causal core reflects the strength of the corresponding Granger causality interaction. Comparing the networks in Figures 4, 5, and 6, it is apparent that the context network is much larger than the Granger network, which is itself much larger than the causal core. For this example NR, the causal core has several trisynaptic pathways in which signals from sensory cortex causally influence entorhinal cortex, which in turn causally influences dentate gyrus and CA3 before reaching the NR. One example of a such a pathway is shown by the gray shading in Figure 6. In the context network (see Figure 4), there is a large central cluster of entorhinal interactions interspersed with entorhinal-C A1 interactions. The Granger network (see Figure 5) has a comparatively sparse architecture, with considerably fewer intra-entorhinal interactions. Intra-entorhinal interactions are almost totally absent in the corresponding causal core (see Figure 6). Figure 7 shows the average composition and size of all 93 context networks, Granger networks, and causal cores. Consistent with the above observations, context networks were significantly larger than Granger networks, which were themselves significantly larger than causal cores.
920
A. Seth and G. Edelman
Intra-entorhinal interactions dominated context networks, were less prominent in Granger networks, and were comparatively rare in causal cores. The expanded segments of each chart show the components of the trisynaptic and perforant pathways. These components were much more prominent in causal cores than in context networks or Granger networks, suggesting that these pathways had greater causal influence, in relation to the NRs, than entorhinal interactions. How does the size of a causal core depend on the amount of time that is integrated by a context network? As noted previously (in section 2.2), each context network was constructed by iterating six time steps back from a given NR (a backtrace). Figure 8 shows the ratio of causal core size to the corresponding context network size, for each NR separately, as the backtrace procedure is iterated. As additional time steps were incorporated in the analysis (i.e., as the context network reached further back in time from the NR), the majority of causal cores constituted a progressively smaller fraction of the corresponding context networks. This raises the possibility that in Darwin X, even if the construction of a context network were to be iterated through many more time-steps than is feasible in practice, the corresponding causal core may not change substantially. Can causal cores in Darwin X be identified on the basis of synaptic strengths, without recourse to a statistical causality analysis? To address this question, we examined the relation between synaptic strength and Granger causality for each NR separately, using the network similarity index η defined in section 2.4. From each context network, we selected j subnetworks by removing those edges corresponding to j% of the weakest synaptic weights, for all j ∈ [1, 100]. From each subnetwork, we also removed those dead-end edges that did not lead to the NR. We then calculated the network similarity η between each subnetwork and the corresponding causal core. The resulting values represent, for a particular NR, the similarity between the causal core and a series of subnetworks identified by narrowing the corresponding context network on the basis of synaptic strength. Figure 9A shows the mean network similarity η across all NRs as a function of j. For low j, η is low, as expected, because the context networks are still large. η is also low for high j, which indicates that causal cores do not consist solely of the strongest synaptic weights. Figure 9B shows the same analysis repeated using synaptic interaction strength (i.e., the product of synaptic weight and presynaptic activity) instead of synaptic weight. The results are very similar. The maximum η averaged across all NRs was approximately 0.3 for both analysis methods, and we found the maximum value of η = 1 for only 1/93 NRs, in the second analysis only. These observations indicate that causal cores cannot in general be identified on the basis of synaptic weights or synaptic interaction strengths. In relation to the above point, it bears emphasizing that whereas a Granger-causality analysis offers a natural threshold, based on
Distinguishing Causal Interactions in Neural Populations
Reference
CA1 CA3 DG
921
ECin ECout IT ATN Pr
Figure 4: Example context network from Darwin X, obtained from CA1 reference unit 6 during trial 13. Each circle represents a single neuronal unit; different colors represent distinct brain areas, as defined in section 2.1 and Figure 3. The thickness of each line reflects the product of synaptic strength and presynaptic activity.
Reference
CA1 CA3 DG
ECin ECout IT ATN Pr
Figure 5: The Granger network corresponding to the context network shown in Figure 4. This network comprises those connections in Figure 4 that are causally significant. The thickness of each line reflects the strength of the Granger causality interaction (the logarithm of the corresponding F-statistic; see section 2.3).
922
A. Seth and G. Edelman
Reference
CA1 CA3 DG
ECin ECout IT ATN Pr
Figure 6: The causal core corresponding to the context network in Figure 4 and the Granger network in Figure 5. This network was derived by removing from the corresponding Granger network all dead-end edges that did not participate in causal chains leading to the NR. The gray shading shows one of the several trisynaptic pathways in the causal core. As in Figure 5, the thickness of each line reflects the strength of the Granger causality interaction.
context
Granger
causal core
mean size = 723
mean size = 223
mean size = 29
ATN -> EC
IT->EC
EC->DG
EC->CA1
CA3->CA1
Pr->EC
EC -> EC
EC->CA3
DG->CA3
CA3->CA3
Figure 7: Network composition and average network size (number of edges) for context networks, Granger networks, and causal cores. Different colors represent different connection types (see section 2.1 and Figure 3). Expanded segments show components of trisynaptic pathways (EC→DG, DG →CA3, and CA3→C A1) and perforant pathways (EC→CA1). Context networks had a much higher proportion of intra-entorhinal connections than causal cores.
Distinguishing Causal Interactions in Neural Populations
Nodes (N )
size(CC)/size(CX)
1
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
Edges (K )
1
0.8
0
-1 -2 -3 -4 -5 -6 time step
923
-1 -2 -3 -4 -5 -6 time step
Figure 8: The ratio of causal core size to context network size, in terms of connections (K ) and neuronal units (N), for all NRs. This ratio is shown as a function of the iteration depth of the backtrace procedure (see section 2.2).
synaptic weight
1
0.5
0 0
B network similarity η
network similarity η
A
50 j
100
synaptic interaction strength 1
0.5
0 0
50 j
100
Figure 9: Network similarity η between causal cores and a series of subnetworks extracted from the corresponding context networks. For each NR, 100 subnetworks were extracted by removing j% of the weakest edges in the corresponding context networks, as well as dead-end edges (see the text for details). (A) Mean η across all 93 NRs for subnetwork extraction based on synaptic weight. (B) Mean η based on synaptic interaction strength. Shaded areas represent standard errors.
statistical significance, for selecting a subnetwork, thresholds based on synaptic weight or synaptic interaction strength are necessarily arbitrary. Comparisons of size and composition would therefore be more difficult to interpret in these latter cases. 3.2 Modulation of Causal Interactions During Learning. We turn now to the modulation of causal interactions during learning of the hiddenplatform task by Darwin X. Figure 10 shows causal cores from two reference
924 Trial
Trial
A. Seth and G. Edelman 6
6
10
13
Reference
CA3
ECin
CA1
DG
ECout
7
14
IT
Pr
ATN
9
13
Figure 10: Causal cores for reference units 5 (top) and 10 (bottom) from a selection of trials spanning the learning period. To facilitate comparison between networks, they are arranged using a circular template in which units from the same brain area are placed next to each other. The causal cores diminished in size over time.
units for several NRs spanning the learning process. To facilitate comparison between networks, and in contrast to Figures 4 to 6, these networks are arranged on a circular template in which neuronal units from the same brain area are placed next to each other (see section 2.4 and Figure 3). Inspection of Figure 10 shows that the causal cores diminished in size over time, mainly due to the pruning of intra-entorhinal interactions. We refer to this reduction in size during learning as refinement. Causal core refinement could arise in three different ways: (1) as a correlate of reduction in size of the corresponding context networks; (2) as a result of a general reduction in the likelihood with which interactions are causally significant (in which case, Granger networks, but not context networks, would be similarly refined); and (3) as the result of selection of particular causal pathways from a diverse and dynamic repertoire of neural interactions (in which case only causal cores would show refinement). To distinguish these possibilities, Figure 11 shows network sizes as a function of experimental trial for each type of network and separately for each of the 15 reference units. In agreement with the third account, only causal cores showed significant refinement (see the inset in Figure 3C). Further, causal core size appeared to reach an asymptote after about 10 trials, which corresponds to the number of trials required for Darwin X to learn to move more
Distinguishing Causal Interactions in Neural Populations
A
B
925
C 400
context
Granger
800
3000
3000
2000
2000
causal core 3000
600
200
K
400
2000 100
200
1000
0 5
1000
10 Trial
15
0 5
300
0 5
10
10 Trial
15
15
1000
0 5
0 5
10
10
15
15
Trial
Figure 11: (A) Size of context networks as a function of trial number during learning, in terms of number of edges (K ), for all 15 reference units. Each line represents a different reference unit. (B) Sizes of the corresponding Granger networks. (C) Sizes of the corresponding causal cores. Insets of panels B and C show the same data on a magnified scale.
or less directly to the hidden platform from any starting point (Krichmar, Seth, et al., 2005). These observations raise the possibility that causal core refinement may be a correlate of learning at the level of neuronal populations. Since causal cores in Darwin X were not composed solely of the strongest synaptic interactions (see Figure 9) and since neither context networks nor Granger networks showed similar reductions in size during learning (see Figure 11), causal core refinement in Darwin X may best be understood as resulting from the modulation, via synaptic plasticity, of causal pathways within neuronal populations. 4 Discussion We have developed a theoretical network analysis that can distinguish statistically causal interactions (according to Granger causality) from noncausal interactions in the neural activity leading up to the activity of a particular neuronal unit at a particular time (a neural reference, NR). According to our analysis, activity within neuronal populations may involve the selection of causal interactions from diverse and dynamic repertoires, and learning may involve synaptic changes that modulate selection at the population level. An illustrative application of our analysis to a brain-based device provided an opportunity to explore patterns of statistically causal interactions in large neuronal repertoires during behavior. Causal interactions in Darwin X arose from network dynamics generated by interactions among anatomy, the specific embodiment of Darwin X, and behavior in a structured environment, and therefore could not be inferred directly from design specifications. Our analysis of Darwin X revealed several novel aspects of such
926
A. Seth and G. Edelman
interactions, notably the small size of causal cores as compared to context networks (see Figures 4–7) and the reduction in size (refinement) of causal cores during learning (see Figures 10 and 11). We verified that causal cores did not comprise only the strongest synaptic weights or synaptic interactions and could not in general be identified on the basis of these features (see Figure 9). Because brain-based devices such as Darwin X reflect vertebrate neuroanatomy and neurophysiology, their analysis can provide potentially useful heuristics for the interpretation of neurobiological data (Krichmar & Edelman, 2002; Seth, Sporns, & Krichmar, 2005). Below, we discuss some implications of these results for the interpretation and analysis of population neural activity, for the relation between synaptic plasticity and behavioral learning, and for the dynamic mechanisms underlying hippocampal function. We also discuss limitations, constraints, and possible extensions of the approach. 4.1 Population Activity and Neuronal Variability. A common view of neural population activity is that activity patterns encode information about sensory, motor, or internal state variables (Pouget, Dayan, & Zemel, 2000; Georgopoulos, Schwartz, & Kettner, 1986; Lee, Rohrer, & Sparks, 1988), and that population activity patterns constitute representations that provide substrates for computational operations (Averbeck, Latham, & Pouget, 2006; Harris, 2005). According to this view, mechanisms are required for encoding afferent signals into activity patterns as well as for decoding these activity patterns in order to generate output (Pouget et al., 2000; Seung & Sompolinsky, 1993). An alternative perspective, which accords with the theory that the brain is a selectional system (Edelman, 1978, 1987, 1993), is that population activity provides a dynamic repertoire from which specific causal pathways are selected during behavior and learning. The analysis here provides a means of identifying these pathways in the form of causal cores. According to this analysis, causal cores are not information-bearing representations and do not require explicit encoding and decoding. Instead, they are dynamic, causally effective processes that give rise to specific outputs. These views offer contrasting interpretations of neuronal variability. Experimental observations of the brain during behavior indicate large variations in neural activity even for performance of the same task, or perception of the same stimulus (Edelman, 1987; Softky & Koch, 1993; Schwartz, 1994; Grobstein, 1994). Often such variability has been treated as an inconvenience and minimized or removed using averaging in order to reveal the underlying activity patterns. From a population coding perspective, attempts have been made to identify whether variation is unavoidable neural noise or, alternatively, is part of a signal transmitted to other neurons (Stein, Gossen, & Jones, 2005; Knoblauch & Palm, 2005; Averbeck et al., 2006) via a neural code (deCharms & Zador, 2000). In
Distinguishing Causal Interactions in Neural Populations
927
contrast, according to the view presented here, variability in the neural activity leading to a given event is expected and indicates the existence of diverse and dynamic repertoires underlying selection. 4.2 Synaptic Plasticity and Behavioral Learning. It is often assumed that learned behavior depends on the cumulative effects of long-term potentiation (LTP) or depression (LTD) of synapses. However, synaptic plasticity and behavioral learning operate over vastly different temporal (Drew & Abbott, 2006) and spatial scales. Although experimental evidence suggests that behavioral learning can induce LTP and LTD in certain cases (Malenka & Bear, 2004; Dityatev & Bolshakov, 2005), the precise functional contributions of synaptic plasticity to learned behavior have remained unclear. The results here raise the possibility that synaptic plasticity may underlie learning via modulations of causal interactions at the population level, specifically, by giving rise to the refinement of causal cores. Accordingly, the comparatively large size of causal cores early in learning may reflect a situation where selection had not yet acted to consolidate a causal pathway within a large space of possible pathways. Recent experimental evidence is consistent with this selectionist view. For example, Barnes and colleagues sampled populations of sensorimotor striatal projection neurons, while rats learned a conditional T-maze task (Barnes, Kubota, Hu, & Graybiel, 2005). They found that variability in task-related spiking activity (as measured by entropy) diminished during training and overtraining. Interestingly, by including extinction and reacquisition phases in the learning paradigm, they showed that this reduction in variability can be reversed. 4.3 Causal Interactions in the Hippocampus. The close correspondence of Darwin X with the neuroanatomy and neurophysiology of the mammalian hippocampus suggests one comparatively specific hypothesis regarding hippocampal function. Intra-entorhinal interactions, which dominated most context networks, were comparatively rare in Granger networks, and especially in causal cores. This raises the possibility that mammalian entorhinal cortex may play a modulatory role, as opposed to a causally driving role, with respect to neuronal responses in CA1 during learning. By the same token, the prevalence of trisynaptic and perforant pathways within causal cores suggests that these pathways may causally drive CA1 responses. Although these possibilities are difficult explicitly to test in vivo, the notion of causally efficacious trisynaptic and perforant pathways is supported by both anatomy (see Figure 3) and recent experimental and modeling results that suggest a functionally significant perforant pathway (Brun, Otnass, & Molden, 2002; Hasselmo, Bodelon, & Wyble, 2002). In addition, evidence is accumulating that entorhinal cortex can modulate behavioral expression, for example by driving the formation of associations
928
A. Seth and G. Edelman
for long-term memory (Frank & Brown, 2003) and by influencing the expression of conditioned fear in rats (Burwell, Bucci, Sanborn, & Jutras, 2004; Hebert & Dash, 2004). Future work will address in detail further implications of this analysis for theories of hippocampal function. 4.4 Role of Neuronal Inactivity. The operation of any neural system is likely to involve neurons for which momentary inactivity is functionally significant. Our analysis accounts for this aspect of neural dynamics in the identification of causal cores from context networks, but not in the identification of context networks themselves. The reason is that momentary inactivity can be functionally significant only as part of an extended time series. As described in section 2.2, context networks comprise those synaptic interactions that could potentially have influenced a neural event at a particular time. Accordingly, context networks incorporate neural interactions corresponding to active synapses only. By contrast, the assessment of causal significance by Granger causality is based on relative predictive ability of time series, which requires each time series to exhibit periods of inactivity as well as periods of activity. Therefore, the identification of a causal core from a context network naturally accommodates momentary neuronal inactivity because such inactivity contributes to predictive ability. 4.5 Validating Causal Hypotheses Using Lesions. A useful means of testing causal relations in general is to lesion or perturb elements of the studied system (Keinan, Sandbank, Hilgetag, Meilijson, & Ruppin, 2004). We have shown in previous work that a Granger causality analysis can successfully predict the behavioral consequences of single-neuron lesions within a highly simplified neural network model and is superior in this regard to predictions based on synaptic weights or synaptic interaction strengths (Seth, 2005). In the analysis here, lesion studies of particular causal cores would be difficult to interpret for the reason that the ongoing behavior of Darwin X depends only minimally on the activity of any single neuronal unit. Moreover, Darwin X is a degenerate system (Edelman & Gally, 2001) in which many different causal circuits are likely to give rise to the same behavior. Having said this, it may be informative for future studies to compare lesions to entire brain regions, including, for example, entorhinal areas and trisynaptic/perforant areas. It bears emphasizing that the present analysis provides causal descriptions in the absence of lesions or exogenous perturbations. The resulting causal cores therefore correspond to the dynamics of the intact system during natural behavior. By contrast, the interpretation of causal inferences based on external interventions is complicated by the fact that the studied system is either no longer intact (for lesions) or may display different behavior (for exogenous perturbations).
Distinguishing Causal Interactions in Neural Populations
929
4.6 Covariance Stationarity. A requirement for the present analysis is that time series are covariance stationary; i.e., means and variances do not change over time. However, biological time series often exhibit strong autocorrelations that are inconsistent with covariance stationarity. Absence of covariance stationarity can be addressed in several ways. This study employed first-order differencing, which can be effective in removing autocorrelative structure but implies that the resulting time series reflect changes in activity rather than absolute activity levels. An alternative approach is prewhitening (Box, Jenkins, & Reinsel, 1994; Langheim, Leuthold, & Georgopoulos, 2006), which involves fitting a separate univariate regression model to each time series and performing any subsequent multivariate analysis on the residuals only. Finally, since short time series are more likely to be covariance stationary than long time series, a third approach would be to break each time series into several comparatively short segments. Although this approach has the advantage of allowing analysis of time-varying causal ¨ influences (Hesse, Moller, Arnold, & Schack, 2003), the reduced number of samples in each time series risks that statistical inferences of causality may be weaker. The present analysis is highly conservative with respect to covariance stationarity for the reason that we included in each time series the maximum number of data points possible. We assessed the robustness of our results to this factor by repeating our analysis after splitting each time series into two equal parts. We found that the causal cores corresponding to each time series segment were nearly identical to the causal cores corresponding to the full series, suggesting that our approach was not excessively conservative.
4.7 Application to Other Neural Simulations. Although our results were generated by analyzing a subset of CA1 neuronal units in a particular brain-based device (Darwin X), it should be emphasized that our analysis can be applied to a broad range of neural simulations. Such applications could usefully assess the generality of the results described here. Some of these results—for example, the particular constitution of context networks and causal cores—may be expected to vary according to the details of the analyzed system. Other results may be more general—for example, the correspondence of causal core refinement with behavioral learning and the small size of causal cores as compared to context networks. Darwin X was based on a mean-firing rate neuronal model, in which the activity of each neuronal unit represented the combined activity of approximately 100 real neurons. Identifying causal cores in real neural systems will require extending our analysis to inferences of causality based on spiking neuronal dynamics, for example by combining point-process models with maximum likelihood models (Okatan, Wilson, & Brown, 2005). Such an extension could usefully be tested on future brain-based devices having spiking neuronal dynamics.
930
A. Seth and G. Edelman
4.8 Application to Empirical Data. More generally, application of our analysis requires the identification of all (or a large fraction) of the connected and coactive precursors to a given NR. At present, this is fully achievable only in a computational model, and only in a brain-based device can an NR be meaningfully associated with real-world behavior. However, recent methodological advances suggest that investigators soon may be able to characterize both anatomical and functional architectures in biological systems at microscopic scales. For example, retrograde transneuronal transport of rabies virus has been used to characterize the microanatomy of connectivity between the basal ganglia and cortex (Kelly & Strick, 2004). Functional microarchitectures have been mapped using metabolic markers for neuronal activity (Konkle & Bielajew, 2004), high-resolution optical imaging (Raffi & Siegel, 2005), the genetic manipulation of neurons to express fluorescent voltage sensors (Ahrens et al., 2004) and the bulk-loading of calcium-sensitive indicators (Cossart et al., 2005; Ohki et al., 2005). Importantly, several of these techniques can be applied to behaving primates (Raffi & Siegel, 2005). These advances reflect a changing experimental emphasis from the blind recording of neurons toward the detailed characterization of specific neurons as their dynamics evolve in time in behaving animals. New analysis methods will be required in order to make sense of the data generated by these new techniques. We suggest that our analysis provides one such method and that the insights that potentially could follow may best be exposed by prior application of the analysis to brain-based devices. Application of the novel experimental methods described above will also permit testing of several predictions of the model, in particular, the small size of causal cores as compared to context networks, and the correspondence between learning and causal core refinement. Acknowledgments We are grateful to Jeffrey Krichmar, Joseph Gally, and Eugene Izhikevich for many constructive discussions. We also benefited from the comments of two anonymous referees. We thank Dr. Krichmar in particular for his contribution to the design and implementation of Darwin X. This research was supported by grant N00014-03-1-0980 from the Office of Naval Research, by DARPA, and by the Neuroscience Research Foundation. References Ahrens, K., Hollender, L., Heider, B., Lee, H., Dean, C., Callaway, E., Isacoff, E., & Siegel, R. (2004). Imaging flare, a genetically encoded fluorescent voltage sensor, with two-photon microscopy in live mammalian cortex. Abst. SocNeurosci. Akaike, H. (1974). A new look at the statistical model identification. IEEE Trans. Autom. Control, 19, 716–723.
Distinguishing Causal Interactions in Neural Populations
931
Averbeck, B., Latham, P., & Pouget, A. (2006). Neural correlations, population coding and computation. Nature Reviews Neuroscience, 7(5), 358–367. Baars, B. (1988). A cognitive theory of consciousness. Cambridge: Cambridge University Press. Barnes, T., Kubota, Y., Hu, D., & Graybiel, A. (2005). Activity of striatal neurons reflects dynamic encoding and recoding of procedural memories. Nature, 437, 1158–1161. Bienenstock, E., Cooper, L., & Munro, P. (1982). Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in the visual cortex. Journal of Neuroscience, 2(1), 32–48. Bouton, M. (2004). Context and behavioral processes in extinction. Learning and Memory, 11(5), 485–494. Box, G., Jenkins, G., & Reinsel, G. (1994). Time series analysis: Forecasting and control. (3rd ed.). Englewood Cliffs, NJ: Prentice Hall. Brun, V., Otnass, M., & Molden, S. (2002). Place cells and place recognition maintained by direct entorhinal-hippocampal circuitry. Science, 296, 2243–2246. Burwell, R., Bucci, D., Sanborn, M., & Jutras, M. (2004). Perirhinal and postrhinal contributions to remote memory for context. J. Neurosci., 24(49), 11023–11028. Cossart, R., Ikegaya, Y., & Yuste, R. (2005). Calcium imaging of cortical network dynamics. Cell Calcium, 37, 451–457. deCharms, R., & Zador, A. (2000). Neural representation and the cortical code. Annu. Rev. Neurosci., 23, 613–647. Dityatev, A., & Bolshakov, V. (2005). Amygdala, long-term potentiation, and fear conditioning. Neuroscientist, 11, 75–88. Drew, P. J., & Abbott, L. F. (2006). Extending the effects of spike-timing-dependent plasticity to behavioral timescales. Proceedings of the National Academy of Sciences, USA, 103(23), 8876–8881. Edelman, G. (1978). Group selection and phasic re-entrant signalling: A theory of higher brain function. In V. Mountcastle (Ed.), The mindful brain. Cambridge, MA: MIT Press. Edelman, G. (1987). Neural Darwinism. New York: Basic Books. Edelman, G. (1993). Selection and reentrant signaling in higher brain function. Neuron, 10, 115–125. Edelman, G., & Gally, J. (2001). Degeneracy and complexity in biological systems. Proceedings of the National Academy of Sciences, USA, 98(24), 13763– 13768. Edelman, G., Reeke, G., Gall, W., Tononi, G., Williams, D., & Sporns, O. (1992). Synthetic neural modeling applied to a real-world artifact. Proc. Nat. Acad. Sci. USA, 89(15), 7267–7271. Fanselow, M. (2000). Contextual fear, gestalt memories, and the hippocampus. Behavioral and Brain Research, 110(1–2), 73–81. Frank, L. M., & Brown, E. (2003). Persistent activity and memory in the entorhinal cortex. Trends Neurosci., 26(8), 400–401. Georgopoulos, A., Schwartz, A., & Kettner, R. (1986). Neuronal population coding of movement direction. Science, 233, 1416–1419. Geweke, J. (1982). Measurement of linear dependence and feedback between multiple time series. Journal of the American Statistical Association, 77, 304–313.
932
A. Seth and G. Edelman
Granger, C. (1969). Investigating causal relations by econometric models and crossspectral methods. Econometrica, 37, 424–438. Grobstein, P. (1994). Variability in brain function and behavior. In V. Ramachandran (Ed.), The encyclopaedia of human behavior (pp. 447–458). Orlando, FL: Academic Press. Harris, K. (2005). Neural signatures of cell assembly organization. Nature Reviews Neuroscience, 6, 399–407. Hasselmo, M., Bodelon, C., & Wyble, B. (2002). A proposed function for hippocampal theta rhythm: Separate phases of encoding and retrieval enhance reversal of prior learning. Neural Comput., 14, 793–817. Hebert, A., & Dash, P. (2004). Plasticity in the entorhinal cortex suppresses memory for contextual fear. J. Neurosci., 24(45), 10111–10116. ¨ Hesse, W., Moller, E., Arnold, M., & Schack, B. (2003). The use of time-variant EEG Granger causality for inspecting directed interdependencies of neural assemblies. Journal of Neuroscience Methods, 124, 27–44. Izhikevich, E., Gally, J., & Edelman, G. (2004). Spike-timing dynamics of neuronal groups. Cerebral Cortex, 14, 933–944. Keinan, A., Sandbank, B., Hilgetag, C., Meilijson, I., & Ruppin, E. (2004). Fair attribution of functional contribution in artificial and biological networks. Neural Computation, 16, 1887–1915. Kelly, R., & Strick, P. (2004). Macro-architecture of basal ganglia loops with the cerebral cortex: Use of rabies virus to reveal multisynaptic circuits. Prog. Brain. Res., 143, 449–459. Knoblauch, A., & Palm, G. (2005). What is signal and what is noise in the brain? Biosystems, 79(1–3), 83–90. Konkle, A., & Bielajew, C. (2004). Tracing the neuroanatomical profiles of reward pathways with markers of neuronal activation. Rev. Neurosci., 15(6), 383–414. Krichmar, J., & Edelman, G. (2002). Machine psychology: Autonomous behavior, perceptual categorization and conditioning in a brain-based device. Cerebral Cortex, 12(8), 818–830. Krichmar, J., Nitz, D., Gally, J., & Edelman, G. (2005). Characterizing functional hippocampal pathways in a brain-based device as it solves a spatial memory task. Proc. Natl. Acad. Sci. USA, 102(6), 2111–2116. Krichmar, J., Seth, A., Nitz, D., Fleischer, J., & Edelman, G. (2005). Spatial navigation and causal analysis in a brain-based device modeling cortical-hippocampal interactions. Neuroinformatics, 3(3), 197–222. Langheim, F., Leuthold, A., & Georgopoulos, A. (2006). Synchronous dynamic brain networks revealed by magnetoencephalography. Proceedings of the National Academy of Sciences, USA, 103(2), 455–459. Lavenex, P., & Amaral, D. (2000). Hippocampal-neocortical interaction: A hierarchy of associativity. Hippocampus, 10, 420–430. Lee, C., Rohrer, W., & Sparks, D. (1988). Population coding of saccadic eye movements by neurons in the superior colliculus. Nature, 332, 357–360. Mackintosh, N. (1983). Conditioning and associative learning. New York: Oxford University Press. Malenka, R., & Bear, M. (2004). LTP and LTD: An embarrasment of riches. Neuron, 44, 5–21.
Distinguishing Causal Interactions in Neural Populations
933
Morris, R. (1984). Developments of a water-maze procedure for studying spatial learning in the rat. Journal of Neuroscience Methods, 11, 47–60. Ohki, K., Chung, S., Kara, Y. C. P., & Reid, R. (2005). Functional imaging with cellular resolution reveals precise microarchitecture in visual cortex. Nature, 433, 597–603. Okatan, M., Wilson, M., & Brown, E. (2005). Analyzing functional connectivity using a network likelihood model of ensemble neural spiking activity. Neural Computation, 17(9), 1927–1961. Pouget, A., Dayan, P., & Zemel, R. (2000). Information processing with population codes. Nature Reviews Neuroscience, 1(2), 125–132. Raffi, M., & Siegel, R. (2005). Functional architecture of spatial attention in the parietal cortex of the behaving monkey. Journal of Neuroscience, 25, 5171–5186. Schwartz, A. (1994). Direct cortical representation of drawing. Science, 265, 540–542. Schwartz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 5(2), 461–464. Seth, A. (2005). Causal connectivity of evolved neural networks during behavior. Network: Computation in Neural Systems, 16, 35–54. Seth, A., & Baars, B. (2005). Neural Darwinism and consciousness. Consciousness and Cognition, 14, 140–168. Seth, A., McKinstry, J., Edelman, G., & Krichmar, J. (2004a). Active sensing of visual and tactile stimuli by brain-based devices. International Journal of Robotics and Automation, 19, 222–238. Seth, A., McKinstry, J., Edelman, G., & Krichmar, J. (2004b). Visual binding through reentrant connectivity and dynamic synchronization in a brain-based device. Cerebral Cortex, 14, 1185–1189. Seth, A., Sporns, O., & Krichmar, J. (2005). Neurorobotic models in neuroscience and neuroinformatics. Neuroinformatics, 3(3), 167–170. Seung, H., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proceedings of the National Academy of Science (USA), 90(22), 10749–10753. Softky, W., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13(1), 334–350. Stein, R. B., Gossen, E., & Jones, K. (2005). Neuronal variability: Noise or part of the signal? Nat. Rev. Neurosci., 6(5), 389–397. Witter, M., Naber, P., & van Haeften, T. (2000). Anatomical organization of the parahippocampal-hippocampal network. Ann. N.Y. Acad. Sci., 911, 1–24.
Received December 8, 2005; accepted July 12, 2006.
LETTER
Communicated by Lucas Parra
Model Selection for Convolutive ICA with an Application to Spatiotemporal Analysis of EEG Mads Dyrholm [email protected] Intelligent Signal Processing Group, Informatics and Mathematical Modelling, Technical University of Denmark, 2800 Lyngby, Denmark
Scott Makeig [email protected] Swartz Center for Computational Neuroscience, Institute for Neural Computation University of California San Diego, La Jolla CA 92093–0961, U.S.A.
Lars Kai Hansen [email protected] Intelligent Signal Processing Group, Informatics and Mathematical Modelling, Technical University of Denmark, 2800 Lyngby, Denmark
We present a new algorithm for maximum likelihood convolutive independent component analysis (ICA) in which components are unmixed using stable autoregressive filters determined implicitly by estimating a convolutive model of the mixing process. By introducing a convolutive mixing model for the components, we show how the order of the filters in the model can be correctly detected using Bayesian model selection. We demonstrate a framework for deconvolving a subspace of independent components in electroencephalography (EEG). Initial results suggest that in some cases, convolutive mixing may be a more realistic model for EEG signals than the instantaneous ICA model. 1 Introduction Motivated by the EEG signal’s complex temporal dynamics, we are interested in convolutive independent component analysis (cICA), which in its most basic form concerns reconstruction of L + 1 mixing matrices Aτ and N component signal vectors (innovations), st , of dimension K , combining to form an observed D-dimensional linear convolutive mixture
xt =
L
Aτ st−τ .
(1.1)
τ =0
Neural Computation 19, 934–955 (2007)
C 2007 Massachusetts Institute of Technology
Model Selection in Convolutive ICA
935
That is, cICA models the observed data x as produced by K processes whose time courses are first convolved with fixed, finite-length time filters and then summed in the D sensors. This allows a single component to be expressed in the different sensors with variable delays and frequency characteristics. One common application for this model is the acoustic blind source separation problem in which sound sources are mixed in a reverberant environment. Simple ICA methods that do not take signal delays into account fail to produce satisfactory results for this problem, which has thus for been the focus of much cICA research (e.g., Lee, Bell, & Orglmeister, 1997; Parra, Spence, & Vries, 1998; Sun & Douglas, 2001; Mitianoudis & Davies, 2003, ¨ Anemuller & Kollmeier, 2003). For analysis of human electroencephalographic (EEG) signals recorded from the scalp, ICA has already proven to be a valuable tool for detecting and enhancing relevant source subspace brain signals while suppressing irrelevant noise and artifacts such as those produced by muscle activity and eye blinks (Makeig, Bell, Jung, & Sejnowski, 1996; Jung et al., 2000; Delorme & Makeig, 2004). In conventional ICA, each independent component (IC) is represented as a spatially static projection of cortical activity to the sensors. Results of static ICA decomposition are generally compatible with a view of EEG signals as originating in spatially static cortical domains within which local field potential fluctuations are partially synchronized (Makeig, Enghoff, Jung, & Sejnowski, 2000; Jung et al., 2001; Delorme, Makeig, & Sejnowski, 2002; Makeig, Debener, Onton, & Delorme, 2004; Onton, Delorme, & Makeig, 2005). Modeling EEG data as consisting of convolutive as well as static independent processes allows a richer palette for source modeling, possibly leading to more complete signal independence ¨ (Anemuller, Sejnowski, & Makeig, 2003). In this letter, we present a new cICA decomposition method that, unlike most previous work in the area, operates entirely in the time domain. In the wavelet or DFT (discrete Fourier transform) domain approaches, it is common practice to window and taper the data; hence, the choice of the optimal window size is a problem in practice. In an acoustic application of a DFT-based method, for instance, the window size can be determined based on a listening test; however, an equally elegant solution for application to nonaudible data such as an EEG has yet to be derived. In our new method, the optimal order of the mixing filters has to be determined, and we provide a principled scheme for doing so based on generalization error. The new scheme also makes no assumptions about the nonstationarity of the components, a key assumption in several successful cICA methods (see, e.g., Parra & Spence, 2000; Rahbar, Reilly, & Manton, 2002) whose relevance to EEG is unclear. Previous time domain and DFT domain methods have formulated the problem as one of finding a finite impulse response (FIR) filter that unmixes as in equation 1.2 below (Belouchrani, Meraim, Cardoso, & Moulines, 1997; Choi & Cichocki, 1997; Moulines, Cardoso, & Gassiat,
936
M. Dyrholm, S. Makeig, and L. Hansen
1997; Lee, Bell, & Lambert, 1997; Attias & Schreiner, 1998; Parra, Spence, & Vries, 1998; Deligne & Gopinath, 2002; Douglas, Cichocki, & Amari, 1999; Comon, Moreau, & Rota, 2001; Sun & Douglas, 2001; Rahbar & Reilly, 2001; ¨ Rahbar et al., 2002; Baumann, Kohler, & Orglmeister, 2001; Anemuller & Kollmeier, 2003): sˆ t =
Wλ xt−λ .
(1.2)
λ
However, the inverse of the mixing FIR filter modeled in equation 1.2 is, in general, an infinite impulse response (IIR) filter. We thus expect that FIRbased unmixing will require estimation of extended or potentially infinite length unmixing filters. Our method, by contrast, finds such an unmixing IIR filter implicitly in terms of the mixing model parameters, that is, the Aτ ’s in equation 1.1, isolating st in equation 1.1. as sˆ t =
A#0
xt −
L
Aτ sˆ t−τ ,
(1.3)
τ =1
where A#0 denotes Moore-Penrose inverse of A0 . Another advantage of this parameterization is that the Aτ ’s allow a separated component to be easily backprojected into the original sensor domain. Other authors have proposed the use of IIR filters for separating convolutive mixtures using the maximum likelihood principle. The unmixing IIR filter, equation 1.3, generalizes that of Torkkola (1996) to allow separation of more than two components. Furthermore, it bears interesting resemblance to that of Choi and Cichocki (1997) and Choi, Amari, Cichocki, & Liu (1999). Though put in different analytical terms, the inverses used there are equivalent to the unmixing IIR, equation 1.3. However, the unique expression, equation 1.3, and its remarkable analytical simplicity, is the key to learning the parameters of the mixing model, equation 1.1, directly. 2 Learning the Mixing Model Parameters Statistically motivated maximum likelihood approaches for cICA have been proposed (Torkkola, 1996; Pearlmutter & Parra, 1997; Parra, Spence, & Vries, 1997; Moulines et al., 1997; Attias & Schreiner, 1998; Deligne & Gopinath, 2002; Choi et al., 1999; Dyrholm & Hansen, 2004) and are attractive for a number of reasons. First, they force a declaration of statistical assumptions—in particular, the assumed distribution of the component signals. Second, a maximum likelihood solution is asymptotically optimal given the assumed observation model and the prior choices for the hidden variables.
Model Selection in Convolutive ICA
937
Assuming independent and identically distributed (i.i.d.) component signals and noise-free mixing, the likelihood of the parameters in equation 1.1 given the data is p(X|{Aτ }) =
···
N
δ(et )p(st )dst ,
(2.1)
t=1
where et = x t −
L
(2.2)
Aτ st−τ
τ =0
and δ(et ) is the Dirac delta function. In the following derivation, we assume that the number of convolutive processes K does not exceed the dimension D of the data. First, we note that only the Nth term under the product operator in equation 2.1 is a function of s N . Hence, the s N -integral may be evaluated first, yielding
p(X|{Aτ }) = |A0 A0 | T
−1/2
···
p(ˆs N )
N−1
δ(et )p(st )dst ,
(2.3)
t=1
where integration is over all components at all times except s N , and sˆ t =
xt −
A#0
L
Aτ ut−τ , un ≡
τ =1
sn sˆ n
for for
n
(2.4)
Now, as before, only one of the factors under the product operator in equation 2.3 is a function of s N−1 . Hence, the s N−1 -integral can now be evaluated, yielding −1 p(X|{Aτ }) = A0 T A0
···
p(ˆs N )p(ˆs N−1 )
N−2
δ(et )p(st )dst ,
(2.5)
t=1
where integration is over all components at all times except s N and s N−1 , and sˆ t =
A#0
xt −
L τ =1
Aτ ut−τ , un ≡
sn
for
n< N−1
sˆ n
for
n≥ N−1
.
(2.6)
938
M. Dyrholm, S. Makeig, and L. Hansen
By induction and assuming sn is zero for n < 1, we get N −N/2 p(X|{Aτ }) = A0 T A0 p(ˆst ),
(2.7)
t=1
where sˆ t =
A#0
xt −
L
Aτ sˆ t−τ .
(2.8)
τ =1
Thus, the likelihood is calculated by first unmixing the components using equation 2.8, and then measuring equation 2.7. It is clear that the algorithm reduces to standard Infomax ICA (Bell & Sejnowski, 1995) when the length of the convolutional filters L is set to zero and D = K ; in that case, equation 2.7 can be estimated using sˆ t = A0 −1 xt . 2.1 Model Component Declaration Ensures Stable Unmixing. Because of inherent instability concerns, the use of IIR filters for unmixing has often been discouraged (Lee, Bell, & Lambert, 1997). Using FIR unmixing filters could certainly ensure stability but would not solve the fundamental problem of inverting a linear system in cases in which it is not invertible. Invertibility of a linear system is related to the phase characteristic of the system transfer function. A SISO (single input/single output) system is invertible if and only if the complex zeros of its transfer function are all situated within the unit circle. Such a system is characterized as minimum phase. If the system is not minimum phase, only an approximate, regularized inverse can be sought. (See Hansen, 2002, on techniques for regularizing a system with known coefficients.) For MIMO (multiple input/multiple output) systems, the matter is more involved. The stability of equation 2.8, and hence the invertibility of equation 1.1, is related to the eigenvalues λm of the matrix
−A#0 A1 −A#0 A2 . . . −A#0 A L I 0 ˜ = A . . . .. .. I 0
(2.9)
For K = D, a necessary and sufficient condition is that all eigenvalues λm ˜ are situated within the unit circle, |λm | < 1 (Neumaier & Schneider, of A 2001). We can generalize the minimum phase concept to MIMO systems if we think of the λm ’s as quasi poles of the inverse MIMO transfer function. A SISO system being minimum phase implies that no system with the same frequency response can have a smaller phase shift and system delay.
Model Selection in Convolutive ICA
939
Generalizing that concept to MIMO systems, we can get a sense of what a quasi minimum phase MIMO system must look like. In particular, most energy must occur at the beginning of each filter and less toward the end. However, not all SISO component-to-sensor paths in the MIMO system need be minimum phase for the MIMO system as a whole to be quasi minimum phase. Certainly unmixing data using FIR filters is regularized in the sense that their joint impulse response is of finite duration, whereas IIR filter impulse responses may potentially become unstable. Fortunately, the maximum likelihood approach has a built-in regularization that avoids this problem (Dyrholm & Hansen, 2004). This can be seen in the likelihood equation, equation 2.7, by noting that although an unstable IIR filter will lead to a divergent component estimate, sˆ t , such large-amplitude signals are exponentially penalized under most reasonable probability density functions (pdf’s), for example, for EEG data p(s) = sech(s)/π, ensuring that unstable solutions are avoided in the evolved solution. If so, it may prove safe to use an unconstrained iterative learning scheme to unmix EEG data. Once the unmixing process has been stably initialized (set, e.g., A0 to a random matrix, and Aτ zero), each learning step will produce model refinements that are stable in the sense of equation 2.8. Even if the system 1.1 we are trying to unmix is not invertible, meaning no exact stable inverse exists, the maximum-likelihood approach will give a regularized and stable quasi minimum phase solution. 2.2 Gradients and Optimization. The cost function of the algorithm is the negative log likelihood, N log p(ˆst ). log det A0T A0 − 2 N
L({Aτ }) =
(2.10)
t=1
The gradient of the cost function is presented here in two steps. Step 1 reveals the partial derivatives of the component estimates, and step 2 uses the step 1 results in a chain rule to compute the gradient of the cost function (see also Dyrholm & Hansen, 2004):
Step 1: Partial derivatives of the unmixed component estimates L L ∂(ˆst )k ∂ sˆ t−τ # = δ(i − k) xt − Aτ sˆ t−τ − A0 Aτ ∂(A#0 )ij ∂(A#0 )ij τ =1 τ =1 j
(2.11) k
940
M. Dyrholm, S. Makeig, and L. Hansen
ψ t )k = p ((ˆst )k )/ p((ˆst )k ). and (ψ L ∂ sˆ t−τ ∂(ˆst )k # # = −(A0 )ki (ˆst−τ ) j − A0 Aτ ∂(Aτ )ij ∂(A#τ )ij τ =1
(2.12) k
Step 2: Gradient of the cost function. The gradient of the cost function with respect to A#0 is given by ∂L({Aτ }) ∂ sˆ t = −N A0T ij − ψtT , # ∂(A0 )ij ∂(A#0 )ij t=1 N
(2.13)
and the gradient with respect to the other mixing matrices is ∂ sˆ t ∂L({Aτ }) =− ψtT . ∂(Aτ )ij ∂(Aτ )ij N
(2.14)
t=1
These expressions allow the use of general gradient optimization methods, a stable starting point being Aτ = 0 (for τ = 0) with arbitrary A0 . In the experiments reported below, we have used a Broyden-Fletcher-GoldfarbShanno (BFGS) algorithm for optimization. (See Cardoso & Pham, 2004, for a relevant discussion and Nielsen, 2000, for a reference to the precise implementation we used.) We are happy to release the complete source code algorithm on request through email correspondence to either of the Denmark-based authors: [email protected] and [email protected]. Supplementary material for the derivations is available online through http://www.imm.dtu.dk/∼mad/ papers/madrix.pdf. 3 Three Approaches to Overdetermined cICA Current EEG experiments typically involve simultaneous recording from 30 to 100 or more electrodes, forming a high (D) dimensional signal. After signal separation, we hope to find a relatively small number (K) of independent components. Hence, we are interested in studying the so-called overdetermined problem (K < D). There are at least three approaches to performing overdetermined cICA: 1. (Rectangular) Perform the decomposition with D > K . 2. (Augmented) Perform the decomposition with K set to D, attempting to estimate some extra components. 3. (Diminished) Perform the decomposition with D equal to K , that is, on a K -dimensional subspace projection of the data.
Model Selection in Convolutive ICA
1
941
Source 1 -> Sensor 1
Source 2 -> Sensor 1
Source 1 -> Sensor 2
Source 2 -> Sensor 2
Source 1 -> Sensor 3
Source 2 -> Sensor 3
0 -1 1 0 -1 1 0 -1 0
10
20
Filter lags
30 0
10
20
30
Filter lags
Figure 1: Synthetic MIMO mixing system. Here, two components were convolutively mixed at three sensors. The poles of the mixture (as defined in section 2.1) are all situated within the unit circle; hence, an exact and stable inverse exists in the sense of equation 2.8.
We compared the performance of these three approaches experimentally as a function of signal-to-noise ratio (SNR). First, we created a synthetic mixture: two i.i.d signals s1 (t) and s2 (t) (with 1 ≤ t ≤ N and N = 30,000) generated from a Laplacian distribution, sk (t) ∼ p(x) = 12 exp(−|x|) with variance Var{sk (t)} = 2. These signals were mixed using the filters of length L = 30 shown in Figure 1 producing an overdetermined 3D mixture (D = 3, K = 2). A 3D i.i.d. gaussian noise signal nt was added to the mixture xt = σ nt + τL=0 Aτ st−τ with a controlled variance σ 2 . Next, we investigated how well the three analysis approaches estimated the two components by measuring the correlations between each true component innovation, sk (t), and the best-correlated estimated component innovation, sˆk (t). 3.1 Approach 1 (Rectangular). Here, all three data channels were decomposed and the two true components estimated. Figure 2 shows how well the components were estimated at different SNR levels. The quality of the estimation dropped dramatically as SNR decreased. Although our derivation (see section 2) is valid for the overdetermined case (D > K ), the
942
M. Dyrholm, S. Makeig, and L. Hansen
Correlation coefficient
A
B
1
1
0.9
0.9
0.8
0.8
0.7
0.7 Rectangular Augmented Diminished
0.6 0.5 1
10 100 1000 SNR
0.6 0.5 1
10 100 1000 SNR
Figure 2: Comparison of separation of the system in Figure 1 using three cICA approaches (Rectangular, Augmented, Diminished). (A) Estimates of true component activity: Correlations with the best-estimated component. (B) Similar correlations for the less-well-estimated component.
validity of the zero-noise assumption proves vital in this case. The explanation for this can be seen in the definitions of the likelihood, equation 2.7, and unmixing filter, equation 2.8. In equation 2.7, any rotation on the columns of A0 will not influence the determinant term of the likelihood. From equation 2.8, we note that the estimated component vectors sˆ t are found by linear mapping through A#0 : IR D → IR K . Hence, the prior term in equation 2.7 alone will be responsible for determining a rotation of A0 that hides as much variance as possible in the null-space (IR D−K ) of A#0 in equation 2.8. In an unconstrained optimization scheme, this side effect will be untamed and consequently will hide variance in the null-space of A#0 and achieve an artificially high likelihood while relaxing the effort to make the components independent. 3.2 Approach 2 (Augmented). One solution to the problem with the Rectangular approach above could be to parameterize the null-space of A#0 or, equivalently, the orthogonal complement space of A0 . This can be seen as a special case of the algorithm in which A0 is D−by−D and Aτ is D−by−K . With the D − K additional columns of A0 denoted by B, the model can be written,
xt = Bvt +
L τ =0
Aτ st−τ ,
(3.1)
Model Selection in Convolutive ICA
943
where vt and B constitute a low-rank approximation to the noise. Hence, we declare a gaussian prior p.d.f. on vt . Note that equation 3.1 is a special case of the convolutive model, equation 1.1. In this case, we attempt to estimate the third (noise) component in addition to the two convolutive components. Figure 2 shows how well the components are estimated using this approach for different SNR levels. For the best-estimated component (see Figure 2A), the Augmented approach gave better estimates than the Rectangular or Diminished approaches. This was also the case for the second component (see Figure 2B) at low SNR, but not at high SNR since in this case, the true B was near zero and became improbable under the likelihood model. 3.3 Approach 3 (Diminished). Finally, we investigated the possibility of extracting the two components from a two-dimensional projection of the data. For example, in the EEG experiment later in this letter, we propose projecting the data onto a subspace defined by K instantaneous ICA components. In other situations, PCA might be a more principled way to dimensionality reduction. However, for the current synthetic experiment, we pick a more random projection by excluding the third sensor from the decomposition. Figure 2 shows that even in the presence of considerable noise, the separation achieved was not as good as in the Augmented approach. However, the Diminished approach used the lowest number of parameters and hence had the lowest computational complexity. Furthermore, it lacked the peculiarities of the Augmented approach at high SNR. Finally, we note that once the Diminished model has been learned, an estimate of the Rectangular model can be obtained by solving T T xt st−λ = Aτ st−τ st−λ
(3.2)
τ
for Aτ by regular matrix inversion using the estimated components and N < · >= N1 1=1 . 3.4 Summary of the Three Approaches. In the presence of considerable noise, the best separation was obtained by augmenting the model and extracting, from the D-dimensional mixture, K components as well as a (rank D − K ) approximation of the noise. However, the Diminished approach had the advantage of lower computational complexity, while the separation it achieved was close to that of the Augmented approach. At very high SNR, the Diminished approach was even slightly better than the Augmented approach. The Rectangular approach had difficulties and should not be considered for use in practice, as the presence of some channel noise may be assumed.
944
M. Dyrholm, S. Makeig, and L. Hansen
4 Detecting a Convolutive Mixture Model selection is a fundamental issue of interest; in particular, detecting the order of L can tell us whether the convolutive mixing model is a better model than the simpler instantaneous mixing model of standard ICA methods. In the framework of Bayesian model selection, models that are too complex are penalized by the Occam factor and will therefore be chosen only if there is a need for their complexity. However, this compelling feature can be disrupted if fundamental assumptions are violated. One such assumption was involved in our derivation of the likelihood, in which we assumed that the components are i.i.d., that is, not autocorrelated. The problem with this assumption is that the likelihood will favor models based not only on achieved independence but on component whiteness as well. A model selection scheme for L that does not take the component autocorrelations into account will therefore be biased upward because models with a larger value for L can absorb more component autocorrelation than models with lower L values. To address this problem, we introduce a model for each of the sources sk (t) =
M
h k (λ)zk (t − λ)
(4.1)
λ=0
where zk (t) represents an i.i.d. signal—a whitened version of the component signal. Introducing the K component filters of order M allows us to reduce the value of L, that is, lowering the number of parameters in the model while achieving uniformly better learning for limited data (Dyrholm, Makeig, & Hansen, 2006). We note that some authors of FIR unmixing methods have also used component models (e.g., Pearlmutter & Parra, 1997; Parra et al., 1997; Attias & Schreiner, 1998). 4.1 Learning Component Autocorrelation. The negative log likelihood for the model combining equation 1.1 and 4.1 is given by L = N log | det A0 | + N
log |h k (0)| −
k
N
log p(ˆzt ),
(4.2)
t=1
where zˆ t is a vector of whitened component signal estimates at time t using an operator that represents the inverse of equation 4.1, and we assume A0 to be square as in the Diminished and Augmented approaches above. We can, without loss of generality, set h k (0) = 1; then, L = N log | det A0 | −
N t=1
log p(ˆzt ).
(4.3)
Model Selection in Convolutive ICA
945
For notational convenience, we introduce the following matrix notation instead of equation 4.1, bundling all components in one matrix equation, st =
M
Hλ zt−λ ,
(4.4)
λ=0
where the Hλ ’s are diagonal matrices defined by (Hλ )ii = h i (λ). To derive an algorithm for learning the component autocorrelations in addition to the mixing model, we modify the equations found in section 2.2 by inserting a third Component model step (see below) between the two steps found there, that is, substituting zˆ t for sˆ t in step 2. 4.1.1 Component Model Step. is given by zˆ t = sˆ t −
M
The inverse component coloring operator
Hλ zˆ t−λ ,
(4.5)
λ=1
and the partial derivatives, which we shall use in step 2, are given by ∂(ˆzt )k ∂(A−1 0 )ij
=
∂(ˆs)k ∂(A−1 0 )ij
−
M
Hλ
λ=1
∂(ˆzt−λ )k ∂(A−1 0 )ij
∂(ˆs)k ∂(ˆzt−λ )k ∂(ˆzt )k = − Hλ ∂(Aτ )ij ∂(Aτ )ij ∂(Aτ )ij λ=1 M ∂ zˆ t−λ ∂(ˆzt )k = −δ(k − i)(ˆzt−λ )i − Hλ . ∂(Hλ )ii ∂(Hλ )ii
(4.6)
M
λ =1
(4.7)
(4.8)
k
4.1.2 Step 2 Modified: Gradient of the Cost Function. The gradient of the cost function with respect to A−1 0 with the component model invoked is given by ∂L({Aτ }) ∂(A−1 0 )ij
N = −N A0T ij − ψtT t=1
∂ zˆ t ∂(A−1 0 )ij
,
(4.9)
and the gradient with respect to the other mixing matrices is N ∂L({Aτ })ij ∂ zˆ t =− ψtT . ∂(Aτ )ij ∂(Aτ )ij t=1
(4.10)
946
M. Dyrholm, S. Makeig, and L. Hansen Source 1
Source 2
1 0 -1 0
5
10
15 0
Filter lags
5
10
15
Filter lags
Figure 3: The filters used to produce autocorrelated components (M = 15).
4.2 Protocol for Detecting L. We propose a simple protocol for determining the dimensions (L , M) of the convolutional and component filters. First, expand the convolution without an autofilter (M = 0). This will model the total temporal dependency structure of the system L max . The optimal dimension is found by monitoring the Bayes information criterion (BIC) (Schwarz, 1978), log p(M|X) ≈ log p(X|θθ 0 , M)−
dim θ log N, 2
(4.11)
where M represents a specific choice of model structure (L , M), θ represents the parameters in the model, θ 0 are the maximum likelihood parameters, and N is the size of the data set (number of samples). Next, keep the temporal dependency constant, (L + M) = L max , while expanding the length of the component autofilters M, again monitoring the BIC to determine the optimal choice of L = L max − M. 4.3 Example: Correctly Rejecting cICA of an Instantaneous Mixture. We now illustrate the importance of the component model and the validity of the protocol for detecting L when dealing with the following fundamental question: Do we learn anything by using convolutive ICA instead of instantaneous ICA? Or, put in another way, should L be larger than zero? To produce an instantaneous mixture, we generate two random signals from a Laplace distribution, filter them through filters of order 15 shown in Figure 3, and mix the two filtered components using an arbitrary square mixing matrix. Figure 4A shows the result of using Bayesian model selection for this mixture without allowing for a filter (M = 0). This corresponds to model selection in a conventional convolutive model. Since the signals are nonwhite, L is detected, and the model BIC simply increases as a function of L up to the maximum, here stopped at L = 15. Next, (see Figure 4B), we fix L + M = 15. Models with a larger L have at least the same capability as models with lower L, though models with lower L are preferable because they have fewer parameters. By adding the component model, we get the
Bayes Info. Criterion (/1000)
Model Selection in Convolutive ICA
-30 -32 -34 -36 -38 -40 -42 -44 -46 -48
A
947
-31.6
B
-31.7
-31.8 0 5 10 15 20 25 # Lags (L)
0
5
10
15
# Lags (L)
Figure 4: (A) The result of using Bayesian model selection without allowing for an autofilter (M = 0). Since the signals are nonwhite, the validity of L is unquestioned even at 15 lags (L = 15). (B) We fix L + M = 15, and now get the correct answer: that model information is largest for L = 0, meaning there is no evidence of convolutive mixing.
correct answer in this case: these data contain no evidence of convolutive mixing. 5 Deconvolving an EEG ICA Subspace We now show by example how cICA can be used to separate the delayed influences of statically defined ICA components on each other, thereby achieving a larger degree of independence in the convolutive component time courses. The procedure described here can be seen as a Diminished approach in which we extract K convolutive components from the Ddimensional data by deconvolving a K -dimensional subspace projection of the data. In Dyrholm, Hansen, Wang, Arendt-Nielsen, & Chen (2004) we used a subspace from principal component analysis (PCA), but as our experiment will show, using ICA for that projection has the benefit that the subspace can be chosen, for example, for physiological interest. As a first test of this approach, we applied convolutive decomposition to 20 minutes of a 71-channel human EEG recording (20 epochs of 1 minute duration), downsampled for numeric convenience to a 50 Hz sampling rate after filtering between 1 and 25 Hz with phase-indifferent FIR filters. First, the recorded (channels-by-times) data matrix (X) was decomposed using extended Infomax ICA (Bell & Sejnowski, 1995; Makeig et al., 1996; Jung et al., 1998; Lee, Girolami, & Sejnowski, 1999; Jung et al., 2001) into 71 maximally independent components whose (“activation”) time series were contained
948
M. Dyrholm, S. Makeig, and L. Hansen
in (components-by-times) matrix SICA and whose (“scalp map”) projections to the sensors were specified in (channels-by-components) mixing matrix AICA , assuming instantaneous linear mixing X = AICA SICA . Five of the resulting independent components (ICs) were selected for further analysis on the basis of event-related coherence results that showed a transient partial collapse of component independence following the subject button presses (Makeig, Delorme et al., 2004). More specifically, all five components displayed amplitude activity that was time-locked to the subject button presses, which was evident from looking at ERP images of stimulus-aligned epochs sorted by button-press response time (Makeig et al., 2002). The scalp maps of the five components, that is, the relevant five columns of AICA , are shown on the left margin of Figure 7. Next, cICA decomposition was applied to the five component activation time series (relevant five rows of SICA ), assuming the model sICA = t
L
Aτ scICA t−τ .
(5.1)
τ =0
As a qualified guess of the order L, we applied the approach to estimating L outlined in section 4.2 above to the EEG subspace data. First, we increased the order of the convolutive model L (keeping M = 0) while monitoring the BIC. To produce error bars, we used jackknife resampling (Efron & Tibshirani, 1993). For each value of L, 20 runs with the algorithm were performed, one for each jackknifed epoch; thus, the data in each run consisted of the 19 remaining epochs. Figure 5A shows the mean jackknifed BIC. Clearly, the BIC, without an autofilter included, was at least L max = 40, since some correlations in the data extended to at least 800 ms. Next, we swept the range of possible component model filters M, keeping L + M = 40. Figure 5B shows that L = 10, corresponding to a filter length of 200 ms, proved optimal. Figures 6 shows the 5 × 5 matrix of learned convolutive kernels. Before plotting, we arranged the order of the five output CCs so that the diagonal (CCi → I Ci ) kernels, shown in one-third scale in Figure 6, were dominant. Figure 7 shows the resulting percentage of variance of the contributions from each of the CC innovations to each of the IC activations. As the large diagonal contributions in Figure 7 show, each convolutive CC j dominated one spatially static IC (IC j). However, there were clearly significant off-diagonal contributions as well, indicating that spatiotemporal relationships between the static ICA components were captured by the cICA model. To explore the robustness of this result further, we tested for the presence of delayed correlations, first between the static IC activations (skICA (t)) and then between the learned CC innovations (skcICA (t)). Figure 8 shows, for the
Model Selection in Convolutive ICA
A
Jackknife BIC (/104)
-36 -38 -40 -42 -44 -46 -48
949
B
-36.1 -36.2 -36.3 -36.4 -36.5 -36.6 -36.7 -36.8 -36.9 -37
M=0
M+L=40
0 10 20 30 40 # Lags (L)
0
5 10 15 20 # Lags (L)
Figure 5: Using the protocol for detecting the order of L for EEG. (A) There are correlations over at least 40 lags in the data. This corresponds to 800 ms. (B) By introducing the component model, it turns out that L should be only on the order of 10 corresponding to 200 ms. CC1 -> IC1
CC2 -> IC1
CC3 -> IC1
CC4 -> IC1
CC5 -> IC1
CC1 -> IC2
CC2 -> IC2
CC3 -> IC2
CC4 -> IC2
CC5 -> IC2
CC1 -> IC3
CC2 -> IC3
CC3 -> IC3
CC4 -> IC3
CC5 -> IC3
CC1 -> IC4
CC2 -> IC4
CC3 -> IC4
CC4 -> IC4
CC5 -> IC4
x/3
x/3
x/3
x/3 CC1 -> IC5
CC2 -> IC5
CC3 -> IC5
CC4 -> IC5
CC5 -> IC5
x/3
0
200 0
200 0 200 0 Filter lag (ms)
200 0
200
Figure 6: Kernels of the five derived convolutive ICA components (CCs), arranged (in columns) in order of their respective contributions to the five static ICA components (ICs) (rows). Each CC made a dominant contribution to one IC; these were ordered so as to appear on the diagonal. Scaling of the diagonal kernels is one-third that of the off-diagonal kernels.
950
M. Dyrholm, S. Makeig, and L. Hansen
CC1
49% IC1
CC2
CC3
CC4
CC5
9%
26%
14%
1%
1%
1%
1%
0%
0%
88% 8%
IC2
32% IC3
IC4
66% 1%
82%
10%
3%
4%
1%
2%
2%
2% 94%
IC5
1%
Predictability (%)
Figure 7: Percentage variance of five static ICA components (ICs) accounted for by the five derived convolutive components (CCs). The IC scalp maps on the left are shown for interest. Contributions arranged on the diagonal are dominant. Squares represent the (rounded) percentage variance of the IC activation time series accounted for by each CC. Significant off-diagonal elements indicate the presence of significant delayed spatiotemporal interactions between the static IC activations.
10 8 6 4
ICA cICA
2 0 0
2 4 6 8 Prediction order (r)
10
Figure 8: Predictability of the most predictable ICA component activation (IC3) and cICA component innovation (CC5) from the most predictive other IC and CC component, respectively (IC1, CC3).
Model Selection in Convolutive ICA
951
most predictable IC and CC, the percentage of their time course variances that was accounted for by linear prediction from the past history (of order r ) of the largest contributing remaining ICs or CCs, respectively. As expected from the cICA results, as the prediction order (r ) increased, the predictability of the static ICA component activation also increased. For the ICA component activation, 9% of the variance could be explained by linear prediction from the previous 10 time points (200 ms) of another ICA component. The static ICA component time courses were nearly independent only in the sense of zero-order prediction (r = 0), as expected from their derivation. Their lack of independence at other lags is compatible with the cICA results. For the CC innovation, however, the predictability in Figure 8 remained low as r increased, indicating that cICA in fact deconvolved delayed correlations present in the EEG subspace data. Figure 9 shows the power spectral densities for each of the IC activations (in bold traces) along with the two CCs (in thin traces) that, in accordance with Figure 7, contributed the most to the respective IC (cf. Figure 7). Note that the broad alpha band spectral peak in IC1 (the uppermost panel in Figure 9) around 10 Hz has been split between CC1 and CC3. In the middle panel, the distinct spectral contributions of CC1 and CC3 to the double alpha peak in the IC3 spectrum. As expected, the CCs made different spectral contributions to the IC time courses. For example, CC1 made different power spectral density contributions to IC1, IC3, and IC4.
6 Discussion In general, the usefulness of any blind decomposition method applied to biological time-series data is most likely relative to the fit between the assumptions of the algorithm and the underlying physiology and biophysics. Therefore, it is important to consider the physiological basis of the delayed interactions between statically defined independent component time courses we observed here, and the possible physiological significance of the derived convolutive component filters and time courses. Applied to these EEG data, static ICA gave 15 to 20 components that were of physiological interest according to their spatial projections or activation time series, although we were not able to practically deconvolve more than five components here because of numerical complexity. Open questions therefore are to identify independent component subspaces of interest for cICA decomposition or to explore the efficiency of performing cICA on larger computer clusters. In the future, convolutive ICA might also be applied usefully to other types of biomedical time-series data that involve stereotyped source movements, thus presenting problems for static ICA decomposition. These might include electrocardiographic (ECG) and brain hemo¨ dynamic measures such as diffusion tensor imaging (DTI) (Anemuller et al., 2004).
952
M. Dyrholm, S. Makeig, and L. Hansen
IC1 -80
CC1 (49%)
-120
CC3 (26%)
IC2 Power spectral density (dB)
-80
CC3 (8%)
-120
CC2 (88%)
IC3 -80
CC1 (32%) CC3 (66%)
-120
IC4 -80 CC4 (82%) CC1 (10%)
-120
IC5 -80 CC5 (94%)
-120 0
5
10 15 Frequency (Hz)
20
25
Figure 9: Power spectrum of the most powerful CC contributions to the five ICs.
References ¨ Anemuller, J., Duann, J.-R., Sejnowski, T. J., & Makeig, S. (2004). Unraveling spatiotemporal dynamics in FMRI recordings using complex ICA. In C. G. Puntonet & A. Prieto (Eds.), Independent component analysis and blind signal separation (pp. 1103– 1110). Berlin: Springer-Verlag.
Model Selection in Convolutive ICA
953
¨ Anemuller, J., & Kollmeier, B. (2003). Adaptive separation of acoustic sources for anechoic conditions: A constrained frequency domain approach. IEEE Transactions on Speech and Audio Processing, 39(1–2), 79–95. ¨ Anemuller, J., Sejnowski, T., & Makeig, S. (2003). Complex independent component analysis of frequency-domain EEG data. Neural Networks, 16, 1313–1325. Attias, H., & Schreiner, C. E. (1998). Blind source separation and deconvolution: The dynamic component analysis algorithm. Neural Computation, 10(6), 1373–1424. Baumann, W., Kohler, B.-U., Kolossa, D., & Orglmeister, R. (2001). Real time separation of convolutive mixtures. In T.-W., Lee, T.-P., Jung, S., Makeig, & T. J., Sejnowski (Eds.), 3rd International Conference on Independent Component Analysis and Blind Signal Separation (pp. 65–69). San Diego, CA. Bell, A., & Sejnowski, T. J. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Belouchrani, A., Meraim, K. A., Cardoso, J.-F., & Moulines, E. (1997). A blind source separation technique based on second order statistics. IEEE Transactions on Signal Processing, 42, 434–444. Cardoso, J.-F., & Pham, D.-T. (2004). Optimization issues in noisy gaussian ICA. In C. G. Puntonet & A. Prieto (Eds.), Independent component analysis and blind signal separation (pp. 41–48). Berlin: Springer-Verlag. Choi, S., & Cichocki, A. (1997). Blind signal deconvolution by spatio-temporal decorrelation and demixing. In J. Principe, L. Gile, N. Morgan, & E. Wilson (Eds.), Neural networks for signal processing (pp. 426–435). Amelia Island, CA. Choi, S., Amari, S.-I., Cichocki, A., & wen Liu, R.-W. (1999). Natural gradient learning with a nonholonomic constraint for blind deconvolution of multiple channels. In Independent Component Analysis and Blind Signal Separation (pp. 371–376). Aussois, France. Comon, P., Moreau, E., & Rota, L. (2001). Blind separation of convolutive mixtures: A contrast based joint diagonalization approach. In T.-W. Lee, T.-P. Jung, S. Makeig, & T. J. Sejnowski (Eds.), Independent Component Analysis and Blind Source Separation (pp. 686–691). San Diego, CA. Deligne, S., & Gopinath, R. (2002). An EM algorithm for convolutive independent component analysis. Neurocomputing, 49, 187–211. Delorme, A., & Makeig, S. (2004). EEGlab: An open source toolbox for analysis of single-trial EEG dynamics. Journal of Neuroscience Methods, 134, 9–21. Delorme, A., Makeig, S., & Sejnowski, T. J. (2002). From single-trial EEG to brain area dynamics. Neurocomputing, 44–46, 1057–1064. Douglas, S. C., Cichocki, A., & Amari, S. (1999). Self-whitening algorithms for adaptive equalization and deconvolution. IEEE Transactions on Signal Processing, 47, 1161–1165. Dyrholm, M., & Hansen, L. K. (2004). CICAAR: Convolutive ICA with an autoregressive inverse model. In C. G. Puntonet & A. Prieto (Eds.), Independent Component Analysis and Blind Signal Separation (pp. 594–601). Berlin: Springer-Verlag. Dyrholm, M., Hansen, L. K., Wang, L., Arendt-Nielsen, L., & Chen, A. C. (2004). Convolutive ICA (c-ICA) captures complex spatio-temporal EEG activity. http://www.imm.dtu.dk/∼mad/papers/cICA eeg hbm2004.pdf. Dyrholm, M., Makeig, S., & Hansen, L. K. (2006). Model structure selection in convolutive mixtures. In J. Rosca, D. Erdogmus, J. C. Pripe, & S. Haykin (Eds.),
954
M. Dyrholm, S. Makeig, and L. Hansen
Independent Component Analysis and Blind Signal Separation (pp. 74–81). BerlinSpringer-Verlag. Efron, B., & Tibshirani, R. (1993). An introduction to the bootstrap. London: Chapman & Hall. Hansen, P. C. (2002). Deconvolution and regularization with Toeplitz matrices. Numerical Algorithms, 29, 323–378. Jung, T.-P., Humphries, C., Lee, T.-W., Makeig, S., McKeown, M. J., Iragui, V., & Sejnowski, T. J. (1998). Extended ICA removes artifacts from electroencephalographic recordings. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10. Cambridge, MA: MIT Press. Jung, T.-P., Makeig, S., Humphries, C., Lee, T.-W., McKeown, M. J., Iragui, V., & Sejnowski, T. J. (2000). Psychophysiology, 37, 163–178. Jung, T.-P., Makeig, S., McKeown, M. J., Bell, A., Lee, T.-W., & Sejnowski, T. J. (2001). Imaging brain dynamics using independent component analysis. Proceedings of the IEEE, 89(7), 1107–1122. Lee, T.-W., Bell, A. J., & Lambert, R. H. (1997). Blind separation of delayed and convolved sources. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 758–764). Cambridge, MA: MIT Press. Lee, T.-W., Bell, A. J., & Orglmeister, R. (1997). Blind source separation of real world signals. In International Conference Neural Networks (pp. 2129–2135). Houston, TX. Lee, T.-W., Girolami, M., & Sejnowski, T. J. (1999). Independent component analysis using an extended infomax algorithm for mixed sub-gaussian and super-gaussian sources. Neural Computation, 11(2), 417–441. Makeig, S., Bell, A. J., Jung, T.-P., & Sejnowski, T. J. (1996). Independent component analysis of electroencephalographic data. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 145–151). Cambridge, MA: MIT Press. Makeig, S., Debener, S., Onton, J., & Delorme, A. (2004). Mining event-related brain dynamics. Trends in Cognitive Science, 8(5), 204–210. Makeig, S., Delorme, A., Westerfield, M., Townsend, J., Courchense, E., & Sejnowski, T. (2004). Electroencephalographic brain dynamics following visual targets requiring manual responses. PLoS Biology, 2, e176. Makeig, S., Enghoff, S., Jung, T.-P., & Sejnowski, T. J. (2000). A natural basis for efficient brain-actuated control. IEEE Trans. Rehab. Eng., 8, 208–211. Makeig, S., Westerfield, M., Jung, T.-P., Enghoff, S., Townsend, J., Courchesne, E., & Sejnowski, T. J. (2002). Dynamic brain sources of visual evoked responses. Science, 295(5555), 690–694. Mitianoudis, N., & Davies, M. (2003). Audio source separation of convolutive mixtures. IEEE Transactions on Speech and Audio Processing, 11(5), 489–497. Moulines, E., Cardoso, J.-F., & Gassiat, E. (1997). Maximum likelihood for blind separation and deconvolution of noisy signals using mixture models. In International Conference on Acoustics, Speech, and Signal Processing (pp. 3617–3620). Los Alamitos, CA: IEEE Computer Society Press. Neumaier, A., & Schneider, T. (2001). Estimation of parameters and eigenmodes of multivariate autoregressive models. ACM Transactions on Mathematical Software, 27(1), 27–57.
Model Selection in Convolutive ICA
955
Nielsen, H. B. (2000). UCMINF—an algorithm for unconstrained, nonlinear optimization (Tech. Rep. IMM-REP-2000-19). Langby: Department of Mathematical Modeling, Technical University of Denmark. Onton, J., Delorme, A., & Makeig, S. (2005). Frontal midline EEG dynamics during working memory. NeuroImage, 27, 342–356. Parra, L., & Spence, C. (2000). Convolutive blind source separation of nonstationary sources. IEEE Transactions on Speech and Audio Processing, 8, 320–327. Parra, L., Spence, C., & Vries, B. (1997). Convolutive source separation and signal modeling with Ml. In International Symposium on Intelligent Systems, Reggio Calbria, Italy. Parra, L., Spence, C., & Vries, B. D. (1998). Convolutive blind source separation based on multiple decorrelation. In Neural Networks for Signal Processing Proceedings (pp. 23–32). Cambridge, UK. Pearlmutter, B. A., & Parra, L. C. (1997). Maximum likelihood blind source separation: A context-sensitive generalization of ICA. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 613–619). Cambridge, MA: MIT Press. Rahbar, K., & Reilly, J. (2001). Blind source separation of convolved sources by joint approximate diagonalization of cross-spectral density matrices. In International Conference on Acoustics, Speech, and Signal Processing (pp. 2745–2748). Los Alamitos, CA: IEEE Computer Society Press. Rahbar, K., Reilly, J. P., & Manton, J. H. (2002). A frequency domain approach to blind identification of mimo FIR systems driven by quasi-stationary signals. In International Conference on Acoustics, Speech, and Signal Processing (pp. 1717–1720). Los Alamitos, CA: IEEE Computer Society Press. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. Sun, X., & Douglas, S. (2001). A natural gradient convolutive blind source separation algorithms for speech mixtures. In T.-W. Lee, T.-P. Jung, S. Makeig, & T. J. Sejnowski (Eds.), Independent Component Analysis and Blind Source Separation (pp. 59–64). San Diego, CA, USA. Torkkola, K. (1996). Blind separation of convolved sources based on information maximization. In Neural Networks for Signal Processing Proceedings (pp. 423–432). Kyoto, Japan.
Received August 2, 2005; accepted July 26, 2006.
LETTER
Communicated by Daniel Amit
Information and Topology in Attractor Neural Networks D. Dominguez [email protected]
K. Koroutchev [email protected]
E. Serrano [email protected]
F. B. Rodr´ıguez [email protected] EPS, Universidad Aut´onoma de Madrid, 28049 Madrid, Spain
A wide range of networks, including those with small-world topology, can be modeled by the connectivity ratio and randomness of the links. Both learning and attractor abilities of a neural network can be measured by the mutual information (MI) as a function of the load and the overlap between patterns and retrieval states. In this letter, we use MI to search for the optimal topology with regard to the storage and attractor properties of the network in an Amari-Hopfield model. We find that while an optimal storage implies an extremely diluted topology, a large basin of attraction leads to moderate levels of connectivity. This optimal topology is related to the clustering and path length of the network. We also build a diagram for the dynamical phases with random or local initial overlap and show that very diluted networks lose their attractor ability. 1 Introduction Interest in attractor Amari-Hopfield neural networks, originally dealing with fully connected architectures (Amari, 1972; Hopfield, 1982), has been renewed with the study of more realistic topologies (Masuda & Aihara, ˜ 2004; Torres, Munoz, Marro, & Garrido, 2004). Among them, relatively simple small-world (SW) graphs (Watts & Strogatz, 1998) can capture most facts of a wide range of realistic social, artificial or biological networks (Albert & Barabasi, 2002; Rolls & Treves, 2004). The SW connectivity properties interpolate between regular and a completely random network. The SW graphs are modeled by only two parameters: the average ratio of links per node γ and the ratio of long-range random links ω. The Amari-Hopfield network, as an information processing system, stores information through specific configurations of the network activity (i.e., patterns that can be recovered in response to the stimuli). Of interest is the load α, the ratio of patterns per links that the network stores. The most Neural Computation 19, 956–973 (2007)
C 2007 Massachusetts Institute of Technology
Information and Topology in Attractor Neural Networks Theory + Sim. |J|=40M, m0=1
−4
FC, γ=1.0
m
1
1
0.8
0.8
0.6
0.4
0.2
0.2
0
0.05
0.1
0.15
0.2
0.25
0.15
0.1
0.1
0.2
0.4
0.6
imax αc
αc
0.05
0
0.2
imax
0.15
0
Simulation Theory
0
0
0.2
i
ED, γ=10
0.6
Simulat. Theory
0.4
0.25
957
0.05 0
0
0.05
0.1
α
0.15
0.2
0
0.2
α
0.4
0.6
Figure 1: The overlap m (top panels) and the information rate i (bottom) versus α for fully connected (γ F C = 1.0, left) and random extremely diluted (γ E D = 10−4 , right) networks. Theory (solid lines) and simulation (circles).
common measure of the retrieval ability of these networks is the overlap, m, between neuron states and the memorized patterns (Amit, Gutfreund, & Sompolinsky, 1985). This measure is plotted as a function of the load α in the top panels of Figure 1 for the classical Amari-Hopfield fully connected (FC, γ = 1) and random extremely diluted (RED, γ 1, ω = 1) networks. The left panel of Figure 1 show that the FC network has a critical capacity at αcF C ∼ 0.145: the overlap goes sharply to 0 for loads larger than αcF C , defining well-pronounced phase transition with a small amount of error in the retrieval of patterns mc ∼ 0.97. This result was discussed long ago by Amit et al. (1985) and Crisanti, Amit, and Gutfreund (1986). However, for random extremely diluted networks, the transition of m to zero occurs at αcRE D ∼ 0.64 (Canning & Gardner, 1988), but it is a smooth transition: mc = 0 (see the right panel of Figure 1). Therefore, the characterization of the system exclusively in terms of m and αc poses some problems. Less attention has been paid to another parameter of the network: the mutual information between the stored patterns and the neural states, or its derivate, the information ratio (Okada, 1996; Dominguez & Bolle, 1998). The bottom panels of Figure 1 display the information ratio i(α), evaluated from the conditional probability of neuron states and the patterns. When
958
D. Dominguez, K. Koroutchev, E. Serrano, and F. Rodr´ıguez
the network approaches its saturation limit αc , the neuron states cannot remain close to the patterns because the overlap is usually small. Thus, while the number of patterns increases, the information per stored pattern decreases. Therefore, the information i(α) for all topologies (except the FC) is a nonmonotonic function of the load, which reaches its maximum i max ≡ i(αmax ) at some value αmax ≤ αc of the load. For the FC case, αmax and αc coincide, and the critical information is about i cF C ∼ 0.13, as can be seen in the left panel of Figure 1. However, for other long-range topologies, as RED, for instance, this does not hold (see the right panel of Figure 1). The RED network has null information at αc . If one looks RE D RE D for the maximum, one finds i max ∼ 0.22 for αmax ∼ 0.32. In conclusion, the parameters i max , αmax are well defined and are more appropriate than mc , αc , to evaluate the Hopfield network’s performance on various topologies. In this letter, we address the problem of searching the topology that maximizes the information capacity of the network. Using the graph framework, we built networks with a connectivity ratio changing from fully connected (γ = 1) to extremely diluted networks (γ → 0) and randomness ranging from purely local (ω = 0) to totally random (ω = 1) connectivity. We show that i increases (decreases) with γ for local (random) networks and remains about the same for SW topologies. A question arises about the optimal topology: If the randomness ω is fixed (by physical constraints), which is the best connectivity γ ? To our knowledge, no one previously has answered this question. We approach this problem from two scenarios: the stability of the memorized patterns (related to the number of errors induced by crosstalk) and the domain of the attractors (related to the network’s ability to associate an corrupted input pattern with a stored pattern). The structure of the letter is as follows. In section 2, we define the neural dynamics model and review the information measures used in the calculations. The results are shown in section 3, where we study the stationary states by simulation and theory. In section 4 the attractors’ sensitivity to different initial conditions is discussed. We present a diagram for the phases with local and random initial conditions and show the relation between topology and mutual information. The conclusions are drawn in the final section.
2 The Model 2.1 Topology and Dynamics. The network state at a given time t is defined by a set of N binary neurons, σ t = {σit ∈ {±1}, i = 1, . . . , N}. The network learns a set of independent patterns {ξ µ , µ = 1, . . . , P}. Accordingly, µ each pattern ξ µ = {ξi ∈ {±1}, i = 1, . . . , N}, is a set of site-independent ranµ dom variables, binary and uniformly distributed: p(ξi = ±1) = 1/2. The synaptic couplings are J ij ≡ Cij Wij , where C denotes the topology matrix and W are the learning weights.
Information and Topology in Attractor Neural Networks
959
The topology, defined by the matrix C = {Cij ∈ {0, 1}}, with Cii = 0, splits in local and random links. The local links connect each neuron to its K l nearest neighbors in a closed ring. The random links connect each neuron to K r others, uniformly distributed along the network. Hence, the network degree is K = K l + K r . The network topology is then characterized by two parameters: the connectivity ratio and the randomness ratio, defined respectively by γ = K /N,
ω = K r /K ,
(2.1)
where ω plays the role of a rewiring probability in the small-world model (SW) (Watts & Strogatz, 1998). The model presented here was proposed by Newman and Watts (1999), which has the advantage of avoiding disconnecting the graph. We named small-world the procedure to build the connectivity graph rather than the properties of a network (large clustering and small path length). Note that the topology can be implemented by an adjacency list connecting neighbors. So the storage cost of this network is given by the cardinality |J| = N × K . Only long-range, K ∝ N, N → ∞, networks will be considered in this work, whatever ω. The learning algorithm updates W, according to the Hebb rule µ
µ−1
Wij = Wij
µ µ
+ ξi ξ j .
(2.2)
The network starts at Wij0 = 0, and after µ = P learning steps, it reaches µ µ a value Wij = µ ξi ξ j . The learning stage has a slow dynamics, which is stationary in the timescale of the much faster retrieval stage. The neural states, σit ∈ {±1}, are updated according to the dynamics, σit+1 = sign(h it ), h it ≡ J ij σ jt , i = 1, . . . , N, (2.3) j
where h it is the local field of neuron i at time t. Note that in the case of symmetric synaptic couplings, J ij = J ji , an energy function can be defined, whose minima are the stable states of the dynamics, equation 2.3. 2.2 Information Measures. The task of the neural channel is to retrieve a pattern (say, ξ ≡ ξµ ) starting from a neuron state σ 0 , which is inside its attractor basin. This is achieved through the network dynamics, equation 2.3. The relevant order parameter is the overlap between the neural states and the pattern, mtN ≡
1 ξ σ t, N i i i
at the time step t.
(2.4)
960
D. Dominguez, K. Koroutchev, E. Serrano, and F. Rodr´ıguez
We consider a mean-field network, and the distribution of the states is assumed to be site independent (Newman, Moore, & Watts, 2000). Therefore, according to the law of large numbers, the overlap can be written, for K , N → ∞, as mt = σ t ξ σ,ξ . The brackets represent an average over the joint distribution p(σ, ξ ), for a single neuron, understood as an ensemble distribution for the neuron states {σi } and pattern {ξi } (Dominguez & Bolle, 1998). We calculate the mutual information MI [ p(σ ; ξ )], a quantity used to measure the prediction that an observer at the output, σ , can do about the input, ξ (we drop the time index t). We consider solely nonspatially modulated states; hence, MI factorizes in sites. It reads MI = S[σ ] − S[σ |ξ ], where the output entropy, S[σ ], and the conditional entropy, S[σ |ξ ], depend on the distribution p(σ, ξ ). This distribution factorizes in conditional probability p(σ |ξ ) = (1 + mσ ξ )/2 and input probability p(ξ ). In p(σ |ξ ), all types of noise in the retrieval process are enclosed (from both environment and the dynamical process itself). With the above expressions and p(σ ) ≡ δ(σ 2 − 1), the entropies read (Bolle, Dominguez, & Amari, 2000): 1+m 1+m 1−m 1−m log2 − log2 , 2 2 2 2 S[σ ] = 1[bit].
S[σ |ξ ] = −
(2.5)
Together with the overlap, one needs a measure of the load, which is the rate of pattern bits per synapses used to store them. Since the synapses and patterns are independent, the load is given by α = |{ξ µ }|/|J| ≡ P/K . We define the information ratio as i(α, m) ≡ MI [σ ; {ξ µ }]/|J| = α MI [σ ; ξ ], since for independent neurons and patterns, MI [σ ; {ξ µ }] ≡
(2.6) iµ
µ
MI [σi ; ξi ].
3 Stationary States 3.1 Simulation. We deal with different architectures—three regimes of dilution: the fully connected (FC, γ = 1), moderately connected (MD, 0 < γ < 1), and extremely diluted (ED, γ → 0) symmetric networks. Also the level of randomness runs from purely local (ω = 0) to completely random (ω = 1) networks. We simulate the dynamics with a fixed number of total connections, |J | = K · N = 3.6 · 107 , changing the size of the network from N = 6 · 103 = K (γ = 1) up to N = 6 · 105 , K = 60 (γ = 10−4 ). McGraw & Menzinger (2003) use K = 50, N = 5 · 103 (γ = 10−2 ), with |J | = 25 · 104 , which is far from the asymptotic limit. All runs use smoothing average over a uniform window along the load axis P = α · N.
Information and Topology in Attractor Neural Networks
sim;
ω=0,.2,1;
T=0;
γ=1.0000
γ=0.1000
γ=0.0100
961
#|J|=36M; γ=0.0010
0
m =1 γ=0.0001
1 0.8
m
0.6 0.4 0.2 0
i
0.24 0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0
ω=0.0 ω=0.2 ω=1.0
0
0.05 0.1 0.15 0
α
0.1 0.2 0.3 0 0.1 0.2 0.3 0.4 0
α
α
0.2
0.4
α
0.6 0
0.2 0.4 0.6 0.8
α
Figure 2: Overlap m(α) (top panels) and information rate i(α) (bottom panels) for initial m0 = 1.0, γ = 1, 10−1 , 10−2 , 10−3 , 10−4 (from left to right) and ω = 0.0, 0.2, 1.0 (circles, plus, diamonds).
We study the stability properties of the retrieval states, starting the network from a given pattern, with initial overlap m0 = 1, and we let the neurons update to relaxation. The upper panels of Figure 2 plot the overlap after it converges to a fixed point (m∗ ), for each learned pattern. There is a transition from a retrieval phase (R, m∗ > 0) to a zero phase (Z, m∗ = 0) at a critical αc . For the FC network, αcF C ∼ 0.14 (Amit et al., 1985; Crisanti et al., 1986). The load αc for diluted networks depends on ω. For the RED network, αcRE D ∼ 0.64 (Canning & Gardner, 1988), for the local ED network, αcL E D ∼ 0.22, as seen in right-most panel of Figure 2. By increasing the dilution (decreasing γ ), the transition with α becomes smoother, but the αc increases. On the other hand, we observe that the overlap is always larger for random networks than for local networks, increasing with ω for all dilution degrees. Since there is a competition between critical load αc and sharpness of the transition in m, we plot in the bottom panel of Figure 2 the information ratio, i(α, m), for the three architectures. It can be seen that i(α, m) is a nonmonotonic function of α that reaches a maximum, i max (γ , ω) ≡ i(αmax , γ , ω), for a load αmax ≤ αc , with αmax = αc only for the discontinuous transition
962
D. Dominguez, K. Koroutchev, E. Serrano, and F. Rodr´ıguez
of the FC network. The i max increases from the left to the right panel of Figure 2 (γ → 0, ED networks) and from local (ω = 0) to random connections (ω = 1). Overall, the optimal topology with respect to the stability of the retrieval states is the RED network, with i max ∼ 0.23, at αmax ∼ 0.34, and mmax ∼ 0.87. 3.2 Replica Theory. We consider a symmetrical Hebb network with topology described by its connectivity matrix C, with parameters according to equation 2.1. The static thermodynamics of a symmetric Hebb neural network in the thermodynamic limit (N → ∞) is exactly solvable for arbitrary topology, provided that each neuron is equally connected to the others and that a site-independent solution for the fixed point exists. This is the case for SW topology (Watts & Strogatz, 1998), which we will study in detail here. The energy of the symmetric network in the state σ ≡ {σi } is H=−
N 1 Cij Wij σi σ j , 2Nγ
(3.1)
i, j=1
with Wij given by the Hebb rule, equation 2.2. H completely defines the equilibrium thermodynamics of the system. We are looking only for spatially independent solutions; therefore, the order parameters are independent of the indexes. Following the literature (Amit et al., 1985; Canning & Gardner, 1988), we define the order parameters as ρ mρ = ξi σi ξ (3.2) ρ σi σiτ ξ , τ = ρ −1 ρ τ =γα mν mν ,
q ρτ = r ρτ
(3.3) τ = ρ,
(3.4)
ν
where (.)i ≡ N−1 i (.) is a spatial average, . ξ is the pattern average, and . is the thermal average. The ξ is the only pattern with macroscopic overlap; the index ν runs over the rest of the patterns; q and r are the configuration’s correlation between the replicas of σ and mν respectively. Further, we calculate the free energy and get an equation, analogous to equation 3.2 in Canning and Gardner (1988), the extremum in m, r, q of: f=
γ γ γ α + m2 Tr C −1 + αβr (1 − q ) 2 2 2 α Tr ln (1 − β(1 − q )C/N) + 2β
Information and Topology in Attractor Neural Networks
− Tr βq (1 − β(1 − q )C/N)−1 C/N √ 1 Dz ln 2 cosh γβ( αr z + mξ ) ξ , − γβ
963
(3.5)
where β = 1/T is the inverse temperature, 1 is the unity N × N matrix, and Dz means gaussian integral. We denote the trace of a matrix as Tr M ≡ i Mii and not the trace over the configurations of the system. All terms of the free energy scale well with N → ∞. The only problem is in the term Tr C −1 , for which the sum needs to provide 1/γ factor. Let us note that for √the SW connectivity considered here, the degree √ of each node is γ N + O( N), the major eigenvalue of C is also γ N + O( N), and each √ eigenvector component is v0i = 1/ N + O(1/N). The term in question can be calculated decomposing C −1 by its eigenvalues, C −1 =
vk † λ−1 k , k v
k †
and vk its corresponding transposed where λk is the kth largest eigenvalue √ eigenvector. Note that λ0 = γ N + O( N). The approximation terms will vanish in the limit N → ∞. Changing the SW rewiring scheme, one can achieve results without approximation terms, but to match the computer simulations, we do not do that. We have: † (C −1 )ij = Nv0 · C −1 · v0 † = Nv0 · vk λ−1 k · v0 † = 1/γ . k v k
ij
From this point on, the derivation of the equations for the order parameters is straightforward, and in the thermodynamic limit (N → ∞), one obtains the same result as Canning and Gardner (1988):
m=
q=
√ Dz ξ tanh βγ ( αr z + mξ )
(3.6)
√ Dz tanh2 βγ ( αr z + mξ )
(3.7)
q r = Tr γ
ξ
ξ
C 1 − β(1 − q ) N
−2 2 C · . N
(3.8)
Having spatial-dependent connectivity, it is unclear whether the site fluctuations will not provide solution with local macroscopic dependence of
964
D. Dominguez, K. Koroutchev, E. Serrano, and F. Rodr´ıguez
m, for example, spatially dependent solutions. However, when the neurons are unbiased, σ ∈ {±1}, with zero threshold, only spatially symmetric states can be observed in the thermodynamic limit in the case of finite γ > 0, as can be seen in Koroutchev and Korutcheva (2005). At T → 0, keeping the quantity G ≡ γβ(1 − q ) finite, in equation 3.6, the tanh(.) converge to sign(.), in the next equation tanh2 (.) behaves as 1 − δ(.), and in equation 3.8, the expression can be expanded in series of G, giving: √ m = erf(m/ 2r α) 2 G = 2/(πr α) e −m /(2r α) r=
∞
(k + 1)a k G k ,
(3.9) (3.10) (3.11)
k=0
where a k ≡ γ Tr[(C/K )k+2 ]. Note that a k is the probability of the existence of (eventually self-crossing) cycles of length k + 2 in the connectivity graph. According to the definition, a 0 = 1, because C is symmetric, and a 1 is the probability of having 3-cycle, that is, the clusterization index of the graph. The series converge, because G ∈ [0, 1) and a k ∈ [0, 1]. For random extremely diluted and fully connected networks, one recovers the known results for r RE D = 1 and r F C = 1/(1 − G)2 , respectively (Bolle, Jongen, & Shim, 1999). The only equation explicitly dependent on the topology of the system is equation 3.11, and the only important topology features for the network equilibrium are {a k }. As k → ∞, a k tends to γ . Cyclic evaluation of the Ising model was studied by Parisi (1988). It was observed that compared with the simulations, with finite N, equations 3.9 to 3.11 overestimate the role of long loops. In order to close the system of these equations, one must estimate the {a k }, which are topology dependent. This is done in the appendix. We plot with lines in Figure 3 i max as a function of the dilution γ for different values of the randomness ω, obtained from theoretical equations 3.9 to 3.11. The results from the simulation for mt=0 = 1 are represented in the Figure 3 with symbols. A comparison of theoretical predictions and the simulations yields, in the experimental range we study, no significant deviation from the replica symmetric solution. Therefore, no replica symmetry breaking is expected. Figure 3 shows that both theory and simulation results agree for most ω > 0, with some fluctuations for ω ∼ 0. 4 Attractors 4.1 Sensitivity to Initial Overlap. The theoretical equations for the stationary states, equations 3.9 to 3.11, account only for the existence of the retrieval (R) solution m > 0. They say nothing about its stability. The zero
Information and Topology in Attractor Neural Networks
965
Theory × Simulation (Stability of m>0 × m0=1.0) 0.22
ω=1.0
0.2
ω=0.3
imax
0.18
ω=0.2 ω=0.1
0.16
0.14
ω=0.0 0.12
0.1
−4
10
10
−3
−2
10
−1
10
0
10
γ
Figure 3: Maximal information i max = i(αmax ) versus γ . Theory for the stationary states (solid lines) and simulations (symbols) for m0 = 1.0 and several values of randomness ω.
states (Z), m = 0, are also a solution of these equations, so both R and Z may coexist in some region of topological parameters γ , ω. In order to study the stability of the attractors, we simulate equation 2.3, and see how the network behaves under different initial conditions. It is worth noting that no theory for the dynamics of our topological model is yet known (Bolle et al., 1999). To analyze the attractor properties of the retrieval, we made the neuron states start at a configuration σ 0 weakly correlated with a learned pattern, ξ ≡ ξµ . If they are inside its basin of attraction, σ 0 ∈ B(ξ ), the pattern will be recovered, σ t → ξ . First, we choose an initial configuration given by a random correlation with the learned pattern, p(σ 0 = ±ξ |ξ ) = (1 ± m0 )/2, for all neurons (so we avoid a bias between local and random neighbors). We call this the m R initial overlap. The retrieval dynamics starts with an overlap m0 = 0.1 and stops after it converges to a fixed point m∗ . Usually t f = 80 sequential updates are a large enough delay for retrieval. The information i(α, m; γ , ω) is calculated according to equations 2.5 and 2.6 for m∗ . The results are depicted in the bottom panel of Figure 4, where i(α) is plotted for different values of γ , ω. Unlike the theoretical results for the storage capacity, there are MD topologies that perform best whenever the attractor properties are considered. We define the optimal topology, for a
966
D. Dominguez, K. Koroutchev, E. Serrano, and F. Rodr´ıguez
γ =1.00
t=80s; |J|=40M; Sim; mL × mR=0.1 −1
−2
γ =10
γ =10
γ =10
−3
γ =10
−4
0.12 0.1
ω=0.0 ω=0.1 ω=0.2
0.08
i
mL
0.06 0.04 0.02 0 0.1
mR
0.08
i
0.06
0.04 0.02
0
0
0.02
α
0.04 0
0.05 0.1 0.15
α
0
0.05
0.1
α
0.15 0
0.05
0.1
α
0.15 0
0.05 0.1 0.15 0.2
α
Figure 4: Mutual information versus α for the local (upper) and random (lower) initial overlap m0 = 0.1, with ω = 0.0, 0.1, 0.2.
fixed ω, as the value of γ that yields the most information. Starting with m0 = 0.1 yields optima i(γopt , w) for moderate dilutions; for instance, with ω = 0.1, it holds γopt ∼ 10−2 . Next, we compare m R with another type of initial distribution. The neurons start with local correlations: σi0 = ξi , i = 1, . . . , (Nm0 ), and random σi0 = ±1 otherwise. We call it the m L initial overlap. The results are shown in the top panel of Figure 4. The first observation is that the maximum information i max (γ ; ω) has the same nonmonotonic behavior with γ in the region of ω considered; there is still a moderate γopt for which the information i(γopt ) is optimized. However, the i max for the m L are a little smaller than for the m R overlap, and the optima shift a bit to the left panels, where the less diluted topologies are. For instance, with ω = 0.1, now it holds γopt ∼ 10−1 . The comparison between the upper (m L ) and lower (m R ) panels of Figure 4 shows that the initial m R allows for an easier retrieval for any ω. Local topologies (ω = 0) are especially sensitive to the type of initial overlap, losing their retrieval abilities for m L if the connectivity is γopt ≤ 10−2 . This sensitivity to the initial conditions can be understood in terms of the basins of attraction. Random topologies have very deep attractors, especially if the network is diluted enough, while regular topologies almost lose their
Information and Topology in Attractor Neural Networks
967
imax>0.05; m0=0.2 1.00
R
ω
R,L
0.10
R
0.01 −4 10
−3
10
−2
10 γ
10
−1
0
10
Figure 5: Diagram (ω × γ ) with the phases R and L, for initial overlap m0 = 0.2.
retrieval abilities with dilution. However, since the basins become rougher with dilution, the network takes longer to reach the attractor and can be trapped in meta-stable states. Hence, the competition between depth and roughness is won by the more robust MD networks. The retrieval capability of the network starting at condition m R or m L is plotted in Figure 5. We represent as R the phase where the retrieval reaches the information i max ≥ 0.05, starting from m0 = 0.2, with m R . The phase L is the same, but starting with m L . Efficient retrieval is not allowed with the m L condition for very connected or local topologies. On the other hand, only local diluted topologies exclude retrieval with the m R initial overlap, shown below the dashed curve in Figure 5. Next we study the role of the symmetry constraints in the synaptic weights. By asymmetric topology, we mean that no condition Cij = C ji is assumed, and the local connections are chosen for only one direction in the network ring. The maximum of information against the dilution, i max (γ ), is plotted in Figure 6, for several degrees of randomness 0 ≤ ω ≤ 1. The initial overlap is the random condition m R , with m0 = 0.1. The left panel shows the results for the symmetric network and the right panel the results for the asymmetric network. We observe that the nonmonotonic behavior of i max (γ ) still holds for the asymmetric network, only for ω ≤ 0.2. The asymmetric topology has more robust attractors than the symmetric topology for strong dilution. The symmetric topology keeps its optimal moderate dilution, 0 < γ < 1, up to ω ∼ 0.4. Both types of symmetry have the same asymptotic behavior with extreme dilution, for random topologies, saturating the information at i max (γ → 0) ∼ 0.22. Nevertheless, the
968
D. Dominguez, K. Koroutchev, E. Serrano, and F. Rodr´ıguez N×K=40M Simulation; m0R=0.1 Symmetric Asymmetric
imax [bits]
0.25
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
10
−4
10
−3
−2
10 γ
−1
10
10
0
0
10
ω=0.0 ω=0.1 ω=0.2 ω=0.4 ω=1.0
−4
10
−3
−2
10 γ
10
−1
10
0
Figure 6: Maximal information i max = i(αmax ) versus γ , with initial condition, m0R = 0.1. (Left) Symmetric network. (Right) Asymmetric network.
nonmonotonic behavior is qualitatively similar in spite of the rather different topologies. 4.2 Clustering and Mean-Length Path. We describe here the topological features of the network as a function of its parameters: the clustering coefficient, c (the average number of neighbors of each node that are neighbors between them), and the mean-length path between neurons, l (the average distance between nodes measured as the minimal path length between them). When γ is large, the net has c large, c ∼ 1 and l small, l = O(1), whatever ω used. When γ is small, then if ω ∼ 0, the net is clustered and has large paths: c = O(γ ), l ∼ N/K . If ω ∼ 1, the net becomes random: c 1 and l ∼ ln N. However, if the randomness is about ω ∼ 0.1, then c = O(γ ), but with l ∼ ln N, and the network behaves as an SW: clustered but with short paths. The dependence of c, l, and i on randomness ω is plotted in Figure 7. In all panels, for connectivity γ = 10−1 , 10−2 , 10−3 , we see a decrease of c and l and an increase of i with ω. For these ranges of ω, the path length l has already
Information and Topology in Attractor Neural Networks ω=K r /K, γ =K/N γ =10
969 #=N´K=10´10
6
T=0; m0=0.1
−1
γ =10
−2
−3
γ =10
1.0
l/10,c/c0
0.8
0.6
l c
0.4
0.2
0.0
0.2
i
0.15
0.1
0.05
0
−3
10
−2
10
ω
10
−1
0
10
−2
10
10
ω
−1
0
10
10
−2
−1
10
ω
10
0
Figure 7: The maximal information i(αmax ) (bottom), clustering coefficient c, and mean-path length l (top) versus ω, for γ = 10−1 , 10−2 , 10−3 (from left to right). Simulations with N.K = 10M, m0 = 0.1.
decreased to small values, and the networks have entered the SW region. However, in the right panel, there is still a slowdown of l. The clustering c decreases quickly around ω = 0.2, after which the network is random-like. The comparison of the behavior of c (top panels) with i (bottom panels) is in agreement with Kim (2004), which allows us to conclude that the performance of the Hopfield model on various networks can be enhanced by decreasing the clustering coefficient. This region 0.001 ≤ ω ≤ 0.2 is the SW regime for the network we study. Looking at the bottom panel, we see that at the end of the SW graph, there is a fast increase of i between 0.05 ≤ ω ≤ 0.20. We conjecture that after the SW region, a further increase of the randomness ω is not worth its wiring cost to gain a little extra information i. 5 Conclusions We have discussed the information capacity of an attractor neural network as a function of its topology. We calculated the mutual information for an Amari-Hopfield model, with Hebbian learning, varying the connectivity (γ ) and randomness (ω) parameters, and obtained the maximum respect to α,
970
D. Dominguez, K. Koroutchev, E. Serrano, and F. Rodr´ıguez
i max (γ , ω) ≡ i(αmax ; γ , ω). The information i max always increases with ω, but for a fixed ω, an optimal topology γopt , in the sense of the information, i opt ≡ i max (γopt , ω), can be found. We studied the stability and attractor properties. From the point of view of stability, the calculations show that the optimal topology regarding the storage is the random extremely diluted (RED) network. Indeed, if no pattern completion is required, the patterns can be stored statically, which is achieved with a vanishing connectivity. As long as the dynamics is concerned (starting from a noisy pattern, m0 < 1), however, the latter is not true: we found there is an moderate γopt , which optimizes the retrieval performance, whenever 0 ≤ ω < 0.3 (see Figure 6). Moreover, the memory behavior of local diluted networks is even more damaged if they start with local overlap than if the initial overlap is random. This can be understood regarding the shape of the attractors. The ED network waits much longer for the retrieval than more connected networks do, so the neurons can eventually be trapped in spurious states with vanishing information. We found a relation between the fast increase of information with ω and the SW region of the topology. This implies that it is worth increasing the wiring length just to the end of the SW zone. In both nature and technological approaches to neural devices, dynamics is an essential issue for information processing. So an optimized topology holds for any practical purpose, even if no attention is paid to wiring or other costs of the random links (Morelli, Abramson, & Kuperman, 2004). We conjecture that the reason for the intermediate optimal γopt is a competition between the broadness (larger storage capacity) and roughness (slower retrieval speed) of the attraction basins. We believe that the maximization of the information with respect to the topology could be a biological criterion (where nonequilibrium phenomena are relevant) for building real neural networks. We expect that the same dependence can be found for more structured networks and learning rules. More complex initial conditions may also play a role in the retrieval and deserve dedicated study. Appendix: The Probability of Cycles The small-world (SW) topology can be defined by the probability of having the nodes i and j connected: γ P(Cij = 1) ≡ ωγ + (1 − ω) γ N − i − j + N + N modN , 2
(A.1)
where ω is the SW parameter defined as the rewiring probability in Watts and Strogatz (1998) and is the unit step function. One can estimate a k by generating random SW graphs with a given size, performing k + 1 random walk steps starting from some arbitrarily chosen origin node and then counting the trail as successful if there is a link from
Information and Topology in Attractor Neural Networks
971
that node and as failing otherwise. Using this Monte Carlo procedure, we have a Bernoulli process for estimating a k . However, this estimation will suffer finite-scale effects that are difficult to estimate, especially in the case of small γ . Therefore, first, one can try to make the limit N → ∞ in equation A.1 and then estimate the probabilities a k . This continuous topology estimation will not suffer finite-scale effects, with the exception of the larger error in estimating a k when γ is small. However, the topology in the continuous case is not exactly the same as in the discrete case (the probability of repeating the movement along any edge in the graph is exactly zero, and the symmetry is loosely defined in the continuous approximation), so the question of whether the discrete estimation with N → ∞ and the continuous estimation coincide is legitimate. We have checked the discrepancies with simulations, and when γ N > 30, there is a very good match between the continuous and the discrete case. Similar questions are discussed by Newman et al. (2000). The conclusion is that the coefficients a k , k > 0 can be calculated using a modification of the Monte Carlo method originally proposed by Canning and Gardner (1988). Let us consider k as a number of iterations and regard a particle that at each iteration changes its position among the network connectivity graph with probabilities defined by the topology of the (SW) graph (see Figure 8). If the particle at step k is in node j(k), we say that the coordinate of the particle is x(k) ≡ j(k)/N. When N → ∞, x(k) is a continuous variable. If y is a uniformly distributed random variable between −1/2 and 1/2, then in step k + 1, the coordinate of the particle changes according to the rule:
x(k + 1) = [x(k) + γ y]mod 1 with probability 1 − ω, x(k + 1) = [x(k) + 1y]mod 1 with probability ω.
(A.2)
If the particle starts at moment k = 0 from x(0) = 0 and is located in position x(k) at step k, then the probability of having a loop of length k + 1 according to the SW topology is
a k−1 = (1 − ω) θ (γ /2 − |x(k)|) + ωγ ,
(A.3)
where it has been assumed that x(k) ∈ [−1/2, 1/2]. From this Bernoulli process one can estimate a k . (See the precision’s estimation at Koroutchev, Dominguez, Serrano, & Rodriguez, 2004.) Alternatively to the Bernoulli process, one can get an expression to estimate a k , k > 0. Suppose that we have unity length of the graph; then the
972
D. Dominguez, K. Koroutchev, E. Serrano, and F. Rodr´ıguez
0 2 1
5 3
4
Figure 8: Calculus of a k by random walk in SW topology. The network with its connection is represented by the gray circle. The height of the surrounding circle represents the probability of connection of the neuron at position 0. Movements to distances shorter (larger) than γ /2 are represented by bold (thin) lines. After five steps, the probability of the existence of a cycle with length 6 (a 4 ) is given by the probability of a connection between x5 and x0 (dotted line).
probability of having a loop is ak = γ
k k+2 d xi R=−k
i=1
γ
xi − R , p(xi ) δ
(A.4)
i
where xi represents the movement in step i, p(x) is the probability of a link at displacement x, x ∈ [−1/2, 1/2), and R is the number of revolution performed by the loop. Although this integral can be solved analytically for small k, the calculation of a k using the Monte Carlo method seems more efficient. The calculation of a k = a k (C) closes the system of equations 3.9 to 3.11, giving the possibility of solving them with respect of (G, m, r ) for every (ω, γ , α) values. Acknowledgments This work was supported by grants TIC01-572, TIN2004-07676-C01-01, BFI2003-07276, TIN2004-04363-C03-03 from MCyT, Spain. D.D. thanks a Ramon y Cajal grant from MCyT. References Albert, R., & Barabasi, A. (2002). Statistical mechanics of complex networks. Rev. Mod. Phys, 74, 47–97.
Information and Topology in Attractor Neural Networks
973
Amari, S. I. (1972). Learning patterns and pattern sequences by self-organizing nets of threshold elements. IEEE Trans. Comp., C-21, 1197–1206. Amit, D., Gutfreund, H., & Sompolinsky, H. (1985). Statistical mechanics of neural networks near saturation. Annals of Physics, 173, 30–67. Bolle, D., Dominguez, D., & Amari, S. (2000). Mutual information of sparsely coded associative memory with self-control and ternary neurons. Neural Networks, 13, 445–462. Bolle, D., Jongen, G., & Shim, G. (1999). Parallel dynamics of extremely diluted symmetric q-ising neural networks. Journal of Statistical Physics, 96(3–4), 861–882. Canning, A., & Gardner, E. (1988). Partially connected models of neural networks. J. Phys. A, 21, 3275–3284. Crisanti, A., Amit, D., & Gutfreund, H. (1986). Saturation level of the Hopfield model for neural network. Europhysics Lett., 2, 337–341. Dominguez, D., & Bolle, D. (1998). Self-control in sparsely coded networks. Phys. Rev. Lett, 80, 2961–2964. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Nat. Acad. U.S.A, 79, 2554–2558. Kim, B. (2004). Performance of networks of artificial neurons: The role of clustering. Phys. Rev. E, 045101. Koroutchev, K., Dominguez, D., Serrano, E., & Rodriguez, F. (2004). Mutual information and topology 2: Symmetric network. In F., Yin, J., Wang, & C., Guo (Eds.), Advances in neural networks. (pp. 20–25). Berlin: Springer. Koroutchev, K., & Korutcheva, E. (2005). Conditions for the emergence of spatial asymmetric states in attractor neural network. CEJP, 3, 409–419. Masuda, N., & Aihara, K. (2004). Global and local synchrony of coupled neurons in small-world networks. Biol. Cybernetics, 90, 302–309. McGraw, P. N., & Menzinger, M. (2003). Topology and computational performance of attractor neural networks. Physical Review E, 68, 047102. Morelli, L. G., Abramson, G., & Kuperman, M. (2004). Associative memory on a small-world neural network. European Physical Journal B, 38, 495–500. Newman, M. E. J., Moore, C., & Watts, D. J. (2000). Mean-field solution of the smallworld network model. Physical Review Letters, 84, 3201–3204. Newman, M. J. E., & Watts, D. J. (1999). Scaling and percolation in the small-world network model. Phys. Rev. E, 60, 7332–7342. Okada, M. (1996). Notions of associative memory and sparse coding. Neural Network, 8–9, 1429–1458. Parisi, G. (1988). Statistical filed theory. Redwood City, CA: Addison-Wesley. Rolls, E., & Treves, A. (2004). Neural network and brain function. New York: Oxford University Press. ˜ Torres, J., Munoz, M., Marro, J., & Garrido, P. (2004). Influence of topology on the performance of a neural network. Neurocomputing, 58–60, 229–234. Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of “small-world” networks. Nature, 393, 440–422.
Received February 3, 2006; accepted June 22, 2006.
LETTER
Communicated by Anders Lansner
Connection Topology Selection in Central Pattern Generators by Maximizing the Gain of Information Gregory R. Stiesberg [email protected] Institute for Nonlinear Science, University of California San Diego, La Jolla 92093–0402, U.S.A.
Marcelo Bussotti Reyes [email protected] Instituto de F´ısica, Universidade de S˜ao Paulo, S˜ao Paulo 05508–090, Brazil
Pablo Varona [email protected] Institute for Nonlinear Science, University of California San Diego, La Jolla 92093–0402, U.S.A., and Departamento de Ingenier´ıa Inform´atica, Universidad Aut´onoma de Madrid, Madrid, Spain
Reynaldo D. Pinto [email protected] Instituto de F´ısica, Universidade de S˜ao Paulo, S˜ao Paulo 05508–090, Brazil, and Institute for Nonlinear Science, University of California San Diego, La Jolla 92093–0402, U.S.A.
´ Huerta Ramon [email protected] Institute for Nonlinear Science, University of California San Diego, La Jolla 92093–0402, U.S.A.
A study of a general central pattern generator (CPG) is carried out by means of a measure of the gain of information between the number of available topology configurations and the output rhythmic activity. The neurons of the CPG are chaotic Hindmarsh-Rose models that cooperate dynamically to generate either chaotic or regular spatiotemporal patterns. These model neurons are implemented by computer simulations and electronic circuits. Out of a random pool of input configurations, a small subset of them maximizes the gain of information. Two important characteristics of this subset are emphasized: (1) the most regular output activities are chosen, and (2) none of the selected input configurations are networks with open topology. These two principles are observed in living
Neural Computation 19, 974–993 (2007)
C 2007 Massachusetts Institute of Technology
Topology Selection by Maximizing the Gain of Information
975
CPGs as well as in model CPGs that are the most efficient in controlling mechanical tasks, and they are evidence that the information-theoretical analysis can be an invaluable tool in searching for general properties of CPGs. 1 Introduction A central pattern generator is a group of connected neurons that control the activity of a mechanical device by activating muscle groups. CPGs are responsible for activities like chewing, walking, and swimming (Selverston, 1999a, 1999b; Selverston, Elson, Rabinovich, Huerta, & Abarbanel, 1998; Arshavsky, Deliagina, & Orlovsky, 1997). The intrinsic properties of every neuron in the CPG, together with the connection topology between them, determines the phase relationship in the electrical activity. A group of neurons can generate many different spatiotemporal patterns of activity that can control different types of mechanical devices. Only a subset of the possible solutions of activity of the CPG will work for a given mechanical device (see, e.g., Huerta, Sanchez-Montanes, Corbacho, ¨ & Siguenza, 2000). Among the large number of possible spatiotemporal patterns is a subset that efficiently performs the required mechanical functions of the animals. In this work, we intend to determine some generic constraints of a simplified working CPG. Our motivation is previous work (Huerta, Varona, Rabinovich, & Abarbanel, 2001), where it was shown that the best efficiency of a mechanical pump based in the pyloric chamber and controlled by pyloric CPG of the lobster is achieved by combinations of synaptic connections in sets corresponding to CPGs that oscillate in regular fashion and all of these CPGs have nonopen connection topologies. Here, nonopen topology is defined as a configuration in which every neuron receives synaptic input from at least one other neuron in the CPG (see, e.g., Selverston & Moulins, 1987; Arshavsky, Beloozerova, Orlovsky, Panchin, & Pavlova, 1985; Getting, 1989). In this letter, we test the use of Shannon and Weaver information (Shannon & Weaver, 1949) as a global measure of the efficiency in the CPG activity. One of the best-known CPGs is the pyloric CPG of the lobster stomatogastric ganglion (Selverston & Moulins, 1987), where the rhythm is achieved by a combination of the intrinsic properties of the neurons connected by many inhibitory and a few electrical synapses. Most neurons isolated from this CPG have been shown to behave in a chaotic way (Rabinovich, Abarbanel, Huerta, Elson, & Selverston, 1997; Selverston et al., 2000), and many ideas for the role of chaos in the neurons of the CPG have been proposed (Abarbanel et al., 1996; Rabinovich et al., 1997). Here we study the behavior of simple CPGs made of either three or four chaotic neurons connected by inhibitory synapses. For simulating the chaotic neuron elements, we use one of the simplest mathematical models that is able
976
G. Stiesberg et al.
to produce chaotic behavior: the Hindmarsh-Rose model (HR) of a bursting neuron (Hindmarsh & Rose, 1984; Pinto et al., 2000), and the synapses are simulated by a first-order kinetic model of neurotransmitter release (Destexhe, Mainen, & Sejnowski, 1994; Sharp, Skinner, & Marder, 1996; Pinto et al., 2001). We also included the results from the analysis of CPGs made of three or four electronic neurons (Pinto et al., 2000) based on the HR model (Hindmarsh & Rose, 1984). Our motivation for using analog electronic neurons (ENs) is to include in our experiments more elements of those that a real living CPG is exposed to. Different from mathematical modelling, ENs are made of real components that experience fluctuations in their values (because of changes in temperature, for example) in the same way neurons experience fluctuations in their intrinsic properties due to changes in the temperature or in the ionic concentrations of the saline. ENs are also subject to electronic noise that could be compared to the noise that living neurons receive when embedded in the living tissue. Experiments with ENs produce results that have to be robust to small variations, such as those that can occur in a living CPG, and they can be used to validate the results obtained in ideal systems created by computer simulations. Information theory has been used to study neural systems (e.g., Deco & Schurmann, 1997; Strong, Koberle, de Ruyter van Steveninck, & Bialek, 1998). However, our approach is different because we consider the input to the CPG as a set of synaptic and conductance configurations. We disregard the effects of neuromodulation on the endogenous properties of the neurons as proposed by Stemmler and Koch (1999). Basically we want to address the generation of information by the CPG due to some given set of synaptic connections. In order to be able to evaluate the mutual information between the input and the output, three definitions should be given: what the input, the output and the information being conveyed are. First, let us state the definition of the mutual information, MI (I ; O) = H(O) − H(O/I ) where I and O are the input and the output, respectively; H represents the entropy or the measure of uncertainty or variability. Roughly, the entropy of a system is a measure of how much information we need to be able to predict the system behavior. A low entropy means that the system is very regular and the behavior can be easily predicted. Conversely, a high entropy means that the system behaves in an irregular way, which makes prediction difficult. Our goal is to maximize the mutual information; in the optimal case, this means that the uncertainty of the output H(O) is maximum (meaning that it has a wide range of possible responses), and the uncertainty of the output given the input H(O/I ) is minimum, meaning 0. Stated explicitly, we want the response of the system to be as rich as
Topology Selection by Maximizing the Gain of Information
977
Figure 1: What is the input and the output? The concentration of the neuromodulators determines the type of connectivity between the neurons (in real CPGs, they also modify the dynamics of every neuron). The connection topology established by the neuromodulators selects a specific phase of bursting electrical activity between the members of the CPG.
possible, and at the same time we want the response of the system to a specific input to be fully determined by this input. We still need to be more precise about what we call “the input.” In most of the CPGs, the rhythm changes due to neuromodulatory inputs that modify both the synaptic connections between neurons (connection topology) and the kinetics of the ionic channels on the neurons membrane (Harris-Warrick et al., 1998; Brodfuehrer, Debski, O’Gara, & Friesen, 1995; Katz, 1998). To reduce the complexity of the problem, in this letter, we consider only variations in the synaptic connections. As shown schematically in Figure 1, instead of using the neuromodulator concentrations as the input configuration, we choose the connection topology with the assumption that there is a function relating the neuromodulator concentration configuration and the connection topology configuration. Therefore, the input we consider is a specific synaptic configuration, which is given by I = gi j ; 0 ≤ gi j < gmax , i = j; i, j = 1 . . . N , where gmax is the maximum synaptic conductance and N is the number of neurons in the CPG. The output is based on the observation that the electrical activity produced by a neuron is mainly transmitted to the muscles when the burst is on. The spikes are carried through the axons to the muscles, which integrate them to produce an action. Also, there is strong evidence that burst duration can encode information in very different kinds of networks (Kepecs & Lisman, 2003; Lisman, 1997). Thus, for our purposes, we will disregard the frequency of the spike firing, considering that the information is basically carried by the burst start and end times. A random set of input configurations is used to integrate the CPG so that the conditional distributions and entropies can be calculated. Then the
978
G. Stiesberg et al.
estimation of the maximum of the mutual information leads us to a subset of configurations in I. In the next few sections, with an approach independent of the one used by Huerta (Huerta et al., 2001), we show that a subset consisting of 15% of the input configurations is sufficient to obtain the maximum of the mutual information, that all of these configurations lead to regular rhythmic activity, and that all of these are nonopen topology configurations. These last properties are widely observed in living CPGs. 2 Mutual Information Estimation Let us define the variable E(R, t) that can assume the Boolean values 0 (no event) or 1 (presence of an event) at time t. The value of R represents the type of event and is in the range from 1 to 2 · N, where N is the number of neurons in the CPG. Initially all E(R, t) are reset to 0. When neuron i starts a burst at time t ∗ , we set E(R = i, t = t ∗ ) = 1, and when neuron i ends a burst at time t ∗ , we set E(R = N + i, t = t ∗ ) = 1. To calculate the joint probability, we integrate the model CPG and obtain E(R, t) for a given input: a specific synaptic configuration G j ∈ I with j = 1, . . . , M and M is the cardinality of the set I. Setting t = 0 when neuron 1 (used as a time reference) started its last burst, we define p(R, t /G j ) or pG j (R, t ) as the conditional probability of observing an event of type R at time t for a given input G j . The conditional probabilities are calculated as follows. We create a counter-array named C G j (R, t ), where R has the previously explained meaning and t runs from 0 to a sufficiently large value T. Starting from t = 0, we collect the events in E(R, t) for increasing t: 1. The neuron number 1 is set as the time reference for the rest of the neurons. Every time E(R = 1, t = τ ) = 1, the reference time τ is taken. 2. When an event of type i is found at time t, the overall counter C G j (R = i, t = t − τ ) is increased by 1. 3. At the end, C G j (R, t ) is normalized to 1 to yield pG j (R, t ). Examples of some typical pG j (R, t ) obtained for two different input configurations G j (corresponding to qualitatively different bursting CPGs made of four neurons) are shown in Figure 2. Note that in both cases, the time probability for all events (R = 2 to 8) was superimposed. For the regular bursting CPG, the pG j (R, t ) presents well-defined peaks that correspond to the time when each event is expected, while for the irregular bursting CPG, the pG j (R, t ) is spread to a large extent of time. Once we have calculated the conditional probabilities, we determine the values of the conditional entropies (defined as the variability of the output
Topology Selection by Maximizing the Gain of Information
979
Figure 2: Examples of some typical conditional probabilities obtained in experiments with four analog electronic neuron CPGs. (Top) pG 2 51 (R, t) for an irregular bursting CPG. (Bottom) pG 5 17 (R, t) for a regular bursting CPG. In each plot, the probabilities of the different events (R = 2–8) were superimposed.
for a given G j ) as follows,
HG j = −
2·N T
pG j (R, t) log2 pG j (R, t),
t=0 R=1
where the joint probability is p(R, t, G j ) = pG j (R, t) p(G j ) and p(G j ) is the probability of the configuration G j in the set I. Since during the simulations or experiments, M random synaptic configurations G j are chosen, 1 p(G j ) = M . The marginal probability is
p(R, t) =
M
pG j (R, t) p(G j ),
j=1
which is used to calculate H(O) as
H(O) = −
2·N T t=0 R=1
p(R, t) log2 p(R, t),
980
G. Stiesberg et al.
and the conditional entropy H(O/I ) =
M
p(G j )HG j .
j=1
3 The Problem: Maximizing the Mutual Information We have explained all the elements that we need to calculate the mutual information between the input and the output and therefore are ready to restate the question that we want to address in this work. Which subset of input configurations that belong to I maximizes the mutual information? To be more specific, What are the values p(G j ) such that the MI (I ; O) is maximized? In order to maximize the mutual information as a function of p(G j ) (we rename p(G j ) ≡ x j and HG j ≡ h j for convenience), we calculate the gradient of MI (I ; O) on the x j space with restrictions j x j = 1 and x j ≥ 0. The approach followed to calculate the gradient consists of reducing the space of search from M to M − 1 by using xM = 1 − M−1 j=1 x j on the MI (I ; O) formula. The gradient expression used is T 2·N ∂ MI (I ; O) = h M − hµ + pxM (R, t) − pxµ (R, t) log2 p(R, t), ∂ xµ R=1 t=0
with µ = 1, . . . , M, h M = dient helps to maximize MI (I ; O) =
M−1 µ=1
i=µ
h i , and pxM (R, t) =
i=µ
pxi (R, t). This gra-
∂ MI (I ; O) xµ . ∂ xµ
It is sufficient to change the values xµ by using xµ = α
∂ MI (I ; O) ∂ xµ
with α > 0. Another restriction that must be imposed is that if the change of xµ would result in xµ < 0 or xµ > 1, then the modification of the values does not take place. The idea of using the gradient is to find the direction of xµ that corresponds to bigger values of the mutual information. Then we change xµ in that direction. The process is repeated several times and for all xµ until the gradient approaches zero, close to a maximum. At this point we have the set of xµ that maximizes the mutual information.
Topology Selection by Maximizing the Gain of Information
981
Since there are multiple local maxima, the search has to be combined with random hints. The set of p(G j ) is randomly set at first (we replace the 1 common value p(G j ) = M from the simulations or experiments by randomly choosing each p(G j ) from a random uniform distribution between 0 and 1). The random set of p(G j ) is normalized to 1, and the process of finding new p(G j )s that maximize the MI described above is repeated until an approximate maximum of the MI is achieved (changes in p(G j ) become smaller than 1%). The small subset of G j s in I found with maximum p(G j ) after maximizing the MI is stored, and new random p(G j )s are set; the procedure is then repeated. After enough repetitions (about the number of different configurations in I) of the process above, we can define S as the total set of the input configurations found to maximize the mutual information. Then we define the subset s(HG − , HG + ) ∈ S as the chosen configurations that generate a conditional entropy HG j with a value between the numerical values HG − and HG + . We also define the subset i(HG − , HG + ) ∈ I of the input configurations that produce an entropy value in (HG − , HG + ). Finally, we can define ζ as ζ (HG − , HG + ) =
#s(HG − , HG + ) , #i(HG − , HG + )
which represents what percentage of the total number of input configurations that generate entropy in a given range is found to maximize the MI. 4 The Model Neurons and Their Synaptic Connections The single-neuron model is a modified version of the HR model that is known to generate chaotic behavior (Hindmarsh & Rose, 1984). The additional coefficients in the equations were introduced for a more convenient implementation of the analog circuit described in the following sections. The model is made of three dynamical variables comprising a fast subset, x and y, and a slower z: d x(t) = 4 · y(t) + 1.5 · x 2 (t) − 0.25 · x 3 (t) − 2 · z(t) + 2 · e + Isyn (4.1) dt dy(t) (4.2) = b/2 − 0.625 · x 2 (t) − y(t), dt 1 dz(t) = −z(t) + 2 · [x(t) + 2 · d] , (4.3) µ dt where e represents a constant injected (DC) current, and µ is the parameter that controls the time constant of the slow variable. The parameters are
982
G. Stiesberg et al.
chosen to set the isolated neurons in the chaotic spiking-bursting regime (e = 3.281, µ = 0.0021, b = 1, d = 1.6). Isyn represents the postsynaptic current evoked after the onset of a chemical graded synapse. In this letter, we consider only inhibitory synapses, which is the main type of connection present in the pyloric CPG of the California spiny lobster (Panulirus interruptus), and which dominates in most invertebrate CPGs. The synaptic current has been simulated in a fashion similar to that used in the dynamical clamp technique (Sharp et al., 1996), with minor modifications: Isyn = −g · r (xpre ) · ϑ(xpost ),
(4.4)
where g is the maximal synaptic conductance, r (xpre ) is the synaptic activation variable that is determined from the presynaptic activity by: dr = [r∞ (xpre ) − r ]/τr dt r∞ (xpre ) = [1 + tanh((xpre + 1.2)/0.9)]/2,
(4.5) (4.6)
τr is the characteristic time constant of the synapse (τr ∼ 100), and ϑ(xpost ) is a nonlinear function of the membrane potential of the postsynaptic neuron, xpost : ϑ(xpost ) = (1 + tanh((xpost + a )/b)).
(4.7)
Parameters for ϑ(xpost ) (a = 2.807514 and b = 0.4) were chosen so that this function remained linear for the slow oscillations (subthreshold region for the fast spikes), that is, d x ∼1 dv a ,b g(x) ∼ 0.
(4.8) (4.9)
a ,b
The rule used to set the connectivity pattern in the set I is that with a probability of 80%, the connection gi j will range uniformly from 0.0 to 1.0, and with a probability of 20%, it is set to 0. This connectivity pattern rule is chosen in this way because there is no CPG in any animal where all the neurons are connected to all the rest. After choosing the connectivity of the CPG, we perform the simulation by numerically integrating equations 4.1 to 4.3, plus those for all the active synapses, described by equation 4.5, using a fourth-order Runge-Kutta
Topology Selection by Maximizing the Gain of Information
Isyn e
-x(t)
∫ dt
NA
983
out
x2/10 x3/100
b
d
∫ dt
y(t)
µ∫dt
−∑
-z(t)
-1
Figure 3: Block diagram of an analog electronic neuron.
algorithm with a fixed step of 0.01 units of time. The bursts are sampled at every two units of time to build the conditional probabilities pG j (R, t ). 5 The Electronic Neuron We also worked with analog electronic circuits developed to reproduce the observed membrane voltage oscillations of isolated biological neurons from the stomatogastric ganglion of the California spiny lobster. These electronic neurons (ENs) have been successfully used to replace some biological neurons in the stomatogastric ganglion, reestablishing the regular pattern of ¨ et al., 2000). the oscillations of the lobster pyloric CPG (Szucs The block diagram of the simple analog circuit we use to implement the three-dimensional dynamical system of the EN is shown in Figure 3. This circuit integrates equations 4.1 to 4.3. The circuit uses three integrators, one adder, one inverter, two analog multipliers, and one nonlinear amplifier. We used off-the-shelf general-purpose operational amplifiers, National Instruments Model TL082, to build the integrators, adders, inverters, and nonlinear amplifier; and we used Analog Devices Model AD633 as analog multipliers. More details about the analog implementation of the ENs, as well as a four-dimensional EN that is able to reproduce in even more detail the chaotic behavior of the isolated biological neurons, can be found in Pinto et al. (2000). The ENs were also set a priori in chaotic spiking-bursting regimes as described in section 4 but with different values of the control parameters
984
G. Stiesberg et al. microelectrode
i
i (100mV/nA)
CH0 DAC ADC
BN
CH1 DAC ADC
BN
v (x10)
ADC
CH2 DAC
EN
CH3
CH1 DAC ADC
EN
CH2 DAC ADC
EN
CH3 DAC
DAC
(a)
CH0 DAC ADC
EN
ADC
BN
Isyn (200mV/nA)
ADC
DynClamp4
BN
v (x10)
DynClamp4
amplifier
v
(b)
Figure 4: Dynamic clamp using the DynClamp4 program. (a) Between living biological neurons (BNs). (b) Between ENs. Hybrid circuits can be formed mixing both kinds of configurations of the channels (CH0-CH3).
(e ≈ 1.75, b = 0, µ ≈ 0.0022, d ≈ 0.63). The components used in the ENs have a 10% tolerance, and the values indicated for the parameters here are averages of the values set in the ENs. In the experiments, we usually fine-tune e, µ, and d around the values indicated to obtain similar chaotic spiking-bursting patterns among the ENs. There is also an integration time constant that was introduced by means of a capacitor to make the burts and spikes of the EN have approximately the same average duration presented in time series of the membrane potential of living neurons from the lobster CPG. The time constant relates the real time of the EN to the units of time units of the simulations: tr eal = t450 . 5.1 Implementation of the Analog CPGs Using the DynClamp4, a Dynamic Clamp Program. Our DynClamp4 dynamic clamp program (Pinto et al., 2001) runs on a Windows NT 4.0 platform in a Pentium III computer with an Axon Digidata 1200A ADC/DAC board. DynClamp4 is an implementation of the dynamic clamp protocol (Sharp et al., 1996) that uses an auxiliary analog demultiplex circuit, which we also developed, to generate all the possible combinations of chemical and electrical synapses among up to four living or electronic neurons, as well as several HodgkinHuxley voltage-dependent conductances. The DynClamp4 can be used to connect living neurons, electronic neurons, and hybrid circuits, as shown in Figure 4. The program digitizes the membrane potential of the neurons as it appears at the output of the microelectrode amplifiers; then it calculates the current to be injected in each neuron due to the combination of all synapses and conductances. Then the voltage signals proportional to these currents are generated via digital-to-analog converters. This cycle is repeated with
Topology Selection by Maximizing the Gain of Information
985
a frequency of more than 5 kHz so that the neurons are subjected to a continuous current signal, as compared to their intrinsic timescales. In this work, we used a modified version of the DynClamp4 software for the automatic implementation of the random synapses between the ENs. To facilitate this, we needed to tune the nonlinear amplifiers at the output of the ENs to reshape, rescale, and the offset of the membrane potentials xi (t) in order to make them resemble the overall and relative amplitudes between slow oscillations and spikes in the signal we have from the biological neurons. The analog ENs were connected by means of the DynClamp4, so we could repeat the digital computer simulations performed earlier with these analog computers. We used only the chemical synapses of the DynClamp4 program. The model implemented for these synapses is similar to the one described in section 4 for the computer simulations, but it is a closer approximation to the biological graded chemical synapses present in the lobster CPG. The model of the synaptic current is described by the first-order kinetics (Destexhe et al., 1994; Sharp et al., 1996): Isyn = gsyn S Vsyn − Vpost , (1 − S∞ ) τs
dS = (S∞ − S) , dt
where
S∞ Vpre =
tanh
Vpr e −Vthr es Vslope
0
: :
Vpre > Vthres otherwise,
gsyn is the maximal synaptic conductance, S is the instantaneous synaptic activation, S∞ is the steady-state synaptic activation, Vsyn is the synaptic reversal potential, Vpre and Vpost are the presynaptic and postsynaptic voltages, respectively, τs is the time constant for the synaptic decay, Vthres is the synaptic threshold voltage, and Vslope is the synaptic slope voltage. The common parameters we used for all synapses were Vsyn = −80 mV (inhibitory synapses), τs = 100 ms, Vthres = −40 mV, and Vslope = 10 mV. For every trial of synaptic configuration, we chose at random 20% of the synapses to have gsyn = 0, and the other 80% of the synapses were chosen to have gsyn ranging uniformly from 0 to 500 nS. This is similar to what we did in the computer simulations. The output currents generated by the DynClamp4 are in nA, and in the living neurons the microelectrode amplifiers usually use a conversion factor of 50 to 100 mV/nA in their current input. For our ENs, a conversion factor of 200 mV/nA was appropriate to generate the voltage signals to be presented as inputs (Isyn ). For each synaptic configuration, the dynamic clamp cycle was started; then transients were allowed to dissipate for 60 s before acquiring data.
986
G. Stiesberg et al.
150
400
G
#I in (HG− ε,H +ε)
G
#I in (HG−ε,H +ε)
300 100
50
200
100
0
2
4
6 HG (bits)
8
10
0
1
3
5
7
9
11
HG (bits)
Figure 5: (Left) Distribution of inputs with specific values of the conditional entropy for = 0.225 in a three-neuron CPG. (Right) Input distribution in a four-neuron CPG for = 0.58.
The bursts of each EN were sampled every 10 ms and detected when the reshaped membrane potential crossed a threshold of −40 mV. The bursts were recorded for 300 s, and the conditional probabilities pG j (R, t ) for each set of synaptic configuration were calculated in the same way as described in section 2. 6 Results of the Computer Simulations We performed simulations of CPGs of three and four chaotic neurons where the connections between them were chosen at random according to the rules and methods explained in section 4. For each CPG, a pool of 850 synaptic configurations was tested (#I = 850). For each configuration, 30 random initial conditions were utilized, and a transient time of 10,000 was discarded before the probability distributions were calculated for 40,000 units of time. The bursts were sampled at every 2 units of time, the conditional probabilities and entropies were calculated, and the maximization of MI (I ; O) as a function of p(G j ) was carried out. Once all the numerical calculations were finished, the conditional entropies and probabilities were stored. The maximization of the mutual information was performed using the rules explained in previous sections. In Figure 5 (left), one can see the distribution of the conditional entropy values for a four-neuron CPG and a random set of I, where all the elements were chosen from an uniform distribution. We can see that most of the input configurations are situated in the range of entropy values between 5.5 and 7.5 bits. Those entropy values represent irregular activity (see a time series on Figure 6), whereas values closer to 2 represent regular rhythms. In Figure 5 (right) we show the distribution of the conditional entropy for a four-neuron CPG. In this case, most of the input configurations have conditional entropy between 6 and 9 bits.
6
4
4
2
2
V3
6
0
−2
−4
−4
4
4 V2
2 0
2 0
−2
−2
−4
−4
4
4
2
2
0 −2
987
0
−2
V1
V1
V2
V3
Topology Selection by Maximizing the Gain of Information
0 −2
−4
4000
5000
6000
7000
−4
4000
5000
Time
6000
7000
Time
Figure 6: (Left) Time series for a configuration g12 = 0.207668, g13 = 0.396304, g21 = 0.058504, g23 = 0, g31 = 0.782209, g32 = 0.452369 that generates a conditional entropy with value of 6.64 bits. (Right) Time series for the configuration g12 = 0.110004, g13 = 0.253859, g21 = 0.700028, g23 = 0.152461, g31 = 0.772312, g32 = 0 that generates a conditional entropy value of 2.58 bits. 1
ζ(HG−ε,H G+ε)
0.8
0.6
0.4
0.2
0
2
4
6 HG (bits)
8
10
Figure 7: Distribution of the selected input configurations that maximize the information between the input and the rhythmic activity for a three-neuron CPG (dotted) and a four-neuron CPG (solid).
In Figure 7, we can see the ratio ζ between the number of configurations found to maximize the MI and the number of input configurations for a CPG of three and four neurons. Most of the chosen configurations lie in the lower range of the conditional entropy in both cases (ζ vanishes for values about 1 or 2 bits higher than the minimun entropies found), which indicates
988
G. Stiesberg et al.
that a small subset of input configurations is chosen and that most of them behave in a regular manner. The main results are summarized as follows: 1. A subset of 100 configurations of I accounts for 99% of the configura tions that maximize MI (I ; O), that is, µ∈subset p(G µ ) ≤ 0.99. 2. The preferred connectivity patterns are the ones with entropy values between 2 and 3 bits. These configurations produced the most regular CPG activity, as can be seen in Figure 6. 3. All of these configurations are nonopen topologies. Most of the known invertebrate CPGs have nonopen topology connections. We have defined nonopen topologies as those in which each neuron receives synaptic input from at least one other neuron. The explanation for the dominance of such topologies is likely because the model neurons are chaotic: any chaotic neuron that does not receive input from any other neuron remains chaotic. The chosen configurations in our experiments are typically nonopen topologies, which produce regular spatiotemporal patterns, while the open topology configurations almost always produce a chaotic pattern. 7 Results of the Analog Computations: ENs + Dynamic Clamp In the analog computations, we first performed our simulations in a CPG composed of three chaotic neurons, with connections between them chosen at random, as explained in the previous sections. A pool of 500 random synaptic configurations was tested (#I = 500). The methods described at the end of section 5 were repeated to obtain the conditional probabilities and the entropies for each synaptic configuration. The maximization of the MI (I ; O) as a function of p(G j ) was carried out in the same way as for the computer simulations (see section 3). The same procedure was repeated for the CPGs composed of four neurons, but we tested a pool of 630 synaptic configurations (#I = 630). Figure 8 shows the distribution of the conditional entropy values for random set of I where all the elements were chosen from an uniform distribution for both three- and four-neuron CPGs. Here we also have most of the input configurations situated in a small but slightly different range of entropy values (between 5.0 and 6.5 bits for three-neuron CPGs and between 5.5 and 7 bits for four-neuron CPGs) as compared to Figure 5. The ranges are shifted about +1 bit in the simulations, because in the analog experiments, the burst sampling rate is about half the sampling rate used in the simulations (when one converts the units of time). However, these values of entropy still represent irregular activity (as in time series in Figure 9, for example) and the values closer to 3 bits represent regular rhythms. Another difference between the results from simulations and the analog experiments is the minimum entropy,
100
100
80
80 #I in (HG− ε,HG+ε)
#I in (HG − ε,HG+ε)
Topology Selection by Maximizing the Gain of Information
60
40
20
0
989
60
40
20
2
3
4
5
6 HG (bits)
7
8
9
10
0
2
3
4
5
6 HG (bits)
7
8
9
10
Figure 8: (Left) Distribution of inputs with specific values of the conditional entropy for = 0.225 in a three-neuron CPG. (Right) Input distribution in a four-neuron CPG for = 0.58.
Figure 9: (Left) Time series for a configuration g12 = 239 nS, g13 = 374 nS, g21 = 0 nS, g23 = 277 nS, g31 = 217 nS, g32 = 0 nS that generates a conditional entropy with a value of 6.44 bits. (Right) Time series for a configuration g12 = 277 nS, g13 = 0 nS, g21 = 260 nS, g23 = 257 nS, g31 = 0 nS, g32 = 443 nS that generates a conditional entropy with a value of 3.37 bits.
which is smaller in simulations (from 1 to 2 bits) than in the experiments (from 3 to 4 bits), probably because the fluctuations present in the electronic circuits are much higher than the small numerical noise introduced in the simulations. In order to show that the analog computations also generate the same results pointed out in section 6 for the digital computer simulations, we also calculated the ζ ratio between the number of configurations found to maximize the MI and the number of input configurations for the analog computations of the three- and four-neuron CPGs. As can be seen in Figure 10, in both cases, most of the chosen configurations have a conditional entropy smaller than 5 bits (the limit is again just about 1 or 2 bits
990
G. Stiesberg et al.
1
ζ(HG −ε,HG+ε)
0.8
0.6
0.4
0.2
0
2
3
4
5
6 HG (bits)
7
8
9
10
Figure 10: Distribution of the selected input configurations that maximize the information between the input and the rhythmic activity for a three-neuron CPG (dotted line) and four-neuron CPG (solid line).
higher than the minimun entropies found), which indicates again that the subset of configurations chosen behaves in a more regular way. 8 Discussion One of the main problems faced by the study of the CPGs in animals is to establish a relationship between the electrical activity of the CPG and the mechanical device that performs a specific function. In the experiments performed in some animals, the CPG is located in a dish without connections to the mechanical device, and hence a measure of the function of the CPG cannot be established. If the motor activity is recorded, there are no means at the present time to introduce electrodes in the neurons of the CPG in the living animal because it moves and breathes. The question we address is, Can we find a general measure that helps us to bound the potential function of a CPG? In this work, we show how, using computer and electronic models, the maximization of the information gives an approximate response that bounds the possible solutions that one can find in the CPG. It is remarkable that regardless of the differences between the computer and electronic implementations, similar results are obtained. It hints at the generality of the phenomenon for simple CPGs made of chaotic neurons.
Topology Selection by Maximizing the Gain of Information
991
We must consider that the simplifications imposed on our model CPGs, disregarding intrinsic properties of neurons and including only inhibitory connections, seem to be quite drastic and bring serious limits to the generality of the results, especially when more complex CPGs, such as those found in vertebrates, are considered. However, one should also consider that if even our simple model CPGs were able to show some properties found in living invertebrate CPGs, it could mean that the models were able to express important features. So although these results cannot be claimed to validate the models, they do serve as evidence of the generality of the properties found, at least in small invertebrate CPGs. Several other invertebrate CPGs that have excitatory synapses also present nonopen topologies. For a few examples one can see the CPGs for feeding in Planorbis (Arshavsky et al., 1985), swimming in Tritonia (Getting, 1989), and swimming in Clione (Arshavsky et al., 1998). From an evolutionary perspective, our results are evidence that one can think that the CPGs are developed following some general rules and then the final adjustments are performed depending on the specific function to be carried out. From this point of view, the application of information theory to CPG selection would be an invaluable tool to perform a first filter of the large number of possible configurations that can occur. A set of possible experiments to prove whether nature uses the information-maximization principle to evolve would be to let a set of nonspecific neurons develop connections in a culture dish. Once the connections are developed, we should check the configuration obtained. We could repeat the experiments as many times as needed to make the statistic relevant. The observation that the selected connection topologies are nonopen is also supported in previous work (Huerta et al., 2001), where open topology solutions in the CPG do not provide an efficient function of the pyloric chamber of the lobster. It is interesting to note that information maximization gives a bounded region of configurations where the specificity of a function is included. Finally, we note that the use of electronic model neurons can be a good tool to check the results obtained from numerical simulations when no better experiments are available, since the results found have to be robust to several kinds of fluctuations the ENs are exposed to.
Acknowledgments Partial support for this work came from the U.S. Department of Energy, Office of Science, and the U.S. Office of Naval Research. R.D.P. and M.B.R. were supported by the Brazilian agency Fundac¸a˜ o de Amparo a` Pesquisa do Estado de S˜ao Paulo (FAPESP). We are also grateful to A. I. Selverston, M. I. Rabinovich, and H. D. I. Abarbanel for their suggestions.
992
G. Stiesberg et al.
References Abarbanel, H. D. I., Huerta, R., Rabinovich, M. I., Rulkov, N., Rowat, P. F., & Selverston, A. I. (1996). Synchronized action of synaptically coupled chaotic model neurons. Neural Computation, 8, 50–65. Arshavsky, Y. I., Beloozerova, I. N., Orlovsky, G. N., Panchin, Y. V., & Pavlova, G. A. (1985). Control of locomotion in marine mollusc. Clione limacina. Experimental Brain Research, 58(2), 255–293. Arshavsky, Y. I., Deliagina, T. G., & Orlovsky, G. N. (1997). Pattern generation. Current Opinion in Neurobiology, 7(6), 781–789. Arshavsky, Y. I., Deliagina, T. G., Orlovsky, G. N., Panchin, Y. V., Popova, L. B., & Sadreyev, R. I. (1998). Analysis of the central pattern generator for swimming in the mollusk Clione. Annals of the New York Academy of Sciences, 860, 51–69. Brodfuehrer, P. D., Debski, E. A., O’Gara, B. A., & Friesen, W. O. (1995). Neuronal control of leech swimming. Journal of Neurobiology, 27(3), 403–418. Deco, G., & Schurmann, B. (1997). Information transmission and temporal code in central spiking neurons. Physical Review Letters, 79(23), 4697–4700. Destexhe, A., Mainen, Z. F., & Sejnowski, T. J. (1994). An efficient method for computing synaptic conductances based on a kinetic model of receptor binding. Neural Computation, 6, 14–18. Getting, P. A. (1989). Emerging principles governing the operation of neural networks. Ann. Rev. Neurosci., 12, 185–204. Harris-Warrick, R. M., Johnson, B. R., Peck, J. H., Kloppenburg, P., Ayali, A., & Skarbinski, J. (1998). Distributed effects of dopamine modulation in the crustacean pyloric network. Annals of the New York Academy of Sciences, 860, 155–167. Hindmarsh, J. L., & Rose, R. M. (1984). A model of neuronal bursting using three coupled first order differential equations. Proc. R. Soc. Lond. B, 221, 87–102. ¨ Huerta, R., Sanchez-Montanes, M. A., Corbacho, F., & Siguenza J. A. (2000). A central pattern generator to control a pyloric-based system. Biological Cybernetics, 82, 85– 94. Huerta, R., Varona, P., Rabinovich, M. I., & Abarbanel, H. D. I. (2001). Topology selection by chaotic neurons of a pyloric central pattern generator. Biological Cybernetics, 84(1), L1–L8. Katz, P. S. (1998). Neuromodulation intrinsic to the central pattern generator for escape swimming in Tritonia. Annals of the New York Academy of Sciences, 860, 181–188. Kepecs, A., & Lisman, J. (2003). Information encoding and computation with spikes and bursts. Network-Computation in Neural Systems, 14(1), 103–118. Lisman, J. E. (1997). Bursts as a unit of neural information: Making unreliable synapses reliable. Trends in Neurosciences, 20(1), 38–43. ¨ A., Rabinovich, M. I., Selverston A. I., & Abarbanel, Pinto, R. D., Elson, R. C., Szucs, H. D. I. (2001). Extended dynamic clamp: Controlling up to four neurons using a single desktop computer and interface. Journal of Neuroscience Methods, 108(1), 39–48. ¨ A., Abarbanel, H. D. I., & Rabinovich, Pinto, R. D., Varona, P., Volkovskii, A. R., Szucs, M. I. (2000). Synchronous behavior of two coupled electronic neurons. Phys. Rev. E, 62, 2644–2656.
Topology Selection by Maximizing the Gain of Information
993
Rabinovich, M. I., Abarbanel, H. D. I., Huerta, R., Elson, R. E., & Selverston, A. I. (1997). Self-regularization of chaos in neural systems: Experimental and theoretical results. IEEE Trans. Circuits Systems I, 44, 997–1005. Selverston, A. (1999a). What invertebrate circuits have taught us about the brain. Brain Research Bulletin, 50(5–6), 439–440. Selverston, A. (1999b). General principles of rhythmic motor pattern generation derived from invertebrate CPGs. Progress in Brain Research, 123, 247–257. Selverston, A., Elson, R., Rabinovich, M., Huerta, R., & Abarbanel, H. (1998). Basic principles for generating motor output in the stomatogastric ganglion. Annals of the New York Academy of Sciences, 860, 35–50. Selverston, A. I., & Moulins, M. (Eds.). (1987). The crustacean stomatogastric system. Berlin: Springer. Selverston, A. I., Rabinovich, M. I., Abarbanel, H. D. I., Elson, R. C., Szcs, A., Pinto, R. D., Huerta, R., & Varona P. (2000). Reliable circuits from irregular neurons: A dynamical approach to understanding central pattern generators. J. Physiol. (Paris), 94, 357–374. Shannon, C. E., & Weaver, W. (1949). The mathematical theory of communication. Urbana: University of Illinois Press. Sharp, A. A., Skinner, F. K., & Marder, E. (1996). Mechanisms of oscillation in dynamic clamp constructed two-cell half-center circuits. J. Neurophysiology, 76, 867–883. Stemmler, M., & Koch, C. (1999). How voltage-dependent conductances can adapt to maximize the information encoded by neuronal firing rate. Nat. Neurosci., 2(6), 521–527. Strong, S. P., Koberle, R., de Ruyter van Steveninck, R. R., & Bialek, W. (1998). Entropy and information in neural spike trains. Physical Review Letters, 80(1), 197–200. ¨ Szucs, A., Varona, P., Volkovskii, A. R., Abarbanel, H. D. I, Rabinovich, M. I., & Selverston, A. I. (2000). Interacting biological and electronic neurons generate realistic oscillatory rhythms. NeuroReport, 11, 563–569.
Received September 29, 2005; accepted July 11, 2006.
LETTER
Communicated by Barak Pearlmutter
Independent Slow Feature Analysis and Nonlinear Blind Source Separation Tobias Blaschke [email protected]
Tiziano Zito [email protected]
Laurenz Wiskott [email protected] Institute for Theoretical Biology, Humboldt University Berlin, Invalidenstraße 43, D-10115 Berlin, Germany
In the linear case, statistical independence is a sufficient criterion for performing blind source separation. In the nonlinear case, however, it leaves an ambiguity in the solutions that has to be resolved by additional criteria. Here we argue that temporal slowness complements statistical independence well and that a combination of the two leads to unique solutions of the nonlinear blind source separation problem. The algorithm we present is a combination of second-order independent component analysis and slow feature analysis and is referred to as independent slow feature analysis. Its performance is demonstrated on nonlinearly mixed music data. We conclude that slowness is indeed a useful complement to statistical independence but that time-delayed second-order moments are only a weak measure of statistical independence.
1 Introduction In signal processing, one often has to deal with multivariate data such as a vectorial signal x(t) = [x1 (t), . . . , xM (t)]T . To facilitate the interpretation of such a signal, a useful representation of the data in terms of a linear or nonlinear transformation has to be found; prominent linear examples are Fourier transformation, principal component analysis, and Fisher discriminant analysis. In this letter, we concentrate on blind source separation (BSS), which recovers signal components (sources) that have originally generated an observed mixture. While the linear BSS problem can be solved by resorting to independent component analysis (ICA), a method based on the assumption of mutual independence between the mixed source signal components, this is not possible in the nonlinear case. Some algorithms have been proposed to address this problem, and we mention them below. The objective of this letter is to show that the nonlinear BSS problem can Neural Computation 19, 994–1021 (2007)
C 2007 Massachusetts Institute of Technology
Independent Slow Feature Analysis and Nonlinear BSS
995
be solved by combining ICA and slow feature analysis (SFA), a method to find a representation where signal components are varying slowly. After a short introduction to linear BSS and ICA in section 2.1, we present the nonlinear BSS problem and some of the available algorithms in section 2.2. SFA is explained in section 3. We introduce independent slow feature analysis (ISFA) in section 4, a combination of second-order ICA and SFA that can perform nonlinear BSS. In section 5 the algorithm is tested on random and surrogate correlation matrices and then applied to nonlinearly mixed audio data. An analysis of the results reveals that nonlinear BSS can be solved by combining the objectives statistical independence and slowness, but that time-delayed second-order moments are not a sufficient measure of statistical independence in our case. We conclude with a discussion in section 6. 2 Blind Source Separation and Independent Component Analysis 2.1 Linear BSS and ICA. Let x(t) = [x1 (t), . . . , xN (t)]T be a linear mixture of a source signal s(t) = [s1 (t), . . . , s N (t)]T and be defined by x(t) = As(t),
(2.1)
with an invertible N × N mixing matrix A. The goal of blind source separation (BSS) is to recover the unknown source signal s(t) from the observable x(t) without any prior information. The only assumption is that the source signal components are statistically independent. Given only the observed signal x(t), we want to find a matrix R such that the components of u(t) = Qy(t) = QWx(t) = Rx(t)
(2.2)
are mutually statistically independent. Here we have divided R into two parts. First, a whitening transformation y(t) = Wx(t) with whitening matrix W is applied, resulting in uncorrelated signal components yi (t) with unit variance and zero mean, where we have assumed x(t) and also s(t) to have zero mean. Second, a transformation u(t) = Qy(t) with orthogonal Q (Comon 1994) results in statistically independent components ui (t). The method of finding a representation of the observed data such that the components are mutually statistically independent is called independent component analysis (ICA). It has been proven that ICA solves the linear BSS problem, apart from the fact that the source signal components can be recovered only up to scaling and permutation (Comon 1994). A variety of algorithms perform linear ICA and therefore linear BSS. They can be divided into two classes (Cardoso 2001): (1) independence is achieved by optimizing a criterion that requires higher-order statistics, and (2) the optimization criterion requires autocorrelations or nonstationarity
996
T. Blaschke, T. Zito, and L. Wiskott
of the source signal components. For the second class of BSS algorithms, second-order statistics is sufficient (see, e.g., Tong, Liu, Soon, & Huang, 1991). Here we focus on class (2) and use a method introduced by Molgedey & Schuster (1994) based on only second-order statistics. It is based on the minimization of an objective function that can be written as 2 N N N 2 (y) (u) τ ICA Ci j (τ ) = (Q) := Qik Q jl Ckl (τ ) i, j=1 i= j
i, j=1 i= j
(2.3)
k,l=1
(u)
operating on the already whitened signal y(t). Ci j (τ ) is an entry of a symmetrized time-delayed correlation matrix, 1 u(t)u(t + τ )T + u(t + τ )u(t)T , 2 1 (u) Ci j (τ ) := ui (t)u j (t + τ ) + ui (t + τ )u j (t) , 2 C(u) (τ ) :=
(2.4) (2.5)
τ can be underand C(y) (τ ) is defined correspondingly. Minimization of ICA stood intuitively as finding an orthogonal matrix Q that diagonalizes the correlation matrix with time delay τ . Since, because of the whitening, the instantaneous correlation matrix, which is simply the covariance matrix, is already diagonal, this results in signal components that are decorrelated instantaneously and at a given time delay τ . This can be sufficient to achieve statistical independence (Tong et al., 1991). Extending this method to several time delays is straightforward and provides greater robustness (see, e.g., ¨ Belouchrani, Abed Meraim, Cardoso, & Moulines, 1997; Ziehe & Muller, 1998; and section 5.1).
2.2 Nonlinear BSS and ICA. An obvious extension to the linear mixing model 2.1 has the form x(t) = F (s(t)),
(2.6)
with a nonlinear function F : R N → R M that maps N-dimensional source vectors s(t) onto M-dimensional signal vectors x(t). The components xi (t) of the observable are a nonlinear mixture of the sources and, as in the linear case, source signal components si (t) are assumed to be mutually statistically independent. Extracting the source signal is possible only if F is an invertible function on the range of s(t), which we will assume from now on.
Independent Slow Feature Analysis and Nonlinear BSS
997
The equivalence of BSS and ICA in the linear case does not hold in general for a nonlinear function F (Hyv¨arinen & Pajunen, 1999; Jutten & Karhunen, 2003). For example, given statistically independent components u1 (t) and u2 (t), any nonlinear functions h 1 (u1 ) and h 2 (u2 ) also lead to components that are statistically independent. Also, a nonlinear mixture of u1 (t) and u2 (t) can still have statistically independent components (Jutten & Karhunen, 2003). Thus, in the nonlinear BSS problem, independence is not sufficient to recover the original source signal, and additional assumptions about the mapping F or the source signal are needed to sufficiently constrain the optimization problem. We list some of the known methods:
r
Constraints on the mapping F : —F is a smooth mapping (Hyv¨arinen & Pajunen, 1999; Almeida, 2004) —F is a postnonlinear (PNL) mapping (Taleb & Jutten, 1997, 1999; Yang, Amari & Cichocki, 1998; Taleb, 2002; Ziehe, Kawanabe, ¨ Harmeling, & Muller, 2003).
r
Prior information about the source signal components: —Source signal components are bounded (Babaie-Zadeh, Jutten, & Nayebi, 2002) —Source signal components have time-delayed autocorrelations (referred to as temporal correlations) (Hosseini & Jutten, 2003) —Source signal components are those that exhibit a characteristic time structure (power spectra are pairwise different) (Harmeling, Ziehe, ¨ Kawanabe, & Muller, 2003)
2.3 A New Approach. In our approach, we do not make any specific assumption about the mapping F , although the function space available for unmixing will be finite-dimensional in the algorithm, which imposes some limitations on F . Since we employ an ICA method based on time-delayed cross-correlations, we make the implicit assumption that the sources have significantly different temporal structure (power spectra are pairwise different) (cf. Harmeling et al., 2003). We also assume that the sampling rate is high enough that the input signal can be treated as if it were continuous and the time derivative is well approximated by the difference of two successive time points. We have seen that in the nonlinear case, statistical independence alone is not a sufficient criterion for BSS. There are infinitely many nonlinearly distorted versions of one source that are all statistically independent of another source. We propose slowness as a means to resolve this ambiguity and select a good representative from the different versions of a source because nonlinearly distorted versions of a source usually vary more quickly than the source itself. Consider a sinusoidal signal component xi (t) = sin(t) and a second component that is the square of the first
998
T. Blaschke, T. Zito, and L. Wiskott
x j (t) = xi (t)2 = 0.5 (1 − cos(2t)). The second component is more quickly varying due to the frequency doubling induced by the squaring. We believe this argument can be made more formal, and it can be proven that given the set of a one-dimensional signal and all its nonlinearly and continuously transformed versions, the slowest signal of the set is either the signal itself or an invertibly transformed version of it (Sprekeler, Zito, & Wiskott, 2007). Considering this, we propose, in order to perform nonlinear BSS, to complement the independence objective of pure ICA with a slowness objective. In the next section, we give a short introduction to slow feature analysis, an algorithm built on the basis of this slowness objective. 3 Slow Feature Analysis Slow feature analysis (SFA) is a method that extracts slowly varying signals from a given observed signal (Wiskott & Sejnowski, 2002). This section gives a short description of the method as well as a link between SFA and second-order ICA (Blaschke, Berkes, & Wiskott, 2006), which provides the means to find a simple objective function for our nonlinear BSS method. Consider a vectorial input signal x(t) = [x1 (t), . . . , xM (t)]T . The objective of SFA is to find a nonlinear input-output function g(x) = [g1 (x), . . . , g L (x)]T such that the components of u(t) = g(x(t)) are varying as slowly as possible. As a measure of slowness, we use the variance of the first derivative, so that a slow signal has on average a small slope. The optimization problem then is as follows. Minimize the objective function (ui ) := u˙ i2 (t)
(3.1)
successively for each ui (t) under the constraints ui (t) = 0, (ui (t))2 = 1, ui (t)u j (t) = 0 ∀ j < i,
(zero mean)
(3.2)
(unit variance)
(3.3)
(decorrelation and order)
(3.4)
where · denotes averaging over time. Constraints 3.2 and 3.3 ensure that the solution will not be the trivial solution ui (t) = const. Constraint 3.4 provides uncorrelated output signal components and thus guarantees that different components carry different information. To make the optimization problem easier to solve, we consider the components gi of the input-output function to be a linear combination of a finite set of nonlinear functions. We can then split the optimization procedure into two parts: (1) nonlinear expansion of the input signal x(t) into a highdimensional feature space and (2) solving the optimization problem in this feature space linearly.
Independent Slow Feature Analysis and Nonlinear BSS
999
3.1 Nonlinear Expansion. A common method to make nonlinear problems solvable in a linear fashion is nonlinear expansion. The observed signal components xi (t) are mapped into a high-dimensional feature space according to z(t) = h(x(t)).
(3.5)
The dimension L of z(t) is typically much larger than that of the original signal. For instance, if we want to expand into the space of second-degree polynomials, we can apply the mapping h(x) = [x1 , . . . , xM , x1 x1 , x1 x2 , . . . , xM xM ]T − hT0 .
(3.6)
The dimensionality of this feature space is L = M + M (M + 1)/2. The constant vector hT0 is needed to make the expanded signal mean free. 3.2 Solution of the Linear Optimization Problem. Given the nonlinear expansion, the nonlinear input-output function g(x) can be written as g(x) = Rh(x) = Rz ,
(3.7)
where R is an L × L matrix, which is subject to optimization. To simplify the optimization procedure we (1) choose the nonlinearities h (· ) such that z(t) is mean free and (2) first find a transformation y(t) = Wz(t) to obtain mutually decorrelated components yi (t) with zero mean. Matrix W is a whitening matrix as in normal ICA; u(t) = Qy(t) = QWz(t) = Rz(t) = g(x(t)) ,
(3.8)
where y(t) is the nonlinearly expanded and whitened signal. It can be shown (Wiskott & Sejnowski, 2002) that the constraints 3.2 to 3.4 are fulfilled trivially if the transformation Q, subject to learning, is an orthogonal matrix. To solve the optimization problem, we rewrite the slowness objective, equation 3.1, as (ui ) = (u˙ i (t))2 = qiT y˙ (t) y˙ (t)T qi =: qiT Eqi ,
(3.9)
where qi = [Qi1 , Qi2 , . . . , Qi L ]T is the ith row of Q and E is the matrix ˙ y(t) ˙ T . For this optimization problem, there exists a unique solution. y(t) For i = 1, the optimal weight vector is the normalized eigenvector that corresponds to the smallest eigenvalue of E. The eigenvectors of the next higher eigenvalues produce the next slow components u2 (t), u3 (t), . . . and so forth. Typically only the first several of all L possible output components are of interest and selected.
1000
T. Blaschke, T. Zito, and L. Wiskott
Finding the eigenvectors is equivalent to finding the transformation Q such that QT EQ is diagonal. As described in detail in Blaschke et al. (2006), this leads to an objective function for SFA subject to maximization,
τ SFA (Q)
2 L L L 2 (y) (u) Cii (τ ) = := Qik Qil Ckl (τ ) , i=1
i=1
(3.10)
k,l=1
where τ is a time delay that arises from an approximation of the time ˙ ≈ y(t + derivative. We set τ = 1 because we make the approximation y(t) 1) − y(t). To understand equation 3.10 intuitively, we note that slowly varying signal components are easier to predict and should therefore have strong correlations in time. Thus, maximizing the time-delayed autocorrelation produces a slowly varying signal component. Since the trace of C(y) (τ ) is preserved under a rotation Q, maximizing the sum over the squared autocorrelations tends to produce a set of most slowly varying signal components at the expense of the other components, which become most quickly varying and are usually discarded. Note the formal similarity between equations 2.3 and 3.10. 4 Independent Slow Feature Analysis The nonlinear BSS method proposed in this section combines the principle of independence known from linear second-order BSS methods with the principle of slowness as described above. Because of the combination of ICA and SFA, we refer to this method as independent slow feature analysis (ISFA). Second-order ICA tends to make the output components independent, and SFA tends to make them slow. Since we are dealing with a nonlinear mixture, we first compute a nonlinearly expanded signal z(t) = h(x(t)) with h : R M → R L being some nonlinear function chosen such that z(t) has zero mean. In a second step, z(t) is whitened to obtain y(t) = Wz(t). Finally, we apply linear ICA combined with linear SFA on y(t) in order to find the output signal u(t), the R first components of which are the estimated source signals, where R is usually much smaller than L, the dimension of the expanded signal. Because of the whitening, we know that ISFA, like ICA and SFA, is solved by finding an orthogonal L × L matrix Q. We write the output signal u(t) as u(t) = Qy(t) = QWz(t) = QWh(x(t)).
(4.1)
While the u1 (t), . . . , u R (t) are statistically independent and slowly varying, the components u R+1 (t), . . . , u L (t) are more quickly varying and may be statistically dependent on each other as well as on the estimated sources.
Independent Slow Feature Analysis and Nonlinear BSS
1001
The last L − R components of the output signal u(t) are irrelevant for the final result but important during the optimization procedure (see below). To summarize, we have an M-dimensional input x(t), an L-dimensional nonlinearly expanded and whitened y(t), and an L-dimensional output signal u(t). ISFA finds an orthogonal matrix Q such that the R first components of the output signal u(t) are mutually independent and slowly varying. These are the estimated sources. 4.1 Objective Function. To recover R source signal components ui (t), i = 1, . . . , R from an L-dimensional expanded and whitened signal y(t), the objective for ISFA with one single time delay τ reads τ τ τ (u1 , . . . , u R ) := b ICA ICA (u1 , . . . , u R ) − b SFA SFA (u1 , . . . , u R ) ISFA
= b ICA
R R 2 2 (u) (u) Ci j (τ ) − b SFA Cii (τ ) ,
(4.2)
i=1
i, j=1, i= j
where we simply combine the ICA objective, equation 2.3, and SFA objective, equation 3.10, for the first R components weighted by the factors b ICA and b SFA , respectively. Note that the ICA and the SFA objective are usually applied to all components and that in the linear case (and for one time delay τ = 1), they are equivalent (Blaschke et al., 2006). Here, they are applied to an R-dimensional subspace in the L-dimensional exτ panded space, which makes them different from each other. ISFA is to be minimized, which is the reason why the SFA part has a negative sign. In the linear case, it is standard practice to use multiple time delays to stabilize the ICA solution (see, e.g., the kTDSEP algorithm by Harmeling et al., 2003). We will see in sections 5.1 and 5.2 that in our case, multiple time delays are essential to get meaningful solutions. The general expression for the objective of ISFA then reads, ISFA (u1 , . . . , u R ) := b ICA
τ τ κICA ICA − b SFA
τ ∈TICA
= b ICA
τ τ κSFA SFA
τ ∈TSFA τ κICA
τ ∈TICA
− b SFA
τ ∈TSFA
R 2 (u) Ci j (τ ) i, j=1, i= j
τ κSFA
R 2 (u) Cii (τ ) ,
(4.3)
i=1
where TICA and TSFA are the sets of time delays for the ICA and SFA obτ τ jectives, respectively, whereas κICA and κSFA are weighting factors for the
1002
T. Blaschke, T. Zito, and L. Wiskott
corresponding correlation matrices. For simplicity, we will continue the description with only one time delay based on equation 4.2 and only later provide the full formulation with multiple time delays based on equation 4.3. 4.2 Optimization Procedure. From equation 4.1, we know that C(u) (τ ) in equation 4.2 depends on the orthogonal matrix Q. There are several ways to find the orthogonal matrix that minimizes the objective function. Here we apply successive Givens rotations to obtain Q. A Givens rotation is a rotation around the origin within the plane of two selected components µ and ν and has the matrix form
µν Qi j
cos(φ) − sin(φ) := sin(φ) δi j
for (i, j) ∈ {(µ, µ) , (ν, ν)} for (i, j) ∈ {(µ, ν)} for (i, j) ∈ {(ν, µ)} otherwise
(4.4)
with Kronecker symbol δi j and rotation angle φ. Any orthogonal L × L matrix such as Q can be written as a product of L(L − 1)/2 (or more) Givens rotation matrices Qµν (for the rotation part) and a diagonal matrix with diagonal elements ±1 (for the reflection part). Since reflections do not matter in our case, we consider only the Givens rotations as is often done in secondorder ICA algorithms (e.g., Cardoso & Souloumiac, 1996)—(but note that here ICA is applied to a subspace). Objective 4.2 as a function of a Givens rotation Qµν reads
τ,µν
ISFA (Qµν ) = b ICA
R
i, j=1 i= j
− b SFA
R i=1
L
2 µν
µν
Qik Q jl Ckl (τ ) (u )
k,l=1
L
2 µν µν (u ) Qik Qil Ckl
(τ ) ,
(4.5)
k,l=1
where u is some intermediate signal during the optimization procedure. τ,µν For each Givens rotation, there exists an angle φmin with minimal ISFA . Sucµν cessive application of Givens rotations Q with the corresponding rotation angle φmin leads to the final rotation matrix Q, yielding C(u) (τ ) = QT C(y) (τ )Q .
(4.6)
In the ideal case, the upper left R × R submatrix of C(u) (τ ) is diagonal with R (u) a large trace i=1 Cii (τ ).
Independent Slow Feature Analysis and Nonlinear BSS
1003
Applying a Givens rotation Qµν in the µν-plane changes all auto- and (u ) cross-correlations Ci j (τ ) with at least one of the indices equal to µ or ν. There exist two invariances under such a transformation: 2 2 (u ) (u ) Cµi (τ ) + Cνi (τ ) = const ∀i ∈ {µ, ν},
(4.7)
2 2 2 2 (u ) (u ) (u ) (u ) Cµµ (τ ) + Cµν (τ ) + Cνµ (τ ) + Cνν (τ ) = const.
(4.8)
τ Assume we want to minimize ISFA for a given R, where R denotes the number of signal components we want to extract. Applying a Givens rotation Qµν , we have to distinguish three cases:
r
r
Case 1. Both axes, µ and ν, lie inside the subspace spanned by the first R axes (µ, ν ≤ R) (see Figure 1a). The sum over all squared cross-correlations of all signal components that lie outside the R-dimensional subspace is constant, as well as that of all signal components inside the subspace. The former holds because of the first invariance, equation 4.7, and the latter because of the first—equation 4.7—and second—equation 4.8–invariance. There is no interaction between inside and outside; in fact, the objective function is exactly the objective for an ICA algorithm based on ¨ second-order statistics, such as TDSEP or SOBI (Ziehe & Muller, 1998; Belouchrani et al., 1997). Blaschke et al. (2006) have shown that this is equivalent to SFA in the case of a single time delay of τ = 1. Case 2. Only one axis, without loss of generality, µ, lies inside the subspace; the other, ν, lies outside (µ ≤ R < ν) (see Figure 1b). Since one axis of the rotation plane lies outside the subspace, u µ in the objective function can be optimized at the expense of the u ν outside the subspace. A rotation of π/2, for example, would simply exchange (u ) components u µ and u ν . For instance, according to equation 4.7, (Cµi )2 (u )
r
can be optimized at the expense of (Cνi )2 with i ∈ {1, . . . , R}; accord(u ) (u ) ing to equation 4.8, (Cµµ )2 can be optimized at the expense of (Cµν )2 , (u ) (u ) (Cνµ )2 , and (Cνν )2 . This gives the possibility of finding the slowest and most independent components in the whole space spanned by all L axes, in contrast to case 1, where the minimum is searched within the subspace spanned by the first R axes considered in the objective function. Case 3. Both axes lie outside the subspace (R < µ, ν) (see Figure 1c). A Givens rotation with the two rotation axes outside the relevant subspace does not affect the objective function and can therefore be disregarded.
1004
T. Blaschke, T. Zito, and L. Wiskott
u´µ u´ν u´R
u´L
u´µ u´ν u´R
u´L (a) Case 1 u´µ u´R
u´R
u´ν u´L
u´µ
u´ν u´L
u´µ u´R
u´R
u´µ u´ν
u´ν
u´L
u´L
(b) Case 2
(c) Case 3 (u )
Figure 1: Each square represents a squared cross- or autocorrelation (Ci j )2 where index i ( j) denotes the row (column) of the square. Dark squares indicate all entries that are changed by a rotation in the µ-ν-plane. L is the dimensionality of the expanded signal u and R the number of signal components ui (t) subject to optimization. The entries incorporated in the objective function are located in the upper-left corner, as indicated by the dashed line.
To optimize the objective function of ISFA, equation 4.2, we need to τ,µν calculate the explicit form of the objective function ISFA in equation 4.5. By inserting the Givens rotation matrix 4.4 into the objective function 4.5, and considering the case with multiple time delays, we can write the objective
Independent Slow Feature Analysis and Nonlinear BSS
1005
as a function of the rotation angle φ, 2 µν ISFA (φ) = b ICA e c + e β cos4−β (φ) sinβ (φ) β=0
− b SFA dc +
4
dα cos
4−α
α
(φ) sin (φ)
(4.9)
α=0 (u )
with constants e and d that depend only on the Ckl before rotation. Further simplification (cf. Blaschke & Wiskott, 2004) leads to µν
Case 1:
ISFA (φ) = A0 + A4 cos(4φ + φ4 )
(4.10)
Case 2:
µν ISFA
(4.11)
(φ) = A0 + A2 cos(2φ + φ2 ) + A4 cos(4φ + φ4 ),
with a single minimum (if without loss of generality, φ ∈ − π2 , π2 ), which can be calculated easily. The derivation of equations 4.10 and 4.11 involves various trigonometric identities and, because of its length, is documented in the appendix. The iterative optimization procedure with successive Givens rotations can now be described as follows:
1. Initialize Q = I and compute C(u ) (τ ) = C(y) (τ )∀τ ∈ TICA ∪ TSFA with equation 2.4 and ISFA with equation 4.3. 2. Choose a random permutation of the set of axis pairs: P = σ ({(µ, ν), with µ ≤ R and µ < ν ≤ L}). 3. Go systematically through all axis pairs in P. For each axis pair: µν (a) Determine the optimal rotation angle φmin for the selected axes with equation 4.10 or 4.11. µν (b) Compute the Givens rotation matrix Qµν φmin defined by equation 4.4. (c) Update C(u ) (τ ) using C(u ) (τ ) → (Qµν )T C(u ) (τ )Qµν . (d)Update Q according to Q → Qµν Q . (e) Back up the previous objective-function value ISFA = ISFA . (f) Calculate the new objective-function value ISFA with equation 4.3 using the updated C(u ) (τ ) from step 3c, −ISFA (g) Store the relative decrease of the objective function value ISFA . | | ISFA
(4) Go to 2 until the relative decrease of the objective function is smaller than 1 for all axis pairs in P. (5) Set Q = Q and u (t) = Qy(t).
1006
T. Blaschke, T. Zito, and L. Wiskott
In step 2 it is important to note that the rotation planes of the Givens rotations are selected from the whole L-dimensional space (although we avoid the irrelevant case 3 by requiring µ ≤ R; see Figure 1), whereas the objective function uses information of correlations only among the first R signal components ui . Since Qµν is very sparse, the Givens rotation in step 3c does not require a full matrix multiplication but can be computed more efficiently. Note that the algorithm works on the intermediate correlation matrices C(u ) (τ ), not on the signals themselves; the input signal y(t) is used only in the initialization (step 1) and at the end (step 5) when the output signal u(t) is computed. To circumvent the problem of getting stuck in local optima of the objective function, a random rotation of the outer space (ν > µ > R) can be performed after convergence in step 4, and the algorithm can be restarted at step 2. 5 Results To evaluate the performance of ISFA, we tested the algorithm first on random matrices to check how many matrices are needed to get meaningful results, then on surrogate matrices to check that the algorithm reliably converges to the global optimum under these ideal conditions, and then on a difficult although low-dimensional mixture of audio data to show how it performs on real data. In order to reduce the problem of local optima, we use SFA as a preprocessing step. That choice follows from the empirical observation that SFA is always able to extract the first source signal. To stabilize the ISFA solutions even further, we typically run the optimization routine once with the first axis fixed and then once more following the procedure described in section 4.2. Throughout this letter, the SFA time-delay set and the weighting factors were as follows: TSFA = 1
(5.1)
τ =1 κSFA
for τ = 1
(5.2)
τ κICA
∀ τ ∈ TICA .
(5.3)
=1
This particular choice makes it easy to interpret the ISFA objective function 4.3. The SFA part is the plain SFA objective function of equation 3.10; the ICA part is the plain ICA objective function of equation 2.3 extended to several time delays. If we would choose more than one time delay for the SFA part, the interpretation in terms of slowness would become less clear (see Blaschke et al., 2006). TICA depends on the experiment (see below). 5.1 Tests with Random Matrices. First, consider only the ICA part of the objective function 4.3. Its purpose is to guarantee statistical independence
Independent Slow Feature Analysis and Nonlinear BSS
1007
Table 1: Critical Number of Time Delays, Tcrit , for Different Values of L and R.
L=9 L = 20
R
2
3
4
5
6
7
8
9
10
More Than 10
Tcrit Tcrit
18 36
8 19
5 13
4 9
3 7
2 6
2 5
2 4
3
2
of the estimated sources by simultaneously diagonalizing the R × R upper left submatrix of T time-delayed L × L correlation matrices, where T is the number of elements in TICA . However, for the ICA term to be useful, we have to take sufficiently many matrices into account so that simultaneous submatrix-diagonalization is not trivial. For instance, a single symmetric matrix can always be fully diagonalized by the orthonormal set of its eigenvectors. Thus, for R = L and T = 1, one has to take at least two matrices to avoid this spurious solution, which would be found even if there are no underlying statistically independent sources. To estimate the minimum number of matrices needed, we ran ISFA with b SFA = 0 on randomly generated symmetric matrices Aτ , τ = 1, ..., T, for different valuesof L, R, and T. The subdiagonalization was considered successful if E :=
Ai2j τ, j,i> j , that is, the square root of the averaged squared
nondiagonal terms, was below a threshold E crit := 10−3 . For fixed L and R < L, we typically observe that a high degree of subdiagonalization is possible for T = 2. For T > 2 the subdiagonalization is still possible but at a lower degree with increasing T, until a critical Tcrit is reached, for which the degree of subdiagonalization displays a sharp transition where E crosses the threshold E crit and remains stable after that. The estimated critical number of time delays Tcrit for L ∈ {9, 20} and different values of R are given in Table 1. In the simulations that follow, we have M = R = 2 and use ISFA3 and ISFA5 (ISFAn refers to ISFA with polynomials of degree n), resulting in L = 9 and L = 20, respectively. From the table, we see that with T = 50, we are well above Tcrit in both cases. 5.2 Tests with Surrogate Matrices. To test the performance of ISFA (now including the SFA part) in the absence of noise, finite-size effects, or any other kind of perturbation, we carried out an experiment with T > 1 surrogate matrices, prepared such that they have a unique exact solution (except for permutations). The first matrix, with τ = 1, is fully diagonal, with the diagonal elements ordered by decreasing absolute value, with the exception of the second and last element, which are swapped. All other T − 1 matrices are random symmetric matrices with a diagonal (R + RICA ) × (R + RICA ) upper submatrix. SFA alone sees only the first matrix (cf. equations 5.1 and 5.2) and would favor a solution in which the last component is swapped back into the R × R subspace in place of the
1008
T. Blaschke, T. Zito, and L. Wiskott
small second component. ICA alone would favor any permutation of the first (R + RICA ) components equally well, because for any of these permutations, the R × R upper submatrices are all diagonal. In this example, ICA should prevent SFA from swapping the last component into the R × R subspace, and SFA should disambiguate the many equally valid ICA solutions by selecting the largest diagonal elements, that is, the slowest components, in the first matrix. This set of matrices constitutes a fixed point for the ISFA algorithm. If we run ISFA directly on these matrices, we get Q = I. If we now apply a random rotation matrix Qrand to the set of matrices, we would expect ISFA to find a matrix Q that inverts this rotation and returns the R original first components, but in any arbitrary order. Thus, the R × R submatrix of the product P := QQrand should be a permutation matrix for perfect unmixing. We performed 10,000 independent tests with R = 2, RICA = 2, L = 9, and T = 50, somewhat imitating the case of two nonlinearly mixed independent sources and an expansion space of all polynomials of degree three. The estimated critical number of matrices Tcrit is 18. Using 50 matrices, we rule out any spurious solution as discussed in section 5.1. As a measure of performance, we used the reconstruction error measure first introduced by, Amari, Cichocki, & Yang (1995) in the formulation given in Blaschke and Wiskott (2004): R R R R |Pi j | |P | 1 ij E= 2 − 1 + − 1 . R maxk |Pik | maxk |Pk j | i=1
j=1
j=1
i=1
(5.4) An experiment is considered successful if the unmixing error is smaller than 10−5 . We found that ISFA always recovered the original components and that this 100% success rate was largely independent of the scaling factors b ICA and b SFA , which we therefore set to b ICA = b SFA = 1 for this experiment. 5.3 Tests with Twisted Audio Data. In the third experiment, we tested the algorithm on 171 pairs of 19 nonlinearly mixed music excerpts. The sample values of the 19 excerpts were in the range of [−1, +1); the mean had an average value of (−10 ± 110) × 10−6 (mean ± std); the standard deviation had an average value of 0.16 ± 0.07, with its minimum and maximum values 0.02 and 0.27, respectively. One additional music excerpt was discarded because it had extreme peaks, which led to a strong nonlinear distortion due to the SFA part and low correlations with the source even though it was in principle extracted correctly. All audio signals were 221 = 2,097,152 samples long and had a CD-quality sampling frequency of 44,100 Hz. We
Independent Slow Feature Analysis and Nonlinear BSS
1009
used the nonlinear mixture introduced by Harmeling et al. (2003) defined by x1 (t) = (s2 (t) + 3s1 (t) + 6) cos(1.5 π s1 (t)),
(5.5)
x2 (t) = (s2 (t) + 3s1 (t) + 6) sin(1.5 π s1 (t)) .
(5.6)
This is quite an extreme nonlinearity, and the unmixing performance depends strongly on the standard deviation of the sources. For the ICA part of the objective in equation 4.3, we used 50 time delays evenly spaced within 1 and 44,100, corresponding to a timescale up to 1 second. The number of time delays is greater than the critical number Tcrit , which is 18 for an expansion with polynomials of degree three and 36 for polynomials of degree five. In order to evaluate the performance of the algorithm fairly, we used linear regression to check if the nonlinear mixture was indeed invertible within the available space. Two orthogonal directions were fit within the whitened expanded space to maximize the correlation with the original sources. Within the space of polynomials of degree three, there were a number of cases (51 examples, 30% of the total) where the two sources were not found by linear regression, which means the nonlinear mixture was not invertible within the available expanded space. This is the main reason for failures in ISFA3 . Within the space of polynomials of degree five, the mixture was always invertible. The scaling factor b SFA was kept constant and equal to 1, while b ICA was manually tuned for each example in order to maximize the correlation between estimated and original sources. For polynomials of degree three, we tested different values of b ICA equidistant on a logarithmic scale between 0 and 10,000. The number of tested values varied between 5 and 40 depending on how clear and robust the optimum was. For polynomials of degree one and five, we largely adopted the values found for polynomials of degree three; only if the algorithm failed with these values did we retune b ICA with 20 equidistant values. This tuning resulted in values between 0 and 1000. A source signal is considered to be recovered if the correlation with the estimated source is greater than 0.9. Scatter plots of a successful example are shown in Figure 2, and a summary of the results is given in Table 2. ISFA is able to separate the nonlinearly mixed sources in about 70% of the cases in which unmixing was possible at all. This is remarkable given the extreme nonlinearity of the mixture and a chance level of unmixing of less than 0.01%, as we have tested by numerical simulations. However, there remains a failure rate of about 30%, which is puzzling given the perfect performance on the surrogate matrices (see section 5.2). We investigate this in the next section. 5.4 Analysis of Failure Cases. Why did ISFA fail in about 30% of the cases where a good solution was available by linear regression? The values of the objective function ISFA , equation 4.3, and its two parts, ICA and SFA ,
1010
T. Blaschke, T. Zito, and L. Wiskott
Figure 2: Scatter plot of two sources, their nonlinear mixture, and the estimated sources. (a) Sources. (b) Mixture. (c) Sources estimated by ISFA5 . (d) First source versus estimated first source. (e) Second source versus estimated second source. Correlation coefficients of estimated sources and original sources were 0.996 and 0.998.
give us some information about possible reasons. Consider the following four different cases: 1. In 1 out of the 35 true failures for ISFA3 and never for ISFA5 , it is the case that ISFA of the sources estimated by ISFA is greater than the ISFA of the sources estimated by linear regression. In this case, the algorithm obviously got stuck in a local optimum. 2. In 15 and 26 out of the 35 and 49 true failures for ISFA3 and ISFA5 , respectively, ISFA of the sources estimated by ISFA is smaller than the ISFA of the sources estimated by linear regression, but either ICA or SFA is greater than the corresponding linear regression value. This indicates that the tuning of the weighting factors b SFA and b ICA might not have been fine enough. However, it could also be that there is an abrupt transition between solutions where ICA is greater to solutions where SFA is greater than the corresponding linear-regression value. 3. In 6 and 3 out of the 35 and 49 true failures for ISFA3 and ISFA5 , respectively, ICA and SFA of the sources estimated by ISFA are both smaller than the ones of the linear-regression estimate and greater than the ones of the original sources. Neither a local optimum nor the weighting factors are a plausible cause for the failure in these
Independent Slow Feature Analysis and Nonlinear BSS
1011
Table 2: Reconstruction Performance of Linear Regression and ISFA. Number of Reconstructed Sources 2 1 0 Number of Reconstructed Sources
REG1 5% 54% 41%
REG3 (8) (93) (70)
ISFA1
2 1 0
5% 50% 45%
% correct
100%
70% 30% 0%
REG5 (120) (51) (0)
ISFA3 (8) (85) (78) 8 8
50% 34% 16% 71%
100% 0% 0%
(171) (0) (0)
ISFA5 (85) (59) (27) 85 120
71% 18% 11% 71%
(122) (30) (19) 122 171
Notes: The upper part shows percentages of cases where both, one, or none of the two sources were recovered by linear regression (supervised) in the original space (REG1 ) or in the expanded space with polynomials of degree three (REG3 ) or five (REG5 ). The lower part shows the same for ISFA (unsupervised except for the tuning of b ICA ). Each entry indicates the percentage (and number) of pairs with respect to the total of 171 pairs. The last line presents the percentage of both sources recovered correctly with respect to the number of mixtures invertible within the available expanded space by linear regression. In the case of two recovered sources, chance level is always smaller than 0.01%.
cases. It might be that the expansion was too low-dimensional and that a higher-dimensional expansion would have yielded the correct solution. 4. In 13 and 20 out of the 35 and 49 true failures for ISFA3 and ISFA5 , respectively, ICA and SFA of the sources estimated by ISFA are both smaller than the ones of the original sources. In this case, the solution found is even better than the original sources in terms of the objective function, which indicates that there is something wrong with the objective function. It might be possible to eliminate the failures of the first three cases by refining the algorithm, for example, by tuning the weighting factors better or by going to higher polynomials, but case 4 is more fundamental and requires reconsidering the objective function itself. In this latter case, the signals extracted by ISFA appear to be both slower and more mutually independent than the original sources. However, scatter plots of the estimated sources reveal that they are not statistically independent at all, but often one is largely a function of the other (see Figures 3 and 4). Thus, the ICA part of the objective function is not strong enough to ensure statistical independence of the estimated sources. The cross-correlation functions shown in
1012
T. Blaschke, T. Zito, and L. Wiskott
Figure 3: Scatter plot of two sources, their nonlinear mixture, and the sources estimated by ISFA in a failure case. (a) Sources. (b) Mixture. (c) Sources estimated by ISFA3 . (d) First source versus estimated first source (correlation coefficient 0.9771). (e) First source versus estimated second source (correlation coefficient 0.0377). (f) Second source versus estimated first source (correlation coefficient 0.0197). (g) Second source versus estimated second source (correlation coefficient 0.1301).
Figure 5 indicate that this problem is not due to the specific choice of the time delays, because the time-delayed cross-correlations of the estimated sources (mean ± std = 0 ± 0.0028) are overall smaller than the ones of the original sources (0 ± 0.0066). Even using different or more time delays, such a data set would have been processed incorrectly. We conclude that any measure of independence based on time-delayed correlations would be insufficient in our context. Figure 5 suggested to us that sources with a large standard deviation of their cross-correlation function might be particularly difficult to separate with our ISFA algorithm. We tested this hypothesis but did not find a significant correlation with the failure cases. For an expansion with polynomials of degree three, even linear regression fails if the standard deviation of the first signal, which goes along the spiral, is large. For polynomials of degree five, linear regression always worked in our examples, but we suspected that separation might still be more difficult for sources with large standard deviation. But again, we did not find a significant correlation with the failure cases. We argue here that the failures must be attributed to the weakness of the ICA term in the objective function. If the SFA term were too weak, it could happen that all output signal components are truly statistically
Independent Slow Feature Analysis and Nonlinear BSS
1013
Figure 4: Scatter plots of the sources estimated by ISFA for some failure cases. It is clear that in these cases, the signal components are not statistically independent, even though the ICA term indicates so.
independent, but at least some of them are too quickly varying, so that they are not correlated to the sources but to some nonlinearly distorted version of the sources, something we did not observe. Also, the success in detecting the failure cases based on higher-order cumulants (see next section) indicates that the failures are due to the ICA term. 5.5 Unsupervised Detection of Failure Cases. A failure rate of about 30% (or even up to 50% for ISFA3 if one also counts the cases in which even linear regression was not able to recover the sources) is obviously not acceptable, unless one can detect the failure cases in an unsupervised manner. We use the weighted sum over the third- and fourth-order crosscumulants, 34 (u) :=
1 (u) 2 1 (u) 2 Ci jk + C , 3! i jk=iii 4! i jkl=iiii i jkl
(5.7)
as an independent measure of statistical independence to indicate with high values those cases in which the second-order ICA term has failed to yield independent output signal components. The factors 3!1 and
1014
T. Blaschke, T. Zito, and L. Wiskott
Figure 5: Cross-correlation functions of a failure case. (a) Cross-correlation function of the original sources. (b) Cross-correlation function of the estimated sources. Same data set as in Figure 3.
1 4!
arise from an expansion of the Kullback-Leibler divergence in u, which provides a rigorous derivation of this criterion (Comon, 1994; McCullagh, 1987). The receiver operating characteristic (ROC) curves in Figure 6 show that 34 (u) is a good measure of success. These tests also included the cases where linear regression was not able to recover the sources. The area under the ROC curves is 0.952 and 0.988 for ISFA3 and ISFA5 , respectively. 6 Conclusion In the work presented here, we have addressed the problem of nonlinear BSS. It is known that in contrast to the linear case, statistical independence
Independent Slow Feature Analysis and Nonlinear BSS
1015
Figure 6: ROC curves for the test of successful source separation based on 34 (u), the weighted sum of third- and fourth-order cross-cumulants. The area under the curves is 0.952 and 0.988 for ISFA3 (dashed line) and ISFA5 (solid line), respectively.
alone is not a sufficient criterion for separating sources from a nonlinear mixture; additional criteria are needed to solve the problem of selecting the true sources (or good representatives thereof) from the many possible output signal components that would be statistically independent of other components. We claim here that for source signals with significant autocorrelations for time delay one, temporal slowness is a good criterion to solve this selection problem, because the slow components are those most likely related to the true sources by an invertible transformation; noninvertible transformations would typically lead to more quickly varying components. Based on this assumption, we have derived an objective function that combines a term from second-order independent component analysis (ICA) with a term derived from slow feature analysis (SFA). Optimization of the new objective function is achieved by successive Givens rotations, a method often used in the context of ICA. We refer to the resulting algorithm as independent slow feature analysis (ISFA) to indicate the combination of ICA and SFA. The algorithm is somewhat unusual in that only a small submatrix of large time-delayed correlation matrices is being diagonalized by the Givens rotations (usually the full matrices are being diagonalized). This opens the question of the uniqueness of the solution. Using randomly generated pseudocorrelation matrices, we have found that a minimum number of time delays is needed to obtain unique and meaningful solutions. For instance, if the upper left 2 × 2 ubmatrix of 9 × 9 matrices has to be diagonalized, at least 18 such matrices are needed to obtain a meaningful solution, which
1016
T. Blaschke, T. Zito, and L. Wiskott
would be unlikely to emerge by accident. With 17 matrices, on the other hand, good diagonalization can be achieved reliably even for random symmetric matrices. With (sufficiently many) surrogate matrices, structured such that they have a unique solution, we have subsequently verified that the algorithm reliably converges to the correct solution. With tests on quite an extreme nonlinear mixture of two audio signals, we have shown that ISFA is indeed able to perform nonlinear BSS, often with high precision. However, in about 30% of the cases in which the true sources could have been extracted with the nonlinearity used (as verified by regression), ISFA failed to extract them. In many of these cases, the extracted signals were actually better than the original sources in both the SFA term and the ICA term of the objective function. This was a surprising finding for us, since it seems to contradict our basic assumption that a combination of slowness and statistical independence should permit reliable nonlinear BSS. Closer inspection, however, has revealed that the extracted output signal components only appear to be statistically independent in terms of the time-delayed second-order moments but that they are often highly related, as can be seen by visual inspection (see Figure 4) and automatically detected with a measure 34 (u) based on higher-order cumulants. This is not a consequence of the particular choice of time delays we have used but would be expected for any general set of time delays, as can be seen from the cross-correlation functions (in Figure 5). We believe that two important conclusions can be drawn from these results. Firstly, the success cases indicate that combining slowness and statistical independence is a promising approach to nonlinear BSS. Secondly, any measure of statistical independence based on (time-delayed) secondorder moments is too weak to guarantee statistical independence in our context; it might even be too weak in any context where the dimensionality of the space in which the signal components are searched for is significantly larger than the number of components. For a possible theoretical account of the failure of second-order ICA in our context, consider the following example. Given a symmetrically distributed source s1 , the correlation between, for instance, s1 and s12 vanishes (Harmeling et al., 2003, sec. 4.1). To the extent that this also holds for timeshifted versions s1 (t) and s12 (t + τ ) (cf. Harmeling et al., 2003, sec. 5.4), the statistical dependence between s1 and s12 does not manifest itself in the time-delayed correlations. Thus, second-order ICA cannot be expected to prevent extraction of s1 and s12 as the estimated sources, which can easily lead to a failure case say, if s12 is more slowly varying than s2 . A failure rate of 30% would render the algorithm useless if it were not possible to detect the failure cases. We have shown that the measure 34 (u), which is based on higher-order cumulants, permits failure detection with high reliability; the area under the ROC curve is greater than 0.95, resulting in a true positive rate of 90% and 94% at a false-positive rate of 5% and 10% for ISFA3 and ISFA5 , respectively.
Independent Slow Feature Analysis and Nonlinear BSS
1017
It might be possible to use 34 (u) not only to detect the failure cases but also to automatically tune the weight b ICA given b SFA = 1 and determine the number of sources. For the former, one could start with a small value of b ICA , so that only the SFA term is effective and the extracted components might not be independent, and then increase b ICA , so that the ICA term becomes increasingly effective, until the value of 34 (u) drops below a certain threshold. Similarly, for determining the number of sources, one could start by running the algorithm with only two output components to be extracted and successively increase the number of components. One would then stop if adding another component would increase 34 (u) significantly (which can obviously be detected only a posteriori). More interesting, however, would be to use higher-order cumulants more directly to improve the algorithm. For instance, one could define a new objective function that is a combination of the SFA term used here and an ICA term like 34 (u). Given the high reliability with which 34 (u) can detect failure cases, we expect better performance with such a new objective function. However, higher-order cumulants are expensive to compute, especially for high-dimensional and long signals, so that there is probably a trade-off between reliability and computational complexity. Exploring these possibilities will be subject of our future research. Appendix The definitions of the constants dn and e n for the expression of the objective (u) function, equation 4.9, follow directly from the multilinearity of C... (τ ). They are given in Table 3. Using trigonometry, we can derive simpler objective functions of the form µν
Case 1: ISFA (φ) = a 20 + c 24 cos(4φ) + s24 sin(4φ)
(A.1)
µν
Case 2: ISFA (φ) = a 20 + c 22 cos(2φ) + s22 sin(2φ) + c 24 cos(4φ) + s24 sin(4φ),
(A.2)
with constants defined in Table 4. In the next step, these objective functions are further simplified by combining the sine term and cosine term in a single cosine term. This results in µν
Case 1: ISFA (φ) = A0 + A4 cos(4φ + φ4 ) Case 2:
µν ISFA (φ) =
A0 + A2 cos(2φ + φ2 ) + A4 cos(4φ + φ4 ),
with constants defined in Table 5.
(A.3) (A.4)
1018
T. Blaschke, T. Zito, and L. Wiskott
Table 3: Constants in Equation 4.9. Case 1
d0
τ ∈TSFA
d1
4
τ ∈TSFA
d2
2
τ κSFA
2 (u ) (u ) (u ) + Cµµ Cνν 2 Cµν
d3
0
d4
0
e0
2
τ ∈TICA
e1
4
τ ∈TICA
α=1, α∈{µ,ν}
τ κICA
2 (u ) Cµν
τ ∈TICA
2
τ ∈TICA
τ κICA
R−1 R α=1 β>α
4
2 τ (u ) κSFA Cµµ
τ ∈TSFA
2
(u )
Cαβ
2
2 (u ) − Cµν
τ (u ) (u ) κSFA Cµν Cµµ
2 τ (u ) 2 Cµν κSFA
τ ∈TSFA (u ) (u ) +Cµµ Cνν τ (u ) (u ) 4 κSFA Cµν Cνν τ ∈TSFA 2 τ (u ) κSFA Cνν τ ∈TSFA R 2 τ (u ) Cαα κSFA α=1 τ ∈TSFA α=µ
τ κICA
τ ∈TICA
τ (u ) (u ) (u ) (u ) κICA Cνν − Cµµ Cµν Cµν
τ ∈TSFA
2
2 2 τ (u ) (u ) (u ) κICA − Cνν − 2 Cµν Cµµ
e2
ec
R 2 (u ) Cαα
τ κSFA
τ ∈TSFA
τ (u ) (u ) (u ) (u ) κSFA Cµν − Cµν Cνν Cµµ
τ ∈TSFA
dc
Case 2
2 2 τ (u ) (u ) κSFA + Cνν Cµµ
2
α=1 α=µ
τ κICA
τ ∈TICA
R
τ ∈TICA
(u ) (u ) Cµα Cαν
α=1 α=µ
τ κICA
τ ∈TICA
2
R 2 (u ) Cµα
R 2 (u ) Cαν α=1 α=µ
τ κICA
R−1
R 2 (u ) Cαβ
α=1, β=α+1, α=µ β=µ
It is easy to see why it is possible to write both objective functions A.3 and A.4 in such a simple form. First, the terms in equation 4.9 are products of at most four sin(φ) and cos(φ) functions, which allows, at most, a frequency of µν 4. Second, in case 1, ISFA (φ) has a periodicity of π/2 because rotations by multiples of π/2 correspond to a permutation (possibly plus sign change) of the two components. Since both components are inside the subspace, permutations do not change the objective function, and the objective function has a π/2 periodicity. Thus, we conclude that only frequencies of 0 and 4 can be present in equation A.3. In case 2, since one component lies outside the subspace, an exchange of components will change the objective function, equation A.4. A rotation by multiples of π, however, which results only in a possible sign change, will leave the objective function unchanged, resulting in an objective function with π-periodicity and therefore frequencies of 0, 2, and 4.
Independent Slow Feature Analysis and Nonlinear BSS
1019
Table 4: Constants in Equations A.1 and A.2 in Terms of the Constants of Table 3. Case 1 a 20
c 22 s22 c 24 s24
b ICA 4 (4e c + e 2 + 3e 0 ) − bSFA 4 (4dc + d2
Case 2
+ 3d0 )
b SFA b ICA 4 (e 0 − e 2 ) − 4 (d0 b SFA b ICA e − d 1 1 4 4
− d2 )
b ICA 2 (2e c + e 0 + e 2 ) − bSFA 8 (8dc + 3d0 + d2 + 3d4 ) b ICA b SFA 2 (e 0 − e 2 ) − 2 (d0 − d4 ) b ICA b SFA 2 e 1 − 4 (d1 + d3 ) − bSFA 8 (d0 − d2 + d4 ) − bSFA 8 (d1 − d3 )
Table 5: Constants in Equations A.3 and A.4 in Terms of the Constants of Table 4. Case 1
Case 2
A0
a 20
A2 A4
2 + s2 c 24 24
a 20 2 + s2 c 22 22 2 2 c 24 + s24
tan (φ2 )
-
−
tan (φ4 )
−
s24 c 24
−
s22 c 22 s24 c 24
Acknowledgments This work has been supported by the Volkswagen Foundation through a grant to L.W. for a junior research group. We thank Pietro Berkes for carefully going through all constants and Barak Pearlmutter for helpful comments. All examples have been implemented in Python using the Modular Toolkit for Data Processing (freely available online at http://mdptoolkit.sourceforge.net). References Almeida, L. (2004). Linear and nonlinear ICA based on mutual information—the MISEP method. Signal Processing, 84(2), 231–245. Amari, S., Cichocki, A., & Yang, H. (1995). Recurrent neural networks for blind separation of sources. In Proc. of the Int. Symposium on Nonlinear Theory and Its Applications (NOLTA-95) (pp. 37–42). Las Vegas. Available online at http://hawk.ise. chuo-u.ac.jp/NOLTA/95/ad-pro.html. Babaie-Zadeh, M., Jutten, C., & Nayebi, K. (2002). A geometric approach for separating post-nonlinear mixtures. In Proc. of the XI European
1020
T. Blaschke, T. Zito, and L. Wiskott
Signal Processing Conference (EUSIPCO 2002) (pp. 11–14). Available online at http://www.eurasip.org/content/Eusipco/2002/articles/paper.701/html. Belouchrani, A., Abed Meraim, K., Cardoso, J.-F., & Moulines, E. (1997). A blind source separation technique based on second order statistics. IEEE Transactions on Signal Processing, 45(2), 434–44. Blaschke, T., Berkes, P., & Wiskott, L. (2006). What is the relation between independent component analysis and slow feature analysis? Neural Computation, 18(10), 2495–2508. Blaschke, T., & Wiskott, L. (2004). CuBICA: Independent component analysis by simultaneous third- and fourth-order cumulant diagonalization. IEEE Transactions on Signal Processing, 52(5), 1250–1256. Cardoso, J.-F. (2001). The three easy routes to independent component analysis: Contrasts and geometry. In Proc. of the 3rd Int. Conference on Independent Component Analysis and Blind Source Separation, San Diego, (ICA 2001). San Diego, CA: Institute for Neural Computation. Cardoso, J.-F., & Souloumiac, A. (1996). Jacobi angles for simultaneous diagonalization. SIAM J. Mat. Anal. Appl., 17(1), 161–164. Comon, P. (1994). Independent component analysis: A new concept? Signal Processing, 36(3), 287–314. ¨ Harmeling, S., Ziehe, A., Kawanabe, M., & Muller, K.-R. (2003). Kernel-based nonlinear blind source separation. Neural Computation, 15, 1089–1124. Hosseini, S., & Jutten, C. (2003). On the separability of nonlinear mixtures of temporally correlated sources. IEEE Signal Processing Letters, 10(2), 43–46. Hyv¨arinen, A., & Pajunen, P. (1999). Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12(3), 429–439. Jutten, C., & Karhunen, J. (2003). Advances in nonlinear blind source separation. In Proc. of the 4th Int. Symposium on Independent Component Analysis and Blind Signal Separation, Nara, Japan (pp. 245–256). Nara, Japan: N.P. McCullagh, P. (1987). Tensor methods in statistics. London: Chapman and Hall. Molgedey, L., & Schuster, G. (1994). Separation of a mixture of independent signals using time delayed correlations. Physical Review Letters, 72(23), 3634–3637. Sprekeler, H., Zito, T., & Wiskott, L. (2007). Understanding slow feature analysis: A mathematical framework. Humboldt University. Manuscript in preparation. Taleb, A. (2002). A generic framework for blind source separation in structured nonlinear models. IEEE Transactions on Signal Processing, 50(8), 1819–1830. Taleb, A., & Jutten, C. (1997). Nonlinear source separation: The postnonlinear mixtures. In Proc. European Symposium on Artificial Neural Networks, Bruges, Belgium (pp. 279–284) Available online at http://www.dice.ucl.ac.be/ esann/proceedings/papers.php?ann-1997. Taleb, A., & Jutten, C. (1999). Source separation in post non linear mixtures. IEEE Transactions on Signal Processing, 47(10), 2807–2820. Tong, L., Liu, R., Soon, V. C., & Huang, Y.-F. (1991). Indeterminacy and identifiability of blind identification. IEEE Transactions on Circuits and Systems, 38(5), 499–509. Wiskott, L., & Sejnowski, T. (2002). Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4), 715–770. Yang, H.-H., Amari, S., & Cichocki, A. (1998). Information-theoretic approach to blind separation of sources in non-linear mixture. Signal Processing, 64(3), 291–300.
Independent Slow Feature Analysis and Nonlinear BSS
1021
¨ Ziehe, A., Kawanabe, M., Harmeling, S., & Muller, K.-R. (2003). Blind separation of post-nonlinear mixtures using linearizing transformations and temporal decorrelation. Journal of Machine Learning Research, 4, 1319–1338. ¨ Ziehe, A., & Muller, K.-R. (1998). TDSEP—an efficient algorithm for blind separation using time structure. In Proc. of the 8th Int. Conference on Artificial Neural Networks (ICANN’98) (pp. 675–680). Berlin: Springer Verlag.
Received September 17, 2004; accepted July 21, 2006.
LETTER
Communicated by Laurenz Wiskott
A Maximum-Likelihood Interpretation for Slow Feature Analysis Richard Turner [email protected]
Maneesh Sahani [email protected] Gatsby Computational Neuroscience Unit, University College London, London, WC1N 3AR, U.K.
The brain extracts useful features from a maelstrom of sensory information, and a fundamental goal of theoretical neuroscience is to work out how it does so. One proposed feature extraction strategy is motivated by the observation that the meaning of sensory data, such as the identity of a moving visual object, is often more persistent than the activation of any single sensory receptor. This notion is embodied in the slow feature analysis (SFA) algorithm, which uses “slowness” as a heuristic by which to extract semantic information from multidimensional time series. Here, we develop a probabilistic interpretation of this algorithm, showing that inference and learning in the limiting case of a suitable probabilistic model yield exactly the results of SFA. Similar equivalences have proved useful in interpreting and extending comparable algorithms such as independent component analysis. For SFA, we use the equivalent probabilistic model as a conceptual springboard with which to motivate several novel extensions to the algorithm. 1 Introduction The meaning of sensory information often varies more slowly than the activity of low-level sensory receptors. For example, the output of photoreceptors directed toward a swaying tree on a bright windy day may flicker, but the identity and percept of the tree remain invariant. Observations of this sort motivate temporal constancy, or slowness, as a useful learning principle to extract meaningful higher-level descriptions of sensory data ¨ ak, 1991; Mitchison, 1991; Stone, 1996). (Hinton, 1989; Foldi´ The slowness learning principle is at the core of the slow feature analysis (SFA) algorithm (Wiskott & Sejnowski, 2002). SFA linearly extracts slowly varying, uncorrelated projections of multidimensional time-series data, ordered by their slowness. When SFA is trained on a nonlinear expansion of a video of natural scene patches, the filter outputs are found to resemble the receptive fields of complex cells (Berkes & Wiskott, 2005). Slowness Neural Computation 19, 1022–1038 (2007)
C 2007 Massachusetts Institute of Technology
Maximum-Likelihood Interpretation for Slow Feature Analysis
1023
and decorrelation of features (in SFA and similar algorithms, e.g., Kayser, ¨ ¨ ¨ ¨ Kording, & Konig, 2003; Kording, Kayser, Einh¨auser, & Konig, 2004) thus provide an interesting alternative heuristic to sparseness and independence for recovering receptive fields similar to those observed in the visual system (e.g., Olshausen & Field, 1996, Bell & Sejnowski, 1997). This letter provides a probabilistic interpretation for SFA. Such a perspective has been useful for both principal component analysis (PCA; see Tipping & Bishop, 1999) and, in particular, independent component analysis (ICA; see MacKay, 1999; Pearlmutter & Parra, 1997), motivating many new learning algorithms (covariant, variational; see Miskin, 2000) and generalizations to the model (independent factor analysis (Attias, 1999)) and gaussian scale mixture priors (Karklin & Lewicki, 2005). More generally, probabilistic models have several desirable features. For instance, they force the tacit assumptions of algorithms to be made explicit, allowing them to be criticized and improved more easily. Furthermore, a number of general tools have been developed for learning and inference in such models, for example, methods to handle missing data (Dempster, Laird, & Rubin, 1977). The probabilistic framework is a powerful one. Fortunately, it is intuitive too, and heuristics such as sparseness and slowness can naturally be translated into probabilistic priors. Indeed, previous work has illustrated how to combine the two approaches into a common probabilistic model (Hyv¨arinen, Hurri, & V¨ayrynen, 2003; Hurri & Hyv¨arinen, 2003). The advantage of the probabilistic approach is that it softens the heuristics and allows them to trade off against one another. One of the contributions of this letter is to place SFA into a proper context within this common framework. In the following, we introduce SFA and provide a geometrical picture of the algorithm. Next, we develop intuition for a class of models in which maximum-likelihood learning has similar flavor to SFA, before proving exact equivalence under certain conditions. In the final section, we use this maximum-likelihood interpretation to motivate a number of interesting probabilistic extensions to the SFA algorithm. 2 Slow Feature Analysis Formally, the SFA algorithm can be defined as follows. Given an MM dimensional time series, xt , t = 1, . . . , T, find M weight vectors {wm }m=1 , such that the average squared temporal difference of the output signals, T yˆ m,t = wm xt is minimized: ˆ m = arg min( yˆ m,t − yˆ m,t−1 )2 , w
(2.1)
wm
given the constraint yˆ m,t yˆ n,t − yˆ m,t yˆ n,t = δmn .
(2.2)
1024
R. Turner and M. Sahani
Without loss of generality, we can assume the time series has zero mean, hence yˆ m,t = 0. The constraint is important in that it is necessary to prevent the trivial solutions yˆ m,t = 0 and ensure that the output signals report different aspects of the stimulus: yˆ m,t = yˆ n,t . Furthermore, it also imposes an ordering on the solutions. We shall see below that the constraint plays as important a role in shaping the results of SFA as does the minimization itself. In the following, we use yˆ to represent the time differences of the features yˆ t − yˆ t−1 , and similarly x for the data time differences. With SFA defined in this way, it can be shown the optimal weights satisfy the generalized eigenvalue equation, 2
¯ WB. WA =
(2.3)
Where we have defined yˆ t = Wxt (that is, the matrix W collects the weight 2 ¯ 2mn = ω¯ m δmn . vectors wm ), Amn = xm xn , Bmn = xm xn , and The slow features are ordered from slowest to fastest by the eigenvalues 2 ω¯ m . As there are efficient methods for solving the generalized eigenvalue equation, learning as well as inference is very fast. 3 A Geometric Perspective SFA has a useful geometric interpretation that makes clear connections to other algorithms such as PCA. The constraint of equation 2.2 requires that multiplication by the SFA weight matrix spatially whiten or “sphere” the data. In general, such a sphering matrix is given by the product of the square root of the covariance matrix with an arbitrary orthogonal transform: W = RB−1/2 .
(3.1)
Geometrically, as illustrated in Figure 1, the rows of W are constrained to lie on a hyper-ellipsoid, but we are free to choose their rotation R (S¨arel¨a & Valpola, 2005). PCA makes one particular choice, R = I, and thus the rows of the principal loading matrix (analogous to W) are parallel to the eigenvectors of B. They may be ordered by the eigenvalues of B, which give the variances of the data in each principal direction. Various ICA algorithms choose a nontrivial rotation R so as to maximize kurtosis or some other statistic of the projected data. In this case, a natural ordering is given by the value of the projected statistic. By contrast, SFA chooses a rotation based on dynamical information, using the normalized eigenvectors of the covariance matrix of the temporal
Maximum-Likelihood Interpretation for Slow Feature Analysis A.
1025
x
2
5
0
−5 −5
0
x
5
1
x
2
B. 2
2
0
0 −2
−2 −2
z2
C.
0
−T/2
2
−2
B
1
1
0
0
−1 −1
0
1
R
D.
−1 −1
0
x
2
0
1
z
0
1
y
1
1
y2
1
0
−1 −1
E.
1
2
w2
1 0 −1 −2 −2
0
2
w1
Figure 1: The geometrical interpretation of PCA and SFA in two dimensions. The left column illustrates PCA and the right column SFA. (A) The data x1,t , x2,t (black dots) and the temporal differences x1,t , x2,t (gray dots). (B) Data space, left panel: The covariance of the data illustrated schematically by a black ellipse: x T Bx = 1; the principal axes are shown by the black lines. Right: The covariance of the data, and the covariance of the time derivatives (gray dotted ellipse) x T Ax = 1. (C) Sphered (PCA) space resulting from the linear transform B−T/2 (a rotation determined by the eigenvectors of B and a rescaling by the square root of the corresponding eigenvalues). Left panel: The sphered covariance matrix (black circle), the principal axes of which (black lines) are now of equal length. Right panel: The sphered covariance and the transformed covariance of the time derivatives (gray dotted ellipse), the principal axes of which (gray dotted lines) are not aligned with the axes of the sphered space. (D) SFA space: An additional rotation R is made to align the axes of the space with the transformed time difference covariance. (E) PCA and SFA weights W shown in weight space. Other choices for the rotation matrix correspond to points on the ellipsoid chosen to be orthogonal in the metric of the ellipse.
1026
R. Turner and M. Sahani
differences of the sphered data.1 This can be seen by substituting equa¯ 2 R. Geometrically, tion 3.1 into equation 2.3, leading to RB−1/2 AB−1/2 = SFA corresponds to transforming into the sphered (PCA) space and then rotating again to align the axes of the transformed data with the principal axes of the transformed time differences. The natural ordering is then given by the eigenvalues of the temporaldifference covariance matrix. If each projection yˆ t varies sinusoidally in time, the eigenvalues are the squares of the corresponding frequencies. More generally, the eigenvalues give the average squared frequency of the extracted features, where the average is taken over the normalized spectral power density, that is, ω¯ n2 =
ω
ω2
| y˜ n (ω)|2 , ˜ n (ω )|2 ω | y
(3.2)
where y˜ n (ω) is the ω-component of the discrete Fourier transform of the nth feature.2 The geometric perspective makes clear one important difference between PCA and SFA: PCA is dependent on the relative scales of the measurements, but SFA is not. That is, rescaling a subset of dimensions of the observations xn,1:T alters the eigenvectors of the covariance matrix, and consequently the extracted time series y1:T that PCA recovers is also changed. Critically, SFA finds an interesting rotation after sphering the data, and so the extracted time series y1:T that it recovers is insensitive to the rescaling. For this reason, SFA can be used meaningfully with an expanded set of pseudo-observations (whose scale is arbitrary), while PCA cannot. This is a property that SFA shares with other algorithms such as factor analysis (Roweis & Ghahramani, 1999). 4 Toward a Probabilistic Interpretation The very notion of extracting features based on temporal slowness is captured naturally in the generative modeling framework, and this suggests that SFA might be amenable to such an interpretation. To show the correspondence between the SFA algorithm and maximum likelihood learning of a particular probabilistic latent variable model (perhaps in a particular limit), we interpret SFA’s weights as recognition weights derived from the
1 However, some variants of ICA are closely related; see Blaschke, Berkes, and Wiskott (2006). 2 To prove this result, note that equation 2.3 gives wT AwT = y ˆ n2 = ω¯ n2 yˆ n2 . By Parsen n 2 = 2 and (using the Fourier transform of the derivative vals theorem, | y ˆ | | y ˜ (ω)| t n,t ω n operator) t | yˆ n,t |2 = ω |ω y˜ n (ω)|2 . Substituting these in for the two averages gives equation 3.2.
Maximum-Likelihood Interpretation for Slow Feature Analysis
1027
statistical inverse of a generative model. The output signals then have a natural interpretation as the mean or mode of the posterior distribution over latent variables, given the data. Our goal is to unpack the assumptions implicit in SFA’s cost function and constraints and intuitively build corresponding factors into a simple candidate form of generative model. This will then be formally analyzed to rederive the SFA algorithm under certain conditions. One benefit of this new perspective relates to the hard constraints that had to be introduced to prevent SFA from recovering trivial solutions. In a probabilistic model, softened versions of these constraints might be expected to arise naturally. For example, in SFA, the scale of the output signals must be set by hand to prevent them from being shrunk away by the smoothness objective function. The temporal difference cost function depends on only the recovered signals, and there is no penalty for eliminating information about the measured data. In a probabilistic setting, a soft volume penalty arising from a combination of the normalizing constant of the prior and the information-preserving likelihood term will prevent shrinking in an automatic manner. However, one of the consequences of such a soft constraint is that the scale of each of the latents may not be identical. Fortunately, this ambiguity does not affect the ordering of the solutions. Similarly, the other constraint of SFA—that the output signals be decorrelated—emerges by choosing a decorrelated prior. Indeed, as with probabilistic PCA and ICA, we choose a fully factored prior distribution on the latent sources. Our ultimate choice will be gaussian, so that decorrelation and independence are the same at each time-step, although the factored form also implies that the processes are decorrelated at different time steps. Thus, consideration of the SFA constraints suggests a factored prior over the latent variables. The cost function, which penalizes the sum of squared differences between adjacent time steps, further specifies the form of the prior on each latent time series. The squared cost is consistent with an underlying gaussian distribution, while the penalty on adjacent differences suggests a Markov chain structure. A reasonable candidate is thus a one time-step linear-gaussian dynamical system (or AR(1) autoregressive model): N 2 p yt |yt−1 , λ1:N , σ1:N = p yn,t |yn,t−1 , λn , σn2
2
n=1
p yn,t |yn,t−1 , λn , σn = Norm λn yn,t−1 , σn2 2 2 p yn,1 |σn,1 = Norm 0, σn,1 .
(4.1)
Intuitively, the λn control the strength of the correlations between the latent variables at different time points and therefore their slowness. If λn = 0, then successive variables are uncorrelated, and the processes vary
1028
R. Turner and M. Sahani
rapidly. As λn → 1, the processes become progressively more correlated and hence slower. More generally, if |λn | < 1, then after a long time (t → ∞), the prior distribution of each latent series settles into a stationary state with temporal covariance, or autocorrelation, given by yn,t yn,t−τ →
σn2 λ|τ | . 1 − λ2n n
(4.2)
The “effective memory” of this prior (defined as the time difference over which correlations fall by 1/e) is thus τeff = −1/ log λn . At first glance, the choice of |λn | < 1 would seem to suggest that each latent process shrink toward 0 with time, but this clearly is at odds with the stationary distribution derived above. This apparent conflict is resolved by noting that while the mean of the one-step conditional distribution is indeed smaller than the preceding value, yn,t |yn,t−1 = λn yn,t−1 , the second moment 2 2 includes the effects of the innovations process yn,t |yn,t−1 = λ2n yn,t−1 + σn2 . σn2 2 2 Thus, if yn,t−1 < 1−λ2 , the conditional expected square is larger than yn,t−1 . n Indeed, it is clear that the only way to achieve a stationary AR(1) process with nonzero innovations noise is to choose |λn | < 1. Two useful corollaries follow from equation 4.2. First, it is possible to express the prior expectation that each latent process is stationary from the outset (that is, without a initial transient) by choosing its initial distribution to match the long-run variance: 2 σn,1 =
σn2 . 1 − λ2n
(4.3)
This form will be assumed in the following, but it is not essential to the derivation. Second, by choosing σn2 = 1 − λ2n , we can set the stationary variance of the prior to one, a fact that we will make use of later. It is important to note here that we have so far discussed the properties of the prior p(y1:T ), and that these may generally be quite different from the properties of the posterior p(y1:T |x1:T ). In addition to the prior on the latent variables, the complete specification of a generative model requires a (potentially probabilistic) mapping from the latents to the observations. This is constrained by noting that inference in SFA is instantaneous: a feature at time t is derived through a linear combination of only the current observations, without reference to earlier or later observed data. In the general Markov dynamical model, accurate inference requires information from multiple time points. For example, adding a linear-gaussian output mapping to the latent model (see equation 4.1), leads to inference through the well-known Kalman smoothing recursions in time. The only way for instantaneous linear recognition to be optimal is for the mapping from latents to observations to be deterministic and
Maximum-Likelihood Interpretation for Slow Feature Analysis
1029
linear; the matrix of generative weights is then the inverse of the recognition matrix W. It is useful to view this deterministic process as the limit of a probabilistic mapping (Tipping & Bishop, 1999), and a natural choice is p(xt |yt , W, σx ) = Norm W−1 yt , σx2 I ,
(4.4)
where the deterministic mapping is recovered as σx2 → 0. This completes the specification of the model, which is a member of the linear gaussian state-space models (Roweis & Ghahramani, 1999), in the limit of zero observation noise. We have described why it might have the same flavor as SFA, but it is certainly not clear a priori under what restrictions this equivalence will hold. For instance, we might worry that peculiar settings of the transition dynamics and state noise might be required. Fortunately, these conditions can be derived analytically and 2are found2to be surprisingly unrestrictive. Denoting the parameters θ = σ1:N , λ1:N , σx , W , we first form the likelihood function: p(x1:T |θ ) =
dy1:T ×
p(xt |yt , W, σx2 )
t=1 N
n=1
=
T
dy1:T
p
yn,1 |σn2
T
T p yn,t |yn,t−1 , λn , σn2 t=2
(4.5)
δ(xt − W−1 yt )
t=1
N
T 1 2 1 1 2 yn,1 + [yn,t − λn yn,t−1 ] × exp − , 2 Z 2σn2 2σn,1 n=1
(4.6)
t=2
where we have taken the limit σx2 → 0 in the likelihood term. Completing the integrals, the log likelihood is L(θ ) = log p(x1:T |θ ) N
1 (wnT x1 )2 2 2σ n,1 n=1
T 2 1 T T + 2 wn xt − λn wn xt−1 , 2σn
= c + T log | det W| −
t=2
(4.7)
(4.8)
1030
R. Turner and M. Sahani
where the constant c is independent of the weights. Expanding the square yields T N σn2 T 1 L(θ ) = c + T log | det W| − wn x1 x1T wn + wnT xt xtT wn 2 2σn2 σn,1 n=1 t=2 T−1 T
T 2 T T T T T + λn wn xt xt wn − λn wn xt xt−1 + xt−1 xt wn . t=1
t=2
t=2
(4.9) The following identities are then useful: T
xt xtT = TB − x1 x1T
(4.10)
xt xtT = TB − xT xTT
(4.11)
t=2 T−1 t=1 T t=2
T xt xt−1 +
T
xt−1 xtT = 2TB − (T − 1)A − x1 x1T − xT xTT ,
(4.12)
t=2
where we remind the reader, 1 (xm,t+1 − xm,t )(xn,t+1 − xn,t ) T −1 T−1
Amn = xm xn =
t=1
Bmn = xm xn =
T 1 xm,t xn,t . T t=1
Substituting and collecting terms this yields N (T − 1) T Aλn L(θ ) = c + T log | det W| − wnT B(1 − λn )2 + 2σn2 T n=1
λn (1 − λn ) T T x1 x1 + xT xT wn . + T
(4.13)
As the number of observations increases, T → ∞, the relative contrin) (x1 x1T + xT xTT ) → 0 and T−1 → 1. bution from edge effects reduces, λn (1−λ T T
Maximum-Likelihood Interpretation for Slow Feature Analysis
1031
Therefore, assuming a large number of data points: L(θ ) ≈ c + T log | det W| − = c + T log | det W| −
T wT B(1 − λn )2 + Aλn wn 2 n 2σ n n T tr WBWT (2) + WAWT (1) , 2
(4.14) (4.15)
(1−λn ) λn (2) where (1) mn = δmn σn2 and mn = δmn σn2 . Differentiating this expression with respect to the recognition weights (assuming the determinant is not zero) recovers the following condition: 2
d L(θ ) ∝ W−T − (2) WB + (1) WA . dW
(4.16)
Setting this to 0 and rearranging, we obtain a condition on the maximum: yˆ m,t yˆ n,t =
σn2 λn δmn − yˆ m,t yˆ n,t , (1 − λn )2 (1 − λn )2
(4.17)
T xt . where, as defined earlier, yˆ m,t = wm Now, as the first term on the right-hand side of equation 4.17 is diagonal, symmetry of the left-hand side requires that, for off-diagonal terms,
λn λm yˆ m,t yˆ n,t = yˆ n,t yˆ m,t . 2 (1 − λn ) (1 − λm )2
(4.18)
But if we choose λm = λn for m = n, this condition can be met only when the covariance matrix of the latent temporal differences is diagonal, which makes the covariance of the latents themselves diagonal by equation 4.17. This implies that we have recovered the SFA directions, but not necessarily the correct scaling. In other words, the most probable output signals are decorrelated but not of equal power; further rescaling is necessary to achieve sphering. This is a consequence of the soft volume constraints and might be a desirable result for reasons discussed earlier. Surprisingly, the maximum-likelihood weights do not depend on the exact setting of the λn , so long as they are all different. If 0 < λn < 1 ∀n, then larger values of λn correspond to slower latents. This corresponds directly to the ordering of the solutions from SFA. To recover exact equivalence to SFA, another limit is required that corrects the scales. There are several choices, but a natural one is to let σn2 = 1 − λ2n (which fixes the prior covariance of the latent chains to be one, as discussed earlier) and then take the limit λ1 ≤ λ2 ≤ . . . λ M → 0,
1032
R. Turner and M. Sahani
which implies yˆ m,t yˆ n,t → (1 + 2λn )δmn − λn yˆ m,t yˆ n,t → δmn .
(4.19)
Using the geometric intuitions of section 3, at the limit the weights sphere the inputs and therefore lie somewhere on the shell of the hyper-ellipsoid defined by the data covariance. However, as the limit is taken, the weights must be parallel to those of SFA (as we have argued above). Provided the behavior is smooth, as the perturbation vanishes, the weights must drop onto the hyper-ellipsoid at precisely the point specified by SFA. The analytical results presented here have been verified using computer simulations, for instance, using the expectation-maximization (EM; Dempster et al., 1977) algorithm and annealing the output noise and transition matrices appropriately so as to take the approximate limit (see Figure 2, top). However, learning by EM is very slow when the variance of output noise σx2 is small; this is alleviated somewhat by the annealing process, but other likelihood optimization schemes might be more suitable. In the above analysis, we thought of the inputs x1:T as observations from some temporal process. However, the argument we have presented is agnostic to the actual source of the inputs. This means, for example, that the recognition model and learning as presented above are formally equivalent to SFA even when the inputs x1:T themselves are formed from nonlinear combinations of observations (as is the usually the case in applications of the SFA algorithm). However, the generative model is not likely to correspond to a particularly sensible selection of assumptions in this case, as will be discussed in the next section. 5 Extensions The probabilistic modeling perspective suggests we could use a linear gaussian state-space model with a factorial prior in place of SFA and learn the state and observation noise, that is, the dynamics, as well as the weights (for which SFA would provide an intelligent initialization). This probabilistic version improves over standard SFA when there is substantial observation noise and if there are missing data, even providing a natural measure of confidence under these conditions (see Figure 2). More radical variants are also interesting. 5.1 Bilinear Extensions. Loosely speaking, there are at least two broad types of slow features in images: object identity (what, or content information) and object location and pose (where, or style information), and there is evidence that there are parallel pathways for the processing of each sort in the brain. Bilinear learning algorithms have recently been developed in which the effects of the two are segregated (Tenenbaum & Freeman, 2000;
Maximum-Likelihood Interpretation for Slow Feature Analysis
1033
Figure 2: Comparisons between slow feature analysis and probabilistic SFA on three data sets drawn from the forward model. Left column: The threedimensional observations. Right column: The slowest latent variable/extracted feature. The dashed line is the actual value of the slowest latent, the gray line is the slowest feature extracted by slow feature analysis, and the gray scale indicates the posterior distribution from the probabilistic model. (Row A) Low observation noise. Both SFA and probabilistic SFA recover the true latents accurately. For each algorithm, we calculate the root mean square (RMS) error between the reconstructed and true value of the latent process. The difference between these errors, divided by the RMS amplitude of the latent process, gives the proportional difference in errors, D. In this case, D = 0.03, indicating that the methods perform similarly at low observation noise levels. (Row B) Higher observation noise. SFA is adversely affected, but the probabilistic algorithm makes reasonable predictions. The difference in the RMS errors is now comparable to the amplitude of the latent signal (D = 1.10). (Row C) As for B, but with missing data. The probabilistic model can predict the intermediate latents in a principled manner, but SFA cannot. Using an ad hoc straight line interpolation of the normal SFA solution, the RMS error over the region of missing data is substantially greater (D = 1.4), indicating the superiority of the probabilistic interpolation.
Grimes & Rao, 2005). SFA, by contrast, confounds these two types of signal because it extracts them both into one type of latent variable. A natural extension to the SFA model might therefore augment the continuous “where” latent variables (yn,t ) with a new set of binary “what” latent variables (sn,t ),
1034
R. Turner and M. Sahani
forming independent discrete Markov chains with transition matrices T(n) , and combining the two bilinearly: N N p st |st−1 , {T(n) }n=1 p sn,t |sn,t−1 , T(n) =
(5.1)
n=1
(n) p sn,t = a |sn,t−1 = b, T(n) = Ta b
(5.2)
N,M , σx = Norm gmn yn,t sm,t , σx2 I . p xt |yt , st , {gmn }n=1,m=1
(5.3)
mn
Exact inference in such a model would be intractable, but the maximum likelihood parameters could be approximated, for example, by using variational EM (Neal & Hinton, 1998; Jordan, Ghahramani, Jaakkola, & Saul, 1999). 5.2 Generalized Bilinear Extensions. Traditionally, SFA has been made into a nonlinear algorithm by expanding the observations through a large set of nonlinearities and using these as new inputs. Probabilistic models are not well suited to such an approach for two reasons: (1) they are much slower to train when the dimensionality of the inputs is large, and (2) the expansion introduces complex dependencies between the new observations that a standard probabilistic model cannot capture. We consider the second of these points in more detail next, first noting that an interesting alternative that sidesteps these two issues is to learn a nonlinear mapping from the latents to the observations directly. One straightforward approach would be to used a generalized linear (or neural network) mapping,
p
N,M xt |yt , st , {gmn }n=1,m=1 , σx
= Norm
gmn yn,t sm,t
, σx2 I
,
(5.4)
mn
for a nonlinear link function . This form would also easily accommodate types of observation for which a normal output model would be inappropriate—binary data, for instance. Again, approximations would be required for learning. 5.3 Nonlinear Extensions. The above is a fairly conventional extension to time-series models (e.g., Ghahramani & Roweis, 1999). Further consideration of the second point above motivates an alternative that is more in the spirit of traditional SFA. In the expanded observation space, points corresponding to realizable data lie on low-dimensional manifolds (see Figure 3 for an example). Therefore, a good generative model should assign probability only to these manifolds and not “waste” it on the unrealisable volumes.
Maximum-Likelihood Interpretation for Slow Feature Analysis
1035
generative model @ t+1
~2 Φ1(x~t)=x t
allowable manifold
projection @ t+1
generative model @t
projection @ t
~ )=x ~ Φ (x 2
t
t
Figure 3: A schematic of the product-of-experts model in expanded observation space. The observations are one-dimensional and are expanded to two dimensions: φ1 (x˜ t ) = x˜ t and φ2 (x˜ t ) = x˜ t2 . The generative model assigns nonzero probability off the manifold, as shown by its energy contours (shown at two successive time steps to indicate a slow direction and a fast direction). The local experts have high energy only around a localized section of the realizable manifold. The product of the two types of expert leads to a projected process with slow latents that generates around the manifold. The width around the manifold is shown for illustration purposes: in the limit β → 0, the projected process lies precisely on the manifold.
The probabilistic model developed thus far is free to generate points anywhere in the expanded observation space; it cannot capture the structure introduced by the expansion. Fortunately, taking inspiration from productof-expert models (Hinton, 2002; Osindero, Welling, & Hinton, 2006), we can use our old probabilistic model to produce a new, more sensible model. The idea is to treat the old generative distribution as a global expert that models temporal correlations in the data and then form a new distribution by multiplying its density by K local experts, which constrain the expanded observations to lie on the realisable manifold, and renormalizing. It is important to note that unlike in the usual product-of-experts model, the local experts will not correspond to normalizable distributions in their own right. Generally product models can be succinctly parameterized via an energy such that P(w) = Z1 exp(−E(w)). A sensible energy function can be devised by introducing two new types of latent variables: x˜ m,t , which can usefully
1036
R. Turner and M. Sahani
be thought of as perturbations from the observations, and φk,t which are perturbations in the expanded observation space:
1 1 2 2 E =− (xm,t − x˜ m,t ) + β [φk,t − φk (˜xt )] − log P(Y, ). 2 t σx2 m k (5.5) The first two terms in this energy correspond to the local experts, and the final term is the global, dynamical model expert (alternatively, the old generative model). Intuitively, the global expert assigns lower energy to solutions with slow latent variables Y and therefore favors them. The local experts are more complicated, but essentially they constrain the observations of the old generative model, , to lie on the realizable manifold as described below (see also Figure 3). The x˜ m,t tend to be close to the observations xm,t (as this lowers the energy, by an amount depending on the value of σx2 ) and can therefore be thought of as perturbations. The φk (˜xt ) correspond to the K nonlinearities applied in the expansion (e.g., polynomials). Due to the perturbations, they are constrained to lie on a localized region of the manifold centered on φk (xt ) in the expanded observation space. The latents φk,t tend to be close to the φk (˜xt ) and therefore lie around the manifold (with a fuzziness determined by β, different from the output noise of the dynamical model). Taken together, these terms act as a local expert, vetoing any output from the global expert that does not lie near the realizable manifold. The parameters of this model can be learned using contrastive divergence (Hinton, 2002). What is more, the function φk (˜xt ) can be parameterized, by a neural network, for example, and the parameters can also be learned in the same way. 6 Conclusion We have shown the equivalence between the SFA algorithm and maximumlikelihood learning in a linear gaussian state-space model, with an independent Markovian prior. This perspective inspired a number of variants on the traditional slow feature analysis algorithm. First, a fully probabilistic version of slow features analysis was shown to be capable of dealing with both output noise and missing data naturally, using the Bayesian framework. Second, we suggested more speculative probabilistic extensions to further improve this new approach. For example, it is known that the traditional SFA algorithm makes no distinction between the “what” information in natural scenes and the “where” information. We suggest augmenting the continuous latent space with binary gating variables to produce a model capable of representing these two types of information distinctly. Finally, we noted that probabilistic models are not compatible with the standard
Maximum-Likelihood Interpretation for Slow Feature Analysis
1037
kernel method of expanding the observations through a large family of nonlinearities. However, an interesting alternative is to learn the nonlinearities in the model directly. We suggest both a traditional method using neural networks to parameterize the mapping from latent variable to observation and a nontraditional method that learns the inverse mapping and is therefore more in the spirit of SFA. The power of these new models will be explored in future work. Acknowledgments We thank Pietro Berkes for helpful discussions and for helping us to clarify the presentation. This work was supported by the Gatsby Charitable Foundation. References Attias, H. (1999). Independent factor analysis. Neural Comp., 11(4), 803–851. Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge filters. Vision Research, 37(23), 3327–3338. Berkes, P., & Wiskott, L. (2005). Slow feature analysis yields a rich repertoire of complex cell properties. Journal of Vision, 5(6), 579–602. Blaschke, A. J., Berkes, P., & Wiskott, L. (2006). What is the relation between slow feature analysis and independent component analysis? Neural Computation, 18(10), 2495–2508. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society Series B (Methodological), 39, 1–38. ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural ComFoldi´ putation, 3(2), 194–200. Ghahramani, Z., & Roweis, S. T. (1999). Learning nonlinear dynamical systems using an EM algorithm. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 599–605). Cambridge, MA: MIT Press. Grimes, D. B., & Rao, R. P. N. (2005). Bilinear sparse coding for invariant vision. Neural Comput., 17(1), 47–73. Hinton, G. E. (1989). Connectionist learning procedures. Artificial Intelligence, 40(1–3), 185–234. Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800. Hurri, J., & Hyv¨arinen, A. (2003). Temporal and spatiotemporal coherence in simplecell responses: A generative model of natural image sequences. Network, 14(3), 527–551. ¨ Hyv¨arinen, A., Hurri, J., & Vyrynen, J. (2003). Bubbles: A unifying framework for low-level statistical properties of natural image sequences. Journal of the Optical Society of America. A, Optics, Image Science, and Vision, 20(7), 1237–1252. Jordan, M. I., Ghahramani, Z., Jaakkola, T., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37(2), 183–233.
1038
R. Turner and M. Sahani
Karklin, Y., & Lewicki, M. S. (2005). A hierarchical Bayesian model for learning nonlinear statistical regularities in nonstationary natural signals. Neural Computation, 17(2), 397–423. ¨ ¨ Kayser, C., Kording, K. P., & Konig, P. (2003). Learning the nonlinearity of neurons from natural visual stimuli. Neural Computation, 15(8), 1751–1759. ¨ ¨ Kording, K. P., Kayser, C., Einh¨auser, W., & Konig, P. (2004). How are complex cell properties adapted to the statistics of natural stimuli? Journal of Neurophysiology, 91(1), 206–212. MacKay, D. J. C. (1999). Maximum likelihood and covariant algorithms for independent component analysis. Unpublished manuscript. Available online at ftp://wol.ra.phy.cam.ac.uk/pub/mackay/ica.ps.gz. Miskin, J. (2000). Ensemble learning for independent components analysis. Unpublished doctoral dissertation, Cambridge University. Mitchison, G. (1991). Removing time variation with the anti-Hebbian differential synapse. Neural Computation, 3, 312–320. Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan (Ed.), Learning in graphical models (pp. 355–370). Norwell, MA: Kluwer Academic Press. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583), 607– 609. Osindero, S., Welling, M., & Hinton, G. E. (2006). Topographic product models applied to natural scene statistics. Neural Computation, 18(2), 381–414. Pearlmutter, B., & Parra, L. (1997). Maximum likelihood blind source separation: A context-sensitive generalization of ICA. In M. C. Mozer, M. I. Jordan, & T. Petche (Eds.), Advances in neural information processing systems, 9 (pp. 613–619). Cambridge, MA: MIT Press. Roweis, S., & Ghahramani, Z. (1999). A unifying review of linear gaussian models. Neural Computation, 11(2), 305–345. S¨arel¨a, J., & Valpola, H. (2005). Denoising source separation. J. Machine Learning Res., 6, 233–272. Stone, J. V. (1996). Learning perceptually salient visual parameters using spatiotemporal smoothness constraints. Neural Computation, 8(7), 1463–1492. Tenenbaum, J. B., & Freeman, W. T. (2000). Separating style and content with bilinear models. Neural Computation, 12(6), 1247–1283. Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal components analysis. Journal of the Royal Statistical Society Series B (Methodological), 61(3), 611–622. Wiskott, L., & Sejnowski, T. J. (2002). Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4), 715–770.
Received May 10, 2006; accepted July 30, 2006.
LETTER
Communicated by Simon Haykin
An Augmented Extended Kalman Filter Algorithm for Complex-Valued Recurrent Neural Networks Su Lee Goh [email protected]
Danilo P. Mandic [email protected] Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2AZ, U.K.
An augmented complex-valued extended Kalman filter (ACEKF) algorithm for the class of nonlinear adaptive filters realized as fully connected recurrent neural networks is introduced. This is achieved based on some recent developments in the so-called augmented complex statistics and the use of general fully complex nonlinear activation functions within the neurons. This makes the ACEKF suitable for processing general complex-valued nonlinear and nonstationary signals and also bivariate signals with strong component correlations. Simulations on benchmark and real-world complex-valued signals support the approach. 1 Introduction Recent progress in biomedicine, wireless and mobile communications, seismics, and sonar and radar signal processing has brought to the light new problems, where data models are often complex valued or have a higherdimensional compact representation (Mandic & Chambers, 2001; Haykin, 1994). To process such signals, research has largely been directed toward extending the results from real-valued adaptive filters to those operating in the complex plane, C. One such algorithm is the complex least mean square (CLMS) algorithm for linear finite impulse response (FIR) adaptive filters, introduced in 1975 (Widrow, McCool, & Ball, 1975). More recently, complex-valued learning algorithms have been introduced for training of neural networks (NNs) (Kim & Adali, 2001; Leung & Haykin, 1991; Treichler, Johnson, & Larimore, 1987; Goh & Mandic, 2004). Fully connected recurrent neural networks (FCRNNs) have been employed as nonlinear adaptive filters1 (Mandic & Chambers, 2001; Medsker
1
Nonlinear autoregressive (NAR) processes can be modeled using feedforward networks, whereas nonlinear autogressive moving-average (NARMA) processes can be represented using RNNs. Neural Computation 19, 1039–1055 (2007)
C 2007 Massachusetts Institute of Technology
1040
S. Goh and D. Mandic
& Jain, 2000) where for real-time applications, the real-time recurrent learning (RTRL) algorithm (Williams & Zipser, 1989) has been widely used to train FCRNNs. In the complex domain, a recently proposed fully complex real-time recurrent learning (CRTRL)2 algorithm (Goh & Mandic, 2004; Goh, Popovic, & Mandic, 2004) has been applied to forecasting of the complexvalued wind field. These initial results were promising, but it was also realized that gradient-based learning may experience problems when processing signals with rich dynamics (for instance, intermittent signals). Following the established approaches from the real domain R, a possible solution to this problem may be based on Kalman filters (Puskorius & Feldkamp, 1994), which have been shown to exhibit superior performance in several applications, including state estimation for road navigation (Obradovic, Lenz, & Schupfner, 2004), parameter estimation for timeseries modeling, and neural network training (Puskorius & Feldkamp, 1991; Julier & Uhlmann, 2004; Haykin, 2001). Kalman filtering (Kalman, 1960) is known to give the optimal solution to the linear gaussian sequential state estimation problem in the domain of second order statistics. However, for a nonlinear or nongaussian state tracking problem with possibly an infinite number of states, no such general optimal algorithm exists. Extensions of the class of Kalman filter algorithms that cater to this case are termed the extended Kalman filter (EKF). The EKF (Feldkamp & Puskorius, 1998; Grewal & Andrews, 2001) is based on a truncated Taylor series expansion, which linearizes the system model locally, and the subsequent use of a linear Kalman filter. The EKF algorithms have been used to train temporal neural networks (Obradovic, 1996; Baltersee & Chambers, 1998; Mandic, Baltersee, & Chambers, 1998), with promising results. Recently, EKF training of NNs has been extended to the complex domain (Huang & Chen, 2000). Notice that in order to design an algorithm suitable for the complex domain, we need a precise mathematical model that describes the evolution of system parameters. Hence, extensions of learning algorithms to the complex domain are not trivial and often involve some constraints, for instance, a simplified model of both complex statistics and complex nonlinearities within neurons. This might be suboptimal for classes of signals with significant correlation between the real and imaginary parts, which affects both the choice of the complex activation function and complex statistics. To tackle this problem, we first provide mathematical foundations for complex-valued second-order statistics and highlight the need to consider
2 The CRTRL algorithm uses a fully complex activation function (AF) that is analytical and bounded almost everywhere in C. In a split-complex AF, the real and imaginary components of the input signal x are separated and fed through the real-valued activation function f R (x) = f I (x), x ∈ R. A split-complex activation function is therefore given as 1 1 f (x) = f R (Re(x)) + j f I (I m(x)), for example, f (x) = −β(Re(x)) + j −β(I m(x)) . 1+e
1+e
A Complex EKF for RNNs
1041
the so-called augmented statistics in the derivation of such learning algorithms. Next, both the augmented complex-valued Kalman filter (ACKF) and the the augmented extended complex-valued Kalman filter (ACEKF) algorithm are derived, whereby the recently developed CRTRL algorithm is used to compute the Jacobian matrix within the ACEKF. For rigor, this is achieved for a fully complex nonlinear activation function of a neuron. The potential of such an ACEKF for training of FCRNNs is analyzed, and a comparison with the CRTRL is provided. The analysis is comprehensive and is supported by simulation examples on benchmark complex-valued nonlinear and colored signals, together with simulations on complex-valued real-world radar and environmental measurements. Finally, possible directions for future research are highlighted. 2 Complex-Valued Second-Order Statistics Before deriving the ACEKF algorithm for complex FCRNNS, we introduce some important notations from complex second-order statistics. The covariance matrix for two real-valued RVs x and y is defined as (Anton, 2003) T Pxy = cov[x, y] = E (x − E [x]) y − E y ,
(2.1)
where E(·) represents the statistical expectation operator. Similarly, the four covariance matrices of two complex-valued RVs, x = xr + jxi and y = yr + jyi , are given by (Neeser & Massey, 1992) Pxr yr = cov[xr , yr ]
Pxr yi = cov[xr , yi ]
Pxi yr = cov[xi , yr ]
Pxi yi = cov[xr , yi ],
(2.2)
√ where j = −1, (·)T denotes the vector transpose operator, and superscripts (·)r and (·)i denote, respectively, the real and imaginary part of a complex number or complex vector. These four real-valued matrices are equivalent to the following two complex-valued matrices, given by (Neeser & Massey, 1992), H Pxy = E (x − E [x]) y − E y T ξ Pxy = E (x − E [x]) y − E y where Pxy = Pxr yr + Pxi yi + j(Pxi yr − Pxr yi )
(2.3)
1042
S. Goh and D. Mandic ξ
Pxy = Pxr yr − Pxi yi + j(Pxi yr + Pxr yi ),
(2.4)
and the symbol (·) H denotes the Hermitian transpose operator. We can solve for Pxr yr , Pxi yi , Pxi yr , and Pxr yi to obtain 1 ξ Pxr yr = R Pxy + Pxy 2 1 ξ Pxi yi = R Pxy − Pxy 2 1 ξ Pxi yr = I Pxy + Pxy 2 1 ξ Pxr yi = I − Pxy + Pxy , 2
(2.5)
where symbols R and I denote, respectively, the real and imaginary part of a complex quantity. It is clear that the four real-valued covariance matrices, equation 2.5, are in a one-to-one relationship with the two complex-valued covariance matrices in equation 2.4. In the literature, nearly always only ξ Pxy is considered and is referred to as the covariance matrix, whereas Pxy is termed the pseudocovariance matrix. 2.1 Augmented Covariance Matrix. It is often assumed that the theory of complex-valued random vectors (RVs) is no different from that of the real RVs, as long as the definition of the covariance matrix of an RV x is changed from E[xxT ] in the real case to E[xx H ] in the complex case (Schreier & Scharf, 2003; Picinbono, 1996). This assumption, however, is not justified since the covariance matrix E[xx H ] will not completely describe the second-order statistical behavior of x. For complex-valued gaussian random variables, we therefore need to consider both the variable x and its complex conjugate x∗ in order to make full use of the available statistical information. This additional information is contained in the cross-moments, and to design mathematically wellfounded complex learning algorithms, a study of these moments and their implications on learning is required (Neeser & Massey, 1992). We therefore set out to derive a complex-valued Kalman filter that takes into account the augmented complex statistics and thus provides a general framework for nonlinear adaptive filtering in C. To achieve this, instead of a complex RV x, we consider an “augmented” (2n × 1)dimensional vector xa = [x, x∗ ]. In addition, for “improper” (see appendix A for more detail) vectors, it is the augmented (2n × 2n)-dimensional covariance matrix Pxa xa = E[xa (xa )T ] (rather than the (n × n)-dimensional matrix Pxx = E[xx H ]) that contains the complete second-order statistical information. Such an augmented covariance matrix is given by (Schreier & Scharf,
A Complex EKF for RNNs
1043
2003) T H Pxx x Pxa xa = E ∗ x x = ξ∗ x Pxx
ξ
Pxx P∗xx
.
(2.6)
Notice that the covariance matrix Pxa xa is invertible and thus positive defiξ∗ nite. Besides Pxx being positive semidefinite (PSD) and Pxx being symmetξ ∗ −1 ξ ∗ ric, the Schur complement Pxx − Pxx Pxx Pxx must be PSD to ensure that Pxa xa is PSD and thus a valid covariance matrix for xa (Neeser & Massey, 1992). 3 The Augmented Complex-Valued Kalman Filter Algorithm Based on the analysis of augmented complex statistics, we first derive a Kalman filter learning algorithm for complex-valued inputs using the augmented states and augmented covariance matrix. Consider a general statespace model given by (Haykin, 2001), xk+1 = Fk+1 xk + ωk y k = Hk x k + ν k ,
(3.1)
where ωk and vk are independent, zero-mean, complex-valued gaussian processes of covariance matrices Qk and Rk , respectively, and F and H are the transition and measurement matrix. From equation 3.1, the augmented state-space model is obtained as xak+1 = Fak+1 xak + ωak yak = Hak xak + ν ak ,
(3.2)
where xak = [xk , x∗k ], yak = [yk , y∗k ], Fak = [Fk , F∗k ], Hak = [Hk , H∗k ], ωak = [ωk , ω∗k ], and ν ak = [ν k , ν ∗k ]. The augmented covariance matrices of the zeromean complex-valued gaussian noise processes are denoted respectively by Qak and Rak . In the initialization of the algorithm (k = 0), we set xˆ a0 = E xa0 , T P0 = E xa0 − E xa0 xa0 − E xa0 .
(3.3)
The computation steps for k = 1, . . . , L are given below (for clarity, we use notation similar to that from (Haykin, 2001):
1044
S. Goh and D. Mandic
State estimate propagation: − xˆ ak − = Fak,k−1 xˆ ak−1
(3.4)
Error covariance propagation: a H a + Qak−1 P− k = Fk,k−1 Pk−1 Fk,k−1
(3.5)
Kalman gain matrix: −1 a H a a H a H P + R Gk = P − H H k k k k k k
(3.6)
State estimate update: xˆ ak = xˆ ak − + Gk yak − Hak xˆ ak −
(3.7)
Error covariance update: Pk = I − Gk Hak P− k
(3.8)
This completes the description of the augmented complex-valued Kalman filter. 4 FCRNN Trained with Augmented Extended Complex-Valued Kalman Filter 4.1 FCRNN Architecture. Figure 1 shows an FCRNN, which consists of N neurons with p external inputs and N feedback connections. The network has two distinct layers: the external input-feedback layer and a layer of processing elements. Let yl,k denote the complex-valued output of a neuron, l = 1, . . . , N at time index k and sk the (1 × p) external complexvalued input vector. The overall input to the network uk then represents a concatenation of vectors yk , sk and the bias input (1 + j), and is given by uk = [sk−1 , . . . , sk− p , 1 + j, y1,k−1 , . . . , yN,k−1 ]T un,k ∈ uk = urn,k + juin,k ,
n = 1, . . . , p + N + 1.
(4.1)
For the lth neuron, its weights form a ( p + N + 1) × 1–dimensional weight vector wlT = [wl,1 , . . . , wl, p+N+1 ], l = 1, . . . , N, which are encompassed in the complex-valued weight matrix of the network W = [w1 , . . . , w N ]. The output of every neuron can be expressed as yl,k = (netl,k ),
l = 1, . . . , N,
(4.2)
A Complex EKF for RNNs
1045
Feedback inputs
z−1
Outputs y
z−1
z−1
z−1
1+j s(k-1)
External Inputs s(k-p) I/O layer
Feedforward Processing layer of and hidden and output feedback connections neurons
Figure 1: A fully connected recurrent neural network for prediction.
where is a complex nonlinear activation function of a neuron and
p+N+1
netl,k =
wl,n un,k
(4.3)
n=1
is the net input to lth node at time index k. In order to establish a mathematical framework for Kalman filter-based training of FCRNNs, the dynamical behavior of the FCRNN by a state-space model is given by wak+1 = wak + ωak yak
= h(wak , uk )
(4.4) +
ν ak ,
(4.5)
1046
S. Goh and D. Mandic
where h is a nonlinear operator associated with observations, wak is the augmented weight vector, and yak is the augmented output of the network. The underlying idea here is to linearize the state-space model for every time instant k. Once such a local linear model is obtained, the standard ACKF approach can be applied. In practice, the process noise covariance Qak and measurement noise covariance Rak matrices might be time varying, but here we assume they are constant. Notice that the Jacobian Hak is defined as a set of partial derivatives of the outputs with respect to the weights, wak , and needs to be updated at every time instant during learning. 4.2 Derivation of the Augmented Complex-Valued Extended Kalman Filter Algorithm. When operating in the complex domain, similar to the real EKF, the nonlinear state and measurement equations ought to be linearized about the current state in order to subsequently employ a standard Kalman filter. Since this linearization is dynamical in its nature, the performance of such EKF cannot be assessed beforehand. Indeed, there is a chance that the updated trajectory estimate will be poorer than the nominal one, which leads to inaccuracy in the estimates, causing further error accumulation (Brown & Hwang, 1997). The derivation of the required complexvalued Jacobian matrix is nontrivial, and the linearization can lead to a highly unstable performance (Julier & Uhlmann, 2004). Therefore, careful parameter selection is required when using the ACEKF algorithm. With this in mind, we proceed with the derivation of the ACEKF as an extension from the ACKF presented in the previous section. The following expressions summarize the proposed ACEKF: a − a H −1 a H Gk = P− Hk Pk (Hk ) + Rak k (Hk ) ˆ ak − + Gk yak − h(w ˆ ak − , uk ) ˆ ak = w w a Pk = (I − Gk Hak ) P− k + Qk .
(4.6) (4.7) (4.8)
The ACEKF is initialized by ˆ a0 = E [w0 ] w T P0 = E wa0 − E wa0 wa0 − E wa0 .
(4.9)
For generality, the augmented Jacobian matrix Hak is computed using the augmented CRTRL algorithm3 for recurrent networks (Goh & Mandic, 2004) (using fully complex nonlinearities within the network; see ˆ ak denotes the estimate of the augmented state of appendix B). The vector w 3 Matrix Ha denotes the matrix of partial derivatives of the network’s augmented k output yak with respect to weight parameters (see appendix B).
A Complex EKF for RNNs
1047
the system update at step k. The Kalman gain matrix Gk is a function of the estimated error covariance matrix Pk , the Jacobian matrix Hak , and a global a H a scaling matrix Hak P− k (Hk ) + Rk . 5 Simulations For the experiments, the nonlinearity at the neuron was chosen to be the complex tanh function, (x) =
e βx − e −βx , e βx + e −βx
(5.1)
where x ∈ C. The value of the slope of (x) was β = 1. The architecture of the FCRNN (see Figure 1) consisted of N = 3 neurons with the tap input length of p = 5. To support the analysis, we tested the ACEKF on a wide range of signals, including complex linear and complex nonlinear signals. To further illustrate the approach and verify the advantage of using the ACEKF over CEKF and the CRTRL, additional single-trial experiments were performed on real-world complex-valued wind4 and radar5 data. In the first experiment, the input signal was a stable complex linear AR(4) process given by r (k) = 1.79r (k − 1) − 1.85r (k − 2) + 1.27r (k − 3) + 0.41r (k − 4) + n(k), (5.2) with complex white gaussian noise (CWGN) n(k) ∼ N (0,1) as the driving input. The CWGN can be expressed as n(k) = nr (k) + jni (k). The real and imaginary components of CWGN are mutually independent sequences having equal variances, so that σn2 = σn2r + σn2i . For the second experiment, we used a complex benchmark nonlinear signal (Narendra & Parthasarathy, 1990), z(k) =
4
z2 (k − 1)(z(k − 1) + 2.5) + r (k − 1). 1 + z(k − 1) + z2 (k − 2)
(5.3)
Publicly available online from http://mesonet.agron.iastate.edu/. The wind vector can be expressed in the complex domain C as v(t)e jθ(t) = v E (t) + v N (t). Here, the two wind components, the speed v and direction θ , which are of different natures, are modeled as a single quantity in a complex representation space (Mandic and Goh, 2005). 5 Publicly available online from http://soma.ece.mcmaster.ca/ipix/.
1048
S. Goh and D. Mandic
Table 1: Comparison of Prediction Gains Rp , MSE, and SNR for the Various Classes of Signals. Signal
Nonlinear
AR4
Wind
Radar
6.24 5.55
4.77 3.98
12.65 10.24
10.58 9.91
3.76 5
3.54 5
6.12 3
7.22 3
R p [dB] (ACEKF) R p [dB] (standard CEKF) R p [dB] (CRTRL) SNR [dB]
To assess the performance, standard prediction gain R p was employed and is given by (Haykin & Li, 1995),
R p (k) = 10 log10
σx2 σˆ e2
[d B],
(5.4)
where σx2 denotes the variance of the input signal x(k), and σˆ e2 denotes the estimated variance of the forward prediction error e(k). Table 1 shows a comparison of the prediction gains R p [dB] between the ACEKF, standard complex-valued (CEKF) (without considering the augmented states), and CRTRL for various classes of signals, and also the signal-to-noise ratio (SNR) for various experiments. In all the cases, there was a significant improvement in the performance when ACEKF was employed over that of CRTRL and CEKF. Figures 2 and 3 show a subsegment of the predictions generated by the ACEKF and CRTRL for complex-valued colored (see equation 5.2) and nonlinear (see equation 5.3) signals. The ACEKF outperformed CRTRL for both cases. To further illustrate the advantage of ACEKF over CRTRL, we compared the performances of FCRNNs trained with these algorithms in experiments on real-world radar and wind data. Figure 4 shows the prediction performance of the ACEKF applied to the complex-valued real-world (velocity and angle components) wind signal. Simulation results for radar data are shown in Figure 5. In both cases, the AECKF outperformed the other algorithms considered (for a quantitative performance comparison, see Table 1). 6 Discussion and Conclusion The augmented complex-valued extended Kalman filter (ACEKF) has been introduced for nonlinear adaptive filtering in the complex domain. For generality, this has been achieved for nonlinear adaptive filters realized as fully connected recurrent neural networks (FCRNNs), and the ACEKF has been derived using the augmented complex-valued statistics, whereby
A Complex EKF for RNNs
1049
A
noisy signal predicted signal actual signal
Absolute Nonlinear Signal
1.2
1
0.8
0.6
0.4
0.2
0 1600
1650
1700
1750
1800
1850
1900
Iterations, k
(a) ACEKF Algorithm B
predicted signal actual signal
Absolute Nonlinear Signal
1.2
1
0.8
0.6
0.4
0.2
0 1600
1650
1700
1750
1800
1850
1900
Iterations, k
(b) CRTRL Algorithm 0.5
C
CRTRL ECKF
0.45 0.4 0.35
MSE
0.3 0.25 0.2 0.15 0.1 0.05 0 1600
1650
1700
1750
1800
1850
1900
Iterations, k
(c) Mean Square Error
Figure 2: Comparison of one-step-ahead prediction performance between the ACEKF and CRTRL for nonlinear signal (see equation 5.3).
1050
S. Goh and D. Mandic 0.8
A
noisy signal predicted signal actual signal
Absolute Colored Signal
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 2200
2250
2300
2350
2400
2450
2500
Iterations, k
(a) ACEKF Algorithm 0.8
B
predicted signal actual signal
Absolute Colored Signal
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 2200
2250
2300
2350
2400
2450
2500
Iterations, k
(b) CRTRL Algorithm 0.7
C
CRTRL ACEKF
0.6
MSE
0.5
0.4
0.3
0.2
0.1
0 2200
2250
2300
2350
2400
2450
2500
Iterations, k
(c) Mean Square Error
Figure 3: Comparison of one-step-ahead prediction performance between the ACEKF and CRTRL for colored signal (see equation 5.2).
A Complex EKF for RNNs
1051
0.7 noisy signal predicted signal actual signal
Absolute Wind
0.6
0.5
0.4
0.3
0.2
0.1
0 2200
2250
2300
2350
2400
2450
2500
Iterations, k
Figure 4: Prediction performance for complex wind signal using ACEKF. 1 noisy signal predicted signal actual signal
Absolute Radar Signal
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2700
2750
2800
2850
2900
2950
3000
Iterations, k
Figure 5: Prediction performance for complex radar signal using ACEKF.
a complete second-order statistics for complex-valued quantities is taken into account. The performance of the ACEKF has been evaluated on benchmark complex-valued nonlinear and colored signals and also on real-life complex-valued wind and radar signals. Experimental results have justified the potential of ACEKF in nonlinear complex-valued neural adaptive filtering applications. It should be noted that the computational cost of the ACEKF is relatively more expensive than that of the CRTRL. To mitigate this problem, a decoupled extended Kalman filter (DEKF) algorithm (Puskorius et al. 1991) may be used; however, the reduction in computational complexity may lead to poorer performance of the algorithm in the domain of augmented complex statistics. This is due to the fact that the DEKF is derived from EKF by approximating the second-order information between
1052
S. Goh and D. Mandic
weights so that they belong only to mutually exclusive groups; correlations between weights estimates can be neglected, which leads to a sparser error covariance matrix Pk . The decoupling methods for processes described by augmented complex statistics are an area for future research, especially if the main concern is the computational complexity of augmented EKF. Appendix A: Proper and Improper Complex Random Vectors ξ
A complex RV x is called proper if its pseudocovariance Pxx vanishes (Neeser & Massey, 1992; Schreier & Scharf, 2003). For a proper complexvalued random vector x, we have Pxr xr = Pxi xi ,
Pxi xr = −PxT i xr .
(A.1)
This means that the real and imaginary parts of x are uncorrelated. The ξ vanishing of Pxx does not imply that the real part of xm ∈ x and the imaginary part of xn ∈ x are uncorrelated for m = n. When RVs are proper, the probability density function (PDF) of a gaussian RV takes a form similar to that from the real case. Let x ∈ Cn be a proper complex-valued gaussian random variable with mean µ and nonsingular covariance matrix Pxx . Then the PDF of x is given by (Picinbono, 1998) f (x) =
1 T −1 e −(x−µ) Pxx (x−µ) , det(πPxx )
(A.2)
where P−1 xx is some Hermitian and positive-definite matrix. For convenience, in many applications, complex-valued RVs are treated as proper. However, ξ Pxx may not be necessarily zero, and the results obtained this way are suboptimal. Appendix B: An Augmented CRTRL Algorithm for the FCRNN The augmented CRTRL algorithm for training the FCRNN is briefly described. The cost function of the recurrent network is given by J k = 12 ek ekT . The instantaneous output error is defined as ek = dak − yak ,
(B.1)
where dak denotes the augmented desired output vector. The augmented weight matrix update becomes wak = −η
∂yak ∂ Jk . a = −ηek ∂wk ∂wak
(B.2)
A Complex EKF for RNNs
1053
a We introduce three new matrices—the 2N × (2N + p + 1) matrix l,k , the 2N × (2N + p + 1) matrix Ul,k , and the 2N × 2N diagonal matrix Sk —as (Haykin, 1994)
a l,k =
∂yak , ∂wak
∗ ∗ ya = y1,k , . . . , yN,k , y1,k , . . . , yN,k ,
0 .. . Ul,k = uk ← lth row, l = 1, . . . , N . ..
l = 1, 2, . . . , N (B.3)
(B.4)
0
S(k) = diag ukT wa1,k , . . . , ukT waN,k H a ∗ a , l,k+1 = S∗k Ul,k + Waf,k l,k
(B.5) (B.6)
where W f denotes the set of those entries in W that correspond to the feedback connections. Acknowledgments We thank Dragan Obradovic from Siemens Corporate Research for his critical comments and lengthy discussions. References Anton, H. (2003). Contemporary linear algebra. New York: Wiley. Baltersee, J., & Chambers, J. A. (1998). Nonlinear adaptive prediction of speech signals using a pipelined recurrent network. IEEE Transactions on Signal Processing, 46(8), 2207–2216. Brown, R. B., & Hwang, P. Y. C. (1997). Introduction to random signals and applied Kalman filtering. New York: Wiley. Feldkamp, L. A., & Puskorius, G. V. (1998). A signal processing framework based on dynamical neural networks with application to problems in adaptation, filtering and classification. IEEE Transactions on Neural Networks, 86, 2259–2277. Goh, S. L., & Mandic, D. P. (2004). A complex-valued RTRL algorithm for recurrent neural networks. Neural Computation, 16(12), 2699–2713. Goh, S. L., Popovic, D. H., & Mandic, D. P. (2004). Complex-valued estimation of wind profile and wind power. In Proceedings of the 12th IEEE Mediterranean Electrotechnical Conference, MELECON 2004 (Vol. 3, pp. 1037–1040). Piscataway, NJ: IEEE Press.
1054
S. Goh and D. Mandic
Grewal, M. S., & Andrews, A. P. (2001). Kalman filtering: Theory and practice. New York: Wiley. Haykin, S. (1994). Neural networks: A comprehensive foundation. Upper Saddle River, NJ: Prentice Hall. Haykin, S. (2001). Kalman Filtering and neural networks. New York: Wiley. Haykin, S., & Li, L. (1995). Nonlinear adaptive prediction of nonstationary signals. IEEE Transactions on Signal Processing, 43(2), 526–535. Huang, R. C., & Chen, M. S. (2000). Adaptive equalization using complex-valued multilayered neural network based on the extended Kalman filter. In Proceedings of Signal Processing, International Conference on WCCC-ICSP (Vol. 1, pp. 519–524). Piscataway, NJ: IEEE Press. Julier, S. J., & Uhlmann, J. K. (2004). Unscented filtering and nonlinear estimation. Proceedings of the IEEE, 92(3), 401–422. Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Transaction of the ASME–Journal of Basic Engineering, 82, 35–45. Kim, T., & Adali, T. (2001). Approximation by fully complex MLP using elementary transcendental activation functions. In Proceedings of XI IEEE Workshop on Neural Networks for Signal Processing (pp. 203–212). Piscataway, NJ: IEEE Press. Leung, H., & Haykin, S. (1991). The complex backpropagation algorithm. IEEE Transactions on Signal Processing, 39(9), 2101–2104. Mandic, D., Baltersee, J., & Chambers, J. (1998). Non-linear adaptive prediction of speech with a pipelined recurrent neural network and advanced learning algorithms. In A. Prochazka, J. Uhlir, P. W. Rayner, & N. G. Kingsbury (Eds.), Signal analysis and prediction. Boston: Birkhauser. Mandic, D. P., & Chambers, J. A. (2001). Recurrent neural networks for prediction: Learning algorithms, architectures and stability. New York: Wiley. Mandic, D. P., & Goh, S. L. (2005). Sequential data fusion via vector spaces: Complex modular neural network approach. In Proceedings of the IEEE Workshop on Machine Learning Signal Processing (pp. 147–151). Piscataway, NJ: IEEE Press. Medsker, L. R., & Jain, L. (2000). Recurrent neural networks: Design and applications. Boca Raton, FL: CRC Press. Narendra, K. S., & Parthasarathy, K. (1990). Identification and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks, 1(1), 4–27. Neeser, F., & Massey, J. (1992). Proper complex random processes with applications to information theory. IEEE Transactions on Information Theory, 39(4), 1293–1302. Obradovic, D. (1996). On-line training of recurrent neural networks with continuous topology adaptation. IEEE Transactions on Neural Networks, 7(1), 222–228. Obradovic, D., Lenz, H., & Schupfner, M. (2004). Sensor fusion in Siemens car navigation system. In Proceedings of the IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing (pp. 655–664). Piscataway, NJ: IEEE Press. Picinbono, B. (1996). Second-order complex random vectors and normal distributions. IEEE Transactions on Signal Processing, 44(10), 2637–2640. Picinbono, B. (1998). On circularity. IEEE Transactions on Signal Processing, 42(12), 3473–3482. Puskorius, G. V., & Feldkamp, L. A. (1991). Decoupled extended Kalman filter training of feedforward layered networks. IJCNN91 Settle International Joint Conference on Neural Networks (Vol. 1, pp. 771–777). Piscataway, NJ: IEEE Press.
A Complex EKF for RNNs
1055
Puskorius, G. V., & Feldkamp, L. A. (1994). Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent networks. IEEE Transactions on Neural Networks, 5(2), 279–297. Schreier, P. J., & Scharf, L. L. (2003). Second-order analysis of improper complex random vectors and processes. IEEE Transactions on Signal Processing, 51(3), 714– 725. Treichler, J. R., Johnson, J. C. R., & Larimore, M. (1987). Theory and design of adaptive filters. New York: Wiley. Widrow, B., McCool, J., & Ball, M. (1975). The complex LMS algorithm. Proceedings of the IEEE, 63, 712–720. Williams, R. J., & Zipser, D. A. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2), 270–280.
Received August 22, 2005; accepted July 9, 2006.
LETTER
Communicated by Terence Kwok
Equilibria of Iterative Softmax and Critical Temperatures for Intermittent Search in Self-Organizing Neural Networks ˇ Peter Tino [email protected] School of Computer Science, University of Birmingham, Birmingham B15 2TT, U.K.
Kwok and Smith (2005) recently proposed a new kind of optimization dynamics using self-organizing neural networks (SONN) driven by softmax weight renormalization. Such dynamics is capable of powerful intermittent search for high-quality solutions in difficult assignment optimization problems. However, the search is sensitive to temperature setting in the softmax renormalization step. It has been hypothesized that the optimal temperature setting corresponds to the symmetry-breaking bifurcation of equilibria of the renormalization step, when viewed as an autonomous dynamical system called iterative softmax (ISM). We rigorously analyze equilibria of ISM by determining their number, position, and stability types. It is shown that most fixed points exist in the neighborhood of the maximum entropy equilibrium w = (N−1 , N−1 , . . . , N−1 ), where N is the ISM dimensionality. We calculate the exact rate of decrease in the number of ISM equilibria as one moves away from w. Bounds on temperatures guaranteeing different stability types of ISM equilibria are also derived. Moreover, we offer analytical approximations to the critical symmetrybreaking bifurcation temperatures that are in good agreement with those found by numerical investigations. So far, the critical temperatures have been determined only by trial-and-error numerical simulations. On a set of N-queens problems for a wide range of problem sizes N, the analytically determined critical temperatures predict the optimal working temperatures for SONN intermittent search very well. It is also shown that no intermittent search can exist in SONN for temperatures greater than one-half. 1 Introduction Since the pioneering work of Hopfield (1982), adaptation of neural computation techniques to solving difficult combinatorial optimization problems has proved useful in numerous application domains (Smith, 1999). In particular, a self-organizing neural network (SONN) was proposed as a general methodology for solving 0-1 assignment problems (Smith, 1995). The methodology has been successfully applied in a wide variety of applications, from assembly line sequencing to frequency assignment in Neural Computation 19, 1056–1081 (2007)
C 2007 Massachusetts Institute of Technology
Critical Temperatures for Self-Organizing Neural Networks
1057
mobile communications (see, e.g., Smith, Palaniswami, & Krishnamoorthy, 1998). The key difference between traditional self-organizing maps (SOM) (Kohonen, 1982) and SONN lies in the notion of neighborhood of the winning unit (the unit most responsive to the current input). In the Kohonentype SOM, neural units play the role of codebook vectors in a constrained vector quantization of the input space. Close units on the map represent close regions of the input space. As the map develops, the notion of closeness on the map changes only in scale, while the topology of neighborhood relations remains unchanged. In contrast, SONNs adopt a more flexible notion of the winning unit neighborhood. Neural units represent partial solutions to an assignment optimization problem. The criterion for closeness of two units is their relative performance in solving the optimization task. In this view, the neighborhood of the winning unit contains only relatively successful partial assignment solutions to the optimization task being solved. There is no predetermined neighborhood topology on the neural units. However, both SOM and SONN share the philosophy of neighborhood-based updating of the parameters (weights). The winner unit, as well as units from its neighborhood, are modified to respond better to the current input stimulus. Searching for 0-1 solutions in general assignment optimization problems can be made more effective when performed in a continuous domain, with values in the interval (0, 1) representing partial (soft) assignments (e.g., Kosowsky & Yuille, 1994). Typically the softmax function is employed to ensure that elements within a set of positive parameters sum up to one. The softmax function was incorporated into SONN in Guerrero, Lozano, Smith, Canca, and Kwok (2002) and was used in other optimization frameworks (e.g., in Gold & Rangarajan, 1996; Rangarajan, 2000). When endowed with a physics-based Boltzmann distribution interpretation, the softmax function contains a free parameter: temperature T (or inverse temperature T −1 ). As the system cools, the assignments become increasingly crisp. Recently interesting observations have been made regarding the appropriate values of the temperature parameter when solving assignment problems with SONN endowed with softmax renormalization of the weight parameters (Kwok & Smith, 2002, 2004). There is a critical temperature T∗ at which the SONN achieves superior performance in terms of both quality and efficiency. It has been suggested that the critical temperature may be closely related to the symmetry-breaking bifurcation of equilibria in the autonomous softmax dynamics (Kwok & Smith, 2003). Of even greater interest is the finding that at T∗ , when the neighborhood size and learning rate are kept fixed at appropriate values, SONN is capable of powerful intermittent search through a multitude of high-quality solutions represented as meta-stable states of the SONN dynamics (Kwok & Smith, 2005). Kwok and Smith (2005) numerically studied global dynamical properties of SONN in the intermittent search mode and argued that such models display
ˇ P. Tino
1058
characteristics of systems at the edge of chaos (Langton, 1990; Crutchfield & Young, 1990). In this contribution, we seek to shed more light on the phenomenon of critical temperatures and intermittent search in SONN. In particular, since the critical temperature is closely related to bifurcations of equilibria in autonomous iterative softmax systems (ISM), we rigorously analyze the number, position, and stability types of fixed points of ISM. Moreover, we offer analytical approximations to the critical temperature as a function of the ISM dimensionality. So far, the critical temperatures have been determined only by trial-and-error numerical investigations. This letter is organized as follows. After a brief introduction to SONN in section 2, we formally define ISM in section 3. The numbers and positions of ISM equilibria are studied in section 4, and their stability is investigated in section 5. Analytical approximations to the critical temperature for SONN intermittent search are derived and verified in section 6. Section 7 summarizes the key findings. 2 Self-Organizing Neural Network with SoftMax Weight Renormalization In this section we briefly introduce self-organizing neural networks (SONN) endowed with weight renormalization for solving assignment optimization problems (see, e.g., Kwok & Smith, 2005). Consider a finite set of inputs j ∈ J = {1, 2, . . . , M} that need to be assigned to outputs i ∈ I = {1, 2, . . . , N}, so that a global cost (potential) Q(A) of an assignment A : J → I is minimized. Partial cost of assigning j ∈ J to i ∈ I is denoted by V(i, j). Both the input and output elements j ∈ J and i ∈ I, respectively, are represented through one-hot-encoding, that is, there is one input and one output unit exclusively devoted to each input and output element, respectively. The strength of assigning j to i is represented by the weight wi, j ∈ [0, 1]. The SONN algorithm consists of the following steps: 1) Initialize connection weights wi, j , j ∈ J , i ∈ I, to random values around 0.5. 2) Choose at random an input item jc ∈ J , and calculate partial costs V(i, jc ), i ∈ I, of all possible assignments of jc . 3) The winner output node i( jc ) ∈ I is the one that minimizes V(i, jc ), that is, i( jc ) = argmini∈I V(i, jc ). The neighborhood B L (i( jc )) of size L of the winner node i( jc ) consists of L nodes i = i( jc ), which yield the smallest partial costs V(i, jc ). 4) Weights of nodes i ∈ B L (i( jc )) get strengthened, and those outside B L (i( jc )) are left unchanged: wi, jc ← wi, jc + η(i)(1 − wi, jc ), i ∈ B L (i( jc )),
Critical Temperatures for Self-Organizing Neural Networks
1059
where |V(i( jc ), jc ) − V(i, jc )| η(i) = β exp − , |V(k( jc ), jc ) − V(i, jc )| and k( jc ) = argmaxi∈I V(i, jc ). 5) Weights w j = (w1, j , w2, j , . . . , w N, j ) for each input node j ∈ J are normalized using softmax (here, denotes the transpose operator): exp
wi, j ← N
wi, j wk, j .
T
k=1 exp
T
6) Repeat from step 2 until all inputs jc ∈ J have been selected (one epoch). Although the soft assignments wi, j evolve in continuous space, when needed, a 0-1 assignment solution can be produced by imposing A( j) = i if and only if i = argmaxk∈I wk, j . A frequently studied (NP-hard) assignment optimization problem in case of SONN is the N-queen problem (e.g., Kwok & Smith, 2004, 2005). Place N queens on an N × N chessboard without attacking each other. In this case, J = {1, 2, . . . , N} and I = {1, 2, . . . , N} index the columns and rows, respectively, of the chessboard. Partial cost V(i, j) evaluates the diagonal and column contributions (in the sense of directions on the chessboard) to the global cost Q of placing a queen on column j of row i. (For more details, see, Kwok & Smith, 2004, 2005.) Many neural-based approaches and heuristic techniques have been proposed to solve the N-queen problem (e.g., Minton, Johnston, Philips, & Laird, 1992; Tagliarini & Page, 1987). Previous SONN studies of the N-queen problem used the problem as a convenient test-bed for illustrating optimization capabilities of SONN, without intending to devise the most efficient algorithm. The N-queen problem is used in this study in a similar manner. Kwok and Smith (2003, 2005) argue that step 5 of the SONN algorithm is crucial for intermittent search by SONN for globally optimal assignment solutions. In particular, they note that temperatures at which symmetrybreaking bifurcation of equilibria of the renormalization procedure in step 5 occur correspond to temperatures at which optimal (in terms of both quality and quantity of found solutions) intermittent search takes place. In the following sections, we rigorously analyze equilibria of softmax viewed as an autonomous dynamical system. Our findings will later allow us to formulate analytical approximations to the critical temperatures at which symmetry-breaking bifurcations of the softmax renormalization dynamics take place.
ˇ P. Tino
1060
3 Iterative Softmax Denote the (N − 1)-dimensional simplex in R N by SN−1 , that is, SN−1 = w = (w1 , w2 , . . . , w N ) ∈ R N | wi ≥ 0,
i = 1, 2, . . . , N, and
N
wi = 1 .
(3.1)
i=1 0 The interior of SN−1 is denoted by SN−1 :
0 SN−1
= w ∈ R | wi > 0, i = 1, 2, . . . , N, and N
N
wi = 1 .
(3.2)
i=1
Given a parameter T > 0 (the temperature), the softmax maps R N into 0 SN−1 : w → F(w; T) = (F1 (w; T), F2 (w; T), . . . , F N (w; T)) ,
(3.3)
where exp( wTi ) Fi (w; T) = N , i = 1, 2, . . . , N. wk k=1 exp( T )
(3.4)
We will denote the common normalization factor of Fi ’s by Z(w; T), that is, Z(w; T) =
N k=1
exp
w
k
T
.
(3.5)
0 Linearization of F around w ∈ SN−1 is given by the Jacobian J (w; T), the (i, j)th element of which reads,
J(w; T)i, j =
1 [δi, j Fi (w; T) − Fi (w; T)F j (w; T)], T
(3.6)
where δi, j = 1 iff i = j and δi, j = 0 otherwise. The Jacobian is a symmetric N × N matrix. 0 The softmax map F induces on SN−1 a discrete-time dynamics, w(t + 1) = F(w(t); T),
(3.7)
Critical Temperatures for Self-Organizing Neural Networks
1061
sometimes referred to as iterative softmax (ISM). In the next section, we study the relationship between the number and position of equilibria of ISM and the temperature parameter T. We study systems for N ≥ 2. 4 Fixed Points of Iterative Softmax 0 Recall that w ∈ SN−1 is a fixed point (equilibrium) of the ISM dynamics driven by F, if w = F(w). It is easy to see that for any temperature setting,1 there is always at least one fixed point of ISM (see equation 3.7). 0 Lemma 1. The maximum entropy point w = (N−1 , N−1 , . . . , N−1 ) ∈ SN −1 is a fixed point of ISM, equation 3.7, for any T ∈ R.
Proof. The result directly follows from w ¯1 = w ¯2 = ··· = w ¯ N. There is a strong structure in the fixed points of ISM: coordinates of any fixed point of ISM can take on only up to two distinct values. Theorem 1. Except for the maximum entropy fixed point w = (N−1 , . . . , N−1 ) , for all the other fixed points w = (w1 , w2 , . . . , w N ) of ISM, equation 3.7, it holds: wi ∈ {γ1 (w; T), γ2 (w; T)}, i = 1, 2, . . . , N, where γ1 (w; T) > N−1 and γ2 (w; T) < N−1 are the two solutions of Z(w; T) · x = exp( Tx ). Proof. We have wi = Fi (w; T) =
exp( wTi ) , i = 1, 2, . . . , N. Z(w; T)
That means Z(w; T) · wi = exp
w
i
T
,
and so all the coordinates of w must lie on the intersection of the line Z(w; T) · x and the exponential function exp( Tx ). Since the exponential is a convex function, any line can intersect with it in at most two points. Because w = w, there must be a coordinate wi of w such that wi = N−1 . 0 If wi < N−1 , then (since w ∈ SN−1 ) there exists another coordinate w j of −1 w, such that w j > N . By the argument above, all of the other coordinates of w must be equal to either wi or w j . In addition, wi , w j are solutions of Z(w; T) · x = exp( Tx ). The case wi > N−1 can be treated analogously.
1
In ISM, T is usually positive, but this claim is true for any T ∈ R.
ˇ P. Tino
1062
We will often write the larger of the two fixed-point coordinates as γ1 (w; T) = α N−1 , α ∈ (1, N).
(4.1)
Theorem 2. Fix α ∈ (1, N), and write γ1 = α N−1 . Let min be the smallest natural number greater than (α − 1)/γ1 . Then, for ∈ {min , min + 1, . . . , N − 1}, at temperature
α − 1 −1 , Te (γ1 ; N, ) = (α − 1) − · ln 1 − γ1
(4.2)
there exist N distinct fixed points of ISM (3.7) with (N − ) coordinates having value γ1 and coordinates equal to γ2 =
1 − γ1 (N − ) .
(4.3)
Proof. First, we verify that min can never be greater than N − 1. It is sufficient to show that α−1 < N − 1. γ1
(4.4)
Equation 4.4 is equivalent to stating N > α, which is automatically fulfilled since α ∈ (1, N). 0 Now consider ∈ {min , min + 1, . . . , N − 1}, and a fixed point w ∈ SN−1 of ISM having N − coordinates equal to γ1 . Then w has coordinates of 0 , we have value γ2 < N−1 . Because w ∈ SN−1 (N − )γ1 + γ2 = 1. It follows that γ2 =
1 − γ1 (N − ) .
(4.5)
Note that γ2 must be positive, and so γ1 < (N − )−1 . This is indeed the case, 0 since otherwise w would not lie in SN−1 . Normalizing constant Z(w; T) is equal to Z(w; T) = (N − ) · exp
γ
1
T
+ · exp
γ
2
T
.
Critical Temperatures for Self-Organizing Neural Networks
1063
Because w is a fixed point of ISM, we have exp γT1 γ1 = . Z(w; T) Hence,
γ2 − γ1 = 1. γ1 · (N − ) + exp T
Consequently, γ1 · 1 +
1−α = (N − )−1 . exp N− T
Equation 4.6 can be reformulated to
α−1 α−1 exp − =1− . T γ1
(4.6)
(4.7)
Since ≥ min > (α − 1)/γ1 , the right-hand side of equation 4.7 is always positive. Given the number of coordinates with value γ2 , we can solve for the temperature Te (γ1 ; N, ) satisfying equation 4.7:
α − 1 −1 Te (γ1 ; N, ) = (α − 1) − · ln 1 − . (4.8) γ1 To conclude the proof, we need to show that γ2 in equation 4.5 satisfies the fixed-point condition exp γT2 γ2 = . (4.9) Z(w; T) From equation 4.9, we have
−1 γ1 − γ2 γ2 = (N − ) exp . + T Denote (γ1 − γ2 )/T by ω. Then γ1 = [N − + exp(−ω)]−1 γ2 = [(N − ) exp(ω) + ]−1 . By equation 4.5, 1 N− 1− , γ2 = N − + exp(−ω) which reduces to equation 4.10.
(4.10)
ˇ P. Tino
1064
1 0.8 0.6 0.4
h(x;T)
0.2 0
g(x)
−0.2 0
0.05
0.1
0.15
0.2 x
0.25
0.3
0.35
0.4
Figure 1: A geometric interpretation of the temperature Te (γ1 ; N, ). The decreasing exponential function h(x; T) is intersected by the decreasing line g(x). Given a particular number , temperature T can be set to Te (γ1 ; N, ), so that h(x; Te (γ1 ; N, )) intersects g(x) in x0 = −1 . In this case, N = 10, α = 1.5, γ1 = 0.15, min = 4, = 5, Te (γ1 ; N, ) = 0.0910.
By symmetry of the argument, there are N possibilities for equilibria of ISM with N − and coordinates equal to γ1 and γ2 , respectively. For a geometric interpretation of the temperature Te (γ1 ; N, ), consider functions
α−1 h(x; T) = exp − x and T g(x) = 1 −
α−1 x, γ1
(4.11) (4.12)
operating on (0, ∞) and plotted in Figure 1. The decreasing exponential h(x; T) is intersected by the decreasing line g(x) at x0 . Given a particular number of coordinates of value γ2 , we can set T to T = Te (γ1 ; N, ), so that the exponential h(x; Te (γ1 ; N, )) intersects g(x) in x0 = −1 . Corollary 1. Consider the ISM (3.7). For ∈ {0, 1, . . . , N − 1}, define α =
N . N−
Critical Temperatures for Self-Organizing Neural Networks
1065
Fix α ∈ (1, N) ∪ [α−1 , α ), ∈ {1, . . . , N − 1}. Then, by varying the temperature parameter T, the ISM can have Q N (α) =
N−1 N k=
k
distinct fixed points. Proof. By theorem 2, for α ∈ (1, N) ∪ [α−1 , α ), there are temperature set0 tings (see equation 4.2) so that points w ∈ SN−1 with N − k and k coordinates −1 equal to γ1 = α N and γ2 (see equation 4.3), respectively, are equilibria of the ISM, k = , + 1, . . . , N − 1. For this value of α, no other fixed points are possible. This is easily observed, since for α ∈ (1, N) ∪ [α−1 , α ), the lower bound min on the number of smaller coordinates is equal to . Since α0 = 1 < α1 < α2 < · · · < α N−1 = N, corollary 1 tells us that the richest structure of equilibria of ISM can be found in a small neighborhood of the maximum entropy fixed point w. The fewest fixed points N exist in the = N fixed corner areas of SN−1 (α N−2 = N/2 ≤ α < α N−1 = N): only N−1 points are possible—points at the vertexes of SN−1 . In addition, we can analytically describe a decreasing trend in the number of possible equilibria as one diverges from w. Whenincreasing α crosses α , the number of pos sible fixed points decreases by N . As an example, we show in Figure 2 the number of possible fixed points as a function of the larger coordinate γ1 in a 10-dimensional ISM system.
5 Fixed-Point Stability in Iterative Softmax 0 Lemma 2. Let w ∈ SN− 1 , w = w be a fixed point of ISM with larger and smaller coordinates equal to γ1 and γ2 , respectively. Then γ1 is never further from 1 /2 than γ2 is, that is,
γ1 −
1 ≤ γ2 − 2
1 . 2
(5.1)
Proof. Assume first that γ1 ∈ (0, 12 ]. Then since γ1 > γ2 , equation 5.1 is automatically satisfied. 0 Assume now that γ1 ∈ ( 12 , 1). Since w ∈ SN−1 , there can be only one co1 ordinate of w with value γ1 . Also, γ2 < 2 . The coordinates must sum up to one, and so γ1 = 1 − (N − 1)γ2 .
ˇ P. Tino
1066 1200
# possible fixed points
1000 800 600 400 200 0 0.1
0.2
0.3
0.4
0.5 0.6 gamma1
0.7
0.8
0.9
1
Figure 2: Number of possible fixed points as a function of the larger coordinate γ1 in ISM with N = 10. The farther we move from the maximum entropy equilibrium w = (0.1, 0.1, . . . , 0.1) , the fewer fixed points are possible.
For N ≥ 2 we have2 γ1 − 1 = γ1 − 1 = 1 − (N − 1)γ2 ≤ 1 − γ2 = γ2 − 2 2 2 2
1 . 2
0 Theorem 3. Consider a fixed point w ∈ SN− 1 of ISM (3.7) with one of its coor−1 dinates equal to N ≤ γ1 < 1. Then if
T > Ts,1 (γ1 ) = 2 γ1 (1 − γ1 ),
(5.2)
T > Ts,2 (γ1 ) = γ1 ,
(5.3)
or if
the fixed point w is stable. Proof. Jacobian J (w; T) of the ISM map, equations 2.3 and 2.4, is given by equation 3.6. The Jacobian is a symmetric matrix, and so the column and row sum norms coincide: N |J (w; T)i, j | . (5.4)
J (w; T) ∞ = max i=1,2,...,N j=1
2
Recall that we study systems for N ≥ 2.
Critical Temperatures for Self-Organizing Neural Networks
1067
The sum of absolute values in the ith row of the Jacobian is equal to Si =
Fi (w; T) (1 − Fi (w; T)) + T
wi (1 − wi ) + wj = T j=i
F j (w; T)
(5.5)
j=i
wi [(1 − wi ) + (1 − wi )] T 2 = wi (1 − wi ). T
=
(5.6) (5.7) (5.8)
Note that equation 5.6 follows form 5.5 because w is a fixed point of F, that 0 is, w = F(w), and equation 5.7 follows form 5.6 because w ∈ SN−1 . We have
J (w; T) ∞ = max {Si } i=1,2,...,N
=
2 T
max
i=1,2,...,N
q (wi ) ,
(5.9) (5.10)
where q : [0, 1] → [0, 1/4] is a unimodal function q (x) = x(1 − x) with maximum at x = 1/2. Point x = 1/2 is also the point of symmetry. Hence,
J (w; T) ∞ =
2 q (w), ˜ T
where w ˜ is the coordinate value of w closest to 1/2. By theorem 1, there can be at most two distinct values taken on by the coordinates of w. If w = w, all the coordinates are the same, and w ˜ = N−1 . 3 −1 If w = w, by lemma 2, it is the bigger value of the two, γ1 > N , that is close to 1/2. Hence, w ˜ = γ1 . It follows that if one of the coordinates of w is equal to N−1 ≤ γ1 < 1, then J (w; T) ∞ = 2q (γ1 )/T. Now, if J (w; T) ∞ < 1, the fixed point w is contractive. In other words, if T > Ts,1 (γ1 ) = 2γ1 (1 − γ1 ), w is a stable fixed point of ISM (see equation 3.7). Note that all eigenvalues of Jacobian J (w; 1) (ISM operating at temperature 1) are upper-bounded by max1≤k≤N Fk (w; 1) (Elfadel & Wyatt, 1994). It follows that all eigenvalues of J (w; T) are upper-bounded by
3
Note that the smaller value γ2 is automatically less than N−1 .
ˇ P. Tino
1068
max1≤k≤N T −1 Fk (w; T). For each fixed point w of ISM operating at temperature T, w = F(w; T), and so by theorem 1, all eigenvalues of J (w; T) are upper-bounded by γ1 T −1 . Hence, the spectral radius ρ(J (w; T)) of J (w; T) (the maximal absolute value of eigenvalues of J (w; T)) is upper-bounded by γ1 T −1 . If ρ(J (w; T)) < 1, that is if T > Ts,2 (γ1 ) = γ1 , w is a stable equilibrium of ISM, equation 3.7. Since Ts,2 (γ1 ) ≤ Ts,1 (γ1 ) on (N−1 , 1/2] and Ts,2 (γ1 ) > Ts,1 (γ1 ) on (1/2, 1), it is useful to combine the two bounds into a single one. 0 Corollary 2. Consider a fixed point w ∈ SN− 1 of ISM, equation 3.7, with one of its coordinates equal to N−1 ≤ γ1 < 1. Define
Ts (γ1 ) =
Ts,2 (γ1 ), Ts,1 (γ1 ),
if if
γ1 ∈ [N−1 , 1/2) γ1 ∈ [1/2, 1).
(5.11)
Then if T > Ts (γ1 ), the fixed point w is stable. Lemma 1 tells us that the maximum entropy point w = (N−1 , N , . . . , N−1 ) is an equilibrium of ISM, equation 3.7, for any temperature setting T. By theorem 3, for T > Ts (N−1 ) = N−1 , w is guaranteed to be stable. The next theorem claims that for high temperature settings, no fixed points other than w exist, that is, w is the unique equilibrium of ISM. −1
w=
Theorem 4. For T > 1/2, the maximum entropy point (N−1 , N−1 , . . . , N−1 ) is the unique equilibrium of ISM, equation 3.7. 0 , Proof. For any w ∈ SN−1
J (w; T) ∞ = max
i=1,2,...,N
= max
i=1,2,...,N
≤
Fi (w; T) F j (w; T) (1 − Fi (w; T)) + T j=i 2 [Fi (w; T)(1 − Fi (w; T))] T
1 21 = . T4 2T
(5.12)
For T > 1/2, the ISM, equation 3.7, is a global contraction. By the Banach fixed point theorem, there exists a unique attractive fixed point of the ISM. But w is always a fixed point of the ISM. It follows that for T > 1/2, the only fixed point of ISM, equation 3.7, is w. In addition, w is attractive. Theorem 4 sharpens the statement by Elfadel and Wyatt (1994) that points of SN−1 converge under ISM, equation 3.7, to w as T → ∞.
Critical Temperatures for Self-Organizing Neural Networks
1069
We now turn our attention to conditions under which fixed points of ISM are not stable. 0 Theorem 5. Consider a fixed point w ∈ SN− 1 of ISM, equation 3.7, with one of its coordinates equal to N−1 ≤ γ1 < 1. Let N − be the number of coordinates of value γ1 . Then if
T < Tu (γ1 ; N, ) = γ1 (2 − Nγ1 )
N− 1 1 + − , N N N
(5.13)
w is not stable. Proof. Let ρ(J (w; T)) be the spectral radius of the Jacobian J (w; T). If for a fixed point w of ISM, equation 3.7, ρ(J (w; T)) > 1, the fixed point is not stable. Note that N N 1 1 |trace(J (w; T))| ρ(J (w; T)) ≥ |λi | ≥ λi = , N N N i=1
i=1
0 where λi are eigenvalues of the Jacobian J (w; T). By theorem 2, w ∈ SN−1 has to have coordinates equal to
γ2 =
1 − γ1 (N − ) .
We have 1 [(N − )γ1 (1 − γ1 ) + γ2 (1 − γ2 )] T 1 = [γ1 (2 − Nγ1 )(N − ) + − 1] . T
|trace(J (w; T))| =
The result follows by imposing |trace(J (w; T))| > 1. N Note that by theorem 5, for temperatures T < Tu (N−1 ; N, N) = N−1 − N−2 = Ts (N−1 ) − N−2 , the maximum entropy equilibrium w of ISM, equation 3.7, cannot be attractive. This bound can be improved by realizing that the Jacobian J (w; T),
ˇ P. Tino
1070 stability and existence of ISM fixed points (N=10) 0.5 stable equilibria
stable equilibria 0.4 potentially saddle−type equilibria
T
0.3
0.2 N1=2 N1=1 N1=3
0.1
N1=4
unstable equilibria
0
0.1
0.2
0.3
0.4
0.5 0.6 gamma1
0.7
0.8
0.9
1
Figure 3: Stability regions for equilibria w = w of ISM with N = 10 units as a function of the larger coordinate γ1 and temperature T. The number of coordinates with value γ1 is denoted by N1 . Temperatures Ts (γ1 ) (see equation 5.11) above which equilibria with a larger coordinate equal to γ1 are guaranteed to be stable are shown by a solid bold line. For N1 = N − ∈ {1, 2, 3, 4}, we show the temperatures Tu (γ1 ; N, ) (see equation 5.13) below which equilibria with larger coordinate equal to γ1 are guaranteed to be unstable with solid normal lines. Also plotted are temperatures Te (γ1 ; N, ) (see equation 4.2) at which equilibria with the given number N1 of coordinates of value γ1 exist (dashed lines).
equation 3.6, of ISM, equation 3.7, at w has exactly one zero eigenvalue (corresponding to the eigenvector (1, 1, . . . , 1) ) and N − 1 eigenvalues equal to (NT)−1 . Hence, the spectral radius is equal to ρ(J (w; T)) =
1 . NT
(5.14)
It follows that the maximum entropy equilibrium w is attractive and nonattractive if T > N−1 and T < N−1 , respectively. An illustrative summary of the previous results for equilibria w = w is provided in Figure 3. The ISM has N = 10 units. Coordinates of such fixed points can take on only two possible values, the larger of which we denote by γ1 . Temperatures Ts (γ1 ) (see equation 5.11), above which
Critical Temperatures for Self-Organizing Neural Networks
1071
equilibria with larger coordinate equal to γ1 are guaranteed to be stable, are shown as the solid bold line. Denote the number of coordinates with value γ1 by N1 : N1 = N − . For N1 = 1, 2, 3, 4, we show temperatures Tu (γ1 ; N, ), equation 5.13, below which equilibria with larger coordinate equal to γ1 are guaranteed to be unstable with solid normal lines.4 For a given value of γ1 , temperatures between Tu (γ1 ; N, ) and Ts (γ1 ) potentially correspond to saddle-type equilibria. We also plot temperatures Te (γ1 ; N, ), equation 4.2, at which equilibria with the given number N1 of coordinates of value γ1 exist (dashed lines). The figure suggests that most of ISM equilibria with more than one larger coordinate γ1 , that is, N1 ≥ 2, are of saddle type. Only fixed points with extreme values of γ1 close to N1−1 are guaranteed to be unstable. These findings were numerically confirmed by eigenvalue decomposition of Jacobians of a large number of fixed points corresponding to the dashed lines. Note that apart from the maximum entropy equilibrium w, only the extreme equilibria close to the vertexes of simplex SN−1 can be guaranteed by our theory to be stable. This happens when the dashed line for N1 = 1 crosses the solid bold line. We now show that the character of the partitioning of the (γ1 , T)-space with respect to stability regimes of ISM equilibria remains unchanged for other values of N (dimensionality of ISM). Lemma 3. Consider equilibria of an N-dimensional (N > 2) ISM, equation 3.7, with at least two larger coordinates of value γ1 : ≤ N − 2 and γ1 ∈ (N−1 , (N − )−1 ). Then, for the temperature function, equation 4.2,
γ1 T,N (γ1 ) = Te (γ1 ; N, ) = (Nγ1 − 1) · ln γ1 ( − N) + 1
−1
,
(5.15)
guaranteeing the existence of such equilibria, it holds: 1. T,N (N−1 ) = Ts (N−1 ). 2. T,N is concave. (N−1 ) = 1 − N/(2). 3. T,N
Proof. 1. At γ1 = 1/N, both the numerator and denominator of equation 5.15 are zero. By L’Hospital rule5 lim T,N (γ1 ) = lim
γ1 →N−1
4 5
γ1 →N−1
Nγ1 [γ1 ( − N) + 1] 1 = = Ts (N−1 ). (5.16) N
0 Note that since w ∈ SN−1 , γ1 must be smaller than 1/N1 . We slightly abuse mathematical notation by a considering γ1 a continuous variable.
ˇ P. Tino
1072
2. The first derivative of T,N is calculated as T,N (γ1 ) =
ln
N γ1 γ1 (−N)+1
−
ln
2
Nγ1 − 1
. γ12 ( − N) + γ1
γ1 γ1 (−N)+1
(5.17)
γ1 . We write To ease the presentation, ln(·) will stand for ln γ1 (−N)+1
T,N (γ1 ) = −1 (A(γ1 ) − B(γ1 )), where
A(γ1 ) =
N ln(·)
and B(γ1 ) =
ln(·)2 γ
Nγ1 − 1 . 1 [γ1 ( − N) + 1]
The derivative of A(γ1 ), A (γ1 ) = −
N , ln(·)2 γ1 [γ1 ( − N) + 1]
is negative, as for γ1 ∈ (N−1 , (N − )−1 ), we have γ1 ( − N) + 1 > 0. Denote ln(·)2 γ1 [γ1 ( − N) + 1] by C. Then, B (γ1 ) = C −2 (D(γ1 ) ln(·)2 + 2 ln(·)), where D(γ1 ) = N(N − )γ12 − 2(N − )γ1 + 1. Now, D(γ1 ) is a convex function with minimum at N−1 and D(N−1 ) = /N > 0. Also note that since γ1 > N−1 , we have γ1 > 1, γ1 ( − N) + 1 and so ln(·) > 0. (γ1 ) < 0 and T,N is From A (γ1 ) < 0 and B (γ1 ) > 0, it follows that T,N concave. 3. The result follows from evaluation of limγ1 →N−1 T,N (γ1 ) based on equation 5.17. After applying the L’Hospital rule twice, we get
γ1
lim T,N (γ1 ) = 1 − →N−1
N . 2
We are now ready to prove that for equilibria of ISM, equation 3.7, with at least two larger coordinates γ1 > N−1 , the temperatures at which they exist, Te (γ1 ; N, ), are always below the bound Ts (γ1 ), above which their stability is guaranteed.
Critical Temperatures for Self-Organizing Neural Networks
1073
Theorem 6. Let N > 2 and ≤ N − 2. Then, for γ1 ∈ (N−1 , (N − )−1 ), it holds: Te (γ1 ; N, ) < Ts (γ1 ). Proof. By lemma 3 (point 2 of lemma 3), T,N (γ1 ) = Te (γ1 ; N, ) is a concave function and hence can be upper-bounded on (N−1 , (N − )−1 ) by the tangent line κ(γ1 ) to T,N (γ1 ) at N−1 : T,N (γ1 ) ≤ κ(γ1 ), where, by equation 5.6 and lemma 3 (points 1, 3), κ(γ1 ) =
1 1 N + 1− γ1 − . N 2 N
Now, Ts (γ1 ) is a line,
1 1 Ts (γ1 ) = + γ1 − . N N Since N/(2) > 0, we have that on (N−1 , (N − )−1 ), Te (γ1 ; N, ) < Ts (γ1 ). Hence, the tangent κ(γ1 ) upper-bounding Ts (γ1 ) never crosses Ts (γ1 ) and always stays below it. As an illustrative example, we present an analog of Figure 3 for a 17-dimensional ISM in Figure 4. The number of coordinates with value γ1 is denoted by N1 . Temperatures Ts (γ1 ) (see equation 5.11) above which equilibria with larger coordinates equal to γ1 are guaranteed to be stable are shown with a solid bold line. For N1 = N − ∈ {1, 2, 3, 4}, we show the temperatures Tu (γ1 ; N, ) (see equation 5.13) below which equilibria with a larger coordinate equal to γ1 are guaranteed to be unstable with solid normal lines. Temperatures Te (γ1 ; N, ) (see equation 4.2) at which equilibria with the given number N1 of coordinates of value γ1 exist (dashed lines) are also marked according to the stability type of the corresponding fixed points. The stability types were determined by eigenanalysis of Jacobians (see equation 3.6) at the fixed points. Stable and unstable equilibria existing at temperature Te (γ1 ; N, ) are shown as stars and circles, respectively. All unstable equilibria are of saddle type. The horizontal dashed line shows a numerically determined temperature by Kwok and Smith (2005) at which attractive equilibria of 17-dimensional ISM lose stability and the maximum entropy point w remains the only stable fixed point. The position where Ts (γ1 ) crosses Te (γ1 ; N, ) for N1 = 1 is marked by a bold circle. Note that no equilibrium with more than one coordinate greater than N−1 is stable.
ˇ P. Tino
1074 stability types of ISM fixed points (N=17) 0.5 0.45 0.4 0.35 N1=1
T
0.3 0.25 0.2 0.15
N1=2
0.1 N1=3
0.05 N1=4
0 0.1
0.2
0.3
0.4
0.5 0.6 gamma1
0.7
0.8
0.9
1
Figure 4: Stability types for equilibria w = w of ISM with N = 17 units as a function of the larger coordinate γ1 and temperature T. The number of coordinates with value γ1 is denoted by N1 . Temperatures Ts (γ1 ) (see equation 5.11) above which equilibria with larger coordinates equal to γ1 are guaranteed to be stable are shown with a solid bold line. For N1 = N − ∈ {1, 2, 3, 4}, we show the temperatures Tu (γ1 ; N, ) (see equation 5.13) below which equilibria with larger coordinate equal to γ1 are guaranteed to be unstable with solid normal lines. Temperatures Te (γ1 ; N, ) (see equation 4.2) at which equilibria with the given number N1 of coordinates of value γ1 exist (dashed lines) are also marked according to the stability type of the corresponding fixed points. Stable and unstable equilibria existing at temperature Te (γ1 ; N, ) are shown as stars and circles, respectively. The horizontal dashed line shows a numerically determined temperature by Kwok and Smith (2005) at which attractive equilibria of 17-dimensional ISM lose stability and the maximum entropy point w remains the only stable fixed point. The bold circle marks the position where Ts (γ1 ) crosses Te (γ1 ; N, ) for N1 = 1.
6 Critical Temperature for Intermittent Search in SONN with Softmax Weight Renormalization It has been hypothesized that ISM provides an underlying driving force behind intermittent search in SONN with softmax weight renormalization (Kwok & Smith, 2003, 2005) (see section 2). Kwok and Smith (2005) argue that the critical temperature at which the intermittent search takes place
Critical Temperatures for Self-Organizing Neural Networks
1075
0.5 0.45 0.4 0.35
T
0.3 0.25 0.2 0.15 0.1
N=15 N=40
0 0.5
N=5
N=10
0.05
0.55
N=30 0.6
0.65
0.7
0.75 gamma1
0.8
0.85
0.9
0.95
1
Figure 5: Approximations of the bifurcation point based on equation 4.2 and bound, equation 5.11, for increasing ISM dimensionalities N.
corresponds to the bifurcation point of the autonomous ISM dynamics when the existing equilibria lose stability and only the maximum entropy point w survives as the sole stable equilibrium. The authors numerically determined such bifurcation points for several ISM dimensionalities N. It was reported that bifurcation temperatures decreased with increasing N. Based on the analysis in section 5, the bifurcation points correspond to the case when equilibria near corners of the simplex SN−1 (equilibria with only one coordinate with large value γ1 ) lose stability. Based on the bound Ts (γ1 ) (see equation 5.11) and temperatures Te (γ1 ; N, N − 1) (see equation 4.2) at which equilibria of ISM exist, such a bifurcation point can be approximated by the bold circle in Figure 4. Figure 5 shows the evolution of the approximation of the bifurcation point (based on Te (γ1 ; N, N − 1) (see equation 4.2) and bound Ts (γ1 ) (see equation 5.11) for increasing ISM dimensionalities N. The bound, equation 5.11, is independent of N. In accordance with the empirical findings of Kwok and Smith, the bifurcation temperature is decreasing with increasing N. We present two approximations to the critical temperature T∗ (N) at which the bifurcation occurs. The first 1 expands Te (γ1 ; N, N − 1) (see equation 4.2) (2) around γ10 as a second-order polynomial TN−1 (γ1 ). Based on Figure 5, a good 0 0 choice for γ1 is γ1 = 0.9. Approximation to the critical temperature is then
ˇ P. Tino
1076
obtained by solving (2)
TN−1 (γ1 ) = Ts (γ1 ) (2)
for γ1 , and then plugging the solution γ1 (2) calculating Ts (γ1 ). We have
back to bound 5.11, that is,
Te (γ1 ; N, N − 1) =
Nγ1 − 1
, 1 (N − 1) ln (N−1)γ 1−γ1
Te (γ1 ; N, N − 1) =
N Nγ1 − 1
−
(N−1)γ1 1 (N − 1) ln 1−γ1 (N − 1)γ1 (1 − γ1 ) ln2 (N−1)γ 1−γ1
(5.18)
(5.19) and
1 Nγ12 − 2γ1 + 1 − 2 ln−1 (N−1)γ 1−γ 1
. Te (γ1 ; N, N − 1) = 2 (N−1)γ1 2 2 (N − 1)γ1 (1 − γ1 ) ln 1−γ1
(5.20)
By solving Aγ12 + Bγ1 + C = 0,
(5.21)
where A= 1 /2 Te γ10 ; N, N − 1 + 2, B = Te γ10 ; N, N − 1 − γ10 Te γ10 ; N, N − 1 − 2, 1 2 C = Te γ10 ; N, N − 1 − γ10 Te γ10 ; N, N − 1 + γ10 Te γ10 ; N, N − 1 2 (2)
and retaining the solution γ1 compatible with the requirement that (2) w ∈ SN−1 , we obtain an analytical approximation T∗ (N) to the critical temperature T∗ (N): (2)
T∗(2) (N) = 2γ1
(2)
1 − γ1
.
(5.22)
A cheaper approximation to T∗ (N) can be obtained by realizing that both functions in Figure 5 are almost linear in the neighborhood of their
Critical Temperatures for Self-Organizing Neural Networks
1077
intersection. First, we approximate the bound Ts (γ1 ), equation 5.11 (the solid bold line) by the tangent line at (1,0): v(γ1 ) = 2(1 − γ1 ).
(5.23)
Imposing v(γ1 ) = Te (γ1 ; N, N − 1) leads to y = 2(N − 1) ln(y) − 1,
(5.24)
where y=
(N − 1)γ1 . 1 − γ1
(5.25)
We expand the right-hand side of equation 5.24 around y0 (corresponding to γ10 ) as 2(N − 1) y + 2(N − 1)(1 + ln(y0 )) − 1 y0 and so the approximate solution to equation 5.24 reads y(1) =
2(N − 1)(1 + ln(y0 )) − 1 1−
2(N−1) y0
.
(5.26)
From equation 5.25, the corresponding value of γ1 is (1)
γ1 =
y(1) , N − 1 + y(1)
(5.27) (1)
yielding an approximation T∗ (N) to the critical temperature T∗ (N): (1)
T∗(1) (N) = 2γ1
(1)
1 − γ1
.
(5.28) (1)
(2)
To illustrate the approximations T∗ (N) and T∗ (N) of the critical temperature TB = T∗ (N), numerically found bifurcation temperatures for ISM systems with dimensions between 8 and 30 are shown in Figure 6 as circles. (2) The approximations based on quadratic and linear expansions, T∗ (N) and (1) T∗ (N), are plotted with bold solid and dashed lines, respectively. Also shown are the temperatures 1/N above which the maximum entropy equilibrium w is stable (solid normal line). At bifurcation temperature, w is already stable, and equilibria at vertexes of simplex SN−1 lose stability. The
ˇ P. Tino
1078 bifurcation temperatures 0.35
TB (qdr. approx) TB (lin. approx)
0.3
TB T (N−queens) *
0.25
T
0.2
0.15 max. entropy point stable 0.1
0.05
0 5
max. entropy point unstable
10
15
20 N
25
30
35
Figure 6: Analytical approximations of the bifurcation temperature TB = T∗ (N) for increasing ISM dimensionalities N. The approximations based on quadratic (2) (1) and linear expansions, T∗ (N) and T∗ (N), are plotted with bold solid and dashed lines, respectively. Numerically found bifurcation temperatures are shown as circles. The best-performing temperatures for intermittent search in the N-queens problems are shown as stars. Also shown are the temperatures 1/N above which the maximum entropy equilibrium w is stable. As Kwok and Smith suggested, there is a marked correspondence between the bifurcation temperatures and the best-performing temperatures in intermittent search by (2) (1) SONN. The analytical solutions T∗ (N) and T∗ (N) appear to approximate the bifurcation temperatures well. (2)
(1)
analytical solutions T∗ (N) and T∗ (N) appear to approximate the bifurcation temperatures well. We also numerically determined optimal temperature settings for intermittent search by SONN in the N-queens problems. Following Kwok and Smith (2005), we set the SONN parameter β to β = 0.8.6 The optimal neighborhood size increased with the problem size N from L = 2 (for N = 8), through L = 3 (N = 10, 13), L = 4 (N = 15, 17, 20), to L = 5
6 Our experiments confirmed findings by Kwok and Smith that the choice of β is in general not sensitive to N.
Critical Temperatures for Self-Organizing Neural Networks
1079
(N = 25, 30).7 Based on our extensive experimentation, the best-performing temperatures for intermittent search in the N-queens problems are shown as stars. Clearly, as suggested by Kwok and Smith, there is a marked correspondence between the bifurcation temperatures of ISM equilibria and the best-performing temperatures in intermittent search by SONN. The only equilibria that can be guaranteed to be stable by our theory are the maximum entropy point w and “one-hot” solutions at the vertexes of the N − 1 dimensional simplex SN−1 . Each of such one-hot solutions corresponds to one particular assignment of SONN inputs to SONN outputs. At the critical point of losing their stability, the one-hot solutions do not exhibit a strong attractive force in the ISM state space, and SONN weight updates can easily jump from one assignment to another, occasionally being pulled by the neutralizing power of the stable maximum entropy equilibrium w. This mechanism enables rapid intermittent search for good assignment solutions in SONN. Obviously much more work is needed to rigorously analyze the intricate influence of the neighborhood size, SONN weight updates, and softmax renormalization in a unified framework. This is a matter for our future work. It is highly encouraging, however, that the analytically obtained approximations of critical temperatures predict the optimal working temperatures for SONN intermittent search very well. So far, the critical temperatures have been determined only by extensive trial-and-error numerical investigations. Note that as a direct consequence of theorem 4, no intermittent search can exist in SONN for temperatures T > 1/2. 7 Conclusions A recently proposed new kind of optimization dynamics using selforganizing neural networks driven by softmax weight renormalization is capable of powerful intermittent search for high-quality solutions in difficult assignment optimization problems (Kwok & Smith, 2005). However, such dynamics is sensitive to temperature setting in the softmax renormalization step. Kwok and Smith (2005) have hypothesized that the optimal temperature setting corresponds to symmetry-breaking bifurcation of equilibria of the renormalization step, when viewed as an autonomous dynamical system. Following Kwok and Smith (2003, 2005), we call such dynamical systems iterative softmax (ISM). We have rigorously analyzed equilibria of ISM. In particular, we determined their number, position, and stability types. Most fixed points can be found in the neighborhood of the maximum entropy equilibrium point w = (N−1 , N−1 , . . . , N−1 ) . We have calculated the exact rate of decrease in
7 The increase of optimal neighborhood size L with increasing problem dimension N is in accordance with findings in Kwok and Smith (2005).
1080
ˇ P. Tino
the number of ISM equilibria as one moves away from w. We have derived bounds on temperatures guaranteeing different stability types of ISM equilibria. It appears that most ISM fixed points are of a saddle type. This hypothesis is supported by our extensive numerical experiments (most are not reported here). The only equilibria that can be guaranteed to be stable by our theory are the maximum entropy point w and one-hot solutions at the vertexes of the N − 1 dimensional simplex. We argued that close to the critical bifurcation temperature at which such one-hot equilibria lose stability, the most powerful intermittent search can take place in SONN. Based on temperature bounds guaranteeing stability of one-hot equilibria and temperatures guaranteeing their existence, we were able to derive analytical approximations to the critical bifurcation temperatures that are in good agreement with those found by numerical investigations. So far, the critical temperatures have been determined only by extensive trial-and-error numerical investigations. Moreover, the analytically obtained critical temperatures predict the optimal working temperatures for SONN intermittent search well. We have also shown that no intermittent search can exist in SONN for temperatures greater than one-half.
References Crutchfield, J. P., & Young, K. (1990). Computation at the onset of chaos. In W. H. Zurek (Ed.), Complexity, entropy, and the physics of information (Vol. 8, pp. 223–269). Reading, MA: Addison-Wesley. Elfadel, I. M., & Wyatt, J. L. (1994). The softmax nonlinearity: Derivation using statistical mechanics and useful properties as a multiterminal analog circuit element. In J. Cowan, G. Tesauro, & C. L. Giles (Eds.), Advances in neural information processing systems, 6 (pp. 882–887). San Mateo, CA: Morgan Kaufmann. Gold, S., & Rangarajan, A. (1996). Softmax to softassign: Neural network algorithms for combinatorial optimization. Journal of Artificial Neural Networks, 2, 381–399. Guerrero, F., Lozano, S., Smith, K. A., Canca, D., & Kwok, T. (2002). Manufacturing cell formation using a new self-organizing neural network. Computers and Industrial Engineering, 42, 377–382. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Science USA, 79, 2554–2558. Kohonen, T. (1982). Self–organizing formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69. Kosowsky, J. J. & Yuille, A. L. (1994). The invisible hand algorithm: Solving the assignment problem with statistical physics. Neural Networks, 7(3), 477–490. Kwok, T., & Smith, K. A. (2002). Characteristic updating-normalisation dynamics of a self-organising neural network for enhanced combinatorial optimisation. In Proceedings of the 9th International Conference on Neural Information Processing (ICONIP’2002), (Vol. 3, pp. 1146–1152). Piscataway, NJ: IEEE.
Critical Temperatures for Self-Organizing Neural Networks
1081
Kwok, T., & Smith, K. A. (2003). Performance-enhancing bifurcations in a selforganising neural network. In Computational Methods in Neural Modeling: Proceedings of the 7th International Work-Conference on Artificial and Natural Neural Networks (IWANN’2003) (pp. 390–397). Berlin: Springer-Verlag. Kwok, T., & Smith, K. A. (2004). A noisy self-organizing neural network with bifurcation dynamics for combinatorial optimization. IEEE Transactions on Neural Networks, 15(1), 84–88. Kwok, T., & Smith, K. A. (2005). Optimization via intermittency with a selforganizing neural network. Neural Computation, 17, 2454–2481. Langton, C. G. (1990). Computation at the edge of chaos: Phase transitions and emergent computation. Physica D, 42, 12–37. Minton, S., Johnston, M. D., Philips, A. B., & Laird, P. (1992). Minimizing conflicts: A heuristic repair method for constraint satisfaction and scheduling problems. Artificial Intelligence, 58(1–3), 161–205. Rangarajan, A. (2000). Self-annealing and self-annihilation: Unifying deterministic annealing and relaxation labeling. Pattern Recognition, 33(4), 635–649. Smith, K. A. (1995). Solving the generalized quadratic assignment problem using a self-organizing process. In Proceedings of the IEEE Int. Conf. on Neural Networks (Vol. 4, pp. 1876–1879). Piscataway, NJ: IEEE. Smith, K. A. (1999). Neural networks for combinatorial optimization: A review of more than a decade of research. INFORMS Journal on Computing, 11(1), 15–34. Smith, K. A., Palaniswami, M., & Krishnamoorthy, M. (1998). Neural techniques for combinatorial optimization with applications. IEEE Transactions on Neural Networks, 9(6), 1301–1318. Tagliarini, G. A., & Page, E. W. (1987). Solving constraint satisfaction problems with neural networks. In The First IEEE International Conference on Neural Networks (Vol. 3, pp. 741–747). Piscataway, NJ: IEEE.
Received May 3, 2006; accepted July 19, 2006.
LETTER
Communicated by Oliver Chapelle
Recursive Finite Newton Algorithm for Support Vector Regression in the Primal Liefeng Bo [email protected]
Ling Wang [email protected]
Licheng Jiao [email protected] Institute of Intelligent Information Processing, Xidian University, Xi’an 710071, China
Some algorithms in the primal have been recently proposed for training support vector machines. This letter follows those studies and develops a recursive finite Newton algorithm (IHLF-SVR-RFN) for training nonlinear support vector regression. The insensitive Huber loss function and the computation of the Newton step are discussed in detail. Comparisons with LIBSVM 2.82 show that the proposed algorithm gives promising results. 1 Introduction Support vector machines (SVMs) (Burges, 1998; Vapnik, 1998; Smola & ¨ Scholkopf, 2004) are powerful tools for classification and regression. In the past few years, fast algorithms for SVMs have been an active research direction. Traditionally, SVMs are trained by using decomposition techniques such as SVMlight (Joachims, 1999) and sequential minimal optimization (Platt, 1999; Shevade, Keerthi, Bhattacharyya, & Murthy, 2000; Keerthi, Shevade, Bhattacharyya, & Murthy, 2001), which solve the dual problem by optimizing a small subset of the variables each iteration. Recently, some algorithms in the primal have been presented for training SVMs. (Mangasarian, 2002; Fung & Mangasarian, 2003) proposed the finite Newton algorithm and showed that it is rather powerful for linear SVMs. Keerthi and DeCoste (2005) introduced some appropriate techniques to speed up the finite Newton algorithm and named the resulting algorithm the modified finite Newton algorithm. Chapelle (2006) proposed the recursive finite Newton algorithm and showed it to be as efficient as the dual domain method for nonlinear support vector classification. This letter follows these studies and develops a recursive finite Newton algorithm (IHLF-SVR-RFN) for nonlinear SVR. One of its main contributions is to introduce an insensitive Huber loss function that includes several Neural Computation 19, 1082–1096 (2007)
C 2007 Massachusetts Institute of Technology
Recursive Finite Newton Algorithm for SVR
1083
popular loss functions as its special cases. Comparisons with LIBSVM 2.82 (Fan, Chen, & Lin, 2005) suggest that IHLF-SVR-RFN is as efficient as the dual domain method for nonlinear SVR. The letter is organized as follows. In section 2, SVR in the primal is introduced. The insensitive Huber loss function and its corresponding optimization problem are proposed in section 3. The recursive finite Newton algorithm is discussed in section 4. Comparisons with LIBSVM 2.82 are reported in section 5. Section 6 presents the conclusions. 2 Support Vector Regression in the Primal n Consider a regression problem with training samples {xi , yi }i=1 where xi is the input sample and yi is the corresponding target. To obtain a linear predictor, SVR solves the following optimization problem:
p w2 p min ξi + ξ¯i +C w,b 2 n
i=1
s.t. w · xi + b − yi ≤ ε + ξi yi − w · xi + b ≤ ε + ξ¯i ξi , ξ¯i ≥ 0, i = 1, 2, · · · , n.
(2.1)
n Eliminating the slack variables ξi , ξ¯i i=1 and dividing equation 2.1 by the factor C, we get the unconstrained optimization problem, min L ε (w, b) = w,b
n
lε (w · xi + b − yi ) + λ w
2
,
(2.2)
i=1
1 and lε (r ) = max (|r | − ε, 0) p . The most popular selections for where λ = 2C p are 1 and 2. For convenience of expression, the loss function with p = 1 is referred to as insensitive linear loss function (ILLF) and that with p = 2 insensitive quadratic loss function (IQLF). Nonlinear SVR can be obtained by using a kernel function and an associated reproducing kernel Hilbert space H. The resulting optimization is
min L ε ( f ) = f
n
lε ( f (xi ) − yi ) + λ
f 2H
,
(2.3)
i=1
where we have dropped b for simplicity. Our experience shows that the generalization performance of SVR is not affected by this drop. According to the representer theory (Kimeldorf & Wahba, 1970), the optimal function
1084
L. Bo, L. Wang, and L. Jiao
for equation 2.3 can be expressed as a linear combination of the kernel functions centered in the training samples,
f (x) =
n
βi k (x, xi ).
(2.4)
i=1
Substituting equation 2.4 into 2.3, we have min L ε (β) = β
n
lε
i=1
n
βi k xi , x j − yi + λ
n
j=1
βi β j k xi , x j .
(2.5)
i=1
Introducing the kernel matrix K with Ki j = k xi , x j and Ki the ith row of K, equation 2.5 can be rewritten as min L ε (β) = β
n
lε (Ki β − yi ) + λβ Kβ . T
(2.6)
i=1
As long as lε (·) is differentiable, we can optimize equation 2.6 by a gradient descent algorithm. 3 Finite Newton Algorithm for Insensitive Huber Loss Function The finite Newton algorithm is a Newton algorithm that can be proved to converge in a finite number of steps, and it has been demonstrated to be very efficient for support vector classification (Mangasarian, 2002; Keerthi & DeCoste, 2005; Chapelle, 2006). Here, we focus on the regression case. The finite Newton algorithm is straightforward for the insensitive quadratic loss function by defining the generalized Hessian matrix (Clarke, 1983; Hiriart-Urruty, Strodiot, & Nguyen, 1984; Keerthi & DeCoste, 2005); however, it is not applicable to the insensitive linear loss function since it is not differentiable. Inspired by the Huber loss function (Huber, 1981), we propose an insensitive Huber loss function (IHLF), 0 lε, (z) = (|z| − ε)2 ( − ε) (2 |z| − − ε) whose shape is shown in Figure 1.
if if if
|z| ≤ ε ε < |z| < , |z| ≥
(3.1)
Recursive Finite Newton Algorithm for SVR
1085
Figure 1: Insensitive Huber loss function with ε = 0.5 and = 1.5 and insensitive quadratic loss function with ε = 0.5.
We emphasize that is strictly greater than ε, ensuring that IHLF is differentiable. Its first-order derivative can be written as 0 ∂lε, (z) = 2sign (z) (|z| − ε) ∂z 2sign (z) ( − ε)
if if if
|z| ≤ ε ε < |z| < , |z| ≥
(3.2)
where sign (z) is 1 if z ≥ 0; otherwise, sign (z) is −1. The properties of IHLF are controlled by two parameters: ε and . At certain ε and values, we can obtain some familiar loss functions: (1) for ε = 0 and an appropriate , IHLF becomes the Huber loss function (HLF); (2) for ε = 0 and = ∞, IHLF becomes the quadratic (gaussian) loss function (QLF); (3) for ε = 0 and → ε, IHLF approaches the linear (Laplace) loss function (LLF); (4) for 0 < ε < ∞ and = ∞, IHLF becomes the insensitive quadratic loss function (IQLF); and (5) for 0 < ε < ∞ and → ε, IHLF approaches the insensitive linear loss function (ILLF). The profiles of the loss functions derived from equation 3.1 are shown in Figure 2. Introducing IHLF into the optimization problem, equation 2.6, we have the following nonlinear SVR problem: min L ε, (β) = β
n i=1
lε, (Ki β − yi ) + λβ Kβ . T
(3.3)
1086
L. Bo, L. Wang, and L. Jiao
Figure 2: Loss functions derived from equation 3.1.
Let the residual vector be
r (β) = Kβ − y . ri (β) = Ki β − yi
(3.4)
Defining the sign vector s (β) = [s1 (β) , · · · , sn (β)]T by 1 si (β) = −1 0
if ε < ri (β) < if − < ri (β) < −ε , otherwise
(3.5)
the sign vector s¯ (β) = [¯s1 (β) , · · · , s¯n (β)]T by if ri (β) ≥ 1 s¯i (β) = −1 if ri (β) ≤ − , 0 otherwise
(3.6)
Recursive Finite Newton Algorithm for SVR
1087
and the active matrix W (β) = diag w1 (β) , · · · , wn (β)
(3.7)
by wi (β) = si2 (β), we can rewrite equation 3.3 as min L ε, (β) = β
n
wi (β)(|ri (β)| − ε)2
i=1
+
n
s¯i2
T (β) ( − ε) 2 ri (β) − − ε + λβ Kβ . (3.8)
i=1
Rearranging equation 3.8, we can obtain a more compact expression: min β
L ε, (β) = r (β)T W (β) r (β) − 2εr (β)T s (β) + 2 ( − ε) r (β)T s¯ (β) +λβ T Kβ + ε T W (β) ε − 2 − ε 2 s¯ (β)T s¯ (β)
.
(3.9) Let us consider some basic properties of L ε, (β). First, L ε, (β) is a piecewise quadratic function. Second, L ε, (β) is a convex function, so it has a unique minimizer. Finally, L ε, (β) is continuously differentiable with respect to β. Although L ε, (β) is not twice differentiable, one still can use the finite Newton algorithm by defining the generalized Hessian matrix. The flowchart of the finite Newton algorithm is as follows: Algorithm 3.1: IHLF-SVR-FN 1. Choose a suitable starting point β 0 and set k = 0; 2. Check whether β k is the optimal solution of equation 3.9. If so, stop; 3. Compute the Newton step h; 4. Choose the step size t by the exact line search. Set β k+1 = β k + th and k = k + 1. 4 Practical Implementation 4.1 Computing the Newton Step. For any given value of β, wesay that a point xi is support vector if ri (β) > ε. Let sν1 = i|si β k = 0 denote the index set ofthe support vectors lying in the quadratic part of the loss function, sν2 = i|¯si β k = 0 the index set of the support vectors lying in the linear part of the loss function, and nsν the index set of the nonsupport vectors. Chapelle (2006) has shown that in support the vector classification, Newton step can be computed at a cost of O n3sν1 rather than O n3 , where
1088
L. Bo, L. Wang, and L. Jiao
nsν1 denotes the number of the support vectors lying in the quadratic part of the loss function. In the following, we will show that a similar conclusion holds for SVR. The gradient of L ε, (β) with respect to β is ∇ L ε, (β) = 2KT W (β) r (β) − 2εKT s (β) + 2 ( − ε) KT s¯ (β) + 2λKβ.
(4.1)
Define the set A by A = β ∈ Rn |∃i, ri (β) = ε or ri (β) = .
(4.2)
The Hessian exists for β ∈ / A. For β ∈ A, it is (arbitrarily) defined to one of its limits. Thus, we have the generalized Hessian, ∇ 2 L ε, (β) = 2KT W (β) K + 2λK.
(4.3)
The Newton step at the kth iteration is given by −1 h = − ∇ 2 L ε, β k ∇ L ε, β k −1 k W β y + εs β k − ( − ε) s¯ β k − β k , (4.4) = W β k K + λI where I is the identity matrix. For convenience of expression, we reorder the training samples such that the first nsν1 training samples are the support vectors lying in the quadratic part of the loss function and the training samples from nsν1 + 1 to nsν1 + nsν2 are the support vectors lying in the linear part of the loss function. Thus, equation 4.4 can be rewritten as
−1 hsν1 Ksν1 ,sν1 + λIsν1 ,sν1 Ksν1 ,sν2 Ksν1 ,nsν hsν2 = 0 λIsν2 ,sν2 0 0 0 λInsν,nsν hnsν k ysν1 + ε s β k sν1 β sν1 k × − ( − ε) s¯ β sν − β ksν2 . 2 β knsν 0
(4.5)
Recursive Finite Newton Algorithm for SVR
1089
Together with the matrix inversion result from Zhang (2004), we can simplify equation 4.5 as
k + λI + ε s β y (K ) sν1 ,sν1 sν1 sν1 ,sν1 sν1 k ¯ − ε) K s β ( sν ,sν 1 2 hsν1 sν2 k + − β sν1 . hsν2 = λ hnsν − ( − ε) s¯ β k sν2 k − β sν2 λ −1
(4.6)
−β knsν Equation 4.6 indicates that we can compute the Newton step at a cost of O n3sν1 . Instead of inverting the matrix (Ksν1 ,sν1 + λIsν1 ,sν1 ), we can compute the first part of the Newton step by solving the positive definite system: (Ksν1 ,sν1 + λIsν1 ,sν1 ) hsν1 = ysν1
( − ε) Ksν1 ,sν2 s¯ β k sν2 k + ε s β sν1 + . λ (4.7)
Many methods, such as gaussian elimination, conjugate gradient, and factorization, can be used to find the solution to equation 4.7. In the current implementation, we solve equation 4.7 using Matlab command “\” that employs Cholesky factorization. 4.2 Exact Line Search. The minimizer of a piecewise-smooth, convex quadratic function can be found by identifying the points at which the second derivative jumps, sorting these points, and successively searching over those points until a point where the slope changes sign is located. (For more details, refer to Madsen & Nielsen, 1990, and Keerthi & DeCoste, 2005.) The complexity of this line search is O(m log(m)). Since the Newton step is much more expensive, the line search does not add to the complexity of the algorithm. 4.3 Initialization. The initialization of IHLF-SVR-FN is an important problem. If we use a randomly generated β as a starting point, we would have to invert an n × n matrix at the first Newton step. But if we can identify the support vectors ahead of time, we need to invert only an nsν1 × nsν1 matrix at the first Newton step. According to Chapelle’s suggestion (Chapelle, 2006), we can identify the support vectors by a recursive procedure. We start from a small number of
1090
L. Bo, L. Wang, and L. Jiao
training samples, train, double the number of training samples, retrain, and so on. In this way, the support vectors are rather well identified, avoiding the inversion of the n × n matrix. To distinguish this algorithm from the original finite Newton algorithm, we call it a recursive finite Newton algorithm (IHLF-SVR-RFN). 4.4 Checking Convergence. Once the support vectors remain unchanged, IHLF-SVR-FN has found the optimal solution. We use this property to check the convergence of IHLF-SVR-FN. 4.5 Finite Convergence and Computational Complexity Theorem 1: IHLF-SVR-FN terminates at the global minimum solution of equation 3.9 after a finite number of iterations. The proof of theorem 1 is very much along the lines of the proof of theorem 1 in Keerthi and DeCoste (2005), which is easily extended to a piecewise quadratic loss function, although it considers only a squared hinge loss function. In IHLF-SVR-RFN, the most time-consuming step is computing the Newton step, whose complexity is O(n(nsν1 + nsν2 ) + n3sν1 ), where nsν2 denotes the number of the support vectors lying in the linear part of the loss function. The first term is the cost of finding the support vectors and the second term the cost of solving the linear system, equation 4.7. The number of iterations, that is, the loops of steps 2 to 4, is usually small, say, 5 to 30. 5 Experiments In order to verify the effectiveness of IHLF-SVR-RFN and IQLF-SVR-RFN (the insensitive quadratic loss function is used), we compare them with LIBSVM 2.82 on some benchmark data sets. LIBSVM 2.82 is the fastest version of LIBSVM, where the second-order information is used to select the working set. Gaussian kernel k(xi , x j ) = exp(−γ xi − x j 22 ) is used to construct nonlinear SVR. All the experiments are run on a personal computer with 2.4 GHz processors, 1 GB memory, and Windows XP operation systems. (Matlab code for IHLF-SVR-RFN is available online at http://see.xidian.edu.cn/graduate/lfbo/.) We use the data sets Abalone, Computer Activity, and House16H (Rasmussen et al., 1996) for our comparisons. The Abalone data set consists of 4177 samples with 8 attributes. The task is to predict the age of abalone from all eight physical measurements. The Computer Activity data set was collected from a Sun Sparcstation 20/712 with 128 Mb of memory running in a multiuser university department. It consists of 8192 samples with 21 attributes. The task is to predict the portion of time that CPUs run in user mode from all 21 attributes. The House16H data set consists of 22,784
Recursive Finite Newton Algorithm for SVR
1091
Table 1: Training Time on the Abalone, Computer Activity, and House16H Data Sets.
Abalone Computer House16H
IHLF-SVR-RFN
IQLF-SVR-RFN
LIBSVM 2.82
ε
2.48 11.97 1640.81
4.30 18.43 /
5.99 22.48 653.05
0.100 0.050 0.050
0.110 0.055 0.060
samples with 16 attributes. The task is to predict the median price of the houses in a small survey region. 5.1 Comparison with LIBSVM 2.82. The Abalone data set is randomly split into 3000 training samples and 1177 test samples, the Computer Activity data set into 5000 training samples and 4192 test samples, and the House16H data set into 20,000 training samples and 2784 test samples. For each training-test pair, the training samples are scaled into the interval [−1, 1], and the test samples are adjusted using the same linear transformation. For the Abalone and Computer Activity data sets, we precompute the entire kernel matrix for all three algorithms. Since the value of the parameter pair (γ , C) that optimizes the generalization performance is usually not known prior, the value that is ultimately used to design SVR is usually determined by trying many different values of (γ , C). Thus, it is important to understand how different values, optimal and otherwise, affect training time. Four experiments are performed on the Abalone and Computer Activity data sets. In the first experiment, we fix the values of ε and shown in Table 1 and search the optimal pair(γ ∗ , C ∗ ) from the set −4 −3parameter −3 3 4 −2 7 8 2 /d, 2 /d, · · · , 2 /d, 2 /d × 2 , 2 , · · · , 2 , 2 , which gives the best generalization performance on the test samples, where d is the number of the attributes of samples. Therefore, for each problem, we try 9 × 12 = 108 combinations. The mean training time of the three algorithms on all parameter pairs is reported in Table 1. In the second experiment, we fix the value of the kernel parameter and investigate the robustness of the three algorithms against the variation of the regularization parameter C. In the third experiment, we fix the value of the regularization parameter and investigate the robustness of the three algorithms against the variation of the kernel parameter γ . In the fourth experiment, we investigate the influence of training samples size on the training time of the three algorithms. The values of the kernel parameter and the regularization parameter are set to (γ ∗ , C ∗ ). From Table 1, we can see that IHLF-SVR-RFN outperforms IQLF-SVRRFN and LIBSVM 2.82 in terms of the training time. From Figure 3, we observe that the training time of LIBSVM 2.82 has a sharp increase for large
1092
L. Bo, L. Wang, and L. Jiao
Figure 3: Variation of the training time with the regularization parameter on the Abalone (left) and Computer Activity (right) data sets. γ is fixed to 22 /d.
Figure 4: Variation of the training time with the kernel parameter on the Abalone (left) and Computer Activity (right) data sets. For the Abalone data set, C is fixed to 23 , and for the Computer Activity data set, C is fixed to 24 .
C value, and hence IHLF-SVR-RFN and IQLF-SVR-RFN are much more robust against variation of C than LIBSVM 2.82. From Figure 4, we observe that the training time of IHLF-SVR-RFN and LIBSVM 2.82 increases with an increasing kernel parameter. From Figure 5, we observe that in terms of the training time IHLF-SVR-RFN and IQLF-SVR-RFN are superior to LIBSVM 2.82 on the Abalone data set and inferior to LIBSVM 2.82 on the Computer Activity data set using the optimal parameter pair. For the House16H data sets, it is impossible to store the entire kernel matrix because of insufficient memory, so we need to compute the kernel matrix in each Newton step. Since it is computationally prohibitive to search the set of grid values using all the training samples, we use a random subset of the training samples of size 5000 to choose (γ ∗ , C ∗ ). The training time of IHLF-SVR-RFN and LIBSVM 2.82 on (γ ∗ , C ∗ ) is given in Table 1. Note that
Recursive Finite Newton Algorithm for SVR
1093
Figure 5: Variation of the training time with the training set size on the Abalone (left) and Computer 2 Activity (right) data sets. For the Abalone data set, (γ , C) 3 2 /d, 2 , and for the Computer Activity data set, (γ , C) is fixed to is 2fixed 4to 2 /d, 2 .
IQLF-SVR-RFN cannot run on this data set because of insufficient memory. From Table 1, we can see that LIBSVM 2.82 is faster than IHLF-SVR-RFN on this data set. The main reason is that IHLF-SVR-RFN needs to reevaluate the n × (nsν1 + nsν2 ) kernel matrix in each Newton step. However, we believe that IHLF-SVR-RFN has the advantage for large-scale optimization because the computation of the kernel matrix can be easily parallelized, which could improve the training time significantly. 5.2 Influence of . The purpose of the experiments in this section is to study the influence of on test error and training time. The partition of the training samples and test samples is the same as in section 5.1. The values of the kernel parameter and the regularization parameter are set to (γ ∗ , C ∗ ), and the value of ε is the same as in Table 1. The final errors are averaged over 10 random splits of the full data sets. Figures 6 and 7 show the variation of the test error and the training time with on the Abalone and Computer Activity data sets, respectively. From Figure 6, we observe that the performance of IHLF-SVR-RFN is close to that of LIBSVM 2.82 (the insensitive linear loss function is used) when approaches ε; the performance of IHLF-SVR-RFN is close to that of IQLF-SVR-RFN when is significantly greater than ε. This is consistent with the discussion in section 3. From Figure 7, we observe that the training time of IHLF-SVR-RFN fluctuates with the variation of value. The main reason is that the training time is affected by the number of support vectors and the Newton step. The former often increases with an increasing , and hence some training time is added for large value; however, the latter often decreases with an increasing , and hence some training time is subtracted for large value.
1094
L. Bo, L. Wang, and L. Jiao
Figure 6: Influence of on the test error on the Abalone (left) and Computer Activity (right) data sets.
Figure 7: Influence of on the training time on the Abalone (left) and Computer Activity (right) data sets.
6 Conclusion and Discussions In this letter, we present a recursive finite Newton algorithm for training nonlinear SVR. First, we propose the insensitive Huber loss function and show that several popular loss functions are special cases. Then we discuss in detail the implementation of IHLF-SVR-RFN. Finally, we verify the effectiveness of the proposed algorithm on three benchmark regression data sets. The computational complexity of IHLF-SVR-RFN can be decreased by some approximation strategies. For example, if the kernel matrix is sparse, the Newton step can be solved more efficiently. We can also approximate the objective function 3.9 by a matching pursuit approach. Keerthi, Chapelle, and DeCoste (2006) implemented that in the classification case, and it is straightforward to generalize it to the regression case.
Recursive Finite Newton Algorithm for SVR
1095
Acknowledgments We thank the two reviewers for their helpful comments, which greatly improved this letter, and Rainer Stollhoff and Bruce Rogers for their help in proofreading the manuscript. This work was supported by the National Natural Science Foundation of China under grant 60372050 and the National Defense Preresearch Foundation of China under Grant A1420060172. References Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121–167. Chapelle, O. (2006). Training a support vector machine in the primal. (MPI-Tech. Rep. ¨ no. 147). Tubingen: Max Planck Institute for Biological Cybernetics. Clarke, F. H. (1983). Optimization and nonsmooth analysis. New York: Wiley. Fan, R. E., Chen P. H., & Lin C. J. (2005). Working set selection using second order information for training support vector machines. Journal of Machine Learning Research, 6, 1889–1918. Fung, G., & Mangasarian, O. L. (2003). Finite Newton method for Lagrangian support vector machine classification. Neurocomputing, 55(1–2), 39–55. Hiriart-Urruty, J. B., Strodiot, J. J., & Nguyen V. H. (1984). Generalized Hessian matrix and second-order optimality conditions for problems with CL1 data. Applied Mathematics and Optimization, 11, 43–56. Huber, P. (1981). Robust statistics. New York: Wiley. ¨ Joachims, T. (1999). Making large-scale SVM learning practical. In B. Scholkopf, C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—Support vector learning. Cambridge, MA: MIT Press. Keerthi, S. S., Chapelle, O., & DeCoste D. (2006). Building support vector machines with reduced classifier complexity. Journal of Machine Learning Research, 7, 1493– 1515. Keerthi, S. S., & DeCoste D. M. (2005). A modified finite Newton method for fast solution of large scale linear SVMS. Journal of Machine Learning Research, 6, 341– 361. Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., & Murthy, K. R. K. (2001). Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Computation, 13(3), 637–649. Kimeldorf, G. S., & Wahba G. (1970). A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Annals of Mathematical Statistics, 41, 495–502. Madsen, K., & Nielsen H. B. (1990). Finite algorithms for robust linear-regression. BIT, 30(4), 682–699. Mangasarian, O. L. (2002). A finite Newton method for classification. Optimization Methods and Software, 17(5), 913–929. Platt, J. (1999). Sequential minimal optimization: A fast algorithm for training sup¨ port vector machines. In B. Scholkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—Support vector learning. Cambridge, MA: MIT Press.
1096
L. Bo, L. Wang, and L. Jiao
Rasmussen, C., Neal, R., Hinton G., van Camp D., Ghahramani Z., Kustra, R., & Tibshirani R. (1996). The DELVE manual. Available online at http://www.cs. toronto.edu/∼delve. Shevade, S. K., Keerthi, S. S., Bhattacharyya, C., & Murthy, K. R. K. (2000). Improvements to the SMO algorithm for SVM regression. IEEE Transactions on Neural Networks, 11(5), 1188–1193. ¨ Smola, A. J., & Scholkopf, B. (2004). A tutorial on SVR. Statistics and Computing, 14(3), 199–222. Vapnik, V. (1998). Statistical learning theory, New York: Wiley. Zhang, X. D. (2004). Matrix analysis and application. New York: Springer.
Received January 22, 2006; accepted July 19, 2006.
LETTER
Communicated by Antti Honkela
State-Space Models: From the EM Algorithm to a Gradient Approach Rasmus Kongsgaard Olsson [email protected]
Kaare Brandt Petersen [email protected]
Tue Lehn-Schiøler [email protected] Technical University of Denmark, 2800 Kongens Lyngby, Denmark
Slow convergence is observed in the EM algorithm for linear state-space models. We propose to circumvent the problem by applying any off-theshelf quasi-Newton-type optimizer, which operates on the gradient of the log-likelihood function. Such an algorithm is a practical alternative due to the fact that the exact gradient of the log-likelihood function can be computed by recycling components of the expectation-maximization (EM) algorithm. We demonstrate the efficiency of the proposed method in three relevant instances of the linear state-space model. In high signal-tonoise ratios, where EM is particularly prone to converge slowly, we show that gradient-based learning results in a sizable reduction of computation time. 1 Introduction State-space models are widely applied in cases where the data are generated by some underlying dynamics. Control engineering and speech enhancement are typical examples of applications of state-space models, where the state-space has a clear physical interpretation. Black box modeling constitutes a different type of application: the state-space dynamics have no direct physical interpretation, only the generalization ability of the model matters, that is, the prediction error on unseen data. A fairly general formulation of linear state-space models (without deterministic input) is: st = Fst−1 + vt
(1.1)
xt = Ast + nt ,
(1.2)
where equations 1.1 and 1.2 describe the state and observation spaces, respectively. State and observation vectors, st and xt , are random processes Neural Computation 19, 1097–1111 (2007)
C 2007 Massachusetts Institute of Technology
1098
R. Olsson, K. Petersen, and T. Lehn-Schiøler
driven by independent and identically distributed (i.i.d.) zero-mean gaussian inputs vt and nt with covariance Q and R, respectively. Optimization in state-space models by maximizing the log likelihood, L(θ ), with respect to the parameters, θ ≡ {Q, R, F, A}, falls in two main categories based on either gradients (scoring) or expectation maximization (EM). The principal approach to maximum likelihood in state-space models, and more generally in complex models, is to iteratively search the space of θ for the local maximum of L(θ ) by taking steps in the direction of the gradient, ∇θ L(θ ). A basic ascend algorithm can be improved by supplying curvature information, such as second-order derivatives or line search. Often, numerical methods are used to compute the gradient and the Hessian, due to the complexity associated with the computation of these quantities. Gupta and Mehra (1974) and Sandell and Yared (1978) give fairly complex recipes for the computation of the analytical gradient in the linear state-space model. The EM algorithm (Dempster, Laird, & Rubin, 1977), is an important alternative to gradient-based maximum likelihood, partly due to its simplicity and convergence guarantees. It was first applied to the optimization of linear state-space models by Shumway and Stoffer (1982) and Digalakis, Rohlicek, & Ostendorf (1993). A general class of linear gaussian (statespace) models was treated in Roweis and Ghahramani (1999), in which the EM algorithm was the main engine of estimation. In the context of independent component analysis (ICA), the EM algorithm has been applied in, among others, Moulines, Cardoso, and Cassiat (1997) and Højen-Sørensen, Winther, and Hansen (2002). In Olsson and Hansen (2004, 2005), the EM algorithm was applied to the convolutive ICA problem. A number of authors have reported the slow convergence of the EM algorithm. In Redner and Walker (1984), impractically slow convergence in two-component gaussian mixture models is documented. This critique is, however, moderated by Xu and Jordan (1996). Modifications have been suggested to speed up the basic EM algorithm (see, e.g., Lachlan & Krishnan, 1997). Jamshidian and Jennrich (1997) employ the EM update, g˜ n ≡ θ n+1 − θ n , as a so-called generalized gradient in variations of the quasi-Newton algorithm. The approach taken in Meilijson (1989) differs in that the gradient of the log likelihood is derived from the expectation step (E-step), from a theorem originally shown by Fisher. Subsequently, a Newton-type step can be devised to replace the maximization step (M-step). The main contribution of this letter is to demonstrate that specialized gradient-based optimization software can replace the EM algorithm, at little analytical cost, reusing the components of the EM algorithm itself. This procedure is termed the easy gradient recipe. Furthermore, empirical evidence, supporting the results in Bermond and Cardoso (1999), is presented to demonstrate that the signal-to-noise ratio (SNR) has a dramatic effect on the convergence speed of the EM algorithm. Under certain circumstances,
State-Space Models
1099
such as in high SNR settings, the EM algorithm fails to converge in reasonable time. Three applications of state-space models are investigated: (1) sensor fusion for the black box modeling of speech-to-face mapping problem, (2) mean field independent component analysis (mean field ICA) for estimating a number of hidden independent sources that have been linearly and instantaneously mixed, and (3) convolutive independent component analysis for convolutive mixtures. In section 2, an introduction to EM and the easy gradient recipe is given. More specifically, the relation between the various acceleration schemes is reviewed in section 2.3. In section 3, the models are stated, and in section 4 simulation results are presented. 2 Theory Assume a model with observed variables x, state-space variables s, and parameters θ . The calculation of the log likelihood involves an integral over the state-space variables: L(θ ) = ln p(x|θ ) = ln
p(x|s, θ ) p(s|θ )ds.
(2.1)
The marginalization in equation 2.1 is intractable for most choices of distributions, hence, direct optimization is rarely an option, even in the gaussian case. Therefore, a lower bound, B, is introduced on the log likelihood, which is valid for any choice of probability density function, q (s|φ): B(θ , φ) =
q (s|φ) ln
p(s, x|θ ) ds ≤ ln p(x|θ ). q (s|φ)
(2.2)
At this point, the problem seems to have been made more complicated, but the lower bound B has a number of appealing properties, which makes the original task of finding the parameters easier. One important fact about B becomes clear when we rewrite it using Bayes’ theorem, B(θ , φ) = ln p(x|θ ) − K L[q (s|φ)|| p(s|x, θ )],
(2.3)
¨ where K L denotes the Kullback-Leibler divergence between the two distributions. In the case that the variational distribution, q , is chosen to be exactly the posterior of the hidden variables, p(s|x, θ ), the bound, B, is equal to the log likelihood. For this reason, one often tries to choose the variational distribution flexible enough to include the true posterior and yet simple enough to make the necessary calculations as easy as possible. However, when computational efficiency is the main priority, a simplistic
1100
R. Olsson, K. Petersen, and T. Lehn-Schiøler
q is chosen that does not necessarily include p(s|x, θ ), and the optimum of B may differ from that of L. Examples of applications involving both types of q are described in sections 3 and 4. The approach is to maximize B with respect to φ in order to make the lower bound as close as possible to the log likelihood and then maximize the bound with respect to the parameters θ . This stepwise maximization can be achieved via the EM algorithm or by applying the easy gradient recipe (see section 2.2). 2.1 The EM Update. The EM algorithm, as formulated in Neal and Hinton (1998), works in a straightforward scheme that is initiated with random values and iterated until suitable convergence is reached: E: Maximize B(θ, φ)w.r.t. φ keeping θ fixed. M: Maximize B(θ, φ)w.r.t. θ keeping φ fixed. It is guaranteed that the lower-bound function does not decrease on any combined E- and M-step. Figure 1 illustrates the EM algorithm. The convergence is often slow; for example, the curvature of the bound function, B, might be much higher than that of L, resulting in very conservative parameter updates. As mentioned in section 1, this is particularly a problem in latent variable models with low-power additive noise. Bermond and Cardoso (1999) and Petersen and Winther (2005) demonstrate that the EM update of the parameter A in equation 1.2 scales with the observation noise level, R. That is, as the signal-to-noise ratio increases, the M-step change in A decreases, and more iterations are required to converge. 2.2 The Easy Gradient Recipe. The key idea is to regard the bound, B, as a function of θ only, as opposed to a function of both the parameters θ and the variational parameters φ. As a result, the lower bound can be applied to reformulate the log likelihood, L(θ ) = B(θ , φ ∗ ),
(2.4)
where φ ∗ = φ ∗ (θ ) satisfies the constraint q (s|φ ∗ ) = p(s|x, θ ). Comparing with equation 2.3, it is easy to see that φ ∗ maximizes the bound. Since φ ∗ is exactly minimizing the KL divergence, the partial derivative of the bound with respect to φ evaluated in the point φ ∗ is equal to zero. Therefore, the derivative of B(θ , φ ∗ ) is equal to the partial derivative, ∂ B(θ , φ ∗ ) d B(θ, φ ∗ ) ∂ B(θ , φ ∗ ) ∂ B(θ , φ) ∂φ = + , = φ ∗ ∂θ φ ∗ dθ ∂θ ∂φ ∂θ
(2.5)
Cost function values
State-Space Models
1101
φ = φ (φ ) *
*
n
L(θ)
B(θ,φ ) *
θ θn+1 n
Parameter θ
Figure 1: Schematic illustration of lower-bound optimization for a onedimensional estimation problem, where θ n and θ n+1 are iterates of the standard EM algorithm. The log-likelihood function, L(θ ), is bounded from below by the function B(θ , φ ∗ ). The bound attains equality to L in θ n due to the choice of variational distribution: q (s|φ ∗ ) = p(s|x, θ n ). Furthermore, in θ n , the derivatives of the bound and the log likelihood are identical. In many situations, the curvature of B(θ , φ ∗ ) is much higher than that of L(θ ), leading to small changes in the parameter, θ n+1 − θ n .
and due to the choice of φ ∗ , the derivative of the log likelihood is the partial derivative of the bound dL(θ ) d B(θ, φ ∗ ) ∂ B(θ , φ ∗ ) = = , dθ dθ ∂θ which can be realized by combining equations 2.4 and 2.5. In this way, exact values and gradients of the true log likelihood can be obtained using the lower bound. This observation is not new; it is essentially the same as that used in Salakhutdinov, Roweis, and Ghahramani (2003) to construct the so-called expected conjugated gradient algorithm (ECG). The novelty of the recipe is, rather, the practical recycling of low-complexity computations carried out in connection with the EM algorithm for a much more efficient optimization using any gradient-based optimizer. This can be expressed in Matlab-style pseudocode where a function loglikelihood receives as argument the parameter θ and returns L and its gradient dL : dθ
1102
R. Olsson, K. Petersen, and T. Lehn-Schiøler
= loglikelihood(θ ) 1. Find φ ∗ such that ∂∂φB φ ∗ = 0 function [L,
dL ] dθ
2. Calculate L = B(θ , φ ∗ ) 3. Calculate
dL dθ
=
∂B (θ , φ ∗ ) ∂θ
Step 1, and to some extent step 2, are obtained by performing an E-step, while step 3 requires only little programming, which implements the gradients used to solve for the M-step. Compared to the EM algorithm, the main advantage is that the function value and gradient can be fed to any gradient-based optimizer, which in most cases substantially improves the convergence properties. In that sense, it is possible to benefit from the speed-ups of advanced gradient-based optimization. In cases when q does not contain p(s|x, θ ), the easy gradient recipe will converge to generalized EM solutions. The advantage of formulating the log likelihood using the bound function, B, depends on the task at hand. In the linear state-space model, equations 1.1 and 1.2, a naive computation of dL involves the somewhat complidθ cated derivatives of the Kalman filter equations with respect to each of the parameters in θ ; this is explained in, for example, Gupta and Mehra (1974). Consequently, it leads to a cost of dim[θ ] times the cost of one Kalman filter filter sweep.1 When using the easy gradient recipe, dL is derived via dθ the gradient of the bound, ∂∂θB (see the appendix for a derivation example). These derivatives are often available in connection with the derivation of the M-step. In addition, the computational cost is dominated by steps 1 and 2, which require only a Kalman smoothing, scaling as two Kalman filter filter sweeps, hence constituting an important reduction of computation time as compared to the naive gradient computation. Sandell and Yared (1978) noted in in their investigation of linear state-space models that a reformulation of the problem resulted in a similar reduction of the computational costs. 2.3 Relation to Other Speed-Up Methods. In this section, a series of extensions to the EM algorithm is discussed. They have in common the utilization of conjugate gradient or quasi-Newton steps, that is, the inverse Hessian is approximated, for example, using the gradient. Steplengthening methods, which are often simpler in terms of analytical cost, have been explored in Salakhutdinov and Roweis (2003) and Honkela, Valpola, and Karhunen (2003). They are, however, out of the scope of this presentation.
1 The computational complexity of the Kalman filter is O[N(d )3 ], where N is the data s length.
State-Space Models
1103
Table 1: Analytical Costs Associated with a Selection of EM Speed-Ups Based on Newton-Like Updates. Method
M
B
B
L
EM (Dempster et al., 1977) QN1 (Jamshidian & Jennrich, 1997) QN2 (Jamshidian & Jennrich, 1997) CG (Jamshidian & Jennrich, 1993) LANGE (Lange, 1995) EGR
x x x x x
x x x x
x -
x x x x
Notes: The x’s indicate whether the quantity is required by the algorithm. It should be noted that B , required by the easy gradient recipe (EGR) as well as QN2, CG, and LANGE, often is a by-product of the derivation of M.
In order to conveniently describe the methods, additional notation along the lines of Jamshidian and Jennrich (1997) is introduced: the combined EM operator, which maps the current estimate of the parameters, θ , to the new one, is denoted θˆ = M(θ ). The change in θ due to an EM update ˜ ) = M(θ) − θ . In a sense, g(θ ˜ ) can be regarded a generalized then is g(θ gradient of an underlying cost function, which has zeros identical to those of g(θ ) = dL(θ) . The Newton-Raphson update can then be devised as dθ ˜ ), θ = −J(θ )−1 g(θ where the Jacobian is defined J(θ ) = M (θ ) − I. A number of methods approximate J(θ )−1 or M (θ ) in various ways. For instance, the quasi-Newton method QN1 of Jamshidian and Jennrich (1997) approximates J−1 (θ ) by employing the Broyden (1965) update. The QN2 algorithm of the same publication performs quasi-Newton optimization on L(θ ), but its update direction ˜ ) − g(θ). can be written as an additive correction to the EM step: d = g(θ The correction term is updated through by means of the BroydenFletcher-Goldfarb-Shanno (BFGS) update (see, e.g., Bishop, 1995). A search is performed in the direction of d so as to maximize L(θ + λd), where λ is the step length. QN2 has the disadvantage compared to QN1 that L(θ ) and ˜ ). This is also the case of the g(θ ) have to be computed in addition to g(θ conjugate gradient (CG) accelerator of Jamshidian and Jennrich (1993). Similarly in Lange (1995), the log likelihood, L(θ ), is optimized directly through quasi-Newton steps, that is, in the direction of d = −J−1 L (θ )g(θ ). However, the iterative adaptation to the Hessian, JL (θ ), involves the computation of B (θ , φ ∗ ), which may require considerable human effort in many applications. In Table 1, the analytical cost associated with the discussed algorithms are summarized.
1104
R. Olsson, K. Petersen, and T. Lehn-Schiøler
The easy gradient recipe lends from QN2 in that quasi-Newton optimization on L(θ ) is carried out. The key point here is that the specific quasi-Newton update of QN2 can be replaced by any software package of choice that performs gradient-based optimization. This has the advantage that various highly sophisticated algorithms can be tried for the particular problem at hand. Furthermore, we emphasize, as does Meilijson (1989), that the gradient can be computed from the derivatives involved in solving for the E-step. This means that the advantages of gradient-based optimization are obtained conveniently at little cost. In this letter, a quasi-Newton gradient-based optimizer has been chosen. The implementation of the BFGS algorithm is due to Nielsen (2000) and has built-in line search and trust region monitoring. 3 Models The EM algorithm and the easy gradient recipe were applied to three models that can all be fitted into the linear state-space framework. 3.1 Kalman Filter-Based Sensor Fusion. The state-space model of equations 1.1 and 1.2 can be used to describe systems where two different types of signals are measured. The signals could be, for example, sound and images in as (Lehn-Schiøler, Hansen, & Larsen, 2005), where speech and lip movements were the observables. In this case, the observation equation 1.2, can be split into two parts, xt =
x1t x2t
=
A1 A2
st +
n1t n2t
,
where n1t ∼ N(0, R1 ) and n2t ∼ N(0, R2 ). The innovation noise in the statespace equation 1.1, is defined as vt ∼ N(0, Q). In the training phase, the parameters of the system, F, A1 , A2 , R1 ,R2 ,Q, are estimated by maximum likelihood using either EM or a gradient-based method. When the parameters have been learned, the state-space variable s, which represents unknown hidden causes, can be deduced from one of the observations (x1 or x2 ), and the missing observation can be estimated by mapping from the state-space. 3.2 Mean Field ICA. In independent component analysis (ICA), one tries to separate linearly mixed sources using the assumed statistical independence of the sources. In many cases, elaborate source priors are necessary, which calls for more advanced separation techniques such as mean field ICA. The method, which was introduced in Højen-Sørensen et al. (2002), can handle complicated source priors in an efficient approximative manner.
State-Space Models
1105
The model in equation 1.2 is identical to an instantaneous ICA model provided that F = 0 and that p(vt ) is reinterpreted as the (nongaussian) source prior. The basic generative model of the instantaneous ICA is xt = Ast + nt ,
(3.1)
where nt is assumed i.i.d. gaussian and st = vt is assumed distributed by a factorized prior i p(vit ), which is independent in both time and dimension. The mean field ICA is only approximately compatible with the easy gradient recipe, since the variational distribution q (s|φ) is not guaranteed to contain the posterior p(s|x, θ ). However, acceptable solutions (generalized EM) are retrieved when q is chosen sufficiently flexible. 3.3 Convolutive ICA. Acoustic mixture scenarios are characterized by sound waves emitted by a number of sound sources propagating through the air and arriving at the sensors in delayed and attenuated versions. The instantaneous mixture model of standard ICA, equation 3.1, is clearly insufficient to describe this situation. In convolutive ICA, the signal path (delay and attenuation) is modeled by an FIR filter, that is, a convolution of the source by the impulse responses of the signal path, xt =
Ck st−k + nt ,
(3.2)
k
where Ck is the mixing filter matrix. Equation 3.2 and the source independence assumption can be fitted into the state-space formulation of equations 1.1 and 1.2 (see Olsson & Hansen, 2004, 2005), by making the following model choices: (1) noise inputs vt and nt are i.i.d. gaussian; (2) the state vector is augmented to contain time-lagged values, that is, s¯ t ≡ [s1,t s1,t−1 . . . s2,t s2,t−1 . . . sds ,t s1,t−1 . . .] ; and (3) state-space parameter matrices (e.g., F) are constrained to a special format (certain elements are fixed to 0’s and 1’s) in order to ensure the independence of the sources mentioned above. 4 Results The simulations that are documented in this section serve to illustrate the well-known advantage of advanced gradient-based learning over standard EM. Before advancing to the more involved applications described above, the advantage of gradient-based methods over EM will be explored for a one-dimensional linear state-space model: an ARMA(1,1) process. In this case, F and A are scalars as well as the observation variance R and the
1106
R. Olsson, K. Petersen, and T. Lehn-Schiøler
Figure 2: Convergence of EM (dashed) and a gradient-based method (solid) in the ARMA(1,1) model. (a) EM has faster initial convergence than the gradientbased method, but the final part is slow for EM. (b) Zoom-in on the loglikelihood axis. Even after 50 iterations, EM has not reached the same level as the gradient-based method. (c) Parameter estimates convergence in terms of squared relative (to the generative parameters) error.
transition variance Q. Q is fixed to unity to resolve the inherent scale ambiguity of the model. As a consequence, the model has only three parameters. The BFGS optimizer mentioned in section 2 was used. Figure 2 shows the convergence of both the EM algorithm and the gradient-based method. Initially, EM is fast; it rapidly approaches the maximum log likelihood, but slows down as it gets closer to the optimum. The large dynamic range of the log likelihood makes it difficult to ascertain the final increase in the log likelihood; hence, Figure 2b provides a close-up of the log-likelihood scale. Table 2 gives an indication of the importance of the
State-Space Models
1107
Table 2: Estimation in the ARMA(1,1) Model.
Iterations Log likelihood F A R
Generative
Gradient
EM 50
EM ∞
0.5000 0.3000 0.0100
43 −24.3882 0.4834 0.2953 0.0097
50 −24.5131 0.5626 0.2545 0.0282
1800 −24.3883 0.4859 0.2940 0.0103
Notes: The convergence of EM is slow compared to the gradient-based method. Note that after 50 EM iterations, the log likelihood is relatively close to the value achieved at convergence, but the parameter values are far from the generative values.
final increase. After 50 iterations, EM has reached a log-likelihood value of −24.5131, but the parameter values are still far off. After convergence, the log likelihood has increased to −24.3883, which is still slightly worse than that obtained by the gradient-based method, but the parameters are now near the generative values. The BFGS algorithm used 43 iterations and 52 function evaluations. For large-scale state-space models, the computation time is all but dominated by the E-step computation. Hence, a function evaluation costs approximately one E-step. Similar results are obtained when comparing the learning algorithms on the Kalman filter-based sensor fusion, mean field ICA, and convolutive ICA problems. As argued in section 2, it is demonstrated that the number of iterations required by the EM algorithm to converge in state-space type models critically depends on the SNR. Figure 3 shows the performance of the two methods on the three problems. The relevant comparison measure is computation time, which in the examples are all but dominated by the E-step for EM and function evaluations (whose cost is also dominated by an E-step) for the gradient-based optimizer. The line search of the latter may require more function evaluations per iteration, but that was most often not the case for the BFGS algorithm of choice. The plots indicate that in the low-noise case, the EM algorithm requires relatively more time to converge, whereas the gradient-based method performs equally well for all noise levels. 5 Conclusion In applying the EM algorithm to maximum likelihood estimation in statespace models, we find, as have many before us, that it has poor convergence properties in the low noise limit. Often a value “close” to the maximum likelihood is reached in the first few iterations, while the final increase, which is crucial to the accurate estimation of the parameters, requires an excessive number of iterations. More important, we provide a simple scheme for efficient gradient-based optimization achieved by transformation from the EM formulation; the
1108
R. Olsson, K. Petersen, and T. Lehn-Schiøler
Figure 3: Number of function evaluations for EM (dashed) and gradient-based optimization (solid) to reach convergence as a function of signal-to-noise ratio for the three problems. (a) Kalman filter-based sensor fusion. (b) Mean field ICA. (c) Convolutive ICA. The level of convergence was defined as a relative change in log likehood below 10−5 , at which point the parameters no longer changed significantly. In the case of the EM algorithm, this sometimes occurred in plateau regions of the parameter space.
simple math and programming of the EM algorithms is preserved. Following this recipe, one can get the optimization benefits associated with any advanced gradient based-method. In this way, the tedious, problem-specific analysis of the cost-function topology can be replaced with an off-the-shelf approach. Although the analysis provided in this letter is limited to a set of linear mixture models, it is in fact applicable to any model subject to the EM algorithm, hence constituting a strong and general tool to be applied by the part of the neural community that uses the EM algorithm.
State-Space Models
1109
Appendix: Gradient Derivation Here we demonstrate how to obtain a partial derivative of the constrained bound function, B(θ , φ ∗ ). The derivation lends from Shumway and Stoffer (1982). At q (s|φ ∗ ) = p(s|x, θ ), we can reformulate equation 2.2 as p(s, x|θ ) B(θ , φ ∗ ) = ln q (s|φ ∗ )
= ln p(s, x|θ ) − ln q (s|φ ∗ ) , where the expectation
· is over the posterior q (s|φ ∗ ) = p(s|x, θ ). Only the
first term, J (θ ) = ln p(s, x|θ ) , depends on θ . The joint variable distribution factors due to the Markov property of the state-space model, p(s, x|θ ) = p(s1 |θ )
N
p(sk |sk−1 , θ )
k=2
N
p(xk |sk , θ ),
k=1
where N is the length of the time series and the initial state, s1 is normally distributed with mean µ1 and covariance 1 . Using the fact that all variables are gaussian, the expected log distributions can be written out as: 1 dim(s) ln 2π + ln |1 | 2
+ (s1 − µ1 ) 1−1 (s1 − µ0 )
J (θ ) = −
+(N − 1) [dim(s) ln 2π + ln |Q|] +
N
(sk − Fsk−1 ) Q−1 (sk − Fsk−1 ) ) k=2
+N [dim(x) ln 2π + ln |R|] +
N
(xk − Ask ) R−1 (xk − Ask ) . k=1
The gradient with respect to A is: ∂ J (θ) 1 −1
2R A sk sk − 2R−1 xk s =− k ∂A 2 N
k=1
= −R−1 A
N k=1
N
−1 x k s sk s k +R k ). k=1
1110
R. Olsson, K. Petersen, and T. Lehn-Schiøler
It is seen depends on the marginal posterior source mo that
the gradient
ments sk s k and sk , which are provided by the Kalman smoother. The gradients with
respect to F and Q furthermore require the marginal moment sk s k−1 . The derivation of the gradients for the covariances R and Q, must respect the symmetry of these matrices. Furthermore, steps must be taken to ensure the positive definiteness following a gradient step. This can be achieved, for example, by adapting Q0 instead of Q where Q = Q0 Q 0. Acknowledgments We thank Danish Oticon Fonden for supporting in part this research project. Inspiring discussions with Lars Kai Hansen and Ole Winther initiated this work.
References Bermond, O., & Cardoso, J.-F. (1999). Approximate likelihood for noisy mixtures. In Proceedings of the ICA Conference (pp. 325–330). Aussois, France. Bishop, C. (1995). Neural networks for pattern recognition. New York: Oxford University Press. Broyden, C. G. (1965). A class of methods for solving nonlinear simultanoeus equations. Math. Comput., 19, 577–593. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistics Society, Series B, 39, 1–38. Digalakis, V., Rohlicek, J., & Ostendorf, M. (1993). ML estimation of a stochastic linear system with the EM algorithm and its application to speech recognition. IEEE Transactions on Speech and Audio Processing, 1(4), 431–442. Gupta, N. K., & Mehra, R. K. (1974). Computational aspects of maximum likelihood estimation and reduction in sensitivity calculations. IEEE Transactions on Automatic Control, AC-19(1), 774–783. Højen-Sørensen, P. A., Winther, O., & Hansen, L. K. (2002). Mean field approaches to independent component analysis. Neural Computation, 14, 889–918. Honkela, A., Valpola, H., & Karhunen, J. (2003). Accelerating cyclic update algorithms for parameter estimation by pattern searches. Neural Processing Letters, 17(2), 191–203. Jamshidian, M., & Jennrich, R. I. (1993). Conjugate gradient acceleration of the EM algorithm. Journal of the American Statistical Association, 88, 221–228. Jamshidian, M., & Jennrich, R. I. (1997). Acceleration of the EM algorithm using quasi-Newton methods. Journal of Royal Statistical Society, B, 59(3), 569–587. Lachlan, G. J., & Krishnan, T. (1997). The EM algorithm and extensions. New York: Wiley. Lange, K. (1995). A quasi Newton acceleration of the EM algorithm. Statistica Sinica, 5, 1–18.
State-Space Models
1111
Lehn-Schiøler, T., Hansen, L. K., & Larsen, J. (2005). Mapping from speech to images using continuous state space models. In Lecture Notes in Computer Science (Vol. 3361, pp. 136–145). Berlin: Springer. Meilijson, I. (1989). A fast improvement to the EM algorithm on its own terms. Journal of Royal Statistical Society, B, 51, 127–138. Moulines, E., Cardoso, J., & Cassiat, E. (1997). Maximum likelihood for blind separation and deconvolution of noisy signals using mixture models. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Proc. ICASSP (Vol. 5, pp. 3617–3620). Cambridge, MA: MIT Press. Neal, R. M., & Hinton, G. E. (1998). Learning in graphical models: A view of the EM algorithm that justifies incremental, sparse, and other variants. Dordrecht: Kluwer. Nielsen, H. B. (2000). UCMINF—an algorithm for unconstrained nonlinear optimization. (Tech. Rep. IMM-Rep-2000-19). Lungby: Technical University of Denmark. Olsson, R. K., & Hansen, L. K. (2004). Probabilistic blind deconvolution of nonstationary sources. In 12th European Signal Processing Conference (pp. 1697–1700). Vienna, Austria. Olsson, R. K., & Hansen, L. K. (2005). A harmonic excitation state-space approach to blind separation of speech. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems (Vol. 17, pp. 993–1000). Cambridge, MA: MIT Press. Petersen, K. B., & Winther, O. (2005). The EM algorithm in independent component analysis. In International Conference on Acoustics, Speech, and Signal Processing. Piscataway, NJ: IEEE. Redner, R. A., & Walker, H. F. (1984). Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 2(26), 195–239. Roweis, S., & Ghahramani, Z. (1999). A unifying review of linear gaussian models. Neural Computation, 11, 305–345. Salakhutdinov, R., & Roweis, S. (2003). Adaptive overrelaxed bound optimization methods. In International Conference on Machine Learning (pp. 664–671). New York: AAAI Press. Salakhutdinov, R., Roweis, S. T., & Ghahramani, Z. (2003). Optimization with EM and expectation-conjugate-gradient. In International Conference on Machine Learning (Vol. 20, pp. 672–679). New York: AAAI Press. Sandell, N. R., & Yared, K. I. (1978). Maximum likelihood identification of state space models for linear dynamic systems. (Tech. Rep. ESL-R-814). Cambridge, MA: Electronic Systems Laboratory, MIT. Shumway, R., & Stoffer, D. (1982). An approach to time series smoothing and forecasting using the EM algorithm. J. Time Series Anal., 3(4), 253–264. Xu, L., & Jordan, M. I. (1996). On convergence properties of the EM algorithm for gaussian mixtures. Neural Computation, (8), 129–151.
Received February 25, 2005; accepted July 20, 2006.
LETTER
Communicated by Shun-ichi Amari
Variational Bayes Solution of Linear Neural Networks and Its Generalization Performance Shinichi Nakajima [email protected] Nikon Corporation, Kumagaya, Saitama, 360–8559, Japan
Sumio Watanabe [email protected] Tokyo Institute of Technology, Midori-ku, Yokohama, Kanagawa, 226-8503, Japan
It is well known that in unidentifiable models, the Bayes estimation provides much better generalization performance than the maximum likelihood (ML) estimation. However, its accurate approximation by Markov chain Monte Carlo methods requires huge computational costs. As an alternative, a tractable approximation method, called the variational Bayes (VB) approach, has recently been proposed and has been attracting attention. Its advantage over the expectation maximization (EM) algorithm, often used for realizing the ML estimation, has been experimentally shown in many applications; nevertheless, it has not yet been theoretically shown. In this letter, through analysis of the simplest unidentifiable models, we theoretically show some properties of the VB approach. We first prove that in three-layer linear neural networks, the VB approach is asymptotically equivalent to a positive-part James-Stein type shrinkage estimation. Then we theoretically clarify its free energy, generalization error, and training error. Comparing them with those of the ML estimation and the Bayes estimation, we discuss the advantage of the VB approach. We also show that unlike in the Bayes estimation, the free energy and the generalization error are less simply related with each other and that in typical cases, the VB free energy well approximates the Bayes one, while the VB generalization error significantly differs from the Bayes one. 1 Introduction The conventional learning theory (Cramer, 1949), which provides the mathematical foundation of information criteria, such as Akaike’s information criterion (AIC) (Akaike, 1974), Bayesian information criterion (BIC) (Schwarz, 1978), the minimum description length criterion (MDL) (Rissanen, 1986), and others, describes the asymptotic behavior of the Bayes free energy, the generalization error, the training error, and so on in the Neural Computation 19, 1112–1153 (2007)
C 2007 Massachusetts Institute of Technology
Variational Bayes Solution and Generalization
1113
regular statistical models. However, many models used in the field of machine learning, such as neural networks, Bayesian networks, mixture models, and hidden Markov models, are actually classified as unidentifiable models, which have singularities in the parameter spaces. Around the singularities, on which the Fisher information matrix is singular, the log likelihood cannot be approximated by any quadratic form of the parameter. Therefore, neither the distribution of the maximum likelihood estimator nor the Bayes posterior distribution asymptotically converges to the normal distribution, which prevents the conventional learning theory from holding (Hartigan, 1985; Watanabe, 1995; Amari, Park, & Ozeki, 2006, Hagiwara, 2002). Accordingly, the information criteria have no theoretical foundation in unidentifiable models. Some properties of learning in unidentifiable models have been theoretically clarified. In the maximum likelihood (ML) estimation, which is asymptotically equivalent to the maximum a posterior (MAP) estimation, the asymptotic behavior of the log-likelihood ratio in some unidentifiable models was analyzed (Hartigan, 1985; Bickel & Chernoff, 1993; Takemura & Kuriki, 1997; Kuriki & Takemura, 2001; Fukumizu, 2003), with facilitation by the idea of the locally conic parameterization (Dacunha-Castelle & Gassiat, 1997). It has thus been known that the ML estimation in general provides poor generalization performance, and in the worst cases, the ML estimator diverges. In linear neural networks, on which we focus in this letter, the generalization error was clarified and proved to be greater than that of the regular models whose dimension of the parameter space is the same when the model is redundant to learn the true distribution (Fukumizu, 1999), although the ML estimator is in a finite region (Baldi & Hornik, 1995). On the other hand, for the analysis of the Bayes estimation in unidentifiable models, an algebraic geometrical method was developed (Watanabe, 2001a), by which the generalization errors or their upper bounds in some unidentifiable models were clarified and proved to be less than that of the regular models (Watanabe, 2001a; Yamazaki & Watanabe, 2002; Rusakov & Geiger, 2002; Yamazaki & Watanabe, 2003a, 2003b; Aoyagi & Watanabe, 2004), and moreover, the generalization error of any model having singularities was proved to be less than that of the regular models when we use a prior distribution having positive values on the singularities (Watanabe, 2001b). According to the previous work noted, it can be said that in unidentifiable models, the Bayes estimation has the advantage of generalization performance over the ML estimation. However, the Bayes posterior distribution can seldom be exactly realized. Furthermore, Markov chain Monte Carlo (MCMC) methods, often used for approximation of the posterior distribution, require huge computational costs. As an alternative, the variational Bayes (VB) approach, which provides computational tractability, was proposed (Hinton & van Camp, 1993; MacKay, 1995; Attias, 1999; Jaakkola &
1114
S. Nakajima and S. Watanabe
Jordan, 2000; Ghahramani and Beal, 2001). In models with latent variables, an iterative algorithm like the expectation maximization (EM) algorithm, which provides the ML estimator (Dempster, Laird, & Rubin, 1977), was derived in the framework of the VB approach (Attias, 1999; Ghahramani & Beal, 2001). Some properties similar to the EM algorithm have been proved (Wang & Titterington, 2004), and it has been shown that the VB approach can be considered as the natural gradient descent with respect to the VB free energy (Sato, 2001). However, its advantage of generalization performance over the EM algorithm has not been theoretically shown yet, although it has been experimentally shown in many applications. In addition, although the VB free energies in some models have been clarified (Watanabe & Watanabe, 2006; Hosino, Watanabe, & Watanabe, 2005; Nakano & Watanabe, 2005), they unfortunately, provide little information on their generalization errors, unlike in the Bayes estimation. The reason is that the simple relation that holds in the Bayes estimation between the free energy and the generalization error does not hold in the VB approach. (This relation will appear as equation 3.15 in section 3.) Currently, in any unidentifiable model, the VB generalization error has never been theoretically clarified yet. Meanwhile, it was surprisingly proved that even in the regular models, the usual ML estimator does not provide optimum generalization performance (Stein, 1956). The James-Stein (JS) shrinkage estimator, which dominates the ML estimator, was subsequently introduced (James & Stein, 1961). (The definition of dominate is given in section 3.3.) Some work has discussed the relation between the JS estimator and Bayesian learning methods as follows: it was shown that the JS estimator can be derived as the solution of an empirical Bayes (EB) approach where the prior distribution of a regular model has a hyperparameter as its variance to be estimated based on the marginal likelihood (Efron & Morris, 1973); the similarity between the asymptotic behavior of the generalization error of the Bayes estimation in a simple class of unidentifiable models and that of the JS estimation in the regular models was pointed out (Watanabe & Amari, 2003); it was proved that in three-layer linear neural networks, a subspace Bayes (SB) approach, the extension of the EB approach where part of the parameters is regarded as a hyperparameter is asymptotically equivalent to a positive-part JS-type shrinkage estimation, a subspecies of the JS estimation (Nakajima & Watanabe, 2005, 2006). Interestingly, the VB solution, derived in this letter, is similar to the SB solution. In this letter, we consider the theoretical reason that the VB approach provides good generalization performance through the analysis of the simplest unidentifiable models. We first derive the VB solution of three-layer linear neural networks, which reveals an interesting relation between the VB approach, a rising method in the field of machine learning, and the JS shrinkage estimation, a classic in statistics. Then we clarify its free energy,
Variational Bayes Solution and Generalization
1115
generalization error, and training error. The major results of this letter are as follows:
r
r
r
r
r
The VB approach is asymptotically equivalent to a positive-part JStype shrinkage estimation. Hence, the VB estimator of a necessary component to realize the true distribution is asymptotically equivalent to the ML estimator, while the VB estimator of a redundant component is not equivalent to the ML estimator, even in the asymptotic limit. In the VB approach, the asymptotic behavior of the free energy and that of the generalization error are less simply related to each other, unlike in the Bayes estimation. In typical cases, the VB free energy well approximates the Bayes one, while the VB generalization error significantly differs from the Bayes one. Moreover, their dependencies on the redundant parameter dimension are different from each other, which can throw doubt on model selection by minimizing the VB free energy to obtain better generalization performance. Although the VB free energy is, by definition, never less than the Bayes one, the VB generalization error can be much less than the Bayes one. This, however, does not mean the domination of the VB approach over the Bayes estimation and is consistent with the proved superiority of the Bayes estimation. The more different the parameter space dimensions of the different layers are, the smaller the generalization error is. This implies that the advantage of the VB approach over the EM algorithm can be enhanced in mixture models, where the dimension of the upper-layer parameters, which corresponds to the number of the mixing coefficients, is one per component, and the dimension of the lower-layer parameters, which corresponds to the number of the parameters that each component has, is usually more. The VB approach has not only a property similar to the Bayes estimation but also another property similar to the ML estimation.
In section 2, neural networks and linear neural networks are described. The Bayes estimation, the VB approach, and the JS-type shrinkage estimator are introduced in section 3. Then the effects of singularities on generalization performance and the significance of consideration of delicate situations are explained in sections 4 and 5, respectively. The solution of the VB predictive distribution is derived in section 6, and its free energy, generalization error, and training error are clarified in section 7. In section 8, the theoretical values of the generalization properties of the VB approach are compared with those of the ML estimation and of Bayes estimation. Discussion and conclusions follow in sections 9 and 10, respectively.
1116
S. Nakajima and S. Watanabe
2 Linear Neural Networks Let x ∈ R M be an input (column) vector, y ∈ R N an output vector, and w a parameter vector. A neural network model can be described as a parametric family of maps { f (·; w) : R M → R N }. A three-layer neural network with H hidden units is defined by f (x; w) =
H h=1
b h ψ a ht x ,
(2.1)
where w = {(a h , b h ) ∈ R M × R N ; h = 1, . . . , H} summarizes all the parameters; ψ(·) is an activation function, which is usually a bounded, nondecreasing, antisymmetric, nonlinear function like tanh(·); and t denotes the transpose of a matrix or vector. We denote by Nd (µ, ) the d-dimensional normal distribution with average vector µ and covariance matrix , and by Nd (·; µ, ) its density function. Assume that the output is observed with a noise subject to N N (0, ). Then the conditional distribution is given by p(y|x, w) = N N (y; f (x; w), ).
(2.2)
In this letter, we focus on linear neural networks, whose activation functions are linear, as the simplest unidentifiable models. A linear neural network model (LNN) is defined by f (x; A, B) = B Ax,
(2.3)
where A = (a 1 , . . . , a H )t is an H × M input parameter matrix and B = (b 1 , . . . , b H ) is an N × H output parameter matrix. Because the transform (A, B) → (T A, BT −1 ) does not change the map for any nonsingular H × H matrix T, the parameterization in equation 2.3 has trivial redundancy. Accordingly, the essential dimension of the parameter space is given by K = H(M + N) − H 2 = H(L + l) − H 2 ,
(2.4)
where L = max(M, N),
(2.5)
l = min(M, N).
(2.6)
We assume that H ≤ l throughout this letter. An LNN, also known as a reduced-rank regression model, is not a toy but a useful model in many applications (Reinsel & Velu, 1998). It is used for multivariate factor analysis when the relevant dimension of the factors is unknown and classified as an essentially singular model when 0 < H < l because no transform makes
Variational Bayes Solution and Generalization
1117
it regular. Although LNNs are simple, we expect that some phenomena caused by the singularities, revealed in this letter, would be observed in other unidentifiable models as well. 3 Learning Methods 3.1 Bayes Estimation. Let Xn = {x1 , . . . , xn } and Yn = {y1 , . . . , yn } be arbitrary n training samples independently and identically taken from the true distribution q (x, y) = q (x)q (y|x). The marginal conditional likelihood of a model p(y|x, w) is given by Z(Yn |Xn ) =
n
φ(w)
i=1 p(yi |xi , w)dw,
(3.1)
where φ(w) is a prior distribution. The posterior distribution is given by p(w|X , Y ) = n
n
φ(w)
n
i=1 p(yi |xi , w) . Z(Yn |Xn )
(3.2)
The predictive distribution is defined as the average of the model over the posterior distribution: p(y|x, Xn , Yn ) =
p(y|x, w) p(w|Xn , Yn )dw.
(3.3)
The free energy, evidence, or stochastic complexity, which is used for model selection or hyperparameter estimation (Efron & Morris, 1973; Akaike, 1980; MacKay, 1992), is defined by F (Yn |Xn ) = − log Z(Yn |Xn ).
(3.4)
Commonly, the normalized free energy, defined by F (n) = F (Yn |Xn ) + log q (Yn |Xn )q (Xn ,Yn ) ,
(3.5)
where ·q (Xn ,Yn ) denotes the expectation value over all sets of n training samples, can be asymptotically expanded as follows: F (n) = λ log n + o(log n),
(3.6)
where λ is called the free energy coefficient in this letter. The generalization error, a criterion of generalization performance, and the training error are
1118
S. Nakajima and S. Watanabe
defined by G(n) = G(Xn , Yn )q (Xn ,Yn ) ,
(3.7)
T(n) = T(Xn , Yn )q (Xn ,Yn ) ,
(3.8)
respectively, where G(X , Y ) = n
n
q (x)q (y|x) log
q (y|x) d xd y p(y|x, Xn , Yn )
(3.9)
is the Kullback-Leibler (KL) divergence of the predictive distribution from the true distribution, and T(Xn , Yn ) = n−1
n i=1
log
q (y|x) p(y|x, Xn , Yn )
(3.10)
is the empirical KL divergence. Commonly, the generalization error and the training error can be asymptotically expanded as follows: G(n) = λn−1 + o(n−1 ),
(3.11)
−1
(3.12)
T(n) = νn
−1
+ o(n ),
where the coefficients of the leading terms, λ and ν, are called the generalization coefficient and the training coefficient, respectively, in this letter. In the regular models, the free energy, the generalization, and the training coefficients are simply related to the parameter dimension K in the Bayes estimation, as well as in the ML estimation, as follows: 2λ = 2λ = −2ν = K .
(3.13)
The penalty factor of AIC (Akaike, 1974) is based on the fact that (λ − ν) = K , and the penalty factor of BIC (Schwarz, 1978), as well as that of the MDL (Rissanen, 1986), is based on the fact that 2λ = K . However, in unidentifiable models, equation 3.13 does not generally hold, and the coefficients can depend not only on the parameter dimension of the learning machine but also on the true distribution. So it seems to be difficult to propose any information criterion in unidentifiable models; nevertheless, an information criterion, called a singular information criterion, has recently been proposed by using the dependence on the true distribution (Yamazaki, Nagata, & Watanabe, 2005). For such kind of work, clarifying the coefficients is important.
Variational Bayes Solution and Generalization
1119
In the (rigorous) Bayes estimation, we can prove that the relation λ = λ
(3.14)
still holds even in unidentifiable models, by using the following well-known relation (Levin, Tishby, & Solla, 1990): G(n) = F (n + 1) − F (n).
(3.15)
Therefore, clarifying the Bayes free energy immediately informs us of the asymptotic behavior of the Bayes generalization error. 3.2 Variational Bayes Approach. For an arbitrary trial distribution r (w), Jensen’s inequality leads to the following one, F (Y |X ) ≤ log n
n
r (w) p(Yn |Xn , w)φ(w)
= F¯ (Yn |Xn ),
(3.16)
r (w)
where · p denotes the expectation value over a distribution p, and F¯ (Yn |Xn ) is called the generalized free energy in this letter. In the VB approach, restricting the set of possible distributions for r (w), we minimize the generalized free energy, a functional of r (w). The optimum of the trial distribution, which we denote by rˆ (w), is called the VB posterior distribution and substituted for p(w|Xn , Yn ) in equation 3.3. In addition, the minimum value of the generalized free energy, which we denote by F˜ (Yn |Xn ), is called the VB free energy, and the expectation value over the VB posterior distribution is called the VB estimator. In the original VB approach, proposed for learning of neural networks, the posterior distribution is restricted to the normal distribution (Hinton & van Camp, 1993). Recently, an iterative algorithm with good tractability has been proposed for models with latent variables, such as mixture models and graphical models, by using an appropriate class of prior distributions and restricting the posterior distribution such that the parameters and the latent variables are independent of each other; it has been attracting attention (Attias, 1999; Ghahramani & Beal, 2001). In LNNs, we can deduce a similar algorithm by restricting the posterior distribution such that the parameters of different layers, as well as those of different components, are independent of each other, as shown below.1
1 Note that this restriction is similar to the one applied in models with hidden variables in Attias (1999). Therefore, the analysis in this letter also provides insight into the properties of the VB approach in more general unidentifiable models, which is discussed in section 9.4.
1120
S. Nakajima and S. Watanabe
ah
bh
Figure 1: VB posterior distribution (right), which is the independent distribution with respect to different layer parameters to be substituted for the Bayes posterior distribution (left).
Assume that the VB posterior distribution factorizes as r (w) = r (A, B) = r (A)r (B).
(3.17)
Then the generalized free energy is given by F¯ (Yn |Xn ) =
r (A)r (B) log
r (A)r (B) d Ad B, p(Yn |Xn , A, B)φ(A, B)
(3.18)
where d V denotes the integral with respect to all the elements of the matrix V. Using the variational method, we can easily show that the VB posterior distribution satisfies the following relations if we use a prior distribution that factorizes as φ(w) = φ(A, B) = φ(A)φ(B): r (A) ∝ φ(A) explog p(Yn |Xn , A, B)r (B) ,
(3.19)
r (B) ∝ φ(B) explog p(Yn |Xn , A, B)r (A) .
(3.20)
We find from equations 3.19 and 3.20 that the VB posterior distributions are the normal if the prior distributions, φ(A) and φ(B), are the normal, because the log likelihood of an LNN, log p(Yn |Xn , A, B), is a biquadratic function of A and B. Figure 1 (right) illustrates the VB posterior distribution. The VB posterior distribution, which is the independent distribution with respect to different layer parameters minimizing the generalized free energy, equation 3.18, is substituted for the Bayes posterior distribution (see Figure 1, left). We then apply another restriction, the independence of the parameters
Variational Bayes Solution and Generalization
1121
of different components, to the posterior distribution: r (A, B) =
H
h=1 r (a h )r (b h ).
(3.21)
This restriction makes the relations, equations 3.19 and 3.20, decompose into those for each component as follows, if we use a prior distribution that H factorizes as φ(A, B) = h=1 φ(a h )φ(b h ): r (a h ) ∝ φ(a h ) explog p(Yn |Xn , A, B)r (A,B)/r (a h ) ,
(3.22)
r (b h ) ∝ φ(b h ) explog p(Y |X , A, B)r (A,B)/r (bh ) ,
(3.23)
n
n
where ·r (A,B)/r (a h ) , as well as ·r (A,B)/r (bh ) , denotes the expectation value over the trial distribution of the parameters except a h , as well as that except bh . In the same way as the Bayes estimation, the predictive distribution, the normalized free energy, the generalization error, and the training error of the VB approach are given by p˜ (y|x, Xn , Yn ) =
p(y|x, w)ˆr (w)dw,
(3.24)
F˜ (n) = F˜ (Yn |Xn ) + log q (Yn |Xn )q (Xn ,Yn ) = λ˜ log n + o(log n), (3.25) −1
˜ ˜ ˜ G(n) = G(X , Y )q (Xn ,Yn ) = λn
−1
+ o(n ),
(3.26)
˜ ˜ n , Yn )q (Xn ,Yn ) = ν˜ n−1 + o(n−1 ), T(n) = T(X
(3.27)
n
n
respectively, where ˜ n , Yn ) = G(X
q (x)q (y|x) log
˜ n , Yn ) = n−1 T(X
n i=1
log
q (y|x) d xd y, p˜ (y|x, Xn , Yn )
q (y|x) . p˜ (y|x, Xn , Yn )
(3.28) (3.29)
˜ because the relation corresponding to equation 3.15 does In general, λ˜ = λ, not hold in the VB approach. Accordingly, the VB free energy and the VB generalization error are less simply related to each other, as will be shown in section 8. 3.3 James-Stein Type Shrinkage Estimator. In this section, considering a regular (i.e., identifiable) model, we introduce the James-Stein (JS)–type shrinkage estimator. Suppose that Zn = {z1 , . . . , zn } are arbitrary n training samples independently and identically taken from Nd (µ, Id ), where µ
1122
S. Nakajima and S. Watanabe
is the mean parameter to be estimated and Id denotes the d × d identity matrix. We denote by ∗ the true value of a parameter and by a hat an estimator of a parameter throughout this letter. We define the risk as the mean squared error: gµˆ (µ∗ ) = µ ˆ − µ∗ 2 q (Zn ) .2 We say that µ ˆ α dominates µ ˆ β if gµˆ α (µ∗ ) ≤ gµˆ β (µ∗ ) for arbitrary µ∗ and gµˆ α (µ∗ ) < gµˆ β (µ∗ ) for a certain µ∗ . n The usual ML estimator, µ ˆ MLE = n−1 i=1 zi , is an efficient estimator that is never dominated by any unbiased estimator. However, the ML estimator has surprisingly been proved to be inadmissible when d ≥ 3; that is, there exists at least one biased estimator that dominates the ML estimator (Stein, 1956). The JS shrinkage estimator, defined as follows, was subsequently introduced (James & Stein, 1961): µ ˆ JS = (1 − χ/n µ ˆ MLE 2 )µ ˆ MLE ,
(3.30)
where χ = (d − 2) is called the degree of shrinkage in this letter. The estimators expressed by equation 3.30 with arbitrary χ > 0 are called the JS-type shrinkage estimators. We can easily find that a positive-part JS-type shrinkage estimator, defined by µ ˆ PJS = θ (n µ ˆ MLE , ˆ MLE 2 > χ)(1 − χ/n µ ˆ MLE 2 )µ ˆ MLE = 1 − χχ −1 µ (3.31) dominates the JS-type shrinkage estimator, equation 3.30, with the same degree of shrinkage, χ. Here, θ (·) is the indicator function of an event; it is equal to one if the event is true and to zero otherwise, and χ = max(χ, n µ ˆ MLE 2 ). Note that even in the large sample limit, the shrinkage estimators are not equivalent to the ML estimator if µ∗ = 0, although they are equivalent to the ML estimator otherwise. Also note that the indicator function in equation 3.31 works like a model selection method, which eliminates insignificant parameters, and the shrinkage factor works like a regularizer, which pulls the estimator near the model with smaller degree of freedom. 4 Unidentifiability and Singularities We say that a parametric model is unidentifiable if the map from the parameter to the probability distribution is not one-to-one. A neural network model, equation 2.1, is unidentifiable because the model is invariant for any a h if b h = 0 or vice versa. The continuous points denoting the same distribution are called the singularities, because the Fisher information matrix
2 The risk function coincides with the twice of the generalization error, equation 3.7, when p(z|µ) ˆ = Nd (z; µ, ˆ Id ) is regarded as the predictive distribution.
Variational Bayes Solution and Generalization
1123
bh
ah
Figure 2: Singularities of a neural network model.
on them is singular. The shadowed locations in Figure 2 indicate the singularities. We can see in Figure 2 that the model denoted by the singularities has more neighborhoods and state density than any other model denoted by only one point each. When the true model is not on the singularities, they asymptotically do not affect prediction, and therefore the conventional learning theory of the regular models holds. When the true model is on the singularities, they significantly affect generalization performance as follows:
r
r
ML-type effect: In the ML estimation, as well as the MAP estimation, an increase of the neighborhoods of the true distribution leads to an increase of the flexibility of imitating noises and therefore accelerates overfitting. Bayesian effect: In the Bayesian learning methods, the large state density of the true distribution increases its weight and therefore suppresses overfitting.
According to the effects above, we can expect the Bayesian learning methods to provide better generalization performance than the ML estimation.
5 Significance of Delicate Situation Analysis Suppression of overfitting accompanies insensitivity to the true components with small amplitude. There is a trade-off that would be ignored in asymptotic analysis if we considered only situations when the true model is distinctly on the singularities or not. Consider, for example, the positive-part JS-type estimator, equation 3.31, with a very large but finite degree of shrinkage, 1 χ < ∞. That estimator provides very small generalization error when µ∗ = 0 because of its strong shrinkage and the same
1124
S. Nakajima and S. Watanabe
generalization error as that of the ML estimator when µ∗ = O(1) in the asymptotic limit because µ ˆ PJS = µ ˆ MLE + O p (n−1 ). Therefore, in ordinary asymptotic analysis, it could seem to be a good learning method. However, such an estimator will not perform well because it can provide extremely worse generalization performance in real situations where the number of training samples, n, is finite. In other words, the ordinary asymptotic analysis ignores the insensitivity to the true small components. To consider the case where n is sufficiently large but finite, with the benefit of the asymptotic approximation, we should √ analyze the delicate situations (Watanabe & Amari, 2003) when 0 < n µ∗ < ∞. We have to balance between the generalization performance in the region where √ µ∗ = 0 and that in the region where 0 < n µ∗ < ∞, when we select a learning method. The delicate situations in unidentifiable models should be defined as the ones where the KL divergence of the true distribution from the singularities is comparable to n−1 . This kind of analysis is important in model selection problems and in statistical tests with finite number of samples for the following reasons: first, there naturally exist a few true components with amplitude comparable to n−1/2 when neither the smallest nor the largest model is selected; and second, whether the selected model involves such components essentially affects generalization performance. In this letter, we discuss the generalization error and the training error in the delicate situations, in addition to the ordinary asymptotic analysis. 6 Theoretical Analysis 6.1 Variational Condition. Assume that the covariance matrix of a noise is known and equal to I N . Then the conditional distribution of an LNN is given by p(y|x, A, B) = N N (y; B Ax, I N ).
(6.1)
We use the prior distribution that factorizes as φ(A, B) = φ(A)φ(B), where φ(A) = N H M (vec(At ); 0, c a2 I H M ),
(6.2)
φ(B) = N NH (vec(B); 0, c b2 I NH ).
(6.3)
Here 0 < c a2 , c b2 < ∞ are constants corresponding to the variances, and vec(·) denotes the vector created from a matrix by stacking the column vectors below one another. Assume that the true conditional distribution is p(y|x, A∗ , B ∗ ), where B ∗ A∗ is the true map with rank H ∗ ≤ H. For simplicity, we assume that the input vector is orthonormalized so
Variational Bayes Solution and Generalization
1125
that xx t q (x)d x = I M . Consequently, the central limit theorem leads to the following two equations with respect to the sufficient statistics: Q(Xn ) = n−1 R(X , Y n
n
n i=1
n ) = n−1 i=1
xi xit = I M + O p (n−1/2 ), yi xit
∗
∗
−1/2
= B A + O p (n
(6.4) ),
(6.5)
where Q(Xn ) is an M × M symmetric matrix and R(Xn , Yn ) is an N × M matrix. Hereafter, we abbreviate Q(Xn ) as Q and R(Xn , Yn ) as R. Let γh be the hth largest singular value of the matrix RQ−1/2 , ωa h the corresponding right singular vector, and ωbh the corresponding left singular vector, where 1 ≤ h ≤ H. We find from equation 6.5 that in the asymptotic limit, the singular values corresponding to the necessary components to realize the true distribution converge to finite values, while the others corresponding to the redundant components converge to zero. Therefore, with probability 1, the largest H ∗ singular values correspond to the necessary components, and the others correspond to the redundant components. Combining that with equation 6.4, we have ωbh RQρ = ωbh R + O p (n−1 )
for
H ∗ < h ≤ H,
(6.6)
where −∞ < ρ < ∞ is an arbitrary constant. Substituting equations 6.1 to 6.3 into equations 3.22 and 3.23, we find that the posterior distribution can be written as follows: r (A, B) =
H
h=1 r (a h )r (b h ),
(6.7)
where r (a h ) = N M (a h ; µa h , a h ),
(6.8)
r (b h ) = N N (b h ; µbh , bh ).
(6.9)
Given an arbitrary map B A, we can have A with its orthogonal row vectors and B with its orthogonal column vectors by using the singular value decomposition. In just that case, both of the prior probabilities, equations 6.2 and 6.3, are maximized. Accordingly, {µa h ; h = 1, . . . , H}, as well as {µbh ; h = 1, . . . , H}, of the optimum distribution is a set of vectors orthogonal to each other. Now we can easily obtain the following condition, called the variational condition, by substituting equations 6.7 to 6.9 into equations 3.22 and 3.23: µa h = na h Rt µbh , a h = (n( µbh + tr(bh ))Q + 2
(6.10) c a−2 I M )−1 ,
(6.11)
1126
S. Nakajima and S. Watanabe
µbh = nbh Rµa h , −1 bh = n µat h Qµa h + tr a1/2 Qa1/2 + c b−2 I N , h h
(6.12) (6.13)
where tr(·) denotes the trace of a matrix. Similarly, by substituting equations 6.7 to 6.9 into equation 3.18, we also obtain the following form of the generalized free energy: 2 F¯ (Yn |Xn ) =
H
h=1
− log σa2M σb2N − 2nµtbh Rµa h h h
µbh 2 + tr(bh ) + n µat h Qµa h + tr a1/2 Qa1/2 h h +
n i=1
yi 2 + O p (n−1/2 ) + const.,
(6.14)
where σa2h and σb2h are defined by a h = σa2h (I M + O p (n−1/2 )),
(6.15)
bh = σb2h (1 + O p (n−1/2 ))I N ,
(6.16)
respectively. 6.2 Variational Bayes Solution. The variational condition, equations 6.10 to 6.13, can be analytically solved, which leads to the following theorem: Theorem 1. is given by Bˆ Aˆ =
Let L h = max(L , nγh2 ). The VB estimator of the map of an LNN H h=1
1 1 − L L − ωbh ωbt h RQ−1 + O p (n−1 ). h
(6.17)
The proof is given in the appendix. Moreover, we obtain the following lemma, expanding the predictive distribution divided by the true distribution, p˜ (y|x, Xn , Yn )/q (y|x): Lemma 1. The predictive distribution of an LNN in the VB approach can be written as follows: ˆ Bˆ Ax, ˆ V) ˆ + O p (n–3/2 ), p˜ (y|x, Xn , Yn ) = N N (y; V ˆ = I N + O p (n−1 ). where V
(6.18)
Variational Bayes Solution and Generalization
1127
Comparing equation 6.17 with the ML estimator, Bˆ Aˆ MLE =
H
ωbh ωbt h RQ−1
(6.19)
h=1
(Baldi & Hornik, 1995), we find that the VB estimator of each component is asymptotically equivalent to a positive-part JS-type shrinkage estimator, defined by equation 3.31. However, even if the VB estimator is asymptotically equivalent to the shrinkage estimator, the VB approach can differ from the shrinkage estimation because of the extent of the VB posterior distribution. Lemma 1 states that the VB posterior distribution is sufficiently localized that we can substitute the model at the VB estimator for the VB predictive distribution with asymptotically insignificant impact on generalization performance. Therefore, we conclude that the VB approach is asymptotically equivalent to the shrinkage estimation. Note that the variances, c a2 and c b2 , of the prior distributions, equations 6.2 and 6.3, asymptotically have no effect on prediction and hence on generalization performance, as far as they are positive and finite constants. 7 Generalization Properties 7.1 Free Energy. We can obtain the VB free energy by substituting the VB posterior distribution, derived in the appendix, into equation 6.14, where only the order of σˆ a2h and that of σˆ b2h signify because the leading term of the normalized free energy should be the first one in the curly braces of equation 6.14. Theorem 2. The normalized free energy of an LNN in the VB approach can be asymptotically expanded as F˜ (n) = λ˜ log n + O(1), where the free energy coefficient is given by 2λ˜ = H ∗ (L + l) + (H − H ∗ )l.
(7.1)
Proof. We will separately consider the necessary components, which imitate the true ones with positive singular values, and the redundant ones. For a necessary component, h ≤ H ∗ , equation A.7 implies that both µ ˆ a h and µ ˆ bh are of order O p (1). Then we find from equations A.4 and A.6 that both a h and bh are of order O p (n−1 ). Substituting the fact above into equation 6.14 leads to the first term of equation 7.1. For a redundant component, h > H ∗ , we find from the VB solution, equations A.51 to A.62, that the order of the estimators is as follows:
1128
S. Nakajima and S. Watanabe
1. When M > N, µ ˆ a h = O p (1), σˆ a2h = O p (1), µ ˆ bh = O p (n−1/2 ), σˆ b2h = O p (n−1 ). (7.2) 2. When M = N, ˆ bh = O p (n−1/4 ), σˆ b2h = O p (n−1/2 ). µ ˆ a h = O p (n−1/4 ), σˆ a2h = O p (n−1/2 ), µ (7.3) 3. When M < N, ˆ bh = O p (1), σˆ b2h = O p (1). (7.4) µ ˆ a h = O p (n−1/2 ), σˆ a2h = O p (n−1 ), µ Substituting equations 7.2 to 7.4 into equation 6.14 leads to the second term of equation 7.1. Note that the contribution of the necessary components involves the trivial redundancy (compare the first term of equation 7.1 with equation 2.4) because the independence between A and B prevents the VB posterior distribution from extending along the trivial degeneracy, which is a peculiarity of the LNNs. 7.2 Generalization Error. The generalization error, as well as the training error, of the VB approach can be derived in the same way as that of the SB approach, which results in a similar solution (Nakajima & Watanabe, 2005; 2006). (See section 9.1.) The existence of the singular value decomposition of any B ∗ A∗ allows us to assume without loss of generality that A∗ and B ∗ consist of its orthogonal row vectors and of its orthogonal column vectors, respectively. Then we find from lemma 1 that the KL divergence, equation 3.28, with a set of n training samples is given by
ˆ 2 (B ∗ A∗ − Bˆ A)x + O p (n−3/2 ) 2 q (x) H ˜ h (Xn , Yn ) + O p (n−3/2 ), = G
˜ n , Yn ) = G(X
h=1
(7.5)
where 1 ˜ h (Xn , Yn ) = tr (b h∗ a h∗t − bˆ h aˆ ht )t (b h∗ a h∗t − bˆ h aˆ ht ) G 2
(7.6)
is the contribution of the hth component. We denote by Wd (m, , ) the d-dimensional Wishart distribution with m degrees of freedom, scale matrix , and noncentrality matrix , and abbreviate as Wd (m, ) the central Wishart distribution. Remember that θ (·) is the indicator function, introduced in section 3.3.
Variational Bayes Solution and Generalization
1129
Theorem 3. The generalization error of an LNN in the VB approach can be asymptotically expanded as ˜ ˜ −1 + O(n-3/2 ), G(n) = λn where the generalization coefficient is given by 2λ˜ = (H ∗ (L + l) − H ∗2 ) +
H−H ∗ h=1
θ (γh2
L > L) 1 − 2 γh
2
γh2
.
(7.7)
q ({γh2 })
Here, γh2 is the hth largest eigenvalue of a random matrix subject to Wl−H ∗ (L − H ∗ , Il−H ∗ ), over which ·q ({γh2 }) denotes the expectation value. Proof. According to theorem 1, the difference between the VB and the ML estimators of a true component with a positive singular value is of the order O p (n−1 ). Furthermore, the generalization error of the ML estimator of the component is the same as that of the regular models because of its identifiability. Hence, from equation 2.4, we obtain the first term of equation 7.7 as the contribution of the first H ∗ components. On the other hand, we find from equation 6.6 and theorem 1 that for a redundant component, identifying RQ−1/2 with R affects the VB estimator only of order O p (n−1 ), which, hence, does not affect the generalization coefficient. We say that U is the general diagonalized matrix of an N × M matrix T if T has the following singular value decomposition: T = b Ua , where a and b are an M × M and an N × N orthogonal matrices, respectively. Let D be the general diagonalized matrix of R and D the (N − H ∗ ) × (M − H ∗ ) matrix created by removing the first H ∗ columns and rows from D. Then the first H ∗ diagonal elements of D correspond to the positive true singular value components, and D consists only of noises. Therefore, D is the general diagonalized matrix of n−1/2 R , where R is an (N − H ∗ ) × (M − H ∗ ) random matrix whose elements are independently subject to N1 (0, 1), so that R Rt is subject to W N−H ∗ (M − H ∗ , I N−H ∗ ). The redundant components imitate n−1/2 R . Hence, using theorem 1 and equation 7.6 and noting that the distribution of the (l − H ∗ ) largest eigenvalues, which are not trivially equal to zero, of a random matrix subject to W L−H ∗ (l − H ∗ , I L−H ∗ ) are equal to that of the eigenvalues of another random matrix subject to Wl−H ∗ (L − H ∗ , Il−H ∗ ), we obtain the second term of equation 7.7 as the contribution of the last (H − H ∗ ) components. Thus, we complete the proof of theorem 3. Although the second term of equation 7.7 is not a simple function, it can relatively easily be numerically calculated by creating samples subject to the Wishart distribution. Furthermore, the simpler function approximating the term can be derived in the large-scale limit when M, N, H, and H ∗ go
1130
S. Nakajima and S. Watanabe
to infinity in the same order, in a similar fashion to the analysis of the ML estimation (Fukumizu, 1999). We define the following scalars: α = (l − H ∗ )/(L − H ∗ ), ∗
∗
β = (H − H )/(l − H ), ∗
κ = L/(L − H ).
(7.8) (7.9) (7.10)
Let W be a random matrix subject to Wl−H ∗ (L − H ∗ , Il−H ∗ ), and {u1 , . . . , ul−H ∗ } the eigenvalues of (L − H ∗ )−1 W. The measure of the empirical distribution of the eigenvalues is defined by δ P = (l − H ∗ )−1 δ(u1 ) + δ(u2 ) + · · · + δ(ul−H ∗ ) , (7.11) where δ(u) denotes the Dirac measure at u. In the large-scale limit, the measure, equation 7.11, converges almost everywhere to (u − um )(u M − u) θ (um < u < u M )du, (7.12) p(u)du = 2παu √ √ where um = ( α − 1)2 and u M = ( α + 1)2 (Watcher, 1978). Let −1
(2πα)
J (ut ; k) =
∞
uk p(u)du
(7.13)
ut
be the kth order moment of the distribution, equation 7.12, where ut is the lower bound of the integration range. The second term of equation 7.7 consists of the terms proportional to the minus first, the zero, and the firstorder moments of the eigenvalues. Because only the eigenvalues greater than L among the largest (H − H ∗ ) eigenvalues contribute to the generalization error, the moments with the lower bound ut = max(κ, uβ ) should be calculated, where uβ is the β-percentile point of p(u), that is,
∞
β=
uβ
p(u)du = (2πα)−1 J (uβ ; 0).
√ Using the transform s = (u − (um + u M )/2) /(2 α), we can calculate the moments and thus obtain the following theorem: Theorem 4. is given by
The VB generalization coefficient of an LNN in the large-scale limit
2λ˜ ∼ (H ∗ (L + l) − H ∗2 ) +
(L − H ∗ )(l − H ∗ ) J (st ; 1) − 2κ J (st ; 0) + κ 2 J (st ; −1) , 2πα
(7.14)
Variational Bayes Solution and Generalization
1131
where J (s; 1) = 2α − s 1 − s 2 + cos–1 s , √ J (s; 0) = −2 α 1 − s 2 + (1 + α) cos–1 s − (1 − α) cos–1
√
α(1 + α)s + 2α , √ 2αs + α(1 + α)
J (s; −1) √ √ √ 1+α 1 − s2 α(1 + α)s + 2α −1 −1 cos − cos s + (0 < α < 1) √ 2 α √ 1−α 2 αs + 1 + α 2αs + α(1 + α) = , 2 1 − s − cos−1 s (α = 1) 1+s √ and st = max((κ − (1 + α))/2 α, J -1 (2παβ; 0)). Here J -1 (·; k) denotes the inverse function of J (s; k). In ordinary asymptotic analysis, one considers only situations when the amplitude of each component of the true model is zero or distinctly positive. Also theorem 3 holds only in such situations. However, as mentioned in section 5, it is important to consider the delicate situations when ∗ ∗ the true √ ∗map B A has tiny but nonnegligible singular values such that 0 < nγh < ∞. Theorem 1 still holds in such situations by replacing the second term of equation 6.17 with o p (n−1/2 ). We regard H ∗ as√the number of distinctly positive true singular values such that γh∗ −1 = o( n). Without loss of generality, we assume that B ∗ A∗ is a nonnegative, general diagonal matrix with its diagonal elements arranged in nonincreasing order. Let R∗ be the true submatrix created by removing the first H ∗ columns and rows from B ∗ A∗ . Then D , defined in the proof of theorem 3, is the general diagonalized matrix of n−1/2 R , where R is a random matrix such that R Rt is subject to W N−H ∗ (M − H ∗ , I N−H ∗ , nR∗ R∗t ). Therefore, we obtain the following theorem: Theorem 5. The VB generalization coefficient of an LNN in the general situations the true map B ∗ A∗ may have delicate singular values such that √ when ∗ 0 < nγh < ∞ is given by 2λ˜ = (H ∗ (L + l) − H ∗2 ) +
+
H−H ∗
H
nγh∗2
h=H ∗ +1
θ (γh2
h=1
−2 1 −
L γh2
> L)
L 1 − 2 γh
√ γh ωbth nR∗ ωah
2
γh2
, q (R )
(7.15)
1132
S. Nakajima and S. Watanabe
where γh , ωah , and ωbh are the hth largest singular value of R , the corresponding right singular vector, and the corresponding left singular vector, respectively, of which ·q (R ) denotes the expectation value over the distribution. 7.3 Training Error. Lemma 1 implies that the empirical KL divergence, equation 3.29, with a set of n training samples is given by 1 ∗ ∗ tr (B A − Bˆ Aˆ MLE )t (B ∗ A∗ − Bˆ Aˆ MLE ) 2 − ( Bˆ Aˆ − Bˆ Aˆ MLE )t ( Bˆ Aˆ − Bˆ Aˆ MLE ) + O p (n−3/2 ).
˜ n , Yn ) = − T(X
(7.16)
In the same way as the analysis of the generalization error, we obtain the following theorems. Theorem 6. The training error of an LNN in the VB approach can be asymptotically expanded as ˜ T(n) = ν˜ n−1 + O(n−3/2 ), where the training coefficient is given by 2˜ν = −(H ∗ (L + l) − H ∗2 ) H−H ∗
L L 2 1 + 2 γh2 θ (γh > L) 1 − 2 − γ γh h h=1
.
(7.17)
q ({γh2 })
Theorem 7. given by
The VB training coefficient of an LNN in the large-scale limit is
2ν˜ ∼ −(H ∗ (L + l) − H ∗2 ) −
(L − H ∗ )(l − H ∗ ) J (st ; 1) − κ 2 J (st ; −1) 2πα
(7.18)
Theorem 8. The VB training coefficient of an LNN in the general situations √ when the true map B ∗ A∗ may have delicate singular values such that 0 < nγh∗ < ∞ is given by 2ν˜ = −(H ∗ (L + l) − H ∗2 ) +
−
H−H ∗ h=1
H
nγh∗2
h=H ∗ +1
θ (γh2
L > L) 1 − 2 γh
L 1 + 2 γh
γh2
. q (R )
(7.19)
Variational Bayes Solution and Generalization
1133
2
VB Bayes Regular
2 lambda / K
1.5
1
0.5
0 0
5
10
15
20
25
30
35
40
H Figure 3: Free energy: L = 50, l = 30, H = 1, . . . , 30, and H ∗ = 0. The VB and the Bayes free energies almost coincide with each other.
8 Theoretical Values 8.1 Free Energy. Figure 3 shows the free energy coefficients of the LNNs where L = 50 and l = 30 on the assumption that the true rank is equal to zero, H ∗ = 0. The horizontal axis indicates the rank of the learner, H = 1, . . . , 30. The vertical axis indicates the coefficients normalized by the half of the essential parameter dimension K , given by equation 2.4. The lines correspond to the free energy coefficient of the VB approach, given by theorem 2, to that of the Bayes estimation, clarified in Aoyagi & Watanabe (2004), and to that of the regular models, respectively.3 Note that the lines of the VB approach and of the Bayes estimation almost coincide with each other in Figure 3. Figure 4, in which the VB free energy also well approximates the Bayes one, similarly shows the coefficients of LNNs where L = 80, l = 1, . . . , 80, indicated by the horizontal axis, and H = 1 on the assumption that H ∗ = 0. The following two figures show cases where the VB free energy does not well approximate. Figure 5 shows the coefficients of LNNs where L = l = 80, and H = 1, . . . , 80, indicated by the horizontal axis, on the assumption that H ∗ = 0, and Figure 6 shows the true rank, H ∗ = 1, . . . , 20, dependence of the coefficients of an LNN with L = 50, l = 30, and H = 20 units. Figures 7 to 9 show cases similar to Figure 4 but with H = 10, 20, 40, respectively. We conclude that the VB free energy behaves similarly to the 3 By “regular models,” we denote not LNNs but the regular models with the same parameter dimension, K , as the LNN specified by the horizontal axis.
1134
S. Nakajima and S. Watanabe
2
VB Bayes Regular
2 lambda / K
1.5
1
0.5
0 0
20
40
60 l
80
100
Figure 4: Free energy: L = 80, l = 1, . . . , 80, H = 1, and H ∗ = 0. The VB and the Bayes free energies coincide with each other, as well as in Figure 3.
2
VB Bayes Regular
2 lambda / K
1.5
1
0.5
0 0
20
40
60 H
80
100
Figure 5: Free energy: L = l = 80, H = 1, . . . , 80, and H ∗ = 0.
Bayes one in general and well approximates when l and L are not almost equal to each other or H l. In Figure 6, the VB free energy behaves strangely and poorly approximates the Bayes one when H ∗ is large, which can, however, be regarded as a peculiarity of the LNNs, which have trivial redundancy, rather than a property of the VB approach in general models. (See the last paragraph of section 7.1.)
Variational Bayes Solution and Generalization
1135
2
VB Bayes Regular
2 lambda / K
1.5
1
0.5
0 0
5
10
15 H*
20
25
30
Figure 6: Free energy: L = 50, l = 30, H = 20, and H ∗ = 1, . . . , 20.
2
VB Bayes Regular
2 lambda / K
1.5
1
0.5
0 0
20
40
60 l
80
100
Figure 7: Free energy: L = 80, l = 1, . . . , 80, H = 10, and H ∗ = 0.
8.2 Generalization Error and Training Error. Figures 10 to 13 show the generalization and the training coefficients on the identical conditions to Figures 3 to 6, respectively. The lines in the positive region correspond to the generalization coefficient of the VB approach, clarified in this letter, to that of the ML estimation, clarified in (Fukumizu, 1999), to that of the Bayes estimation, clarified in Aoyagi and Watanabe (2004)4 and to that of 4 The Bayes generalization coefficient is identical to the Bayes free energy coefficient, noted in the last paragraph of section 3.1.
1136
S. Nakajima and S. Watanabe
2
VB Bayes Regular
2 lambda / K
1.5
1
0.5
0 0
20
40
60 l
80
100
Figure 8: Free energy: L = 80, l = 1, . . . , 80, H = 20, and H ∗ = 0.
2
VB Bayes Regular
2 lambda / K
1.5
1
0.5
0 0
20
40
60 l
80
100
Figure 9: Free energy: L = 80, l = 1, . . . , 80, H = 40, and H ∗ = 0.
the regular models, respectively; the lines in the negative region correspond to the training coefficient of the VB approach, to that of the ML estimation, and to that of the regular models, respectively.5 Unfortunately the Bayes training error has not been clarified yet. The results in Figures 10 to 13 have been calculated in the large-scale approximation, that is, by using theorems
5 By “the regular models” we mean both “of the ML estimation and of the Bayes estimation in the regular models.”
Variational Bayes Solution and Generalization
1137
2 lambda / K
2
VB ML Bayes Regular
1.5 1 0.5 0
2 nu / K
−0.5 −1
−1.5 −2
0
5
10
15
20
25
30
35
40
H Figure 10: Generalization error (in the positive region) and training error (in the negative region): L = 50, l = 30, H = 1, . . . , 30, and H ∗ = 0.
2 lambda / K
2
VB ML Bayes Regular
1.5 1 0.5
2 nu / K
0 −0.5 −1 −1.5 −2
0
20
40
60 l
80
100
Figure 11: Generalization error and training error: L = 80, l = 1, . . . , 80, H = 1, and H ∗ = 0.
4 and 7. We have also numerically calculated them by using theorems 3 and 6 and thus found that both results almost coincide with each other so that we can hardly distinguish them. We see in Figures 10 to 13 that the VB approach provides good generalization performance comparable to, or in some cases better than, the Bayes estimation. We also see that the VB generalization coefficient significantly differs from the Bayes one even in the cases where the VB free energy well
1138
S. Nakajima and S. Watanabe
2 lambda / K
2
VB ML Bayes Regular
1.5 1 0.5 0
2 nu / K
−0.5 −1
−1.5 −2
0
20
40
60 H
80
100
Figure 12: Generalization error and training error: L = l = 80, H = 1, . . . , 80, and H ∗ = 0.
2 lambda / K
2
VB ML Bayes Regular
1.5 1 0.5 0
2 nu / K
− 0.5 −1
− 1.5 −2
0
5
10
15 H*
20
25
30
Figure 13: Generalization error and training error: L = 50, l = 30, H = 20, and H ∗ = 1, . . . , 20.
approximates the Bayes one. (Compare, for example, Figures 3 and 4 with the positive regions of Figures 10 and 11.) Furthermore, we see in Figures 5 and 12 that when H l, the VB approach provides much worse generalization performance than the Bayes estimation, while the VB free energy well approximates the Bayes one, and that when H ∼ l, the VB approach provides much better generalization performance, while the VB free energy is significantly larger than the Bayes one. This can throw doubt on model
Variational Bayes Solution and Generalization
1139
selection by minimizing the VB free energy to obtain better generalization performance. (See section 9.2.) We see in Figures 10 and 12 that the VB generalization coefficient depends on H, similar to the ML one. Moreover, we see in Figure 12 that when H = 1, the VB generalization coefficient exceeds that of the regular models, which the Bayes one never exceeds (Watanabe, 2001b). The reason is that because of the asymptotic equivalence to the shrinkage estimation, the VB approach would approximate the ML estimation and hence provide poor generalization performance in models where the ML estimator of a redundant component would √ go out of the effective range of shrinkage, that is, be much greater than L/n. When (l − H ∗ ) (H − H ∗ ), the (H − H ∗ ) largest eigenvalues of a random matrix subject to Wl−H ∗ (L − H ∗ , Il−H ∗ ) are much greater than L because of selection from a large number of random variables subject to the noncompact support distribution. Therefore, the eigenvalues {γh2 } in theorem 3 go out of the effective range of shrinkage. It can be said that the VB approach has not only a property similar to the Bayes estimation (i.e., suppression of overfitting by the JS-type shrinkage), but also another property similar to the ML estimation (acceleration of overfitting by selection of the largest singular values of a random matrix). Figure 13 shows that in this LNN, the VB approach provides no worse generalization performance than the Bayes estimation regardless of H ∗ . This, however, does not mean the domination of the VB approach over the Bayes estimation. The consideration of delicate situations in the following denies the domination. Using theorems 5 and 8, we can numerically calculate the VB, as well as the ML,6 generalization and training coefficients in the delicate situations when the true distribution is near the singularities. Figure 14 shows the coefficients of an LNN where L = 50, l = 30, and H = 20 on the assumption that the true map consists of H ∗ = 5 distinctly positive component, the 10 delicate components whose singular values are identical to each other, and the other five null components. The hor√ izontal axis indicates nγ ∗ , where γh∗ = γ ∗ for h = 6, . . . , 15. The Bayes generalization error in the delicate situations was clarified in Watanabe and Amari (2003), but unfortunately, only in single-output (SO) LNNs, that is, N = H = 1.7 Figure 15 shows the coefficients of an SOLNN with M = 5 input units on the assumption that H ∗ = 0 and the one true singular value, indicated by the horizontal axis, is delicate. We see that in some delicate
6 All of the theorems in this letter except theorem 2 can be modified for the ML estimation by letting L = 0. 7 An SOLNN is regarded as a regular model from the viewpoint of the ML estimation because the transform b 1 a 1 → w, where w ∈ R M , makes the model identifiable, and hence the ML generalization coefficient is identical to that of the regular models, as shown in Figure 15. Nevertheless, Figure 15 shows that an SOLNN has a property of unidentifiable models from the viewpoint of the Bayesian learning methods.
1140
S. Nakajima and S. Watanabe
2 lambda / K
2
VB ML Regular
1.5 1 0.5
2 nu / K
0 −0.5 −1 −1.5 −2
0
2
4
6
8 10 12 sqrt(n) gamma*
14
16
18
16
18
Figure 14: Delicate situations.
2 lambda / K
2
VB ML(=Regular) Bayes
1.5 1 0.5
2 nu / K
0 −0.5 −1 −1.5 −2
0
2
4
6
8 10 12 sqrt(n) gamma*
14
Figure 15: Delicate situations in a single-output LNN.
situations, the VB approach provides worse generalization performance than the Bayes estimation, although this is one of the cases in which the VB approach seemed to dominate the Bayes estimation, without consideration of delicate situations. Note that the ordinary asymptotic cases H ∗ = 5 √ and H ∗ =√15 (in Figure 13) correspond to the case that nγ ∗ = 0 and the case that nγ ∗ → ∞ in Figure 14, respectively. It is expected that, also in Figure 14, there must be a region where the VB approach provides greater generalization error than the Bayes estimation, because the Bayes estimation is admissible. (See section 9.3.) In other words, Figure 13 shows the
Variational Bayes Solution and Generalization
1141
generalization error variation when the number, H ∗ , of positive true components increases one by one, and, in that case, the VB approach never provides worse generalization performance for any H ∗ than the Bayes estimation; while Figure 14 shows the generalization error variation when the amplitudes of 10 true components simultaneously increase, and, in that case, the VB approach is expected to provide worse generalization performance in some region. 9 Discussion 9.1 Relation to Subspace Bayes Approach. Interestingly, the solution of the VB approach is similar to that of the subspace Bayes (SB) approach (Nakajima & Watanabe, 2005b, 2006). The VB solution is actually asymptotically equivalent to that of the MIP version of SB approach, the empirical Bayes (EB) approach where the output parameter matrix is regarded as a hyperparameter, when M ≥ N, and the MOP version, the EB approach where the input parameter matrix is regarded as a hyperparameter, when M ≤ N. The form of the posterior distribution of the parameters corresponding to the redundant components implies the reason of the equivalence. The SB posterior distribution extends with its variance of order O p (1) in the space of the parameters, that is, the ones not regarded as hyperparameters; while the VB posterior distribution extends with its variance of order O p (1) in the larger-dimension parameter space, either the input one or the output one, as we can find in the proof of theorem 2. So the similarity between the solutions is natural, and it can be said that the VB approach automatically selects the larger-dimension space as the marginalized space and that the dimension of the space in which the singularities allow the posterior distribution to extend is essentially related to the degree of shrinkage. Figure 16 illustrates the posterior distribution of the MIP version of the SB approach, and that of the VB approach of a redundant component when M > N.8 The relation between the EB approach and the JS estimation was discussed in Efron and Morris (1973), as mentioned in section 1. In this letter, we have discovered the relation among the VB approach, the EB approach, and the JS-type shrinkage estimation. 9.2 Model Selection in VB Approach. In the framework of the Bayes estimation, the marginal likelihood, equation 3.1 can be regarded as the likelihood of the ensemble of the models. Following the concept of likelihood in statistics, we can accept that the ensemble of models with the maximum marginal likelihood is the most likely. Thus, the model selection method minimizing the free energy, equation 3.4, was proposed (Efron & Morris,
8 Note that the SB posterior distribution is the delta function with respect to b , so it h has infinite values in reality.
1142
S. Nakajima and S. Watanabe
O p (1)
0
ah
O p (1)
(
O p n −1/ 2
)
bh
Figure 16: SB (MIP) posterior (left) and VB posterior (right) for H ∗ < h ≤ H when M > N.
1973; Schwarz, 1978; Akaike, 1980; MacKay, 1992). In the VB approach, the method where the model with the minimum VB free energy is selected was proposed (Attias, 1999). Although it is known that the model with the minimum free energy does not necessarily provide the minimum generalization error even in the Bayes estimation and even in regular models, we usually expect that the selected model would also provide a small generalization error. However, we have shown in this letter that the free energy and the generalization error are less simply related with each other in the VB approach, unlike in the Bayes estimation. This can imply that the increase of the free energy caused by overfitting with redundant components less accurately indicates that of the generalization error in the VB approach than in the Bayes estimation, and throw a doubt on this model selection method. 9.3 Consistency with Superiority of Bayes Estimation. The Bayes estimation is said to be superior to any other learning method in a certain meaning. Consider the case in which we apply a learning method to many applications, where the true parameter w ∗ is different in each application and subject to q (w ∗ ). Note that the true distribution in each application is written as q (y|x) = p(y|x, w ∗ ). Although we have been abbreviating it (except in section 3.3), the generalization error, equation 3.7, naturally depends on the true distribution, so we here denote the dependence explicitly as follows: G(n; w ∗ ). Let ¯ G(n) = G(n; w ∗ )q (w∗ ) be the average generalization error over q (w ∗ ).
(9.1)
Variational Bayes Solution and Generalization
1143
Proposition 1. When we know and use the true prior distribution, q (w ∗ ), the Bayes estimation minimizes the average generalization error over q (w ∗ ), that is, ¯ ¯ Other (n), G(n) ≤G
(9.2)
¯ Other (n) denotes the average generalization error of an arbitrary learning where G method. However, we consider the situations where we need model selection, that is, we do not know even the true dimension of the parameter space. Accordingly, it can happen that an approximation method of the Bayes estimation provides better generalization performance than the Bayes estimation, as shown in this letter. On the other hand, proposition 1 naturally leads to the admissibility of the Bayes estimation: there is no learning method dominating the Bayes estimation. The analysis of the delicate situation has consistently denied the domination of the VB approach over the Bayes estimation in the last paragraph of section 8.2. 9.4 Features of VB Approach. To summarize, we can say that the VB approach has the two aspects caused by the ML-type effect and the Bayesian effect of the singularities, itemized in section 4. In LNNs, the ML-type effect is observed as the acceleration of overfitting by selection of the largest singular values of a random matrix, while the Bayesian effect is observed as the JS-type shrinkage. We have also shown in Figures 11 and 12 that one of the most preferable properties of the Bayes estimation, 2λ ≤ K , does not hold in the VB approach, that is, 2λ˜ can exceed the parameter dimension, K —hence the generalization coefficient of the regular models. We conjecture that the two aspects are essentially caused by the localization of the posterior distribution, which we can see in Figure 1. Because of the independence between the parameters of different layers, the posterior distribution cannot extend so as to fill uniformly the narrow low-energy (high-likelihood) region around the singularities, which changes the strength of the Bayesian effect, sometimes increasing and sometimes decreasing. The restriction on the VB posterior distribution affects the generalization performance. However, the variation of the restriction is limited if we want to obtain a simple iterative algorithm. The independence between the parameters of different layers is especially indispensable for simplifying the calculation. Actually, the restriction applied in models with hidden variables, such as mixture models and hidden Markov models, is very similar to the one applied in LNNs in this letter. In Attias (1999), the VB posterior distribution is restricted such that the parameters and the hidden variables are independent of each other. That restriction results in the independence between the parameters of different layers and consequently leads to the localization of the posterior distribution. In fact, the
1144
S. Nakajima and S. Watanabe
order of the extent of the VB posterior distribution in mixture models when we use the prior distribution having positive values on the singularities has recently been derived as similar to that in LNNs (Watanabe & Watanabe, 2005). So we expect that the two aspects of the VB approach would hold in more general unidentifiable models. In addition, in models with hidden variables, the other restriction that we applied in LNNs (the independence between the parameters of different components) is also guaranteed when we introduce the hidden variables and substitute the likelihood of the complete data for the original marginal likelihood, in the same way as the derivation of the EM algorithm (Dempster et al., 1977; Attias, 1999). Theorem 1 says that increasing the smaller dimension, l, does not increase the degree of shrinkage unless L = l, but theorem 3 says that it increases the number of the random variables (i.e., the number of the eigenvalues of a random matrix subject to Wl−H ∗ (L − H ∗ , Il−H ∗) among which the redundant components of the VB estimator choose. Therefore, we can say that the VB generalization error is very small when the difference between the parameter space dimensions of the different layers is very large, that is, l/L 1. So we conjecture that the advantage of the VB approach over the EM algorithm would be enhanced in mixture models, where the dimension of the upper-layer parameters (the number of the mixing coefficients) is one per component and that of the lower-layer parameters (the number of the parameters that each component has) is usually more. However, in general, nonlinearity increases the ML-type effect of singularities. In (general) neural networks, equation 2.1, for example, we expect that the nonlinearity of the activation function, ψ(·), would extend the range of basis selection and hence increase the generalization error. Because an LNN can be embedded in a finite-dimensional regular model, the ML-type effect of the singularities is relatively soft, and its ML estimator is in a finite region. On the other hand, the ML estimator can diverge in some unidentifiable models. An important question to be solved is how the VB approach behaves in models where the ML estimator diverges because of the MLtype effect, that is, whether the Bayesian shrinkage effect leashes the VB estimator in a finite region. As mentioned in section 1, the VB free energies in some unidentifiable models have been clarified (Watanabe & Watanabe, 2005; Hosino et al., 2005; Nakano & Watanabe, 2005) and shown to behave similarly to the VB free energy in LNNs, except for the peculiarity of LNNs caused by the trivial redundancy—the strange H ∗ dependence discussed in the last paragraph of section 8.1. The fact shown in this letter that in LNNs, the VB free energy and the VB generalization error are less simply related to each other, implies that, also in other models, we cannot expect the VB approach to provide similar generalization performance to the Bayes estimation, even if it provides similar free energy.
Variational Bayes Solution and Generalization
1145
10 Conclusions and Future Work We have proved that in three-layer linear neural networks, the variational Bayes (VB) approach is asymptotically equivalent to a positive-part JamesStein–type shrinkage estimation, and theoretically clarified its free energy, generalization error, and training error. We have thus concluded that the VB approach has not only a property similar to the Bayes estimation but also another property similar to the maximum likelihood estimation. We have also shown that unlike in the Bayes estimation, the free energy and the generalization error are less simply related to each other, and we discussed the relation between the VB approach and the subspace Bayes (SB) approach. Analysis of the VB generalization performance of other unidentifiable models is future work. Consideration of the relation between the VB approach and the SB approach in other models is also future work. We think that any similarity between the approaches may hold not only in LNNs. Appendix A: Proof of Theorem 1 Both tr(a h ) and tr(bh ) are of no greater order than 1 because of the prior distributions, equations 6.2 and 6.3. Both are also not equal to zero; otherwise, the generalized free energy, equation 6.14, diverges to infinity with finite n. Consequently, the generalized free energy diverges to infinity when either µa h or µbh goes to infinity. Hence, the optimum values of µa h , µbh , a h , and bh necessarily satisfy the variational condition, equations 6.10 to 6.13. Combining equations 6.10 and 6.12, we have ˆ a h Rt ˆ bh Rµ µ ˆ a h = n2 ˆ ah ,
(A.1)
ˆ bh R ˆ a h Rt µ ˆ bh , µ ˆ bh = n
(A.2)
2
ˆ bh are an eigenvector of Rt R and an eigenand thus find that µ ˆ a h and µ t ˆ a h R , respectively, or µ vector of R ˆ a h = µ ˆ bh = 0. Hereafter, separately considering the necessary components, which imitate the true ones with positive singular values, and the redundant components, we will find the solution of the variational condition that minimizes the generalized free energy, equation 6.14. For a necessary component, h ≤ H ∗ , the (observed) singular value of R is of order O p (1). Hence, the free energy, equation 6.14, can be minimized only when both µ ˆ a h and µ ˆ bh are of order O p (1). Therefore, the variational condition, equations 6.10 to 6.13, is approximated as follows: µ ˆ a h = µ ˆ bh −2 Q−1 Rt µ ˆ bh + O p (n−1 ), −1
ˆ a h = n µ ˆ bh
−2
−1
Q
−2
+ O p (n ),
(A.3) (A.4)
1146
S. Nakajima and S. Watanabe
µ ˆ bh = (µ ˆ at h Qµ ˆ a h )−1 Rµ ˆ a h + O p (n−1 ),
(A.5)
ˆ bh = n−1 (µ ˆ at h Qµ ˆ a h )−1 I N + O p (n−2 ).
(A.6)
Thus, we obtain the VB estimator of the necessary component: µ ˆ bh µ ˆ at h = ωbh ωbt h RQ−1 + O p (n−1 ).
(A.7)
On the other hand, for a redundant component, h > H ∗ , equations 6.6, 6.15, and 6.16 allow us to approximate the variational condition, equations 6.10 to 6.13, as follows: µ ˆ a h = nσˆ a2h Rt µ ˆ bh (1 + O p (n−1/2 )),
(A.8)
σˆ a2h = (n( µ ˆ bh 2 + Nσˆ b2h ) + c a−2 )−1 ,
(A.9)
µ ˆ bh = nσˆ b2h Rµ ˆ a h (1 + O p (n−1/2 )),
(A.10)
σˆ b2h = (n( µ ˆ a h 2 + Mσˆ a2h ) + c b−2 )−1 .
(A.11)
Combining equations A.9 and A.11, we have
σˆ a2h =
σˆ b2h =
−(nˆηh2 − (M − N)) +
(nˆηh2 + M + N)2 − 4MN
2nM( µ ˆ bh 2 + n−1 c a−2 ) −(nˆηh2 + (M − N)) + (nˆηh2 + M + N)2 − 4MN 2nN( µ ˆ a h 2 + n−1 c b−2 )
,
(A.12)
,
(A.13)
where
ηˆ h2
c −2 = µ ˆ ah + b n 2
µ ˆ b h 2 +
c a−2 n
.
(A.14)
We consider two possible solutions: ˆ bh = 0: Obviously, the following equations sat1. Such that µ ˆ a h = µ isfy equations A.8 and A.10: µ ˆ a h = 0,
(A.15)
µ ˆ bh = 0.
(A.16)
Substituting equations A.15 and A.16 into equations A.12 and A.13, we easily find the following solution:
Variational Bayes Solution and Generalization
(a) When M > N: M− N σˆ a2h = + O p (n−1 ), Mc a−2 σˆ b2h =
c a−2 + O p (n−2 ). n(M − N)
(b) When M = N: ca + O p (n−1 ), σˆ a2h = √ c b nM cb + O p (n−1 ). σˆ b2h = √ c a nM
1147
(A.17) (A.18)
(A.19) (A.20)
(c) When M < N: σˆ a2h = σˆ b2h =
c b−2 + O p (n−2 ), n(N − M) N−M Nc b−2
+ O p (n−1 ).
(A.21) (A.22)
(2) Such that µ ˆ a h , µ ˆ bh > 0. Combining equations A.8 and A.10, we find that µ ˆ a h and µ ˆ bh are the right and the corresponding left singular vectors of R, respectively. Let hth largest singular value component of R correspond to the hth component of the estimator. Then we have µ ˆ a h = µ ˆ a h ωa h ,
(A.23)
µ ˆ b h = µ ˆ bh ωbh .
(A.24)
Substituting equations A.12, A.13, A.23, and A.24 into equations A.8 and A.10, we have 2MN 2 = nˆ η + M + N − (nˆηh2 + M + N)2 − 4MN + O p (n−1/2 ), h nγh2 (A.25) (Mc a−2 δˆ h − Nc b−2 δˆ h−1 )(1 + O p (n−1/2 )) = n(M − N)(γh − γˆh ),
(A.26)
where γˆh = µ ˆ a h µ ˆ bh ,
(A.27)
µ ˆ ah . µ ˆ bh
(A.28)
δˆ h =
Equation A.25 implies that ξ = (nγh2 )−1 is a solution of the following equation: MNξ 2 − (nˆηh2 + M + N)ξ + 1 = O p (n−1/2 ). (A.29)
1148
S. Nakajima and S. Watanabe
Thus, we find that equation A.25 has the solution below when and only when nγh2 ≥ L:
N M 1 − γh2 + O p (n−3/2 ). (A.30) ηˆ h2 = 1 − nγh2 nγh2 By using equations A.27 and A.28, the definition of ηˆ h , equation A.14, can be written as follows: c b−2 c a−2 2 (A.31) ηˆ h = 1 + 1+ γˆh2 , nγˆh δˆ h nγˆh δˆ h−1 Hereafter, we will find the simultaneous solution of equations A.26, A.30, and A.31 in order to obtain the solution that exists only when nγh2 ≥ L. Equations A.30 and A.31 imply that γh2 > ηˆ h2 > γˆh2 and that (γh2 − γˆh2 ) is of order O p (n−1 ). We then find that (γh − γˆh ) is of order O p (n−1/2 ) because γh is of order O p (n−1/2 ). (a) When M > N: In this case, equation A.26 can be approximated as follows: n(M − N)(γh − γˆh ) δˆ h = + O p (n−1/2 ). (A.32) Mc a−2 Substituting equations A.30 and A.32 into equation A.31, we have
N (M − N)(γh − γˆh ) M 1 − γh2 = O p (n−2 ), γˆh2 + γˆh − 1 − M nγh2 nγh2 (A.33) and hence
M M N + 1 − γ γ ˆ γh = O p (n−2 ). γˆh − 1 − h h N nγh2 nγh2 (A.34) Therefore, we obtain the following solution with respect to γˆh and δˆ h :
M γˆh = 1 − γh + O p (n−3/2 ), (A.35) nγh2 (M − N)
+ O p (n−1/2 ). c a−2 γh We thus obtain the following solution:
M M − N 1/2 1− ωa h + O p (n−1 ), µ ˆ ah = nγh2 c a−2 δˆ h =
σˆ a2h =
M− N c a−2 nγh2
+ O p (n−1 ),
(A.36)
(A.37) (A.38)
Variational Bayes Solution and Generalization
µ ˆ bh = σˆ b2h =
1−
M nγh2
c a−2 M− N
1/2
1149
γh ωbh + O p (n−1 ),
c a−2 + O p (n−2 ). n(M − N)
(A.39) (A.40)
(b) When M = N: In this case, we find from equation A.26 that ca δˆ h = . (A.41) cb Substituting equation A.41 and then equation A.30 into equation A.31, we have γˆh = ηˆ h + O p (n−1 )
M γh + O p (n−1 ). = 1− nγh2 We thus obtain the following solution:
1/2 ca M ωa h + O p (n−1 ), µ ˆ ah = 1− γh cb nγh2 ca + O p (n−1 ), σˆ a2h = c b nγh
1/2 cb M ωbh + O p (n−1 ), µ ˆ bh = 1− γh ca nγh2 cb + O p (n−1 ). σˆ b2h = c a nγh
(A.42)
(A.43) (A.44) (A.45) (A.46)
(c) When M < N: In exactly the same way as the case when M > N, we obtain the following solution: 1/2 c b−2 N 1− γh ωa h + O p (n−1 ), (A.47) µ ˆ ah = nγh2 N − M c b−2 + O p (n−2 ), n(N − M) 1/2 N N−M 1− ωbh + O p (n−1 ), µ ˆ bh = nγh2 c b−2 σˆ a2h =
σˆ b2h =
N−M c b−2 nγh2
+ O p (n−1 ).
(A.48)
(A.49) (A.50)
ˆ bh > 0, equaWe can find that, when the solution such that µ ˆ a h , µ tions A.37 to A.40 and A.43 to A.50, exists, it makes the VB free energy,
1150
S. Nakajima and S. Watanabe
equation 6.14, smaller than the solution such that µ ˆ a h = µ ˆ bh = 0, equations A.15 to A.22. Hence, we arrive at the following solution: 1. When M > N,
L L − l 1/2 µ ˆ ah = 1− ωa h + O p (n−1 ), Lh c a−2 L −l + O p (n−1 ), c a−2 L h
−2 1/2 ca L µ ˆ bh = 1− γh ωbh + O p (n−1 ), Lh L − l σˆ a2h =
σˆ b2h =
c a−2 + O p (n−2 ). n(L − l)
µ ˆ ah =
L 1− Lh
c b−2 L −l
σˆ b2h =
L −l c b−2 L h
+ O p (n−1 ).
(A.53)
(A.55) (A.56)
(A.57) (A.58)
1/2 γh ωa h + O p (n−1 ),
c b−2 + O p (n−2 ), n(L − l) 1/2 L L −l µ ˆ bh = 1− ωbh + O p (n−1 ), Lh c b−2 σˆ a2h =
(A.52)
(A.54)
2. When M = N,
1/2 ca L ωa h + O p (n−1 ), 1 − γh µ ˆ ah = cb Lh ca σˆ a2h = + O p (n−1 ), c b nL h 1/2
L cb 1 − γh ωbh + O p (n−1 ), µ ˆ bh = ca Lh cb σˆ b2h = + O p (n−1 ). c a nL h 3. When M < N,
(A.51)
(A.59)
(A.60)
(A.61) (A.62)
Selecting the largest singular value components minimizes the free energy, equation 6.14. Hence, combining equation A.7 with the fact that L L −1 = O p (n−1 ) for the necessary components, and the solution above h with equation 6.6, we obtain the VB estimator in theorem 1.
Variational Bayes Solution and Generalization
1151
Acknowledgments We thank the anonymous reviewers for the meaningful comments that polished this letter. We also thank Kazuo Ushida, Masahiro Nei, and Nobutaka Magome of Nikon Corporation for encouragement to research this subject. References Akaike, H. (1974). A new look at statistical model. IEEE Trans. on Automatic Control, 19(6), 716–723. Akaike, H. (1980). Likelihood and Bayes procedure. In J. M. Bernald (Ed.), Bayesian statistics (pp. 143–166). Valencia, Spain: University Press. Amari, S., Park, H., & Ozeki, T. (2006). Singularities affect dynamics of learning in neuromanifolds. Neural Computation, 18, 1007–1065. Aoyagi, M., & Watanabe, S. (2004). Stochastic complerities of reduced rank regression in Bayesian estimation. In Neural Networks, 18, 924–933. Attias, H. (1999). Inferring parameters and structure of latent variable models by variational Bayes. In Proc. of the Conference on Uncertainty in Artificial Intelligence. San Francisco: Morgan Kaufmann. Baldi, P. F., & Hornik, K. (1995). Learning in linear neural networks: A survey. IEEE Trans. on Neural Networks, 6(4), 837–858. Bickel, P., & Chernoff, H. (1993). Asymptotic distribution of the likelihood ratio statistic in a prototypical non regular problem. In J. K. Ghosh, S. K. Mitra, K. R. Parthasarathy, & B. L. S. Prakasa Rao (Eds.), Statistics and Probability: A Raghu Raj Bahadur Festschrift (pp. 83–96). New York: Wiley. Cramer, H. (1949). Mathematical methods of statistics. Princeton, NJ: Princeton University Press. Dacunha-Castelle, D., & Gassiat, E. (1997). Testing in locally conic models, and application to mixture models. Probability and Statistics, 1, 285–317. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood for incomplete data via the EM algorithm. J. R. Statistical Society, 39B, 1–38. Efron, B., & Morris, C. (1973). Stein’s estimation rule and its competitors—an empirical Bayes approach. J. Am. Stat. Assoc., 68, 117–130. Fukumizu, K. (1999). Generalization error of linear neural networks in unidentifiable cases. In Proceedings of the Tenth International Conference, Algorithmic Learning Theory. (pp. 51–62). New York: Springer. Fukumizu, K. (2003). Likelihood ratio of unidentifiable models and multilayer neural networks. Annals of Statistics, 31(3), 833–851. Ghahramani, Z., & Beal, M. J. (2001). Graphical models and variational methods. In M. Opper & D. Saad (Eds.), Advanced mean field methods (pp. 161–177). Cambridge, MA: MIT Press. Hagiwara, K. (2002). On the problem in model selection of neural network regression in overrealizable scenario. Neural Computation, 14, 1979–2002. Hartigan, J. A. (1985). A failure of likelihood ratio asymptotics for normal mixtures. In Proc. of the Berkeley Conference in Honor of J. Neyman and J. Kiefer (pp. 807–810). Monterey, CA: Wadsworth.
1152
S. Nakajima and S. Watanabe
Hinton, G. E., & van Camp, D. (1993). Keeping neural networks simple by minimizing the description length of the weights. In Proc. of the Conference on Computational Learning Theory. (pp. 5–13). New York: ACM. Hosino, T., Watanabe, K., & Watanabe, S. (2005). Stochastic complexity of variational Bayesian hidden Markov models. In Proc. of the International Joint Conference on Neural Networks. Jaakkola, T. S., & Jordan, M. I. (2000). Bayesian parameter estimation via variational methods. Statistics and Computing, 10, 25–37. James, W., & Stein, C. (1961). Estimation with quadratic loss. In Proc. of the 4th Berkeley Symp. on Math. Stat. and Prob. (pp. 361–379). Berkeley: University of California Press. Kuriki, S., & Takemura, A. (2001). Tail probabilities of the maxima of multilinear forms and their applications. Annals of Statistics, 29(2), 328–371. Levin, E., Tishby, N., & Solla, S. A. (1990). A statistical approach to learning and generalization in layered neural networks. Proc. IEEE, 78, 1568–1674. MacKay, D. J. C. (1992). Bayesian interpolation. Neural Computation, 4(2), 415–447. MacKay, D. J. C. (1995). Developments in probabilistic modeling with neural networks—ensemble learning. In Proc. of the 3rd Ann. Symp. on Neural Networks (pp. 191–198). New York: Springer-Verlag. Nakajima, S., & Watanabe, S. (2005). Generalization error of linear neural networks in an empirical Bayes approach. In Proc. of IJCAI (pp. 804–810). Edinburgh, U.K. Nakajima, S., & Watanabe, S. (2006). Generalization performance of subspace Bayes approach in linear neural networks. IEICE Trans., E89-D(3), 1128–1138. Nakano, N., & Watanabe, S. (2005). Stochastic complexity of layered neural networks in mean field approximation. In Proc. of ICONIP. Taipei, Taiwan. Reinsel, G. C., & Velu, R. P. (1998). Multivariate reduced-rank regression. New York: Springer. Rissanen, J. (1986). Stochastic complexity and modeling. Annals of Statistics, 14(3), 1080–1100. Rusakov, D., & Geiger, D. (2002). Asymptotic model selection for naive Bayesian networks. In Proc. of Conference on Uncertainty in Artificial Intelligence (pp. 438– 445). San Francisco: Morgan Kaufmann. Sato, M. (2001). Online model selection based on the variational Bayes. Neural Computation, 13, 1649–1681. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461–464. Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proc. of the 3rd Berkeley Symp. on Math. Stat. and Prob. (pp. 197–206). Takemura, A., & Kuriki, S. (1997). Weights of chi-barsquare distribution for smooth or piecewise smooth cone alternatives. Annals of Statistics, 25(6), 2368–2387. Wang, B., & Titterington, D. M. (2004). Convergence and asymptotic normality of variational Bayesian approximations for exponential family models with missing values. In Proc. of Conference on Uncertainty in Artificial Intelligence (pp. 577–584). Watanabe, K., & Watanabe, S. (2006). Stochastic complexities of gaussian mixtures in variational Bayesian approximation. Journal of Machine Learning Research, 7, 625–644.
Variational Bayes Solution and Generalization
1153
Watanabe, S. (1995). A generalized bayesian framework for neural networks with singular fisher information matrices. In Proc. of the International Symposium on Nonlinear Theory and Its Applications (Vol. 2, pp. 207–210). Watanabe, S. (2001a). Algebraic analysis for nonidentifiable learning machines. Neural Computation, 13(4), 899–933. Watanabe, S. (2001b). Algebraic information geometry for learning machines with singularities. In T. K. Leen, T. G. Dietterich, & V. Trep (Eds.), Advances in neural information processing systems, 13, (pp. 329–336). Cambridge, MA: MIT Press. Watanabe, S., & Amari, S. (2003). Learning coefficients of layered models when the true distribution mismatches the singularities. Neural Computation, 15, 1013–1033. Watcher, K. W. (1978). The strong limits of random matrix spectra for sample matrices of independent elements. Ann. Prob., 6, 1–18. Yamazaki, K., & Watanabe, S. (2002). Resolution of singularities in mixture models and its stochastic complexity. In Proc. of the 9th International Conference on Neural Information Processing (pp. 1355–1359). Singapore. Yamazaki, K., & Watanabe, S. (2003a). Stochastic complexities of hidden Markov models. In Proc. of Neural Networks for Signal Processing XIII (NNSP) (pp. 179– 188). Toulouse, France. Yamazaki, K., & Watanabe, S. (2003b). Stochastic complexity of Bayesian networks. In Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence (pp. 592– 599). San Francisco: Morgan Kaufmann. Yamazaki, K., Nagata, K., & Watanabe, S. (2005). A new method of model selection based on learning coefficient. In Proceedings of the International Symposium on Nonlinear Theory and Its Applications (pp. 389–392). Bruges, Belgium.
Received October 7, 2005; accepted July 27, 2006.
LETTER
Communicated by L´eon Bottou
Training a Support Vector Machine in the Primal Olivier Chapelle∗ [email protected] Max Planck Institute for Biological Cybernetics, 72076 Tuebingen, Germany
Most literature on support vector machines (SVMs) concentrates on the dual optimization problem. In this letter, we point out that the primal problem can also be solved efficiently for both linear and nonlinear SVMs and that there is no reason for ignoring this possibility. On the contrary, from the primal point of view, new families of algorithms for large-scale SVM training can be investigated. 1 Introduction The vast majority of textbooks and articles introducing support vector machines (SVMs) first state the primal optimization problem, and then go directly to the dual formulation (Vapnik, 1998; Burges, 1998; Cristianini & ¨ Shawe-Taylor, 2000; Scholkopf & Smola, 2002). A reader could easily obtain the impression that this is the only possible way to train an SVM. In this letter, we reveal this as being a misconception and show that someone unaware of duality theory could train an SVM. Primal optimizations of linear SVMs have already been studied by Keerthi and DeCoste (2005) and Mangasarian (2002). One of the main contributions of this letter is to complement those studies to include the nonlinear case.1 Our goal is not to claim that primal optimization is better than dual, but merely to show that they are two equivalent ways of reaching the same result. Also, we will show that when the goal is to find an approximate solution, primal optimization is superior. Given a training set {(xi , yi )}1≤i≤n , xi ∈ Rd , yi ∈ {+1, −1}, recall that the primal SVM optimization problem is usually written as min ||w||2 + C w,b
n
p
ξi under constraints yi (w · xi + b) ≥ 1 − ξi ,
ξi ≥ 0,
i=1
(1.1)
∗ Now
at Yahoo! Research. Email: [email protected] Primal optimization of nonlinear SVMs have also been proposed in Lee and Mangasarian (2001b, section 4), but with a different regularizer. 1
Neural Computation 19, 1155–1178 (2007)
C 2007 Massachusetts Institute of Technology
1156
O. Chapelle
where p is either 1 (hinge loss) or 2 (quadratic loss). At this point, in the literature there are usually two main reasons mentioned for solving this problem in the dual: 1. The duality theory provides a convenient way to deal with the constraints. 2. The dual optimization problem can be written in terms of dot products, thereby making it possible to use kernel functions. We will demonstrate in section 3 that those two reasons are not a limitation for solving the problem in the primal, mainly by writing the optimization problem as an unconstrained one and by using the representer theorem. In section 4, we will see that performing a Newton optimization in the primal yields exactly the same computational complexity as optimizing the dual; that will be validated experimentally in section 5. Finally, possible advantages of a primal optimization are presented in section 6. We will start with some general discussion about primal and dual optimization. 2 Links Between Primal and Dual Optimization Primal and dual optimization have strong connections, and we illustrate some of them through the example of regularized least squares (RLS). Given a matrix X ∈ Rn×d representing the coordinates of n points in d dimensions and a target vector y ∈ Rn , the primal RLS problem can be written as min λw w + Xw − y2 ,
w∈Rd
(2.1)
where λ is the regularization parameter. This objective function is minimized for w = (X X + λI )−1 X y, and its minimum is y y − y X(X X + λI )−1 X y.
(2.2)
Introducing slacks variables ξ = Xw − y, the dual optimization problem is maxn 2α y − α∈R
1 α (XX + λI )α. λ
(2.3)
The dual is maximized for α = λ(XX + λI )−1 y, and its maximum is λy (XX + λI )−1 y.
(2.4)
Training a Support Vector Machine in the Primal
1157
The primal solution is then given by the KKT condition, w=
1 X α. λ
(2.5)
Now we relate the inverses of XX + λI and X X + λI thanks to the Woodbury formula (Golub & Loan, 1996) λ(XX + λI )−1 = I − X(λI + X X)−1 X .
(2.6)
With this equality, we recover that primal, equation 2.2, and dual, equation 2.4, optimal values are the same: that the duality gap is zero. Let us now analyze the computational complexity of primal and dual optimization. The primal requires the computation and inversion of the matrix (X X + λI ), which is O(nd 2 + d 3 ). The dual deals with the matrix (XX + λI ), which requires O(dn2 + n3 ) operations to compute and invert. It is often argued that one should solve either the primal or the dual optimization problem depending on whether n is larger or smaller than d, resulting in an O(max(n, d) min(n, d)2 ) complexity. But this argument does not really hold because one can always use equation 2.6 in case the matrix to invert is too big. So for both primal and dual optimization, the complexity is O(max(n, d) min(n, d)2 ). The difference between primal and dual optimization comes when computing approximate solutions. Let us optimize both the primal, equation 2.1, and dual, equation 2.3, objective functions by conjugate gradient and see how the primal objective function decreases as a function of the number of conjugate gradient steps. For the dual optimiztion, an approximate dual solution is converted to an approximate primal one by using the KKT condition, equation 2.5. Intuitively, the primal optimization should be superior because it directly minimizes the quantity we are interested in. Figure 1 confirms this intuition. In some cases, there is no difference between primal and dual optimization (left), but in other cases, the dual optimization can be much slower to converge (right). In appendix A, we try to analyze this phenomenon by looking at the primal objective value after one conjugate gradient step. We show that the primal optimization always yields a lower value than the dual optimization, and we quantify the difference. The conclusion from this analysis is that even though primal and dual optimization are equivalent, in terms of both the solution and time complexity, when it comes to approximate solution, primal optimization is superior because it is more focused on minimizing what we are interested in: the primal objective function.
1158
O. Chapelle
0
5
10
10 Primal Dual
−2
Primal Dual
10
Primal suboptimality
Primal suboptimality
0
−4
10
−6
10
−8
10
10
−5
10
−10
10 −10
10
−12
10
0
−15
2
4 6 Number of CG iterations
8
10
10
0
2
4 6 Number of CG iterations
8
10
Figure 1: Plots of the primal suboptimality, equations 2.1 and 2.2, for primal and dual optimization by conjugate gradient (pcg in Matlab). n points are drawn randomly from a spherical gaussian distribution in d dimensions. The targets are also randomly generated according to a gaussian distribution. λ is fixed to 1. (Left), n = 10 and d = 100. (Right), n = 100 and d = 10. 3 Linear Quadratic 2.5
Loss
2
1.5
1
0.5
0
− 0.5
0
0.5
1
1.5
Output
Figure 2: SVM loss function, L(y, t) = max(0, 1 − yt) p for p = 1 and 2.
3 Primal Objective Function Coming back to support vector machines, let us rewrite equation 1.1 as an unconstrained optimization problem: ||w||2 + C
n
L(yi , w · xi + b),
(3.1)
i=1
with L(y, t) = max(0, 1 − yt) p (see Figure 2). More generally, L could be any loss function.
Training a Support Vector Machine in the Primal
1159
Let us now consider nonlinear SVMs with a kernel function k and an associated reproducing kernel hilbert space H. The optimization problem, equation 3.1, becomes min λ|| f ||2H + f ∈H
n
L(yi , f (xi )),
(3.2)
i=1
where we have made a change of variable by introducing the regularization parameter λ = 1/C. We have also dropped the offset b for simplicity. However, all the algebra presented below can be extended easily to take it into account (see appendix B). Suppose now that the loss function L is differentiable with respect to its second argument. Using the reproducing property f (xi ) = f, k(xi , ·) H , we can differentiate equation 3.2 with respect to f and at the optimal solution f ∗ , the gradient vanishes, yielding 2λ f ∗ +
n ∂L i=1
∂t
yi , f ∗ (xi ) k(xi , ·) = 0,
(3.3)
where ∂ L/∂t is the partial derivative of L(y, t) with respect to its second argument. This implies that the optimal function can be written as a linear combination of kernel functions evaluated at the training samples. This result is also known as the representer theorem (Kimeldorf & Wahba, 1970). Thus, we seek a solution of the form: f (x) =
n
βi k(xi , x).
(3.4)
i=1
We denote those coefficients βi and not αi , as in the standard SVM literature, to stress that they should not be interpreted as Lagrange multipliers. Let us express equation 3.2 in term of βi , n n n βi β j k(xi , x j ) + L yi , k(xi , x j )β j , (3.5) λ i, j=1
i=1
j=1
where we used the kernel reproducing property in || f ||2H = i,n j=1 βi β j < n k(xi , ·), k(x j , ·) > H = i, j=1 βi β j k(xi , x j ). Introducing the kernel matrix K with K i j = k(xi , x j ) and K i the ith column of K , equation 3.5 can be rewritten as λβ K β +
n L yi , K i β . i=1
(3.6)
1160
O. Chapelle
As long as L is differentiable, we can optimize equation 3.6 by gradient descent. Note that this is an unconstrained optimization problem. 4 Newton Optimization The unconstrained objective function, equation 3.6, can be minimized using a variety of optimization techniques such as conjugate gradient. Here we will consider only Newton optimization as the similarities with dual optimization will then appear clearly. We will focus on two loss functions: the quadratic penalization of the training errors (see Figure 2) and a differentiable approximation to the linear penalization, the Huber loss. 4.1 Quadratic Loss. Let us start with the easier case: the L 2 penalization of the training errors, L(yi , f (xi )) = max(0, 1 − yi f (xi ))2 . For a given value of the vector β, we say that a point xi is a support vector if yi f (xi ) < 1, that is, if the loss on this point is nonzero. Note that this definition of support vector is different from βi = 0.2 Let us reorder the training points such that the first nsv points are support vectors. Finally, let I 0 be the n × n diagonal matrix with the first nsv entries being 1 and the others 0, 1 .. . 0 1 0 . I ≡ 0 .. 0 . 0 The gradient of equation 3.6 with respect to β is ∇ = 2λK β +
nsv
Ki
i=1
= 2λK β + 2
nsv
∂L yi , K i β ∂t
K i yi yi K i β − 1
i=1
= 2(λK β + K I 0 (K β − Y)),
(4.1)
From equation 3.3, it turns out at the optimal solution that the sets {i, βi = 0} and {i, yi f (xi ) < 1} will be the same. To avoid confusion, we could have defined this latter as the set of error vectors. 2
Training a Support Vector Machine in the Primal
1161
and the Hessian is H = 2(λK + K I 0 K ).
(4.2)
Each Newton step consists of the following update, β ← β − γ H −1 ∇, where γ is the step size found by line search or backtracking (Boyd & Vandenberghe, 2004). In our experiments, we noticed that the default value of γ = 1 did not result in any convergence problem, and in the rest of this section we consider only this value. However, to enjoy the theorical properties concerning the convergence of this algorithm, backtracking is necessary. Combining equations 4.1 and 4.2 as ∇ = Hβ − 2K I 0 Y, we find that after the update, β = (λK + K I 0 K )−1 K I 0 Y −1 0 I Y. = λIn + I 0 K
(4.3)
Note that we have assumed that K (and thus the Hessian) is invertible. If K is not invertible, then the expansion is not unique (even though the solution is), and equation 4.3 will produce one of the possible expansions of the solution. To avoid these problems, let us simply assume that an infinitesimally small ridge has been added to K . Let I p denote the identity matrix of size p × p and K sv the first nsv columns and rows of K , that is, the submatrix corresponding to the support vectors. Using the fact that the lower-left block λIn + I 0 K is 0, the invert of this matrix can be easily computed, and finally, the update equation 4.3, turns out to be β=
(λInsv + K sv )−1 0 0 0
(λInsv + K sv )−1 Ysv = 0
Y, .
(4.4)
4.1.1 Link with Dual Optimization. Update rule 4.4 is not surprising if one has a look at the SVM dual optimization problem:
max α
n i=1
αi −
n 1 αi α j yi y j (K i j + λδi j ), 2 i, j=1
under constraints αi ≥ 0.
1162
O. Chapelle
Consider the optimal solution: the gradient with respect to all αi > 0 (the support vectors) must be 0, 1 − diag(Ysv )(K sv + λInsv )diag(Ysv )α = 0, where diag(Y) stands for the diagonal matrix with the diagonal being the vector Y. Thus, up to a sign difference, the solutions found by minimizing the primal and maximizing the primal are the same: βi = yi αi . 4.1.2 Complexity Analysis. Only a couple of iterations are usually necessary to reach the solution (rarely more than five), and this number seems independent of n. The overall complexity is thus the complexity one Newton step, which is O(nnsv + n3sv ). Indeed, the first term corresponds to finding the support vectors (i.e., the points for which yi f (xi ) < 1), and the second term is the cost of inverting the matrix K sv + λInsv . It turns out that this is the same complexity as in standard SVM learning (dual maximization) since those two steps are also necessary. Of course, nsv is not known in advance, and this complexity analysis is an a posteriori one. In the worse case, the complexity is O(n3 ). It is important to note that in general, this time complexity is also a lower bound (for the exact computation of the SVM solution). Chunking and decomposition methods, for instance (Joachims, 1999; Osuna, Freund, & Girosi, 1997), do not help since there is fundamentally a linear system of size nsv to be be solved.3 Chunking is useful only when the K sv matrix can not fit in memory. 4.2 Huber and Hinge Loss. The hinge loss used in SVMs is not differentiable. We propose to use a differentiable approximation of it, inspired by the Huber loss (see Figure 3):
L(y, t) =
0
(1+h−yt)2 4h
1 − yt
if if if
yt > 1 + h |1 − yt| ≤ h, yt < 1 − h
(4.5)
where h is a parameter to choose, typically between 0.01 and 0.5. Note that we are not minimizing the hinge loss, but this does not matter, since from a machine learning point of view, there is no reason to prefer the hinge loss anyway. If really one wants to approach the hinge loss solution, one can make h → 0 (similar to Lee & Mangasarian, 2001b). The derivation of the Newton step follows the same line as for the L 2 loss, and we will not go into the details. The algebra is just a bit more 3 We considered here that solving a linear system (in either the primal or the dual) takes cubic time. This time complexity can, however, be improved.
Training a Support Vector Machine in the Primal
1163
1
0.8
Linear
loss
0.6
0.4 Quadratic 0.2
0 0
0.5
1 yt
1.5
2
Figure 3: The Huber loss is a differentiable approximation of the hinge loss. The plot is equation 4.5 with h = 0.5.
complicated because there are three different parts in the loss (and thus three different categories of points):
r r
r
nsv of them are in the quadratic part of loss. nsv¯ are in the linear part of the loss. We will call this category of points the support vectors at bound, in reference to dual optimization where the Lagrange multipliers associated with those points are at the upper bound C. The rest of the points have zero loss.
We reorder the training set in such a way that the points are grouped in these three categories. Let I 1 be diagonal matrix with first nsv 0 elements followed by nsv¯ 1 elements (and 0 for the rest). The gradient is ∇ = 2λK β +
K I 0 (Kβ − (1 + h)Y) − K I 1Y 2h
and the Hessian H = 2λK +
K I0K . 2h
Thus, ∇ = Hβ − K
1+h 0 1 I + I Y, 2h
1164
O. Chapelle
and the new β is −1 1+h 0 I0K I + I1 Y β = 2λIn + 2h 2h βsv (4hλInsv + K sv )−1 ((1 + h)Ysv − K sv,sv¯ Ysv¯ /(2λ)) ≡ βsv¯ . (4.6) Ysv¯ /(2λ) = 0 0 Again, one can see the link with the dual optimization: letting h → 0, the primal and the dual solution are the same, βi = yi αi . This is obvious for the points in the linear part of the loss (with C = 1/(2λ)). For the points that are right on the margin, their output is equal to their label, K sv βsv + K sv,sv¯ βsv¯ = Ysv . But since βsv¯ = Ysv¯ /(2λ), −1 (Ysv − K sv,sv¯ Ysv¯ /(2λ)), βsv = K sv
which is the same equation as the first block of equation 4.6 when h → 0. 4.2.1 Complexity Analysis. Similar to the quadratic loss, the complexity is O(n3sv + n(nsv + nsv¯ )). The nsv + nsv¯ factor is the complexity for computing the output of one training point (number of nonzeros elements in the vector β). Again, the complexity for dual optimization is the same since both steps (solving a linear system of size nsv and computing the outputs of all the points) are required. 4.3 Other Losses. Some other losses have been proposed to approximate the SVM hinge loss (Lee & Mangasarian, 2001b; Zhang, Jin, Yang, & Hauptmann, 2003, Zhu & Hastie, 2005). However, none of them has a linear part, and the overall complexity is O(n3 ), which can be much larger than the complexity of standard SVM training. More generally, the size of the linear system to solve is equal to nsv , the number of training points 2 for which ∂∂tL2 (yi , f (xi )) = 0. If there are some large linear parts in the loss function, this number might be much smaller than n, resulting in significant speed-up compared to the standard O(n3 ) cost. 5 Experiments The experiments in this section aimal checking that primal and dual optimization of a nonlinear SVM have similar time complexities. However,
Training a Support Vector Machine in the Primal
1165
for linear SVMs, the primal optimization is definitely superior (Keerthi & DeCoste, 2005), as illustrated below. Some Matlab code for the quadratic penalization of the errors and taking into account the bias b is available online at http://www.kyb.tuebingen. mpg.de/bs/people/chapelle/primal. 5.1 Linear SVM. In the case of quadratic penalization of the training errors, the gradient of the objective function, equation 3.1 is ∇ = 2w + 2C
(w · xi − yi )xi , i∈sv
and the Hessian is H = Id + C
xi xi .
i∈sv
The computation of the Hessian is in O(d 2 nsv ) and its inversion in O(d 3 ). When the number of dimensions is relatively small compared to the number of training samples, it is advantageous to optimize directly on w rather than on the expansion coefficients. In the case where d is large but the data are sparse, the Hessian should not be built explicitly. Instead, the linear system H −1 ∇ can be solved efficiently by conjugate gradient (Keerthi & DeCoste, 2005). Training time comparison on the Adult data set (Platt, 1998) is presented in Figure 4. As expected, the training time is linear for our primal implementation, but the scaling exponent is 2.2 for the dual implementation of LIBSVM (comparable to the 1.9 reported in Platt, 1998). This exponent can be explained as follows : nsv is very small (Platt, 1998, Table 12.3), and nsv¯ grows linearly with n (the misclassified training points). So for this data set, the complexity of O(n3sv + n(nsv + nsv¯ )) turns out to be about O(n2 ). It is noteworthy that for this experiment, the number of Newton steps required to reach the exact solution was seven. More generally, this algorithm is usually extremely fast for linear SVMs. 5.2 L 2 Loss. We now compare primal and dual optimization for nonlinear SVMs. To avoid problems of memory management and kernel caching and to make time comparison as straightforward as possible, we decided to precompute the entire kernel matrix. For this reason, the Adult data set used in the previous section is not suitable because it would be difficult to fit the kernel matrix in memory (about 8G). Instead, we used the USPS data set consisting of 7291 training examples. The problem was made binary by classifying digits 0 to 4 versus 5 to 9. An RBF kernel with σ = 8 was chosen. We consider in this section the hard margin SVM by fixing λ to a very small value: 10−8 .
1166
O. Chapelle
LIBSVM Newton Primal
2
Time (sec)
10
1
10
0
10
−1
10
3
4
10
10 Training set size
Figure 4: Time comparison of LIBSVM (an implementation of SMO; Platt, 1998) and direct Newton optimization on the normal vector w.
The training for the primal optimization is performed as follows (see algorithm 1): we start from a small number of training samples, train, double the number of samples, retrain, and so on. In this way, the set of support vectors is rather well identified (otherwise, we would have to invert an n × n matrix in the first Newton step). Algorithm 1: SVM Primal Training by Newton Optimization. Function: β = PRIMALSVM(K,Y,λ) n ← length(Y) {Number of training points} if n > 1000 then n2 ← n/2 {Train first on a subset to estimate the decision boundary} β ← PRIMALSVM(K 1..n2 ,1..n2 , Y1..n2 , λ)] sv ← nonzero components of β else sv ← {1, . . . , n}. end if repeat β sv ← (K sv + λInsv )−1 Ysv Other components of β ← 0 sv ← indices i such that yi [K β]i < 1 until sv has not changed
The time comparison is plotted in Figure 5: the running times for primal and dual training are almost the same. Moreover, they are directly propor3 tional to n3sv , which turns out to be the dominating term in the O(nn√ sv + nsv ) time complexity. In this problem, nsv grows approximatively like n. This
Training a Support Vector Machine in the Primal
1167
2
10
LIBSVM Newton − Primal 3 −8 nsv ( ×10 )
1
Training time (sec)
10
0
10
−1
10
−2
10
−3
10
2
10
3
10 Training set size
4
10
Figure 5: With the L 2 penalization of the slacks, the parallel between dual optimization and primal Newton optimization is striking: the training times are almost the same (and scale in O(n3sv )). Note that both solutions are exactly the same.
seems to be in contradiction with the result of Steinwart (2003), which states that the number of support vectors grows linearly with the training set size. However, this result holds only for noisy problems, and the USPS data set has a very small noise level. 5.3 Huber Loss. We perform the same experiments as in the previous section but introduce noise in the labels: for randomly chosen 10% of the points, the labels have been flipped. In this kind of situation, the L 2 loss is not well suited, because it penalizes the noisy examples too much. However, if the noise were, for instance, gaussian in the inputs, the L 2 loss would have been very appropriate. We will study the time complexity and the test error when we vary the parameter h. Note that the solution will not usually be exactly the same as for a standard hinge loss SVM (it will be the case only in the limit h → 0). The regularization parameter λ was set to 1/8, which corresponds to the best test performance. In the experiments described below, a line search was performed in order to make Newton converge more quickly. This means that instead of using equation 4.6 for updating β, the following step was made, β ← β − γ H −1 ∇, where γ ∈ [0, 1] is found by 1D minimization. This additional line search does not increase the complexity since it is just O(n). For the L 2 loss described in the previous section, this line search was not necessary, and full Newton steps were taken (γ = 1).
1168
O. Chapelle 3
0.18 C=10 C=1
0.175
2.5 time (sec)
Test error
0.17 0.165 0.16
2 0.155 0.15 0.145
−2
−1
10
1.5
0
10
−2
−1
10
10
10
0
10
h
h
Figure 6: Influence of h on the test error (left, n = 500) and the training time (right, n = 1000).
400
Number of support vectors
nsv free n
sv
at bound
350
300
250
200
−2
−1
10
10 h
Figure 7: When h increases, more points are in the quadratic part of the loss (nsv increases and nsv¯ decreases). The dashed lines are the LIBSVM solution. For this plot, n = 1000.
5.3.1 Influence of h. As expected, the left-hand side of Figure 6 shows that the test error is relatively unaffected by the value of h as long as it is not too large. For h = 1, the loss looks more like the L 2 loss, which is inappropriate for the kind of noise we generated. Concerning the time complexity (right-hand side of Figure 6), there seems to be an optimal range for h. When h is too small, the problem is highly non-quadratic (because most of the loss function is linear), and a lot of Newton steps are necessary. On the other hand, when h is large, nsv increases, and since the complexity is mainly in O(n3sv ), the training time increases (see Figure 7).
Training a Support Vector Machine in the Primal
1169
2
10
LIBSVM Newton − primal n3sv ( ×10−8 )
1
Training time (sec)
10
0
10
−1
10
−2
10
2
10
3
10 Training set size
4
10
Figure 8: Time comparison between LIBSVM and Newton optimization. Here nsv has been computed from LIBSVM (note that the solutions are not exactly the same). For this plot, h = 2−5 .
5.3.2 Time Comparison with LIBSVM. Figure 8 presents a time comparison of both optimization methods for different training set sizes. As for the quadratic loss, the time complexity is O(n3sv ). However, unlike Figure 5, the constant for LIBSVM training time is better. This is probably the case because the loss function is far from quadratic, and the Newton optimization requires more steps to converge (on the order of 30). But we believe that this factor can be improved by not inverting the Hessian from scratch in each iteration or by using a more direct optimizer such as conjugate gradient descent.
6 Advantages of Primal Optimization As explained throughout this letter, primal and dual optimization are very similar, and it is not surprising that they lead to the same computational complexity, O(nnsv + n3sv ). So is there a reason to use one rather than the other? We believe that primal optimization might have advantages for largescale optimization. Indeed, when the number of training points is large, the number of support vectors is also typically large, and it becomes intractable to compute the exact solution. For this reason, one has to resort to approximations (Bordes, Ertekin, Weston & Bottou, 2005; Bakır, Bottou, & Weston, 2005; Collobert, Bengio, & Bengio, 2002; Tsang, Kwok, & Cheung, 2005). But introducing approximations in the dual may not be wise. There is indeed no guarantee that an approximate dual solution yields a good approximate primal solution. Since what we are eventually interested in is a good primal
1170
O. Chapelle
objective function value, it is more straightforward to directly minimize it (see the discussion at the end of section 2). Below are some examples of approximation strategies for primal minimization. One can probably come up with many more, but our goal is just to give a flavor of what can be done in the primal. 6.1 Conjugate Gradient. One could directly minimize equation 3.6 by conjugate gradient descent. For squared loss without regularizer, this approach has been investigated in Ong (2005). The hope is that on a lot of problems, a reasonable solution can be obtained with only a couple of gradient steps. In the dual, this strategy is hazardous: there is no guarantee that an approximate dual solution corresponds to a reasonable primal solution. We have indeed shown in section 2 and appendix A that for a given number of conjugate gradient steps, the primal objective function was lower when optimizing the primal than when optimizing the dual. However, one needs to keep in mind that the performance of a conjugate gradient optimization strongly depends on the parameterization of the problem. In section 2, we analyzed an optimization in terms of the primal variable w, whereas in the rest of the letter, we used the reparameterization, equation 3.4, with the vector β. The convergence rate of conjugate gradient depends on the condition number of the Hessian (Shewchuk, 1994). For an optimization on β, it is roughly equal to the condition number of K 2 (see equation 4.2), while for an optimization on w, this is the condition number of K . So optimizing on β could be much slower. There is fortunately an easy fix to this problem: preconditioning by K . In general, preconditioning by a matrix M requires being able to compute efficiently M−1 ∇, where ∇ is the gradient (Shewchuk, 1994). But in our case, it turns out that one can factorize K in the expression of the gradient, equation 4.1, and the computation of K −1 ∇ becomes trivial. With this preconditioning (which comes at no extra computational cost), the convergence rate is now the same as for the optimization on w. In fact, in the case of regularized least squares of section 2, one can show that the conjugate gradient steps for the optimization on w and for the optimization on β with preconditioning are identical. Pseudocode for optimizing equation 3.6 using the Fletcher-Reeves update and this preconditioning is given in algorithm 2. Note that in this pseudocode, ∇ is exactly the gradient given in equation 4.1 but “divided” by K . Finally, we would like to point out that algorithm 2 has another interesting interpretation: it is indeed equivalent to performing a conjugate gradients minimization on w (see equation 3.1), while maintaining the so lution in terms of β, such that w = βi xi . This is possible because at each step of the algorithm, the gradient (with respect to w) is always in the span of the training points. More precisely, we have that the gradient of equation 3.1 with respect to w is (∇new )i xi .
Training a Support Vector Machine in the Primal
1171
Test error
σ=1 σ=2 σ=4 σ=8 −1
10
−2
10
0
10
1
10 Number of CG iterations
2
10
Figure 9: Optimization of the objective function, equation 3.6, by conjugate gradient for different values of the kernel width.
Algorithm 2 Optimization of Equation 3.6 (with the L 2 loss) by Preconditioned Conjugate Gradients. Let β = 0 and d = ∇old = −Y repeat Let t ∗ be the minimizer of equation 3.6 on the line β + td. β ← β + t ∗ d. Let o = K β − Y and sv = {i, o i yi < 1}. Update I 0 . ∇new ← 2λβ + I 0 o. ∇ K ∇new d. d ← −∇new + ∇new old K ∇old ∇old ← ∇new . until ||∇new || ≤ ε
Let us now have an empirical study of conjugate gradient behavior. As in the previous section, we considered the 7291 training examples of the USPS data set. We monitored the test error as a function of the number of conjugate gradient iterations. It can be seen in Figure 9 that:
r
r
A relatively small number of iterations (between 10 and 100) is enough to reach a good solution. Note that the test errors at the right of the figure (corresponding to 128 iterations) are the same as for a fully trained SVM: the objective values have almost converged at this point. The convergence rate depends, using the condition number, on the bandwidth σ . For σ = 1, K is very similar to the identity matrix, and one step is enough to be close to the optimal solution. However, the test error is not so good for this value of σ , and one should set it to 2 or 4.
1172
O. Chapelle
It is noteworthy that the preconditioning discussed above is really helpful. For instance, without it, the test error is still 1.84% after 256 iterations with σ = 4. Finally, note that each conjugate gradient step requires the computation of Kβ (see equation 4.1), which takes O(n2 ) operations. If one wants to avoid recomputing the kernel matrix at each step, the memory requirement is also O(n2 ). Both time and memory requirement could probably be improved to O(n2sv ) per conjugate gradient iteration by directly working on the linear system, equation 4.6. But this complexity is probably still too high for largescale problems. Here are some cases where the matrix vector multiplication can be done more efficiently:
r
r
r
Sparse kernel. If a compactly supported RBF kernel (Schaback, 1995; Fasshauer, 2005) is used, the kernel matrix K is sparse. The time and memory complexities are then proportional to the number of nonzero elements in K . Also, when the bandwidth of the gaussian RBF kernel is small, the kernel matrix can be well approximated by a sparse matrix. Low rank. Whenever the kernel matrix is (approximately) low rank, one can write K ≈ AA where A ∈ Rn× p can be found through an incomplete Cholesky decomposition in O(np 2 ) operations. The complexity of each conjugate iteration is then O(np). This idea has been used in Fine and Scheinberg (2001) in the context of SVM training, but the authors considered only dual optimization. Note that the kernel matrix is usually low rank when the bandwidth of the gaussian RBF kernel is large. Fast multipole methods. Generalizing both cases above, fast multipole methods and KD-trees provide an efficient way of computing the multiplication of an RBF kernel matrix with a vector (Greengard & Rokhlin, 1987; Gray & Moore, 2000; Yang, Duraiswami, & Davis, 2004; de Freitas, Wang, Mahdaviani, & Lang, 2005; Shen, Ng, & Seeger, 2006). These methods have been successfully applied to kernel ridge regression and gaussian processes, but do not seem to be able to handle high-dimensional data. (See Lang, Klaas, & de Freitas, 2005, for an empirical study of time and memory requirement of these methods.)
6.2 Reduced Expansion. Instead of optimizing on a vector β of length n, one can choose a small subset of the training points to expand the solution on and optimize only those weights. More precisely, the same objective function 3.2 is considered, but unlike equation 3.4, it is optimized on the subset of the functions expressed as f (x) =
i∈S
βi k(xi , x),
(6.1)
Training a Support Vector Machine in the Primal
1173
where S is a subset of the training set. This approach has been pursued in Keerthi, Chapelle, and DeCoste (2006), where the set S is greedily constructed, and in Lee and Mangasarian (2001a), where S is selected randomly. If S contains k elements, these methods have a complexity of O(nk 2 ) and a memory requirement of O(nk). 6.3 Model Selection. Another advantage of primal optimization is when some hyperparameters are optimized on the training cost function (Chapelle, Vapnik, Bousqoet, & Mukherjee, 2002; Grandvalet & Canu, 2002). If θ is a set of hyperparameters and α the dual variables, the standard way of learning θ is to solve a min max problem (remember that the maximum of the dual is equal to the minimum of the primal): min max Dual(α, θ ), θ
α
by alternating between minimization on θ and maximization on α (see, e.g., Grandvalet & Canu, 2002) for the special case of learning scaling factors). But if the primal is minimized, a joint optimization on β and θ can be carried out, which is likely to be much faster. Finally, to compute an approximate leave-one-out error, the matrix K sv + λInsv needs to be inverted (Chapelle et al., 2002), but after a Newton optimization, this inverse is already available in equation 4.4. 7 Conclusion In this letter, we have studied the primal optimization of nonlinear SVMs and derived the update rules for a Newton optimization. From these formulas, it appears clear that there are strong similarities between primal and dual optimization. Also, the corresponding implementation is very simple and does not require any optimization libraries. The historical reasons for which most of the research in the past decade has been about dual optimization are unclear. We believe that it is because SVMs were introduced in their hard margin formulation (Boser, Guyon, & Vapnik, 1992), for which a dual optimization (because of the constraints) seems more natural. In general, however, soft margin SVMs should be preferred, even if the training data are separable: the decision boundary is more robust because more training points are taken into account (Chapelle, Weston, Bottou, & Vapnik, 2000). We do not pretend that primal optimization is better in general; our main motivation was to point out that primal and dual are two sides of the same coin and that there is no reason to look always at the same side. And by looking at the primal side, some new algorithms for finding approximate solutions emerge naturally. We believe that an approximate primal solution
1174
O. Chapelle
is in general superior to a dual one since an approximate dual solution can yield a primal one that is arbitrarily bad. In addition to all the possibilities for approximate solutions metioned in this letter, the primal optimization also offers the advantage of tuning the hyperparameters simultaneously by performing a conjoint optimization on parameters and hyperparameters. Appendix A: Primal Suboptimality Let us define the following quantities, A= y y B = y XX y C = y XX XX y After one gradient step with exact line search on the primal objective function, we have w = C +BλB X y, and the primal value 2.1, is 1 1 B2 A− . 2 2 C + λB For the dual optimization, after one gradient step, α = equation 2.5, w = B +AλA X y. The primal value is then 1 1 A+ 2 2
A B + λA
2 (C + λB) −
λA y B + λA
and by
AB . B + λA
The difference between these two quantities is 1 2
A B + λA
2 (C + λB) −
(B 2 − AC)2 AB 1 B2 1 + = ≥ 0. B + λA 2 C + λB 2 (B + λA)2 (C + λB)
This proves that if one does only one gradient step, one should do it on the primal instead of the dual, because one will get a lower primal value this way. Now note that by the Cauchy-Schwarz inequality, B 2 ≤ AC, and there is equality only if XX y and y are aligned. In that case, the above expression is zero: the primal and dual steps are as efficient. That is what happens on the left side of Figure 1: when n d, since X has been generated according to a gaussian distribution, XX ≈ d I and the vectors XX y and y are almost aligned.
Training a Support Vector Machine in the Primal
1175
Appendix B: Optimization with an Offset We now consider a joint optimization on
f (x) =
n
b of the function β
βi k(xi , x) + b.
i=1
The augmented Hessian (see equation 4.2) is 2
1 I 0 1 K I 01
1 I 0 K λK + K I 0 K
,
where 1 should be understood as a vector of all 1. This can be decomposed as
−λ 1 2 0 K
0 1 0 I 1 λI + I 0 K
.
Now the gradient is ∇ = Hβ − 2
1 K
I 0 Y,
and the equivalent of the update equation, equation 4.3, is
0 1 I 0 1 λI + I 0 K
−1
−λ 1 0 K
−1
1 K
I 0Y =
0 1 I 0 1 λI + I 0 K
−1
0 I 0Y
.
So instead of solving equation 4.4, one solves
b β sv
=
0 1 1 λInsv + K sv
−1
0 Ysv
.
Acknowledgments We are grateful to Adam Kowalczyk and Sathiya Keerthi for helpful comments.
1176
O. Chapelle
References Bakır, G., Bottou, L., & Weston, J. (2005). Breaking SVM complexity with cross training. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17, (pp. 81–88). Cambridge, MA: MIT Press. Bordes, A., Ertekin, S., Weston, J., & Bottou, L. (2005). Fast kernel classifiers with online and active learning. (Tech. Rep.). Princeton, NJ: NEC Research Institute. Available online at http://leon.bottou.com/publications/pdf/huller3.pdf. Boser, B., Guyon, I., & Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory (pp. 144–152). New York: ACM. Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press. Burges, C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121–167. Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46, 131– 159. Chapelle, O., Weston, J., Bottou, L., & Vapnik, V. (2000). Vicinal risk minimization. ¨ In S. A. Solla, T. K. Leen, & K.-R. Muller (Eds.), Advances in neural information processing systems, 12, Cambridge, MA: MIT Press. Collobert, R., Bengio, S., & Bengio, Y. (2000). A parallel mixture of SVMS for very large scale problems. Neural Computation, 14(5), 1105–1114. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge: Cambridge University Press. de Freitas, N., Wang, Y., Mahdaviani, M., & Lang, D. (2005). Fast Krylov methods ¨ for N-body learning. In Y. Weiss, B. Scholkopf, & J. Platt (Eds.), Advances in neural information processing systems, 18 (pp. 251–258). Cambridge, MA: MIT Press. Fasshauer, G. (2005). Meshfree methods. In M. Rieth & W. Schommers (Eds.), Handbook of theoretical and computational nanotechnology. Valencia, CA: American Scientific Publishers. Fine, S., & Scheinberg, K. (2001). Efficient SVM training using lowrank kernel representations. Journal of Machine Learning Research, 2, 243– 264. Golub, G., & Van Loan, C. (1996). Matrix computations. (3rd ed.). Baltimore: Johns Hopkins University Press. Grandvalet, Y., & Canu, S. (2002). Adaptive scaling for feature selection in SVMS. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Neural information processing systems, 15. Cambridge, MA: MIT Press. Gray, A., & Moore, A. (2000). “N-body” problems in statistical learning. In T. K. Leen, T. Dietlerich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 521–527). Cambridge, MA: MIT Press. Greengard, L., & Rokhlin, V. (1987). A fast algorithm for particle simulations. Journal of Computational Physics, 73, 325–348. ¨ Joachims, T. (1999). Making large-scale SVM learning practical. In B. Scholkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods–Support vector learning. Cambridge, MA: MIT Press.
Training a Support Vector Machine in the Primal
1177
Keerthi, S., Chapelle, O., & DeCoste, D. (2006). Building support vector machines with reducing classifier complexity. Journal of Machine Learning Research, 7, 1493– 1515. Keerthi, S. S., & DeCoste, D. M. (2005). A modified finite Newton method for fast solution of large scale linear SVMS. Journal of Machine Learning Research, 6, 341– 361. Kimeldorf, G. S., & Wahba, G. (1970). A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Annals of Mathematical Statistics, 41, 495–502. Lang, D., Klaas, M., & de Freitas, N. (2005). Empirical testing of fast kernel density estimation algorithms (Tech. Rep. UBC TR-2005-03). Vancouver: University of British Columbia. Lee, Y. J., & Mangasarian, O. L. (2001a). RSVM: Reduced support vector machines. In Proceedings of the SIAM International Conference on Data Mining. Philadelphia: SIAM. Lee, Y. J., & Mangasarian, O. L. (2001b). SSVM: A smooth support vector machine for classification. Computational Optimization and Applications, 20(1), 5–22. Mangasarian, O. L. (2002). A finite Newton method for classification. Optimization Methods and Software, 17, 913–929. Ong, C. S. (2005). Kernels: Regularization and optimization. Unpublished doctoral dissertation, Australian National University. Osuna, E., Freund, R., & Girosi, F. (1997). Support vector machines: Training and applications (Tech. Rep. AIM-1602). Cambridge, MA: MIT Artificial Intelligence Laboratory. Platt, J. (1998). Fast training of support vector machines using sequential minimal ¨ optimization. In B. Scholkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods–Support vector learning. Cambridge, MA: MIT Press. Schaback, R. (1995). Creating surfaces from scattered data using radial basis functions. In M. Daehlen, T. Lyche, & L. Schumaker (Eds.), Mathematical methods for curves and surfaces (pp. 477–496), Nashville, TN: Vanderbilt University Press. ¨ Scholkopf, B., & Smola, A. (2002). Learning with kernels. Cambridge, MA: MIT Press. Shen, Y., Ng, A., & Seeger, M. (2006). Fast gaussian process regression using KD¨ trees. In Y. Weiss, B. Scholkopf, & J. Platt (Eds.), Advances in neural information processing systems, 18. Cambridge, MA: MIT Press. Shewchuk, J. R. (1994). An introduction to the conjugate gradient method without the agonizing pain (Tech. Rep. CMU-CS-94-125) Pittsburgh, PA: School of Computer Science, Carnegie Mellon University. Steinwart, I. (2003). Sparseness of support vector machines. Journal of Machine Learning Research, 4, 1071–1105. Tsang, I. W., Kwok, J. T., & Cheung, P. M. (2005). Core vector machines: Fast SVM training on very large datasets. Journal of Machine Learning Research, 6, 363–392. Vapnik, V. (1998). Statistical learning theory. New York: J. Wiley. Yang, C., Duraiswami, R., & Davis, L. (2004). Efficient kernel machines using the ¨ L. K. Saul, & B. Scholkopf ¨ improved fast gauss transform. In S. Thrun, (Eds.), Advances in neural information processing systems, 16. Cambridge, MA: MIT Press. Zhang, J., Jin, R., Yang, Y., & Hauptmann, A. G. (2003). Modified logistic regression: An approximation to SVM and its applications in large-scale text categorization.
1178
O. Chapelle
In Proceedings of the 20th International Conference on Machine Learning. Menlo Park, CA: AAAI Press. Zhu, J., & Hastie, T. (2005). Kernel logistic regression and the import vector machine. Journal of Computational and Graphical Statistics, 14, 185–205.
Received April 11, 2006; accepted August 23, 2006.
LETTER
Communicated by Michael DeWeese
Dynamics of Nonlinear Feedback Control H. P. Snippe [email protected] Department of Neurobiophysics, University of Groningen, Groningen, The Netherlands
J. H. van Hateren [email protected] Department of Neurobiophysics, University of Groningen, and Netherlands Institute for Neuroscience, KNAW-NIN, Amsterdam, The Netherlands
Feedback control in neural systems is ubiquitous. Here we study the mathematics of nonlinear feedback control. We compare models in which the input is multiplied by a dynamic gain (multiplicative control) with models in which the input is divided by a dynamic attenuation (divisive control). The gain signal (resp. the attenuation signal) is obtained through a concatenation of an instantaneous nonlinearity and a linear low-pass filter operating on the output of the feedback loop. For input steps, the dynamics of gain and attenuation can be very different, depending on the mathematical form of the nonlinearity and the ordering of the nonlinearity and the filtering in the feedback loop. Further, the dynamics of feedback control can be strongly asymmetrical for increment versus decrement steps of the input. Nevertheless, for each of the models studied, the nonlinearity in the feedback loop can be chosen such that immediately after an input step, the dynamics of feedback control is symmetric with respect to increments versus decrements. Finally, we study the dynamics of the output of the control loops and find conditions under which overshoots and undershoots of the output relative to the steady-state output occur when the models are stimulated with low-pass filtered steps. For small steps at the input, overshoots and undershoots of the output do not occur when the filtering in the control path is faster than the low-pass filtering at the input. For large steps at the input, however, results depend on the model, and for some of the models, multiple overshoots and undershoots can occur even with a fast control path. 1 Introduction In natural environments, sensory inputs typically have a large dynamic range. For instance, the luminance input to the visual system varies over multiple log units not only between day and night, but also on much shorter Neural Computation 19, 1179–1214 (2007)
C 2007 Massachusetts Institute of Technology
1180
H. Snippe and J. van Hateren
timescales (van Hateren, 1997). Likewise, the range of auditory intensities is very large, with changes in intensity often occuring on short timescales (Escab´ı, Miller, Read, & Schreiner, 2003). This strong dynamic nature of the input poses a problem to sensory systems because both the sensory transducers and the subsequent neurons have a limited dynamic range. Hence, to fit the input into these processing units, some form of dynamic compression is required. In this letter, we study systems in which a compression of the dynamic range is attained through an operation of gain control. Thus, the models for gain control we study operate such that, at least in their steady state, they yield a reduction in the range of their output relative to their input range. We study two types of gain control. The first type is multiplicative gain control in which an input I (t) is multiplied by a dynamic gain g(t), which yields an output O(t) = g(t)I (t).
(1.1)
When the gain g decreases for increasing values of the input I , the dynamic range of the output O is compressed relative to the dynamic range of the input I . The second type of control that we study is divisive gain control, in which an input I (t) is divided by an attenuation signal a (t), which yields an output, O(t) =
I (t) . a (t)
(1.2)
When the attenuation a increases with increasing values of the input I , the dynamic range of O is compressed relative to the dynamic range of I . Obviously, multiplicative and divisive gain control are closely related. When the attenuation signal a (t) and the gain signal g(t) are related as a (t) = 1/g(t), the outputs O(t) of equations 1.1 and 1.2 are identical. However, as we show in section 2, for many control systems, even when a = 1/g in steady state, the dynamic behavior of a (t) and g(t) is not related through a simple inversion a (t) = 1/g(t). As a consequence, the outputs of multiplicative and divisive gain control can be qualitatively different. Gain control in neural systems is ubiquitous (Shapley & Enroth-Cugell, 1984; Schwartz & Simoncelli, 2001); thus, it is important to understand its dynamics. For instance, what is the speed of adaptation for systems with gain control? Can we expect asymmetries in adaptation after stimulation with increments and decrements of the input (DeWeese & Zador, 1998; Snippe, Poot, & van Hateren, 2004; Hosoya, Baccus, & Meister, 2005)? Do systems with gain control have monotonic outputs when stimulated with simple monotonic inputs, or should we expect overshoots, undershoots, or oscillations in their output? By focusing on simple models for gain control,
Dynamics of Nonlinear Feedback Control
1181
we can start to answer these questions. Since gain control occurs at many levels in the neural system (Abbott, Varela, Sen, & Nelson, 1997; Salinas & Thier, 2000), the inputs I (t) in equations 1.1 and 1.2 are not necessarily sensory inputs; they could also represent the internal input to a process of gain control anywhere within the neural system. The control signals g(t) in equation 1.1 and a (t) in equation 1.2 could be generated from either the input signal I (t) (feedforward control) or the output O(t) of the control loop (feedback control). In this letter, we concentrate on feedback control, since this appears to be the most common type of control in neural systems (Marmarelis, 1991; Torre, Ashmore, Lamb, & Menini, 1995; Calvert, Ho, Lefebvre, & Arshavsky, 1998; Crevier & Meister, 1998). A second reason to concentrate on feedback control is that the dynamics of a feedback loop is often less immediately obvious than the dynamics of a feedforward loop, especially when the control loop is nonlinear. Although the dynamics of linear, subtractive feedback control is well known (e.g., van de Vegte, 1990; Franklin, Powell & Emami-Naeini, 1994), less work has been done on the dynamics of multiplicative and divisive feedback control. Here we provide a mathematical analysis of such nonlinear feedback control. The letter is organized as follows. In section 2, we define four feedback models—two with multiplicative control and the other two with divisive control. The structure of these models is simple—about as simple as possible for nonlinear feedback. This has been done on purpose, since it allows a mathematical analysis of the dynamics of these models rather than relying on numerical simulations. The results of this analysis should help to understand the behavior of more involved models that are too complex for an exact mathematical analysis. In section 3, we study the stability of the feedback models for a variety of inputs. In section 4, we study the dynamics of the control signals when the models are stimulated with step inputs, and in section 5, we show that these control signals can have strong asymmetries with respect to increment versus decrement steps in the input. In section 6, we focus on the output O(t) of the models rather than on the control signals g(t) and a (t). In particular, we investigate under what circumstances overshoots or oscillations in the output can occur when the models are stimulated with simple, monotonic inputs. Finally, in section 7 we restate the most important conclusions. To avoid disturbing the flow of the arguments, two appendixes contain longer mathematical derivations of the results. 2 Models We analyze the dynamics of the four models for feedback control shown in Figure 1. Two of the models (MNL and MLN ; see Figure 1, top row) have a multiplicative gain control; the other two models (DNL and DLN ; see Figure 1, bottom row) have a divisive gain control. For each of the models,
1182
H. Snippe and J. van Hateren
MNL
A I(t)
g(t)
LPτ
LPτ
g(t)
f DNL
C I(t )
a(t)
O(t )
O(t) h
MLN
B I(t )
f
D I(t )
a(t )
γ (t )
LPτ
DLN
h
α (t )
O (t )
O (t )
LPτ
Figure 1: Models for nonlinear feedback control. Models (A) MNL and (B) MLN have multiplicative control in which the output O(t) equals the input I (t) multiplied by a gain signal g(t). Models (C) DNL and (D) DLN have divisive control in which the output O(t) equals the input I (t) divided by an attenuation signal a (t). The feedback paths consist of a concatenation of a first-order low-pass filtering with time constant τ and an instantaneous nonlinearity ( f and h). Note that in models MNL and DNL , the nonlinearity precedes the low-pass filtering in the feedback path, whereas in models MLN and DLN , the low-pass filtering precedes the nonlinearity.
the control loop consists of a concatenation of an instantaneous nonlinearity (functions f and h) and a linear low-pass filter L Pτ of first order with time constant τ . For the two models on the left in Figure 1 (MNL and DNL ), the control loop consists of a nonlinearity followed by a low-pass filter L Pτ , whereas for the two models on the right in Figure 1 (MLN and DLN ), the control loop consists of a low-pass filter L Pτ , followed by a nonlinearity. Models of the general structure shown in Figure 1 have been shown to explain various aspects of processing in the visual system. For instance, Crevier and Meister (1998) analyzed a multiplicative feedback model of the type of Figure 1B to account for period doubling in the electroretinogram response to periodic flashes. A spatiotemporal extension of this model (Berry, Brivanlou, Jordan, & Meister, 1999) explained retinal responses of salamanders and rabbits to flashed and moving bars (see also Wilke et al., 2001). Snippe et al. (2004) showed that divisive feedback models of a type similar to Figure 1C can explain the asymmetric dynamics of gain control after increments and decrements of contrast. Divisive feedback has also been used to model the dynamics of adaptation in the auditory system (Dau, ¨ Puschel, & Kohlrausch, 1996). For simplicity, in the analysis of the models in
Dynamics of Nonlinear Feedback Control
1183
Figure 1, we assume that the inputs I (t) are nonnegative: I (t) ≥ 0. This is reasonable, for example, for models of light adaptation in which the input signal (luminance) is nonnegative. To describe contrast-gain control (e.g., Berry et al., 1999), the input signal I (t) = C(t)s(t) can be written as the product of a carrier signal s(t) and a contrast envelope C(t), with C(t) ≥ 0. Because the carrier signal s(t) can be negative, the input I (t) does not satisfy I (t) ≥ 0 in this case. However, for contrast-gain control, the effects of the carrier s(t) are expected to be largely demodulated in the control loop (Snippe, Poot, & van Hateren, 2000). Thus, the dynamics of the control loop is governed mainly by the dynamics of the contrast envelope C(t), which does satisfy nonnegativity: C(t) ≥ 0. Hence, also for contrast-gain control, the assumption that the input is nonnegative is not a severe restriction. Due to the first-order filtering in the models of Figure 1, the dynamics of these models are described by first-order differential equations. (See van Hateren, 2005, for a discussion on low-pass filtering in biological systems, and examples of biochemical processes that can be described as low-pass filters.) In general, the output y(t) of a first-order low-pass filter with time constant τ is related to the input x(t) to the filter through a first-order differential equation τ dy/dt = x − y (Oppenheim, Willsky, & Young, 1983). Thus, the dynamics of the gain signal g(t) in model MNL (see Figure 1A) is described by the first-order differential equation,
τ
dg = f (O) − g = f (g I ) − g, dt
(2.1)
since the input to the low-pass filter in the control loop of model MNL is f (O), and the argument O of the nonlinearity f equals g I , due to the multiplicative nature of the gain control of model MNL . Note that due to this multiplicative operation, equation 2.1 is a nonlinear differential equation even when the function f is linear. The dynamics of the gain signal g(t) for a given input I (t) is obtained by solving equation 2.1 (either analytically or numerically). From the solution g(t) for the gain, the output O(t) can be obtained through O(t) = g(t)I (t). When the input I is constant, I = Is , the system attains its steady state, g = gs and O = Os . The input-output relation for the steady state can be obtained by setting the time derivative dg/dt in equation 2.1 equal to zero, yielding gs = f (Os ). With Os = gs Is , hence gs = Os /Is , this yields the steady-state relation Os /Is = f (Os ), hence Is = Os / f (Os ). To obtain a system in which the dynamic range of Os is reduced relative to the dynamic range of Is , we require that the steady-state gain gs = f (Os ) is a decreasing function of its argument, that is, the function f (x) must satisfy d f (x)/d x < 0. Note that this requirement is sufficient to guarantee a monotonic steady-state relation Is = Os / f (Os ): with increasing Os , f (Os ) decreases, hence Is increases.
1184
H. Snippe and J. van Hateren
For model DNL (see Figure 1C), the dynamics of the attenuation signal a (t) is described by the differential equation τ
da = h(O) − a = h(I /a ) − a . dt
(2.2)
From the solution a (t) of equation 2.2 for a given input I (t), the output O(t) of model DNL is obtained as O(t) = I (t)/a (t). In steady state (da /dt = 0), equation 2.2 yields a s = h(Os ). With Os = Is /a s , hence a s = Is /Os , this gives the steady-state relation Is = Os h(Os ). To obtain a reduction of the dynamic range of Os relative to the dynamic range of Is , we require that the steadystate attenuation a s = h(Os ) is an increasing function of its argument, that is, dh(x)/d x > 0. This guarantees a monotonic steady-state relation Is = Os h(Os ). Note that this steady-state relation becomes identical to that of the multiplicative model MNL when the nonlinearities f (O) of model MNL and h(O) of model DNL are related as h(O) = 1/ f (O). A differential equation that describes the dynamics of model MLN (see Figure 1B) can be obtained for the output γ (t) of the low-pass filter L Pτ in the feedback loop, τ
dγ = O − γ = g I − γ = f (γ )I − γ , dt
(2.3)
because the gain g is related to γ as g = f (γ ). From the solution γ (t) of equation 2.3 for a given input I (t), one obtains the dynamic gain g(t) as g(t) = f (γ (t)), and thus the output O(t) = g(t)I (t). In steady state (dγ /dt = 0), equation 2.3 yields γs = Os = f (Os )Is , hence Is = Os / f (Os ), which is identical to the steady-state behavior of model MNL . As in model MNL , for model MLN we require that f be a decreasing function of its argument. Finally, the dynamics of model DLN (see Figure 1D) follows from the differential equation for the output α(t) of the low-pass filter in model DLN : τ
dα I I = O−α = −α = − α. dt a h(α)
(2.4)
From the solution α(t) of this differential equation, one obtains the attenuation a (t) = h(α(t)), and thus the output O(t) = I (t)/a (t). In steady state (dα/dt = 0), equation 2.4 produces the steady-state input-output relation Is = Os h(Os ), which is identical to the steady-state equation obtained for model DNL . As in model DNL , we require that h be an increasing function of its argument.
Dynamics of Nonlinear Feedback Control
1185
2.1 The Nonlinearities f (x) and h(x) Need Not Be Positive for All x. Although for simplicity we assumed that the input I is nonnegative, I ≥ 0, a similar assumption for the nonlinearities f and h in the feedback loops of the models in Figure 1 would be too restrictive, since this would leave out viable and interesting models for gain control. Thus, we allow the functions f (x) in multiplicative control and h(x) in divisive control to be negative for certain ranges of their argument x. Since f (x) is assumed to be decreasing, this means that we allow functions f (x) that are positive in a range 0 < x < x0 , become zero at x = x0 , and are negative for x > x0 , with d f /d x < 0 over the entire range 0 < x < ∞. An example of such a function f would be a linearly decreasing function f (x) = k(x0 − x), with k and x0 positive constants. For a linear function f (x), the ordering of f and the lowpass filtering in Figure 1 is arbitrary, and models MNL and MLN become identical. Solving the steady-state relation Is = Os / f (Os ) with respect to Os for this f yields Os = x0 k Is /(1 + k Is ), a Naka-Rushton relationship (Heeger, 1992). Note that x0 represents the largest steady-state output Os that can be obtained in this case and that f (Os ) ≥ 0 for all steady inputs Is . In a dynamic situation, however, after sudden increases in the input I (t), the output O(t) can become larger than x0 , and f (O) is temporarily negative. This negative value of the function f helps to quickly reduce the gain g (through equation 2.1), which reduces the output O back to values below x0 . A comparison of the dynamics of adaptation of such a multiplicative feedback system (with k = x0 = 1) after increments and decrements of the input is shown in Figure 2A. Note that despite the linearity of f , the dynamics is asymmetric for increments versus decrements. Allowing negative values for f (x) in multiplicative control can thus be beneficial in that it speeds up adaptation after increases in I (t). Nevertheless, in comparing multiplicative and divisive control, some care is required when we allow f (x) in multiplicative control to be zero for x = x0 and negative for x > x0 . This means that the corresponding h(x) = 1/ f (x) in divisive control is monotonically increasing up to x = x0 , where it becomes infinite. For x > x0 , h(x) becomes negative, and thus h(x) is nonmonotonic in the range 0 < x < ∞. For model DLN (see Figure 1D), this is actually not a problem, since for this model, x < x0 also in dynamic situations. An example of adaptation for such a model DLN (again with a Naka-Rushton steady state) is shown in Figure 2B. On the other hand, for model DNL (see Figure 1C), the discontinuity in the function h(x) could become a problem for strong increments at the input. The solution to this problem is to assume that the output O(t) for a divisive control system with such an h(x) in the feedback path remains smaller than x0 also in dynamic situations, so h(x) remains positive at all t. It can be shown that this is indeed what happens when the input I (t) to the system is a continuous function of time, as will always be the case in practice. An example of adaptation in such a model DNL (again with a Naka-Rushton steady state) is shown in Figure 2C. Note the strong asymmetry of the dynamics of this model for increments versus decrements of the input.
1186
A
H. Snippe and J. van Hateren
B
1.0 0.8
increment I
Model MNL, with f ( O )=1-O = Model MLN, with f (γ)=1-γ
Model DLN, with h (α)=1/(1-α) decrement I
5 4
g(t)
a(t)
0.6 3
0.4 2
0.2 0.0
decrement I 0
increment I
1
1
t
2
0
t
1
2
D 1.0
C 5
decrement I Model DNL, with h (O )=1/(1-O )
Model DNL, with h (O )=1-(1/O )
decrement I
0.8 4
a(t)
a(t)
0.6 3
0.4 2
0.2
increment I 1 0
t
1
2
0.0
increment I 0
t
1
2
Figure 2: (A) Dynamics of the gain g(t) in a multiplicative feedback control (with time constant τ = 1) with a linear function f (x) = 1 − x in the feedback path, which yields a Naka-Rushton relation Os = Is /(1 + Is ) for the steady state. The solid line is g(t) in response to an increment step of I (from 1 /4 to 4). The dashed line is g(t) after a decrement step of I (from 4 to 1 /4 ). For linear f (x), models MNL and MLN are identical. (B) Dynamics of the attenuation a (t) of model DLN (see Figure 1D; τ = 1) when h(α) = 1/(1 − α), which yields a Naka-Rushton steady state Os = Is /(1 + Is ). Input steps are identical to A. (C) Dynamics of a (t) of model DNL (see Figure 1C; τ = 1) when h(O) = 1/(1 − O). Steady state of this model is identical to A and B, but the dynamics is very different. (D) Dynamics of a (t) of model DNL (see Figure 1C; τ = 1) when h(O) = 1 − (1/O), which yields Os = 1 + Is in steady state.
Likewise, in divisive control we allow increasing functions h (x) that are negative for x < x0 and positive for x > x0 . For such systems, steady-state output Os is never smaller than x0 ; hence, h(Os ) ≥ 0. When Is = 0, Os = x0 . Thus, for such systems, x0 can be thought of as a spontaneous activity. After large decrements of I (t), the output O(t) can temporarily be smaller than this spontaneous activity x0 of the system; hence, h(O) < 0, which speeds up adaptation according to equation 2.2. An example is shown in Figure 2D. For the corresponding multiplicative control, with f (x) = 1/h(x), f (O) becomes very large when the output O approaches the spontaneous
Dynamics of Nonlinear Feedback Control
1187
activity x0 , and for continuous inputs I (t), this effect is sufficiently strong to keep O(t) > x0 at all times t. 2.2 Identity of Models MLN and DLN . Equations 2.3 and 2.4 yield identical dynamics when the nonlinearities f in equation 2.3 and h in equation 2.4 are related as h = 1/ f . This simply says that multiplying an input I with a gain g to obtain an output O = g I is identical to dividing I by the inverse gain (attenuation) 1/g : O = I /[1/g]. Thus, when the nonlinear functions f and h are related as h = 1/ f , model MLN and model DLN in Figure 1 yield identical responses if they are stimulated with the same inputs. Figures 2A and 2B provide an example: the dynamic attenuation a (t) of model DLN in Figure 2B equals the inverse of the dynamic gain g(t) of model MLN in Figure 2A, that is, a (t) = 1/g(t). 2.3 Duality of Models MNL and DNL . Likewise, models MNL and DNL are closely related, but in this case, the relationship amounts to a duality rather than an identity. For the duality, the nonlinearities f in equation 2.1 and h in equation 2.2 have to be related through an inversion of their argument: h(x) = f (1/x). In that case, when model DNL is stimulated with an input I D (t) = 1/I M (t) that is the inverse of the input I M (t) to model MNL , the dynamics 2.2 for the attenuation a (t) becomes τ
da = h(I D /a ) − a = f (a /I D ) − a = f (I M a ) − a , dt
(2.5)
which is identical to the dynamics of g(t) in equation 2.1. Thus, under these conditions, the dynamic attenuation a (t) of model DNL equals the dynamic gain g(t) of model MNL : a (t) = g(t). Then the output OD (t) of model DNL is related to the output OM (t) of model MNL through an inversion: OD = I D /a = 1/[I M g] = 1/OM . Figures 2A and 2D provide an example of this duality of models DNL and MNL . The attenuation a (t) in Figure 2D after an increment step of the input I (from 1/4 to 4) equals the gain g(t) in Figure 2A after a decrement step of I (from 4 to 1/4). Likewise, a (t) in Figure 2D after a decrement step of I equals g(t) after an increment step in Figure 2A. Thus, when the nonlinearities f and h are related through an inversion of their argument, h (x) = f (1/x), the models MNL and DNL are dual in the following sense: when model MNL transforms an input I (t) into an output O(t), model DNL will transform the input 1/I (t) into the output 1 O(t). Of course, the duality also works in the other direction, from model DNL to model MNL . Apart from the theoretical interest of this duality between multiplicative and divisive control, the duality can be useful in applications. When the mathematics or the statistics of 1/I (t) are simpler than those of the input I (t) itself, it makes sense to do the calculations in the dual model. Stimulating the dual model with an input 1/I (t) and inverting the resulting output of the dual model yields the output O(t) of the original model.
1188
H. Snippe and J. van Hateren
3 Stability As is well known, feedback can lead to instabilities (Bechhoefer, 2005). Here we show that this is not the case for the feedback systems studied in this letter. The basic reason for this is that the low-pass filters in the feedback loops of the schemes in Figure 1 are first order, that is, they are described by the first-order differential equations 2.1 to 2.4. A first-order low-pass filter has an exponential pulse response that starts to operate as soon as a disturbance occurs, that is, without a delay, which could yield instabilities (Bechhoefer, 2005). 3.1 The Models Are Stable for Constant Input. To see more formally that the feedback schemes studied in this letter are stable, first assume that the input signal I (t) is constant, I (t) = Is . For model MNL , the corresponding steady-state gain gs is obtained by setting the time derivative dg/dt in equation 2.1 to zero; hence, f (gs Is ) − gs = 0. Now assume that at some moment, the actual gain g is larger than the steady-state gain gs , g > gs . Since the function f is a decreasing function of its argument, we have f (g Is ) < f (gs Is ), and therefore f (g Is ) − g < f (gs Is ) − gs = 0. Hence, it follows from equation 2.1 that dg/dt < 0. Thus, g will decrease from its initial value (which is larger than gs ) toward the steady-state value gs . Likewise, when g is initially smaller than gs , g will increase toward the steady-state value gs . Hence, for a constant input I , any deviations from steady state will always decrease over time; that is, for constant input I , the model MNL is stable. Similar reasoning shows that for constant input, the other models in Figure 1 are also stable. 3.2 The Models Are Stable for Arbitrary, Bounded Input. Now assume that the input I (t) is not necessarily constant, but that it is bounded, say, between a lower bound I− and an upper bound I+ , that is, at all times t : I− ≤ I (t) ≤ I+ . For such bounded inputs, the systems studied here cannot become unstable; their control signals (and hence their outputs) will remain finite. For instance, consider model DNL , in which the dynamics of the control (attenuation) signal a is described by equation 2.2. Now take any moment t0 in time. At that moment, the input to the system is I0 = I (t0 ), to which corresponds a steady-state value a s for which (from equation 2.2) h(I0 /a s ) − a s = 0. Assume that at the moment t0 , the actual control signal a (t0 ) is larger than this steady-state value a s . Then, since the function h(x) is an increasing function of its argument x, and hence h(I /a ) decreases with increasing a , we can be sure that when a > a 0 , h(I0 /a ) − a < h(I0 /a s ) − a s = 0. Thus, from equation 2.2, da /dt < 0, that is, an attenuation signal that is too large (relative to the instantaneous steady state) is guaranteed to decrease. Similarly, if at any moment the attenuation signal a is too small (relative to the instantaneous steady state), it will increase. In particular, if at some initial time t0 the attenuation signal a would be larger than the
Dynamics of Nonlinear Feedback Control
1189
2
as(t ) a(t)
a(t ) 1
0 0
5
t
10
Figure 3: Stability of feedback control. Different initial values for the control signal (dashed lines) converge to identical solutions. Shown is an example of model DNL , with h(O) = O2 and τ = 3. Input I (t) = I (1 + sin ωt), with I = ω = 1. For constant input I (t) = Is , steady-state attenuation a s = Is2/3 in this model. The continuous line is a plot of the instantaneous steady state a s (t) ≡ (I (t))2/3 , which can be compared with the dynamics of the actual attenuation a (t). Note that da /dt < 0 when a is above the instantaneous steady state, da /dt > 0 when a is below steady state, and da /dt = 0 whenever a (t) crosses steady state.
steady-state value a s+ corresponding to the upper bound I+ , a (t) is guaranteed to decrease monotonically until it becomes smaller than a s+ . Likewise, if at time t0 the attenuation signal a is smaller than the steady-state value a s− corresponding to the lower bound I− , a (t) will increase monotonically until it becomes larger than a s− . And if initially a s− < a (t0 ) < a s+ , then a s− < a (t) < a s+ for all times t > t0 . Figure 3 illustrates this behavior for a specific choice of the nonlinearity, h(x) = x 2 . 3.3 Monotonic Input Yields Monotonic Control Signals. The reasoning we used above to demonstrate the stability of the feedback schemes of Figure 1 can also be used to derive general qualitative properties of the control signals in these schemes. For instance, assume that the input I (t) is a monotonically increasing function of time for times t > t0 and that at time t0 , the attenuation signal a (t0 ) of model DNL is not larger than the steady-state level a s corresponding to the input I (t0 ), a (t0 ) ≤ a s . Then we can be sure that the divisive control signal a (t) is also monotonically increasing, without any overshoots or oscillations. If a (t) would be nonmonotonic, there would be times t1 > t0 for which da /dt < 0. According to equation 2.2, this means that at that time t1 , the attenuation signal has to be larger than the steady-state corresponding to the input I (t1 ) at time t1 . But this means that a (t) must have crossed its steady state at some moment t between t0 and t1 . But at that hypothetical moment, according to equation 2.2, da /dt = 0. Thus, a crossing of a (t) with its instantaneous steady-state value a s (t) would be possible
1190
H. Snippe and J. van Hateren 8 6
as(t )
a(t ) 4
a( t )
2 0 0
2
t 4
6
Figure 4: Attenuation a (t) (dashed line) of the model of Figure 3, when I (t) is monotonic (a series of steps in this example). The attenuation cannot overshoot its steady-state solution a s = I 2/3 (continuous line), since da /dt becomes 0 when a approaches steady state. Hence, when a (t) is initially below this line, it cannot escape from the region below the continuous line. Since da /dt > 0 when a < a s , a monotonic input I (t) yields a monotonic control signal a (t).
only if d I /dt < 0 at that moment (as can be observed in Figure 3), contrary to the assumption of a monotonically increasing I (t). Figure 4 illustrates the behavior of a (t) for a specific choice of I (t) and the nonlinearity h, but the conclusion is general. For the systems in Figure 1, monotonic inputs I (t) yield control signals a (t), respectively g(t), that are monotonic; they do not overshoot or oscillate. 4 Speed of Adaptation After Steps in the Input Here we look at the speed of adaptation of the models in Figure 1 immediately after a step in the input I (t). Steps in the input occur in natural stimuli (e.g., Griffin, Lillholm, & Nielsen, 2004) and are often used to study the dynamics of adaptation (e.g., Crawford, 1947; Thorson & BiedermanThorson, 1974; Snippe et al., 2000; Fairhall, Lewen, Bialek, & de Ruyter van Steveninck, 2001; Rieke, 2001). Without loss of generality, the moment of the step in the input is chosen as t = 0. For an increment step, the input at t = 0 switches from a value I− to a higher value I+ > I− . We assume that for t < 0, the input has been at I− for a sufficiently long time such that at time t = 0, the system is in the steady state corresponding to input I− , that is, the system is fully adapted to I− . When the input step at t = 0 occurs, the system starts to adapt to the new input I+ . Likewise, for a decrement step I+ → I− at time t = 0, we assume that at t = 0, the system is fully adapted to the input I+ . We compare the initial speed of adaptation for the models in Figure 1 immediately after increment steps I− → I+ and decrement steps I+ → I− . Speed of adaptation is quantified using equations 2.1 to 2.4, which
Dynamics of Nonlinear Feedback Control
1191
give the speed with which the control loops in Figure 1 adjust after the step in the input. 4.1 Comparison of Multiplicative and Divisive Control 4.1.1 Models MLN and DLN Have Identical Dynamics. As noted in section 2.2, the multiplicative model MLN in Figure 1B and the divisive model DLN in Figure 1D become identical when the nonlinearities f and h in the control paths are related as h = 1/ f . When f and h are related in this way, the attenuation signal a (t) in model DLN is identical to the inverse 1/g(t) of the gain signal g(t) in model MLN . Thus, the dynamics of a (t) and 1/g(t) are identical: da /dt = d[1/g]/dt, which means that the speed of adaptation in models MLN and DLN is identical. 4.1.2 Large Differences in Dynamics for Models MNL and DNL . We find large differences when we compare adaptation in models MNL (see Figure 1A) and DNL (see Figure 1C) in which the nonlinearities f , respectively h, precede the low-pass filtering in the feedback loop. It is still the case that we can equate the steady-state behavior of models MNL and DNL by choosing h = 1/ f , which yields a s = 1/gs . However, even in this case, a (t) and 1/g(t) after a step in the input I (t) are different. For model DNL , the dynamics of a (t) is given by equation 2.2. For model MNL , the dynamics of g(t) is given by equation 2.1, which yields for the inverse gain 1/g(t) immediately after the step in I (t): τ
d[1/g] f (gs I ) (a s )2 τ dg 1 − = a − =− = , s dt (gs )2 dt gs (gs )2 h(I /a s )
(4.1)
in which we used the relation f = 1/ h, and the fact that immediately after the step in I (t), the gain g and the attenuation a still have their steady-state value, and hence are related as gs = 1/a s . Comparing result 4.1 with the result τ da /dt = h (I /a s ) − a s of equation 2.2 immediately after a step in the input I , we find τ
d[1/g] 1 (a s )2 da −τ = h(I /a s ) − 2a s + = [h(I /a s ) − a s ]2 . dt dt h(I /a s ) h(I /a s ) (4.2)
After an increment step in I (t) (when da /dt, d[1/g]/dt and h(I /a s ) are all positive), it follows from equation 4.2 that da /dt > d[1/g]/dt. Thus, immediately after an increment step of the input, adaptation in the divisive feedback model DNL is faster than adaptation in the multiplicative feedback model MNL . In fact, this is true not only immediately after the increment step, but also at later times t, in the sense that for all t > 0, a (t) > 1/g(t).
1192
H. Snippe and J. van Hateren
B
A
4
a(t)
3
adaptation
adaptation
4
1/g(t)
2
a(t )
2
1/g(t) 1
1 0
3
0
2
t
4
6
0
0
2
t
4
6
Figure 5: Dynamics of adaptation for the divisive model DNL (dashed line) and the multiplicative model MNL (dotted line), when h(O) = 1/ f (O) = O2 and τ = 3. Input I (t) is an increment step from 1 to 8 at time t = 0 (A) and a decrement step from 8 to 1 (B). Both models have identical steady state, a s = 1/gs = I 2/3 (continuous line), but strongly different dynamics.
Thus, after an increment step, the multiplicative gain control never catches up with the divisive gain control. The proof of this assertion is easy: suppose that at some time t = t ∗ , the multiplicative control would have just caught ∗ ∗ = 1 g (t ∗ ), up with the divisive control, that is, suppose that at t = t , a ) (t ∗ with a (t) > 1 g(t) for 0 < t < t . Then it follows from equation 4.2 evaluated at t = t ∗ that da /dt > d [1/g]/dt immediately before t = t ∗ ; thus, a < 1/g immediately before t = t ∗ . This contradicts the original assertion that a (t) > 1 g(t) for 0 < t < t ∗ ; hence a (t) > 1 g(t) for all t > 0. Thus, after an increment step in I (t), adaptation in the divisive control system DNL is consistently faster than adaptation in the multiplicative control system MNL . On the other hand, after a decrement step in I (t) (when both da /dt and d [1/g]/dt are negative), equation 4.2 yields |d [1/g]/dt| > |da /dt|, assuming that h (I /a s ) > 0 after the step. In fact, the assumption h (I /a s ) > 0 is not really needed in this case, since we argued in section 2.1 that when h(x) becomes negative in a divisive control, the gain g in the corresponding multiplicative control with f (x) = 1 h(x) changes very rapidly in order to keep the output O(t) in its required range; hence, |d [1/g]/dt| > |da /dt| also when h (I /a s ) < 0. Thus, after a decrement step, adaptation in the multiplicative model MNL is faster than adaptation in the divisive model DNL . An example is given in Figure 5 using a quadratic nonlinearity h(O) = O2 2 for the divisive control and hence f (O) = 1 h(O) = 1 O for the multiplicative control, with identical time constant τ of the low-pass filtering in the control loops. The example shows that the differences in speed of adaptation between multiplicative and divisive control can be large. 4.2 The Order of the Nonlinearity and the Low-Pass Filtering in the Feedback Can Have Strong Effects. In the previous section, we showed
Dynamics of Nonlinear Feedback Control
1193
that the dynamics of divisive and multiplicative control can be very different, even under conditions where the steady-state behaviors of these systems are identical. Here we show that also for a specific choice of control, either divisive or multiplicative, the order of the two operations, nonlinearity and low-pass filtering, in the feedback loops of Figure 1 can have a strong effect on the dynamics of the control system. First, we look at divisive gain control, and compare model DNL (Figure 1C) in which the nonlinearity h precedes the low-pass filtering in the feedback loop with model DLN (Figure 1D) in which the nonlinearity h occurs after the low-pass filtering. If the function h(x) is linear, h(x) = c + kx, models DNL and DLN are identical, since the order of two linear operations (here, low-pass filtering and h(x) = c + kx) has no effect on the outcome. Thus, for linear h(x), models DNL and DLN have identical dynamics. This, however, is not the case when h(x) is nonlinear. As will be argued below, in that case, the speed of adaptation of model DNL relative to model DLN is determined by the curvature of h(x), that is, by its second-order derivative d 2 h(x) d x 2 . If h(x) is an expansive function of x (i.e., if it has a positive second-order derivative everywhere), adaptation after an increment step I− → I+ is faster in model DNL than in model DLN . For model DNL , speed of adaptation is governed by equation 2.2; immediately after the increment step, τ
da = h(I+ /a − ) − a − = h(I+ /a − ) − h(I− /a − ) dt
(ModelDNL ),
(4.3)
where a − is the steady-state attenuation corresponding to input I− , and we have used the fact that in steady state, a − = h(I− /a − ). On the other hand, for model DLN , adaptation is governed by equation 2.4, which for the attenuation signal a = h(α) yields τ
I+ dh(x) dh(x) dα da ·τ · − α− = = dt d x x=I− /a − dt d x x=I− /a − a− I+ I− dh(x) (Model DLN ), · − = d x x=I− /a − a− a−
(4.4)
where in the last step, we used the fact that in steady state, α− = I− /a − . Because of the factor dh/d x in equation 4.4, adaptation in model DLN is governed by a linear extrapolation of the function h(x). As illustrated in Figure 6, for an expansive function h(x), that is, when d 2 h(x)/d x 2 > 0, the outcome of equation 4.3 is larger than the outcome of equation 4.4. Thus, immediately after an increment step, adaptation in model DNL is faster than in model DLN . The intuition for this difference in behavior of the two models is that in model DNL , the overshoot in the output O(t) immediately after the increment is accentuated by an expansive function h(O), which
1194
H. Snippe and J. van Hateren
immediately after the input increment
h(O ) DNL
before the input increment
I- /a-
DLN O
I+ /a-
Figure 6: Comparison of adaptation in models DNL and DLN immediately after an increment step I− → I+ of the input. For model DNL , the dynamics of adaptation immediately after the step is governed by the difference of the values of the nonlinearity h(O) at the outputs before and after the input step. For model DLN , the dynamics of adaptation immediately after the step is governed by a linear extrapolation of h (dotted line). When, as in this example, the function h(O) has positive curvature, adaptation after an increment step is faster in model DNL than in model DLN .
provides a powerful drive to the low-pass filter, and hence fast adaptation. As in section 4.1, it is easy to prove that the difference in adaptation persists for all t > 0; after an increment step in the input, adaptation in model DNL is consistently faster than adaptation in model DLN when h(x) is expansive. On the other hand, when h(x) is compressive, that is, when d 2 h(x)/d x 2 < 0, the linear extrapolation of equation 4.4 exceeds the result of equation 4.3, and adaptation in model DLN after an increment step is faster than adaption in model DNL . After a decrement step I+ → I− , it works just the other way around: when h(x) is expansive, adaptation in model DLN is faster than in model DNL , and when h(x) is compressive, adaptation in model DNL is faster than in model DLN . For multiplicative gain control, similar results hold when comparing adaptation in model MNL (see Figure 1A) with adaptation in model MLN (see Figure 1B). Models MNL and MLN are identical in their dynamics when f (x) is a linear function f (x) = d − kx (with k > 0 to obtain range compression). When f (x) is a nonlinear function, the dynamics of adaptation in models MNL and MLN differ, and again the sign of the difference in speed of adaptation is governed by the curvature of f (x), that is, by its second-order derivative d 2 f (x)/d x 2 . If d 2 f (x)/d x 2 > 0, adaptation after an increment step I− → I+ is faster in model MLN than in model MNL , whereas after a decrement step I+ → I− , adaptation is faster in model MNL than in model MLN . Figure 7 gives an example and shows that the difference in speed of adaptation can be substantial. If d 2 f (x)/d x 2 < 0, the situation
Dynamics of Nonlinear Feedback Control
A
1195
B 1.0
1.0
MNL
g(t )
g(t)
MNL
0.5
MLN
0.5
MLN
0.0
0
2
t
4
6
0.0
0
2
t
4
6
Figure 7: Comparison of adaptation in model MNL (dashed lines) and model MLN (dotted lines) when the nonlinearity in the feedback loop f (x) = 1/x 2 and τ = 3. The input I (t) is an increment step from 1 to 8 (A) and a decrement step from 8 to 1 (B). Note that models MNL and MLN have identical steady state gs = I −2/3 (continuous lines) but strongly different dynamics.
reverses, and adaptation after an increment step is faster in model MNL than in model MLN , whereas adaptation after a decrement step is faster in model MLN than in model MNL . 5 Asymmetries of Adaptation After Increments and Decrements of the Input In a linear system, responses after increments and decrements of the input are equally fast. The systems we study here, however, are fundamentally nonlinear, and the speed with which they adapt to increment steps can be strongly different from the speed with which they adapt to decrement steps. For the models of Figure 1, such strong asymmetries of adaptation can be observed in Figures 5 and 7 by comparing the different panels. For contrast gain control, adaptation for an ideal observer has been predicted to be faster after increments of contrast than after decrements of contrast (DeWeese & Zador, 1998). This prediction has been confirmed for human subjects (Snippe & van Hateren, 2003; Snippe et al., 2004). This type of asymmetry can be observed in Figure 5 for model DNL and in Figure 7 for model MLN . Similarly, faster adaptation after contrast increments than after decrements has been found in ganglion cells of salamanders and rabbits (Smirnakis, Berry, Warland, Bialek, & Meister, 1997), in salamander bipolar cells (Rieke, 2001), and in motion-sensitive cells in the visual system of the fly (Fairhall et al., 2001). Here we study these increment and decrement asymmetries for the models in Figure 1. For each of the models of Figure 1, the model will adapt faster to increment stimuli for some nonlinearities, in the feedback path, whereas for other nonlinearities, the model will adapt faster to decrement stimuli. However, despite their nonlinear nature, for
1196
H. Snippe and J. van Hateren
each of the models in Figure 1, there is a class of nonlinearities in the feedback path such that adaptation is initially equally fast after an increment step I− → I+ and after a decrement step I+ → I− of the input. 5.1 Increment and Decrement Symmetry for Model D LN Is Obtained When h(x) = c exp(kx). For model DLN (see Figure 1D), the speed of adaptation immediately after an increment step I− → I+ of the input follows from equation 4.4; thus, τ
da dh(α) dh(α) dα dh(α) dh(α) I+ I− =τ =τ = [O − α] = − dt dt dα dt dα dα h(α) h(α) =
d ln h(α) [I+ − I− ] . dα
(5.1)
In the last step of equation 5.1 we have used the equality (1/h(x))dh(x)/d x = d ln h(x)/d x, and this derivative is evaluated at the steady-state value of α corresponding to the input I− before the increment step. After a decrement step I+ → I− , a similar derivation as in equation 5.1 shows that the speed of adaptation immediately after the step follows from τ da /dt = d ln h(α)/dα·[I− − I+ ], where the derivative d ln h(α)/dα is now evaluated at the steady-state value of α corresponding to the input I+ before the decrement step. Thus, in model DLN , the speed of adaptation da /dt immediately after increment and decrement steps will be identical (except for the sign, [I+ − I− ] versus [I− − I+ ]) if the derivative d ln h(α)/dα is identical at the points α corresponding to steady-state at input I− , respectively, I+ . This will be the case for any pair of values I+ , I− of the input I if the nonlinearity h(x) is such that d ln h(x)/d x = k, a constant (which must be chosen positive to obtain gain control), hence ln h(x) = kx + ln c, with c another positive constant. Thus, for model DLN , an exponential nonlinearity h(x) = c exp(kx) will yield equally fast adaptation immediately after increment and decrement steps. Due to the nonlinear nature of the model, this equality will not hold for adaptation at later times after the steps, but still it is remarkable that for this nonlinear model, there is a unique shape of the nonlinearity h(x) (i.e., an exponential) that yields symmetric increment and decrement adaptation immediately after the steps for any pair I+ , I− of input levels. Moreover, the structure of equation 5.1 for adaptation immediately after steps in the input leads to a simple rule for the sign of any increment or decrement asymmetries of adaptation in model DLN when h(x) is not an exponential function: the critical feature of h(x) is the behavior of d ln h(x)/d x as a function of x. If d ln h(x)/d x increases with x (i.e., when ln h(x) is an expansive function, d 2 ln h(x)/d x 2 > 0), adaptation immediately after a decrement step will be faster than adaptation after an increment step. This is because when ln h(x) is expansive, d ln h(x)/d x will be larger (hence, adaptation faster, according to equation 5.1) at the high
Dynamics of Nonlinear Feedback Control
1197
steady-state level x+ that is present immediately before a decrement step than at the low steady-state level x− that is present immediately before an increment step. 5.2 Increment √ and Decrement Symmetry for Model M L N Is Obtained When f (x) = c − kx. For model MLN (see Figure 1B), the nonlinearity f (x) governs (a)symmetry of adaptation after increments versus decrements. Using equation 2.3, speed of adaptation of the gain signal g = f (γ ) immediately after an increment step I− → I+ in the input is τ
dg d f (γ ) d f (γ ) dγ d f (γ ) =τ =τ = [O − γ ] dt dt dγ dt dγ =
1 d[ f (γ )]2 d f (γ ) [ f (γ )I+ − f (γ )I− ] = [I+ − I− ], dγ 2 dγ
(5.2)
with the derivative d[ f (γ )]2 /dγ evaluated at the steady-state value of γ that corresponds to the input I− immediately before the input step. After a decrement step I+ → I− the speed of adaptation immediately after the step is minus the result 5.2, but with the derivative now evaluated at the steady state corresponding to I+ . Thus, the speed of adaptation immediately after increment and decrement steps is identical when d[ f (x)]2 /d x = −k, a constant (the minus sign indicates that k should be chosen positive to obtain multiplicative gain control). From this follows [ f (x)]2 = c − kx (with √ c another positive constant), hence f (x) = c − kx. For other choices of f (x), asymmetric adaptation of g(t) occurs after increments versus decrements, and the sign of the asymmetry is governed by the curvature (i.e., the second-order derivative) of [ f (x)]2 . If [ f (x)]2 has positive curvature (i.e., when d 2 [ f (x)]2 /d x 2 > 0), adaptation of g(t) is faster after increment steps than after decrement steps, whereas the reverse holds when [ f (x)]2 has negative curvature. Due to the identity of models MLN and DLN noted in section 2.1, the results derived above for model DLN can be immediately translated into results for model MLN , and vice versa. For instance, symmetry of initial adaptation in the divisive model DLN was obtained for a nonlinearity h(x) = c exp (kx), which corresponds to f (x) = 1/h(x) = (1/c) exp (−kx) in the equivalent multiplicative model MLN . Since da /dt = d (1/g)/dt = − 1/g 2 dg/dt, for this choice of f (x), the speed of adaptation of the gain g divided by the square of this gain is symmetric after increment and decre√ ment steps of the input. Likewise, a choice h(x) = 1/ c − kx leads to a divisive model DLN in which 1/a 2 da /dt is symmetric after increment and decrement steps. Finally, we ask if there is a nonlinearity f (x) = 1/h(x) such that there is symmetry of adaptation after increments and decrements of d ln g/dt = (1/g) dg/dt = − (1/a ) da /dt = −d ln a /dt, that is, for the gain (or, equivalently, attenuation) signal plotted on a logarithmic axis. In fact,
1198
H. Snippe and J. van Hateren
a linear function f (x) = d − kx attains this, as can be immediately proved from equation 5.2. As was seen in section 2.1, a linear function f (x) = c − kx (or, equivalently, h(x) = 1/(c − kx) in model DLN ) leads to a Naka-Rushton relation Os (Is ) = c Is /(1 + k Is ) for the steady-state output (Heeger, 1992). 5.3 Increment and Decrement Symmetry for Model M NL Is Obtained When h(x) = c ln(k/x), and for Model DNL When h(x) = c ln(kx). Also for the models MNL (see Figure 1A) and DNL (see Figure 1C) in which the nonlinearities in the feedback loops precede the low-pass filtering, choices exist for these nonlinearities such that adaptation immediately after a step input is symmetric for increment versus decrement steps. Contrary to equations 5.1 and 5.2 for models DLN , respectively MLN , that have a differential expression in their right-hand side (which can immediately be solved), symmetry for models MNL and DNL leads to functional equations, which can be much harder to solve (Aczel & Dhombres, 1989). In appendix A we show how to derive these functional equations and how to solve them by transforming the functional equations into differential equations. Table 1 includes the results for the functions f (x) = c ln(k/x) of model MNL and h(x) = c ln(kx) of model DNL that yield increment and decrement symmetry of adaptation immediately after the input steps. When plotted on axes such that these resulting functions yield straight lines, the resulting asymmetries of adaptation for other functions f and h can be judged by their curvature. 6 Dynamics of the Output So far we have concentrated on the dynamics of the control signals, a (t) and g(t), of the models in Figure 1. This is a sensible approach for applications in psychophysics where the control signal a (t), or g(t), determines the detectability of a test signal that is superimposed on a background I (t) (Snippe et al., 2004). Also in a neurophysiological context, control signals such as a (t) and g(t) will be important determinants of neural detectability, signal-to-noise ratio, and information transmission. Nevertheless, in physiological experiments, one often directly observes the output signal O(t) but not the control signal a (t) or g(t). Here we study the dynamical behavior of the output O(t) of the control loops in Figure 1. In particular, we are interested if for monotonic input I (t), the output O(t) will also be monotonic, or whether O(t) can have overshoots or oscillations when I (t) is monotonic. Obviously, when I (t) increases very fast (e.g., step-wise), O(t) will have an overshoot, because the control signals in Figure 1 cannot completely follow these rapid changes in I (t) due to the low-pass filtering in the feedback loops. Likewise, when I (t) decreases very fast, O(t) will have an undershoot. Hence, to obtain simple (monotonic) behavior of O(t), restrictions on d I (t)/dt are needed. A reasonable choice to obtain such a restriction is to suppose, as illustrated in Figure 8, that the input I (t) to the gain control
Dynamics of Nonlinear Feedback Control
1199
Table 1: Feedback Nonlinearities Yielding Symmetry. Model
1/g 2 dg/dt
dg/dt
(1/g) dg/dt
MNL MLN
c√ln(k/x) c − kx
c − kx c − kx
c − kx 2 (1/c) exp(−kx)
DNL DLN
da /dt c ln(kx) c exp(kx)
(1/a )da /dt c − (k/x) 1/(c − kx)
(1/a 2 )da /dt 2 c− √(k/x ) 1/ c − kx
Notes: Mathematical form of the nonlinearities in the feedback paths of the models in Figure 1 such that adaptation is equally fast (i.e., symmetric) immediately after increment steps I− → I+ and decrement steps I+ → I− of the input. The second column shows the nonlinearities necessary to obtain increment or decrement symmetry of the gain g(t) (for models MNL and MLN ), respectively the attenuation a (t) (for models DNL and DLN ). The third column shows the nonlinearities for which one obtains symmetry on a logarithmic scale for the gain, respectively the attenuation. Note that since d ln g/dt = (1/g)dg/dt (and similar for a (t)) this corresponds to the case where the speed of adaptation of g, respectively a , relative to its initial value is symmetric. The fourth column shows the nonlinearities that yield symmetry when the control signal g(t) in models MNL and MLN is plotted as an attenuation a (t) = 1/g(t), and likewise when a (t) in models DNL and DLN is plotted as a gain g(t) = 1/a (t). Since d(1/g)/dt = −(1/g 2 )dg/dt (and similar for a (t)), this corresponds to the case where the speed of adaptation of g, respectively a , relative to the square of its initial value is symmetric. The parameters c and k should be chosen positive.
J(t )
τI
I(t ) a(t )
O(t ) τ
h
Figure 8: In section 6, the input I (t) to the gain control (here illustrated for model DNL ) is a low-pass filtered version (time constant τ I ) of a signal J (t) that steps from J = J 0 to J = J 1 at time t = 0.
is a low-pass filtered version of a signal J (t). Since all neural systems incorporate low-pass filtering, for instance, associated with membrane time constants, this is a reasonable choice. In this section, we assume that J (t) is a step function J 0 → J 1 at time t = 0. In fact, it can be shown that when the output O(t) of the gain control is monotonic for such a step J (t), it will certainly be monotonic when J (t) would be a smoother function than a step (i.e., when there would be more low-pass filtering before the gain control). The actual input I (t) to the gain control loop is assumed to be the result of applying a first-order low-pass filter with time constant τ I to the step
1200
H. Snippe and J. van Hateren
function J (t), that is, for t > 0, we assume τI
dI = J 1 − I. dt
(6.1)
The resulting input I (t) to the gain controls of Figure 1 can be written explicitly as I (t) = J 1 + (J 0 − J 1 ) exp(−t/τ I ).
(6.2)
The question we pose here is, What we can expect for the dynamics of the output O(t) when the models of Figure 1 are stimulated with the input, equation 6.2? It turns out that the qualitative nature of O(t) depends critically on the value of the time constant τ I of the filtering of the input in equation 6.1 relative to the time constant τ of the low-pass filtering in the feedback controls of Figure 1. In the next section, we study the case where the input steps are small (i.e., J 1 − J 0 J 0 ), in which case the models of Figure 1 can be linearized. In sections 6.2 to 6.4, we study the full nonlinear case. 6.1 Linearized Models Have Monotonic Output When τ ≤ τ I . For explicitness we apply linearization to model DNL (see Figure 1C), but the other three models in Figure 1 can be linearized in the same way, and in fact the linearized forms of the four models in Figure 1 are identical. To apply linearization, we assume I (t) = [1 + i(t)]Is ,
(6.3)
O(t) = [1 + o(t)]Os ,
(6.4)
a (t) = [1 + η(t)]a s ,
(6.5)
and
with i(t), o(t), and η(t) all small relative to 1. Note that i(t), o(t), and η(t) are dimensionless and will be referred to below as input modulation, output modulation, and control modulation, respectively. The constant steadystate values Is , Os , and a s are related through Os = Is /a s and a s = h(Os ). The dynamic signals i(t), o(t), and η(t) are related through 1 + o(t) = (1 + i(t))/(1 + η(t)). Using the fact that these signals are small relative to 1, 1 + o = (1 + i)/(1 + η) ≈ (1 + i)(1 − η) = 1 + i − η − iη ≈ 1 + i − η, and thus o(t) = i(t) − η(t),
(6.6)
Dynamics of Nonlinear Feedback Control
1201
that is, the linearized system behaves as a subtractive feedback in which the control modulation η(t) is subtracted from the input modulation i(t) to obtain the output modulation o(t). In appendix B, we prove that when the input modulation i(t) is a low-pass filtered step function, as in equation 6.2, the output modulation o(t) has an overshoot when the time constant τ of the low-pass filter in the control path is larger than the time constant τ I of the filtering at the input. In this case, the control modulation η(t) cannot follow the increase in the input modulation i(t) fast enough, and the resulting output modulation o(t) has an overshoot relative to its final steady state. On the other hand, when τ ≤ τ I , the filter in the feedback path is fast enough to follow the input modulation, and the output modulation o(t) will approach its new steady state in a monotonic way, without an overshoot. 6.2 Output of Model DNL Resembles the Output of the Linearized Model. As shown in appendix B, when the linearized models are stimulated with low-pass filtered steps, the qualitative dynamics of the output of these models is quite simple: the sum of a constant and two exponentials. Depending on the time constants of the low-pass filters, the output is either monotonic, or it has a single overshoot (respectively, undershoot for a decrement step). In this section we show that, remarkably, for model DNL (see Figure 1C), this qualitative dynamics persists also for large input steps, independent of the precise form of the nonlinearity h in the feedback path. For the other models in Figure 1, however, qualitatively new features in the output can occur, depending on the size and sign of the input steps and on the form of the nonlinearities in the feedback loops. When stimulated with an input I (t), the output O(t) of model DNL can be written as O(t) = I (t)/a (t). Differentiating this expression with respect to time yields dO 1 da 1 dI J1 − I h (O) − a = 2 a −I = 2 a −I dt a dt dt a τI τ 1 J1 − I Oh(O) − I , = − a τI τ
(6.7)
where we used equations 6.1 and 2.2. Now if O(t) has an overshoot, d O/dt = 0 at some finite time t where the overshoot reaches its maximum value. Setting d O/dt = 0 in equation 6.7 yields Oh(O) =
τ τ I. J1 + 1 − τI τI
(6.8)
When plotted in an input-output diagram with the input I on the horizontal axis and Oh(O) on the vertical axis (such that steady-state, Is = Os h(Os ), is a straight line through the origin), equation 6.8 yields for any value of τ/τ I
1202
H. Snippe and J. van Hateren
B
A 5 4
4
dO/dt=0
3 dO/dt< 0
Oh(O)
Oh(O)
dO/dt< 0
5
J1=4
2
large step
small step
3
J1=4 dO/dt> 0
2
J0=3
dO/dt> 0 1 0 0
1
J0=1 1
2
dO/dt= 0
J0=1
I
3
4
5
0 0
1
2
I
3
4
5
Figure 9: (A) Dynamics of the output O of model DNL (with h(O) = O2 ) stimulated with a low-pass filtered step (time constant τ I ) when τ < τ I (τ = 0.5τ I in this example). The horizontal axis is the input I from equation 6.2 (with J 0 = 1 and J 1 = 4). The vertical axis is a nonlinear transformation Oh(O) = O3 of the output of O the model, such that steady state Os3 = Is (dotted line) is a straight line through the origin. According to equation 6.8, the line d O/dt = 0 (dashed line) also is a straight line in these coordinates, with positive slope when τ < τ I . Hence, the actual (nonlinearly transformed) output O3 (I ) (continuous line) cannot cross this line and must remain in the region below the dashed line, where d O/dt > 0, hence O(t) is monotonic. (B) As A, but now with τ = 2τ I , such that equation 6.8, d O/dt = 0 (dashed line), has negative slope and therefore can be crossed by the actual output O3 (I ) (continuous curve labeled “large step”). In fact, a crossing always occurs since the output cannot cross the output of the linearized model valid for small steps (continuous curve labeled “small step”), which has an overshoot when τ > τ I , as derived in appendix B. Thus, O(t) has a single overshoot, and it has a single undershoot (not shown in the figure) after a decrement step in the input.
a straight line (see Figure 9). This line separates the region above the line where d O/dt < 0 from the region below it where d O/dt > 0. Note that the line where d O/dt = 0 (given by equation 6.8) crosses the steady-state line, Is = Os h(Os ), when I = J 1 . Now consider the case that τ/τ I <1: thus, the low-pass filter in the feedback control is faster than the filtering at the input. Then the line given by equation 6.8 has a positive slope 1 − τ/τ I (smaller than 1), as illustrated in Figure 9A. When the step J 0 → J 1 at t = 0 is an increment (i.e., when J 0 < J 1 ), the system will start at t = 0 from a point I = Oh(O) = J 0 < J 1 on the steady-state line, and according to equation 6.7, d O/dt > 0. Thus, for t > 0 the output O(t) initially increases. And in fact it will continue to increase (toward its final steady-state level where Oh(O) = I = J 1 ) for all t > 0. This is because the output O(t) cannot cross the line where d O/dt = 0, since on this line, d O/dt = 0, but for I < J 1 , according to equation 6.1, d I /dt > 0. Hence, if O(t) approaches the line d O/dt = 0, it will be pushed back in the region where d O/dt > 0. Thus, when τ < τ I , O(t)
Dynamics of Nonlinear Feedback Control
1203
approaches its final steady state in a monotonic way, without an overshoot (illustrated in Figure 9A for the case h(O) = O2 ). Likewise, when the step J 0 → J 1 at the input is a decrement step (i.e., when J 0 < J 1 ), the output O(t) will decrease monotonically and reach its final steady state without an undershoot. The situation is very different, however, when τ/τ I > 1—when the lowpass filter in the feedback loop is slower than the filtering at the input. In that case, the straight line of equation 6.8 has a negative slope, and now the output O(t) can cross the line d O/dt = 0, because now if O(t) approaches the line d O/dt = 0, it will be pulled into the region with d O/dt < 0 (see Figure 9B). In fact, from the small-signal analysis performed in appendix B, it follows that such a crossing must occur. In appendix B we showed that when τ/τ I > 1, for small steps at the input, the resulting output attains its final steady state with an overshoot (the curve labeled “small step” in Figure 9B). But the output for a large step (the curve labeled “large step” in Figure 9B) cannot cross the output for the small step (since both outputs are described by the same first-order differential equation 6.7, implying that their trajectories must be identical if they have a point in common). The result, as illustrated in Figure 9B, is that also for a large step, O(t) will have exactly one overshoot before it reaches steady state for t → ∞. Likewise, after a decrement step (i.e., when 0 < J 1 < J 0 in equation 6.2), O(t) will have exactly one undershoot (without any subsequent overshoot) when τ/τ I > 1. Thus, remarkably, the qualitative dynamics of the output O(t) of the nonlinear divisive model DNL is identical to the behavior of its linearized version analyzed in appendix B. 6.3 The Output of Model MNL Can Have Overshoots Also When τ < τ I . The qualitative dynamics of the output O(t) of the multiplicative model MNL (see Figure 1A), however, can differ from the linear analysis of appendix B. When stimulated with input I (t), this model yields an output O(t) = I (t)g(t), hence using equations 6.1 and 2.1, dO f (O) − g dI dg J1 − I +I =g +I =g dt dt dt τI τ 2 I − I f (O)/O J1 − I . − =g τI τ
(6.9)
Setting d O/dt = 0 in equation 6.9 yields I2 O = , f (O) I (τ/τ I + 1) − J 1 τ/τ I
(6.10)
which, contrary to the corresponding equation 6.8 for model DNL , is a nonlinear equation in I . The behavior of equation 6.10 is illustrated for a number
1204
H. Snippe and J. van Hateren
τ=0.25
τ=0.5 τ=2
τ=4
O/f(O)
τ=0.5 τ=1
τ=1
1
τ=2
J1=1
0 0
τ=0.25
τ=4
J0=0.04 1
I
2
Figure 10: Dynamics of the output O of model MNL when stimulated with the low-pass filtered step I (t) of equation 6.2, with τ I = 1. Vertical axis O/ f (O) is a (monotonic) nonlinear transformation of the output such that steady state (the dotted line) Os / f (Os ) = Is is a straight line through the origin. Dashed lines are plots, for a number of values of τ , of equation 6.10, where d O/dt = 0. Note that, contrary to Figure 9 for model DNL , these lines are not straight, and have a sharp upward turn at some value I < J 1 . The response O/ f (O) (continuous line, obtained with f (O) = 1/O, J 0 = 0.04, J 1 = 1, and τ = 0.5) weaves its way through the line d O/dt = 0 for τ = 0.5. As a consequence, O(t) has an overshoot and a subsequent undershoot before it reaches its final steady state.
of values of τ/τ I by the dashed curves in Figure 10. When τ > τ I , that is, when the filter in the feedback loop is slower than the filtering of the input, the structure of the lines in Figure 10 is such that the qualitative behavior of O(t) is identical to the behavior for τ > τ I of the output of model DNL . After an increment step at the input, O(t) has a single overshoot, and after a decrement step, O(t) has a single undershoot. For an increment step, this result follows using the reasoning for model DNL in Figure 9B, since in both Figure 9B and Figure 10, the lines d O/dt = 0 are monotonically decreasing when I < J 1 and τ > τ I . For I > J 1 the lines d O/dt = 0 are nonmonotonic when τ > τ I ; they have a minimum at a finite value of I and then increase with increasing I . However, this does not affect the qualitative behavior of O(t) since O(t) cannot cross this increasing part of the curve d O/dt = 0. Thus, when τ > τ I the output O(t) has a single undershoot after a decrement step at the input. Now consider the case that τ < τ I —the case that the filter in the feedback path is faster than the filter at the input. After a decrement step of the input, the output O(t) is qualitatively identical to the output of model DNL and will reach steady state without an undershoot. This follows from the fact that the curves d O/dt = 0 in Figure 10 are monotonically increasing when I > J 1 and τ < τ I ; hence, the qualitative dynamics for O(t) in this case is identical to that obtained for model DNL . After an increment step, however, now O(t) can be qualitatively different from the output of
Dynamics of Nonlinear Feedback Control
1205
10
O(t) 5
0 0
1
t
2
Figure 11: Output O(t) of model DLN with τ = 1 and a nonlinearity h(x) = 1 + x + 0.8 sin(π x)/π , which is monotonic but has multiple changes in the sign of its second-order derivative. Input I (t) is a step from 0 to 110, low-pass filtered with a time constant τ I = 1. For this model, a simple input (a saturating exponential) generates a complicated output with multiple oscillations.
model DNL , because the curves d O/dt = 0 in Figure 10 now have a more complicated structure than in Figure 9A, with a sharp upturn for small values of I . As a consequence, whereas model DNL reaches steady state in a monotonic fashion when τ < τ I , the output O(t) of model MNL develops a “nose” for suffiently large steps at the input. That is, after large steps of the input I (t), the output O(t) of model MNL has an initial overshoot, followed by a subsequent undershoot, as shown by the continuous curve in Figure 10. 6.4 Simple Input Can Yield a Complicated Output in Models DL N and M L N . Model DLN (or, equivalently, model MLN ) can have an output O(t) that is still more complicated than the output of model MNL shown in Figure 10. When the nonlinearity h(x) (though monotonic) has sign changes in its second-order derivative d 2 h/d x 2 , the output O(t) of model DLN can have oscillatory features when the model is stimulated with the simple low-pass filtered step of equation 6.2. An example is shown in Figure 11. 7 Discussion In this letter, we studied the dynamics of the four feedback models for gain control shown in Figure 1. Two of the models (MNL and MLN ) have a multiplicative gain control, whereas the other two models (DNL and DLN ) have a divisive gain control. The structure of all four models is very similar in that the feedback loop for each consists of a concatenation of an instantaneous nonlinearity and a first-order low-pass filter. As shown
1206
H. Snippe and J. van Hateren
in sections 2.2 and 2.3, two of these models (MLN and DLN ) are actually identical, and the other two models (MNL and DNL ) are connected by a duality. Despite these relationships, however, the dynamics of models MNL , DNL , and MLN (c.q. DLN ) can be very different for a given input I (t). For instance, in section 4, we showed that the dynamics of adaptation of the models differs after steps in the input (see Figures 5 and 7), and in section 5 we showed that the models differ in their asymmetries of adaptation after increments and decrements of the input. Finally, in section 6, we showed that for simple band-limited inputs I (t), the outputs O(t) of the different models can be qualitatively different. Thus, subtle differences in the control structure of feedback models for gain control can have strong effects on their dynamics. As shown in section 6, the input-output behavior of model DNL is qualitatively the simplest, especially when τ < τ I (see Figure 9A), because in that case, a monotonic input I (t) yields a monotonic output O(t). Thus, when a nonmonotonic output O(t) occurs in model DNL , the corresponding input I (t) also must have been nonmonotonic. On the other hand, even when τ < τ I , model MNL can have a nonmonotonic output O(t) when the input I (t) was monotonic, as is shown in Figure 10. Finally, model DLN (and model MLN ) can give complicated outputs even when the input is simple. Hence, for this model, it can be difficult to make a qualitative estimate of the input when observing the output. For instance, it would be hard to discriminate the output seen for this model in Figure 11 from an output generated by this model when stimulated by an input that genuinely oscillates. It appears therefore that model DNL is usually a safer design than the models MNL and (especially) MLN or DLN , since as a general principle in information processing, one should try to avoid introducing spurious structure that is not present at the input (Koenderink, 1984). The structual complexity of the models studied in this letter has been kept to a minimum. The great advantage of this approach is that we can obtain general results using an exact mathematical analysis. It would be very hard to derive conclusions of similar generality if one would have to resort to numerical simulations, which might be necessary for more complex models. Nonetheless, the results derived for the simple models studied here can aid in the analysis of more complex models. For instance, the results derived here can speed the search for realistic complex models by restricting the huge search space of the many possible complex models to more manageable proportions. A realistic strategy for this would be to first investigate the steady-state behavior of the system under study, which determines the mathematical form of the nonlinearity in the system (Chichilnisky, 2001; Snippe et al., 2004). Given the steady state, one can then study the dynamics of the system (e.g., increment and decrement asymmetry and overshoots of the output) to determine which type of model (e.g., divisive or multiplicative) is appropriate to explain the experimental data.
Dynamics of Nonlinear Feedback Control
1207
B
A I(t)
NL
LP
O(t)
I(t) a(t)
a ( t ) =O ( t )
NL LP
LP
O(t)
NL
Figure 12: An example of nonlinear (divisive) control (A) where the nonlinearity and the filtering shape not only the control signal a (t) but also directly the output O(t). The model is equivalent to the model shown at the right (B). Hence, the dynamics of a (t) = O(t) of this model can be studied with the methods developed for model DNL in Figure 1.
There are several simple extensions and alternatives of the models studied here that are amenable to analysis with the methods developed here. For instance, the filtering and nonlinearities in the feedback path in Figure 1 could also be placed at a point before the feedback loop branches off from the output. As shown in Figure 12, this in fact does not affect the dynamics of the gain signal, but it changes the dynamics of the output, which now becomes identical to the gain. Finally, the filtering operations could be generalized. For instance, one could consider higher-order low-pass filters, time delays in the filters, or high-pass filtering of the output (Berry et al., 1999). It is well known that with increasing values of the input (e.g., luminance or contrast), neural systems not only decrease their gain but also adapt their temporal filtering (e.g., Sperling & Sondhi, 1968; Victor, 1987; Benardete, Kaplan, & Knight (1992); Carandini & Heeger, 1994; van Hateren, 2005). It will be interesting to see to what extent the tools developed in this letter can also be used to analyze such more general systems. Appendix A: Increment and Decrement Symmetry for Models DNL and M NL Adaptation in model DNL (see Figure 1C) is described by τ
da I = h(O) − a = h( ) − a . dt a
(A.1)
For a constant input Is , steady state is attained when da /dt = 0, thus, when h (I /a s ) = a s . Hence, in steady state Is = a s h −1 (a s ) = Os h (Os ), where h −1 is the inverse of the function h, and we have used the fact that in steady state a s = h(Os ). Now consider an increment step I− → I+ of the input. For the input I− the steady-state equation is I− = a − h −1 (h − ) = O− h (O− ), and for the input I+ it is I+ = a + h −1 (a + ) = O+ h (O+ ). Immediately after the increment step, when the input is I+ but the attenuation a is still at its old
1208
H. Snippe and J. van Hateren
adaptation level a − , da inc =h τ dt
I+ a−
− a− = h
I+ a−
−h
I− a−
=h
O+ h (O+ ) h (O− )
− h (O− ) . (A.2)
Likewise, immediately after a decrement step I+ → I− , da dec τ =h dt
O− h (O− ) h (O+ )
− h(O− ).
(A.3)
For symmetry of adaptation, we equate da inc /dt and −da dec /dt (taking into account that a decreases after a decrement step). Writing O+ = y and O− = x, this yields h
yh (y) h(x)
+h
xh(x) h (y)
= h (y) + h(x),
(A.4)
a functional equation for h in terms of two variables x and y (Aczel & Dhombres, 1989). It is easy to verify that equation A.4 is solved by the function h(x) = c + k ln x (with c and k constants; k > 0 for gain control), but how does one obtain such a solution? A method to solve A.4 is to set y = x + ε and make a Taylor expansion of the resulting expression. Since, as shown below, terms up to the first order in the expansion cancel, it is necessary to expand to second order in ε. An identical result can be obtained by differentiating equation A.4 twice with respect to y, and putting y = x after differentiating. Taylor-expanding the right-hand side of equation A.4 yields h (x + ε) + h(x) = 2h(x) + ε
dh(x) 1 2 d 2 h(x) 1 = 2h + εh + ε 2 h , + ε dx 2 d x2 2 (A.5)
where for conciseness, we write dh/d x = h and d 2 h/d x 2 = h , and drop the argument x everywhere. Likewise, the left-hand side of equation A.4 yields
(x + ε) h + εh + ε 2 h /2 xh h +h h h + εh + ε 2 h /2 ε 2 h x h +h = h (x + ε) 1 + ε + h 2 h 1 + εh / h + ε 2 h /2h
Dynamics of Nonlinear Feedback Control
1209
h xh xh =h x+ε 1+ + ε2 + h h 2h
h 2 xh 2 xh −ε −x +h x − ε h 2h h h xh xh ε 2 xh 2 h =h+ε 1+ h + ε2 + h + 1+ h h 2h 2 h
2 h xh ε 2 xh 2 2 xh +h − ε h h −ε −x h + h 2h h 2 h 2 2 h xh 2 h xh 2 2 (h ) h = 2h + εh + ε + xh + + 1+ . h 2 h h 2 h (A.6) Equating A.6 and A.5, only terms in ε 2 remain: ε
2
xh h 1 (h )2 + h + + h 2 h
xh h
2
x (h )3 h + h2
=
1 2 ε h . 2
(A.7)
Cancelling the right-hand side of equation A.7 with the second term on the left-hand side of equation A.7, and multiplying with h 2 /ε 2 h yields 2 hh + xhh + x 2 h h + x h = 0,
(A.8)
which can be simplified to
h + xh · h + xh = [xh] · xh = 0.
(A.9)
The first factor in equation A.9, [xh] = 0, has a solution h(x) = k/x that decreases with increasing x, contrary to the assumptions in section 2. However, the second factor in equation A.9, [xh ] = 0, does yield a result for divisive gain control. Because [xh ] = 0 means xh = c with c a constant (positive for divisive gain control), it follows that h = c/x, thus h(x) = c ln (kx), with k another positive constant. Note that when x < 1/k, h(x) = c ln(kx) becomes negative, which speeds up adaptation after large decrements of the input. However, in steady state, Is = Os f (Os ) = Os c ln(k Os ) > 0, hence h(Os ) = c ln(k Os ) > 0. The method used here to derive a functional equation and transform it into a differential equation can also be used to find nonlinearities f (x) that yield symmetric adaptation immediately after increment and decrement steps of the input in the multiplicative model MNL . Likewise,
1210
H. Snippe and J. van Hateren
nonlinearities can be found for which (1/a ) da /dt = d ln a /dt (or, equivalently, (1/g)dg/dt = d ln g/dt) is symmetric immediately after increment and decrement steps. Results are collected in Table 1. Appendix B: Dynamics of the Output for the Linearized Models To obtain an explicit expression for the output modulation o(t) (see equation 6.6), we linearize the dynamics of the feedback loop of model DNL , equation 2.2, which yields τ
da (t) = h(O(t)) − a (t) = h(Os ) + Os h (Os )o(t) − a s − a s η(t) dt = Os h (Os )o(t) − h(Os )η(t).
(B.1)
In the first step of equation B.1, we used the steady-state relation a s = h(Os ), and defined h (Os ) = dh(Os )/d Os . Using equation 6.5 and the steady-state relation a s = h(Os ), the left-hand side of equation B.1 can be rewritten as τ h(Os )dη(t)/dt. Using equation 6.6, we can rewrite equation B.1 as τ
dη(t) = Qi(t) − (1 + Q) η(t), dt
(B.2)
where we define Q ≡ Os h (Os )/h(Os ).
(B.3)
Note that in general, the parameter Q in equation B.3 depends on the steady-state output Os , Q = Q(Os ). For conciseness, we have suppressed this dependence in our notation. Dividing equation B.2 by 1 + Q yields τQ
dη(t) Q = i(t) − η(t), dt 1+ Q
(B.4)
where we defined τ Q = τ/(1 + Q). Equation B.4 says that for the linearized model, the subtractive control modulation η(t) is a low-pass filtered version of the input modulation i(t), in which the filter has a gain Q/(1 + Q) and a time constant τ Q = τ/(1 + Q) (Bechhoefer, 2005). Equation B.4 can be solved explicitly when the input modulation i(t) is a low-pass filtered step of size s, that is, when for t ≥ 0, i(t) = s[1 − exp(−t/τ I )],
(B.5)
and for t < 0, i(t) = 0. The solution of the linear equation B.4 (representing a first-order low-pass filter) can be written as the convolution of i(t) with a
Dynamics of Nonlinear Feedback Control
1211
low-pass filter with time constant τ Q and gain Q/(1 + Q): η(t) =
Q (1 + Q) τ Q
t
i t exp − t − t /τ Q dt ,
(B.6)
0
where the fact that i (t ) = 0 for t < 0 has been used for the lower bound of the integral. Substituting equation B.5 into equation B.6, the integral is elementary and yields η(t) =
τQ sQ 1+ exp(−t/τ Q ) 1+ Q τI − τQ τI − exp(−t/τ I ) when τ I = τ Q , τI − τQ
(B.7)
and η(t) =
sQ t 1− 1+ exp (−t/τ Q ) 1+ Q τQ
when
τI = τQ.
(B.8)
First, we study the case τ I = τ Q . For this case, using equations 6.6, B.5, and B.7, the output modulation o(t) of the gain control is o(t) = i(t) − η(t) = −
s s (τ I − τ ) exp (−t/τ I ) − 1 + Q (1 + Q) (τ I − τ Q )
s Qτ Q exp (−t/τ Q ) , (1 + Q) (τ I − τ Q )
(B.9)
where in the second term of equation B.9, we used the relation τ = (1 + Q)τ Q . Now when τ < τ I (i.e. when the low-pass filter in the control loop is faster than the filter at the input), the prefactors of the exponential functions in equation B.9 are both positive, and hence o(t) is a monotonically increasing function. Thus, when τ < τ I , the output modulation o(t) reaches its final value s Q/(1 + Q) in a monotonic way, without any overshoots or oscillations. When τ > τ I (i.e., when the low-pass filter in the control loop is slower than the filter at the input), two situations can occur, depending on the value of the parameter Q in equation B.3: either τ Q = τ/(1 + Q) < τ I , or τ Q > τ I . If τ Q > τ I , then at large times t, the exponential function exp (−t/τ Q ) in equation B.9 will dominate the exponential function exp (−t/τ I ). However, τ Q > τ I means that τ I − τ Q < 0; hence, in this case, the contribution of exp(−t/τ Q ) to the result B.9 will be positive, which means that for large t, the output modulation o(t) in equation B.9 approaches its final value s Q/(1 + Q) from above. This, however, can be the case only when o(t) has
1212
H. Snippe and J. van Hateren
an overshoot. when τ Q < τ I , the exponential exp (−t/τ I ) dom Likewise, inates exp −t τ Q for large t. However, in this case, the contribution of exp (−t/τ I ) to o(t) is positive, as can be easily verified. Hence, when τ > τ I , the output modulation o(t) always has an overshoot, independent of the precise value of τ Q relative to τ I . One can easily check that o(t) also has a single overshoot when η(t) is described by equation B.8, that is, when τ I = τ Q exactly (and hence τ = (1 + Q) τ Q > τ I ). Acknowledgments This research is supported by the Netherlands Organization for Scientific Research (NWO/ALW). References Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–224. Aczel, J., & Dhombres, J. (1989). Functional equations in several variables. Cambridge: Cambridge University Press. Bechhoefer, J. (2005). Feedback for physicists: A tutorial essay on control. Reviews of Modern Physics, 77, 783–836. Benardete, E. A., Kaplan, E., & Knight, B. W. (1992). Contrast gain control in the primate retina: P cells are not X-like, some M cells are. Visual Neurosci., 8, 483– 486. Berry, M. J., Brivanlou, I. H., Jordan, T. A., & Meister, M. (1999). Anticipation of moving stimuli by the retina. Nature, 398, 334–338. Calvert, P. D., Ho, T. W., Lefebvre, Y. M., & Arshavsky, V. Y. (1998). Onset of feedback reactions underlying vertebrate rod photoreceptor light adaptation. J. Gen. Physiol., 111, 39–51. Carandini, M., & Heeger, D. J. (1994). Summation and division by neurons in primate visual cortex. Science, 264, 1333–1336. Chichilnisky, E. J. (2001). A simple white noise analysis of neuronal light responses. Network, 12, 199–213. Crawford, B. H. (1947). Visual adaptation in relation to brief conditioning stimuli. Proceedings of the Royal Society B, 134, 283–302. Crevier, D. W., & Meister, M. (1998). Synchronous period-doubling in flicker vision of salamander and man. J. Neurophys., 79, 1869–1878. ¨ Dau, T., Puschel, D., & Kohlrausch, A. (1996). A quantitative model of the “effective” signal processing in the auditory system. I. Model structure. J. Acoust. Soc. Am., 99, 3615–3622. DeWeese, M., & Zador, A. (1998). Asymmetric dynamics in optimal variance adaptation. Neural Computation, 10, 1179–1202. Escab´ı, M. A., Miller, L. M., Read, H. L., & Schreiner, C. E. (2003). Naturalistic auditory contrast improves spectrotemporal coding in the cat inferior colliculus. J. Neurosci., 23, 11489–11504.
Dynamics of Nonlinear Feedback Control
1213
Fairhall, A. L., Lewen, G. D., Bialek, W., & de Ruyter van Steveninck, R. R. (2001). Efficiency and ambiguity in an adaptive neural code. Nature, 412, 787– 792. Franklin, G. F., Powell, J. D., & Emami-Naeini, A. (1994). Feedback control of dynamical systems (3rd ed.). Reading, MA: Addison-Wesley. Griffin, L. D., Lillholm, M., & Nielsen, M. (2004). Natural image profiles are most likely to be step edges. Vision Res., 44, 407–421. Heeger, D. J. (1992). Normalization of cell responses in cat striate cortex. Visual Neurosci., 9, 181–197. Hosoya, T., Baccus, S. A., & Meister, M. (2005). Dynamic predictive coding by the retina. Nature, 436, 71–77. Koenderink, J. J. (1984). The structure of images. Biological Cybernetics, 50, 363– 370. Marmarelis, V. Z. (1991). Wiener analysis of nonlinear feedback in sensory systems. Annals of Biomedical Engineering, 19, 345–382. Oppenheim, A. V., Willsky, A. S., & Young, I. T. (1983). Signals and systems. Englewood Cliffs, NJ: Prentice Hall. Rieke, F. (2001). Temporal contrast adaptation in salamander bipolar cells. J. Neurosci., 21, 9445–9454. Salinas, E., & Thier, P. (2000). Gain modulation: A major computational principle of the central nervous system. Neuron, 27, 15–21. Schwartz, O., & Simoncelli, E. P. (2001). Natural signal statistics and sensory gain control. Nat. Neurosci., 4, 819–825. Shapley, R., & Enroth-Cugell, C. (1984). Visual adaptation and retinal gain controls. Progr. Retinal Res., 3, 263–346. Smirnakis, S. M., Berry, M. J., Warland, D. K., Bialek, W., & Meister, M. (1997). Adaptation of retinal processing to image contrast and spatial scale. Nature, 386, 69–73. Snippe, H. P., Poot, L., & van Hateren, J. H. (2000). A temporal model for early vision that explains detection thresholds for light pulses on flickering backgrounds. Visual Neurosci., 17, 449–462. Snippe, H. P., Poot, L., & van Hateren, J. H. (2004). Asymmetric dynamics of adaptation after onset and offset of flicker. J. Vision, 4, 1–12. Snippe, H. P., & van Hateren, J. H. (2003). Recovery from contrast adaptation matches ideal-observer predictions. J. Opt. Soc. Am. A, 20, 1321–1330. Sperling, G., & Sondhi, M. M. (1968). Model for luminance discrimination and flicker detection. J. Opt. Soc. Am., 58, 1133–1145. Thorson, J., & Biederman-Thorson, M. (1974). Distributed relaxation processes in sensory adaptation. Science, 183, 161–172. Torre, V., Ashmore, J. F., Lamb, T. D., & Menini, A. (1995). Transduction and adaptation in sensory receptor cells. J. Neurosci., 15, 7757–7768. van de Vegte, J. (1990). Feedback control systems (2nd ed.). Englewood Cliffs, NJ: Prentice Hall. van Hateren, J. H. (1997). Processing of natural time series of intensities by the visual system of the blowfly. Vision Res., 23, 3407–3416. van Hateren, J. H. (2005). A cellular and molecular model of response kinetics and adaptation in primate cones and horizontal cells. J. Vision, 5, 331–347.
1214
H. Snippe and J. van Hateren
Victor, J. D. (1987). The dynamics of the cat retinal X cell centre. J. Physiol., 386, 219–246. ¨ Wilke, S. D., Thiel, A., Eurich, C. W., Greschner, M., Bongard, M., Ammermuller, J., & Schwegler, H. (2001). Population coding of motion patterns in the early visual system. J. Comp. Physiol. A, 187, 549–558.
Received August 12, 2005; accepted August 1, 2006.
LETTER
Communicated by Bard Ermentrout
Interspike Interval Statistics in the Stochastic Hodgkin-Huxley Model: Coexistence of Gamma Frequency Bursts and Highly Irregular Firing Peter Rowat [email protected] Institute for Neural Computation, University of California at San Diego, La Jolla, CA 92093, U.S.A.
When the classical Hodgkin-Huxley equations are simulated with Naand K-channel noise and constant applied current, the distribution of interspike intervals is bimodal: one part is an exponential tail, as often assumed, while the other is a narrow gaussian peak centered at a short interspike interval value. The gaussian arises from bursts of spikes in the gamma-frequency range, the tail from the interburst intervals, giving overall an extraordinarily high coefficient of variation—up to 2.5 for 180,000 Na channels when I ≈ 7µA/cm2 . Since neurons with a bimodal ISI distribution are common, it may be a useful model for any neuron with class 2 firing. The underlying mechanism is due to a subcritical Hopf bifurcation, together with a switching region in phase-space where a fixed point is very close to a system limit cycle. This mechanism may be present in many different classes of neurons and may contribute to widely observed highly irregular neural spiking. 1 Introduction The variability of spike times in response to repeated identical inputs is a prominent characteristic of the nervous system that many investigators have observed (Werner & Mountcastle, 1963; Pfeiffer & Kiang, 1965; Nakahama, Suzuki, Yamamoto, & Aikawa, 1968; Noda & Adey, 1970; Lamarre, Filion, & Cordeau, 1971; Bassant, 1976; Burns & Webb, 1976; Whitsel, Schreiner, & Essick, 1977; Dean, 1981; Softky & Koch, 1993; Holt, Softky, Koch, & Douglas, 1996; Shadlen & Newsome, 1998) and analyzed from many different approaches (Stein, 1965; Wilbur & Rinzel, 1983; Softky & Koch, 1993; Mainen & Sejnowski, 1995; Troyer & Miller, 1997; Gutkin & Ermentrout, 1998; Stevens & Zador, 1998; Destexhe, Rudolph, Fellous, & Sejnowski, 2001; Mann-Metzer & Yarom, 2002; Rudolph & Destexhe, 2002; Brette, 2003; Stiefel, Englitz, & Sejnowski, 2006). Although synaptic noise may account for some of this variability, another significant source is noise from stochastic ion channel activity, which can affect neuronal threshold and spike time reliability (Sigworth, 1980; White, Rubinstein, & Kay, Neural Computation 19, 1215–1250 (2007)
C 2007 Massachusetts Institute of Technology
1216
P. Rowat
2000). Indeed, in certain small neurons, a single channel opening can initiate an action potential (Lynch & Barry, 1989). Channel noise has been shown to be significant or essential in other neural systems (White, Klink, Alonso, & Kay, 1998; Dorval & White, 2005; Kole, Hallerman, & Stuart, 2006), including bursting neurons, where Rowat and Elson (2004) showed that the addition of channel noise to a model cell qualitatively reproduced certain burst statistics of an identified biological neuron. In this letter, we show that an isolated model neuron, described by the stochastic Hodgkin-Huxley (HH) equations, generates intrinsic spike-time variability with a characteristic bimodal distribution of interspike intervals (ISIs): a gaussian peak at the shortest intervals in addition to an exponential tail. This bimodal distribution has been observed in the firing of a variety of biological neurons (Gerstein & Kiang, 1960; Nakahama et al., 1968; Lamarre et al., 1971; Ruggero, 1973; Armstrong-James, 1975; Bassant, 1976; Burns & Webb, 1976; Siebler, Koller, Stichel, Muller, & Freund, 1993; de Ruyter van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997; Duchamp-Viret, Kostal, Chaput, Lansky, & Rospars, 2005). Moreover, in the stochastic HH model as in biology, the location of the initial peak lies in the gamma-frequency range (40–100 Hz). Put another way, gamma-frequency firing is a favored behavior of the stochastic HH model. In their deterministic form, the HH equations (Hodgkin & Huxley, 1952), originally introduced to describe action potentials in the squid giant axon, remain a foundation stone of modern neuroscience, and the formalism introduced by this model—also known as conductance based—is widely used today (Meunier & Segev, 2002). The basic outlines of the HH equations’ dynamics in the physiological range are well known (e.g., Hassard 1978; Troy, 1978; Best, 1979; Rinzel & Miller, 1980; Labouriau, 1985; Guckenheimer & Oliva, 2002), and much of what remains unknown in the deterministic case has been illuminated by Guckenheimer and Labouriau (1993). Many neurons, including several cortical types (Markram, Toledo-Rodriguez, Wang, Gupta, Silberberg, & Caizhi, 2004; Tateno, Harsch, & Robinson, 2004), can be classified as Hodgkin’s class II (Hodgkin, 1948)—a type of neuron that, as applied current is increased in steps, begins spiking with positive (nonzero) frequency. Class II behavior arises in most cases from the presence of a subcritical Hopf bifurcation. The HH model is a four-dimensional system, and several reductions to a two-dimensional system have been introduced. Of the latter, the FitzhughNagumo model is perhaps the best known (Fitzhugh, 1961) and has been heavily investigated as a prototype excitable system in biological, chemical, and physical contexts (Lindner, Garica-Ojalvo, Neiman, & SchimanskyGeier, 2004). While not an HH model reduction, the Morris-Lecar neuron model is a conductance-based model of voltage oscillations in the barnacle giant muscle fiber (Morris & Lecar, 1981). It has been analyzed extensively (Rinzel & Ermentrout, 1998) and is often used as a good, qualitatively quite accurate two-dimensional model of neuronal spiking.
ISI Statistics in the Stochastic Hodgkin-Huxley Model
1217
Table 1: Hodgkin Huxley Parameters Used in Simulations C VL gL VK g¯ K ρK NK VNa g¯ Na ρ Na NNa
Membrane capacitance Leak reversal potential Leak conductance Potassium reversal potential Maximal potassium conductance Potassium channel density Number of potassium channels Sodium reversal potential Maximal sodium conductance Sodium channel density Number of sodium channels
1 µF/cm2 −54.4 mV 0.3 mS/cm2 −77 mV 36 mS/ cm2 18 channels/µm2 50 mV 120 mS/ cm2 60 channels/µm2
Using a stochastic version of this reduced two-dimensional model, we propose a simple dynamical explanation for the characteristic bimodal distribution of ISIs observed in the four-dimensional stochastic HH model. These dynamics are likely to apply to the spontaneous spiking of most class II neurons, even those with complex arrays of ionic currents. Our findings indicate that a combination of irregular firing and gamma-frequency bursts can be attributed to the effects of channel noise intrinsic to a neuron. 2 Methods 2.1 The Stochastic Hodgkin-Huxley Model and Simulation. The current conservation equation for the HH model is C
dV = I − [g Na (V − VNa ) + g K (V − VK ) + g L (V − VL )] dt
(2.1)
where C is the membrane capacitance, V is the membrane potential, I is the applied current, g Na is the sodium conductance, g K is the potassium conductance, g L is the leak conductance, VNa and VK are the sodium and potassium reversal potentials, and VL is the leak reversal potential. The parameter values used are given in Table 1. The Na and K conductances are written as: g Na = m3 h g¯ Na ,
g K = n4 g¯ K ,
(2.2)
where the gating variables m, h, n, satisfy equations dx = αx (V)(1 − x) − βx (V) x, dt
for x = m, h, n.
(2.3)
1218
P. Rowat
The rate functions αx , βx , for x = m, h, n are as follows: αm (V) =
0.1 (V + 40) , 1 − e −(V+40)/10
αh (V) = 0.07 e −(V+65)/20 , αn (V) =
0.01 (V + 55) , 1 − e −(V+55)/10
βm (V) = 4e −(V+65)/18 , βh (V) =
1 , 1 + e −(V+35)/10
βn (V) = 0.125e −(V+65)/80 ,
(2.4) (2.5) (2.6)
In the stochastic HH model, the gating equations 2.3 are not used, and we proceed as follows. The voltage-dependent conductances for the sodium and potassium currents are given by g Na =
Q Na g¯ Na , NNa
gK =
QK g¯ K , NK
(2.7)
where NNa is the total number of sodium channels, NK the total number of potassium channels, Q Na and Q K are the numbers of sodium and potassium channels in the open state, and g¯ Na and g¯ K are the maximum sodium and potassium conductances. NNa and NK are given by NNa = ρ Na × area, NK = ρ K × area,
(2.8)
where ρ Na and ρ K are the sodium and potassium channel densities. In this letter, ρ Na and ρ K are fixed. Thus, when the membrane area is changed, the total number of channels, NNa and NK , is changed. Each potassium channel has four identical gates. Each gate is either open or closed, and the channel conducts potassium current only when all four gates are open. Each potassium gate is a Markov process with voltagedependent opening and closing rates given by the classical HodgkinHuxley functions αn (V) and βn (V) and the kinetic scheme for the operation of a single potassium channel is the following:
(2.9)
Here, n4 is the state in which all four gates are open. For each state transition, for example, n0 → n1, the function labeling it (here, 4αn ) is the transition rate. A sodium channel has three activation (m-) gates and one inactivation (h-) gate, and conducts sodium current only when all four gates are open. With the opening and closing rates αm (V), βm (V), αh (V), and βh (V) for the
ISI Statistics in the Stochastic Hodgkin-Huxley Model
1219
m-gates and the h-gate, the associated Markov kinetic scheme is then
(2.10)
A channel is in state mihj if i m-gates and j h-gates are open, 0 ≤ i ≤ 3, j = 0, 1. A sodium channel is open only when it is in state m3h1. The stochastic Hodgkin-Huxley model consists of equations 2.1, 2.4, 2.5, 2.6, and 2.7; NNa copies of equation 2.9; and NK copies of equation 2.10. The brute-force algorithm keeps track of the states of all the sodium and potassium channels, computationally a hugely inefficient scheme. The first time one of the (NNa + NK ) channels changes state, the membrane equation, 2.1, is integrated up to that instant in time, and the process is repeated. An exact, and much better, algorithm (Gillespie, 1977; Chow & White, 1996) does not track the state of every channel but instead tracks the number of channels in each state, as follows. In schemes 2.9 and 2.10, there are 13 different channel states x, for x = n0, n1, . . . , n4, m0h0, . . . , m3h1, and 28 transitions y with rates {Ry (V) : y = n0 → n1, . . . . , n3 → n4, m0h0 → m1h0, . . . , m2h1 → m1h1, . . . , m3h0 → m3h1}—for example, Rn1→n2 = 3αn . For each x, Nx is the number of channels in state x. The algorithm tracks the global state (Nn0 , Nn1 , . . . , Nm3h1 ), where Nn0 + Nn1 + Nn2 + Nn3 + Nn4 = NK , Nmoho + · · · + Nm3h1 = NNa , Q K = Nn4 , and Q Na = Nm3h1 . The algorithm has five steps: Step 1: Select a transition time ttr from the current global state (Nn0 , Nn1 , . . . , Nm3h1 ) to its immediate successor. Step 1a: Update the transition rates {Ry (V)}. Step 1b: For each z = 1, 2, . . . , 28, compute Rztot = Nsour ce(y) Ry (V), where z = or d(y) : z = 1, 2, . . . , 28 is a fixed ordering of the transitions y = n0 → n1, . . ., m3h0 → m3h1.
1220
P. Rowat
tot Step 1c: Let λ = 28 z=1 Rz , the global transition rate out of the current global state. This has the distribution function λ e −λ(t−t0 ) .
(2.11)
Step 1d: Let ttr = ln(r1−1 )/λ, with r1 a pseudorandom number in [0,1]. ttr is a sample from the PDF (see equation 2.11). Step 2: Integrate equation 2.1 from t0 to t = t0 + ttr to get a new value of V. Step 3: Select the transition taken. Using r2 , a pseudorandom numµ−1 ber in [0,1], compute µ ∈ [1, 28] such that z=1 Rztot < λ r2 ≤ µ tot z=1 Rz . Step 4: Update the state as determined by the transition µ. For example, if Rµtot corresponds to transition m2h1 → m1h1, then add 1 to Nm1h1 and subtract 1 from Nm2h1 . Step 5: Set t0 = t and repeat steps 1 to 4. 2.2 The Stochastic Hodgkin-Huxley Model in Voltage-Clamp Mode. This is an experimental technique physiologists use. We use this mode to collect statistics in channel noise. Fix V = V0 , and use the algorithm just described, but omit step 2. Also, step 1a can be taken out of the loop and executed once on initialization. In all HH voltage-clamp runs, the first 20 ms were discarded to avoid transients. 2.3 The Morris-Lecar Model. The deterministic Morris-Lecar model is C
dV = −g¯ Ca m∞ (V)(V − VCa ) − g¯ K w(V − VK ) − g L (V − VL ) + I dt (2.12)
dw w∞ (V) − w =φ , dt τ∞ (V)
(2.13)
where m∞ (V) = 0.5∗ 1 + tanh (V − V1 )/V2 , w∞ (V) = 0.5∗ 1 + tanh (V − V3 )/V4 ,
(2.14)
τw (V) = 1/ cosh (V − V3 )/(2∗ V4 ) .
(2.15)
and
ISI Statistics in the Stochastic Hodgkin-Huxley Model
1221
Parameter values were V1 = −1.2, V2 = 18, V3 = 2, V4 = 30, g¯ Ca = 4.4, g¯ K = 8.0, g L = 2, VK = −84, VL = −60, VCa = 120, C = 20µF/cm2 , and φ = 0.04. The injected current I was varied. The stochastic Morris-Lecar model is derived in the same manner as the stochastic HH model derives from the deterministic HH model. The K+ gating variable w is replaced by a population of NK channels, each having a single gate with two states—open and closed. We introduce the membrane area Area with channel density ρ K for which NK = ρ K Area
(2.16)
holds. The opening and closing transition probabilities are given by functions αw (V) and βw (V). These are obtained by rewriting equation 2.13 in the form dw = αw (V)(1 − w) − βw (V)w , dt
(2.17)
where, after a little manipulation and using equations 2.14 and 2.15, ∗ αw (V) = 0.5 φ cosh (V − V3 )/(2∗ V4 ) 1 + tanh (V − V3 )/(2∗ V4 ) ∗ βw (V) = 0.5 φ cosh (V − V3 )/(2∗ V4 ) 1 − tanh (V − V3 )/(2∗ V4 ) . (2.18) Now we retain only equation 2.12, and replace equation 2.13 (or its alternate form, 2.17), by the kinetic scheme,
(2.19)
where αw (V) and βw (V) are given by equations 2.18. We used ρ K = 20 channels/µm2 , typically with area = 50 µm2 so the number of potassium channels NK = 1000. The stochastic Morris-Lecar model in voltage-clamp mode is obtained as for the Hodgkin-Huxley model. Six different random number generators were tested, and the resulting histograms compared for I = 0. Since no significant differences were found, all subsequent simulations were done with the Mersenne Twister (Matsumoto & Nishimura, 1998). Most runs were done using double precision (64 bits), but with the larger areas of over 1000 µm2 (hence, larger channel numbers and very small times between Markov state changes), we changed to long double precision (128 bits).
1222
P. Rowat
2.4 Hodgin-Huxley Phase Space and Q Space. In the deterministic HH model, phase space is the set of 4-tuples (V, m, h, n). In the stochastic HH model, the natural space to work with is the space of 3-tuples (V, Q Na , Q K ), where Q Na = Nm3h1 is the number of open Na channels and Q K = Nn4 is the number of open K channels. Knowing NNa and NK , we can move from (V, m, h, n) space to (V, Q Na , Q K ) space via Q Na = m3 h NNa , and Q K = n4 NK ,
(2.20)
by equation 2.7. There is no single 1-1 map from (V, Q Na , Q K ) back to (V, m, h, n) space. One choice is the following, assuming the channel state numbers Nm0h1 , Nm1h1 , Nm2h1 , and Nm3h0 are known. For the single gate open probability n, assuming symmetry among the gates, use n=
4
QK . NK
For sodium channels, by diagram 2.10, we use h=
Nm0h1 + Nm1h1 + Nm2h1 + Q Na NNa
and m=
3
Q Na + Nm3h0 . NNa
3 Results When the ISI time series used for Figure 1A is converted to hertz and the frequency histogram plotted, two significant peaks appear (see Figure 1B); the insert shows the original case with I = 0 and A = 100 µm2 (Chow & White, 1996), with a barely noticeable second peak. In both histograms, the second peak is in the gamma-frequency range (40–100 Hz). This suggested plotting the frequency histograms for a wide range of input currents (see Figure 2), which revealed that the right-hand peak frequency increased linearly with input current. Next, we plotted the right-hand peak against input current frequency for two areas, A = 100 and A = 400 µm2 (see Figure 3, top two traces). On the same graph, the frequency of the deterministic Hodgkin-Huxley equations (Rinzel & Miller, 1980) is plotted, showing that for I ≥ 7 µA/cm2 , the right-hand peaks in the frequency histogram closely track, at a slightly higher frequency, the deterministic frequency.
ISI Statistics in the Stochastic Hodgkin-Huxley Model
1223
A
0.04
Probability Density
Prob.Density
0.02
0.02
A=100 µm2 I=0
0 0
200
400
ISI length (ms) A=400 µm2 I=3 µA/cm2
0 0
200
ISI length (ms)
400
Probability Density
B
0.06
0.06 0.03 0
0.03
0
0
0
20
20
40
40
60
60
Instantaneous frequency (Hz) Figure 1: A prominent initial peak in ISI histograms generates a peak in the frequency histogram in the 40–100 Hz range. (A) Smoothed sliding histogram of ISIs generated by the stochastic HH model with I = 3 µA/cm2 and area A = 400 µm2 (corresponding channel numbers NNa = 24000, NK = 72000). The histogram was generated by sliding a bin of width 2 ms along the data in steps of 0.2 ms and then smoothing it slightly. For most histograms, n = 30000 data points were used, except when the mean ISI was higher than a few seconds, when n ranged as low as 3000. Inset: Area = 100 µm2 , I = 0, the case considered by (Chow & White, 1996). All histogram bins are normalized so each curve is an approximate probability density function. (B) Smoothed sliding histograms of the frequencies obtained from the ISI data sets used in A where frequency = 1000/ISI.
1224
P. Rowat
gamma frequency
I=3
Probability Density
0.06
I=10 9 8 7
4
0.03
6 5
5 6 7
4 8
3
0 0
40
Hz
80
Frequency of righthand peak, Hz
Figure 2: Frequency histograms for input current I = 3, 4, . . . ,10 µA/cm2 , A = 400 µm2 . The numerals indicate the current corresponding to each histogram curve.
80 60
Area=100 µm2 Area=400 µm2 Deterministic frequency
40 20
Iν
I1
0 0
5
10
15
I, µ A/cm2 Figure 3: Comparison of right-hand peaks of the frequency histograms in Figure 2 with the frequency of the deterministic Hodgkin-Huxley model. Squares: right-hand peaks plotted against input current I for area = 400 µm2 . Circles: right-hand peaks for area A = 100 µm2 . The solid curve is the deterministic frequency. The left and right vertical dotted lines are where the deterministic frequency drops to 0 as I decreases (left line) or jumps up from 0 as I increases (right line). For the meaning of Iν and I1 , see the text and Figure 6.
ISI Statistics in the Stochastic Hodgkin-Huxley Model
1225
0 2 4 6 8 10 12 120mV
7 0
ms
500
Figure 4: Sample traces for different input currents I . Area = 400 µm2 , except for the bottom trace, which has area = 2000 µm2 . Numerals at the left give the input current I (µA/cm2 ).
Figure 4 shows membrane potential traces underlying the ISIs used for histograms of Figures 1 to 3, and Figure 5 shows basic statistics (mean, coefficient of variation) of the ISIs. The grouping of spikes into gammafrequency bursts suggested the unlikely possibility that the stochastic HH model was in fact bursty. We disproved this by taking a 30,000-long ISI time series and examining the distribution of gamma-frequency bursts in 10,000 shuffles. The numbers of pairs, triples, 4-tuples, and 5-tuples of short ISIs in the original time series was not significantly different from their overall distributions in the 10,000 shuffles. This was done twice. We conclude that the distribution of short and longer ISIs within any one ISI time series is random. This has also been found in experimental data (Stiefel et al., 2006). 4 Mechanism The frequency plots in Figure 3 strongly suggest that the right-hand peaks in the frequency histograms in Figure 2—and hence the initial narrow
1226
P. Rowat
A
10000 1000 100 10 Coefficient of Variation
Mean ISI, ms
1e+05
Iν
3
I1
B
100 µm2 400 µm2 1000 µm2 2000 µm2 3000 µm2
2 1 0
0
5
10
15
2
input I, µA/cm
Figure 5: ISI statistics as a function of input current. (A) Mean ISI for each I . (B) Coefficient of variation (CV). Note the log scale for means. Circles, A = 100 µm2 ; squares, A = 400 µm2 ; diamonds, A = 1000 µm2 , triangles, A = 2000 µm2 ; inverted triangles, A = 3000 µm2 . The vertical dashed lines at I = Iν and I1 are important values in the subcritical Hopf bifurcation (see Figures 6 and 3). For the vertical arrow, see Figure 6.
gaussian peak in the ISI histogram—are linked to the spiking frequency of the deterministic HH equations. If this is indeed the case, what mechanism underlies the linear continuation of the frequency plot below Iν for the stochastic HH model? Here we describe mechanisms that underlie the tracking of the deterministic frequencies above Iν and the linear extension of the peak noise-induced frequencies to lower input currents, I < Iν , where the deterministic HH model is quiescent. From Figure 1A, an obvious proposal is that there are two processes present—one generating the gaussian initial peak and another the exponential tail, together with a mechanism for switching between them. The gaussian peak could be generated by consecutive noisy traversals of a limit cycle (LC). Commonly, an exponential distribution is generated by a random walk with absorbing barrier; thus, the second process requires a spike to be followed by a random walk to an absorbing barrier followed by another spike. Thus, the second process could be generated by a system whose state was suddenly transported to a fixed point and then undergoes a random walk to an unstable limit cycle (ULC) that is followed by another spike, that is, another orbit of a limit cycle. In a dynamical system, both processes
ISI Statistics in the Stochastic Hodgkin-Huxley Model
1227
would be present if the phase portrait has a stable limit cycle (SLC), generating the first process, around a stable fixed point (SFP), with an unstable limit cycle between them—as required in two dimensions by the Bendixson theorem (see Figure 7A)—generating the second process. We will often use vicinity to mean “basin of attraction.” What mechanism could act to switch between these processes? Noise could perturb a trajectory from the vicinity of the SLC across the ULC and into the vicinity of the SFP, an inward switch. When the random walk outward from the SLC crosses the ULC into the vicinity of the SLC, an outward switch has occurred. When the first spike is emitted on entering the vicinity of the SLC, an ISI is generated that marks the end of a spiraling random walk and hence contributes to the exponential distribution. Subsequent traversals of the SLC without crossing the ULC result in short ISIs that contribute to the initial gaussian peak. This description is strictly true only in a two-dimensional system, since the “inside” or “outside” of a limit cycle is not defined in higher dimensions. However, it can be generalized to higher dimensions if we replace “unstable limit cycle” by the “attractor basin boundary” (ABB), a lower-dimensional manifold lying between the basins of attraction of the stable fixed point and the stable limit cycle. It is well known that the HH equations undergo a subcritical Hopf bifurcation as input current is increased (Hassard, 1978; Troy, 1978; Rinzel & Miller, 1980). The bifurcation diagram in Figure 6 (Guckenheimer & Oliva, 2002) shows how the voltage coordinate of the fixed point, and the maximum and minimum voltages reached by the stable and unstable limit cycles, change as input current I is increased. The subcritical Hopf bifurcation occurs at input current I1 , while Iν , Iν < I1 , is the current where the stable and unstable limit cycles merge. No limit cycle is present when I < Iν . For I between Iν and I1 (case A in Figure 6), there are a stable LC, an unstable LC, and a stable fixed point, except for a small interval J of length at most 0.12 µA/cm2 and lying strictly between Iν and I1 , which has two extra unstable LCs. Thus, for values of I in the middle interval of the subcritical Hopf bifurcation (Iν < I < I1 ), outside of the small subinterval J, the phase portrait has exactly the structure shown in Figure 7A for an ISI histogram of the form in Figure 1A. Could the transitions from fixed point to stable limit cycle, and vice versa, ever occur as a result of channel noise? A cursory glance at Figure 6A suggests that the distance between SFP and ULC might be too large compared to the noise amplitude. However, on remembering that the ULC and SLC are four-dimensional orbits around the SFP and that the voltage ranges of both the unstable and stable limit cycles include the fixed-point voltage, the question becomes: Are there portions of the SLC where the minimal distance between SFP and SLC is small enough that channel-noise-induced perturbations can plausibly cause the state to switch from one attractor basin to the other? There are two requirements. First, the minimal distance
1228
P. Rowat 40
C:I < Iν
20
B: I > I1
A: I ν < I < I1
V, mV
0
-20
-40
-60
-80
0
input current I, µA/cm 2
5
10
Iν
15
I1
Figure 6: Bifurcation diagram of the deterministic HH model, for parameter I. Solid line, stable fixed point (SFP). Long-dashed line, unstable fixed point (UFP). Dot-dashed lines, extreme values of V on the stable limit cycle (SLC). Short-dashed lines, extreme V values on the unstable limit cycle (ULC). The S-bend in the ULC curve above the vertical arrow implies that three ULCs exist over a small range of input current. The vertical dashed lines at I = I1 , Iν divide the range of I into three cases: A, B, and C. The subcritical Hopf bifurcation occurs at I = I1 and the stable and unstable limit cycles merge at I = Iν . Adapted, with permission, from Guckenheimer & Oliva (2002).
δ1 from the SFP to the ABB must be small enough to permit a noise-induced “out” transition out of the SFP vicinity across the ABB into the SLC vicinity, with nonzero probability. Second, the minimal distance δ2 from SLC to the ABB must be small enough for a noise-induced “in” transition into the SFP basin. It is not necessary for the two minima to occur at the same point or region of the ABB. Sketches in Figures 7B and 7C illustrate this point in two dimensions. A detailed description of this mechanism will not be given for the fourdimensional HH model. Instead the mechanisms are illustrated in a simpler, two-dimensional case: the class II Morris-Lecar model (Rinzel & Ermentrout, 1998), which also undergoes a subcritical Hopf bifurcation. Here the values for Iν and I1 are approximately 88.3 and 93.85 µA/cm2 . We constructed a stochastic version of the Morris-Lecar equations, 2.12 to 2.15, and obtained ISI data distributions similar to those we obtained for the
ISI Statistics in the Stochastic Hodgkin-Huxley Model
A
B
C in
1229
in
out
out SLC ULC
SFP leaving ULC
Switching Region
Figure 7: Phase plane sketches in region A (Iv < I < I1 ; see Figure 6). Dot, stable fixed point (SFP); continuous circle, stable limit cycle (SLC); dashed circle, unstable limit cycle (ULC). (A) Minimal phase portrait implied by the ISI histogram in Figure 1A. The distances between SFP and ULC and between ULC and SLC are large relative to noise amplitude. (B) There is a single switching region where the distances between SFP, SLC, and ULC are of the same order as the noise. Thus, noise can easily perturb a trajectory from the vicinity of the fixed point over the ULC and “out” to the vicinity of the SLC and, vice versa, perturb a trajectory near the SLC “in” to the vicinity of the SFP. (C) The switching region for an “in” movement of a trajectory could be different from the switching region for an “out” movement.
stochastic HH model. Compare Figure 8A with Figure 1A, Figure 8B with Figure 2, and Figure 8C with Figure 3. Qualitatively, the match is quite good. It might be objected that the Morris-Lecar model does not have a J-interval with more than two limit cycles. We do not think this an important objection because in our use of the HH model, we have not seen any indication of unusual ISI statistics when the injected current I is within the J-interval in the HH model. We conclude that the global dynamics of the Morris-Lecar model should serve as a good, or at least valuable, guide to the dynamics of the stochastic HH model that underlie the ISI distributions of Figure 1. Case A: Iν < I < I1 . In Figure 9A (1), the small dotted rectangle shows that the stable fixed point and the stable and unstable limit cycles are very close together for a short section of the ULC. Therefore, in this region of phase space, noise with amplitude the same order as the distance between the SFP and the SLC will frequently cause the trajectory to switch from the SLC vicinity to the stable fixed-point vicinity and vice versa. A region with this property will be referred to as a switching region. Figure 9A (2) shows an example of noisy switching and Figure 9A (3) the corresponding voltage trace. Once the trajectory switches to the inside of the ULC, in the absence of further noise, it spirals in toward the SFP. Consider a fixed radius from the SFP through the closest point on the ULC and extending out beyond
1230
P. Rowat
A Probability
0.005 ... 11000
0
0 I=82
500
B
0.5 Probability
250
ISI, ms
I=98
84
86 88 90
Frequency , Hz
0
0
10
Frequency, Hz
C 10 Stochastic MLE Deterministic frequency
Iν
I1
0 80
2
I, µA/cm
90
100
Figure 8: ISI data from the stochastic Morris-Lecar model has similar qualitative properties to ISI data from the stochastic HH model. (A) ISI histogram for I = 82.0. Area = 50 µm2 (1000 K+ channels). The long exponential tail, cut off at 500 ms, extends out to 10,000 ms. (B) Superimposed frequency histograms for I = 82, 84, . . . , 98. Area = 50 µm2 . (C) Plot of the right-hand peaks in B (circles) compared with the frequency of the deterministic Morris-Lecar model. Iν ≈ 88.29 and I1 ≈ 93.86.
ISI Statistics in the Stochastic Hodgkin-Huxley Model
1231
the ULC. Due to noise, successive intersections of the spiraling trajectory with this radius generate a one-dimensional random walk that terminates when the next intersection with the radius lies outside the ULC, in the SLC attractor basin. Now, in the absence of another perturbation back inside the ULC, the trajectory rejoins the SLC and emits another spike. Case B: I1 < I . Here there is no ULC, only an SLC and a UFP (see Figure 9B(1)). However, the fixed point is only weakly unstable, so if the trajectory is in its vicinity, it remains close for a significant duration, while the trajectory describes a slowly expanding spiral (see Figure 9B(1), inset). See, for example, the initial segment of the voltage trace in Figure 9B(3) before the first spike. Thus, noise on w keeps the state close to this fixed point for prolonged periods until eventually the outward-expanding spiral “wins” (see Figure 9B(2))—the random walk is terminated—and the trajectory returns to the SLC, whose next spike generates an ISI contributing to the negative exponential tail in the ISI PDF. Subsequent noisy ISIs contribute to the gamma-frequency peak in the frequency PDF. Case C: I < I ν. Now there is no limit cycle, just a single SFP. There are, however, remnants of the limit cyle, termed “ruts,” due to the way that trajectories starting from a wide range of initial conditions converge onto a rut. Figure 9C(1) shows two ruts in the deterministic Morris-Lecar phase space and several trajectories converging onto each one. In a rut, the vector field is strongly contracting transverse to the trajectory. In the region of phase space to the right of the fixed point, there is no transverse contraction so initially separate trajectories remain parallel and separate (see Figures 9C(1) and 9C(2). In a similar case (I = 88.2 instead of 82.0), Tateno and Pakdaman (2004) show the distribution of the ML dynamics in phase space: ruts appear as small ridges in approximately the same position, while in the nonrut (noncontraction) areas, the ridges disappear due to the distribution being spread out. Any trajectory starting more than about 0.03 vertically below the fixed point emits a spike and enters the rut that constitutes the downstroke of the MLE spike. The arrow in the inset graph shows the location of the “threshold” between trajectories that do and do not generate spikes. 4.1 Where Is the Switching Region in the Hodgkin-Huxley Model? The mechanisms just described for the two-dimensional (2D) stochastic Morris-Lecar model generalize, approximately, to the 4D stochastic HH model. An exact description is beyond the scope of this letter, in part because 4D objects are difficult to visualize, but also because in 4D, the boundary between two attractor basins (e.g., between a SFP and a SLC) is a 3D manifold, which is very likely to be fractal (McDonald, Grebogi, Ott, & Yorker, 1985; Guckenheimer & Oliva, 2002) in some range of I. Another difficulty in case A of the HH subcritical Hopf bifurcation (see Figure 6) is this: the ULC is a 1D object embedded in the 3D boundary between the basins of attraction of the SFP and the SLC, but a trajectory switching from
A
V (1)
w
Switching Region
(2)
w
stable FP stable LC unstable LC noisy trajectory
V
V 40 0 -40
(3)
0
2
50 ms
500
ISI Statistics in the Stochastic Hodgkin-Huxley Model
1233
one basin to the other might not pass close to the ULC, despite appearances in Figure 6. Corresponding to case C in the Morris-Lecar model (see Figure 9C), Figure 10 shows the projection of a noise-free HH trajectory (I = 6 µA/cm2 ) onto the (m, n)−plane—one of six possible 2D projections—that emits one spike and then spirals into the SFP. The projection of the switching region where small perturbations can generate a spike is enclosed in the ellipse in the inset. There is no ABB since there is only one attractor. However, the dashed line segment indicates something similar: the (2D projection of the) boundary between trajectories that emit a full-blown spike and those continuing with small-amplitude spirals into the SFP. The dashed line segment in Figure 10 is like a threshold. However, the concept of a threshold is problematic for a system with more than one variable because the hidden variables may not have their expected values (see Guckenheimer & Oliva, 2002, for a rigorous definition). If the system is in the vicinity of the SFP with coordinates (V0 , m0 , h 0 , n0 ) but previous perturbations have caused (m, h, n) to momentarily have values different from (m0 , h 0 , n0 ), then the effective V threshold to generate a spike may now be very different. This is seen experimentally: in cortical neurons, the previous voltage behavior can affect spike threshold by as much as 10 mV (Azouz & Gray, 2000). Also, in a very small I-interval within the S-turns in the ULC amplitude plot (see the thick arrow in Figure 6), there is a difficulty of a different kind: with I ≈ 7.9219, Guckenheimer and Oliva (2002) found a chaotic invariant set and conjectured that there are regions of phase space where every neighborhood, however small, contains two kinds of states:
Figure 9: Switching regions in Morris-Lecar phase-portraits. The parameters are as in Rinzel and Ermentrout (1998), subcritical Hopf bifurcation. For one instance of each case A, B, C (see Figure 6), we show (1) the phase portrait of the deterministic (noise-free) system. The switching region is shown by a small dotted rectangle and expanded as an inset graph. (2) A noisy trajectory overlaid on the noise-free phase portrait (3) A noisy voltage trace corresponding to the noisy trajectory in graph 2. In each case, graphs 1 and 2 have identical scales. Scale bars in graph 1: Horizontal bars: main graph, 10 mV; inset graph, 1 mV. Vertical bars: main graph, 0.1; inset graph, 0.01. Case A: I = 90 µA/cm2 , which has an SFP, an SLC, and a ULC. Case B: I = 98 µA/cm2 . An UFP and SLC. The solid line shows a trajectory spiraling out from near the UFP to join the SLC; the latter is shown by dots at equal time intervals. Case C: I = 82 µA/cm2 . Here there is only a stable fixed point and no LC, but “ruts” in phase-space enable noise-induced spikes to occur. Segments of nine trajectories are shown, starting from points very close to the SFP. Six fall into the upper rut and emit a spike, while the remaining three immediately spiral into the SFP. The small arrow in the inset graph indicates the threshold between spiking and nonspiking initial points.
1234
P. Rowat
Figure 10: Hopf case C in the Hodgkin-Huxley model with I = 6 µA/cm2 , so there is no LC. (A) Projection of deterministic trajectory onto the (m, n) phaseplane. (B) Expansion of the region around SFP. The dashed curve indicates the approximate location of the attractor basin boundary or “threshold”; the dotted ellipse indicates the switching region in this projection.
those leading to action potentials and those leading to the stable fixed point. If true, then, mathematically, no threshold can be defined for this value of the injected current. In order to truly locate a switching region for case A of the Hopf bifurcation, one must compute the ABB between the SFP and the SLC and then measure two minimal distances: δ1 from SFP out to the ABB and δ2 inward from SLC to ABB. The endings of δ1 and δ2 in the ABB may be close, but with current knowledge, they could be at different locations (see the sketch in Figure 7C). If the latter, then switching in and out of the vicinity of the SFP will occur at different phases of the spiking cycle. Cases B and C require different definitions of the switching region, which will not be attempted. These difficulties are avoided by a simplistic approach. We computed the minimal distance between fixed point and stable limit cycle in (V, m, h, n)space. It’s also useful to look at this in three-dimensional (V, Q Na , Q K )-, or Q-, space (see Figure 11). If these distances are small enough, then basin switching can be induced by small-amplitude noise. Since no LC exists in case C, only ruts, these plots do not go below Iν . See the location of Iν in Figures 3 and 6.
Minimum distance from SFP to SLC
ISI Statistics in the Stochastic Hodgkin-Huxley Model
6
9
12
1235
15
12
ional imens
ce
Q-spa
d
Three8
e
-spac
0.06
0.04
si
dimen
Four-
6
hase onal p
9 12 Injected current I, µA/cm2
15
Figure 11: These two curves track the closest approach of the HH stable limit cycle to the fixed point for 6.5 ≤ I ≤ 15, in two deterministic phase spaces. The lower trace uses the standard four-dimensional space (V, m, h, n), while the upper uses the three-dimensional space (V, Q Na , Q K ) where Q Na = m3 h × NNa , and Q K = n4 × NK . Area = 400 µm2 .
As input current increases from Iν to 15, the minimal distance in (V, m, h, n)-space from the FP to the SLC increases from 0.04 to nearly 0.07, while the minimal distance in Q-space increases from under 7 to over 12. In Q-space, “minimal distance X” means X channel openings or closing in the direction of the fixed point or SLC depending on whether the trajectory is going “in” or “out.” If one supposes that the ABB or ULC lies approximately halfway,1 then only X/2 channel changes are needed. Thus, the expected probability of basin switching goes down as I increases, which is in agreement with the sequence of traces in Figure 4. More exactly, the probability of switching “in” from SLC vicinity to FP vicinity becomes lower, while the probability of an “out” switch decreases less due to weakening of the FP stability. With I = 7µA/cm2 , the distance from the SFP to points on the SLC during one spiking cycle is shown in Figures 12B and 12C. The distance in (V, m, h, n)-space is dominated by the voltage V, since 0 ≤ m, h, n ≤ 1, whereas the range of V is over 100 mV. Thus, the minimal distance occurs at two downward spikes created when V crosses the SFP rest potential
1 Here I assume that basic dynamical features such as limit cycles in (V, m, h, n)-space map into similar features in Q-space.
1236
P. Rowat
V
40
A
0 -40
rest V0
-80 distance from SFP to SLC
100 1e+02
B
Four-dimensional phase-space
C
Three-dimensional Q-space
1
4
10
2
10
0
10 time, ms
20
Figure 12: (A) The voltage trace for the deterministic HH equations with I = 7 µA/cm2 (thick line). The dotted line is a trace from the stochastic HH model for the same I and with area = 400 µm2 , but shifted up for clarity. (B) The Euclidean distance in (V, m, h, n) space from the fixed point (FP) to the stable limit cycle as two spikes occur. (C) The thick line is the distance to the stable limit cycle in the three-dimensional Q-space (V, Q Na , Q K ) where Q Na = m3 h NNa and Q K = n4 NK . The dotted line is the distance in Q-space from the FP to the Qspace trajectory whose V-component is shown dotted in A. This trace has been shifted up. The arrow-headed line indicates an interval of over 6 ms where the distance in Q-space is under 20. The origin for time coincides with the peak of the first spike. The minimum distance in Q-space from FP to the dotted trajectory is 3.24.
V0 . In Q-space, the range of V is negligible relative to the Q Na and Q K ranges (see Figure 13), and quite different locations of the minimal distance are found. The minimum distance is 6.8 at 9 ms into the 18 ms spike cycle, with another minimum of 8.5 channel steps at 14 ms; moreover, the Q-distance remains below 20 between these minima. Overall this gives a 6 ms interval with Q-distance always under 20. Most switches between SLC and SFP occur in this interval; consequently, the approximate location of the switching region is known.
ISI Statistics in the Stochastic Hodgkin-Huxley Model
A
15 QNa
QNa
4000
2000
1237
D
10 5 0
0 -80
-40 V
0
40
-62
V
3000
B
E 200 QK
QK
2000
1000
100
0 -80
-40 V
0
40
-64
V -60
3000
C
F 200 QK
QK
2000
1000
100
0 0
2000 QNa
4000
0
20 QNa
40
Figure 13: Comparison of deterministic and stochastic trajectories in Q-space. The thick gray line shows the deterministic LC in (V, Q Na , Q K ) phase space when I = 7 µA/cm2 , and the thin black line is a path from the stochastic HH model. The area was 400 µm2 so NNa = 24,000 and NK = 7200. The fixed point, shown by a circled cross, is visible in the expanded views D, E, and F . In the (V, Q Na ) plane, D, the fixed point appears to be on the limit cycle, not inside, but this is merely due to the orthogonal projection being used. The stochastic trajectory used here is an extension of the one shown in Figure 12.
Figures 12A and 12C show the voltage and the distance to the SFP, respectively, for two spikes generated by the stochastic HH model. Most of the jitter occurs in the arrow-headed interval, as expected, and the trajectory
1238
P. Rowat
comes within 3.24 of the SFP, half the minimum distance of the deterministic trajectory. It is not trapped in the SFP vicinity but instead continues to cycle. 4.2 ISI and Channel Noise Statistics. Figure 5 shows how the overall mean and CV of ISIs varies with current and membrane area. Since there is no geometry associated with the model, the increase in area simply means an increase in the number of channels, which in turn means the channel noise has a lower standard deviation and higher frequency (see Figure 14). For I > I1 , the noise-free dynamics is continuous spiking, so smaller-amplitude noise means a smaller probability of leaving the SLC or of spending any significant time near the weakly unstable FP. Since the intervals between continuous spiking (ICSs) become fewer and shorter, the CV quickly goes to zero as area increases. On the contrary, as I decreases below I1 , the ICSs can become arbitrarily long, as can be seen when the area is high (A = 3000), resulting in very high CV. For I < Iν , the noisefree dynamics rest at the SFP, so the CV curves reflect the introduction of single spikes into a primarily nonspiking system. For case A, Iν < I < I1 , the two opposing tendencies result in CV curves whose maxima increase with increased area (reduced amplitude/increased noise frequency). From Figure 5B, it is tempting to speculate that as the noise amplitude tends to zero, the CV curve tends toward a delta function centered in the region in the bifurcation diagram where chaos and undefinable threshold occur. This is not the case. The probability density ρ(t) = α δ(t) + (1 − α) λ e −λt
(4.1)
with λ > 0 and 0 < α < 1, where δ(t) is the Dirac delta, has a coefficient of variation given by (1 − α)−1/2 , which is always larger than one and can become arbitrarily large when α approaches 1 (Wilbur & Rinzel, 1983). If we replace the initial gaussian peaks in the ISI histograms by tall Dirac delta functions, the probability density of equation 3.1 is a rough match to the histograms. So it is to be expected that for some combinations of area and current, the ISI histograms will have arbitrarily large CV. 4.3 Characteristics of Channel Noise. It is difficult to give a concise description of channel noise occurring in a spiking neuron. First, the time between changes in the number of open channels varies by nearly four orders of magnitude when area = 400 µm2 , from a minimum of less than 10−6 ms at the action potential peak to a maximum of about 0.06 ms, 3 to 5 ms after the deepest hyperpolarization. Second, the size of the Nacurrent passing when a single channel opens or closes varies according to the distance from the Na+ -reversal potential. Thus, the magnitude of a single Na-channel current varies by a factor of six. It is smallest at the
ISI Statistics in the Stochastic Hodgkin-Huxley Model
1239
0.8 CV
0.6
Na activation K activation
A
0.4 0.2 0
KHz
300
B
200
Na channels K channels
100
Frequency,Hz
0
0
2e+06
200
400 2 Area, µm
600
800
C
+
K channels + Na channels
1e+06
D
+
K + Na X 100
0.4
0 0.006 0.003
+
K + Na X 10
E
0 1
+
K activation + Na activation
F
CV
Standard deviation
Mean activation
0 0.8
1000
0 -80
-40
V, mV
0
40
Figure 14: Channel noise statistics when the stochastic HH model is held in voltage clamp. In A and B, the voltage was fixed at −60.78 mV (steady-state value when I = 7 µA/cm2 and the CV and frequency of the Na- and K- activation plotted √ as a function of area. The data in A are fitted with curves of the form a / Area − b. In C, D, E, F , the area = 400 µm2 , while the potential varies from −75 to +31 mV. (C) Frequency. (D) Mean activation. (E) Standard deviation. (F) CV. In D the mean Na-activation is magnified × 100. In E, the standard deviation of Na-activation is magnified × 10.
1240
P. Rowat
peak and largest at the lowest potentials. Similar considerations apply to K-current. We sidestepped the issue of characterizing channel noise in a spike cycle by adopting an experimental technique: the voltage clamp. For each membrane area, we clamped the neuron at −60.78 mV, the steady-state potential for I = 7 µA/cm2 , and then recorded the Na- and K-activations and computed their coefficient of variation (CV) and frequency. Figures 14A and 14B show the CV and frequency as a function of area—hence, of NNa or of NK . As expected, the CV decreases as the inverse square root of the area and the frequency increases linearly. In Figures 14C to 14F, the area is fixed, and the potential ranges from −75 to +31 mV. Figure 14C shows the frequency, while Figure 14D plots the mean activations, which exactly match the deterministic activations: m3 h for Na- and n4 for K-current. The curves in Figures 14C, 14D, and 14F depend on only the rate functions {αx , βx : x = m, h, n} in equations 2.4 to 2.6, but the relationship remains to be elucidated. 4.4 Oscillatory Histograms. In Figure 1A, there is a small, subsidiary peak after the initial, peak which is seen in all ISI histograms we have examined. In some cases, a third, poorly defined peak can be discerned, and when large numbers of channels are used, as in Figure 15, at least three subsidiary peaks can be seen (but note the logarithmic scale in Figure 15B used to make these very small peaks visible). On a gross scale, these subsidiary peaks are simply part of the exponential tail. On a finer scale, they stand out and can be understood in terms of our proposed mechanism. The initial peak arises from bursts of consecutive traversals of the SLC. The second peak arises when the trajectory leaves the SLC vicinity after one traversal and spirals round in the vicinity of the FP for one cycle, and as the next cycle starts, it switches back to the SLC vicinity, thus creating an ISI roughly twice as long as the period of the SLC. Compared to the first two peaks, the probability of a second spike occurring at time T = 1.5 × (SLC period) is small; this results in the significant dip between the first two peaks. The fact that this dip is not down to zero arises from the noise-induced phase randomization that presumably increases with time, thus causing the subsidiary peaks to be smoothed out as the time since the last spike increases. The Na channel numbers in Figure 15 are 60,000, 120,000, and 180,000. The CV for A = 3000 is 2.8. 4.5 Gaussian Noise Has the Same Effect as Channel Noise. We integrated the deterministic HH equations with a gaussian noise term in equation 2.1. In Figure 16, the histogram obtained with gaussian noise and the histogram obtained with channel noise in Figure 1, both with I = 3 µA/cm2 , are plotted together. This shows that the effects of channel noise and gaussian noise are essentially indistinguishable.
ISI Statistics in the Stochastic Hodgkin-Huxley Model
A
0.2
1241
Area=1000 µm2
Probability density
Area=2000 µm2 Area=3000 µm2
0.1
0
B
0.1 0.01 0.001
0.0001 1e-05
20
40
60
ISI, ms
Figure 15: ISI histograms for large areas 1000, 2000, 3000 µm2 , shown on a linear scale in A and logarithmic scale in B.
Figure 16: An ISI histogram nearly identical to the histogram of Figure 1A is obtained when gaussian noise is added to the HH current balance equation, 2.1. Noise with standard deviation 0.1 was used, with integration time step 0.01 ms. This histogram curve, in gray, is overlaid on the curve from Figure 1A, in black. The inset graph shows the gray curve again on a log scale.
1242
P. Rowat
4.6 On the Existence of Ruts. Every SLC has a neighborhood—that is, an annulus in a 2D system or a tube in a 3D system—where the vector field transverse to the SLC is contracting everywhere, with different strengths at different phases. A ULC has a neighborhood where the local vector field at each point of the ULC is expanding everywhere. When a bifurcation occurs and the SLC no longer exists, this means that somewhere along the SLC, the width of the neighborhood of contraction has shrunk to zero. In the geographical view of a dynamical system that depends continuously on parameters—such as a SFP is a hollow and a UFP is a rounded peak—there is a “valley” at every point of a SLC, but as a parameter I passes through a bifurcation value, at some point of the SLC the valley opens out and becomes flat. Except in a completely symmetric system, which is unlikely to occur in applications, there will still be regions close to where the SLC used to be, with transverse contracting neighborhoods still in existence. These are the ruts. In this argument, we appeal only to the continuous dependence of the vector field on the parameter I and to the disappearance of a SLC, not to the width of the interval [Iν , I1 ]. The sudden disappearance of an SLC as I decreases is sufficient to ensure the existence of ruts, independent of the width of the region with two or more attractors. 5 Discussion We have shown that the stochastic HH model generates a bimodal ISI distribution and described in qualitative terms the underlying dynamical mechanism. Although the description is based on a two-dimensional model, our extensive modeling data suggest that the same dynamical mechanism will be present in more complex models, provided a subcritical Hopf bifurcation and a switching region are present. More formally, we conjecture that a bimodal ISI distribution is generated in any noisy system that passes through a subcritical Hopf bifurcation with a switching region. The noise does not need to be channel noise; gaussian noise will suffice (see Figure 16). Our simulations have been done with the stochastic HH equations whose parameters are based on squid, a cold-blooded creature. The dynamical mechanisms described are quite general and can be expected to apply to any neuron, not just the HH model or squid axon, that is, class II. Thus, any such neuron will have a bimodal ISI distribution in the presence of channel or synaptic noise. In particular, it should apply to all class II cortical cells. Experimentally, bimodal ISI histograms are common. They have been observed in many mammalian neurons in spontaneous behavior or in the presence of stimuli: spontaneous activity of a ventro-basal neuron during sleep (Nakahama et al., 1968), ventrolateral thalamic neurons in slow-wave and fast-wave sleep (Lamarre et al., 1971), auditory nerve fiber response to white noise (Ruggero, 1973), adult rat primary somatosensory cortex (Armstrong-James, 1975), pyramidal cells in rabbit (Bassant, 1976), cat
ISI Statistics in the Stochastic Hodgkin-Huxley Model
1243
auditory cortex awake in light (Burns & Webb, 1976) and either spontaneous or in response to clicks (Gerstein & Kiang 1960; Kiang, 1965), cultured hippocampal cells (Siebler et al., 1993), and rat olfactory receptor neurons (Duchamp-Viret et al., 2005). Recordings that look very similar to the traces in Figure 4, but without accompanying ISI histograms, have been observed in cortical fast-spiking cells (Cauli, Audinat, Lambolez, Angulo, Ropert, Tsuzuki, Hestrin, & Rossier, 1997), stellate cells in medial entorhinal cortex (Klink & Alonso, 1993), and in stuttering and irregularspiking inhibitory cortical interneurons (Markram et al., 2004)). In all these observations, the initial gaussian spike is, or would be if the histogram was constructed, in the range of 10 to 30 ms, roughly in the gammafrequency range. A periodic histogram similar to Figure 15 was observed in thalamic somatic sensory neurons (Poggio & Viernstein 1964). The spontaneous activity of frog olfactory neurons generates similar bimodal histograms on a much slower timescale (3–5 Hz) (Rospars, Lansky, Vaillant, Duchamp-Viret, & Duchamp, 1994), while a motion-sensitive neuron in fly generates a bimodal histogram on a much faster timescale (150 Hz) (de Ruyter van Steveninck et al. 1997). Guttman and Barnhill (1970), using space-clamped squid axon treated with low calcium, observed “skip runs,” which look very similar to our model traces (see Figure 4) but were unable to explain them by modeling. Guttman, Lewis, & Rinzel (1980), again with squid axon bathed in low calcium, showed the coexistence of two states in the squid axon—repetitive firing and subthreshold oscillations—and used small, well-chosen perturbations to annihilate repetitive firing in both the axon and model. This confirms the region of bistability in the HH bifurcation diagram (see Figure 6). Gerstein and Mandelbrot (1964) used a random walk model to generate ISI histograms with a nonsimple-exponential tail, that fits data from several types of neurons but not from those under investigation here that generate bimodal ISI histograms. Figure 17 shows an ISI histogram that closely matches the form of the ISI histogram in Figure 1A. It was recorded by Kiang (1965) from an auditory nerve fiber in anesthetized cat and has a timescale approximately eight times faster than the squid axon’s. Note that the characteristic dip in the histogram between initial peak and exponential tail is clearly visible. While on the one hand, it appears from the literature that bimodal ISI histograms occur in many places in the nervous system, on the other hand, we have shown that with the introduction of a source of stochasticity, the HH equations generate bimodal ISI histograms of the same form, that is, a high initial peak followed by an exponential tail. It seems natural, then, to suggest that the stochastic HH model could be used to model the firing properties of the spike initiation zone of other classes of neurons as well as the squid axon. Obviously, the HH
1244
P. Rowat
ISI count
128
64
0 0
32 ISI length (msec)
64
Figure 17: An ISI histogram recorded from spontaneous firing of an auditory nerve fiber in cat. (Reproduced, with permission, from Kiang 1965.) Qualitatively, this has the same form as the histograms generated by the HH model (see Figure 1). However, the frequency associated with the initial peak is much higher—approximately 300 to 500 Hz.
parameters would need considerable changes to generate spikes on the timescales of the frog olfactory neurons or motion-sensitive neurons in fly. In related work, Yu and Lewis (1989) showed that the response of the HH model could be linearized by the presence of noise: using high-amplitude noise, they showed that the simulation data in Figure 3 linearly extend to near zero frequency. Tateno and Pakdaman (2004) investigated the noisy class II Morris-Lecar model and described the phase-space distribution in the same regimes we investigated; however, ISI distributions were not studied and are not easily deduced from the phase-space distributions. Lee, Neiman, & Kim (1998) studied coherence resonance in the HH model with noisy synaptic inputs and computed a coherence measure from the gaussian peak of the ISI histogram. Tiesenga, Jose, and Sejnowski (2000) investigated in the deterministic HH model the effect of different kinds of input noise on the CV and other measures of the output spike train. The work presented here, on the other hand, investigates the effect of the intrinsic neuronal channel noise on the distribution of the output ISIs. Analysis of stochastic dynamical systems (Berglund & Gentz, 2003; Su, Rubin, & Terman, 2004) can give strict estimates of, for example, the time spent in the neigborhood of the unstable fixed point before switching back to the SLC as in our Hopf case B, but the statement of such an estimate would take us too far afield. In an article to which this one could be considered an extension, Schneidman, Freedman, and Segev (1998) showed that a stochastic HH model qualitatively reproduced the different reliability and precision characteristics of
ISI Statistics in the Stochastic Hodgkin-Huxley Model
1245
spike firing in response to DC and fluctuating input in neocortical neurons found by Mainen and Sejnowski (1995). ISI histograms were not examined. In an approximate analysis, Chow and White (1996) derived an exponential distribution for the ISIs generated by channel noise in the subthreshold HH equations. The common practice in modeling studies to assume an exponential distribution for spontaneous ISIs was bolstered by their result. Our work was initiated when we repeated their simulations and, using much longer ISI time series, found the initial peak shown in the inset in Figure 1A—the case they studied. It is hoped that modelers of neural networks will take into account the often bimodal character of the ISI distribution. Are the channel noise levels used in this study realistic in any sense? Typically, the conductance, γ Na , of single, biological Na-channels lies between 2 and 20 pS (Hille, 2001). The population conductance of our model neuron, g Na , was 120 mS/cm2 , implying a channel density 109 to 1010 /cm2 . Assuming that a spike initiation zone might occupy 103 to 104 /mm2 (10−5 to 10−4 cm2 ), one comes to a projected Na channel number, NNa ≈ 104 to 106 , whereas the maximum number used in our simulations was 180,000 = 1.8 ∗ 105 . These are, of course, crude approximations. They serve only to argue that our noise levels are not necessarily unreasonable. In many biological neurons, the actual number of Na and K channels occupying the site of spike initiation remains unknown. 6 Summary We have by simulation shown that the stochastic HH model generates bimodal ISI histograms with a prominent initial peak in the gamma-frequency range and often has high (greater than 1) coefficient of variation in physiological ranges. We introduced a stochastic version of the Morris-Lecar model that uses perhaps the simplest possible kind of channel noise, which may prove useful in the analysis of channel noise and its effects. We used this Morris-Lecar model to give a simple description of the dynamical mechanisms underlying the generation of bimodal ISI histograms and gave numerical evidence that the dynamical mechanism is the same in the Morris-Lecar and HH stochastic models. We conjecture that the same mechanism arises in any neuron model, however complex, provided only that it undergoes a subcritical Hopf bifurcation—is class II—and has a switching region. We pointed out that class II neurons and neurons with bimodal ISI histograms are widespread in the nervous system, in particular in cortex and sensory systems, and suggested that the same underlying dynamics may be present. The high CV often associated with bimodal histograms may contribute to the highly irregular firing of cortical cells (Softky & Koch 1993), in particular to the firing of stuttering and irregular firing classes
1246
P. Rowat
of inhibitory cortical interneurons (Markram et al., 2004). Although we worked here with channel noise, the same properties of the ISI distribution arise with gaussian noise. We conclude that the properties of the stochastic HH model make it a prime candidate for models of mammalian neural networks. Acknowledgments I thank Robert Elson for advice at an early stage, Javier Movellan for the use of his cluster of Mac G5s, Jochen Triesch for a crucial early comment, Terry Sejnowski for encouragement, and two referees for constructive criticism. Above all, I thank my partner, Nona, for her unflagging support and keeping it all together. References Armstrong-James, M. (1975). Functional status and columnar organization of single cells responding to cutaneous stimulation in neonatal rat somatosensory cortex S1. J. Physiol., 246, 501–538. Azouz, R., & Gray, C. (2000). Dynamic spike threshold reveals a mechanism for synaptic coincidence detection in cortical neurons in vivo. Proc. Natl. Acad. Sci., 97(14), 8110–8115. Bassant, M. (1976). Analyse statistique de l’activit´e des cellules pyramidales de l’hippocampe dorsal du lapin. Electroencephal. Clinical Neurophysiology, 40, 585– 603. Berglund, N., & Gentz, B. (2003). Geometric singular perturbation theory for stochastic differential equations. J. Differential Equations, 191, 1–54. Best, E. (1979). Null space in the Hodgkin-Huxley equations. Biophysical J., 27, 87–104. Brette, R. (2003). Reliability of spike timing is a general property of spiking model neurons. Neural Computation, 15, 279–308. Burns, B., & Webb, A. (1976). The spontaneous activity of neurones in the cat’s cerebral cortex. Proc. R. Soc. Lond. B, 194, 211–223. Cauli, B., Audinat, E., Lambolez, B., Angulo, M. C., Ropert, N., Tsuzuki, K., Hestrin, S., & Rossier, J. (1997). Molecular and physiological diversity of cortical nonpyramidal cells. J. Neuroscience, 17(10), 3894–3906. Chow, C. C., & White, J. A. (1996). Spontaneous action potentials due to channel fluctations. Biophysical J., 71, 3013–3021. de Ruyter van Steveninck, R., Lewen, G., Strong, S. P., Koberle, R., & Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 1805– 1808. Dean, A. (1981). The variability of discharge of simple cells in the cat striate cortex. Experimental Brain Research, 44, 437–440. Destexhe, A., M. Rudolph, Fellous, J., & Sejnowski, T. J. (2001). Fluctuating synaptic conductances recreate in vivo–like activity in neocortical neurons. Neuroscience, 107(1), 13–24.
ISI Statistics in the Stochastic Hodgkin-Huxley Model
1247
Dorval, A., & White, J. (2005). Channel noise is essential for perithreshold oscillations in entorhinal stellate neurons. J. Neurosci., 25(43), 10025–10028. Duchamp-Viret, P., Kostal, L., Chaput, M., Lansky, P., & Rospars, J.-P. (2005). Patterns of spontaneous activity in single rat olfactory receptor neurons are different in normally breathing and tracheotomized animals. J. Neurobiology, 65(2), 97–14. Fitzhugh, R. (1961). Impulses and physiological states in models of nerve membrane. Biophys. J., 1, 445–466. Gerstein, G., & Kiang, N. Y.-S. (1960). Approach to the quantitative analysis of electrophysiological data from single neurons. Biophys. J., 6, 15–28. Gerstein, G., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophys. J., 4, 41–68. Gillespie, D. T. (1977). Exact stochastic simulation of coupled chemical reactions. J. Physical Chemistry, 81(25), 2340–2361. Guckenheimer, J., & Labouriau, I. S. (1993). Bifurcation of the Hodgkin-Huxley equations: A new twist. Bull. Math. Biol., 55, 937–952. Guckenheimer, J., & Oliva, R. A. (2002). Chaos in the Hodgkin-Huxley model. SIAM J. Applied Dynamical Systems, 1(1), 105–114. Gutkin, B. S., & Ermentrout, G. B. (1998). Dynamics of membrane excitability determine interspike interval variability: A link between spike generation mechanisms and cortical spike train statistics. Neural Computation, 10(5), 1047–1065. Guttman, R., & Barnhill, R. (1970). Oscillation and repetitive firing in squid axons. Journal of general Physiology, 55, 104–118. Guttman, R., Lewis, S., & Rinzel, J. (1980). Control of repetitive firing in squid axon membrane as a model for a neuroneoscillator. J. Physiology, 305, 377–395. Hassard, B. (1978). Bifurcation of periodic solutions of the Hodgkin-Huxley model for the squid giant axon. J. Theor. Biology, 71, 401–420. Hille, B. (2001). Ion channels of excitable membranes (3rd ed.). Sunderland, MA: Sinauer. Hodgkin, A. (1948). The local electrical changes associated with repetitive action in a non-medullated axon. J. Physiol., 107, 165–181. Hodgkin, A., & Huxley, A. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol., 117, 500–544. Holt, G., Softky, W., Koch, C., & Douglas, R. J. (1996). Comparison of discharge variability in vitro and in vivo in cat visual cortex neurons. J. Neurophysiol., 75(5), 1806–1814. Kiang, N. Y.-S. (1965). Discharge patterns of single fibers in the cat’s auditory nerve. Cambridge, MA: MIT Press. Klink, R., & Alonso, A. (1993). Ionic mechanisms for the subthreshold oscillations and differential electroresponsiveness of medial entorhinal cortex layer II neurons. J. Neurosci., 70(1), 144–157. Kole, M., Hallerman, S., & Stuart, G. (2006). Single Ih channels in pyramidal neuron dendrites: Properties, distribution, and impact on action potential output. J. Neurosci., 26(6), 1677–1687. Labouriau, I. S. (1985). Degenerate Hopf bifurcation and nerve impulse. SIAM J. Math. Anal., 16(6), 1121–1133. Lamarre, Y., Filion, M., & Cordeau, J. P. (1971). Neuronal discharges of the ventral nucleus of the thalamus during sleep and wakefulness in the cat. I. Spontaneous activity. Experimental Brain Research, 12, 480–498.
1248
P. Rowat
Lee, S.-G., Neiman, A., & Kim, S. (1998). Coherence resonance in a Hodgkin-Huxley neuron. Physical Review E, 57(3), 3292–3297. Lindner, B., Garica-Ojalvo, J., Nerman, A., & Schimansky-Geier, L. (2004). Effects of noise in excitable systems. Physics Reports, 392(6), 321–424. Lynch, J., & Barry, P. (1989). Action potentials initiated by single channels opening in a small neuron (rat olfactory receptor). Biophys. J., 55, 755– 768. Mainen, Z., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Mann-Metzer, P., & Yarom, Y. (2002). Jittery trains induced by synaptic-like currents in cerebellar inhibitory interneurons. J. Neurophysiol., 87, 149–156. Markram, H., Toledo-Rodriguez, M., Wang, Y., Gupta, A., Silberberg, G., & Caizhi, W. (2004). Interneurons of the neocortical inhibitory system. Nature Reviews Neuroscience, 5, 793–807. Matsumoto, M., & Nishimura, T. (1998). Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. ACM Trans. on Modeling and Computer Simulation, 8(1), 3–30. McDonald, S. W., Grebogi, C., Ott, E., & Yorke, J. A. (1985). Fractal basin boundaries. Physica D, 17, 125–153. Meunier, C., & Segev, I. (2002). Playing the devil’s advocate: Is the Hodgkin-Huxley model useful? TINS, 25(11), 558–563. Morris, C., & Lecar, H. (1981). Voltage oscillations in the barnacle giant muscle fiber. Biophys. J., 35, 193–213. Nakahama, H., Suzuki, H., Yamamoto, M., & Aikawa, S. (1968). A statistical analysis of spontaneous activity of central single neurons. Physiology and Behavior, 3(5), 745–752. Noda, H., & Adey, R. (1970). Firing variability in cat association cortex during sleep and wakefulness. Brain Research, 18, 513–526. Pfeiffer, R., & Kiang, N. Y.-S. (1965). Spike discharge patterns of spontaneous and continuously stimulated activity in the cochlear nucleus of anaesthetized cats. Biophysical Journal, 5, 301–316. Poggio, G., & Viernstein, J. (1964). Time series analysis of impulse sequences of thalamic somatic sensory neurons. J. Neurophysiol., 27, 517–545. Rinzel, J., & Ermentrout, G. B. (1998). Analysis of neural excitability and oscillations. K. Koch & I. Segev (Eds.), In Methods in neuronal modeling (pp. 251–291). Cambridge, MA: MIT Press. Rinzel, J., & Miller, R. (1980). Numerical calculation of stable and unstable periodic solutions to the Hodgkin-Huxley equations. Mathematical Biosciences, 49, 27–59. Rospars, J.-P., Lansky, P., Vaillant, J., Duchamp-Viret, P., & Duchamp, A. (1994). Spontaneous activity of first- and second-order neurons in the frog olfactory system. Brain Research, 662, 31–44. Rowat, P., & Elson, R. (2004). State-dependent effects of Na-channel noise on neuronal burst generation. J. Comput. Neuroscience, 16, 87–112. Rudolph, M., & Destexhe, A. (2002). Point-conductance models of cortical neurons with high discharge variability. Neurocomputing, 44–46, 147–152. Ruggero, M. (1973). Response to noise of auditory nerve fibers in the squirrel monkey. J. Neurophysiol., 36, 569–587.
ISI Statistics in the Stochastic Hodgkin-Huxley Model
1249
Schneidman, E., Freedman, B., & Segev, I. (1998). Ion channel stochasticity may be critical in determining the reliability and precision of spike timing. Neural Computation, 10(7), 1679–1703. Shadlen, M., & Newsome, W. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neuroscience, 18(10), 3870–3896. Siebler, M., Koller, H., Stichel, C. C., Muller, H. W., & Freund, H. J. (1993). Spontaneous activity and recurrent inhibition in cultured hippocampal networks. Synapse, 14, 206–213. Sigworth, F. (1980). The variance of sodium current fluctuation at the node of Ranvier. J. Physiol., 307, 97–129. Softky, W., & Koch, K. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neuroscience, 13(1), 334– 350. Stein, R. (1965). A theoretical analysis of neuronal variability. Biophys. J., 5, 173–194. Stevens, C. F., & Zador, A. M. (1998). Input synchrony and the irregular firing of cortical neurons. Nature Neuroscience, 1(3), 210–217. Stiefel, K. M., Englitz, B., & Sejnowski, J. (2006). The irregular firing of cortical interneurons in vitro is due to fast K+ -kinetics. Manuscript in preparation. Su, J., Rubin, J., & Terman, D. (2004). Effects of noise on elliptic bursters. Nonlinearity, 17, 133–157. Tateno, T., Harsch, A., & Robinson, H. P. C. (2004). Threshold firing frequency-current relationships of neurons in rat somatosensory cortex: Type 1 and type 2 dynamics. J. Neurophysiol., 92, 2283–2294. Tateno, T., & Pakdaman, K. (2004). Random dynamics of the Morris-Lecar model. Chaos, 14(3), 511–530. Tiesenga, P., Jose, J., & Sejnowski, J. (2000). Comparison of current-driven and conductance-driven neocortical model neurons with Hodgkin-Huxley voltagegated channels. Phys. Review E, 62(6), 8413–8419. Troy, W. C. (1978). The bifurcation of periodic solutions in the Hodgkin-Huxley equations. Quarterly Journal of Applied Mathematics, 36, 73–83. Troyer, T., & Miller, K. (1997). Physiological gain leads to high ISI variability in a simple model of a cortical regular spiking cell. Neural Computation, 9(5), 971– 983. Werner, G., & Mountcastle, V. (1963). The variability of central neural activity in a sensory system, and its implications for the central reflection of sensory events. J. Neurophysiol., 26, 958–977. White, J. A., Klink, R., Alonso, A., & Kay, A. P. (1998). Noise from voltage-gated ion channels may influence neuronal dynamics in the entorhinal cortex. J. Neurophysiol., 80, 262–269. White, J. A., Rubinstein, J. T., & Kay, A. P. (2000). Channel noise in neurons. Trends in Neurosciences, 23(3), 131–137. Whitsel, B., Schreiner, R., & Essick, G. K. (1977). An analysis of variability in somatosensory cortical neuron discharge. J. Neurophysiol., 40(3), 589–607. Wilbur, W., & Rinzel, J. (1983). A theoretical basis for large coefficient of variation and bimodality in neuronal interspike interval distribution. J. Theoretical Biology, 105, 345–368.
1250
P. Rowat
Yu, X., & Lewis, E. R. (1989). Studies with spike initiators: Linearization by noise allows continuous signal modulation in neural networks. IEEE Trans. Biomedical Eng., 36(1), 36–43.
Received February 1, 2006; accepted July 24, 2006.
LETTER
Communicated by Christoph Borgers
The Firing of an Excitable Neuron in the Presence of Stochastic Trains of Strong Synaptic Inputs Jonathan Rubin [email protected] Department of Mathematics, University of Pittsburgh, Pittsburgh, PA 15260 U.S.A.
Kreˇsimir Josi´c [email protected] Department of Mathematics, University of Houston, Houston, TX 77204-3008, U.S.A.
We consider a fast-slow excitable system subject to a stochastic excitatory input train and show that under general conditions, its long-term behavior is captured by an irreducible Markov chain with a limiting distribution. This limiting distribution allows for the analytical calculation of the system’s probability of firing in response to each input, the expected number of response failures between firings, and the distribution of slow variable values between firings. Moreover, using this approach, it is possible to understand why the system will not have a stationary distribution and why Monte Carlo simulations do not converge under certain conditions. The analytical calculations involved can be performed whenever the distribution of interexcitation intervals and the recovery dynamics of the slow variable are known. The method can be extended to other models that feature a single variable that builds up to a threshold where an instantaneous spike and reset occur. We also discuss how the Markov chain analysis generalizes to any pair of input trains, excitatory or inhibitory and synaptic or not, such that the frequencies of the two trains are sufficiently different from each other. We illustrate this analysis on a model thalamocortical (TC) cell subject to two example distributions of excitatory synaptic inputs in the cases of constant and rhythmic inhibition. The analysis shows a drastic drop in the likelihood of firing just after inhibitory onset in the case of rhythmic inhibition, relative even to the case of elevated but constant inhibition. This observation provides support for a possible mechanism for the induction of motor symptoms in Parkinson’s disease and for their relief by deep brain stimulation, analyzed in Rubin and Terman (2004).
Neural Computation 19, 1251–1294 (2007)
C 2007 Massachusetts Institute of Technology
1252
J. Rubin and K. Josi´c
1 Introduction There has been substantial discussion of the roles of excitatory and inhibitory synaptic inputs in driving or modulating neuronal firing. Computational analysis of this issue generally considers a neuron awash in a sea of synaptic bombardment (Somers et al., 1998; van Vreeswijk & Sompolinsky, 1998; De Schutter, 1999; Tiesinga, Jose, & Sejnowski, 2000; Tiesinga, 2005; Chance, Abbott, & Reyes, 2002; Tiesinga & Sejnowski, 2004; Huertas, Groff, & Smith, 2005). In this work, we also investigate the impact of synaptic inputs on the firing of a neuron, but with a focus on the effects of single inputs within stochastic trains. This investigation is motivated by consideration of thalamocortical relay (TC) cells, under the hypothesis that such cells are configured to reliably relay individual excitatory inputs, arising from either strong, isolated synaptic signals or tightly synchronized sets of synaptic signals, during states of attentive wakefulness, yet are also modulated by inhibitory input streams (Smith & Sherman, 2002). This viewpoint leads to the question of how different patterns of inhibitory modulation affect the relationship between the activity of a neuron and a stochastic excitatory input train that it receives. The main goal of this letter is to introduce and illustrate a mathematical approach to the analysis of this relationship. Our approach applies to general excitable systems with separation of timescales, including a single slow variable and fast onset and offset of inputs. These ideas directly generalize to other neuronal models featuring a slow buildup of potential interrupted by instantaneous spikes and resets. We harness these features to reduce system dynamics to a one-dimensional map on the slow variable. Each iteration of the map corresponds to the time interval from the arrival of one excitatory input to the arrival of the next excitatory input (Othmer & Watanabe, 1994; Xie, Othmer, & Watonabe, 1996; Ichinose, Aihara, & Judd, 1998; Othmer & Xie, 1999; Coombes & Osbaldestin, 2000). From this map, under the assumption of a bounded excitatory input rate, we derive a transition process based on the evolution of the slow variable between inputs. In general, such a process would be non-Markovian, because the transition probabilities would depend on the arrival times of all inputs the cell had received since it last spiked. However, we derive an irreducible Markov chain by defining states that are indexed by both slow variable values and numbers of inputs received since last firing. In the case of a constant inhibition level, we prove the key result that under rather general conditions, this Markov chain is aperiodic and hence has a limiting distribution. This limiting distribution can be computed from the distribution of input arrival times. Once obtained, it can be used to deduce much about the firing statistics of the driven cell, including the probability that the cell will fire in response to a given excitatory input, the expected number of response failures that the cell will experience between firings, and the distribution of slow variable values
Firing in the Presence of Stochastic Synaptic Inputs
1253
attained after any fixed number of unsuccessful inputs arriving between firings. We emphasize that, a priori, it is not certain that such a limiting distribution exists. Therefore, Monte Carlo methods, which rely on the assumed existence of a limit, may fail to converge and hence may give misleading results about firing statistics and the distribution of slow variable values at input arrival times. In addition to providing conditions that guarantee the existence of a limiting distribution for the Markov chain, our analysis shows how convergence failure can arise in the case of a stochastic train of excitatory inputs that is close to periodic. When our existence conditions are satisfied, our approach yields an analytical calculation of the limiting distribution, eliminating the need for simulations as well as the complications of transients and slow convergence, and the eigenvalues of the transition matrix used to compute the limiting distribution give information about convergence rate. Moreover, this approach provides a framework for the analysis of how changes in the statistics of the inputs affect the output of a cell. Finally, the Markov chain analysis allows for the identification of bifurcation events in which variation of model parameters can lead to abrupt changes that affect long-term statistics, although we do not pursue this in detail in this work (see Doi, Inoue, & Kumagai, 1998; Tateno & Jimbo, 2000, which we comment on in section 9). We discuss the Markov chain approach in the particular case of constant inhibition, which may be zero or nonzero, and extend it to the case of inhibition that undergoes abrupt switches between two different levels, at a lower frequency than that of the excitatory signal. These choices are motivated by the analysis of TC cell relay reliability in the face of variations in inhibitory basal ganglia outputs that arise in Parkinson’s disease (PD) and under deep brain stimulation (DBS), applied to combat the motor symptoms of PD. An important point emerging from experimental results is that parkinsonian changes induce rhythmicity in inhibitory basal ganglia outputs (Nini, Feingold, Slovin, & Bergman, 1995; Magnin, Morel, & Jeanmonod, 2000; Raz, Vaadia, & Bergman, 2000; Brown et al., 2001), while DBS regularizes these outputs, albeit at higher-than-normal levels (Anderson, Postpuna, & Ruffo, 2003; Hashimoto, Elder, Okun, Patrick, & Vitek, 2003). In recent work, Rubin and Terman (2004) provided computational and analytical support for the hypothesis that, given a TC cell that can respond reliably to excitatory inputs under normal conditions, the parkinsonian introduction of repetitive, fairly abrupt switches between high and low levels of inhibition to the TC cell will disrupt reliable relay. On the other hand, regularized basal ganglia activity, even if it leads to unusually high levels of inhibition, can restore reliable TC responses. While their computational results focused on finite time simulations of two TC models with particular choices of excitatory and inhibitory input strengths, frequencies, and durations, the general framework presented here can be used to analyze how a model TC cell responds to any train of fast excitatory inputs in
1254
J. Rubin and K. Josi´c
the presence of inhibition that stochastically makes abrupt jumps between two levels. Furthermore, the results of this letter yield more detailed statistical information about the firing of the driven TC cells, particularly just after the onset of inhibition. The letter is organized as follows. In section 2, we review the way in which the dynamics of a fast-slow excitable system under pulsatile drive can be reduced to a map. In section 3, we use these ideas to construct a Markov chain that captures the relevant aspects of the dynamics of the system. We consider the existence of limiting densities for the constructed Markov chain and their interpretation in section 4. Further, we discuss the application of these ideas to responses of a population of cells, as well as their extension to populations with heterogeneities and to related models, in section 5. The specific example of determining the reliability of a reduced model TC cell under various types of inputs is addressed, as an application of the theory, in sections 6 and 7, while an explicit connection to PD and DBS is made in section 8. Some details of the calculations underlying the results of these sections are contained in the appendixes. Section 9 provides a summary of our results, a discussion of their relation to past work, and some ideas on possible generalization and future directions. In the two examples that we present, one with a uniform and one with a normal distribution of excitatory inputs, a significant decrease in the responsiveness of the model TC cell and a significant increase in its likelihood of being found far from firing threshold result after the onset of the inhibitory phase of a time-varying inhibitory input, relative to the cases of high or low but constant inhibition. These findings are in agreement with Rubin and Terman (2004) and, based on the generality of our approach, appear to represent general characteristics of the TC model with a single slow variable. 2 Reduction of the Dynamics to Maps of the Interval In this section we introduce a general relaxation oscillator subject to synaptic input. Under certain assumptions on the form of the input, the dynamics of the oscillator is accurately captured by a map of an interval. 2.1 A General Relaxation Oscillator. Consider a model system of the form v = f (v, w) + I (v, t) w = εg(v, w),
(2.1)
where 0 < ε 1. In the neuronal context, the fast variable v models the voltage, while the slow variable w typically models different conductances (Rubin & Terman, 2002). The input to the oscillator is modeled by the term
Firing in the Presence of Stochastic Synaptic Inputs
1255
w 0 wFP E wLK
excitation off N0
excitation on NE
0 wRK E w RK
w−nullcline
V
Figure 1: The nullclines of system 2.1 under the assumptions discussed in the text. The upper and lower v-nullclines correspond to se xc = 0 (excitation off) and se xc = 1 (excitation on), respectively.
I (v, t). We will assume that if I (v, t) = C for a constant C in a range of interest, then the v-nullcline, given implicitly by f (v, w) = −C, has a threebranched, or N-like, shape (see Figure 1). We will refer to the different branches of this nullcline as the left, middle, and right branch, respectively. Under assumptions on the form of f and g that typically hold in practice (Rubin & Terman, 2002), it is well known that if the w-nullcline, given implicitly by g(v, w) = 0, intersects the v-nullcline at the middle branch, then equation 2.1 has oscillatory solutions, known as relaxation oscillations. If the w-nullcline meets the v-nullcline at the left branch, then there are no oscillatory solutions, but the system is excitable. In this case, a kick in the positive v direction can trigger an extended excursion (a spike). In the following, we consider the case when the two nullclines intersect on the left branch of the v-nullcline in the absence of input. In the work system presented here, equation 2.1 will be used as a model of a neuronal cell, and I (v, t) will model synaptic input, so that I (v, t) = −ge xc se xc (t)(v − ve xc ) − ginh sinh (t)(v − vinh ),
(2.2)
where v − ve xc < 0 and v − vinh > 0 over the relevant range of v-values. The two terms in the sum represent effects of excitatory and inhibitory currents, respectively. We will assume that the synaptic variables se xc (t) and sinh (t) switch between values of 0 (off) and 1 (on) instantaneously and independently (Somers & Kopell, 1993; Othmer & Watanabe, 1994). This is a reasonable approximation in a situation where the input cells fire action potentials of stereotypical duration, sufficiently widely separated in time to allow for synaptic decay to a small level between inputs, and where the
1256
J. Rubin and K. Josi´c
synaptic onset and offset rates are rapid. We need not consider other aspects of the dynamics of the neurons supplying inputs to the model cell, as only the input timing will be relevant. If the magnitude of I is not too large, then each of the four possible choices for (se xc , sinh ) yields a different v-nullcline of system 2.1, each of which has an N-like shape. For simplicity, we first consider the case sinh = 0, so that the cell receives only excitatory input. We label the resulting nullclines as N i , with i ∈ {E, 0} corresponding to the values of se xc = 1 and se xc = 0, respectively. The case when sinh = 0 can be treated similarly, resulting in two additional nullclines, N I +E and N I . We refer to the left (right) branch of each v-nullcline N i as the silent (active) phase. Each left branch terminates where it coalesces with the middle branch in a saddle node bifurcation of equilibria for the v-equation. We denote these bifurcation points, or left knees, by (v iL K , wiL K ), and we denote the analogous right knees as (v iRK , wiRK ), with i ∈ {E, 0} as above. Since v − ve xc < 0, increasing se xc lowers the v-nullcline in the (v, w) phase plane. See Figure 1 for an example of the arrangement of the nullclines and the different knees. We have assumed that system 2.1 is excitable when excitation is off, with a stable critical point (v 0FP , w 0FP ) on the left branch of N 0 . All solutions of equation 2.1 will therefore approach (v 0FP , w 0FP ) in the absence of input. We E E assume that the critical point (v FP , w FP ) lies on the middle branch of N E , so that the system is oscillatory in the presence of a constant excitatory input. This second assumption is not essential, and the analysis is easily generalized. Note that in the singular limit, if w ∈ (w LEK , w 0FP ], then an excitatory input results in a large excursion of v, corresponding to a spike. If excitatory inputs are brief, then any interesting dynamics is the consequence of synaptic inputs that result in such spikes. In remark 4, we comment further on the necessity of the assumptions made here. 2.2 Timing of the Synaptic Inputs. We denote the times at which the excitatory synaptic inputs occur by ti , and we assume that each input has the same duration, which we call t ∗ . We will assume that ti+1 − ti > t ∗ , so that the inputs do not overlap, and that the inputs are of equal strength, such that se xc (t) =
1 if ti < t < ti + t ∗ , 0 otherwise.
We comment on trains of inputs with variable durations and amplitudes in remarks 4 and 5. Of fundamental interest in the following analysis will be the timing between subsequent excitatory inputs, which gives rise to the distribution of interexcitatory intervals Ti = ti+1 − ti . We will assume that the Ti are
Firing in the Presence of Stochastic Synaptic Inputs excitation off 0 w= wFP
w=w0 x
excitation on
1257 MT (w)
~ ELK ) MT(w
w= M T(w0 ) x
MT(w ERK )
w E w= wRK
V
n
g j w ~ ELK w ELK w 0FP w
Figure 2: Schematic representation of a typical trajectory starting on the left branch of N 0 with w = w0 , where w0 is in the interval j = (w LEK , w0FP ), and ending on the same branch after time T (left). MT (w0 ) is defined as the w coordinate of the end point of the trajectory (right). Note that the trajectory from w ˜ LEK itself ˜ LEK ) > w LEK . fails to reach the active phase, such that w ˜ LEK · t ∗ = w LEK and MT (w
independent and identically distributed random variables, with corresponding density ρ(t). Excitation that is T-periodic in time is a special case, corresponding to the singular distribution ρ(t) = δ(T). 2.3 Reduction of the Dynamics to a Map. For the purposes of analysis, we consider that the active phase of the cell is very fast compared to the silent phase. More specifically, we treat the neuron as a three-timescale system. The first timescale, on the order of tenths of milliseconds, governs the evolution of the fast variable (the voltage) during spike onset and offset. The second, intermediate timescale, on the order of milliseconds, governs the evolution of the slow variable on the right-hand branch, while the third, slow timescale, on the order of tens of milliseconds, governs its evolution on the left-hand branch. We assume that the duration of excitation is greater than or equal to the duration of the active phase (Stuart & H¨ausser, 2001). Correspondingly, we measure input duration on the slowest timescale. Using ideas of singular perturbation theory, these simplifications allow us to reduce the response of an excitable system to a one-dimensional map on the slow variable w in the singular limit. Similar maps have been introduced previously (Othmer & Watanabe, 1994), in the setting of two rather than three timescales. Since we measure the duration of excitation E on the slow timescale, cells that spike are reinjected at w RK , and the only E 0 accessible points on the left branch of N lie in the interval I = [w RK , w 0FP ) (see Figure 2). A common dynamical systems convention is to denote the solution obtained by integrating a differential equation for time t, from initial condition x0 , by x0 · t. Using this convention, we denote the w coordinate at time t, of a point on the trajectory with initial condition (v0 , w0 ), by w0 · t. We now define a map MT : I → I by MT (w0 ) = w0 · T, where it is assumed that an excitatory input is received at time t = 0, at the start of the map cycle, and
1258
J. Rubin and K. Josi´c
no other excitatory inputs are received between times 0 and T. We claim that in the singular limit, there is no ambiguity in the definition of the map MT . Indeed, in the singular limit, the initial point (v0 , w0 ) can be assumed to lie on the left branch of the nullcline N 0 . Since the evolution on the right branch of N E occurs on the intermediate timescale and the time T exceeds the duration of excitation t ∗ , the trajectory starting at (v0 , w0 ) again ends on the left branch of N 0 at time T. We emphasize that when we repeatedly iterate a map MT or compose a sequence of maps MTi , we assume for consistency that an input arrives at the start of each map cycle and that no other inputs arrive during each cycle. Figure 2 (left) shows a schematic depiction of part of a typical trajectory, starting on N 0 with w = w0 , during an interval of length T following an excitatory input. The trajectory jumps to the right nullcline after excitation is received. The jumps between the branches occur on the fast timescale (triple arrows), while the evolution on the right and left nullcline are intermediate and slow, respectively (double and single arrows). A possible form of the resulting map MT is shown on the right. Note that this map has three branches. Let w ˜ LEK ≤ w LEK denote the infimum of the set of w-values from which a cell will reach the active phase, when given an excitatory input of duration t ∗ . Orbits with initial conditions in the interval j = (w LEK , w 0FP ) will jump to the active phase as soon as excitation is received. Meanwhile, orbits E with initial conditions in n = [w RK ,w ˜ LEK ] will be blocked from reaching the active phase, by the left branch of the N E nullcline. Intermediate initial conditions, in g = (w ˜ LEK , w LEK ], will yield trajectories that jump to N E and then make a second jump, to the active phase, after a time 0 ≤ t < t ∗ . In the singular limit, the region j is compressed to a single point under MT , so that E E MT ( j) = MT (w RK ) = w RK · T. Since the times of entry into the active phase vary continuously across initial conditions in g, MT (g) has a nonzero slope. Finally, note that on n, the map MT has slope less than one, corresponding to the fact that w decreases as w approaches the fixed point w 0FP along the left branch of N 0 . Remark 1. The definition of the map MT (w) can be extended to the case when the system is excitable rather than oscillatory under constant excitaE E tion, that is, when the point (v FP , w FP ) is on the left branch of N E . In this E case, MT (w) < w may occur for w near w FP if T is close to t ∗ , due to the E E E stability of the fixed point (v FP , w FP ) on N . 2.4 Periodic Excitation. Although the case of periodic excitation has been analyzed in much detail elsewhere, we review it here since the analysis shares a common framework with the developments in the subsequent sections. First, consider the limit as the duration of excitation goes to zero, with respect to the slow timescale, such that w ˜ LEK ↑ w LEK and the middle branch
Firing in the Presence of Stochastic Synaptic Inputs
MT (w)
1259
MT (w)
4
MT (wERK ) 3 MT (wERK ) MT2(wERK ) 1
MT (wERK ) w
E MT5(w LK ) 5 ~E+ MT(w LK) 1 ~E+ MT(w LK)
w ELK
g
w
~ ELK w ELK w
Figure 3: Examples of the map MT . (Left) An example with t ∗ = 0, for which, after a finite number of iterations, all initial conditions will be mapped to a ˜ LE+ periodic orbit of period N = 4. (Right) An example with t ∗ > 0, with MT1 (w K) 1 defined as limw↓w˜ LEK MT (w). This example exhibits contraction of the interval g, 1 1 E since MT5 (g) ⊂ MT1 (g) and |MT5 (w LEK ) − MT5 (w ˜ LE+ ˜ LE+ K )| < |MT (w L K ) − MT (w K )|. Note 5 1 E E that in both cases, MT (w L K ) = MT (w L K ).
of MT is eliminated. As shown in the left panel of Figure 3, the fact that j = (w LEK , w 0FP ) is contracted to a single point under MT implies that all points w0 ∈ I get mapped to a periodic orbit of MT in a finite number of iterations. A simple analysis shows that there exists a natural number N such that a population of cells with initial conditions distributed on I will form N synchronous clusters under N applications of the map MT . The E periodic orbit is obtained from applying N iterations of MT to MT (w RK ), and i N N E E it consists of the points {MT (w RK )}i=1 , where MT (w RK ) ∈ j (see Figure 3). Every trajectory is absorbed into this orbit, possibly after an initial transient. This clustered state persists away from the singular limit. More precisely, consider the following intervals, or bins, −(i−1) E ji = MT−i w LEK , MT wL K
i = 1, 2, . . .
(2.3)
Note that the ith iterate of ji under MT is contained in j. Therefore, since E MTi ( ji ) ⊂ j, it follows that MTi+1 ( ji ) = MT (w RK ) or, more generally, MTl ( ji ) = l−i E MT (w RK ) for l > i. We can interpret this as follows. Consider a collection of identical oscillators subject to identical input, under the assumption that all oscillators have initial conditions in ji . This collection will get mapped to the interval j just prior to the ith input. It will then respond to the ith input by firing in unison and will form a synchronous cluster after i + 1 excitatory inputs, since the interval j collapses to a single point after the cells fire. If the bin ji contains a fraction q of the initial conditions, then this fraction of cells will fire at the ith input, as well as on every Nth input after that.
1260
J. Rubin and K. Josi´c
Therefore, without knowing the distribution of initial conditions, it is not possible to know what fraction of the cell population will respond to a given input. Consider the two extreme examples: if all cells have initial conditions lying in one bin, the population of cells will respond only to every Nth input. On the other hand, if initially every bin ji contains some fraction of cells, then each input will induce a response by some fraction of the population. Now consider the effect of t ∗ > 0, corresponding to a nonzero duration of excitation. As long as t ∗ is sufficiently small, or the slope of the middle branch of MT is sufficiently shallow, then MT will be contracting on g, as shown in the right panel of Figure 3. In this case, a similar clustered state arises away from the t ∗ = 0 limit as well. Moreover, the definitions and clustering phenomenon described here carry over identically to the case of periodic excitation with inhibition held on at any constant level. In that case, the system will evolve on the nullclines N I and N I +E in the singular I +E limit. The bins are defined in terms of the points w LI +E K and w RK , rather E E than w L K and w RK as above. We will show in the next section that under general conditions, the situation can be quite different when the times between onsets of successive excitatory inputs, which we will call interexcitation intervals (IEIs), are random and the possibility of sufficiently long IEIs exists. Remark 2. The map MT resembles a time T map of the voltage obtained from an integrate-and-fire system. In our case, the map is defined on the slow conductance w, however. If an excitatory input fails to elicit a spike and g(v, w) depends weakly on v, then the input may have little effect on the slow conductance. This is unlike the typical integrate-and-fire model, in which excitatory inputs move the cell closer to threshold. 3 The Construction of a Markov Chain We next analyze the long-term statistical behavior of a cell that receives excitatory input that is not periodic in time. Our goals are to determine the probability that the cell fires in response to each subsequent input since it last fired and the number of inputs the cell is expected to receive between firings or, equivalently, the average number of failures before a spike occurs. Our results can be interpreted in the context of a population of cells as well, and this is discussed in section 5. A key point in our approach is that, as noted in section 2, all firing events E lead to reinjection at w RK . Therefore, the system has only limited memory, since after a cell fires, all information about its prior behavior is erased. This allows us to describe the evolution of the variable w through the interval E [w RK , w 0FP ) as a Markov process with a finite number of states. The steps in this process will be demarcated by the arrival times of excitatory inputs. Each IEI is defined as the time from the onset of one excitatory input to the
Firing in the Presence of Stochastic Synaptic Inputs w
0 wFP E wLK
E. wRK 2S E. wRK S
1261
excitation off
IN IN−1
. . .
excitation on
I2 I1
w
x E wRK
V
Figure 4: Bins I j are defined by intervals of values of the slow variable w. Note E for that larger subscripts label intervals of larger w values. Iterations of w RK integer multiples of the minimal IEI time S are used to define most bins, as in equation 3.1, whereas bins I N−1 and I N are defined using w LEK and w0FP , as in equation 3.2.
onset of the next. The length of this interval includes the duration of the input that occurs at the start of the IEI. We assume that the lengths of the IEIs, which we denote by T, are independent and identically distributed random variables with density ρ. Fundamental in our analysis is the assumption that the support of ρ takes the form [S, U], where 0 < S < U < ∞. As long as the frequency of the cells providing excitatory inputs is bounded, this assumption is not unreasonable. If ρ satisfies this assumption, then the long-term behavior of a population of cells is accurately captured by the asymptotics of a corresponding Markov chain with finitely many states. We show that under certain conditions, this Markov chain is aperiodic and irreducible and thus has a limiting distribution. This distribution can be used to describe the firing statistics of the cell. 3.1 The States of the Markov Chain. We start by again assuming that the cell receives only excitatory input. To simplify the exposition, we now assume that the input is instantaneous on the slow timescale, so that it causes a cell to spike if and only if w > w LEK . In the singular limit, such an input has no effect on the slow variable w of a cell unless a spike is evoked. This assumption will be relaxed subsequently. In the case of periodic excitation, we considered bins defined by backward iterations of w LEK , as in equation 2.3. For notational reasons, it is now E more convenient to consider forward iterations of w RK to form bins used in the definition of the Markov chain (see Figure 4). Let N be the smallest E number such that w RK · NS > w LEK , where S > 0 is the lower bound of
1262
J. Rubin and K. Josi´c
the support of ρ mentioned above. Therefore, N is the maximal number of E excitatory inputs that a cell starting at any point w0 ∈ [w RK , w 0FP ] can receive before firing. Set E E Ik = w RK · k S, w RK · (k + 1)S)
if
1 ≤ k ≤ N − 2,
(3.1)
and define two additional bins: E I N−1 = w RK · (N − 1)S, w LEK I N = w LEK , w 0FP = j.
(3.2)
If T is a random variable, then the map MT transfers ensembles of cells between bins. This process is non-Markovian, in that transition probabilities depend on the full set of IEIs that have occurred since a cell last spiked, and not just current bin membership. That is, a cell that enters a bin after fewer inputs will be more likely to lie in the lower part of the bin than a cell that enters the same bin after more inputs. Hence, the number of inputs that a cell has received since its last spike can significantly affect its transition probabilities between bins. Therefore, to form a Markov chain from MT , a further refinement is needed. To that effect, we define Markov chain states (Ik , l) as follows. A cell is in state (Ik , l) if w ∈ Ik and l complete IEIs have passed since the cell last fired. Note that by construction, the first of these IEIs (l = 1) actually corresponds to the duration of the input that made the cell fire, together with the pause between that input and the next, since the actual firing and reset occur relative to the slow timescale. As in the case of periodic excitation, the analysis carries over directly to the case of constant inhibition. In this case, the nullclines N I and N I +E are used instead of N 0 and N E above. The remainder of the above construction is identical, with the superscripts I and I + E replacing the superscripts 0 and E, respectively. Remark 3. For the remainder of this section, we continue to denote the states of the Markov chain by (Ik , l), to emphasize that the first index refers to an interval of slow variable values. For simplicity, we use (k, l) instead of (Ik , l) in the examples of the analysis that appear later in the letter. Remark 4. Up to this point, we have made several assumptions that can in fact be weakened in the following ways:
r
For ε = 0, the actual jump-up threshold lies below w LEK and is given by the minimal level of w for which an input current pushes the point (v, w) on the left branch of N 0 into the active phase. As long as all inputs are the same (otherwise, see the next point and remark 5), this value is uniquely defined and can be used to replace w LEK in the definitions of I N−1 and I N in equation 3.2.
Firing in the Presence of Stochastic Synaptic Inputs
r
r
r
1263
If the excitatory input is not instantaneous in time, then the bins have to be adjusted. For instance, the top bin will include all cells that fire by jumping to the left branch of N E and then reaching w LEK with the input still on. Each IEI will consist of the duration of the excitatory input beyond reset (see the next point), plus the time until the arrival of the next input. The distribution of IEIs must be adjusted accordingly. These steps are incorporated in the examples presented in section 6. As stated in section 2, when inputs are not instantaneous, input duration is measured on the slowest timescale. It is possible, however, that the input duration, in milliseconds, is only slightly longer than the active phase, which transpires on our intermediate timescale. What we refer to here as the input duration is therefore really the duration of the part of the input that extends beyond the cell’s return to the silent phase, which we measure on the slow timescale. If the input turns on and off gradually, then the analysis becomes more complicated, although additional simplifying assumptions could reduce this complexity.
3.2 Transition Probabilities. To complete the definition of the Markov chain, we need to compute the transition probabilities between the different states. The transitions occur at times at which the cell receives excitatory inputs. One way to think about this probability is to imagine a large pool of noninteracting cells, each of which receives a different train of inputs with IEIs chosen from the same distribution. To start, assume that each cell is in bin I N and receives its next excitatory input at slow time 0. By the definition E of I N , these inputs cause the cells to spike, and they get reinjected at w RK , still at slow time 0. The assumption that all cells are reset to the same point is essential in the definition of the Markov chain. Recall that we have defined the length T of an IEI as the time between the onsets of two subsequent inputs and that these times are independent and drawn from the distribution ρ. Focus on a particular cell, and denote the length of its next IEI by T1 . Just before the cell’s next input arrives, it E will be at w RK · T1 . Similarly, if we check the locations of all other cells after their respective IEIs, we will find that the population of cells is distributed E in some interval starting at w RK · S, as shown in Figure 5. The transition probability from the state (I N , k) to the state (I j , 1) equals the fraction of the population of cells that are in bin I j at the ends of their respective IEIs. This fraction is independent of the number of inputs k a cell received before firing, by our assumption that all information about prior dynamics is erased once a cell has fired. E Let us return to the cell at w RK · T1 , and let T2 denote its second IEI since time 0, also chosen from the distribution ρ. After this IEI, the example E cell will be at w RK · (T1 + T2 ). The fraction of the population lying in bin I j after one IEI since time 0 and in bin Ik after two IEIs since time 0 is
1264
J. Rubin and K. Josi´c
w ELK
w ELK 3
w ELK
w ELK
3
MS (wERK )
E MS (wRK )
E MS (wRK )
E ) MS (wRK
2
MS2(wERK )
E ) MS2(wRK
E ) MS2(wRK
E MS1(wRK )
MS1(wERK )
E MS1(wRK )
E MS1(wRK )
MS (wERK )
3
3
Figure 5: An example of the state of the population of cells that start in I N and immediately receive inputs and fire. After one subsequent IEI, the cells are E · T, where T is distributed according to ρ. Therefore, all cells lie in an at w RK E E ) = w RK · S, as shown in the left-most part interval bounded below by MS1 (w RK of the figure. For this example, we have assumed that supp(ρ) ⊂ [S, 2S), such E · 2S, as shown. After a second IEI, each that all cells are below MS2 (w RK ) = w RK E · (T1 + T2 ), where both T1 and T2 are chosen from the distribution cell is at w RK ρ. After a third IEI, some cells will lie above w LEK , such that they fire to their next inputs, while others will not. The distributions of cells after the third and fourth IEIs are the right-most two distributions shown above.
the transition probability between the states (I j , 1) and (Ik , 2). This process can be continued to compute all the transition probabilities. In the example shown in Figure 5, we have assumed that supp(ρ) ⊂ [S, 2S) and that there are five accessible states, which we order as (I1 , 1), (I2 , 2), (I3 , 3), (I4 , 3), and (I4 , 4). Hence, N = 4, and the matrix of transition probabilities has the form
0 0 A= 0 1 1
1 0 0 0 0
0 0 ∗ ∗ 0 0 0 0 0 0
0 0 1 , 0 0
(3.3)
where Ai j gives the transition probability from the ith state in the list to the jth state in the list. If initial conditions are selected randomly, then some cells may initially E lie in transient states that cannot be reached subsequent to reset from w RK . We neglect such states in the Markov chain, since including them does not affect the statistics we consider. We now are ready to compute the transition probabilities for the general Markov chain that we have defined. We first compute transition
Firing in the Presence of Stochastic Synaptic Inputs
1265
probabilities between the states (I j , l) and (Ik , l + 1) for l ≤ j < k ≤ N. Consider independent and identically distributed random variables Ti , each with probability density ρ. Let σl = li=1 Ti denote the sum of these random variables. The transition probability between the states (I j , l) and (Ik , l + 1) is given by p( j,l)→(k,l+1) = P[(Ik , l + 1)|(I j , l)] = P σl+1 ∈ [k S, (k + 1)S) σl ∈ [ j S, ( j + 1)S) .
(3.4)
Since the probability density ρ (l) of the sum σl can be computed recursively by the convolution formula, ρ (l) (t) =
ρ (l−1) (u)ρ(t − u)du,
the conditional probabilities in the expression above can be evaluated. In particular, since P σl+1 ∈ [z, z + z] & σl ∈ [w, w + w] ≈ ρ (l) (w)ρ(z − w)wz, it follows that (k+1)S ( j+1)S p( j,l)→(k,l+1) =
kS
ρ (l) (w)ρ(z − w)dwdz . ( j+1)S ρ (l) (z)dz jS
jS
(3.5)
Next, we define the transition probabilities from the states (I N , l). By our assumption, a cell in one of these states fires when it receives an excitatory input. Therefore, the next state must be of the form (I j , 1). As discussed above, once a cell fires, it has no memory of the number of excitatory inputs it received prior to this event. Therefore, the transition probability from (I N , l) to (I j , 1) is the same for all l. This transition probability can be obtained as ( j+1)S p(N,l)→( j,1) = P[(I j , 1)|(I N , l)] = ρ(t)dt. jS
Since no transitions other than the ones described above are possible, this completes the definition of the Markov chain. As a final step, let Iki denote the highest bin, with respect to values of the slow variable, that can be reached through i IEIs after reset. To form the transition matrix A for the Markov chain, we simply order the states of the Markov chain in the list (I1 , 1), (I2 , 1), . . . , (Ik1 , 1), (I2 , 2), (I3 , 2), . . . , (Ik2 , 2), . . . , (I N , N).
1266
J. Rubin and K. Josi´c
The transition from the mth state to the nth state in the list is taken to be the (m, n) element of the matrix A, as was done in matrix 3.3 in the above example. Remark 5. As mentioned in remark 4, the threshold w LEK in equation 3.2 depends on input amplitude and duration. If input amplitudes are not constant, then there will no longer be a single value N such that a cell fires to its next input if and only if it is in a state of the form (I N , j), no matter how I N is defined. In such a case, we can, for example, use the threshold defined for the maximal relevant input amplitude to define I N . Then the probability distribution of input amplitudes can be used to compute probabilities p(N,l)→(N,l+1) and p(N,l)→( j,1) . For some range of l values, p(N,l)→(N,l+1) will be nonzero, and there will be a maximal value l such that no cell requires more than l inputs to fire, and hence (I N , l) is the last state in the chain. 4 Limiting Distributions of the Markov Chain and Their Interpretation We next consider the long-term behavior of the Markov chains defined above and interpret this behavior in terms of the original fast-slow system. For a finite-state Markov chain with M states and transition matrix A, the probability distribution π = (π1 , . . . π M ) is a stationary distribution if i πi Ai, j = π j for all j. The stationary distribution π is a limiting distribution if lim {An }i, j = π j .
n→∞
A Markov chain is irreducible if for any two states i and j, there exists a finite n such that {An }i, j > 0. In other words, there is a nonzero probability of transition between any two states in a finite number of steps. An irreducible finite state Markov chain has a unique stationary distribution (Hoel, Port, & Stone, 1972). The period di of the state i in a Markov chain is the greatest common divisor of the set {n|{An }i,i > 0}. For an irreducible chain, all states have the same period d. Such a chain is called aperiodic if d = 1 and periodic if d > 1. A key point is made in the following theorem: Theorem 1 ((Hoel et al., 1972), p. 73). For an aperiodic, irreducible Markov chain, the stationary distribution is a limiting distribution. 4.1 Conditions for the Existence of a Limiting Distribution. We next show that under very general conditions, the Markov chain constructed in the previous section is irreducible and aperiodic and therefore has a limiting distribution. First, recall that the chain was constructed to include precisely those states that can be reached with nonzero probability in a finite number of steps after a cell is reset. Since a cell in any state will fire and
Firing in the Presence of Stochastic Synaptic Inputs
1267
be reset after a finite number of steps, this implies that there is a nonzero probability of transition from any state to any other state in the chain in a finite number of steps. Therefore, the Markov chain under consideration is always irreducible. We next consider conditions under which the Markov chain is aperiodic. E It is sufficient to start with a continuum ensemble of cells at w RK . If these cells are subject to different realizations of the input, and a nonzero fraction of cells occupies every state after a finite number inputs, then the transition matrix is aperiodic. To start, note that if ρ is supported on a single point T, so that the input j E is periodic, then the Markov chain will be periodic. Each point MT (w RK ) is N contained in bin I j . Therefore the states of the Markov chain are {(Ii , i)}i=1 . In the case depicted in Figure 3, for example, the transition matrix has the form 0 1 0 0 0 0 1 0 (4.1) A= 0 0 0 1. 1 0 0 0 Consider next what happens as the support of ρ is enlarged to [T, T + δ] in this example. For small δ, there is no change in the structure of the transition matrix. For some δ0 sufficiently large, however, we have E M3(T+δ0 ) (w RK ) = w LEK , and when δ > δ0 , as shown in Figure 5, a fraction of the cells will be in state (I3 , 3) after three IEIs, while another fraction will be in state (I4 , 3). When their next excitatory input arrives, the cells in state (I4 , 3) will fire and transition to (I1 , 1), whereas the cells in (I3 , 3) will require two more inputs to fire. Therefore, in addition to the states (I1 , 1), (I2 , 2), (I3 , 3), (I4 , 4) that were relevant in the periodic case, the new state (I4 , 3) becomes part of the Markov chain. The transition matrix between these five states for δ > δ0 has the form
0 0 A= 0 1 1
1 0 0 1 − ε(δ) 0 0 0 0 0 0
0 ε(δ) 0 0 0
0 0 1 , 0 0
(4.2)
where ε(δ0 ) = 0, ε increases with δ, and the states are ordered as (I1 , 1), (I2 , 2), (I3 , 3), (I4 , 3), and (I4 , 4). It can be checked directly that this transition matrix is aperiodic. This is also a consequence of the following general theorem: Theorem 2. Suppose that the support of the distribution ρ of IEIs (including input durations) is the interval [S, U]. The transition matrix defined in section 3
1268
J. Rubin and K. Josi´c
E is aperiodic if and only if MiU (w RK ) > w LEK for an integer 0 < i < N, where N is E E E defined by M(N−1)S (w RK ) < w L K < MNS (w RK ), as in section 3.
Corollary 1. The Markov chain defined above has a limiting distribution if and E ) > w LEK for an integer 0 < i < N, where N is described in only if MiU (w RK theorem 2. Remark 6. In the example discussed above, as the support of ρ widens, the periodic transition matrix in equation 4.1 is replaced by the aperiodic matrix, equation 4.2. Since the limiting behavior of periodic and aperiodic Markov chains is very different, the system can be thought of as undergoing a bifurcation as δ is increased past δ0 . E ) ≤ w LEK for all integers 0 < i < N, then it can be checked Proof. If MiU (w RK directly that the only states that are achievable from an initial condition E MS (w RK ) have the form (Ii , i) for 0 < i ≤ N. Therefore, the transition matrix is of size N × N and has the form of matrix 4.1, with ones on the superdiagonal in all rows except the last, which has a one in its first column. Thus, the Markov chain is periodic. E Assume instead that MiU (w RK ) > w LEK for an integer 0 < i < N. Consider E a continuum of cells that have just spiked and been reset to w RK and are receiving independent realizations of the input. First, note that after i IEIs following the spike, there will be a nonzero fraction of cells in all states (Ik , i), k ≥ i that are part of the Markov chain. E The condition MiU (w RK ) > w LEK and the fact that the support of ρ is an interval imply that some cells will fire on the ith, (i + 1)st, . . . , and Nth inputs after the one that causes their initial reset. Correspondingly, there will be cells in all states (Ik , 1), k ≥ 1 in the chain after the ith input, cells in all states (Ik , 2), k ≥ 2 and (Ik , 1), k ≥ 1 in the chain after the (i + 1)st input, and so on, until all states of the form (Ik , j) with k ≥ j and j ∈ {1, . . . , N − i + 1} are nonempty after N inputs. Similarly, some cells will fire again to input number 2i, with some other cells firing to each input from 2i + 1 up to 2N, and after 2N inputs, all states (Ik , j) with k ≥ j and j ∈ {1, . . . , 2N − 2i + 1} in the chain will be nonempty. Continuing this argument inductively shows that all states are necessarily occupied just before the arrival of the (b N)th input after reset (i.e., after b N IEIs), where b = N−1 —namely, the least N−i integer greater than or equal to (N − 1)/(N − i), such that b N is bounded above by (N − 1)N, since i ≤ N − 1. This establishes theorem 2. Finally, since we established previously that the Markov chain is irreducible, theorems 1 and 2 imply that it has a limiting distribution whenever E MiU (w RK ) > w LEK for an integer 0 < i < N. If this condition fails, then the Markov chain is periodic and therefore has no limiting distribution. Hence, corollary 1 holds as well.
Firing in the Presence of Stochastic Synaptic Inputs
1269
4.2 Interpretation of the Invariant Density. The limiting distribution Q of the derived Markov chain has two interpretations (Taylor & Karlin, 1998). Assume that a cell is subjected to excitatory input for a long time. If w is recorded after each IEI, just prior to the moment when the cell receives an excitatory input, then Q[(I j , l)] is the fraction of recordings observed to fall inside (I j , l). Consequently, the total mass of the distribution Q that lies in the states (I N , j), namely, Nj=1 Q[(I N , j)], is the probability that a cell will respond to the next excitatory input by firing. Similarly, we can N think of Q[(I j , l)]/ k=l Q[(Ik , l)] as the probability that a cell will be found to have w ∈ I j just prior to receiving its lth excitatory input since it fired last. Note that the firing probability Nj=1 Q[(I N , j)] is not the reciprocal of the average number of failures before a spike. However, the average number of failures, call it E f , can be computed once the Q[(I j , l)] are known. Clearly,
Ef =
N−1
j Fj,
(4.3)
j=1
where each F j denotes the probability that exactly j failures occur. Note that j failures occur precisely when j + 1 conditions are met. That is, for each i = 1, . . . , j, just prior to the ith input since the last firing, the trajectory falls into a state of the form (Iki , i), where i ≤ ki < N. Further, the trajectory ends in state (I N , j + 1) before the ( j + 1)st input, to which the cell responds by firing. The quantity N−1 k=i
Q(Ik , i)
k=i
Q(Ik , i)
N
equals the probability that just before the ith input arrives, a cell will lie in a state (Ik , i) with k < N. Thus, the failure probabilities are given by Fj =
N−1 k=1 Q(Ik , 1) N k=1 Q(Ik , 1)
N−1 k= j N k= j
Q(Ik , j) Q(Ik , j)
N−1 k=2 Q(Ik , 2) N k=2 Q(Ik , 2)
N
···
Q(I N , j + 1)
k= j+1
Q(Ik , j + 1)
.
(4.4)
Of course, while some of the Q(Ik , j) may be zero, the exclusion of transient states in our Markov chain construction ensures that all of the sums in the denominators of equation 4.4 will be positive for every population of cells.
1270
J. Rubin and K. Josi´c
Finally, we note that the eigenvalues of the transition matrix can be used to quantify the rate at which the system will converge to its limiting distribution, when one exists. This information may be useful even if Monte Carlo simulations will be used. 5 Extensions of the Markov Chain Approach The ideas introduced in the previous section are applicable in many other settings. Here we outline extensions to heterogeneous populations of excitable cells subject to identical periodic input and to general excitable systems and related models. 5.1 Populations and Heterogeneity. The limiting distribution computed from a Markov chain cannot be used directly to determine what fraction of a population of identical cells subject to identical stochastic trains of excitatory inputs will respond to a given input. As noted earlier, it is necessary to specify the distribution of initial conditions to compute this fraction. On the other hand, the limiting distribution contains information about a population of identical cells subject to different realizations of the stochastic train of excitatory inputs (Taylor & Karlin, 1998). In this case, all of the quantities discussed in the previous section can be reinterpreted with respect to the firing statistics of a population of cells. For example, the fraction of cells responding with a spike to a given excitatory input equals the probability of a single cell firing to a particular excitatory input, that is, N j=1 Q[(I N , j)]. We can also apply similar ideas to the case of a heterogeneous population of excitable cells all subject to the same periodic input, say, of period T. Heterogeneity in intrinsic dynamics may lead to different rates of evolution for different cells in the silent phase, but this disparity can be eliminated by rescaling time separately for each cell, so that all cells evolve at the same unit speed in the silent phase. As a result of this rescaling, the heterogeneity in the population will become a heterogeneity in firing thresholds. Denote the t resulting distribution of thresholds by φ, such that for any t > 0, 0 φ(τ ) dτ gives the fraction of cells with thresholds below t. There will exist a maximal nonnegative integer m and a minimal positive integer n > m such that the support of the distribution φ is contained in the interval [mT, nT]. Thus, for iT i ≥ 1, δi = (i−1)T φ(τ ) dτ gives the fraction of cells that will fire in response to the ith input, which is nonzero for i = m + 1, . . . , n. We can define Markov chain states I j , for j = 1, . . . , n, by stating that a cell is in state I j if it has received j − 1 inputs since it last fired. The transition probability P( j, j + 1) from state I j to state I j+1 is 1 if j ≤ m; if j = m + 1, then P(m + 1, m + 2) = 1 − δm+1 , while the probability of transition from state Im to state I1 is P(m + 1, 1) = δm+1 ; and if m + 1 < j ≤ n − 1, then P( j, j + 1) is given by the proportion of those cells that have not fired to
Firing in the Presence of Stochastic Synaptic Inputs
1271
the first j − 1 inputs that also do not fire to the jth input, namely, P( j, j + 1) = 1 − γ j := =1 −
1 − δ j − δ j−1 − . . . − δm+1 1 − δ j−1 − δ j−2 − . . . − δm+1
δj , 1 − δ j−1 − δ j−2 − . . . − δm+1
while P( j, 1) = γ j = δ j /(1 − δ j−1 − . . . − δm+1 ). Finally, the transition probability from state In to state I1 is 1, and all other transitions have zero probability. If we set γm+1 = δm+1 , then the transition matrix takes the form
0 0 · · ·
A= γm+1 · · · γn−1 1
1 0 · · · 0 · · · 0 0
0 1 · · · 0 · · · 0 0
0 0 · · · 0 · · · 0 0
... ... ... ... ... ... ... ... ... ... ...
0 0 · · · 1 − γm+1 · · · 0 0
0 0 · · · 0 · · · 0 0
... ... ... ... ... ... ... ... ... ... ...
0 0 · · · 0 · · · 0 0
0 0 · · · . 0 · · · 1 − γn−1 0
As long as m < n − 1, such that the distribution φ has support on an interval of length greater than T, not all cells will fire together. Under this condition, there exists i < n such that δi = 0 and, correspondingly, γi = 0. In this case, the proof of theorem 2 immediately generalizes to show that there exists a sufficiently large number of iterations N such that {Ar }i,i > 0 for all r > N, which implies that A is aperiodic and the Markov chain has a limiting distribution. This result can be interpreted as follows. If the heterogeneity is weak, then cells starting with the same initial condition will always respond to the same inputs in a train, giving an unreliable population response. With a stronger degree of heterogeneity, the population response will disperse, eventually every input will evoke a response from some nonempty subset of the cells, and the statistics of the population response will be given by the limiting distribution of the Markov chain. 5.2 General Excitable Systems and Related Models. An excitable system can be characterized by the existence of a stable rest state and a threshold. In such a system, when a perturbation pushes a trajectory from the rest state across the threshold, the trajectory makes a significant excursion before returning to a small neighborhood of the rest state. Suppose that an n-dimensional excitable system receives transient inputs of a stereotyped
1272
J. Rubin and K. Josi´c
form. If we select a time T, then we can define a map MT (u) having as its domain a set of starting states u for the system, each of which is an n-dimensional vector. In theory, bins in n-dimensional space could be defined to implement the Markov chain approach for an n-dimensional map. Indexing these bins efficiently and computing transition probabilities between them could become problematic when n is not small, however. On the other hand, the Markov chain approach discussed in the previous section can be applied directly to systems for which MT (u) = u(T) lie (approximately) on an interval in the phase space of the system, and which are reset to a fixed value ureset after crossing threshold. The assorted versions of the integrate-andfire (IF) model (leaky IF, quadratic IF, exponential IF, and so on) in the subthreshold regime satisfy both of these conditions, as do the lighthouse model of (Haken, 2002; Chow & Coombes, 2006) and the Kuramoto (1984) model on S1 . There is an important difference between the excitable systems considered in detail here and the IF model, however. When the results derived for excitable systems are applied to neuronal models, the Markov chain will be defined using a slowly changing ionic conductance, or an associated activation or inactivation, while in the case of the IF model, it would be defined in terms of the voltage. As a consequence, for the IF case, one would need to take into account the jumps in voltage due to synaptic inputs when defining the states of the Markov chain and computing the transition probabilities. Moreover, an additional difficulty arises from the fact that voltage will decrease after the application of an input that pulls the model above its intrinsic rest state but fails to cause a threshold crossing. This nonmonotonicity will complicate bin definitions. A similar issue will arise even in an excitable system with one slow variable w if the system remains excitable while its input is on. In this case, w converges E E toward w FP whenever w ∈ (w FP , w LEK ) on N E . 6 An Example: The TC Model A prototypical representative of the class of models to which the analysis outlined in the preceding sections is applicable is a model for a thalamocortical relay (TC) cell relevant for the study of Parkinson’s disease (PD) and deep brain stimulation (DBS). The TC model that we consider takes a similar form to the reduced TC model in Rubin and Terman (2004): Cm v = −I L − IT − ge xc se xc (v − ve xc ) − ginh sinh (v − vinh ) w = φ(w∞ (v) − w)/τ (v).
(6.1)
Here, the leak current I L = g L (v − v L ) and the T-type or low-threshold calcium current IT = gT m∞ (v)w(v − vCa ), with parameter values Cm = 1 µF/cm2 , g L = 1.5 mS/cm2 , v L = −68 mV, gT = 5 mS/cm2 , vCa = 90 mV,
Firing in the Presence of Stochastic Synaptic Inputs
1273
ge xc = 0.08 mS/cm2 , ve xc = 0 mV, ginh = 0.12 mS/cm2 , vinh = −85 mV, φ = 3.5 and functions m∞ (v) = (1 + exp(−(v + 35)/7.4))−1 , w∞ (v) = (1 + exp((v + 61)/9))−1 , and τ (v) = 10 + 400/(1 + exp((v + 50)/3)). The values of se xc and sinh will be determined by stochastic processes discussed below, and we made an additional modification to the model to make it oscillatory in the presence of excitation, which is discussed in remark 8. The assumptions that we make regarding the presence of three timescales in the dynamics are not unreasonable in this model, as shown in Rubin and Terman (2004) and Stone (2004). In this section, we present the analytical results of the Markov chain approach for this model for two particular IEI distributions, one uniform and one normal, both with inhibition held off for all time and with inhibition held on for all time (further details appear in appendixes A and B). We compare the stationary distributions found analytically with those found by numerical simulations and find good agreement. Indeed, the Markov chain analysis can be used to check under what conditions a limiting distribution exists, so that numerical simulations will converge, and how long they need to be run to yield accurate results. 6.1 Uniform IEI Distribution. Consider system 6.1 with constant inhibition, which may be on or off. We take the IEIs to be distributed uniformly on [30,70] msec, except for the first such interval after each reset, which is chosen from a uniform distribution on [20,60] msec (see remark 7 below). Using the bin definitions 3.1 and 3.2, with this minor adjustment to E E E E I1 , we have I1 = [w RK · 20, w RK · 50), I2 = [w RK · 50, w RK · 80), . . . , I N−1 = E [w RK · 20 + (N − 2)30, w LEK ], I N = (w LEK , w 0FP ). For the default parameters of system 6.1, inhibition held at sinh = 0, and the excitation characteristics described, simulation of the TC model, equation 6.1, for one passage through the silent phase shows that N = 3, while with inhibition at sinh = 1, we find N = 5. With sinh = 0, the states of the Markov chain are (1, 1), (2, 1), (2, 2), (3, 2), (3, 3), with bin boundaries defined from the one-time simulation. The transition matrix for this case, call it P 0 , is computed analytically in appendix A and appears in equation A.1. The unit-dominant eigenvector of (P 0 )T gives the limiting distribution, v 0 = [.3404 .1135 .0922 .3617 .0922]T .
(6.2)
As discussed in section 4, the values of v 0 represent the likelihood that a cell is found in a given state, if state membership is recorded just prior to the onset of an excitatory input. For comparison, we simulated a single cell, modeled by equation 6.1, with the technical modification mentioned in remark 8 below, over 70 sec, after an initial transient of 10 sec. This
1274
J. Rubin and K. Josi´c
simulation yielded the vector 0 = [.3276 .1289 .0878 .3629 .0878]T vnum
of proportions of inputs during which the cell belonged to each relevant state, which agree nicely with the analytically computed expectations v 0 in equation 6.2. Over the entire simulation, the cell never failed to fire to three consecutive inputs. Grouping our analytical values by bins indicates that just before onset of excitation, on 34.04% of the observations, we expect w ∈ [w RK · 20, w RK · 50), on 20.57% of the observations, we expect w ∈ [w RK · 50, w RK · 75.5), and on 45.39% of the observations, we expect w > w RK · 75.5. In particular, this implies that a cell will fire to roughly 45% of its inputs for these choices of parameters. Further, from equations 4.3 and 4.4 with Q values given by entries of v 0 , it is expected that a successful input will be followed by 1.20 inputs that fail to elicit a spike, before the next successful input occurs; this is in good agreement with direct simulation results showing a mean of 1.19 unsuccessful inputs between successful ones. As discussed in section 5, these results can also be interpreted in terms of a population of identical cells as long as the cells receive independent realizations of the input. With nonzero constant inhibition, given by sinh = 1, the states of the Markov chain are (1,1), (2,1), (2,2), (3,2), (3,3), (4,2), (4,3), (4,4), (5,2), (5,3), (5,4), and (5,5). The transition matrix, P I , is computed analytically in appendix B and appears in equation B.1. In this case, (P I )T has the unitdominant eigenvector v I , which is shown below together with the distriI bution of states listed in vector vnum . The latter are obtained from direct numerical simulation for 80 sec, with a transient consisting of the first 10 sec removed from consideration: vI
= [.2290 .0763 .0859 .1622 .0215 .0569 .0624 .0005 .0004 .2211 .0833 .0005]T
I vnum = [.2272 .0784 .0726 .1927 .0194 .0388 .0669 .0022 .0007 .2171 .0820 .0002]T .
Grouping the states (I5 , j) reveals that a cell will fire in response to about 30% of its inputs. This failure rate is higher than in the case of sinh = 0, which E is to be expected because inhibition changes the locations of w LEK and w RK , E E so that the time to evolve from w RK to w L K is longer with inhibition on than with inhibition off. Similarly, based on v I , the case of sinh = 1 has a substantially higher expected number of failed inputs between spikeinducing inputs, namely, E f = 1(.0013) + 2(.7240) + 3(.2731) + 4(.0016) = 2.28 from equations 4.3 and 4.4, compared to the 1.20 expected in the case of sinh = 0. Direct numerical simulation gives a similar estimate of E f = 2.35 in this case.
Firing in the Presence of Stochastic Synaptic Inputs
1275
Remark 7. In the above calculation, each time interval from the offset of one input to the onset of the next was chosen from a uniform distribution on [20,60] msec. Suppose that we were to fix the total duration of each excitatory input at 10 msec. Since ε = 0 in simulations, even with this constant input duration, the part of the input duration remaining after the cell is reset from the active phase would vary after different spiking events. This variation can be handled easily, as discussed in remark 4, and results in a distribution of IEIs T with support on [20+x,70] msec, where x is the minimal duration of input remaining after reset. However, to maintain a uniform IEI distribution, we chose to adjust the simulations by simply turning off the input each time a firing cell is reset or, equivalently, setting the input duration equal to 0 on the slow timescale. As a result, the distribution for the first IEI after reset was supported on [20,60] msec rather than on [30,70] msec. Remark 8. For consistency with section 3, we modified model 6.1 to make the cell oscillatory in the presence of excitation. To do this, we simplified the dynamics to the form w = (w∞ − w)/τw whenever v < −55, for constants w∞ = .61 and τw = 407, which were chosen to give nice bin boundary values. We also made a related technical adjustment to simplify the presentation of these example results, related to the nonzero duration of excitation. If an excitatory input arrives but fails to make a cell fire, then the cell jumps from the left branch of N 0 to the left branch of N E , and it evolves on this branch of N E while the input is on. The rates of flow on the left branches of N 0 and N E may differ, however. Thus, the time a cell takes to travel from one value of w to another in the silent phase may depend on how many inputs the cell receives during the passage. This can easily be taken into account in the above calculations by making appropriate adjustments to bin boundaries, but this would clutter the exposition. Hence, we instead adjusted the flow in our simulations of equation 6.1 in this example so that the w dynamics was identical on the left branches of N 0 and N E . In theory, the way that we did this introduces the possibility that solutions may escape from the left branch of N 0 in the vicinity of w 0L K and fire without receiving input, since w 0L K < w∞ . However, in practice, this does not occur because w 0L K is sufficiently large, relative to the IEIs, that it is never reached. 6.2 Normal, or Other, IEI Distributions. If IEIs are taken from a nonuniform distribution, we can still use equation 3.5 to obtain the transition matrix elements analytically, albeit with numerical evaluation of the integrals that arise. These integrals are analogous to those given in appendixes A and B for the uniform case. Once the transition matrix is obtained, the limiting distribution for the Markov chain is computed as the unit-dominant eigenvector of its transpose, as previously. For example, consider IEIs of the form T = tref + X, where tref is a fixed constant and X is selected from a truncated normal distribution. In
1276
J. Rubin and K. Josi´c
particular, suppose that tref = 20 msec, that X˜ is selected from a normal distribution with mean 20 msec and standard deviation 10 msec, and that 0, X˜ < 0, ˜ ˜ X = X/M X , 0 ≤ X ≤ 40, ˜ 40, 40 < X,
(6.3)
where MX is a constant correction factor such that R X = 1. This is a reasonable choice that keeps IEIs within the bounds present in the example in the previous section. For this example, with other simulation parameters fixed as in the previous section, including sinh = 0 and an input duration of 10 msec, and the same adjustment mentioned in remark 8, we obtained the transition matrix (Pn0 ) given in equation A.2, corresponding to states (1, 1), (2, 1), (2, 2), (3, 2), (3, 3). The dominant eigenvector vn0 of (Pn0 )T matches closely with a vector (vn0 )num obtained from direct counting of state membership in the last 90 sec of a l00 sec numerical simulation: vn0
= [.4073 .0670 .0594 .4108 .0594]T
(vn0 )num
= [.3883 .0812 .0625 .4070 .0610]T .
Here, the subscript n simply indicates the use of a normal distribution of IEIs. Based on these results, we expect that a cell will respond to approximately 47% of the inputs, with an average of 1.15 failures between successful spikes from equations 4.3 and 4.4. This agrees nicely with the estimate 1.13 that we obtained from direct numerical simulations. For completeness, we conclude with the results of an analogous calculation done with sinh = 1. This yields the transition matrix PnI given in equation B.2, corresponding to states (1, 1), (2, 1), (2, 2), (3, 2), (3, 3), (4, 2), (4, 3), (5, 2), (5, 3), (5, 4). The corresponding dominant eigenvector vnI , and an example vector (vnI )num of state occupancy probabilities from direct simulations, are vnI
= [.2638 .0438 .0672 .2234 .0072 .0169 .0701 .2133 .0942]T
(vnI )num
= [.2587 .0505 .0697 .2132 .0121 .0263 .0637 .2329 .0728]T ,
where we have omitted the (5,2) entry since the probability of occupancy there is less than 0.5 × 10−5 . These results imply that a cell with sinh = 1 will respond to approximately 31% of excitatory inputs and is thus less reliable than a cell with sinh = 0. The average number of failures between spikes is 2.27, from equations 4.3 and 4.4, which is similar to the 2.24 obtained from direct simulations and exceeds that found with sinh = 0.
Firing in the Presence of Stochastic Synaptic Inputs
1277
7 Inhibitory and Excitatory Inputs We next consider the more complex case in which the postsynaptic cell receives a stochastic excitatory input train while subjected to modulation by an inhibitory input. The corresponding analysis illustrates how switches between any pair of epochs with different state transition characteristics can be handled naturally within the Markov chain framework through products of transformation matrices. In particular, the same type of analysis could be done if both input trains were excitatory, with different statistics. Moreover, in the next section of the letter, the example that we present will be discussed in connection with a possible mechanism for the therapeutic effectiveness of DBS.
7.1 Transition Matrices. In this section, we assume that for a system of the form 2.1 with input given by equation 2.2, the function sinh (t) turns on and off abruptly, with a relatively long period between transitions. In the previous section, we discussed how to derive the transition matrices P I and P 0 for the case of constant inhibition of any level. Our next goal is to derive transition matrices P 0→I and P I →0 , encoding the probabilities of passing between various states during the onset and offset of inhibition, respectively. Using these matrices, we can form a transition matrix for the time from one inhibitory offset (or onset) to the next, by matrix multiplication. One complication in this derivation is that the bin boundaries differ between the cases of zero and nonzero inhibition. Even with the assumption that the rate of evolution in the silent phase is independent of input level, differences in bin boundaries remain due to differences in right knee positions, leading to different starting points in the silent phase. Similarly, differences in left knee positions lead to different cutoffs for firing. To make this ex0 plicit, in the following analysis, let {S0j } Nj=1 denote the ordered states for the Markov chain formed when inhibition is off, which we call the inhibition-off I Markov chain, and let {S Ij } Nj=1 denote the states for the Markov chain defined when inhibition is on, which we call the inhibition-on Markov chain. We use the notation (Ik0 , l) or (IkI , l) for the states in these collections. A second difficulty is that due to the stochasticity of the input trains, a change in inhibition level may occur at any time relative to the start of the particular IEI during which it happens, or even at a time when excitation is on. An example of how the resulting bin membership depends on the inhibition onset time is illustrated schematically in Figure 6. To handle both of these issues, we define the transition matrix P 0→I for the onset of inhibition in the following way. First, note that there will be a final iteration of the inhibition-off Markov chain, before inhibition arrives, after which the cell will lie in a state Su0 for some u. During the next IEI, inhibition arrives. At the end of this IEI, the cell will lie in a state SvI for some v. For i = 1, . . . , N0 and j = 1, . . . , NI , let the (i, j) entry of P 0→I denote the
1278
J. Rubin and K. Josi´c NE 0 wFP
NI I 40
I wFP
NE 0 wFP
I
I4
E wLK
0
I3
I+E wLK
A
A .T
I I2
0
I wFP
0 I3
I+E wLK
I
I4 I
I3 I02 II2
0
0
I1 I
.S
E wRK
I4 E wLK
I
I3 I 02
NI
I1 I+E .S wRK
I1 E. wRK S
I
I1
.
I+E wRK S
Figure 6: A schematic depiction of the two difficulties encountered when defining the matrix P 0→I . (Left) An ensemble of cells A in state (I20 , l) is mapped to two bins, I2I and I3I , when inhibition turns on. (Right) The same ensemble gets mapped only to the bin I3I if inhibition turns on a time T later than in the left panel.
probability that u = i and v = j. Recall that the states of the inhibition-off Markov chain, by definition, take the form (Ik0 , l), where l gives a count of IEIs since last reset; thus, many elements of P 0→I will be zero. There are several important advantages to basing the cycle length on only excitatory input times rather than the time when inhibition turns on. This definition of P 0→I renders irrelevant the bins into which an ensemble is mapped by the onset of inhibition itself (e.g., Figure 6), since bin membership is not checked when inhibition is turned on but is instead checked at the end of the IEI during which the inhibitory onset occurs. Moreover, in this formulation, it does not matter whether inhibition turns on while the excitation is still on or after it is off, if we continue to assume that the rate of silent phase evolution is input independent (assumed in remark 8). Whether the last excitatory input that arrives before the onset of inhibition causes the cell to fire, or fails to do so, is determined by the knee positions E without inhibition, and if a firing occurs, instantaneous reset to w RK follows. We neglect the probability zero case of excitation and inhibition turning on at precisely the same moment, and therefore if a cell fires, then inhibition will always turn on after the cell is reset. I +E I +E As previously, let w LI +E , K , w RK denote the knees of the nullcline N corresponding to the case when both excitation and inhibition are on. One I +E E additional complication may arise due to the fact that w RK < w RK . That is, at the onset of inhibition, w may lie below the lower boundary of the inhibition-on partition. To account for this possibility requires the inclusion of additional bins in this partition, with a corresponding adjustment of the matrix P I to allow for matrix multiplication.
Firing in the Presence of Stochastic Synaptic Inputs
1279
We define the transition matrix P I →0 , for the offset of inhibition, analogously to the case of P 0→I . That is, there will be a final iteration of the inhibition-on Markov chain, before inhibition turns off, after which the cell will lie in a state SuI . During the next IEI, inhibition turns off. At the end of this IEI, the cell will lie in a state Sv0 . The (i, j) entry of P I →0 denotes the probability that u = i and v = j. The complication in this case is that postinhibitory rebound (PIR) may occur if, when the offset of inhibition occurs, either excitation is off and w > w 0L K or excitation is still on and w > w LEK . Rebound leads to reinjection into the silent phase followed by evolution there. However, the duration of this evolution, from the time of reinjection to the time that the next excitatory input arrives, depends on when the inhibition turned off, relative to the time of the previous input, which complicates calculations. Assuming that P 0→I , P I →0 can be computed, the appropriate transition matrix for the case of excitatory and inhibitory input trains is obtained by multiplication of the transposes of the separately computed transition matrices P 0 , P I , P 0→I , and P I →0 . Specifically, let [M]T denote the transpose of matrix M, and let (M)n denote M to the nth power. Suppose that n IEIs occur in the absence of inhibition, that inhibition arrives during the next IEI, that m additional IEIs occur with inhibition on, and that inhibition turns off again on the next IEI. In this case, the ( j, i) entry of the matrix [P I →0 ]T ([P I ]T )m [P 0→I ]T ([P 0 ]T )n gives the probability that a cell that starts in state Si0 of the inhibition-off partition ends up in state S0j after the sequence of n + m + 2 IEIs described above. Of course, when the IEIs and the durations of inhibitory on and off periods are selected randomly from distributions, the probabilities that different numbers (m, n) of excitatory inputs arrive during these periods are also random; in appendix C, these are calculated for the example of uniform distributions. In the general discussion given here, let us assume that m, n ≥ 0 are bounded above by M, N < ∞, respectively. Suppose that over a long time period, we check the cell’s state membership at the first onset of excitation following each inhibitory offset. We would like to claim that in the asymptotic limit, the proportion of these trials for which a cell will belong to each inhibition-off state is given by the entries of the unit-dominant eigenvector of the matrix N M
c m,n [P I →0 ]T ([P I ]T )m [P 0→I ]T ([P 0 ]T )n ,
(7.1)
m=0 n=0
where each coefficient c m,n denotes the probability of occurrence of the corresponding exponent pair (m, n), some of which may be 0. Substantiating
1280
J. Rubin and K. Josi´c
this claim necessitates justifying whether this eigenvector exists and truly represents a limiting distribution. This will be true if the matrix in equation 7.1 is irreducible and aperiodic, as discussed in section 4. Recall that the proof of theorem 2 gives an upper bound b for the number of inputs after which occupancy of all states is guaranteed, under the assumptions of the theorem, for constant inhibition. Now, let b I , b 0 denote the respective upper bounds for matrices P I , P 0 , based on the IEI distribution. A sufficient pair of conditions to guarantee the existence of a limiting distribution for equation 7.1 is that if the probability p(m) > 0, then m ≥ b I , while if p(n) > 0, then n ≥ b 0 . If these conditions are violated, the states of the system may still tend to some limiting distribution, particularly if p(m), p(n) are substantial for some values above the respective bounds, but this must be verified numerically. Similarly, the unit-dominant eigenvector of the matrix, N M
c m,n [P 0→I ]T ([P 0 ]T )n [P I →0 ]T ([P I ]T )m ,
(7.2)
m=1 n=1
if it exists and represents a limiting distribution, gives the expected proportions of trials for which a cell will belong to each inhibition-on state (IkI , l) at the moment of arrival of the first excitatory input after inhibition turns on. The dominant eigenvectors of the matrices N M
c m,n ([P 0 ]T )n [P I →0 ]T ([P I ]T )m [P 0→I ]T
m=1 n=1 M N
c m,n ([P I ]T )m [P 0→I ]T ([P 0 ]T )n [P I →0 ]T
m=1 n=1
have similar interpretations. Remark 9. In fact, justifying the existence of the limiting distribution for the matrix in equation 7.2 requires an additional technical adjustment, relative to that for matrix 7.1. The added complication arises because the fixed point of N I is higher than that of N 0 , which may lead to a violation of irreducibility. This is simply a technical point and can be handled by replacing ([P I ]T )m by the product of a sequence of nonidentical matrices that incorporate successively larger numbers of states, assuming m is not too small. 7.2 TC Cell Example Revisited: Uniform Distributions. Consider again the TC model equation 6.1. We focused on inhibitory onset and computed the transition matrices from equation 7.2 analytically, using
Firing in the Presence of Stochastic Synaptic Inputs
1281
equation 3.5 as in the constant inhibition cases discussed in appendixes A and B, under the assumptions that the intervals from excitatory offsets to onsets are selected from a uniform distribution on [20,60] msec, that the excitatory duration is fixed at 10 msec, and that the durations of both the inhibitory inputs and the time intervals between inhibitory inputs are selected from a uniform distribution on [125,175] msec. In this example, the coefficients c m,n = p(m) p(n) in equation 7.2 can also be computed analytically, and we discuss some details of this calculation in appendix C. After the p(m) are computed, the dominant eigenvector v 0→I of matrix 7.2 can be 0→I obtained. We compare v 0→I to its counterpart, vnum , generated numerically from 301 inhibitory onsets during the last 90 sec of a 100 sec simulation: v 0→I = [.3530 .0801 .1598 .2558 .0333 .0677 .0363 0 0 .0140
0
0]T
0→I vnum = [.3089 .1030 .0897 .2724 .0199 .0432 .0698 0 0 .0664 .0066 0]T .
The states here are exactly those listed in section 6 for the uniform IEI distribution with sinh = 1: (1,1), (2,1), (2,2), (3,2), (3,3), (4,2), (4,3), (4,4), (5,2), (5,3), (5,4), and (5,5). Note that the zero entries here signify values less than 0.5 × 10−5 . These states are included in the chain because they are reached after inhibition stays on sufficiently long, since inhibition shifts the v-nullcline to higher w-values. However, the probability of membership in these states just after inhibition turns on is negligible. 0→I The agreement between v 0→I and vnum is fairly good, although not as good as in the case of constant inhibition in section 6. This is presumably due to errors introduced by some minor simplifying assumptions that we made, which are discussed in remark 10. The distribution v 0→I indicates that a cell will be expected to fire well under 10% of the time to the first input that arrives just after the onset of inhibition. This low number fits the prediction of phase plane and bifurcation analysis, which implies that the onset of inhibition interferes with responsiveness to the subsequent excitatory input (Rubin & Terman, 2004). Moreover, the full distribution v 0→I characterizes the expected behavior of the cell after this first excitatory input. In particular, note that the cell will be in the (1,1) state, corresponding to the lowest slow variable values, on the arrival of over 30% of such inputs, which is a significant increase over the likelihood of membership in this state in the sinh = 1 case. Hence, the compromise of responsiveness by rhythmic inhibition will endure beyond the first excitation after inhibitory onset. Remark 10. Recall that PIR, or rebound, refers to the firing of a spike immediately on the removal of inhibition. Although we focus on equation 7.2, and hence on the limiting distribution v 0→I corresponding to the onset of inhibition, in the above example, we still must consider PIR in the calculation of the elements of the matrix P I →0 , which appears in equation 7.2. Since the possibility of rebounding in general, and in particular rebounding
1282
J. Rubin and K. Josi´c
and reaching I2 before the arrival of the next excitation, is relatively small for our parameter values, we make the simplifying assumption that after PIR, a cell can reach only I1 before the next excitation arrives. To maintain I +E E tractability, we also neglect the fact that w RK > w RK , based on the fact that the right knees are fairly close for ginh = 0.12. These approximations do introduce some error into our results. 7.3 TC Cell Example Revisited: Normal Distributions. For comparison to direct numerical simulation, we also calculated the transition matrices and dominant eigenvector when IEIs were selected from the truncated normal distribution described earlier (see equation 6.3), with inhibition on and off durations selected from a similar truncated normal distribution. The distribution for on and off intervals of inhibition was supported on [125,175] msec and had mean 150 msec, as in the uniform distribution example. To compute the relevant dominant eigenvector, as done above in the uniform distribution case, we need coefficients c m,n , which we computed as c m,n = p(m) p(n) from the numerically obtained probabilities p(1) = .0014, p(2) = .1625, p(3) = .6763, p(4) = .1625, p(5) = .0014, with p( j) = 0 for j ≥ 6. In this case, we used transition probabilities obtained from long-time simulations to compute the transition matrices PnI →0 , Pn0→I , which we do not display here, although these could have been computed from equation 3.5 as well. The dominant eigenvector vn0→I , corresponding to state occupancy at the arrival time of the first excitatory input after inhibitory onset, and an example of the distribution of states (vn0→I )num obtained from the last 90 sec of a 100 sec simulation are vn0→I
= [.3354 .0463 .1233 .3685 .0083 .0394 .0350 .0439
0
]T
(vn0→I )num = [.3984 .0549 .1071 .3132 .0082 .0302 .0384 .0467 .0027 ]T , for states (1, 1), (2, 1), (2, 2), (3, 2), (3, 3), (4, 2), (4, 3), (5, 3), (5, 4), with the subscript n for normal as previously. These results show that a cell can be expected to fire reliably to the first excitatory input after the onset of inhibition less than 5% of the time, and it will lie in the (1,1) state on the arrival of over 30% of such inputs, as also seen in the uniform distribution example. 8 Explicit Connection to Parkinsonian Reliability Failure In PD, the inhibitory output from the basal ganglia may become rhythmic. DBS eliminates this rhythmicity, leading to inhibition from the basal ganglia that is elevated but, when summed over all inputs to a cell, is roughly constant, possibly with fast oscillations around a high constant level. A possible mechanism for the induction of motor symptoms in PD and for their amelioration by DBS, analyzed in Rubin and Terman (2004), is that
Firing in the Presence of Stochastic Synaptic Inputs
1283
rhythmic basal ganglia inhibitory outputs periodically compromise TC response reliability, while the regularization of these outputs by DBS restores responsiveness. The drastic drop in the number of TC cells expected to fire and the accumulation of cells in states far from firing threshold found just after inhibitory onset in the case of rhythmic inhibition in our examples, relative even to the case of elevated but constant inhibition, offers a strong demonstration of the feasibility of this idea. The approach that we have taken to obtain these results is based on limiting distributions, and hence eliminates the possibility of spurious outcomes due to transient effects in simulations. Suppose that we adjust parameters to an extreme case, so that TC cells fire in response to every excitatory input when sinh = 0 and when sinh = 1. In terms of the slow variable, these conditions become E w RK · S > w LEK
(8.1)
and I +E w RK · S > w LI +E K .
Even with these strong conditions, we still find some failures when inhibition turns on and off rhythmically. In this case, we can read off from v 0→I the probabilities with which a cell will experience each possible number of failures due to inhibitory onset, as seen in the examples discussed above. In particular, even if no failures occur when inhibition is held at a constant level, there may be multiple failures following inhibitory onset if the I +E E difference w RK − w RK is large. Inhibitory offset may lead to response failure as well, through PIR. Cells E that rebound when inhibition turns off are reset to w RK . The assumption of reliability in the inhibition-off case implies that inequality 8.1 holds. However, the time from the offset of inhibition until the arrival of the next excitatory input may be less than S, since inhibition may turn off at any moment in the IEI. Therefore, the time t ∗ from PIR to the next arrival of excitatory input may be less than S, and a response failure can follow PIR. Finally, after such a failure, relay will proceed reliably under assumption E E 8.1, since a cell will lie at w RK · t ∗ > w RK after its first failure. In summary, even under the assumption of perfect TC relay reliability in the constant inhibition case, multiple relay failures can occur following inhibitory onsets, while inhibitory offsets can induce PIR, possibly followed by a single relay failure. 9 Discussion We have considered a fast-slow excitable system subject to a stochastic excitatory input train, and we have shown how to derive an irreducible
1284
J. Rubin and K. Josi´c
Markov chain that can be used analytically to compute the system’s firing probability to each input, expected number of response failures between firings, and distribution of slow variable values between firings, in the infinite-time limit. We have illustrated this analysis on a model TC cell subject to a uniform or a truncated normal distribution of excitatory synaptic inputs, in the cases of constant inhibition and of inhibition that switches abruptly between two levels. The analysis generalizes to any pair of input trains, excitatory or inhibitory and synaptic or not, that switch, with distinct frequencies, between discrete on and off states. In such cases, an appropriate transition matrix, analogous to equation 7.1, can always be derived. This approach also can be extended to other models, such as the Haken (2002) or Kuramoto (1984) models, featuring a single variable that builds up to a threshold, mediates an instantaneous spike, and experiences a reset, possibly followed by a refractory period. In this vein, we expect that extension to integrate-and-fire type models should be possible, but the details of handling nonmonotone changes in voltage would need to be worked out. In the TC cell case, our results generalize earlier findings suggesting how the modulation of inhibitory outputs of the basal ganglia can compromise TC responsiveness to excitatory inputs, with possible relevance to PD and DBS (Rubin & Terman, 2004). The method used here goes beyond the direct counting of failed responses to a sequence of inputs that was done previously, by deriving information about the complete limiting distribution of states in the TC cell Markov chain. More generally, basal ganglia output areas in rat show abrupt firing rate fluctuations on the timescale of secondsto-minutes even in non-parkinsonian states (Ruskin et al., 1999), and the ideas we introduced could be used to consider how different fluctuation characteristics affect TC reliability. TC cells are also inhibited by thalamic reticular (RE) cells, which are targets of excitatory corticothalamic inputs. Cortical oscillations, particularly abrupt transitions between cortical up and down states (Steriade, Nunez, & Amzica, 1993a; Cowan & Wilson, 1994), could naturally lead to jumps in the levels of inhibition from RE cells to TC cells, providing an alternative source for the type of inhibition that we consider. In this context, our results provide a way to quantify the expected extent of the transient loss of thalamic relay reliability during the transitions between up and down states of different depths, as well as the likelihood that transitions from down to up will be signaled by PIR thalamic bursts (Steriade, Nunez, & Amzica, 1993b). Huertas et al. (2005) have also considered the relay properties of TC cells, specifically those in the dorsal lateral geniculate nucleus, in the presence of RE inhibition, using simulations of an integrate-and-fire-or-burst model with an oscillatory driving input based on retinal ganglion cell activity. In their simulations, as found here and in previous work such as Rubin and Terman (2004), rhythmic inhibition to TC cells led to TC bursts and a failure of TC cells to respond to excitatory signals, although TC-RE interactions in their model gave rhythmic TC bursting
Firing in the Presence of Stochastic Synaptic Inputs
1285
phase-locked to the driving stimulus, which would not be present for the types of synaptic drive we consider. We have proved that under general conditions, the Markov chain we derived is irreducible and aperiodic and hence has a limiting distribution, which can be computed analytically using equation 3.5, using a onetime simulation for establishment of bin boundaries. This finding contrasts with Monte Carlo simulations, in which convergence properties cannot be forecast and transient effects may be a concern. Our analysis also explains why a limiting distribution will not exist when these conditions break down, and hence these results, even without the full calculations, can be used to decide whether Monte Carlo simulations are a reasonable option. It is interesting to consider the minimal requirements for the application of our Markov chain ideas. Application of this approach will be possible whenever a driven system can be characterized as having a single (or single dominant) slow recovery process or other variable that builds up to a firing threshold in the silent phase, the transition probabilities between states in the silent phase can be estimated from the behavior of this variable, and the statistics of the input to the cell are known. Consideration of the minimal requirements for the experimental characterization of such transition probabilities for a neuron, possibly along the lines of phase response curve estimation, remains an interesting avenue for future consideration. Our work is related to two earlier studies in which a Markov operator was derived to track transitions between oscillator phases, relative to sinusoidal forcing, at successive jump-up (Doi et al., 1998) or threshold-crossing (Tateno & Jimbo, 2000) times, in the presence of noise. Neither of these works, however, used a Markov chain to track transitions linked to successive input arrivals but not to firing events. Moreover, neither arrived at analytically computable transition probabilities or proved the existence of a limiting distribution, as we have done. A nice feature of the previous letters was the numerical tracking of subdominant eigenvalues to identify bifurcations, defined in a stochastic sense, relating to changes in mode locking. The approach that we have presented could also be used to study bifurcations. For example, as mentioned in remark 6, a change in the range of IEIs can cause the transition probability between a pair of states to switch from zero to nonzero, which may abruptly change the existence, or the nature, of the corresponding limiting distribution. As a related alternative approach, one could try to analyze the longterm behavior of a cell by studying the random dynamical system wn+1 = MT (wn ), where the interinput interval T is a random variable with a known density (Lasota & Mackey, 1994). If it could be found, then the limiting density for the variable w could be used to compute the statistics of interest for the cell. However, finding this limiting density is in general quite difficult. The Markov chain approach that we have presented can in fact be viewed as a discretization of this procedure. As a result, we recover a coarsely grained
1286
J. Rubin and K. Josi´c
version of the full limiting density for w, which is sufficient to determine the statistics of interest. In a series of letters (Othmer & Watanabe, 1994; Xie et al., 1996; Othmer & Xie, 1999), Othmer and collaborators studied the application of step function forcing to a piecewise linear excitable system similar to those that we consider. In their work, which focused on the existence of modelocked (harmonic or subharmonic) solutions and on chaos, they used a map and defined states as we do, but their states were based on positions of the knees of nullclines and their projections to other nullclines rather than the flow along branches of nullclines, and they did not consider a Markov chain for transitions between states. Moreover, their analysis was restricted to periodic forcing, whereas ours accommodates, and indeed is particularly well suited for, stochasticity in input timing. LoFaro and Kopell (1999) also used 1D maps to study a forced excitable system, but in their work, the excitable system was a neuron mutually coupled via inhibitory synapses to an oscillatory cell and the map was a singular Poincar´e map, with each iteration corresponding to the time between jumps to the active phase. Similarly, Keener, Hoppensteadt, and Rinzel (1981) and subsequent authors have used firing time maps to study mode locking in integrateand-fire and related models with periodic stimuli. Alternatively, other works have used 1D maps based on interinput intervals to study modelocked, quasi-periodic, and chaotic responses of excitable systems to periodic, purely excitatory inputs (Coombes & Osbaldestin, 2000; Ichinose et al., 1998). Clearly, the transformation of some combination of excitatory and inhibitory synaptic inputs into postsynaptic neuronal responses is a fundamental operation present within any nontrivial nervous system. As a result, various manifestations of this transformation have been studied, computationally and experimentally, by many researchers. In our work, as in Smith and Sherman (2002), we consider the excitatory input stream as a drive to the postsynaptic cell and the inhibitory input as a modulation that alters the cell’s responsiveness to its drive. We neglect such intriguing effects as long-term synaptic scaling (Desai, Cudmore, Nelson, & Turrigiano, 2002), changes in intrinsic excitability (Aizenman, Akerman, Jensen, & Cline, 2003), and changes in the balance of excitation and inhibition (Somers et al., 1998), which could become relevant in the asymptotic limit, as well as the effect of attention (Tiesinga, 2005). Moreover, we assume that successful thalamic relay consists of single-spike responses to an input train, neglecting the idea that by virtue of their stereotyped form and reliability, thalamic bursts may serve a relay function (Person & Perkel, 2005; Babadi, 2005). Other authors have considered how variations in intrinsic properties of postsynaptic cells determine the input characteristics that induce these cells to spike most reliably (Fellous et al., 2001; Schreiber, Fellous, Tiesinga, & Sejnowski, 2004) and how neuronal processing varies under more general changes in input spike patterns than the ones that
Firing in the Presence of Stochastic Synaptic Inputs
1287
we have considered here (e.g., Hunter & Milton, 2003; Tiesinga & Toups, 2005). While these issues are beyond the scope of this work, our approach can accommodate input trains that vary stochastically in a variety of ways (e.g., see remarks 3 and 4), and hence it may be well suited for the future exploration of such issues. Appendix A: TC Cell with sinh = 0 With sinh = 0, a cell that has just spiked is guaranteed to spike again after at most N = 3 subsequent inputs. Although the intervals Ik are defined in terms of intervals of the slow variable w, clearly there is also a particular time interval associated with each Ik (see remark 11 below). In the particular model, equation 6.1, with sinh = 0, these time intervals are E [20, 50), [50, 75.5), [75.5, ∞), where w RK · 75.5 ≈ w LEK . In practice, we used a simulation protocol, rather than determination of the left knee of N E directly from equation 6.1, to compute the value 75.5. That is, we found 75.5 to be the minimum value of t such that, given an initial condition with E w = w RK · t and with v at the corresponding position on the left branch of 0 N , the model cell would fire in response to an excitatory input of duration 10 msec. This adjustment represents the modification to w LEK discussed in remark 4 and is therefore consistent with the rest of the analysis that we present. Using these time intervals allows us to compute the transition probabilities between states (Ik , j) for j, k = 1, . . . , 3, with j ≤ k. Let p(k1 , j1 )→(k2 , j2 ) = P[(Ik2 , j2 )|(Ik1 , j1 )]. To compute these probabilities, we start with the fate of a cell just after firing, computing p(3, j)→(k,1) for each relevant ( j, k). First, note that since all cells fire from a state of the form (I3 , j) for some j ≤ 3 and E all firing cells are reset together to w RK regardless of where they fired from, p(3, j)→(k,1) is independent of j. In this example, with ginh = 0.12, we have p(3, j)→(1,1) = P[T1 ∈ [20, 50)], where T1 denotes the time from reset to the onset of the next excitatory input. Further, P[T1 ∈ [20, 50)] = 3/4, since T1 is selected from a uniform distribution on [20,60]. Similarly, p(3, j)→(2,1) = 1/4 and p(3, j)→(k,1) = 0 for all k > 2. Next, we seek values for p(1,1)→(k,2) for each k ≥ 2 and p(2,1)→(k,2) for each k ≥ 3. We have p(1,1)→(2,2) = P[T1 + T2 + 10 ∈ [50, 75.5)|T1 ∈ [20, 50)], where T2 + 10 denotes the second IEI, with the input duration of 10 msec written out explicitly. This can be computed as the ratio of two areas, either geometrically or by integration, leading to the result that p(1,1)→(2,2) = 2601/9600. Since N = 3 for sinh = 0, it follows that p(1,1)→(3,2) = 6999/9600,
1288
J. Rubin and K. Josi´c
and that p(2,1)→(3,2) = p(2,2)→(3,3) = 1. This completes the calculation for sinh = 0. The corresponding transition matrix for sinh = 0 is thus
0
0
2601/9600 6999/9600 0
0 0 0 0 P = 0 3/4 1/4 3/4 1/4
0
1
0
0
0
0
0
0
0 1, 0 0
where the states of the Markov chain (1, 1), (2, 1), (2, 2), (3, 2), (3, 3). Similar calculations yield the transition matrix
0
0
0 0 0 0 Pn = 0 .8576 .1424 .8576 .1424
.1474 .8526 0 0
1
0
0
0
0
0
0
(A.1)
are
the
states
0 1 0 0
(A.2)
for states (1,1), (2,1), (2,2), (3,2), (3,3) in the case of normally distributed IEIs, also discussed in section 6. Remark 11. The time intervals associated with each Ik are independent of the number of inputs received since the last spike, and hence uniquely defined, under the assumption of input-independent evolution of w (see remark 8). Without this assumption, to each bin, we could associate different time intervals based on the number of inputs received since the last spike. Once this is done, the calculations proceed as discussed here, with appropriate adjustment of the limits of integration in equation 3.5. Appendix B: TC Cell with sinh = 1 The case sinh = 1 requires more analytical computations than the case sinh = 0, since N = 5 for sinh = 1. In this case, p(5, j)→(k,1) are identical to p(3, j)→(k,1) , computed for sinh = 0 above, and are nonzero only for k = 1, 2. The states for sinh = 1 correspond to times [20,50), [50,80), [80,110), [110,128), [128, ∞) (see remark 11). Similar to the previous case, with the input duration of 10 msec again included explicitly, p(1,1)→(2,2) = P[T1 + T2 + 10 ∈ [50, 80)|T1 ∈ [20, 50)] = 3/8,
Firing in the Presence of Stochastic Synaptic Inputs
1289
with p(1,1)→(3,2) = 1/2 and p(1,1)→(4,2) = 1/8 by analogous calculations. Along the same lines, p(2,1)→(3,2) = P[T1 + T2 + 10 ∈ [80, 110)|T1 ∈ [50, 60)] = 5/8, while p(2,1)→(4,2) = 37/100 and p(2,1)→(5,2) = 1/200. As we proceed further, certain probability calculations become more involved, because membership in a state may be achieved by more than one path. For example, we see already that a cell may reach state (I3 , 2) from (I1 , 1) or from (I2 , 1). Thus, continuing to use Ti to denote the times between the offset of one input and the onset of the next, we have p(3,2)→(4,3) = P[T1 + T2 + T3 + 20 ∈ [110, 128)|T1 ∈ [20, 50) and T1 + T2 + 10 ∈ [80, 110)] + P[T1 + T2 + T3 + 20 ∈ [110, 128)|T1 ∈ [50, 60) and T1 + T2 + 10 ∈ [80, 110)] = 2123/14,250. The full set of calculations reveals that the transition matrix P I for sinh = 1 has the form
0
0 0 0 0 0 I P = 0 0 3 4 3 4 3 4 3 4
0
3 8
1 2
0
1 8
0
.
.
.
0
0
5 8
0
37 100
0
0
1 200
0
0
0
0
0
1 4
0
6011 13500
0
0
2057 6750
0
.
.
.
0
2123 14250
0
0
12127 14250
0
.
.
.
0
243 10000
0
0
9757 10000
.
.
.
0
1
0
.
.
.
0
1
.
.
.
1 4
0
.
.
.
1 4
0
.
.
.
1 4
0
.
.
.
0 0 0 0 0 0 1 0 0 0
1 4
0
.
.
.
0
0
0
(B.1)
where the states of the Markov chain are (1,1), (2,1), (2,2), (3,2), (3,3), (4,2), (4,3), (4,4), (5,2), (5,3), (5,4), and (5,5).
1290
J. Rubin and K. Josi´c
Similar calculations yield the transition matrix
0
0
0 0 0 0 0 0 0 0 PnI = 0 0 0 0 .8576 .1424 .8576 .1424
.2548 .7249
0
.0203
0
.7352
0
.2642 .0006
0
0
.1074
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
.8576 .1424
0
0
0
0
0
0
0 .5612 .3314 0 .1448 .8552 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0
0
(B.2)
0
for states (1, 1), (2, 1), (2, 2), (3, 2), (3, 3), (4, 2), (4, 3), (5, 2), (5, 3), (5, 4) in the case of normally distributed IEIs, also discussed in section 6. Appendix C: Coefficients for Matrix 7.1 in the Case of Time-Dependent Inhibition and Excitation To calculate the coefficients c m,n = p(m) p(n) analytically, we compute p(1), which is the probability that exactly one excitatory input arrives during an epoch of constant inhibition, and then we compute p(m) for each m = 2, . . . , 6 as the probability of at most m inputs arriving minus the probability of at most m − 1 inputs arriving. We stop at m = 6, since at most six excitatory inputs can arrive during 175 msec, given the IEIs and excitation duration that we consider. Let t = 0 denote the time of inhibitory offset and let TI ∈ [125, 175] denote the duration of the ensuing time interval on which inhibition remains off. Define T−1 as the time from the last excitatory onset before t = 0 to the first excitatory onset after t = 0. Let T1 denote the time of this first excitatory onset. Given that each excitation endures for 10 msec, T−1 ∈ [30, 70], while T1 ∈ [0, T−1 ] (see Figure 7). Finally, let T2 denote the IEI following the end of the excitatory input that occurs at time T1 . Since T1 < TI must hold, the value of p(1) is the probability that T1 + T2 + 10 > TI . Thus, 175 70 T−1 60 p(1) =
125
30
min(TI −T1 −10,60)
0
175 70 T−1 60 125
30
0
20
dT2 dT1 dT−1 dTI
dT2 dT1 dT−1 dTI
=
27 . 51,200
For m > 1, each quantity p(m) is given as the ratio of two multiple integrals as well, with one additional nested integral in each, relative to those used to calculate p(m − 1). Since these integrals can be evaluated exactly, we obtain exact fractional representations for each, but to save
Firing in the Presence of Stochastic Synaptic Inputs
1291
T I T-1
T1
T2
T3
t=0 Figure 7: Notation for the calculation of the probabilities p(m). Note that the actual number of excitatory inputs arriving during the interval of constant inhibition will be between one and six.
space, we simply give decimal approximations here: p(2) ≈ .1985, p(3) ≈ .6075, p(4) ≈ .1875, p(5) ≈ .0060, p(6) ≈ 4.731 × 10−6 . Acknowledgments We received partial support for this work from National Science Foundation grants DMS-0414023 to J.R., and DMS-0244529, ATM-0417867 and a GEAR grant from the University of Houston to K.J. We thank the anonymous reviewers for comments that helped improve the clarity of the letter.
References Aizenman, C. D., Akerman, C. J., Jensen, K. R., & Cline, H. T. (2003). Visually driven regulation of intrinsic neuronal excitability improves stimulus detection in vivo. Neuron, 39(5), 831–842. Anderson, M., Postpuna, N., & Ruffo, M. (2003). Effects of high-frequency stimulation in the internal globus pallidus on the activity of thalamic neurons in the awake monkey. J. Neurophysiol., 89, 1150–1160. Babadi, B. (2005). Bursting as an effective relay mode in a minimal thalamic model. J. Comput. Neurosci., 18, 229–243. Brown, P., Oliviero, A., Mazzone, P., Insola, A., Tonali, P., & Lazzaro, V. D. (2001). Dopamine dependency of oscillations between subthalamic nucleus and pallidum in Parkinson’s disease. J. Neurosci., 21, 1033–1038. Chance, F., Abbott, L., & Reyes, A. (2002). Gain modulation from background synaptic input. Neuron, 35, 773–782. Chow, C., & Coombes, S. (2006). Existence and wandering of bumps in a spiking neural network model. SIAM J. Dyn. Syst., 5, 552–574.
1292
J. Rubin and K. Josi´c
Coombes, S., & Osbaldestin, A. (2000). Period-adding bifurcations and chaos in a periodically stimulated excitable neural relaxation oscillator. Phys. Rev. E, 62, 4057–4066. Cowan, R. L., & Wilson, C. J. (1994). Spontaneous firing patterns and axonal projections of single corticostriatal neurons in the rat medial agranular cortex. J. Neurophysiol., 71(1), 17–32. De Schutter, E. (1999). Using realistic models to study synaptic integration in cerebellar Purkinje cells. Rev. Neurosci., 10, 233–245. Desai, N. S., Cudmore, R. H., Nelson, S. B., & Turrigiano, G. G. (2002). Critical periods for experience-dependent synaptic scaling in visual cortex. Nat. Neurosci., 5(8), 783–789. Doi, S., Inoue, J., & Kumagai, S. (1998). Spectral analysis of stochastic phase lockings and stochastic bifurcations in the sinusoidally forced van der Pol oscillator with additive noise. J. Stat. Phys., 90(5/6), 1107–1127. Fellous, J. M., Houweling, A. R., Modi, R. H., Rao, R. P., Tiesinga, P. H., & Sejnowski, T. J. (2001). Frequency dependence of spike timing reliability in cortical pyramidal cells and interneurons. J. Neurophysiol., 85(4), 1782–1787. Haken, H. (2002). Brain dynamics: Synchronization and activity patterns in pulse-coupled neural nets with delays and noise. Berlin: Springer-Verlag. Hashimoto, T., Elder, C., Okun, M., Patrick, S., & Vitek, J. (2003). Stimulation of the subthalamic nucleus changes the firing pattern of pallidal neurons. J. Neurosci., 23, 1916–1923. Hoel, P. G., Port, S. C., & Stone, C. J. (1972). Introduction to stochastic processes. Boston: Houghton Mifflin. Huertas, M., Groff, J., & Smith, G. (2005). Feedback inhibition and throughput properties of an integrate-and-fire-or-burst network model of retinogeniculate transmission. J. Comp. Neurosci., 19, 147–180. Hunter, J. D., & Milton, J. G. (2003). Amplitude and frequency dependence of spike timing: Implications for dynamic regulation. J. Neurophysiol., 90(1), 387– 394. Ichinose, N., Aihara, K., & Judd, K. (1998). Extending the concept of isochrons from oscillatory to excitable systems for modeling excitable neurons. Int. J. Bif. Chaos, 8(12), 2375–2385. Keener, J. P., Hoppensteadt, F. C., & Rinzel, J. (1981). Integrate-and-fire models of nerve membrane response to oscillatory input. SIAM J. Appl. Math., 41(3), 503– 517. Kuramoto, Y. (1984). Chemical oscillations, waves, and turbulence. Berlin, SpringerVerlag. Lasota, A., & Mackey, M. C. (1994). Chaos, fractals, and noise: Stochastic aspects of dynamics (2nd ed.). New York: Springer-Verlag. LoFaro, T., & Kopell, N. (1999). Timing regulation in a network reduced from voltagegated equations to a one-dimensional map. J. Math. Biol., 38(6), 479–533. Magnin, M., Morel, A., & Jeanmonod, D. (2000). Single-unit analysis of the pallidum, thalamus, and subthalamic nucleus in parkinsonian patients. Neuroscience, 96, 549–564. Nini, A., Feingold, A., Slovin, H., & Bergman, H. (1995). Neurons in the globus pallidus do not show correlated activity in the normal monkey, but phase-locked
Firing in the Presence of Stochastic Synaptic Inputs
1293
oscillations appear in the MPTP model of parkinsonism. J. Neurophysiol., 74, 1800– 1805. Othmer, H., & Watanabe, M. (1994). Resonance in excitable systems under stepfunction forcing. I. Harmonic solutions. Adv. Math. Sci. Appl., 4, 399–441. Othmer, H., & Xie, M. (1999). Subharmonic resonance and chaos in forced excitable systems. J. Math. Biol., 39, 139–171. Person, A. L., & Perkel, D. J. (2005). Unitary IPSPs drive precise thalamic spiking in a circuit required for learning. Neuron, 46(1), 129–140. Raz, A., Vaadia, E., & Bergman, H. (2000). Firing patterns and correlations of spontaneous discharge of pallidal neurons in the normal and tremulous 1-methyl-4phenyl-1,2,3,6 tetrahydropyridine vervet model of parkinsonism. J. Neurosci., 20, 8559–8571. Rubin, J. E., & Terman, D. (2002). Geometric singular perturbation analysis of neuronal dynamics. In B. Fiedler (Ed.), Handbook of dynamical systems (Vol. 2, pp. 93–146). Amsterdam: North-Holland. Rubin, J. E., & Terman, D. (2004). High frequency stimulation of the subthalamic nucleus eliminates pathological thalamic rhythmicity in a computational model. J. Comput. Neurosci., 16, 211–235. Ruskin, D. N., Bergstrom, D. A., Kaneoke, Y., Patel, B. N., Twery, M. J., & Walters, J. R. (1999). Multisecond oscillations in firing rate in the basal ganglia: Robust modulation by dopamine receptor activation and anesthesia. J. Neurophysiol., 81(5), 2046–2055. Schreiber, S., Fellous, J.-M., Tiesinga, P., & Sejnowski, T. J. (2004). Influence of ionic conductances on spike timing reliability of cortical neurons for suprathreshold rhythmic inputs. J. Neurophysiol., 91(1), 194–205. Smith, G., & Sherman, S. (2002). Detectability of excitatory versus inhibitory drive in an integrate-and-fire-or-burst thalamocortical relay neuron model. J. Neurosci., 22, 10242–10250. Somers, D., & Kopell, N. (1993). Rapid synchronization through fast threshold modulation. Biol. Cybern., 68, 393–407. Somers, D., Todorov, E., Siapas, A., Toth, L., Kim, D., & Sur, M. (1998). A local circuit approach to understanding integration of long-range inputs in primary visual cortex. Cerebral Cortex, 8, 204–217. Steriade, M., Nunez, A., & Amzica, F. (1993a). A novel slow (< 1 Hz) oscillation of neocortical neurons in vivo: Depolarizing and hyperpolarizing components. J. Neurosci., 13(8), 3252–3265. Steriade, M., Nunez, A., & Amzica, F. (1993b). Intracellular analysis of relations between the slow (< 1 Hz) neocortical oscillation and other sleep rhythms of the electroencephalogram. J. Neurosci., 13(8), 3266–3283. Stone, M. (2004). Varied stimulation of a relaxation oscillator. Unpublished master’s thesis, University of Houston. Stuart, G., & H¨ausser, M. (2001). Dendritic coincidence detection of EPSPs and action potentials. Nat. Neurosci., 4, 63–71. Tateno, T., & Jimbo, Y. (2000). Stochastic mode-locking for a noisy integrate-and-fire oscillator. Phys. Lett. A, 271, 227–236. Taylor, H. M., & Karlin, S. (1998). An introduction to stochastic modeling (3rd ed.). San Diego, CA: Academic Press.
1294
J. Rubin and K. Josi´c
Tiesinga, P. (2005). Stimulus competition by inhibitory interference. Neural Comput., 17, 2421–2453. Tiesinga, P., Jose, J., & Sejnowski, T. (2000). Comparison of current-driven and conductance-driven neocortical model neurons with Hodgkin-Huxley voltagegated channels. Phys. Rev. E, 62, 8413–8419. Tiesinga, P., & Sejnowski, T. (2004). Rapid temporal modulation of synchrony by competition in cortical interneuron networks. Neural Comput., 16, 251–275. Tiesinga, P. H. E., & Toups, J. V. (2005). The possible role of spike patterns in cortical information processing. J. Comput. Neurosci., 18(3), 275–286. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Comput., 10, 1321–1371. Xie, M., Othmer, H., & Watanabe, M. (1996). Resonance in excitable systems under step-function forcing. II. Subharmonic solutions and persistence. Physica D, 98, 75–110.
Received November 9, 2005; accepted August 18, 2006.
LETTER
Communicated by Emilio Salinas
Optimal Coding Predicts Attentional Modulation of Activity in Neural Systems Santiago Jaramillo [email protected]
Barak A. Pearlmutter [email protected] Hamilton Institute, NUI Maynooth, County Kildare, Ireland
Neuronal activity in response to a fixed stimulus has been shown to change as a function of attentional state, implying that the neural code also changes with attention. We propose an information-theoretic account of such modulation: that the nervous system adapts to optimally encode sensory stimuli while taking into account the changing relevance of different features. We show using computer simulation that such modulation emerges in a coding system informed about the uneven relevance of the input features. We present a simple feedforward model that learns a covert attention mechanism, given input patterns and coding fidelity requirements. After optimization, the system gains the ability to reorganize its computational resources (and coding strategy) depending on the incoming attentional signal, without the need of multiplicative interaction or explicit gating mechanisms between units. The modulation of activity for different attentional states matches that observed in a variety of selective attention experiments. This model predicts that the shape of the attentional modulation function can be strongly stimulus dependent. The general principle presented here accounts for attentional modulation of neural activity without relying on special-purpose architectural mechanisms dedicated to attention. This principle applies to different attentional goals, and its implications are relevant for all modalities in which attentional phenomena are observed. 1 Introduction Adaptation in the nervous system occurs at several timescales, from the fast changes in spiking patterns to the long-term effects of development and learning. One of the most intriguing types of adaptation is the modification of neural coding strategies related to selective attention. The fact that attention can be covertly shifted to different aspects of a stimulus to improve their detection and discrimination (Downing, 1988) indicates that the nervous system is not just filtering out features in a fixed manner, but Neural Computation 19, 1295–1312 (2007)
C 2007 Massachusetts Institute of Technology
1296
S. Jaramillo and B. Pearlmutter
is adapting its function according to some objective. This adaptation is expressed as an attentional-state dependent modulation of neural activity. Attentional modulation of activity has been characterized at different levels, from whole brain (Heinze et al., 1994; Hopfinger, Buonocore, & Mangun, 2000; Corbetta & Shulman, 2002) to single cell (Moran & Desimone, 1985; Luck, Chelazzi, Hillyard, & Desimone, 1997; Treue & Maunsell, 1999; McAdams & Maunsell, 1999). In contrast, theoretical accounts of these phenomena are still a matter of debate and would greatly benefit from models that relate these effects to general coding principles. Traditional accounts of receptive field formation assume equal relevance of all features of a stimulus. Attempts to include the adaptability necessary for dealing with uneven relevance (e.g. for attention-related phenomena) usually include mechanisms tailored to the precise phenomenology, such as gating or shifting circuitry (Olshausen, Anderson, & Van Essen, 1993; Deco & Zihl, 2001; Heinke & Humphreys, 2003). A different approach is to find high-level principles that give rise to the phenomena observed during changing attentional state. The literature provides a few examples of this approach, namely, models in which various types of uncertainty are included in a framework of optimal inference and learning (Dayan & Zemel, 1999; Dayan, Kakade, & Montague, 2000; Yu & Dayan, 2005; Rao, 2005). Our proposal here is in some sense less radical than the assumption that attention subserves optimal inference in that we account for the sensory codes in question using the optimal coding framework, a framework that has been successfully applied to a variety of nonattentional sensory coding phenomena (Atick, 1992). We hypothesize that the nervous system uses an optimal code for representing sensory signals and that fidelity requirements of the code change based on top-down information. These conditions imply shifts in the neural code and changes in response properties of single neurons that together can be regarded as a global resource allocation. Using computer simulation, we explore the efficient representation of sensory input under the assumption of limited capacity and changing nonuniform fidelity requirements. These simulations show that (1) the modulation of activity of the simulated units matches that observed in animals during selective attention tasks; (2) the magnitude of this modulation depends on the capacity of the neural circuit; (3) the behavior of a single neuron cannot be well characterized by measurements of attentional modulation of only a single sensory stimulus; (4) modulation of coding strategies does not require special mechanisms—it is possible to obtain dramatic modulation even when signals informing the system about fidelity requirements enter the system in a fashion indistinguishable from sensory signals; and (5) even a simple feedforward network can perform sophisticated and dramatic reallocation of processing resources.
Optimal Coding Predicts Attentional Modulation A
0.6
1297 B
0.6
Top Right
Bottom Right
Top Left
Bottom Left
DECODER OUTPUT
0
0.3
Error
ATTENDED LOCATION
Input / Output
INPUT
C Uninformed
Informed
Uninfor > Infor
0.3 Error
ERROR −0.6
0
0
Figure 1: Reallocation of resources was observed when attentional signal changed. (A) Example of reconstruction of a single input pattern and four different attentional states as indicated by the dashed circles. Error is lower for attended locations. (B) Structure of the feedforward network used in the simulations. For each unit, attentional inputs (after training) are indistinguishable from sensory inputs. (C) Mean over patterns of position-specific error for one attentional state, compared to a system with no attentional signal. The white region in the image on the right indicates lower error when the system makes use of the attentional signal.
2 Methods 2.1 Network Structure. An autoassociative network was constructed consisting of five layers connected in a feedforward fashion (see Figure 1B), following Jaramillo and Pearlmutter (2004). The number of units in each layer was 256–20–10–20–256, respectively. Each unit received inputs from all units in the previous layer, in addition to two attentional signals, displayed as a single arrow per layer in Figure 1B. This attentional input was the same for all layers, and its role depended on the cost function to be minimized, as explained below. There were no lateral connections within units in a layer. The activity in the input and output layers was represented as an image of 16×16 pixels. 2.2 Unit Model. A firing rate model was used in which the output of each unit was calculated as the weighted sum of the inputs passed through a saturating nonlinearity, as follows: ri = s
j
wi j r j + b i = s wi j r j + wi j r j + b i , j
(2.1)
j
where the sum is over all units from the previous layer (indexed by j ) and the attentional inputs (indexed by j ). The saturating function was s(x) = a tanh(bx), with a = 1.716, b = 0.667. The activity of unit i is denoted ri , and the parameters wi j correspond to the strength of the connection
1298
S. Jaramillo and B. Pearlmutter
from unit j to unit i. Note that each unit i also included a bias term b i . The connection strength wi j from unit j to unit i was real valued and unbounded. 2.3 Stimulus Statistics. The set of patterns used during optimization consisted of 20,000 monochromatic 16×16 pixel images. Pixel intensity values had zero mean and standard deviation σ = 1/3. The images were created by convolving (filtering) white gaussian noise images with a rotationally symmetric 2D gaussian with σfilter = 2. Edge effects were avoided by extracting only the 16×16 center of the resulting image. These images were then scaled to have the desired variance. 2.4 Attentional Input. The attentional signal consisted of a two-element vector with elements in the range [−1, 1]. For each optimization step, this input was randomly drawn from a uniform distribution over the possible range. After optimization, this signal enters each unit in the same fashion as signals from other layers. It is only through the optimization process that its role is defined. 2.5 Optimization. The optimization process consisted of finding the set of weights and bias parameters that minimize the cost function E = E p , where the error for one input pattern and attentional state is Ep =
2 c k (p) yk (p) − dk (p) ,
(2.2)
k
in which k indexes locations in the 16×16 grids holding the stimulus and its reconstruction, c k (p) defines the importance of that particular location (analogous to the intensity of an attentional spotlight), yk (p) is the output of the network, dk (p) is the desired output, which in our case is the same as the input, and p represents the complete information coming into the system at one point in time—the input image as well as the top-down attentional signal. The expectation · is taken over p. The gradient was calculated using backpropagation (Rumelhart, Hinton, & Williams, 1986) of the weighted error defined in equation 2.2, and optimization used online gradient descent with a weight decay term of 10−6 and a learning rate η = 0.005. All weights were plastic during learning, and the attention coefficients in the penalty function formed a simple soft mask, c k (p) =
1+
m2
1 , k − a (p)2
(2.3)
with a (p) being the attentional input (in our case, a two-dimensional vector representing the center of attention) and k being a location in the plane. The
Optimal Coding Predicts Attentional Modulation
1299
width of the attentional spotlight was set by m, which was held constant at m = 12 in our simulations. Noise was injected into the bottleneck units (before the nonlinearity) during optimization in order to limit the capacity of the system. The noise level was σNB = 0.1, except in Figures 7C and 7D, in which additional values in the range σNB ∈ [0.0125, 1.6] were also used. 2.6 Testing Performance. The performance of the system at encoding and decoding the input was measured using independently generated random patterns. For performance comparison (see Figure 1C), a network that used a flat penalty function c k (p) = 1 was also trained. The error in Figures 1 and 7 was the absolute difference between input and output pixel values. 2.7 Finding Preferred Stimuli. To calculate the stimuli that maximally drive each unit in the bottleneck layer for a particular attentional state, we first generated 106 random white gaussian images and found the activation of the unit of interest for each of these patterns (keeping the attentional state fixed). The preferred stimulus was defined as the average of these random patterns weighted by the activity produced in the unit of interest. The antipreferred stimulus was defined as the negative of the preferred stimulus. 3 Results The task of the network of Figure 1B was to encode its input (a monochromatic image) into a compressed representation and decode that representation, with minimal error. An additional input (a two-element vector denoted in Figure 1B as A) informed the system about the current attentional state. Attentional states differed in that selected regions were preferentially reconstructed with higher quality than the rest, but these regions were not explicitly represented by any signal after optimization. Each unit computed its output as the weighted sum of its inputs passed through a saturating nonlinearity. The results presented in the following sections correspond to measurements on the system after the optimization procedure has found the appropriate connection strengths. Any modulation is due only to changes in activity, not to changes in the structure or connection strengths of the network. 3.1 Reallocation of Resources Emerged Naturally. First, it was verified that some regions of the input image were in fact reconstructed better than others depending on the attentional state. An example of this uneven performance is presented in Figure 1A. Here, the input image remained fixed as the attentional input changed. The dashed circles indicate those regions for which preferential reconstruction was requested. We should keep in mind that the attentional input consisted of only two additional values (which
1300 Attention
S. Jaramillo and B. Pearlmutter U1
U2
U3
U4
U5
U6
U7
U8
U9
U10 100%
Stimulus Intensity 0%
Figure 2: Preferred stimuli were dependent on attentional state. Gray-scale images represent the preferred stimulus for each unit in the bottleneck layer (columns) for three attentional states (rows), as indicated by the dashed circles. Higher contrast was observed on attended regions.
the network will interpret as the center of attention) and that these signals entered each layer the same way as feedforward (sensory) inputs. The regions represented by the dashed circles were not explicitly represented by any signal in the network during this simulation (only during optimization, through the coefficients c k in equation 2.2). As expected, lower error was achieved for regions closer to the center of attention. After confirming that the system’s processing was dependent on attentional state, we asked how these results compared to those from a classical model in which there is no informing signal. Figure 1C compares the mean reconstruction performance of a system that ignores the informing signal with a system attending to the bottom-left corner of the input image. The difference shows that reconstruction was better for attended regions but worse for unattended ones. 3.2 Preferred Stimuli of Bottleneck Units Was Dependent on Attentional State. Results presented in this section concern the units from the bottleneck layer. Activity from these units represents a lossy encoding of the input image. Figure 2 shows the stimulus that maximally drove each of these units at different attentional states, as indicated by the dashed circles. Results are clearly dependent on attentional state, with preferred stimuli containing sharper edges in regions closer to the center of attention. 3.3 Activity of Bottleneck Units Was Modulated by Nonsensory Signals. How was the activity in the bottleneck neurons affected by the attentional signals? Figure 3 shows maps of activity created by fixing the stimulus and moving the center of attention around the image. The intensity at each point in the map represents the activity of one unit in the bottleneck, normalized in the range of activities observed for that particular stimulus. The bars between the two rows of maps represent the range of activities achieved in each condition. The pattern representing activity modulation
Optimal Coding Predicts Attentional Modulation Stimulus
U1
U2
U3
U4
U5
U6
U7
1301 U8
U9
U10
Max Normalized response Min
Figure 3: Activity was modulated by the attentional signal. Gray-scale images represent the activity of each bottleneck unit for a fixed stimulus as the center of attention is directed to different regions of the input space. Two different stimuli (top and bottom rows) are used for comparison. Maps are scaled according to the range of activities observed for that particular stimulus. The bars between the two rows display the range of activity for each of the two conditions with respect to the absolute limits of the activity of the units.
was clearly different for each unit. Furthermore, the modulation of a unit’s activity was dependent on the stimulus. For example, for unit U1, higher changes in activity occurred when attention shifted from left to right or from top to bottom, depending on the stimulus. This effect occurred even when, in the case of U1, the activity ranges were very similar for the two conditions.
3.4 The Model Reproduced Measurements from Visual Cortex. We related these findings to experimentally observed modulation of neural activity during selective attention tasks. First, we replicated the analysis presented in Treue and Maunsell (1999), in which combinations of preferred and antipreferred stimuli were presented inside the receptive field (RF) of a cell and activity was measured as attention changed between these two regions. Figure 4A shows the response from a neuron in the medial superior temporal (MST) area. The preferred stimulus for this neuron was a pattern of dots moving in one direction, indicated by the upward-pointing arrow. The attended stimulus is indicated by the dashed ellipse. The histograms in gray show the firing rate of the cell as a function of time, and mean responses are indicated by solid horizontal lines for each condition. Overlaid, we show the response of one neuron from our model (dotted lines) under similar conditions. The stimuli consisted of combinations of the left and right halves from the preferred and antipreferred stimuli, calculated with attention directed to the center of the input space. The maximal firing rate for the model neuron was set to 50 spikes per second to obtain comparable magnitudes. The modulation of activity of the model neuron matched that of the visual neuron. In particular, when attention was shifted from preferred to nonpreferred features of the same stimulus, activity decreased dramatically.
S. Jaramillo and B. Pearlmutter
MST neuron
A
Model neuron
Response [spikes/sec]
50
0 pref
null
pref
null 1 sec
pref
null
B
Response with preferred target (spikes/sec)
1302
100
80
60
40 MT 20
MST Model
0 0 20 40 60 80 100 Response with anti - preferred target (spikes/sec)
Figure 4: Activity modulation matched experimental measurements from area MST. (A) Neuronal response to two stimuli inside the receptive field of the cell—one preferred (arrow up) and one antipreferred (arrow down). Attentional focus is indicated by a dashed ellipse. The histograms in gray show the firing rate of the cell as a function of time. Mean responses are indicated by solid horizontal lines (MST cell) and dotted lines (model unit) for each condition. (B) Scatter plot showing the response to antipreferred stimuli versus the response to preferred stimuli for each unit. Points above the diagonal indicate higher activity when attending to the preferred stimulus. (Reproduced from Treue & Maunsell, 1999.)
The scatter plot in Figure 4B shows the firing rates for neurons in areas MT and MST when attention was directed toward the preferred stimulus (y-axis) versus the firing rates obtained while attending the antipreferred stimulus (x-axis). Points above the diagonal indicate higher activity when attending the preferred stimulus. Values for all model units, indicated by stars, fall above the diagonal and within the range of experimentally observed values. For this plot, the range of the s(·) function was scaled and shifted to 0 to 100 spikes per second. Further exploration of these effects is shown in Figure 5. Responses to the four combinations of preferred and antipreferred half-stimuli are presented, first for a single unit and then for all units in the bottleneck layer. Figure 5A shows the stimuli and corresponding attention maps for unit U2. The stimuli consisted of combinations of the left and right halves from the preferred and antipreferred stimuli. Attention maps showed a clear change in the activity of the unit as attention was shifted from right to left. The simulation also showed that changes in features far from the attended region have a smaller effect on activity than changes presented at the attended location (compare the dark bars of Figure 5B). As shown in Figure 4, when attention was shifted from preferred to nonpreferred features of the same stimulus, activity decreased dramatically. In contrast, the effect of attention when both halves were preferred or antipreferred was very small. These observations were consistent for all units (see Figure 5C).
Optimal Coding Predicts Attentional Modulation
[+|+]
1
[+|- ]
[ -|+]
[ - |- ]
Stimulus intensity
A
1303
-1 Activity
100%
0%
Activity
B 100%
50%
0%
Change in activity
C
50% 25% 0 -25% -50%
Figure 5: All model units showed similar activity modulation. (A) Combination of preferred (+) and antipreferred (−) stimuli for unit U2 (top) and attentional modulation of activity in U2 for these inputs (bottom). (B) Comparison of activity in U2 for two attentional states, as indicated by the dashed ellipses. (C) Changes in activity for each unit in the bottleneck as attention is shifted from right to left. Bars show means across units.
The model was also compared to experiments in which the responses of cells in area V4 were measured for four attentional conditions, while a bar of fixed orientation was displaced inside the receptive field of the cell (Connor, Gallant, Preddie, & Van Essen, 1996; Connor, Preddie, Gallant, &
1304
S. Jaramillo and B. Pearlmutter 30
A
50%
B
20
Response rate (spikes/sec)
0
1 2 3 4 5
30 20 10 0
12345 1 2 3 4 5
1 2 3 4 5
30 20
Response model neuron (normalized)
25% 10
0%
1 2 3 4 5
50% 25% 12345 0%
1 2 3 4 5
1 2 3 4 5
50% 25%
10 0
1 2 3 4 5
Bar position (spacing = 1.0°)
0%
1 2 3 4 5
Stim position
Figure 6: Activity modulation matched experimental measurements from area V4. (A) Response of one V4 neuron to a bar stimulus placed at five different positions inside the receptive field, as indicated by a dashed circle. Attention was directed to one of the four circles outside the receptive field. (Reproduced from Connor et al., 1996). (B) Response of one model neuron as a region of a nonpreferred stimulus is replaced by the preferred stimulus in five different locations. Attention is directed to the border of the input space as indicated by the circles.
Van Essen, 1997). Figure 6A shows the response of one V4 cell. These plots contain features that are common in attentional modulation:
r r r r
The response of the cell for a fixed stimulus depends on the attended location. The stimulus that elicits the strongest response depends on attention. The cell’s response when attention is fixed depends on the stimulus position. This dependence on stimulus position differs between attentional states—for instance, the left and right panels in Figure 6A display different trends as the stimulus position is changed.
Figure 6B shows the results from a model neuron under similar conditions. In this case, the input image is composed of a nonpreferred stimulus with a small region belonging to the preferred stimulus at different positions, indicated by the numbers 1 to 5. Attention is directed to the borders of the image, as indicated by the circles. The model exhibited all features observed in the experimental data. Two shift indexes were also calculated for each neuron, after Connor et al. (1997). The fractional shift measures the proportion of total response that shifted from one side of the RF to the other when attention is shifted. This index is bounded between −1 and 1, which is a positive value indicating shifts in the direction of attention. Connor et al. (1997) report
Optimal Coding Predicts Attentional Modulation A
C
25%
25%
20% Activity Modulation
Activity Modulation
20%
15%
10%
15%
10%
5%
5%
0%
0%
0.25
0.4
B
Unattended Attended
D 0.2 Average Error
0.3 Average Error
1305
0.2
0.15
0.1
0.1
Unattended Attended 0
5
10 15 Number of bottleneck units
20
0.05 −2
−1.5 −1 −0.5 Bottleneck noise level (log σNB)
0
Figure 7: Magnitude of attentional modulation depends on system capacity. (A) Modulation of activity in response to random stimuli as attention was shifted from left to right, for networks with different numbers of units in the bottleneck. Open gray circles represent the mean modulation for each unit, and solid circles show the mean over all units, with bars indicating the standard error. (B) Mean reconstruction error for the attended and unattended sides. (C, D) Same as A and B but using 10 bottleneck units and instead varying the amount of noise injected into the bottleneck units during training.
mean values of 0.16 or 0.26 depending on whether five or seven bar positions were in use. All our model neurons had a positive fractional shift, with a mean value of 0.22. The second index, the peak shift, measures the distance between positions generating maximum responses. Mean experimental reported values were 10% or 25% of the RF size, depending on whether five or seven bar positions were in use. Our model neurons had nonnegative peak shifts with a mean value of 25% of the total position variation. 3.5 Magnitude of Attentional Modulation Depended on Capacity. The mean modulation of activity of bottleneck units in the model as attention was shifted from left to right is shown in Figure 7, with Figure 7A showing how this varies with the number of bottleneck units and Figure 7C showing how this varies with amount of injected noise. Gray open circles correspond to the absolute value of the modulation
1306
S. Jaramillo and B. Pearlmutter
averaged over 1000 random test patterns, for each unit in the bottleneck. The solid circles show the mean across units, with error bars indicating the standard error. Figures 7B and 7D show the corresponding reconstruction error for each set of parameters, plotting the mean errors for the attended versus unattended halves of the sensory input. Note that noise was added to the bottleneck units only during optimization, and for this reason changes in the modulation magnitude cannot be attributed to noise in the measurements. The magnitude of the modulation decreased as the number of bottleneck units increased. As expected, errors in the ignored region were higher than those in the attended region, but decreased as the number of bottleneck units was increased, gradually closing the gap. As the noise level in the bottleneck units was increased, the magnitude of both the modulation and the reconstruction error increased. The error for the attended region increased faster than for the unattended region. These two plots display an abrupt transition as the noise level becomes very high: the trend of the activity modulation, as well as the error values, changes, with the error values for the attended and unattended regions becoming equal. 4 Discussion The results show that when a neural system is optimized to encode its inputs with changing fidelity requirements, the code developed exhibits modulatory phenomena of the sort that has been observed in the visual systems of animals engaged in selective attention tasks. These modulatory phenomena, which constitute a reallocation of resources, emerge even when the encoder is a very simple homogeneous feedforward neural model. 4.1 Limitations of the Simulation. The simulations incorporated a number of simplifications, most of which were made for ease of exposition or computational efficiency: 1. During optimization, the penalty function was set to a single attentional spotlight. This is not a requirement of the general model, and other functions could be used to define the performance demands. For example, nonspatial goals could be incorporated by requiring higher-fidelity reconstruction of some particular feature regardless of its spatial location. 2. To allow convenient visual display, simulated stimuli were visual patterns. Efficient representations of stimuli and attentional modulation phenomena are present in many (if not all) modalities, and the model explored here could be applied equally well to nonvisual and nonspatial modalities. 3. The simulation allowed synapses from a single neuron to be both excitatory and inhibitory. This is not a common feature in biological neural
Optimal Coding Predicts Attentional Modulation
1307
systems, where a combination of excitatory and inhibitory neurons could achieve the same functionality. 4. The simulation is limited to explorations inside the receptive field of a cell. Furthermore, as a consequence of the simulation’s simplicity, it cannot make detailed predictions about the form of the RFs. Extended and more complicated models would be necessary to predict attentional modulation of realistic RFs. For instance, incorporating sparsity constraints into the optimization procedure should give rise to more realistic localized RFs (Olshausen & Field, 1996), which would allow the prediction of attentional modulation of RF shape and location. 5. The capacity in the bottleneck was restricted by limiting the number of neurons, with each neuron’s capacity limited by an intrinsic noise term. Given that the primary sensory cortex typically has greater than 103 × more neurons than the sensory sheet it represents (e.g., in humans, 106 retinal ganglion cells versus over 109 V1 pyramidal neurons), one is led to question the biological relevance of this capacity restriction. It is important to note that this is not an issue with the model of attention but rather with the optimal coding framework it builds on. One potential resolution of this issue is that energetic constraints may drive the nervous system to use efficient codes even in areas where there is an abundance of neurons and even where their principal role is computation rather than coding. 6. The data displayed were collected after the optimization procedure had been run and the connection strengths fixed at their optimal values. In nature, we might expect this type of adaptation to be continuous rather than being confined to a period of training. 7. To allow standard effective learning algorithms to be applied, the simulation was at the level of firing rates rather than spikes, and short-term plasticity was not included. 8. The goal of the network in the simulations was to find an efficient representation for the stimulus. Biological networks likely transform stimuli not to merely compressed representations but to representations that serve to inform future actions. The limitations presented above are mostly specific to the current simulation rather than to the general model proposed in this letter or to the predictions it makes. Stimulus-driven attentional phenomena, and the undefined origin of the attentional signal, are some of the limitations that elaborations of the model might address. Despite these simplifications, the simulation exhibited many phenomena observed experimentally and makes novel concrete, testable predictions about the modulation of neural activity by attentional processes. The fact that the model, even with these simplifications, accounts for a broad range of previous measurements and makes novel robust predictions is, to our mind, a strength rather than a weakness.
1308
S. Jaramillo and B. Pearlmutter
4.2 Consistency with Electrophysiological Data. The predictions of this model are consistent with observations from electrophysiological recordings in which the output of neurons in the visual system is modulated by the attentional state of the subject. This modulation suggests that the characterization of neural response should go beyond the traditional stimulus-response dependency, usually represented as a receptive field or tuning curve. Treue and Maunsell (1999) showed that when combining preferred and nonpreferred moving stimuli inside the receptive field of a cell in monkey area MST, the activity of the cell was higher when attention was directed to the region containing the preferred direction of motion. These results are exhibited by the model (see Figures 4 and 5). Results from the model are also consistent with those of Connor et al. (1997). Experimental measurements of the modulation of neuronal response for different sets of stimuli and attentional conditions qualitatively matched those from model neurons. Furthermore, when setting the maximum firing rate parameter to biologically plausible values, the model produced modulations that lie in the ranges observed experimentally. We would make one observation regarding the shift indices of Connor et al. (1997): it may be misleading to interpret the partial shift values merely as a tendency to have increased activity when a bar is presented at the attended edge of the RF, as compared to the activity when a bar is presented at the opposite edge of the RF—a property exhibited by the model of Rao (2005). For instance, even when all units show positive indices, some may show higher activity for a bar in position 1 compared to position 5 in both (left and right) attentional conditions. What the index measures is the proportion of activity that shifts when attention changes. Implicit in the measurements of Treue and Maunsell (1999) and Connor et al. (1997) is the fact that the attentional modulation is dependent on the stimulus, for example, attention moving from left to right inside the RF of a cell will have an increasing or decreasing effect on the neuron’s firing rate depending on the relative location of preferred and nonpreferred features. Examples of this dependency in our simulation are shown in Figure 3. This phenomenon implies that attentional modulation cannot be fully characterized using a single stimulus. Treue and Maunsell (1999) also observed that directing attention toward the RF of a cell can both enhance and reduce responses; for example, when the animal was attending to the antipreferred of two directions in the RF, the response was below the one evoked by the same stimulation when attention was directed outside the RF. While this phenomenon could not be tested directly using our simulation, our results suggest a clear reduction in activity when attention is directed to the antipreferred region compared to other regions (see Figure 5). This is not obvious from models in which attention simply amplifies activity for attended locations. Furthermore, the model predicts that stimulus changes in the attended location generally
Optimal Coding Predicts Attentional Modulation
1309
produce a higher change in activity than stimulus changes in unattended locations (see Figure 5B). Results from our model are consistent with other experimental results not analyzed in detail in this letter. For instance, the response of cells in areas V2 and V4 is attenuated when placing a nonpreferred feature inside the receptive field when compared to a preferred feature alone. If attention is directed to the preferred feature, activity is restored (Reynolds, Chelazzi, & Desimone, 1999). Our model exhibits the same phenomenon, as suggested by Figure 5. Another example relates to the type of modulation that attentional signals produce on neural activity. Preliminary results suggest that this model can display apparent multiplicative effects without explicit multiplicative interactions. A more complex model that exhibits localized RFs, as discussed above, would be necessary to obtain conditions comparable to those from the electrophysiological recordings discussed in McAdams and Maunsell (1999). 4.3 Comparison to Other Modeling Approaches. Selective attention researchers have suggested a wide variety of models with different predictive power, most of them presenting accounts of behavioral phenomena rather than explicit predictions on the modulation of neural activity (Mozer & Sitton, 1998; Deco & Zihl, 2001; Heinke & Humphreys, 2003). An influential early modeling study that made concrete predictions regarding attentional changes of neural activity was developed in Olshausen et al. (1993). In that study, control neurons dynamically modified the synaptic strengths of the connections in a model network of the ventral visual pathway. The network selectively routed information into higher cortical areas, producing invariant representation of visual objects. This model predicted changes in position and size of receptive fields as attention was shifted or rescaled. These phenomena are partially supported by results from Connor et al. (1997). Their model also qualitatively matches modulation effects observed by Moran and Desimone (1985) with stimuli inside and outside the classical receptive field of V4 neurons. In comparison to our model, in which attentional modulation emerges from the nonlinearity of the units and general objective of the network, their model obtained modulatory effects by explicitly modulating the synaptic strengths of the connections. Their model also used a spatially localized connectivity pattern, which gave rise to localized RFs, thus allowing for comparison of attention directed inside versus outside the RF. More recent studies incorporate principles of statistical inference into models of attention. For instance, Yu and Dayan (2005) and Rao (2005) present networks that implement Bayesian integration of sensory inputs and priors and replicate behavioral as well as electrophysiological measurements. In these studies, spatial attention is equated to prior information on the location of the features of interest. The Bayesian inference approach to modeling attention should be regarded as complementary to that taken
1310
S. Jaramillo and B. Pearlmutter
here. The transformations performed by the model units in this work are defined by the solution to an optimal coding problem, and under certain conditions, these computations would be equivalent to those in inferencebased networks. In fact, coding, statistical modeling of distributions, and inference from partial data are, mathematically speaking, very closely related. 4.4 Mechanisms That Subserve Attentional Modulation. Local gain modulation, a common tool in mechanistic models of attentional modulation, is not necessarily in opposition with our approach. What we suggest is that events traditionally described as local gain modulation subserve a global reallocation of resources, which is the strategy the nervous system has evolved for approaching optimal performance given its constraints. The mechanisms of attentional modulation have been traditionally posited to be changes in synaptic efficacy or modulation of presynaptic terminals (Olshausen et al., 1993). Here we show that the overall effects of this modulation can appear in a network of saturating units without any changes in synaptic strength, being controlled only by the activity itself. This is consistent with the notion that a network optimized for efficient coding under shifting fidelity requirements will integrate whatever information is available about the current requirements by pressing into service any available mechanism. In other words, it is possible that there is no way to anatomically distinguish between neuronal mechanisms that support coding from mechanisms that support shifts in coding. While it could be argued that evidence of multiple sites of integration in cortical neurons (Larkum, Zhu, & Sakmann, 1999) is inconsistent with this idea, the results above show that attentional modulation of the neural code is not sufficient to explain the functional role of inputs at different cortical layers, instead suggesting that these mechanisms may be more important for learning or other processes. 4.5 Main Predictions. The fact that our simulation shows modulation effects consistent with physiological recordings suggests that we should not necessarily expect explicit gating circuitry in neural systems responsible for attentional phenomena. Furthermore, the informing signals do not have to explicitly represent the attentional space, that is, a spatial attention effect is not necessarily mediated by a topographic input. Our model strongly predicts that a neuron’s preferred stimulus will depend on attentional state. Moreover, the behavior of a single neuron in this model cannot be well characterized by measurements of attentional modulation of only a single sensory stimulus. This prediction is consistent with experimental results discussed above, but it should be possible to test it more explicitly using currently available experimental techniques.
Optimal Coding Predicts Attentional Modulation
1311
The model also suggests that stronger modulations are expected when the complexity of the input grows relative to the capacity of the system. 5 Conclusion The model presented here accounts for attentional modulation of neural response in a framework that includes both attention and receptive field formation, and as a consequence of an underlying normative principle (optimal coding) rather than by tuning a complex special-purpose architecture. The model shows that reallocation of resources can emerge even in a simple feedforward network and challenges the traditional characterization of neural activity. These results are consistent with the notion that attentional modulation is not, at its root, due to specific local architectural features but is rather a ubiquitous phenomenon to be expected in any system with shifting fidelity requirements. Acknowledgments This work was supported by U.S. NSF PCASE 97-02-311, the MIND Institute, an equipment grant from Intel, a gift from the NEC Research Institute, Science Foundation Ireland grant 00/PI.1/C067, and Higher Education Au´ thority of Ireland (An tUdar´ as Um Ard-Oideachas.) References Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing? Network: Computation in Neural Systems, 3(2), 213–251. Connor, C. E., Gallant, J. L., Preddie, D. C., & Van Essen, D. C. (1996). Responses in area V4 depend on the spatial relationship between stimulus and attention. J. Neurophysiol., 75(3), 1306–1308. Connor, C. E., Preddie, D. C., Gallant, J. L., & Van Essen, D. C. (1997). Spatial attention effects in macaque area V4. J. Neurosci., 17(9), 3201–3214. Corbetta, M., & Shulman, G. L. (2002). Control of goal-directed and stimulus-driven attention in the brain. Nat. Rev. Neurosci., 3(3), 201–215. Dayan, P., Kakade, S., & Montague, P. R. (2000). Learning and selective attention. Nature Neuroscience, 3, 1218–1223. Dayan P., & Zemel, R. (1999). Statistical models and sensory attention. In D. Willshaw & A. Murray (Eds.), Proceedings of the Ninth International Conference on Artificial Neural Networks (Vol. 2, pp. 1017–1022). Piscataway, NJ: IEEE. Deco, G., & Zihl, J. (2001). A neurodynamical model of visual attention: Feedback enhancement of spatial resolution in a hierarchical system. Computational Neuroscience, 10(3), 231–253. Downing, C. (1988). Expectancy and visual-spatial attention: Effects on perceptual quality. J. Exp. Psychol. Hum. Percept. Perform., 14(2), 188–202.
1312
S. Jaramillo and B. Pearlmutter
Heinke, D., & Humphreys, G. W. (2003). Attention, spatial representation, and visual neglect: Simulating emergent attention and spatial memory in the selective attention for identification model (SAIM), Psychol. Rev., 110(1), 29–87. ¨ ¨ A., Mangun, G. R., & Hillyard, S. A. (1994). Heinze, H. J., Luck, S. J., Munte, T. F., Gos, Attention to adjacent and separate positions in space: An electrophysiological analysis. Percept. Psychophys., 56(1), 42–52. Hopfinger, J. B., Buonocore, M. H., & Mangun, G. R. (2000). The neural mechanisms of top-down attentional control. Nature Neuroscience, 3(3), 284–291. Jaramillo, S., & Pearlmutter, B. A. (2004). A normative model of attention: Receptive field modulation. Neurocomputing, 58–60, 613–618. Larkum, M. E., Zhu, J. J., & Sakmann, B. (1999). A new cellular mechanism for coupling inputs arriving at different cortical layers. Nature, 398(6725), 338–341. Luck, S. J., Chelazzi, L., Hillyard, S. A., & Desimone, R. (1997). Neural mechanisms of spatial selective attention in areas V1, V2, and V4 of macaque visual cortex. J. Neurophysiol., 77(1), 24–42. McAdams, C. J., & Maunsell, J. H. R. (1999). Effects of attention on orientationtuning functions of single neurons in macaque cortical area V4. J. Neurosci., 19(1), 431–441. Moran, J., & Desimone, R. (1985). Selective attention gates visual processing in the extrastriate cortex. Science, 229(4715), 782–784. Mozer, M. C., & Sitton, M. (1998). Computational modeling of spatial attention. In H. Pashler (ed.), Attention (pp. 341–393). New York: Psychology Press. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607– 609. Olshausen, B. A., Anderson, C. H., & Van Essen, D. C. (1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. J. Neurosci., 13(11), 4700–4719. Rao, R. (2005). Bayesian inference and attentional modulation in the visual cortex. Neuroreport, 16(16), 1843–1848. Reynolds, J. H., Chelazzi, L., & Desimone, R. (1999). Competitive mechanisms subserve attention in macaque areas V2 and V4. J. Neurosci., 19(5), 1736–1753. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back–propagating errors. Nature, 323, 533–536. Treue, S., & Maunsell, J. H. R. (1999). Effects of attention on the processing of motion in macaque middle temporal and medial superior temporal visual cortical areas, J. Neurosci., 19(17), 7591–7602. Yu, A., & Dayan, P. (2005). Uncertainty, neuromodulation, and attention. Neuron, 46(4), 681–692.
Received December 7, 2005; accepted November 27, 2006.
LETTER
Communicated by Raj Rao
Efficient Computation Based on Stochastic Spikes Udo Ernst [email protected]
David Rotermund [email protected]
Klaus Pawelzik [email protected] Institute for Theoretical Neurophysics, Otto-Hahn-Allee, 28359, Bremen, Germany
The speed and reliability of mammalian perception indicate that cortical computations can rely on very few action potentials per involved neuron. Together with the stochasticity of single-spike events in cortex, this appears to imply that large populations of redundant neurons are needed for rapid computations with action potentials. Here we demonstrate that very fast and precise computations can be realized also in small networks of stochastically spiking neurons. We present a generative network model for which we derive biologically plausible algorithms that perform spike-by-spike updates of the neuron’s internal states and adaptation of its synaptic weights from maximizing the likelihood of the observed spike patterns. Paradigmatic computational tasks demonstrate the online performance and learning efficiency of our framework. The potential relevance of our approach as a model for cortical computation is discussed. 1 Introduction Experimental work has demonstrated that perception can be a very fast process. As an example, it has been shown that human brains take only 150 ms to decide if a natural image contains an animal (Thorpe, Fize, & Marlot, 1996). From typical firing frequencies of cortical neurons (about 15–45) and from the number of different processing stages involved in this task (about 10), it was concluded that about one to three spikes per neuron and per processing stage are sufficient to robustly achieve this performance. Another example for the exceptional speed of our brain has recently been found in contour integration. In these experiments, macaque monkeys were trained to correctly detect aligned edge elements within a field of randomly oriented distracter edges from presentations lasting 30 to 50 ms before a mask appeared (Mandon & Kreiter, 2005). Even if contour integration is performed locally within the primary visual cortex, this computation can rely on only very few spikes per neuron. In masking Neural Computation 19, 1313–1343 (2007)
C 2007 Massachusetts Institute of Technology
1314
U. Ernst, D. Rotermund, and K. Pawelzik
experiments, it has been shown that grouping processes involved in Gestalt perception are capable of binding features that were presented only for 20 to 30 ms (Herzog & Fahle, 2002). In all these examples, the nervous system depends on spikes as signals that appear to be emitted stochastically and with a high degree of variability (Shadlen & Newsome, 1998). These observations open the question which neuronal algorithms might underlie the computational processes that enable fast perception and recognition. To explain the high speed of categorization in humans, a rank-order code was proposed (Thorpe, Delorme, & van Rullen, 2001; van Rullen & Thorpe, 2001), which in principle allows for a coding of natural images using only one spike per neuron. This approach assumes that the latency of spikes from the retinal ganglion cells is inversely proportional to the luminance of an image patch (Berry, Warland, & Meister, 1997). While this method displays robustness against static pixel noise (Delorme & Thorpe, 2001), it may fail when the spiking process is not strictly deterministic. In addition, real-life situations require the brain to cope with nonstationary inputs that have no well-defined starting and ending time, which put to question if a rank-order code is available in the brain.1 Alternative frameworks for explaining the speed of cortical computation explicitly use stochastic spikes for transmitting information. The corresponding investigations identified several coding principles allowing for a high velocity of signal transmission. Examples include massive population rate codes with and without a balance of excitation and inhibition (Panzeri, Treves, Schultz, & Rolls, 1999; Gerstner & Kistler, 2002; Fourcaud & Brunel, 2002) or codes where the signal is coded into the variance of inputs onto a large population of neurons (Silberberg, Bethge, Markram, Pawelzik, & Tsodyks, 2004). In addition, the shape of neuronal tuning curves might be optimized to facilitate fast information coding. Depending on the decoding time available, either binary tuning curves (for short times) or gradual tuning curves (for long times) were found to be optimal (Bethge, Rotermund, & Pawelzik, 2003a, 2003b). These approaches provide no conclusive explanation of fast perception for two reasons. First, the work on population coding largely ignores the problem of processing information with respect to a certain task. Second, it is doubtful if the massive numbers of neurons required for accurate information transmission are available in the brain. Therefore, we focused our research on a plausible neuronal online algorithm for small networks, which might explain fast and precise computations despite a high degree of stochasticity in the spiking process. Generative models represent a framework that is well suited for describing probabilistic computation. In technical contexts, generative models 1
For the early visual system, saccades or input from the reticular formation could provide a reference signal for latency coding. However, it is doubtful if such a signal can be found in higher brain areas.
Efficient Computation Based on Stochastic Spikes
1315
have been used successfully, for example, for deconvolution of noisy, blurred images (Richardson, 1972; Lucy, 1974). Since information processing in the brain, too, seems to be stochastic in nature, generative models are also promising candidates for modeling neural computation. In many practical applications and—because firing rates are positive—possibly also in the brain, generative models are subject to nonnegativity constraints on their parameters and variables. The subcategory of update algorithms for these constrained models was termed nonnegative matrix factorization (NNMF) by Lee and Seung (1999). The objective of NNMF is to iteratively factorize scenes (images) Vµ into a linear additive superposition Hµ of elementary features W such that Vµ ≈ W Hµ . For this well-defined problem, multiplicative update and learning algorithms have been derived and analyzed for different noise models and various further constraints on the parameters W and internal representation Hµ (Lee & Seung, 2000). Generative models can be seen as algorithmic realizations of a principle stated first by Helmholtz, which hypothesizes that the brain attains internal states that are maximally predictive of the sensory inputs. While this idea has often been used in phenomenological models to explain psychophysical and neurobiological evidence, it is tempting to apply it also to neural subsystems in the sense that a network should evolve into a state that is the best explanation of all its inputs. This perspective on neural systems might appear rather philosophical, but it turns out to be a very useful framework for understanding neuronal computation. Generative models also cover deterministic computations by being able to predict the missing part of a typical but incomplete input pattern from its internal representation (for a schematic example, see Figure 1). In the general nondeterministic case, generative models are obviously related to Bayesian estimation. If input signals originate from different sensory modalities, a generative model can perform optimal multimodal integration, and the internal states will then represent optimal Bayesian estimates of underlying features (Deneve, Latham, & Pouget, 2001; see Figure 1). Recent experimental evidence demonstrates that the brain can indeed adopt a Bayesian strategy when integrating two sources of information. In one example, tactile and visual information about the width of a bar is integrated in an optimal manner (Ernst & Banks, 2002). In another example, uncertain information about the position of a virtual hand is evaluated, incorporating prior knowledge about a random shift in the hand position ¨ (Kording & Wolpert, 2004). In addition, motion processing in area MT has been successfully described by a Bayesian approach, replicating the neurons’ firing properties to various moving random dot patterns (Koechlin, Anton, & Burnod, 1999). Taken together, generative models provide a very general and plausible framework for investigations of neural computation. Nevertheless, it is an open question if it is possible to implement a biologically plausible generative network able to cope with the high stochasticity of spike trains, while at the same time being as fast as possible in
1316
U. Ernst, D. Rotermund, and K. Pawelzik
Figure 1: Example schema of the general framework investigated in this letter. A scene V composed as a superposition of different features or objects is coded into stochastic spike trains S by the sensory system. In this example, the scene comprises a house, a tree, and a horse, together with their spoken labels. With each incoming spike at one input channel s, the model updates the internal variables h(i) of the hidden nodes i. Normally the generative network uses the likelihoods p(s|i) to evolve its internal representation H toward the best explanation, which maximizes the likelihood of the observation S. In our example, the network seeks the optimal, combined explanation of the spikes from the audiovisual input stream (black arrows). An alternative application of a generative model can be used to infer functional dependencies in its input. In every natural environment, certain features have a high likelihood of co-occurence because they describe two different aspects of a single object. In this example, the impinging signals can be divided into two parts, SV and S A, which represent the input and output arguments of a function S A = f (SV ). In our example, the function f describes meaningful combinations of visual and auditory input. Presenting only the visual part SV to an appropriate generative network will lead to an explanation H∗ (gray arrows to the right). This internal representation H∗ in turn can generate a sharp distribution over the function values, centered at the correct auditory counterpart S A, to the input SV . In this sense, the model computes the function f mapping the correct spoken label to each of the visual objects (gray arrows to the left).
integrating the incoming information into a suitable internal representation of the external world. In this letter, we investigate a generative network designed to solve a typical factorization problem with nonnegativity constraints. In contrast to previous work using noise-free analog input patterns (Lee & Seung, 1999), our network receives a stochastic sequence of input spikes from a scene Vµ . We derive a novel algorithm that updates the internal representation using only a single spike per iteration step. It turns out to be surprisingly simple and efficient and can be complemented by an online learning
Efficient Computation Based on Stochastic Spikes
1317
algorithm for the model parameters that can be interpreted as a Hebbian rule with weight decay. Furthermore, we modify the basic framework of a generative model for implementing deterministic computations of arbitrary functions, which we show to be closely linked to optimal multimodal integration. The dynamics and performance of this approach are demonstrated by applying it to two illustrative cases: a network trained with Boolean functions demonstrates the learnability and errorless performance for arbitrary complex functions, and a small network for handwritten digit recognition demonstrates that less than one stochastic spike per input neuron can suffice to achieve impressing performance in a visual task exceeding the nearest-neighbor classifier (explained in appendix D). We further use the Boolean parity function to demonstrate that suitable combinations of our basic model can perform efficient spike-based computation in hierarchical networks. Taken together, our framework provides a novel approach for explaining fast perception with stochastic spikes in a generative network model of the brain. This letter is organized as follows. In section 2, we introduce our model, together with a definition of the dynamics of its stochastic sensory input. In section 3, the algorithms for rapid update of internal state variables and model parameters are derived and discussed. In section 4, we present the results from our simulations of the two paradigmatic cases and the extension of the framework to hierarchical networks. This contribution concludes with a discussion of different probabilistic interpretations of the model and an evaluation of the biological pausibility of our model within the context of realistic neuronal dynamics and synaptic learning. Mathematical details are described in the appendixes. 2 A Spike-Based Generative Model Every natural environment composed of various objects leads to an abundance of spikes that networks in the brain receive from the sensors or other brain regions. A generative network should infer (or recognize) typical objects or features that underlie (i.e., generate or predict) an observed spike train (see Figure 1). Hereby, the brain must combine prior knowledge about typical features and their frequency of occurrence, as well as previously observed signals and input from other networks. This prior knowledge has to be acquired during earlier learning phases, where typical correlations in the stream of information from all sources should be extracted in order to increase recognition speed and performance in a later task. 2.1 Basic Model. We consider a structurally simple realization of a generative model as a network consisting of a layer of input nodes indexed by s = 1, . . . , S connected via weights ws,i to a layer of hidden nodes indexed by i = 1, . . . , H. The hidden nodes hold the model’s internal states h µ,i (see Figure 1). The connection weights represent information about the input
1318
U. Ernst, D. Rotermund, and K. Pawelzik
statistics when typical features i are part of a scene µ. With the framework chosen in Lee and Seung (1999), the model is supposed to explain an observed scene Vµ 2 as a linear, nonnegative superposition of the features W via vµ,s ≈
H
ws,i h µ,i .
(2.1)
i=1
At first glance, this approach is limited because of its linearity constraint on the decomposition of an observation. Nevertheless, such a framework has been successfully applied to many problems in natural image processing with analog input patterns (Hoyer, 2004; Olshausen & Field, 1996; Lee & Seung, 1999), and allows comparing the differences between our novel spike-based stochastic model directly to known analog deterministic algorithms. We demonstrate that this approach is sufficiently complex to allow computations of arbitrary deterministic functions with few stochastic neurons in the brain. Furthermore, the mathematical derivation of spikebased information processing can also be applied to nonlinear generative networks using similar techniques. 2.2 From Poisson to Bernoulli Processes. Because of the high degree of stochasticity of spike events, networks in the brain receive information about a scene µ through stochastic spike trains impinging on the input nodes s. In our model, we assume that spike trains are generated with rates Rµ ∝ Vµ from independent Poissonian point processes. In this case, the number of spikes in each channel counted within a given time interval provide sufficient statistics; they provide the maximal amount of information about Vµ available from spike observation. For setting up the model, we reformulate the rate rµ,s in terms of the probability pµ (s) to receive the respective next spike in channel s. A straightforward calculation yields p˜ µ (s) = rµ,s /
S s =1
rµ,s = vµ,s /
S
vµ,s .
(2.2)
s =1
The representation of the scene µ through these event probabilities has the convenient properties that it is invariant with respect to the total rate
2 The following notation will be used throughout this letter. Vectors or matrices X are in bold, and their single components xi are in italics. Through braces, components may be combined to a vector or matrix as X = { xi }i=1,...,N = { xi }. Examples are W = { ws,i }s=1,...,S; i=1,...,H = { ws,i } and H = { h i }i=1,...,H = { h i }. Brackets select individual entries from a vector or matrix via [ X]i = xi .
Efficient Computation Based on Stochastic Spikes
1319
and ignores the true timing of the spikes. It changes the description of the spike statistics by Poissonian point processes to the more tractable Bernoulli processes and identifies each spike event with the index of the channel in which the spike occurred. These properties will strongly simplify the subsequent construction of our model. Instead of real time, we now use the running number of spike events, which we here denote by t. Note that this spike-by-spike clocking implies that real time is on average proportional S to the mean of t/ s=1 rµ,s . Thus, global knowledge about the total input rate allows us to estimate the real time elapsed during t events but is not necessary for deriving the following algorithms. 2.3 From Deterministic to Probabilistic Decomposition. Similar to the transformation of vµ,s into p˜ µ (s), the weights ws,i are proportional to the probability p(s|i) of an input spike in channel s given that the scene µ is composed of the single feature i, p(s|i) = ws,i
S
ws ,i .
(2.3)
s =1
Reformulating the input scene Vµ in terms of the (normalized) firing probabilities p˜ µ (s) as described in section 2.2 transforms the factorization problem, equation 2.1, into the expression H H p(s|i)h µ,i h µ,i = p(s|i) H . p˜ µ (s) ≈ H i=1 S |i) h p(s i=1 µ,i s =1 i =1 h µ,i i=1 It is obvious that a simple variable transformation h µ (i) = h µ,i / reduces this problem to p˜ µ (s) ≈
H
p(s|i)h µ (i) = pµ (s),
(2.4) H
i =1
h µ,i
(2.5)
i=1
with the nonnegativity and normalization constraints p(s|i) > 0 S s=1
p(s|i) = 1
h µ (i) > 0 H
h µ (i) = 1.
(2.6)
i=1
The new variables h µ (i) denoting the normalized superposition coefficients have a corresponding stochastic interpretation. To explain the next action
1320
U. Ernst, D. Rotermund, and K. Pawelzik
potential arriving at one input node s, the generative model assumes a doubly stochastic process underlying spike generation. First, a specific feature or cause i is drawn from the probability distribution h µ (i). In a second step, a spike is drawn from the corresponding conditional probability distribution p(s|i), which is observed in channel s. While at first glance, this interpretation seems rather academic, it is a natural description of the physical process of how a visual scene composed of superimposed features creates photons impinging on retinal ganglion cells. While we interpret the conditional probabilities p(s|i) to be related to synaptic weights connecting input neuron s with hidden neuron i, the assumption of a continuous internal state variable to be present in each neuron goes somewhat beyond usual network models in which only inputs and activities that are equivalent to output signals are allowed to represent the network’s state. Since real neurons are spatially extended objects with large amounts of internal state variables, we believe it is well justified to hypothesize that some of them parameterize neuronal excitability rather independently from its actual activity. The characteristic dynamics of internal state variables derived in the next section may then serve as specific predictions of our model (see also section 4). 2.4 Estimation and Learning Spike by Spike. Bayesian estimation implies that prior information about the internal state should be multiplied with the conditional probability of observing the actual external input, given the actual internal state. Therefore, generative models with iterative Bayesian algorithms appear to be suitable candidates for a comprehensive, abstract model for the function and dynamics of networks in the brain (Mumford, 2002; Lee & Mumford, 2003; Rao, 2004). In the case of perception subjected to time constraints, prior knowledge about the causes underlying the previous sensory input has to be integrated online as efficiently as possible while the sequence of observations of the spikes arrives at the different input channels. As we already pointed out, previous models largely neglect this realistic condition by using analog, noise-free input patterns in each iteration step instead of only a single spike. We begin to derive update algorihms for the internal states and synaptic weights by considering stochastic ensembles of exactly T incoming spikes for a scene µ, which can be written as the temporally ordered sequences of input channels at which these spikes arrive, t=1,...,T ST µ = sµt .
(2.7)
We now seek an iterative algorithm that with every spike encountered tends to maximize the likelihood P({STµ }|{h µ (i)}, { p(s|i)})
(2.8)
Efficient Computation Based on Stochastic Spikes
1321
of observing the ensemble of sequences {STµ }µ=1,...,M over the space of all internal states { h µ (i)} and model parameters { p(s|i)}. With pˆ µ (s) = 1/T
T
δs,sµt
(2.9)
t=1
denoting the relative number of spikes in channel s in one observation sequence µ, this likelihood is given by P=
M
µ=1
T! S s=1 T pˆ µ (s) !
S
s=1
pµ (s)
T pˆ µ (s)
.
(2.10)
pµ (s) is defined as in equation 2.5.3 The standard approach to this optimization problem is to minimize the negative logarithm of the likelihood P, L = −1/T log P = C −
S M
pˆ µ (s) log pµ (s),
(2.11)
µ=1 s=1
under the constraints mentioned in equation 2.6. C is a constant that does not depend on the optimization variables. If there is no prior knowledge about the typical features of which scenes µ are composed, the generative network will have to do both, estimate h µ (i), and learn the likeliest set of features p(s|i) in order to explain the ensemble of observed spike sequences. As soon as suitable p(s|i) have been found and can be held constant, minimizing L turns into a convex optimization problem for the h µ (i) only. Unfortunately, serial algorithms updating their representations with each incoming spike from one input pattern cannot be deduced directly from known update equations that work on µ = 1, . . . , M full observation sequences ST µ in parallel (Lee & Seung, 1999). The reasons for this will become clear during the following derivations, in which we seek a rapid update rule for the h µ (i)’s and p(s, i)’s minimizing L, which are based on a single spike observation sµt in each iteration step t. In particular, we aim at a multiplicative update rule, because in many cases (Lant´eri, Roche, & Aime, 2002), multiplicative algorithms are known to converge very fast when compared to simple additive gradient methods. Unfortunately, we were not able to find the fastest algorithm possible by
3 Note that p(s), p ˆ (s), and p˜ (s) all denote different variables. While p˜ (s) describes the real but unknown input statistics of a scene, pˆ (s) denotes a particular stochastic realization of this scene measured as the relative spike count at the input nodes s. The approximation of pˆ (s) by a linear superposition of elementary features is then denoted by p(s).
1322
U. Ernst, D. Rotermund, and K. Pawelzik
deriving it directly from L, and we believe that it cannot be done for the general case. For devising a multiplicative rule by means of a gradient descent, we first compute the derivatives −∇ L of the log likelihood, pˆ µ (s) ∂L = p(s|i) ∂h µ (i) pµ (s) S
−
(2.12)
s=1
pˆ µ (s) δs ,s ∂L pˆ µ (s ) = h µ (i) = h µ (i) . ∂ p(s|i) pµ (s ) pµ (s) S
M
−
M
µ=1 s =1
(2.13)
µ=1
For unconstrained problems, D∇ L is a valid descent direction if D is a positive definite matrix (Bertsekas, 1995). With z denoting the iteration step, the gradient descent in its most general form reads
h˜ µz+1 (i) = h µz (i) − γh Dhz ∇h µ L z i
p˜ z+1 (s|i) = p z (s|i) − γ p Dzp ∇ p L z s i .
(2.14) (2.15)
In our case, we choose Dhz and Dzp as diagonal matrices with entries
z
Dh i,i = h z (i) and Dzp s i,s i = p z (s|i). Dhz and Dzp explicitly depend on the update step z. We further introduced the update constants γh (estimation rate) and γ p (learning rate). Since the update is always positive, we have to satisfy the normalization constraints only, which is done by a postupdate normalization scheme: h˜ µz+1 (i) h µz+1 (i) = H ˜ z+1 j=1 h µ ( j) p z+1 (s|i) = S
p˜ z+1 (s|i)
s =1
p˜ z+1 (s |i)
(2.16) .
(2.17)
Combining update and normalization and substituting = γh /(1 + γh ) and γ = γ p yields the equations
S pˆ µ (s) (1 − ) + p(s|i) p z (s) s=1 µ
M pˆ µ (s) h (i) p z (s|i) 1 + γ µ=1 µ z pµ (s) . p z+1 (s|i) = M pˆ (s ) 1 + γ µ=1 h µ (i) sS =1 pµz (s ) p z (s |i) h µz+1 (i) = h µz (i)
µ
(2.18)
(2.19)
Efficient Computation Based on Stochastic Spikes
1323
These equations still use the full sequences ST µ for all µ = 1, . . . , M patterns to do one update step from z to z + 1. If a given collection of T spikes is considered a fixed input vector, equations 2.18 and 2.19 have the same fixed points as the corresponding NNMF algorithms (Lee & Seung, 2000), which can be derived using estimation-maximization (EM) and for → 0 and γ → ∞ they become equivalent. However, at one instant in real time, only one scene µ is presented, and only one or very few spikes are observed. In order to convert equations 2.18 and 2.19 to an online update rule, we restrict the algorithms to one pattern µ at a time. Furthermore, we assume that from this pattern, only one spike observed at time t enters the update equations for h or p. This procedure is equivalent to dropping the summations over µ, replacing z by t, and inserting δs,s t instead of the full pattern pˆ µ (s). The corresponding update equations (online learning and online estimation) resulting from equations 2.18 and 2.19 then read p(s t |i) h t+1 (i) = h t (i) (1 − ) + t t (2.20) p (s ) γ h(i) t t t δ p t+1 (s|i) = p t (s|i) 1 + t t − p (s |i) . (2.21) s,s p (s ) + γ h(i) p t (s t |i) The algorithms—equation 2.20 (spike-by-spike online estimation) and equation 2.21 (spike-by-spike online learning)—are our main results. Equations 2.20 and 2.21 are online algorithms, in contrast to equations 2.18 and 2.19 and the NNMF learning rules. Their update relies on the extreme case of only one spike, but it is straightforward to construct mixed versions that take into account several spikes for each interation step (this is done by choosing T in the definition equation 2.9 for pˆ µ equal to the number of spikes that should be considered in one step). Our online versions have finite memory, and for stationary input patterns, fluctuations of the hidden states will in general remain finite due to the stochastic drive. Therefore, both algorithms can at most achieve approximate maximization of the likelihood for the full input spike sequence. In the next section, we demonstrate by means of simple examples that with suitably chosen update parameters ∈ [0, 1] and γ ∈ [0, ∞], this approximate convergence is sufficient to achieve high computational performance. 2.5 Simplified Algorithm with Batch Learning. The online learning equation can be transformed to reduce computational complexity. In this case, learning of the p(s|i) is assumed to take place on a much larger timescale than the estimation process of the internal state h. In reality, learning is a slow process extending over the (repeated) presentation of many training scenes µ. Equation 2.21 hereby imposes a high computational load during online learning. This load can be significantly reduced by assuming that on a fast timescale, the hidden representations
1324
U. Ernst, D. Rotermund, and K. Pawelzik
h tµ for each µ, µ = 1, . . . , M, are computed with the online estimation algorithm, equation 2.20, for T spikes and time steps each. After these computations, now on a much slower timescale, the averaged hidden activities T t < h > µ = 1/ t=T−+1 h µ are used in parallel for a single update step of the conditional probabilities p(s|i) according to equation 2.19. A suitable multiplicative algorithm for this update has been proposed by Lee and Seung (1999), which reads
p˜ (s|i) = p (s|i) z
z
M
pˆ µ (s)< h > µ (i)
p
(s|i) = p˜ (s|i) z
S
z
p (s|i
)< h > µ (i )
(2.22)
i =1
µ=1
z+1
H
p˜ z (s |i) .
(2.23)
s =1
The parameter z = 1, . . . , Z counts the learning steps and is identical to the update step used in equations 2.18 and 2.19. With M different scenes (stimuli) in total, one update of the p(s|i) is made each T M spikes. Equations 2.22 and 2.23, together with equation 2.20, define the spike-by-spike batch learning algorithm (SbS-batch). Note that equations 2.22 and 2.23 can be derived directly from equation 2.19 by substituting the average < h > for h and by choosing the maximum update constant γ → ∞. 3 Results In this section, we demonstrate that the SbS-online and SbS-batch algorithms optimize suitable representations of scene ensembles by predicting the spikes arriving at the input nodes. For learning Boolean functions, our scene ensemble will consist of all input and output bit patterns of the respective function. For learning handwritten digits, the scene ensemble will be represented by the pixel patterns of the digit images, together with their correct classifications from a training data set. The performance of the learned representation will be evaluated in a subsequent computation or classification task on a test data set. Results will be compared to a simple nearest-neighbor classifier (abbreviated by NN classifier; for details, see appendix D) and to the NNMF algorithm by Lee and Seung (1999). 3.1 A Simple Example. First, we illustrate characteristic properties of the algorithm for a spike-by-spike update of internal states and the resulting coding in our network using a minimal model. For this purpose, we consider a highly overcomplete situation where S = 2 input neurons project to H = 100 hidden neurons. The optimization problem is therefore underconstrained, and accordingly, the Seung model converges to different internal
Efficient Computation Based on Stochastic Spikes
1325
states H for different initial conditions. It appears that the algorithm prefers mixtures of almost all available features W as shown in Figure 4A. In contrast, the sparse solution H found by our online update algorithm turns out to be unique for almost all initial conditions (see Figure 4B). The degree of sparseness is tightly linked to the parameter : the larger , the sparser is the internal representation. With = 1, only one hidden state will be active, which has the maximum likelihood to explain the input sequence. In the example shown in Figure 4B, the high value of causes the representation to concentrate on the feature vector, which in this case is the most natural explanation of the input spike statistics. 3.2 Preprocessing, Training, and Classification. The simulations are carried out in two stages: a training and test run. Before we present results on learning and performance for specific computations, we briefly introduce constructing, preprocessing, and partitioning the data into training and test sets. 3.2.1 Training. For learning a classification or computation task, each of the Mtr training scenes Vtr µ comprises a pattern or image, together with its correct classification into one of Sc classes. First, the average is removed from each pattern (or image). Since firing rates are always positive and because this transformation leads to negative pattern values, it is necessary to distribute the corresponding values to twice the number of channels. Positive values are directly assigned to the odd channels, while the negative values are rectified and assigned to the even channels. Second, the classification is represented by a vector with Sc entries with only one nonzero entry at the position corresponding to the correct class. Finally, this vector and the processed pattern are weighted and combined to form a single, normalized vector Vtr µ from which the input spikes are drawn. The mathematical details of this procedure are described in appendixes A and B. On Vtr µ , the network now learns a suitable representation including the correct match between patterns and classes in an unsupervised manner. This procedure is schematically explained in Figure 1, and employs the network structure sketched in Figure 2. 3.2.2 Test. For testing the trained generative network, the Mts test scenes Vts µ consist of only the transformed, positive input patterns without the class information (details described in appendix C; see also Figure 3). The correct class c µts is compared to the prediction cˆ µ being inferred from the internal states of the generative model. This method is described in Figure 1 as an inference procedure with partial sensory input and employs the network structure sketched in Figure 3. 3.3 Boolean Functions. For a proof of principle, we used the problem of learning and executing Boolean functions. Computation of arbitrary
1326
U. Ernst, D. Rotermund, and K. Pawelzik
Figure 2: Spike-by-spike network for training a classification task. The training patterns with Sp components, together with their correct classification into one of Sc classes, are presented as randomly drawn spike trains to S = Sp + Sc input nodes during learning. The network thereby finds a suitable representation p(s|i) of the input ensemble and estimates an internal state h(i) for each input pattern according to either the SbS online or the SbS batch algorithm.
Figure 3: Spike-by-spike network for classification of patterns. The test patterns are presented to the Sp input nodes, and an appropriate internal state h(i) explaining the actual pattern is estimated by the SbS algorithm. From the internal state, a classification q (c) is inferred and compared to the correct classification.
Boolean functions in networks that receive only stochastic input is particularly nasty. The spike noise needs to be cancelled perfectly, because Boolean functions are nonsmooth in the sense that flipping one input bit can result in a complete change of the output values. If a network is capable of performing the required mapping, this has interesting consequences for multimodal integration and attention, which will be explained in section 4. In any case, efficient computation of Boolean functions would demonstrate that in principle, every possible computation, including the well-known XOR problem, can be rapidly performed in small networks with stochastic spikes. For our simulations, we randomly selected Boolean functions of N = 5 input bits mapped to one output bit, leading to Mtr = Mts = 32 = 2 N
Efficient Computation Based on Stochastic Spikes
1327
input-output patterns for one specific function.4 Consequently, a number of H = 2 N hidden units should suffice to represent one Boolean function of N bits. In this case, each weight vector p(s|i) comprising the conditional probabilities for a fixed i matches one of the Mts test patterns. Consequently, for each of the input patterns, only one of the hidden units h should be active (winner-takes-all network). In contrast, it might be possible that fewer than 2 N hidden nodes in the network are sufficient to perfectly represent a specific Boolean function. In this case, each input pattern can be represented by a mixture of active hidden nodes (soft competition network). As we considered only one output bit, there are only Sc = 2 different classes to which an input pattern can be assigned. For details of the simulation, see appendix E. 3.3.1 Performance of the Network. Figure 5 shows the mean classification error of different algorithms in dependence on the number of hidden units H. The spike-by-spike algorithms (with online learning) perform considerably better than the NNMF algorithm. The NNMF algorithm tends to explain the input as a very broad mixture of many basic features, as in the other example shown in Figure 4, while our algorithm concentrates on the most predictive features. With SbS, the error drops to approximately zero as soon as more than H = 25 hidden units are in the network. In these cases, the input patterns are reconstructed using a mixture of two or more features represented by the activated hidden nodes. Figure 6 shows how the classification error decreases with the number of incoming spikes. Here, SbS leads to a classification error that reaches zero after only about three spikes per input node. In comparison, the NN classifier is also astonishingly fast and closely approaches the error curve of the SbS algorithm. In this specific example, the complexity of the codebooks of the SbS algorithm and NN-classifier is comparable. 3.4 Handwritten Digits. As a third example, we selected the problem of handwritten digit recognition in order to demonstrate that real-world estimation problems can be solved efficiently and very fast by estimation with single spikes. We used the U.S. Postal Service data set, which serves as a benchmark for classification algorithms and allows comparison with human performance. For parameters of our simulations, see appendix F. 3.4.1 Performance of the Network. Figure 7 shows the classification error of the network versus the mean number of spikes per input node. It turns out
4 We used the same set of patterns for both training and classification. Note, however, that the realization of the patterns was different in both runs because the patterns are represented by a finite number of input spikes, which are randomly drawn in each repetition of a presentation.
1328
U. Ernst, D. Rotermund, and K. Pawelzik
Figure 4: (A) Dynamics of internal representation H in dependence on the iteration step z for the NNMF model by Lee and Seung (1999) and (B) for the spike-by-spike network in dependence on the number t of incoming spikes. The networks comprise S = 2 input nodes and H = 100 hidden nodes, with the weights chosen as p(1|i) = (i + 1)/H and p(2|i) = 1 − p(1|i). The initial state was random but identical for both models, and the input was given by the vector V = {0.42, 0.58}. While the NNMF model converges to a broad mixture of all feature vectors p(s|i), the SbS network converges to a sparse state where the internal representation peaks around the feature vector p(s|42) = {0.42, 0.58} and accurately reflects the input rate distribution. The update constant was 0.5. In addition we show in the right plot of B, that the Kullback-Leibler (KL) divergence, averaged over 1000 runs with different V, decreases with iteration time, thus demonstrating convergence of our algorithm.
Figure 5: Mean classification error for Boolean functions of five input and one output bits, in dependence on the number of nodes H in the hidden layer of the network. The SbS online learning (solid line) clearly outperforms the noiseless NNMF algorithm (dashed-dotted line), which barely approaches 20% error.
Efficient Computation Based on Stochastic Spikes
1329
Figure 6: Error rate e of SbS online algorithm (solid line) for the task in Figure 5 in comparison to the error of a NN classifier (dotted line). The SbS online algorithm is as fast and as accurate as the NN classifier, which in this case represents optimal lookup of function values. Note that a mean of three spikes per input neuron is sufficient to achieve perfect classification.
that the NN classifier is remarkably good, reaching 6.1% error on the test set after about a mean of one spike per input node. This performance comes, however, at the price of using the full Mtr = 7291 training patterns for classification. In contrast, a network with only H = 500 nodes optimized on the training set with the batch algorithm has with one spike per input neuron a superior performance. An increase in classification speed and quality can be obtained using an annealing strategy by adapting (cooling) during learning and estimation of the internal states: the corresponding curve (the dotted line in Figure 7) shows that with only H = 100, the classification is nearly as fast and good as the NN classifier with its 7291 stored patterns. Figure 8 shows the complete weight set for the case SbS with H = 500 and = 0.1. About half of the weights consist of templates, which, however, become combined with particular feature weights during classification runs and are required to achieve the classification performance shown. With the same parameters, we computed the classification error of the network in dependence of H, which is shown in Figure 9. Using H = 25 hidden nodes is already sufficient to obtain an error that is only twice as high as the error of the NN classifier using the full set of 7291 pattern templates. In addition, the error curve shows no signatures of overfitting as it decreases monotonically even for very large networks with up to H = 7290 nodes. 3.4.2 Robustness Against Noise. While our algorithm by construction is robust against noise in the spiking process, the question is whether this
1330
U. Ernst, D. Rotermund, and K. Pawelzik
Figure 7: Mean classification error e for the USPS database shown for different numbers of neurons in the hidden layer, in dependence on the number of input spikes (parameters as in Figure 8). The dashed line shows the error of an NN classifier in comparison. The SbS algorithm with 500 hidden neurons (solid line) is considerably slower but exceeds the performance of the NN classifier. If learning and classification are performed with an adaptive estimation constant ∝ t −1/3 (dotted line), the performance with only 100 hidden neurons approaches the speed and performance of the NN classifier. The weights in this example were trained with the batch learning rule.
remains true for other types of noise. To address this issue, we subjected the digit patterns to two types of perturbations. First, we occluded an increasing number of vertical or horizontal lines in the digit patterns (starting with the two center lines), and second, we added an increasing amount of noise to each pixel in a pattern. Figure 10A shows the classification performance in dependence on the number of covered lines. Up to a number of six lines, classification error stays below 20 percent. Figure 10B demonstrates that pixel noise must increase to a considerable amount η = 0.35 in order to raise the error above 20%. 3.5 Hierarchical Networks. Finally, we demonstrate that our basic network can be combined by using the hidden variables h(i) as probabilities for sending out next spikes—that is, we turn them into input neurons of subsequent networks. As a simple example, we present a hierarchical network that solves the Boolean function of the parity problem that requires the output to be zero whenever an unequal number of input neurons are active and to be one otherwise. While this problem can be solved trivially with H = 2 N hidden neurons, a hierarchical topology of simple SbS
Efficient Computation Based on Stochastic Spikes
1331
Figure 8: Conditional probabilities (weight vectors) p ∗ (s|i) for the SbS batch learning algorithm using H = 500 hidden nodes. Vectors i for even and odd input nodes s are combined and individually rescaled to a gray value gi (s) ∈ [−1, 1], and then displayed in a 16 × 16 pixel raster. The gi (s ) were computed as gi (s ) = g˜ i (s )/ max{|g˜ i (s )|} with g˜ i (s ) = p∗ (2s − 1|i) − p∗ (2s |i). Parameters for batch learning were chosen as described in appendix F.
Figure 9: Mean classification error e for the USPS database in dependence on the number of hidden nodes after 10 spikes per input node. The weights are attained by the batch learning rule, and the other parameters are as chosen for Figure 8.
1332
U. Ernst, D. Rotermund, and K. Pawelzik
Figure 10: Error rate of the SbS algorithm for (A) digit patterns partially occluded by gray bars and (B) digit patterns subjected to a varying amount η of pixel noise (weights and parameters as in Figure 8). In A, the gray bars indicate the error rate under a horizontal occlusion of pixel rows (for details, see appendix F), while black bars indicate the error under a vertical occlusion of pixel columns in dependence on the number of occluded rows or columns. For fewer than four occluded pixel rows or columns and for a noise level less than η = 0.25, the recognition error remains below 10%.
networks can drastically reduce the size of the network. This network proves that our approach can be generalized in a self-consistent manner, using the output of one simple module as a meaningful input to another simple module. Therefore, one can conclude that SbS networks can be combined for iterative computations. Together with our results on Boolean functions, spike-by-spike networks are thus able to compute arbitrary complex functions with any required precision in a finite amount of time. 3.5.1 Parity Function Network. The binary parity function is 1 if the number of input bits being 1 is an odd number and 0 otherwise. For an arbitrary number of input bits, the parity function is easily computed by a hierarchical tree of XOR modules. For our simulations, we choose N = 16 input bits, thus having Sp = 2N = 32 on-off input channels, and two output classification nodes (see Figure 11). The full network consists of 15 spike-by-spike XOR modules, whose weights were set up manually (a value of 0.5 was assigned to all connections drawn in Figure 11). In every module, input-output nodes are shown as white discs and hidden nodes as gray discs. The hidden node activities h(i) and output variables pˆ (k, n) of one XOR module are updated by equations 2.20 and 2.32 with each incoming spike. Spikes are elicited by either the external input or by the activities pˆ (k, n) of an internal output node n, which is at the same time the input node for the next XOR module. With probability p I = 0.05, a spike is drawn according to the probability distribution of the actual input bit pattern and sent to one of the 8 XOR modules in the input layer. With probability p H = pˆ (k, n)(1 − p I )/14, a spike is drawn from one
Efficient Computation Based on Stochastic Spikes
1333
Figure 11: Structure of a hierarchical network constructed by hand for solving the parity function with 16 input bits. The network consists of 15 XOR modules linked together. Gray discs denote the hidden-layer nodes of the single modules, and open circles denote the input-output nodes. Connections between the nodes are marked with solid lines (all connections set to a weight of 0.5). The arrow indicates the output nodes that compute the parity subfunction for the first eight input bits (from left).
of the output nodes of the 14 XOR modules in the lower layers of the hierarchy and passed on to the next XOR module. If the network performs the intended computation perfectly, by construction each pair of input-output nodes should display a well-defined activity for each input pattern. As an example, consider the two output nodes marked by an arrow in Figure 11. If the number of bits of value 1 within the first 8 bits of the input pattern is an even number, the output variable of the left node should take a lower value (ideally 0) than the output variable of the right node (ideally 1), thus computing the parity function for the first 8 input bits. Figure 12 displays the corresponding errors of the network for the output layer and for the input-output nodes in the hidden layers in dependence on the number of spikes arriving in the input layer. All errors go down to zero but on different timescales. Clearly, it can be seen that the errors in the lower layers of the hierarchy decrease earlier than errors in the higher layers. This behavior is to be expected from the structure of the network: only when the input from the (m − 1)th layer is correct can the output error of the mth layer decrease to 0. 4 Summary and Discussion We presented a generative framework for neural computation that relies on spike signals given by Poissonian point processes. In our framework, probabilities for eliciting spikes capture neural activities, and synaptic weights are correspondingly related to conditional probabilities to observe spikes given a putative elementary cause. While our scheme is equivalent to
1334
U. Ernst, D. Rotermund, and K. Pawelzik
Figure 12: Classification error of the hierarchical SbS network displayed in Figure 11 in dependence on the mean number of spikes per input node (solid line). In a hierarchical network, the accuracy of the output of a higher layer relies on the accuracy of the output of the lower layers. This dependence is demonstrated by the error curves for the first hidden layer (dashed line), the second hidden layer (dashed-dotted line), and the third hidden layer (dotted line). Correspondingly, the curves show that the error in the nth layer falls below 10% after about 1 to 2 spikes per input node later than in the n − 1th layer.
nonnegative matrix factorization (NNMF) when used for the asymptotic case where mean rates represent the signals, the spike-by-spike online update for single-action potentials leads to a fundamentally different dynamics of hidden states. In particular, for overcomplete representations, our fast algorithm favors sparse hidden-layer representations in contrast to the basic algorithms for NNMF derived in Lee and Seung (1999). While ad hoc sparseness conditions for NNMF have been discussed in Hoyer (2004), in our framework they emerge from the requirement of rapid convergence. A different idea to link sparseness to fast visual processing was put forward by Perrinet, Samuelides, and Thorpe (2004). Here, the idea is to subtract with each spike an orthogonal projection of the best matching feature from the input scene (matching pursuit). This method has been shown to need very few spikes to achieve good reconstruction of natural scenes. We demonstrated with simple paradigmatic examples that very small networks using our dynamics can rapidly and accurately perform complex computations using only a few stochastic spikes per input neuron for achieving maximal performance. For a proof of concept, we considered learning and computation of random Boolean functions. Boolean functions
Efficient Computation Based on Stochastic Spikes
1335
are particularly challenging because a different input in one channel (bit) may switch an arbitrary function of the remaining bits to an arbitrary different function. In a neural environment, this dynamical property allows changing a computation performed in a certain brain area depending on a specific context. This context can be interpreted, for example, as attentional priming, additional sensory cues, or information from other brain areas. For Boolean functions with N input bits, 2 N hidden nodes are in general required for a complete representation. However, we found that optimal performance is in most cases possible with fewer than these 2 N hidden nodes. This reduction is due to the capability of the algorithm to find and represent redundancies in an effective manner. The error vanished completely, in dramatic contrast to the algorithm for NNMF by Lee and Seung (1999), which failed for this task. It is possible that the stochasticity of the spikes effectively implements a Monte Carlo sampling over the posterior, which helps to find a better solution as with NNMF. This is an interpretation being put forward by Hoyer and Hyv¨arinen (2003). The ability of our network to learn and compute Boolean functions shows that highly nonlinear, nonsmooth functions can be computed efficiently on the basis of very few spikes, which underlines the universality of our framework. Applications of our model with H ≈ 500 hidden nodes to handwritten digits demonstrate that on average, fewer than one spike per input neuron leads to recognition rates exceeding those of the nearest neighbor classifier that used the full training set with more than 7000 patterns. This high speed of recognition appears to be a general property of our framework and depends little on the input stimulus ensemble. We also studied our model with natural images, thereby confirming these results (data not shown). Training the network with natural images by the batch learning rule leads to weight vectors p(s|i) whose appearances strongly depend on the value of . For very small , weight vectors approximate orthogonal pixel basis functions, specializing in explaining the spikes of a few neighboring input channels. Consequently, the representation of one natural image is a linear, nonsparse superposition of these basis functions. In contrast, values of near one lead to weight vectors approximating templates of image patches. The representation then gets maximally sparse by activating only one hidden node whose weight vector most closely matches the presented input pattern. In between these extreme cases, the weight vectors have strong similarities to the receptive fields shown in Olshausen and Field (1996). However, our interpretation of the weight vectors learned with a fixed is one of a representation that is optimized to explain any input pattern with an average number of spikes being proportional to 1/. In summary, our results challenge the notion that the high speed of visual perception found in human and monkey psychophysics (Thorpe et al., 1996; Mandon and Kreiter, 2005) can be explained only by assuming
1336
U. Ernst, D. Rotermund, and K. Pawelzik
that the exact timing of individual spikes conveys the information about the stimulus (van Rullen & Thorpe, 2001). The recognition of handwritten digits was also used to show that the framework is very robust to noise and data corruption. By construction, our framework is invariant to scalings of overall intensities. This property has been found in primary visual cortex where the tuning of neurons is invariant to stimulus contrast (Skottun, Bradley, Sclar, Ohzawa, & Freeman, 1987), which gives a hint of the potential relevance of our approach for explaining cortical computation. At first sight, the equations for update and online learning, equations 2.20 and 2.21, might appear biologically unrealistic. However, we expect that the algorithms can be implemented using known properties of real neural circuits only. This implementation is the subject of our ongoing studies, but we give an outline of the underlying ideas in the following paragraph. First, we assume that the hidden variable h t (i) is represented in the excitability of a neuron or, even more realistic, in the excitability of a neuronal group or population (Wilson & Cowan, 1972), which would be equal to the average over all membrane potentials. Second, we observe that the update term in equation 2.20 is proportional to h t (i) p(s t |i). One could interpret this term as a spike delivered over a synapse with efficacy p(s t |i), which is multiplied with the excitability h t (i). In fact, such multiplicative interactions have been observed in real neurons, as in Chance, Abbott, and Reyes (2002). Spikes in the populations representing the h t (i)’s will be elicited with a probability proportional to h t (i). One inhibitory population would suffice to receive and to average those spikes, and feed them back to the other populations as a divisive normalization, which is also a well-known neural operation that has been extensively studied in, for example, the early visual cortical areas (Heeger, 1992; Carandini & Heeger, 1994). Taken together, this outline demonstrates that a biophysically plausible implementation of our algorithm is within reach and will in its final form be described by coupled differential equations defined in real time instead of using event timing in discrete update algorithms. A different approach to spike-by-spike estimation was studied by Den`eve (2005), who derived a differential equation updating an estimate of the log-likelihood ratio for the presence or absence of a stimulus in the receptive field of a Bayesian neuron. This study is very interesting because it takes the temporal dynamics of a stimulus into account, although in its current state, only for the presence or absence of a single cause and not for mixtures of causes. Den`eve also demonstrated that the corresponding differential equation is, assuming various approximations, similar to those of an integrate-and-fire neuron. For the biological plausibility of online learning, equation 2.21, we observe that the weight change is essentially given by two terms: one that is Hebbian and another that describes weight decay proportional to
Efficient Computation Based on Stochastic Spikes
1337
postsynaptic activity: p(s|i) ∝ h(i) p(s|i)(δs,s t − p(s t |i)) .
(4.1)
In previous approaches, simple cell responses were explained from ecological requirements (Barlow, 1960), like statistical independence of hidden node activations (Bell & Sejnowski, 1997). For natural image data, it was argued that this requirement is equivalent to sparseness constraints on the activities (Olshausen & Field, 1996). A different justification for sparseness may be found in the principle of minimal energy consumption (Levy & Baxter, 1996). In frameworks like ours in which the input data are modeled by a linear superposition of basis vectors, however, imposing sparseness conditions on the coefficients right from the beginning appears rather ad hoc. In contrast, our work points toward a new explanation of sparseness as a consequence of convergence time constraints. In our algorithm 2.20, residual noise of the hidden variable estimations caused by nonvanishing values of the update speed parameter induces sparseness of hidden node activations. The relation of speed and sparseness becomes clear when considering the update algorithm with the extreme value of = 1. In this case, asymptotically only the hidden neuron obtaining the largest input will remain active. Here, sparseness becomes maximal, and we effectively have a winner-takes-all network. With < 1, sparseness is enforced; however, sparseness of the hidden representation is dominated by the need to explain the data. These dependencies may also be interpreted in a different fashion: the parameter introduces a timescale that effectively determines how strong the observation of an input spike changes the internal representation. Small increase the memory for previous spikes; thus, the internal representation can be adjusted more accurately to an input pattern at the cost of a larger observation time. It is obvious that a more accurate representation in general needs more basic features to explain the input data. In contrast, with a large , a sparse representation of a few features suffices to explain an input pattern having a high potential variability due to the small number of observed spikes. Taking all these obsevations together, our framework might provide a novel basis for understanding neural computation. Coming from a rather technical background where similar algorithms were introduced to achieve blind source separation and blind deconvolution (often referred to as the cocktail party problem), we have shown that our approach is quite universal and can be used to perform general computations efficiently. In particular, it can implement multimodal neural integration by taking spike sequences from different sensory modalities or other brain regions into account (Deneve et al., 2001). Also it is straightforward to extend our approach to include statistical expectations, particularly tasks and attention. We believe
1338
U. Ernst, D. Rotermund, and K. Pawelzik
that future investigations along these lines will allow formulating specific predictions of our framework that can be tested in neurophysiological and psychophysical experiments. Appendix A: Pattern Preprocessing To avoid negative firing rates, the raw image or bit patterns Bµ with N components each were preprocessed to form the positive input patterns Bµ + finally applied to the network. In a first step, the b µ,s , s = 1, . . . , N were normalized by subtracting the individual mean value < b µ,s > = 1/N sN =1 b µ,s , b˜ µ,s = b µ,s − < b µ,s >.
(A.1)
Then the N components were duplicated and assigned pairwise to the even and uneven input node pairs, yielding S = 2N nonnegative components b µ,s + according to the expressions b µ,2s −1 + = b µ,2s + =
+b˜ µ,s 0
for b˜ µ,s > 0 otherwise
(A.2)
0 −b˜ µ,s
for b˜ µ,s > 0 . otherwise
(A.3)
This preprocessing is motivated by the properties of early visual processing in the brain: splitting the mean-value corrected input into negative and positive values resembles the analysis of visual stimuli by on- and off-cells in the lateral geniculate nucleus (LGN). Appendix B: Training Procedures For the training procedure (see Figure 2), the S input nodes are split into two sets of sizes Sp (indexed by s = 1, . . . , Sp ) and Sc (indexed by s = Sp + 1, . . . , M = Sp + Sc ). Correspondingly, there also exist two sets of conditional probabilities p(s|i). Each item in the training set comprises a + tr nonnegative input pattern Btr µ , together with its correct classification c µ . + The pattern Btr is applied to the first set of Sp input nodes, while the µ correct classification c µtr activates the c µtr th node of the second set of input nodes. Pattern and classification inputs are in addition weighted by a factor λ. This factor regulates the ratio of the mean number of spikes caused by the two types of input. λ allows balancing the input and output arguments of a function to be learned by the network. Using these assignments, the
Efficient Computation Based on Stochastic Spikes
1339
final training scenes Vtr µ are given by the expressions tr vµ,s
=λ
tr b µ,s
Sp
tr b µ,s
for s ∈ [1, Sp ]
(B.1)
s =1
and tr = (1 − λ)δs−Sp ,cµtr vµ,s
for s ∈ [Sp + 1, Sp + Sc ].
(B.2)
During training, spikes are drawn randomly from Vtr µ according to equation 2.2, while both h(i) and p(s|i) are updated with the SbS online or SbS batch algorithms. Appendix C: Classification and Computation Procedures For the classification run, input scenes Vts µ are composed solely of the pre Sp ts + ts ts via v = b / processed patterns Bts µ,s µ,s µ s =1 b µ,s (see Figure 3). The first set of conditional probabilities p(s|i) from the learning procedure with s = 1, . . . , Sp is renormalized, yielding the pattern weights p ∗ (s|i) = p(s|i)/
Sp
p(s |i) .
(C.1)
s =1
For each presented pattern or scene µ, now only the internal states h µ (i), not the weights p(s|i), are updated using equation 2.20. The remaining Sc weight vectors are transposed and normalized, yielding the classification weights p(c + Sp |i) pˆ (i|c) = Sc c =1 p(c + Sp |i)
for c = 1, . . . , Sc .
(C.2)
H pˆ (i|c)h µ (i) are computed. From h(i) and pˆ (i|c), quantities q µ (c) = i=1 These q µ (c) denote the probabilities for each of the Mts test patterns Vts µ to belong to the class c. From the q µ (c), a prediction for the classification cˆ µt is computed and updated within each time step t, cˆ µt = argmaxc q µt (c) .
(C.3)
The mean classification error e t over all Mts test patterns is finally computed as e t = 1/Mts
Mts µ=1
δcµts ,ˆcµt .
(C.4)
1340
U. Ernst, D. Rotermund, and K. Pawelzik
Appendix D: Definition of Nearest-Neighbor Classifier For a comparison of our learning and reconstruction algorithms to standard approaches, we decided to use a nearest-neighbor classifier. This classifier is easy to implement and yields relatively good performance at the price of using the whole training data set as codebook vectors. In our simulations, tr a test pattern Vts µ is classified by searching for the training pattern Vµ tr ts with the smallest quadratic distance Vµ − Vµ 2 to the test pattern. The tr classification is then done by assuming that both patterns Vts µ and Vµ tr belong to the same class c µ . Appendix E: Details and Parameters for the Computation of Boolean Functions From all possible 232 Boolean functions of five input and one output bits, Fo = 100 Boolean functions were chosen randomly for the SbS network (with online learning). By splitting the inputs arguments into on-off channels as described in appendix A, we therefore have S = Sp + Sc = 12 input nodes. During SbS with online learning, the 32 patterns were presented sequentially with a number of 4620 spikes per pattern and with λ = 5/6. This procedure was repeated Z = 20 times. Before each learning step and for each different pattern µ, the h’s were reset to a constant value of h 0µ = 1/H. Before the first learning step, the p z (s|i) were uniformly initialized with p 0 (s|i) = 1/S. After learning, the h’s were again reset to h 0µ , and each input pattern was presented for T = 1000 spikes for testing the classification performance. Learning constants were chosen as = 1000/1001 and γ = 0.01, unless stated otherwise. Appendix F: Details and Parameters for the Classification of Handwritten Digits The database of handwritten digits from the U.S. Postal Service contains Mtr = 7291 training and Mts = 2007 test patterns, each one consisting of N = 16 × 16 gray scale values (pixels) ranging from −1 (black) to 1 (white). According to the data preprocessing described before, the network then comprises Sp = 512 pattern input channels and Sc = 10 classification input channels for the training. Using the SbS algorithm with batch learning, one learning step included the parallel presentation of all training pattern with λ = 0.5, = 0.1, T = 4620, = 4620, and Z = 20 similar to the procedure employed with the Boolean functions. During classification, each digit pattern was presented for T = 10, 000 spikes. In order to test for the robustness of the algorithms, the recognition was made more difficult in two different ways. The first challenge was
Efficient Computation Based on Stochastic Spikes
1341
introduced by an occlusion of a variable number of rows or columns. Starting from the center columns or rows of the digit pattern, pixel values of whole rows and columns were set to an intermediate value of 0. For example, a vertical occlusion of six columns on a pattern of size 16 × 16 was realized by affecting the pixel values of columns 6 to 11 in all rows 1 to 16. A horizontal occlusion of four rows was realized by setting all pixel values of rows 7 to 10 of all columns 1 to 16 to a value of 0. The second challenge was introduced by superimposing random noise on the digit pattern. In detail, for each digit to be noisified, random pixel values Brµnd were drawn from a uniform distribution between −1 and 1. By means of a parameter η ∈ [0, 1], the original pattern was combined with the noise pattern to a new input new pattern b µ,s according to the rule new r nd b µ,s = (1 − η)b µ,s + ηb µ,s .
(F.1)
The parameter η regulates the amount of noise on the original pattern. η = 0 represents the noise-free original pattern, and η = 1 denotes the case when the original pattern has been fully replaced by the noise pattern. Acknowledgments Discussions with Matthias Bethge and Misha Tsodyks were very helpful for clarifying basic notions of our framework. This work was supported by the German Science Foundation (DFG, project B5 in SFB 517) and the Ministry of Research and Education (BMBF, DIP METACOMP). References Barlow, H. (1960). The coding of sensory messages. Cambridge: Cambridge University Press. Bell, A., & Sejnowski, T. (1997). The “independent components” of natural scenes are edge filters. Vision Research, 37(23), 3327–3338. Berry, M., Warland, D., & Meister, M. (1997). The structure and precision of retinal spike trains. Proc. Natl. Acad. Sci., 99, 5411–5416. Bertsekas, D. (1995). Non-linear programming. Belmont, MA: Athena Scientific. Bethge, M., Rotermund, D., & Pawelzik, K. (2003a). Optimal neural rate coding leads to bimodal firing rate distributions. Network: Comput. Neural Syst., 14, 303–319. Bethge, M., Rotermund, D., & Pawelzik, K. (2003b). A second order phase transition in neural rate coding: Binary encoding is optimal for rapid signal transmission. Phys. Rev. Lett., 90(8), 088104–1. Carandini, M., & Heeger, D. (1994). Summation and division by neurons in primate visual cortex. Science, 264, 1333–1336. Chance, F. S., Abbott, L. F., & Reyes, A. D. (2002). Gain modulation from background synaptic input. Neuron, 35, 773–782.
1342
U. Ernst, D. Rotermund, and K. Pawelzik
Delorme, A., & Thorpe, S. J. (2001). Face identification using one spike per neuron: Resistance to image degradations. Neural Networks, 14, 795– 803. Den`eve, S. (2005). Bayesian inference in spiking neurons. In L. K. Saul, Y. Weiss, L. Bottou, (Eds.), Advances in neural information processing systems, 17 (pp. 353–360). Cambridge, MA: MIT Press. Den`eve, S., Latham, P. E., & Pouget, A. (2001). Efficient computation and cue integration with noisy population codes. Nature Neuroscience, 4(8), 826–831. Ernst, M. O., & Banks, M. S. (2002). Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415, 429–433. Fourcaud, N., & Brunel, N. (2002). Dynamics of the firing probability of noisy integrate-and-fire neurons. Neural Computation, 14, 2057–2110. Gerstner, W., & Kistler, W. (2002). Spiking neuron models. Cambridge: Cambridge University Press. Heeger, D. (1992). Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9, 181–198. Herzog, M. H., & Fahle, M. (2002). Effects of grouping in contextual modulation. Nature, 415, 433–436. Hoyer, P. O. (2004). Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research, 5, 1457–1469. Hoyer, P. O., & Hyv¨arinen, A. (2003). Interpreting neural response variability as Monte Carlo sampling of the posterior. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 277–284). Cambridge, MA: MIT Press. Koechlin, E., Anton, J. L., & Burnod, Y. (1999). Bayesian interference in populations of cortical neurons: A model of motion integration and segmentation in area MT. Biol. Cyb., 80, 25–44. ¨ Kording, K. P., & Wolpert, D. M. (2004). Bayesian integration in sensorimotor learning. Nature, 427, 244–247. Lant´eri, H., Roche, M., & Aime, C. (2002). Penalized maximum likelihood image restoration with positivity constraints: Multiplicative algorithms. Inverse Problems, 18, 1397–1419. Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788–791. Lee, D. D., & Seung, H. S. (2000). Algorithms for non-negative matrix factorization. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.). Advances in neural information processing systems, 13, (pp. 556–562). Cambridge, MA: MIT Press. Lee, T., & Mumford, D. (2003). Hierarchical Bayesian inference in the visual cortex. J. Opt. Soc. Am., 20(7), 1434–1448. Levy, W., & Baxter, R. (1996). Energy-efficient neural codes. Neur. Comp., 8, 531–543. Lucy, L. (1974). An iterative technique for the rectification of observed distributions. Astron. J., 74, 745–754. Mandon, S., & Kreiter, A. K. (2005). Rapid contour integration in macaque monkeys. Vision Research, 45, 291–300. Mumford, D. (2002). Pattern theory: The mathematics of perception. In L. Tatsien (Ed.), International Congress of Mathematicians, Beijing, China (Vol. III, pp. 401–422). Singapore: World Scientific.
Efficient Computation Based on Stochastic Spikes
1343
Olshausen, B., & Field, D. (1996). Emergence of simple cells receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Panzeri, S., Treves, A., Schultz, S., & Rolls, E. (1999). On decoding the responses of a population of neurons from short time epochs. Neural Computation, 11, 1553–1577. Perrinet, L., Samuelides, M., & Thorpe, S. (2004). Coding static natural images using spiking event times: Do neurons cooperate? IEEE Trans. Neur. Networks, 15, 1164– 1175. Rao, R. P. (2004). Bayesian computation in recurrent neural circuits. Neural Computation, 16, 1–38. Richardson, W. (1972). Bayesian-based iterative method of image restoration. J. Opt. Soc. Am., 62, 55–59. Shadlen, M., & Newsome, W. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation and information coding. J. Neurosci., 18(10), 3870–3896. Silberberg, G., Bethge, M., Markram, H., Pawelzik, K., & Tsodyks, M. (2004). Dynamics of population rate codes in ensembles of neocortical neurons. Journal of Neurophysiology, 91, 704–709. Skottun, B., Bradley, A., Sclar, G., Ohzawa, I., & Freeman, R. (1987). The effects of contrast on visual orientation and spatial frequency discrimination: A comparison of single cells and behaviour. J. Neurophysiol., 57, 773–786. Thorpe, S., Delorme, A., & van Rullen, R. (2001). Spike-based strategies for rapid processing. Neural Networks, 14, 715–725. Thorpe, S. J., Fize, D., & Marlot, C. (1996). Speed of processing in the human visual system. Nature, 381, 520–522. van Rullen, R., & Thorpe, S. J. (2001). Rate coding versus temporal order coding: What the retinal ganglion cell tells the visual cortex. Neural Computation, 13, 1255–1283. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Kybernetik, 12, 1–24.
Received February 11, 2005; accepted September 1, 2006.
LETTER
Communicated by Raj Rao
Exact Inferences in a Neural Implementation of a Hidden Markov Model Jeffrey M. Beck [email protected]
Alexandre Pouget [email protected] Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY 14627, U.S.A.
From first principles, we derive a quadratic nonlinear, first-order dynamical system capable of performing exact Bayes-Markov inferences for a wide class of biologically plausible stimulus-dependent patterns of activity while simultaneously providing an online estimate of model performance. This is accomplished by constructing a dynamical system that has solutions proportional to the probability distribution over the stimulus space, but with a constant of proportionality adjusted to provide a local estimate of the probability of the recent observations of stimulusdependent activity-given model parameters. Next, we transform this exact equation to generate nonlinear equations for the exact evolution of log likelihood and log-likelihood ratios and show that when the input has low amplitude, linear rate models for both the likelihood and the loglikelihood functions follow naturally from these equations. We use these four explicit representations of the probability distribution to argue that, in contrast to the arguments of previous work, the dynamical system for the exact evolution of the likelihood (as opposed to the log likelihood or log-likelihood ratios) not only can be mapped onto a biologically plausible network but is also more consistent with physiological observations. 1 Introduction It is becoming increasingly clear that some of the computations performed by the nervous system are akin to Bayesian inferences, whether in the context of sensory perception, reasoning, or motor control (Carpenter & Williams, 1995; Gold & Shadlen, 2001; Knill & Pouget, 2004). However, the neural mechanisms of these inferences remain unclear. Here, we explore one particular neural implementation of Bayesian inferences for hidden Markov models (HMMs). HMMs are particularly appealing because they can be used to model a wide variety of problems involving dynamic stimuli such as cue integration, decision making, and motor control. The goal is to understand how a neural network can infer Pr(θ (t)|v (t), . . . , v (0)), where Neural Computation 19, 1344–1361 (2007)
C 2007 Massachusetts Institute of Technology
Exact Inferences in a Neural Implementation of an HMM
1345
θ (t) is a time-varying stimulus and {v(t), . . . , v(0)} is a set of observations or measurements related to θ (t), where θ (t) could be the position of a moving object inferred from a sequence of images {v(t), . . . , v(0)}. Consistent with the work of others, we will limit ourselves to the consideration of recurrent networks for which the firing rate ui (t) of neuron i is related to the probability that the stimulus takes on some value θi through some monotonically increasing function, ui (t) = f (Pr(θ (t) = θi |v (s)∀s < t))
(1.1)
where f (x) is typically equal to the identity, a convolution (Anderson, 1994), or a log transform (Rao, 2004; Yu & Dayan, 2005; Den`eve, 2005). Exact solutions to this problem have recently been derived for a single neurons with a binary variable θ (Den`eve, 2005) and for multiple neurons with a static variable θ (Koechlin, Anton, & Burnod, 1999). However, for nontrivial, dynamic θ (t) and a population of neurons, the only available solution relies on the assumption that the log of a sum can be approximated by the sum of logs (Rao, 2004). In particular, this study showed that when f (x) = log(x), a linear dynamical system of the form du = − u + W u + Mv , dt
(1.2)
where W is the recurrent connectivity matrix and M is the feedforward connectivity matrix, approximates Bayes inferences for such an HMM after the weight matrix W was adjusted by a numerical fit. Here we present a simple, rigorous, and systematic derivation of a quadratic, nonlinear first-order dynamical system that has parameters that may be linked directly to Bayes rule and has solutions that give the exact probability that a Markov stimulus currently takes on a particular value given the entire history of the observed pattern of activity. Furthermore, it will be shown that such a system follows solely from the assumptions that the stimulus is Markov and that the stimulus-dependent pattern of activity consists of a collection of independent Poisson processes with stimulusdependent (and nonzero) rates. A companion equation for the evolution of the log likelihood of the observed stimulus-dependent activity given the model parameters will also be derived. We argue that this equation provides the proper objective function for a Bayes-linked learning rule and indicates that the goal of such a network is, in fact, the mean prediction of the input patterns. Moreover, we will demonstrate that the rate of increase of the probability of the data given model parameters can be incorporated into the amplitude of the output probability distribution in a manner that allows for an online estimate of model performance. For the purposes of comparison with previous work, we then generate the associated nonlinear evolution equations for the log-likelihood and
1346
J. Beck and A. Pouget
log-likelihood ratios. The assumption of low-amplitude input will then be applied to the log-likelihood equations to demonstrate that the recurrent weights of the linearized model may be calculated directly, as opposed to by a numerical fit as in Rao (2004). We then show that the assumption of low-amplitude input may also be applied to the nonlinear equation for the evolution of the likelihood function to obtain an equivalent linear dynamical system with an associated Hebbian learning rule. Finally, we discuss the biological plausibility of the various dynamical systems for Bayesian inference that we have generated and argue the case for a neural implementation, which directly represents the amplitude-adjusted probability density function as opposed to a representation in the log likelihood or log-likelihood ratio domain. 2 Model To demonstrate the distinction between our approach and that of Rao (2004) and Den`eve (2005), we begin by replicating their derivation of a discrete dynamical system that performs exact Bayesian inferences. We begin with a formulation of Bayes’ rule that calculates the probability that at the time step n, a particular stimulus (θi ) is being presented given the total history of all previous stimulus-dependent activity (v n , v n−1 , . . . , v 0 ). Thus, if we denote θ n = θi as θin , we may write the conditional probability that the stimulus takes on some value given the history of the activity pattern as Pr v n |θin , v n−1 , . . . , v 0 Pr θin |v n , v n−1 , . . . , v 0 = Pr (v n |v n−1 , . . . , v 0 ) n−1 n−1 n−1 0 0 Pr θ . Pr θin |θ n−1 , v , . . . , v | v , . . . , v × j j
(2.1)
j
The assumptions of a first-order Markov process for the stimulus and conditional independence of v n given θin (see Figure 1) imply that a discrete dynamical system for the quantity uin = Pr(θin |v n , v n−1 , . . . , v 0 ),
(2.2)
is given by uin
n−1 Pr v n |θin = Pr θin |θ n−1 uj . j n n−1 0 Pr (v |v , . . . , v ) j
(2.3)
At this point, Rao (2004) takes the natural log of this equation so that dependence on the noisy input (v n ) comes in as an additive driving force
Exact Inferences in a Neural Implementation of an HMM
θ n −1
θn
θ n +1
v n −1
vn
v n +1
1347
Figure 1: Graphical representation of a hidden Markov model. Arrows indicate conditional independence.
rather than a multiplicative one. Unfortunately, this causes the transition probabilities to come into the equation through a term that is the natural log of a sum. This nonlinearity is eliminated through a numerical parameter fit. It is then assumed that both the additive driving term and the resulting linear approximation to the transition probability term are of order t, so that a first-order linear differential equation may be obtained. In contrast, here we start with the goal of obtaining a temporally continuous first-order dynamical system and explicitly delineate the assumptions necessary to come to this conclusion. In particular, we begin by noting that for a discrete system of the form un−1 , v n , t), u n = h(
(2.4)
a first-order dynamical system can be obtained when un−1 , v n , t) = u n−1 + t h 1 ( un−1 , v n ) + O(t 2 ). h(
(2.5)
It can be shown that to put equation 2.3 into this form, we need only make the reasonable assumptions that (1) the probability of transitioning from state j to state i = j is proportional to the discrete time interval t and that (2) the ability of a particular pattern of activity to demonstrate a preference for a given stimulus is also proportional to t, the time interval over which the pattern of activity is observed. Written in terms of the relevant probability distributions, assumption 1 is given by Pr θin |θ n−1 = δij + twij , j where δij is the Kronecker delta and Similarly, assumption 2 is given by
(2.6)
Pr θin |v n n = 1 + tdin + O(t 2 ), Pr θi
i
wij = 0, and wij > 0 for all i = j.
(2.7)
1348
J. Beck and A. Pouget
where din = di (v n ) are subject to the condition d n · b = 0 for all n, for b i = Pr(θin ) gives the prior or bias in the stimulus. Note that din can be thought of as the growth rate of the conditional preference of the probability of θin given v n . Clearly, assumption 1 is just a restatement of the assumption that the stimulus is governed by a continuous-time Markov process and thus has Poisson transition times. Assumption 2 is slightly more restrictive. It is easy to show that any set of discrete-time HMMs that converges to a continuous HMM as t goes to zero has the property that the left-hand side of equation 2.7 has a Taylor expansion in t. However, it is not generally the case that the first term of this expansion is equal to one as is required to put equation 2.3 into the form of equation 2.5. Fortunately, a wide class of biologically plausible stimulus-dependent patterns of activity has this property. Specifically, a layer of cortex modeled by a set of independent Poisson neurons with stimulus-dependent rates has this property, as does a layer cortex modeled by a covariant gaussian distribution associated with a dense set of tuning curves. (See appendix A for an example of the Poisson case.) Regardless, equations 2.6 and 2.7 can be used with equation 2.3 after another application of Bayes’ rule,
Pr (v
n
|θin )
Pr θin |v n Pr (v n ) = , Pr θin
(2.8)
to yield n n−1 + O(t 2 ), = Pr (v n ) (I + tDn ) (I + tW) u Pr v n |v n−1 . . . v 0 u (2.9) where Dn is a diagonal matrix with diagonal entries given by the din defined by equation 2.7, W is the input-independent, recurrent weight matrix with elements given by transition probabilities, and I is the identity matrix. To obtain an expression for Pr v n |v n−1 . . . v 0 , we simply take the dot the vector composed entirely of ones, while product of equation 2.9 with 1, n−1 = 1 since both u n and u n−1 represent probarecalling that 1 · u n = 1 · u bility distribution. This yields n−1 + O t 2 . Pr v n |v n−1 . . . v 0 = Pr (v n ) 1 + t 1 · (Dn + W) u (2.10)
Exact Inferences in a Neural Implementation of an HMM
1349
Then, taking the ratio of equations 2.9 and 2.10, we obtain u n =
un−1 } + O(t 2 ) u n−1 + t{(Dn + W) , 1 + t{1 · (Dn + W) un−1 } + O(t 2 )
(2.11)
where the denominator corresponds to a form of divisive normalization (Heeger, 1992; Nelson, 1994), which, in this particular context, is required to ensure that u (t) is a probability distribution. Expanding this equation to first order in t, we obtain u n = u n−1 + t{(Dn + W) un−1 − 1 · (Dn + W) un−1 } + O(t 2 ).
(2.12)
Taking the limit as t goes to zero, we find that du ·u = D(t) u + W u − (d(t) ) u, dt
(2.13)
where we have used the definition of the matrix D(t) and the fact that 1 · W = 0 T to simplify the uniform inhibitory term. Recalling that D(t) is a function of the time-dependent stimulus v(t), we see that we have obtained a first-order quadratic nonlinear recurrent network for the exact computation of Bayesian inferences. Additionally, we note that the quadratic inhibitory term in equation 2.13 is a result of the divisive normalization in equation 2.11, and has the same effect in both equations. For example, when D(t) is constant, the solution to equation 2.13 is u (t) =
exp[(D + W)t] u0 . (1 − 1 · u 0 ) + 1 · exp[(D + W)t] u0
(2.14)
This expression is similar to the divisive normalization reported in cortical circuits (Heeger, 1992; Nelson, 1994). 3 Estimating Model Performance Solutions to equation 2.13 have the property that 1 · u (t) = 1 for all time. As noted in previous work (Sahani & Dayan, 2003) this property, while useful in the context of the above derivation, is not a necessary feature of the activity of a recurrent network, which directly represents a probability distribution and may be used to represent some other variable. In this context, the natural choice would be to use the amplitude to represent an estimate of model performance. The performance of the model can be assessed at any given time by computing the likelihood of the data under the current model parameters M. A procedure similar to that used above may be used to generate an
1350
J. Beck and A. Pouget
expression for the evolution of this quantity, leading to L n = Pr(v n , v n−1 . . . v 0 |M) = Pr(v n |v n−1 . . . v 0 , M)L n−1 .
(3.1)
Using equation 2.9 to evaluate the conditional dependence of the current activity pattern on the history of the activity pattern, we see that L n = L n−1 Pr(v n |M)(1 + t d n · u n−1 ).
(3.2)
Taking the natural log, expanding to first order in t, and eliminating terms that are independent of model parameters yields L ∗ (t) − L ∗ (0) =
t
·u d(s) (s)ds,
(3.3)
0
where L ∗ (t) is a function optimized by the same model parameters that optimize the L(t) when the model parameters accurately represent the average stimulus-independent and time-independent behavior of the input patterns, that is, when terms of the form Pr(v n |M) may be neglected. (See appendixes A and B for details.) Thus, the expected value of d · u provides the proper objective function for evaluating the performance of the model while a local average of d · u provides a local estimate of model performance. Our goal is to find a differential equation over a new variable u, ˜ which has a solution proportional to the solution of equation 2.13: u(t) ˜ = α(t) u(t), and for which α(t) can be related to a local average of d · u . This can be achieved by suitably adjusting the term responsible for divisive normalization. In particular, a dynamical system of the form d u˜ = (D(t)+ W) u˜ − f (α) u˜ dt
(3.4)
has a solution given by u(t) ˜ = α(t) u(t), where α(t) obeys d log(α) = d ·u − f (α) . dt
(3.5)
Thus, if f (α) is a monotonically increasing function, this quantity is attracted to a local average of d · u as desired. In particular, if f (α) = β log(α), then β log(α) is a local average of d · u . Moreover, since the effective time constant of equation 3.5 is given by the inverse of the parameter β, we may adjust the effective time window over which the “local” average of model performance is taken by simply manipulating this parameter. Of course, our choice of f (α), the coefficient of this divisive normalization term in equation 3.4, is fairly arbitrary. Indeed, if we wish to restrict
Exact Inferences in a Neural Implementation of an HMM
1351
ourselves to dynamical systems with quadratic nonlinearities, then we should choose f (α) to be linear and obtain d u˜ = (D(t) + W + βI)u˜ − γ (n · u) ˜ u, ˜ dt
(3.6)
which has solutions given by u˜ = α(t) u but with α(t) obeying the equation dα = α d · u + β − γα , dt
(3.7)
where β and γ may be used to adjust the effective time constant of integration for the evolution of the amplitude. 4 Special Case: Log Probability The equations we have derived so far are general in the sense that they make no assumption about the mapping between the activity of the neurons and the encoded distributions (i.e., the function f (x) in equation 1.1). We now consider special cases that have been investigated in other studies and show that our exact equation may be reduced to yield the results of previous work when identical assumptions are made. For instance, Den`eve (2005) has recently obtained a set of equations for exact Bayesian inferences for binary Markov random variables using neurons encoding the log-likelihood ratio: f (x) = log(x) − log(1 − x). To determine the multidimensional analog, we first assume that f (x) = log(x) and transform equation 3.6 to obtain d φ˜ i wij exp φ˜ j − exp φ˜ j , = di (t) + exp −φ˜ i dt j j
(4.1)
where φ˜ i = log(u˜ i ) = log(α) + log(ui ). This equation can be further simplified by considering the complete set of likelihood ratios defined as φij = log(ui ) − log(u j ). This yields the multidimensional equivalent of the binary case derived by Den`eve (2005): dφij wik exp (φki ) − w jk exp φk j . = di − d j + dt k
(4.2)
A simple linear transformation allows an equation of this form to be mapped onto a biologically plausible network that functions by linearly summing inputs under the assumption that the action of a dendrite is to exponentiate the rate of the presynaptic neuron.
1352
J. Beck and A. Pouget
5 Linearized Equations A related study (Rao, 2004) considered a similar log probability encoding ( f (x) = log(x)). However, since many models of neural activity assume that a linear dynamical system can be used to model neural activity (Dayan & Abbott, 2001), Rao generated a linear differential equation for Bayesian inference by numerically approximating the log of a sum as a sum of logs. Unfortunately, as equation 4.1 indicates, Bayes-Markov inference cannot always be reduced to a linear dynamical system under a log-encoding scheme, and while this approach worked well enough for several cases of interest, that study did not make clear the conditions under which the approximation is likely to hold. Here, we address this issue by considering the assumptions necessary to linearize equations 2.13, 4.1, or 4.2. As shown in appendix C, an equivalent linearization follows naturally from the assumption that the input has a low amplitude, that is, |d(t)| 1. For equation 4.1, this implies that |φ˜ i − φ˜ j − (log(b i ) − log(b j ))| 1 and, similar to Rao (2004), this leads to a linear dynamical system of the form d φ˜ i1 = di (t) + wij∗ φ˜ 1j − log(b j ) − γ b j φ˜ 1j − log(b j ) , dt j j
(5.1)
but with recurrent weights given by wij∗ = wij
bj bi
(5.2)
rather than by a numerical fit. Note also that the parameter γ > 0 does not affect the output probability distribution but is necessary to ensure the linear stability of solutions. Alternatively, we note that the same assumption necessary to linearize the equation in the log domain can be applied to equation 2.13, leading to a compellingly simple equation for an approximate probability distribution, dui1 wij u1j . = b i di (t) + dt j
(5.3)
This implies that the approximation used in Rao is valid in the same regime as the simple linear model in the probability domain given by equation 5.3. Furthermore, when the driving function di (t) is obtained by linearly summing input rates, this equation takes the form of a typical rate model
Exact Inferences in a Neural Implementation of an HMM
1353
for neural activity (Dayan & Abbott, 2001), d u˜ 1 − 1 · u˜ 1 ), = BMv (t) + Wu˜ 1 + γ b(1 dt
(5.4)
where B is the diagonal matrix given by the bias vector, so that the matrix BM has the property that 1 · BM = 0 T . Once again, γ > 0 is an adjustable parameter that does not affect the output probability distribution, but is necessary to ensure that the solution is asymptotically stable and, in the absence of inputs, equal to the bias vector. Moreover, equation 3.3 implies that when the stimulus is unbiased, maximum likelihood parameter estimation for this equation has an objective function given by d · u = u1 · Mv ,
(5.5)
which we recognize as the quantity maximized by a constrained Hebbian learning rule associated with such a simple linear model of neural activity (Dayan & Abbott, 2001). Once again, note that as in the log-likelihood ratio case, linearizing eliminates the possibility of using the amplitude to encode model performance. 6 Discussion We have presented a systematic technique for the generation of a system of equations that track the evolution of a dynamical neural network for the case of hidden (and hierarchical hidden) Markov models. This technique was shown to benefit from being exact, requiring no parameter fit, and retaining a simple interpretation of recurrent weights—that is, excitatory recurrent weights are transition probabilities. Moreover, we have demonstrated that since the amplitude of a population that directly encodes a probability distribution is arbitrary, we can use it to incorporate an online estimate of model performance. For the purposes of comparison with previous work, we then generated equations for the evolution of the log likelihood and loglikelihood ratios, as well as linear approximations to the dynamical systems that calculate the likelihood and log-likelihood functions. We now consider the biologically plausibility of Bayes inferences in the likelihood and log-likelihood domains. One advantage of the likelihood implementation we have proposed is that exact Bayesian inference for an HMM requires only divisive normalization and dendritic multiplication, both of which have been observed in neurons (Heeger, 1992; Pena & Konishi, 2001; Gabbiani, Krapp, Koch, & Laurent, 2002). In contrast, much previous work has argued for the implementation of Bayesian inferences in the log probability domain, typically by arguing that such implementations do not require that neurons perform any complicated actions such as
1354
J. Beck and A. Pouget
multiplication. Unfortunately, this is not the case. While it is true that the log transformation eliminates the multiplicative terms associated with the feedforward terms, equation 4.1 clearly demonstrates that for tasks more complicated than binary decision making, this transformation can result in a dynamical system that has an even more complicated nonlinear interaction associated with the recurrent terms. Rao (2004), resolved this issue through the generation of a linear dynamical system that approximates the full nonlinear system. However, the derivation of equation 5.1 demonstrates that the assumption necessary to make this type of approximation valid in the log-likelihood domain could just as easily have been applied in the likelihood domain to obtain an equally plausible linear dynamical system. Additionally, an implementation of Bayes inferences for an HMM capable of incorporating prior knowledge on short timescales would require a network capable of dynamically adjusting the recurrent connection strengths. In both the likelihood and log-likelihood domains, the recurrent connection strengths are multiplied by the synaptic input, so that if priors are to be included, then even the linear equations may ultimately require neurons capable of performing a multiplicative operation. Thus, it seems that there is both little distinction between and little utility of linearization in the likelihood and log-likelihood domains. This may be why the log of the sum approximation has been dropped in a more recent work (Rao, 2005), which identifies the attracting state of the membrane potential of a neuron with the log of the equation for Bayesian inference (see equation 2.3). The firing rate of such a neuron is then obtained by exponentiating this quantity so that the firing rate is approximately proportional to the likelihood function. The log of sums is implemented by a dendritic filtering function that applies a log transform to a linear combination of the input rates. In the work presented here, we are agnostic as to the interpretation and evolution of membrane potential and simply remark that, if neuronal dynamics can implement the model described in Rao (2005) for inputs generated by an HMM (as opposed to the static stimulus which is analyzed in that work), then the resulting rate dynamics should approximate the exact evolution equation for the likelihood function, equation 2.13, derived above. Moreover, we note that the utility of this work lies in the demonstration that the consideration of a dynamic as opposed to static stimulus actually simplifies network dynamics required to implement Bayesian inference. Finally, we argue that simply representing the likelihood function is inefficient and suggest that the ability of equation 3.6 to represent model performance through mean activity of the network can be used to explain one of the experimental discrepancies of the Bayesian information integration model used by Shadlen and Newsome (2001) to model the behavior of LIP neurons. In that work, a weakly tuned binary stimulus (left or right motion) was shown, and neurons in LIP were observed to initially exhibit a linear increase or decrease in activity depending on their direction selectivity.
Exact Inferences in a Neural Implementation of an HMM
1355
Figure 2: Predictions for the four explicit representations of the probability distribution for a binary decision task. Solid lines represent right selective neurons, while dashed lines represent left selective neurons. In the likelihood and loglikelihood ratio cases, left selective neurons decrease in activity at the same rate as right selective neurons. In the log-likelihood case, the decrease is larger than the increase. Only when the amplitude is used to encode model performance does the increase in activity of the right selective neuron exceed the decrease in the left selective neurons, as is observed in Shadlen and Newsome (2001).
However, the linear decrease was always observed to be of lesser absolute slope than the linear increase, indicating that mean activity increases. As noted in that work, this is incompatible with the interpretation of the activity of left and right selective neurons as representative of log-likelihood ratios. In particular, if these LIP neurons are tracking log-likelihood ratios or potential functions or any linearized quantity that is a function of the probability, then the mean activity should not change (see Figure 2c). This is also the case when these LIP neurons are assumed to represent the likelihood function directly (see Figure 2a). Similarly, when the mean activity represents the log likelihood, the concavity of the log function implies that the mean activity should actually decrease (see Figure 2b). In contrast, if the amplitude of the population (or in this case the average activity of the
1356
J. Beck and A. Pouget
pair of neurons) also encodes the likelihood of the recent data given the model parameters, then we would expect an increase in mean activity to accompany the presentation of a familiar stimulus (see Figure 2d). To our knowledge, this is the only model that explicitly represents the probability distribution and accounts for this feature of the relevant experimental data in LIP. Thus, we conclude that the implementation of exact Bayes-Markov inferences for biologically plausible stimulus-dependent patterns of activity in the likelihood domain may be amplitude modulated to provide an online representation of model performance in a manner consistent with both biological constraints and physiological measurements. Appendix A: Conditional Preference The only departure from a standard hidden Markov that was required to yield an exact implementation of Bayesian inference through a first-order differential equation was the requirement that the conditional preference have a Taylor expansion in t, given by Pr v n |θin Pr θin |v n n = 1 + tdin + O t 2 . = n Pr (v ) Pr θi
(A.1)
By considering a series of discrete-time hidden Markov models that converge to a continuous-time HMM, it is possible to show that a Taylor expansion in t necessarily exists, but that it is not generally the case that the first term of the expansion is one. This additional restriction holds only when a single snapshot of the stimulus-dependent pattern of activity is noninformative, that is, when Pr(θin |v , t = 0) = Pr(θin ). Fortunately, when the stimulus-dependent pattern of activity is composed of independent asynchronous events, such an event occurs with zero probability in a time window of zero width. This implies that zero-width time windows are noninformative. As an example, consider a stimulus-dependent pattern of activity that is a collection of independent Poisson processes with stimulus-dependent (and nonzero) rates,
Pr(v n |θ, t) =
(t f i (θ ))tvin exp (−t f i (θ )) , tvin ! i
(A.2)
where vin is now thought of as a discrete estimate of the rate of neuron i obtained by summing Poisson spikes in a time window of length t and f i (θ ) is the tuning curve of that neuron. For such a distribution
Exact Inferences in a Neural Implementation of an HMM
1357
equations 2.7 and A.1 take the form Pr(v n |θ, t) (t f i (θ ))tvi exp (−t f i (θ )) = tvin . Pr(v n |t) t ¯f i exp −t ¯f i i n
(A.3)
The log of this equation yields
Pr(v n |θ, t) log Pr(v n |t)
= t
vin
log
i
f i (θ ) ¯f i
− f i (θ ) + ¯f i ,
(A.4)
which is proportional to t since vin is an estimate of rate (as opposed to spike count) and may be considered an order one quantity. Exponentiation of this equation then leads to the desired expansion in t provided that f i (θ ) is nonzero for all values of the stimulus. Appendix B: Estimating Model Performance To generate an expression for the evolution of the probability of the input pattern of activity given the model parameters M, we recall that L n = Pr(v n , v n−1 , . . . , v 0 |M) = Pr(v n |v n−1 , . . . , v 0 , M)L n−1 .
(B.1)
Using equation 2.9 to evaluate the conditional dependence of the current activity pattern on the history of the activity pattern, we see that L n = L n−1 Pr(v n |M)(1 + t d n · u n−1 ),
(B.2)
Taking the natural log, expand = 0. where we have used the fact that WT n ing to order t yields n−1 log(L n ) = log(L n−1 ) + log(Pr(v n |M)) + t d n · u
(B.3)
or, equivalently, log(L N ) = log(L 0 ) +
N j=1
log(Pr(v j |M)) +
N
t d j · u j−1 .
(B.4)
j=1
Written in this way, we note that the first sum in equation B.4 represents the probability of the stimulus-dependent activity under the assumption that each interval is independent. However, for a hidden Markov model, each stimulus-dependent pattern of activity is independent only when conditioned on the stimulus; therefore, this term represents the log likelihood
1358
J. Beck and A. Pouget
of the stimulus-dependent pattern of activity under the assumption that the pattern is independent of the time course of the stimulus. This implies that if the model parameters accurately characterize the average behavior of the stimulus-dependent pattern of activity, but not necessarily the actual stimulus dependence, then this term may be neglected, giving rise to the expression in equation 2.17. Once again a consideration of the independent Poisson case is helpful. In this case, equation A.3 implies that the first sum in equation B.4 has terms of the form log (Pr(v n |M, t)) = t
vin log ¯f i − ¯f i
i
+
tvin
tvin
log (ti ) −
i
log ( j)
(B.5)
j=1
However, only terms that contain ¯f i are affected by model parameters so that a maximum likelihood estimate may ignore the second line of equation B.5. Thus maximizing L is equivalent to maximizing L ∗ if L ∗ (T) is given by L ∗ (T) = L ∗ (0) +
T
·u d(s) (s)ds +
0
0
T
vi (s) log ¯f i − ¯f i ds.
(B.6)
Finally, if the model parameters do not affect the stimulus-independent mean rates of the input neurons ¯f i , then we may conclude that d L∗ = d · u , dt
(B.7)
as desired. Appendix C: Linearizations To generate a linear dynamical system that approximates a Bayes-Markov network we could linearize any of the exact equations 2.13, 3.6, 4.1, or 4.2. To demonstrate that such linearizations follow naturally from the assumption of weakly tuned input, we now assume that D(t) = εD1 (t) and ui (t) = b i + εui1 (t) + O(ε 2 ). Inserting this ansatz into equation 2.14 yields dui1 = b i di1 (t) + wij u1j + O (ε) . dt j
(C.1)
Exact Inferences in a Neural Implementation of an HMM
1359
Equation 5.4 is obtained by noting that the recurrent weight matrix is sin respectively. gular with the left and right eigenvectors given by 1 and b, Since the dynamical system is linear, an addition of a recurrent term proportional to (1 · u )b and a driving term proportional to b will have the effect of adding a vector proportional to b to the solution, while enhancing the linear stability of the solution. For the purposes of comparison with Rao (2004), we also choose to linearize equation 4.1. First, we note the uniform inhibitory term in equation 4.1 can be replaced by an arbitrary function without affecting the resulting probability distribution since exp(φi (t) + χ(t)) exp(φi (t)) ui (t) = = exp(φi (t) + χ(t)) exp(φi (t)) j
(C.2)
j
for any χ(t). Thus, a linear dynamical system that approximates a BayesMarkov network is concerned only with the accuracy of the linear approximation to the nonlinearity, which contributes the excitatory term, that is, the term given by
wij exp φ˜ j − φ˜ i .
(C.3)
j
When D is of order ε,
φ˜ j − φ˜ i = φ j − φi ≈ log(b j ) − log(b i ) + ε
u1j bj
−ε
ui1 , bi
(C.4)
which implies that
wij exp φ˜ j − φ˜ i ≈
j
j
bj wij bi
u1j
u1 1+ε −ε i bj bi
=ε
j
wij
u1j bi
, (C.5)
where we have used the fact that j
wij b j = 0.
(C.6)
1360
J. Beck and A. Pouget
Defining φ˜ 1j so that ε
u1j bj
= φ˜ 1j − log(b j ) − log(α(t)),
(C.7)
we may combine equation 5.1 with the approximation given by equation C.5 to obtain the linearized equation d φ˜ i1 wij∗ φ˜ 1j − wij∗ log(b j ), = di (t) + β + dt j j
(C.8)
with wij∗ = wij
bj . bi
(C.9)
Recalling that any uniform term may added to an equation that tracks the log likelihood, we obtain equation 5.1 by simply adding an arbitrary uniform term given from a sum of the φ˜ j ’s. Acknowledgments J.B. was supported by NIH training grant T32-MH19942, and A.P. was supported by NSF grant 0346785. References Anderson, C. (1994). Neurobiological computational systems. In J. M. Zurada, R. J. Marks, & C. J. Robinson (Eds.), Computational intelligence imitating life. New York: IEEE Press. Carpenter, R. H., & Williams, M. L. (1995). Neural computation of log likelihood in control of saccadic eye movements. Nature, 377(6544), 59–62. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. Den`eve, S. (2005). Bayesian inferences in spiking neurons. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 353–360). Cambridge, MA: MIT Press. Gabbiani, F., Krapp, H. G., Koch, C., & Laurent, G. (2002). Multiplicative computation in a visual neuron sensitive to looming. Nature, 420(6913), 320–324. Gold, J. I., & Shadlen, M. N. (2001). Neural computations that underlie decisions about sensory stimuli. Trends in Cognitive Sciences, 5, 10–16. Heeger, D. J. (1992). Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9, 181–197. Knill, D. C., & Pouget, A. (2004). The Bayesian brain: The role of uncertainty in neural coding and computation. Trends Neurosci., 27(12), 712–719.
Exact Inferences in a Neural Implementation of an HMM
1361
Koechlin, E., Anton, J. L., & Burnod, Y. (1999). Bayesian inference in populations of cortical neurons: A model of motion integration and segmentation in area MT. Biol. Cybern., 80(1), 25–44. Nelson, M. E. (1994). A mechanism for neuronal gain control by descending pathways. Neural Computation, 6, 242–254. Pena, J. L., & Konishi, M. (2001). Auditory spatial receptive fields created by multiplication. Science, 292(5515), 249–252. Rao, R. P. (2004). Bayesian computation in recurrent neural circuits. Neural Comput., 16(1), 1–38. Rao, R. P. (2005). Bayesian inference and attention modulation in the visual cortex. Neuroreport, 16(16), 1843–1848. Sahani, M., & Dayan, P. (2003). Doubly distributional population codes: Simultaneous representation of uncertainty and multiplicity. Neural Comput., 15(10), 2255– 2279. Shadlen, M. N., & Newsome, W. T. (2001). Neural basis of a perceptual decision in the parietal cortex (area LIP) of the rhesus monkey. J. Neurophysiol., 86(4), 1916–1936. Yu, A. J., & Dayan, P. (2005). Inference, attention, and decision in a Bayesian neural architecture. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17. Cambridge, MA: MIT Press.
Received June 10, 2005; accepted August 18, 2006.
LETTER
Communicated by Walter Senn
Multispike Interactions in a Stochastic Model of Spike-Timing-Dependent Plasticity Peter A. Appleby [email protected] Institute for Theoretical Biology, Humboldt-Universit¨at zu Berlin, Invalidenstraße 43, D-10115 Berlin, Germany
Terry Elliott [email protected] Department of Electronics and Computer Science, University of Southampton, Highfield, Southampton, SO17 1BJ U.K.
Recently we presented a stochastic, ensemble-based model of spiketiming-dependent plasticity. In this model, single synapses do not exhibit plasticity depending on the exact timing of pre- and postsynaptic spikes, but spike-timing-dependent plasticity emerges only at the temporal or synaptic ensemble level. We showed that such a model reproduces a variety of experimental results in a natural way, without the introduction of various, ad hoc nonlinearities characteristic of some alternative models. Our previous study was restricted to an examination, analytically, of twospike interactions, while higher-order, multispike interactions were only briefly examined numerically. Here we derive exact, analytical results for the general n-spike interaction functions in our model. Our results form the basis for a detailed examination, performed elsewhere, of the significant differences between these functions and the implications these differences have for the presence, or otherwise, of stable, competitive dynamics in our model. 1 Introduction The dependence of at least some forms of synaptic plasticity on the relative timing of pre- and postsynaptic spiking activity is now a well-established experimental result (see, e.g., Bi & Poo, 1998; Zhang, Tao, Holt, Harris, & Poo, 1998; Froemke & Dan, 2002). Spike-timing-dependent plasticity (STDP) has mostly been examined from the perspective of the interaction between exactly two spikes, one presynaptic and the other postsynaptic, although spike triplet and spike quadruplet interactions have begun to be studied too (Froemke & Dan, 2002). For two-spike interactions, it is usually found that a presynaptic spike followed by a postsynaptic spike leads to a potentiation of synaptic efficacy, while reversing the order of the two spikes leads to Neural Computation 19, 1362–1399 (2007)
C 2007 Massachusetts Institute of Technology
Multispike Interactions in a Stochastic Model of STDP
1363
a depression of synaptic efficacy (but see Bell, Han, Sugawara, & Grant, 1997). The extent of potentiation or depression depends on the interspike interval, with shorter intervals generally leading to greater changes. Beyond an interspike interval of several tens of milliseconds, no change is observed. Despite the fact that the experimental protocols tend to study multiple spike pairings in order to observe a statistically significant change in synaptic efficacy and that changes are usually measured over the entire synaptic connection between a pair of coupled neurons (that is, not at single synapses), the theoretical literature has assumed that the double exponential-like rule characteristic of STDP is obeyed by every individual synapse for every single spike pairing. The STDP rule has been studied directly to examine its computational consequences (Song, Miller, & Abbott, 2000; van Rossum, Bi, & Turrigiano, 2000), and biophysical models of the dynamics of single synapses have been built that seek to understand the mechanisms that could give rise to the STDP rule (Castellani, Quinlan, Cooper, & Shauval, 2001; Senn, Markram, & Tsodyks, 2001; Karmarker & Buonomano, 2002; Shouval, Bear, & Cooper, 2002). Much attention has also focused on understanding the conditions under which a BienenstockCooper-Munro (BCM)–like (Bienenstock, Cooper, & Munro, 1982) rule can emerge from STDP (Senn et al., 2001; Izhikevich & Desai, 2003; Burkitt, Meffin, & Grayden, 2004), and how to extend the two-spike STDP rule to accommodate the data from spike triplet and quadruplet interactions (Froemke & Dan, 2002). Previously we presented a stochastic, ensemble-based view of STDP in which STDP is an emergent property of neurons at the temporal or synaptic ensemble level (Appleby & Elliott, 2005). We showed that it is possible to build a model in which single synapses respond to spike pairs by adjusting their synaptic efficacies in a fixed, all-or-none fashion (cf. Petersen, Malenka, Nicoll, & Hopfield, 1998; O’Conner, Wittenberg, & Wang, 2005). If the second spike arrives within a time window that is stochastically determined, then a fixed level of potentiation or depression will occur, but if the second spike arrives too late, no change in efficacy will take place. Critically, any changes that do take place do not depend on the time of the second spike relative to the first. Because the window size is stochastic, different synapses can respond to the same spike pairing differently, and the same synapse can respond to multiple presentations of the same pairing differently. Only at this temporal or synaptic ensemble level does the overall synaptic connection between a pair of neurons exhibit STDP. Due to the overall structure of the proposed model, we found that the experimental data on spike triplet interactions (Froemke & Dan, 2002) were automatically reproduced, and we also showed that a BCM-like learning rule can emerge from the two-spike interactions. Here, we derive exact, analytical results for the general multispike or n-spike interaction functions, for any n ≥ 2, averaged over all possible sequences of n spikes in our model. This is achieved by extending our previous
1364
P. Appleby and T. Elliott
model slightly and then finally taking a limit that reduces it to its original form. Although we derive exact results for the n-spike interaction function, we show that the asymptotic, large n form of this function is of particular interest, not least because the finite n form rapidly converges to the asymptotic limit, so that the limit is achieved within just a few tens of spikes. These results are required for an analysis, performed elsewhere (Appleby & Elliott, 2006), of the differences between the two-spike and the general n-spike, n > 2, interaction functions, this analysis showing that, at least in our model, the two-spike interaction function’s form is entirely different from the general form observed for all other n-spike interaction functions.
2 Summary of Model and Its Extension We first provide a brief summary of our earlier model, the derivation of the general two-spike interaction function from it (Appleby & Elliott, 2005), and its extension to a slightly more general form that will facilitate the derivation of general multispike interaction functions. We proposed that all synapses that constitute the overall connection between a pair of synaptically coupled neurons can each exist, independent of the other synapses, in one of three functional states. The resting state of a synapse is labeled the OFF state. Following a presynaptic spike, the synapse undergoes a transition from the OFF state to another state, the POT state, while following a postsynaptic spike, the synapse undergoes a transition from the OFF state to the DEP state. Once in the POT state, further presynaptic spiking has no effect on the synapse’s state, and similarly, once in the DEP state, further postsynaptic spiking has no effect. (We shall modify this assumption in order to generalize the model somewhat.) In the absence of postsynaptic (presynaptic) spiking, the synapse stochastically returns to the OFF state from the POT (DEP) state. These stochastic transitions are governed by separate probability density functions (pdfs), f + (t) for the POT → OFF transition, and f − (t) for the DEP → OFF transition; here, the variable t denotes time. However, if the synapse is in the POT (DEP) state and a postsynaptic (presynaptic) spike occurs, then the synapse immediately returns to the OFF state but also triggers potentiation (depression) by a fixed amount A+ (A− ). Thus, the timing of the second, pre- or postsynaptic spike relative to the first spike does not affect the degree of change in synaptic strength: the strength changes in an all-or-none fashion, with no change occurring if the second spike arrives too late, after the synapse has stochastically returned to the OFF state. Because the transitions are stochastic, however, different synapses respond to the same spike pairing differently, and the same synapse responds to multiple presentations of the same spike pairing differently. Only at this synaptic or temporal ensemble level does the overall connection strength between the two neurons respect a spike-timing-dependent form of synaptic plasticity, while
Multispike Interactions in a Stochastic Model of STDP
1365
individual synapses obey a much simpler, double step-function response when presented with a single spike pair. To calculate the average response to a spike pair, we assume that preand postsynaptic spiking are independent Poisson processes of rates λπ and λ p , respectively. Because they are independent, the combined spike train, ignoring the distinction between pre- and postsynaptic events, is also Poisson, with combined rate β = λπ + λ p . Defining λπ = λπ /β and λp = λ p /β, the probability that any particular spike from this combined sequence is presynaptic (postsynaptic) is then just λπ (λp ). The interspike interval between spikes is governed by an exponential probability distribution, with pdf f T (t) = β exp(−βt). We shall label presynaptic (postsynaptic) spikes by the symbol π ( p) for notational convenience and also refer to π spikes and p spikes throughout. Consider the sequence π p. The π spike moves the synapse from OFF to POT. The probability that the synapse is still in the POT state when the p spike arrives is P + (t) =
∞ t
dt f + (t ),
(2.1)
where t is the interspike interval. The average change in synaptic strength triggered by the π p pairing is then just Sπ p (t) = +A+ P + (t).
(2.2)
Similarly, for a pπ pairing, we have Spπ (t) = −A− P − (t),
(2.3)
where P − (t) is identical to P + (t) except that f + is replaced by f − in the integral. These two results are conditional expectation values, conditional on a given spike pairing of a given interspike interval. The average response to any isolated spike pair of any interspike interval is obtained by weighting the conditional expectation values by the probability of each particular spike pair and integrating out the time of the first spike and the interspike interval according to their common pdf, f T (t). Hence, S2 =
i, j∈{π, p}
λi λj
∞
∞
dt0 f T (t0 ) 0
dt1 f T (t1 )Si j (t1 ),
(2.4)
0
where, of course, Sπ π (t) = Spp (t) = 0, and S2 denotes the expected response to an average two-spike pairing. In the remainder of this letter, we shall be principally concerned with the derivation of an expression for Sn , the expected response to an average n-spike sequence.
1366
P. Appleby and T. Elliott
We previously took the pdfs f ± (t) to be governed by integer-order gamma distributions, so that f ± (t) =
(t/τ± )ν± −1 1 exp(−t/τ± ), (ν± − 1)! τ±
(2.5)
where τ± are the characteristic timescales for the stochastic processes and ν± are their orders. Notice that if ν± = 1, then these are just simple exponential distributions. It is then easy to see that P ± (t) = e ν± (t/τ± ) exp(−t/τ± ), where e ν (x) =
ν−1 i=0
K 1± (λ) = 1 −
(2.6)
xi /i! is the truncated exponential series. Defining
1 , (1 + τ± λ)ν±
(2.7)
we then have S2 = λπ λp [A+ K 1+ (β) − A− K 1− (β)].
(2.8)
We now generalize the above model slightly in order to facilitate the calculation of Sn . We assumed that if a synapse is in the POT (DEP) state, then further presynaptic (postsynaptic) spiking has no effect. We now propose instead that after r additional presynaptic (postsynaptic) spikes, the stochastic processes governing the return of the synapse from the POT (DEP) state to the OFF state are reset, provided that the synapse has not returned to the OFF state at any time after the first spike moved it into the POT (DEP) state. The overall result of this is to increase the likelihood that potentiation or depression will occur by essentially discarding the spike history of the synapse after r + 1 identical spikes and starting afresh. For ν± = 1, however, the exponential distribution is memoryless, so such a system should exhibit dynamics independent of the value of r . Strictly, we should allow differing numbers of spikes to reset the POT and DEP states, but it is easy to generalize our results to this case. Furthermore, we are particularly interested in the properties of the r = 1 and r → ∞ models, the latter being equivalent to the original form of our model. Arguably, these r = 1 and r → ∞ models are more biologically plausible than models with other values of r : either a spike of the appropriate sort resets the stochastic process or it does not, and other values of r would require us to postulate some form of spike counting machinery at the synapse. We therefore assume that r additional spikes of the appropriate kind reset both stochastic processes for DEP → OFF and POT → OFF, and we refer to this generalized model as the r -reset model. We call the 1-reset model
Multispike Interactions in a Stochastic Model of STDP
1367
the resetting model and the ∞-reset model the nonresetting model. Our strategy in much of what follows is to extract n-spike results for the small r models, because r limits the possible extent of spike interactions in a long sequence of spikes. The behavior for large r will then be clear, and the results will fall out directly. 3 Preliminaries It is convenient before deriving general results for Sn to derive some preliminary results that will be of use throughout the remainder of the letter. Consider first a sequence of n identical spikes, say, π spikes, followed by a different spike, say, a p spike, giving a total of n + 1 spikes, the first spike occurring at time t0 , and let the subsequent interspike intervals be t1 , . . . , tn , so that ti , i > 0, is the time between the ith spike and (i + 1)th spike. The time of the ith spike is then i−1 j=0 t j , for any i. We assume that the synapse is initially in the OFF state, so that the first π spike at time t0 moves the + synapse into the POT state. Let the multiargument function PON (t1 , . . . , tn ) − be the probability that the synapse is in the POT state (PON for the DEP n ± state and p spikes) at time i=0 ti , when the final spike arrives; clearly PON do not depend on the time of the first spike, t0 , as this merely moves the synapse from the OFF state to the POT or DEP states. For the moment, we do not consider resetting. If, when the second spike arrives, the synapse is OFF due to a stochastic decay, then this second spike moves the synapse back to POT, and the probability, n given OFF+ at time t0 + t1 , that the synapse is in the POT state at time i=0 ti is just PON (t2 , . . . , tn ). The probability of being in the OFF state at time t0 + t1 is just 1 − P + (t1 ), from the previous section. The probability of being in the POT state at time t0 + t1 is obviously P + (t1 ), so we can write + + (t1 , . . . , tn ) = [1 − P + (t1 )]PON (t2 , . . . , tn ) PON + (t2 , . . . , tn |POT at t0 + t1 ), + P + (t1 )PON
(3.1)
+ (t2 , . . . , tn |POT at t0 + t1 ) is the conditional probability that the where PON n synapse is in the POT state at time i=0 ti given that it was still in the POT state when the second spike arrived at time t0 + t1 . We now repeat the above argument, conditioning on whether the synapse is still in the POT state when the third spike arrives at time t0 + t1 + t2 given that it was still in the POT state when the second spike arrived at time t0 + t1 . Suppose that the synapse is in the OFF state at time t0 + t1 + t2 given this sequence. Then the third spike sends the synapse back to POT. The probability that the synapse is in the OFF state at time t0 + t1 + t2 given that it was in the POT state at time t0 + t1 is 1 − P + (t1 + t2 |POT at t0 + t1 ), and the probability that it is still in the POT state is P + (t1 + t2 |POT at t0 + t1 ).
1368
P. Appleby and T. Elliott
Hence, we can now write + + PON (t1 , . . . , tn ) = [1 − P + (t1 )]PON (t2 , . . . , tn ) + + [P + (t1 ) − P + (t1 + t2 )]PON (t3 , . . . , tn ) + + P + (t1 + t2 )PON (t3 , . . . , tn |POT at t0 + t1 + t2 ),
(3.2)
where we have used the definition of the conditional probability, P + (t1 + t2 |POT at t0 + t1 ) = =
P + (POT at t0 + t1 + t2 & POT at t0 + t1 ) P + (POT at t0 + t1 ) P + (t1 + t2 ) . P + (t1 )
(3.3)
Thus, repeating the above argument for all subsequent spikes, we see that + + (t1 , . . . , tn ) = [1 − P + (t1 )]PON (t2 , . . . , tn ) PON + + [P + (t1 ) − P + (t1 + t2 )]PON (t3 , . . . , tn )
+ ··· + + [P + (t1 + · · · + ti−1 ) − P + (t1 + · · · + ti )]PON (ti+1 , . . . , tn )
+ ··· + + [P + (t1 + · · · + tn−2 ) − P + (t1 + · · · + tn−1 )]PON (tn )
+ P + (t1 + · · · + tn ),
(3.4)
+ where clearly PON (t) ≡ P + (t). The conditional expectation value for the extent of potentiation or depression induced by this sequence of n identical spikes followed by a differ+ − ent spike is just +A+ PON (t1 , . . . , tn ) or −A− PON (t1 , . . . , tn ), respectively. As with the two-spike calculation, we convert these into unconditional expectations by integrating out the interspike intervals according to their pdfs and weighting by the spike sequences’ probabilities. We do this for the gen+ − eral distribution PON , not specifying whether PON or PON , and hence we do not at this point weight by the spike sequence probabilities, as these differ for the two possible cases. Because the combined pre- and postsynaptic spike sequence is Poisson with rate β = λπ + λ p , fortunately the interspike pdf is always the same, although in general, for non-Poisson processes, this will not be true. We take equation 3.4 and integrate t1 , . . . , t n over their entire ranges, ti ∈ [0, ∞), having weighted by β n exp(−β nj=1 t j ). Notice that the
Multispike Interactions in a Stochastic Model of STDP
1369
integration over the first spike time, t0 , always produces a result of unity, by the definition of the pdf. Defining
∞
Tl (β) = β l
∞
dt1 · · ·
0
dtl exp −β
0
l
t j PON (t1 , . . . , tl )
(3.5)
j=1
and K l (β) = β l
∞
dt1 · · ·
0
= βl
∞
dtl exp −β
0 ∞
dt 0
l
t j P(t1 + · · · + tl )
j=1
tl−1 −βt e P(t) (l − 1)!
(3.6)
both for l ≥ 1, and defining K 0 (β) ≡ 1, we then have Tn = K n +
n−1
Tn−i (K i−1 − K i ),
(3.7)
i=1
where, unless explicitly indicated otherwise in what follows, the argument of the Ti ’s or K i ’s is always β. Using the gamma pdfs for P(t) above, it is straightforward although tedious to show that K l± (β)
=
τ± β 1 + τ± β
l ν ± −1 i +l −1 i=0
l −1
1 . (1 + τ± β)i
(3.8)
The K i ’s will play an important role throughout. We may now consider how the possibility of spikes resetting the stochastic processes affects the above. In the r -reset model, if r + 1 < n, then after the (r + 1)th identical spike, the process, if it has never deactivated during these identical spike events, is reset. This has the effect of making subsequent probabilities independent of any of the events that occurred before the (r + 1)th identical spike arrived and causes subsequent conditional probabilities to cease to depend on times earlier than t0 + · · · + tr . For concreteness, consider r = 1. In moving between equations 3.1 and + 3.2, we had to expand PON (t2 , . . . , tn |P OT at t0 + t1 ) by conditioning on the state of the synapse when the third spike arrived. However, for r = 1, the second π spike at time t0 + t1 resets the stochastic process because it is given that the synapse has stayed in the POT state since the first π spike at time t0 moved the synapse into the POT state. Thus, subsequent changes in synaptic strength are independent of the fact that the synapse did not move
1370
P. Appleby and T. Elliott
to the OFF state between the first and second π spikes, because the second π spike restarted the stochastic process afresh. Thus, + + PON (t2 , . . . , tn |P OT at t0 + t1 ) = PON (t2 , . . . , tn ) only for r = 1.
(3.9)
+ + (t1 , . . . , tn ) = PON (t2 , . . . , tn ), and of course Hence, equation 3.1 becomes PON + + we finally arrive at PON (t1 , . . . , tn ) = · · · = PON (tn ) ≡ P + (tn ). This equation means that for r = 1, the final spike in a long sequence of identical spikes is the only spike that counts. Hence, for r = 1, we have that Tl = T1 = K 1 , any l, by the definition of the Ti ’s in equation 3.5. Putting Tl = K 1 into equation 3.7, which we now regard as an equation defining the K i ’s in terms of the Ti ’s, it is easy to see that for this r = 1 case, K l = K 1l . Thus, we extract the one-reset model from the nonresetting model by replacing K l by K 1l in any equations in which it occurs. This reflects the fact that the one-reset model replaces P(t1 + · · · + tl ) by li=1 P(ti ). Similar considerations show that the r -reset model replaces K rl+s , 0 ≤ s < r , by K rl K s , so any results that we obtain for the nonresetting model can be converted into those for the r -reset model by the substitution K rl+s → K rl K s . In general, the finite r models are useful precisely because they limit multispike interactions to at most r + 1 spikes, and this enables us, in the next section, to set up a recurrence relation in Sn .
4 Multispike Recurrence Relations We now derive some recurrence relations for Sn for the one-, two- and three-reset models, and then, having developed intuitions based on these small r cases, we can proceed to derive the results for the arbitrary r -reset model. 4.1 1-Reset Model. Let n (t0 , t1 , . . . , tn−1 ) denote an arbitrary sequence of n spikes with interspike intervals ti , i > 0, and t0 the time of the first spike, and let the function Wn (t0 , . . . , tn−1 ) denote the pdf and spike probability weighting for this sequence that is required to transform conditional expectation values into unconditional expectation values. If n contains m π spikes and n − m p spikes, then n−m Wn (t0 , . . . , tn−1 ) = λm exp[−β(t0 + · · · + tn−1 )]. π λp
(4.1)
Let Sn (t1 , . . . , tn−1 ) be the conditional expectation value for the change in synaptic strength induced by the sequence n . This is not a function of the first spike time, t0 , but only of the interspike intervals. Consider a sequence of n + 2 spikes, the first two of which are either both π spikes or both p spikes. For r = 1, the first spike is essentially irrelevant because the second, identical spike either reactivates the same process if
Multispike Interactions in a Stochastic Model of STDP
1371
the synapse is in the OFF state when it arrives at time t0 + t1 or resets the process if the synapse is still in the POT state. We therefore see that Sπ π n (t1 , . . . , tn+1 ) = Sπ n (t2 , . . . , tn+1 ),
(4.2)
Sppn (t1 , . . . , tn+1 ) = Spn (t2 , . . . , tn+1 ).
(4.3)
Now consider the spike sequence π pn (t0 , . . . , tn+1 ). If the synapse has returned to the OFF state before the second, p spike arrives, then the first spike is irrelevant. If the synapse is still in the POT state when the p spike arrives, then there will be a potentiation by an amount A+ , and the remainder of the spike sequence n (t2 , . . . , tn+1 ) contributes to Sπ pn (t1 , . . . , tn+1 ) independent of these first two spikes. Hence, we have that Sπ pn (t1 , . . . , tn+1 ) = [1 − P + (t1 )]Spn (t2 , . . . , tn+1 ) + P + (t1 )[Sn (t3 , . . . , tn+1 ) + A+ ].
(4.4)
Similarly, we have Spπ n (t1 , . . . , tn+1 ) = [1 − P − (t1 )]Sπ n (t2 , . . . , tn+1 ) + P − (t1 )[Sn (t3 , . . . , tn+1 ) − A− ].
(4.5)
Defining Sn =
∞
dt0 · · ·
0
0
∞
dtn−1 Wn (t0 , . . . , tn−1 )Sn (t1 , . . . , tn−1 ),
(4.6)
where the absence of temporal arguments on Sn indicates that interspike intervals have been integrated out according to their pdfs and spike probabilities included, Equations 4.2 and 4.3 become Sπ π n = λπ Sπ n ,
(4.7)
Sppn = λp Spn ,
(4.8)
where the factors of λπ and λp arise because the dependence of Wπ π n or Wppn on the first spike factors out and the remainder of Wπ n or Wpn is absorbed into the definition of Sπ n or Spn . Similarly, Sπ pn = λπ (1 − K 1+ )Spn + λπ λp K 1+ (Sn + A+ Fn ),
(4.9)
Spπ n = λp (1 − K 1− )Sπ n + λπ λp K 1− (Sn − A− Fn ),
(4.10)
1372
P. Appleby and T. Elliott
where
F n =
∞
dt2 · · ·
0
0
∞
dtn+1 Wn (t2 , . . . , tn+1 ) = λπ λp m
n−m
(4.11)
if the spike sequence n contains mπ spikes and n − mp spikes. We nowsum over all possible spike sequences n , of which there π n are 2 . Let denote this sum, and define S = n { } {n } Sn , Sn+1 = n ππ S , S = S , and so on, in an obvious notation. Then π n π π n {n } {n } n+2 we have that ππ π Sn+2 = λπ Sn+1 ,
(4.12)
πp Sn+2
= λπ (1 −
p K 1+ )Sn+1
pπ Sn+2
= λp (1
π K 1− )Sn+1
−
+ λπ λp K 1+ (Sn + A+ ), +
λπ λp K 1− (Sn
(4.13)
− A− ),
(4.14)
Sn+2 = λp Sn+1 , pp
p
(4.15)
π ≡ where we have used the binomial theorem, {n } Fn ≡ 1. Since Sn+2 πp p pπ pp ππ Sn+2 + Sn+2 and Sn+2 ≡ Sn+2 + Sn+2 (by summing over all possip ble second spikes), and Sn ≡ Snπ + Sn , we finally obtain the coupled recurrence relations: π Sn+2 = +λπ λp A+ K 1+ + λπ λp K 1+ Sn + λπ Sn+1 − λπ K 1+ Sn+1 p
p Sn+2
π = −λπ λp A− K 1− + λπ λp K 1− Sn + λp Sn+1 − λp K 1− Sn+1 .
(4.16) (4.17)
These recurrence relations represent our final form for the 1-reset model. 4.2 2-Reset Model. For the one-reset model, we had to look at specific spike patterns extending out to two spikes. For the 2-reset model, we must go out to three spikes, because only the third spike in a sequence of three identical spikes possesses the capacity to reset the stochastic processes. Consider Sπ π π n (t1 , . . . , tn+2 ). As usual, we condition on the state of the synapse as the second and third π spikes arrive. If the synapse is in the OFF state when the second π spike arrives, we obtain a subsequent change governed by Sπ π n (t2 , . . . , tn+2 ). If the synapse is still in the POT state when the second spike arrives, then we look at the state when the third spike arrives. If the synapse is in the OFF state at this time, we obtain a subsequent change governed by Sπ n (t3 , . . . , tn+2 ). However, if the synapse is still in the POT state, then the third spike resets the process, and subsequent changes are still governed by Sπ n (t3 , . . . , tn+2 ). Hence, two conditioning steps lead to Sπ π π n (t1 , . . . , tn+2 ) = P + (t1 )Sπ π n (t2 , . . . , tn+2 |P OT at t0 + t1 ) + [1 − P + (t1 )]Sπ π n (t2 , . . . , tn+2 ),
(4.18)
Multispike Interactions in a Stochastic Model of STDP
1373
where the conditional change Sπ π n (t2 , . . . , tn+2 |POT at t0 + t1 ) = P + (t1 + t2 |POT at t0 + t1 )Sπ n (t3 , . . . , tn+2 ) + [1 − P + (t1 + t2 |POT at t0 + t1 )]Sπ n (t3 , . . . , tn+2 ) = Sπ n (t3 , . . . , tn+2 ),
(4.19)
so that Sπ π π n (t1 , . . . , tn+2 ) = P + (t1 )Sπ n (t3 , . . . , tn+2 ) + [1 − P + (t1 )]Sπ π n (t2 , . . . , tn+2 ).
(4.20)
This cancellation between the conditional probabilities in the final conditioning step is characteristic of all the r -reset models and is due precisely to the stochastic process resetting. Similarly, we obtain + (t1 , t2 )[Sn (t4 , . . . , tn+2 ) + A+ ] Sπ π pn (t1 , . . . , tn+2 ) = PON + + [1 − PON (t1 , t2 )]Spn (t3 , . . . , tn+2 )
(4.21)
+
Sπ pπ n (t1 , . . . , tn+2 ) = P (t1 )[Sπ n (t3 , . . . , tn+2 ) + A+ ] + [1 − P + (t1 )]Spπ n (t2 , . . . , tn+2 )
(4.22)
+
Sπ ppn (t1 , . . . , tn+2 ) = P (t1 )[Spn (t3 , . . . , tn+2 ) + A+ ] + [1 − P + (t1 )]Sppn (t2 , . . . , tn+2 ),
(4.23)
and the remaining four expressions with a leading p spike follow by symmetry. Integrating out the spike times and summing over all spike sequences n as for the 1-reset model, we have πππ ππ π = λπ (1 − K 1+ )Sn+2 + λπ K 1+ Sn+1 Sn+3 2
ππp Sn+3
2 = λπ (1
π pπ Sn+3
= λπ (1
(4.25)
λπ
A+ )
(4.26)
Sn+3 = λπ (1 − T1+ )Sn+2 + λπ λp T1+ (Sn+1 + λp A+ ).
(4.27)
−
pπ T1+ )Sn+2
+
(4.24)
2 λπ λp T2+ (Sn
+ A+ )
π pp
−
p T2+ )Sn+1
+
π λπ λp T1+ (Sn+1
pp
+
p
ππp
πp
π pπ
π pp
ππ πππ = Sn+3 + Sn+3 and Sn+3 = Sn+3 + Sn+3 , and so on, we Since Sn+3 obtain ππ = λπ (1 − T2+ )Sn+1 + λπ λp T2+ (Sn + A+ ) Sn+3 2
p
2
ππ π + λπ K 1+ Sn+1 + λπ (1 − K 1+ )Sn+2 2
(4.28)
1374
P. Appleby and T. Elliott πp
Sn+3 = λπ (1 − T1+ )Sn+2 + λπ λp T1+ (Sn+1 + A+ ). p
(4.29)
ππ = The trick to simplify these equations further is to observe that Sn+2 πp πp π Sn+2 − Sn+2 and use equation 4.29 to rewrite Sn+2 . Since this equation has no terms in Sn , this trick does not increase the order of the resulting recurrence relations, but does reduce them from a coupled four-dimensional system to a more manageable, coupled two-dimensional system. Hence, π Sn+3 = λπ λp [T2+ − T1+ (1 − K 1+ )](Sn + A+ ) 2
π + λπ [1 − T2+ − (1 − T1+ )(1 − K 1+ )]Sn+1 + λπ K 1+ Sn+1 2
2
p
+ λπ λp T1+ (Sn+1 + A+ ) + λπ (1 − T1+ )Sn+2 .
(4.30)
p
A symmetrical version of this equation holds for Sn+3 . We observe that T2+ − T1+ (1 − K 1+ ) = K 2+ and 1 − T2+ − (1 − T1+ )(1 − K 1+ ) = K 1+ − K 2+ , and the overall multiplier of A+ is λp (λπ K 1+ + λπ 2 K 2+ ). So, defining Q+ 2 = − 2 − = λ K + λ K , we obtain the coupled system λπ K 1+ + λπ 2 K 2+ and Q− p 1 p 2 2 + + π Sn+3 = +λp A+ Q+ 2 + λπ λ p K 2 Sn + λπ K 1 Sn+1 2
− λπ K 2+ Sn+1 + λπ (1 − K 1+ )Sn+2 2
p
(4.31)
− − Sn+3 = −λπ A− Q− 2 + λπ λ p K 2 Sn + λ p K 1 Sn+1 2
p
π − λp K 2− Sn+1 + λp (1 − K 1− )Sn+2 2
(4.32)
These coupled recurrence relations represent our final form for the 2-reset model. 4.3 3-Reset Model. Now we go out to the first four spikes in the sequence because the fourth identical spike possesses the capacity to reset the stochastic processes. Three conditioning steps and the usual weighting by W give ππππ πππ ππ π Sn+4 = λπ (1 − K 1+ )Sn+3 + λπ (K 1+ − K 2+ )Sn+2 + λπ K 2+ Sn+1 . 2
3
(4.33) π pπ π
π pπ p
π ppπ
π ppp
The four equations for Sn+4 , Sn+4 , Sn+4 , and Sn+4 all sum to give
πp p Sn+4 = λπ λp T1+ (Sn+2 + A+ ) + λπ 1 − T1+ Sn+3 ,
(4.34)
Multispike Interactions in a Stochastic Model of STDP π π pπ
1375
π π pp
while those for Sn+4 and Sn+4 sum to give ππp
Sn+4 = λπ λp T2+ (Sn+1 + A+ ) + λπ (1 − T2+ )Sn+2 , 2
2
p
(4.35)
and we also have πππp
Sn+4 = λπ λp T3+ (Sn + A+ ) + λπ (1 − T3+ )Sn+1 . 3
3
p
(4.36)
In these four equations, for reasons that will become clear in the r -reset model, we maintain a rigorous distinction between T1+ and K 1+ even though they are identical. We see that these last three equations are, in fact, instances of the generic equation Smπ
i
p
= λπ λp Ti+ (Sm−i−1 + A+ ) + λπ (1 − Ti+ )Sm−i , i
i
p
(4.37)
for m > i, where π i p means i π spikes followed by a p spike. This equation follows easily, in fact, from conditioning on whether the synapse is in the POT state when the first p spike arrives, leading to the Ti+ term, or in the OFF state when the first p spike arrives, leading to the 1 − Ti+ term, and then, as usual, integrating out interspike intervals and summing over all subsequent spikes. By observing that Smπ = Smπ + i
i−1
Smπ p , j
(4.38)
j=1
we can use this equation to eliminate all occurrences of Smπ terms, i > 1, from equation 4.33 and then use equation 4.37 to eliminate all occurrences πi p p of Sm terms, i > 0, leaving terms in Smπ and Sm only. Defining Q+ 3 = + 2 + 3 + λπ K 1 + λπ K 2 + λπ K 3 , after some algebra, we finally obtain i
π Sn+4 = λπ (1 − K 1+ )Sn+3 + (λπ K 1+ − λπ K 2+ )Sn+2 + λπ K 2+ Sn+1 2
2
− λπ K 3+ Sn+1 + λp λπ K 3+ Sn + λp A+ Q+ 3, 3
p
3
(4.39)
p with a similar equation for Sn+4 . It is convenient to define K˜ i+ = λπ i K i+ and K˜ i− = λp i K i− . Then Qi± = ij=1 K˜ ± j , and we may finally write π ˜+ ˜+ ˜+ = +λp A+ Q+ Sn+4 3 + λ p K 3 Sn + K 2 Sn+1 − K 3 Sn+1 p
+ ( K˜ 1+ − K˜ 2+ )Sn+2 + (λπ − K˜ 1+ )Sn+3
(4.40)
1376
P. Appleby and T. Elliott ˜− ˜− ˜− π Sn+4 = −λπ A− Q− 3 + λπ K 3 Sn + K 2 Sn+1 − K 3 Sn+1 p
+ ( K˜ 1− − K˜ 2− )Sn+2 + (λp − K˜ 1− )Sn+3 .
(4.41)
These coupled recurrence relations represent our final form for the 3-reset model. 4.4 r -Reset Model. Having built some intuitions for small r , we can now proceed to the general r case. The standard conditioning argument gives r +1
π Sn+r +1 =
r −1 + π r −i ˜+ π (λπ K˜ i+ − K˜ i+1 )Sn+r −i + K r Sn+1 ,
(4.42)
i=0
where K˜ 0± ≡ 1, and we still have the generic results in equations 4.37 and 4.38. It is, of course, equation 4.42 that always ensures that the resulting p recurrence relation in Smπ and Sm is of order r + 1 and hence a finite system of coupled equations. Pulling out the i = r − 1 term from the sum and then writing j = i + 1, we have the slightly more convenient form r +1
π Sn+r +1 =
r −1 π r +1− j ˜+ π ˜+ (λπ K˜ + j−1 − K j )Sn+r +1− j + λπ K r−1 Sn+1 .
(4.43)
j=1
Using equation 4.38 to eliminate terms in Smπ , i > 1, and then equation πi p 4.37 to eliminate terms in Sm , i > 0, we obtain the apparently rather unpromising equation, i
π Sn+r +1 = X1 + λ p A+ X2 + λ p X3 + X4 ,
(4.44)
where π + X1 = λπ K˜ r+−1 Sn+1
r −1 π ˜+ (λπ K˜ + j−1 − K j )Sn+r +1− j ,
(4.45)
j=1
X2 =
r
λπ Tj+ − j
j=1
X3 =
r j=1
r− j r −1
λπ
i+ j
+ K+ Ti+ , j−1 − K j
(4.46)
j=1 i=1
λπ Tj+ Sn+r − j − j
r− j r −1
λπ
i+ j
+ K+ Ti+ Sn+r −i− j , − K j j−1
j=1 i=1
(4.47)
Multispike Interactions in a Stochastic Model of STDP
X4 =
r
λπ
j
1377
p 1 − Tj+ Sn+r +1− j
j=1
−
r− j r −1
λπ
i+ j
+ K+ j−1 − K j
p 1 − Ti+ Sn+r +1−i− j .
(4.48)
j=1 i=1
However, the quantities X2 , X3 and X4 simplify dramatically. Consider, for example, X3 :
X3 =
r
λπ Tj+ Sn+r − j − j
j=1
=
r
r− j r −1
λπ
i+ j
+ K+ Ti+ Sn+r −i− j j−1 − K j
j=1 i=1
λπ Tj+ Sn+r − j − j
j=1
r
λπ Sn+r −k k
k=2
= λπ T1+ Sn+r −1
+
r
j λπ Sn+r − j
r
+ + − K l+ Tk−l K l−1
l=1
Tj+
j=2
≡ λπ T1+ Sn+r −1 +
k−1
−
j−1
+ K i−1
−
K i+
+ Tj−i
i=1
λπ Sn+r − j K + j j
j=2
=
r
K˜ + j Sn+r − j ,
(4.49)
j=1
where the penultimate line follows from the recurrence relation defining Tj+ in equation 3.7. This was the reason for rigorously distinguishing between T1 and K 1 , even though they are identical, so that we could immediately use equation 3.7 without any further manipulation. The key step in this simplification is to rewrite the double sum by defining i + j to be k and summing over its range. Identical manipulations reduce X2 and X4 , giving
X2 =
r
+ K˜ + j ≡ Qr ,
(4.50)
j=1
X4 =
r j=1
˜ + S p λπ K˜ + n+r +1− j . j−1 − K j
(4.51)
1378
P. Appleby and T. Elliott
Thus, finally we have π + ˜+ π Sn+r +1 = λ p A+ Qr + λπ K r −1 Sn+1 +
r −1 π ˜+ λπ K˜ + Sn+r − K +1− j j j−1 j=1
+ λp
r
K˜ + j Sn+r − j +
r
j=1
˜ + S p λπ K˜ + n+r +1− j . j−1 − K j
(4.52)
j=1
This expression is valid for all r ≥ 1. A similar expression follows by symp metry for Sn+r +1 . After a little further manipulation, we obtain, for r ≥ 2, π + ˜+ ˜+ ˜+ Sn+r +1 = +λ p A+ Qr + λ p K r Sn + K r −1 Sn+1 − K r Sn+1 p
+
r −2
˜+ ˜+ K˜ + j − K j+1 Sn+r − j + λπ − K 1 Sn+r
(4.53)
j=1 p π Sn+r +1 = −λπ A− Qr− + λπ K˜ r− Sn + K˜ r−−1 Sn+1 − K˜ r− Sn+1
+
r −2
˜− ˜− K˜ − j − K j+1 Sn+r − j + λ p − K 1 Sn+r ,
(4.54)
j=1
−2 ≡ 0 for r = 2. where, of course, rj=1 p π Notice that the equations for Sn+r +1 and Sn+r +1 involve terms in Sm p π only, except for the single terms Sn+1 and Sn+1 . These two terms thus p π prevent us from simply adding the equations for Sn+r +1 and Sn+r +1 and hence obtaining a single, uncoupled (r + 1)th-order recurrence relation for Sm . The above coupled recurrence relations represent our final form, and we now turn to obtaining the exact solution for r = 1 and the asymptotic large n solution for r > 1. 5 Solution of Recurrence Relations We now solve the one-reset model exactly for any n, then determine the leading-order, asymptotic solutions for large n for the r -reset, r > 1, model. 5.1 Exact Solution of the 1-Reset Model. The 1-reset model has the second-order coupled recurrence relations given by equations 4.16 and 4.17. p We are not interested in how the separate components Snπ and Sn evolve p π but only in their sum, Sn = Sn + Sn . We have that − ˜+ ˜− Sn+2 = (λp A+ Q+ 1 − λπ A− Q1 ) + (λ p K 1 + λπ K 1 )Sn p π +Sn+1 − ( K˜ 1+ Sn+1 + K˜ 1− Sn+1 )
(5.1)
Multispike Interactions in a Stochastic Model of STDP
1379
and also p π ˜+ − ˜+˜− K˜ 1+ Sn+2 + K˜ 1− Sn+2 = (λp A+ K˜ 1− Q+ 1 − λπ A− K 1 Q1 ) + K 1 K 1 Sn
+(λp K˜ 1+ + λπ K˜ 1− )Sn+1 − K˜ 1+ K˜ 1− Sn+1 .
(5.2)
At the cost of increasing the order of the recurrence relation, we can thus obtain a relation purely in Sm , to give (Sn+3 − Sn+2 ) − K˜ 1+ K˜ 1− (Sn+1 − Sn ) = A˜ 1 ,
(5.3)
where we have defined the inhomogeneous part of this equation, A˜ 1 , by ˜+ − A˜ 1 = λp A+ (1 − K˜ 1− )Q+ 1 − λπ A− (1 − K 1 )Q1 .
(5.4)
The characteristic equation for the homogeneous part of the equation has the three roots +1 and ± K˜ + K˜ − . Since 0 ≤ K ± < 1 for finite β and 0 ≤ 1
1
1
λπ , λp ≤ 1, these last two roots are real and have moduli less than unity. The general solution of this equation is then Sn = n
n/2 A˜ 1 + K˜ 1+ K˜ 1− [D1 + (−1)n D2 ] + D3 , 1 − K˜ + K˜ − 1
(5.5)
1
where the constants D1 , D2 , and D3 can be determined by requiring that S0 = 0, S1 = 0, and S2 = λp A+ K˜ 1+ − λπ A− K˜ 1− . Because K˜ 1+ K˜ 1− < 1, the asymptotic large n form of this solution is then just 1 A˜ 1 Sn ∼ . n 1 − K˜ 1+ K˜ 1−
(5.6)
The solution for large enough n thus essentially scales linearly with n, and this is exactly what we should expect: a typical train of 2n spikes should induce, on average, twice as much change in synaptic strength as a typical train of n spikes, for n sufficiently large. Of course, real neurons would not be expected to scale in this manner because synaptic strengths are presumably constrained and thus saturate at upper or lower limits. 5.2 Asymptotic Solution of the r -Reset Model. As for the 1-reset model, we can write Sn+r +1 = (λp A+ Qr+ − λπ A− Qr− ) + (λp K˜ r+ + λπ K˜ r− )Sn +( K˜ r+−1 + K˜ r−−1 )Sn+1 + (1 − K˜ 1+ − K˜ 1− )Sn+r
1380
P. Appleby and T. Elliott
+
r −2 ˜− ˜+ ˜− ( K˜ + j + K j − K j+1 − K j+1 )Sn+r − j j=1
p π − ( K˜ r+ Sn+1 + K˜ r− Sn+1 )
(5.7)
and p π K˜ r+ Sn+r +1 + K˜ r− Sn+r +1
= (λp A+ K˜ r− Qr+ − λπ A− K˜ r+ Qr− ) + K˜ r+ K˜ r− Sn + ( K˜ r+−1 K˜ r− + K˜ r−−1 K˜ r+ − K˜ r+ K˜ r− )Sn+1 +
r −2
˜+ ˜+ ˜− ˜− K˜ r− ( K˜ + j − K j+1 ) + K r ( K j − K j+1 ) Sn+r − j
j=1
+ (λp K˜ r+ + λπ K˜ r− )Sn+r − ( K˜ 1+ K˜ r− + K˜ 1− K˜ r+ )Sn+r .
(5.8)
Hence, defining A˜r = λp A+ (1 − K˜ r− )Qr+ − λπ A− (1 − K˜ r+ )Qr− ,
(5.9)
after some algebra we have Sn+2r +1 − Sn+2r = A˜r + K˜ r+ K˜ r− (Sn+1 − Sn ) −
r −1 ˜− ( K˜ + j + K j )(Sn+2r +1− j − Sn+2r − j ) j=1
+
r −1 ˜−˜+ ˜− ( K˜ + j K r + K j K r )(Sn+r +1− j − Sn+r− j ),
(5.10)
j=1
where we have written this equation so that it is clear that there is a +1 solution of the characteristic equation for the homogeneous form of the −1 ≡ 0 for r = 1, this equation is valid for all recurrence relation. With rj=1 r ≥ 1. n We define Un = Sn+1 − Sn , with S1 ≡ 0, so that Sn+1 = i=1 Ui . Equation 5.10 then implies that the Ui ’s satisfy the recurrence relation r j=0
j Un+2r − j −
r ˜− ˜−˜+ ˜ ˜+˜− ( K˜ + j K r + K j K r )Un+r − j = Ar − K r K r Un , j=0
(5.11)
Multispike Interactions in a Stochastic Model of STDP
1381
˜− where j = K˜ + j + K j , j ≥ 1, and we define 0 ≡ 1. These Ui ’s will play an important role later in the construction of exact solutions for Sn in the nonresetting model. The characteristic equation for the homogeneous part of equation 5.11 is, after a little rearrangement,
θ 2r +
r −1 j=1
j θ 2r − j −
r −1 ˜− ˜ − ˜ + r − j − K˜ r+ K˜ r− = 0. ( K˜ + j K r + K j K r )θ
(5.12)
j=1
This equation has 2r roots, which we denote by θi , i = 1, . . . , 2r . To obtain the asymptotic behavior of Sn for general r , we need to bound |θi |, all i. We can prove that the roots θi have moduli not exceeding unity by showing that Sn grows no faster than linearly in n. Consider an arbitrary sequence of spikes, n (t0 , . . . , tn−1 ). Any π spike could induce a change of −A− in synaptic efficacy if it triggers a DEP → OFF transition; otherwise, it induces no change. Similarly, any p spike could induce a change of +A+ if it triggers a POT → OFF transition; otherwise, it induces no change. There are at most n π spikes and at most n p spikes in the sequence. Hence, we must have −nA− ≤ Sn (t0 , . . . , tn−1 ) ≤ +nA+ , for any sequence n with any interspike intervals t0 , . . . , tn−1 . The averaged function Sn must therefore also respect these bounds, so that
−A− ≤
1 Sn ≤ +A+ . n
(5.13)
Thus, Sn grows no faster than linearly in n, so there are no roots θi having moduli exceeding unity. Furthermore, there are no at least thrice-repeated unit modulus roots, as these would induce at least quadratic terms in n. We also exclude twice-repeated unit modulus roots not equal to unity because such roots would contribute linearly growing, oscillatory terms to Sn , which are incompatible with our model’s dynamics. Finally, it is easy to see that equation 5.12 has precisely one positive (real) root, and this root is strictly less than one. In summary, all the roots θi satisfy |θi | ≤ 1, and any that satisfy |θi | = 1 are unrepeated. There is no θi ≡ 1 root. Hence, the asymptotic form of Sn is dominated by its particular solution, since the +1 root of the characteristic equation of equation 5.10 in the presence of the constant inhomogeneous term induces a particular solution that grows linearly with n. Thus, we finally obtain the asymptotic large n solution λp A+ (1 − K˜ r− )Qr+ − λπ A− (1 − K˜ r+ )Qr− 1 Sn ∼ , n (1 − K˜ r+ )(1 − K˜ r− ) + (1 − K˜ r− )Qr+ + (1 − K˜ r+ )Qr−
(5.14)
1382
P. Appleby and T. Elliott
valid for any r . We shall see later that the asymptotic solution is good even for n ≈ 10. The formal limit r → ∞ is beset by various technical difficulties associated with the limit and the resulting sum over a potentially infinite number of polynomial roots. However, in the usual spirit of applied mathematics, we shall ignore these possible difficulties and just assume that the limit of the finite r asymptotic solution is the asymptotic solution of the nonresetting model (and confirm this numerically). Now, K˜ r → 0 as r → ∞, and we require Q± = limr →∞ Qr± . We have, for example,
Q+ =
∞
λπ K l+ (β) l
l=1 ∞
∞
tl−1 −βt + e P (t) (l − 1)! 0 l=1 ∞ ∞ (λπ t)l dt = λπ e −βt P + (t) l! 0 l=0 ∞ dt e −λ p t P + (t) = λπ =
λlπ
dt
0
λπ + Q+ = K (λ p ). λp 1
(5.15)
Notice that this derivation makes no use of our particular choice for f + and λ therefore is valid for any pdf f + . Similarly Q− = λπp K 1− (λπ ) (independent of f − ), so for the nonresetting or r → ∞ model, we obtain the particularly simple expression λπ A+ K 1+ (λ p ) − λp A− K 1− (λπ ) 1 . Sn ∼ λ n 1 + λπp K 1− (λπ ) + λλπp K 1+ (λ p )
(5.16)
Notice that the limits λπ → 0 or λ p → 0 are well defined since K 1 (λ) → 0 as λ → 0 at least as fast as λ. 6 1-Transition Processes We now seek to provide a deeper understanding of the asymptotic solutions derived above. This will permit us to construct an exact solution for Sn in the nonresetting model in the next section.
Multispike Interactions in a Stochastic Model of STDP
1383
We rewrite the general asymptotic solution in the slightly more transparent form, 1 T1 , Sn ∼ n N1
(6.1)
where T1 = λp A+ N1 = 1 +
Qr+ Qr− , + − λπ A− ˜ 1 − Kr 1 − K˜ r−
(6.2)
Qr+ Qr− + , + 1 − K˜ r 1 − K˜ r−
(6.3)
and we work henceforth, for convenience, only with the nonresetting, r → ∞ model, so that T1 = λp A+ Q+ − λπ A− Q− , +
(6.4)
−
N1 = 1 + Q + Q .
(6.5)
We specifically do not sum the terms in Q± , obtaining, for example, Q+ = λπ K (λ ), as this obscures the reduction of this model to the r -reset model λp 1 p and also renders opaque a natural interpretation of the terms in T1 and N1 . Notice that for the reduction to the r -reset model, we have, for example, Q+ =
∞
K˜ i+ → Qr+ 1 + K˜ r+ + ( K˜ r+ )2 + · · · =
i=1
Qr+ , 1 − K˜ r+
(6.6)
so we see immediately how terms like Qr+ /(1 − K˜ r+ ) arise from Q+ and therefore no longer need to consider the general r -reset model. We define T1+ = +λp A+ Q+ and T1− = −λπ A− Q− , so that T1 = T1+ + T1− . What processes give rise to the terms that occur, for example, in T1+ ? We have that T1+ = λp A+
∞
λπ K i+ . i
(6.7)
i=1
We know that a K i+ term arises from a sequence of i π spikes in which the synapse never returns to the OFF state once placed in the POT state by the first π spike. That is, K i+ arises from the probability that the synapse never returns to the OFF state during the whole time interval [t0 , ij=0 t j ), which is P + (t1 + · · · + ti ). If the synapse is in the POT state when a final p spike arrives at time ij=0 t j , then a change in synaptic strength by an amount A+ will occur. Thus, the term λp A+ K˜ i+ is obtained from the spike sequence π i p in which no stochastic transitions back to OFF ever occur and
1384
P. Appleby and T. Elliott
in which the only transition back to OFF is induced by the final p spike. An identical argument holds for the term −λπ Ai K˜ i− for the pi π sequence. T1+ is thus obtained by summing over all spike sequences π p, π 2 p, π 3 p, . . ., in each of which there is only one transition back to the OFF state, inducing potentiation by an amount A+ . This is similarly so for T1− . We refer to such processes as 1-transition processes. T1 is therefore the expected change in synaptic strength induced by all 1-transition processes. Another 1-transition process is of the form π i γ , where the γ represents the first stochastic transition back to the OFF state. Of course, such a process induces no change in synaptic strength and so does not contribute to T1 . What is the expected number of spikes in all possible 1-transition processes, π i p, pi π, π i γ and pi γ , any i > 0? The expected number of spikes in all the sequences π i p is just N p+ = λp
∞ (i + 1) K˜ i+ ,
(6.8)
i=1
and similarly for pi π sequences, we have Nπ− = λπ
∞ (i + 1) K˜ i− .
(6.9)
i=1
For the sequence π i γ , the probability that the first stochastic transition back to the OFF state occurs after the ith π spike is P + (t1 + · · · + ti−1 ) − + P + (t1 + · · · + ti ), i > 1, leading to K i−1 − K i+ , with K 0+ ≡ 1, i > 0. Hence, the expected number of spikes in all the sequences π i γ is Nγ+ =
∞
+ i(λπ K˜ i−1 − K˜ i+ ),
(6.10)
i=1
and in all pi γ sequences, we have Nγ− =
∞
− i(λp K˜ i−1 − K˜ i− ).
(6.11)
i=1
Now, N p+ + Nγ+ =
∞ + iλπ K˜ i−1 − i K˜ i+ + λp (i + 1) K˜ i+ i=1
= λπ +
∞ (i + 1) K˜ i+ − i K˜ i+ i=1
= λπ
+ Q+ .
(6.12)
Multispike Interactions in a Stochastic Model of STDP
1385
Similarly, Nπ− + Nγ− = λp + Q− , and so the expected number of spikes in all 1-transitions is just 1 + Q+ + Q− , which is precisely N1 , the denominator in the asymptotic expression for Sn . T1 is therefore the expected change in synaptic strength in all 1-transition processes, and N1 is the expected number of spikes in all 1-transition processes. Consider some general sequence of n spikes. This sequence will decompose, by definition, into a chain of successive 1-transition processes defined by the synapse returning to the OFF state at the end of each 1-transition process. How many 1-transition processes do we expect, on average, to occur in a sequence of n spikes? Because the average length of a 1-transition process is N1 spikes, obviously there are n/N1 1-transitions in a chain of n spikes on average. Each such 1-transition process induces, on average, a change in synaptic strength by an amount T1 . The total change induced by n spikes is thus expected to be T1 × (n/N1 ), but this should just be approximately Sn , the expected change induced by n spikes. When n is sufficiently large, the statistical error should become negligible, so we should have Sn ∼ n
T1 , N1
(6.13)
which is, of course, equation 6.1. It is a remarkable fact that the asymptotic solutions that were derived somewhat laboriously from the recurrence relations nevertheless can be extracted so quickly by carving up an arbitrary sequence of spikes into its natural articulation points, these being the times at which the synapse has returned to the OFF state. This is possible because of the underlying Markovian nature of 1-transition processes: once the synapse has returned to the OFF state, its subsequent evolution is independent of its history at that point. It is therefore natural to think in terms of a chain of 1-transition processes, the end point of each of which is the OFF state. 7 Exact Solutions for Sn in the Nonresetting Model We consider again the nonresetting model, r → ∞. Inspired by the above 1transition analysis, we explicitly calculate Sn for small n by decomposing Sn into terms representing the final potentiating or depressing 1-transition process in the sequence of n spikes. Notice that the final 1-transition process need not terminate on the final spike of the sequence; it could terminate before, with the remaining spikes not forming a potentiating or depressing 1-transition process. Lengthy and tedious calculation then reveals that the A+ -dependent parts of Sn , which we denote by Sn+ , are given by S2+ = λp A+ F0 K˜ 1+ , S3+ = λp A+ F0 K˜ 2+ + (F0 + F1 ) K˜ 1+ ,
1386
P. Appleby and T. Elliott
S4+ = λp A+ F0 K˜ 3+ + (F0 + F1 ) K˜ 2+ + (F0 + F1 + F2 ) K˜ 1+ , where F0 = 1, F1 = F0 − 1 , and F2 = F1 − (2 − 21 ). The expressions for the A− -dependent parts are given by the usual transformations of the A+ dependent parts: λπ ↔ λp and + ↔ −. Notice that F0 , F1 , and F2 are invariant under these transformations, and so the Sn− have identical Fi coefficients to those for the Sn+ . In general, by this final 1-transition process decomposition, we know that we must be able to write Sn+1 =
n
G n−i (λp A+ K˜ i+ − λπ A− K˜ i− ),
(7.1)
i=1
where the λp A+ K˜ i+ term represents the final 1-transition process contribution from a spike sequence π i p, and −λπ A− K˜ i− that from a pi π sequence. They must have the same coefficient, G n−i , because G n−i essentially counts all the possible ways in which we arrive at the OFF state before the final 1-transition process occurs, and this must be independent of whether that final process is π i p or pi π, by the Markovian property of 1-transitions. Moreover, although G i will be a function of the K˜ ± j , it must be a symmetric function of them, so that G i → G i under λπ ↔ λp and + ↔ −, because for any sequence i that arrives at the OFF state after i spikes, there is ¯ i that also arrives at the OFF state after i a complementary sequence ¯ i is obtained from i by π ↔ p, that is, replace every π spikes, where spike by a p spike and every p spike by a π spike. So the expansion in equation 7.1 is generic, exploiting the importance of the 1-transition processes. It remains to determine the coefficients G i . Since Un = Sn+1 − Sn , we have the expansion Un =
n
Fn−i (λp A+ K˜ i+ − λπ A− K˜ i− ),
(7.2)
i=1
where F0 = G 0 = 1 and Fi = G i − G i−1 , i > 0, or G i = ij=0 F j . So, equivalently, we can determine instead the coefficients Fi rather than G i . Although we are now interested only in the nonresetting model, because we know how to extract the r -reset model from it, a determination of the coefficients Fi depends critically on the recurrence relations for the r -reset models. In fact, we can explicitly calculate the sequence of functions Sn , n = 2, 3, 4, . . . , for the nonresetting model in an incremental manner from the 1-, 2-, 3-, . . . , reset models in the following way. Consider S2 . The sequences that could possibly contribute to the function are ππ, π p, pπ, and pp. The two sequences π p and pπ never probe any resetting properties because we need a chain of at least two identical spikes for this. Although
Multispike Interactions in a Stochastic Model of STDP
1387
the sequences ππ and pp do probe 1-resetting properties, they give exactly zero contribution to the function. Hence, the function S2 is independent of r for r ≥ 1, and we can therefore use the r = 1 recurrence relations in equations 4.16 and 4.17, with n = 0 and S0 = 0 and S1 = 0 to calculate it. Now consider S3 . Eight sequences could possibly contribute to it. Only πππ and ppp could probe 2-resetting properties, but again they give zero contribution. All the other sequences probe at most one-resetting properties. Thus, S3 is independent of r for r ≥ 2. We can therefore use the r = 2 form of the recurrence relations in equations 4.53 and 4.54, with n = 0 and the initial values S0 = 0, S1 = 0 and S2 calculated from the r = 1 recurrence relations, to calculate S3 . Similarly, S4 is independent of r for r ≥ 3, and we can use the r = 3 form of equations 4.53 and 4.54 with n = 0 to calculate it, using the forms of Si , i ≤ 3, already established. In general, therefore, we see that Sm+1 is independent of r for r ≥ m because m + 1 spikes cannot probe the m-resetting properties with nonzero contributions, so we can use the r = m form of equations 4.53 and 4.54 with n = 0 and all the previously established forms of Si , i ≤ m, to calculate Sm+1 . The important thing to notice about this procedure is that we always set S0 = 0 and S1 = 0. This means that the corresponding terms in equations 4.53 and 4.54 with n = 0 do not contribute to the calculation of Sm+1 for r ≥ p p π m. But it was precisely the S1π and S1 terms (the Sn+1 and Sn+1 terms for nonzero n) in particular that prevented us before from simply adding equations 4.53 and 4.54 to obtain an (r + 1)th-order recurrence relation entirely in Sm . So, now with S0 = 0 and S1 = 0 (and, in particular, p S1π = 0 and S1 = 0), we are free to add the equations, obtaining
− Sm+1 = (λp A+ Q+ m − λπ A− Qm ) +
m−2
(i − i+1 )Sm−i ,
(7.3)
i=0
m−2 valid for m ≥ 1 with i=0 ≡ 0 for m = 1, and we have used 0 ≡ 1. After a little rearrangement, we have m−1
− i Um−i = λp A+ Q+ m − λπ A− Qm ,
(7.4)
i=0
where we have used the definition Ui = Si+1 − Si . This recurrence relation succinctly encodes the procedure described above and thus determines the form of Sm+1 for an r -reset model in which r ≥ m. Taking the limit r → ∞ to give the nonresetting model, equation 7.4 is valid for any m ≥ 1 and represents a simple recurrence relation to calculate the form of Um , any m, in the nonresetting model.
1388
P. Appleby and T. Elliott
Substituting the expansion in equation 7.2 into the recurrence relation in equation 7.4 and looking at, say, only the A+ -dependent piece, we have m−1
i
m−i
i=0
+ Fm−i− j K˜ + j = Qm ,
(7.5)
j=1
and we can rearrange the double sum to give m i=1
K˜ i+
m−i
j Fm−i− j = Q+ m.
(7.6)
j=0
m ˜+ Since Q+ m = i=1 K i , we see that equation 7.2 is a solution of equation 7.4 provided that the Fi satisfy the recurrence relation n
Fi n−i = 1,
(7.7)
i=0
any n. With F0 = 0 = 1, we can compute the first few Fi ’s, giving agreement with those calculated explicitly above. We now write Fi = i n+1 C j , C0 ≡ 1, so that Fi+1 = Fi + Ci+1 . By subtracting i=0 Fi n+1−i j=0 n from i=0 Fi n−i , we see that the Ci must satisfy the recurrence relation n
Ci n−i = 0,
(7.8)
i=0
for n ≥ 1, with C0 ≡ 1. It is easy to see that the Ci are generated by the function ∞
1 . i i=0 i x
C i x i = ∞
i=0
Summarizing these various results, we have that Un =
n
Fn−i (λp A+ K˜ i+ − λπ A− K˜ i− ),
i=1
Sn+1 =
n i=1
G n−i (λp A+ K˜ i+ − λπ A− K˜ i− ),
(7.9)
Multispike Interactions in a Stochastic Model of STDP
1389
where G i = ij=0 F j and Fi = ij=0 C j , and the Ci ’s are generated by the ∞ function 1/ i=0 i xi . By simple rearrangements of the sums in these two representations of Un and Sn+1 , we also have Un =
n
Cn−i (λp A+ Qi+ − λπ A− Qi− ),
i=1
Sn+1 =
n
Fn−i (λp A+ Qi+ − λπ A− Qi− ).
i=1
These equivalent expressions represent the exact solution of the nonresetting model for any n. Of course, the r -reset model is obtained by the standard replacement K lr+s → K rl K s throughout. 8 Analytical and Simulated Results In this section, we turn to a comparison between the analytical results derived above and numerical results obtained by the computational simulation of the r -reset model. We are particularly concerned with establishing how rapidly n1 Sn converges to the analytical asymptotic results; establishing that the nonresetting model’s asymptotic result matches a large n simulation of the same model, confirming indeed that the r → ∞ limit of the r -reset model’s asymptotic result really is the asymptotic result of the nonresetting model; and establishing how rapidly the r -reset model tends to the nonresetting model as r → ∞. We run simulations of the stochastic model as set out in earlier work (Appleby & Elliott, 2005), with the standard set of parameters used there: A+ A− τ+ τ− ν+ ν−
= = = = = =
1.00 0.95 13.3 ms 20.0 ms 3 3
These parameters were chosen because they reproduced a large variety of the experimental results available in the literature, including the replication of the spike triplet results (Froemke & Dan, 2002). These parameters also guarantee the presence of both a depressing and a potentiating regime in the function S2 , so that depression and potentiation can both occur at the ensemble level. In most of the simulations below, each simulation is run over exactly 107 spikes. Thus, in obtaining the numerical form of n1 Sn , we perform 107 /n separate runs over n spikes and average over all these runs.
P. Appleby and T. Elliott
Change in synaptic strength per spike
1390
0.015 0.01 0.005 0 -0.005 -0.01 -0.015 0
50 100 150 Presynaptic firing rate (Hz)
200
Figure 1: Analytical and numerical values of n1 Sn plotted as a function of λπ for the one-reset model for various values of n. The solid lines show the numerical results for n = 2, 4, 8, 16, 32, and 64 spikes, while the corresponding dotted lines show the exact, analytical result. The n = 2 pair of lines corresponds to the bottom pair, n = 4 the next pair up, and so on, the function n1 Sn increasing as a function of n for fixed λπ . The dashed line shows the exact asymptotic limit, n → ∞.
Of course, as n increases, there are fewer runs over which to average, but there is more averaging within each sequence of n spikes. This means that the statistical error should be roughly equivalent for any value of n, this being the reason for keeping the total number of spikes fixed at 107 . Sn is, of course, a function of both λπ and λ p . Previously, we set λp =
λπ −
for λπ ≥
0
otherwise
,
(8.1)
as a simple model of postsynaptic firing with a threshold set at = 5 Hz. This gives a cross section through the λπ –λ p plane. We will also show contour plots of n1 Sn in the λπ –λ p plane. In Figure 1, we investigate how rapidly n1 Sn approaches the asymptotic limit derived above. We use the r = 1 reset model for convenience, since we also have very simple, exact solutions against which to compare the numerical results. We see very rapid convergence to the asymptotic limit, so that even n = 8 gives the limit to within a few percent. This means that after n ≈ 10 spikes, we expect to see the scaling property observed earlier for the one-reset model, so that doubling the spike train length merely doubles
Change in synaptic strength per spike
Multispike Interactions in a Stochastic Model of STDP
1391
0.015 0.01 0.005 0 -0.005 -0.01 -0.015 0
50
100
150
200
Presynaptic firing rate (Hz)
Figure 2: Numerical values of n1 Sn plotted as a function of λπ for the nonresetting model for various values of n. The solid lines again show the numerical results for n = 2, 4, 8, 16, 32, and 64 spikes, with the bottom line corresponding to n = 2, the next line up n = 4, and so on, n1 Sn exhibiting the same monotonicity as a function of n for fixed λπ as in the 1-reset model. The dashed line shows the exact, analytical result for the asymptotic limit of the nonresetting model.
the expected change in synaptic efficacy, adding no new qualitative features to the dynamics. The results for the nonresetting model are shown in Figure 2. Here we do not plot the corresponding finite n analytical results, but we show the analytical asymptotic limit. This figure demonstrates that the r → ∞ limit of the asymptotic result for the r -reset model is indeed the asymptotic result for the nonresetting model. We address the question of how rapidly the r -reset model converges to the nonresetting model as r → ∞ in Figure 3. We see that even the 4-reset model is extremely close to the nonresetting model, and the 8-reset model is virtually indistinguishable from it. For λπ ≈ λ p , we have that λπ ≈ 0.5 and λp ≈ 0.5. Thus, the probability of a long sequence of identical spikes in this case is maximally suppressed, and the contributions of such sequences to Sn are weighted by these small probabilities. However, to resolve the difference between the r -reset model and the (r + 1)-reset model for large r , we require precisely such long sequences of identical spikes. Hence, we should not be too surprised by this rapid convergence of the r -reset model to the nonresetting model. In Figures 4, 5, and 6 we show contour plots for the analytical form of n1 Sn for the one-reset model for n = 2, n = 3 and the asymptotic limit n → ∞, respectively. Contour plots for other values of n are all very similar
P. Appleby and T. Elliott
Change in synaptic strength per spike
1392
0.015 0.01 0.005 0 -0.005 -0.01 -0.015 0
50 100 150 Presynaptic firing rate (Hz)
200
Figure 3: The analytical asymptotic limit of n1 Sn for the r -reset model plotted as a function of λπ for various values of r . Shown are results for r = 1, 2, 4, 8, 16, and the nonresetting model. The top curve corresponds to r = 1, the next one down r = 2, and so on, n1 Sn monotonically decreasing as a function of r for fixed λπ . The curves for the 8-, 16- and nonresetting models are virtually identical, almost superimposed on top of each other.
to those for n = 3 and n → ∞. These trends are also observed, for example, for the nonresetting model. We see a very significant difference between the n = 2 landscape and that for all other values of n. This shows that although there tends to be an emphasis in the literature on the form of the two-spike rule, the multispike rules for n > 2, at least in our model, exhibit qualitatively very different dynamics. We explore these differences elsewhere, showing that they have important consequences for the presence and nature of competitive dynamics in these models (Appleby & Elliott, 2006). 9 Discussion In this letter, we have extended our earlier ensemble-based, stochastic model of STDP in order to derive exact, analytical results for general nspike interactions for n > 2. By extending the model to include stochastic process resetting, we essentially constructed a mathematical ladder that we could ascend in order to arrive at exact results for the nonresetting model. Once there, knowing how to extract the general r -reset model from the nonresetting model, we could then, to paraphrase Wittgenstein, kick away the ladder of resetting. As suggested earlier, the most biologically plausible forms of these various resetting models are, arguably, the 1-reset model and the nonresetting model. Other finite r , r = 1, forms require us to postulate
Multispike Interactions in a Stochastic Model of STDP
1393
Postsynaptic firing rate (Hz)
200
150
100
50
0
0
50 100 150 Presynaptic firing rate (Hz)
200
Figure 4: A contour plot of the function n1 Sn for the 1-reset model, with n = 2, in the λπ –λ p plane. Because the form of S2 is independent of r , this is a contour plot of equation 2.8. Black areas represent minimum values and white areas maximum values, and shades of gray interpolate between these extremes. The minimum value on this partial plane is −0.0122, and the maximum value is +0.0059. The zero contour intersects both axes at values λπ , λ p ≈ 92 Hz.
some kind of spike-counting machinery at individual synapses. Yet one of the original motivations for considering an ensemble-based model of STDP was an attempt to remove some of the computational burdens imposed on single synapses required by the assumption that they can, at some appropriate level, compute and represent complicated, temporally dependent quantities and adjust their strengths accordingly. Although the general results for n1 Sn for all but the 1-reset model are somewhat cumbersome, in fact we find rapid convergence of n1 Sn to its asymptotic, large n form, so that even by approximately 10 spikes, n1 Sn is within just a few percent of the limiting result. Furthermore, in terms of the general form of the function Sn in the λπ –λ p plane, we find no qualitative changes in the structure of the landscape for any n ≥ 3, at least for the parameter set used here. These two facts lead us to suppose that any properties exhibited by the asymptotic form of the n-spike function will also be exhibited by all the n-spike functions for n ≥ 3. Of course, we expect quantitative differences, but the overall dynamics, as determined by the overall structure of the function, should be similar.
1394
P. Appleby and T. Elliott
Postsynaptic firing rate (Hz)
200
150
100
50
0
0
50 100 150 Presynaptic firing rate (Hz)
200
Figure 5: A contour plot of the function n1 Sn for the 1-reset model, with n = 3, in the λπ –λ p plane. The minimum value on this partial plane is −0.0284, and the maximum value is +0.0342. The zero contour intersects the λπ axis at λπ ≈ 14 Hz.
In our earlier work (Appleby & Elliott, 2005), we considered purely numerical simulations of n-spike interactions for n > 2. There, we looked specifically at the cross section of the λπ –λ p landscape corresponding to equation 8.1 and found little quantitative difference between the 2-spike and the general n-spike results, all forms exhibiting a BCM-like profile with depression at low presynaptic rates switching over to potentiation for higher presynaptic rates. This observation reflects the highly specific cut through the λπ –λ p plane. Other cuts reveal profound differences between the 2-spike and the general n-spike results, as seen in Figures 4, 5 and 6. The particular model of postsynaptic firing in equation 8.1 might be expected to be fairly reasonable (modulo issues of saturation, etc.) for a target cell innervated by precisely one afferent, so that the postsynaptic response follows closely the presynaptic input. However, for a target cell innervated by multiple afferents, it is entirely possible for there to be large differences between some of the presynaptic rates and the postsynaptic rate, because the postsynaptic rate will reflect all the inputs. We are then led to examine the whole λπ –λ p plane, which reveals these differences. Indeed, it is these differences between the 2-spike and the general n-spike interaction functions, observed initially numerically, that motivated our derivation of the above analytical results for the n-spike function. As we
Multispike Interactions in a Stochastic Model of STDP
1395
Postsynaptic firing rate (Hz)
200
150
100
50
0
0
50 100 150 Presynaptic firing rate (Hz)
200
Figure 6: A contour plot of the function n1 Sn for the 1-reset model, in the asymptotic limit n → ∞, in the λπ –λ p plane. The minimum value on this partial plane is −0.1071, and the maximum value is +0.1148. The zero contour intersects the λπ axis at λπ ≈ 8 Hz.
discuss elsewhere (Appleby & Elliott, 2006), the 2-spike interaction function S2 is, in fact, pathological, and without additional constraints or modifications, it is noncompetitive in nature. However, the general n-spike interaction function in our model for n > 2 leads to stable, competitive dynamics all by itself, without any additional constraints or modifications. This is a remarkable difference, all the more so because much of the experimental and theoretical literature on STDP is devoted to a study of the interaction between only two spikes. We thus wanted to understand this difference mathematically, and so needed to derive general, analytical results for the n-spike interaction function. The focus of this letter has been this rather lengthy derivation. In deriving the results above, we have assumed independent pre- and postsynaptic Poisson firing at fixed rates λπ and λ p . Two major issues arise from this assumption. First, synaptic plasticity will typically induce changes in the rate of postsynaptic firing, for fixed presynaptic rate, contrary to our assumption of fixed postsynaptic rate. Our assumption of fixed preand postsynaptic rates corresponds, however, to the standard experimental protocol in which the pre- and postsynaptic neurons are clamped and induced to fire at fixed rates, regardless of any associated changes in synaptic strength. Our results therefore represent exact predictions about the
1396
P. Appleby and T. Elliott
multispike rules that may be observed under such experimental protocols. Furthermore, provided that the induced changes in synaptic strength in the unclamped situation, are small compared to the overall synaptic strength, or, equivalently, provided that the change in the postsynaptic firing rate is small, then any discrepancy at second order between our fixed-rate rules and the rules allowing for variable rates will be negligible. Hence, our fixed-rate rules may be employed over sufficiently short epochs of firing, even though those epochs may result in small changes in rates, as we show elsewhere (Appleby & Elliott, 2006). Although we derive above asymptotic limits for our rules that entail a consideration of an infinite number of spikes (and thus infinite time), we have shown that the asymptotic limit is achieved very rapidly, within approximately 10 spikes. Presumably, 10 spikes are typically not enough to induce a large change in the postsynaptic firing rate. Second, of course, postsynaptic spiking events are not really independent of presynaptic spiking events, contrary to our assumption. However, for a sufficiently large number of synaptic inputs from multiple afferents, the dependence of a postsynaptic spiking event on any single presynaptic spiking event will be negligible, and, to a first approximation, we can regard them as independent (although, of course, the rates will not be independent). Moreover, in replacing the postsynaptic Poisson neuron in our study with a full numerical implementation of an integrate-and-fire neuron, we find that the synaptic dynamics exhibited by the multispike rules in the independent Poisson case are qualitatively unchanged in the integrate-and-fire case (unpublished observations). Hence, the assumption of independence does not have major consequences for our results. Having derived explicit expressions for the n-spike interaction functions, we may directly compare our switch model to competing models of STDP and the underlying experimental data. A variety of experiments have demonstrated that a biphasic, STDP-like curve apparently governs spike-pair interactions (Bell et al., 1997; Bi & Poo, 1998; Zhang et al., 1998). Reproduction of this 2-spike result has been adopted by the community as the basic test of the validity of a model of STDP. In keeping with this view, models of STDP, our own included, are often initially formulated as descriptions of 2-spike interactions, and we therefore expect to find that all such models are consistent with these results. Accordingly, we find that our switch model produces predictions comparable to other models of STDP when we consider the interaction of precisely two spikes (Song et al., 2000; van Rossum et al., 2000; Castellani et al., 2001; Senn et al., 2001; Karmarker & Buonomano, 2002; Shouval et al., 2002). In addition, we find a requirement under our model that depression dominates potentiation overall, which has empirically been shown to be a requirement in other models of STDP (Song et al., 2000). Although various qualitative differences exist between the various STDP curves produced under different models, the quality of experimental data, in particular the level of noise around the
Multispike Interactions in a Stochastic Model of STDP
1397
transition from potentiation to depression as the temporal ordering of preand postsynaptic spiking is reversed, makes it difficult to distinguish one model from another on the basis of 2-spike interactions alone. We are thus led to examine the predictions of the different models of STDP when we consider more than two spikes. It is in this multispike domain that we find significant differences between the predictions of our model of STDP and those of competing models. In other models of STDP, a multispike train is (often implicitly) considered to be broken down into a series of spike pairs that are then summed in a linear manner according to the underlying STDP rule. This is not the case in our switch model. Instead, we find that a multispike train carries with it higher-order spike interactions that are not present if the train is broken down into a linear sum of spike pairs. These higher-order spike interactions, which are not explicitly built into the model but arise naturally from its intrinsic structure, bestow the 3-spike function with a very different character compared to that of the 2-spike function. We find that these higher-order interactions are sufficient to explain the spike-triplet data of Froemke and Dan (2002) (Appleby & Elliott, 2005), a result that other models of STDP can achieve only by the introduction of additional nonlinearities such as spike suppression (Froemke & Dan, 2002). Even in retrospect, it is astonishing that many of our results can be derived exactly. This is partly due to the use of a stochastic model with intrinsic Markovian properties and partly because the model is actually quite simple in its overall structure. Simplicity and analytical tractability are generally powerful virtues of mathematical models, even when those models are known to be only very crude approximations of the underlying dynamics being studied. It is their capacity for explicit, mathematical understanding that makes such models so useful and permits the development of general intuitions about the manner in which the models behave. In contrast, more complicated models are often ill understood, and resort to purely numerical simulation is frequently necessary even at the very first step. Here we have a trade-off between understanding and the accuracy of any predictions resulting from the two types of model. In a mature subject such as physics, the simple models have been built, studied, and used to develop intuitions, and more complicated models can then be constructed that permit highly detailed (if poorly understood) predictions to be made about fundamental processes. In neurobiology, however, our general belief is that theoretical studies are still in their infancy, and therefore models still need to be simple and, above all, mathematically tractable in order to seek to develop the necessary conceptual framework with which more complicated systems can then be studied. The n-spike interaction functions derived above are unconditional expectation values that represent, in essence, rate-based learning rules in which the underlying spiking processes have been integrated out and summed over. Such rate-based rules are, as we described them before (Appleby &
1398
P. Appleby and T. Elliott
Elliott, 2005), doubly emergent, because they depend on the emergence of STDP at the synaptic or temporal ensemble level in our model, and then this STDP is itself averaged over time and over many spike patterns, giving rise to the rate-based behavior. Of course, the spike-based behavior is more fundamental, and in particular it more accurately describes the behavior of the neuron or synapse over short time courses and exhibits the transition to rate-based-like behavior over longer time courses. Nevertheless, if we are interested in studying, for example, specifically long-term plasticity phenomena (such as the development of cortical structures like ocular dominance columns), it is, we would argue, more appropriate to analyze the behavior in terms of the emergent, rate-based rules. There is a temptation to believe that a model is inadequate if it does not capture every last detail of all known processes. However, finding the appropriate level of approximation for studying a system is often a much more powerful strategy, reducing complexity and increasing understanding. For example, it would be pointless to argue that fluids are not really incompressible and are actually made of molecules, atoms, and, ultimately, quarks and leptons, and seek to replace the Navier-Stokes equations with equations in terms of quantum fields. Such a fine-grained analysis is unnecessary precisely because of emergence. We therefore wonder whether the emergent rate-based rules in our model are not also a more appropriate level of description for understanding certain forms of synaptic plasticity. In conclusion, we have modified our earlier model of STDP in order to derive exact, analytical results for the n-spike interaction function. Elsewhere (Appleby & Elliott, 2006), we focus on the profound differences between the 2-spike interaction function and higher-order interaction functions and the consequences for the presence of stable, competitive dynamics in our model. Acknowledgments P.A.A. thanks the University of Southampton for the support of a studentship. T.E. thanks the Royal Society for the support of a University Research Fellowship during the initial stages of this work. References Appleby, P. A., & Elliott, T. (2005). Synaptic and temporal ensemble interpretation of spike timing dependent plasticity. Neural Comput., 17, 2316–2336. Appleby, P. A., & Elliott, T. (2006). Stable competitive dynamics emerge from multispike interactions in a stochastic model of spike-timing-dependent plasticity. Neural Comput., 18, 2414–2464. Bell, C. C., Han, V. Z., Sugawara, Y., & Grant, K. (1997). Synaptic plasticity in a cerebellum-like structure depends on temporal order. Nature, 387, 278– 281.
Multispike Interactions in a Stochastic Model of STDP
1399
Bi, G. Q., & Poo, M. M. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472. Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J. Neurosci., 2, 32–48. Burkitt, A. N., Meffin, H., & Grayden, D. B. (2004). Spike-timing-dependent plasticity: The relationship to rate-based learning for models with weight dynamics determined by a stable fixed point. Neural Comput., 16, 885–940. Castellani, G. C., Quinlan, E. M., Cooper, L. N., & Shouval, H. Z. (2001). A biophysical model of bidirectional synaptic plasticity: Dependence on AMPA and NMDA receptors. Proc. Natl. Acad. Sci. USA, 98, 12772–12777. Froemke, R. C., & Dan, Y. (2002). Spike-timing-dependent synaptic modification induced by natural spike trains. Nature, 416, 433–438. Izhikevich, E. M., & Desai, N. S. (2003). Relating STDP to BCM. Neural Comput., 15, 1511–1523. Karmarkar, U. R., & Buonomano, D. V. (2002). A model of spike-timing dependent plasticity: One or two coincidence detectors? J. Neurophysiol., 88, 507–513. O’Conner, D. H., Wittenberg, G. M., & Wang, S. S.-H. (2005). Graded bidirectional synaptic plasticity is composed of switch-like unitary events. Proc. Natl. Acad. Sci. USA, 102, 9679–9684. Petersen, C. C. H., Malenka, R. C., Nicoll, R. A., & Hopfield, J. J. (1998). All-or-none potentiation at CA3-CA1 synapses. Proc. Natl. Acad. Sci. USA, 95, 4732–4737. Senn, W., Markram, H., & Tsodyks, M. (2001). An algorithm for modifying neurotransmitter release probability based on pre- and postsynaptic spike timing. Neural Comput., 13, 35–67. Shouval, H. Z., Bear, M. F., & Cooper, L. N. (2002). A unified model of NMDA receptor-dependent bidirectional synaptic plasticity. Proc. Natl. Acad. Sci. USA, 99, 10831–10836. Song, S., Miller, K., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nat. Neurosci., 3, 919–926. van Rossum, M. C. W., Bi, G. Q., & Turrigiano, G. G. (2000). Stable Hebbian learning from spike timing-dependent plasticity. J. Neurosci., 20, 8812–8821. Zhang, L. I., Tao, H. W., Holt, C. E., Harris, W. A., & Poo, M. M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44.
Received September 7, 2005; accepted August 2, 2006.
LETTER
Communicated by Yaser Yacoob
3D Periodic Human Motion Reconstruction from 2D Motion Sequences Zonghua Zhang [email protected]
Nikolaus F. Troje [email protected] BioMotionLab, Department of Psychology, Queen’s University, Kingston, Ontario K7M 3N6 Canada
We present and evaluate a method of reconstructing three-dimensional (3D) periodic human motion from two-dimensional (2D) motion sequences. Using Fourier decomposition, we construct a compact representation for periodic human motion. A low-dimensional linear motion model is learned from a training set of 3D Fourier representations by means of principal components analysis. Two-dimensional test data are projected onto this model with two approaches: least-square minimization and calculation of a maximum a posteriori probability using the Bayes’ rule. We present two different experiments in which both approaches are applied to 2D data obtained from 3D walking sequences projected onto a plane. In the first experiment, we assume the viewpoint is known. In the second experiment, the horizontal viewpoint is unknown and is recovered from the 2D motion data. The results demonstrate that by using the linear model, not only can missing motion data be reconstructed, but unknown view angles for 2D test data can also be retrieved. 1 Introduction Human motion contains a wealth of information about the actions, intentions, emotions, and personality traits of a person, and human motion analysis has widespread applications—in surveillance, computer games, sports, rehabilitation, and biomechanics. Since the human body is a complex articulated geometry overlaid with deformable tissues and skin, motion analysis is a challenging problem for artificial vision systems. General surveys of the many studies on human motion analysis can be found in recent review articles (Gavrila, 1999; Aggarwal & Cai, 1999; Buxton, 2003; Wang, Hu, & Tan, 2003; Moeslund & Granum, 2001; Aggarwal, 2003; Dariush, 2003). The existing approaches to human motion analysis can be roughly divided into two categories: model-based methods and model-free methods. In the modelbased methods, an a priori human model is used to represent the observed Neural Computation 19, 1400–1421 (2007)
C 2007 Massachusetts Institute of Technology
3D Periodic Human Reconstruction
1401
subjects; in model-free methods, the motion information is derived directly from a sequence of images. The main drawback of model-free methods is that they are usually designed to work with images taken from a known viewpoint. Model-based approaches support viewpoint-independent processing and have the potential to generalize across multiple viewpoints (Moeslund & Granum, 2001; Cunado, Nixon, & Carter, 2003; Wang, Tan, Ning, & Hu, 2003; Jepson, Fleet, & El-Maraghi, 2003; Ning, Tan, Wang, & Hu, 2004). Most of the existing research (Gavrila, 1999; Aggarwal & Cai, 1999; Buxton, 2003; Wang, Hu, & Tan, 2003; Moeslund & Granum, 2001; Aggarwal, 2003; Dariush, 2003; Cunado et al., 2003; Wang, Tan et al., 2003; Jepson et al., 2003; Ning et al., 2004) has been focused on the problem of tracking and recognizing human activities through motion sequences. In this context, the problem of reconstructing 3D human motion from 2D motion sequences has received increasing attention (Rosales, Siddiqui, Alon, & Sclaroff, 2001; Sminchisescu & Triggs, 2003; Urtasun & Fua, 2004a, 2004b; Bowden, 2000; Bowden, Mitchell, & Sarhadi, 2000; Ong & Gong, 2002; Yacoob & Black, 1999). The importance of 3D motion reconstruction stems from applications such as surveillance and monitoring, human body animation, and 3D human-computer interaction. Unlike 2D motion, which is highly view dependent, 3D human motion can provide robust recognition and identification. However, existing systems need multiple cameras to get 3D motion information. If 3D motion can be reconstructed from a single camera viewpoint, there are many potential applications. For instance, using 3D motion reconstruction, one could create a virtual actor from archival footage of a movie star, a difficult task for even the most skilled modelers and animators. 3D motion reconstruction could be used to track human body activities in real time (Arikan & Forsyth, 2002; Grochow, Martin, Hertzmann, & Popovi´c, 2004; Chai & Hodgins, 2005; Aggarwal & Triggs, 2004; Yacoob & Black, 1999). Such a system may be used as a new and effective human-computer interface for virtual reality applications. In the model of Kakadiaris and Metaxas (2000), motion estimation of human movement was obtained from multiple cameras. Bowden and colleagues (Bowden, 2000; Bowden et al., 2000) used a statistical model to reconstruct 3D postures from monocular image sequences. Just as a 3D face can be reconstructed from a single image using a morphable model (Blanz & Vetter, 2003), they reconstructed the 3D structure of a subject from a single view of its outline. Ong and Gong (2002) discussed three main issues in the linear combination method: choosing the examples to build a model, learning the spatiotemporal constraints on the coefficients, and estimating the coefficients. They applied their method to track moving 3D skeletons of humans. These models (Bowden, 2000; Bowden et al., 2000; Ong & Gong, 2002) are based on separate poses and do not use the temporal
1402
Z. Zhang and N. Troje
information that connects them. The result is a series of reconstructed 3D human postures. Using principal components analysis (PCA), Yacoob and Black (1999) built a parameterized model for image sequences to model and recognize activity. Urtasun and Fua (2004a) presented a motion model to track the human body and then (Urtasun & Fua, 2004b) extended it to characterize and recognize people by their activities. Leventon and Freeman (1998) and Howe, Leventon, and Freeman (1999) studied reconstruction of human motion from image sequences using Bayes’ rule, which is in many respects similar to our approach. However, they did not present a quantitative evaluation of the 3D motion reconstructions corresponding to the missing dimension. For viewpoint reconstruction from motion data, some researchers (Giese & Poggio, 2000; Agarwal & Triggs, 2004; Ren, Shakhnarovich, Hodgins, Pfister, & Viola, 2005) quantitatively evaluated the performance for their motion model. Incorporating temporal data into model-based methods requires correspondence-based representations, which separate the overall information into range-specific information and domain-specific information (Ramsay & Silverman, 1997; Troje, 2002a). In the case of biological motion data, range-specific information refers to the state of the actor at a given time in terms of the location of a number of feature points. Domainspecific information refers to when a given position occurs. Yacoob and Black (1999) addressed temporal correspondence in their walking data by assuming that all examples had been temporally aligned. Urtasun and Fua (2004a, 2004b) chose one walking cycle with the same number of samples. Giese and Poggio (2000) presented a learning-based approach for the representation of complex motion patterns based on linear combinations of prototypical motion sequences. Troje (2002a) developed a framework that transformed biological motion data into a linear representation using PCA. This representation was used to construct a sex classifier with a reasonable classification performance. These authors (Troje, 2002a; Giese & Poggio, 2000) pointed out that finding spatiotemporal correspondences between motion sequences is the key issue for the development of efficient models that perform well on reconstruction, recognition, and classification tasks. Establishing spatiotemporal correspondences and using them to register the data to a common prototype is a prerequisite for designing a generative linear motion model. Our input data consist of the motion trajectories of discrete marker points, that is, spatial correspondence is basically solved. Since we are working with periodic movements—normal walking—temporal correspondence can be established by a simple linear time warp, which is defined in terms of the frequency and the phase of the walking data. This simple linear warping function can get much more complex when dealing with nonperiodic movement, and suggestions on how to expand our approach to other movements are discussed below.
3D Periodic Human Reconstruction
1403
In this letter, human walking is chosen as an example to study 3D periodic motion reconstruction from 2D motion sequences. On the one hand, locomotion patterns such as walking and running are periodic and highly stereotyped. On the other hand, walking contains information about the individual actor. Keeping the body upright and balanced during locomotion takes a high level of interaction of the central nervous system, sensory system (including proprioceptive, vestibular, and visual systems), and motor control systems. The solutions to the problem of generating a stable gait depend on the masses and dimensions of the particular body and its parts and are therefore highly individualized. Therefore, human walking is characterized not only by abundant similarities but also by stylistic variations. Given a set of walking data represented in terms of a motion model, the similarities are represented by the average motion pattern, while the variations are expressed in terms of the covariance matrix. Principal component analysis can be used to find a low-dimensional, orthonormal basis system that would efficiently span a motion space. Individual walking patterns are approximated in terms of this orthonormal basis system. Here, we use PCA to create a representation that is able to capture the redundancy in gait patterns in an efficient and compact way, and we test the resulting model’s ability to reconstruct missing parts of the full representation. The vision problem is not the primary purpose of this letter, and we make two assumptions to stay focused on the problem of 3D reconstruction. First, we assume that the problem of tracking feature points in the 2D image plane is solved. We represent human motion in terms of trajectories of a series of discrete markers located at the main joints of the body. Figure 1 illustrates the locations of these markers. Second, 2D motion sequences are assumed to be orthographic projections of 3D motion. Both assumptions are not critical to the general idea outlined here. In particular, the tracking problem can be treated completely independently, and there are many researchers working on this problem (Jepson et al., 2003; Ning et al., 2004; Urtasun & Fua, 2004a, 2004b; Karaulova, Hall, & Marshall, 2002). We systematically develop a theory for recovering 3D walking data from 2D motion sequences and evaluate it using a cross-validation procedure. Single 2D test samples are generated by orthographically projecting a walker from our 3D database. We then construct a model based on the remaining walkers and project the test set onto the model. The resulting 3D reconstruction is then evaluated by comparing it to the original 3D data of this walker. This procedure is iterated through the whole data set. Section 2 briefly describes data acquisition with a marker-based motion capture system. The details of the linear motion model and 3D motion reconstruction are given in section 3. We run our algorithm and evaluate the reconstructions in section 4. Finally, conclusions and possible future extensions are discussed in section 5.
1404
Z. Zhang and N. Troje
Figure 1: Human motion representation by joints. The 15 virtual markers are located at the major joints of the body (shoulders, elbows, wrists, hips, knees, ankles), the sternum, the center of the pelvis, and the center of head.
2 Walking Data Acquisition Eighty participants—32 males and 48 females—served as subjects to acquire walking data. Their ages ranged from 13 to 59 years, and the average age was 26 years. Participants wore swimming suits, and a set of 41 retroreflective markers was attached to their bodies. Participants were requested to walk on a treadmill, and they adjusted the speed of the belt to the rate that felt most comfortable. The 3D trajectories of the markers were recorded using an optical motion capture system (Vicon; Oxford Metrics, Oxford, UK) equipped with nine CCD high-speed cameras. The system tracked the markers with a submillimeter spatial resolution and a sampling rate of 120 Hz. From the trajectories of these markers, we computed the location of 15 virtual markers according to a standard biomechanical model (BodyBuilder, Oxford Metrics), as illustrated in Figure 1. The virtual markers were located at the major joints of the body (shoulders, elbows, wrists, hips, knees, ankles), the sternum, the center of the pelvis, and the center of head (for more details, see Troje, 2002a).
3D Periodic Human Reconstruction
1405
3 Motion Reconstruction In our 3D motion reconstruction method, first a linear motion model is constructed from Fourier representations of human examples by PCA, and then the missing motion data are reconstructed from the linear motion model based on two approaches: least-square minimization by pseudoinverse and calculation of a maximum a posteriori probability (MAP) by using Bayes’ rule. 3.1 Linear Motion Model. The collected walking data can be regarded as a set of time series of postures pr (t) : t = 1, 2, . . . , Tr , r = 1, 2, . . . , R, represented by the major joints, where R is the number of walkers and Tr is the number of sampled postures for walker r . Because each joint has three coordinates, the representation of a posture pr (t) consisting of 15 joints is a 45-dimensional vector. Human walking can be efficiently described in terms of low-order Fourier series (Unuma, Anjyo, & Takeuchi, 1995; Troje, 2002b). A compact representation for a particular walker consists of the average posture ( p0 ), the characteristic postures of the fundamental frequency ( p1 , q 1 ) and the second harmonic ( p2 , q 2 ) of a discrete Fourier expansion, and the fundamental frequency (ω) to characterize this walker: p(t) = p0 + p1 sin(ω t) + q 1 cos(ω t) + p2 sin(2ω t) + q 2 cos(2ω t) + err. (3.1) The power carried by the residual term err is less than 3% of the overall power of the input data, and we discard it from further computations. Since the average posture and each of the characteristic postures are 45-dimensional vectors, the dimensionality of a 3D Fourier representation at this stage is 45 ∗ 5 = 225. For each specific walking pattern pr (t), we can get a 3D Fourier representation wr : wr = ( p0,r , p1,r , q 1,r , p2,r , q 2,r ).
(3.2)
The advantage of Fourier representation is that it allows us to apply linear operations and easily determine temporal correspondence. Temporal information is included in the frequency and phase. After computing the frequency and phase of a walking series by Fourier analysis, we can represent the walking series with zero phase and frequency-independent characteristic postures, as shown in equation 3.2. Every 3D Fourier representation of a walker can be treated as a point in a 225-dimensional linear space. PCA is applied to all the Fourier representations in order to learn the principal motion variations and reduce
1406
Z. Zhang and N. Troje
dimensionality further. To do so, all of the Fourier representations are concatenated into a matrix W, with each column containing one walker wr , comprising the parameters p0,r , p1,r , q 1,r , p2,r , and q 2,r stacked vertically. Computing PCA on the matrix W results in a decomposition of each parameter set w of a walker into an average walker w and a series of orthogonal eigenwalkers e 1 , . . . , e N : w=w+
N
kn e n ,
(3.3)
n=1
w = (1/R) wr denotes the average value of all the R columns, and kn is the projection onto eigenwalker e n . A linear motion model is spanned by the first N eigenwalkers e 1 , . . . , e N , which represent the principal variations. The exact dimensionality N of the model depends on the required accuracy of the approximation but will generally be much smaller than R. The Fourier representation of a walker can now be represented as a linear combination of eigenwalkers by using the obtained coefficients k1 , . . . , k N . 3.2 Reconstruction. We denote a 2D motion sequence as pˆ (t) : t = 1, 2, . . . , T, which is represented in terms of its discrete Fourier components: w ˆ = ( pˆ 0 , pˆ 1 , qˆ 1 , pˆ 2 , qˆ 2 ) .
(3.4)
The average posture pˆ 0 and the characteristic postures pˆ 1 , qˆ 1 , pˆ 2 , and qˆ 2 contain only 2D joint position information. These postures are concatenated into a column vector w ˆ with 30 ∗ 5 = 150 entries. We call this a 2D Fourier representation. Supposing a projection matrix C relates the 2D Fourier representation and its 3D Fourier representation, reconstructing the full 3D motion means finding the right solution w to the equation w ˆ = Cw,
(3.5)
where C : 225 → 150 is the projection matrix. At this point we assume C is known. Later we consider C a function of an unknown horizontal viewpoint. Equation 3.5 is an underdetermined equation system and can be solved only if additional constraints can be formulated. We constrain the possible solution to be a linear combination of eigenwalkers in the motion model outlined in the previous section, so the problem is how to calculate a set of coefficients k = [k1 , . . . , k N ]. The solution w is a linear combination of eigenwalkers e 1 , . . . , e N with the obtained coefficients as the corresponding weights. Two calculation approaches are explored to find these coefficients.
3D Periodic Human Reconstruction
1407
3.2.1 Approach I. Since a 3D Fourier representation of the walker w can be represented as a linear combination of eigenwalkers with a set of coefficients kn , substituting equation 3.3 into equation 3.5 gets w ˆ =C w+
N
kn e n .
(3.6)
n=1
ˆ = Cw ¯ and eˆ n = Ce n , equation 3.6 can be rewritten as Denoting w ˆ = w ˆ −w
N
kn eˆ n ,
(3.7)
n=1
or, using matrix notation, as ˆ ˆ¯ = Ek. w ˆ −w
(3.8)
Equation 3.8 contains 150 equations with N unknown coefficients k = [k1 , . . . , k N ]. N is smaller than R, the number of walkers. For our calculations, we used a value of N = 30. Equation 3.8 is therefore an overdetermined linear equation system. We approximate a solution according to a least-square criterion using the pseudo-inverse ˆ ˆ −1 Eˆ T (w k = ( Eˆ T E) ˆ − w).
(3.9)
3.2.2 Approach II. In actual situations, due to measurement noise of 2D motion sequences and incompleteness of training examples, the obtained coefficients from least-squares minimization may cause the 3D reconstruction to be far beyond the range of the training data. In order to avoid this overfitting, Bayes’ rule is used to make a trade-off between quality of match and prior probability. According to Bayes’ rule, a posterior probability is p (k|w) ˆ ∝ p (k) p (w|k) ˆ .
(3.10)
The prior probability p (k) reflects prior knowledge of the possible values of the coefficients. It can be calculated from the above linear motion model. Assuming a normal distribution along each of the eigenvectors, the prior probability is p (k) ∝
N i=1
exp
−ki2 / (2λi )
= exp
−
N i=1
ki2 / (2λi )
,
(3.11)
1408
Z. Zhang and N. Troje
where λi , i = 1, . . . , N is the eigenvalue corresponding to the eigenwalker e i . Because the variations along the eigenvectors are uncorrelated within the set of walkers, products can be used. Our goal is to determine a set of coefficients k = [k1 , . . . , k N ]T for the 2D Fourier representation w ˆ with maximum probability in the 3D linear space. We assume that each dimension of the 2D Fourier representation w ˆ is subject to uncorrelated gaussian noise with a variance σ 2 . Then the likelihood of measuring w ˆ is given by ˆ 2 /(2σ 2 )), ˆ¯ − Ek p(w|k) ˆ ∝ exp(−w ˆ −w
(3.12)
ˆ = Cw is the projection of average motion and Eˆ = C E denotes where w the projection of eigenwalkers. According to equation 3.10, the posterior probability is p (k|w) ˆ ∝ exp
N 2 2 2 ˆ ˆ − w ˆ −w ¯ − Ek / 2σ − ki / (2λi ) .
(3.13)
i=1
The maximization of equation 3.13, corresponding to a maximum a posteriori (MAP) estimate, can be found by minimizing the following equation:
k MAP
N 2 2 2 ˆ ˆ ˆ −w ¯ − Ek / 2σ + = arg min w ki / (2λi ) . k
(3.14)
i=1
The optimal estimate kopt is then calculated (see appendix A for computational details), −1
2 σ ˆ¯ . Eˆ T w ˆ −w kopt = k MAP = diag + Eˆ T Eˆ λ
(3.15)
where diag (x) is an operation to produce a square matrix, whose diagonal elements are the elements of the vector x. We can see when eigenvalue λ(i) is large, the effect of the variance σ 2 on the coefficients k(i) in the optimal estimate is less than those with small eigenvalues. After the coefficient k is estimated using the two proposed approaches, the missing information w ˜ = ( p˜ 0 , p˜ 1 , q˜ 1 , p˜ 2 , q˜ 2 ) of the Fourier representation of a 2D walker is synthesized in terms of the respective linear combination of 3D eigenwalkers e 1 , . . . , e N . The reconstructed Fourier representation w is the combination of the 2D Fourier representation w ˆ and the reconstructed w: ˜ w = ([w, ˆ w]) ˜ = ([ pˆ 0 , p˜ 0 ], [ pˆ 1 , p˜ 1 ], [qˆ 1 , q˜ 1 ], [ pˆ 2 , p˜ 2 ], [qˆ 2 , q˜ 2 ]) .
(3.16)
3D Periodic Human Reconstruction
1409
3D periodic human motion is obtained from the reconstructed Fourier representation w by using equation 3.1. 4 Experiments Using walking data acquired with a marker-based motion capture system, as described in section 2, we conducted two experiments. 2D test sequences were created by orthographic projection of the 3D walkers onto a vertical plane. We used viewpoints that ranged from the left profile view to the frontal view and from there to the right profile view in 1 degree steps. In the first experiment, we tried to reconstruct the missing, third dimension assuming that the view angle of the 2D test walker was known. In the second experiment, we assumed the horizontal viewpoint to be unknown and tested whether we could retrieve it together with the 3D motion. In both experiments, we used a complete leave-one-out cross-validation procedure: one walker is set aside for testing when creating the linear motion model and later projected onto it and evaluated. This is then repeated for every walker in the data set. 4.1 Model Construction. Fourier representations were created separately for each walker. On average, the fundamental frequency accounted for 91.9% of the total variance, and the second harmonic accounted for another 6.0%, which meant that the first two harmonics explained 97.9% of the overall postural variance of a walker. The sets of all the Fourier representations were now submitted to a PCA. The first 30 principal components accounted for 95.5% of the variance of all the Fourier representations, and they were chosen as eigenwalkers in the linear motion model (see Figure 2).
4.2 Motion Reconstruction. For approach II, in order to determine the MAP for a given 2D motion sequence, we calculated the optimal variance σ 2 (see equation 3.15). Since the 2D motion sequence was the orthographic projection of a 3D walker onto a vertical plane, the actual motion in the missing dimension was known and could be compared directly to the reconstructed data. We define an absolute error for each joint, comparing the reconstructed data with the original data, by
E abs ( j) =
T 1 j ( p (t) − p˜ j (t))2 , T
(4.1)
t=1
where p j (t) and p˜ j (t) are the original and reconstructed data for the jth ( j = 1, 2, . . . , 15) joint in the missing dimension. The absolute reconstruction error of one walker is the average value of the absolute errors for the
1410
Z. Zhang and N. Troje
Figure 2: The cumulative variance covered by the principal components computed across 80 walkers.
15 joints: 1 E abs ( j). 15 15
Ea =
(4.2)
j=1
We calculated the average reconstruction errors for the 80 walkers for variances (σ 2 in equation 3.12) ranging from 1 to 625 mm2 . Figure 3 illustrates the results. The optimal variance for all the walkers was about σ 2 = 70 mm2 . This value was used for approach II in all subsequent experiments. To calculate a 3D Fourier representation, we projected the 2D Fourier representation onto the linear motion model and got a set of coefficients by the two proposed approaches. In order to qualitatively evaluate the reconstruction, we displayed all the subjects reconstructed from three viewing angles: 0, 30, and 90 degrees. For each demonstration, the original and reconstructed motion data were superimposed and displayed from a direction orthogonal to the direction along which motion data were missing. Figure 4 illustrates the original and the reconstructed motion sequences for one subject from the three viewing angles. This figure shows five equally spaced postures in one motion cycle. The original and the reconstructed walking sequences appear very similar in the corresponding postures, and there are no obvious differences between the two calculation approaches.
3D Periodic Human Reconstruction
1411
Figure 3: The effect of variance σ 2 on average reconstruction error.
The same results were inspected visually in all 240 demonstrations; no obvious differences could be seen. In order to provide a more quantitative evaluation, equation 4.2 was used to calculate the absolute reconstruction error. From different viewpoints, the 2D projections have different variances, and therefore a relative reconstruction error is also defined for each joint, E rel ( j) =
T 1 j ( p (t) − p˜ j (t))2 , Tσ 2
(4.3)
t=1
where σ 2 is the average overall variance of the 15 joints in the missing dimension. The relative reconstruction error of one walker is the average value of the relative errors for the 15 joints: 1 E rel ( j). 15 15
Er =
(4.4)
j=1
The absolute and relative reconstruction errors were calculated for all walkers and for different viewpoints using the two approaches. 3D walking data were reconstructed at 1 degree intervals from the left profile view to the right profile view. Figure 5 shows absolute average errors, relative average errors, and overall variance of the missing dimension as a function
1412
Z. Zhang and N. Troje
Figure 4: Original and reconstructed walking data from three viewing angles (0, 30, and 90 degrees, corresponding to the top, middle, and bottom rows, respectively). The dashed lines and the solid lines are the original and the reconstructed postures in the motion sequence, respectively. (a) Approach I. (b) Approach II.
of horizontal viewpoint (−90 to 90 degrees). The solid and dashed curves in Figures 5a and 5b correspond to approach I and approach II, respectively. It is obvious that when prior probability is involved in the calculation, the error of 3D motion reconstruction is about 30% smaller than the error by approach I. As can be seen from Figure 5a, there are different absolute reconstruction errors in the negative and positive oblique viewpoints, which implies that human walking is slightly asymmetric. The very small asymmetry of variances in Figure 5c also confirms this point.
3D Periodic Human Reconstruction
1413
Figure 5: Illustration of the average reconstruction errors from different viewpoints of −90 to 90 degrees for 80 walkers. (a) Absolute errors by two approaches. (b) Relative errors by two approaches. (c) Variance.
1414
Z. Zhang and N. Troje
Figure 5: Continued
Figure 5b illustrates that reconstruction based on frontal and oblique viewpoints produces small relative reconstruction errors, while reconstruction from the profile view generates large relative reconstruction errors. The main reason for that is the fact that the overall variance that needs to be recovered from the profile view is relatively small anyway (see Figure 5c), and a reasonable absolute reconstruction error becomes a large relative error. The poor reconstruction in profile views and the good reconstruction in frontal views fit with data on human performance in biological motion recognition tasks (Troje, 2002a; Mather & Murdoch, 1994). 4.3 Viewpoint Reconstruction. While we assumed the projection matrix C (see equation 3.5) to be known in the first experiment, we include the horizontal viewpoint as an unknown variable, which is subject to reconstruction in the second experiment. Assuming that a 2D walking sequence has a constant walking direction and that the viewpoint is rotated only about the vertical axis (that is, it is a horizontal viewpoint), we can estiˆ¯ mate the view angle using the rotated average posture w(α) and the rotated ˆ eigenwalkers E(α) and then retrieve the missing motion data. Equation 3.8 and 3.14 can be represented as , ˆ ˆ¯ (αopt , kopt ) = arg min w ˆ − w(α) − E(α)k α,k
(4.5)
3D Periodic Human Reconstruction
1415
N 2 2 2 / 2σ + ˆ ˆ¯ ˆ − w(α) − E(α)k (αopt , kopt ) = arg min w ki / (2λi ) . α,k
i=1
(4.6) The optimum solution of this minimization problem over the coefficient k and view angle α can be found by solving a nonlinear overdetermined system. Details can be found in appendix B. The missing data are the linear combination of eigenwalkers using the optimal coefficients kopt . As in the first experiment, a leave-one-out cross-validation procedure was applied to all the walkers. The average reconstructed angles from different viewpoints by the two approaches are illustrated in Figure 6. The experimental results show that we can precisely obtain the view angles from motion sequences with unknown viewpoints. Here, Bayesian inference does not provide an advantage. It is possible to identify and recognize human gait from 2D image sequences with different viewpoints by using the obtained view angle and missing data. 5 Conclusions and Future Work We have investigated and evaluated the problem of reconstructing 3D periodic human motion from 2D motion sequences. A linear motion model, based on a linear combination of eigenwalkers, was constructed from Fourier representations of walking examples using PCA. The Fourier representation of a particular walker is a compact description of human walking, so not only does it allow us to find temporal correspondence by adjusting the frequency and phase, but it also reduces computational expenditure and storage requirements. Projecting 2D motion data onto this model could find a set of coefficients and view angles for test data with unknown viewpoints. Two calculation approaches were explored to determine the coefficients. One was based on a least-square minimization using the pseudoinverse; the other used prior probability of training examples to calculate a MAP by using Bayes’ rule. Experiments and evaluations were made on walking data. The results and quantified error analysis showed that the proposed method could reasonably reconstruct 3D human walking data from 2D motion sequences. We also verified that the linear motion model could estimate parameters like the view angle from 2D motion sequences with unknown constant walking direction. By applying Bayes’ rule to 3D motion reconstruction, we got better performance in reconstruction. We developed and tested the proposed framework on a relatively simple data set: discrete marker trajectories obtained from motion capture data from walking human subjects. We also made a series of assumptions that simplified algorithmic and computational demands. In particular, we assumed that the projection matrix was either completely known or that only
1416
Z. Zhang and N. Troje
Figure 6: Illustration of the average reconstruction view angles from different viewpoints of −90 to 90 degrees for 80 walkers. The solid line is the average reconstructed angles, and the dashed lines are the average standard deviations. (a) Approach I. (b) Approach II.
3D Periodic Human Reconstruction
1417
a single parameter, the horizontal viewpoint, had to be recovered. Using the proposed framework to process real-world video data would require a number of additional steps and expansions to the model. The computer vision problem of tracking joint locations in video footage remains challenging. Markerless tracking, however, has become a huge research field, and promising advances have been made along several lines. A discussion of the area is beyond the scope of this letter, but we are positive that solutions to the problem will be available in the near future. Another constraint that could be relaxed in future versions of this work is the knowledge about the projection matrix. The fact that the single parameter that we looked at here, horizontal viewing angle, was recovered effortlessly and with high accuracy shows that the redundancy in the input data is still very high. Great numbers of unknown parameters would require more sophisticated optimization procedures, but we consider it very likely that the system would stay overdetermined, providing enough constraints to find a solution. Recently, Chai and Hodgins (2005) simultaneously extracted the root position and orientation from the vision data captured from two cameras. There is another challenge when we switch from periodic actions like walking and running to nonperiodic movements. Establishing temporal correspondence across our data set was easy: mapping one sequence onto another simply required scaling and translation in time. Establishing correspondence between nonperiodic movements generally requires nonlinear time warping. A number of different methods are available to achieve this task, ranging from simple landmark registration to dynamic programming and the employment of hidden Markov models (see, e.g., Ramsay & Silverman, 1997), time-warping algorithm (Bruderlin & Williams, 1995), spatiotemporal correspondence (Giese & Poggio, 2000; Mezger, Ilg, & Giese, 2005), and registration curve (Kovar & Gleicher, 2003). The main purpose of this work was to emphasize the richness of the redundancy inherent in motion data and suggest ways to employ this redundancy to recover missing data. In our example, the missing data were the third dimension, which gets lost when projecting the 3D motion data onto a 2D image plane. The same approach, however, could be used to recover markers that become occluded by other parts of the body or even to generate new marker trajectories at locations where a real marker cannot be placed (for instance, inside the body). Appendix A: Solving an Overdetermined System for Coefficients In order to calculate the optimal estimates k, we assume the following equation: N ˆ 2 / 2σ 2 + ˆ¯ − Ek ki2 / (2λi ). E= w ˆ −w i=1
(A.1)
1418
Z. Zhang and N. Troje
It can be written as E=
ˆ¯ w ˆ −w
2
N 2 ˆ ˆ¯ + k, Eˆ T Ek / 2σ + ˆ −w − 2 k, Eˆ T w ki2 / (2λi ). i=1
(A.2) According to the optimum, N ˆ / 2σ 2 + ˆ¯ + 2 Eˆ T Ek ˆ −w 2ki / (2λi ), 0 = ∇ E = −2 Eˆ T w
(A.3)
i=1
so ˆ + σ2 Eˆ T Ek
N
ˆ¯ . ˆ −w ki /λi = Eˆ T w
(A.4)
i=1
The solution is −1
2 σ ˆ¯ . Eˆ T w k = diag ˆ −w + Eˆ T Eˆ λ
(A.5)
Appendix B: Solving an Overdetermined System for Coefficients and Angle For solving equation 3.5, we define an objective function, whose return value is a 150-dimensional vector. Every element of this vector is ˆ ˆ¯ diff (i) = w(i) ˆ − w(i)(α) − E(i)(α) · k, for i = 1, . . . , 150
(B.1)
ˆ where E(i)(α) is a row vector with the number of used eigenwalkers. The first 75 elements in this vector can be calculated by the following equation: diff (i) = w(i) ˆ − cos α × (w x + E x · k) − sin α × w y + E y · k ,
(B.2)
where i = 1, . . . , 75, w x , E x , and w y , E y are the corresponding x- and ycoordinates of average walkers and eigenwalkers, respectively. The values of the remaining 75 elements can be calculated by diff (i) = w(i) ˆ − w z − E z · k,
(B.3)
where i = 76, . . . , 150, and w z and E z are the corresponding z-coordinates of average walkers and eigenwalkers, respectively.
3D Periodic Human Reconstruction
1419
Finally, the obtained 150-dimensional vector is sent to the subroutine lsqnonlin in Matlab 6.5. The “large-scale: trust-region reflective Newton” optimization algorithm was used in this subroutine. The returned values are the optimum coefficients and the viewing angle. For equation 4.6, we calculate the coefficients for the viewing angle from 0 to 180 degrees by using equation 3.15 and then substitute them into equation 4.6. We can determine the optimum coefficients and view angle by comparing the 181 obtained values. Acknowledgments We gratefully acknowledge the generous financial support of the German Volkswagen Foundation and the Canada Foundation for Innovation. We thank Tobias Otto for his help in collecting the motion data and Daniel Saunders for many discussions and his tremendous help in the final editing of the manuscript. The helpful comments from the anonymous reviewers are gratefully acknowledged; they greatly helped us improve this letter. References Aggarwal, A., & Triggs, B. (2004). Learning to track 3D human motion from silhouettes. In Proceedings of the 21st International Conference on Machine Learning. Madison, WI: Omni Press. Aggarwal, J. (2003). Problems, ongoing research and future directions in motion research. Machine Vision and Applications, 14, 199–201. Aggarwal, J., & Cai, Q. (1999). Human motion analysis: A review. Computer Vision and Image Understanding, 73, 428–440. Arikan, O., & Forsyth, D. (2002). Interactive motion generation from examples. ACM Transactions on Graphics, 21(3), 483–490. Blanz, V., & Vetter, T. (2003). Face recognition based on fitting a 3D morphable model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25, 1063–1074. Bowden, R. (2000). Learning statistical models of human motion. Paper presented at the IEEE Workshop on Human Modeling Analysis and Synthesis, CVPR 2000, South Carolina. Bowden, R., Mitchell, T., & Sarhadi, M. (2000). Non-linear statistical models for the 3D reconstruction of human pose and motion from monocular image sequences. Image and Vision Computing, 18, 729–737. Bruderlin, A., & Williams, L. (1995). Motion signal processing. Proceedings of SIGGRAPH, 95, 97–104. Buxton, H. (2003). Learning and understanding dynamics scene activity: A review. Image and Vision Computing, 21, 125–136. Chai, J., & Hodgins, J. (2005). Performance animation from low-dimensional control signals. ACM Transactions on Graphics, 24(3), 686–696. Cunado, D., Nixon, M., & Carter, J. (2003). Automatic extraction and description of human gait models for recognition purposes. Computer Vision and Image Understanding, 90, 1–41.
1420
Z. Zhang and N. Troje
Dariush, B. (2003). Human motion analysis for biomechanics and biomedicine. Machine Vision and Applications, 14, 202–205. Gavrila, D. (1999). The visual analysis of human motion: A survey. Computer Vision and Image Understanding, 73, 82–98. Giese, M., & Poggio, T. (2000). Morphable models for the analysis and synthesis of complex motion patterns. International Journal of Computer Vision, 38, 59–73. Grochow, K., Martin, S., Hertzmann, A., & Popovi´c, Z (2004). Style-based inverse kinematics. ACM Transactions on Graphics, 23(3), 522–531. Howe, N., Leventon, M., & Freeman, W. (1999). Bayesian reconstruction of 3D human ¨ motion from single-camera video. In S. Solla, T. Leen, & K.-R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 820–826). Cambridge, MA: MIT Press. Jepson, A., Fleet, D., & El-Maraghi, T. (2003). Robust online appearance models for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25, 1296–1311. Kakadiaris, I., & Metaxas, D. (2000). Model-based estimation of 3D human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 1453–1459. Karaulova, I., Hall, P., & Marshall, A. (2002). Tracking people in three dimensions using a hierarchical model of dynamics. Image and Vision Computing, 20, 691– 700. Kovar, L., & Gleicher, M. (2003). Flexible automatic motion blending with registration curves. In Proceedings of Symposium on Computer Animation (pp. 214–224). Aire-laSwitzerland: Eurographics Association. Leventon, M., & Freeman, W. (1998). Bayesian estimation of 3D human motion from an image sequence. (Tech. Rep. No. 98-06). Cambridge, MA: Mitsubishi Electric Research Laboratory. Mather, G., & Murdoch, L. (1994). Gender discrimination in biological motion displays based on dynamic cues. Proceedings of the Royal Society of London, Series B: Biological Sciences, 258, 273–279. Mezger, J., Ilg, W., & Giese, M. (2005). Trajectory synthesis by hierarchical spatiotemporal correspondence: Comparison of different methods. In Proceedings of the ACM Symposium on Applied Perception in Graphics and Visualization (pp. 25–32). New York: ACM Press. Moeslund, T., & Granum, E. (2001). A survey of computer vision-based human motion capture. Computer Vision and Image Understanding, 81, 231–268. Ning, H. Z., Tan, T. N., Wang, L., & Hu, W. M. (2004). Kinematics-based tracking of human walking in monocular video sequences. Image and Vision Computing, 22, 429–441. Ong, E., & Gong, S. (2002). The dynamics of linear combinations: Tracking 3D skeletons of human subjects. Image and Vision Computing, 20, 397–414. Ramsay, J., & Silverman, B. (1997). Functional data analysis. New York: Springer. Ren, L., Shakhnarovich, R., Hodgins, J., Pfister, H., & Viola, P. (2005). Learning silhouette features for control of human motion. ACM Transactions of Graphics, 24, 1303–1331. Rosales, R., Siddiqui, M., Alon, J., & Sclaroff, S. (2001). Estimating 3D body pose using uncalibrated cameras. In Proceedings of the Computer Vision and Pattern Recognition Conference (Vol. 1, pp. 821–827). Los Alamitos, CA: IEEE Computer Society Press.
3D Periodic Human Reconstruction
1421
Sminchisescu, C., & Triggs, B. (2003). Kinematic jump processes for monocular 3D human tracking. In Proceedings of the Computer Vision and Pattern Recognition Conference (Vol. 1, pp. 69–76). Los Alamitos, CA: IEEE Computer Society Press. Troje, N. F. (2002a). Decomposing biological motion: A framework for analysis and synthesis of human gait patterns. Journal of Vision, 2, 371–387. Troje, N. F. (2002b). The little difference: Fourier based gender classification from ¨ biological motion. In R. P. Wurtz & M. Lappe (Eds.), Dynamic perception (pp. 115– 120). Berlin: Aka Press. Unuma, M., Anjyo, K., & Takeuchi, R. (1995). Fourier principles for emotion-based human figure animation. Proceedings of SIGGRAPH, 95, 91–96. Urtasun, R., & Fua, P. (2004a). 3D human body tracking using deterministic temporal motion models. In Proceedings of the Eighth European Conference on Computer Vision. Berlin: Springer-Verlag. Urtasun, R., & Fua, P. (2004b). 3D tracking for gait characterization and recognition. In Proceedings of the Sixth International Conference on Automatic Face and Gesture Recognition. (pp. 17–22). Los Alamitos, CA: IEEE Computer Society Press. Wang, L., Hu, W. M., & Tan, T. N. (2003). Recent developments in human motion analysis. Pattern Recognition, 36, 585–601. Wang, L., Tan, T. N., Ning, H. Z., & Hu, W. M. (2003). Silhouette analysis-based gait recognition for human identification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25, 1505–1518. Yacoob, Y., & Black, M. (1999). Parameterized modeling and recognition of activities. Computer Vision and Image Understanding, 73, 232–247.
Received April 25, 2005; accepted August 16, 2006.
LETTER
Communicated by Thomas Vogl
A New Backpropagation Learning Algorithm for Layered Neural Networks with Nondifferentiable Units Takahumi Oohori [email protected] Department of Information Design, Hokkaido Institute of Technology, Sapporo 006-8585, Japan
Hidenori Naganuma [email protected] Division of Electrical Engineering, Graduate School of Engineering, Hokkaido Institute of Technology, Sapporo 006-8585, Japan
Kazuhisa Watanabe [email protected] Department of Information Network Engineering, Hokkaido Institute of Technology, Sapporo 006-8585, Japan
We propose a digital version of the backpropagation algorithm (DBP) for three-layered neural networks with nondifferentiable binary units. This approach feeds teacher signals to both the middle and output layers, whereas with a simple perceptron, they are given only to the output layer. The additional teacher signals enable the DBP to update the coupling weights not only between the middle and output layers but also between the input and middle layers. A neural network based on DBP learning is fast and easy to implement in hardware. Simulation results for several linearly nonseparable problems such as XOR demonstrate that the DBP performs favorably when compared to the conventional approaches. Furthermore, in large-scale networks, simulation results indicate that the DBP provides high performance. 1 Introduction The three-layered perceptron proposed by Rosenblatt (referred to as a simple perceptron, SP) has provided research results that have become the foundation for understanding the learning possibilities of neural networks (Rosenblatt, 1961; Minsky & Papert, 1969). It has been theoretically proven that the SP can classify categories in finite learning iterations if the output of the middle layer is linearly separable. However, this means that the network cannot learn if at least one category among the input pattern set is linearly nonseparable. Neural Computation 19, 1422–1435 (2007)
C 2007 Massachusetts Institute of Technology
A New Digital-Type Backpropagation Algorithm
1423
Any linearly nonseparable pattern set becomes separable through random transformations between the input and middle layers if the number of middle-layer units is significantly large (Amari, 1978). However, in general, a large number of units are needed in the middle layer so that the output of the middle layer may become linearly separable. To make learning technologically feasible, it is necessary to restrict the number of the middle-layer units. However, because the SP cannot update coupling weights between the input and middle layers, it is unrealistic to expect a great improvement of the learning performance (Minsky & Papert, 1969). Rumelhart, Hinton, and Williams (1986) proposed an analog backpropagation algorithm (ABP), which replaces the binary units in the middle and output layers with analog and differentiable units. Since the output error propagates backward (from the output layer to the input layer), the ABP can update the coupling weights between the input and middle layers. However, this approach retains a few problems. In the case of a linearly nonseparable pattern set, the ABP converges to the local minimum and stops learning for several initial coupling weights. Even after it converges to the global minimum, the ABP requires a large number of iterations for learning (Jacobs, 1988; Vogl, Mangis, Rigler, Zink, and Alkon, 1988). Moreover, this algorithm cannot directly solve a digital problem where the middlelayer output is intrinsically binary, such as the internal state inference of the automation and vector quantization (Zeng, Goodman, & Smyth, 1993; Watanabe, Furukawa, Mitani, & Oohori, 1997). This letter proposes a digital version of the backpropagation algorithm (called digital BP, DBP) for three-layered neural networks with nondifferentiable binary units. For the DBP, teacher signals are given to both middle and output layers and are given only to the output layer for the SP. The DBP is therefore able to update the coupling weights not only between the middle and output layers, but also between the input and middle layers. A neural network based on the DBP learning model is fast and easy to implement in hardware. We have applied the proposed method to several linearly nonseparable problems and compared the DBP’s learning performances with that of the ABP. Furthermore, we have applied the algorithm to a large-scale network and determined the learning performance. 2 Digital Backpropagation This letter proposes a new method of backpropagation for three-layered neural networks with nondifferentiable binary units in the middle and output layers (Oohori, Kamada, Kobayashi, Konishi, & Watanabe, 2000). In digital neural networks, it is not possible to propagate the error from the output layer, through the middle layer, and to the input layer because of the nondifferentiable units in the middle layer. Thus, the coupling weights
1424
T. Oohori, H. Naganuma, and K. Watanabe Step
Step
Linear
Linear
Output Layer
(p) O k
T
(p) k
(p) j
T
(p) j
Wkj O
Middle Layer 1 Input Layer
Wkj
(p) i
(A) in case of single output
(p)
(p) j
Wji
1 Input Layer
O 1
(p) k
Oj T
Middle Layer
Wji
(p)
Ok T
Output Layer
(p) i
O 1
(B) in case of multiple outputs
Figure 1: Learning of digital backpropagation.
between the input and middle layers cannot be updated. As a result, the SP has difficulty learning linearly nonseparable problems. The proposed DBP is able to update the coupling weights not only between the middle and output layers but also between the input and middle layers by giving the teacher signals to the middle layer. This serves to decrease the error of the output layer. The following sections describe the DBP, focusing on how to set the teacher signals to the middle layer for both single and multiple outputs. 2.1 In Case of Single Output. In digital neural networks (see Figure 1A), for each input pattern p, the output and the error in the output layer are calculated. The coupling weights between the middle and output layers are updated based on the delta rule in the same manner as with the SP: ( p) ( p) ( p) Wk j = Wk j − α Ok − Tk Oj .
(2.1)
( p)
Similarly, if the teacher signal Tj can be set to the middle layer, the coupling weights between the input and middle layers can be updated by the following delta rule: ( p) ( p) ( p) Oi . Wji = Wji − α O j − Tj
(2.2)
( p)
Next, we need to calculate the teacher signal Tj for the middle-layer units. If the coupling weightsWk j between the middle and output layers are fixed temporarily after Wk j is updated, it is preferable to set the
A New Digital-Type Backpropagation Algorithm
1425
( p)
middle-layer unit output O j to a value that decreases the output-layer error. Since the activation of the kth output unit is ( p)
Uk
=
( p)
Wk j O j ,
(2.3)
j ( p)
if Wk j is positive, the activation increases when O j is changed from 0 to ( p) 1 and decreases when O j is changed from 1 to 0. If Wk j is negative, the ( p) activation decreases and increases contrary to the change in O j . Therefore, ( p) to activate the output-layer unit when the output O j = 0 and the teacher ( p) ( p) ( p) ( p) signal Tk = 1 (Ok − Tk = −1), the teacher signal Tj of the middle layer is set as follows: ( p)
= 1,
(2.4)
( p)
= 0.
(2.5)
if
Wk j > 0
then
Tj
if
Wk j < 0
then
Tj
( p)
In order to make the output-layer unit inactivate when the outputOk ( p) ( p) ( p) ( p) and teacher Tk = 0 (Ok − Tk = 1), set Tj as follows:
=1
( p)
= 0,
(2.6)
( p)
= 1.
(2.7)
if
Wk j > 0
then
Tj
if
Wk j < 0
then
Tj
Combining equations 2.5 and 2.6 and equations 2.4 and 2.7, the following equations are obtained, respectively: ( p) ( p) ( p) if Ok − Tk Wk j > 0 then Tj = 0, (2.8) ( p) ( p) ( p) Wk j < 0 then Tj = 1. (2.9) if Ok − Tk ( p)
( p)
When Ok = Tk and/or Wk j = 0, the coupling weights between the input and middle layers need not be updated, as shown in equations 2.1 and 2.3. Therefore, the teacher signal is set to the existing output of the middle-layer unit to prevent updating of the coupling weights between the input and middle layers—namely, ( p) ( p) ( p) ( p) if Ok − Tk Wk j = 0 then Tj = O j . (2.10) Based on these principles, the teacher signals given to the middle layer are ( p) ( p) summarized in Table 1 (in case of the single output). Where S j = S jk = ( p) ( p) (Ok − Tk )Wk j is the value that becomes the base for the teacher signal
1426
T. Oohori, H. Naganuma, and K. Watanabe
Table 1: Calculating the Teacher Signal for Middle Layer (Single Output). ( p)
( p)
Ok
Tk
0
1
1
0
0 1
0 1
( p)
Wk j
Sj
( p)
( p)
= S jk = (Ok
( p)
− Tk )Wk j
Positive Negative Positive Negative
= Negative = Positive = Positive = Negative
–
0
( p)
Tj
1 0 0 1 ( p)
Oj
of the middle-layer units, we refer to it as a decision factor for the teacher signal. 2.2 In Case of Multiple Outputs. When dealing with multiple outputs (see Figure 1B), the coupling weights between the middle and output layers are updated by the following delta rule in a manner similar to that used by the SP: ( p) ( p) ( p) Wk j = Wk j − α Ok − Tk Oj .
(2.11)
To update the coupling weights between the input and middle layers, the decision factor of the teacher signal is calculated as ( p) ( p) ( p) S jk = Ok − Tk Wk j . ( p)
(2.12) ( p)
The teacher signal Tj for the middle layer may be set by the sign of S jk , and the coupling weights between the input and middle layers may be updated by the following delta rule: ( p) ( p) ( p) Wji = Wji − α O j − Tj Oi .
(2.13)
However, because there are two or more units in the output layer, a ( p) mutual competition occurs when the sign of S jk is different for several ( p) ( p) units. To resolve this, we define the sum S j over k of S jk as the decision factor for teacher signal when there are multiple outputs: ( p)
Sj =
k
( p)
( p)
Ok − Tk
Wk j .
(2.14)
A New Digital-Type Backpropagation Algorithm
1427
Table 2: Calculating Teacher Signals for Middle Layer (Two Outputs). ( p)
( p)
( p)
S j1
S j2
Sj
Positive
Negative
Negative
Positive
Positive Negative
Positive Negative
( p)
( p)
( p)
= S j1 + S j2
Tj
Positive Negative 0 Positive Negative 0 Positive Negative
0 1 ( p) Oj 0 1 ( p) Oj 0 1
Now the teacher signal of the middle layer is set by the following equations: ( p)
( p)
if
Sj > 0
then
Tj
if
S j < 0 then
Tj
( p)
( p)
( p)
= 0,
(2.15)
= 1.
(2.16)
( p)
When S jk s are balanced and S j = 0, the coupling weights between the input and middle layers need not be updated, that is, if
( p)
Sj = 0
then
( p)
Tj
( p)
= Oj .
(2.17)
The teacher signals given to the middle-layer units, based on this discussion, are summarized in Table 2 in case of two outputs. 2.3 Improvement of the Algorithm. Figure 2 shows a program analysis ( p) diagram (PAD) for the DBP algorithm. In this figure, the decision factor S j ( p) and the middle-layer teacher signal Tj are calculated using the current ( p) ( p) Ok and the newly updated Wk j . If the middle-layer teacher signals Tj ( p) are calculated using the newly recalculated Ok instead of the current one, ( p) acceleration in convergence can be expected. The recalculation of Ok using the newly updated Wk j is shown in the dotted box in Figure 2. 3 Comparison with Previous Learning Methods 3.1 Comparison with the ABP. Rumelhart et al. (1986) proposed the backpropagation algorithm (analog BP, or ABP) with analog and differentiable units. Equations 3.1 and 2.11 show updating of the coupling weights between the middle and output layers using both the ABP and the DBP for
1428
T. Oohori, H. Naganuma, and K. Watanabe
Figure 2: PAD for DBP.
comparison,
(ABP)
( p) ( p) ( p) ( p) f (Uk )O j , Wk j = Wk j − α Ok − Tk
(DBP)
( p) ( p) ( p) Oj , Wk j = Wk j − α Ok − Tk
(3.1)
(2.11)
where f is an input output function of the units in the middle and output layers and the sigmoid function is used. In case of the ABP, with reverse error propagation of the output layer ( p) through f , the term f (Uk ) is added as shown in equation 3.1. For the ( p) DBP, the term is not added. f (Uk ) is a small value in the neighborhood of ( p) ∼ ( p) ( p) ( p) the error maximum (Ok = 1 − Tk ) and the error minimum (Ok ∼ = Tk ). ( p) ( p) With the DBP, whenever the error occurs (Ok = Tk ), it updates by a ( p) constant amount α independent of the activation Uk . Next, the equations for updating the coupling weights between the input and middle layers in the ABP and the DBP are simplified by using the
A New Digital-Type Backpropagation Algorithm
1429
( p) ( p) teacher-signal decision factors, S¯ j and S j , respectively:
ABP ( p) S¯ j =
( p)
( p)
Ok − Tk
Wk j f (Uk ), ( p)
(3.2)
k
( p) ( p) ( p) Oi . Wji = Wji − α S¯ j f U j DBP ( p)
Sj =
( p)
( p)
Ok − Tk
Wk j ,
(3.3)
(3.4)
k ( p)
Wji = Wji − α
Sj
( p)
|S j |
( p)
Oi .
(3.5)
( p)
In equation 3.3 of the ABP, f (U j ) for reversely propagating from the ( p) middle layer, and the decision factor S¯ j of the teacher signal, which is calculated in the output layer, are added. As with the output layer, when the ( p) ( p) output O j of the middle layer is close to 0 or 1, f (U j ) becomes almost 0, and Wji is updated slowly. With the DBP, however, as shown in equation 3.5, Wji is updated with a constant amount α, and the learning is fast. 3.2 Comparison with the MADALINE. The MADALINE, proposed by Widrow, Winter, and Baxter (1988) is a three-layered neural network with nondifferentiable binary units in the middle and output layers. It has the same structure as both the SP and the DBP proposed in this letter. During learning iterations with the MADALINE, coupling weights between the middle and output layers are updated by the delta rule until the error does not decrease any further for all input patterns. Afterward, for each input pattern, the output in the middle-layer unit whose activation is the closest to 0 is reversed. If the output error decreases, the coupling weights between the input and middle layers are updated by the delta rule. Otherwise, the output that is reversed is reset, and the output of the unit whose activation is next closest to 0 is reversed. Once one unit is reversed, two units are also reversed, one after another, up to the predetermined number. The MADALINE’s learning algorithm is based on trial and error. When the number of middle-layer units increases, the algorithm must choose the number of units to use in order to decrease the output error from a large number of combinations. Therefore, due to combinatorial explosion, the computational time increases tremendously and is not practical. On the other hand, when using the DBP, the teacher signals for all middlelayer units are set to decrease the output error. Because the coupling weights
1430
T. Oohori, H. Naganuma, and K. Watanabe
between the input and middle layers are updated concurrently under the control of teacher signals, this combinatorial explosion does not occur. Although the DBP’s convergence has not been theoretically proven yet, a large number of simulation experiments have demonstrated good convergence. 3.3 Comparison with the Dual Learning Method. Application of the dual learning method proposed by Yamada, Kuroyanagi, and Iwata (2004) to the digital three-layered perceptron uses the following delta rule to update the coupling weights not only between the middle and output layers but also between the input and middle layers: ( p) ( p) ( p) Wk j = Wk j − α Ok − Tk Oj , ( p) ( p) ( p) Oi , Wji = Wji − α O j − Tj
(3.6) (3.7)
where the middle-layer teacher signal is calculated as an analog value: ( p)
Tj
( p)
= Oj − α
( p)
( p)
Ok − Tk
Wk j .
(3.8)
k ( p)
Yamada et al. (2004) proposed that the teacher signal Tj ( p) when Tj exceeds the range of [0,1], that is,
( p)
Tj
0 ( p) = Tj 1
( p)
Tj
<0
be set to 0 or 1
( p) 0 ≤ Tj ≤ 1 . ( p) Tj > 1
(3.9)
Numerical experiments demonstrate that the performance of dual learning using equations 3.8 and 3.9 is comparable to that of both the ABP and the DBP. However, in the dual learning method, the middle-layer teacher signals are not digital but analog, whereas in the DBP, the teacher signals for the middle layer are digital, as shown in equations 2.14 to 2.17. Thus, the proposed DBP is a completely digital type of learning, including the teacher signals. 4 Numerical Experiments 4.1 Preliminary Experiments with the DBP. We tested the DBP with the XOR problem, which consists of two input units, two middle units, and one output unit. The learning coefficient α of the DBP and ABP was
A New Digital-Type Backpropagation Algorithm
1431
Percentage of successful trials
100
DBP1 DBP2 ABP 80
60
40
20
0 0.0001
0.001
0.01
0.1
1
10
100
Learning coefficient
Figure 3: Percentage of trials where the network converges successfully for XOR.
varied from 10−2 to 102 . A hundred different initializations of the coupling weights were generated using a uniform distribution between –1.0 and 1.0. The maximum learning iterations were set to 102 /α in the DBP and to 105 /α in the ABP, inversely proportional to the learning coefficient α. The learning was completed in the DBP when the error between the output and the teacher reached 0, or the learning iterations exceeded the maximum learning iteration, and in the ABP when the mean square error between the output and the teacher signal fell below 10−3 , or the learning iterations exceeded the maximum learning iteration. In the experiment, the learning coefficient α was divided by the number of patterns (Pmax ) in order to suppress the influence of Pmax in each problem. Figures 3 and 4 show the percentage of trials in which the network converged successfully and the average iterations required to converge using both the DBP and the ABP. In these figures, DBP1 and DBP2 represent the DBP without and with output recalculation, respectively. Figure 3 shows a slight improvement in the learning performance of DBP by recalculating the output. The performance decreases greatly when the learning coefficient α exceeds the limits in all of the methods. Figure 4 shows that learning accelerates as α increases. Therefore, it is preferable to set the learning coefficient α to as large a value as possible within the range where the percentage of successful trials does not decrease. A similar tendency is observed in other problems; therefore, in the following simulations, we have used α = 0.002 (for the DBP) and α = 2 (for the ABP) and added the output recalculation in the DBP.
1432
T. Oohori, H. Naganuma, and K. Watanabe
Average number of iterations required to converge
10
10
10
6
DBP1 DBP2 ABP
5
4
10
3
2
10
10 0.0001
0.001
0.01
0.1
1
10
100
Learning coefficient Figure 4: Average iterations to converge for XOR.
4.2 Experiment for Comparison with the Conventional BP (ABP). The next experiment presents some linearly nonseparable problems to both the DBP and the ABP. The learning coefficient α was 0.002 for the DBP and 2 for the ABP. The maximum number of learning iterations was set to 50,000 for both methods. The network consisted of two units in the middle layer for two-input problems and three units for three-input problems. The other parameters were the same as the preliminary experiments. The problems used in this experiment were the XOR (one output), the half adder (two outputs), the parity check (one output), the counting problem (three or four outputs), and the symmetry judgment (one output). Tables 3 and 4 show the percentage of trials where the network converged successfully and the average number of iterations required to converge, respectively. Table 3 shows that the percentage of successful trials for the DBP was higher than for the ABP in all two-input problems and for the three-input counting problem. The DBP performed better than the ABP for multiple-output problems, such as the half adder and the counting problems, though the comparison varied considerably, depending on the nature of each problem. Table 4 shows that the average number of iterations of the DBP required to converge was fewer than those of the ABP, except for the symmetry judgment problem. Moreover, there is a tendency to decrease the average number of iterations required for convergence when the percentage of successful trials is high, except for the parity check problem. 4.3 Experiments of More Realistic Problems. In order to observe the learning characteristics for a large-scale network, we presented a character
A New Digital-Type Backpropagation Algorithm
1433
Table 3: Comparison of Percentages of Successful Trials.
Two inputs
XOR Half adder Counting Parity check Symmetry judgment Counting
Three inputs
ABP
DBP
79 86 97 100 99 68
83 100 100 91 91 98
Table 4: Comparison of Average Iterations to Converge.
Two inputs
XOR Half adder Counting Parity check Symmetry judgment Counting
Three inputs
ABP
DBP
3153 5069 6047 4254 2629 10,883
1881 3259 2966 3762 3169 7208
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
Figure 5: Binary input data for character recognition.
recognition problem to the DBP. The network consisted of 25 units in the input layer and 26 units in the output layer, corresponding to each character. The input data consisted of the 26 characters of the alphabet, as shown in Figure 5. The learning coefficient was assumed to be 0.02, and the maximum number of learning iterations was set to 50,000. A hundred different initializations of coupling weights were generated in the same fashion as described in sections 4.1 and 4.2. The number of the units in the middle layer was changed from 4 to 10. Table 5 shows the percentage of trials where the network converged
1434
T. Oohori, H. Naganuma, and K. Watanabe
Table 5: Result of the Character Recognition Problem with 25 Inputs. Number of Units in Middle Layer 4 5 6 7 8 9 10
Percentage of Successful Trials
Average Iteration to Converge
0 24 100 100 100 100 100
NA 2881 2233 2212 2378 2387 2392
successfully and the average number of iterations required for converging. With 4 middle-layer units, the network cannot learn all 26 characters because the maximum number of learnable pattern for 4 units is 16. When the number of the units is increased to 5, the percentage of successful trials remains small. This is because some similar characters such as O and G and E and F are classified into the same pattern in the middle layer. With 6 or more middle-layer units, we see an effective classification in the middle layer. We also see that the average number of iterations does not depend on the number of units in the middle layer. 5 Conclusion This letter proposed a digital version of the backpropagation algorithm for three-layered neural networks with nondifferentiable binary units. In the DBP, teacher signals were given to both the middle and the output layers, while they are given only to the output layer of a simple perceptron. The DBP was able to update the coupling weights not only between the middle and output layers but also between the input and middle layers. Simulations for several problems produced the following results:
r r r
The learning performance of the proposed DBP compares favorably to that of the ABP, though it depends on each linearly nonseparable problem. Hardware implementations are simplified—replacing the learning results with logic units—because the network consists of only digital units. When the proposed DBP is applied to a large-scale network, it can learn fast with high accuracy.
The following problems need to be resolved in the future:
r
A formal proof of convergence of the algorithm and a clarification of convergence conditions
A New Digital-Type Backpropagation Algorithm
r r
1435
Correspondence to more large-scale networks—increasing the layers and number of units in the middle layer Application of the algorithm to intrinsically binary problems such as the internal state inference of the automaton and vector quantization
Acknowledgments We thank the reviewers of this letter for their insightful comments. References Amari, S. (1978). Mathematical Theory of Nerve Nets. Tokyo: Sangyotosho. (In Japanese.) Jacobs, R. (1988). Increased rates of convergence through learning rate adoption. Neural Networks, 1, 295–307. Minsky, M., & Papert, S.(1969). Perceptrons. Cambridge, MA: MIT Press. Oohori, T., Kamada, S., Kobayashi, Y., Konishi, Y., & Watanabe, K. (2000). A new error back propagation learning algorithm for a layered neural network with nondifferentiable units (Tech. rep. NC99-85). Tokyo: Institute of Electronics, Information, and Communications Engineers. Rosenblatt, F. (1961). Principles of neurodynamics. Spartan. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representation by back-propagation errors. Nature, 323, 533–536. Vogl, T. P., Mangis, J. K., Rigler, A. K., Zink, W. T., & Alkon, D. L. (1988). Accelerating the convergence of the backpropagation method. Biological Cybernetics, 59, 257– 263. Watanabe, K., Furukawa, T., Mitani, M., & Oohori, T., (1997). Fluctuation-driven learning for feedforward neural networks. IEICE Trans. on Information and System, J80-D-II, 1249–1256. Widrow, B, Winter, R. G., & Baxter, R. A. (1988). Layered neural nets for pattern recognition. IEEE Trans. Acoustic, Speech and Signal Processing, ASSP-36, 1109– 1118. Yamada, K., Kuroyanagi, S., & Iwata, A. (2004). A supervised learning method using duality in the artificial neuron model. IEICE Trans. on Information and System, J87-D-II, 399–406. Zeng, Z., Goodman, R. M., & Smyth, P. (1993). Learning finite state machines with self-clustering networks. Neural Computation, 5, 976–990.
Received August 31, 2005; accepted August 10, 2006.
LETTER
Communicated by Walter Senn
Spike-Timing-Dependent Plasticity in Balanced Random Networks Abigail Morrison [email protected] Computational Neuroscience Group, RIKEN Brain Science Institute, Wako City, Saitama 351-0198, Japan
Ad Aertsen [email protected] Neurobiology and Biophysics, Institute of Biology III, Albert-Ludwigs-University, 79104 Freiburg, Germany
Markus Diesmann [email protected] Computational Neuroscience Group, RIKEN Brain Science Institute, Wako City, Saitama 351-0198, Japan
The balanced random network model attracts considerable interest because it explains the irregular spiking activity at low rates and large membrane potential fluctuations exhibited by cortical neurons in vivo. In this article, we investigate to what extent this model is also compatible with the experimentally observed phenomenon of spike-timing-dependent plasticity (STDP). Confronted with the plethora of theoretical models for STDP available, we reexamine the experimental data. On this basis, we propose a novel STDP update rule, with a multiplicative dependence on the synaptic weight for depression, and a power law dependence for potentiation. We show that this rule, when implemented in large, balanced networks of realistic connectivity and sparseness, is compatible with the asynchronous irregular activity regime. The resultant equilibrium weight distribution is unimodal with fluctuating individual weight trajectories and does not exhibit development of structure. We investigate the robustness of our results with respect to the relative strength of depression. We introduce synchronous stimulation to a group of neurons and demonstrate that the decoupling of this group from the rest of the network is so severe that it cannot effectively control the spiking of other neurons, even those with the highest convergence from this group.
Neural Computation 19, 1437–1467 (2007)
C 2007 Massachusetts Institute of Technology
1438
A. Morrison, A. Aertsen, and M. Diesmann
1 Introduction In order to understand cognitive processing in the cortex, it is necessary to have a solid understanding of its dynamics. The balanced random network model has been able to provide some insight into this matter, as it can account for the large fluctuations in membrane potential and the irregular firing at low rates observed in vivo for cortical neurons. Many aspects of this model have been investigated and are now well understood (van Vreeswijk & Sompolinsky, 1996, 1998; Brunel & Hakim, 1999; Brunel, 2000). Until now, studies have mainly focused on the activity dynamics of such networks under the assumption that the strength of a synapse remains constant. However, recent research indicates that the strength of a synapse varies with respect to the relative timing of pre- and postynap¨ tic spikes (Markram, Lubke, Frotscher, & Sakmann, 1997; Zhang, Tao, Holt, Harris, & Poo, 1998; Bi & Poo, 1998; Debanne, G¨ahwiler, & Thompson, 1998; Sjostrom, Turrigiano, & Nelson, 2001; Froemke & Dan, 2002; Wang, Gerkin, Nauen, & Bi, 2005), a phenomenon known as spike-timing-dependent plasticity (STDP). This has prompted intense theoretical interest (see, e.g., Kempter, Gerstner, & van Hemmen, 1999; Song, Miller, & Abbott, 2000; van Rossum, Bi, & Turrigiano, 2000; Kistler & van Hemmen, 2000; Rubin, Lee, & ¨ Sompolinsky, 2001; Kempter, Gerstner, & van Hemmen, 2001; Gutig, Aharonov, Rotter, & Sompolinsky, 2003; Izhikevich & Desai, 2003; Burkitt, Meffin, & Grayden, 2004; Guyonneau, VanRullen, & Thorpe, 2005), but these studies have thus far tended to focus on the development of the synaptic weights of a single neuron in the absence of the activity dynamics of a re¨ current network. Such network studies as there are (Hertz & Prugel-Bennet, 1996; Levy, Horn, Meilijson, & Ruppin, 2001; Izhikevich, Gally, & Edelman, 2004; Iglesias, Eriksson, Grize, Tomassini, & Villa, 2005) have shown divergent results, such as the development of strongly connected neuronal groups in Izhikevich et al. (2004) and systematic synaptic pruning in Iglesias et al. (2005). Additionally, these studies all assume a level of connectivity orders of magnitude below that seen in cortical networks (Braitenberg & ¨ 1998). Schuz, The issue of connectivity is significant, as STDP is sensitive to correlation in neuron activity. Reducing the number of incoming synapses per neuron and scaling up the synaptic strength to compensate for this will inevitably influence this correlation. To ensure that the development of synaptic weights observed is not an artifact of low connectivity, it is necessary to examine the behavior of high-connectivity networks. In this article, we investigate the activity dynamics and the development of the synaptic weight distribution for recurrent networks with biologically realistic levels of connectivity and sparseness. This necessitates the use of distributed computing techniques for simulation, previously developed in Morrison, Mehring, Geisel, Aertsen, & Diesmann (2005). As a starting position, we use the well-understood balanced random network model (van Vreeswijk
STDP in Balanced Random Networks
1439
& Sompolinsky, 1996, 1998; Brunel & Hakim, 1999; Brunel, 2000), and extend it to incorporate STDP in its excitatory-excitatory synapses. The model of STDP implemented is based on a reexamination of the experimental data of Bi and Poo (1998), resulting in a novel power law update rule. The network is investigated for both the case that no structured external stimulus is applied and that a group of neurons is subjected to an irregular synchronous stimulus. In section 2 we reexamine the experimental data for STDP and propose a well-constrained power law model for the synaptic weight update. A consideration of rate perturbation further constrains the model to an all-to-all, rather than nearest-neighbor, spike pairing scheme. The neuron and network models are introduced in section 3, and the activity dynamics for a static balanced random network is reviewed. The network is extended to incorporate STDP in section 4. In section 4.1, we demonstrate the mutual equilibrium of weight and activity dynamics in a plastic network with no structured external stimulus and show that no structure is developed. This finding turns out to be robust across changes to the network connectivity, spike pairing scheme, and formulation of the STDP update rule. We examine the behavior of the network if the STDP model exhibits depression that over- or undercompensates for the correlation in network activity. In section 4.2, we apply a synchronous stimulus to a group of neurons and investigate its potential to induce development of structure. We show that a strong stimulus causes the network to enter a pathological state, whereas the structure-enhancing effects of a weaker stimulus are counteracted by the decoupling of the stimulated group from the rest of the network. These results are discussed in section 5, as are ways in which the investigated models might be adapted to enable stimulus-driven development of structure. An algorithmic template for implementing STDP in network simulations that is compatible with distributed computing is given in the appendix. 2 A Novel Power Law Model of STDP There are many theoretical models of STDP in circulation. The two main sources of disparity are the weight dependence of the weight changes—for example, additive (Song et al., 2000), multiplicative (Rubin et al., 2001), or ¨ somewhere in between (Gutig et al., 2003)—and which pairs of spikes contribute to plasticity—for example, all-to-all (Gerstner, Kempter, van Hemmen, & Wagner, 1996) and nearest neighbor (Izhikevich et al., 2004). Other variants include some degree of activity dependence, for example suppression (Froemke & Dan, 2002) or postsynaptic activity-dependent homeostasis (van Rossum et al., 2000). It has already been demonstrated that variations in the formulation of the STDP update rules can lead to qualitatively differ¨ ent results (Rubin et al., 2001; Gutig et al., 2003; Izhikevich & Desai, 2003; Burkitt et al., 2004), but there is as yet no consensus. Therefore, finding a formulation of STDP describing the experimental data as well as possible
1440
A. Morrison, A. Aertsen, and M. Diesmann
B 100
300 100 30 10 3 0 −3 −10 −30 −100 −300
PSC change [%]
PSC change [pA]
A
30
100
300
1000 3000
Initial PSC [pA]
50
0
−50 −100
−50
0
50
100
Spike timing [ms]
Figure 1: Weight and time dependency of change for the Bi and Poo (1998) protocol. (A) Absolute change in excitatory postsynaptic current (EPSC) amplitude as a function of initial EPSC amplitude in log-log representation for potentiation where the presynaptic spike precedes the postsynaptic spike by 2.3 to 8.3 ms (plus signs) and depression where the postsynaptic spike preceded the presynaptic spike by 3.4 to 23.6 ms (circles). The lower black line is a linear fit to the depression data: slope = −1, offset = 2. The upper black line is a linear fit to the potentiation data assuming a slope = 0.4, resulting in an offset = 1.7. The pale gray line shows an additive STDP prediction for the data, and the dark gray curve shows a multiplicative STDP prediction, with wmax = 3000 pA. (B) Percentage change in EPSC amplitude as a function of spike timing. Curves show power law prediction for the data with µ = 0.4, τ = 20 ms, λ = 0.1, and α = 0.11 for different initial EPSC amplitudes: 17 pA, pale gray curve; 50 pA, medium gray curve; 100 pA, dark gray curve. All data from Bi and Poo (1998).
is still a relevant issue. Here we neglect the activity dependence and investigate how the weight dependence of potentiation and depression can best be characterized and what is an appropriate spike pairing scheme to apply. 2.1 Weight Dependence of STDP. Figure 1A shows the dependence of synaptic weight change on initial synapse strength for the experimental data from Bi and Poo (1998). The difference between this plot and the classical plot (Figure 5 in Bi and Poo) is that here, the absolute, rather than relative, change in the synaptic strength is plotted and a double logarithmic representation is used. The exponent of the dependence can be obtained from the slope of a linear fit to the data. In the case of depression, neglecting the largest outlier results in an exponent of −1. We will therefore assume a purely multiplicative update rule: w− ∝ w. In the case of potentiation, depending on the treatment of the largest outlier, an exponent in the range 0.3 to 0.5 results. This suggests a power law rule of the form w+ ∝ w µ , for which we shall assume µ = 0.4. This novel rule fits the data better than either additive (w+ = c) or multiplicative (w+ ∝ wmax − w) rules, shown
STDP in Balanced Random Networks
1441
in Figure 1A for comparison. Although all the rules can be fitted to give similar results for synapses in the 30 to 100 pA range, the greatest difference in prediction is for the behavior of very weak synapses. A multiplicative rule predicts a greater absolute potentiation for synapses close to 0 pA than that obtained for synapses in the 30 to 100 pA range, and an additive rule predicts the same absolute potentiation regardless of synaptic strength. In contrast to these rules, the power law rule predicts very small absolute increases when very weak synapses are potentiated. 2.2 Constraining the Model Parameters. Having established the weight dependence for the change in synaptic strength over the entire 60-pair protocol, we now constrain the model parameters. The update rule underlying the synaptic modifications over the course of the protocol has a weight and a time component; specifically, we assume it takes the form w+ = λ60 w01−µ w µ e − w− = −λ60 αwe
− |t| τ
|t| τ
if t > 0
(2.1)
if t < 0,
where τ is the time constant, λ60 is the learning rate over the whole protocol, w0 is a reference weight, and α scales the strength of depressing increments with respect to the strength of potentiating increments. We take t to be the difference between the postsynaptic and presynaptic spikes arriving at the synapse, that is, after the delays due to axonal- and backpropagation, as suggested by Debanne et al. (1998). If the spikes arrive exactly synchronously, t = 0, no alteration is made to the weight. We start by fitting the potentiation data from Figure 1A to the theoretical form of equation 2.1. We have: |t| log(w+ ) = µ log(w) + (1 − µ) log(w0 ) + log λ60 e − τ .
(2.2)
Without loss of generality, we set w0 = 1 pA, and substitute for t the mean t time interval t = 6.3 ms of the potentiation data. The value of log(λ60 e − τ ) is thus the offset of the linear fit to the potentiation data. Assuming τ = 20 ms in line with other theoretical work on STDP (e.g., Song et al., 2000; van Rossum et al., 2000), this results in the value λ60 = 7.5. However, to implement STDP in a simulation, we need to know the value of the learning rate for just one pair, λ instead of the learning rate over 60 pairs, λ60 . This can be done numerically: by substituting w+ = w in equation 2.2 and solving for w, the value of the weight that would experience 100% potentiation in this protocol can be calculated as w(0) = w0 e
log(λ60 )−t/τ 1−µ
= 17 pA. We start at
1442
A. Morrison, A. Aertsen, and M. Diesmann
this initial weight, and apply the single pair update rule 60 times: µ
|t|
µ
|t|
1−µ
w(0) e −
1−µ
w(1) e −
w(1) = w(0) + λw0 w(2) = w(1) + λw0
τ
τ
.. . 1−µ
w(60) = w(59) + λw0
µ
w(59) e −
|t| τ
.
A value of λ = 0.1 results in w(60) = 2w(0) = 34 pA, as required. As the range of time intervals used to produce the depression data is too large to have much confidence in the offset of the linear fit, we determined α in an analog fashion to give 40% depression for t = 6.3 ms. For the rest of this article, unless otherwise stated, we will use the following update rule for a pair of spikes: 1−µ
w+ = λw0
wµ e −
w− = −λαwe
− |t| τ
|t| τ
if t > 0
(2.3)
if t < 0,
with µ = 0.4, τ = 20 ms, w0 = 1 pA, λ = 0.1, and α = 0.11. The fit of this update rule to the time-dependence data of Bi and Poo (1998) is depicted in Figure 1B for three different initial EPSC amplitudes. This produces a similar window function to that reported in studies on cortical rather than hippocampal plasticity (e.g., Froemke & Dan, 2002), where a maximum potentiation of 103% and a maximum depression of 51% was reported. Although the above calculation assumes that the time constants for potentiation and depression are equal, τ+ = τ− = τ , it is trivial to adjust the parameters to accommodate τ+ < τ− , as reported in several experimental studies (e.g., Feldman, 2000; Froemke & Dan, 2002). An appropriate efficient algorithmic implementation of STDP suitable for distributed computing is given in the appendix. 2.3 Spike Pairing Scheme. A self-consistent rate is a necessary condition for stability in the balanced random network model. It is known that different spike pairing schemes result in qualitatively different weight dynamics as a function of postsynaptic rate (Kempter et al., 2001; Izhikevich & Desai, 2003; Burkitt et al., 2004). Therefore, to establish which pairs of spikes should be considered for the synaptic weight update, we consider the effects of different spike pairing schemes when a self-consistent rate is perturbed. We implemented a nearest-neighbor scheme, whereby a presynaptic spike is paired with only the last postsynaptic spike to effect depression, and a postsynaptic spike is paired with only the last presynaptic spike to effect
STDP in Balanced Random Networks
1443
potentiation. Conversely, in an all-to-all scheme, a presynaptic spike is paired with all previous postsynaptic spikes to effect depression, and a postsynaptic spike is paired with all previous presynaptic spikes to effect potentiation. For this investigation, 1000 neurons were provided with input designed to match the network model described in section 3, namely, 9000 independent excitatory and 2250 independent inhibitory Poisson processes at 7.7 Hz to model input from the local network, and a further 9000 independent excitatory Poisson processes at 2.32 Hz to model external excitation. (For all synaptic and neuronal parameters, refer to section 3). Of the excitatory inputs modeling local network input, 1000 of them were mediated by synapses implementing the power law STDP update rules described above; the rest were static. The 1 × 106 plastic synapses were initialized from a gaussian distribution with a mean of 45.61 pA and a standard deviation of 4.0 pA. In the case of the all-to-all pairing scheme, we set α = 0.1021 and λ = 0.0973, and in the case of the nearest-neighbor scheme, α = 0.0976 and λ = 0.116. This choice of parameters results in a firing rate of 7.7 Hz and a stable and unimodal synaptic weight distribution measured across all the plastic synapses with a mean of 45.5 pA and a standard deviation of 4.0 pA. A unimodal distribution was expected for the power law formulation of STDP, as bimodal distributions have been reported only for additive ¨ or near-additive rules (see Gutig et al., 2003), with very weak or no weight dependence, whereas the power law formulation has a strong dependence on the initial synaptic strength. To perturb the self-consistent rate, a current was injected into the neurons from 50 s onward. A current of 30 pA increased the firing rate by approximately 2 Hz; a current of −35 pA decreased it by the same amount. For all configurations of spike pairing scheme and injected current, the unimodal nature of the distribution was conserved. As can be seen in Figure 2, the choice of pairing scheme has a significant effect on the consequent weight development. If the postsynaptic rate is increased, an allto-all pairing scheme results in a net decrease of synaptic weights (see Figure 2B), whereas a nearest-neighbor scheme results in a net increase (see Figure 2A). Conversely, if the postsynaptic rate is decreased with respect to the input rate, an all-to-all scheme increases the mean synaptic weight (see Figure 2B), whereas a nearest-neighbor scheme decreases it (see Figure 2A). In other words, the all-to-all scheme acts like a restoring force, counteracting the rate disparity, whereas the nearest-neighbor scheme acts in such a direction as to magnify any rate disparity. These results do not depend on the time constants for depression and potentiation being equal; results obtained for τ− = 34 ms and τ+ = 14 ms as reported by Froemke and Dan (2002), were qualitatively the same (all-to-all scheme: α = 0.042, λ = 0.0973; nearest-neighbor scheme: α = 0.0458, λ = 0.116; data not shown). In the absence of some sliding threshold mechanism (e.g., Bienenstock, Cooper, & Munro, 1982) or postsynaptic homeostasis (see Turrigiano, Leslie,
1444
A. Morrison, A. Aertsen, and M. Diesmann
A
B
weight [pA]
48
45.6
47
45.55
46
45.5 45.45
45
45.4
44 0
100 200 time [s]
300
0
100 200 time [s]
300
Figure 2: Development of mean synaptic weight for (A) nearest-neighbor and (B) all-to-all spike pairing schemes, averaged over 1 × 106 synapses. From 50 s onward, a current is injected into the postsynaptic neurons to increase their rate (upward triangles) or decrease it (downward triangles) by 2 Hz. Dashed curves show mean weight development when no current is injected. See the text for further details of the simulation.
Desai, Rutherford, & Nelson, 1998), an all-to-all spike pairing scheme is clearly better suited to situations where a self-consistent rate is required and shall be used for the remainder of this letter. 3 Network Model We consider networks of current-based integrate-and-fire neurons. For each neuron, the dynamics of the membrane potential V is V˙ = −
1 V + I , τm C
where τm is the membrane time constant, C is the capacitance of the membrane, and I is the input current to the neuron. The current arises as a superposition of the synaptic currents and any external current. The synaptic current Is due to one incoming spike is represented as an α-function: Is (t) = w
e −t/τα te , τα
where w is the peak value of the current and τα is the rise time. When the membrane potential reaches a given threshold value , the membrane potential is clamped to zero for an absolute refractory period τr . The values for these parameters used in this article, unless otherwise stated, are: τm : 10 ms C: 250 pF
STDP in Balanced Random Networks
1445
: 20 mV τr : 0.5 ms w: 45.61 pA τα : 0.33 ms The parameters are chosen such that the generated postsynaptic potentials (PSPs) have a rise time of 1.7 mV and a half-width of 8.5 ms (e.g., Fetz, Toyama, & Smith, 1991). To avoid transients in the dynamics at the beginning of the simulation, the initial membrane potentials are drawn from a gaussian distribution with a mean of 5.7 mV and a standard deviation of 7.2 mV. The network model was derived from Brunel (2000), where the larger number of excitatory neurons is balanced by increasing the strength of the inhibitory connections. In order to realize biologically realistic values for both the number of connections per neuron ( 104 ) and the connection probability ( 0.1) in the same network, it is necessary to consider networks with of the order of 105 neurons, approximately the number of neurons in ¨ 1998). Specifically, we a cubic millimeter of cortex (Braitenberg & Schuz, simulated networks with 90,000 excitatory and 22,500 inhibitory neurons. In the static network, each neuron receives 9000 connections from excitatory and 2250 connections from inhibitory neurons, whereby the peak current of an inhibitory synapse is larger than that of an excitatory synapse by a factor of −5. A neuron is not permitted to have a connection to itself. The total number of synapses in the network is 1.27 × 109 . Additionally, each neuron receives an independent Poissonian stimulus corresponding to 9000 excitatory spike trains, each at 2.32 Hz. The strength of these external excitatory connections is the same as the local excitatory connections. The propagation delay between neurons is 1.5 ms, and the networks are simulated with a computational resolution of 0.1 ms. The activity of the network described above, if all synapses are static, is in the asynchronous irregular (AI) regime (Brunel, 2000), whereby each neuron fires irregularly at a low rate (coefficient of variation for the interspike interval CVISI = 0.91, rate ν = 7.8 Hz), and the network activity is not characterized by synchronous events. However, even in the AI regime, the network is not devoid of synchronous activity. In general, both slow and fast oscillations can be excited (Brunel & Hakim, 1999; Brunel, 2000; Tetzlaff, Morrison, Timme, & Diesmann, 2005). The slow oscillations are eradicated by the use of a noisy external stimulus and drawing the initial membrane potentials from a distribution. The fast oscillations are somewhat more tenacious and are clearly visible in the cross-correlogram in Figure 3. They cause a high variability in the spike count, which can be quantified by the Fano factor, calculated by binning the spike times, and dividing the variance of the spike count by the mean. In this article, all Fano factors are calculated by binning the spike trains from 1000 neurons (unless otherwise stated), using a bin size of 3 ms, which resulted in a mean spike count of at least 10 spikes per bin for all the data sets investigated.
1446
A. Morrison, A. Aertsen, and M. Diesmann −4
−4
x 10
x 10 6
4 5
C(τ)
2 4
0
3
−2 −5
2
0
5
1 0 −1 −2 −75
−50
−25
0
25
50
75
τ [ms]
Figure 3: Cross-correlogram of the spiking activity in the static balanced random network, averaged over 500 pairs of neurons and 50 s, using a bin size of 0.1 ms. This is calculated as the cross-covariance normalized by the square root of the product of the variances. As can be seen in the inset, the crosscorrelogram is shifted to the right by the synaptic propagation delay of 1.5 ms, as it is recorded from the perspective of the synapse, and the delay is entirely dendritic (see the appendix).
The above network has a Fano factor for the spike count of FSC = 7.6, significantly higher than the value of 1, which is a characteristic of Poissonian statistics. In the case of an idealized network with pure Poissonian firing statistics, where the excitatory synapses exhibit STDP as described in section 2, the fix point of the resulting synaptic weight distribution can be easily deter1
mined: w ∗ = w0 αpµ−1 . Setting w ∗ = 45.61 pA in line with the static network described above, this determines αp = 0.1. However, as will be demonstrated in section 4, to retain the activity dynamics of the static network, a slightly larger α > αp is required to compensate for the structured raw cross-correlation induced by the fast oscillations.
4 Dynamics of Network Activity and Weight Distribution 4.1 Unstimulated Network. Replacing the excitatory-excitatory connections in the static network with synapses following the power law
STDP in Balanced Random Networks 1
B 0.1
0.8
frac. synapses
area diff. [%]
A
1447
0.6 0.4 0.2 0
0
500 −4
1000
1500
0.05
0
2000
time [s]
−4
x 10
45
55
weight [pA]
x 10
C C(τ)
35
D 4
4
2
2
0
0
−2
−2
−75 −50 −25
0
25 50 75
τ [ms]
−75 −50 −25
0
25 50 75
τ [ms]
Figure 4: Synaptic weight distribution and correlation in the plastic balanced recurrent network with α = 1.057αp . (A) The percentage area difference of the weight distribution from the final distribution (at 2000 s) as a function of time. The data were recorded from the outgoing excitatory-excitatory synapses of 900 neurons: 8.1 × 106 synapses. (B) Histogram of the equilibrium synaptic weight distribution, averaged over 10 samples in the final 500 s. The black curve indicates the gaussian distribution, with the mean and standard deviation obtained from the histogram (µw = 45.65, σw = 3.99). (C) Cross-correlogram of the spiking activity, averaged over 500 pairs of neurons and 50 s, using a bin size of 0.1 ms. The recording was made after the weight distribution had stabilized (400 to 450 s). (D) Difference between the cross-correlograms of the plastic network and a static network using synaptic strengths drawn from the equilibrium distribution shown in (B). Cross-correlograms calculated as in Figure 3.
STDP update rules given by equation 2.3, a stable network configuration is achieved for α = 1.057αp . The resultant weight distribution settles within 200 s to approximately gaussian (µw = 45.65, σw = 3.99; see Figures 4A and 4B). The activity of the plastic network is close to the activity of the static network described in section 3: a slightly higher rate of 8.8 Hz is observed. The other spike statistics are also comparable (CVISI = 0.88, FSC = 8.5). Some
1448
A. Morrison, A. Aertsen, and M. Diesmann
difference is to be expected, as in the static network all excitatory-excitatory synapses have the same weight, whereas in the plastic network, the distribution of weights contributes to the fluctuations of the membrane potential and so increases the rate. Indeed, if the weights in the static network are chosen from a gaussian distribution with the same mean and standard deviation as the equilibrium distribution, the activity statistics are very similar (ν = 8.9 Hz, CVISI = 0.9, and FSC = 8.5). For the rest of this letter, the static network using this distribution of synaptic weights will be considered a control case. The similarity of the dynamics is also demonstrated by the crosscorrelogram in Figure 4C. The cross-correlogram has qualitatively the same shape as that of the static network, and the difference of the two (see Figure 4D) shows no structure whatsoever. Therefore, we can conclude that the two networks exhibit the same activity dynamics and that the power law STDP update rule described in section 2 is compatible with balanced recurrent networks in the AI regime. However, we do not claim that this property is unique to this STDP model. It would seem likely that any STDP formulation that results in a unimodal equilibrium distribution of synaptic weights would also be compatible, at least for some parameter range. This ¨ is confirmed for a formulation of the Gutig et al. (2003) update rules in the next section. 4.1.1 Survival of Strong Synapses. One of the main sources of interest in STDP is its apparent ability to create neuronal assemblies (e.g., Izhikevich et al., 2004). However, searching for a stable structure in a network with 8.1 × 108 excitatory-excitatory synapses presents a significant data analysis problem in itself. To discover whether such assemblies are self-organizing in the balanced recurrent network, we consider the persistence of the strongest synapses in the network. If structure were developing, strong synapses should stay strong or become even stronger. During the course of the simulation, the outgoing plastic synapses of 900 neurons were monitored every 5 s, and those that were above a threshold of µ + 1.3σ (50.8 pA) of the equilibrium distribution shown in Figure 4, were recorded (approximately the strongest 10%). The persistence of strong synapses for various different initial times is shown in Figure 5A. For a group of synapses that are strong at a given time, the number that remain strong decays exponentially with a time constant of approximately 60 s, and almost none of an initial group remains strong after 800 s. No synapses remained strong for longer than 1000 s, although the simulation ran for 2000 s. This behavior was not systematically dependent on the time at which the initial population was defined, as is demonstrated by the similar courses of the curves in Figure 5A. These results suggest that structure is not developing in the balanced recurrent network. The survival statistics are qualitatively different for a low-connectivity network, as can be seen in Figure 5B. Here, the number of neurons in the
STDP in Balanced Random Networks 6
5
A 10
B 10
4
4
synapses
10
10
2
3
10
10
0
10
1449
2
0
500 survival time [s]
1000
10
0
1000 2000 3000 survival time [s]
4000
Figure 5: Persistence of strong synapses. (A) High-connectivity network. From a total population of 8.1 × 106 synapses (see Figure 4) and for a given initial time, approximately the 10% strongest synapses are recorded. The number of these synapses that remain strong is plotted as a function of survival time. The shading of the lines indicates the initial times: 100 s (palest gray curve) in steps of 200 s until 1500 s (black curve). (B) Low-connectivity network. As in A, but with a total population of 8.1 × 104 synapses and an earliest initial time of 500 s (palest gray curve).
network was reduced to 900 excitatory and 225 inhibitory neurons while retaining a connection probability of 0.1. To compensate for the lower number of synaptic inputs, the rate of the external input was increased to 34.66 Hz, the peak current of the static excitatory synapses was increased by a factor of 4 to 182.44 pA, and the peak current of the inhibitory synapses was a factor of −18 larger than the excitatory synapses. In the case of the plastic excitatory synapses, the scaling factor of 4 was applied postsynaptically, so for the purposes of the STDP update, the synaptic weights stayed in the range of approximately 30 to 70 pA. This permitted values of α = 1.109αp and λ = 0.1 to be used, resulting in an STDP window function as close as possible to that used for the full-sized network, and much the same activity statistics (ν = 7.9 Hz, CVISI = 0.91, FSC = 8.6 (900 neurons)). The equilibrium weight distribution was unimodal, with a mean of 45.52 pA and a standard deviation of 6.52 pA. In this case, strong synapses are much more stable, with some remaining strong for thousands of seconds. The decay of the number of strong synapses does not vary systematically with the time at which the initial population was defined, as was the case for the full-sized network. However, the decay is no longer well described by an exponential function, but by a power law with an exponent of −1.02. Despite the longer endurance of strong synapses, it still could not be said that structure is spontaneously
1450
A. Morrison, A. Aertsen, and M. Diesmann
developing in the network, as although the number of strong synapses did not quite decay to zero over the 5000 s measuring period, the remainder comprises less than 0.2% of the total number of plastic synapses. Similar results are obtained when using the nearest-neighbor spike pairing scheme described in section 2.3, resulting in a unimodal distribution with a mean of 45.6 pA, a standard deviation of 6.62 pA, and power law survival statistics with an exponent of −0.93 (data not shown). The results are also robust with respect to the formulation of the STDP update rules: power law STDP with ¨ asymmetric time constants (τ− = 34 ms, τ+ = 14 ms, α = 0.048) and Gutig et al. (2003) STDP (wmax = 100 pA, λ = 0.005, µ = 0.4, and α = 1.188) produce unimodal distributions, with means of 45.7 pA and 45.57 pA, standard deviations of 7.13 pA and 5.17 pA, respectively, and power law survival statistics with exponents of −0.62 and −1.3, respectively (data not shown). These results show that the connectivity of a network has a significant effect on the stability of individual weights, as strong synapses in the lowconnectivity networks persist for much longer periods than in the network with biologically realistic connectivity. Further, they demonstrate that the nondevelopment of structure is quite a robust observation. It is not restricted to high-connectivity networks and does not depend critically on either our specific formulation of the STDP update rules or the spike pairing scheme. 4.1.2 Sensitivity to Scaling of Depression. The value of α = 1.057αp was determined through a tuning process to give a mean synaptic weight close to the weight of excitatory synapses in the static network, and thus a similar self-consistent firing rate. This raises the question of what the effect on the network activity dynamics would be if α is chosen too high, leading to a net depression of the synapses, or too low, leading to a net potentiation. Clearly, any net change of the synaptic weights will lead to a change in the correlation structure, which will affect the weights, and so on. A network can be stable only if its weight distribution and correlation structure are compatible, as in the network described above. How easy is this to accomplish? We found that if α is chosen 2% higher—α = 1.078αp —the network does indeed settle to a new equilibrium, with a near-gaussian weight distribution at a slightly lower mean (µw = 45.33 pA) and a slightly increased standard deviation (σw = 4.1 pA). The network firing rate is, of course, lower, ν = 3.4 Hz, and the activity is still in the AI regime (CVISI = 0.92, FSC = 3.9). Interestingly, the network is in a more asynchronous state than either the heterogeneous static network or the plastic network with α = 1.057αp (FSC = 8.5), as reflected in its cross-correlogram (see Figure 6). The central and side peaks are smaller in amplitude and less well defined, and the difference in the cross-correlograms of this network and the static network (see Figure 6, inset) shows clear structure around τ = 0. This lower degree of synchrony is presumably due to the increased dominance of inhibition in the local recurrent network as the excitatory weights decrease.
STDP in Balanced Random Networks
1451 −4 x 10
−4 x 10 6
3
C(τ)
5
0
4
−3
3
−6 −50
2
0
50
1 0 −1 −2 −75
−50
−25
0
25
50
75
τ [ms]
Figure 6: Cross-correlogram of the spiking activity in the plastic balanced random network with α = 1.078 αp , averaged over 500 pairs of neurons and 50 s, using a bin size of 0.1 ms. The recording was made after the weight distribution had stabilized (950–1000 s). The inset shows the difference in the cross-correlograms for the plastic and the heterogeneous static networks. Cross-correlograms calculated as in Figure 3.
If α is chosen 2% lower—α = 1.035 αp —a very different picture emerges. Up to 24 s, the mean of the synaptic weight distribution slowly increases to 45.83 pA, and its standard deviation decreases to 3.85 pA, retaining its neargaussian form. Over this period, the network firing rate increases smoothly from 8 Hz to about 27 Hz. Between 24 s and 26 s, there is a sudden transition. The network firing rate increases rapidly to 200 Hz (see Figure 7, inset), and the synaptic weight distribution splinters into clusters (see Figure 7, main panel). Simultaneously, the network activity dynamics undergoes a transition. The global activity changes from asynchronous irregular firing (see Figure 8A) to strong peaks of activity interspersed with periods of silence (see Figure 8B). Within each peak, a highly regular pattern of activity builds up within a few milliseconds and then decays abruptly, to be replaced with a different pattern of activity (see Figure 8, lower panels). These results show that α is a critical parameter for the configuration of network and plasticity model investigated here and that the value of α = 1.057αp is quite close to the bifurcation point. For values of α greater than this, the network is stable, albeit with weaker synaptic weights and consequently lower firing rates. Note that even extremely weak weights will
1452
A. Morrison, A. Aertsen, and M. Diesmann 0.12 200
frac. synapses
rate [Hz]
0.1
150 100
0.08 50 0
0.06
5
15 time [s]
25
0.04
0.02
0
0
50
100
150
200
250
300
weight [pA]
Figure 7: Weight distribution and rate development in the plastic balanced recurrent network with α = 1.035 αp . The gray histogram shows the distribution of weights for the outgoing excitatory-excitatory synapses of 1000 neurons (i.e., 9 × 106 synapses) after 24 s. The black histogram shows the weight distribution for the same group of synapses after 30 s. The inset shows the average rate of these neurons as a function of time.
not extinguish the network activity due to the external input. For values of α lower than the bifurcation point, the network is unstable and leaves the asynchronous irregular regime. 4.2 Stimulated Network. In section 4.1.1 it was demonstrated that if an appropriate scaling of the plasticity window is chosen such that a selfconsistent rate ensues, no structure develops spontaneously. In this section, we investigate whether structure can be induced in the network by generating systematic positive correlation in the network. The most obvious way to accomplish this is to induce a group of neurons to fire synchronously. The greater the number of inputs a neuron has from this group, the more likely it is to fire shortly after the synchronous event. 4.2.1 Stimulus Protocol. A stimulus is applied to a group of neurons (N = 500) at random intervals. At each stimulation event, a rectangular pulse current is injected into each neuron, whereby the injection time for each neuron is drawn from a gaussian distribution (σ = 0.5 ms). In this way, there is no systematic relationship between the firing times of the stimulated group during a stimulus event beyond that given by the gaussian. Due to the random connectivity of the network, the number of connections K synch
STDP in Balanced Random Networks
rate [Hz]
A
200 150 100 50 0 22
rate [Hz]
B
1453
22.2
22.4
29.2
29.4
22.6
22.8
23
29.6
29.8
30
3000 2000 1000 0 29
time [s]
rate [Hz]
C 2000
E
1500 1000 500
neuron id
D
0 100
F
75 50 25
29.1
29.1025
time [s]
29.105
29.7
29.7025
29.705
time [s]
Figure 8: Activity dynamics in the plastic balanced recurrent network with α = 1.035αp . (A) Instantaneous rate as a function of time between 22 s and 23 s for a sample of 1000 neurons with a bin size of 0.1 ms. (B) As in A, but between 29 s and 30 s. Arrows indicate time periods treated in lower panels. (C) Instantaneous rate as a function of time between 29.1 s and 29.106 s. (D) Corresponding raster plot for 100 neurons. (E) As in C, but for the time period 29.7 s to 29.706 s. (F) Corresponding raster plot; same neurons as in D.
each neuron receives from the stimulated group is binomially distributed with a mean of 50. The higher the convergence a neuron has from this group, the more likely it is to be affected by the stimulation. We therefore define a high-convergence group with respect to the stimulated group, such
1454
A. Morrison, A. Aertsen, and M. Diesmann
neuron id
A
100 80 60 40 20
neuron id
B
0 100 80 60 40 20
neuron id
C
0 100 80 60 40 20 0
3
3.2
3.4
3.6
3.8
4
time [s]
Figure 9: Effect of stimulation on the network. Raster plot for 100 arbitrarily indexed neurons from the following populations: (A) stimulated group (N = 500), (B) high-convergence group (N = 568, K synch ≥ 69), (C) random-convergence group (N = 1000). The stimulus consists of a rectangular pulse current of amplitude 6840 pA and duration 0.1 ms at irregular intervals with a frequency of 3 Hz. For each stimulation event, the injection times for each neuron are drawn from a gaussian distribution with a standard deviation of 0.5 ms.
that K synch ≥ 69. This value was chosen as it results in a high-convergence group of approximately the same size as the stimulated group (N = 568). The spiking activity for the stimulated group (see Figure 9A), the highconvergence group (see Figure 9B), and a group of neurons (N = 1000) with random convergences (see Figure 9C) shows that although the stimulus has a strong effect within the stimulated group, its initial effect on the high-convergence group is only moderate and its effect on the randomconvergence group weaker still. 4.2.2 Activity Development of Stimulus-Driven Network. Initially, it seems as if the stimulus has a catastrophic effect on the stability of the network. In Figure 10A, a sudden rate transition can be seen at 224 s. All three recorded neuron populations reach rates of over 200 Hz. A closer inspection of the spiking activity at the transition point (see Figures 10C and 10D) yields some insight. A strong burst in the stimulated group triggers an oscillation
STDP in Balanced Random Networks
rate [Hz]
A
B
20
1455
20
15
15
10
10
5
5
0
0
100
200
0
0
time [s]
100
200
300
400
time [s]
neuron id
C 100 80 60 40 20
neuron id
D 100 80 60 40 20 222.95
223
223.05
223.1
223.15
time [s]
Figure 10: Activity development in the stimulated plastic network: (A) Average rate as a function of time (bin size 1 s) for stimulated group, pale gray curve; high-convergence group, dark gray curve; random-convergence group, solid black curve. The dashed black line indicates the equilibrium rate for the unstimulated plastic network. (B) As in A, but with no connections permitted within the stimulated group. (C) Raster plot for stimulated group during activity state transition, arbitrary neuron indices. (D) Raster plot for random-convergence group during activity state transition, arbitrary neuron indices.
in the random-convergence group (which is representative for the activity in the whole network). This synfire explosion was first reported by Mehring, Hehl, Kubo, Diesmann, & Aertsen (2003). Here, however, the effect is magnified rather than dying away, as the network correlation and the synaptic strengths have a positive feedback relationship through the STDP rule. The emergent activity is characterized by bursts of strongly patterned activity separated by periods of no activity, as was also observed for the unstimulated network with α = 1.035αp (see Figure 8). Quite apart from its consequences for the network activity, the triggering of the synfire explosion is in itself an interesting effect, given that the
1456
A. Morrison, A. Aertsen, and M. Diesmann
stimulated group intially had only a weak effect on the network (compare Figures 9A and 9C). This is partly due to the implementation of synaptic delays. As neurons were connected randomly, the neurons in the stimulated group also receive input from other neurons within that group. Since the synaptic delay was implemented as dendritic rather than axonal, when two neurons fire simultaneously, the presynaptic spike arrives at the synapse before the postsynaptic spike. This causes a systematic increase in the weights of the intragroup synapses when a synchronous stimulus is repeatedly applied. Although in general the response to the stimulus within the group is weakened, as will be discussed below, this increase of weights can occasionally cause an amplification of the response. In this case, more of its neurons fire within a few milliseconds, which naturally has a stronger effect on the rest of the network. If the network is connected such that there are no connections between neurons belonginging to the stimulated group, the amplification does not take place, and no synfire explosion is triggered for a stimulated group of this size (see Figure 10B). However, increasing the group size triggers a synfire explosion without the amplifying effects of the intragroup synapses: for a group size of 600 stimulated neurons, this occurs after 249 s; for a group size of 700 after 127 s; and for a group size of 800 after only 7 s (data not shown). Irrespective of whether a synfire explosion was triggered, a notable feature of the network behavior is the significant reduction in the rate of the stimulated group. This change in rate can be explained with reference to the changes in the synaptic weight distributions (see Figure 11A). The positive correlation between the stimulated group and the high-convergence and random-convergence groups corresponds to a systematic negative correlation from the perspective of the incoming synapses of the stimulated group. As a consequence, the means of the distributions of synaptic weights from the random-convergence and high-convergence groups to the stimulated group decrease from the mean weight determined in section 4.1 by 13.3% and 21.1%, respectively. This represents a significant reduction in the input received by the neurons of the stimulated group and causes them to fire much less frequently. Note that this effect can be generalized to any stimulus that creates a systematic positive correlation between a stimulated group and another group of neurons receiving input from it. This decoupling of the stimulated group from the rest of the network counteracts the development of structure observed in the development of the weights of the outgoing synapses of the simulated group. The means of the distributions from the stimulated group to the random-convergence and high-convergence groups increase from the mean weight value determined in section 4.1 by 13.8% and 25.5%, respectively. Over the same period, the mean of the synaptic weights unconnected with the stimulated group increased by just 0.2%. Moreover, a clear relationship can be seen between the degree of convergence a given neuron has from the stimulated group and the mean weight of those synapses (see Figure 11B). Up to about the
STDP in Balanced Random Networks
B
0.12 0.1
65 60
weight [pA]
frac. synapses
A
1457
0.08 0.06 0.04
55 50
0.02 45 0 20
40
60
weight [pA]
80
20
40
60
80
convergence
Figure 11: Development of synaptic weights in the stimulated plastic network, with no connections within the stimulated group. (A) Distribution of synaptic weights between different neuron populations after 400 s: stimulated group to high-convergence group, solid black curve; stimulated group to randomconvergence group, dashed black curve; high-convergence group to stimulated group, solid dark gray curve; random-convergence group to stimulated group, dashed dark gray curve. The pale gray curve indicates the distribution of synaptic weights for synapses unconnected to the stimulated group. (B) Average synaptic weight after 400 s as a function of convergence from the stimulated group (dots) and the high-convergence group (crosses). The vertical black line indicates the cut-off point for membership in the high-convergence group (K synch ≥ 69), and the gray line is a linear fit to the data below this point.
convergence chosen as the cut-off point for membership in the highconvergence group, the dependence of the average synaptic weight on the degree of convergence is well described by a linear relationship. Interestingly, beyond this point, the average synaptic weight increases much faster than the linear prediction. Although this suggests that a development of functional structure is occurring, such that the volley of nearsynchronous spikes is reliably transferred from the stimulated group to the high-convergence group, the development of the weights of the outgoing synapses of the high-convergence group does not support this. Even after 400 s of repeated stimulation, no increase of the average synaptic weight as a function of the convergence from the high-convergence group can be observed. Clearly, the high-convergence group does not echo the stimulus sufficiently reliably as to have a corresponding effect on its own downstream high-convergence group, even though the synaptic weights between the stimulated group and the high-convergence group have increased so significantly. The failure of the stimulated group to drive the high-convergence group effectively, despite the increased synaptic weights, can also be seen in the cross-correlogram of their spiking activity in Figure 12. Ideally, we would
1458
A. Morrison, A. Aertsen, and M. Diesmann −3
x 10
A C(τ)
3 2 1 0 −40
−20
−3
x 10
B C(τ)
0
20
τ [ms]
−3
x 10
C
3
3
2
2
1
1
0
0
−40
−20
0
τ [ms]
20
40
40
−40
−20
0
20
40
τ [ms]
Figure 12: Expected and actual correlation development. (A) Initial correlation: cross-correlation in the static network between the stimulated group and an external group with the same convergence with respect to the stimulated group and the rest of the network as the high-convergence group; synaptic weights drawn from distribution shown in Figure 4B. Averaged over 500 pairs of neurons and 50 s, using a bin size of 0.1 ms. (B) Expected final correlation: as in A, but synaptic weights for the external group are drawn from the appropiate distributions shown in Figure 11A. (C) Actual final correlation: cross-correlation between the stimulated group and the high-convergence group in the plastic network after stabilization of rates. Averaged over 500 pairs of neurons between 300 to 400 s with a bin size of 0.1 ms. Cross-correlograms calculated as in Figure 3.
like to compare the cross-correlation between the stimulated group and the high-convergence group at the beginning and end of the simulation. However, the rates are changing too quickly at the beginning for a crosscorrelation to be meaningful. In order to gain some insight into the initial cross-correlation between the stimulated and the high-convergence groups, the static network with heterogeneous weights was used to provide input to an external group of 500 neurons. This group was connected such that it had the same convergence with respect to the stimulated group and the
STDP in Balanced Random Networks
1459
rest of the network as the high-convergence group, with synaptic weights drawn from the equilibrium distribution shown in Figure 4B. The static network was then stimulated as described in section 4.2.1, to give a reasonable idea of the cross-correlation structure between the stimulated and high-convergence groups in the plastic network at the beginning of the simulation. This is depicted in Figure 12A. To gain insight into how the cross-correlation should have changed as a result of the increased synaptic weights between the stimulated group and the high-convergence group, the experiment was repeated with synaptic weights for the external group drawn from the appropriate distributions shown in Figure 11A. This gives a reasonable idea of the expected cross-correlation at the end of the 400 s simulation, under the assumption that the rate of the stimulated group retains its initial value. The expected cross-correlation is shown in Figure 12B, and a significant increase from the initial cross-correlation is observable. However, the actual cross-correlation between the stimulated and highconvergence groups at the end of the plastic network simulation, shown in Figure 12C, instead indicates a significant reduction in correlation from the initial condition. This suggests that although the initial strong correlation between the stimulated and high-convergence groups causes a systematic increase in the synaptic weights, which should increase the reliability of transferral of the near-synchronous volley of spikes and thus induce development in the outgoing synapses of the high-convergence group, this effect is entirely counteracted by the decoupling of the stimulated group from the rest of the network, which lowers the mean membrane potential of the stimulated group and thereby its responsiveness to the stimulus. 5 Discussion 5.1 Compatibility of STDP with the Balanced Random Network Model. We have shown that in a network with biologically realistic connectivity and sparseness, the weight dynamics of excitatory STDP synapses and the network dynamics can reach a mutual equilibrium. The equilibrium weight distribution is unimodal, and the network dynamics is that of a balanced random network in the asynchronous irregular regime. Specifically, the dynamics is almost identical to that of a static network with weights drawn from the equilibrium distribution of the plastic network. No spontaneous development of structure can be observed, as although the weight distribution is stable, the weights of individual synapses are not. A strong synapse does not stay strong indefinitely, but decays with a characteristic time constant and is replaced by other, previously weak, synapses. In other words, the number of strong synapses remains essentially constant, but the membership to this group is permanently in flux. Smaller networks with lower connectivity did not exhibit spontaneous development of structure either, although strong synapses persisted much longer, with a power law rather than exponential decay. The qualitatively different survival
1460
A. Morrison, A. Aertsen, and M. Diesmann
statistics demonstrate the importance of investigating STDP in networks with biologically realistic connectivity. These results are in contrast to those of Izhikevich et al. (2004), where the development of neuronal groups was observed, but also in contrast to those of Iglesias et al. (2005), where a principal feature was the systematic pruning of the majority of synapses. It is difficult to determine exactly why the results are so divergent, as there are so many differences in the network, neuron, and plasticity models used. However, we have demonstrated the robustness of our findings. Neither a change in the connectivity in the network, nor a change in the spike pairing scheme, nor even in the specific formulation of STDP altered the essential finding that structure does not spontaneously develop in the balanced random networks studied. In order to achieve a stable weight distribution and network activity with the desired mean weight and activity statistics, it was necessary to adjust the strength of depressing increments, while keeping the strength of potentiating increments constant, to compensate for the nonvanishing crosscorrelation brought about by network oscillations. This tuning amounts to a 6% increase in the strength of depressing increments, compared to that required to maintain the desired mean weight with purely Poissonian firing statistics. Interestingly, this adjustment brings the strength of a depressing increment closer to the value of α = 0.11 determined from the experimental data in section 2.2: the analytical value for Poissonian statistics is 8% lower, the adjusted value only 3% lower than the experimental value. If the strength of depressing increments is increased further in the direction of the experimental value, we show that a new stable state is found at a lower mean weight and with a lower degree of synchrony. In contrast, if the strength is decreased, no new stable state is reached within the asynchronous irregular regime. After an initial smooth rate increase, a sudden transition can be observed in the activity dynamics and the weight distribution. The activity is characterized by bursts of strongly patterned activity at high rates, interspersed with network silence. The weight distribution, which remains unimodal until the transition, splits into several distinct clusters. 5.2 Effects of Induced Synchronous Activity. We investigated the effects of irregular but synchronous stimuli on a group of neurons within the stable plastic network. If the stimulus is too strong due to amplification within the group or simply because the group size is too large, a synfire explosion is triggered. Unlike the behavior observed for a static network (Mehring et al., 2003), this explosion does not die out but drives the network into a pathological state. If the stimulus is not strong enough to trigger a synfire explosion, the weights of the outgoing synapses of the stimulated group develop as expected: a net increase is seen, particularly for synapses where the postsynaptic neuron has a high degree of convergence from the stimulated group. However, the causal relationship between the activity
STDP in Balanced Random Networks
1461
of the stimulated group and the rest of the network is an acausal relationship when viewed from the perspective of the incoming synapses of the stimulated group. These synapses are therefore systematically weakened, which leads to a dramatic fall in the firing rate of the stimulated group. This loss of background input reduces their response to the stimulation, with the end result that the correlation between the stimulated group and the high-convergence group is, in fact, less after 400 s than at the beginning of the simulation, despite the increase in weights. No functional structure is developed under these conditions: the high-convergence group does not echo the stimulus strongly enough to strengthen the synapses of its own downstream high-convergence group. 5.3 Power Law Formulation of STDP. The general framework for nor¨ malized STDP update rules established by Gutig et al. (2003) has four free parameters: the maximum weight wmax , the asymmetry parameter α, the step size λ, and the exponent of the weight dependence µ. The time intensity of the simulation study is such that a systematic investigation of all these parameters was unfeasible; therefore, a reexamination of the experimental data of Bi and Poo (1998) was undertaken in order to constrain this framework as tightly as possible. However, when the absolute rather than relative changes in EPSC are considered as a function of the initial EPSC amplitude, a power law relationship emerges that cannot be expressed within ¨ the Gutig et al. framework. This novel update rule is a better fit to the data than either additive (Song et al., 2000; van Rossum et al., 2000; Izhikevich et al., 2004) or multiplicative rules (Rubin et al., 2001), or anything in be¨ tween (Gutig et al., 2003). A property of our formulation of STDP is that the absolute potentiation of very weak synapses is small rather than maximal (for multiplicative rules) or equal to the absolute potentiation of a stronger synapse (for additive rules). This is a clear prediction that can be tested experimentally to distinguish between the available models more conclusively. We combined this formulation of the STDP update rules with an all-to-all spike pairing scheme, which we showed has a tendency to correct rate disparities, in contrast with a nearest-neighbor scheme, which, in the absence of additional stabilization mechanisms, tends to amplify any such disparities. 5.4 Limitations and Perspectives. Inevitably, this study was limited by the long duration of the simulations. The weight dynamics involve timescales up to three orders of magnitude longer than the activity dynamics in a static network: the transient of the weight distribution investigated in section 4.1 lasted about 200 s, whereas the transient in activity dynamics before a stable asynchronous irregular state is established in a static network is more like 200 ms. This is the main problem with such simulations, rather than the increased computational complexity of implementing plastic synapses; this slows our network simulations by a factor of less than
1462
A. Morrison, A. Aertsen, and M. Diesmann
10. Currently, a 1000 s simulation of the unstimulated network at 8.8 Hz requires about 60 hours of computational time on our PC cluster (20 × 2 processors, AMD Opteron 2.4 GHz, Dolphin/Scali interconnect). Simulating networks with reduced connectivity and scaled synapses ameliorated the problem, but at the cost of changing the survival statistics. It has yet to be established to what extent the network can be reduced from its biologically realistic scale while maintaining qualitatively the same behavior. Regardless of whether this can be achieved, there is a clear need for analytical approaches that simultaneously account for plasticity and activity dynamics. Given these limitations, there are nonetheless clear indications of how this work can be extended to investigate STDP in a balanced recurrent network more systematically. The discovery that the implementation of delays as purely dendritic can trigger a synfire explosion by strengthening the synapses within a stimulated group suggests that the distribution of propagation delay between the axon and the dendrite is potentially a crucial parameter for the network dynamics. Future work will need to determine to what extent the pathological state exhibited by the network as a result of a synfire explosion or insufficiently strong depression is affected by an increased proportion of axonal delay. Moreover, the possibility that the patterns observed in these states are at least partially artifacts of discrete time step simulation cannot be ruled out. We are currently investigating efficient methods of incorporating continuous spike times within a discrete time simulation scheme (Morrison, Straube, Plesser, & Diesmann, 2007), which should allow clarification of this issue. Another major area for development is the relative simplicity of the plasticity model used. It is, of course, necessary to start with simple models; otherwise, when confronted with complex behavior, it is impossible to say which aspect of the model is causing what. However, recent research reveals a rich variety of plasticity expression such as suppression (Froemke & Dan, 2002), activity-dependent postsynaptic homeostasis (Turrigiano et al., 1998), and nonlinear interactions between potentiation and depression (Wang et al., 2005). These effects may turn out to have fundamental consequences for network behavior. The model used also incorporates no upper bound on synaptic strength, which seems unbiological. However, as no upper bound could be extracted from the experimental data of Bi and Poo (1998), further investigation is required to establish a reasonable saturation behavior for the power law update rule proposed here. If structure is to develop in a balanced random network as a response to synchronous stimuli, the stimulated group must continue to receive enough background input so that its response to the stimulus does not diminish over time. To achieve this, one or both of the model components in this study must be adjusted. A natural adjustment to the plasticity model would be to incorporate postsynaptic activity dependent homeostasis as used in van Rossum et al. (2000). This would increase the weights of the incoming synapses of the stimulated group as a response to its falling rate.
STDP in Balanced Random Networks
1463
Alternatively, the connectivity in the network model could be adjusted. First, in the model used here, all propagation delays are identical, and each neuron has the same number of incoming synapses. Introducing heterogeneities in these quantities has been shown to reduce network oscillations (Brunel, 2000; Tetzlaff et al., 2005), which might prevent the network from entering a pathological state under stronger stimulation than was possible to use here. Second, a connection pattern resulting in a long-tailed rather than binomial convergence distribution for the stimulated group would lessen the acausal correlation between the embedding network and the stimulated group, thus reducing the drop in the mean weight of its incoming synapses. Appendix: Algorithmic Implementation of STDP The following implementation assumes that a synapse that connects the axon of neuron i to the dendrite of neuron j is stored in a list belonging to i. In the case of a distributed application, this list is assumed to be physically on the machine of the postsynaptic neuron, j (see Morrison et al., 2005). Each list of postsynaptic targets maintains two variables: told , the time of the last presynaptic spike, and K + , which describes the current time-dependent weighting of the STDP update rule for potentiation. Each synapse maintains its current weight w ji and its synaptic delay di = diA + diD , which is composed of its axonal delay diA and its dendritic delay diD , such that diD ≥ diA. Note that these propagation delays are taken into account when calculating the difference in time between presynaptic and postsynaptic spikes: spike timing is calculated from the perspective of the synapse and not the soma (see Debanne et al., 1998). Additionally, the synapse is equipped with two functions F+ (w) and F− (w), which perform the weight-dependent update for potentiation and depression, respectively. For example, in the case of power law STDP, F− (w) = λαw. The target list performs its weight updates and sends spikes to its postsynaptic neurons when its spike method is invoked for a presynaptic spike at time t: spike(t): for each postsynaptic neuron j: history←j.get history(told + dAj − dDj ,t + dAj − dDj ) for each spike tj in history: dt← (tj + dDj ) − (told + dAj ) if dt = 0: wj ← wj + F+ (wj ) · K+ · exp(−dt/tau) K− ←j.get K value(t + dAi − dDi ) wj ← wj − F− (wj ) · K− send spike (wj , dj ) to neuron j K+ ← K+ · exp(−(t − told )/tau) + 1 told ← t
1464
A. Morrison, A. Aertsen, and M. Diesmann
The postsynaptic neuron j maintains the variables told , containing its last spike, K − , which describes the current time-dependent weighting of the STDP update rule for depression, and Nsyn , the number of STDP incoming synapses. In addition, a dynamic data structure spike register is required, which stores spike information in the form of tuples: (tsp ,K sp ,countersp ), where tsp is the spike time, K sp is the value of K − at time tsp , and countersp counts how many times the spike information has been accessed by synapses. When neuron j spikes at time t, the spike register has to be updated: update register(t): K− ← K− · exp(−(t − told )/tau) + 1 while length of spike register ≥ 1: if countersp ≥ Nsyn : remove first element store (t,K− ,0) as last element told ← t
To enable the synapses to access this information, two other methods have to be available: get history(t1 , t2 ), which returns the spike times recorded in the spike register in the range (t1 , t2 ], and get K value(t), which returns the value of K − for the time t. They may be implemented as follows: get history(t1 , t2 ): starting at beginning of spike register: while tsp ≤ t1 : move to next element history←tsp countersp ← countersp + 1 while tsp ≤ t2 : append tsp to history countersp ← countersp + 1 move to next element return history
get K value(t): starting at end of spike register: while tsp ≥ t: move to previous element return Ksp · exp(−(t − tsp )/tau)
This implementation permits a wide class of STDP models to be simulated in a distributed application. Maintaining a spike register means that a neuron does not require knowledge of the locations of its incoming
STDP in Balanced Random Networks
1465
synapses in order to update their weights when it spikes; synaptic weights are updated only when a presynaptic spike occurs. Using the K ± variables to implement the sums of exponential functions and tracking countersp , the number of times synapses have accessed a particular spike, allows spike information that is no longer relevant to be discarded, thus keeping the number of retained spikes to a minimum. This means that the memory required does not increase for longer simulations, and the list operations in get history remain fast. This template algorithm can be trivially extended to models that require clipping of the weights at upper and lower boundaries or where the presynaptic and postsynaptic time constants differ. Different spike pairing schemes can be investigated by changing the way the K ± variables are updated. With more modifications, it can also be extended to more complex models such as the suppression model put forward by Froemke and Dan (2002). The biggest limitation of the algorithm is the requirement that diD ≥ A di . This is necessary to ensure causality is not violated—the postsynaptic neuron must not be required to give information on its future state. In this article, for simplicity, the assumption was made that the delay is purely dendritic: diD = di , diA = 0. Using the above algorithm, other distributions of the delay between the dendrite and the axon can be investigated up to the limit of diD = diA = di /2. This limitation can be circumvented, but at the cost of more complicated techniques, essentially queueing the event until the postsynaptic neuron has reached a state where the required information is available. Acknowledgments This work was carried out when A.M. and M.D. were based at the Bernstein Center for Computational Neuroscience, Albert-Ludwigs-University, Freiburg. We are particularly grateful to Guo-Qiang Bi and Mu-ming Poo for providing us with their original data. We acknowledge constructive discussions with the members of the NEST initiative (www.nest-initiative.org). This work would not have been possible without a dedicated grant for a high-performance computer cluster, HBFG Chapter 1423 Title 812 59, and the exertions of Martin Walter and Ulrich Gehring of the Freiburg University Computer Center. This work was partially funded by DAAD 313-PPPN4-lk, DIP F1.2, and BMBF Grant 01GQ0420 to the Bernstein Center for Computational Neuroscience Freiburg. References Bi, G.-q., & Poo, M.-m. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472.
1466
A. Morrison, A. Aertsen, and M. Diesmann
Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J. Neurosci., 2(1), 32–48. ¨ A. (1998). Cortex: Statistics and geometry of neuronal connecBraitenberg, V., & Schuz, tivity (2nd ed.). Berlin: Springer-Verlag. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8(3), 183–208. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrate-andfire neurons with low firing rates. Neural Comput., 11(7), 1621–1671. Burkitt, A. N., Meffin, H., & Grayden, D. B. (2004). Spike-timing-dependent plasticity: The relationship to rate-based learning for models with weight dynamics determined by a stable fixed point. Neural Comput., 16, 885–940. Debanne, D., G¨ahwiler, B. H., & Thompson, S. M. (1998). Long-term synaptic plasticity between pairs of individual CA3 pyramidal cells in rat hippocampal slice cultures. J. Physiol. (Lond.), 507, 237–247. Feldman, D. E. (2000). Timing-based LTP and LTD at vertical inputs to layer II/III pyramidal cells in rat barrel cortex. Neuron, 27, 45–56. Fetz, E., Toyama, K., & Smith, W. (1991). Synaptic interactions between cortical neurons. In A. Peters (Ed.), Cerebral cortex (Vol. 9, pp. 1–47). New York: Plenum. Froemke, R. C., & Dan, Y. (2002). Spike-timing-dependent synaptic modification induced by natural spike trains. Nature, 416(6879), 433–438. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. ¨ Gutig, R., Aharonov, R., Rotter, S., & Sompolinsky, H. (2003). Learning input correlations through nonlinear temporally asymmetric Hebbian plasticity. J. Neurosci., 23(9), 3697–3714. Guyonneau, R., VanRullen, R., & Thorpe, S. J. (2005). Neurons tune to the earliest spikes through STDP. Neural Comput., 17, 859–879. ¨ Hertz, J., & Prugel-Bennet, A. (1996). Learning synfire-chains by self-organization. Network: Comput. Neural Systems, 7(2), 357–363. Iglesias, J., Eriksson, J., Grize, F., Tomassini, M., & Villa, A. (2005). Dynamics of pruning in simulated large-scale spiking neural networks. Biosystems, 79, 11–20. Izhikevich, E. M., & Desai, N. S. (2003). Relating STDP to BCM. Neural Comput., 15, 1511–1523. Izhikevich, E. M., Gally, J. A., & Edelman, G. M. (2004). Spike-timing dynamics of neuronal groups. Cereb. Cortex, 14, 933–944. Kempter, R., Gerstner, W., & van Hemmen, J. L. (1999). Hebbian learning and spiking neurons. Phys. Rev. E, 59, 4498–4514. Kempter, R., Gerstner, W., & van Hemmen, J. L. (2001). Intrinsic stabilization of output rates by spike-based Hebbian learning. Neural Comput., 12, 2709–2742. Kistler, W. M., & van Hemmen, J. L. (2000). Modeling synaptic plasticity in conjunction with the timing of pre- and postsynaptic action potentials. Neural Comput., 12, 385–405. Levy, N., Horn, D., Meilijson, I., & Ruppin, E. (2001). Distributed synchrony in a cell assembly of spiking neurons. Neural Networks, 14, 815–824. ¨ Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215.
STDP in Balanced Random Networks
1467
Mehring, C., Hehl, U., Kubo, M., Diesmann, M., & Aertsen, A. (2003). Activity dynamics and propagation of synchronous spiking in locally connected random networks. Biol. Cybern., 88(5), 395–408. Morrison, A., Mehring, C., Geisel, T., Aertsen, A., & Diesmann, M. (2005). Advancing the boundaries of high connectivity network simulation with distributed computing. Neural Comput., 17(8), 1776–1801. Morrison, A., Straube, S., Plesser, H. E., & Diesmann, M. (2007). Exact subthreshold integration with continuous spike times in discrete time neural network simulations. Neural Comput., 19(1), 47–79. Rubin, J., Lee, D., & Sompolinsky, H. (2001). Equilibrium properties of temporally asymmetric Hebbian plasticity. Phys. Rev. Lett., 86, 364–367. Sjostrom, P., Turrigiano, G., & Nelson, S. (2001). Rate, timing, and cooperativity jointly determine cortical synaptic plasticity. Neuron, 32, 1149–1164. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nat. Neurosci., 3(9), 919–926. Tetzlaff, T., Morrison, A., Timme, M., & Diesmann, M. (2005). Heterogeneity breaks global synchrony in large networks. In Proceedings of the 30th G¨ottingen Neurobiology Conference. Neuroforum (Supp. 1), 206b. Turrigiano, G. G., Leslie, K. R., Desai, N. S., Rutherford, L. C., & Nelson, S. B. (1998). Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature, 391, 892–896. van Rossum, M. C. W., Bi, G.-Q., & Turrigiano, G. G. (2000). Stable Hebbian learning from spike timing-dependent plasticity. J. Neurosci., 20(23), 8812–8821. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Comput., 10, 1321–1371. Wang, H.-X., Gerkin, R. C., Nauen, D. W., & Bi, G.-Q. (2005). Coactivation and timingdependent integration of synaptic potentiation and depression. Nat. Neurosci., 8(2), 187–193. Zhang, L. I., Tao, H. W., Holt, C. E., Harris, W. A., & Poo, M. M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44.
Received December 18, 2005; accepted September 1, 2006.
LETTER
Communicated by Gal Chechik
Reinforcement Learning Through Modulation of Spike-Timing-Dependent Synaptic Plasticity R˘azvan V. Florian [email protected] Center for Cognitive and Neural Studies (Coneural), 400504 Cluj-Napoca, Romania, and Babes¸-Bolyai University, Institute for Interdisciplinary Experimental Research, 400271 Cluj-Napoca, Romania
The persistent modification of synaptic efficacy as a function of the relative timing of pre- and postsynaptic spikes is a phenomenon known as spike-timing-dependent plasticity (STDP). Here we show that the modulation of STDP by a global reward signal leads to reinforcement learning. We first derive analytically learning rules involving reward-modulated spike-timing-dependent synaptic and intrinsic plasticity, by applying a reinforcement learning algorithm to the stochastic spike response model of spiking neurons. These rules have several features common to plasticity mechanisms experimentally found in the brain. We then demonstrate in simulations of networks of integrate-and-fire neurons the efficacy of two simple learning rules involving modulated STDP. One rule is a direct extension of the standard STDP model (modulated STDP), and the other one involves an eligibility trace stored at each synapse that keeps a decaying memory of the relationships between the recent pairs of preand postsynaptic spike pairs (modulated STDP with eligibility trace). This latter rule permits learning even if the reward signal is delayed. The proposed rules are able to solve the XOR problem with both ratecoded and temporally coded input and to learn a target output firing-rate pattern. These learning rules are biologically plausible, may be used for training generic artificial spiking neural networks, regardless of the neural model used, and suggest the experimental investigation in animals of the existence of reward-modulated STDP. 1 Introduction The dependence of synaptic changes on the relative timing of pre- and postsynaptic action potentials has been experimentally observed in biolog¨ ical neural systems (Markram, Lubke, Frotscher, & Sakmann, 1997; Bi & Poo, 1998; Dan & Poo, 2004). A typical example of spike-timing-dependent plasticity (STDP) is given by the potentiation of a synapse when the postsynaptic spike follows the presynaptic spike within a time window of a few tens of milliseconds and the depression of the synapse when the order Neural Computation 19, 1468–1502 (2007)
C 2007 Massachusetts Institute of Technology
Reinforcement Learning Through Modulation of STDP
1469
of the spikes is reversed. This type of STDP is sometimes called Hebbian because it is consistent with the original postulate of Hebb that predicted the strengthening of a synapse when the presynaptic neuron causes the postsynaptic neuron to fire. It is also antisymmetric, because the sign of synaptic changes varies with the sign of the relative spike timing. Experiments have also found synapses with anti-Hebbian STDP (where the sign of the changes is reversed, in comparison to Hebbian STDP), as well as synapses with symmetric STDP (Dan & Poo, 1992; Bell, Han, Sugawara, & Grant, 1997; Egger, Feldmeyer, & Sakmann, 1999; Roberts & Bell, 2002). Theoretical studies have mostly focused on the computational properties of Hebbian STDP and have shown its function in neural homeostasis and unsupervised and supervised learning. This mechanism can regulate both the rate and the variability of postsynaptic firing and may induce competition between afferent synapses (Kempter, Gerstner, & van Hemmen, 1999, 2001; Song, Miller, & Abbott, 2000). Hebbian STDP can also lead to unsupervised learning and prediction of sequences (Roberts, 1999; Rao & Sejnowski, 2001). Plasticity rules similar to Hebbian STDP were derived theoretically by optimizing the mutual information between the presynaptic input and the activity of the postsynaptic neuron (Toyoizumi, Pfister, Aihara, & Gerstner, 2005; Bell & Parrara, 2005; Chechik, 2003), minimizing the postsynaptic neuron’s variability to a given input (Bohte & Mozer, 2005), optimizing the likelihood of postsynaptic firing at one or several desired firing times (Pfister, Toyoizumi, Barber, & Gerstner, 2006), or self-repairing a classifier network (Hopfield & Brody, 2004). It was also shown that by clamping the postsynaptic neuron to a target signal, Hebbian STDP can lead, under certain conditions, to learning a particular spike pattern (Legenstein, Naeger, & Maass, 2005). Anti-Hebbian STDP is, at first glance, not as interesting as the Hebbian mechanism, as it leads, by itself, to an overall depression of the synapses toward zero efficacy (Abbott & Gerstner, 2005). Hebbian STDP has a particular sensitivity to causality: if a presynaptic neuron contributes to the firing of the postsynaptic neuron, the plasticity mechanism will strengthen the synapse, and thus the presynaptic neuron will become more effective in causing the postsynaptic neuron to fire. This mechanism can determine a network to associate a stable output to a particular input. But let us imagine that the causal relationships are reinforced only when this leads to something good (e.g., if the agent controlled by the neural network receives a positive reward), while the causal relationships that lead to failure are weakened, to avoid erroneous behavior. The synapse should feature Hebbian STDP when the reward is positive and anti-Hebbian STDP when the reward is negative. In this case, the neural network may learn to associate a particular input not to an arbitrary output, determined, for example, by the initial state of the network, but to a desirable output, as determined by the reward. Alternatively, we may consider the theoretical result that under certain conditions, Hebbian STDP minimizes the postsynaptic neuron’s variability
1470
R. Florian
to a given presynaptic input (Bohte & Mozer, 2005). An analysis analogous to the one performed in that study can show that anti-Hebbian STDP maximizes variability. This kind of influence of Hebbian/anti-Hebbian STDP on variability has also been observed at a network level in simulations (Dauc´e, Soula, & Beslon, 2005; Soula, Alwan, & Beslon, 2005). By having Hebbian STDP when the network receives positive reward, the variability of the output is reduced, and the network could exploit the particular configuration that led to positive reward. By having anti-Hebbian STDP when the network receives negative reward, the variability of the network’s behavior is increased, and it could thus explore various strategies until it finds one that leads to positive reward. It is thus tempting to verify whether the modulation of STDP with a reward signal can lead indeed to reinforcement learning. The exploration of this mechanism is the subject of this letter. This hypothesis is supported by a series of studies on reinforcement learning in nonspiking artificial neural networks, where the learning mechanisms are qualitatively similar to reward-modulated STDP, by strengthening synapses when reward correlates with both presynaptic and postsynaptic activity. These studies have focused on networks composed of binary stochastic elements (Barto, 1985; Barto & Anandan, 1985; Barto & Anderson, 1985; Barto & Jordan, 1987; Mazzoni, Andersen, & Jordan, 1991; Pouget, Deffayet, & Sejnowski, 1995; Williams, 1992; Bartlett & Baxter, 1999b, 2000b) or threshold gates (Alstrøm & Stassinopoulos, 1995; Stassinopoulos & Bak, 1995, 1996). These previous results are promising and inspiring, but they do not investigate biologically plausible rewardmodulated STDP, and they use memoryless neurons and work in discrete time. Another study showed that reinforcement learning can be obtained by correlating fluctuations in irregular spiking with a reward signal in networks composed of neurons firing Poisson spike trains (Xie & Seung, 2004). The resulting learning rule qualitatively resembles rewardmodulated STDP as well. However, the results of the study highly depend on the Poisson characteristic of the neurons. Also, this learning model presumes that neurons respond instantaneously, by modulating their firing rate, to their input. This partly ignores the memory of the neural membrane potential, an important characteristic of spiking neural models. Reinforcement learning has also been achieved in spiking neural networks by reinforcement of stochastic synaptic transmission (Seung, 2003). This biologically plausible learning mechanism has several common features with our proposed mechanism, as we will show later, but it is not directly related to STDP. Another existing reinforcement learning algorithm for spiking neural networks requires a particular feedforward network architecture with a fixed number of layers and is not related to STDP as well (Takita, Osana, & Hagiwara, 2001; Takita & Hagiwara, 2002, 2005).
Reinforcement Learning Through Modulation of STDP
1471
Two other previous studies seem to consider STDP as a reinforcement ¨ learning mechanism, but in fact they do not. Strosslin and Gerstner (2003) developed a model for spatial learning and navigation based on reinforcement learning, having as inspiration the hypothesis that eligibility traces are implemented using dopamine-modulated STDP. However, the model does not use spiking neurons and represents just the continuous firing rates of the neurons. The learning mechanism resembles modulated STDP qualitatively by strengthening synapses when reward correlates with both presynaptic and postsynaptic activity, as in some other studies already mentioned. Rao and Sejnowski (2001) have shown that a temporal difference (TD) rule used in conjunction with dendritic backpropagating action potentials reproduces Hebbian STDP. TD learning is often associated with reinforcement learning, where it is used to predict the value function (Sutton & Barto, 1998). But in general, TD methods are used for prediction problems and thus implement a form of supervised learning (Sutton, 1998). In Rao and Sejnowski’s study (2001), the neuron learns to predict its own membrane potential; hence, it can be interpreted as a form of unsupervised learning since the neuron provides its own teaching signal. In any case, that work does not study STDP as a reinforcement learning mechanism, as there is no external reward signal. In a previous preliminary study, we have analytically derived learning rules involving modulated STDP for networks of probabilistic integrateand-fire neurons and tested them and some generalizations of them in simulations, in a biologically inspired context (Florian, 2005). We have also previously studied the effects of Hebbian and anti-Hebbian STDP in oscillatory neural networks (Florian & Mures¸an, 2006). Here we study in more depth reward-modulated STDP and its efficacy for reinforcement learning. We first derive analytically, in section 2, learning rules involving rewardmodulated spike-timing-dependent synaptic and intrinsic plasticity by applying a reinforcement learning algorithm to the stochastic spike response model of spiking neurons and discuss the relationship of these rules with other reinforcement learning algorithms for spiking neural networks. We then introduce, in section 3, two simpler learning rules based on modulated STDP and study them in simulations, in section 4, to test their efficacy for solving some benchmark problems and explore their properties. The results are discussed in section 5. In the course of our study, we found that two other groups have observed independently, through simulations, the reinforcement learning properties of modulated STDP. Soula, Alwan, and Beslon (2004, 2005) have trained a robot controlled by a spiking neural network to avoid obstacles by applying STDP when the robot moves forward and anti-Hebbian STDP when the robot hits an obstacle. Farries and Fairhall (2005a, 2005b) have studied a two-layer feedforward network where the input units are treated as independent inhomogeneous Poisson processes and output units are single-compartment, conductance-based models. The network learned to
1472
R. Florian
have a particular pattern of output activity through modulation of STDP by a scalar evaluation of performance. This latter study has been published in abstract form only, and thus the details of the work are not known. In contrast to these studies, in this letter we also present an analytical justification for introducing reward-modulated STDP as a reinforcement learning mechanism, and we also study a form of modulated STDP that includes an eligibility trace that permits learning even with delayed reward. 2 Analytical Derivation of a Reinforcement Learning Algorithm for Spiking Neural Networks In order to motivate the introduction of reward-modulated STDP as a reinforcement learning mechanism, we derive analytically a reinforcement learning algorithm for spiking neural networks. 2.1 Derivation of the Basic Learning Rule. The algorithm is derived as an application of the OLPOMDP reinforcement learning algorithm (Baxter, Bartlett, & Weaver, 2001; Baxter, Weaver, & Bartlett, 1999), an online variant of the GPOMDP algorithm (Bartlett & Baxter, 1999a; Baxter & Bartlett, 2001). GPOMDP assumes that the interaction of an agent with an environment is a partially observable Markov decision process (POMDP) and that the agent chooses actions according to a probabilistic policy µ that depends on a vector w of several real parameters. GPOMDP was derived analytically by considering an approximation to the gradient of the long-term average of the external reward received by the agent with respect to the parameters w. Results related to the convergence of OLPOMDP to local maxima have been obtained (Bartlett & Baxter, 2000a; Marbach & Tsitsiklis, 1999, 2000). It was shown that applying the algorithm to a system of interacting agents that seek to maximize the same reward signal r (t) is equivalent to applying the algorithm independently to each agent i (Bartlett & Baxter, 1999b, 2000b). OLPOMDP suggests that the parameters wij (components of the vector wi of agent i) should evolve according to wij (t + δt) = wij (t) + γ 0 r (t + δt) zij0 (t + δt) zij0 (t + δt) = β zij0 (t) + ζij (t) ζij (t) =
1
∂µifi (t)
µifi (t) ∂wij
,
(2.1) (2.2) (2.3)
where δt is the duration of a time step (the system evolves in discrete time), the learning rate γ 0 is a small, constant parameter, z0 is an eligibility trace (Sutton & Barto, 1998), ζ is a notation for the change of z0 resulting from
Reinforcement Learning Through Modulation of STDP
1473
the activity in the last time step, µifi (t) is the policy-determined probability that agent i chooses action f i (t), and f i (t) is the action actually chosen at time t. The discount factor β is a parameter that can take values between 0 and 1. To accommodate future developments, we made the following changes to standard OLPOMDP notation: β = exp(−δt/τz )
(2.4)
γ = γ δt/τz
(2.5)
zij0 (t) = zij (t) τz ,
(2.6)
0
with τz being the time constant for the exponential decay of z. We thus have wij (t + δt) = wij (t) + γ δt r (t + δt) zij (t + δt) zij (t + δt) = β zij (t) + ζij (t) /τz .
(2.7) (2.8)
The parameters β and γ 0 should be chosen such that δt/(1 − β) and δt/γ 0 are large compared to the mixing time τm of the system (Baxter et al., 2001; Baxter & Bartlett, 2001; Bartlett & Baxter, 2000a, 2000c). With our notation and for δt τz , these conditions are met if τz is large compared to τm and if 0 < γ < 1. The mixing time can be defined rigorously for a Markov process and can be thought of as the time from the occurrence of an action until the effects of that action have died away. We apply this algorithm to the case of a neural network. We consider that at each time step t, a neuron i either fires ( f i (t) = 1) with probability pi (t) or does not fire ( f i (t) = 0) with probability 1 − pi (t). The neurons are connected through plastic synapses with efficacies wij (t), where i is the index of the postsynaptic neuron. The efficacies wij can be positive or negative (corresponding to excitatory and inhibitory synapses, respectively). The global reward signal r (t) is broadcast to all synapses. By considering each neuron i as an independent agent and the firing and nonfiring probabilities of the neuron as determining the policy µi of the corresponding agent (µi1 = pi , µi0 = 1 − pi ), we get the following form of ζij :
ζij (t) =
1 ∂ pi (t) , pi (t) ∂wij
−
∂ pi (t) 1 , 1 − pi (t) ∂wij
if
f i (t) = 1
if
f i (t) = 0.
(2.9)
1474
R. Florian
Equation 2.9, together with equations 2.7 and 2.8, establishes a plasticity rule that updates the synapses so as to optimize the long-term average of the reward received by the network. Thus far, we have followed a derivation also performed by Bartlett and Baxter (1999b, 2000b) for networks of memoryless binary stochastic units, although they called them spiking neurons. Unlike these studies, we consider here spiking neurons where the membrane potential keeps a decaying memory of past inputs, as in real neurons. In particular, we consider the spike response model (SRM) for neurons, which reproduces with high accuracy the dynamics of the complex Hodgkin-Huxley neural model while being amenable to analytical treatment (Gerstner, 2001; Gerstner & Kistler, 2002). According to the SRM, each neuron is characterized by its membrane potential u that is defined as ui (t) = ηi (t − tˆ i ) +
wij
j
f εij t − tˆ i , t − t j ,
(2.10)
F jt
where tˆ i is the time of the last spike of neuron i, ηi is the refractory response f due to this last spike, t j are the moments of the spikes of presynaptic neuron f j emitted prior to t, and wij εij (t − tˆ i , t − t j ) is the postsynaptic potential f
induced in neuron i due to an input spike from neuron j at time t j . The first sum runs over all presynaptic neurons, and the last sum runs over all spikes of neuron j prior to t (represented by the set F jt ). We consider that the neuron has a noisy threshold and that it fires stochastically, according to the escape noise model (Gerstner & Kistler, 2002). The neuron fires in the interval δt with probability pi (t) = ρi (ui (t) − θi ) δt, where ρi is a probability density, also called firing intensity, and θi is the firing threshold of the neuron. We note ∂ρi = ρi , ∂(ui − θi )
(2.11)
and we have ∂ui (t) ∂ pi (t) = ρi (t) δt. ∂wij ∂wij
(2.12)
From equation 2.10 we get ∂ui (t) f = εij t − tˆ i , t − t j ; ∂wij t Fj
(2.13)
Reinforcement Learning Through Modulation of STDP
1475
hence, equation 2.9 can be rewritten as 1 f εij t − tˆ i , t − t j , ρi (t) ρ (t) i F jt ζij (t) = δt f ρi (t) εij t − tˆ i , t − t j , − 1 − ρi (t) δt
if
f i (t) = 1
if
f i (t) = 0.
(2.14)
F jt
Together with equations 2.7 and 2.8, this defines the plasticity rule for reinforcement learning. We consider that δt τz , and we rewrite equation 2.8 as zij (t + δt) − zij (t) τz = −zij (t) + ξij (t) + O δt
δt τz
zij (t),
(2.15)
where we noted ξij (t) =
ζij (t) . δt
(2.16)
By taking the limit δt → 0 and using equations 2.7, 2.14, 2.15, and 2.16, we finally get the reinforcement learning rule for continuous time: dwij (t) = γ r (t) zij (t) dt dzij (t) τz = −zij (t) + ξij (t) dt i (t) f ξij (t) =
ij t − tˆ i , t − t j , − 1 ρi (t) ρi (t) t
(2.17) (2.18) (2.19)
Fj
f where i (t) = Fi δ(t − ti ) represents the spike train of the postsynaptic neuron as a sum of Dirac functions. f We see that when a postsynaptic spike emitted at ti follows a presynaptic f f spike emitted at t j , z undergoes a sudden growth at ti with f f 1 ρi ti f f z = f εij ti − tˆ i , ti − t j , τz ρi ti
(2.20)
1476
R. Florian
which will decay exponentially with a time constant τz . Over the long term, after the complete decay, this event leads to a total synaptic change w = ti
∞ f
t − ti γ r (t) z exp − τz
f
dt.
(2.21)
If r is constant, this is w = γ r z τz . Hence, if r is positive, the algorithm suggests that the synapse should undergo a spike-timing-dependent potentiation, as in experimentally observed Hebbian STDP. We always have ρ ≥ 0 since it is a probability density and ρ ≥ 0 since the firing probability increases with a higher membrane potential. The spike-timing dependence of the potentiation depends on ε and is approximately exponential, as in experimentally observed STDP, since ε models the exponential decay of the postsynaptic potential after a presynaptic spike (for details, see Gerstner & Kistler, 2002). The amplitude of f f the potentiation is also modulated by ρi (ti )/ρi (ti ), in contrast to standard STDP models, but being consistent with the experimentally observed dependence of synaptic modification not only on relative spike timings, but also on interspike intervals (Froemke & Dan, 2002), on which ρi depends indirectly. Unlike in Hebbian STDP, the depression of z suggested by the algorithm (and the ensuing synaptic depression, if r is positive) is nonassociative, as each presynaptic spike decreases z continuously. In general, the algorithm suggests that synaptic changes are modulated by the reinforcement signal r (t). For example, if the reinforcement is negative, a potentiation of z due to a postsynaptic spike following presynaptic spikes will lead to a depression of the synaptic efficacy w. In the case of constant negative reinforcement, the algorithm implies associative spike-timing-dependent depression and nonassociative potentiation of the synaptic efficacy. This type of plasticity has been experimentally discovered in the neural system of the electric fish (Han, Grant, & Bell, 2000). 2.2 A Neural Model for Bidirectional Associative Plasticity for Reinforcement Learning. Since typical experimentally observed STDP involves only associative plasticity, it is interesting to investigate in which case a reinforcement learning algorithm based on OLPOMDP would lead to associative changes determined by both pre-after-post and post-after-pre spike pairs. It turns out that this happens when presynaptic neurons homeostatically adapt their firing thresholds to keep the average activity of the postsynaptic neurons constant. To see this, let us consider that not only pi , but also the firing probability p j of presynaptic neuron j depends on the efficacy wij of the synapse. We apply the OLPOMDP reinforcement learning algorithm to an agent
Reinforcement Learning Through Modulation of STDP
1477
formed by neuron i and all presynaptic neurons j that connect to i. As in the previous case, the policy is parameterized by the synaptic efficacies wij . However, the policy should now establish the firing probabilities for all presynaptic neurons, as well as for the postsynaptic neuron. Since the firing probabilities depend on only neuronal potentials and thresholds, each neuron decides independently if it fires. Hence, the probability of having j k a given firing pattern f i , { f j } factorizes as µ fi ,{ f j } = µifi j µ f j , where µ f k j
is the probability of neuron k to choose action f k ∈ {0, 1}. Only µifi and µ f j depend on wij , and thus j ∂ µifi j µ fj ∂µ fi ,{ f j } 1 = ζij (t) = j µ fi ,{ f j } ∂wij ∂wij µifi j µ fj j ∂ µifi µ f j 1 = . j ∂wij µifi µ f j 1
(2.22)
The particular form of ζij depends on the various possibilities for the actions taken by neurons i and j:
j
µifi µ f j
(1 − pi )(1 − p j ), = pi (1 − p j ), (1 − pi ) p j ,
if f i = f j = 0 if f i = 1 and f j = 0 .
(2.23)
if f i = 0 and f j = 1
We do not consider the case f i = f j = 1 since there is a negligibly small probability of having simultaneous spiking in a very small interval δt. After performing the calculation of ζij for the various cases and considering that δt → 0, by analogy with equation 2.9 and the calculation from the previous section, we get ξij (t) =
j (t) ∂ρ j (t) i (t) ∂ρi (t) + . −1 −1 ρi (t) ∂wij ρ j (t) ∂wij
(2.24)
Let us consider that the firing threshold of neuron j has the following dependence: f wk j χ j t − tk , θ j (t) = θ j k
(2.25)
Fkt
where θ j is an arbitrary continuous increasing function, the first sum runs over all postsynaptic neurons to which neuron j projects, and the second
1478
R. Florian
sum runs over all spikes prior to t of postsynaptic neuron k. χ j is a decaying f f kernel; for example, χ j (t − tk ) = exp(−(t − tk )/τθ j ), which means that θ j depends on a weighed estimate of the firing rate of postsynaptic neurons, computed using an exponential kernel with time constant τθ j . The proposed dependence of presynaptic firing thresholds means that when the activity of postsynaptic neurons is high, neuron j increases its firing threshold in order to contribute less to their firing and decrease their activity. The activity of each neuron k is weighed with the synaptic efficacy wk j , since an increase in the firing threshold and a subsequent decrease of the activity of neuron j will have a bigger effect on postsynaptic neurons to which synaptic connections are strong. This is biologically plausible, since there are many neural mechanisms that ensure homeostasis (Turrigiano & Nelson, 2004) and plasticity of intrinsic excitability (Daoudal & Debanne, 2003; Zhang & Linden, 2003), including mechanisms that regulate the excitability of presynaptic neurons as a function of postsynaptic activity (Nick & Ribera, 2000; Ganguly, Kiss, & Poo, 2000; Li, Lu, Wu, Duan, & Poo, 2004). The firing intensity of neuron j is ρ j (t) = ρ j (u j (t) − θ j (t)). We have ∂ρ j (t) ∂θ j (t) f = −ρ j (t) = −ρ j (t)θ j (t) χ j t − ti . ∂wij ∂wij t
(2.26)
Fi
We finally have i (t) f ξij (t) = − 1 ρi (t) εij t − tˆ i , t − t j ρi (t) t
+ −
Fj
j (t) f + 1 ρ j (t) θ j (t) χ j t − ti , ρ j (t) t
(2.27)
Fi
which together with equations 2.17 and 2.18 define the plasticity rule in this case. We have the same associative spike-timing-dependent potentiation of z as in the previous case, but now we also have an associative spike-timingdependent depression of z. The depression corresponding to a presynaptic f f spike at t j following a postsynaptic spike at ti is f f f 1 ρ j tj f z = − f θ j t j χ j t j − ti , τz ρ j t j
(2.28)
and the dependence on the relative spike timing is given by χ j . The variation of z is negative because ρ j and θ j are positive since they are derivatives of
Reinforcement Learning Through Modulation of STDP
1479
increasing functions, ρ j is positive since it is a probability density, and χ j is positive by definition. We have thus shown that a bidirectional associative spike-timingdependent plasticity mechanism can result from a reinforcement learning algorithm applied to a spiking neural model with homeostatic control of the average postsynaptic activity. As previously, the synaptic changes are modulated by the global reinforcement signal r (t). As in the previous case, we also have nonassociative changes of z: each presynaptic spike depresses z continuously, while each postsynaptic spike potentiates z continuously. However, depending on the parameters, the nonassociative changes may be much smaller than the spike-timingdependent ones. For example, the magnitude of nonassociative depression induced by presynaptic spikes is modulated with the derivative (with respect to the difference between the membrane potential and the firing threshold) of the postsynaptic firing intensity, ρi . This usually has a nonnegligible value when the membrane potential is high (see the forms of the ρ function from Gerstner & Kistler, 2002; Bohte, 2004), and thus a postsynaptic spike is probable to follow. When this happens, z is potentiated and the effect of the depression may be overcome. 2.3 Reinforcement Learning and Spike-Timing-Dependent Intrinsic Plasticity. In this section, we consider that the firing threshold θi of the postsynaptic neuron is also an adaptable parameter that contributes to learning, like the wij parameters. There is ample evidence that this is the case in the brain (Daoudal & Debanne, 2003; Zhang & Linden, 2003). The firing threshold will change according to
τzθ
dθi (t) = γθ r (t) ziθ (t) dt dziθ (t) = −ziθ (t) + ξiθ (t) dt 1 1 ∂ pi (t) δt pi (t) ∂θi , ξiθ (t) = limδt→0 1 ∂ pi (t) 1 , − δt 1 − pi (t) ∂θi
(2.29) (2.30) if f i (t) = 1 (2.31) if f i (t) = 0,
by analogy with equations 2.17, 2.18, 2.9 and 2.19. Since ∂ pi (t)/∂θi = −ρi (t) δt, we have i (t) ξiθ (t) = − + 1 ρi (t). ρi (t)
(2.32)
1480
R. Florian
Hence, for positive reward, spikes emitted by a neuron lead to a decrease of the firing threshold of the neuron (an increase of excitability). This is consistent with experimental results that showed that neural excitability increases after bursts (Aizenman & Linden, 2000; Cudmore & Turrigiano, 2004; Daoudal & Debanne, 2003). Between spikes, the firing threshold increases continuously (but very slowly when the firing intensity, and thus also its derivative with respect to the difference between the membrane potential and the firing threshold, is low). This reinforcement learning algorithm using spike-timing-dependent intrinsic plasticity is complementary with the two other algorithms previously described and can work simultaneously with any of them. 2.4 Relationship to Other Reinforcement Learning Algorithms for Spiking Neural Networks. It can be shown that the algorithms proposed here share a common analytical background with the other two existing reinforcement learning algorithms for generic spiking neural networks (Seung, 2003; Xie & Seung, 2004). Seung (2003) applies OLPOMDP by considering that the agent is a synapse instead of a neuron, as we did. The action of the agent is the release of a neurotransmitter vesicle, instead of the spiking of the neuron, and the parameter that is optimized is one that controls the release of the vesicle instead of the synaptic connections to the neuron. The result is a learning algorithm that is biologically plausible but for which there exists no experimental evidence yet. Xie and Seung (2004) do not model in detail the integrative characteristics of the neural membrane potential and consider that neurons respond instantaneously to inputs by changing the firing rate of their Poisson spike train. Their study derives an episodic algorithm that is similar to GPOMDP and extends it to an online algorithm similar to OLPOMDP without any justification. The equations determining online learning—equations 16 and 17 in Xie and Seung (2004)—are identical, after adapting notation, to our ones here: equations 2.17, 2.18 and 2.19. By reinterpreting the current-discharge function f in Xie and Seung (2004) as the firing intensity ρ and the synaptic current h ij as the postsynaptic kernel ij , we can see that the algorithm of Xie and Seung is mathematically equivalent to the algorithm derived, more accurately, here. However, our different interpretation and implementation of the mathematical framework permit making better connections to experimentally observed STDP and also a straightforward generalization and application to neural models commonly used in simulations, which the Xie and Seung algorithm does not permit because of the dependence on the memoryless Poisson neural model. The common analytical background of all these algorithms suggests that their learning performance should be similar. Pfister et al. (2006) developed a theory of supervised learning for spiking neurons that leads to STDP and can be reinterpreted in the context
Reinforcement Learning Through Modulation of STDP
1481
of reinforcement learning. They studied in detail only particular episodic learning scenarios, not the general online case approached in our letter. However, the analytical background is again the same one, since in their work, synaptic changes depend on a quantity (equation 7 in their article) that is the integral over the learning episode of our ξij (t), equation 2.19. It is thus not surprising that some of their results are similar to ours: they find a negative offset of the STDP function that corresponds to the nonassociative depression suggested by our algorithm for positive reward, and they derive associative spike-timing-dependent depression as the consequence of a homeostatic mechanism, as we did in a different framework. 3 Modulation of STDP by Reward The derivations in the previous sections show that reinforcement learning algorithms that involve reward-modulated STDP can be justified analytically. We have also previously tested in simulation, in a biologically inspired experiment, one of the derived algorithms, to demonstrate practically its efficacy (Florian, 2005). However, these algorithms involve both associative spike-timing-dependent synaptic changes and nonassociative ones, unlike standard STDP (Bi & Poo 1998; Song et al., 2000), but in agreement with certain forms of STDP observed experimentally (Han et al., 2000). Moreover, the particular dynamics of the synapses depends on the form of the firing intensity ρ, which is a function for which there exists no experimentally justified model. For these reasons, we chose to study in simulations some simplified forms of the algorithms derived analytically, that extend the standard STDP model in a straightforward way. The learning rules that we propose can be applied to any type of spiking neural model; their applicability is not restricted to probabilistic models or to the SRM that we previously used in the analytical derivation. The proposed rules are just inspired by the analytically derived and experimentally observed ones, and do not follow directly from them. In what follows, we drop the dependence of plasticity on the firing intensity (which means that the learning mechanism can also be applied to deterministic neural models commonly used in simulations), we consider that both potentiation and depression of z are associative and spike timing dependent (without considering the homeostatic mechanism introduced in section 2.2), we use an exponential dependence of plasticity on the relative spike timings, and we consider that the effect of different spike pairs is additive, as in previous studies (Song et al., 2000; Abbott & Nelson, 2000). The resulting learning rule is determined by equations 2.17 and 2.18, which we repeat below for convenience, and an equation for the dynamics of ξ that takes into account these simplifications: dwij (t) = γ r (t) zij (t) dt
(3.1)
1482
R. Florian
τz
dzij (t) = −zij (t) + ξij (t) dt ξij (t) = i (t) A+
(3.2)
F jt
+ j (t) A−
f
exp −
Fit
t − tj
τ+
t − ti exp − τ−
f
,
(3.3)
where τ± and A± are constant parameters. The time constants τ± determine the ranges of interspike intervals over which synaptic changes occur. According to the standard antisymmetric STDP model, A+ is positive and A− is negative. We call the learning rule determined by the above equations modulated STDP with eligibility trace (MSTDPET). The learning rule can be simplified further by dropping the eligibility trace. In this case, we have a simple modulation by the reward r (t) of standard STDP: dwij (t) = γ r (t) ξij (t), dt
(3.4)
where ξij (t) is given by equation 3.3. We call this learning rule modulated STDP (MSTDP). We remind that with our notation, the standard STDP model is given by dwij (t) = γ ξij (t). dt
(3.5)
For all these learning rules, it is useful to introduce a variable Pij+ that tracks the influence of presynaptic spikes and a variable Pij− that tracks the influence of postsynaptic spikes, instead of keeping in memory, in computer simulations, all past spikes and summing repeatedly their effects (Song et al., 2000). These variables may have biochemical counterparts in biological neurons. The dynamics of ξ and P ± is then given by ξij (t) = Pij+ i (t) + Pij− j (t)
(3.6)
dPij+ /dt = −Pij+ /τ+ + A+ j (t)
(3.7)
dPij− /dt = −Pij− /τ− + A− i (t).
(3.8)
If τ± and A± are identical for all synapses, as it is the case in our simulations, then Pij+ does not depend on i, and Pij− does not depend on j, but we will keep the two indices for clarity.
Reinforcement Learning Through Modulation of STDP
1483
If we simulate the network in discrete time with a time step δt, the dynamics of the synapses is defined, for MSTDPET, by equations 2.7 and 2.8, or, for MSTDP, by wij (t + δt) = wij (t) + γ r (t + δt) ζij (t),
(3.9)
and by ζij (t) = Pij+ (t) f i (t) + Pij− (t) f j (t)
(3.10)
Pij+ (t) = Pij+ (t − δt) exp(−δt/τ+ ) + A+ f j (t)
(3.11)
Pij− (t) = Pij− (t − δt) exp(−δt/τ− ) + A− f i (t),
(3.12)
where f i (t) is 1 if neuron i has fired at time step t or 0 otherwise. The dynamics of the various variables is illustrated in Figure 1. As for the standard STDP model, we also apply bounds to the synapses, additionally to the dynamics specified by the learning rules. 4 Simulations 4.1 Methods. In the following simulations, we used networks composed of integrate-and-fire neurons with resting potential ur = −70 mV, firing threshold θ = −54 mV, reset potential equal to the resting potential, and ¨ decay time constant τ = 20 ms (parameters from Gutig, Aharonov, Rotter, & Sompolinsky, 2003, and similar to those in Song et al., 2000). The network’s dynamics was simulated in discrete time with a time step δt = 1 ms. The synaptic weights wij represented the total increase of the postsynaptic membrane potential caused by a single presynaptic spike; this increase was considered to take place during the time step following the spike. Thus, the dynamics of the neurons’ membrane potential was given by ui (t) = ur + [ui (t − δt) − ur ] exp(−δt/τ ) +
wij f j (t − δt).
(4.1)
j
If the membrane potential surpassed the firing threshold θ , it was reset to ur . Axons propagated spikes with no delays. More biologically plausible simulations of reward-modulated STDP were presented elsewhere (Florian, 2005); here, we aim to present the effects of the proposed learning rules using minimalist setups. We used τ+ = τ− = 20 ms; the value of γ varied from experiment to experiment. Except where specified, we used τz = 25 ms, A+ = 1, and A− = −1. 4.2 Solving the XOR Problem: Rate-Coded Input. We first show the efficacy of the proposed rules for learning XOR computation. Although this
1484
R. Florian
Figure 1: Illustration of the dynamics of the variables used by MSTDP and MSTDPET and the effects of these rules and of STDP on the synaptic strength for sample spike trains and reward (see equations 2.7, 2.8, and 3.9–3.12). We used γ = 0.2. The values of the other parameters are listed in section 4.1.
is not a biologically relevant task, it is a classic benchmark problem for artificial neural network training. This problem was also approached using other reinforcement learning methods for generic spiking neural networks (Seung, 2003; Xie & Seung, 2004). XOR computation consists of performing the following mapping between two binary inputs and one binary output: {0, 0} → 0 ; {0, 1} → 1 ; {1, 0} → 1 ; {1, 1} → 0. We considered a setup similar to the one in Seung (2003). We simulated a feed-forward neural network with 60 input neurons, 60 hidden neurons, and 1 output neuron. Each layer was fully connected to the next layer. Binary inputs and outputs were coded by the firing rates of the corresponding neurons. The first half of the input neurons represented the first input and the rest the second input. The input 1 was represented by a Poisson spike train at 40 Hz, and the input 0 was represented by no spiking. The training was accomplished by presenting the inputs and then delivering the appropriate reward or punishment to the synapses,
Reinforcement Learning Through Modulation of STDP
1485
according to the activity of the output neuron. In each learning epoch, the four input patterns were presented for 500 ms each, in random order. During the presentation of each input pattern, if the correct output was 1, the network received a reward r = 1 for each output spike emitted and 0 otherwise. This rewarded the network for having a high-output firing rate. If the correct output was 0, the network received a negative reward (punishment) r = −1 for each output spike and 0 otherwise. The corresponding reward was delivered during the time step following the output spike. Fifty percent of the input neurons coding for either inputs were inhibitory; the rest were excitatory. The synaptic weights wij were hardbounded between 0 and 5 mV (for excitatory synapses) or between −5 mV and 0 (for inhibitory synapses). Initial synaptic weights were generated randomly within the specified bounds. We considered that the network learned the XOR function if, at the end of an experiment, the output firing rate for the input pattern {1, 1} was lower than the output rates for the patterns {0, 1} or {1, 0} (the output firing rate for the input {0, 0} was obviously always 0). Each experiment had 200 learning epochs (presentations of the four input patterns), corresponding to 400 s of simulated time. We investigated learning with both MSTDP (using γ = 0.1 mV) and MSTDPET (using γ = 0.625 mV). We found that both rules were efficient for learning the XOR function (see Figure 2). The synapses changed such as to increase the reward received by the network, by minimizing the firing rate of the output neuron for the input pattern {1, 1} and maximizing the same rate for {0, 1} and {1, 0}. With MSTDP, the network learned the task in 99.1% of 1000 experiments, and with MSTDPET, learning was achieved in 98.2% of the experiments. The performance in XOR computation remained constant, after learning, if the reward signal was removed and the synapses were fixed. 4.3 Solving the XOR Problem: Temporally Coded Input. Since in the proposed learning rules synaptic changes depend on the precise timing of spikes, we investigated whether the XOR problem can also be learned if the input is coded temporally. We used a fully connected feedforward network with 2 input neurons, 20 hidden neurons, and 1 output neuron, a setup similar to the one in Xie and Seung (2004). The input signals 0 and 1 were coded by two distinct spike trains of 500 ms in length, randomly generated at the beginning of each experiment by distributing uniformly 50 spikes within this interval. Thus, to underline the distinction with rate coding, all inputs had the same average firing rate of 100 Hz. Since the number of neurons was low, we did not divide them into excitatory and inhibitory ones and allowed the weights of the synapses between input and the hidden layer to take both signs. These weights were bounded between −10 mV and 10 mV. The weights of the synapses between the hidden layer and the output neuron were bounded between 0 and 10 mV. As before, each experiment consisted of 200 learning epochs, corresponding
1486
R. Florian
MSTDP
MSTDPET
250
a)
Reward
200 150 100 50
Average Sample
0 -50 0
50
100
150
b)
Average firing rate (Hz)
Learning epoch
200 0
50
100
150
200
Learning epoch
300 250 200 150 100 50 0 {0,0}
{0,1}
{1,0}
Input pattern
{1,1}
{0,0}
{0,1}
{1,0}
{1,1}
Input pattern
Figure 2: Learning XOR computation with rate-coded input with MSTDP and MSTDPET. (a) Evolution during learning of the total reward received in a learning epoch: average over experiments and a random sample experiment. (b) Average firing rate of the output neuron after learning (total number of spikes emitted during the presentation of each pattern in the last learning epoch divided by the duration of the pattern): average over experiments and standard deviation.
to 400 s of simulated time. We used γ = 0.01 mV for experiments with MSTDP and γ = 0.25 mV for MSTDPET. With MSTDP, the network learned the XOR function in 89.7% of 1000 experiments, while with MSTDPET, learning was achieved in 99.5% of the experiments (see Figure 3). The network learned to solve the task by responding to the synchrony of the two input neurons. We verified that after learning, the network solves the task not only for the input patterns used during learning but for any pair of random input signals similarly generated. 4.4 Learning a Target Firing-Rate Pattern. We have also verified the capacity of the proposed learning rules to allow a network to learn a given output pattern, coded by the individual firing rates of the output neurons. During learning, the network received a reward r = 1 when the distance d between the target pattern and the actual one decreased and a punishment r = −1 when the distance increased, r (t + δt) = sign(d(t) − d(t − δt)).
Reinforcement Learning Through Modulation of STDP
1487
Figure 3: Learning XOR computation with temporally coded input with MSTDP and MSTDPET. (a) Evolution during learning of the total reward received in a learning epoch: average over experiments and a random sample experiment. (b) Average firing rate of the output neuron after learning (total number of spikes emitted during the presentation of each pattern in the last learning epoch divided by the duration of the pattern): average over experiments and standard deviation.
We used a feedforward network with 100 input neurons directly connected to 100 output neurons. Each output neuron received axons from all input neurons. Synaptic weights were bounded between 0 and wmax = 1.25 mV. The input neurons fired Poisson spike trains with constant, random firing rates, generated uniformly, for each neuron, between 0 and 50 Hz. The target firing rates νi0 of individual output neurons were also generated randomly at the beginning of each experiment, uniformly between 20 and 100 Hz. The actual firing rates νi of the output neurons were measured using a leaky integrator with time constant τν = 2 s: δt 1 f i (t). νi (t) = νi (t − δt) exp − + τν τν
(4.2)
This is equivalent to performing a weighed estimate of the firing rate with an exponential kernel. We measured the total distance between the current
R. Florian 250
70
200
65
Rate (Hz)
Distance (Hz)
1488
150 100 50
60 55 50 45 40
0 0
10
20
30 40 Time (s)
a)
50
60
0
10
20
30 40 Time (s)
50
60
b)
Figure 4: Learning a target firing-rate pattern with MSTDP (sample experiment). (a) Evolution during learning of distance between the target pattern and the actual output pattern. (b) Evolution during learning of the firing rates of two sample output neurons (normal line) toward their target output rates (dotted lines).
0 2 output firing pattern and the target one as d(t) = i [νi (t) − νi ] . Learning began after the output rates stabilized after the initial transients. We performed experiments with both MSTDP (γ = 0.001 mV) and MSTDPET (γ = 0.05 mV). Both learning rules allowed the network to learn the given output pattern, with very similar results. During learning, the distance between the target output pattern and the actual one decreased rapidly (in about 30 s) and then remained close to 0. Figure 4 illustrates learning with MSTDP. 4.5 Exploring the Learning Mechanism. In order to study in more detail the properties of the proposed learning rules, we repeated the learning of a target firing-rate pattern experiment, described above, while varying the various parameters defining the rules. To characterize learning speed, we considered that convergence is achieved at time tc if the distance d(t) between the actual output and the target output did not become lower than d(tc ) for t between tc and tc + τc , where τc was either 2 s, 10 s, or 20 s, depending on the experiment. We measured learning efficacy as e(t) = 1 − d(t)/d(t0 ), where d(t0 ) is the distance at the beginning of learning; the learning efficacy at convergence is e c = e(tc ). The best learning efficacy possible is 1, and a value of 0 indicates no learning. A negative value indicates that the distance between the actual firing-rate pattern and the target has increased with respect to the case when the synapses have random values. 4.5.1 Learning Rate Influence. Increasing the learning rate γ results in faster convergence, but as γ increases, its effect on learning speed becomes
Reinforcement Learning Through Modulation of STDP
1489
Figure 5: Influence of learning rate γ on learning a target firing-rate pattern with delayed reward with MSTDP. (a) Convergence time tc as a function of γ . (b) Learning efficacy at convergence e c as a function of γ .
less and less important. The learning efficacy at convergence decreases linearly with higher learning rates. Figure 5 illustrates these effects for MSTDP (with τc = 10 s), but the behavior is similar for MSTDPET. 4.5.2 Nonantisymmetric STDP. We have investigated the effects on learning performance of changing the form of the STDP window (function) used by the learning rules, by varying the values of the A± parameters. This permitted the assessment of the contributions to learning of the two parts of the STDP window (corresponding to positive and, respectively, negative delays between pre- and postsynaptic spikes). This also permitted the exploration of the properties of reward-modulated symmetric STDP. We have repeated the learning of a target firing pattern experiment with various values for the A± parameters. Figure 6 presents the learning efficacy after 25 s of learning. It can be observed that for MSTDP, reinforcement learning is possible in all cases where a postsynaptic spike following a presynaptic spike strengthens the synapse when the reward is positive and depresses the synapse when the reward is negative, thus reinforcing causality. When A+ = 1, MSTDP led to reinforcement learning regardless of the value of A− . For MSTDPET, only modulation of standard antisymmetric Hebbian STDP (A+ = 1, A− = −1) led to learning. Many other values of the A± parameters maximized the distance between the output and the target pattern (see Figure 6), instead of minimizing it, which was the task to be learned. These results suggest that the causal nature of STDP is an important factor for the efficacy of the proposed rules. 4.5.3 Reward Baseline. In previous experiments, the reward was either 1 or −1, thus having a 0 baseline. We have repeated the learning of a target firing pattern experiment while varying the reward baseline r0 , giving the reward r (t + δt) = r0 + sign(d(t) − d(t − δt)). Experiments showed that a
1490
R. Florian
Figure 6: Learning a target firing-rate pattern with delayed reward with modified MSTDP and MSTDPET. Learning efficacy e at 25 s after the beginning of learning as a function of A+ and A− .
zero baseline is most efficient, as expected. The efficiency of MSTDP did not change much for a relatively large interval of reward baselines around 0. The efficiency of MSTDPET proved to be more sensitive to having a 0 baseline, but any positive baseline still led to learning. This suggests that for both MSTDP and MSTDPET, reinforcement learning can benefit from the unsupervised learning properties of standard STDP, achieved when r is positive. A positive baseline that ensures that r is always positive is also biologically relevant, since changes in the sign of STDP have not yet been observed experimentally and may not be possible in the brain. Modified MSTDP (with A− being 0 or 1), which has been shown to allow learning for 0 baseline, is the most sensitive to departures from this baseline, slight deviations leading the learning efficacy to negative values (i.e., distance is maximized instead of being minimized; see Figure 7). The poor performance of modified MSTDP at positive r0 is caused by the tendency of all synapses to take the maximum allowed values. At negative r0 , the distribution of synapses is biased more toward small values for modified MSTDP as compared to normal MSTDP. The relatively poor performance of MSTDPET at negative r0 is due to a bias of the distribution of synapses toward large values. The synaptic distribution after learning, for the cases where learning is efficient, is almost uniform, with MSTDP and modified MSTDP also having additional small peaks at the extremities of the allowed interval. 4.5.4 Delayed Reward. For the tasks presented thus far, MSTDPET had a performance comparable to MSTDP or poorer. The advantage of using an eligibility trace appears when learning a task where the reward is delivered
Reinforcement Learning Through Modulation of STDP
1491
Figure 7: Learning a target firing-rate pattern with MSTDP, MSTDPET, and modified MSTDP. Learning efficacy e at 25 s after the beginning of learning as a function of reward baseline r0 .
to the network with a certain delay σ with regard to the state or the output of the network that caused it. To properly investigate the effects of reward delay on MSTDPET learning performance, for various values of τz , we had to factor out the effects of the variation of τz . A higher τz leads to smaller synaptic changes, since z decays more slowly and, if reward varies from time step to time step and the probabilities of positive and negative reward are approximately equal, their effects on the synapse, having opposite signs, will almost cancel out. Hence, the variation of τz results in the variation of the learning efficacy and the convergence time, similar to the variation of the learning rate, previously described. In order to compensate for this effect, we varied the learning rate γ as a function of τz such that the convergence time of MSTDPET was constant, at reward delay σ = 0, and approximately equal to the convergence time of MSTDP for γ = 0.001, also at σ = 0. We obtained the corresponding values of γ by numerically solving the equation tcMSTDPET (γ ) = tcMSTDP using Ridder’s method (Press, Teukolsky, Vetterling, & Flannery, 1992). For determining convergence, we used τc = 2 s. The dependence on τz of the obtained values of γ is illustrated in Figure 8c. MSTDPET continued to allow the network to learn, while MSTDP and modified MSTDP (A− ∈ {−1, 0, 1}) failed in this case to solve the tasks, even for small delays. MSTDP and modified MSTDP led to learning in our setup only if reward was not delayed. In this case, changes in the strength of the i j synapse were determined by the product r (t + δt) ζij (t) of the reward r (t + δt) that corresponded to the network’s activity (postsynaptic spikes) at time t and of the variable ζij (t) that was nonzero only if there
1492
R. Florian
Figure 8: Learning a target firing-rate pattern with delayed reward with MSTDP and MSTDPET. (a, b) Learning efficacy e at 25 s after the beginning of learning, as a function of reward delay σ and eligibility trace decay time constant τz . (a) Dependence on σ (standard deviation is not shown for all data sets, for clarity). (b) Dependence on τz (MSTDPET only). (c) The values of the learning rate γ , used in the MSTDPET experiments, which compensate at 0 delay for the variation of τz .
were pre- or postsynaptic spikes during the same time step t. Even a small extra delay σ = δt of the reward impeded learning. Figure 8a illustrates the performance of both learning rules as a function of reward delay σ for learning a target firing pattern. Figure 8b illustrates the performance of MSTDPET as a function of τz , for several values of reward delay σ . The
Reinforcement Learning Through Modulation of STDP
1493
value of τz that yields the best performance for a particular delay increases monotonically and very fast with the delay. For zero delay or for values of τz larger than optimal, performance decreases for higher τz . Performance was estimated by measuring the average (obtained from 100 experiments) of the learning efficacy e at 25 s after the beginning of learning. The eligibility trace keeps a decaying memory of the relationships between the most recent pairs of pre- and postsynaptic spikes. It is thus understandable that MSTDPET permits learning even with delayed reward as long as this delay is smaller than τz . It can be seen from Figure 8 that at least for the studied setup, learning efficacy degrades significantly for large delays, even if they are much smaller than τz , and there is no significant learning for delays larger than about 7 ms. The maximum delay for which learning is possible may be larger for other setups. 4.5.5 Continuous Reward. In previous experiments, reward varied from time step to time step, at a timescale of δt = 1 ms. In animals, the internal reward signal (e.g., a neuromodulator) varies more slowly. To investigate learning performance in such a case, we used a continuous reward, varying on a timescale τr , with a dynamics given by
δt r (t + δt) = r (t) exp − τr
+
1 sign(d(t − σ ) − d(t − δt − σ )). τr
(4.3)
Both MSTDP and MSTDPET continue to work when reward is continuous and not delayed. Learning efficacy decreases monotonically with higher τr (see Figures 9a and 9b). As in the case of a discrete reward signal, MSTDPET allows learning even with small reward delays σ (see Figure 9c), while MSTDP does not lead to learning when reward is delayed. 4.5.6 Scaling of Learning with Network Size. To explore how the performance of the proposed learning mechanisms scales with network size, we varied systematically the number of input and output neurons (Ni and No , respectively). The maximum synaptic efficacies were set to wmax = 1.25 · (Ni /100) mV in order to keep the average input per postsynaptic neuron constant between setups with different numbers of input neurons. When No varies with Ni constant, learning efficacy at convergence e c diminishes with higher No and convergence time tc increases with higher No because the difficulty of the task increases. When Ni varies with No constant, convergence time decreases with higher Ni because the difficulty of the task remains constant, but the number of ways in which the task can be solved increases due to the higher number of incoming synapses per output neuron. The learning efficacy diminishes, however, with higher Ni , when No is constant.
1494
R. Florian
Figure 9: Learning driven by a continuous reward signal with a timescale τr . (a, b) Learning performance as a function of τr . (a) MSTDP. (b) MSTDPET. (c) Learning performance of MSTDPET as a function of reward delay σ (τz = τr = 25 ms). All experiments used γ = 0.05 mV.
When both Ni and No vary at the same time (Ni = No = N), the convergence time decreases slightly for higher network sizes (see Figure 10a). Again, learning efficacy diminishes with higher N, which means that one cannot train networks of arbitrary size with this method (see Figure 10b). The behavior is similar for both MSTDP and MSTDPET, but for MSTDPET, the decrease of learning efficacy for larger networks is more accentuated. By considering that the dependence of e c on N is linear for N ≥ 600 and extrapolating from the available data, the result is that learning is not possible (e c 0) for networks with N > Nmax , where Nmax is about 4690 for MSTDP and 1560 for MSTDPET, for the setup and the parameters considered here. For a certain configuration of parameters, learning efficacy can be improved in the detriment of learning time by decreasing the learning rate γ (as discussed above; see Figure 5).
Reinforcement Learning Through Modulation of STDP
1495
Figure 10: Scaling with network size of learning a target firing-rate pattern, with MSTDP and MSTDPET. (a) Convergence time tc as a function of network size Ni = No = N. (b) Learning efficacy at convergence e c as a function of N.
4.5.7 Reward-Modulated Classical Hebbian Learning. To further investigate the importance of causal STDP as a component of the proposed learning rules, we have also studied the properties of classical (rate-dependent) Hebbian plasticity modulated by reward while still using spiking neural networks. Under certain conditions (Xie & Seung, 2000), classical Hebbian plasticity can be approximated by modified STDP when the plasticity window is symmetric, A+ = A− = 1. In this case, according to the notation in Xie and Seung (2000), we have β0 > 0 and β1 = 0, and thus plasticity depends (approximatively) on the product of pre- and postsynaptic firing rates, not on their temporal derivatives. Reward-modulated modified STDP with A+ = A− = 1 was studied above. A more direct approximation of modulated rate-dependent Hebbian learning with spiking neurons is the following learning rule, dwij (t) = γ r (t) Pij+ (t) Pij− (t), dt
(4.4)
where Pij± are computed as for MSTDP and MSTDPET, with A+ = A− = 1. It can be seen that this rule is a reward-modulated version of the classical Hebbian rule, since Pij+ is an estimate of the firing rate of the presynaptic neuron j, and Pij− is an estimate of the firing rate of the postsynaptic neuron i, using an exponential kernel. Reward can take both signs, and thus this rule allows both potentiation and depression of synapses. An extra mechanism for depression is not necessary to avoid synaptic explosion, as for classic Hebbian learning. We have repeated the experiments presented above while using this learning rule. We could not achieve learning with this rule for values of γ between 10−6 and 1 mV.
1496
R. Florian
4.5.8 Sensitivity to Parameters. The results of the simulations presented above are quite robust with respect to the parameters used for the learning rules. The only parameter that we had to tune from experiment to experiment was the learning rate γ , which determined the magnitude of synaptic changes. A learning rate that is too small does not change the synapses rapidly enough to observe the effects of learning during a given experiment duration, while a learning rate that is too large pushes the synapses toward the bounds, thus limiting the network’s behavior. Tuning the learning rate is needed for any kind of learning rule involving synaptic plasticity. The magnitude of the time constant τz of the decay of the eligibility trace was not essential in experiments without delayed reward, although large values degrade learning performance (see Figure 8b). For the relative magnitudes of reward used for positive and, respectively, negative rewards, we used a natural choice (setting them equal) and did not have to tune this as in Xie and Seung (2004) to achieve learning. The absolute magnitudes of the reward and of A± are not relevant, as their effect on the synapses is scaled by γ . 5 Discussion We have derived several versions of a reinforcement learning algorithm for spiking neural networks by applying an abstract reinforcement learning algorithm to the spike response model of spiking neurons. The resulting learning rules are similar to spike-timing-dependent synaptic and intrinsic plasticity mechanisms experimentally observed in animals, but involve an extra modulation by the reward of the synaptic and excitability changes. We have also studied in computer simulations the properties of two learning rules, MSTDP and MSTDPET, that preserve the main features of the rules derived analytically, while being simpler, and also being extensions of the standard model of STDP. We have shown that the modulation of STDP by a global reward signal leads to reinforcement learning. An eligibility trace that keeps a decaying memory of the effects of recent spike pairings allows learning in the case that reward is delayed. The causal nature of the STDP window seems to be an important factor for the learning performance of the proposed learning rules. The continuity between the proposed reinforcement learning mechanism and experimentally observed STDP makes it biologically plausible and also establishes a continuity between reinforcement learning and the unsupervised learning capabilities of STDP. The introduction of the eligibility trace z does not contradict what is currently known about biological STDP, as it simply implies that synaptic changes are not instantaneous but are implemented through the generation of a set of biochemical substances that decay exponentially after generation. A new feature is the modulatory effect of the reward signal r . This may be implemented in the brain by a neuromodulator. For example, dopamine carries a short-latency
Reinforcement Learning Through Modulation of STDP
1497
reward signal indicating the difference between actual and predicted rewards (Schultz, 2002) that fits well with our learning model based on continuous reward-modulated plasticity. It is known that dopamine and acetylcholine modulate classical (firing-rate-dependent) long-term potentiation and depression of synapses (Seamans & Yang, 2004; Huang, Simpson, Kellendonk, & Kandel, 2004; Thiel, Friston, & Dolan, 2002). The modulation of STDP by neuromodulators is supported by the discovery of the amplification of spike-timing-dependent potentiation in hippocampal CA1 pyramidal neurons by the activation of beta-adrenergic receptors (Lin, Min, Chiu, & Yang, 2003, Figure 6F). As speculated before (Xie & Seung, 2004), it may be that other studies failed to detect the influence of neuromodulators on STDP because the reward circuitry may not have worked during the experiments and the reward signal may have been fixed to a given value. In this case, according to the proposed learning rules, for excitatory synapses, if the reward is frozen to a positive value during the experiment, it leads to Hebbian STDP and otherwise to anti-Hebbian STDP. The studied learning rules may be used in applications for training generic artificial spiking neural networks and suggest the experimental investigation in animals of the existence of reward-modulated STDP. Acknowledgments We thank the reviewers, Raul C. Mures¸an, Walter Senn, and Sergiu Pas¸ca, for useful advice. This work was supported by Arxia SRL and by a grant of the Romanian government (MEdC-ANCS). References Abbott, L. F., & Gerstner, W. (2005). Homeostasis and learning through spike-timing dependent plasticity. In B. Gutkin, D. Hansel, C. Meunier, J. Dalibard, & C. Chow (Eds.), Methods and models in neurophysics: Proceedings of the Les Houches Summer School 2003. Amsterdam: Elsevier Science. Abbott, L. F., & Nelson, S. B. (2000). Synaptic plasticity: Taming the beast. Nature Neuroscience, 3, 1178–1183. Aizenman, C. D., & Linden, D. J. (2000). Rapid, synaptically driven increases in the intrinsic excitability of cerebellar deep nuclear neurons. Nature Neuroscience, 3(2), 109–111. Alstrøm, P., & Stassinopoulos, D. (1995). Versatility and adaptive performance. Physical Review, E 51(5), 5027–5032. Bartlett, P. L., & Baxter, J. (1999a). Direct gradient-based reinforcement learning: I. Gradient estimation algorithms (Tech. Rep.). Canberra: Australian National University, Research School of Information Sciences and Engineering. Bartlett, P. L., & Baxter, J. (1999b). Hebbian synaptic modifications in spiking neurons that learn (Tech. Rep.). Canberra: Australian National University, Research School of Information Sciences and Engineering.
1498
R. Florian
Bartlett, P., & Baxter, J. (2000a). Stochastic optimization of controlled partially observable Markov decision processes. In Proceedings of the 39th IEEE Conference on Decision and Control. Piscataway, NJ: IEEE. Bartlett, P. L., & Baxter, J. (2000b). A biologically plausible and locally optimal learning algorithm for spiking neurons. Available online at http://arp.anu.edu.au/ftp/ papers/jon/brains.pdf.gz. Bartlett, P. L., & Baxter, J. (2000c). Estimation and approximation bounds for gradientbased reinforcement learning. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory (pp. 133–141). San Francisco: Morgan Kaufmann. Barto, A. G. (1985). Learning by statistical cooperation of self-interested neuron-like computing elements. Human Neurobiology, 4, 229–256. Barto, A., & Anandan, P. (1985). Pattern-recognizing stochastic learning automata. IEEE Transactions on Systems, Man, and Cybernetics, 15(3), 360–375. Barto, A. G., & Anderson, C. W. (1985). Structural learning in connectionist systems. In Proceedings of the Seventh Annual Conference of the Cognitive Science Society. Mahwah, NJ: Erlbaum. Barto, A. G., & Jordan, M. I. (1987). Gradient following without back-propagation in layered networks. In M. Caudill & C. Butler (Eds.), Proceedings of the First IEEE Annual Conference on Neural Networks (Vol. 2, pp. 629–636). Piscataway, NJ: IEEE. Baxter, J., & Bartlett, P. L. (2001). Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15, 319–350. Baxter, J., Bartlett, P. L., & Weaver, L. (2001). Experiments with infinite-horizon, policy-gradient estimation. Journal of Artificial Intelligence Research, 15, 351–381. Baxter, J., Weaver, L., & Bartlett, P. L. (1999). Direct gradient-based reinforcement learning: II. Gradient ascent algorithms and experiments (Tech. Rep.). Canberra: Australian National University, Research School of Information Sciences and Engineering. Bell, A. J., & Parrara, L. C. (2005). Maximising sensitivity in a spiking network. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17. Cambridge, MA: MIT Press. Bell, C. C., Han, V. Z., Sugawara, Y., & Grant, K. (1997). Synaptic plasticity in a cerebellum-like structure depends on temporal order. Nature, 387(6630), 278–281. Bi, G.-Q., & Poo, M.-M. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. Journal of Neuroscience, 18(24), 10464–10472. Bohte, S. M. (2004). The evidence for neural information processing with precise spike-times: A survey. Natural Computing, 3(2), 195–206. Bohte, S. M., & Mozer, C. (2005). Reducing spike train variability: A computational theory of spike-timing dependent plasticity. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17. Cambridge, MA: MIT Press. Chechik, G. (2003). Spike time dependent plasticity and information maximization. Neural Computation, 15, 1481–1510. Cudmore, R. H., & Turrigiano, G. G. (2004). Long-term potentiation of intrinsic excitability in LV visual cortical neurons. Journal of Neurophysiology, 92, 341–348. Dan, Y., & Poo, M.-M. (1992). Hebbian depression of isolated neuromuscular synapses in vitro. Science, 256(5063), 1570–1573.
Reinforcement Learning Through Modulation of STDP
1499
Dan, Y., & Poo, M.-M. (2004). Spike timing-dependent plasticity of neural circuits. Neuron, 44, 23–30. Daoudal, G., & Debanne, D. (2003). Long-term plasticity of intrinsic excitability: Learning rules and mechanisms. Learning and Memory, 10, 456– 4665. Dauc´e, E., Soula, H., & Beslon, G. (2005). Learning methods for dynamic neural networks. In Proceedings of the 2005 International Symposium on Nonlinear Theory and Its Applications (NOLTA 2005). Bruges, Belgium. Egger, V., Feldmeyer, D., & Sakmann, B. (1999). Coincidence detection and changes of synaptic efficacy in spiny stellate neurons in rat barrel cortex. Nature Neuroscience, 2(12), 1098–1105. Farries, M. A., & Fairhall, A. L. (2005a). Reinforcement learning with modulated spike timing-dependent plasticity. Poster presented at the Computational and Systems Neuroscience Conference (COSYNE 2005). Available online at http://www.cosyne.org/climages/d/dy/COSYNE05 Abstracts.pdf. Farries, M. A., & Fairhall, A. L. (2005b). Reinforcement learning with modulated spike timing-dependent plasticity. Program No. 384.3. 2005 Abstract Viewer/Itinerary Planner. Washington, DC: Society for Neuroscience. Available online at http://sfn.scholarone.com/itin2005/main.html?new page id=126&abstract id= 5526&p num=384.3&is tec. Florian, R. V. (2005). A reinforcement learning algorithm for spiking neural networks. In D. Zaharie, D. Petcu, V. Negru, T. Jebelean, G. Ciobanu, A. Cicortas¸, A. Abraham, & M. Paprzycki (Eds.), Proceedings of the Seventh International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (pp. 299–306). Los Alamitos, CA: IEEE Computer Society. Florian, R. V., & Mures¸an, R. C. (2006). Phase precession and recession with STDP and anti-STDP. In S. Kollias, A. Stafylopatis, W. Duch, & E. Oja (Eds.), Artificial Neural Networks–ICANN 2006. 16th International Conference, Athens, Greece, September 10–14, 2006. Proceedings, Part I. Berlin: Springer. Froemke, R. C., & Dan, Y. (2002). Spike-timing-dependent synaptic modification induced by natural spike trains. Nature, 416, 433–438. Ganguly, K., Kiss, L., & Poo, M. (2000). Enhancement of presynaptic neuronal excitability by correlated presynaptic and postsynaptic spiking. Nature Neuroscience, 3(10), 1018–1026. Gerstner, W. (2001). A framework for spiking neuron models: The spike response model. In F. Moss & S. Gielen (Eds.), The handbook of biological physics (Vol. 4, pp. 469–516). Amsterdam: Elsevier Science. Gerstner, W., & Kistler, W. M. (2002). Spiking neuron models. Cambridge: Cambridge University Press. ¨ Gutig, R., Aharonov, R., Rotter, S., & Sompolinsky, H. (2003). Learning input correlations through nonlinear temporally asymmetric Hebbian plasticity. Journal of Neuroscience, 23(9), 3697–3714. Han, V. Z., Grant, K., & Bell, C. C. (2000). Reversible associative depression and nonassociative potentiation at a parallel fiber synapse. Neuron, 27, 611–622. Hopfield, J. J., & Brody, C. D. (2004). Learning rules and network repair in spiketiming-based computation networks. Proceedings of the National Academy of Sciences, 101(1), 337–342.
1500
R. Florian
Huang, Y.-Y., Simpson, E., Kellendonk, C., & Kandel, E. R. (2004). Genetic evidence for the bidirectional modulation of synaptic plasticity in the prefrontal cortex by D1 receptors. Proceedings of the National Academy of Sciences, 101(9), 3236–3241. Kempter, R., Gerstner, W., & van Hemmen, J. L. (1999). Hebbian learning and spiking neurons. Physical Review E, 59(4), 4498–4514. Kempter, R., Gerstner, W., & van Hemmen, J. L. (2001). Intrinsic stabilization of output rates by spike-based Hebbian learning. Neural Computation, 13, 2709– 2742. Legenstein, R., Naeger, C., & Maass, W. (2005). What can a neuron learn with spiketiming-dependent plasticity? Neural Computation, 17, 2337–2382. Li, C., Lu, J., Wu, C., Duan, S., & Poo, M. (2004). Bidirectional modification of presynaptic neuronal excitability accompanying spike timing-dependent synaptic plasticity. Neuron, 41, 257–268. Lin, Y., Min, M., Chiu, T., & Yang, H. (2003). Enhancement of associative long-term potentiation by activation of beta-adrenergic receptors at CA1 synapses in rat hippocampal slices. Journal of Neuroscience, 23(10), 4173–4181. Marbach, P., & Tsitsiklis, J. N. (1999). Simulation-based optimization of Markov reward processes: Implementation issues. In Proceedings of the 38th Conference on Decision and Control. Piscataway, NJ: IEEE. Marbach, P., & Tsitsiklis, J. N. (2000). Approximate gradient methods in policy-space optimization of Markov reward processes. Discrete Event Dynamic Systems: Theory and Applications, 13(1–2), 111–148. ¨ Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science, 275(5297), 213– 215. Mazzoni, P., Andersen, R. A., & Jordan, M. I. (1991). A more biologically plausible learning rule for neural networks. Proceedings of the National Academy of Science of the USA, 88, 4433–4437. Nick, T. A., & Ribera, A. B. (2000). Synaptic activity modulates presynaptic excitability. Nature Neuroscience, 3(2), 142–149. Pfister, J.-P., Toyoizumi, T., Barber, D., & Gerstner, W. (2006). Optimal spike-timingdependent plasticity for precise action potential firing in supervised learning. Neural Computation, 18(6), 1318–1348. Pouget, A., Deffayet, C., & Sejnowski, T. J. (1995). Reinforcement learning predicts the site of plasticity for auditory remapping in the barn owl. In B. Fritzke, G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7. Cambridge, MA: MIT Press. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C: The art of scientific computing (2nd ed.). Cambridge: Cambridge University Press. Rao, R. P. N., & Sejnowski, T. J. (2001). Spike-timing-dependent Hebbian plasticity as temporal difference learning. Neural Computation, 13(10), 2221–2237. Roberts, P. (1999). Computational consequences of temporally asymmetric learning rules: I. Differential Hebbian learning. Journal of Computational Neuroscience, 7, 235–246. Roberts, P. D., & Bell, C. C. (2002). Spike timing dependent synaptic plasticity in biological systems. Biological Cybernetics, 87(5–6), 392–403.
Reinforcement Learning Through Modulation of STDP
1501
Schultz, W. (2002). Getting formal with dopamine and reward. Neuron, 36, 241–263. Seamans, J. K., & Yang, C. R. (2004). The principal features and mechanisms of dopamine modulation in the prefrontal cortex. Progress in Neurobiology, 74, 1–57. Seung, H. S. (2003). Learning in spiking neural networks by reinforcement of stochastic synaptic transmission. Neuron, 40(6), 1063–1073. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nature Neuroscience, 3, 919–926. Soula, H., Alwan, A., & Beslon, G. (2004). Obstacle avoidance learning in a spiking neural network. Poster presented at Last Minute Results of Simulation of Adaptive Behavior, Los Angeles, CA. Available online at http://www.koredump.org/hed/ abstract sab2004.pdf. Soula, H., Alwan, A., & Beslon, G. (2005). Learning at the edge of chaos: Temporal coupling of spiking neuron controller of autonomous robotic. In Proceedings of AAAI Spring Symposia on Developmental Robotics. Menlo Park, CA: AAAI Press. Stassinopoulos, D., & Bak, P. (1995). Democratic reinforcement: A principle for brain function. Physical Review, E 51, 5033–5039. Stassinopoulos, D., & Bak, P. (1996). Democratic reinforcement: Learning via selforganization. Available online at http://arxiv.org/abs/cond-mat/9601113. ¨ Strosslin, T., & Gerstner, W. (2003). Reinforcement learning in continuous state and action space. Available online at http://lenpe7.epfl.ch/∼stroessl/publications/ StrosslinGe03.pdf. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Takita, K., & Hagiwara, M. (2002). A pulse neural network learning algorithm for POMDP environment. In Proceedings of the 2002 International Joint Conference on Neural Networks (IJCNN 2002) (pp. 1643–1648). Piscataway, NJ: IEEE. Takita, K., & Hagiwara, M. (2005). A pulse neural network reinforcement learning algorithm for partially observable Markov decision processes. Systems and Computers in Japan, 36(3), 42–52. Takita, K., Osana, Y., & Hagiwara, M. (2001). Reinforcement learning algorithm with network extension for pulse neural network. Transactions of the Institute of Electrical Engineers of Japan, 121-C(10), 1634–1640. Thiel, C. M., Friston, K. J., & Dolan, R. J. (2002). Cholinergic modulation of experiencedependent plasticity in human auditory cortex. Neuron, 35, 567–574. Toyoizumi, T., Pfister, J.-P., Aihara, K., & Gerstner, W. (2005). Spike-timing dependent plasticity and mutual information maximization for a spiking neuron model. In L. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 1409–1416). Cambridge, MA: MIT Press. Turrigiano, G. G., & Nelson, S. B. (2004). Homeostatic plasticity in the developing nervous system. Nature Reviews Neuroscience, 5, 97–107. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256. Xie, X., & Seung, H. S. (2000). Spike-based learning rules and stabilization of persis¨ tent neural activity. In S. A. Solla, T. K. Leen, & K.-R. Muller (Eds.), Advances in neural information processing systems, 12. Cambridge, MA: MIT Press.
1502
R. Florian
Xie, X., & Seung, H. S. (2004). Learning in neural networks by reinforcement of irregular spiking. Physical Review E, 69, 041909. Zhang, W., & Linden, D. J. (2003). The other side of the engram: Experience-driven changes in neuronal intrinsic excitability. Nature Reviews Neuroscience, 4, 885–900.
Received June 14, 2005; accepted September 6, 2006.
LETTER
Communicated by Robert Kass
A Method for Selecting the Bin Size of a Time Histogram Hideaki Shimazaki [email protected]
Shigeru Shinomoto [email protected] Department of Physics, Kyoto University, Kyoto 606-8502, Japan
The time histogram method is the most basic tool for capturing a timedependent rate of neuronal spikes. Generally in the neurophysiological literature, the bin size that critically determines the goodness of the fit of the time histogram to the underlying spike rate has been subjectively selected by individual researchers. Here, we propose a method for objectively selecting the bin size from the spike count statistics alone, so that the resulting bar or line graph time histogram best represents the unknown underlying spike rate. For a small number of spike sequences generated from a modestly fluctuating rate, the optimal bin size may diverge, indicating that any time histogram is likely to capture a spurious rate. Given a paucity of data, the method presented here can nevertheless suggest how many experimental trials should be added in order to obtain a meaningful time-dependent histogram with the required accuracy. 1 Introduction Neurophysiological studies are based on the idea that information is transmitted between cortical neurons by spikes (Johnson, 1996; Dayan & Abbott, 2001). A number of filtering algorithms have been proposed for estimating the instantaneous activity of an individual neuron or the joint activity of multiple neurons (DiMatteo, Genovese, & Kass, 2001; Wiener & Richmond, 2002; Sanger, 2002; Kass, Ventura, & Cai, 2003; Brockwell, Rojas, & Kass, 2004; Kass, Ventura, & Brown, 2005; Brown, Kass, & Mitra, 2004). The most basic and frequently used tool for spike rate estimation is the time histogram method. For instance, one aligns spike sequences at the onset of stimuli repeatedly applied to an animal and describes the response of a single neuron with a peristimulus time histogram (PSTH) or the responses of multiple neurons with a joint PSTH (Adrian, 1928; Gerstein & Kiang, 1960; Gerstein & Perkel, 1969; Abeles, 1982). The shape of a PSTH is largely dependent on the choice of the bin size. With a bin size that is too large, one cannot represent the time-dependent spike rate. And with a bin size that is too small, the time histogram Neural Computation 19, 1503–1527 (2007)
C 2007 Massachusetts Institute of Technology
1504
H. Shimazaki and S. Shinomoto
fluctuates greatly, and one cannot discern the underlying spike rate. There is an appropriate bin size for each set of spike sequences, which is based on the goodness of the fit of the PSTH to the underlying spike rate. For most previously published PSTHs, however, the bin size has been subjectively selected by the authors. For data points distributed compactly, there are classical theories about how the optimal bin size scales with the total number of data points n. It was proven that the optimal bin size scales as n−1/3 with regard to the bargraph-density estimator (R´ev´esz, 1968; Scott, 1979). It was recently found that for two types of infinitely long spike sequences, whose rates fluctuate either smoothly or jaggedly, the optimal bin sizes exhibit different scaling relations with respect to the number of sequences, timescale, and amplitude of rate modulation (Koyama & Shinomoto, 2004). Though interesting, the scaling relations are valid only for a large amount of data and are of limited use in selecting a bin size. We devised a method of selecting the bin size of a time histogram from the spike data. In the course of our study, we realized that a theory on the empirical choice of the histogram bin size for a probability density function was presented by Rudemo (1982). Although applicable to a Poisson point process, this theory appears to have rarely been used by neurophysiologists in analyses of PSTHs. In the actual procedure of neurophysiological experiments, the number of trials (spike sequences) plays an important role in determining the resolution of a PSTH and thus in designing experiments. Therefore, it is preferable to have a theory that accords with the common protocol of neurophysiological experiments in which a stimulus is repeated to extract a signal from a neuron. Given a set of experimental data, we wish to not only determine the optimal bin size, but also estimate how many more experimental trials should be performed in order to obtain a resolution we deem sufficient. For a small number of spike sequences derived from a modestly fluctuating rate, the estimated optimal bin size may diverge, implying that by constructing a PSTH, it is likely that one obtains spurious results for the spike rate estimation (Koyama & Shinomoto, 2004). Because a shortage of data underlies this divergence, one can carry out more experiments to obtain a reliable rate estimation. Our method can suggest how many sequences should be added in order to obtain a meaningful time histogram with the required accuracy. As an application of this method, we also show that the scaling relations of the optimal bin size that appears for a large number of spike sequences can be examined from a relatively small amount of data. The degree of the smoothness of an underlying rate process can be estimated by this method. In addition to a bar graph (piecewise constant) time histogram, we also designed a method for creating a line graph (piecewise linear) time histogram, which is superior to a bar graph in the goodness of the fit to the underlying spike rate and in comparing multiple responses to different stimulus conditions.
The Bin Size of a Time Histogram
1505
These empirical methods for the bin size selection for a bar and a line graph histogram, estimation of the number of sequences required for the histogram, and estimation of the scaling exponents of the optimal bin size were corroborated by theoretical analysis derived for a generic stochastic rate process. In the next section, we develop the bar graph (peristimulus) time histogram (Bar-PSTH) method, which is the most frequently used PSTH. In the appendix, we develop the line graph (peristimulus) time histogram (Line-PSTH) method. 2 Optimization of the Bar Graph Time Histogram We consider sequences of spikes repeatedly recorded from a single neuron under identical experimental conditions. A recent analysis revealed that in vivo spike trains are not simply random, but possess interspike interval distributions intrinsic and specific to individual neurons (Shinomoto, Shima, & Tanji, 2003; Shinomoto, Miyazaki, Tamura, & Fujita, 2005). However, spikes accumulated from a large number of spike trains are in the majority mutually independent and can be regarded as being derived from a time-dependent Poisson point process (Snyder, 1975; Daley & Vere-Jones, 1988; Kass et al., 2005). It would be natural to assess the goodness of the fit of the estimator λˆ t to the underlying spike rate λt over the total observation period T by the mean integrated squared error (MISE), 1 MISE ≡ T
T
E (λˆ t − λt )2 dt,
(2.1)
0
where E refers to the expectation over different realizations of point events, given λt . We begin with a bar graph time histogram as λˆ t and explore a method to select the bin size that minimizes the MISE. The difficulty of the problem comes from the fact that the underlying spike rate λt is not known. A bar graph time histogram is constructed simply by counting the number of spikes that belong to each bin of width . For an observation period T, we obtain N = T/ intervals. The number of spikes accumulated from all n sequences in the ith interval is counted as ki . The bar height at the ith bin is given as ki /n. Figure 1 shows the schematic diagram for the construction of a bar graph time histogram. Given a bin of width , the expected height of a bar graph for t ∈ [0, ] is the time-averaged rate,
θ=
1
0
λt dt.
(2.2)
1506
H. Shimazaki and S. Shinomoto
A
0
B
^
C 0
Figure 1: Bar-PSTH. (A) An underlying spike rate, λt . The horizontal bars indicate the time-averaged rates θ for individual bins of width . (B) Sequences of spikes derived from the underlying rate. (C) A time histogram for the sample sequences of spikes. The estimated rate θˆ is the total number of spikes k that entered each bin, divided by the number of sequences n and the bin size .
The total number of spikes k from n spike sequences that enter this bin of width obeys the Poisson distribution, (nθ )k −nθ e , k!
p(k | nθ ) =
(2.3)
whose expectation is nθ . The unbiased estimator for θ is denoted as θˆ = k/(n), which is the empirical height of the bar graph for t ∈ [0, ]. 2.1 Selection of the Bin Size. By segmenting the total observation period T into N intervals of size , the MISE defined in equation 2.1 can be rewritten as 1 MISE =
0
N 1 E ( θˆi − λt+(i−1) )2 dt, N
(2.4)
i=1
where θˆi ≡ ki /(n). Hereafter, we denote the average over the segmented rate λt+(i−1) as an average over an ensemble of (segmented) rate functions
The Bin Size of a Time Histogram
1507
{λt } defined on an interval of t ∈ [0, ], as MISE =
1
E ( θˆ − λt )2 dt.
(2.5)
0
The expectation E refers to the average over the spike count, or θˆ = k/(n), given a rate function λt , or its mean value θ . The MISE can be decomposed into two parts: MISE = E(θˆ − θ )2 +
1
(λt − θ )2 dt.
(2.6)
0
The first and second terms are, respectively, the stochastic fluctuation of the estimator θˆ around the expected mean rate θ and the averaged temporal fluctuation of λt around its mean θ over an interval of length . The second term of equation 2.6 can be decomposed further into two parts: 1
(λt − θ + θ − θ )2 dt =
0
1
(λt − θ )2 dt − (θ − θ )2 .
0
(2.7) The first term in equation 2.7 represents a mean squared fluctuation of the underlying rate λt from the mean rate θ and is independent of the bin size , because 1
(λt − θ )2 dt =
0
1 T
T
(λt − θ )2 dt.
(2.8)
0
We define a cost function by subtracting this term from the original MISE, Cn () ≡ MISE −
1 T
T
(λt − θ )2 dt
0
= E(θˆ − θ )2 − (θ − θ )2 .
(2.9)
This cost function corresponds to the risk function in Rudemo (1982, equation 2.3), obtained by direct decomposition of the MISE. The second term in equation 2.9 represents the temporal fluctuation of the expected mean rate θ for individual intervals of period . As the expected mean rate θ is not an observable quantity, we have to replace the fluctuation of the expected mean rate with that of the observable estimator θˆ . Using the decomposition
1508
H. Shimazaki and S. Shinomoto
rule for an unbiased estimator (E θˆ = θ ), E(θˆ − E θˆ )2 = E(θˆ − θ )2 + (θ − θ )2 , the cost function is transformed into Cn () = 2 E(θˆ − θ )2 − E(θˆ − E θˆ )2 .
(2.10)
(2.11)
Due to the assumed Poisson nature of the point process, the number of spikes k counted in each bin obeys a Poisson distribution; the variance of k is equal to the mean. For the estimated rate defined as θˆ = k/(n), this variance-mean relation corresponds to E(θˆ − θ )2 =
1 ˆ E θ. n
(2.12)
By incorporating equation 2.12 into equation 2.11, the cost function is given as a function of the estimator θˆ , Cn () =
2 ˆ 2 . E θˆ − E(θˆ − E θ) n
(2.13)
The optimal bin size is obtained by minimizing the cost function Cn (), as ∗ ≡ arg min Cn ().
(2.14)
By replacing the expectation with the sample spike count, the cost function (see equation 2.13) is converted into this useful recipe: Algorithm 1: A Method for Bin Size Selection for a Bar-PSTH i. Divide the observation period T into N bins of width , and count the number of spikes ki from all n sequences that enter the ith bin. ii. Construct the mean and variance of the number of spikes {ki } as 1 1 ¯ 2. k¯ ≡ ki , and v ≡ (ki − k) N N N
N
i=1
i=1
iii. Compute the cost function: Cn () =
2k¯ − v . (n)2
iv. Repeat i through iii while changing the bin size to search for ∗ that minimizes Cn ().
The Bin Size of a Time Histogram
1509
2.2 Extrapolation of the Cost Function. With the method developed in the preceding section, we can determine the optimal bin size for a given set of experimental data. In this section, we develop a method to estimate how the optimal bin size decreases when more experimental trials are added to the data set: given n sequences, the method provides the cost function for m( = n) sequences. Assume that we are in possession of n spike sequences. The fluctuation of the expected mean rate (θ − θ )2 in equation 2.10 is replaced with the empirical fluctuation of the time histogram θˆ n using the decomposition rule for the unbiased estimator θˆn satisfying E θˆn = θ ,
E(θˆn − E θˆn )2 = E(θˆn − θ )2 + (θ − θ )2 .
(2.15)
The expected cost function for m sequences can thus be obtained by substituting the above equation into equation 2.9, yielding Cm (|n) = E(θˆm − θ )2 + E(θˆn − θ )2 − E(θˆn − E θˆn )2 .
(2.16)
Using the variance-mean relation for a Poisson distribution, equation 2.12, and E(θˆm − θ )2 =
1 1 E θˆm = E θˆn , m m
(2.17)
we obtain Cm (|n) =
1 1 − m n
1 E θˆn + Cn () ,
(2.18)
where Cn () is the original cost function (see equation 2.13) computed using the estimators θˆn . By replacing the expectation with sample spike count averages, the cost function for m sequences can be extrapolated with this formula, using the sample mean k¯ and variance v of the numbers of spikes, given n sequences and the bin size . The extrapolation method is summarized as algorithm 2: Algorithm 2: A Method for Extrapolating the Cost Function for a Bar-PSTH A. Construct the extrapolated cost function, Cm (|n) =
1 1 − m n
k¯ + Cn (), n2
1510
H. Shimazaki and S. Shinomoto
using the sample mean k¯ and variance v of the number of spikes obtained from n sequences of spikes. Cn () is the cost function computed for n sequences of spikes with algorithm 1. B. Search for ∗m that minimizes Cm (|n). C. Repeat A and B while changing m, and plot 1/∗m versus 1/m to search for the critical value 1/m = 1/nˆ c above which 1/∗m practically vanishes. It may come to pass that the original cost function Cn () computed for n spike sequences does not have a minimum or has a minimum at a bin size comparable to the observation period T. Because of the paucity of data, one may consider carrying out more experiments to obtain a reliable rate estimation. The critical number of sequences nc above which the cost function has a finite bin size ∗ may be estimated in the following manner. With a large , the cost function can be expanded as Cn () ∼ µ
1 1 − n nc
1 1 +u 2,
(2.19)
where we have introduced nc and u, which are independent of n. The optimal bin size undergoes a phase transition from the vanishing 1/∗ for n < nc to a finite 1/∗ for n > nc . The inverse optimal bin size is expanded in the vicinity of nc as 1/∗ ∝ (1/n − 1/nc ). We can estimate the critical ˆ ∗m value nˆ c (see Figure 3) by applying this asymptotic relation to the set of estimated from Cm (|n) for various values of m: 1 1 1 − . (2.20) ∝ ∗m m nˆ c The minimum number of sequences required for the construction of a BarPSTH is estimated from nˆ c . It should be noted that there are cases that the optimal bin size exhibits a discontinuous divergence from a finite value. Even in such cases, the plot of {1/m, 1/∗ } could be useful in exploring the discontinuous transition from nonvanishing values of 1/∗ to practically vanishing values. In the opposite extreme, with a sufficiently large number of spike sequences, our method selects a small bin size. It is known that the optimal bin size exhibits a power law scaling with respect to the number of sequences n. The exponent of the scaling relation depends on the smoothness of the underlying rate (Koyama & Shinomoto, 2004). Given a large number of spike sequences, the method for extrapolating the cost function can also be used to estimate the optimal bin size for a larger number of spike sequence m > n, and further estimate the scaling exponent representing the smoothness of the underlying rate.
The Bin Size of a Time Histogram
1511
2.3 Theoretical Analysis on the Optimal Bin Size. To verify the empirical methods, we obtain the theoretical cost function of a Bar-PSTH directly from a process with a known underlying rate. Note that this theoretical cost function is not available in real experimental conditions in which the underlying rate is not known. The estimator θˆ ≡ k/(n) is a uniformly minimum variance unbiased estimator (UMVUE) of θ , which achieves the lower bound of the Cram´erRao inequality (Blahut, 1987; Cover & Thomas, 1991), E(θˆ − θ ) = − 2
∞ k=0
∂ 2 log p (k | nθ ) p (k | nθ ) ∂θ 2
−1 =
θ . n
(2.21)
Inserting this into equation 2.9, the cost function is represented as θ − (θ − θ )2 n µ 1 = − φ (t1 − t2 ) dt1 dt2 , n 2 0 0
Cn () =
(2.22)
where µ = θ is the mean rate, and φ(t1 − t2 ) = (λt1 − µ)(λt2 − µ) is the autocorrelation function of the rate fluctuation, λt − µ. To obtain the last equation, we used (θ − θ ) = 2
=
1 2
1
2 (λt − µ) dt
0 0
0
(λt1 − µ)(λt2 − µ) dt1 dt2 .
(2.23)
The cost function with a large bin size can be rewritten as µ 1 − 2 ( − |t|)φ(t) dt n − ∞ 1 ∞ 1 µ − φ(t) dt + 2 |t|φ(t) dt, ∼ n −∞ −∞
Cn () =
(2.24)
based on the symmetry φ(t) = φ(−t) for a stationary process. Equation 2.24 can be identified with equation 2.19 with parameters µ φ(t) dt −∞
nc = ∞
(2.25)
1512
H. Shimazaki and S. Shinomoto
u=
∞
−∞
|t|φ(t) dt.
(2.26)
For the process giving the correlation of the form φ (t) = σ 2 e −|t|/τ and the 2 2 process giving a gaussian correlation φ(t) = σ 2 e −t /τ used in the simulation in the next √ section, the critical numbers are obtained as nc = µ/2σ 2 τ and 2 nc = µ/σ τ π , respectively. Based on the same theoretical cost function, equation 2.22, we can also derive scaling relations of the optimal bin size, which is achievable with a large number of spike sequences. With a small , the correlation of rate fluctuation φ(t) can be expanded as φ (t) = φ (0) + φ (0+ ) |t| + 12 φ (0) t 2 + O |t|3 , and we obtain an expansion of equation 2.22 with respect to , Cn () =
1 µ 1 − φ (0) − φ (0+ ) − φ (0) 2 + O(3 ). n 3 12
(2.27)
The optimal bin size ∗ is obtained from dCn () /d = 0. For a rate that fluctuates smoothly in time, the correlation function is a smooth function of t, resulting in φ (0) = 0 due to the symmetry φ(t) = φ(−t). In this case, we obtain the scaling relation
6µ ∼ − φ (0) n ∗
1/3 .
(2.28)
For a rate that fluctuates in a zigzag pattern, in which the correlation of rate fluctuation has a cusp at t = 0 (φ (0+ ) < 0), we obtain the scaling relation ∗ ∼ −
3µ φ (0+ )n
1/2 ,
(2.29)
by ignoring the second-order term. Examples that exhibit this type of scaling are the Ornstein-Uhlenbeck process and the random telegraph process (Kubo, Toda, & Hashitsume, 1985; Gardiner, 1985). The scaling relations, equation 2.28 and equation 2.29, are the generalization of the ones found in Koyama and Shinomoto (2004) for two specific time-dependent Poisson processes. 3 Application of the Method to Spike Data In this section, we apply the methods for the bar graph (peristimulus) time histogram (Bar-PSTH) and the line graph (peristimulus) time histogram (Line-PSTH) to (1) spike sequences (point events) generated by the simulations of time-dependent Poisson processes and (2) spike sequences recorded
The Bin Size of a Time Histogram
1513
from a neuron in area MT. The methods for a Line-PSTH are summarized in the appendix. 3.1 Application to Simulation Data. We applied the method for estimating the optimal bin size of a Bar-PSTH to a set of spike sequences derived from a time-dependent Poisson process. Figure 2A shows the empirical cost function Cn () computed from a set of n spike sequences. The empirical cost function is in good agreement with the theoretical cost function computed from the mean spike rate µ and the correlation φ(t) of the rate fluctuation, according to equation 2.22. In Figure 2D, a time histogram with the optimal bin size is compared with those with nonoptimal bin sizes, demonstrating the effectiveness of optimizing the bin size. Algorithm 2 provides an estimate of how many sequences should be added for the construction of a Bar-PSTH with the resolution we deem sufficient. In this method, the cost function of m sequences is obtained by modifying the original cost function, Cn (), computed from the spike count statistics of n sequences of spikes. In subsequent applications of this extrapolation method, the original cost function, Cn (), was obtained by averaging over the initial partitioning positions. We applied this extrapolation method for a Bar-PSTH to a set of spike sequences derived from the smoothly regulated Poisson process. Figure 3A depicts the extrapolated cost function Cm (|n) for several values of m, computed from a given set of n(= 30) spike sequences. Figure 3B represents the dependence of an inverse optimal bin size 1/∗m on 1/m. The inverse optimal bin size, 1/∗m , stays near 0 for 1/m > 1/nˆ c and departs from 0 linearly with m for 1/m ≤ 1/ nˆ c . By fitting the linear function, we estimated the critical number of sequences nˆ c . In Figure 3C, the critical number of sequences nˆ c estimated from a smaller or larger n is compared to the theoretical value of nc computed from equation 2.25. It is noteworthy that the estimated nˆ c approximates the theoretical nc well even from a fairly small number of spike sequences, with which the estimated optimal bin size diverges. We also constructed a method for selecting the bin size of a Line-PSTH, which is summarized as algorithm 3 in the appendix. Figure 4 compares the optimal Bar-PSTH and the optimal Line-PSTH obtained from the same set of spike sequences, demonstrating the superiority of the Line-PSTH to the Bar-PSTH in the sense of the MISE. In addition, the Line-PSTH is suitable for comparing multiple time-dependent rates, as is the case for filtering methods. The extrapolation method for a Line-PSTH is summarized as algorithm 4 in the appendix. With the extrapolation method for a Bar-PSTH (see algorithm 2), one can also estimate how much the optimal bin size decreases (i.e., the resolution increases) with the number of spike sequences. Figure 5A shows log-log plots of the optimal bin sizes ∗m versus m with respect to two rate processes that fluctuate either smoothly according to the stochastic process characterized by the mean spike rate µ and the correlation of the rate
1514
H. Shimazaki and S. Shinomoto
B 60
Underlying rate
A 0
Cost function for the Bar-PSTH
0
1
2
3
0
1
2
3
2
3
C 100
Empirical cost function Theoretical cost function
0
D
120 ^
0 60
-100
^
0
0.1
0.2
0.3
0.4
0.5
0 60 ^
0
0
1 Time, t
Figure 2: Construction of the optimal Bar-PSTH from spike sequences. (A) Dots: An empirical cost function, Cn (), computed from spike data according to algorithm 1. Here, 50 sequences of spikes for an interval of 20 [s] are derived from a time-dependent Poisson process characterized by the mean rate µ and 2 2 the correlation of the rate fluctuation φ(t) = σ 2 e −t /τ , with µ = 30 [spikes/s], σ = 10 [spikes/s], and τ = 0.1 [s]. Solid line: The theoretical cost function computed directly from the underlying fluctuating rate using equation 2.22. (B) The underlying fluctuating rate λt . (C) Spike sequences derived from the rate. (D) Time histograms made using three types of bin sizes: too small, optimal, and too large.
process φ (t) = σ 2 e −t /τ or in a zigzag pattern according to the OrnsteinUhlenbeck process characterized by the mean rate µ and the correlation of the rate process φ (t) = σ 2 e −|t|/τ . These plots exhibit power law scaling relations with distinctly different scaling exponents. The estimated exponents (−0.34 ± 0.04 for the smooth rate process and −0.56 ± 0.04 for the zigzag rate process) are close to the exponents of m−1/3 and m−1/2 that were obtained analytically as in equations 2.28 and 2.29 (see also Koyama & Shinomoto, 2004). In this way, we can estimate the degree of smoothness of the underlying rate from a reasonable amount of spike data. With the extrapolation method for a Line-PSTH (see algorithm 4 in the appendix), the scaling relations for a Line-PSTH can be examined in a 2
2
The Bin Size of a Time Histogram
1515
A m=30
2 1
C Estimated critical number
^
nc
m=40
0 -1
m=60 0
4
60 8
B
40 6 20 4 2
0 0
0 1/200
1/40 1/m
1/20
20 40 60 Number of sequences, n
80
Figure 3: Extrapolation method. (A) Extrapolated cost functions Cm (|n) of a Bar-PSTH for several values of m, computed from 30 sequences of spikes derived from a Poisson process characterized by the mean rate µ and the cor2 2 relation of the rate fluctuation φ(t) = σ 2 e −t /τ , with µ = 30, σ = 2, and τ = 0.1. The modestly fluctuating rate with σ = 2 is used as compared with the rate with σ = 10 used in Figure 1. (B) The inverse optimal bin size 1/∗m plotted against 1/m. A solid line is a linear function fitted to the data. (C) The critical number of sequences nˆ c extrapolated from a smaller or larger number of spike sequences. The horizontal axis represents the number of spike sequences n used to obtain the extrapolated cost function Cm (|n). The vertical axis represents an extrapolated critical number nˆ c . The dashed lines represent a theoretical value nc computed using equation 2.25 and a diagonal.
similar manner. Figure 5B represents the optimal bin sizes computed for two rate processes that fluctuate either smoothly or in a zigzag pattern. The estimated exponents are −0.24 ± 0.04 for the smooth rate process and −0.50 ± 0.05 for the zigzag rate process. The exponents obtained by the extrapolation method are similar to the analytically obtained exponents, m−1/5 and m−1/2 respectively (see equations A.12 and A.13). Note that for the smoothly fluctuating rate process, the scaling relation for the Line-PSTH is m−1/5 , whereas the scaling relation for a Bar-PSTH is m−1/3 . In contrast, for the rate process that fluctuates jaggedly, the exponents of the scaling relations for both a Bar-PSTH and a Line-PSTH are m−1/2 .
1516
A
H. Shimazaki and S. Shinomoto
B
Bar-PSTH
^
Line-PSTH
^
40
40
20
20
0
0
1
2
0
3
0
1
2
3
Time, t
Time, t
Figure 4: Comparison of the optimal Bar-PSTH and the optimal Line-PSTH. The spike data used are the same as the data used in Figure 2. (A) The optimal Bar-PSTH constructed from the spike sequences derived from an underlying fluctuating rate process (dashed line). (B) The optimal Line-PSTH constructed from the same set of spike sequences.
A
B
Scaling relations for the Bar-PSTH m
10
10
-1
10
-2
10
Scaling relations for the Line-PSTH m
0
1
2
10 10 10 Number of sequences, m
3
10
-1
-2
10
0
1
2
10 10 10 Number of sequences, m
3
Figure 5: Scaling relations for the Bar-PSTH and the Line-PSTH. The optimal bin sizes ∗m are estimated from the extrapolated cost function Cm (|n), computed from 100 sequences of spikes. The optimal bin sizes ∗m versus m is shown in log-log scale. Two types of rate processes were examined: one that fluctuates smoothly, which is characterized by the correlation of the rate fluctuation φ (t) = 2 2 σ 2 e −t /τ , and the other that fluctuates jaggedly, which is characterized by the correlation of the rate fluctuation φ (t) = σ 2 e −|t|/τ . The parameter values are the same as the ones used in Figure 2. (A) Log-log plots of the optimal bin size with respect to m for a Bar-PSTH. Lines are fitted to the data in an interval of m ∈ [50, 500], whose regression coefficients, −0.34 ± 0.04 and −0.56 ± 0.04, correspond to the scaling exponents for a Bar-PSTH. (B) Log-log plots of the optimal bin size with respect to m for a Line-PSTH. Lines are fitted to the data in an interval of m ∈ [50, 500], whose regression coefficients, −0.24 ± 0.04 and −0.50 ± 0.05, correspond to the scaling exponents for a Line-PSTH.
The Bin Size of a Time Histogram
1517
3.2 Application to Experimental Data. We also applied our method for optimizing a Bar-PSTH to publicly available neuronal spike data (Britten, Shadlen, Newsome, & Movshon, 2004). The details of the experimental methods are described in Newsome, Britten, and Movshon (1989) and Britten, Shadlen, Newsome, & Movshon (1992). It was reported that the neurons in area MT exhibit highly reproducible temporal responses to identical visual stimuli (Bair & Koch, 1996). The estimation of a time-dependent rate by a PSTH is, however, sensitive to the choice of the bin size. Therefore it would be preferable for such an appraisal of the reproducibility to be tested with our objective method of optimizing the bin size. To make a reliable estimate of the optimal bin size from the spike sequences of a short observation period, we took an average of the cost functions computed under different partitioning positions. We examine here the data recorded from a neuron under the repeated application of a random dot visual stimulus with 3.2% of coherent motion (w052, nsa2004.1, Britten et al., 2004). The method for a Bar-PSTH was applied to the data, with close attention paid to how the optimal bin size changes with the number of spike sequences sampled. Figure 6 depicts the results for the first n = 5, the first n = 20 and the total n = 50 sequences of spikes. The optimal bin size practically diverges (∗ = 1000 [ms]) for the n = 5 sequences, implying that with this small amount of data, a discussion about the time-dependent response does not make sense, as far as we rely on the histogram method. Even in this stage, it is possible to extrapolate the cost function by means of algorithm 2. The critical number of trials, above which the optimal bin size is finite, was estimated as nˆ c ≈ 12. By increasing the number of spikes to n = 20, we obtained the optimal bin size ∗ = 33 [ms], implying that a discussion about the neuronal response is possible based on this amount of data. The critical number of trials estimated from this data set (n = 20) is nˆ c ≈ 10. The optimal bin size for the total 50 sequences is ∗ = 23 [ms]. The critical number of sequences estimated from the total 50 sequences is nˆ c ≈ 12. Algorithm 1 confirms that the temporal rate modulation can be discussed with a sufficiently large number of sequences, such as n = 20 or n = 50. With algorithm 2, three sets of sequences with n = 5, 20, and 50 suggest the critical number of sequence to be nˆ c ≈ 10–12. 4 Discussion We have developed a method for selecting the bin size, so that the BarPSTH (see section 2) or the Line-PSTH (see the appendix) best represents the (unknown) underlying spike rate. The suitability of this method was demonstrated by applying it to not only model spike sequences generated by time-dependent Poisson processes, but also real spike sequences recorded from cortical area MT of a monkey.
1518
H. Shimazaki and S. Shinomoto
A n=5
80
40 ^
40
0 0
0
0.5
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4 0.6 Time, t
0.8
1
1
B n=20 40
0 ^
-40 0
0 0.05
0.1
C n=50 0
40 ^
-40 0
0.05
0.1
0
Figure 6: The Bar-PSTHs for the spike sequences recorded from a MT neuron (w052 in nsa2004.1; Britten et al., 2004). Left: The cost function (averaged over different partitioning positions). Right: Spike sequences sampled and the optimal Bar-PSTH for the first 1 [s] of the recording period 2 [s]. (A) The first 5 spike sequences. (B) The first 20 spike sequences that include 5 sequences of A. (C) The complete 50 spike sequences that include 20 sequences of B.
For a small number of spike sequences derived from a modestly fluctuating rate, the cost function does not have a minimum (Cn (T) < Cn () for any < T), or has a minimum at a bin size that is comparable to the observation period T, implying that the time-dependent rate cannot be captured by means of the time histogram method. Our method can nevertheless extrapolate the cost function for any number of spike sequences and suggest how many sequences should be added in order to obtain a meaningful time histogram with the resolution we require. The model data and real data illustrated that the optimal bin size may diverge (∗ = O(T)), and even under such conditions, our extrapolation method works reasonably well.
The Bin Size of a Time Histogram
1519
In this study, we adopted the mean integrated squared error (MISE) as measuring the goodness of the fit of a PSTH to the underlying spike rate. There were studies of the density estimation based on other standards such as the Kullback-Leibler divergence (Hall, 1990) and the Hellinger distance (Kanazawa, 1993). To our knowledge, however, there are yet no practical algorithms based on these standards that optimize the bin size solely with raw data. It is interesting to explore practical methods based on other standards to the estimation of the time-dependent rate and compare them with the MISE criterion. Use of regular (equal) bin size is a constraint particular to our method. The regular histogram is suitable for stationary time-dependent rates. In practical applications, however, the rate does not necessarily fluctuate in a stationary fashion but can change abruptly during a localized period of time. In such a situation, a histogram with variable bin width can better fit the underlying rate than a regular histogram can. In order to make a better estimate of the underlying rate in the sense of the MISE, it is desirable to develop an adaptive method that adjusts the bin size over time. It is also desirable to develop the optimization method for the Bar-PSTH and the Line-PSTH into a method for higher-order spline fittings that can be compared with filtering methods. Nevertheless, as long as the Bar-PSTHs or the Line-PSTHs are used as conventional rate-estimation tools, our method for selecting the bin size should be used for their construction. Appendix: Optimization of the Line Graph Time Histogram A line graph can be constructed by simply connecting top centers of adjacent bar graphs of the height ki /(n). Figure 7 schematically shows the construction of a line graph time histogram. For the same set of spike sequences, the optimal bin size (window size) of a Line-PSTH is, however, different from that of a Bar-PSTH. We develop here a method for selecting the bin size for a Line-PSTH. The expected heights of adjacent bar graphs for intervals of [−, 0] and [0, ] are θ− ≡
1
θ+ ≡
1
0 −
λt dt, λt dt.
0
The expected line graph L t in an interval of [− 2 , 2 ] is a line connecting top centers of these bar graphs, Lt =
θ+ − θ− θ+ + θ− + t. 2
(A.1)
1520
H. Shimazaki and S. Shinomoto +
A
0
B
^+
C ^
0 Figure 7: Line-PSTH. (A) An underlying spike rate, λt . The horizontal bars indicate the original bar graphs θ − and θ + for adjacent intervals [−, 0] and [0, ]. The expected line graph L t is a line connecting the two top centers of adjacent expected bar graphs (− 2 , θ − ) and ( 2 , θ + ). (B) Sequences of spikes derived from the underlying rate. (C) A time histogram for the sample sequences of spikes. The empirical line graph Lˆ t is a line connecting the two top centers of adjacent empirical bar graphs (− 2 , θˆ − ) and ( 2 , θˆ + ).
The unbiased estimators of the original bar graphs are θˆ − = k − /(n) and θˆ + = k + /(n), where k − and k + are the numbers of spikes from n spike sequences that enter the intervals [−, 0] and [0, ], respectively. The empirical line graph Lˆ t in an interval of [− 2 , 2 ] is a line connecting two top centers of adjacent empirical bar graphs,
Lˆ t =
θˆ + + θˆ − θˆ + − θˆ − + t. 2
(A.2)
A.1 Selection of the Bin Size. The time average of the MISE (see equation 2.1) is rewritten by the average over the N segmented rates,
MISE =
1
/2
−/2
E ( Lˆ t − λt )2 dt.
(A.3)
The Bin Size of a Time Histogram
1521
The MISE can be decomposed into two parts: 1
MISE =
/2
−/2
1 E( Lˆ t − L t )2 dt +
/2
−/2
(λt − L t )2 dt.
(A.4)
The first term of equation A.4 is the stochastic fluctuation of the empirical linear estimator Lˆ t due to the stochastic point events, which can be computed as 1
/2
−/2
2 E(θˆ + − θ + )2 . E( Lˆ t − L t )2 dt = 3
(A.5)
+ ˆ− − ˆ+ Here we have used the relations, E(θ − θ )(θ − θ ) = 0 and E(θˆ + − θ + )2 = E(θˆ − − θ − )2 . The second term of equation A.4 is the temporal fluctuation of λt around the expected linear estimator L t . We expand the second term by inserting µ = θ − = θ + and obtain
1
/2
1 (λt − L t )2 dt = −/2 −
2
/2
−/2
(λt − µ)2 dt
/2
−/2
(λt − µ)(L t − µ)dt
1 /2 (L t − µ)2 dt −/2 /2 1 = (λt − µ)2 dt −/2 − 2 θ+ − µ θ0 − µ + 2 θ+ − µ θ∗
1 2 + (θ − µ)2 + (θ + − µ)(θ − − µ) , + 3 3 +
where 1 θ ≡
/2
0
θ∗ ≡
2 2
−/2
λt dt,
/2
−/2
tλt dt.
(A.6)
1522
H. Shimazaki and S. Shinomoto
To obtain the above equation, we have used relations, θ − = θ + = θ 0 = µ, θ ∗ = 0, (θ + − µ)2 = (θ − − µ)2 , (θ + − µ)(θ 0 − µ) = (θ − − µ)(θ 0 − µ), and (θ + − µ)θ ∗ = −(θ − − µ)θ ∗ . The first term of equation A.6 represents a mean squared fluctuation of the underlying rate and is independent of the choice of the bin size (see equation 2.8). We introduce the cost function by subtracting the variance of the underlying rate from the original MISE: 1 Cn () ≡ MISE − T =
T
(λt − µ)2 dt
0
2 2 E(θˆ + − θ + ) − 2 (θ + − µ)(θ 0 − µ) − 2 (θ + − µ)θ ∗ 3 1 2 2 (A.7) + E(θ + − µ) + (θ + − µ)(θ − − µ) . 3 3
The first term of equation A.7 can be estimated from the data, using the variance-mean relation, equation 2.12. The last four terms of equation A.7 are the covariances of the expected rates θ − , θ + , θ 0 , and θ ∗ , which are not observables. We can estimate them by using the covariance decomposition rule for unbiased estimators: (θ + − θ + )(θ p − θ p ) = E[(θˆ + − E θˆ + )(θˆ p − E θˆ p )] −E[(θˆ + − θ + )(θˆ p − θ p )],
(A.8)
where p denotes −, +, 0, or ∗. The first term of equation A.8 can be computed directly from the data, using the relation E θˆ − = E θˆ + = E θˆ 0 = µ, and E θˆ ∗ = 0. Unlike the Bar-PSTH, however, the mean-variance relation for the Poisson statistics is not directly applicable to the second term. We suggest estimating these covariance terms from multiple sample sequences, as in algorithm 3: Algorithm 3: A Method for Bin Size Selection for a Line-PSTH i. Divide the observation period T into N + 1 bins of width . Count the number of spikes ki ( j) that enter ith bin from jth sequence. (−) (+) Define ki ( j) ≡ ki ( j) and ki ( j) ≡ ki+1 ( j) (i = 1, 2, . . . , N). Divide the period [/2, T − /2] into N bins of width . (0) Count the number of spikes ki ( j) that enter ith bin from jth se (∗) quence. In each bin, compute ki ( j) ≡ 2 ti ( j)/, where ti ( j) is the time of the th spike that enters the ith bin, measured from the center of each bin. ( p) ( p) ( p) ii. Sum up ki ( j) for all sequences: ki ≡ nj=1 ki ( j), where p = {−, +, 0, ∗}.
The Bin Size of a Time Histogram
1523
Average those spike counts with respect to all the bins: k¯ ( p) ≡ ( p) 1 N i=1 ki . N (+)
Compute covariances of ki
( p)
and ki :
1 (+) ¯ (+) ( p) ¯ ( p) ki − k ki − k . N N
c (+, p) ≡
i=1
(+)
( p)
Compute covariances of ki ( j) and ki ( j) with respect to sequences and average over time:
c¯
(+, p)
N n (+) ( p) ki ki 1 1 (+) ( p) ≡ ki ( j) − ki ( j) − . N n−1 n n i=1
j=1
Finally, compute
σ (+, p) ≡
c (+, p) c¯ (+, p) − . 2 (n) n2
iii. Compute the cost function:
Cn () =
2 k¯ (+) 2 1 − 2σ (+, 0) − 2σ (+, ∗) + σ (+, +) + σ (+, −) . 3 (n)2 3 3
iv. Repeat i through iii while changing to search for ∗ that minimizes Cn (). A.2 Extrapolation of the Cost Function. As in the case of the Bar-PSTH, the cost function for any m sequences of spike trains can be extrapolated using the variance-mean relation for the Poisson statistics,
Cm (|n) =
2 3
1 1 + E θˆn + Cn () , − m n
(A.9)
where Cn () is the original cost function (see equation A.7). With the original cost function Cn (), we can easily estimate a cost function for m sequences with algorithm 4:
1524
H. Shimazaki and S. Shinomoto
Algorithm 4: A Method for Extrapolating the Cost Function for a Line-PSTH A. Construct the extrapolated cost function, Cm (|n) =
2 3
1 1 − m n
k¯ (+) + Cn (), n2
where Cn () is the cost function for the line-graph time histogram computed for n sequences of spikes with the algorithm 3. B. Search for ∗m that minimizes Cm (|n). C. Repeat A and B while changing m, and plot 1/∗m versus 1/m to search for the critical value 1/m = 1/nˆ c above which 1/∗m practically vanishes. A.3 Scaling Relations of the Optimal Bin Sizes. We can obtain a theoretical cost function of a Line-PSTH directly from the mean spike rate µ and the correlation φ(t) of the rate fluctuation. According to the mean-variance relation based on the Cram´er-Rao (in)equality (see equation 2.21), the cost function, equation A.7, is given by Cn () =
/2 2µ 2t2 2 φ (t1 − t2 ) dt1 dt2 1+ − 2 3n 0 −/2 0 2 1 + 2 φ (t1 − t2 ) dt1 dt2 + φ (t1 − t2 ) dt1 dt2 . 3 0 0 32 0 − (A.10)
By expanding the correlation of the rate fluctuation, 1 1 1 φ (t) = φ (0) + φ (0+ ) |t| + φ (0) t 2 + φ (0+ ) |t|3 + φ (0) t 4 + O |t|5 , 2 6 24 we obtain Cn () =
181 37 2µ − φ (0) − φ (0+ ) + φ (0+ ) 3 3n 144 5760 49 + φ (0) 4 + O(5 ). 2880
(A.11)
Unlike the Bar-PSTH, the line graph successfully approximates the original rate to first order in , and therefore the O(2 ) term in the cost function vanishes for a Line-PSTH.
The Bin Size of a Time Histogram
1525
The optimal bin size ∗ is obtained from dCn () /d = 0. For a rate process that fluctuates smoothly in time, the correlation function is a smooth function, resulting in φ (0) = 0 and φ (0) = 0 due to the symmetry φ(t) = φ(−t), and we obtain the scaling relation ∗ ∼
1280µ 181φ (0) n
1/5 .
(A.12)
The exponent −1/5 for a Line-PSTH is different from the exponent −1/3 for a Bar-PSTH (see equation 2.28). If the correlation of rate fluctuation has a cusp at t = 0 (φ (0+ ) < 0), we obtain the scaling relation, ∼ − ∗
96µ 37φ (0+ )n
1/2 ,
(A.13)
by ignoring the higher-order terms. The exponent −1/2 for a Line-PSTH is the same as the exponent for a Bar-PSTH (see equation 2.29). Acknowledgments We thank Shinsuke Koyama, Takeaki Shimokawa, and Kensuke Arai for helpful discussions. We also acknowledge K. H. Britten, M. N. Shadlen, W. T. Newsome, and J. A. Movshon, who have made their data available to the public. This study is supported in part by Grants-in-Aid for Scientific Research to S.S. from the Ministry of Education, Culture, Sports, Science and Technology of Japan (16300068, 18020015) and the 21st Century COE Center for Diversity and Universality in Physics. H.S. is supported by the Research Fellowship of the Japan Society for the Promotion of Science for Young Scientists. References Abeles, M. (1982). Quantification, smoothing, and confidence-limits for single-units histograms. Journal of Neuroscience Methods, 5(4), 317–325. Adrian, E. (1928). The basis of sensation: The action of the sense organs. New York: Norton. Bair, W., & Koch, C. (1996). Temporal precision of spike trains in extrastriate cortex of the behaving macaque monkey. Neural Computation, 8(6), 1185–1202. Blahut, R. (1987). Principles and practice of information theory. Reading, MA: AddisonWesley. Britten, K. H., Shadlen, M. N., Newsome, W. T., & Movshon, J. A. (1992). The analysis of visual-motion: A comparison of neuronal and psychophysical performance. Journal of Neuroscience, 12(12), 4745–4765.
1526
H. Shimazaki and S. Shinomoto
Britten, K. H., Shadlen, M. N., Newsome, W. T., & Movshon, J. A. (2004). Responses of single neurons in macaque MT/V5 as a function of motion coherence in stochastic dot stimuli. Neural Signal Archive, nsa2004.1. Available online at http://www.neuralsignal.org. Brockwell, A. E., Rojas, A. L., & Kass, R. E. (2004). Recursive Bayesian decoding of motor cortical signals by particle filtering. Journal of Neurophysiology, 91(4), 1899–1907. Brown, E. N., Kass, R. E., & Mitra, P. P. (2004). Multiple neural spike train data analysis: State-of-the-art and future challenges. Nature Neuroscience, 7(5), 456– 461. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Daley, D., & Vere-Jones, D. (1988). An introduction to the theory of point processes. New York: Springer-Verlag. Dayan, P., & Abbott, L. (2001). Theoretical neuroscience: Computational and mathematical modeling of neural systems. Cambridge, MA: MIT Press. DiMatteo, I., Genovese, C. R., & Kass, R. E. (2001). Bayesian curve-fitting with freeknot splines. Biometrika, 88(4), 1055–1071. Gardiner, C. (1985). Handbook of stochastic methods: For physics, chemistry and the naural sciences. Berlin: Springer-Verlag. Gerstein, G. L., & Kiang, N. Y. S. (1960). An approach to the quantitative analysis of electrophysiological data from single neurons. Biophysical Journal, 1(1), 15–28. Gerstein, G. L., & Perkel, D. H. (1969). Simultaneously recorded trains of action potentials—analysis and functional interpretation. Science, 164(3881), 828–830. Hall, P. (1990). Akaikes information criterion and Kullback-Leibler loss for histogram density-estimation. Probability Theory and Related Fields, 85(4), 449–467. Johnson, D. H. (1996). Point process models of single-neuron discharges. Journal of Computational Neuroscience, 3(4), 275–299. Kanazawa, Y. (1993). Hellinger distance and Akaike information criterion for the histogram. Statistics and Probability Letters, 17(4), 293–298. Kass, R. E., Ventura, V., & Brown, E. N. (2005). Statistical issues in the analysis of neuronal data. Journal of Neurophysiology, 94(1), 8–25. Kass, R. E., Ventura, V., & Cai, C. (2003). Statistical smoothing of neuronal data. Network-Computation in Neural Systems, 14(1), 5–15. Koyama, S., & Shinomoto, S. (2004). Histogram bin width selection for timedependent Poisson processes. Journal of Physics A—Mathematical and General, 37(29), 7255–7265. Kubo, R., Toda, M., & Hashitsume, N. (1985). Nonequilibrium statistical mechanics. Berlin: Springer-Verlag. Newsome, W. T., Britten, K. H., & Movshon, J. A. (1989). Neuronal correlates of a perceptual decision. Nature, 341(6237), 52–54. R´ev´esz, P. (1968). The laws of large numbers. New York: Academic Press. Rudemo, M. (1982). Empirical choice of histograms and kernel density estimators. Scandinavian Journal of Statistics, 9(2), 65–78. Sanger, T. D. (2002). Decoding neural spike trains: Calculating the probability that a spike train and an external signal are related. Journal of Neurophysiology, 87(3), 1659–1663. Scott, D. W. (1979). Optimal and data-based histograms. Biometrika, 66(3), 605–610.
The Bin Size of a Time Histogram
1527
Shinomoto, S., Miyazaki, Y., Tamura, H., & Fujita, I. (2005). Regional and laminar differences in in vivo firing patterns of primate cortical neurons. Journal of Neurophysiology, 94(1), 567–575. Shinomoto, S., Shima, K., & Tanji, J. (2003). Differences in spiking patterns among cortical neurons. Neural Computation, 15(12), 2823–2842. Snyder, D. (1975). Random point processes. New York: Wiley. Wiener, M. C., & Richmond, B. J. (2002). Model based decoding of spike trains. Biosystems, 67(1–3), 295–300.
Received November 17, 2005; accepted September 27, 2006.
LETTER
Communicated by Kiri Wagstaff
Penalized Probabilistic Clustering Zhengdong Lu [email protected]
Todd K. Leen [email protected] Department of Computer Science and Engineering, OGI School of Science and Engineering, Oregon Health and Science Institute, Beaverton, OR 97006, U.S.A.
While clustering is usually an unsupervised operation, there are circumstances in which we believe (with varying degrees of certainty) that items A and B should be assigned to the same cluster, while items A and C should not. We would like such pairwise relations to influence cluster assignments of out-of-sample data in a manner consistent with the prior knowledge expressed in the training set. Our starting point is probabilistic clustering based on gaussian mixture models (GMM) of the data distribution. We express clustering preferences in a prior distribution over assignments of data points to clusters. This prior penalizes cluster assignments according to the degree with which they violate the preferences. The model parameters are fit with the expectation-maximization (EM) algorithm. Our model provides a flexible framework that encompasses several other semisupervised clustering models as its special cases. Experiments on artificial and real-world problems show that our model can consistently improve clustering results when pairwise relations are incorporated. The experiments also demonstrate the superiority of our model to other semisupervised clustering methods on handling noisy pairwise relations. 1 Introduction While clustering is usually executed completely unsupervised, there are circumstances in which we have prior belief (with varying degrees of certainty) that pairs of samples should (or should not) be assigned to the same cluster. These pairwise relations are less informative than direct labeling of the samples, but are often considerably easier to obtain. Indeed, there are many occasions when the pairwise relations can be directly derived from expert knowledge or common sense. Our interest in such problems was kindled when we tried to manually segment a satellite image by grouping small image clips from the image. It is often hard to assign the image clips to different “groups” since we do not know clearly the characteristic of each group or even how many classes Neural Computation 19, 1528–1567 (2007)
C 2007 Massachusetts Institute of Technology
Penalized Probabilistic Clustering
1529
we should have. In contrast, it is much easier to compare two image clips and decide how much they look alike and thus how likely they should be in one cluster. Another interesting example is word sense disambiguation. Ambiguous words like plant tend to exhibit only one meaning in one discourse (Yarowsky, 1995). In other words, two plant’s in the same discourse probably should be assigned to the same class of sense. This fact is very useful in training unsupervised word sense disambiguation models. The third example is in information retrieval. Cohn, Caruana, and McCallum (2003) suggested that in creating a document taxonomy, the expert critique is often in the form “these two documents shouldn’t be in the same cluster.” The last example is continuity, which suggests that neighboring pairs of samples in a time series or in an image are likely to belong to the same class of object, is also a source of clustering preferences (Theiler & Gisler, 1997; Ambroise, Dang, & Govaert, 1997). We would like these preferences to be incorporated into the cluster structure so that the assignment of out-of-sample data to clusters captures the concepts that give rise to the preferences expressed in the training data. Some work has been done on adopting traditional clustering methods, such as K-means, to incorporate pairwise relations (Wagstaff, Cardie, Rogers, & Schroedl, 2001; Basu, Bannerjee, & Mooney, 2002; Klein, Kamvar, & Manning, 2002). These models are based on hard clustering, and the clustering preferences are expressed as hard pairwise constraints that must be satisfied. Wagstaff (2002) and Basu et al. (2004) extended their models to deal with soft pairwise constraints, where each constraint is assigned a weight. The performance of those constrained K-means algorithms is often not satisfactory, largely due to the incapability of K-means to model nonspherical data distribution in each class. Shental, Bar-Hillel, Hertz, and Weinshall (2003) proposed a gaussian mixture model (GMM) for clustering that incorporates hard pairwise constraints. However, the model cannot be naturally generalized to soft constraints, which are appropriate when our knowledge is only clustering preferences or carries significant uncertainty. Motivated in part to remedy this deficiency, Law, Topchy, and Jain (2004, 2005) proposed another GMMbased model to incorporate soft constraints. In their model, virtual groups are created for samples that are supposed to be in one class. The uncertainty information in pairwise relations is therefore expressed as the soft membership of samples to the virtual group. This modeling strategy is cumbersome to model samples shared by different virtual groups. Moreover, it cannot handle the prior knowledge that two samples are in different clusters. Other efforts to make use of the pairwise relations include changing the metric in feature space in favor of the specified relations (Cohn et al., 2003; Xing, Ng, Jordan, & Russe, 2003) or combining the metric learning with constrained clustering (Bilenko, Basu, & Mooney, 2004). This letter is a detailed exposition and extension of our previously reported work (Lu & Leen, 2005). We propose a soft clustering algorithm based
1530
Z. Lu and T. Leen
on GMM that expresses clustering preferences (in the form of pairwise relations) in the prior probability on assignments of data points to clusters. Our algorithm naturally accommodates both hard constraints and soft preferences. In our framework, the preferences are expressed as a Bayesian prior probability that pairs of points should (or should not) be assigned to the same cluster. After training with the expectation-maximization (EM) algorithm, the information expressed as a prior on the cluster assignment of the training data is successfully encoded in the means, covariances, and cluster priors in the GMM. Hence, the model generalizes in a way consistent with the prior knowledge. We call the algorithm penalized probabilistic clustering (PPC). Experiments on artificial and real-world data sets demonstrate that PPC can consistently improve the clustering result by incorporating reliable prior knowledge. The letter is organized as follows. In section 2, we introduce the model and give an artificial example for illustration. In section 3, we discuss the computational complexity and propose two approximation methods for cases where the computation is intractable. Section 4 gives a detailed analysis of the connection of PPC to several other semisupervised clustering models. In section 5, we present the experiments of PPC on a variety of problems, as well as the comparison of PPC to several other semisupervised clustering methods. Section 6 summarizes the letter and points out future research. 2 Model Penalized probabilistic clustering (PPC) begins with a standard Mcomponent GMM, P(x|) =
M
πk P(x| θk ),
k=1
with the parameter vector = (π1 , . . . , π M , θ1 , . . . , θ M ). Here, πk and θk are the prior probability and parameters of the kth gaussian component, respectively. We augment the data set X = {xi }, i = 1, . . . , N with latent cluster assignments Z = z(xi ), i = 1, . . . , N to form the familiar complete data (X, Z). The complete data likelihood is P(X, Z|) = P(X|Z, )P(Z|),
(2.1)
where P(X|Z, ) is the probability of X conditioned on Z: P(X|Z, ) =
N i=1
P(xi |θzi ).
(2.2)
Penalized Probabilistic Clustering
1531
2.1 Prior Distribution on Cluster Assignments. We incorporate our clustering preferences by manipulating the prior probability P(Z|). In the standard gaussian mixture model, the prior distribution on cluster assignments Z is trivial: P(Z|) =
N
πzi .
(2.3)
i=1
We incorporate our clustering preferences through a weighting function g(Z) that has large values when the assignment of data points to clusters Z conforms to our preferences, and low values when Z conflicts with our preferences. Hence, we write ( i πzi )g(Z) 1 P(Z|, G) ≡ = πzi g(Z), i Z( j πz j )g(Z)
(2.4)
where = Z ( j πz j )g(Z) is the normalization constant. The likelihood of the data, given a specific cluster assignment Z, is independent of the cluster assignment preferences, so the complete data likelihood is P(X, Z|, G) = P(X|Z, )P(Z|, G).
(2.5)
From equations 2.1 through 2.5, the complete data likelihood is P(X, Z|, G) = P(X|Z, )
1 1 πzi g(Z) = P(X, Z|)g(Z), i
(2.6)
where P(X, Z|) is the complete data likelihood for a standard GMM. To distinguish the penalized likelihood from the standard likelihood, we introduce the notation Ps (·) to denote the standard likelihood and Pp (·) for penalized likelihood. The data likelihood is the sum of complete data likelihood over all possible Z, that is, L(X|) = Pp (X|, G) = Z Pp (X, Z|, G), which can be maximized with the EM algorithm. Once the model parameters are fit, we do soft clustering according to the posterior probabilities for new data Ps (k|x, ). (Note that cluster assignment preferences are not expressed for the new data, only for the training data.) 2.2 Pairwise Relations. Pairwise relations provide a special case of the framework discussed above. We specify two types of pairwise relations:
r r
Link: two samples should be assigned to the same cluster. Do-not-link: Two samples should be assigned to different clusters.
1532
Z. Lu and T. Leen
The weighting factor given to the cluster assignment configuration Z is g(Z) =
p exp Wi j δ(zi , z j ) ,
(2.7)
i= j p
where δ is the Kronecker delta function and Wi j is the weight associated with sample pair (xi , x j ). This weight satisfies p
p
p
Wi j ∈ (−∞, ∞), Wi j = Wji . p
The weight Wi j reflects our preference for assigning xi and x j into one cluster. p We use positive Wi j when we prefer to assign xi and x j into one cluster p (link) and negative Wi j when we prefer to assign them into different clusters p (do-not-link). The absolute value |Wi j | reflects the strength of the preference. The prior probability with the pairwise relations is P(Z|, G) =
p 1 πzi exp Wi j δ(zi , z j ) . i i= j
(2.8)
It appears that the g(Z) in equation 2.7 changes asymmetrically with the violation of link and do-not-link: when a link is conformed, g(Z) increases; when a do-not-link is violated, g(Z) decreases. Nevertheless, the prior probability given by equation 2.8 decreases under both types of violations. The PPC model is clearly connected to the standard GMM and the constrained clustering model proposed by Shental et al. (2003). We shall show that both models can be viewed as special cases of PPC with particular W p . The connection between PPC and other semisupervised clustering models p is less straightforward and will be discussed in section 4. If Wi j = 0, we have no prior knowledge on the assignment relevancy of xi and x j . When p Wi j = 0 for all pairs (i, j), we have g(Z) = 1; hence, the complete likelihood reduces to the standard one: Pp (X, Z|, G) =
1 Ps (X, Z|)g(Z) = Ps (X, Z|).
(2.9)
p
In the other extreme with |Wi j | → ∞, assignments Z that violate the pairwise relations between xi and x j have zero prior probability, since for those assignments, k
Pp (Z|, G) = Z
πzk
l
πzl
i= j
p
exp(Wi j δ(zi , z j ))
m=n
p
exp(Wmn δ(zm , zn ))
→ 0.
Penalized Probabilistic Clustering
1533 p
Then the relations become hard constraints, while the relations with |Wi j | < ∞ are called soft preferences. When all the specified pairwise relations are hard constraints, the data likelihood becomes Pp (X, Z|, G) =
N 1 δ(zi , z j ) (1 − δ(zi , z j )) πzi Ps (xi | θzi ), i j∈L i j∈N i=1
(2.10) where L is the set of linked sample pairs and N is the set of do-not-link sample pairs. It is straightforward to verify that equation 2.10 is essentially the same as the complete data likelihood given by Shental et al. (2003). Therefore, the model proposed by Shental et al. (2003) is equivalent to PPC with hard constraints. In appendix A, we give a detailed derivation of equation 2.10 and the equivalence of two models. When only hard constraints are available, we simply implement PPC based on equation 2.10. In the remainder of this letter, we use W p to denote the prior knowledge on pairwise relations, that is, Pp (X, Z|, G) ≡ Pp (X, Z|, W p ) =
p 1 exp Wi j δ(zi , z j ) Ps (X, Z|) i= j (2.11)
2.3 Model Fitting. We use the EM algorithm (Dempster, Laird, & Rubin, 1977) to fit the model parameters : ∗ = arg max L(X|, W p ).
The expectation step (E-step) and maximization step (M-step) are E-step: Q(, (t−1) ) = E Z|X (log Pp (X, Z|, W p )|X, (t−1) , W p ) M-step: (t) = arg max Q(, (t−1) ).
In the M-step, the optimal mean and covariance matrix of each component is N j=1
µk = N
x j Pp (k|x j , (t−1) , W p )
j=1
N k =
j=1
Pp (k|x j , (t−1) , W p )
Pp (k|x j , (t−1) , W p )(x j − µk )(x j − µk )T . N (t−1) , W p ) j=1 Pp (k|x j ,
1534
Z. Lu and T. Leen
The update of the prior probability of each component is more difficult due to the normalizing constant in the data likelihood,
=
N Z
πzk
k=1
i= j
p exp Wi j δ(zi , z j ) .
(2.12)
We need to find π ≡ {π1 , . . . , πm } = arg max π
N M
log πl Pp (l|xi , (t−1) , W p ) − log (π),
l=1 i=1
(2.13) which, unfortunately, does not have a closed-form solution in general.1 In this letter, we use a rather crude approximation of the optimal π instead. First, we estimate the values of log (π) on a grid H = {πˆ n } on the simplex defined by M
πk = 1, πk ≥ 0.
k=1
M N Then in each M-step, we calculate the value of l=1 i=1 n (t−1) p n log πˆ l Pp (l|xi , , W ) for each node πˆ ∈ H and find the node πˆ ∗ that maximizes the function defined in equation 2.13: πˆ ∗ = arg max n πˆ ∈H
N M
log πˆ ln Pp (l|xi , (t−1) , W p ) − log (πˆ n ).
(2.14)
l=1 i=1
We use πˆ ∗ as the approximative solution of equation 2.13. In this letter, the resolution of the grid is set to be 0.01. Although it works very well for all experiments in this letter, we notice that the search over grid will be fairly slow for M > 5. Shental, Bar-Hillel, Hertz, and Weinshall (2004) proposed to find optimal π using gradient descent and approximate (π) by pretending all specified relations are nonoverlapping (see section 3.1). Although this method is originally designed for hard constraints, it can be easily adapted for PPC. This will not be covered in this letter.
1
Shental et al. (2003) pointed out that with a different sampling assumption, a closed-form solution for equation 2.13 exists when only hard links are available. See section 4.
Penalized Probabilistic Clustering
(a)
(c)
1535
(b)
(d)
Figure 1: The influence of constraint weight on model fitting. (a) Artificial data set. (b) Links (solid lines) and do-not-links (dashed line). (c, d) The probability density contour of two possible fitted models.
It is critically important to note that with a nontrivial W p , the assignment independence is broken, Pp (zi , z j |xi , x j , , W p = Pp (zi |xi , , W p )Pp (z j |x j , , W p ), which means that the posterior estimation of each sample cannot be done separately. This fact brings an extra computational problem and will be discussed in section 3. 2.4 . Selecting the Constraint Weights 2.4.1 Example: How the Weight Wi j Affects Clustering. The weight matrix W p is crucial to the performance of the PPC. Here we give an example demonstrating how the weight of pairwise relations affects the clustering process. Figure 1a shows the two-dimensional data from two classes, as indicated by the symbols. Besides the data set, we also have 20 pairs correctly labeled as links and do-not-links, as shown in Figure 1b. We try to fit the data set with a two-component GMM. Figures 1c and 1d give the density contour of the two possible models on the data. Without any pairwise relations specified, we have essentially an equal chance to get each. After incorporating pairwise relations, the EM optimization process is biased to
1536
Z. Lu and T. Leen iteration 1
|w |= 0
iteration 3 3
3
2
2
2
2
1
1
1
1
0
0
0
0
−1
−1
−1
−1
−2
−2
−2
−2
−3
−3
−3
−2
0
2
−2
0
2
−2
iteration 3
0
2
−3
3
3
3
2
2
2
1
1
1
1
0
0
0
0
−1
−1
−1
−1
−2
−2
−2
−2
−3
−3
−3
0
2
−2
0
2
−2
iteration 3
0
2
−3
3
3
3
2
2
2
1
1
1
1
0
0
0
0
−1
−1
−1
−1
−2 −2
0
2
−3
−2 −2
0
2
−3
2
0
2
iteration 20
2
−3
−2
iteration 5
3
−2
0 iteration 20
2
−2
−2
iteration 5
3
iteration 1
|w |= 3
iteration 20
3
iteration 1
|w|=1.3
iteration 5
3
−2 −2
0
2
−3
−2
0
2
Figure 2: The contour of probability density fit on data with different weight given to pairwise relations. Top row: w = 0; middle row: w = 1.3; bottom row: w = 3.
the correct one. The weights of pairwise relations are given as follows, w p Wi j = −w 0
if (xi , x j ) is linked if (xi , x j ) is do-not-linked otherwise,
where w ≥ 0 measures the certainty of all specified pairwise relations. In Figure 2, we give three runs with the same initial model parameters but different weight given to the specified pairwise relations. For each run, we give snapshots of the model after 1, 3, 5, and 20 EM iterations. The first row is the run with w = 0 (standard GMM). The search ends up with a model that violates our prior knowledge of class membership. The middle row is the run with w set to 1.3. With the same poor initial condition, the model fitting process still goes to the wrong one again, although at a slower pace. In the bottom row, we increase w to 3. This time, the model converges to the one we intend. 2.4.2 Choosing Weight W p Based on Prior Knowledge. There are some occasions we can translate our prior belief on the relations into the weight
Penalized Probabilistic Clustering
1537
W p . Here we assume that the pairwise relations are labeled by an oracle but contaminated by flipping noise before they are delivered to us. For each labeled pair (xi , x j ), there is thus a certainty value 0.5 ≤ γi j ≤ 1 equal to the probability that pairwise relation is not flipped, that is, that label is correct.2 Our prior knowledge would include those specified pairwise relations and their certainty values = {γi j }. This prior knowledge can be approximately encoded into the weight W p by letting 1 γi j 2 log( 1−γi j ) p γi j Wi j = − 12 log( 1−γ ) ij 0
(xi , x j ) is specified as linked (xi , x j ) is specified as do-not-linked
(2.15)
otherwise.
The details of the derivation are in appendix B. It is obvious from equation 2.15 that for a specified pairwise relation (xi , x j ), the greater the certainty p value γi j , the greater the absolute value of weight Wi j . Note that the weight designed this way is not necessarily optimal in terms of classification accuracy, as will be demonstrated by experiment in section 5.1. The reason is twofold. First, equation 2.15 is derived based on a (possibly crude) approximation. Second, gaussian mixture models as classifiers are often considerably biased from true class distribution of data. As a result, even if the PPC prior P(Z|, W p ) faithfully reflects the truth, it does not necessarily lead to the best classification accuracy. Nevertheless, equation 2.15 gives good initial guidance for choosing the weight. Our experiments in section 5.1 show that this design often yields superior classification accuracy to simply using the hard constraints or ignoring the pairwise relations (standard GMM). One use for this scheme of weight is when pairwise relations are labeled by domain experts and the certainty values are given at the same time. We might also estimate the flipping noise parameters from historical data or available statistics. For example, we can derive soft pairwise relations based on spatial or temporal continuity among samples. That is, we add soft links to all adjacent pairs of samples, assuming the flipping noise explaining all the adjacent pairs that are actually not in one class. We further assume that the flipping noise of each pair follows the same distribution. Accordingly, we assign uniform weight w > 0 to all adjacent pairs. Let q denote the probability that the label on a adjacent pair is flipped. We might be able to estimate q from labeled instances of a similar problem, for example, segmented images or time series. The maximum likelihood (ML) estimation
2 We consider only the certainty value > 0.5, because a pairwise relation with certainty γi j < 0.5 can be equivalently treated as its opposite relation with certainty 1 − γi j .
1538
Z. Lu and T. Leen
of q is given by simple statistics: q˜ =
Number of adjacent pairs that are not in the same class . Number of all adjacent pairs
We give an application of this idea in section 5.2. 3 Computing the Cluster Posterior The M-step requires the cluster membership posterior. Computing this posterior is simple for the standard GMM since each data point xi can be assigned to a cluster independent of the other data points and we have the familiar cluster origin posterior Ps (zi = k|xi , ). The pairwise constraints bring extra relevancy in assignment among samp ples involved. From equation 2.11, if Wi j = 0, Pp (zi , z j |xi , x j , , W p ) = p p Pp (zi |xi , , W )Pp (z j |x j , , W ). Consequently, the posterior probability of xi and x j cannot be estimated separately. This relevancy in the assignment can be formalized as follows: p
Definition. If Wi j = 0, we say there is direct assignment relevancy between xi and x j , denoted by xi Rd x j . If Pp (zi , z j |xi , x j , , W p ) = Pp (zi |xi , , W p )Pp (z j |x j , , W p ), we say there is assignment relevancy between xi and x j , denoted by xi Ra x j . It is clear that Ra is reflexive, symmetric, and transitive. Hence, Ra is an equivalence relation. It can be shown that Ra is the transitive closure of Rd . In other words, two samples have assignment relevancy relation Ra if they can be connected by a path consisting of Ra relations, as illustrated in Figure 3. We call each equivalence class associated with Ra a clique. It is clear that cliques are the smallest sets of samples whose posterior probabilities can be calculated independently. When calculating posterior probabilities, all samples within a clique need to be considered together. In a clique T with size |T|, the posterior probability of a given sample xi ∈ T is calculated by marginalizing the posterior over the entire clique, Pp (zi = k|X, , W p ) =
Pp (ZT |XT , , W p ),
ZT |zi =k
with the posterior on the clique given by Pp (ZT |XT , , W p ) =
Pp (ZT , XT |, W p ) Pp (ZT , XT |, W p ) = .
p p Pp (XT |, W ) ZT Pp (ZT , XT |, W )
Penalized Probabilistic Clustering
(a)
1539
(b)
(c)
Figure 3: (a) Links (solid line) and do-not-links (dotted line) among six samples. (b) Direct assignment relevancy Rd (solid line) translated from links in a. (c) Equivalence classes defined by assignment relevancy Ra , denoted by shading.
(a)
(c)
(b)
Figure 4: (a) Overlapping pairwise relations, with links (solid line) and donot-links (dotted line). (b) Nonoverlapping pairwise relations. (c) Only hard links.
Exact calculation of the posterior probability of a sample in clique T requires time complexity O(M|T| ), where M is the number of components in the mixture model. This calculation can get prohibitively expensive if |T| is large (e.g., 50) for any model size M ≥ 2. Hence, small-size cliques are required to make the marginalization computationally reasonable. 3.1 Two Special Cases with Easy Inference. Apparently the inference is easy when we limit ourselves to small cliques. Specifically, when |T| ≤ 2, the pairwise relations are nonoverlapping, as illustrated in Figures 4a and 4b. With nonoverlapping constraints, the posterior probability for the whole data set can be given in closed form with O(N) time complexity. Moreover, the evaluation of the normalization factor (π) is simple: (π) = 1 −
M k=1
NL πk2
M
NN πk2
,
k=1
where NL and NN are, respectively, the number of links and do-not-links. The optimization of π in the M-step can thus be achieved with little cost. Sometimes nonoverlapping relations are a natural choice: they can be generated by picking up sample pairs from sample sets and labeling the relations without replacement. More generally, we can avoid the expensive
1540
Z. Lu and T. Leen
computation in posterior inference by breaking large cliques into small ones. To do this, we need to deliberately ignore some links or do-not-links. In section 5.2, experiment 3 is an application of this idea. p The second simplifying situation is that we have only hard links (Wi j = +∞ or 0), as illustrated in Figure 4c. In this case, the posterior probability for each sample must be exactly the same as the others in the same clique, so a clique can be treated as a single sample. That is, assume xi is in clique T. We then have Pp (zi = k|xi , , W p ) = Pp (ZT = k|xT , , W p ) Pp (xT , ZT = k|, W p ) =
p k Pp (xT , ZT = k |, W ) Ps (xT , ZT = k|) =
k Ps (xT , ZT = k |) j∈T πk Ps (x j |θk ) . = k ( j∈T πk Ps (x j |θk )) Similar ideas have been proposed independently by Wagstaff et al. (2001), Shental et al. (2003), and Bilenko et al. (2004). This case is useful when we are sure that a group of samples are from one source (Shental et al., 2003). For more general cases, where the exact inference is computationally prohibitive, we propose to use Gibbs sampling (Neal, 1993) and the mean field approximation (Jaakkola, 2004) to estimate the posterior probability. This will be discussed in sections 3.2 and 3.3. 3.2 Estimation with Gibbs Sampling. For fixed , finding Pp (Z|, W p ) is a typical inference problem for graphical models. Techniques for approximate inference developed for graphical models can also be used here. In this section, we use Gibbs sampling to estimate the posterior probability in each EM iteration. In Gibbs sampling, we estimate Pp (zi |X, , W p ) as a sample mean, 1 (t) δ zi , k , S S
Pp (zi = k|X, , W p ) = E(δ(zi , k)|X, , W p ) ≈
t=1
where the sum is over a sequence of S samples from P(Z|X, , G) generated by the Gibbs MCMC. The tth sample in the sequence is generated by the usual Gibbs sampling technique:
r
(t)
(t−1)
Pick z1 from distribution Pp (z1 |z2
(t−1)
, z3
(t−1)
, . . . , zN
, X, W p , ).
Penalized Probabilistic Clustering
r r
(t)
1541 (t)
(t−1)
Pick z2 from distribution Pp (z2 |z1 , z3 ··· (t)
(t)
(t)
(t−1)
, . . . , zN
, X, W p , ).
(t)
Pick z N from distribution Pp (z N |z1 , z2 , . . . , z N−1 , X, W p , ).
For pairwise relations it is helpful to introduce some notation. Let Z−i denote an assignment of data points to clusters that leaves out the assignment of xi . Let U(i) be the indices of the set of samples that participate in a pairwise p relation with sample xi , U(i) = { j : Wi j = 0}. Then we have Pp (zi |Z−i , X, , W p ) ∝ Ps (xi , zi |)
p exp(2Wi j δ zi , z j ) .
(3.1)
j∈U(i)
The time complexity of each Gibbs sampling pass is O(NnM), where n is the maximum number of pairwise relations a sample can be involved in. When W p is sparse, the size of U(i) is small. Thus, calculating Pp (zi |Z−i , X, , W p ) is fairly cheap, and Gibbs sampling can effectively estimate the posterior probability. 3.3 Estimation with Mean Field Approximation. Another approach to posterior estimation is to use mean field theory (Jaakkola, 2004; Lange, Law, Jain, & Buhmann, 2005). Instead of directly evaluating the intractable Pp (Z|X, , W), we try to find a tractable mean field approximation Q(Z). To find a Q(Z) close to the true posterior probability Pp (Z|X, , W), we minimize the Kullback-Leibler divergence between them, min KL(Q(Z)|Pp (Z|X, , W p )), Q
(3.2)
which can be recast into max[H(Q) + E Q {log Pp (Z|X, , W p )}], Q
(3.3)
where E Q {·} denotes the expectation with respect to Q. The simplest family of variational distribution is one where all the latent variables {zi } are independent of each other:
Q(Z) =
N
Qi (zi ).
(3.4)
i=1
With this Q(Z), the optimization problem in equation 3.3 does not have a closed-form solution and is not a convex problem. Instead, a locally optimal
1542
Z. Lu and T. Leen
Q can be found iteratively with the following update equations,
Qi (zi ) ←
1 exp(E Q {log Pp (Z|X, , W p )|zi }), i
(3.5)
for all i and zi ∈ {1, 2, · · · , M}. Here i = zi exp(E Q {log Pp (Z|X, , p W )|zi }) is the local normalization constant. For the PPC model, we have p exp(E Q {log Pp (Z|X, , W p )|zi }) = Ps (zi |xi , ) exp Wi j Q j (zi ) . j=i
Equation 3.5, collectively for all i, is the mean field equation. Evaluation of mean field equations requires at most O(NnM) time complexity, which is same as the time complexity of one Gibbs sampling pass. Successive updates of equation 3.5 will converge to a local optimum of equation 3.3. In our experiments, the convergence usually occurs after about 20 iterations, which is many fewer than the number of passes required for Gibbs sampling. 4 Related Models Prior to our work, different authors have proposed several constrained clustering models based on K-means, including the seminal work by Wagstaff and colleagues (Wagstaff et al., 2001; Wagstaff, 2002), and its successor (Basu et al., 2002; Basu, Bilenko, & Mooney, 2004; Bilenko et al., 2004). These models generally fall into two classes. The first class of algorithms (Wagstaff et al., 2001; Basu et al., 2002) keeps the original K-means cost function (reconstruction error) while confining the cluster assignments to be consistent with the specified pairwise relations. The problem can be cast into the following constrained optimization problem,
min Z,µ
N
||xi − µzi ||2
i=1
subject to zi = z j , zi = z j ,
if (xi , x j ) ∈ L if (xi , x j ) ∈ N ,
where µ = {µ1 , . . . , µ M } is the cluster center. In the second class of algorithms, cluster assignments that violate the pairwise relations are allowed
Penalized Probabilistic Clustering
1543
but will be penalized. They employ a modified cost function (Basu et al., 2004): J (µ, Z) =
N 1 ||xi − µzi ||2 + a i j (zi = z j ) + b i j (zi = z j ), 2 i=1
(i, j)∈L
(i, j)∈N
(4.1) where a i j is the penalty for violating the link between (xi , x j ) and b i j is the penalty when the violated pairwise relation is a do-not-link. It can be shown that both classes of algorithms are encompassed by PPC as special cases. The particular PPC model we consider has spherical gaussian components with radius shrunk to zero and the weight matrix W p expands properly. The details are in appendix C. One weakness shared by the semisupervised K-means algorithms is the limited capability of K-means to model complex data distribution. If data in one class are far from being spherical, it may take a great number of pairwise relations to achieve reasonable classification accuracy (Wagstaff, 2002). Another serious problem lies in the optimization strategy employed by those algorithms to find the optimal assignment within each EM iteration. Due to the extra dependency brought by the pairwise relations, finding the optimal assignment of samples to clusters is not trivial. Evaluating every potential assignments requires O(M|T| ) time complexity where |T| denotes the size of the biggest clique, which is prohibitively expensive when |T| is big. The greedy search used by these algorithms can return only local optima (Basu et al., 2002, 2004), and the sequential assignment strategy employed by Wagstaff et al. (2001) may lead to the situation where one cannot assign a sample to any cluster because of the conflict with some assigned samples. To remedy the limited capability of constrained K-means, several authors proposed probabilistic models based on gaussian mixture models. The models proposed by Shental et al. (2003, 2004) address the situation where pairwise relations are hard constraints. The authors partition the whole data set into a number of (maximal) “chunklets” consisting of samples that are (hard) linked to each other.3 Shental et al. (2003, 2004) discuss two sampling assumptions:
r
Assumption 1: Chunklet Xi is generated identically and independently from component k with prior πk (Shental et al., 2004), and the complete data likelihood is P(X, Z|, E ) =
L 1 (1 − δ(zi , z j )) · {πzl Ps (xi |θzl )}, i= j∈N x ∈X l=1
i
l
(4.2) 3
If a sample is not linked to any other samples, it comprises a chunklet by itself.
1544
r
Z. Lu and T. Leen
where E denotes the specified constraints. Assumption 2: Chunklet Xi generated from component k with prior |X | ∝ πk i , where |Xi | is the number of samples in Xi (Shental et al., 2004). The complete data likelihood is: P(X, Z|, E ) =
L 1 l| (1 − δ(zi , z j )) · {πz|X Ps (xi |θzl )} l i j∈N x ∈X l=1
i
l
(4.3) =
N 1 δ(zi , z j ) (1 − δ(zi , z j )) πzi Ps (xi | θzi ). i j∈L i j∈N i=1
(4.4) In appendix A we show that when using assumption 2, this model (as expressed in equations 4.3 and 4.4) is equivalent to the PPC with only hard constraints (as expressed in equation 2.10). Shental et al. (2004) suggested that assumption 1 might be appropriate, for example, when chunklets are generated from temporal continuity. When pairwise relations are generated by labeling sample pairs picked from a data set, assumption 2 might be more reasonable. Assumption 1 allows a closed-form solution in the M-step (including solution for π) in each EM iteration (Shental et al., 2004). The empirical comparison of the two sampling assumptions will be discussed in section 5. To incorporate the uncertainty associated with pairwise relations, Law et al. (2004, 2005) proposed to use soft group constraints. To model a link between any sample pair (xi , x j ), they create a group l and express the strength of the link as the membership of xi and x j to group l. This strategy works well for some simple situations, for example, when the pairwise relations are nonoverlapping (as defined in section 3.1). However, it is awkward if samples are shared by multiple groups, which is unavoidable when samples are commonly involved in multiple relations. Another serious drawback of the group constraints model is its inability to model do-not-links. Due to these obvious limitations, we omit the empirical comparison of this model to PPC in the following section. 5 Experiments The experiments section consists of two parts. In section 5.1, we examine the way the number of constraints affects the clustering results. For each clustering task in this section, we generate artificial pairwise relations based on class labels. In section 5.2, we address real-world problems, where the constraints are derived from our prior knowledge. Also in this section,
Penalized Probabilistic Clustering
1545
we demonstrate the approaches to reduce computational complexity, as described in section 3. Following are some abbreviations we will use: soft-PPC is PPC with soft constraints, hard-PPC is PPC with hard constraints (implemented based on equation 2.10), soft-CKmeans is the K-means with soft constraints (Basu et al., 2004) and hard-CKmeans is the K-means with hard constraints (Wagstaff et al., 2001). The gaussian mixture model with hard constraints (Shental et al., 2003, 2004) will be referred to as constrained-EM. 5.1 Artificial Constraints. In this section, we discuss the influence of pairwise relations on PPC’s clustering and compare the result to other semisupervised clustering models. This section presents two experiments. In experiment 1, we consider only correct pairwise relations, as an example of authentic knowledge. Accordingly, we use hard constraints in clustering. In experiment 2, we discuss the situation where pairwise relations contain significant error. We evaluate the performance of soft-PPC and test the weight design strategy described in section 2.4. The result is compared to hard-PPC and other semisupervised clustering models. 5.1.1 Constraint Selection. To avoid the computational burden, we limit our discussion to the nonoverlapping pairwise relations in experiments 1 and 2. As discussed in section 3.1, the nonoverlapping pairwise relations, hard or soft, allow fast solution in the maximization step in each EM iteration. The pairwise relations are generated as follows. We randomly pick two samples from the training set without replacement. If the two have the same class label, we add a link constraint between them; otherwise, we add a do-not-link constraint. Note that the application of PPC is not limited to the nonoverlapping cases. In section 5.2, we discuss more complicated real-world problems where overlapping constraints are necessary, and we also present approaches to solve the computational problems. 5.1.2 Performance Evaluation. We try PPC (with the number of components equal to the number of classes) with various numbers of pairwise relations. For each clustering result, a confusion matrix is built to compare it to true labeling. The classification accuracy is calculated as the ratio of the sum of diagonal elements to the number of all samples. 5.1.3 Experiment 1: Artificial Hard Constraints. This experiment is designed to answer two questions: how the number of pairwise relations affects the clustering result and whether the information in the relations has been successfully encoded into the trained model. To answer the second question, we examine the out-of-sample classification of the gaussian mixture model fit with the aid of the pairwise relations. Toward this end, we divide each data set into a training set (90% of data) and a held-out test set (10% of data). Pairwise relations are generated among samples in
1546
Z. Lu and T. Leen
(a) data set1
(b) data set 2
(c) data set 3
Figure 5: The artificial data sets.
the training set. After the density model is fit on the training set and the pairwise relations, it will be applied to the test set. Since the classification on the test set is merely decided by the fit gaussian mixture model, it will reflect the influence of pairwise relations on the trained model. For comparison, we also give results of two other constrained clustering methods: (1) the hard-CKmeans (Wagstaff et al., 2001), for which the accuracy on the test set is given by the nearest-neighbor classification with the cluster centers fit on training set, and (2) constrained-EM (Shental et al., 2004). Since we show in appendix A that constrained-EM with sampling assumption 2 (see section 5) is equivalent to hard-PPC, we need only to consider the constrained-EM with sampling assumption 1. The reported classification accuracy is averaged over 100 different realizations of pairwise relations. The three two-dimensional artificial data sets shown in Figure 5 are designed to highlight PPC’s superior modeling flexibility over constrained K-means.4 In each example, there are 200 samples in each class. It is clear from Figure 5 that for all three problems, data in each class are nongaussian. So not surprisingly, standard K-means and GMM do not return satisfactory clustering results. Figure 6 compares the clustering result of hard-PPC and hard-CKmeans with various number of pairwise relations. As shown in Figure 6, the accuracy of hard-PPC improves significantly when pairwise relations are incorporated. After enough pairwise relations are added in, we can finally reach close to 100% accuracy on the training data. On the test set, although no pairwise relation is available, we observe significantly improved accuracy as well. For the hard-CKmeans, we do not observe any substantial accuracy improvement on the training set or test set. The classification task of data set 1 is relatively easy for the gaussian mixture and difficult for K-means. The classification accuracy of hard-PPC climbs 4
Some authors (Xing et al., 2003; Cohn et al., 2003; Bilenko et al., 2004) combined standard or constrained K-means with metric learning based on pairwise relations and reported improvement on classification accuracy. This will not be discussed in this letter.
Penalized Probabilistic Clustering
0.8 0.7 0.6 0.5 0.4 0
20
40 60 Number of Relations
80
(a) On data set 1
100
0.95
1
trn ppc tst ppc trn cop tst cop
0.9 Classification Accuracy
0.9
1
trn ppc tst ppc trn cop tst cop
Classification Accuracy
Classification Accuracy
1
1547
0.9 0.85 0.8 0.75 0
trn ppc tst ppc trn cop tst cop
0.8 0.7 0.6 0.5
20
40 60 80 Number of Relations
100
0.4 0
(b) On data set 2
50 100 Number of Relations
(c) On data set 3
Figure 6: The performance of hard-PPC and hard-CKmeans with various numbers of relations: trn ppc: accuracy on the training set with hard-PPC; tst ppc: accuracy on the test set from the GMM trained by hard-PPC; trn cop: accuracy on the training set with hard-CKmeans; tst cop: accuracy of K-means on the test set.
from 60% to close to 100% (on both training and test set) after 70 pairwise relations, whereas the accuracy of hard-CKmeans remains less than 60% even with 100 relations. On data set 2, the hard-PPC accuracy is improved from 75% to close to 95% on the training set and stops at around 90% on the test set. This divergence happens because the two classes in data set 2 are overlapped and thus defy a perfect GMM classifier. Data set 3 is the most difficult since it is highly nongaussian. It takes over 100 pairs for the hard-PPC to reach 95% accuracy, whereas hard-CKmeans never reaches 55%. The comparison of PPC and constrained-EM is presented in a way to highlight the difference between the classification accuracy of the two methods. Basically, we record the classification from PPC and constrained-EM with the same pairwise relations and initial condition and then calculate Accuracy = classification accuracy by PPC − classification accuracy by CEM on both the training and test set. In Figure 7, we report the mean and standard deviation of Accuracy estimated over 100 different realizations of pairwise relations. From Figure 7, the difference between PPC and constrained-EM is indistinguishable when the number of relations is small, while PPC is slightly but consistently better than constrained-EM when the relations are abundant. We perform the same experiments on three UCI data sets: the Iris data set has 150 samples and three classes, with 50 samples in each class; the Waveform data set has 5000 samples and three classes, with around 1700 samples in each class; and the Pendigits data set has four classes (digits 0, 6, 8, 9), each with 750 samples. The results are summarized in
1548
Z. Lu and T. Leen
Training set
Training set
Training set
0 −0.02
20
40
60
80
0.04 PPC−CEM
PPC−CEM
PPC−CEM
0.02 0.02
0
−0.02
100
20
40
Test set
60
80
100
60
80
−0.02
80
0
−0.02
100
100
120
140
160
80 100 120 Number of Relations
140
160
Test set 0.04 PPC−CEM
PPC−CEM
PPC−CEM
0
60 Number of Relations
40
Test set
0.02
40
0 −0.02 −0.04 20
0.02
20
0.02
20
(a) data set 1
40
60 Number of Relations
80
0.02 0 −0.02 −0.04 20
100
40
60
(b) data set 2
(c) data set 3
Figure 7: Comparison of PPC and constrained EM on artificial data. With each number of pairwise relations, we show the mean of Accuracy ± standard deviation estimated over 100 random realization of pairwise relations.
0.9
Classification Accuracy
Classification Accuracy
0.9
trn ppc tst ppc trn cop tst cop
0.85 0.8 0.75
0.85
0.9
trn ppc tst ppc trn cop tst cop
Classification Accuracy
1 0.95
0.8 0.75 0.7
0.85
trn ppc tst ppc trn cop tst cop
0.8
0.75
0.7 0.65 0
10
20 30 40 Number of Relations
0.65 0
50
(a) Iris data
100
200 300 400 Number of Relations
500
0.7 0
600
200
(b) Waveform data
400 600 800 1000 1200 Number of Relations
(c) Pendigits data
Figure 8: Performance of PPC on UCI data sets with various numbers of relations. Training set
Training set
10
20
30
40
0
−0.01 100
50
0.02 PPC−CEM
0
−0.02
200
Test set
400
500
−0.02
600
20
30 Number of Relations
(a) Iris
400
40
50
0
−0.01 100
600
800
1000
1200
1000
1200
0.02 PPC−CEM
0
10
200
Test set
0.01 PPC−CEM
PPC−CEM
300
0
Test set
0.02
−0.02
Training set
0.01 PPC−CEM
PPC−CEM
0.02
200
300 400 Number of Relations
(b) Waveform
500
600
0
−0.02
200
400
600 800 Number of Relations
(c) Pendigits
Figure 9: Comparison of PPC and constrained EM on UCI data sets. With each number of pairwise relations, we show the mean of Accuracy ± standard deviation estimated over 100 random realization of pairwise relations.
Figures 8 and 9. As indicated by Figure 8, hard-PPC can consistently improve its clustering accuracy on the training set when more pairwise constraints are added; also, the effect brought by constraints generalizes to the test set. In contrast, as in the artificial data set case, the increase of
Penalized Probabilistic Clustering
1549
accuracy from hard-CKmeans is much less salient than that of hard-PPC. Figure 9 shows that hard-PPC is slightly better than constrained-EM, especially when the number of constraints is large. 5.1.4 Experiment 2: Artificial Soft Constraints. In this experiment, we evaluate the performance of soft-PPC when the specified pairwise relations contain substantial error. The results are compared to hard-PPC, soft-CKmeans, and hard-CKmeans. The artificial constraints are generated the same way as in the previous experiment. The flipping noise is realized by randomly flipping each pairwise relation with a certain probability q ≤ 0.5. For the soft-PPC model, the p weight Wi j to each specified pairwise relation is given as follows: 1 log 1−q (zi , z j )specified as link 2 q p Wi j = − 1 log 1−q (zi , z j )specified as do-not-link. 2 q
(5.1)
We use w to denote the absolute value of the weight for nontrivial ). For soft-CKmeans, we give pairs. For soft-PPC, we have w = 12 log( 1−q q equal weights to all specified constraints. Because there is no guiding rule in the literature on how to choose the weight for the soft-CKmeans model, we simply use the weight that yields the highest classification accuracy. We present the results on the three artificial data sets and three UCI data sets used in experiment 1. Unlike experiment 1, we use everything available in clustering. On each data set, we randomly generate a number of nonoverlapping pairwise relations to have 50% of the data involved. In this experiment, we try two different noise levels, with q set to 0.15 and 0.3. Figure 10 compares the classification accuracies given by the maximum likelihood (ML) solutions of different models.5 The accuracy for each model is averaged over 20 random realizations of pairwise relations. On all data sets except artificial data set 3, soft-PPC with the designed weight gives higher accuracy than hard-PPC (w = ∞) and standard GMM (w = 0) on both noise levels. On artificial data set 3, when q = 0.3, hard-PPC gives the best classification accuracy.6 Soft-PPC apparently gives superior classification accuracy to the K-means models on all six data sets, even though the weight of soft-CKmeans is optimized. Figure 10 also shows that it can be harmful to use hard constraints when pairwise relations
5 We choose the one with the highest data likelihood among 100 runs with different random initialization. For K-means models, including soft-CKmeans and hard-CKmeans, we use the solutions with the smallest value of cost function. 6 Further experiment shows that on these data, soft-PPC with the optimal w (> the one suggested by equation 5.1) is still slightly better than hard-PPC.
1550
Z. Lu and T. Leen
Figure 10: Classification accuracy with noisy pairwise relations. We use all the data in clustering. (A) Standard GMM. (B) Soft-PPC. (C) Hard-PPC. (D) Standard K-means. (E) Soft-CKmeans with optimal weight. (F) Hard-CKmeans.
are noisy, especially when the noise is significant. Indeed, as shown by Figures 10d and 10f, hard-PPC can yield accuracy even worse than standard GMM. 5.2 Real-World Problems. In this section, we present two examples where pairwise constraints are from domain experts or common sense. Both examples are about image segmentation based on gaussian mixture models. In the first problem, experiment 3, hard pairwise relations are derived from image labeling done by a domain expert. In the second problem, soft pairwise relations are generated based on spatial continuity. 5.2.1 Experiment 3: Hard Do-Not-Links from Partial Class Information. The experiment in this section shows the application of pairwise constraints on partial class information. For example, consider a problem with six classes A, B, . . . , F . The classes are grouped into several class sets C1 = {A, B, C}, C2 = {D, E}, C3 = {F }. The samples are partially labeled in the sense that we are told which class set a sample is from but not which specific class it is from. We can logically derive a do-not-link constraint between any pair of samples known to belong to different class sets, while no link constraint can be derived if each class set has more than one class in it.
Penalized Probabilistic Clustering
1551
Figure 11: (a) Gray-scale image from the first spectral channel 1. (b) Partial label given by experts. Black pixels denote nonsnow area, and white pixels denote snow area. Clustering result of standard GMM (c) and PPC (d). Panels c and d are colored according to image blocks’ assignment.
Figure 11a is a 120 × 400 region from the Greenland ice sheet from NASA Langley DAAC (Srivastava, Oza, & Stroeve, 2005).7 Each pixel has intensities from seven spectrum bands. This region is labeled into snow area and nonsnow area, as indicated in Figure 11b. The snow area may contain samples from several classes of interest: ice, melting snow, and dry snow, while the nonsnow area can be bare land, water, or cloud. The labeling from the expert contains incomplete but useful information for further segmentation of the image. To segment the image, we first divide it into 5 × 5 × 7 blocks (175 dim vectors). We use the first 50 principal components as feature vectors. Our goal is then to segment the image into (typically more than 2) areas by clustering those feature vectors. With PPC, we can encode the partial class information into do-not-links.
7 We use the first seven moderate resolution imaging spectroradiometer (MODIS) channels with bandwidths as follows (in nm): channel 1: 620–670; channel 2: 841–876; channel 3: 459–479; channel 4: 545–565; channel 5: 1230–1250; channel 6: 1628–1652; channel 7: 2105–2155.
1552
Z. Lu and T. Leen
For hard-PPC, we use half of the data samples for training and the rest for test. Hard do-not-link constraints (only on training set) are generated as follows. For each block in the nonsnow area, we randomly choose (without replacement) six blocks from the snow area to build do-not-link constraints. By doing this, we achieve cliques with size seven (1 nonsnow block + 6 snow blocks). As in section 5.1, we apply the model fit with hard-PPC to the test set and combine the clustering results on both data sets into a complete picture. Clearly, the clustering task is nontrivial for any M > 2. Typical clustering results of three-component standard GMM and three-component PPC are shown as Figures 11c and 11d, respectively. Standard GMM gives a clustering that is clearly in disagreement with the human labeling in Figure 11b. The hard-PPC segmentation makes far fewer misassignments of snow areas (tagged white and gray) to nonsnow (black) than does the GMM. The hard-PPC segmentation properly labels almost all of the nonsnow regions as nonsnow. Furthermore, the segmentation of the snow areas into the two classes (not labeled) tagged white and gray in Figure 11d reflects subtle differences in the snow regions captured by the gray-scale image from spectral channel 1, as shown in Figure 11a. 5.2.2 Experiment 4: Soft Links from Continuity. In this section, we present an example where soft constraints come from continuity. As in the previous experiment, we try to do image segmentation based on clustering. The image is divided into blocks and rearranged into feature vectors. We use a GMM to model those feature vectors, with each gaussian component representing one texture. However, standard GMM often fails to give good segmentations because it cannot make use of the spatial continuity of image, which is essential in many image segmentation models, such as random field (Bouman & Shapiro, 1994). In our algorithm, the spatial continuity is incorporated as the soft link preferences with uniform weight between each block and its neighbors. As described in section 2.4, the weight w of the soft link can be given as
w=
1 1−q log , 2 q
(5.2)
where q is the ratio of softly linked adjacent pairs that are not in the same class. Usually q is given by an expert or estimated from segmentation results of similar images. In this experiment, we assume we already know the ratio q , which is calculated from the label of the image. The complete data likelihood is Pp (X, Z|, W p ) =
1 Ps (X, Z|) exp(w δ(zi , z j )), i j∈U(i)
(5.3)
Penalized Probabilistic Clustering
1553
Figure 12: (a) Texture combination. (b) Clustering result of standard GMM. (c) Clustering result of soft-PPC with Gibbs sampling. (d) Clustering result of soft-PPC with mean field approximation. Panels b to d are shaded according to the blocks assignments to clusters.
where U(i) means the neighbors of the ith block. The EM algorithm can be roughly interpreted as iterating on two steps: (1) estimating the texture description (parameters of mixture model) based on segmentation and (2) segmenting the image based on the texture description given by step 1. Since exact calculation of the posterior probability is intractable due to the large clique containing all samples, we have to resort to approximation methods. In this experiment, both the Gibbs sampling (see section 3.2) and the mean field approximation (see section 3.3) are used for posterior estimation. For Gibbs sampling, equation 3.1 is reduced to Pp (zi |Z−i , X, , W p ) ∝ Ps (xi , zi |)
exp(2w δ(zi , z j )).
j∈U(i)
The mean field equation, 3.5, is reduced to Qi (zi ) ←
1 Ps (xi , zi |) exp(2w Q j (zi )). i j∈U(i)
The image shown in Figure 12a is built from four Brodatz textures.8 This image is divided into 7 × 7 blocks and then rearranged to 49-dim vectors. We use those vectors’ first five principal components as the associated feature vectors. A typical clustering result of four-component standard GMM is shown in Figure 12b. For soft-PPC, the soft links with weight w calculated from equation 5.2 are added between each block and its four neighbors. Figures 12c and 12d are the clustering result of four-component soft-PPC with respectively Gibbs sampling and mean field approximation. One run with Gibbs sampling takes around 160 minutes on a PC with Pentium 4,
8 Downloaded from http://sipi.usc.edu/services/database/Database.html, April 2004.
1554
Z. Lu and T. Leen
2.0 GHZ processor, whereas the algorithm using the mean field approximation takes only 3.1 minutes. Although mean field approximation is about 50 times faster than Gibbs sampling, the clustering results are comparable according to Figure 12. Compared to the result given by standard GMM, soft-PPC with both approximation methods achieves significantly better segmentation after incorporating spatial continuity.
6 Conclusion and Future Work We have proposed a probabilistic clustering model that incorporates prior knowledge in the form of pairwise relations between samples. Unlike previous work in semisupervised clustering, our model formulates clustering preferences as a Bayesian prior over the assignment of data points to clusters and so naturally accommodates both hard constraints and soft preferences. Unlike many semisupervised learning methods (Szummer & ¨ Jaakkola, 2001; Zhou & Scholkopf, 2004; Zhu, Lafferty, & Ghahramani, 2003) addressing labeled subsets, PPC returns a fitted parametric density model and thus can deal with unseen data. Experiments on different data sets have shown that pairwise relations can consistently improve the performance of the clustering process. Despite its success, PPC has limitations. First, PPC often needs a substantial proportion of samples involved in pairwise relations to give good results. Indeed, if we have the number of relations fixed and keep adding samples without any new relations, the algorithm will finally degenerate into unsupervised learning (clustering). To overcome this, one can instead build a semisupervised model based on discriminative models such as a neural network or gaussian process classifier and use the pairwise relations in the form of hint (Sill & Abu-Mostafa, 1996) or observation (Zhu et al., 2003). Second, since PPC is based on the gaussian mixture model, it works well when the data in each class can be approximated by a gaussian distribution. When this condition is not satisfied, PPC could lead to poor results. One way to alleviate this situation is to use multiple clusters to model one class, an interesting direction for future exploration. Third, in choosing the weight matrix W p , although our design works well on some data sets, it is not clear how to set the weight for a more general situation. In this letter, we implement hard constraints using equation 2.10. Alternatively, we can approximate hard constraints by using large |Wi j | for every constrained pair (xi , x j ). Indeed, from equations 2.5 to 2.8, when a constraint with large weight is violated in assignments Z, the prior probability P(Z|, W p ) will be close to zero. The value of P(Z|, W p ) with such a Z can be made arbitrarily small by increasing the corresponding weight. This is convenient when we want to model soft and hard relations at the same time. This situation is not covered in this letter, but remains an interesting direction for future exploration.
Penalized Probabilistic Clustering
1555
To address the computational difficulty caused by large cliques, we propose two approximation methods: Gibbs sampling and mean field approximation. We also observe that Gibbs sampling can be fairly slow for large cliques. One way to address this problem is to use fewer sampling passes (and thus a cruder approximate inference ) in the early phase of EM training and gradually increase the number of sampling passes (and a finer approximation) when EM is close to convergence. By doing this, we may be able to achieve a much faster algorithm without sacrificing too much precision. For the mean field approximation, the bias brought by the independence assumption among Qi (·) could be severe for some problems. We can ameliorate this, as suggested by Jaakkola (2004), by retaining more substructure of the original graphical model (for PPC, it is expressed in W p ), while still keeping the computation tractable. Appendix A: Hard Constraints Limit In this appendix, we prove that when |Wi j | → ∞ for each specified pair (xi , x j ), the complete likelihood of PPC can written as in equation 2.10, and thus equivalent to the model proposed by Shental et al. (2003). In the model proposed by Shental et al. (2003), the complete likelihood is written as P(X, Z|, E ) =
N 1 δ yci (1 − δ(ya i1 , ya i2 )) Ps (zi |)Ps (xi |zi , ) c 1 2 i
=
a i =a i
i=1
1 δ yci (1 − δ(ya i1 , ya i2 ))Ps (X, Z|), c 1 2 i
a i =a i
where E stands for the pairwise constraints, δ yci is 1 iff all the points in the chunklet (the clique of samples connected with only hard links) c i have the same label, and (a i1 , a i2 ) is the index of the sample pair with hard do-not-link between them. This is equivalent to P(X, Z|, E ) =
1
0
Ps (X, Z|) Z satisfies all the constraints otherwise
.
(A.1)
In the corresponding PPC model with hard constraints, we have +∞ i and j is linked p Wi j = −∞ i and j is do-not-linked . 0 no relation
(A.2)
1556
Z. Lu and T. Leen
According to equation 2.5 and A.1, to prove P(X, Z|, E ) = Pp (X, Z|, W p ), we need only to prove Pp (Z|, W p ) = 0, for all the Z that violate the constraints, that is, p exp Wi j δ(zi , z j ) Pp (Z|, W ) = p = 0. Z l πzl m=n exp Wmn δ(zm , zn )
πzk
k
p
i= j
p First, let us assume Z violates one link between pair (α, β) Wαβ = +∞ . We have p
zα = zβ ⇒ δ(zα , zβ ) = 0 ⇒ exp(Wαβ δ(zα , zβ )) = 1. We assume the constraints are consistent. In other words, there is at least one Z that satisfies all the constraints. We can denote one such Z by Z∗ . We also assume each component has a positive prior probability. It is straightforward to show that Pp (Z∗ |, W p ) > 0. Then it is easy to show p exp Wi j δ(zi , z j ) Pp (Z|, W ) = p Z l πzl m=n exp Wmn δ(zm , zn ) p k πzk i= j exp Wi j δ(zi , z j ) ≤ p ∗ ∗ ∗ k πzk i= j exp Wmn δ(zi , z j ) p πz exp Wipj δ(zi , z j ) exp 2Wαβ δ(zα , zβ ) k = p p πzk∗ exp Wi j δ zi∗ , z∗j exp 2Wαβ δ zα∗ , zβ∗ k (i, j)=(α,β)
p
k
πzk
πz k = ∗ π z k k
i= j
p exp Wi j δ(zi , z j 1 . p p ∗ ∗ ∗ ∗ exp W δ(z , z ) exp 2W αβ δ zα , zβ ij i j (i, j)=(α,β)
Penalized Probabilistic Clustering
1557
Since Z∗ satisfies all the constraints, we must have p exp Wi j δ(zi , z j ) p ≤ 1. exp Wi j δ zi∗ , z∗j (i, j)=(α,β)
So we have
πz k Pp (Z|, W ) ≤ πzk∗ k
p
exp
1 p 2Wαβ
. δ zα∗ , zβ∗
When p
Wαβ → +∞, we have
exp
1 p 2Wαβ
→ 0 δ zα∗ , zβ∗
and then
πz k Pp (Z|, W ) ≤ ∗ π z k k
p
1 → 0. p exp 2Wαβ δ zα∗ , zβ∗
(A.3)
The do-not-link case can be proved in a similar way. Appendix B: Prior for Noisy Pairwise Relations In this appendix, we show how to derive weight W from the certainty value γi j for each pair (xi , x j ). Let E denote those original (noise-free) labeled pairwise relations and E˜ the noisy version delivered to us. If we know the original pairwise relations E, we only have to consider the cluster assignments that are consistent with E and neglect the others, that is, the prior probability of Z is P(Z|, E) =
1 E
0
Ps (Z|)
Z is consistent with E , otherwise
where E is the normalization constant for E: E = Z: consistent with E Ps (Z|). Since we know E˜ and the associated certainty values = {γi j },
1558
Z. Lu and T. Leen
we know ˜ ) = P(Z|, E,
˜ )P(E| E, ˜ ) P(Z|, E, E,
(B.1)
˜ ). P(Z|, E)P(E| E,
(B.2)
E
=
E
Let E(Z) ≡ the unique E that is consistent with Z. From equation B.2, we know ˜ ) = Pp (Z|, E(Z))P(E(Z)| E, ˜ ) P(Z|, E, 1 1 ˜ ) = ˜ )Ps (Z|). Ps (Z|)P(E(Z)| E, P(E(Z)| E, E E
=
If we ignore the variation of E over E, we can get an approximation of ˜ ), denoted as Pa (Z|, E, ˜ ): P(Z|, E, 1 ˜ ) Ps (Z|)P(E(Z)| E, a Hi j ( E,z ˜ i ,z j ) 1 ˜ = Ps (Z|) γi j (1 − γi j )1−Hi j ( E,zi ,z j ) , a i< j
˜ ) = Pa (Z|, E,
where a is the new normalization constant: a = ˜ ) and P(E(Z)| E, ˜ zi , z j ) = Hi j ( E,
1 0
Z
Ps (Z|)
(zi , z j ) is consistent with E˜ . otherwise
˜ ) is equal to a PPC prior probability Pp (Z|, W p ) We argue that Pa (Z|, E, with γi j 1 (zi , z j ) is specified as must-linked in E˜ log 1−γi j 2 p Wi j = − 1 log γi j (zi , z j ) is specified as cannot-linked in E˜ 2 1−γi j 0 otherwise. (B.3) This can be easily proven by verifying Pp (Z|, W p ) a = ˜ ) w Pa (Z|, E,
i< j,Wi j p =0
p
sign(Wi j )−1
γi j
p
(1 − γi j )−sign(Wi j ) = constant.
Penalized Probabilistic Clustering
1559
˜ ) and Pp (Z|, W p ) are normalized, we know Since both Pa (Z|, E, ˜ ) = Pp (Z|, W p ). Pa (Z|, E, Appendix C: From PPC to Constrained K-Means In this appendix, we show how to derive K-means model with soft and hard constraints from PPC. C.1 From PPC to K-Means with Soft Constraints. The adopted cost function for K-means with soft constraints is J (µ, Z) =
N 1 ||xi − µzi ||2 + a i j (zi = z j ) + b i j (zi = z j ), 2 i=1
(i, j)∈L
(i, j)∈N
(C.1) where µk is the center of the kth cluster. Equation 4.1 can be rewritten as p 1 ||xi − µzi ||2 − Wi j δ(zi , z j ) + C, 2 ij N
J (µ, Z) =
(C.2)
i=1
with C = −
(i, j)∈L
ai j p Wi j = −b i j 0
a i j is a constant and (i, j) ∈ L (i, j) ∈ N
(C.3)
otherwise.
The clustering process includes minimizing the cost function J (µ, Z) over both the model parameters µ = {µ1 , µ2 , . . . , µ M } and cluster assignment Z = {z1 , z2 , . . . , z N }. The optimization is usually done iteratively with a modified Linde-Buzo-Gray (LBG) algorithm. Assume we have the PPC model with the matrix W p the same as in equation C.2. We further constrain each gaussian component to be spherical with radius σ . The complete data likelihood for the PPC model is N N ||xi − µz ||2 1 i P(X, Z|, W ) = πzi exp − 2σ 2 i=1 i=1 p exp Wmn δ(zm , zn ) , p
mn
(C.4)
1560
Z. Lu and T. Leen
where is the normalizing constant and µk is the mean of the kth gaussian component. To build its connection to the cost function in equation C.2, we consider the following scaling: σ → ασ,
p
p
Wi j → Wi j /α 2 .
(C.5)
The complete data likelihood with the scaling parameters α is N N ||xi − µz ||2 1 i Pα (X, Z|, W p ) = πzi exp − (α) 2α 2 σ 2 i=1 i=1 p Wmn exp δ(z , z ) . m n α2 mn
(C.6)
It can be shown that when α → 0, the maximum data likelihood will dominate the data likelihood: max Z Pα (X, Z|, W p ) lim = 1. p α→0 Z Pα (X, Z|, W )
(C.7)
To prove equation C.7, we first show that when α is small enough, we have
N ||xi − µzi∗ ||2 Z 2 i=1 ∗ ∗ p − Wmn δ zm , zn . ∗
arg max Pα (X, Z|, W ) = Z ≡ arg min p
Z
(C.8)
mn
Proof of Equation C.8. Assume Z is any cluster assignment different from Z∗ . We only need to show that when α is small enough, Pα (X, Z∗ |, W p ) > Pα (X, Z |, W p ).
(C.9)
To prove equation C.9, we notice that log Pα (X, Z∗ |, W p ) − log Pα (X, Z |, W p ) N N ||xi − µzi∗ ||2 1 ||xi − µzi ||2 log πzi∗ − log πzi + 2 = − α 2 2 i=1 i=1 ∗ ∗ p δ zm , zn − δ zm . (C.10) Wmn , zn − mn
Penalized Probabilistic Clustering
Since Z∗ = arg min Z
1561
2 N ||xi −µzi∗ || i=1 2
−
mn
p ∗ Wmn δ zm , zn∗ , we have
N ∗ ∗ ||xi − µzi ||2 ||xi − µzi∗ ||2 p − − Wmn , zn > 0. δ zm , zn − δ zm 2 2 mn i=1
(C.11) Let ε =
N
||xi −µz ||2 i
i=1
2
−
||xi −µz∗ ||2
−
i
2
mn
∗ ∗ p
Wmn δ zm , zn − δ zm , zn . We
can see that when α is small enough, log Pα (X, Z∗ |, W p ) − log Pα (X, Z |, W p ) =
N i=1
ε log πzi∗ − log πzi + 2 > 0. α
(C.12)
It is obvious from equation C.12 that for any Z different from Z∗ , lim log Pα (X, Z∗ |, W p ) − log Pα (X, Z |, W p )
α→0
= lim
α→0
N i=1
ε log πzi∗ − log πzi + 2 α
= +∞, or equivalently lim
α→0
Pα (X, Z |, W p ) = 0, Pα (X, Z∗ |, W p )
(C.13)
which proves equation C.7. As the result of equation C.7, when optimizing the model parameters, we can equivalently maximize max Z Pα (X, Z|, W p ) over . It is then a joint optimization problem: max Pα (X, Z|, W p ). ,Z
Following the same thought, we find the soft posterior probability of each sample (as in conventional mixture model) becomes hard membership (as in K-means). This fact can be simply proved as follows. The posterior probability of sample xi to component k is Pα (zi = k|X, , W ) = p
Z|z =k
i
Z
Pα (X, Z|, W p )
Pα (X, Z|, W p )
.
1562
Z. Lu and T. Leen
From equation C.7, it is easy to see that lim Pα (zi = k|X, , W p ) =
α→0
1 0
zi∗ = k otherwise.
(C.14)
The negative logarithm of the complete likelihood Pα is then: J α (, Z) = − log Pα (X, Z|, W p ) =−
N
log πzi +
i=1
=−
N i=1
N p ||xi − µzi ||2 Wmn − δ(zm , zn ) + log((α)) 2α 2 α2 mn i=1
1 log πzi + 2 α
N ||xi − µzi ||2 p − Wmn δ(zm , zn ) + C, 2 mn i=1
where C = log (α) is a constant. It is obvious that when α → 0, we can N neglect the term − i=1 log πzi . Hence, the only model parameters left for adjusting are the gaussian means µ. We only have to consider the new cost function 1 J˜α (µ, Z) = 2 α
N ||xi − µzi ||2 p Wmn δ(zm , zn ) , − 2 mn
(C.15)
i=1
the optimization of which is obviously equivalent to equation C.1. So we can conclude that when α → 0 in equation C.5, the PPC model shown in equation 3.4 becomes a K-means model with soft constraints. C.2 From PPC to K-Means with Hard Constraints (COPK-Means). COPK-means is a hard clustering algorithm with hard constraints. The goal is to find a set of cluster centers µ and clustering result Z that minimizes the cost function N
||xi − µzi ||2 ,
(C.16)
i=1
while subject to the constraints zi = z j ,
if (xi , x j ) ∈ L
(C.17)
zi = z j ,
if (xi , x j ) ∈ N .
(C.18)
Penalized Probabilistic Clustering
1563
Assume we have the PPC model with soft relations represented with the matrix W p such that w p Wi j = −w 0
(xi , x j ) ∈ L (xi , x j ) ∈ N
(C.19)
otherwise,
where w > 0. We further constrain each gaussian component to be spherical with radius σ . The complete data likelihood for PPC model is N N ||xi − µz ||2 1 i P(X, Z|, W ) = πzi exp − 2σ 2 i=1 i=1 exp(wδ(zm , zn )) exp(−wδ(zm , zn )), p
(C.20)
(m ,n )∈N
(m,n)∈L
where µk is the mean of the kth gaussian component. There are infinite ways to get equations C.16 to C.18 from equation C.20, but we consider the following scaling with factor β: p
p
σ → βσ, Wi j → Wi j /β 3 .
(C.21)
The complete data likelihood with the scaled parameters is N N ||xi − µz ||2 1 i πzi exp − Pβ (X, Z|, W ) = (β) 2β 2 σ 2 i=1 i=1 w w
, zn ) . exp δ(z , z ) exp − δ(z m n m β3 β3
p
(m,n)∈L
(m ,n )∈N
(C.22) As established in previous section, when β → 0, the maximum data likelihood will dominate the data likelihood max Z Pβ (X, Z|, W p ) lim = 1. p β→0 Z Pβ (X, Z|, W ) As a result, when optimizing the model parameters , we can equivalently maximize max Z Pβ (X, Z|, W p ). Also, the soft posterior probability (as in the conventional mixture model) becomes hard membership (as in K-means).
1564
Z. Lu and T. Leen
The negative logarithm of the complete likelihood Pβ is then: N 1 ||xi − µzi ||2 log πzi + C + 2 J β (, Z) = − β 2 i=1 i=1 1 + wδ(zm , zn ) − wδ(zm , zn ), β
N
(m ,n )∈N
(C.23)
(m,n)∈L
where C = log (β) is a constant. It is obvious that when β → 0, we can N neglect the term − i=1 log πzi . Hence, we only have to consider the new cost function
N ||xi − µzi ||2 2 i=1 1 + wδ(zm , zn ) − wδ(z j , zk ), β
1 J˜β (µ, Z) = 2 β
(m ,n )∈N
(C.24)
(m,n)∈L
the minimization of which is obviously equivalent to the following equation since we can neglect the constant factor β12 : J˜ β (µ, Z) =
N w ||xi − µzi ||2 + J c (Z), 2 β
(C.25)
i=1
where J c (Z) = (m ,n )∈N δ(zm , zn ) − (m,n)∈L δ(zm , zn ) is the cost function term from pairwise constraints. p p Let SZ = {Z|zi = z j if Wi j > 0; zi = z j if Wi j < 0;}. We assume the pairwise relations are consistent, that is, SZ = ∅. Obviously, all Z in SZ achieve the same minimum value of the term J c (Z), that is ∀Z ∈ SZ , Z ∈ SZ
J c (Z) = J c (Z )
∀Z ∈ SZ , Z
/∈ SZ
J c (Z) < J c (Z
).
It is obvious that when β → 0, any Z that minimizes J˜ β (µ, Z) must be in SZ . So the minimization of equation C.22 can be finally casted into the following form, min Z,µ
N
||xi − µzi ||2
i=1
subject to Z ∈ SZ ,
Penalized Probabilistic Clustering
1565
which is apparently equivalent to equations C.16 to C.18. So we can conclude that β → 0 in equation C.21, and the PPC model shown in equation C.20 becomes a K-means model with hard constraints. Acknowledgments We thank Ashok Srivastava of NASA Ames Research Center for providing satellite image data and the reviewers for comments leading to a stronger letter. This work was funded by NASA Collaborative Agreement NCC 21264 and NSF grant OCI-0121475.
References Ambroise, C., Dang, M., & Govaert, G. (1997). Clustering of spatial data by the ´ EM algorithm. In A. Soares, J. Gomez-Hern´ andez, & R. Froidevaux (Eds.), Geostatistics for environmental applications (vol. 3, pp. 493–504). Norwell, MA: Kluwer. Basu, S., Bannerjee, A., & Mooney, R. (2002). Semi-supervised clustering by seeding. In C. Sammut & A. Hoffmann (Eds.), Proceedings of the Nineteenth International Conference on Machine Learning (pp. 19–26). San Francisco: Morgan Kaufmann. Basu, S., Bilenko, M., & Mooney, R. J. (2004). A probabilistic framework for semisupervised clustering. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 59–68). New York: ACM. Bilenko, M., Basu, S., & Mooney, R. (2004). Integrating constraints and metric learning in semi-supervised clustering. In C. Brodley (Ed.), Proceedings of the Twenty-First International Conference on Machine Learning (pp. 11–18). New York: ACM. Bouman, C., & Shapiro, M. (1994). A multiscale random field model for Bayesian image segmentation. IEEE Transactions on Image Processing, 3, 162–177. Cohn, D., Caruana, R., & McCallum, A. (2003). Semi-supervised clustering with user feedback (Tech. Rep. TR2003-1892). Ithaca, NY: Cornell University. Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1–38. Jaakkola, T. (2004). Tutorial on variational approximation methods. In D. Saad & M. Opper (Eds.), Advanced mean field methods: Theory and practice (pp. 129–160). Cambridge, MA: MIT Press. Klein, D., Kamvar, S., & Manning, C. (2002). From instance level to space-level constraints: Making the most of prior knowledge in data clustering. In C. Sammut, & A. Hoffmann (Eds.), Proceedings of the Nineteenth International Conference on Machine Learning (pp. 307–313). San Francisco: Morgan Kaufmann. Lange, T., Law, M., Jain, A., & Buhmann, J. (2005). Learning with constrained and unlabelled data. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 730–737). Los Alamitos, CA: IEEE Computer Society Press.
1566
Z. Lu and T. Leen
Law, M., Topchy, A., & Jain, A. (2004). Clustering with soft and group constraints. In A. Fred, T. Caelli, R. Duin, A. Campilho, & D. Ridder (Eds.), Joint IAPR International Workshop on Syntactical and Structural Pattern Recognition and Statistical Pattern Recognition (pp. 662–670). Berlin: Springer-Verlag. Law, M., Topchy, A., & Jain, A. (2005). Model-based clustering with probabilistic constraints. In Proceedings of SIAM Data Mining (pp. 641–645). Philadelphia: SIAM. Lu, Z., & Leen, T. (2005). Semi-supervised learning with penalized probabilistic clustering. In L. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 849–856). Cambridge, MA: MIT Press. Neal, R. (1993). Probabilistic inference using Markov chain Monte Carlo methods (Tech. Rep. CRG-TR-93-1). Toronto: Computer Science Department, Toronto University. Shental, N., Bar-Hillel, A., Hertz, T., & Weinshall, D. (2003). Computing gaussian mixture models with EM using side-information (Tech. Rep. 2003-43). Leibniz: Leibniz Center for Research in Computer Science. Shental, N., Bar-Hillel, A., Hertz, T., & Weinshall, D. (2004). Computing gaussian mixture models with EM using equivalence constraints. In L. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems 16 (pp. 505–512). Cambridge, MA: MIT Press. Sill, J., & Abu-Mostafa, Y. (1996). Monotonicity hints. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 634–640). Cambridge, MA: MIT Press. Srivastava, A., Oza, N. C., & Stroeve, J. (2005). Virtual sensors: Using data mining techniques to efficiently estimate remote sensing spectra. IEEE Transactions on Geoscience and Remote Sensing, 43(3), 590–599. Szummer, M., & Jaakkola, T. (2001). Partially labeled classification with Markov random walks. In T. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Theiler, J., & Gisler, G. (1997). A contiguity-enhanced K-means clustering algorithm for unsupervised multispectral image segementation. In B. Javidi & D. Psaltis (Eds.), Proceedings of SPIE (Vol. 3159, pp. 108–118). Bellingham, WA: SPIE. Wagstaff, K. (2002). Intelligent clustering with instance-level constraints. Unpublished doctoral, dissertation, Cornell University. Wagstaff, K., Cardie, C., Rogers, S., & Schroedl, S. (2001). Constrained K-means clustering with background knowledge. In C. Brodley & A. Danyluk (Eds.), Proceedings of the Eighteenth International Conference on Machine Learning (pp. 577–584). San Francisco: Morgan Kaufmann. Xing, E., Ng, A., Jordan, M., & Russe, S. (2003). Distance metric learning with applications to clustering with side information. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 505–512). Cambridge, MA: MIT Press. Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (pp. 189–196). Morristown, NJ: Association for Computational Linguistics.
Penalized Probabilistic Clustering
1567
¨ Zhou, D., & Scholkopf, B. (2004). Learning from labeled and unlabeled data using ¨ (Eds.), random walks. In C. Rasmussen, H. Buelthoff, M. Giese, & B. Scholkopf 26th Annual Meeting of the German Association for Pattern Recognition (DAGM 2004). Berlin: Springer. Zhu, X., Lafferty, J., & Ghahramani, Z. (2003). Semi-supervised learning: From Gaussian field to gaussian processes (Tech. Rep. CMU-CS-03-175). Pittsburgh, PA: Computer Science Department, CMU.
Received October 21, 2005; accepted July 27, 2006.
LETTER
Communicated by Jan Karbowski
Robustness of Connectionist Swimming Controllers Against Random Variation in Neural Connections Jimmy Or [email protected] RIKEN Brain Science Institute, 2-1 Hirosawa, Wako, Saitama, Japan 351-0198
The ability to achieve high swimming speed and efficiency is very important to both the real lamprey and its robotic implementation. In previous studies, we used evolutionary algorithms to evolve biologically plausible connectionist swimming controllers for a simulated lamprey. This letter investigates the robustness and optimality of the best-evolved controllers as well as the biological controller hand-crafted by Ekeberg. Comparing cases of random variation in intrasegmental or intersegmental weights against each controller allows estimates of robustness to be made. We conduct experiments on the controllers’ robustness at the excitation level, which corresponds to either the maximum swimming speed or efficiency by randomly varying the segmental connection weights and on some occasions also the intersegmental couplings, through varying noise ranges. Interestingly, although the swimming performance (in terms of maximum speed and efficiency) of the Ekeberg biological controller is not as good as that of the artificially evolved controllers, it is relatively robust against noise in the neural networks. This suggests that the natural evolutions have evolved a swimming controller that is good enough to survive in the real world. Our findings could inspire neurobiologists to conduct real physiological experiments to gain a better understanding on how neural connectivity affects behavior. The results can also be applied to control an artificial lamprey in simulation and possibly also a robotic one. 1 Introduction Advances have been made in understanding animal motor control in recent years due to better physiological measurement techniques, higher-density microelectrodes, and faster computers for simulating of the neural mechanisms that underlie behavior. However, due to the complexity of the nervous system, we are still far from being able to understand completely the neural control of higher vertebrates such as humans. The lamprey has been chosen for study by several neurobiologists because it is relatively easy to analyze: first, because while it has a brain stem and spinal cord with all the basic vertebrate features, the number of neurons in each category is an order Neural Computation 19, 1568–1588 (2007)
C 2007 Massachusetts Institute of Technology
Robustness of Connectionist Swimming Controllers
1569
of magnitude fewer than in other vertebrates; and second, because it has a simple swimming gait. Findings in relation to this prototype vertebrate can thus provide a better understanding of vertebrate motor control in general. According to Ferrar, Williams, and Bowtell (1993), robustness of neural networks against stochastic variation is an important property for biological systems. Over the past 14 years, several researchers have studied different aspects of robustness of model lamprey central pattern generators (CPGs), some at the segmental level (Hellgren, Grillner, & Lansner, 1992; Ferrar et al. 1993) and others at the intersegmental level (Wadden, Hellgren, Lansner, & Grillner, 1997; Ijspeert, 1998; Or & Hallam, 2000; Or, 2002). Hellgren et al. (1992) used a detailed model of the lamprey segmental network presented in Ekeberg, Wall´en, Brodin, & Lansner (1991) and Wall´en et al. (1992) to study how variation in cell and synaptic parameters affects burst pattern. According to Hellgren et al. (1992), this model consists of a population of interneurons whose cellular properties such as size and membrane conductance (including voltage-dependent ion channels) are taken into consideration. The model is detailed enough to take into consideration all experimentally established important neural properties, yet it is simple enough to accommodate the simulation of several neurons. Hellgren et al. found that the robustness of the burst patterns as well as the working frequency range can be increased by incorporating a random variation in certain cells and in the synaptic parameters. Furthermore, the population model they used is capable of producing stable burst activity over a large range of oscillation frequencies even without the lateral inhibitory interneurons (LIN) neurons. Ferrar et al. (1993) later studied the robustness of the lamprey segmental network proposed by Buchanan and Grillner (1987). Unlike the one investigated by Hellgren et al. (1992), the neurons in this model are less detailed, and each of them represents the entire populations of neurons of the same type. Ferrar and his colleagues observed how the neural characteristics of the network (such as cycle duration, burst proportion, and left-right phase lag) change when it is subjected to two different noise conditions: (1) long timescale noise at the synaptic connections from the driver cells (the brain stem in Ekeberg’s model) to all the other cells and (2) short timescale noise at all synaptic connections. The synaptic strengths of the first test varied from 50% to 100% only at the beginning of the simulations. In the other test, the synaptic strengths varied from 0% to 200% at each time step. Ferrar and his colleagues found that the network was less affected by noise in the short timescale because the effect of noise tended to average out over a cycle. They also found that increasing the number of neurons in the network allowed an increase in robustness despite the influence of noise. Given that a controller that can achieve the highest swimming speed or efficiency but is unable to tolerate random noise in the networks is undesirable in the real world, we investigate how these two variables are affected by controllers with random variations in both segmental and
1570
J. Or
intersegmental neural connections. The controllers under investigations are Ekeberg’s biological model as well as three other biologically plausible connectionist models evolved by genetic algorithms (GAs). In this letter, we conducted four experiments that investigate how varying the segmental or intersegmental connections affects the maximum swimming speed and efficiency. 2 Background 2.1 Neural Model of the Segmental Oscillator. We implemented the lamprey segmental oscillator based on physiological data presented in Ekeberg (1993). In the model, each segmental oscillator consists of eight neurons of four types. Each neuron is modeled using a leaky integrator with a saturating transfer function. The output u (∈ [0, 1]) is the mean firing frequency of the population the unit neuron represents. It is calculated using the following set of formulas: ξ˙+ =
1 ui wi − ξ+ τ D i∈
(2.1)
+
1 ξ˙− = ui wi − ξ− τ D i∈
(2.2)
−
1 (u − ϑ) τA 1 − exp{( − ξ+ )} − ξ− − µϑ (u > 0) u= 0 (u ≤ 0),
ϑ˙ =
(2.3) (2.4)
where wi represents the synaptic weights and + and − represent the groups of presynaptic excitatory and inhibitory neurons, respectively. ξ+ and ξ− can be considered as the dendritic delays (or mean membrane potentials) to excitatory and inhibitory inputs, respectively. ϑ represents the frequency adaptation observed in real neurons. The values of the neural timing (τ D , τ A) and gain/threshold (, , µ) parameters and those for the connection weights are set up in such a way that the simulation results from the model agree with physiological observations. As observed in the real lamprey, 100 copies of interconnected segmental oscillators are connected to form the entire swimming CPG. The brain stem stimulates all neurons within the neural networks by the same amount of excitation (here called global excitation); sufficient stimulation results in oscillations in each individual segment at a frequency that depends on the strength of this global excitation signal. Neurons located at the most rostral segments of the CPG receive additional amount of so-called extra excitation. The effect of this, interacting with intersegmental coupling, is to induce a
Robustness of Connectionist Swimming Controllers
1571
roughly equal relative phase lag between successive segments along the CPG, with the result that caudally traveling waves of neural activity appear.1 The global excitation controls the amplitude (mean firing frequency) of the motoneuron outputs as well as the frequency of oscillation of the CPG. The extra excitation alters the intersegmental phase lag largely independent of the global excitation. (For details of the neuron parameters and the Ekeberg model, refer to Ekeberg, 1993.) Note that in our implementation, the model does not contain mechanosensory feedback. 2.2 A Brief Description of the Controllers Under Investigation. In this letter, we investigate the behavior of four connectionist swimming controllers created in previous studies: 1. The biological controller, hand-crafted by Ekeberg (1993) and based on physiological data 2. Controller 2, evolved by Ijspeert, Hallam, and Willshaw (1999) using Ekeberg’s segmental oscillator and intersegmental couplings evolved by GA 3. The efficient controller, the run9 controller obtained through evolution using the bigmin approach (Or, Hallam, Willshaw, & Ijspeert, 2002) 4. The hybrid robust controller, obtained through the evolution which takes robustness in swimming speed against variation in body scales into consideration (Or, 2002) We implemented our biological controller based on the data presented in Ekeberg (1993). As for the rest of the controllers, we used evolutionary algorithms to evolve the neural couplings. Note that although connections between the neurons within these controllers are different (see Table 1), the neurons that underlie the segmental oscillator are the same. Also, the four controllers are used to drive the same mechanical lamprey model. Given that details on how to evolve the swimming controllers have been presented in previous studies, only a brief description is provided here. All three controllers are created using a real number GA with mutation and crossover operators. Chromosomes with genes of real numbers (between 0 and 1) are used to encode the connection weights. The fitness functions used to evolve these controllers share four basic fitness factors that reward controllers that can produce the following features: 1. Generate stable and regular oscillations across the entire CPG. 2. Increase in frequency of oscillation by increasing the global excitation from the brain stem.
1 In the case of the real lamprey, asymmetrical connections from the CIN neurons to neurons in the caudal direction cause the neural wave to travel from head to tail.
1572
J. Or
Table 1: Connection Weight Matrix for the Biological and the Three Evolved Controllers. From
To
Biological
Controller 2
EINl
MNl EINl LINl CINl MNr EINr LINr CINr MNl EINl LINl CINl LINr MNl EINl LINl CINl MNr EINr LINr CINr
1 [5, 5] 0.4 [2, 2] 13 [5, 5] 3 [2, 2] −1 [5, 5] −2 [5, 5] −2 [1, 10] −1 [1, 10] −2 [1, 10]
1 [4, 1] 0.4 [5, 3] 13 [4, 1] 3 [11, 2] −1 [3, 9] −2 [11, 11] −2 [1, 0] −1 [9, 4] −2 [0, 5]
LINl
CINl
Hybrid Robust
Efficient
14.6 [12, 11] −4.2 [8, 4] −4.8 [7, 0] −0.6 [4, 6] −2.5 [0, 5] −2.5 [8, 10] −1.3 [1, 5]
1 [4, 9] 0.4 [3, 3] 13.0 [4, 2] 3 [11, 1] −1 [4, 7] −2 [4, 4] −2 [5, 5] −1 [11, 5] −2 [5, 7]
Notes: Positive numbers indicate excitatory connections, and negative numbers indicate inhibitory connections. Left and right neurons are indicated by l and r. Due to symmetry between the left and right, the connection weights of the neurons on the right are not shown here. The number of extensions to the neighboring segments is shown in the brackets. The first entry represents extension in the rostral direction, and the second entry represents extension in the caudal direction.
3. Increase in phase lag by increasing the extra excitation to the neurons near the head segments. 4. Modulate a wide range of swimming speed by varying either the oscillation frequency or wavelength of body undulation. Ijspeert et al. (1999) used only these criteria to evolve controller 2. When we evolved the efficient controller, we added an additional fitness factor that encourages high swimming efficiency. During the evolutions, we record the minimum and maximum achievable swimming efficiencies of each controller across a range of global and extra excitation inputs from the brain stem. Controllers that are able to generate big minimum swimming efficiency and exhibit the four features mentioned above are better rewarded. The reason for evolving controllers based on big minimum (rather than maximum) efficiency is that this approach implicitly forces all the measured efficiencies of the controller to be good as the GA is trying to pull up the worst efficiency each controller can achieve. Hence, the evolved
Robustness of Connectionist Swimming Controllers
1573
controllers under this bigmin aproach are able to achieve higher swimming efficiency. In terms of the hybrid robust controller, we use a two-stage evolution. In the first stage, we evolve intrasegmental connections between the neurons within the same segment. We then select an oscillator that is able to generate a wide range of oscillation frequencies and motoneuron outputs as the prototype segment. In the second stage, 100 copies of this segmental oscillator are created. We use GA to evolve the intersegmental couplings for the entire swimming controller. Our fitness function is built on top of the one presented in Ijspeert et al. (1999). It has an additional fitness factor that rewards controllers with the least change in swimming speeds at different body scales (normal, normal ±20%). The best evolved controller is called the hybrid robust controller. 2.3 Mechanical Model. We developed a 2D mechanical simulator to study how the muscular activity induced by the model neural networks affects swimming performance. The mechanical model of the lamprey is made to approximate the size and shape of the real one. The entire length of the model lamprey is 0.3 m. It has an elliptical cross section of constant height (30 mm) and variable width ranging from 1.7 mm to 20 mm. In order to describe the position and the shape of the body over time, a chain of links forming the midline is used. Just like the Ekeberg model, our model lamprey consists of 10 rigid body links with nine joints of one degree of freedom. The link is represented by its center of mass coordinate as well as the angle between it and the horizontal x-axis. On each side of the body, muscles connect each link to its immediate neighbors. The muscles are modeled as a combination of springs and dampers. The outputs from the motoneurons control the spring constants of the corresponding muscles. Since there are 100 neural segments in the entire swimming CPG, each mechanical link corresponds to 10 neural segments. The torque at each joint is controlled by the averaged motoneuron outputs of 10 consecutive neural segments along the CPG from segments 6 to 95. The reason for skipping the 5 segments at each end of the CPG is that neurons at the head and the tail receive fewer inputs. When extra excitations are given to neurons near the head segments, the asymmetrical connections of some of the neurons in the caudal direction result in neural waves traveling along the body from head to tail. The successive contraction of muscles creates a mechanical wave. This in turn generates inertial forces from the surrounding water that propel the lamprey forward. (For more details on our mechanical model, refer to Or et al., 2002.) 2.4 Calculating Swimming Efficiency. To calculate the swimming efficiency, we used the ratio of forward swimming speed (U) to backward mechanical wave speed (V) (Williams, 1986; Carling, Williams, & Bowtell, 1998; Sfakiotakis, Lane, & Davis, 1999). The efficiency and mechanical wave
1574
J. Or
speed are defined as follows: U V λ V= T e=
(2.5) (2.6)
where U, V, λ, and T are the swimming speed (m/s), backward mechanical wave speed (m/s), mechanical wavelength (m), and mechanical period (s), respectively. Since both the neural and mechanical simulations are not stable at the beginning, the lamprey can end up swimming straight at any angle. The first step in computing the mechanical wave parameters is to rotate the original coordinate system so that the fish is swimming parallel to the xaxis. Following the transformation, we calculated the mechanical periods of body links 2 to 5. (The mechanical period of each body link can be calculated by using the difference in time instances at which two consecutive wave crests pass through the same link.) If the periods of three or more of the mechanical links are defined, the mechanical wavelength can be calculated using the method described by Videler (1993). Note that due to strange head and tail movements caused by the reduction in neural connections at the ends of the swimming CPG, the mechanical period of the second link, T2 , is used in equation 2.6 to calculate the mechanical wave speed. Equation 2.5 can then be used to compute the swimming efficiency. (For more details on the algorithms to calculate the swimming efficiency and forward speed, refer to Or, 2002.) 3 Investigation of the Effect of Varying Neural Connectivity on the Maximum Swimming Speed and Efficiency 3.1 Description of Experiments. To test the robustness of the controllers under random variation in connection weights, we conducted the following experiments on each controller and then compared their performance in terms of either maximum achievable speed or efficiency. For a controller to be robust, its maximum achievable speed or efficiency should be similar to the speed or efficiency obtained when noise is not present:2
r
Experiment 1: Robustness of swimming speed against random variation in segmental connection weights. For each controller, the excitation combination that corresponds to the maximum swimming speed is chosen as the brain stem input. The connection weights between neurons at
2 Although the term noise generally refers to random variation over time, here we use it to mean random variation in neural connections only at the beginning of the simulations.
Robustness of Connectionist Swimming Controllers
r
r
r
1575
a prototype segment are subjected to random variation (with uniform distribution) in weight from ±5% to ±25% in steps of 5%. One hundred copies of this segmental oscillator are then combined to form the swimming CPG (with the number of extensions to neighboring segments unchanged). Note that at each amount of variation, 100 runs (each of which has different seeds for the random number generator) of neural and mechanical simulations are performed to obtain the average maximum speed for all 100. Thus, for each controller, five averaged speeds are computed across the amount of variation under investigation. Experiment 2: Robustness of swimming speed against random variation in segmental connection weights and intersegmental couplings. This is the same as experiment 1, but in addition, the numbers of rostral and caudal extensions are subjected to random variation. Each extension can be varied by up to three segments. Experiment 3: Robustness of swimming efficiency against random variation in segmental connection weights. This is the same as experiment 1, but the excitation combination that corresponds to the maximum swimming efficiency is chosen as the brain stem input instead of maximum swimming speed. At each amount of variation, 100 runs of neural and mechanical simulations are performed to obtain an average of all 100 maximum efficiency values. Thus, for each controller, five averaged efficiency values are computed across the noise range under investigation. Experiment 4: Robustness of swimming efficiency against random variation in segmental connection weights and intersegmental couplings. This is the same as experiment 3, but in addition, the number of rostral and caudal extensions is subjected to random variation.
Note that as each calibrated connection strength is based on the total number of connections to neurons of neighboring segments, the same controllers are subjected to more variations in experiments 2 and 4 than in experiments 1 and 3. 3.2 Methods. Each of the experiments was conducted with the following stages:
r
Stage 1: Random variation of connection weights. We use chromosome with genes of real numbers to represent neural connections. After the chromosomes that encode the segmental and intersegmental connections for a controller are loaded, the connection matrices (one for segmental connection weights and the others for rostral and caudal extensions) are formed. Depending on the experiment, either the former matrix alone or all three matrices are subjected to random variation.
1576
r
r
r
r
J. Or
Note that random variation is applied only at the beginning of each simulation. To avoid possible turning when the lamprey swims, leftright symmetry of neural connections is maintained. Stage 2: Neural simulations of the swimming CPG. The excitation combination corresponding to maximum swimming speed or efficiency is chosen as the brain stem input. Neural simulations are performed. The neural parameters (amplitude, frequency, and phase lag) of the outputs from the simulated swimming CPG are recorded. Stage 3: Mechanical simulation of the lamprey swimming in a 2D environment. At this stage, mechanical simulations are performed with the neural waves obtained in stage 2. The swimming speeds and efficiencies are recorded. Stage 4: Calculation of percentage change in speed or efficiency. Once the averaged speed or efficiency at each amount of variation is calculated for each controller, the corresponding percentage change in speed or efficiency, with respect to the original ones (without noise), is calculated. Stage 5: Comparative analysis of swimming controllers within the same experiment.
As mentioned, random noise is added to the neural networks after the connection matrices are formed. Each connection weight is modified using the following equation: New weight = original weight + original weight · rand · noise range (3.1) where rand is a small random number evenly distributed within [−0.5, 0.5] and the noise range can be any value between 0.1 and 0.5 in steps of 0.1. This corresponds to randomly varying the connection weight by ±5 to ±25% (in steps of 5%). Using a similar equation, each intersegmental extension can be varied by up to ±3 segments.3 Table 2 shows how the noise range is related to the maximum amount of variation in connection weight and intersegmental extensions. Note that as 100 copies of the same segmental oscillator are used to form the entire swimming CPG, all segments were subjected to the same amount of noise. In other words, the extensions from each source neuron to the corresponding target neuron in its rostral or caudal neighbors are the same regardless of its location along the CPG.
3
Since the weights are in real numbers, each calculated number of extensions is converted into an integer value by truncating the digits after the decimal point. The result is then assigned to either the rostral or the caudal extension matrix.
Robustness of Connectionist Swimming Controllers
1577
Table 2: Conversion Table from the Noise Range to the Maximum Ranges of Possible Variation in Connection Weights and Intersegmental Extensions (in Brackets).
Noise Range (%)
Maximum Range of Possible Variation in Connection weights (%) and Intersegmental Extensions
10 20 30 40 50
±5 [0] ±10 [±1] ±15 [±1] ±20 [±2] ±25 [±3]
Table 3: Maximum Speed and Efficiency Achieved by the Different Controllers Without Noise in the CPGs.
Controller Biological controller Controller 2 Hybrid robust controller Efficient controller
Speed (m/s) at Excitation Combination
Efficiency at Excitation Combination
0.45 (0.6, 200%) 0.49 (0.9, 40%) 0.49 (0.9, 0%) 0.51 (0.9, 40%)
0.58 (0.3, 150%) 0.59 (0.4, 50%) 0.69 (0.6, 80%) 0.76 (0.4, 100%)
Notes: The excitation combinations at which the maximum speeds and efficiencies occur are given in parentheses. The first number represents global excitation, and the second number represents extra excitation from the brain stem.
3.3 Results and Discussion. Figure 1 shows how the speeds of the four controllers are affected at each given amount of random variation (noise range), while Figure 2 shows how the efficiency of the controllers is affected. For comparison purposes, the maximum speed and efficiency values achieved by the controllers when there is no noise are shown in Table 3. Note that as we are interested in how much damage or gain a particular controller receives when random changes are made relative to its original form, we measure the relative change in swim performance. Hence, percentages rather than absolute values are used. A bigger percentage drop in speed or efficiency means that the controller’s original neural configuration is better optimized (i.e., at the peak of a hill in the fitness landscape commonly referred to in GA literatures) than that of other controllers for high speed or efficiency. Note that if the absolute speed or efficiency of this controller is still better than the rest of the controllers, it means that its peak is relatively higher. 3.3.1 Experiment 1: Robustness of Swimming Speed Against Random Variation in Segmental Connection Weights. The top graph in Figure 1 shows
1578
J. Or Experiment 1: Noise in segmental connections
100
A
Biological controller Controller 2 Hybrid robust controller Efficient controller
80
Percentage change in speed
60
40
20
0
0
5
10
15
20
25
Magnitude of the maximum allowable variation in connection weight [%] Experiment 2: Noise in both segmental and intersegmental connections 100
B
Biological controller Controller 2 Hybrid robust controller Efficient controller
80
Percentage change in speed
60
40
20
0
0
5
10
15
20
Magnitude of the maximum allowable variation in connection weight [%]
25
Robustness of Connectionist Swimming Controllers
1579
how varying the segmental connection weights affects the swimming speed. For all controllers, it appears that increasing the variation in weight tends to decrease the swimming speed. This tendency implies the possibility that the neural configurations of the four controllers are already nearly optimized for high-speed swimming. The speed of controller 2 changes the least: with ±25% variation in segmental connection weights, its speed drops by only 13.3%. As for the biological controller and efficient controller, the maximum drop in their speed is 34.7% and 26.3%, respectively. The hybrid robust controller performs the worst. Even with only ±5% variation in weight, its swimming speed drops by 22.8% (which is worse than the worst performance of controller 2). As the amount of variation increases, the speed of the hybrid robust controller continues to drop. At ±25% variation in segmental connection weights, speed drops by more than half. The percentage change in the speed curve for this controller indicates that noise in the segmental connection weights alone has a big negative impact on swimming speed. To summarize, in terms of robustness in speed against noise in segmental connections, controller 2 is the most robust. It is followed by the efficient controller, the biological controller, and then the hybrid robust controller. Although the biological controller and the hybrid robust controller are the least robust, their neural configurations are more optimized for high-speed swimming than the other two controllers. The percentage speed change of each controller, with respect to the maximum speed obtained when there is no noise in the segmental connection weights (as reported in Table 3), is shown in Table 4. 3.3.2 Experiment 2: Robustness of Swimming Speed Against Random Variation in Segmental Connection Weights and Intersegmental Couplings. The bottom graph in Figure 1 shows how varying the segmental connection weights and intersegmental couplings affects the swimming speed. For all controllers, swimming speed tends to decrease with the presence of noise. Note that at ±5% noise in both segmental and intersegmental connections, there is no change in speed for the biological controller. As the variation continues to increase, the speed of this controller drops monotonically. The swimming speed achieved by the rest of the controllers drops immediately with the presence of noise. Controller 2 has its speed drop by less than one-fifth in the worst case, which is reasonably good. One interesting point to note is that
Figure 1: Percentage change in the averaged speed with respect to maximum speed obtained with the same CPGs but without noise. (Top) Effect of noise in segmental connection weight only. (Bottom) Effect of noise in both segmental connection weights and intersegmental extensions. Each averaged speed is the average of 100 runs with different random seeds. The error bars correspond to standard errors.
1580
J. Or Experiment 3: Noise in segmental connections
100
A
Biological controller Controller 2 Hybrid robust controller Efficient controller
80
Percentage change in efficiency
60
40
20
0
0
5
10
15
20
25
Magnitude of the maximum allowable variation in connection weight [%] Experiment 4: Noise in both segmental and intersegmental connections 100
B
Biological controller Controller 2 Hybrid robust controller Efficient controller
80
Percentage change in efficiency
60
40
20
0
0
5
10
15
20
Magnitude of the maximum allowable variation in connection weight [%]
25
Robustness of Connectionist Swimming Controllers
1581
Table 4: Percentage Change in Speed for All Controllers in Experiment 1. Range of Variation In Segmental Connection Weight Controller
±5%
±10%
±15%
±20%
±25%
Biological controller Controller 2 Hybrid robust controller Efficient controller
−4.2 −0.2 −22.8 −2.4
−13.6 −2.7 −39.6 −7.5
−17.8 −6.5 −41.5 −17.4
−21.4 −9.6 −52.6 −15.4
−34.7 −13.3 −51.2 −26.3
Note: Each percentage represents the maximum range of possible variation in segmental connection weights.
Table 5: Percentage Change in Speed for All Controllers in Experiment 2. Range of Variation in Both Segmental and Intersegmental Connections Controller
±5%
±10%
±15%
±20%
±25%
Biological controller Controller 2 Hybrid robust controller Efficient controller
0.0 −10.4 −39.2 −6.9
−9.1 −10.6 −41.5 −13.5
−14.7 −14.9 −45.7 −15.2
−26.5 −16.9 −45.7 −19.2
−34.7 −19.8 −48.6 −30.3
Note: Each percentage represents the maximum range of possible variation in segmental connection weights and intersegmental couplings.
as in experiment 1, the speed of this controller stays fairly constant across different noise ranges. Hence, controller 2 is the most robust in swimming speed against stochastic variation in any kind of neuronal connection. This may be due to its unique neural organization, which allows more flexibility than the other controllers. As for the hybrid robust controller, it has the greatest reduction in speed. Even with ±5% variation, its swimming speed drops by about 40%, which is worse than the worst case of any other controller. Except for the 30% drop in speed at ±25%, the efficient controller behaves similarly to controller 2. The percentage speed change of each controller, with respect to the maximum speeds reported in Table 3 (obtained when there is no noise in the CPGs) is shown in Table 5.
Figure 2: Percentage change in the averaged efficiency, with respect to maximum efficiency obtained with the same CPGs without noise. (Top) Effect of noise in segmental connection weights alone. (Bottom) Effect of noise in both segmental connection weights and intersegmental extensions. Each averaged efficiency is the average of 100 runs with different random seeds. The error bars correspond to standard errors.
1582
J. Or
Table 6: Percentage Change in Efficiency for All Controllers in Experiment 3. Range of Variation in Segmental Connection Weights Controller
±5%
±10%
±15%
±20%
±25%
Biological controller Controller 2 Hybrid robust controller Efficient controller
−0.7 −0.7 −36.9 −22.2
−2.7 −1.5 −36.8 −26.8
−7.6 −2.7 −32.5 −29.9
−10.1 −5.0 −28.7 −33.9
−11.7 −10.2 −35.8 −31.2
Note: Each percentage represents the maximum range of possible variation in segmental connection weights.
Table 7: Percentage Change in Efficiency for all Controllers in Experiment 4. Range of Variation in Both Segmental and Intersegmental Connection Weights Controller
±5%
±10%
±15%
±20%
±25%
Biological controller Controller 2 Hybrid robust controller Efficient controller
−6.7 −4.4 −46.6 −33.9
−8.2 −6.4 −43.6 −30.5
−10.7 −7.9 −41.6 −31.6
−19.1 −10.1 −42.4 −22.1
−20.8 −14.4 −45.3 −31.8
Note: Each percentage represents the maximum range of possible variation in segmental connection weights and intersegmental couplings.
3.3.3 Experiment 3: Robustness of Swimming Efficiency Against Random Variation in Segmental Connection Weights. The top graph in Figure 2 shows that the efficiencies of the biological controller and especially controller 2 are extremely stable against variation in segmental connection weights. Although the percentage drop in efficiency increases with the amount of variation in weight, the absolute efficiency at each noise range for these two controllers is similar to that of the corresponding originals. Specifically, the efficiency of the biological controller drops to only a maximum of about 12% in the worst case, while that of controller 2 drops to about 10% at most. The extremely good performance of these two controllers indicates that their efficiencies are less affected than the others by noise in the segmental connections. This is good for the lamprey because robustness in efficiency is more important than robustness in swimming speed. However, the hybrid robust controller and the efficient controller have a relatively poor tolerance for noise. With only ±5% variation in segmental connection weights, their efficiencies drop by about 37% and 22%, respectively. This is worse than the
Figure 3: Changes in neural characteristics and speed caused by noise in experiments 1 (left) and 2 (right). The error bars correspond to standard errors.
Robustness of Connectionist Swimming Controllers
Experiment 2: Noise in both segmental and intersegmental connections
Experiment 1: Noise in segmental connections 0.85
0.85
A
Biological controller Controller 2 Hybrid robust controller Efficient controller
0.8
Amplitude at different noise levels
Amplitude at different noise levels
0.65
0.6
0.55
0.7
0.65
0.6
0.55
0.5
0.5
0.45
0.45
0
5
10
15
0.4
25
B
15
G
Biological controller Controller 2 Hybrid robust controller Efficient controller
20
25
Biological controller Controller 2 Hybrid robust controller Efficient controller
Frequency [Hz] at different noise levels
5.5
4.5
4
3.5
5
4.5
4
3.5
3
2.5 0
5
10
15
20
25
0
10
15
20
Magnitude of the maximum allowable variation in connection weight [%]
Experiment 1: Noise in segmental connections
Experiment 2: Noise in both segmental and intersegmental connections 2.5
C
F
Phase lag [%] at different noise levels
Biological controller Controller 2 Hybrid robust controller Efficient controller
2
1.5
1
0
5
Magnitude of the maximum allowable variation in connection weight [%] 2.5
Phase lag [%] at different noise levels
10
Experiment 2: Noise in both segmental and intersegmental connections
3
5
10
15
20
2
1.5
1
0.5
25
0
5
10
15
20
Magnitude of the maximum allowable variation in connection weight [%]
Experiment 1: Noise in segmental connections
Experiment 2: Noise in both segmental and intersegmental connections 0.6
D
Biological controller Controller 2 Hybrid robust controller Efficient controller
E
0.5
0.45
0.4
0.35
0.3
25
Biological controller Controller 2 Hybrid robust controller Efficient controller
0.55
Speed [m/s] at different noise levels
0.55
25
Biological controller Controller 2 Hybrid robust controller Efficient controller
Magnitude of the maximum allowable variation in connection weight [%]
0.6
Speed [m/s] at different noise levels
5
Magnitude of the maximum allowable variation in connection weight [%]
5
0.5
0.45
0.4
0.35
0.3
0.25
0.25
0.2
0
Experiment 1: Noise in segmental connections
5.5
Frequency [Hz] at different noise levels
20
Magnitude of the maximum allowable variation in connection weight [%]
6
6
0.5
Biological controller Controller 2 Hybrid robust controller Efficient controller
0.75
0.7
2.5
H
0.8
0.75
0.4
1583
0.2 0
5
10
15
20
Magnitude of the maximum allowable variation in connection weight [%]
25
0
5
10
15
20
Magnitude of the maximum allowable variation in connection weight [%]
25
1584
J. Or
worst performance of the other two controllers. However, as the amount of noise continues to increase, the percentage change in efficiency of these two controllers remains fairly robust (with about 10% change at the extremes of the noise range). Note that the percentage drop in efficiency for the hybrid robust controller generally decreases as the amount of variation increases. From the results of experiment 1, we know that the speed for this controller drops as the amount of variation in segmental connections increases. Since efficiency is defined as the ratio of U to V, this implies that the mechanical wave speed (V) drops faster than the swimming speed (U). Given that the mechanical wave speed is defined as the ratio of mechanical wavelength to the mechanical period and that this period is fixed as the neural wave remains the same, the mechanical wavelength must be dropping rapidly with the increase in noise. The percentage efficiency change of each controller with respect to the maximum efficiency obtained when there is no noise in the segmental connection weights is shown in Table 6. 3.3.4 Experiment 4: Robustness of Swimming Efficiency Against Random Variation in Segmental Connection Weights and Intersegmental Couplings. The bottom graph in Figure 2 shows that both the biological controller and controller 2 are fairly robust in efficiency against the presence of noise in both segmental and intersegmental connections. However, their absolute efficiencies are worse than they were in experiment 3. As the amount of variation increases, controller 2 becomes more robust than the biological controller. The percentage drop in efficiency for these controllers at a maximum possible change of ±25% in segmental connection weights and intersegmental couplings is at most 15% and 21%, respectively. Given that these percentage drops are relatively low, they are acceptable. As in experiment 3, the hybrid robust controller and the efficient controller behave similar to each other. When compared with the biological controller and controller 2, their absolute performances are much worse. Even at only ±5% variation, their efficiencies drop to 46.6% and 33.9%, respectively. The performance of the hybrid robust controller is the worst overall. Its drop in efficiency at each noise range is almost half of the original. Note that after the initial drop at ±5% variation in any neural connection, the efficiencies of both the hybrid robust controller and the efficient controller stay fairly constant. The variation in efficiency is less than 5% for the hybrid robust controller and about 3% for the efficient controller (except at ±20% variation in connections).
Figure 4: Changes in neural characteristics and efficiency caused by noise in experiments 3 (left) and 4 (right). The error bars correspond to standard errors.
Robustness of Connectionist Swimming Controllers
Experiment 4: Noise in both segmental and intersegmental connections
Experiment 3: Noise in segmental connections 0.85
0.85
A
H
Biological controller Controller 2 Hybrid robust controller Efficient controller
0.8
0.75
Amplitude at different noise levels
Amplitude at different noise levels
0.7
0.65
0.6
0.55
0.7
0.65
0.6
0.55
0.5
0.5
0.45
0.45
0
5
10
15
20
0.4
25
5
6 Biological controller Controller 2 Hybrid robust controller Efficient controller
G
Frequency [Hz] at different noise levels
4.5
4
3.5
5
4.5
4
3.5
0
5
10
15
20
2.5
25
0
5
2.5
Phase lag [%] at different noise levels
1.5
1
0.8
F
Biological controller Controller 2 Hybrid robust controller Efficient controller
2
0
5
10
15
15
20
25
Experiment 4: Noise in both segmental and intersegmental connections
Experiment 3: Noise in segmental connections 2.5
C
10
Magnitude of the maximum allowable variation in connection weight [%]
Magnitude of the maximum allowable variation in connection weight [%]
Phase lag [%] at different noise levels
25
3
3
20
2
1.5
1
0.5
25
Biological controller Controller 2 Hybrid robust controller Efficient controller
0
5
10
15
20
Magnitude of the maximum allowable variation in connection weight [%]
Magnitude of the maximum allowable variation in connection weight [%]
Experiment 3: Noise in segmental connections
Experiment 4: Noise in both segmental and intersegmental connections
25
0.8
D
E
Biological controller Controller 2 Hybrid robust controller Efficient controller
0.75
Biological controller Controller 2 Hybrid robust controller Efficient controller
0.75
0.7
Efficiency at different noise levels
0.7
Efficiency at different noise levels
20
Biological controller Controller 2 Hybrid robust controller Efficient controller
5.5
5
0.65
0.6
0.55
0.5
0.45
0.65
0.6
0.55
0.5
0.45
0.4
0.4
0.35
0.35
0.3
15
Experiment 4: Noise in both segmental and intersegmental connections
5.5
0.5
10
Magnitude of the maximum allowable variation in connection weight [%]
Experiment 3: Noise in segmental connections
B Frequency [Hz] at different noise levels
0
Magnitude of the maximum allowable variation in connection weight [%]
6
2.5
Biological controller Controller 2 Hybrid robust controller Efficient controller
0.8
0.75
0.4
1585
0
5
10
15
20
Magnitude of the maximum allowable variation in connection weight [%]
25
0.3
0
5
10
15
20
Magnitude of the maximum allowable variation in connection weight [%]
25
1586
J. Or
The percentage efficiency change of each controller with respect to the maximum efficiency obtained when there is no noise in the CPGs is shown in Table 7. 4 General Discussion Except for the efficient controller, each controller performs similarly across the different experiments. In general, the four controllers under investigation fall into three categories. In the first class, the absolute speed or efficiency of the biological controller and controller 2 at each amount of variation (noise range) is close to that of the originals. This means that these controllers are at the top of the plateau in the fitness landscape. However, the hybrid robust controller forms a class of its own, as its absolute speed or efficiency drops significantly even when there is only a small variation of ±5% in any kind of neural connection. In other words, this controller is more at the peak in the fitness landscape. The third case is the efficient controller, which behaves differently depending on the experiment. In terms of robustness in swimming speed (experiments 1 and 2), it behaves similarly to both the biological controller and controller 2, but its behavior in terms of robustness in swimming efficiency (experiments 3 and 4) is somewhere between the two other classes of controllers. Controller 2 performs best in all the experiments. Its swimming speed and efficiency stay relatively constant across the different amounts of variation under investigation. It is thus the most robust in both speed and efficiency against noise in the networks. As for the hybrid robust controller, its performance is the worst in all the experiments described. In other words, this controller is the most unstable when noise is present in the network system. Having considered the neural characteristics (amplitude, frequency, and phase lag) and swimming performance of each controller across different experiments and different amounts of variation (see Figures 3 and 4), we have observed that there is a tendency for the amplitude of the motoneuron outputs from each controller in each experiment to be least affected by the random seed (i.e., each set of 100 runs gives similar amplitudes). This is followed by the frequency of oscillations, phase lag, and swimming speed or efficiency. It could be that since the measurements of these parameters depend on an increasing number of other factors, the variations add up. Hence, even with the same amount of variation, the amplitude is less affected when compared with the others. Also, as the amount of variation in connection weights increases, the standard error in both neural characteristics and swimming performance tends to increase as well. 5 Conclusion In this letter, we investigated how changes in neural connectivity affect the swimming speed and efficiency. We observed that controller 2 is the
Robustness of Connectionist Swimming Controllers
1587
most robust across different experiments. It is followed by either the biological controller or the efficient controller (depending on experiments). The performance of the evolved hybrid robust controller is the worst. Most important, although the absolute swimming performance of the biological controller in terms of maximum speed and efficiency is not as good as the artificially evolved controllers (it is actually quite good already), it is relatively robust against noise in the system. The robustness of the model CPGs might be explained by the robustness of the real biological circuit due to natural evolution. Through interactions with neurobiologists, we could gain a better understanding of how neural connectivity affects behavior. Acknowledgments I thank my doctoral thesis committee at the University of Edinburgh for their comments on the earlier versions of this letter. Thanks to Shun-ichi Amari at the RIKEN Brain Science Institute for his kind support. References Buchanan, J. T., & Grillner, S. (1987). Newly identified “glutamate interneurons” and their role in locomotion in the lamprey spinal cord. Science, 236, 312–314. Carling, J., Williams, T. L., & Bowtell, G. (1998). Self-propelled anguilliform swimming: Simultaneous solution of the two-dimensional Navier-Stokes equations and Newton’s laws of motion. Journal of Experimental Biology, 201, 3143–3166. ¨ (1993). A combined neuronal and mechanical model of fish swimming. Ekeberg, O. Biological Cybernetics, 69, 363–374. ¨ Wall´en, P., Brodin, L., & Lansner, A. (1991). A computer-based model Ekeberg, O., for realistic simulations of neural networks I: The single neuron and synaptic interaction. Biological Cybernetics, 65, 81–90. Ferrar, C. H., Williams, T. L., & Bowtell, G. (1993). The effect of cell duplication and noise in a pattern generating network. Neural Computation, 5, 587–596. Hellgren, J., Grillner, S., & Lansner, A. (1992). Computer simulation of the segmental neural network generating locomotion in lamprey by using populations of network interneurons. Biological Cybernetics, 68, 1–13. Ijspeert, A. J. (1998). Design of artificial neural oscillatory circuits for the control of lampreyand salamander-like locomotion using evolutionary algorithms. Unpublished doctoral dissertation, University of Edinburgh. Ijspeert, A. J., Hallam, J., & Willshaw, D. (1999). Evolving swimming controllers for a simulated lamprey with inspiration from neurobiology. Adaptive Behavior, 7(2), 151–172. Or, J. (2002). An investigation of artificially-evolved robust and efficient connectionist swimming controllers for a simulated lamprey. Unpublished doctoral dissertation, University of Edinburgh. Or, J., & Hallam, J. (2000). A study on the robustness of the lamprey swimming controllers. In J. A. Meyer, A. Berthoz, D. Floreano, H. L. Roitblat, & S. W. Wilson (Eds.), SAB2000 Proceedings Supplement, Sixth International Conference on Simulation of Adaptive Behaviour (pp. 68–76). Cambridge, MA: MIT Press.
1588
J. Or
Or, J., Hallam, J., Willshaw, D., & Ijspeert, A. (2002). Evolution of efficient swimming controllers for a simulated lamprey. In B. Hallam, D. Floreano, J. Hallam, G. Hayes, & J. A. Meyer (Eds.), From animals to animats: Proceedings of the Seventh International Conference on the Simulation of Adaptive Behavior (SAB 2002) (pp. 323–332). Cambridge, MA: MIT Press. Sfakiotakis, M., Lane, D., & Davis, J. (1999). Review of fish swimming modes for aquatic locomotions. IEEE Journal of Oceanic Engineering, 24(2), 237–252. Videler, J. J. (1993). Fish swimming. London: Chapman and Hall. Wadden, T., Hellgren, J., Lansner, A., & Grillner, S. (1997). Intersegmental coordination in the lamprey: Simulations using a network model without segmental boundaries. Biological Cybernetics, 76, 1–9. ¨ Lansner, A., Brodin, L., Traven, H., & Grillner, S. (1992). Wall´en, P., Ekeberg, O., A computer-based model for realistic simulations of neural networks II: The segmental network generating locomotor rhythmicity in the lamprey. Journal of Neurophysiology, 68, 1939–1950. Williams, T. L. (1986). Mechanical and neural patterns underlying swimming by lateral undulations: Review of studies on fish, amphibia and lamprey. In S. Grillner, P. S. G. Stein, D. Stuart, H. Forssberg, & R. Herman (Eds.), Neurobiology of vertebrate locomotion (pp. 141–155). New York: Macmillan.
Received January 16, 2006; accepted August 3, 2006.
LETTER
Communicated by Hu Sanqing
A Measurement Fusion Method for Nonlinear System Identification Using a Cooperative Learning Algorithm Youshen Xia [email protected] College of Mathematics and Computer Science, Fuzhou University, 350002, Fuzhou, China
Mohamed S. Kamel [email protected] Department of Electrical and Computer Engineering, University of Waterloo, Waterloo N2L 3G1, Canada
Identification of a general nonlinear noisy system viewed as an estimation of a predictor function is studied in this article. A measurement fusion method for the predictor function estimate is proposed. In the proposed scheme, observed data are first fused by using an optimal fusion technique, and then the optimal fused data are incorporated in a nonlinear function estimator based on a robust least squares support vector machine (LS-SVM). A cooperative learning algorithm is proposed to implement the proposed measurement fusion method. Compared with related identification methods, the proposed method can minimize both the approximation error and the noise error. The performance analysis shows that the proposed optimal measurement fusion function estimate has a smaller mean square error than the LS-SVM function estimate. Moreover, the proposed cooperative learning algorithm can converge globally to the optimal measurement fusion function estimate. Finally, the proposed measurement fusion method is applied to ARMA signal and spatial temporal signal modeling. Experimental results show that the proposed measurement fusion method can provide a more accurate model. 1 Introduction Identification of linear and nonlinear systems has received considerable attention and has been widely applied in engineering applications, including signal and image processing, communications, and control (Tekalp, Kaufman, & Woods, 1985; Abed-Meraim, Qui, & Hua, 1997; Ljung, 1999; Leung, Hennessey, & Drosopoulos, 2000). The objective is to find a good estimate of the unknown system model using measurements of the system’s input and output. The problem of system identification can be described in different ways: input-output form (Lo, 1994), state-space equations (Chen & Neural Computation 19, 1589–1632 (2007)
C 2007 Massachusetts Institute of Technology
1590
Y. Xia and M. Kamel
Narendra, 2004), and predictor forms (Roll, Nazin, & Ljung, 2005). In this letter, we focus on identifying nonlinear systems in the predictor form. Many estimation and prediction techniques for modeling nonlinear systems have been developed from noise-free input and noise-free output measurement (Ljung, 1999; Sjoberg et al., 1995). The well-known kernel method is a basic function approximation technique (H¨ardle, 1990). Because the kernel method has the asymptotic approximation property, it has became a foundation of many approximation approaches. For example, polynomial approaches, including local approximation techniques, are known to approximate, arbitrarily well, any continuous input-output mapping under relatively mild conditions (Lee & Mathews, 1993; Fan & Gijbels, 1996; Koukoulas & Kalouptsidis, 2000). In addition to universal approximation capability, neural networks, such as multilayer perceptron (MLP) and radial basis function (RBF) networks, can provide parallel approximation techniques (Hornik, Stinchcombe, & White, 1989; Chen & Billings, 1992; Kadirkamanathan & Niranjan, 1993; Mhaskar & Hahm, 1997). In order to satisfy the requirement of fast identification in practical applications (Schoukens, Nemeth, Crama, Rolain, & Pintelon, 2003), nonasymptotic methods, where the function estimate can minimize a cost function, have been studied. The popular support vector machine (SVM) method has been applied to regression approximation (Muller et al., 1997; Mukherjee, ¨ Osuna, & Girosi, 1997; Smola & Scholkopf, 1998; Rojo-Alvarez, MartinezRamon, Prado-Complido, Artes-Rodriguez, & Figueiras-Vidal, 2004). This method makes use of an -insensitive cost function so that the parameter learning problem can be equivalent to a quadratic convex programming problem. As a good alternative, a least-squares support vector machine (LSSVM) for nonlinear system modeling employs a quadratic cost function that can make the parameter learning problem easily solved by performing a set of linear equations (Suykens, Gestel, Brabanter, Moor, & Vandewalle, 2002). Recently, an interesting approach with a unified framework, called a direct weight optimization (DWO), to nonlinear system identification at a given data point was presented (Roll et al., 2005). The parameter learning problem of the DWO method is a nonlinear constrained optimization problem. From the computational complexity viewpoint, the LS-SVM method is the best among the above approaches. However, the SVM function estimator minimizes a regularized risk function, while the LS-SVM function estimate minimizes a least-squares regularized function. The DWO method minimizes an upper bound of the mean square error. Thus, the DWO method may achieve better accuracy since it can minimize an approximate mean square cost function. In addition, the cross-validation method can obtain an optimal regularization parameter to enhance the estimation accuracy of the SVM and LS-SVM method. But its computation is very complex, and the resulting computational time will be greatly increased. In practical circumstances, signals may be contaminated by noise, and the presence of measurement noise is also very common. Thus, the signal
Cooperative Learning Algorithm for Nonlinear System Identification
1591
data that the system receives is not perfect. As a result, the accuracy of the function estimate is affected by not only approximation error but also noise error. The identification of nonlinear systems with measurement noise was considered, but existing identification methods mainly try to reduce the approximation error without considering the effect of the noise error directly on the accuracy of the function estimate since noisy data are usually included in their function estimates. The purpose of this letter is to develop a new estimation method for identifying a general nonlinear noisy system by minimizing both the approximation error and the measurement noise error. Based on an optimal fusion technique, we first construct optimal fused data from measurement data so that the measurement noise error is minimized. We then pass the optimal fused data to a nonlinear function estimator that minimizes the approximation error, based on a robust LS-SVM. Moreover, we show that the proposed measurement fusion method can obtain the optimal measurement fusion function estimator has a smaller mean square error than the LS-SVM function estimator with a smaller mean square error than the LS-SVM function estimator. Another contribution of this article is a novel cooperative learning algorithm to implement the proposed measurement fusion method. In addition to low computational complexity, the proposed algorithm can converge globally to the optimal measurement fusion function estimate. Finally, we apply the proposed measurement fusion method to the ARMA signal and spatial temporal signal modeling. Computed results show that the proposed cooperative learning algorithm indeed gives a more accurate solution than the related identification algorithms. This letter is organized as follows. In section 2, the problem of nonlinear system identification is discussed. In section 3, a measurement fusion method for nonlinear noisy system identification is proposed by using an optimal fusion technique and a robust LS-SVM technique. In section 4, a cooperative learning algorithm is presented. In section 5, the improved performance of the proposed method and the global convergence of the proposed cooperative learning algorithm are analyzed. In section 6, the proposed measurement fusion method is applied to the modeling of both ARMA signals and spatial temporal signals. In section 7, several illustrative examples are provided. Concluding remarks are given in section 8.
2 Problem and Model Estimates We are concerned with the following discrete-time noisy system,
x(t) = g0 (ϕx (t)) + w(t) y(t) = x(t) + n(t),
(2.1)
1592
Y. Xia and M. Kamel
where g0 (·) is the unknown function, ϕx (t) is the regression vector defined by ϕx (t) = [x(t − 1), . . . , x(t − p), u(t − 1), . . . , u(t − q )]T , u(t) and x(t) are, respectively, the signal input and output of the system, p and q are the orders of the system, y(t) is the observed data, w(t) is driving white noise with zero mean, and n(t) is measurement white noise with zero mean. In a special case that p and q denote the corresponding maximum delays, the noisy system, equation 2.1, becomes a nonlinear autoregression moving average (NARMA)(Ljung, 1999). When g0 is linear and q = 0, the noisy system, equation 2.1, is an AR signal system in the presence of noise. The problem of system identification is as follows. Based on received data from the noisy system, equation 2.1, find a smooth function estimate g of g0 such that the mean square error (MSE) criterion, E[(g(ϕ y (t)) − g0 (ϕx (t)))2 ],
(2.2)
is minimized as soon as possible, where ϕ y (t) = [y(t − 1), . . . , y(t − p), u(t − 1), . . . , u(t − q )]T . Now g(ϕ y (t)) − g0 (ϕx (t)) = g(ϕ y (t)) − g(ϕx (t)) + g(ϕx (t)) − g0 (ϕx (t)). Note that noise n(t) is usually assumed to be independent of the signal process x(t) and u(t). Then E[(g(ϕ y (t)) − g0 (ϕx (t)))2 ] = E[(g(ϕ y (t)) − g(ϕx (t)))2 ] + E[(g(ϕx (t)) − g0 (ϕx (t)))2 ].
(2.3)
It can be seen that the second term of the MSE mainly comes from approximation error. Note that g(ϕ y (t)) − g(ϕx (t)) ≈ ∇g(ϕx (t))T (ϕ y (t) − ϕx (t)) = O(ϕn (t)), where ∇g(ϕx (t)) denotes the gradient of g and ϕn (t) = [n(t − 1), . . . , n(t − p), 0, . . . , 0]T . So, the first term of the MSE mainly comes from measurement noise error. Let V1 = E[(g(ϕ y (t)) − g(ϕx (t)))2 ], V2 = E[(g(ϕx (t)) − g0 (ϕx (t)))2 ]. Then minimizing E[(g(ϕ y (t)) − g0 (ϕx (t)))2 ] can be converted into minimizing V1 and V2 separately. Since the true regression vector ϕx (t) and g0 are unknown, V2 is not computable. A useful alternative is to minimize E[(g(ϕ y (t)) − y(t))2 ] in practice. Similarly, since V1 is not computable like
Cooperative Learning Algorithm for Nonlinear System Identification
1593
V2 , we will convert the minimization of V1 into the minimization of the measurement noise error. Most existing identification methods mainly make efforts in minimizing the approximation error without considering the effect of the noise error directly on the accuracy of the function estimate. For example, the neural networks and the local polynomial approaches can make the approximation error reduced to zero as the number of measurements theoretically. For nonasymptotic approaches, the support vector machine (SVM) method can minimize following regularized risk function, 1 L(g(ϕ y (t)) − y(t)) + βα22 , N N
E 1 (α) =
(2.4)
t=1
and its function estimate is given by g(ϕ y (t)) = α0 +
N
αk f k (ϕ y (t)),
(2.5)
k=1
where N is a number of measurements, α = [α1 , . . . , α N ]T , · 2 denotes l2 norm, L(g(ϕ y (t)) − y(t)) is an -insensitive cost function, f k (ϕ y (t)) is a kernel function, and ϕ y (t) is a regression vector including noise. Weight coefficients {αk } and α0 are obtained by solving quadratic convex programming. The LS-SVM method minimizes another regularized quadratic function, 1 (g(ϕ y (t)) − y(t))2 + βα22 , N N
E 2 (α) =
(2.6)
t=1
where the function estimate g(ϕ y (t)) has the same structure as equation 2.5 but the weight coefficients are obtained by solving linear equations. Directly based on the observed data y(t), a direct weight optimization (DWO) method was presented for the following function estimate: gˆ (ϕ ∗ ) = a 0 +
N
a t y(t)
(2.7)
t=1
at a given point ϕ ∗ . The function estimate of the DWO method minimizes an upper bound of the MSE, and its weight parameters {a k } can optimize a nonlinear objective function: N 2 N 1 ∗ 2 |a t |ϕ y (t) − ϕ 2 + σv2 a t2 . E 3 (a) = 4 t=1
t=1
(2.8)
1594
Y. Xia and M. Kamel
The identification problems of nonlinear systems with measurement noise were considered here, but these methods include the observed data in their function estimates. As a result, the measurement noise included in the observed data may cause a large error on the function estimate without minimizing the noise error. Different from the above identification methods, we propose a novel measurement fusion method for a good function estimation of g0 by minimizing E[(g(ϕ y (t)) − y(t))2 ] and minimizing the measurement noise error separately. More exactly, based on an optimal data fusion technique, in next section, we reconstruct optimal fused data {z∗ (t)} from a set of multisensor observed data {yi (t)} such that E[(g(ϕz (t)) − g(ϕx (t)))2 ],
(2.9)
which may be viewed as an alternative to V1 , is minimized as soon as possible, where ϕz (t) = [z(t − 1), . . . , z(t − p), u(t − 1), . . . , u(t − q )]T . We then pass the optimal fused data {z∗ (t)} to a robust LS-SVM function estimator so that the measurement fusion function estimator is given by g(ϕz∗ (t)) = α0 +
N
αk f k (ϕz∗ (t)),
k=1
where ϕz∗ (t) = [z∗ (t − 1), . . . , z∗ (t − p), u(t − 1), . . . , u(t − q )]T . There are three main advantages of our methods. First, the optimal fused data {z∗ (t)} are shown to have a smaller MSE than observed data {yi (t)}. Second, the proposed measurement fusion method is shown to have an improved performance of the MSE over the LS-SVM method. Third, the proposed measurement fusion method can be effectively implemented by a new cooperative learning algorithm. 3 Measurement Fusion Method In this section, we first study optimal fused data and a robust LS-SVM and then propose a measurement fusion method for the optimal measurement fusion function estimate. 3.1 Optimal Fused Data. With the availability of multiple sensory devices, data fusion has received considerable attention in recent years (Hall & Llinas, 1997). Data fusion is the process of integrating complementary information from multisensor data by minimizing the uncertainty of the fused information. The main potential advantage of data fusion is that by integrating complementary information from multiple sensors, the uncertainty of the fused information is minimized (Varshney, 1997).
Cooperative Learning Algorithm for Nonlinear System Identification
1595
Consider multisensor observed data with τ (τ ≥ 2) sensors, yi (t) = x(t) + ni (t),
(i = 1, . . . , τ, t = 1, . . . , N),
(3.1)
where N is a number of sensor measurements, x(t) = g0 (ϕx (t)) + w(t), yi (t) is the ith sensor measurement, ni (t) represents measurement gaussian noise of the ith sensor, which is assumed to be zero mean and uncorrelated with x(t). Such multisensor observed data could be collected by multiple independent observations or multiple sensory devices with the measurement synchronized. Let the fused data be a linear combination of τ sensor measurements, z(t) =
τ
wi yi (t),
i=1
τ wi = 1. In order to minimize the where {wk } are fusion coefficients with i=1 measurement noise error, we find the optimal fused data {z(t)}1N satisfying minimize subject to
E[(g(ϕz (t)) − g(ϕx (t)))2 ] τ
wi = 1,
(3.2)
i=1
where ϕz (t) = [z(t − 1), . . . , z(t − p), u(t − 1), . . . , u(t − q )]T . Note that E[(g(ϕz (t)) − g(ϕx (t)))2 ] ≈ δ0 E[ϕz (t) − ϕx (t)22 ], where δ0 > 0 is a constant. Then, finding the optimal fused data can be converted into finding an optimal fusion solution of the following: minimize subject to
E[ϕz (t) − ϕx (t)22 ] τ
wi = 1.
(3.3)
i=1
Since E[ϕz (t) − ϕx (t)22 ] = E[ϕz (t)22 ] + E[ϕx (t)22 ] − 2E[ϕz (t)T ϕx (t)] and x(t) is uncorrelated with ni (t), E[ϕz (t) ϕx (t)] = E T
τ i=1
wi ϕ yi (t) ϕx (t) = E[ϕx (t)22 ]. T
1596
Y. Xia and M. Kamel
It follows that E[ϕz (t) − ϕx (t)22 ] = E[ϕz (t)22 ] − E[ϕx (t)22 ]. Thus, equation 3.3 can be rewritten as minimize subject to
E[ϕz (t)22 ] τ
wi = 1.
(3.4)
i=1
Note that E[ϕz (t)22 ] =
p
E[z2 (t − k)] +
k=1
q
E[u2 (t − k)].
k=1
Then, when {yi (t)} is wide-sense stationary (all of its first- and second-order moments are finite and independent of the sample index), solving equation 3.4 can be equivalent to solving the following simple optimization problem: minimize subject to
E[z2 (t)] τ
wi = 1.
(3.5)
i=1
τ So our optimal fused data are now given by z∗ (t) = i=1 wi∗ yi (t), where ∗ {wi } is the optimal solution of equation 3.5. We will show in section 5 that the optimal fused data are unbiased and have a smaller mean square error than the observed data. 3.2 Robust Least Squares Support Vector Machine. Support vector machines (SVMs) motivated by statistical learning theory have received considerable attention in the field of machine learning and engineering applications, including regression approximation (Mukherjee et al., 1997; ¨ Smola & Scholkopf, 1998; Rojo-Alvarez et al., 2004). In order to speed convergence, an interesting alternative, LS-SVM, was recently developed by Suykens, Gestel et al., 2002. The LS-SVM approach has advantages over the SVM as it can be easily implemented by the online algorithm. Consider the problem of approximating the set of data {(yi , xi )}ri=1 , xi ∈ Rn , yi ∈ R
(3.6)
Cooperative Learning Algorithm for Nonlinear System Identification
1597
with a regression function of the form
g(x) =
M
w ˆ i φi (x) + b,
(3.7)
i=1 M are feature functions that are defined in a high-dimensional where {φi (x)}i=1 M space R (possibly infinite-dimensional space), and {w ˆ i } and b are weight coefficients of the regression function g(x). The LS-SVM function estimate is to minimize the following optimization problem in primal weight space:
J (w ˆ k , b) =
minimize
M r 1 2 w ˆk +γ e k2 2 k=1
k=1
yk = g(xk ) + e k (k = 1, . . . , r ).
subject to
(3.8)
Define the Lagrangian function of equation 3.8 as L(αk , b, e k , w ˆ k ) = J (w ˆ k , ek ) −
r
ˆ T φ(xk ) + b + e k − yk }, αk {w
k=1
ˆ = [w ˆ M ]T and φ(xk ) = [φ1 (xk ), . . . , φ M (xk )]T . The condition where w ˆ 1, . . . , w for optimality (Bazaraa, Sherali, & Shetty, 1993) can imply that
ˆ = w
r k=1
αk φ(xk ),
r k=1
αk = 0,
ˆ T φ(xk ) + b + e k = yk (k = 1, . . . , r ). αk = γ e k , w ˆ = Substituting w tion estimate g(x) =
r
r k=1
αk φ(xk ) into equation 3.7, we have the LS-SVM func-
αi φ(x)T φ(xi ) + b,
i=1
where the weight coefficients αk and b satisfy the following linear equations,
Q+
1 I α + ber − γ 1 r k=1 αk = 0,
y=0
(3.9)
where α = [α1 , . . . , αr ]T , er = [1, . . . , 1]T ∈ Rr , y = [y1 , ...., yr ]T , I1 ∈ Rr ×r is a unit matrix, and Q = (φ(xi )T φ(x j )) ∈ Rr ×r . According to Mercer’s
1598
Y. Xia and M. Kamel
theorem, one can further choose a kernel function K (x, z) satisfying K (x, z) = φ(x)T φ(z) so that the LS-SVM function estimate can be represented as g(x) =
r
αi K (xi , x) + b.
(3.10)
i=1
It is well known that γ is a regularization parameter. If γ is made small, then w22 becomes important, and thus the error term e k could be large. Hence, γ should be larger for a small approximation error. In this case, the least square (LS) cost function becomes important and its estimation is not robust. To overcome the drawback concerning robustness, Suykens, Brabanter, Lukas, and Vandewalle (2002) presented a modification on the robust LS-SVM function estimate by weighting error variables such that the robust estimate on αk and b are solutions of the following linear equations,
(Q + γ )α + ber − y = 0 r k=1 αk = 0,
(3.11)
where γ = diag(1/(λ1 γ ), . . . , 1/(λr γ )) and {λk } are weighting factors based on the error variables {e k }. Here we propose an alternative modification on the robust LS-SVM function estimate. From γ e k = αk , we see that the Lagrangian multipliers, {αk }, are directly proportional to the error term e k , where e k indicates the difference between y and g(x). So to reduce the impact of outliers on the estimates, a boundedness of the error e k should be imposed, that is, |e k | ≤ . As a result, |αk | ≤ γ , (k = 1, . . . , r ). Thus, the proposed robust estimates on αk and b are solutions of the following piecewise equations,
f (α − (Q +
1 I )α − γ 1 r k=1 αk
ber + y) − α = 0 = 0,
where f (u) = [ f (u1 ), . . . , f (ur )]T and for i = 1, . . . , r , −γ ui f (ui ) = γ
ui < −γ −γ ≤ ui ≤ γ ui > γ .
(3.12)
Cooperative Learning Algorithm for Nonlinear System Identification
1599
It is easy to see that the solution of the piecewise equation 3.12 satisfies |αi | ≤ γ , and thus |e k | ≤ . Compared with the robust estimate by Suykens, Brabanter et al. (2002), the robust estimate approach needs only to choose two parameters γ and and does not need information about unknown error variables {e k }. Moreover, when γ = +∞, the piecewise equations 3.12 become the linear equations 3.9. So the present modification can be viewed as a significant generalization of the LS-SVM method. 3.3 Measurement Fusion Method. We now propose a measurement fusion method as follows. Step 1: Compute the optimal weight w∗ by solving equation 3.5, and let the optimal fused data be z∗ (t) = x(t)T w∗ . Step 2: Compute optimal weight coefficients of the robust LS-SVM function estimate by solving equation 3.12. Step 3: Incorporate the optimal fused data into the robust LS-SVM function estimate. Remark 1. According to the ergodic theorem, we see that 2 τ N N 1 2 1 E[z (t)] = lim z (t) = lim wk yi (t) . N→+∞ N N→+∞ N 2
t=1
t=1
i=1
So, in practice, finding the optimal solution of equation 3.5 can be replaced by finding the following optimization problem:
minimize
2 τ N 1 wi yi (t) N t=1
subject to
τ
t=1
wi = 1.
(3.13)
i=1
N When sample variance matrix B = N1 t=1 yt ytT is nonsingular, the above optimization problem has a closed-form solution, w∗ = B −1 e/(eT B −1 e), where e = [1, . . . , 1]T ∈ Rτ . However, B may be singular or almost singular due to the dependence of sensor data. The closed-form solution thus gives a poor estimate. To overcome the singularity, we introduce another equivalent formulation. By the well-known projection theorem (Kinderlehrer & Stampcchia, 1980), we see that w∗ = [w1∗ , . . . , wτ∗ ]T is the optimal solution
1600
Y. Xia and M. Kamel
of equation 3.13 if and only if w∗ satisfies w∗ =
I−
1 T ee (I − B)w∗ + e/τ, τ
where ∈ Rτ , I ∈ Rτ ×τ is a unit matrix, and yt = [y1 (t), . . . , yτ (t)]T . Furthermore, the above equation can be written as (τ B + eeT (I − β B))w∗ = (1 − β)eeT Bw∗ + e, ∀β ∈ R. When 0 < β < 1, matrix (τ B + eeT (I − β B)) is nonsingular, we have w∗ = (1 − β)(τ B + eeT (I − β B))−1 eeT Bw∗ + (τ B + eeT (I − β B))−1 e.
(3.14)
Remark 2. In addition to multiple-sensor observed data available, we can create suboptimal fused data directly from single-sensor observed data {y(t)}. Given a sensor number τ ≥ 1, we can combine a set of adjacent data of y(t) such that the suboptimal fused data z∗ (t) = τk=−τ wk∗ y(t + δk) satisfy 1 N N
minimize
t=1
subject to
τ
τ
2 wk∗ y(t
+ δk)
k=−τ
wi = 1,
(3.15)
i=1
where 0 < δ ≤ 1. The suboptimal fused data may still reduce the noise error because when δτ is small enough, we have y(t + δk) ≈ x(t) + nk (t) (k = −τ, . . . , τ ). Remark 3. Xia (1996), developed a continuous-time recurrent neural network for solving linear piecewise equations 3.12 as follows, d dt
α(t) b(t)
=
ˆ 0 + I1 )( f (u(t)) − α(t)) − er erT α(t) (Q erT f (u(t)) − 2erT α(t)
,
(3.16)
ˆ 0 = Q + 1 I1 . According to the where u = α − (Q + γ1 I1 )α − ber + y and Q γ well-known Euler method, a discrete-time iterative algorithm for solving
Cooperative Learning Algorithm for Nonlinear System Identification
1601
equation 3.12 can be the following,
α (k+1)
=
b (k+1)
α (k)
b (k)
− hk
ˆ 0 + I1 ) f u(k) − α (k) − e1 eT α (k) (Q 1 e1T f (u(k) ) − 2e1T α (k)
,
where h k > 0 is a learning step length. 4 Cooperative Learning Algorithm In this section, we present a cooperative learning algorithm to implement the proposed measurement fusion method. Using remarks 1 and 3, we can derive learning algorithms A and B. Learning Algorithm A for Optimal Fused Data Given the sensor number τ , the measurement number N, the initial weight vector w(0) , 0 < β < 1, and δ1 > 0, 0. Compute Aτ = (1 − β)qτ eT B and qτ = Pτ e, where e is defined in equation 3.14 and 1 yt ytT , Pτ = (τ B + eeT (I − β B))−1 , yt = [y1 (t), . . . , yτ (t)]T . N N
B=
t=1
i. Compute w(k) = Aτ w(k−1) + qτ .
(4.1)
ii. If w(k) − w(k−1) 2 ≤ δ1 , then go to iii. Otherwise, let k = k + 1 and go to i. iii. Compute optimal fused data: z(k) (t) =
τ
(k)
wi yi (t), (t = 1, . . . , N)
i=1
Learning Algorithm B for Robust LS-SVM Function Estimate A kernel function K (·, ·), simulation parameters γ and , the initial weight coefficients (α (0) , b (0) ), and δ2 > 0 are given, and data samples {y(t), u(t)}r1 ⊂ N {yi (t), u(t)}t=1 .
1602
Y. Xia and M. Kamel
i. Compute ϕ y (t) = [y(t − 1), y(t − 2), . . . , y(t − p), u(t − 1), . . . , u(t − ˆ = (Q + I1 /γ )/s, yˆ = y/s, and e1 = er /s, where q )]T , and compute Q Q, y, and er are defined in equation 3.12 and Q + I1 /γ er . s= −erT 0 2 ˆ (k) − yˆ + b (k) e1 , and ii. Compute u(k) = ((I − Q)α (k)
f (u(k) ) = [ f (u1 ), . . . , f (ur(k) )]T , where f (ui ) is defined in equation 3.12. iii. Compute
α (k+1) b (k+1)
=
α (k) b (k)
−h
ˆ + I1 ) f u(k) − α (k) − e1 eT α (k) (Q 1 e1T f (u(k) ) − 2e1T α (k)
,
(4.2) where h > 0 is a fixed step length. iv. Compute r (α (k) , b (k) ) = f (u(k) ) − α (k) 22 + (e1T α (k) )2 . If r (α (k) , b (k) ) ≤ δ2 , then output (α (k) , b (k) ). Otherwise, let k = k + 1 and go to step (ii). Cooperative Learning Algorithm Step 1: Find optimal fused data {z∗ (t)} by learning algorithm A. Step 2: Find optimal weights {αi∗ } and b ∗ by learning algorithm B. Step 3: Compute the optimal measurement fusion function estimate given by g(ϕz∗ (t)) =
r
αi∗ K (ϕ y (i), ϕz∗ (t)) + b ∗ , (t = 1, . . . , N)
(4.3)
i=1
where ϕz∗ (t) = [z∗ (t − 1), z∗ (t − 2), . . . , z∗ (t − p), u(t − 1), . . . , u(t − q )]T . Although learning algorithms A and B may be mutually independent, the cooperative learning algorithm can perform the two learning algorithms
Cooperative Learning Algorithm for Nonlinear System Identification
Learning Algorithm (A)
1603
z ∗ (t)
Observed
g(ϕz∗ (t))
Data
Learning
α ∗k , b ∗
Algorithm (B)
Figure 1: Cooperative learning algorithm for nonlinear system identification.
in parallel, and can combine their input results with a feedforward network, which implements equation 4.3. Figure 1 displays a block diagram of the proposed cooperative learning algorithm. Compared with conventional individual learning algorithms for system identification, the proposed cooperative learning algorithm can combine adaptively two individual learning algorithms to have an improved performance in the MSE criterion. Compared with the LS-SVM function estimate, the proposed optimal measurement fusion function estimate has a smaller MSE. Remark 4. In general, the RBF kernel function is a reasonable first choice since it can handle the nonlinear case. (For a detailed explanation, see Hsu, Chang, & Lin, 2003.) 5 Performance Analysis In this section, we prove that the proposed optimal measurement fusion function estimate has an improved performance and that the proposed cooperative learning algorithm converges globally to the optimal measurement fusion function estimate. Theorem 1. Assume that {ni (t)} is a stationary gaussian process with zero mean. Then the optimal fused data are unbiased and have a smaller MSE than observed data {yi (t)}.
1604
Y. Xia and M. Kamel
Proof. Let the optimal fused data be z∗ (t) =
τ
wi∗ yi (t),
i=1
where {wi∗ } are the optimal fusion solution of the following 2 τ wi yi (t) minimize E i=1
subject to
τ
wi = 1.
(5.1)
i=1
Then E[z∗ (t)] =
τ
wi∗ E[yi (t)] =
i=1
τ
wi∗ E[x(t)] = E[x(t)].
i=1
Thus the optimal fused data are unbiased. We now show that z(t) has a smaller MSE than yi (t). On one side, E[(z∗ (t) − x(t))2 ] = E[z∗ (t)2 ] + E[x 2 (t)] − 2E[z∗ (t)x(t)]. Since E[yi (t)x(t)] = E[x(t)x(t)] + E[ni (t)x(t)] = E[x 2 (t)], E[z∗ (t)x(t)] =
τ
wi∗ E[yi (t)x(t)] =
i=1
τ
wi∗ E[x 2 (t)] = E[x 2 (t)].
i=1
Then E[(z∗ (t) − x(t))2 ] = E[z∗ (t)2 ] − E[x 2 (t)] and E[z∗ (t)2 ] = E[(ytT w∗ )2 ] = (w∗ )T E[yt ytT ]w∗ , where yt = [y1 (t), . . . , yτ (t)]T . Since w∗ is the optimal fusion solution, (w∗ )T E[yt ytT ]w∗ =
1 eT R−1 e
,
where e = [1, . . . , 1]T ∈ RT , R = E[yt ytT ] and w∗ = [w1∗ , . . . , wτ∗ ]T . Note that eT R−1 e ≥ eT eλmin (R−1 ) =
eT e . λmax (R)
Cooperative Learning Algorithm for Nonlinear System Identification
1605
It follows that 1 eT R−1 e
≤
λmax (R) λmax (R) = . T e e τ
Let {λi }τ1 be the eigenvalues of R. Then when τ ≥ 2, λmax (R) < R is symmetric and positive definite. Thus,
τ i=1
λi since
τ τ 1 λmax (R) 1 1 ≤ < λ = E[yi2 (t)]. i eT R−1 e τ τ τ i=1
i=1
Hence, E[(z∗ (t) − x(t))2 ] <
τ 1 E[yi2 (t)] − E[x 2 (t)]. τ i=1
On the other side, E[(yi (t) − x(t))2 ] = E[ni2 (t)]. Since E[yi (t)2 ] = E[(x(t) + ni (t))2 ] = E[(x(t)2 ] + E[ni2 (t)], E[(z∗ (t) − x(t))2 ] <
τ 1 (E[x 2 (t)] + E[ni2 (t)]) − E[x 2 (t)] = E[ni2 (t)] τ i=1
since {ni (t)} is a stationary gaussian process. It follows that E[(z∗ (t) − x(t))2 ] < E[(yi (t) − x(t))2 ] (i = 1, . . . , τ ).
(5.2)
Theorem 2. The optimal measurement fusion function estimate has an improved performance in the MSE criterion. Proof. For convenience, we rewrite the robust LS-SVM function estimate, g(ϕ y (t)) =
r
αi K (ϕ y (i), ϕ y (t)) + b ∗ , (k)
(5.3)
i=1
and the optimal measurement fusion function estimate, g(ϕz∗ (t)) =
r i=1
αi K (ϕ y (i), ϕz∗ (t)) + b ∗ . (k)
(5.4)
1606
Y. Xia and M. Kamel
Note that ϕ y (t) − ϕx (t) = [y(t − 1) − x(t − 1), . . . , y(t − p) − x(t − p), 0, . . . , 0]T and ϕz∗ (t) − ϕx (t) = [z∗ (t − 1) − x(t − 1), . . . , z∗ (t − p) − x(t − p), 0, . . . , 0]T . Then we know that the observed regression vector, ϕ y (t) = ϕx (t) + v(t), and the optimal fused regression vector, ϕz∗ (t) = ϕx (t) + v∗ (t), where v j (t) = y(t − j) − x(t − j), v ∗j (t) = z∗ (t − j) − x(t − j), and v(t) = [v1 (t), . . . , v p (t), 0, . . . , 0]T ∈ R p+q , v∗ (t) = [v1∗ (t), . . . , v ∗p (t), 0, . . . , 0]T ∈ R p+q . Assume the noise variance is small. By Taylor formulation, we have g(ϕ y (t)) ≈ g(ϕx (t)) + ∇g(ϕx (t))T v(t) and g(ϕz∗ (t)) ≈ g(ϕx (t)) + ∇g(ϕx (t))T v∗ (t). Then E[(g(ϕ y (t)) − g(ϕx (t)))2 ] ≈ E[(∇g(ϕx (t))T v(t))2 ] and E[(g(ϕz∗ (t)) − g(ϕx (t)))2 ] ≈ E[(∇g(ϕx (t))T v∗ (t))2 ]. Let v p (t) = [v1 (t), . . . , v p (t)]T , v∗p (t) = [v1∗ (t), . . . , v ∗p (t)]T ,
Cooperative Learning Algorithm for Nonlinear System Identification
1607
and denote vector q = [q 1 , . . . , q p ]T , whose elements consist of the previous p elements of ∇g(ϕx (t)). Then 2 p E[(∇g(ϕx (t))T v(t))2 ] = E q i vi (t) i=1
and E[(∇g(ϕx (t))T v∗ (t))2 ] = E
p
2 q i vi∗ (t)
.
i=1
Note that vi (t) and vi∗ (t) are white noise. Then 2 p p q i vi (t) = q i2 E[vi (t)2 ] E[(∇g(ϕx (t))T v(t))2 ] = E i=1
i=1
and E[(∇g(ϕx (t))T v∗ (t))2 ] = E
p
2 q i vi∗ (t)
=
i=1
p
q i2 E[vi∗ (t)2 ].
i=1
By theorem 1, we see that E[vi∗ (t)2 ] < E[vi (t)2 ] for i = 1, . . . , τ . It follows that E[(∇g(ϕx (t))T v∗ (t))2 ] < E[((∇g(ϕx (t))T v(t)))2 ]. Thus, E[(g(ϕz∗ (t)) − g(ϕx (t)))2 ] < E[(g(ϕ y (t)) − g0 (ϕx (t)))2 ]. Note that E[(g(ϕz∗ (t)) − g(ϕx (t)))2 ] = E[(g(ϕz∗ (t)) − g(ϕx (t)))2 ] + E[(g(ϕx (t)) − g0 (ϕx (t)))2 ] and E[(g(ϕ y (t)) − g(ϕx (t)))2 ] = E[(g(ϕ y (t)) − g(ϕx (t)))2 ] + E[(g(ϕx (t)) − g0 (ϕx (t)))2 ].
1608
Y. Xia and M. Kamel
Then E[(g(ϕz∗ (t)) − g(ϕx (t)))2 ] < E[(g(ϕ y (t)) − g(ϕx (t)))2 ].
(5.5)
Theorem 3. Learning algorithm A can converge globally exponentially to the optimal fused data. That is, there exist positive constants η > 0 and µ > 0 such that |z(k) (t) − z∗ (t)| ≤ ηe −µk , ∀k ≥ 1. Proof. The proof is given in the appendix. Theorem 4. Let 0 < h < 1. Learning algorithm B can converge globally to the optimal weights of the robust LS-SVM function estimate. Proof. The proof is given in the appendix. With theorems 2, 3, and 4, we conclude that the proposed cooperative learning algorithm can converge globally to an optimal measurement fusion function estimate with a smaller MSE than the LS-SVM function estimate. Theorem 5. Finally, it should be pointed out that the proposed measurement fusion method can be applied to other function estimates such as the DWO function estimate to reduce the MSE. 6 Two Applications In this section, we apply the proposed measurement fusion method to ARMA signal modeling and spatial temporal signal modeling. 6.1 Application to ARMA Modeling. ARMA models are widely used in many fields such as signal and image processing and digital communications (Tekalp et al., 1985; Tekalp, 1995). Consider the following discrete time linear system, x(t) =
p i=1
a i x(t − i) +
q
b j u(t − j) + w(t),
(6.1)
j=1
where {a i } and {b i } are unknown AR and MA parameters of the system, p and q are known model order, the signals x(t) and u(t) are the output and input of the system, and w(t) is driving white noise with zero mean. Assume that the sensor system is employed and {yi (t)} is a set of noise corrupted measurements of x(t): yi (t) = x(t) + ni (t), (i = 1, . . . , τ ),
Cooperative Learning Algorithm for Nonlinear System Identification
1609
where ni (t) is the additive gaussian white noise with zero mean and τ is a sensor number. Our task is to estimate ARMA modeling including the unknown AR and MA parameter vectors, based on the noisy measurements {yi (t)} and {u(t)}. Note that ARMA model equation 6.1 can be rewritten in the form of equation 2.1 as x(t) = θ T ϕx (t) + w(t), where ϕx (t) = [x(t − 1), . . . , x(t − p), u(t − 1), . . . , u(t − q )]T , θ = [a, b]T , a = [a 1 , . . . , a p ]T , and b = [b 1 , . . . , b q ]T . Thus, we take the linear kernel function in the robust LS-SVM function estimate. We choose a set of data points {y1 (t), ϕ y (t)}r1 as the training set, where ϕ y (t) = [y1 (t − 1), . . . , y1 (t − p), u(t − 1), . . . , u(t − q )]T . By learning algorithm B, we find optimal weight coefficients {αi∗ }, where b is set as zero since there is no constant term in the ARMA model. The robust LS-SVM function estimate is then expressed as g(ϕ y (t)) =
r
αk∗ ϕ y (k)T ϕ y (t),
(6.2)
k=1
and the parameters to be estimated are θ ∗ = [a∗ , b∗ ]T =
r
αk∗ ϕ y (k).
(6.3)
k=1
We next use the fusion technique for improving ARMA modeling. Let N yt = [y1 (t), . . . , yτ (t)]T and B = t=1 yt ytT . By learning algorithm A, we can get the optimal fused data as follows z∗ (t) =
τ
wk∗ yk (t), (t = 1, . . . , N).
k=1
Incorporating the optimal fused data {z∗ (t)} into equation 6.2, we have the optimal measurement fusion function estimate, g(ϕz∗ (t)) =
r k=1
T αk∗ ϕ y (k)
ϕz∗ (t),
(6.4)
1610
Y. Xia and M. Kamel
where ϕz∗ (t) = [z∗ (t − 1), . . . , z∗ (t − p), u(t − 1), . . . , u(t − q )]T . As an immediate result of theorems 1 and 2, we have the following corollary. Corollary 1. The optimal measurement fusion function estimate in equation 6.4 has a smaller MSE than equation 6.2. For robust ARMA system identification, a support vector machine method was developed by Rojo-Alvarez et al. (2004). This approach uses an -insensitive cost function so that the parameter learning problem can be equivalent to a quadratic convex programming problem. By comparison, the present parameter learning problem here is the piecewise equation, which generalizes the parameter learning problem of LS-SVM. The function estimate can be shown to have a smaller MSE than existing SVM and LS-SVM function estimates. 6.2 Application to Spatial Temporal Modeling. The problem of spatial temporal modeling has become of special interest. It arises in many subjects such as radio communications, radio astronomy, radar surveillance, wireless communications, and other fields (Tekalp, 1995; Cressie & Majure, 1997). For instance, radar backscatter from a sea surface is basically a spatial temporal phenomenon (Haykin & Puthusserypady, 1997). The scattered signals come from a moving surface, and thus spatial effects in nearby areas cannot be ignored for accurate radar detection. A spatial temporal dynamical model can be described by the following equations:
x(t, i) = h(x(t − 1, i)) c(t, i) = x(t, i) + n(t, i),
(6.5)
where h is an unknown nonlinear function, x(t, i) is the state trajectory of the dynamical system, c(t, i) is the observation, n(t, i) is the measurement noise with zero mean, t is the discrete-time step, and i is the lattice point or the spatial position. x(t − 1, i) is a state vector for a temporal process. Our goal is to obtain an underlying nonlinear spatial temporal predictor from the noisy measurement c(t, i). To apply the proposed measurement fusion method to spatial temporal modeling, we first reconstruct the state vector as the time-delayed vector at
Cooperative Learning Algorithm for Nonlinear System Identification
1611
the ith cell, xˆ (t − 1, i) = [x(t − 1, i), . . . , x(t − p, i)]T , where p is the embedding dimension in time. Then spatial temporal dynamic modeling is reformulated as the identification problem defined in equation 2.1. We next take sensor number τ ≥ 1 such that {c(t, i + j), j = −τ, . . . , τ } are adjacent observed signals of c(t, i) in the spatial domain. According to the result given by Xia, Leung, and Chan (2006), we see that c(t, i + j) ≈ c(t, i) + vi+ j (t),
(6.6)
where vi+ j (t) is a zero mean process. We combine these adjacent observed signals by using learning algorithm A, where yt = [c(t, i − τ ), . . . , c(t, i), . . . , c(t, i + τ )]T . Let optimal fused data be z∗ (t, i) =
τ
w ∗j c(t, i + j),
j=−τ
where {wk∗ }τk=−τ is the optimal weight coefficients. We then choose a set of data points in a temporal process, {c(t, i), c(t, i)}rt=1 , c(t, i) = [c(t − 1, i), . . . , c(t − p, i)]T , as the training set. By learning algorithm B, we can obtain the optimal weight coefficients, and the robust LS-SVM predictor can be represented by g(c(t, i)) =
τ
αk∗ K (c(k, i), c(t, i)) + b ∗ .
(6.7)
k=1
Finally, after passing the optimal fused data to the robust LS-SVM predictor, we have the optimal measurement fusion predictor, g(z∗ (t, i)) =
τ
αk∗ K (c(k, i), z∗ (t, i)) + b ∗ ,
(6.8)
k=1
where z∗ (t, i) = τj=−τ w ∗j c(t, i + j) and c(t, i + j) = [c(t − 1, i + j), . . . , c(t − p, i + j)]T . As an immediate result of theorems 1 and 2, we have the following corollary:
1612
Y. Xia and M. Kamel
Corollary 2. The optimal measurement fusion predictor in equation 6.8 has a smaller MSE than equation 6.7. Recently, a RBF-CML predictor for modeling of sea clutter was developed (Leung et al., 2000). For spatial temporal modeling, two prediction fusion methods were developed by using high-order cumulants (Xia et al., 2006; Xia & Leung, 2006). Compared with the RBF-CML predictor, the proposed measurement fusion predictor extends the RBF-CML predictor for general spatial temporal modeling. Moreover, unlike learning parameters in the RBF-CML predictor, our learning parameters {w ∗j } and {αk∗ } are optimal solutions of the mean square cost function and the quadratic cost function, respectively. Moreover, the proposed measurement fusion predictor can be shown to have an improved performance in the MSE error. Compared with the two prediction fusion methods, the proposed measurement fusion method can be simply implemented by the proposed cooperative learning algorithm with a low computational complexity. Finally, it should be noted that one earlier spatial temporal method, called kriged Kalman filter modeling, was developed by Mardia, Goodall, Redfern, and Alsonso (1998). The optimal partial prediction is built by the well-known method of kriging, and the temporal effects are analyzed by using the models underlying the Kalman filtering method (Sahu and Mardia, 1999). The space-time Kalman filtering approach by Wikle and Cressie (1999) was then proposed to reduce the numbers of dimensions within the kriged Kalman filter model framework. Kriging is a common interpolation technique used for making spatial inferences at locations, and thus the kriged Kalman filter method is different from the presented method, which is based on the function approximation technique. 7 Illustrative Examples In this section, we give several illustrative examples to demonstrate the effectiveness of the proposed algorithms. The simulation is conducted in Matlab. 7.1 Example 1. Consider the following model system given in Lu, Ju, and Chon (2001),
x(t) = 0.35x(t − 1) + 0.32x(t − 2) + 0.1x(t − 3) 0.54u(t) + 0.34u(t − 1) + 0.23u(t − 2) + 0.21u(t − 3),
(7.1)
where u(t) is a white gaussian process with variance 0.5. Assume that {yi (t)} is a set of noise-corrupted measurements of x(t), yi (t) = x(t) + ni (t), (i = 1, . . . , τ ),
Cooperative Learning Algorithm for Nonlinear System Identification
1613
where ni (t) is measurement white gaussian noise with zero mean and variance 0.1. By using u(t), we collect 500 data samples, which were divided into a training set of 200 data samples and a testing set of 300 data samples. To perform learning algorithm B, we let the regression vector be ϕ y (t) = [y(t − 1), . . . , y(t − 3), u(t), . . . , u(t − 3)]T , and take simulation parameters r = 200, γ = 100, and = 0.5. In view of equation 6.3, we have the following parameter estimation, θˆ = [0.3525, 0.3179, 0.0959, −0.5396, 0.3392, 0.2289, 0.2076]T ,
(7.2)
with the l2 norm error 0.0064. We compare the proposed method with the affine geometry method (Lu et al., 2001) and improved least squares method (Zheng, 1999). The affine geometry method gives one computed result of parameter estimation where the l2 norm error is 0.08, and the improved least squares method has another computed result of parameter estimation where the l2 norm error is 0.016. Thus, the parameter estimation here has a smaller norm error. We next consider system modeling. From equation 6.2, we see that the robust LS-SVM function estimate is given by g(ϕ y (t)) = θˆ T ϕ y (t). Let the initial weight be zero and perform learning algorithm A for the optimal fused data, where τ = 8. The optimal measurement fusion function estimate is thus g(ϕz∗ (t)) = θˆ T ϕz∗ (t), where ϕz∗ (t) = [z∗ (t − 1), . . . , z∗ (t − 3), u(t), . . . , u(t − 3)]T . Figure 2 shows the variation of the predictive signal from 200 to 500 using the proposed function estimate. By comparison, Figure 3 shows the variation of the predictive signal from 200 to 500 using the LS-SVM function estimate with same simulation parameters. Figure 4 displays the absolute error between the original signal and the predictive signal given by two function estimates. From Figure 4 we can see that the proposed function estimate has much smaller errors than the LS-SVM function estimate. 7.2 Example 2. Consider the nonlinear ARMA model system,
x(t) = 0.32x(t − 1) + 0.3x(t − 2) − 0.18x(t − 1)u(t − 1) −0.2u(t − 2)2 + 0.8u(t) − 0.23u(t − 2) + 0.53u(t − 3),
(7.3)
1614
Y. Xia and M. Kamel 15
10
x(t)
5
0
−5
−10
−15 200
250
300
350 t
400
450
500
Figure 2: The variation of two signals versus the number of measurements in example 1. The solid line corresponds to the true signal and the dot line to the signal by the measurement fusion predictor. 15
10
x(t)
5
0
−5
−10
−15 200
250
300
350 t
400
450
500
Figure 3: The variation of two signals versus the number of measurements in example 1. The solid line corresponds to the true signal and the dot line to the signal by the LS-SVM predictor.
Cooperative Learning Algorithm for Nonlinear System Identification
1615
10
9
8
predictive signal error
7
6
5
4
3
2
1 0 200
250
300
350 t
400
450
500
Figure 4: The absolute error variation of two-signal prediction in example 1. The solid line corresponds to the measurement fusion predictor and the dotted line to the LS-SVM predictor.
where u(t) is a white gaussian process with variance 0.4. Assume that {yi (t)} is a set of noise-corrupted measurements of x(t): yi (t) = x(t) + ni (t), (i = 1, . . . , τ ), where ni (t) is measurement white gaussian noise with zero mean and variance 0.5. We collect 250 data samples by using u(t), divided into a training set of 100 data samples and a testing set of 150 data samples. To perform the proposed cooperative learning algorithm, we take the regression vector: ϕ y (t) = [y(t − 1), y(t − 2), u(t), u(t − 1), u(t − 2), u(t − 3)]T . We choose the gaussian RBF function as a kernel function and let r = 200, γ = 200, and = 0.5 in learning algorithm B. We take τ = 8 in learning algorithm A with the zero initial weight. Figure 5 shows the variation of the predictive signal from 100 to 250 using the proposed function estimate. By comparison, Figure 6 shows the variation of the predictive signal from 100 to 250 using the LS-SVM function estimate with same simulation parameters. Figure 7 displays the absolute error between the original signal and the predictive signal given by two function estimates. It can be seen from
1616
Y. Xia and M. Kamel 1.5
1
0.5
x(t)
0
−0.5
−1
−1.5
−2 100
150
200
250
t
Figure 5: The variation of two signals versus the number of measurements in example 2. The solid line corresponds to the true signal and the dotted line to the signal by the measurement fusion predictor. 2
1.5
1
0.5
x(t)
0
−0.5
−1
−1.5
−2
−2.5 100
150
200
250
t
Figure 6: The variation of two signals versus the number of measurements in example 2. The solid line corresponds to the true and the dotted line to the signal by the LS-SVM predictor.
Cooperative Learning Algorithm for Nonlinear System Identification
1617
1.4
1.2
Predictive signal error
1
0.8
0.6
0.4
0.2
0 100
150
200
250
t
Figure 7: The absolute error variation of two-signal prediction in example 2. The solid line corresponds to the measurement fusion predictor and the dotted line to the LS-SVM predictor.
Figure 7 that the function estimate has much smaller errors than the LS-SVM function estimate. Furthermore, to evaluate the fit between the simulated output and the measured output, we use the measure percentage given in Roll et al. (2005). The computed result was obtained by averaging 100 independent Monte Carlo simulations. The LS-SVM method has a fit of 51.8% (±1.8%), while the proposed method gives a better fit of 55.4% (±1.6%). 7.3 Example 3. Consider a nonlinear benchmark system given by Narendra and Li (1996). The system is defined as in state-space form by x1 (t) x1 (t + 1) = sin(x2 (t))( 1+x 2 + 1), 1 (t) 2 2 x2 (t + 1) = x2 (t) cos(x2 (t)) + x1 (t)e −(x1 (t) +x2 (t) )/8 + 1 (t+1) 2 (t+1) + 1+0.5xsin(x + n(t), y(t + 1) = 1+0.5xsin(x 2 (t+1)) 1 (t+1))
u(t)3 , (1+u(t)2 +0.5 cos(x1 (t)+x2 (t))
(7.4) where n(t) is a white gaussian noise with variance 0.1 and the states are assumed not to be measurable. We collect 300 data samples by using a uniformly distributed random input u(t) ∈ [−2.5, 2.5]. Also, to validate the
1618
Y. Xia and M. Kamel 3
2
1
x(t)
0
−1
−2
−3
−4
0
20
40
60
80
100 t
120
140
160
180
200
Figure 8: The variation of two signals versus the number of measurements in example 3. The solid line corresponds to the true signal and the dotted line to the signal by the measurement fusion predictor.
learning model, the input signal u(t) = sin
2πt 2πt + sin , t = 1, . . . , 200 10 25
was used. We take the regression vector as follows: ϕ y (t) = [y(t − 1), y(t − 2), y(t − 3), u(t − 1), u(t − 2), u(t − 3)]T .
To perform learning algorithm A for the optimal fused data, we consider 4 z(t) = i=−4 wi y(t + 0.2i), and let the initial weight be zero. To perform learning algorithm B, we take r = 200, γ = 100, and = 1, and choose the gaussian RBF function as a kernel function. Figure 8 shows the variation of the predictive signal using the proposed function estimate. By comparison, Figure 9 shows the variation of the predictive signal using the LS-SVM function estimate with same simulation parameters. Figure 10 displays the absolute error between the original signal and the predictive signal given by two function estimates. It is easy to see that the function estimate here has smaller errors than the LS-SVM function estimate. Moreover, the LS-SVM method gives the computed result of fit 45.3% (±2.3%), while the proposed
Cooperative Learning Algorithm for Nonlinear System Identification
1619
3
2
1
x(t)
0
−1
−2
−3
−4
0
20
40
60
80
100 t
120
140
160
180
200
Figure 9: The variation of two signals versus the number of measurements in example 3. The solid line corresponds to the true signal and the dotted line to the signal by the LS-SVM predictor. 4
3.5
Predictive signal error
3
2.5
2
1.5
1
0.5
0
0
20
40
60
80
100 t
120
140
160
180
200
Figure 10: The absolute error variation of two-signal prediction in example 3. The solid line corresponds to the proposed measurement fusion predictor and the dotted line to the LS-SVM predictor.
1620
Y. Xia and M. Kamel
Table 1: Computed Results of the Measure Percentage by Two Methods at Different Regularization Parameters in Example 3. γ
2−2
2
24
27
210
213
LS-SVM Proposed
0.57 0.61
0.61 0.66
0.55 0.59
0.50 0.55
0.47 0.52
0.46 0.50
method gives a better fit of 51.7% (±2.1%). Furthermore, the computed result by the proposed method can be compared with the ones obtained by the DWO method and the neural network with 20 hidden sigmoidal given in Roll et al. (2005), where the DWO method gave 49.7% fit, and the neural network with 20 hidden sigmoidal then reaches 47.1% fit. Finally, we compare results of the measure percentage of both the LSSVM method and the proposed method, by choosing exponentially growing sequences of γ , which may be viewed as an approximate cross-validation procedure to identify good regularization parameters (Hsu et al., 2003). Computed results of the measure percentage of two methods are listed in Table 1. It can be seen that the proposed method gives better results than the LS-SVM method. 7.4 Example 4. Consider real-life sea clutter data (Leung et al., 2000). The radar used for data collection is a coherent dual-polarized X-band radar. The radar site was located at 44◦ 36.72’N, 63◦ 25.41’W on a cliff at Osborne Head, Nova Scotia, Canada, facing the Atlantic Ocean about 30 m above mean sea level and an open ocean view of about 130 degrees. The database contains a variety of targets, including small boats and beachball targets. Two data sets are used in this study to investigate the spatial temporal dynamics of sea clutter (Drosopoulos, 1994). They are all staring data (ST1 and ST2), where the antenna is aimed at a single azimuth at all times. Their original radar images are plotted in Figures 11a and 12a, respectively. Here, the y-axis denotes the range, and the x-axis represents the time. When visible, the targets appear in the staring images as a dark horizontal line. Since the target was not anchored, it fluctuates with the waves, causing it to be easier to detect at some times than at others. The target, which is a boat, appears as a solid dark line in the image without any breaking. To perform the proposed cooperative learning algorithm, we choose the simulation parameters, defined in equation 6.8, of p = 10, r = 100, γ = 100, and = 1, and we take the gaussian RBF function as a kernel function. For the measurement fusion, we fuse five neighboring cells in the spatial domain. The resulting images are plotted in Figures 11b and 12b, respectively. We now perform a comparison among the proposed measurement fusion predictor, the robust LS-SVM predictor, and the LS-SVM predictor. Figure 13 displays the mean square prediction errors (MSE) of ST1, where the solid line corresponds to the robust LS-SVM predictor and the
Cooperative Learning Algorithm for Nonlinear System Identification
1621
(a)
(b)
Figure 11: (a) Radar image of the staring data set 1(ST1). (b) Prediction radar image of ST1.
dashed line to the proposed measurement fusion predictor, respectively. Figure 15 displays the mean square prediction errors (MSE) of ST2, where the solid line corresponds to the robust LS-SVM predictor, and the dashed line to the proposed measurement fusion predictor, respectively. Figure 14 displays the MSE of ST1, where the solid line corresponds to the robust LS-SVM predictor with = 1, the dot-dashed line corresponds to the robust LS-SVM predictor with = 16, and the dashed line to the LS-SVM predictor ( = +∞), respectively. Figure 16 displays the MSE of ST2, where
1622
Y. Xia and M. Kamel
(a)
(b)
Figure 12: (a) Radar image of the staring data set 2(ST2). (b) Prediction radar image of ST2.
the solid line corresponds to the robust LS-SVM predictor with = 1, the dot-dashed line corresponds to the robust LS-SVM predictor with = 16, and the dashed line to the LS-SVM predictor, respectively. Based on obtained results, we see that the proposed measurement fusion predictor produces a better model than the robust LS-SVM predictor and LS-SVM predictor. It is because the measurement fusion approach can successfully reduce the MSE by incorporating the extra spatial information in the model. This observation matches our theoretical analysis.
Cooperative Learning Algorithm for Nonlinear System Identification
1623
0
MSE
10
−1
10
−2
10 2000
2100
2200
2300
2400 Range
2500
2600
2700
2800
Figure 13: Prediction error of ST1 data. The solid line corresponds to the robust LS-SVM method with parameter = 1 and the dotted line to the measurement fusion method. 0
MSE
10
−1
10
−2
10 2000
2100
2200
2300
2400
2500
2600
2700
2800
Range
Figure 14: Prediction error of ST1 data. The dotted line corresponds to the LSSVM method, the dot-dashed line corresponds to the robust LS-SVM method with parameter = 16, and the solid line to the robust LS-SVM method with parameter = 1.
1624
Y. Xia and M. Kamel 0
10
−1
MSE
10
−2
10
−3
10 2200
2300
2400
2500
2600
2700 Range
2800
2900
3000
3100
3200
Figure 15: Prediction error of ST2 data. The solid line corresponds to the robust LS-SVM method with parameter = 1 and the dotted line to the measurement fusion method. 0
10
−1
MSE
10
−2
10
−3
10 2200
2300
2400
2500
2600
2700 Range
2800
2900
3000
3100
3200
Figure 16: Prediction error of ST2 data. The dotted line corresponds to the LSSVM method, the dot-dashed line corresponds to the robust LS-SVM method with parameter = 16, and the solid line to the robust LS-SVM method with parameter = 1.
Cooperative Learning Algorithm for Nonlinear System Identification
1625
8 Conclusion In this letter, we developed a novel measurement fusion method for identifying a general nonlinear noisy system. Our method has three main features. First, we created optimal fused data with a smaller MSE than observed data. Second, a measurement fusion function estimate combined the optimal fused data and a robust LS-SVM function estimate together. Theoretically, the proposed optimal measurement fusion estimator has an improved performance with smaller mean square error than existing function estimators without optimal measurement fusion. Third, the proposed measurement fusion method is effectively implemented by a new cooperative learning algorithm, which consists of two parallel learning algorithms and a feedforward network. We showed that the proposed cooperative learning algorithm can converge globally to the optimal measurement fusion estimate. Two applications to ARMA signal modeling and spatial temporal signal modeling were illustrated. Experimental results confirmed that the proposed measurement fusion method can provide a more accurate model than several existing identification methods. Appendix: Proofs of Theorems 3 and 4 A.1 Proof of Theorem 3. First, matrix (τ B + eeT (I − β B)) is nonsingular when 0 < β < 1. Next, let G = (1 − β)Pτ eeT B, where Pτ = (τ B + eeT (I − β B))−1 Then ρ(G) ≤ G2 ≤ |1 − β|ρ(eT B Pτ e) ≤ |1 − β||eT B Pτ e|. Now, −1 −1
B Pτ = (τ I + ee (I − β B)(B) ) T
1 = τ
1 = τ
−1 1 T −1 I + ee (B − β I ) . τ
1 I + eeT (I − β B)(B)−1 τ
−1
1626
Y. Xia and M. Kamel
Using the inverse matrix formulation, we have
I+
−1 1 1 T −1 = I − eeT (B −1 − β I )/(1 + θ ), ee (B − β I ) τ τ
where θ=
1 1 T −1 e (B − β I )e = eT B −1 e − β. τ τ
Note that eT e = τ . It follows that eT B Pτ e = 1 −
θ 1 = . 1+θ 1+θ
Then ρ(G) ≤ G2 ≤ |1 − β|
|1 − β| 1 = <1 |1 + θ | |1 − β + τ1 eT B −1 e|
since eT B −1 e > 0 and 0 < β < 1. So the sequence {wk } converges globally to a point w∗ . We next show that w∗ is an optimal fusion solution of equation 3.13. We easily see that w∗ satisfies w∗ = (1 − β)Pτ eeT Bw∗ + Pτ e. Then Pτ−1 w∗ = (1 − β)eeT Bw∗ + e. That is, (τ B + eeT (I − β B))w∗ = (1 − β)eeT Bw∗ + e. Thus, (τ B + eeT )w∗ = eeT Bw∗ + e. It follows that ∗
w =
1 T I − ee = (I − B)w∗ + e/τ τ
Cooperative Learning Algorithm for Nonlinear System Identification
1627
and
∗
e w =e T
T
1 T ∗ I − ee (I − B)w + e/τ = 1. τ
By the well-known projection theorem (Kinderlehrer & Stampcchia, 1980) we see that w∗ is an optimal fusion solution of equation 3.13. Finally, we have the following: w(k) − w∗ = G(w(k − 1) − w∗ ) = G k (w(0) − w∗ ). Let µ = −ln(η), where η=
|1 − β| |1 − β + τ1 eT B −1 e|
< 1.
Then µ > 0. From the fact that e −µ = η, it follows that w(k) − w∗ 2 ≤ Gk2 w(0) − w∗ 2 ≤ w(0) − w∗ 2 e −µk . It follows that τ τ (k) ∗ wi yi (t) − wi yi (t) ≤ max{Y(t)2 }w(k) − w∗ 2 |z (t) − z (t)| = t (k)
∗
i=1
i=1
≤ max{Y(t)2 }w(0) − w∗ 2 e −µk , t
where Y(t) = [y1 (t), . . . , yτ (t)]T . Thus, proposed learning algorithm A can converge exponentially to the optimal fused data. A.2 Proof of Theorem 4. It is enough to prove that proposed learning algorithm B can converge globally to the optimal weights of the robust LSSVM function estimate. Let z∗ = (α ∗ , b ∗ ) denote the optimal weights of the robust LS-SVM function estimate, and let
F (α, b) =
ˆ e1 Q −e1T 0
α y − , b 0
ˆ and e1 are defined in equation 4.2. According to the projection where Q theorem (Kinderlehrer & Stampcchia, 1980), we see that for any z ∈ Rr +1 , (z − π(z))T (π(z) − z∗ ) ≥ 0, where π(z) = [ f (z1 ), . . . , f (zr ), zr +1 ]T and f (zi ) is defined in equation 3.12.
1628
Y. Xia and M. Kamel
Let z = z(k) − F (α (k) , b (k) ), where z(k) = [α (k) , b (k) ]T is generated by learning algorithm B. Then
T z(k) − F (α (k) , b (k) ) − f (z(k) − F (α (k) , b (k) )) f (z(k) − F (α (k) , b (k) )) − z∗ ≥ 0.
Thus
T z(k) − f (z(k) − F (α (k) , b (k) )) − (F (α (k) , b (k) ) − F (α ∗ , b ∗ )) f (z(k) − F (α (k) , b (k) )) − z∗ ≥ 0.
It follows that
T z(k) − f (z(k) − F (α (k) , b (k) )) F (α (k) , b (k) ) − F (α ∗ , b ∗ ) + z(k) − z∗ T ≥ z(k) − f z(k) − F (α (k) , b (k) ) 22 + z(k) − z∗ (F (α (k) , b (k) ) − F (α ∗ , b ∗ )).
It can be seen that ∗
(k)
ˆ e1 Q
∗
F (α , b ) − F (α , b ) = (k)
α (k) − α ∗
−e1T 0
b (k) − b ∗
.
Then (z(k) − f (z(k) − F (α (k) , b (k) )))T ≥ z
(k)
− f (z
(k)
− F (α , b (k)
(k)
ˆ + I 1 e1 Q −e1T 1 ))22
+ (z
(k)
(z(k) − z∗ ) ∗ T
−z )
ˆ e1 Q −e1T 0
(z(k) − z∗ )
ˆ (k) − α ∗ ) = z(k) − f (z(k) − F (α (k) , b (k) ))22 + (α (k) − α ∗ )T Q(α ≥ z(k) − f (z(k) − F (α (k) , b (k) ))22 + ηα (k) − α ∗ 22 ,
(A.1)
where η > 0. Now, ˆ + I1 )( f (u(k) ) − α (k) ) − e1 eT α (k) (Q 1 e1T f (u(k) ) − 2e1T α (k)
ˆ + I 1 e1 Q = z(k) − h (z(k) − f (z(k) − F (α (k) , b (k) ))). −e1T 1
z(k+1) = z(k) − h
(A.2)
Then, using equation A.1, we have z
(k+1)
−
z∗ 22
= z
(k)
−
z∗ 22
ˆ + I 1 e1 Q (z(k) − f (z(k) − F (α (k) , b (k) )))22 +h −e1T 1
2
Cooperative Learning Algorithm for Nonlinear System Identification
1629
ˆ + I 1 e1 Q −2h(z − z ) (z(k) − f (z(k) − F (α (k) , b (k) ))) −e1T 1
ˆ + I 1 e1 Q (z(k) − f (z(k) − F (α (k) , b (k) )))22 = z(k) − z∗ 22 + h 2 −e1T 1 (k)
∗ T
− 2hz(k) − f (z(k) − F (α (k) , b (k) )))22 − 2hηα (k) − α ∗ 22 . Note that Q ˆ + I1 e1 2 −eT 1 ≤ 2. 1 2 It follows that z(k+1) − z∗ 22 ≤ z(k) − z∗ 22 + 2h 2 z(k) − f (z(k) − F (α (k) , b (k) )))22 − 2hz(k) − f (z(k) − F (α (k) , b (k) )))22 − 2hηα (k) − α ∗ 22 ≤ z(k) − z∗ 22 − 2h(1 − h)z(k) − f (z(k) − F (α (k) , b (k) )))22 − 2hηα (k) − α ∗ 22 , where h(1 − h) > 0. Similar to analysis given in Xia, Leung, and Boss´e (2002), ˆ T . From equation we see that the sequence {z(k) } converges a point zˆ = [α, ˆ b] A.2, we see that
zˆ = zˆ − h
ˆ + I 1 e1 Q −e1T 1
ˆ (ˆz − f (ˆz − F (α, ˆ b))).
That is, ˆ zˆ = f (ˆz − F (α, ˆ b)). Thus, zˆ is a solution of the piecewise equations, 3.12, and hence is the optimal weight coefficients of the robust LS-SVM function estimate. Acknowledgments We thank the editor and reviewers for their encouragement and valued comments, which helped in improving the quality of the letter. References Abed-Meraim, K., Qui, W. Z., & Hua, Y. B. (1997). Blind system identification, Proceedings of the IEEE, 85, 1310–1322.
1630
Y. Xia and M. Kamel
Bazaraa, M. S., Sherali, H. D., & Shetty, C. M. (1993). Nonlinear Programming: Theory and algorithms, Hoboken, NJ: Wiley. Chen, L. J., & Narendra, K. S. (2004). Identification and control of a nonlinear discretetime system based on its linearization: A unified framework, IEEE Transactions on Neural Networks, 15, 663–673. Chen, S., & Billings, S. A. (1992). Neural networks for nonlinear dynamic system modelling and identification. International Journal of Control, 56, 319– 346. Cressie, N., & Majure, J. J. (1997). Spatial-temporal statistical modeling of live-stock waste in streams. Journal of Agricultural, Biological, and Environmental Statistics, 2, 24–47. Drosopoulos, A. (1994). Description of the OHGR database (Tech. Note 94-14). Ottawa, Ont.: Defence Research Establishment. Fan, J., & Gijbels, I. (1996). Local polynomial modelling and its application, London: Chapman and Hall. Hall, D. L., & Llinas, J. (1997). An introduction to multisensor data fusion. Proceedings of the IEEE, 85, 6–23. H¨ardle, W. (1990). Applied nonparametric regression. Cambridge: Cambridge University Press. Haykin, S., & Puthusserypady, S. (1997). Chaotic dynamics of sea clutter. Chaos, 7, 777–802. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366. Hsu, C.-W., Chang, C. C., & Lin, C. J. (2003). A practical guide to support vector classification (Tech. Rep.). Taiwan: Department of Computer Science and Information Engineering, National Taiwan University. Kadirkamanathan, V., & Niranjan, M. (1993). A function estimation. Neural Computation, 5, 954–975. Kinderlehrer, D., & Stampcchia, G. (1980). An introduction to variational inequalities and their applications. New York: Academic Press. Koukoulas, P., & Kalouptsidis, N. (2000). Second-order Volterra system identification. IEEE Transactions on Signal Processing, 48, 3574–3577. Lee, J., & Mathews, V. J. (1993). A fast recursive least squares adaptive second-order Volterra filter and its performance analysis. IEEE Trans. Signal Processing, 41, 1087–1097. Leung, H., Hennessey, G., & Drosopoulos, A. (2000). Signal detection using the radial basis function coupled map lattice. IEEE Transactions on Neural Networks, 11, 1133–1151. Ljung, L. (1999). System identification: Theory for the user (2nd ed.), Upper Saddle River, NJ: Prentice-Hall. Lo, J. T. (1994). Synthetic approach to optimal filtering. IEEE Transactions on Neural Networks, 5, 803–811. Lu, S., Ju, K. H., & Chon, K. H. (2001). A new algorithm for linear and nonlinear ARMA model parameter estimation using affine geometry. IEEE Transactions on Biomedical Engineering, 48, 1116–1122. Mardia, K. V., Goodall, C., Redfern, E. J., & Alsonso, F. J. (1998). The kriged Kalman filter. Test, 7, 217–285.
Cooperative Learning Algorithm for Nonlinear System Identification
1631
Mhaskar, H. N., & Hahm, N. (1997). Neural networks for functional approximation and system identification. Neural Computation, 9, 143–159. Mukherjee, S., Osuna, E., & Girosi, F. (1997). Nonlinear prediction of chaotic time series using support vector machines. Proc. of IEEE NNSP (pp. 24–26). Piscataway, NJ: IEEE Press. ¨ Muller, K. R., Smola, A. J., Ratsch, G., Scholkopf, B. S., Kohlmorgen, J., & Vapnik, V. (1997). Predicting time series with support vector machines: Proceedings of ICANN. Berlin: Springer. Narendra, K. S., & Li, S. M. (1996). Neural networks in control systems. In P. Smolensky, M. C. Mozer, & D. E. Rumelhart (Eds.), Mathematical perspectives on neural networks, (pp. 374–394). Mahwah, NJ: Erlbaum. Rojo-Alvarez, J. L., Martinez-Ramon, M., Prado-Cumplido, M. de, Artes-Rodriguez, A., & Figueiras-Vidal, A. R. (2004). Support vector method for robust ARMA system identification. IEEE Transactions on Signal Processing, 52, 155– 164. Roll, J., Nazin A., & Ljung, L. (2005). Nonlinear system identification via direct weight optimization. Automatica, 41, 475–490. Sahu, S. K., & Mardia, K. V. (1999). A Bayesian kriged Kalman model for short-term forecasting of air pollution levels. Applied Statistics, 54, 223–244. Schoukens, J., Nemeth, J. G., Crama, P., Rolain, Y., & Pintelon, R. (2003). Fast approximate identification of nonlinear systems. Automatica, 39, 1267–1274. Sjoberg, J., Zhang, Q. H., Ljung, L., Benveniste, A., Delyon, B., Glorennec, P. Y., Hjalmarsson, H., & Juditsky, A. (1995). Nonlinear black-box modeling in system identification: A unified overview. Automatica, 31, 1691–1724. ¨ Smola, A. J., & Scholkopf, B. S. (1998). On a Kernel-based method for pattern recognition, regression, approximation, and operator inversion. Algorithmica, 22, 211– 231. Suykens, J. A. K., Gestel, T. V., Brabanter, J. D., Moor, B. D., & Vandewalle, J. (2002). Least squares support vector machines. Singapore: World Scientific. Suykens, J. A. K., Brabanter, J., Lukas, L., & Vandewalle, J. (2002). Weighted least squares support vector machines: Robustness and sparse approximation. Neurocomputing, 48, 85–105. Tekalp, A. M. (1995). Digital video processing. Upper Saddle River, NJ: Prentice Hall. Tekalp, A. M., Kaufman, H., Woods, J. W. (1985). Fast recursive estimation of the parameters of a space-varying autoregressive image model. IEEE Transactions on Acoustics, Speech, and Signal Processing, 33, 469–472. Varshney, P. K. (1997). Multisensor data fusion. Journal of Electronics and Communication Engineering, 9, 245–253. Wikle, C. K., & Cressie, N. (1999). A dimension-reduced approach to space-time Kalman filtering. Biometrika, 86, 815–829. Xia, Y. S. (1996). A new neural network for solving linear and quadratic programming problems. IEEE Transactions on Neural Networks, 7, 1544–1547. Xia, Y. S., & Leung, H. (2006). Spatial temporal prediction based on optimal fusion. IEEE Transactions on Neural Networks, 17, 975–988. Xia, Y. S., Leung, H., & Boss´e, E. (2002). Neural data fusion algorithms based on a linearly constrained least square method. IEEE Transactions on Neural Networks, 13, 320–329.
1632
Y. Xia and M. Kamel
Xia, Y. S., Leung, H., & Chan, H. (2006). A prediction fusion method for reconstructing spatial temporal dynamics using support vector machines. IEEE Transactions on Circuits and Systems—Part II Express Briefs, 53, 62–66. Zheng, W. X. (1999). A least-squares based method for autoregressive signals in the presence of noise. IEEE Transactions on Circuits and Systems II-Analog and Digital Signal Processing, 46, 81–85.
Received December 16, 2005; accepted August 23, 2006.
LETTER
Communicated by John Platt
Efficient Computation and Model Selection for the Support Vector Regression Lacey Gunter [email protected]
Ji Zhu [email protected] Department of Statistics, University of Michigan, Ann Arbor, MI 48109, U.S.A.
In this letter, we derive an algorithm that computes the entire solution path of the support vector regression (SVR). We also propose an unbiased estimate for the degrees of freedom of the SVR model, which allows convenient selection of the regularization parameter. 1 Introduction The support vector regression (SVR) is a popular tool for function estimation problems, and it has been widely used in many applications in the past decade, such as signal processing (Vapnik, Golowich, & Smola, 1996), time ¨ series prediction (Muller et al., 1997), and neural decoding (Shpigelman, Crammer, Paz, Vaadia, & Singer, 2004). In this letter, we focus on the regularization parameter of the SVR and propose an efficient algorithm that computes the entire regularized solution path. We also propose an unbiased estimate for the degrees of freedom of the SVR, which allows convenient selection of the regularization parameter. We briefly introduce the SVR and refer readers to Vapnik (1995) and ¨ Smola and Scholkopf (2004) for a detailed tutorial. Suppose we have a set of training data (x 1 , y1 ), . . . , (x n , yn ), where the input x i ∈ R p is a vector with p predictor variables (attributes), and the output yi ∈ R denotes the response (target value). The standard criterion for fitting a linear -SVR is: 1 β2 + C (ξi + δi ), 2 n
min β0 ,β
subject to
(1.1)
i=1
yi − β0 − β T x i ≤ + ξi , β0 + β T x i − yi ≤ + δi , ξi , δi ≥ 0, i = 1, . . . n.
The idea is to disregard errors as long as they are less than , and the ξi , δi are nonnegative slack variables that allow for deviations larger than . C is Neural Computation 19, 1633–1655 (2007)
C 2007 Massachusetts Institute of Technology
L. Gunter and J. Zhu
2.5
1634
Right
1.0
Loss
1.5
2.0
Left
0.5
Center Elbow R
0.0
Elbow L
−3
−2
−1
0
1
2
3
y−f
Figure 1: The -insensitive loss function. Depending on the values of (yi − f i ), the data points can be divided into five sets: left, right, center, left elbow, and Right elbow.
a cost parameter that controls the trade-off between the flatness of the fitted model and the amount up to which deviations larger than are tolerated. Alternatively, we can formulate the problem using a loss + penalty cri¨ terion (Vapnik, 1995; Smola & Scholkopf, 2004): min β0 ,β
n yi − β0 − β T x i + λ β T β, 2
(1.2)
i=1
where |r | is the so-called -insensitive loss function: |r | =
0,
if |r | ≤ ,
|r | − ,
otherwise.
Figure 1 plots the loss function. Notice that it has two nondifferentiable points at ±. The penalty is the L 2 -norm of the coefficient vector, the same as that used in a ridge regression (Hoerl & Kennard, 1970). The regularization parameter λ in equation 1.2 corresponds to 1/C, with C in equation 1.1, and it controls the trade-off between the -insensitive loss and the complexity of the fitted model.
Efficient Computation and Model Selection for SVR
1635
In practice, one often maps x onto a high- (often infinite) dimensional reproducing kernel Hilbert space (RKHS), and fits a nonlinear kernel SVR ¨ model (Vapnik, 1995; Smola & Scholkopf, 2004): n yi − f (x i ) + λ f 2 , HK f ∈H K 2
min
(1.3)
i=1
where H K is a structured RKHS generated by a positive definite kernel K (x, x ). This includes the entire family of smoothing splines and additive and interaction spline models (Wahba, 1990). Some other popular choices of K (·, ·) in practice are dth degree polynomial : K (x, x ) = (1 + x, x )d , Radial basis : K (x, x ) = exp(−x − x 2 /2σ 2 ), where d and σ are prespecified parameters. Using the representer theorem (Kimeldorf & Wahba, 1971), the solution to equation 1.3 has a finite form: 1 θi K (x, x i ). λ n
f (x) = β0 +
(1.4)
i=1
Notice that we write f (x) in a way that involves λ explicitly, and we will see later that θi ∈ [−1, 1]. Given the format of the solution 1.4, we can rewrite equation 1.3 in a finite form: n n n n 1 1 θi K (x i , x i ) + θi θi K (x i , x i ), min yi − β0 − β0 ,θ λ 2λ i=1
i =1
(1.5)
i=1 i =1
which we will focus on for the rest of the letter. Both equations 1.2 and 1.5 can be transformed into a quadratic programming problem; hence, most commercially available packages can be used to solve the SVR. In the past, many specific algorithms for the SVR have been developed, for example, interior point algorithms (Vanderbei, ¨ 1994; Smola & Scholkopf, 2004), subset selection algorithms (Osuna, Freund, & Girosi, 1997; Joachims, 1999), and sequential minimal optimization (Platt, 1999; Keerthi, Shevade, Bhattacharyya, & Murthy, 1999; Smola & ¨ Scholkopf, 2004). All these algorithms solve the SVR for a prefixed regularization parameter λ (or equivalently C). As in any other smoothing problem, to get a good fitted model that performs well on future data (i.e., small generalization error), choice of the regularization parameter λ is critical.
1636
L. Gunter and J. Zhu
In practice, people usually prespecify a grid of values for λ that cover a wide range and then use either cross-validation (Stone, 1974) or an upper bound of the generalization error (Chang & Lin, 2005) to select a value for λ. In this letter, we make two main contributions:
r
r
We show that the solution θ (λ) is piecewise linear as a function of λ, and we derive an efficient algorithm that computes the exact entire solution path {θ (λ), 0 ≤ λ ≤ ∞}, ranging from the least regularized model to the most regularized model. Using the framework of Stein’s unbiased risk estimation (SURE) theory (Stein, 1981), we propose an unbiased estimate for the degrees of freedom of the SVR model. Specifically, we consider the number of data points that are exactly on the two elbows (see Figure 1): yi − f i = ±. This estimate allows convenient selection of the regularization parameter λ.
We acknowledge that some of these results are inspired by one of the author’s earlier work in the support vector machine setting (Hastie, Rosset, Tibshirani, & Zhu, 2004). Before delving into the technical details, we illustrate the concept of piecewise linearity of the solution path with a simple example. We generate 10 training observations using the famous sinc(·) function:
y=
sin(π x) + e, πx
(1.6)
where x is distributed as Uniform(−2, 2), and e is distributed as Normal(0, 0.22 ). We use the SVR with a one-dimensional spline kernel (Wahba, 1990), K (x, x ) = 1 + k1 (x)k1 (x ) + k2 (x)k2 (x ) − k4 (|x − x |),
(1.7)
where k1 (·) = · − 1/2, k2 = (k12 − 1/12)/2, k4 = (k14 − k12 /2 + 7/240)/24. Figure 2 shows a subset of the piecewise linear solution path θ(λ) as a function of λ. The rest of the letter is organized as follows. In section 2, we derive the algorithm that computes the entire solution path of the SVR. In section 3, we propose an unbiased estimate for the degrees of freedom of the SVR, which can be used to select the regularization parameter λ. In section 4, we present numerical results on both simulation data and real-world data. We conclude with a discussion section.
1637
−1.0
−0.5
θ
0.0
0.5
1.0
Efficient Computation and Model Selection for SVR
1
2
3
λ
4
5
Figure 2: A subset of the solution path θ(λ) as a function of λ. Different line types correspond to θi (λ) of different data points, and all paths are piecewise linear.
2 Algorithm 2.1 Problem Setup. Criterion 1.5 can be rewritten in an equivalent way: min β0 ,θ
subject to
n 1 T (ξi + δi ) + θ Kθ , 2λ i=1
−(δi + ) ≤ yi − f (x i ) ≤ (ξi + ), ξi , δi ≥ 0, i = 1, . . . , n,
where 1 θi K (x i , x i ), i = 1, . . . , n λ i =1 K (x 1 , x 1 ) · · · K (x 1 , x n ) .. .. .. . K= . . . K (x n , x 1 ) · · · K (x n , x n ) n×n n
f (x i ) = β0 +
1638
L. Gunter and J. Zhu
For the rest of the letter, we assume the kernel matrix K is positive definite. Then the above setting gives us the Lagrangian primal function: LP :
n n 1 T (ξi + δi ) + αi (yi − f (x i ) − ξi − ) θ Kθ + 2λ i=1
−
i=1
n
γi (yi − f (x i ) + δi + ) −
i=1
n
ρi ξi −
i=1
n
τi δi .
i=1
Setting the derivatives to zero, we arrive at: ∂ : ∂θ ∂ : ∂β0
θi = αi − γi n
αi =
i=1
n
(2.1) γi
(2.2)
i=1
∂ : ∂ξi
αi = 1 − ρi
(2.3)
∂ : ∂δi
γi = 1 − τi ,
(2.4)
where the Karush-Kuhn-Tucker conditions are αi (yi − f (x i ) − ξi − ) = 0
(2.5)
γi (yi − f (x i ) + δi + ) = 0
(2.6)
ρi ξi = 0
(2.7)
τi δi = 0.
(2.8)
Since the Lagrange multipliers must be nonnegative, we can conclude from equations 2.3 and 2.4 that both 0 ≤ αi ≤ 1 and 0 ≤ γi ≤ 1. We also see from equations 2.5 and 2.6 that if αi is positive, then γi must be zero, and vice versa. These lead to the following relationships: yi yi yi yi yi
− − − − −
f (x i ) > f (x i ) < − f (x i ) ∈ (−, ) f (x i ) = f (x i ) = −
⇒ ⇒ ⇒ ⇒ ⇒
αi αi αi αi αi
= 1, = 0, = 0, ∈ [0, 1], = 0,
ξi ξi ξi ξi ξi
> 0, = 0, = 0, = 0, = 0,
γi γi γi γi γi
= 0, = 1, = 0, = 0, ∈ [0, 1],
δi δi δi δi δi
=0 >0 =0 =0 = 0.
Notice from equation 2.1 that for every λ, θi is equal to (αi − γi ). Hence, using these relationships, we can define the following sets that will be used later when we calculate the regularization path of the SVR:
Efficient Computation and Model Selection for SVR
r r r r r
1639
R = {i : yi − f (x i ) > , θi = 1} (right of the elbows) ER = {i : yi − f (x i ) = , 0 ≤ θi ≤ 1} (right elbow) C = {i : − < yi − f (x i ) < , θi = 0} (center) EL = {i : yi − f (x i ) = −, −1 ≤ θi ≤ 0} (left elbow) L = {i : yi − f (x i ) < −, θi = −1} (left of the elbows).
For points in R, L, and C, the values of θi are known; therefore, the algorithm will focus on points resting at the two elbows ER and EL . 2.2 Initialization. Initially, when λ = ∞, we can see from equation 1.4 that f (x) = β0 . We can determine the value of β0 via a simple onedimensional optimization. For simplicity, we focus on the case that all the values of yi are distinct, and furthermore, the initial sets ER and EL have at most one point combined (which is the usual situation). In this case, β0 will not be unique, and each of the θi will be either −1 or 1. This is so because we can easily shift β0 a little to the left or the right such that our θi do not change and we will still get the same error. Since β0 is not unique, we can focus on one particular solution path, for example, by always setting β0 equal to one of its boundary values (thus keeping one point at an elbow). As λ decreases, the range of β0 shrinks, and at the same time, one or more points come toward the elbow. When a second point reaches an elbow, the solution of β0 becomes unique, and the algorithm proceeds from there. We note that this initialization is different from that of classification. In classification, quadratic programming is needed to solve for the initial solution, which can be computationally expensive (Hastie et al., 2004). However, this is not the case for regression. In the case of regression, the output y is continuous; hence, it is almost certain that no two yi ’s will be in ER and EL simultaneously when λ = ∞. 2.3 The Path. The algorithm focuses on the sets of points ER and EL . These points have either f (x i ) = yi − with θi ∈ [0, 1] or f (x i ) = yi + with θi ∈ [−1, 0]. As we follow the path, we will examine these sets until one or both of them change, at which time we will say an event has occurred. Thus, events can be categorized as: 1. The initial event, for which two points must enter the elbow(s). 2. A point from R has just entered ER , with θi initially 1. 3. A point from L has just entered EL , with θi initially −1. 4. A point from C has just entered ER , with θi initially 0. 5. A point from C has just entered EL , with θi initially 0. 6. One or more points in ER and/or EL have just left the elbow(s) to join either R, L, or C, with θi initially 1, −1, or 0, respectively.
1640
L. Gunter and J. Zhu
Until another event has occurred, all sets will remain the same. As a point passes through ER or EL , its respective θi must change from 1 → 0 or −1 → 0 or vice versa. Relying on the fact that f (x i ) = yi − or f (x i ) = yi + for all points in ER or EL , respectively, we can calculate θi for these points. We use the subscript to index the sets above immediately after the th event has occurred, and let θi , β0 and λ be the parameter values immediately after the th event. Also let f be the function at this time. We define for convenience β0,λ = λ · β0 and hence β0,λ = λ · β0 . Then, since
n 1 f (x) = θi K (x, x i ) β0,λ + λ i=1
for λ+1 < λ < λ , we can write
λ λ f (x) = f (x) − f (x) + f (x) λ λ 1 + θi − θi K (x, x i ) + λ f (x) , = β0,λ − β0,λ λ i∈(ER ∪EL )
where the reduction occurs in the second line since the θi ’s are fixed for all points in R , L , and C , and all points remain in their respective sets. Define |ER | = nER and |EL | = nEL , so for the nER + nEL points staying at the elbows, we have that 1 yk − = + β0,λ − β0,λ λ
1 + ym + = β0,λ − β0,λ λ
θi − θi K (x k , x i ) + λ f (x k ), ∀k ∈ ER
i∈(ER ∪EL )
θi −
θi
K (x m , x i ) + λ f (x m ), ∀m ∈ EL .
i∈(ER ∪EL )
∪ EL ) and ν0 = β0,λ − β0,λ . Then To simplify, let νi = θi − θi , i ∈ (ER
ν0 +
νi K (x k , x i ) = (λ − λ )(yk − ), ∀k ∈ ER
i∈(ER ∪EL )
ν0 +
i∈(ER ∪EL )
νi K (x m , x i ) = (λ − λ )(ym + ), ∀m ∈ EL .
Efficient Computation and Model Selection for SVR
1641
Also, by condition 2.2, we have that
νi = 0.
i∈(ER ∪EL )
This gives us nER + nEL + 1 linear equations we can use to solve for each of the nER + nEL + 1 unknown variables νi and ν0 . Now, define K to be a (nER + nEL ) × (nER + nEL ) matrix with first nER rows containing entries K (x k , x i ) where k ∈ ER and i ∈ (ER ∪ EL ), and the remaining nEL rows containing entries K (x m , x i ) where m ∈ ER and i ∈ (ER ∪ EL ), and let ν denote the vector with first nER components νk , k ∈ ER and remaining nEL components νm , m ∈ EL . And finally let y be a (nER + nEL ) vector with first nER entries being (yk − ) and last nEL entries (ym + ). Using these, we have the following two equations: ν0 1 + K ν = (λ − λ ) y
(2.9)
ν 1 = 0. T
(2.10)
Simplifying further, if we let A =
0 1T 1 K
, ν0 =
ν0 ν
, and y0 =
0 y
,
then equations 2.9 and 2.10 can be combined as A ν 0 = (λ − λ ) y0 . Then if A has full rank, we can define b = (A )−1 y0 ,
(2.11)
to give us ∪ EL ); θi = θi + (λ − λ )b i , ∀i ∈ (ER β0,λ = β0,λ + (λ − λ )b 0 .
(2.12) (2.13)
Thus, for λ+1 < λ < λ , the θi and β0,λ proceed linearly in λ. Also f (x) =
λ f (x) − g (x) + g (x), λ
(2.14)
1642
L. Gunter and J. Zhu
where g (x) = b 0 +
b i K (x, x i ).
i∈(ER ∪EL )
Given λ , equations 2.12 and 2.14 allow us to compute λ+1 , the λ at which the next event will occur. This will be the largest λ less than λ , such that either θk for k ∈ ER reaches 0 or 1, or θm for m ∈ EL reaches 0 or −1, or one of the points in R, L, or C reaches an elbow. The latter event will occur for a point j when f (x j ) − g (x j ) λ = λ , ∀ j ∈ (R ∪ C ) y j − g (x j ) − and
λ = λ
f (x j ) − g (x j ) , ∀ j ∈ (L ∪ C ). y j − g (x j ) +
We terminate the algorithm either when the sets R and L become empty or when λ has become sufficiently close to zero. In the latter case, we must have f − g sufficiently small as well. 2.4 Computational Cost. The major computational cost for updating the solutions at any event involves two things: solving the system of (nER + nEL + 1) linear equations and computing g (x). The former takes O((nER + nEL )2 ) calculations by using inverse updating and downdating since the elbow sets usually differ by only one point between consecutive events, and the latter requires O(n(nER + nEL )) computations. According to our experience, the total number of steps taken by the algorithm is on average some small multiple of n. Letting m be the average size of (ER ∪ EL), the approximate computational cost of the algorithm is O cn2 m + nm2 . We make a note that errors can be accumulated by inverse updating and downdating as the number of steps increases. To get around this problem, we may directly compute the inverse using equation 2.11 after every several steps. As we will see in section 4, for a fixed underlying true function, the average size of (ER ∪ EL ) does not seem to increase with n, so the direct computation of equation 2.11 does not seem to add much extra computational cost. 3 Degrees of Freedom It is well known that an appropriate value of λ is crucial for the performance of the fitted model in any smoothing problem. One advantage of computing
Efficient Computation and Model Selection for SVR
1643
the entire solution path is to facilitate the selection of the regularization parameter. Degrees of freedom is an informative measure of the complexity of a fitted model. In this section, we propose an unbiased estimate for the degrees of freedom of the SVR, which allows convenient selection of the regularization parameter λ. Since the usual goal of regression analysis is to minimize the predicted squared-error loss, we study the degrees of freedom using Stein’s unbiased risk estimation (SURE) theory (Stein, 1981). Given x, we assume y is generated according to a homoskedastic model, that is the errors have a common variance: y ∼ (µ(x), σ 2 ), where µ is the true mean and σ 2 is the common variance. Notice that there is no restriction on the distributional form of y. Then the degrees of freedom of a fitted model f (x) can be defined as df( f ) =
n
cov( f (x i ), yi )/σ 2 .
i=1
Stein showed that under mild conditions, the quantity n ∂ f (x i ) i=1
∂ yi
(3.1)
is an unbiased estimate of df( f ). Later Efron (1986) proposed the concept of expected optimism based on equation 3.1, and Ye (1998) developed Monte Carlo methods to estimate equation 3.1 for general modeling procedures. Meyer and Woodroofe (2000) discussed equation 3.1 in shape-restricted regression and also argued that it provided a measure of the effective dimension. (For detailed discussion and complete references, see Efron, 2004.) Notice that equation 3.1 measures the sum of the sensitivity of each fitted value with respect to the corresponding n observed value. It turns out that in the case of SVR, for every fixed λ, i=1 ∂ f i /∂ yi has an extremely simple formula: ≡ df
n ∂ fi = |ER | + |EL |. ∂ yi
(3.2)
i=1
Therefore, |ER | + |EL | is a convenient unbiased estimate for the degrees of freedom of f (x).
1644
L. Gunter and J. Zhu
In applying equation 3.2 to select the regularization parameter λ, we plug it into the generalized cross validation (GCV) criterion (Craven & Wahba, 1979) for model selection: n
− f (x i ))2 . 2 (1 − df/n)
i=1 (yi
(3.3)
The GCV criterion is an approximation to the leave-one-out cross validation n (yi − f −i (x i ))2 , i=1
where f −i (x i ) is the fit at x i with the ith point removed. The GCV can be mathematically justified in that asymptotically, it minimizes mean squared error for estimation of f . We shall not provide details but instead refer readers to O’Sullivan (1985). The advantages of this criterion are that it does not assume a known σ 2 , and it avoids cross validation, which is computationally more intensive. In practice, one can first use the path algorithm in section 2 to compute the entire solution path and then identify the appropriate value of λ that minimizes the GCV criterion. We outline the proof of equation 3.2 in this section, the details are in the appendix. We make a note that the theory in this section considers the scenario when λ and x 1 , . . . , x n are fixed, while y1 , . . . , yn can change; the proof relies closely on our algorithm in section 2, and follows the spirit of Zou, Hastie, and Tibshirani (2005). As we have seen in section 2, for a fixed response vector y = (y1 , . . . , yn )T , there is a sequence of λ’s, ∞ = λ0 > λ1 > λ2 > · · · > λL = 0, such that in the interior of any interval (λ+1 , λ ), the sets R, L, C, ER , and EL are constant with respect to λ. These sets change only at each λ . We thus define these λ ’s as event points. Lemma 1. For any fixed λ > 0, the set of y = (y1 , . . . , yn )T such that λ is an event point is a finite collection of hyperplanes in Rn . Denote this set as Nλ . Then for any y ∈ Rn \Nλ , λ is not an event point. Notice that Nλ is a null set, and Rn \Nλ is of full measure. Lemma 2. For any fixed λ > 0, θ λ is a continuous function of y. Lemma 3. For any fixed λ > 0 and any y ∈ Rn \Nλ , the sets R, L, C, ER , and EL are locally constant with respect to y.
Efficient Computation and Model Selection for SVR
1645
Theorem 1. For any fixed λ > 0 and any y ∈ Rn \Nλ , we have the divergence formula n ∂ fi = |ER | + |EL |. ∂ yi i=1
We note that the theory applies when λ is not an event point. Specifically, in practice, for a given y, the theory applies when λ = λ0 , λ1 , . . . , λ L ; when λ − as an estimate of df λ . λ = λ , we may continue to use |ER | + |EL |, or df 4 Numerical Results In this section we demonstrate our path algorithm and the selection of λ using the GCV criterion on both simulation data and real-world data. 4.1 Computational Cost. We first compare the computational cost of the path algorithm and that of the LIBSVM (Chang & Lin, 2001). We used the sinc(·) function 1.6 from section 1. Both algorithms have been implemented in the R programming language, and the comparison was done on a Linux server that has two AMD Opteron 244 processors running at 1.8 GHz with a 1 MB cache each, and the system has 2 GB of memory. The setup of the function and the error distribution are similar to those in section 1. We used the one-dimensional spline kernel, equation 1.7, and we generated n = 50, 100, 200, 400, 800 training observations from the sinc(·) function. For each simulation data set, we first ran our path algorithm to compute the entire solution path and retrieved the sequence of event points, λ0 > λ1 > λ2 > · · · > λ L ; then for each λ , we ran the LIBSVM to get the corresponding solution. Elapsed computing times (in seconds) were recorded carefully using the system.time() function in R. We repeated this 30 times and computed the average elapsed times and their corresponding standard errors. The results are summarized in Table 1. The number of steps along the path (or the number of event points L) and the average size of the elbow |ER ∪ EL | were also recorded and summarized in Table 1. As we can see, in terms of the elapsed time in computing the entire solution path, our path algorithm dominates the LIBSVM. We can also see that the number of steps increases linearly with the size of the training data, and interestingly, the average elbow size does not seem to change much when the size of the training data increases. We note that these timings should be treated with great caution, as they can be sensitive to details of implementation (e.g., different overheads on I/O). We also note that the LIBSVM was implemented using the cold start scheme: for every value of λ , the training was done from scratch. In recent years, interest has grown in warm start algorithms that can save computing
1646
L. Gunter and J. Zhu
Table 1: Computational Cost. n 50 100 200 400 800
Path
LIBSVM
0.172 (0.030) 0.398 (0.057) 1.018 (0.114) 2.205 (0.180) 6.072 (0.594)
Number of Steps
|ER ∪ EL |
104.3 (16.6) 209.6 (28.1) 430.1 (46.2) 830.8 (44.0) 1681.4 (122.7)
8.8 (0.8) 9.6 (0.6) 9.8 (0.4) 10.0 (0.6) 10.4 (0.4)
4.517 (4.044) 20.13 (29.49) 54.00 (30.62) 219.3 (239.8) 1123.1 (917.9)
Notes: The first column n is the size of the training data; the second column, Path, is the elapsed time (in seconds) for computing the entire solution path using our path algorithm; the third column, LIBSVM, is the total elapsed time (in seconds) for computing all the solutions at the event points along the solution path (using the cold start scheme); the fourth column, Number of Steps, is the number of event points; the fifth column |ER ∪ EL | is the average elbow size at the event points. All results are averages of 30 independent simulations. The numbers in the parentheses are the corresponding standard errors.
Table 2: Average Elapsed Time of the LIBSVM for a Single Value of λ . n 50 100 200 400 800
LIBSVM 0.042 (0.036) 0.090 (0.110) 0.123 (0.063) 0.261 (0.269) 0.656 (0.510)
time by starting from an appropriate initial solution (Gondzio & Grothey, 2001; Ma, Theiler, & Perkins, 2003). To further study the computational cost of our path algorithm, we also computed the average elapsed time of the LIBSVM for a single value of λ , which is defined as follows: Average elapsed time for a single value of λ =
Total elapsed time . # steps
The results are summarized in Table 2. It takes our path algorithm about 4 to 10 times as long to compute the entire solution path as it takes LIBSVM to compute a single solution. We note that these timings should be treated with some caution, as for the LIBSVM, the computing time varies significantly for different values of λ. It is very likely that in the total elapsed time for the LIBSVM, many of the computing times were wasted on useless (too big or too small) λ values. Finally, we have no intention of arguing that our path algorithm is the ultimate SVR computing tool (or it dominates the LIBSVM; in fact, according to Table 2, it is very likely that even in terms of total elapsed time, the LIBSVM with a warm start scheme can be comparable to or even faster than the path algorithm); the advantage of our path algorithm is to give
Efficient Computation and Model Selection for SVR
1647
a full presentation of the solution path without knowing the locations of the event points a priori. We hope our numerical results make our path algorithm an interesting and useful addition to the toolbox for computing the SVR. 4.2 Simulated Data. Next we demonstrate the selection of λ using the GCV criterion. For simulated data, we consider both additive and multiplicative kernels using the one-dimensional spline kernel, equation 1.7, which are, respectively, K (x, x ) =
p
K (x j , x j ) and K (x, x ) =
j=1
p
K (x j , x j ).
j=1
Simulations were based on the following four functions found in Friedman (1991): 1. One-dimensional sinc(·) function: f (x) =
sin(π x) + e1, πx
where x is distributed as uniform (−2, 2). 2. Five-dimensional additive model: f (x) = 0.1e 4x1 +
4 + 3x3 + 2x4 + x5 + e 2 , 1 + e −20(x2 −.5)
where x1 , . . . , x5 are distributed as Uniform (0, 1). 3. Alternating current series circuit: a. Impedance 1/2 1 2 2 f (R, ω, L , C) = R + ωL − + e3; ωC b. Phase angle 1 ωL − ωC −1 f (R, ω, L , C) = tan + e4; R where (R, ω, L , C) are distributed uniformly in (40π, 560π) × (0, 1) × (1, 11).
(0, 100) ×
The errors e i are distributed as N(0, σi2 ), where σ1 = 0.19, σ2 = 1, σ3 = 218.5, and σ4 = 0.18. We generated 300 training observations from each function along with 10,000 validation observations and 10,000 test observations. For the first two simulations, we used the additive one-dimensional spline kernel and for the second two simulations the multiplicative one-dimensional spline kernel. We then found the λ that minimized the GCV criterion, equation 3.3.
1648
L. Gunter and J. Zhu
Table 3: Simulation Results on Four Different Functions. Prediction Error
1 2 3a 3b
Degrees of Freedom
Gold Standard
GCV
Gold Standard
GCV
0.0390 (0.0012) 1.116 (0.030) 50779 (1885) 0.0464 (0.0019)
0.0394 (0.0015) 1.135 (0.037) 51143 (1941) 0.0474 (0.0024)
21.9 (4.0) 24.9 (4.5) 16.6 (3.3) 41.9 (8.0)
17.2 (3.7) 22.2 (5.1) 13.4 (3.2) 40.8 (13.4)
The validation set was used to select the gold standard λ, which minimized the prediction error. Using these λ’s, we calculated the prediction error with the test data for each criterion. Suppose the fitted function is f (x); the prediction error is defined as Prediction error =
10,000 1 (yi − f i )2 . 10,000 i=1
We repeated this 30 times and computed the average prediction errors and their corresponding standard errors. We also compared the degrees of freedom selected by the two methods. The results are summarized in Table 3. As we can see, in terms of the prediction accuracy, the GCV criterion performs close to the gold standard, and the GCV tends to select a slightly simpler model than the gold standard. 4.3 Real Data. For demonstration on real-world data, we consider the eight benchmark data sets used in Chang and Lin (2005) (available online at http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/). Table 4 contains a summary of these data sets. To be consistent with their analysis, we also used the radial basis kernel, with same values for σ 2 and as specified in Chang and Lin (2005). For each data set, we randomly split the data into training and test sets, with the training set comprising 80% of the data. We calculated the path and used GCV to select the best λ. This λ was used to get a prediction error on the test set. We also computed the prediction error on the test set for each λ and found the minimum prediction error. Notice that in this case, the test set serves more like a separate validation set for selecting a value of λ, and unlike in the simulation study where there is another test set, this minimum prediction error is overly optimistic. We repeated this process 30 times and computed the average prediction errors and their corresponding standard errors. We also computed the degrees of freedom selected by the two different methods. The results are summarized in Table 5. In terms of the prediction accuracy, the GCV performs close to the overly optimistic minimum, and once again, we observe that the GCV tends to select a slightly simpler model than the validation method.
Efficient Computation and Model Selection for SVR
1649
Table 4: Summary of the Eight Benchmark Data Sets. Data Set
n
p
ln(σ 2 )
ln()
pyrim triazines mpg housing add10 cpusmall spacega abalone
74 186 392 566 1000 1000 1000 1000
27 60 7 13 10 12 6 8
3.4 4.0 0.4 1.4 2.9 1.7 1.1 1.2
−6.6 −4.7 −1.7 −1.7 −1.2 −2.1 −4.6 −2.1
Table 5: Real Data Results Prediction Error
pyrim triazines mpg housing add10 cpusmall spacega abalone
Degrees of Freedom
“Minimum”
GCV
“Minimum”
GCV
0.0037 (0.0034) 0.0185 (0.0073) 6.85 (1.92) 9.87 (3.12) 1.77 (0.20) 26.65 (10.22) 0.0119 (0.0015) 4.26 (1.04)
0.0067 (0.0071) 0.0247 (0.0083) 7.37 (2.38) 10.84 (3.69) 1.82 (0.20) 27.60 (10.28) 0.0124 (0.0015) 4.35 (1.05)
31.2 (11.3) 46.9 (37.7) 94.7 (20.7) 226.3 (51.7) 348.1 (35.1) 305.4 (53.9) 138.0 (35.2) 72.1 (20.4)
40.1 (9.0) 32.3 (36.2) 74.8 (10.0) 186.2 (30.1) 317.3 (38.8) 277.4 (29.8) 121.8 (27.5) 57.4 (21.7)
5 Discussion In this letter, we have proposed an efficient algorithm that computes the entire regularization path of the SVR. The general technique employed is known as parametric programming via active sets in the convex optimization literature (Allgower & Georg, 1993). The closest we have seen to our work in the literature is the employment of similar techniques in incremental learning for the SVM (DeCoste & Wagstaff, 2000) and the SVR (Ma et al., 2003). These authors, however, do not construct exact paths as we do, but rather focus on updating and downdating the solutions as more (or fewer) data arise. We have also proposed an unbiased estimate for the degrees of freedom of the SVR model, which allows convenient selection of the regularization parameter λ. Using the solution path algorithm and the proposed degrees of freedom, the GCV criterion seems to work sufficiently well on both simulated data and real-world data. Due to the difficulty of also selecting the best for the SVR, an alternate algorithm exists that automatically adjusts the value of , called the ν-SVR ¨ (Scholkopf, Smola, Williamson, & Bartlett, 2000; Chang & Lin, 2002). The
1650
L. Gunter and J. Zhu
linear ν-SVR can be written as 1 (ξi + δi ) + ν β2 + C 2 n
min
β0 ,β,
i=1
subject to yi − β0 − β x i ≤ + ξi T
β0 + β T x i − yi ≤ + δi ξi , δi ≥ 0, i = 1, . . . , n, where ν > 0 is prespecified. In this scenario, is treated as another free parameter. There are at least two interesting directions where our work can be extended: (1) Using arguments similar to those for β0 in our above algorithm, one can show that is piecewise linear in 1/λ and its path can ¨ be calculated similarly. (2) As Scholkopf et al. (2000) point out, ν is an upper bound on the fraction of errors and a lower bound on the fraction of support vectors. Therefore, it is worthwhile to consider the solution path as a function of ν and develop a corresponding efficient algorithm. Appendix: Proofs A.1 Proof of Lemma 1. For any fixed λ > 0, suppose R, L, C, ER , and EL are given. Then we have 1 β0,λ + λ
θi K (x k , x i ) −
i∈(ER ∪EL )
i∈L
K (x k , x i ) +
K (x k , x i )
i∈R
(A.1) = yk − , ∀k ∈ ER ; 1 θi K (x m , x i ) − K (x m , x i ) + K (x m , x i ) β0,λ + λ i∈L i∈R i∈(ER ∪EL )
= ym + , ∀m ∈ EL ; θi − nL + nR = 0.
(A.2) (A.3)
i∈(ER ∪EL )
These can be reexpressed as
0 1T 1 KE
β0,λ θE
=
b , λ yE − a
where KE is a (nER + nEL ) × (nER + nEL ) matrix, with entries equal to K (x i , x i ), i, i ∈ (ER ∪ EL ). θ E and yE are vectors of length (nER + nEL ),
Efficient Computation and Model Selection for SVR
1651
with elements equal to θi and yi , i ∈ (ER ∪ EL ), respectively. a is also a vector of length (n + n ), with elements equal to − K (x i , x m ) + E E R L m∈L k∈R K (x i , x k ) ± λ, i ∈ EL ∪ ER , and b is a scalar b = nL − nR . Notice that once λ, R, L, C, ER , and EL are fixed, KE , a, and b are also fixed. Then β0,λ and θ E can be expressed as
β0,λ θE
˜ =K
b , λ yE − a
where ˜ = K
0 1T 1 KE
−1
.
Notice that β0,λ and θ E are affine in yE . Now corresponding to the six events listed at the beginning of section 2.3, if λ is an event point, one of the following conditions has to be satisfied: 1. θi = 0, ∃i ∈ (ER ∪ EL ) 2. θk = 1, ∃k ∈ ER 3. θm = −1, ∃m ∈ EL 4. y j = λ1 β0,λ + i∈(ER ∪EL ) θi K (x j , x i ) − i∈L K (x j , x i ) + i∈R K (x j , x i ) + , ∃ j ∈ (R ∪ C) 5. y j = λ1 β0,λ + i∈(ER ∪EL ) θi K (x j , x i ) − i∈L K (x j , x i ) + i∈R K (x j , x i ) − , ∃ j ∈ (L ∪ C). For any fixed λ, R, L, C, ER and EL , each of the above conditions defines a hyperplane of y in Rn . Taking into account all possible combinations of R, L, C, ER , and EL , the set of y such that λ is an event point is a collection of a finite number of hyperplanes. A.2 Proof of Lemma 2. For any fixed λ > 0 and any fixed y0 ∈ Rn , we wish to show that if a sequence ym converges to y0 , then θ ( ym ) converges to θ ( y0 ). Since θ ( ym ) are bounded, it is equivalent to show that for every converging subsequence, say, θ ( ymk ), the subsequence converges to θ ( y0 ). Suppose θ ( ymk ) converges to θ ∞ ; we will show θ ∞ = θ ( y0 ). Denote the objective function in equation 1.5 as h(θ ( y), y), and let h(θ ( y), y, y ) = h(θ ( y), y) − h(θ ( y), y ).
1652
L. Gunter and J. Zhu
Then we have h(θ ( y0 ), y0 ) = h(θ ( y0 ), ymk ) + h(θ ( y0 ), y0 , ymk ) ≥ h(θ ( ymk ), ymk ) + h(θ ( y0 ), y0 , ymk ) = h(θ ( ymk ), y0 ) + h(θ ( ymk ), ymk , y0 ) + h(θ ( y0 ), y0 , ymk ). (A.4) Using the fact that |a | − |b| ≤ |a − b| and ymk → y0 , it is easy to show that for large enough mk , we have h(θ ( ymk ), ymk , y0 ) + h(θ ( y0 ), y0 , ymk ) ≤ c y0 − ymk 1 , where c > 0 is a constant. Furthermore, using ymk → y0 and θ ( ymk ) → θ ∞ , we reduce equation A.4 to h(θ ( y0 ), y0 ) ≥ h(θ ∞ , y0 ). Since θ ( y0 ) is the unique minimizer of h(θ , y0 ), we have θ ∞ = θ ( y0 ). A.3 Proof of Lemma 3. For any fixed λ > 0 and any fixed y0 ∈ Rn \Nλ , since Rn \Nλ is an open set, we can always find a small enough η > 0, such that Ball( y0 , η) ⊂ Rn \Nλ . So λ is not an event point for any y ∈ Ball( y0 , η). We claim that if η is small enough, the sets R, L, C, ER , and EL stay the same for all y ∈ Ball( y0 , η). Consider y and y0 . Let R y, L y, C y, ER y, EL y, R0 , L0 , C0 , ER0 , EL0 denote the corresponding sets, and θ y, f y, θ 0 , f 0 denote the corresponding fits. For any i ∈ ER0 , since λ is not an event point, we have 0 < θi0 < 1. Therey fore, by continuity, we also have 0 < θi < 1, i ∈ ER0 for y close enough to y0 ; or equivalently, ER0 ⊆ ER y, ∀ y ∈ Ball( y0 , η) for small enough η. The same applies to EL0 and EL y as well. Similarly, for any i ∈ R0 , since yi0 − f 0 (x i ) > η, again by continuity, we have yi − f y(x i ) > η for y close enough to y0 ; or equivalently, R0 ⊆ R y, ∀ y ∈ Ball( y0 , η) for small enough η. The same applies to L0 , L y and C0 , C y as well. Overall, we then must have ER0 = ER y, EL0 = EL y, R0 = R y, L0 = L y and C0 = C y for all y ∈ Ball( y0 , η) when η is small enough. A.4 Proof of Theorem 1. Using lemma 3, we know that there exists η > 0, such that for all y ∈ Ball( y, η), the sets R, L, C, ER , and EL stay the same. This implies that for points in (ER ∪ EL ), we have f i = yi + or f i = yi − when y ∈ Ball( y, η); hence, ∂ fi = 1, i ∈ (ER ∪ EL ). ∂ yi
Efficient Computation and Model Selection for SVR
1653
Furthermore, from equations A.1 to A.3, we can see that for points in R, L and C, their θi ’s are fixed at either 1, −1 or 0, and the other θi ’s are determined by yE . Hence, ∂ fi = 0, i ∈ (R ∪ L ∪ C). ∂ yi Overall, we have n ∂ fi = |ER | + |EL |. ∂ yi i=1
Acknowledgments We thank the two referees for their thoughtful and useful comments. Their comments have helped us improve the letter greatly. We also thank Trevor Hastie, Saharon Rosset, Rob Tibshirani, and Hui Zou for their encouragement. L. G. and J. Z. are partially supported by grant DMS-0505432 from the National Science Foundation. References Allgower, E., & Georg, K. (1993). Continuation and path following. Acta Numerica, 2, 1–64. Chang, C. C., & Lin, C. J. (2001). LIBSVM: A library for support vector machines. Taipei: Department of Computer Science and Information Engineering, National Taiwan University. Available online at http://www.csie.ntu.edu.tw/cjlin/libsvm/. Chang, C. C., & Lin, C. J. (2002). Training ν-support vector regression: Theory and algorithms. Neural Computation, 14(8), 1959–1977. Chang, M. W., & Lin, C. J. (2005). Leave-one-out bounds for support vector regression model selection. Neural Computation, 17, 1188–1222. Craven, P., & Wahba, G. (1979). Smoothing noisy data with spline function. Numerical Mathematics, 31, 377–403. DeCoste, D., & Wagstaff, K. (2000). Alpha seeding for support vector machines. In Proceedings of Sixth ACM SIGKDD. New York: ACM Press. Efron, B. (1986). How biased is the apparent error rate of a prediction rule? Journal of the American Statistical Association, 81, 461–470. Efron, B. (2004). The estimation of prediction error: Covariance penalties and crossvalidation. Journal of the American Statistical Association, 99, 619–632. Friedman, J. (1991). Multivariate adaptive regression splines. Annals of Statistics, 19, 1–67. Gondzio, J., & Grothey, A. (2001). Reoptimization with the primal-dual interior point method. SIAM Journal on Optimization, 13, 842–864.
1654
L. Gunter and J. Zhu
Hastie, T., Rosset, S., Tibshirani, R., & Zhu, J. (2004). The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5, 1391– 1415. Hoerl, A., & Kennard, R. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(3), 55–67. ¨ Joachims, T. (1999). Making large-scale SVM learning practical. In B. Scholkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—Support vector learning (pp. 169–184). Cambridge, MA: MIT Press. Keerthi, S., Shevade, S., Bhattacharyya, C., & Murthy, K. (1999). Improvements to Platt’s SMO algorithm for SVM classifier design. (Tech. Rep. Mechanical and Production Engineering, CD-99-14). Singapore: National University of Singapore. Kimeldorf, G., & Wahba, G. (1971). Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 33, 82–95. Ma, J., Theiler, J., & Perkins, S. (2003). Accurate on-line support vector regression. Neural Computation, 15, 2683–2703. Meyer, M., & Woodroofe, M. (2000). On the degrees of freedom in shape-restricted regression. Annals of Statistics, 28, 1083–1104. ¨ ¨ Muller, K., Smola, A., R¨atsch, G., Scholkopf, B., Kohlmorgen, J., & Vapnik, V. (1997). Predicting time series with support vector machines. In Proceedings of the Seventh International Conference on Artificial Neural Networks (pp. 999–1004). Berlin: Springer-Verlag. O’Sullivan, F. (1985). Discussion of “Some aspects of the spline smoothing approach to nonparametric curve fitting” by B. W. Silverman. Journal of the Royal Statistical Society, Series B, 36, 111–147. Osuna, E., Freund, R., & Girosi, F. (1997). An improved training algorithm for support vector machines. In Proceedings of the IEEE Neural Networks for Signal Processing VII Workshop (pp. 276–285). Piscataway, NJ: IEEE Press. Platt, J. (1999). Fast training of support vector machines using sequential minimal ¨ optimization. In B. Scholkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—Support vector learning (pp. 185–208). Cambridge, MA: MIT Press. ¨ Scholkopf, B., Smola, A., Williamson, R., & Bartlett, P. (2000). New support vector algorithms. Neural Computation, 12, 1207–1245. Shpigelman, L., Crammer, K., Paz, R., Vaadia, E., & Singer, Y. (2004). A temporal kernel-based model for tracking hand movements from neural activities. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.). Advances in neural information processing systems, 17. Cambridge, MA: MIT Press. ¨ Smola, A., & Scholkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14, 199–222. Stein, C. (1981). Estimation of the mean of a multivariate normal distribution. Annals of Statistics, 9(6), 1135–1151. Stone, M. (1974). Cross-validation choice and assessment of statistical predictors (with discussion). Journal of the Royal Statistical Society Series B, 36, 111–147. Vanderbei, R. (1994). LOQO: An interior point code for quadratic programming (Tech. Rep., Statistics and Operations Research SOR-94-15). Princeton, NJ: Princeton University. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer-Verlag.
Efficient Computation and Model Selection for SVR
1655
Vapnik, V., Golowich, S., & Smola, A. (1996). Support vector method for function approximation, regression estimation, and signal processing. In M. Mozer, M. Jordan, T. Petsche (Eds.), Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Wahba, G. (1990). Spline models for observational data. Philadelphia: SIAM. Ye, J. (1998). On measuring and correcting the effects of data mining and model selection. Journal of the American Statistical Association, 93(441), 120–131. Zou, H., Hastie, T., & Tibshirani, R. (2005). On the degrees of freedom of the lasso (Tech. Rep.). Palo Alto, CA: Department of Statistics, Stanford University.
Received October 17, 2005; accepted August 1, 2006.
LETTER
Communicated by Sushmita Mitra
A Novel Generic Hebbian Ordering-Based Fuzzy Rule Base Reduction Approach to Mamdani Neuro-Fuzzy System Feng Liu [email protected]
Chai Quek [email protected]
Geok See Ng [email protected] Center for Computational Intelligence, Nanyang Technological University, School of Computer Engineering, Singapore 639798
There are two important issues in neuro-fuzzy modeling: (1) interpretability—the ability to describe the behavior of the system in an interpretable way—and (2) accuracy—the ability to approximate the outcome of the system accurately. As these two objectives usually exert contradictory requirements on the neuro-fuzzy model, certain compromise has to be undertaken. This letter proposes a novel rule reduction algorithm, namely, Hebb rule reduction, and an iterative tuning process to balance interpretability and accuracy. The Hebb rule reduction algorithm uses Hebbian ordering, which represents the degree of coverage of the samples by the rule, as an importance measure of each rule to merge the membership functions and hence reduces the number of the rules. Similar membership functions (MFs) are merged by a specified similarity measure in an order of Hebbian importance, and the resultant equivalent rules are deleted from the rule base. The rule with a higher Hebbian importance will be retained among a set of rules. The MFs are tuned through the least mean square (LMS) algorithm to reduce the modeling error. The tuning of the MFs and the reduction of the rules proceed iteratively to achieve a balance between interpretability and accuracy. Three published data sets by Nakanishi (Nakanishi, Turksen, & Sugeno, 1993), the Pat synthetic data set (Pal, Mitra, & Mitra, 2003), and the traffic flow density prediction data set are used as benchmarks to demonstrate the effectiveness of the proposed method. Good interpretability, as well as high modeling accuracy, are derivable simultaneously and are suitably benchmarked against other well-established neuro-fuzzy models. 1 Introduction The hybridization of neural networks and fuzzy systems, called neurofuzzy systems, has become a popular research focus (Lin, 1996). It takes Neural Computation 19, 1656–1680 (2007)
C 2007 Massachusetts Institute of Technology
A Hebbian-Based Mamdani Neuro-Fuzzy Rule Reduction Approach
1657
advantage of the low-level learning ability of neural networks and the highlevel reasoning ability of the fuzzy systems. As a universal approximator (Castro, 1995), this approach resolves the modeling problem using a linguistic model consisting of a set of if-then fuzzy rules instead of a complex mathematical description. The identification of interpretable fuzzy rules is one of the most important issues in the design of a neuro-fuzzy system. Neuro-fuzzy modeling usually comes with two contradictory requirements: interpretability and accuracy. The fuzzy modeling research is divided into two main areas: linguistic fuzzy modeling (LFM) and precise fuzzy modeling (PFM) (Casillas, Cordon, Herrera, & Magdalena, 2003). The former focuses mainly on the interpretability issue and is usually implemented by a Mamdani model (Mamdani & Assilian, 1975), while the latter aims to devise a fuzzy model with a good accuracy and is often realized by a Takagi-Sugeno-Kang (TSK) model (Takagi & Sugeno, 1985; Sugeno & Kang, 1988). This letter focuses on a delicate balance between interpretability and accuracy in linguistic fuzzy modeling. The interpretability is usually improved in the following ways:
r
r
r
Input variable selection and reduction. As the number of input variables increases, the homogeneous partitioning of the input and output spaces will cause the size of the rule base to grow exponentially. This has been much addressed in the literature (Silipo & Berthold, 2000; Gonzalez & Perez, 2001; Casillas, Cordon, Del Jesus, & Herrera, 2001; Chakraborty & Pal, 2004). Fuzzy rule selection, reduction, and merger. To derive a consistent and effective rule base, redundant, erroneous, unimportant, and conflictive rules should be reduced. Various methods have been proposed in the literature (Setnes, Babuska, Kaymak, & Van Nauta Lemke, 1998; Yen & Wang, 1999; Cordon & Herrera, 2000; Krone & Taeger, 2001; Tung & Quek, 2002). Use of other rule expressions. The disjunctive norm form rules (Gonzalez & Perez 1998), exception rules (Krone & Kiendl, 1994), union-rule configuration (Combs & Andrews, 1998), and others have attempted to derive a compact rule base.
Yen and Wang (1999) used the singular value decomposition (SVD) and QR with column pivoting methods to determine the order of importance of the fuzzy rules in TSK model. The rules corresponding to small singular values in the firing matrix are considered less important or irrelevant and may be removed from the rule base. However, Setnes and Babuska (2001) show that the SVD does not produce an importance ordering of the fuzzy rules and only the QR process is supported. The rule reduction in Setnes et al. (1998) is realized by the merger of fuzzy sets. When two fuzzy sets in one input dimension are similar under certain discrete measures of similarity, they are merged into one fuzzy set,
1658
F. Liu, C. Quek, & G. Ng
which may result in equivalent rules in the rule base. If a set of rules is equivalent, with the same conditions and consequences, only one of them is preserved. This method clearly separates the fuzzy sets and reduces the number of rules simultaneously, which seems very promising. However, this method considers all the rules of equal importance. It may result in the deletion of some important rules and may cause much degradation in accuracy. Furthermore, in this method, the merger and reduction process occurs only once, at the end of the formulation of the rule base, and the accuracy requirement may limit its degree of reduction. Ang and Quek (2005) used rough set theory, based on the POP learning algorithm (Ang, Quek, & Pasquier, 2003; Quek & Zhou, 1999), to improve the interpretability of neuro-fuzzy system. It performs feature selection through the reduction of attributes and extends the reduction to rules without redundant attributes. Optimization methods like genetic algorithm (GA) have also been used to deal with the interpretability issue of fuzzy modeling (Casillas, Cordon, Del Jesus, & Herrera, 2005). With multiple objectives, both accuracy and interpretability are compromised through the evolutionary process. Banerjee, Mitra, and Pal (1998), Mitra, Mitra, and Pal (2001), and Pal, Mitra, and Mitra (2003) used the rough set theory for extracting dependency rules from a real-valued attribute table consisting of fuzzy membership values. At the same time, they employ GA to tune the fuzzification parameters and network weight and structure simultaneously. Besides the methods mentioned above, there are many other techniques for neuro-fuzzy rule generation in the soft computing framework (Mitra & Hayashi, 2000). This letter proposes an iterative process to reduce the rule base to derive good interpretability while maintaining a high level of modeling accuracy. It is based on the Hebbian ordering of the rules and is called Hebb rule reduction. The proposed Hebb rule reduction process both reduces the rule base by merging the fuzzy set in each dimension and considers the merging orders realized by Hebbian ordering. Hebbian ordering is used to represent the importance of the rules, where a higher Hebbian ordering indicates a larger coverage of the training points provided by a given rule. The rules with higher importance are more likely to be preserved. The least mean square (LMS) algorithm is used to tune the membership functions of the rule base after each time step of the reduction. Interpretability and accuracy are balanced using this iterative process. This letter is organized as follows. Section 2 describes the details of the proposed Hebb rule reduction process, five experiments are shown in section 3 to demonstrate the use of the proposed method, and section 4 presents the conclusion. 2 Hebbian-Based Rule Reduction Algorithm In the fuzzy system, the crisp inputs are first fuzzified to fuzzy inputs and subsequently transformed into fuzzy outputs through a set of fuzzy rules,
A Hebbian-Based Mamdani Neuro-Fuzzy Rule Reduction Approach
1659
Figure 1: Structure of five-layer fuzzy neural network.
which has the following form (Mamdani type):
Rl :
if x1 is A and x2 is B then y1 is C.
(2.1)
The fuzzy outputs are finally defuzzified into crisp outputs. When integrating with the neural network to use its low-level learning ability, a five-layer neural network is deployed, as shown in Figure 1. It consists of the input layer, condition layer (which performs the fuzzification), rule node layer (each node in which denotes a fuzzy rule), consequence layer (which performs the defuzzification), and output layer. The crisp input and output vectors are represented as XT = [x1 , x2 , . . . , xi , . . . , xn1 ], YT = [yi , y2 , . . . yi , . . . yn5 ], respectively. The terms n1 , n2 , n3 , n4 , n5 denote the number of the neurons of the input, condition, rule node, n1 conse5 quence, and output layers, respectively. n2 = i=1 L i and n4 = nm=1 Tm , where L i and Tm are the number of input and output linguistic labels for each dimension. In this letter, the gaussian membership function is used in the condition and consequence layers to perform the fuzzification and defuzzification. The centroids and widths of the MFs are deIV IV noted as (c i,jII , δi,jI I ) (the ith input dimension and the jth MF) and (c l,m , δl,m ) (the mth output dimension and the lth MF) for the two layers. By denoting i L k (i) as the input label in the ith input dimension of the kth rule, and o L k (m) as the output label in the mth output dimension of the kth V rule, the final output of the mth output dimension, denoted as o m , can be expressed as n3 V om
=
k=1 c o L k (m),m /δo L k (m),m × n3 IV III k=1 f k /δo L k (m),m IV
IV
f kIII
,
(2.2)
1660
F. Liu, C. Quek, & G. Ng
Figure 2: Flowchart of the proposed rule generation, reduction, and tuning process.
where f kI I I =
n1 i=1
(xi −c I I )2 exp − (δ I I i,i L k )(i)2 is called the firing strength of the input i,i L k (i)
point XT . There are two important issues in neuro-fuzzy modeling: interpretability and accuracy. Interpretability refers to the capability of the fuzzy model to express the behavior of the system in an understandable way, and accuracy refers to the capability of the fuzzy model to represent the system faithfully (Casillas, Cordon, Herrera et al. 2003). Interpretability and accuracy are usually pursued for contradictory purposes and sometimes are dipoles apart as the system complexity increases. When tuning the membership functions of the rules to diminish the modeling error, the interpretability of the rules may be degraded during the tuning process, where the fuzzy sets can drift closer to each other and may end up overlapping each other (Setnes et al., 1998). Thus, a balance between interpretability and accuracy is desirable. The purpose of this article is to identify and devise a way to obtain low modeling error while maintaining high interpretability. The proposed method consists of three phases: the initial rule generation (P1), the iterative rule reduction and refinement (P2), and the membership function tuning (P3), as shown in Figure 2. In P1 phase, the fuzzy rules are formulated to cover all the training samples. The number of input labels, as well as the output labels in the neuro-fuzzy model, is equal to the number of the fuzzy rules. There will be an excessive set of adjustable parameters, and the resultant fuzzy sets will have large areas of overlap. These will be reduced in the P2 phase.
A Hebbian-Based Mamdani Neuro-Fuzzy Rule Reduction Approach
1661
In the P2 phase, an iterative tuning process is employed. It consists of two subphases: rule reduction (P2.1) and membership function (MF) refinement (P2.2). The objective of this phase is to improve the interpretability of the model and simultaneously to maintain high modeling accuracy. Membership functions are merged according to the degree of their overlap, and redundant and conflictive rules are deleted at the same time in the rule reduction subphase. A merging scheme based on the Hebbian ordering of rules is proposed. There are three procedures in P2.1 phase: rank the rules (P2.1.1), merge the MFs (P2.1.2), and reduce the redundant and conflictive rules (P2.1.3). The system first ranks all the rules according to the importance for each rule, which is realized by the Hebbian ordering in this article. Then the MFs of each variable are merged in accordance with a set of criteria, which will be defined later. Finally, if there are any equivalent rules that have the same condition and consequence parts with others, only one of them is preserved, and if there are any conflicting rules that have the same condition but a different consequence part with others, the ones with lower importance are removed. If there is only one MF of an input feature variable, this feature will be reduced. In P2.2 phase, the LMS algorithm is employed to tune the centers and widths of the membership functions to reduce the modeling error. As there may be unsatisfied overlaps between MFs, it will iterate in P2.1 phase to reduce the MFs and the rules. The stopping criterion for iteration is specified as follows: Denote the training error after the ith i time of rule reduction (P2.1) and MF refinement (P2.2), as E train . If i exceeds i+1 i the maximum iteration i max , or E train > ηE train , then restore the rule set just after the ith iteration and go to the next phase. η is a human-defined parameter that controls the reduction of the system complexity, based on the fact that if the number of rules is fewer than necessary, the training error will become much larger than before. The third phase is the finer tuning of the MFs to achieve a high level of accuracy. There will be no further reduction of the MFs and rules, so there will be more updating epochs and a smaller learning rate than that in the MF refinement (P2.2) of the second phase.
2.1 Initial Rule Base Formulation. During the rule initialization phase, whenever a new data sample is presented, if there are no rules in the rule base or if the strength of the rule with the largest firing strength is below a specified threshold θ , a new rule node will be created. This threshold controls the coverage of the data space by the rules and affects the number of initial rules. The larger the threshold, the fewer rules will be generated, and vice versa. When a new rule node is created, a neuron is inserted into the condition and consequence layers for each input and output dimension. The centroid of the newly generated MFs is set to the data sample, and the width of the MFs is set to a predefined value in proportion to the scale of each dimension. The flowchart of the procedure is described in Figure 3.
1662
F. Liu, C. Quek, & G. Ng
Figure 3: Flowchart of the rule initialization algorithm.
2.2 Hebbian-Based MF and Rule Reduction. Excessive MFs and rules will be generated during the rule initialization phase, and they should be reduced so as to provide a clear linguistic semantic meaning of each fuzzy set and decrease the complexity of the model. In fuzzy modeling, each fuzzy rule corresponds to a subregion of the decision space. Some rules lying in a proper region may represent many samples and have a great deal of influence on the final result, while some other rules may occasionally be generated by noise and become redundant in the rule base. Thus, the importance of each rule is not necessarily of the same value. This section introduces the Hebbian importance of fuzzy rules and explains how it is used to merge the fuzzy sets. As the functions of a rule are determined, the training membership sample XTi , YTi is fed into the input and output layers simultaneously. The input XTi is used to produce the firing strength of the kth rule node, while YTi is fed into the consequence layer to derive the membership values at the output label nodes. If the input-output samples repeatedly fire a rule by the product of their firing strength and the membership values, and the accumulated strength surpasses that of other rules, it indicates the existence of such a rule. In another view, the accumulated strength exerted by the sample pairs reflects the degree of coverage of them by the rule. The rule that covers the samples to a higher degree will have greater influence on the modeling accuracy. This is due to the fact that when the MFs of the rule or the rule itself are merged or deleted, respectively, it will result in a larger change in the fuzzy inference results and significantly affect the modeling accuracy. Thus, such a rule is of greater importance. To derive good balance between interpretability and accuracy, when MFs and rules are reduced, it is desirable to preserve all such efficient and effective rules.
A Hebbian-Based Mamdani Neuro-Fuzzy Rule Reduction Approach
1663
Let us define the degree of the coverage of the ith data point XTi , YTi by the kth rule as the product of their firing strength and the membership function values: c k,i = f kIII
n5
µIV o L k (m),m (yi,m )
(2.3)
m=1
The Hebbian importance of the kth rule is defined as the sum of the degree of coverage of all the training samples as in Ik =
n
c k,i .
(2.4)
i=1
The fuzzy rules can then be efficiently sorted by their importance, defined by equation 2.4, in decreasing order. This is called the Hebbian ordering of the rules. At the beginning of rule reduction, the original rule set, denoted by R, contains all the sorted rules, and the reduced rule set, denoted by R , is null. The rules in R are presented successively according to their Hebbian orders. If there is no rule in the reduced rule set, the rule in R is directly added into R ; otherwise, the fuzzy set in each dimension of the rule will be added or merged into the fuzzy sets in R according to a set of criteria defined below. All of the newly added or merged fuzzy sets will be linked together to formulate a new rule in the reduced rule set R . During the merging process, some fuzzy sets may be shared by several rules. To change a fuzzy set shared by many rules is equivalent to modifying these rules simultaneously, which may have great influence on the performance of system, while a less shared fuzzy set affects the performance only locally. Thus, a highly shared fuzzy set is of higher importance than those that are less shared. Denote the importance of a fuzzy set F as IˆF . The IˆF is a changing value during the merging process. At each time the kth rule in the original rule set R is presented, its fuzzy sets Ai and B j (i.e., for the kth rule, “if xi = Ai , then y j = B j ”) have the same importance with their associated rule. It is described as M1 :
Iˆ Ai = Iˆ B j = Ik
(2.5)
for i = 1, . . . , n1 and j = 1, . . . , n5 . For each input dimension i, among the fuzzy sets in the reduced rule set R of this dimension, the fuzzy set Ai with the maximum degree of overlap over Ai (the degree of overlap is defined in equation 2.9) is selected. If the maximum degree of overlap satisfies a specified criterion, the fuzzy sets
1664
F. Liu, C. Quek, & G. Ng
will be merged into another fuzzy set Ai ; otherwise, the fuzzy set Ai is directly added into the reduced rule set of this dimension. The centroid and the variance of Ai and Ai , (c Ai , δ Ai ) and (c Ai , δ Ai ), respectively, are merged into (c Ai , δ Ai ) using M2 :
c Ai =
Iˆ Ai c Ai + Iˆ Ai c Ai Iˆ Ai + Iˆ Ai
(2.6)
δ Ai =
Iˆ Ai δ Ai + Iˆ Ai δ Ai . Iˆ Ai + Iˆ Ai
(2.7)
The importance of the new fuzzy set Ai is given by M3 :
Iˆ Ai = Iˆ Ai + Iˆ Ai .
(2.8)
In other words, the centroid and variance of resultant fuzzy set is the weighted average of the two fuzzy sets in accordance with their importance, and the degree of importance of this final fuzzy set is the sum of the importance of the two. Then the newly generated fuzzy set Ai replaces the previous fuzzy set Ai . For each output dimension, fuzzy set B j is either added directly or merged into the fuzzy sets of this output dimension in the reduced rule set, using the same process for the input dimensions. Finally, the newly added or merged fuzzy sets in all the dimensions are linked together to formulate a fuzzy rule in the reduced rule set R . Given fuzzy sets A and B, the degree of overlap of A by B is defined as A∩ B . (2.9) SA (B) = A As the gaussian membership function is used in the proposed method, the overlap measure can be derived using the centers and variances of the MFs. For the fuzzy sets A and B, with membership functions µ A(x) = exp{−(x − c A)2 /δ 2A} and µ B (x) = exp{−(x − c B )/δ 2B }, respectively, assuming that c A ≥ c B and then |A| and |A ∩ B| can be expressed in (Juang & Lin, 1998) √ |A| = π δ A (2.10)
√ √ h 2 c B − c A + π (δ A + δ B ) h 2 c B − c A + π (δ A − δ B ) + √ √ π (δ A + δ B ) π (δ B − δ A)
√ h 2 c B − c A − π (δ A − δ B ) + (2.11) √ π (δ A − δ B )
1 |A ∩ B| = 2
where h(x) = max {0, x}.
A Hebbian-Based Mamdani Neuro-Fuzzy Rule Reduction Approach
1665
The merging criterion is that if SA (B) > λ or SB (A) > λ, they will be merged, where λ is a specified threshold that determines the maximum degree of overlap between fuzzy sets the system can tolerate. Higher λ may increase the accuracy but would degrade the interpretability, while a lower λ may cause a larger number of rules to be reduced, with the risk that there will not be enough rules to maintain high modeling accuracy. We choose 0.5 in this letter to balance the accuracy and interpretability. After all the rules are presented, the following steps will be executed to remove the feature and delete redundant and conflicting rules in the reduced rule set R : S1: If there is only one membership function within one dimension, this dimension (feature) will be removed. S2: If there is any rule that has the same conditions and consequences as the others, it is removed. S3: If there are any conflicting rules that have equivalent conditions but different consequences, the one with the higher degree of importance is preserved, and the others are deleted. Finally, the original rule set is replaced with the reduced rule set. The flowchart of the algorithm is shown in Figure 4. An example follows. At the start, there are four rules in the original rule set R shown in Figure 5. After rule 1 is presented, the first rule is added into the reduced rule set R (see Figure 6). After the second rule is presented, the two fuzzy (2) (1) sets in the second input dimension, A2 and A2 , are so similar that they (1) are merged to a new A2 . In the first input dimension and the output dimension, the fuzzy sets are well separated and thus are retained. They (3) (1) are shown in Figure 7. After the third rule is presented, A1 and A1 are (1) (3) (1) (1) (3) merged to a new A1 , A2 and A2 are merged to a new A2 , and B1 and (2) (2) B1 are merged into a new B1 , as shown in Figure 8. After the fourth (4) (1) (1) (4) (1) rule is presented, A1 and A1 are merged to a new A1 , A2 and A2 are (1) (4) (1) (1) merged in a new A2 , and B1 and B1 are merged to a new B1 , shown in Figure 9. After all the rules are presented, we can see, first, that there is only one fuzzy set in the second input dimension. Thus, this input feature will be removed (S1). Second, it is easy to see that rules 1 and 4 have the same conditions and consequence. They are equivalent, and one of them will be deleted (S2). In this example, rule 1 is reserved because it has higher importance. Finally, rules 1 and 3 have the same condition but different consequence. So they are conflicting, and one of them has to be removed to improve the interpretability of the system. As I1 > I3 , rule 1 is preserved, and rule 3 is deleted (S3). At the end, four rules are reduced to two rules, and the input feature x2 is removed.
1666
F. Liu, C. Quek, & G. Ng
Figure 4: Flowchart of the rule reduction algorithm.
2.3 Parameter Tuning. The LMS algorithm is used to tune the parameters of the neuro-fuzzy system. The error function used is described by
E=
n5 n5 V 2 1 1 2 em o m − ym = 2 2 m=1
V where e m = o m − ym .
m=1
(2.12)
A Hebbian-Based Mamdani Neuro-Fuzzy Rule Reduction Approach
1667
Figure 5: Original rule set R.
Figure 6: Reduced rule set R after the presentation of rule 1.
Each parameter in the fuzzy neural network is then adjusted according to the partial derivative of E to that parameter: c oI LVk (m),m = −
∂E
=−
∂c oI LVk (m),m
V ∂o m ∂E I V V ∂o m ∂c o L k (m),m
2 δoI LVk (m),m = −e m 2 n3 III δoI LVk (m),m k =1 o k o kI I I
δoI LV
k (m),m
=−
V ∂E ∂E ∂o m =− V ∂o m ∂δoI LV (m),m ∂δoI LV (m),m k
= −2e m
o kI I I
(2.13)
k
n3
cIV oIII k =1 o L k (m),m k
2 δoI LV (m),m − c oI LV
k (m),m
k
n3
2 2 δoI LV δoI LV (m),m oIII k =1 k
k =1
n3
k
o kI I I
2 δoI LV (m),m k
3
k (m),m
(2.14)
1668
F. Liu, C. Quek, & G. Ng
Figure 7: Reduced rule set R after the presentation of rule 2.
Figure 8: Reduced rule set R after the presentation of rule 3.
Figure 9: Reduced rule set R after the presentation of rule 4.
∂E II c i,i L k (i) = − ∂c I I
i,i L k (i)
o kI I I = −2e m
c oI LV (m),m k
=−
∂E
V ∂o m
II ∂µi,i L k (i)
V ∂µ I I II ∂c i,i ∂o m i,i L k (i) L k (i)
n3
oI I I k =1 k
n3
δoI LV (m),m k
oI I I k =1 k
2
−
n3
δoI LV (m),m k
cIV oI I I k =1 o L k (m),m k
2 2
II δi,i L k (i)
2
2
II δoI LV (m),m xi − c i,i L (i) k k
δoI LV (m),m k
2
(2.15)
A Hebbian-Based Mamdani Neuro-Fuzzy Rule Reduction Approach ∂E II δi,i L k (i) = − ∂δ I I
=−
V ∂o m
∂E
II ∂µi,i L k (i)
V ∂µ I I II ∂δi,i ∂o m i,i L k (i) i,i L k (i) L k (i)
2
n3 n3 o kI I I c oI LV (m),m o I I I δoI LV (m),m −
= −2e m
k
k =1 k
n3
1669
k
oI I I k =1 k
cIV oI I I k =1 o L k (m),m k
δoI LV (m),m k
2 2 3 2 II δoI LV (m),m δi,i δoI LV (m),m L k (i) k k
2
2 II xi − c i,i L k (i)
(2.16) 3 Experimental Results Three sets of experiments were performed to evaluate the proposed Hebb rule reduction algorithm: (1) three sets of Nakanishi data set (Nakanishi et al., 1993) (Sugeno & Yasukawa, 1993), (2) Pat synthetic pattern classification (Pal et al., 2003), and (3) highway traffic flow density prediction. 3.1 Nakanishi Data Set. In this section, the proposed learning method for a neuro-fuzzy system is evaluated using the benchmark experiments on three data sets: the human operation of a chemical plant, a nonlinear system, and the daily price of a stock in a stock market. The complete data sets have been published in Nakanishi et al. (1993). They were used to evaluate the following six reasoning methods: Mamdani’s version of Zadel’s compositional rule inference (CRI), Turksen’s version of interval-valued CRI, Turksen’s versions of point-valued and interval-valued analogical approximate reasoning scheme (AARS), and Sugeno’s version of position and gradient type of reasoning (Nakanishi et al., 1993). These data sets are also used to evaluate the qualitative modeling based on fuzzy logic (Sugeno & Yasukawa, 1993). Each of the three data sets is split into three groups: A, B, and C (Nakanishi et al., 1993). In the following experiments, the data sets of groups A and B are used as the training sets, and group C is used as the testing set. The experimental results are compared against (1) the pseudo outer-product-based fuzzy neural network (POPFNN) (Ang et al., 2003), and (2) its modified version: rough-set-based pseudo outer-product fuzzy rule identification algorithm (RSPOP) (Ang & Quek, 2005); (3) adaptivenetwork-based fuzzy inference system (ANFIS) (Jang, 1993); (4) evolving fuzzy neural networks (EFuNN) (Kasabov, 2001); and (5) dynamic evolving neural fuzzy inference system (DENFIS) (Kasabov & Song, 2002). To evaluate the performance of these systems, both the mean squared error (MSE) and the correlation coefficient (R) are used in the experiments. The comparative analysis is presented in section 3.1.4. 3.1.1 A Nonlinear System. The nonlinear system is described as 2 y = 1 + x1−2 + x21.5
1 ≤ x1 , x5 ≤ 5.
(3.1)
1670
F. Liu, C. Quek, & G. Ng 1
1
0.9
T(3) High
T(2) Medium
T(1) Low
0.7 0.6 0.5 0.4 0.3
0.6 0.5 0.4 0.3 0.2
0.1
0.1 0 1.5
2
2.5
3 Input x 1
3.5
4
T(3) High
0.7
0.2
0
T(2) Medium
T(1) Low
0.8
0.8
membership function u(x2)
membership function u(x1)
0.9
4.5
1.5
2
2.5
3 Input x 2
3.5
4
4.5
(b)
(a)
1 0.9
T(2) Medium
T(1) Low
T(3) High
membership function u(y1)
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
1.5
2
2.5
3 3.5 Input y 1
4
4.5
5
(c)
Figure 10: Membership Functions for the nonlinear system modeling problem.
The data set consists of four input variables and one output variable, where only x1 and x2 are useful in the modeling, while x3 and x4 are irrelevant variables. Figure 10 and Table 1 show the shapes, the centroids, and the widths of the derived gaussian membership functions. For this data set, the feature selection method in Nakanishi et al. (1993) discards the input features x3 and x4 , while the RSPOP (Ang & Quek, 2005) discards the feature x3 and partially discards x4 . In our proposed method, the two irrelevant variables x3 and x4 are discarded too. There are three fuzzy sets for each of the input features x1 , x2 and output variable y1 (see Figure 10). In total, seven rules are generated, as shown in Table 2. To show the interpretability, the first two rules are taken as an example. The first rule indicates that if x1 takes a low value and x2 takes a medium value, then y1 will be a medium value, and the second rule indicates that if x1 is high and x2 is medium, then y1 will be low. From equation 3.1, we can see that for x1 , a higher value decreases y and a lower value increases y, while for x2 , the higher the value, the higher the value for y, and vice versa. From these two rules, it can be observed that while the x2 are taking the same fuzzy label, the output y1 is inversely correlated to x1 .
A Hebbian-Based Mamdani Neuro-Fuzzy Rule Reduction Approach
1671
Table 1: Centroid and Width of the Membership Functions for the Nonlinear System Modeling Problem. Label
Low
Medium
High
x1 = c 1II
1.219
3.020
4.605
δ1II x2 = c 2II δ2II y = c IV
1.900
1.155
0.428
1.061
3.803
4.962
0.747
1.953
0.379
1.337 0.816
3.321 1.875
5.050 1.873
δ IV
Table 2: Fuzzy Rules for the Nonlinear System Modeling Problem. Rule 1 2 3 4 5 6 7
x1
x2
y
Low High Low Medium Medium Medium Low
Medium Medium High Medium High Low Low
Medium Low High Low Medium Low Low
Table 3: Variables in the Data Set of the Human Operation of a Chemical Plant. x1 x2 x3 x4 x5 y
Monomer concentration Change of monomer concentration Monomer flow rate Local temperatures inside the plant Local temperatures inside the plant Set point for monomer flow rate
3.1.2 Human Operation of a Chemical Plant. The problem is to model the human operation of a chemical plant. There are five input variables and one output variable, as shown in Table 3. The feature selection method in Nakanishi et al. (1993) discards the inputs x2 , x4 , and x5 , whereas RSPOP discards the inputs x1 , x2 , and x5 . In our proposed method, among the five input dimensions, x1 to x5 , only the x3 is preserved; the others are discarded. There are three fuzzy sets in both input x3 and the output y. Three rules are generated from the system. 3.1.3 Daily Price of a Stock in the Stock Market. This problem is to predict the stock price using various features of a stock in a stock market.
1672
F. Liu, C. Quek, & G. Ng
Table 4: Variables in the Data Set of Daily Stock Price Predictions. x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 y
Past change of moving average over a middle period Present change of moving average over a middle period Past separation ratio with respect to moving average over a middle period Present separation ratio with respect to moving average over a middle period Present change of moving over a short period Past change of price, for instance, change on one day before Present change of price Past separation ratio with respect to moving average over a short period Present change of moving average over a long period Present separation ratio with respect to moving average over a short period Prediction of stock price
Table 5: Consolidated Experimental Results on Nakanishi Data Sets. Data Set
Hebb rule reduction POPFNN RSPOP Sugeno (P-G) Mandani Turksen (IVCRI) ANFIS EFuNN DENFIS
Chemical Plant
Nonlinear System
Stock Prediction
MSE
R
MSE
R
MSE
R
2.423×104
0.998 0.946 0.983 0.990 0.937 0.993 0.780 0.946 0.995
0.185 0.270 0.383 0.467 0.862 0.706 0.286 0.566 0.411
0.911 0.877 0.856 0.845 0.490 0.609 0.853 0.720 0.805
15.138 76.221 24.859 168.90 40.84 93.02 38.062 72.542 69.824
0.947 0.733 0.922 0.700 0.865 0.661 0.875 0.756 0.810
5.630×105 2.124×105 1.931×106 6.580×105 2.581×105 2.968×106 7.247×105 5.240×104
There are 10 input variables and 1 output variable in the data set. They are summarized in Table 4. The proposed method discards the input variables x2 and x5 ; the feature selection method in Nakanishi et al. (1993) discards the inputs x1 , x2 , x3 , x6 , x7 , x9 , and x10 ; and the RSPOP discards inputs x1 , x2 , x3 , x6 , and x10 . There are 20 fuzzy rules identified by the proposed method. 3.1.4 Discussion. Table 5 shows the consolidated experimental results based on the Nakanishi data set in terms of MSE and Pearson’s correlation coefficient (R). On the problem of modeling the human operation of a chemical plant, although only one variable is reserved and three fuzzy rules are generated by the proposed method, the modeling error (MSE) and accuracy (R) are the best among the nine models. Good interpretability and accuracy are achieved simultaneously (see Tables 5 and 6). On the modeling of a nonlinear system, the proposed method correctly identifies the useful variables. It outperforms others for the two measures. On the prediction of stock price, as described before, the inputs x1 , x2 , x3 , x6 , x7 , x9 , and x10 are discarded by the feature selection method in Nakanishi et al. (1993), and the
A Hebbian-Based Mamdani Neuro-Fuzzy Rule Reduction Approach
1673
Table 6: Benchmark Against RSPOP.
Data Set Hebb rule reduction
Chemical Nonlinear Stock Plant System Prediction
Number of rules before reduction 30 Number of rules after reduction 3 Rule reduction (%) 90.0 MSE 2.423×104 R 0.998 RSPOP Number of rules before reduction 24 Number of rules after reduction 14 Rule reduction (%) 41.7 MSE 2.124×105 R 0.983 Improvement Rule reduction (%) 48.3 (Hebb rule reduction MSE (%) 88.6 against RSPOP) R (%) 1.5
25 7 64.0 0.185 0.911 22 17 22.7 0.383 0.856 41.3 51.7 6.0
50 20 60.0 15.138 0.947 50 29 14.0 24.859 0.922 46 39.1 2.6
inputs x1 , x2 , x3 , x6 , and x10 are discarded by RSPOP. In the proposed model, x2 and x5 are discarded. Input x2 is the one that all of them, discard but neither of them considered input x5 as a redundant feature. From the results, although the proposed model uses a larger set of attributes, it has achieved superior performance compared to the other models. It indicates that some variables discarded by other models are still useful in the prediction and cannot be discarded. As both the RSPOP model and the proposed Hebb rule reduction model employ a Hebbian-like learning method and reduce the rule set and attribute set simultaneously with the aim of pursuing a balance between accuracy and interpretability, they are contrasted in Table 6. The proposed model begins with a rule set that is larger than RSPOP for the first two data sets and an equal size for the third one. After the tuning and pruning, the Hebb rule reduction ends up with a much smaller rule set for all three data sets than that produced by RSPOP. The ratio of the decrease of the number of the rules by the proposed method is larger than that by RSPOP. For the MSE, the Hebb rule reduction outperforms RSPOP with the ratio of improvement of 88.6%, 51.7%, and 39.1% for the three data sets, respectively. For R, the ratio of improvement by the proposed method is 1.5%, 6.0%, and 2.6% for the three data sets, respectively. It is safe to conclude that the proposed Hebb rule reduction achieves better performance than RSPOP on both interpretability and accuracy. 3.2 Pat Synthetic Pattern Classification. The Pat synthetic data set consists of 880 two-dimensional pattern points, depicted in Figure 11. There are three linearly nonseparable classes. The figure is marked with class 1(c 1 ) and 2(c 2 ), while class 3(c 3 ) corresponds to the background region. In this
1674
F. Liu, C. Quek, & G. Ng 1 0.9
c3
c1
0.8 0.7 0.6
F2
0.5 0.4 0.3 0.2 c2
0.1 0
0
0.1
0.2
0.3
0.4
0.5 F1
0.6
0.7
0.8
0.9
1
Figure 11: Synthetic Data set Pat. Table 7: Comparison of Performance of the Different Models. Model Hebb rule reduction Modular rough-fuzzy MLP EFuNN RSPOP
Accuracy % Kappa % Number of Rule Confusion 84.24 70.31 82.25 55.86
73.12 71.80 70.22 22.44
28 8 88 16
0.67 1.1 1.0 1.3
experiment, 10% of the data are used as the training set, and the remaining data are used as the testing set. The modular rough-fuzzy MLP (Mitra et al., 2001; Pal et al., 2003), EFuNN, and RSPOP are compared to the proposed Hebb rule reduction system. Quantitative measures are used to evaluate the performance of the rules (Pal et al., 2003): (1) accuracy: the correct classification percentage on the testing set; (2) kappa: the coefficient of agreement that measures the relationship of beyond chance agreement to expected disagreement; (3) the number of rules; and (4) confusion: the measure quantifies the goal that the confusion should be restricted within a minimum number of classes. The performance of different models is shown in Table 7. The proposed Hebb rule reduction outperforms the other benchedmarked models. Its accuracy and kappa are the highest among the models, which indicates its good capability for the classification task. Its confusion measure is also the lowest compared with the others, which means that it has the fewest classes between which confusion occurs. However, to achieve such superior performance, the proposed Hebb rule reduction needs more rules than the rough fuzzy MLP and RSPOP.
A Hebbian-Based Mamdani Neuro-Fuzzy Rule Reduction Approach
1675
6 5
05/9 Thu
4
06/9 Fri
07/9 Sat
08/9 Sun
09/9 Mon
10/9 Tue
3 2 1 0.43
0.17
0.9
0.64
0.38
0.11
0.85
0.58
0.32
0.06
0.81
0.55
0.28
0.02
0.75
0.49
0.23
0.94
0 0.68
Normalized Density
Traffic Density of 3 Lanes along PIE (Site29) cv2 cv1 cv3
Normalize d Time L1 Density
L2 Density
L3 Density
Figure 12: Plotting of the traffic flow density data of three lanes and the division of the cross-validation groups.
3.3 Highway Traffic Flow Density Prediction. This experiment seeks to evaluate the performance of the proposed Hebb rule reduction learning method on the highway traffic flow density data set. The raw traffic flow data for the benchmark test was obtained from Tan (1997). The raw data were collected at site 29 located at exit 15 along the eastbound Pan Island Expressway (PIE) in Singapore using loop detectors embedded beneath the road surface. There are five lanes at the site: two exit lanes and three straight lanes for the main traffic. In this experiment, only the traffic flow data for the three straight lanes are employed. There are four input attributes in this data set: the time and the traffic density of the three lanes. The traffic flow trend at the site is modeled by the proposed Hebb rule reduction neuro-fuzzy model, and the trained network is used to predict traffic density for time t + τ , where τ = 5, 10, 30, 45, and 60 minutes. Figure 12 shows the plotting of the traffic flow density data for the three straight lanes spanning a period of six days from September 5 to September 10, 1996. The data set is divided into three cross-validation groups of training and testing sets, designated CV1, CV2, and CV3, the same as in Tung and Quek (2002) and Ang and Quek (2005). The Pearson correlation coefficient (R) and the MSE are used to evaluate the performance. The proposed method is compared to RSPOP, POPFNN, GenSo, EFuNN, and DENFIS. The consolidated experimental results are shown in Figure 13. (The proposed Hebb rule reduction method is abbreviated as Hebb-R-R). Figures 13a, 13c, and 13e show the average R of the three cross-validation groups, whereas Figures 13b, 13d, and 13f show the average MSE for them. The proposed Hebb rule reduction method shows superior performance to the other models: it achieves the highest
1676
F. Liu, C. Quek, & G. Ng
0.94
0.24 Hebb-R-R RSPOP POPFNN GenSoFNN EFuNN DENFIS
0.92 0.9
0.2 0.18 Average MSE
Average R
0.88
0.22
0.86 0.84
0.16 0.14
0.82
0.12
0.8
0.1
0.78 0.76
Hebb-R-R RSPOP POPFNN GenSoFNN EFuNN DENFIS
0.08
0
10
20
30 Time Interval
40
50
0.06
60
0
10
20
(a) 0.9
60
0.2 0.18 0.16 Average MSE
Average R
0.84 0.82 0.8 0.78
0.14 0.12 0.1 Hebb-R-R RSPOP POPFNN GenSoFNN EFuNN DENFIS
0.08
0.76
0.06
0.74
0.04
0
10
20
30 Time Interval
40
50
0.02
60
0
10
20
(c)
30 Time Interval
40
50
60
(d)
0.94
0.3 Hebb-R-R RSPOP POPFNN GenSoFNN EFuNN DENFIS
0.92 0.9 0.88
0.25
Average MSE
0.86 0.84 0.82
0.2
0.15 Hebb-R-R RSPOP POPFNN GenSoFNN EFuNN DENFIS
0.8 0.78
0.1
0.76 0.74
50
0.22 Hebb-R-R RSPOP POPFNN GenSoFNN EFuNN DENFIS
0.86
Average R
40
(b)
0.88
0.72
30 Time Interval
0
10
20
30 Time Interval
(e)
40
50
60
0.05
0
10
20
30 Time Interval
40
50
60
(f)
Figure 13: Average R and MSE of the Prediction in the Three Cross-Validation Groups.
A Hebbian-Based Mamdani Neuro-Fuzzy Rule Reduction Approach
1677
Table 8: Average MSE, R, and Number of Rules. Neuro-Fuzzy Models
Average MSE
Average R
Average Number of Rules
Hebb rule reduction RSPOP POPFNN GenSoFNN EFuNN DENFIS
0.114 0.146 0.173 0.164 0.189 0.153
0.864 0.834 0.814 0.813 0.798 0.831
8.1 14.4 40.0 50.0 234.5 9.7
average R and the lowest average MSE of lanes 1 to 3 for all intervals: τ = 5, 15, 30, 45 and 60. Table 8 shows the average R, MSE, and number of rules identified for all the time intervals of the three lanes. The proposed Hebb rule reduction method derives on the average only 8.1 fuzzy rules, which is lowest among these models, and its modeling accuracy is superior. It indicates that the proposed method yields a good balance of accuracy and interpretability. 4 Conclusion This letter proposes a novel method for balancing the interpretability and the demand of modeling accuracy in a neuro-fuzzy system. Interpretability is improved through a judicious reduction of the membership functions, fuzzy rules, and attributes, and accuracy is boosted through the least mean square (LMS) learning algorithm that tunes the rules and MFs to the most appropriate locations in the feature space. The Hebbian ordering is proposed to represent the importance of each rule. The rule with a higher Hebbian ordering has a greater degree of coverage of the sample points and more contribution toward the modeling of the data and is more likely to be preserved. Membership functions are merged according to the Hebbian importance of their associated rule. The resultant equivalent rules are deleted, and the attributes with only one MF are removed. The problem of rule conflict is resolved by retaining the rules of higher importance. The proposed membership function merger process not only reduces similar membership functions but also preserves the more informative rules (the rules in high Hebbian ordering) to make the consequent LMS learning process achieve superior accuracy. The proposed method has these key strengths:
r
An intuitive, consistent, and compact rule base can be extracted from the trained neuro-fuzzy model, with a reduced set of attributes, clearly separating the membership functions in each dimension and, most important, a small number of judiciously derived rules.
1678
r
r
F. Liu, C. Quek, & G. Ng
The neuro-fuzzy model does not need clustering method like LVQ (Kohonen, 1982) to specify the number and positions of the membership functions in each dimension prior to the identification of rules, unlike that in Tung and Quek (2002), Ang et al. (2003), and Ang and Quek (2005). Good balance of interpretability and accuracy can be obtained through the proposed iterative processing by Hebb rule reduction and the LMS learning algorithm.
The performance of the proposed Hebb rule reduction method is evaluated using five data sets: (1) human operation of a chemical plant, (2) a nonlinear system, (3) daily price of a stock in a stock market, (4) Pat synthetic pattern classification, and (5) highway traffic flow density prediction. The experimental results show consistently good performance by the proposed method when benchmarked against other established neuro-fuzzy models. References Ang, K. K., & Quek, C. (2005). RSPOP: Rough set-based pseudo outer-product fuzzy rule identification algorithm. Neural Computation, 17(1), 205–243. Ang, K. K., Quek, C., & Pasquier, M. (2003). POPFNN-CRI(S): Pseudo outer productbased fuzzy neural network using the compositional rule of inference and singleton fuzzifier. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 33(6), 838–849. Banerjee, M., Mitra, S., & Pal, S. K. (1998). Rough fuzzy MLP: Knowledge encoding and classification. IEEE Transactions on Neural Networks, 9(6), 1203–1216. Casillas, J., Cordon, O., Del Jesus, M. J., & Herrera, F. (2001). Genetic feature selection in a fuzzy rule-based classification system learning process for high dimensional problems. Information Sciences, 136(1–4), 169–191. Casillas, J., Cordon, O., Del Jesus, M. J., & Herrera, F. (2005). Genetic tuning of fuzzy rule deep structures preserving interpretability and its interaction with fuzzy rule set reduction. IEEE Transactions on Fuzzy Systems, 13, 13–29. Casillas, J., Cordon, O., Herrera, F., & Magdalena, L. (2003). Interpretability improvements to find the balance interpretability-accuracy in fuzzy modeling: An overview. In J. Casillas, O. Cordon, F. Ferrera, & L. Magdalena (Eds.), Interpretability issues in fuzzy modeling (pp. 3–22). New York: Springer. Castro, J. L. (1995). Fuzzy logic controllers are universal approximators. IEEE Transactions on Systems, Man, and Cybernetics, 25(4), 629–635. Chakraborty, D., & Pal, R. N. (2004). A neuro-fuzzy scheme for simultaneous feature selection and fuzzy rule-based classification. IEEE Transactions on Neural Networks, 15(1), 110–123. Combs, W. E., & Andrews, J. E. (1998). Combinatorial rule explosion eliminated by a fuzzy rule configuration. IEEE Transactions on Fuzzy Systems, 6(1), 1–11. Cordon, O., & Herrera, F. (2000). A proposal for improving the accuracy of linguistic modeling. IEEE Transactions on Fuzzy Systems, 8(3), 335–344.
A Hebbian-Based Mamdani Neuro-Fuzzy Rule Reduction Approach
1679
Gonzalez, A., & Perez, R. (1998). Completeness and consistency condictions for learning fuzzy rules. Fuzzy Sets and Systems, 96(1), 37–51. Gonzalez, A., & Perez, R. (2001). Selection of relevant features in a fuzzy genetic learning algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 31(3), 417–425. Jang, J.-S. R. (1993). ANFIS: Adaptive-network-based fuzzy inference system. IEEE Transactions on Systems, Man, and Cybernetics, 23(3), 665–685. Juang, C. F., & Lin, C. T. (1998). An online self-constructing neural fuzzy inference network and its applications. IEEE Transactions on Fuzzy Systems, 6(1), 12–32. Kasabov, N. K. (2001). Evolving fuzzy neural networks for supervised/unsupervised online knowledge-based learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 31(6), 902–918. Kasabov, N. K., & Song, Q. (2002). DENFIS: Dynamic evolving neural-fuzzy inference system and its application for time-series prediction. IEEE Transactions on Fuzzy Systems, 10(2), 144–154. Kohonen, T. K. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43(1), 59–69. Krone, A., & Kiendl, H. (1994). Automatic generation of positive and negative rules for two-way fuzzy controllers. In Proceedings of the Second European Congress on Intelligent Techniques and Soft Computing (pp. 438–447). Aachen, Germany: Verlag Mainz. Krone, A., & Taeger, H. (2001). Data-based fuzzy rule test for fuzzy modeling. Fuzzy Sets and Systems, 123(3), 343–358. Lin, C. T. (1996). Neural fuzzy systems: A neuro-fuzzy synergism to intelligent systems. Upper Saddle River, NJ: Prentice Hall. Mamdani, E. H., & Assilian, S. (1975). An experiment in linguistic synthesis with a fuzzy logic controller. International Journal of Man-Machine Studies, 7(1), 1–13. Mitra, S., & Hayashi, Y. (2000). Neuro-fuzzy rule generation: Survey in soft computing framework. IEEE Transactions on Neural Networks, 11(3), 748–768. Mitra, S., Mitra, P., & Pal, S. K. (2001). Evolutionary modular design of rough knowledge-based network using fuzzy attributes. Neurocomputing, 36(1), 45–66. Nakanishi, H., Turksen, I. B., & Sugeno, M. (1993). A review and comparison of six reasoning methods. Fuzzy Sets and Systems, 57(3), 257–294. Pal, S. K., Mitra, S., & Mitra, P. (2003). Rough fuzzy MLP: Modular evolution, rule generation and evaluation. IEEE Transactions on Knowledge and Data Engineering, 15(1), 14–25. Quek, C., & Zhou, R. W. (1999). POPFNN-AAR(S). A pseudo outer-product based fuzzy neural network. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 29, 859–870. Setnes, M., & Babuska, R. (2001). Rule base reduction: Some comments on the use of orthogonal transforms. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 31(2), 199–206. Setnes, M., Babuska, R., Kaymak, U., & Van Nauta Lemke, H. R. (1998). Similarity measures in fuzzy rule base simplification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 28(3), 376–386. Silipo, R., & Berthold, M. (2000). Input features impact on fuzzy decision processes. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 30(6), 821–834.
1680
F. Liu, C. Quek, & G. Ng
Sugeno, M., & Kang, G. T. (1988). Structure identification of fuzzy model. Fuzzy Sets and Systems, 28(1), 15–33. Sugeno, M., & Yasukawa, T. (1993). A fuzzy-logic-based approach to qualitative modeling. IEEE Transactions on Fuzzy Systems, 1(1), 7–31. Takagi, T., & Sugeno, M. (1985). Fuzzy identification of systems and its applications to modeling and control. IEEE Transactions on Systems, Man, and Cybernetics, 15(1), 116–132. Tan, G. K. (1997). Feasibility of predicting congestion states with neural networks (Tech. Rep.). Singapore: Nanyang Technological University. Tung, W. L., & Quek, C. (2002). GenSoFNN: A generic self-organizing fuzzy neural network. IEEE Transactions on Neural Networks, 13(5), 1075–1086. Yen, J., & Wang, L. X. (1999). Simplifying fuzzy rule–based models using orthogonal transformation methods. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 29(1), 13–24.
Received January 11, 2006; accepted August 2, 2006.
1681
Erratum Kalman Filter Control Embedded into the Reinforcement Learning Framework Istv´an Szita and Andr´as L´´ orincz Neural Computation. 16:491–499 (2004) The theorem on p. 495 should read as follows: If the system (F ; H) is observable, 0 ≥ ∗(or 0 ≥ ∗ in the case of Sarsa), there exists an L such that F + G L ≤ 1/ 1 − p, then there exists a series of learning rates αt such that 0 < αt ≤ 1/xt 4 , t αt = ∞, t αt2 < ∞, and it can be computed online. For all sequences of learning rates satisfying these requirements, the combined TD-KF methods converge to the optimal policy with probability 1 both for state-value and for action-value estimations. Technical report http://arxiv.org/abs/cs/0306120 has been updated accordingly. It contains the details of the proof. Acknowledgment We would like to thank L´aszlo´ Gerencs´er for calling our attention to a mistake in the previous version of the convergence proof.
ARTICLE
Communicated by Jonathan Victor
Estimating Information Rates with Confidence Intervals in Neural Spike Trains Jonathon Shlens [email protected] Salk Institute, La Jolla, CA 92037, and Institute for Nonlinear Science, University of California, San Diego, La Jolla, CA 92093, U.S.A.
Matthew B. Kennel [email protected] Institute for Nonlinear Science, University of California, San Diego, La Jolla, CA 92093, U.S.A.
Henry D. I. Abarbanel [email protected] Institute for Nonlinear Science, University of California, San Diego, La Jolla, CA 92093, U.S.A., and Department of Physics and Marine Physical Laboratory, Scripps Institution of Oceanography, University of California, San Diego, La Jolla, CA 92093, U.S.A.
E. J. Chichilnisky [email protected] Salk Institute, La Jolla, CA 92037, U.S.A.
Information theory provides a natural set of statistics to quantify the amount of knowledge a neuron conveys about a stimulus. A related work (Kennel, Shlens, Abarbanel, & Chichilnisky, 2005) demonstrated how to reliably estimate, with a Bayesian confidence interval, the entropy rate from a discrete, observed time series. We extend this method to measure the rate of novel information that a neural spike train encodes about a stimulus—the average and specific mutual information rates. Our estimator makes few assumptions about the underlying neural dynamics, shows excellent performance in experimentally relevant regimes, and uniquely provides confidence intervals bounding the range of information rates compatible with the observed spike train. We validate this estimator with simulations of spike trains and highlight how stimulus parameters affect its convergence in bias and variance. Finally, we apply these ideas to a recording from a guinea pig retinal ganglion cell and compare results to a simple linear decoder.
Neural Computation 19, 1683–1719 (2007)
C 2007 Massachusetts Institute of Technology
1684
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
1 Introduction How neurons encode and process information about the sensory environment is an essential focus of systems neuroscience. While much progress has been made (Rieke, Warland, de Ruyter, van Steveninck, & Bialek, 1997; Borst & Theunissen, 1999), many fundamental questions remain. Which features of the sensory world do neurons extract (de Ruyter van Steveninck, & Bialek, 1988; Bialek, Rieke, de Ruyter van Steveninck, & Warland, 1991; Dimitrov, Miller, Gedeon, Aldworth, & Parker, 2003; Paninski, 2003b; Aguera y Arcas & Fairhall, 2003; Sharpee, Rust, & Bialek, 2004)? How do correlations within a spike train (Mainen & Sejnowski, 1995; Berry, Warland, & Meister, 1997; Reinagel & Reid, 2002; Fellous, Tiesinga, Thomas, & Sejnowski, 2004; Abarbanel & Talathi, 2006) or between spike trains of different cells (Mastronarde, 1989; Meister, Lagnado, & Baylor, 1995; Usrey & Reid, 1999) convey information about the stimulus (Gawne & Richmond, 1993; Dimitrov et al., 2003; Rieke et al., 1997; Brenner, Strong, Koberle, Bialek, & de Ruyter van Steveninck, 2000; Warland, Reinagel, & Meister, 1997; Stanley, Li, & Dan, 1999; Gat & Tishby, 1999; Reich, Mechler, & Victor, 2001; Schnitzer & Meister, 2003; Nirenberg & Latham, 2003; Schneidman, Bialek, & Berry, 2003)? Are the tuning properties of sensory neurons optimized for their natural enviornment (Rieke, Bodnar, & Bialek, 1995; Lewen, Bialek, & de Ruyter van Steveninck, 2001; Simoncelli & Olhausen, 2001)? How does adaptation optimize information flow in sensory neurons (Brenner, Bialek, & de Ruyter van Steveninck, 2000; Fairhall, Lewen, Bialek, & de Ruyter van Steveninck, 2001; Meister & Berry, 1999)? Addressing such questions requires a reliable measure of how much information about the sensory world is contained in a spike train. The average mutual information rate is a general measure of nonlinear correlation between dynamic sensory stimuli and neural responses (Shannon, 1948; Cover & Thomas, 1991). It does not assume a specific model for interpreting a spike train, yet it bounds the amount of information available to any method for reading or decoding a spike train. This quantity allows one to measure in a meaningful way how efficiently neural circuits encode the sensory stimulus (Strong, Koberle, de Ruyter van Steveninck, & Bialek, 1998) or evaluate on an absolute scale the performance of any decoding mechanism, whether biological or artificial (Buracas, Zador, DeWeese, & Albright, 1998; Warland et al., 1997). A major obstacle to the broad application of mutual information measurements to neural coding problems is that estimating mutual information from experimental data sets is difficult. First, large data requirements of naive estimation techniques restrict the use of information-theoretic analyses to preparations with long-duration recordings. Judging the convergence and bias of conventional estimators of entropy and mutual information is difficult (Treves & Panzeri, 1995; Paninski, 2003a), making the application
Estimating Information Rates in Neural Spike Trains
1685
of these tools particularly problematic in limited data sets. Second, common methods for estimating information rates require a heuristic judgement of the duration over which the stimulus and response histories affect the response (Theunissen & Miller, 1995). This subjective assessment can largely dominate the final estimate (Kennel, Shlens, Abarbanel, & Chichilnisky, 2005). Third, applying this heuristic precludes an objective estimate of a confidence interval for the information rate. In this article, we address all of these shortcomings. We introduce a new estimator of information rates for the analysis of neural spike trains that (1) converges quickly to the true value in finite data sets, (2) requires no heuristic judgements, and (3) alone among all other methods provides accurate confidence intervals bounding the range of information rates from sources that most likely generated the observed data. The last issue is important because it permits one to compare information rates in a statistically meaningful manner. We begin by discussing how to represent a spike train as a time series, and examine what the relevant quantities are for characterizing the information in this representation. We introduce an estimator of information rate, derived from Kennel et al. (2005), and briefly review major issues in estimating these quantities and their associated confidence intervals in neural data. We validate this estimator in simulation and discuss how stimulus parameters in an experiment affect bias and variance in the estimate. Finally, we test this approach on spike data from a guinea pig retinal ganglion cell, calculating the information rate and coding efficiency with confidence intervals, and comparing derived bounds to the performance of a simple linear decoder. 2 Quantifying Information in a Spike Train We begin with several assumputions about how a spike train represents information. These assumptions ought to be as encompassing as possible, balanced by computational tractability, in order to handle two essential features of neural systems: 1. Spike trains are lists of stereotyped action potentials occurring in continuous time. 2. Spike trains are generated by a dynamical system with temporal dependencies. In this setting, only the spike times and number of action potentials in a response can convey information about the stimulus. We represent the spike train as the output of a symbolic dynamical system evolving in time (Lind & Marcus, 1996; Ott, 2002; Gilmore & Lefranc, 2002). By symbolic, we mean that the representation of the spike train is a sequence of integers with a finite alphabet. We select a short time
1686
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
window δt termed the bin size, and count the number of spikes in each time bin (MacKay & McCulloch, 1952; Strong et al., 1998; see also Rapp et al., 1994). This is indicated in Figures 1d and 1g. The resulting time course of the observed response is represented by a symbol stream of integers R = {r1 , r2 , . . . , r N }, where each integer ri is the number of spikes that occur between times (i − 1)δt and iδt. Typically, δt is chosen such that each ri is binary. The bin size is the effective temporal resolution of the spikes train (Theunissen & Miller, 1995; Reinagel & Reid, 2000). This coarse graining of the spike train assumes that the relevant timescale of neural dynamics is larger than the bin size. As a realization of the underlying symbol source, the symbol stream is now amenable to tools from symbolic dynamics and data compression (Willems, Shtarkov, & Tjalkens, 1995; Kennel et al. 2005). Information theory provides a framework for quantifying the transmission of information in any communications channel (Shannon, 1948; Cover & Thomas, 1991; MacKay, 2003; Fano, 1961). In neural coding, the sensory stimulus S and neural response R are considered the respective input to and output of a lossy communications channel, represented probabilistically by
Figure 1: Block diagram of the entire estimation procedure divided into two larger tasks of estimating the total entropy rate (upper-right box) (see section 5.1). Simulate spike trains and bin at particular temporal resolution in order to determine appropriate stimulus parameters (a–b). Perform experiment and bin spike train at same temporal resolution (c–d). (e) Generate NMC samples from the continuous distribution P(h(R) | data) (see section 3.1). Estimate the noise entropy rate by recording multiple repetitions of a stimulus (f) and binning the resulting spike train (g) (see section 3.2). (h) Calculate P(h(R|t) | data) using the adapted algorithm to also condition on phase time t. The result is NS sampled distributions where t and i label the time and sample index in h i (t). (i) Selecting L to be several times the autocorrelation width suffices, although more sophisticated algorithms exist (see appendix A). (j) Generate a stationary bootstrap B to capture the autocorrelation in h(R|t) by acting as a surrogate for the integer time indices t (see appendix A). (k) Draw multiple integers J from a uniform probability distribution ∈ 1, . . . , NMC (see section 3.3). (l) Select a set of samples {h i (t)} using B and J for the time and sample indices and average. Repeat this procedure to generate NMC samples from P(h(R|S) | data), each time generating new sets B and J of time and sample indices (see equation 3.4). (m) Subtract off bias discovered in the bootstrap resampling (see equation 3.5). (n) Subtract individual samples of entropy rates to generate samples of P(I (S; R) | data). The mean and quantiles of P(I (S; R) | data) provide the final estimate and confidence interval bounding the range of likely average mutual information rates (see section 3.5).
Estimating Information Rates in Neural Spike Trains
1687
P(R|S) (for reviews, see Rieke et al., 1997, and Borst & Theunissen, 1999).1 In our framework, the stimulus could be a continuous or discrete signal, but the response will always be a stream of discrete integers. The goal is to quantify the nonlinear correlation or statistical dependence between the stimulus and spike train response.
P(X) denotes the probability distribution of the random variable X, while P(X = x) denotes the probability that X = x. For convenience, we sometimes abbreviate the latter as P(x), where X is implied. 1
1688
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
The classical functional for measuring statistical dependence between P(s,r) (Fano, 1961), where two distributions is the mutual information log2 P(s)P(r ) P(s) and P(r ) are the probabilities of observing stimulus s ∈ S and response r ∈ R. P(s, r ) is the joint probability for these events. One usually deals with the average of this statistic over the joint distribution (Cover & Thomas, 1991), I(S; R) =
P(s, r ) log2
s∈S,r ∈R
P(s, r ) . P(s)P(r )
I(S; R) is the average mutual information and answers the question, How much in bits, on average, do we learn about the observation r when we observe s?2 The average mutual information can also be expressed as a difference in the entropy H[·] of the distributions, I(S; R) = H(S) − H(S|R) = H(R) − H(R|S), where the entropy is defined, H(S) = −
P(s) log2 P(s),
s∈S
and similarly for H(R). The average conditional entropy is H(S|R) = r ∈R P(r )H(S|R = r ), with H(S|R = r ) = −
s∈S
P(s, r ) log2
P(s, r ) . P(r )
The benefit of this formulation is that it is easy to generalize this quantity to time series with temporal dependencies. In a non-Poisson spike train R, successive symbols rt−1 , rt , . . . are not statistically independent. The stationary distribution associated with a symbol stream is the conditional probability distribution P(rt |rt−1 , . . . , rt−D ), in which the symbol rt observed at time t depends on D previous symbols. D is called the conditioning depth or the order of a Markov process. In a dynamical system, the conditioning depth is potentially infinite (Hilborn, 2000;
2 A second interpretation is to recognize that I(S; R) measures the Kullback-Leibler divergence between the joint distribution and the product of the marginals and thus measures how inefficient the product of the marginal distribution is at encoding the joint distribution (Cover & Thomas, 1991).
Estimating Information Rates in Neural Spike Trains
1689
Ott, 2002), though in practice, measurements depend on a finite number of previous symbols. In the limiting case, we define the entropy rate, h(R) =
1 lim H [P(rt |rt−1 , . . . , rt−D )] , δt D→∞
(2.1)
with H [P(rt |rt−1 , . . . , rt−D )] = −
rt ,...,rt−D
P(rt , . . . , rt−D ) log2 P(rt |rt−1 , . . . rt−D ).
h(R) has units of bits per second. The entropy rate quantifies the uncertainty of the next symbol given knowledge of the infinite prior history of the underlying symbol source (Cover & Thomas, 1991; Kennel et al., 2005). In a neuroscience context, the stimulus S and neural response R are two dynamical time series. The generalization of the average mutual information is the average mutual information rate expressed as the difference of entropy rates, I (S; R) = h(R) − h(R|S) = h(S) − h(S|R).
(2.2)
This quantity expresses the average rate of novel information (bits per second) that newly observed responses provide about the stimulus time series. It is symmetric in the response and stimulus. The average information rate can mask the contribution of specific stimuli or responses that are particularly informative because it is an average over all pairs of stimuli and responses. It is of interest then to quantify the information attributed to a single stimulus or neural response (DeWeese & Meister, 1999; Brenner, Strong et al., 2000; Butts, 2003; Bezzi, Samengo, Leutbeg, & Mizmori, 2002), Isp (s) ≡ h(R) − h(R | S = s)
(2.3)
Isp (r ) ≡ h(S) − h(S | R = r ).
These specific information rates quantify the amount of information in a single momentary stimulus or neural response. They are natural decompositions of the average mutual information rate, I (S; R) =
r ∈R
P(r )Isp (r ) =
P(s)Isp (s).
s∈S
The specific information rates need not be positive, although the average mutual information rate is always nonnegative (Fano, 1961; DeWeese & Meister, 1999). We now discuss how to estimate these quantities from an
1690
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
experimental data set by extending estimators of entropy rate from previous work. 3 Estimating Information in a Spike Train The definition of average mutual information rate, equation 2.2, provides two alternatives for calculating the information rate in a neural spike train. One can estimate the information rate by examining the reduction in uncertainty by conditioning in stimulus space (Bialek et al., 1991) or response space (Strong et al., 1998). Stimulus space can be high dimensional and continuous, and thus difficult to sample experimentally. Therefore, we select the alternate formulation, I (S; R) = h(R) − h(R|S), because we can estimate the information rate by measuring solely the neural responses in a suitably designed experiment. In this framework h(R) is termed the total entropy rate, h total , and h(R|S) the noise entropy rate, h noise (Strong et al., 1998). We estimate h total and h noise from separate data sets recorded under different stimulus presentations and combine these results to estimate the information rate I (S; R) = h total − h noise . Each step of this entire procedure is outlined in Figure 1. 3.1 Estimating Total Entropy Rates. The total entropy rate h total is the average rate of uncertainty observed in a neural spike train response over a (stationary) stimulus distribution. Experimentally, this means that we present a stimulus to a neuron to sample all possible spiking patterns over a long duration Nδt. Typically, Nδt ranges from several hundred to several thousand seconds (Borst & Theunissen, 1999). The recorded spike train is discretized at a small bin size δt, producing a list of integer spike counts R = {r1 , r2 , . . . , r N } where the subscript indexes time. These two steps are diagrammed in Figures 1c and 1d. We have discussed in detail how to estimate the entropy rate of a finite-length symbol stream in a related work (Kennel et al., 2005), but we review these ideas briefly here. A paramount goal of entropy rate estimation is to determine the appropriate conditioning depth D of the underlying symbol source.3 We must balance the selection of D as diagrammed in Figure 2 between two opposing biases on the estimate:
r r
Unresolved temporal dependencies due to finite D Undersampled probability distributions due to finite N
3 The word length (Strong et al., 1998; Borst & Theunissen, 1999; Reinagel & Reid, 2000) is roughly equivalent to the conditioning depth D, but see Kennel et al. (2005).
Estimating Information Rates in Neural Spike Trains
1691
Figure 2: Selecting the conditioning depth D can dominate entropy rate estimation. This pedagogical figure reproduced from Kennel et al. (2005) demonstrates how temporal complexity and small data sizes affect entropy rate estimation in a discrete Markov process (D = 10) and a symbolized logistic map (D = ∞). Both symbolic systems use binary alphabets and have a maximum possible entropy rate of 1 bit per symbol. For O(106 ) and O(105 ) simulated symbols (left and right columns, respectively), the entropy rate was estimated as a function ˆ of conditioning depth h(D) using a naive entropy estimator (Miller & Madow, 1954). The dashed line is the true entropy rate calculated analytically. A common heuristic is to identify the final estimate of entropy rate as the plateau on this graph (Strong et al., 1998; Schurmann & Grassberger, 1996). This qualitative assessment is not reliable and can fail in the limit of small data sets (right column) and complex temporal dynamics (top row) (Kennel et al., 2005).
This requires selecting the appropriate probabilistic model complexity for a finite data set, a classic issue in statistical inference and machine learning (Duda, Hart, & Stork, 2001). We resolve this problem by using a model weighting technique termed context tree weighting, originally developed for lossless data compression (Willems et al., 1995; Kennel & Mees, 2002; London, Schreibman, Hausser, Larkum, & Segev, 2002; Kennel et al., 2005), following the spirit of the
1692
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
minimum description length principle (Rissanen, 1989). This weighting, along with local Bayesian entropy estimators (Wolpert & Wolf, 1995; Nemenman, Shalee, & Bialek, 2002), yields a direct estimator of the entropy rate of R, hˆ total . We consider the range of plausible entropy rates compatible with the finite data set, not just a single point estimate. We take the Bayesian perspective and view the estimate hˆ total as the expectation of a posterior distribution of entropy rates most consistent with the observed data, hˆ total =
h total P(h total | data) dh total .
The width of P(h total | data) is a Bayesian confidence interval about its estimate. A wide (narrow) distribution implies large (small) uncertainty about the estimate. The numerical algorithm presented in Kennel et al. (2005) yields NMC numerical samples {h ∗total } of the entropy rate drawn from P(h total | data) using Monte Carlo techniques (Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953; Hastings, 1970; Gamerman, 1997). This step is diagrammed by Figure 1e. The mean of these NMC samples is the estimate of hˆ total , and 5% and 95% quantiles of this empirical distribution give a 90% confidence interval. 3.2 Estimating Noise Entropy Rates. The noise entropy rate h noise measures the reliability of a neuron in response to a stimulus. This term quantifies the amount of uncertainty in the response that is not attributable to the stimulus. This entropy rate can also be difficult to estimate in a finite data set. In this section we discuss how to extend the algorithm presented in Kennel et al. (2005) for estimating h(R) to provide an estimator for h(R|S). The quantity h noise is a conditional entropy rate and thus a weighted average over all conditional entropies, 1 P(rhist , shist )H [P(rt |rhist , shist )] δt r ,hist t = h(R|S = shist ) S ,
h(R|S) =
(3.1)
where hist denotes all previous values up to but not including time t. Thus, rhist and shist denote conditioning histories of response and stimulus space, respectively. To calculate this quantity, we must average over all possible stimuli (i.e., stimulus histories shist ). This can be experimentally infeasible because stimulus space might be continuous and high dimensional. A clever trick to circumvent this problem is to design an experiment where one replays the same time-varying stimulus segment over multiple
Estimating Information Rates in Neural Spike Trains
1693
trials (Mainen & Sejnowski, 1995; Bair & Koch, 1996; de Ruyter van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997; Strong et al., 1998). Typically, a stimulus of duration of 5 to 30 seconds is replayed dozens to hundreds of trials. The notion of phase time as used here refers to the time elapsed since the beginning of the trial. We index the phase time (measured in units of the time bin size δt) by an integer t ∈ [1, . . . , NS ]. With NR repeated stimulus presentations, this generates a raster plot (see Figure 1f or Figure 8), in which a neuron has received practically the same stimulus history shist at each phase time t. In other words, we equate each moment in phase time t with a particular stimulus history shist . We additionally assume that each repetition of the stimulus is sufficiently long so each trial is reasonably ergodic. This means h(R|S) = h(R|S = shist ) S NS = lim h(R|t)t=1 . NS →∞
(3.2)
We emphasize that this assumption might not hold when NS is small or the stimulus contains temporal correlations as commonly observed in natural stimuli (Simoncelli & Olhausen, 2001). In practice, for each moment in phase time t we have observed NR draws from the unknown conditional distribution P(rt | rhist , shist ). In other words, each presentation has conditioned the response on the same (unspecified) stimulus history shist . The stimulus-conditioned response is the very distribution in equation 3.1 of which we need to calculate the entropy. The entropy of this distribution, averaged over stimulus space (or phase time), is the noise entropy.4 We use the same algorithm as previously used for hˆ total as a component, this time forming a new context-tree-based estimate at each phase time t from the NR histories. The observations or draws of this conditional distribution are read off going “downward” in a raster plot as diagrammed in Figure 3, but the conditioning history is back in phase time as usual. Each of the NS trees is processed in Figure 1h, following Kennel et al. (2005), providing NMC samples from P(h noise (t) | data), the Bayesian posterior for the instantaneous noise entropy rate. We denote the jth numerical sample
4 We note that the definition of the noise entropy rate differs from the method commonly used in practice (Strong et al., 1998). The average noise entropy rate is by definition the average over stimulus space of all individual noise entropy rates h(R|S = s). The direct method calculates the average entropy across stimulus space for varying conditioning depths; subsequently, it derives an entropy rate estimate from the average entropies. This out-of-order procedure in practice avoids many complicated extrapolations (Kennel et al., 2005).
1694
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
Figure 3: A diagram demonstrating how to estimate P(rt | rhist , shist ) and, correspondingly, the noise entropy rate hˆ noise (t), conditioned on a particular stimulus. Each row is a discretized spike train response to a repeated stimulus. The set of NR time series ending prior to some moment in phase time t shows the conditioning histories rhist , explicitly, and shist , implicitly. With NR samples of histories, we use the context tree to predict each rt from its past history and estimate hˆ noise (t) = H [P(rt |rhist , shist )], the instantaneous noise entropy rate.
of P(h noise (t) | data) as h ∗j (t). Averaging over those samples yields
hˆ noise (t) =
NMC 1 h ∗j (t). NMC j=1
The estimated average noise entropy rate is then the average over all phase times, NS 1 hˆ noise (t). hˆ noise = NS
(3.3)
t=1
3.3 Estimating the Distribution of Noise Entropy Rates. In addition to the estimate hˆ noise , we also wish to estimate the distribution of likely values for the noise entropy, P(h noise | data). Percentiles of this quantity will provide confidence intervals for hˆ noise . Thus far, we have only computed samples from P(h noise (t) | data) for every point in phase time. How do we appropriately combine these NS sampled distributions into an overall distribution to provide for an error bar on hˆ noise ? At this point, one must make a philosophical decision about what types of variability are intended to be represented by the confidence interval. The proper procedure depends on
Estimating Information Rates in Neural Spike Trains
1695
the intended use and interpretation of the estimated value. We identify two sources of variability, which conceivably ought to be represented in a useful confidence interval: 1. Variability due to estimation error point-wise in phase time, because only a finite number of repeats were recorded 2. Variability across the stimulus process itself, because only a finite duration stimulus was presented The first source of variability quantifies uncertainty in the noise entropy rate pointwise in phase time due to finite samples (NR ). Within typical data sets (NR ranges from 50–1000 repetitions), the width of this distribution, P( hˆ noise (t) | data), tends to be narrow, and thus the variability attributable to point-wise estimation error is small (see, for instance, Figure 8, bottom). The second source of variability is more subtle and often not explicitly recognized. Most often, we want to estimate h(R|S) from the data, where S is the stimulus process, not the individual stimulus time series actually used in the experiment. In principle, to estimate this variability, we would repeat our experiment with many stimulus time series, each with many repeats; the variability across the ensemble of rasters (where each raster corresponds to a unique stimulus time series) would reflect the variability in the stimulus process. In practice, long-duration recordings are difficult, and thus only a single raster plot is measured in response to a single instance from the stimulus process. Thus, the goal set forth is to estimate the variability across an ensemble of raster plots from just a single raster plot. We employ a bootstrap procedure, detailed below, to achieve this goal. Because the underlying stimulus process is (presumed to be) unobserved, the variability in the stimulus process is instead reflected in the range of spike patterns observed across phase time in a raster plot (see Figure 4a, top panel). In practice, we find this source of variability quite large because a neuron can fluctuate from nearly determinstic silence (h noise ≈ 0) to bursting activity (h noise 0) (see Figure 4a, top panel at t = 200, 400 ms, respectively) depending on the stimulus. In order to estimate the effect of variation across the stimulus process, we use a generic time-series bootstrap of the observed effect of stimulus variation on hˆ noise (t) to estimate the expected consequences on hˆ noise . We find ensembles of noise entropy rates based on resampled rasters, as if there were a new stimulus composed of randomized segments of the original stimulus. A bootstrapped sample is
h ∗boot =
NS 1 h ∗J t (Bt ). NS t=1
(3.4)
1696
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
0
400
(a)
800
1200 time (ms)
boot
34 32
corrected h
h
boot
(bits/s)
36
30 28 26
36 34 32 30 28 26
6
(b)
2000
38 (bits/s)
38
1600
8 firing rate (Hz)
10
6
8 firing rate (Hz)
10
Figure 4: Effects of correlation and spike rate on bootstrap procedure on a short repeated stimulus segment from experimental data set (δt = 4 ms; see section 5). (a) Top: original raster plot. Middle: raster with phase time indices B resampled uniformly with replacement. Bottom: raster with phase times indices B resampled using stationary bootstrap procedure. (b) Conditioning on a constant spike rate. Left: noise entropy samples h ∗boot versus firing rate r ∗ over bootstrapped samples, demonstrating typical linear dependence (dashed line). Right: noise entropy samples corrected using linear Ansatz. The width of the distribution in h ∗boot (vertical dispersion) is substantially lowered using this procedure.
where, for each sample h ∗boot , Bt is a draw of time indices in [1 . . . NS ] (details discussed below) and J t is a uniform random draw from the sample indices [1, NMC ] with replacement. Procedurally, we draw a random bootstrap time series B = {B1 , . . . , B NS } of time indices and, at each of those moments, draw
Estimating Information Rates in Neural Spike Trains
1697
one of the NMC samples from the point-wise estimated posterior and average them to give one bootstrap sample, h ∗boot .5 We employ the basic bootstrap interval (Davison et al., 1997) to remove bias discovered during the resampling procedure. The basic bootstrap interval presumes that the distribution of hˆ noise − h noise (where h noise is unknown) may be identified with the distribution of h ∗boot − hˆ noise (where h ∗boot has been sampled by the bootstrap). Samples of our estimate for the distribution of the noise entropy rate h ∗noise are computed using equations 3.3 and 3.4: h ∗noise = hˆ noise − (h ∗boot − hˆ noise ) = 2 hˆ noise − h ∗boot .
(3.5)
These samples are a hybrid Bayesian and bootstrap estimate of the distribution of h noise , and quantiles of h ∗noise provide confidence intervals for the noise entropy. The appropriate procedure for choosing B is subtle because the neural response contains significant temporal correlations across phase time. These correlations may be due to refractory neural dynamics as well as the effect of nearly overlapping stimuli within small shifts of phase time. These correlations in phase time create correlations between hˆ noise (t) and hˆ noise (t + 1) and are reflected in the raster plot by the thickness of each burst of activity (or silence) in Figure 4a, top panel. A naive bootstrapping procedure for B is to uniformly draw random time indices (with replacement) from [1, NS ]. This standard procedure would result in creating the raster plot in Figure 4a, middle panel. Clearly, this procedure fails to capture the autocorrelation structure evident in Figure 4a, top panel. Correspondingly, when significant response autocorrelation exists, a naive bootstrap that presumes point-wise independence in phase time, this procedure will profoundly underestimate the true variance of hˆ noise across the underlying stimulus process. To account for this significant correlation, we propose drawing bootstrapped phase times using a procedure developed by Politis and Romano for correlated time series (Politis & Romano, 1994; Politis & White, 2004). The Politis-Romano (PR) bootstrap procedure creates surrogate time indices B consisting of varying-length blocks of consecutive phase times (each block starting at random phase times), whose purpose is to capture most of the the autocorrelation structure in hˆ noise (t) (Figures 1i and 1j; see appendix A for details). This correlation structure can be quite large at small bin sizes or with stimuli exhibiting long temporal correlations (e.g., natural stimuli). Returning to Figure 4a, the phase time indices B are now resampled using the PR bootstrap. This bootstrap procedure captures the autocorrelation structure evident in the response (Figure 4a, bottom panel). With this bootstrap, the confidence interval calculated as quantiles on equation 3.5 better
5 Note that if one did not want to include the stimulus variation in the confidence interval, one would select Bt = t.
1698
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
capture the variability due to the underlying stimulus process. The consequences of this source of variability are discussed further in the application of this estimator. 3.4 Conditioning Noise Entropy Rate on a Fixed Spike Rate. Entropy rates are significantly correlated with the spike rate. For example, periods of silence in a raster plot correspond to hˆ noise (t) = 0 reflecting practically deterministic activity (Kennel et al., 2005). Thus, sample h ∗boot will contain, by random chance, phase time indices B with fewer or greater periods of time with absolute silence significantly effecting its value. Figure 4b (left panel) demonstrates this empirically linear, systematic effect. This source of variability is artificially introduced by the resampling procedure. Samples drawn with larger (smaller) proportions of phase times B with spiking activity exhibit higher (lower) entropy rates. We may null out this source of variability by adapting the resampling procedure. Conceptually, we are now estimating h(R|S, rate = r0 ), where r0 is the empirically estimated average spike rate. For each moment in phase time t, we estimate the instantaneous spike rate r (t). For each bootstrap sample h ∗boot , there is a corresponding set of phase time indices B and spike rate r ∗ = N1S t r (Bt ). We use least squares to fit a linear Ansatz between the sample spike rate and entropy rate, h ∗boot ≈ α + β(r ∗ − r0 ), where {α, β} are determined with an ordinary least squares fit (see Figure 4b, left panel, dashed line). Each entropy rate sample is corrected as if it had spike rate r0 , ∗ ∗ h ∗∗ boot = h boot + β(r 0 − r ).
(3.6)
These corrected samples can be plugged into equation 3.5, replacing h ∗boot . This procedure corrects for variability introduced by the bootstrap procedure, which is seen primarily through the altered spike rate and effectively shrinks the confidence interval of the noise entropy distribution when this effect is removed. 3.5 Estimating Mutual Information Rates. We now have all of the elements necessary to estimate the specific and average information rates. We remind the reader that the ergodic assumption in section 3.2 equated a specific stimulus history with a particular moment in phase time (i.e., shist ∼ t). Because the total and noise entropy rate distributions are estimated from separately recorded experimental data sets, we can simulate samples of the likely information rates by drawing independent samples from the posterior distributions and taking their difference, I ∗ = h ∗total − h ∗noise ∗ Isp (s) = h ∗total
∗
− h noise (t).
(3.7) (3.8)
Estimating Information Rates in Neural Spike Trains
1699
This step is diagrammed in Figure 1n. The mean of these samples gives the estimated information rate, and the central portion gives a mixed Bayesian and bootstrap confidence interval on the range of information rates consistent with the data. We now test these ideas in simulation and highlight results on an experimental neural data set. 4 Testing the Estimator with Simulation In previous work (Kennel et al., 2005), we validated and compared the performance of our estimator for the total entropy rate using simulations of nonlinear dynamical systems. In summary, we found that our estimate was asymptotically unbiased and converged reliably and rapidly compared to other estimators (Strong et al., 1998; Lempel & Ziv, 1976; Amigo, Szczepanski, Wajnryb, Sanchez-Vives, 2004; Kennel & Mees, 2002; London et al., 2002; Kontoyiannis, Algoet, Suhov, & Wyner, 1998). Furthermore, the size of the confidence interval estimate provided by the entropy rate estimator well matched the variation in entropy rate under sample fluctuations. We now examine the convergence of our estimators and the confidence interval of hˆ total and hˆ noise using a simulation of an inhomogeneous Poisson neuron with an absolute and relative refractory period (Nemenman, Bialek, & de Ruyter van Steveninck, 2004; Victor, 2002). The parameters of this model have been chosen to approximate the statistics of a real neuron and binned at δt = 4 ms; see appendix B for details. The goals of these simulations are to (1) validate our estimates of hˆ total , hˆ noise and their asssociated confidence intervals and (2) judge the necessary amount of data required for a good estimate in real neural data. 4.1 Testing Convergence. We examined the convergence of hˆ total using an inhomogeneous refractory Poisson simulation whose parameters are fit to model the temporal dynamics of a recorded neuron (see appendix B). In previous work we detemined that this estimator is asymptotically unbiased (Kennel et al., 2005). We judge the convergence and consistency of the estimator about an approximate truth (Nδt = 4000 s, NR = 3000 trials, NS δt = 32 s; see also Nemenman et al., 2004) by examining the estimator bias and variance. For each duration Nδt, we generated 100 independent data sets and computed hˆ total for each data set. In Figure 5a, the estimator bias is the difference between approximate truth and the mean of these 100 independent estimates. The estimator variance is the square of the accompanying error bar, and it measures the intrinsic fluctuations in this random process. We emphasize that this error bar is not the estimated confidence interval, but rather is the actual posterior distribution of entropy rates, which we will estimate later. Over increasing Nδt, the estimator variance decreases, and within 125 s of data, the bias and variance (normalized by hˆ total ) are, respectively, less than 0.2% and 1.3%, suggesting a well-converged estimate.
1700
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
total entropy rate (bits/s)
115
110
105
100
5
10
20
50
100
300
600
1000
N δ t (s)
(a) noise entropy rate (bits/s)
65 60 55 50 45 40 0
2
4
6
8
10
12
14
16
400
500
600
700
800
N δ t (s)
(b)
S
noise entropy rate (bits/s)
54 53 52 51 50 49 48 47 0
(c)
100
200
300
N (number of trials) R
Figure 5: Convergence of entropy rate estimates on a simulated refractory Poisson neuron. Shown are ensembles over 100 different data sets; error bars give sample standard deviations over the data set ensembles. Dashed line is approximate truth (Nδt = 4000 s, NR = 3000 trials, NS δt = 32 s). (a) Convergence of hˆ total in N. (b) Convergence of hˆ noise in NS for a fixed number of trials NR = 400. (c) Convergence of hˆ noise in NR for a fixed trial duration NS δt = 8 s.
Examining the convergence of hˆ noise is more subtle because the estimator contains two parameters, NR and NS . Because the convergence is difficult to visualize in multiple dimensions, we instead examine the convergence along each parameter while holding the other fixed at a reasonable value.
Estimating Information Rates in Neural Spike Trains
1701
In Figure 5b, we again generated 100 independent data sets. This time each data set is an entire raster plot with NS varied and NR = 400 trials fixed. Each data set contains a new instantiation of the random process, as well as a new driving firing rate (λ2 ) over NS δt seconds (see appendix B). The bias is the difference between approximate truth and the average of these 100 estimates. The estimator variance is again the width of the posterior distribution of entropy rates, although the posterior distribution of hˆ noise is calculated by applying the stationary bootstrap (see section 3.3) to account for correlations in time. The estimator bias remains small (< 1% of hˆ noise ), yet the estimator variance shrinks as NS increases.6 Finally in Figure 5c, we judge the convergence of hˆ noise over NR while holding NS δt = 8 s fixed. In this panel, each data set is a new instantiation of the random process but retains the same firing rate (λ1 ). Unlike Figure 5c, we do not select a new firing rate time series because we are interested in showing the significant effect of NR on bias, and drawing new firing rate time series partially obscures the effect by submerging it in additional variance. The estimator bias is profound at small NR , dominating the magnitude of the variance (Paninski, 2003a); however, by NR = 400 trials, the bias is well contained within the variance of the estimate. Thus, we find that NR dominates the bias in hˆ noise as compared to NS . In other words, NR dominates the bias, while NS dominates the variance almost independently. Thus, variations across phase time (or stimulus histories shist ), but not the error bar for individual hˆ noise (t), explain most of the variance in hˆ noise . We return to this point in section 6 in discussing how to select experimental parameters for estimating these quantities. 4.2 Testing the Confidence Interval. We now examine how well the confidence interval, which can be estimated from a single data set, can replicate the actual variation of the underlying estimate seen in a large ensemble of new data sets. To test the confidence intervals for hˆ total , we again generate 100 independent data sets with a fixed Nδt. For each data set, we calculate hˆ total and the associated single trial confidence interval. We plot these estimates as circles and error bars sorted by increasing hˆ total in Figure 6a (Nδt = 125 s). First, we note that approximate truth (dashed line) is contained within 90% of the individual confidence intervals. This indicates that the estimated confidence interval is well calibrated. To summarize these results, we calculate the following. We consider the true distribution of this statistic (or posterior distribution of hˆ total ) to be the distribution of hˆ total (circles). The central 90% quantile of this posterior 6 As an aside, notice that we are combining a Bayesian and frequentist bootstrap method for the estimator, and then testing it in a purely frequentist fashion (many draws from the source). The size and location of a Bayesian error bar need not match exactly the result from a frequentist-style experiment, but the general compatibility of the results here gives reassurance there are no peculiar anomalies.
1702
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
total entropy rate (bits/s)
114 112 110 108 106 104 102
0
10
20
30
(a)
40
50
60
70
80
90
100
600
1000
data set number
total entropy rate (bits/s)
120 115 110 105 100 95
(b)
5
10
20
50
N δ t (s)
100
300
Figure 6: Testing the confidence interval of hˆ total . (a) Individual estimates and associated 90% confidence intervals of 100 independent data sets of length Nδt = 125 s (circles). Average of the individually estimated confidence intervals (triangle). 90% quantile on the distribution of hˆ total (circles), which measures actual posterior of h total (square). The horizontal dashed line is approximate truth (Nδt = 4000 s). (b) Comparison between average lower and upper confidence interval limits (triangle) and posterior distribution (square) across Nδt. Across varying data set lengths, confidence intervals are well calibrated and engulf approximate truth.
distribution is denoted by the left-most square (see Figure 6a). We compare this error bar to the average lower and upper limits of the 100 individual confidence intervals (see Figure 6a, triangle). In Figure 6a, the average upper and lower limits contain approximate truth and match the estimated posterior distribution. We use these two statistics as a summary of the calibration and compare the two across varying Nδt (see Figure 6b). Across varying Nδt, the average confidence interval matches the true posterior distribution and engulfs truth. We repeat this same procedure for examining the confidence interval in hˆ noise across NS and NR (see Figure 7). In Figure 7a, we again draw a new instantiation of the random process and the firing rate (λ2 ) for every sample,
Estimating Information Rates in Neural Spike Trains
1703
noise entropy rate (bits/s)
65 60 55 50 45 40
0
10
20
30
40
(a)
50
60
70
80
90
100
data set number
noise entropy rate (bits/s)
70 65 60 55 50 45 40 35 0
2
4
6
100
200
300
(b)
8
10
12
14
16
400
500
600
700
800
NS δ t (s)
noise entropy rate (bits/s)
58 56 54 52 50 48 0
(c)
N (trials) R
Figure 7: Testing the confidence interval of hˆ noise across variations in NS and NR . (a) This figure follows Figure 6. Each circle is an independent data set (raster) with NS δt = 8 s and NR = 400 trials. (b) Comparison between average lower and upper confidence interval limits (triangle) and posterior distribution (square) across NS δt with NR = 400 fixed. (c) Comparison between average lower and upper confidence interval limits (triangle) and posterior distribution (square) across NR with NS δt = 8 s fixed. Across varying NS and NR , confidence intervals are well calibrated and engulf approximate truth (dashed line, NR = 3000 trials, NS δt = 32 s).
1704
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
and we make use of the stationary bootstrap to account for correlations within hˆ noise (t). In Figure 7c, we repeat the same procedure, this time varying NR while keeping NS δt = 8 s. This is in contrast to Figure 5, where we did not draw new firing rate time series. The reason for this is that previously, the purpose was to demonstrate only the result that bias depends primarily on NR ; the additional variability induced by draws of new time series from the firing rate process would be irrelevant and would obscure the effect we intended to demonstrate. Now, however, the magnitude of the fluctuations and whether our estimation technology can account for them is the central object of interest, and so it is appropriate to account for this source of variability in the data-generating process. In summary, we find that the hybrid Bayesian-frequentist confidence interval is well calibrated for hˆ noise across a range of NR and NS in our example. 5 Results We now bring these tools to bear on neural data from a guinea pig retinal ganglion cell (RGC). These data were recorded extracellularly with a multielectrode array (for experimental details, see Chichilnisky & Kalmar, 2003) and full-field binary monitor flicker at 120 Hz was presented for a long duration (Nδt = 1000 s) and over repeated trials (NS δt = 9.5 s, NR = 660 trials) to sample the total and noise entropy rates, respectively. We isolated spike times of individual RGCs using a simple voltage threshold procedure and associated distinct clusters of spike times with individual neurons in a spike waveform feature spikes (Frechette et al., 2005). We present this analysis in full on a single RGC in Figures 8 to 11. We selected the spike times of one well-isolated ON-type RGC (contamination rate < 0.02; Uzzell & Chichilnisky, 2004), binned at 4 ms, which appeared relatively stationary over the duration of the experiment (see Figure 8). We qualitatively judged stationarity by measuring fluctuations in the firing rate (7.9 ± 0.9 Hz) and the spike-triggered average (data not shown). In spite of this selective screening, it should be emphasized that nonstationarity exists in all recordings to varying degrees, as with most other experimental recordings. This nonstationarity could result from either changes in experimental conditions (e.g., temperature, cell death) or adaptation over longer timescales (Fairhall et al., 2001; Baccus & Meister, 2002). Regardless, all of these variations could result in either an increased noise entropy rate, a lack of convergence in our estimates, or an underestimation of the confidence interval. Although we attempted to mitigate these effects by careful control of experimental conditions, this issue is of potential concern because nonstationarity violates a central assumption in the definition of entropy rate. 5.1 Entropy Rates. Figure 9 examines the convergence of the entropy rate across increasing durations of data. Only a single data set is available;
Estimating Information Rates in Neural Spike Trains
1705
Figure 8: ON-type guinea pig retinal ganglion cell. A small portion of the raster from repeated trials of a stimulus. The conservation of spike times across trials suggests that the neuron responds reliably and precisely to repeated presentations of the stimulus. The lower plot shows the specific information rate of the retinal ganglion cell. The black dot is the estimate, and the gray shadow is the 95% confidence interval. The average mutual information rate is 15.2 bits per second.
thus, a complete examination of convergence and confidence intervals of the estimator across ensembles of data sets is not possible. As a rough approximation, we instead artificially divide our single long data set into K independent, smaller data sets. In Figure 9, this analysis is performed on the total and noise entropy rate across for K ≤ 8 fractions of the original data set. The final estimate using the complete data set (K = 1) is displayed with the horizontal dashed line. Individual subsampled estimates (circles) converge to the final estimate as the number of data (N, NR respectively) increases. Single-trial 95% confidence intervals engulf the final estimate a vast majority of the time, indicating that the confidence interval provides a
1706
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky 52
htotal (bits/s)
50 48 46 44 42
100
150
200
(a)
300 N t (s)
500
750
1000
h
noise
(bits/s)
40
35
30
25
(b)
100
150
200 300 trials (N )
500
700
R
Figure 9: Convergence of entropy rate estimates (circles; error bar is 95% confidence interval). Each group of data points is an estimate from K ≤ 8 independent data sets subsampled from the complete data set. Horizontal lines are entropy rate estimates from our estimator (dashed) and from the direct method (dotted) (Strong et al., 1998). (a) Total entropy rate estimates across subsamples of N. (b) Noise entropy rate estimates across subsamples of NR . Note that singletrial confidence intervals from subsampled data sets engulf the final estimate using the complete data set.
reasonable bound on the range of potential estimates assuming more data were available. For the noise entropy rate (see Figure 9b), the confidence interval is calculated using the stationary bootstrap procedure with the firing rate correction. The reason for selecting this type of confidence interval is explored in Figure 10 by examining estimates from data sets subsampled across NS . The ergodicity assumption, equation 3.2, provides that each trial is long enough to explore a reasonable range of response space to provide a good estimate of P(R|S) and consequently hˆ noise . This assumption should be reflected not only in the estimate (circle) but also the confidence interval, such that different repeated presentations (i.e., subsampled NS ) do not substantially differ in their values. Indeed, for the standard bootstrap procedure (see
Estimating Information Rates in Neural Spike Trains
1707
40 35 30
h
noise
(bits/s)
45
25 20
3
5 trial duration, NS t (s)
7
10
3
5 trial duration, NS t (s)
7
10
3
5 trial duration, N
7
10
(a) 33 32 31
h
noise
(bits/s)
34
30 29
(b) 40 35 30
h
noise
(bits/s)
45
25 20
(c)
S
t (s)
Figure 10: Confidence intervals of noise entropy rate estimation (symbols follow Figure 9). Estimates use three different methods for calculating 95% confidence intervals. Each panel uses varying fractions K = 1, 2, 3, 4 of the complete data set to generate independent estimates. (a) Stationary bootstrap procedure. (b) Standard bootstrap procedure for phase times B. (c) Stationary bootstrap procedure correcting for spike rate. Note expanded axes in b.
1708
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
Figure 10b), the confidence interval would indicate otherwise. This suggests that for short stimulus presentations, the ergodicity assumption is incorrect. In contrast, the correlated bootstrap procedure (see Figure 10b) properly accounts for this uncertainty as the error bars indicate that these independent samples are not statistically different. Furthermore, this result holds even when removing systematic fluctuations in the firing rate introduced by the bootstrap procedure (see Figure 10c). We suggest that several features give encouragement that the entropy rate estimates are fairly well converged. First, diagnostics, based on complexity measures introduced in Kennel et al. (2005) are well converged, suggesting a reasonable first check (data not shown). The convergence of these diagnostics is necessary but not sufficient for a well-converged entropy rate estimate. The simulations in the previous section are the strongest evidence for convergence. A model neuron with similar temporal dynamics, binned at the same temporal resolution, provides well-converged hˆ noise and hˆ total estimates within the total recording time of this neuron. 5.2 Information Rates of a Retinal Ganglion Cell. For this stimulus distribution, we estimate the average information rate I (S; R) to be 15.2 ± [1.1, 1.1] bits per second with 95% confidence.7 The total entropy rate of the response h(R) is 46.5 ± [0.8, 0.8] bits per second, providing an (S;R) average response efficiency I h(R) = 32.8 ± [2.8, 3.0]%. The response efficiency characterizes how much of the response capacity is exploited by the neuron. Similar response efficiencies have been reported in neurons across varying bin sizes (Borst & Theunissen, 1999). The stimulus distribution has an entropy rate h(S) = 120 bits per second, because there is an equal chance of a black or white screen (1 bit) selected randomly at 120 Hz. H(S) > H(R) implies that even if the neuron were noise free (i.e., H(R|S) = 0), at a resolution of 4 ms, the neuron would not contain a large enough repertoire of spike patterns to losslessly transmit the stimulus. The stimulus efficiency I (S;R) = 12.7 ± [0.9, 0.9]% characterizes how h(S) much of the stimulus the spike train transmits. The low stimulus efficiency suggests several possibilities: the neuron failed to transmit all of the information about the stimulus, or the bin size of the spike train is too large to extract all of the information about the stimulus. The absolute refractory period of the neuron sets a rough guide to the relevant timescale for the bin size. Recent work has suggested I (S; R) must be calculated for bin sizes smaller than the refractory period in order to extract all information about a stimulus within a spike train (Strong et al., 1998; Reinagel & Reid, 2000; Liu, Tzonev, Rebrik, & Miller, 2001). However, estimating information rates 7 Because our error bars can be asymmetric, we abbreviate lower and upper portions of the error bar as the first and second items within the brackets.
Estimating Information Rates in Neural Spike Trains
1709
Figure 11: Reconstruction of a binary stimulus. In the left panel, a short segment of the displayed light intensity waveform of the full-field stimulus S (black line) and the corresponding RGC spike train (below). We calculate the reconstructed stimulus Sˆ (black dots) by convolving the discretized spike train (δt = 4 ms) with a filter (right panel) and thresholding the resulting value to a high or low stimulus intensity. The filter is the reverse correlation of the spike train with the stimulus divided by the autocorrelation of the response (Rieke et al., 1997). This simple decoding procedure recovers at least 3.3 bits per second, or 21.9 ± [1.5, 1.7]% of the available information.
at smaller bin sizes require-larger effective conditioning depths D and thus contains larger data requirements (Kennel et al., 2005). Most important, the response kinetics of ON-type guinea pig RGCs is slow compared to the refresh interval of the stimulus (see Figure 11). Thus, we hypothesize that the stimulus is probably flickering too rapidly for the RGC to respond accurately. Average information rates mask the contributions of individual stimuli and responses. Recent work has investigated several extensions of information rates to measure how well particular stimuli are encoded (DeWeese & Meister, 1999; Brenner, Strong et al., 2000; Butts, 2003). A trivial extension of our estimator is to estimate the specific information rate Isp (s), or the contribution of individual stimuli. For a small segment of the raster in Figure 8, we plotted Isp (s) for comparison. The gray shadow is the 95% confidence interval and is largely dominated by the uncertainty in hˆ noise . Stimuli eliciting silence provide the largest amount of information about the response because the neural response is reliably silent (Butts, 2003). During periods of silence, h(R|t) = hˆ noise (t) = 0, implying that the specific information rate plateaus at Isp (s) = h(R) − 0 = h total .
1710
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
During periods of bursting activity, hˆ noise > hˆ total , giving a negative specific information Isp (s) < 0. Negative specific information is not only possible but necessary to maintain additivity, a central assumption of information theory (Shannon, 1948; Brillouin, 1956; Fano, 1961; DeWeese & Meister, 1999; Abarbanel, Masuda, Rabinovich, & Tumer, 2001). The sign and distribution of specific information rates reflect the distribution of stimuli selected in the experiment. 5.3 Comparison to a Decoder. A principal feature of the average mutual information rate I (S; R) is that it does not involve any reconstruction of the stimulus nor require any specific model for interpreting a spike train. However, I (S; R) does bound the performance of any hypothesized decoding algorithm Sˆ = F (R), whether biological or artificial—meaning that it measures the performance an optimal decoder could achieve at a temporal resolution δt. We present a comparison of I (S; R) to a simple decoding procedure based on linear reconstruction to illustrate this point. This decoding procedure reconstructs from the spike train R an estimate of the stimulus waveform Sˆ = {ˆs1 , sˆ2 , . . .}, where the subscript indexes time. The performance of this artificial decoder can be judged on an absolute scale relative to the available information, I (S; R) (Buracas et al., 1998; Warland et al., 1997; Abarbanel & Talathi, 2006). Because the original stimulus waveform is binary, we constructed a simple decoder F by convolving the spike train with a filter and thresholding the resulting value to a high or low stimulus value. Figure 11 shows a segment of the resulting reconstruction. We calculated the filter using the first half of the data (Nδt = 500 s); all subsequent calculations used the second half of the data set. The filter (Figure 11, right panel) is the optimal causal linear estimate of the stimulus given the spike train, calculated by reverse-correlating the spike train with the stimulus and dividing by the autocorrelation of the response (Rieke et al., 1997). Preliminarily we can compare the ability of the decoder F to reconstruct the stimulus to the best possible predictive power dictated by the inforˆ mation rate. We define the error as a binary random variable E ≡ δ(S = S), where 1 and 0 indicate a correct and incorrect prediction, respectively. Fano’s inequality (Cover & Thomas, 1991) guarantees that for a binary stimulus distribution, the probability of an error P(E) implicitly is bounded below by the equation, H(E) ≥ δt h(S|R) = δt [h(S) − I (S; R)].
(5.1)
Estimating Information Rates in Neural Spike Trains
1711
The entropy rate of the stimulus, h(S), is 120 bits per second by experimental design. The difference between the h(S) and I (S; R), the latter estimated from data, gives a lower bound on the entropy rate of the error. Solving equation 5.1 for P(E) with our results gives a lower bound P(E) ≥ 37.8%. Empirically, we find our simple decoder failed to predict the stimulus 40.5% of the time. We also characterize the performance of the decoder in absolute terms by examining the reduction in stimulus space to calculate a lower bound on the information attributed to the decoder, I (S; F (R)) = h(S) − h(S|F (R)). We can overestimate h(S|F (R)) by assuming the stimulus estimate Sˆ is identically and independently distributed (i.i.d.). This assumption ensures that our estimate of I (S; F (R)) is a conservative lower bound: 1 H(st | shist , sˆt , sˆhist ) δt 1 = h(S) − H(st | sˆt , sˆhist ) δt 1 ≥ h(S) − H(st | sˆt ) δt 1 ˆ = h(S) − H(S| S), δt
I (S; F (R)) = h(S) −
(5.2)
where we have recognized that S is i.i.d., and hist denotes all previous values in time up to but not including time t. By the data processing inequality, we know that I (S; F (R)) ≤ I (S; R) with equality if and only if F captures all of the information contained in R (or is a sufficient statistic of R). For this simple decoding procedure, we recover at least 3.3 bits per second, or 21.9 ± [1.5, 1.7]% of the available information. Building a better decoding procedure through dimensional reduction of stimulus space or other biological priors could recover a greater percentage of this available information and decrease the probability of reconstruction error (Bialek et al., 1991; Aguera y Arcas & Fairhall, 2003; Simoncelli, Paninski, Pillow, & Schwartz, 2004; Abarbanel & Talathi, 2006). 6 Discussion We have discussed how to estimate information rates in neural spike trains with a confidence interval by extending an estimator introduced in a related
1712
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
work (Kennel et al., 2005).8 We apply a time series bootstrap procedure to estimate the uncertainty in the noise entropy under the (usually unrecognized) assumption that the presented stimulus is but a single example from some underlying distribution. We combined these ideas to calculate a pointwise specific information rate Isp (s) and average information rate I (S; R) in an experimental data set. We calculated the stimulus and response efficiencies to characterize how well the neuron conveys information about the stimulus and how well the neuron exploits its capacity to send information. Finally, we compared I (S; R) to the performance of a simple decoder based on linear reconstruction. Testing these ideas in simulation, we developed an empirical means of judging convergence in real spike trains. Importantly, we showed in simulation for noise entropy estimation that bias is dominated by the number of trials NR , while variance is independently dominated by the duration of each trial NS . This bias-variance trade-off suggests a systematic strategy for determining experimental parameters for estimating the noise entropy rate. First, determine the minimal number of trials for convergence in simulation for a selected bin size (see Figures 1a and 1b). Second, given a fixed experiment duration, maximize the duration of each stimulus presentation in order to minimize the estimator variance. Ideally, the duration of each presentation should be far larger than the correlation length of the neural respose to ensure ergodicity (see equation 3.2). Finally, on real spike trains with inherent nonstationarity, our estimator appears convergent, and the confidence interval captures the range of reasonable variation assuming the source were stationary. Mitigating the effects of nonstationarity in the estimate is thus the focus of our experiment and potential future analysis. Our estimator is a methodological advance because it (1) empirically converges quickly to the entropy rate (Kennel et al., 2005), (2) removes heuristic judgements of conditioning depth D (Schurmann & Grassberger, 1996; Strong et al., 1998), and (3) provides confidence intervals about its estimate. The first point is imperative for any estimator because it minimizes data requirements, thereby increasing confidence in its accuracy. The second point is a special case of judging the appropriate model complexity for a neural spike train. This point removes heuristics and subjective assessments often relied on in practice in the application of the direct method (Reinagel & Reid, 2000; Lewen et al., 2001). These heuristics either ignore the model selection problem or reject an estimate post hoc based on inappropriate convergence behavior. For example, in the application of the direct method (Strong et al., 1998), physical arguments suggest that a linear extrapolation within a selected scaling region should provide a good estimate of the entropy rate provided enough data
8
Source code as well as a Matlab interface for both Bayesian entropy rate and information rate estimation is available online at http://www.snl.salk.edu/∼shlens/infotheory.html.
Estimating Information Rates in Neural Spike Trains
1713
samples exist. Thus, a common post hoc criterion is to reject this extrapolation if deviations from linearity become too large. One difficulty with this approach (aside from the subjective assessment of a scaling region and a linearity criterion) is that no estimate of the entropy rate is provided if this post hoc criterion is not met. In other words, if a linear extrapolation is poor, then no estimate is provided at all. A potential example of this failure can be observed in the upper-right panel of Figure 2, where deviations from a linear extrapolation would be large. We view this shortcoming as fundamental because it skirts the larger problem of selecting the appropriate model complexity. Selecting the appropriate probabilistic model complexity for a spike train allows one to generalize or sample the range of plausible entropy rates associated with the spike train, thus fulfilling the third point above. The computation of confidence intervals for information rates is unique to our estimator. A confidence interval provides a means for comparing information rates in a statistically significant manner. The lack of a reliable, heuristic-free estimator with an associated error bar has complicated the application of information theory in empirical data analysis. Immediate applications of this estimator include several avenues of research where the lack of a reliable estimator has made analysis difficult. This technique can be applied generally to estimate the information of any situation with a symbolic observable and the opportunity to repeat stimuli. We now discuss immediate applications in neuroscience. One important application is to measure the temporal precision of spike times by comparing information rates across high temporal resolutions (Strong et al., 1998; Reinagel & Reid, 2000; Liu et al., 2001). The selection of the optimal bin size (or any discretization; Rapp et al., 1994) to represent a spike train remains an open question, potentially resolved through model selection criteria (Rissanen, 1989). A second avenue is to explore the correlation structure between neurons by comparing information rates across multiple neurons (Puchalla, Schneidman, Harris, & Berry, 2005; Dan, Alonso, Usrey, & Reid, 1998; Gat & Tishby, 1999; Reich et al., 2001; Schneidman et al., 2003). Furthermore, the extension of this estimator to calculate specific information quantities (Bezzi et al., 2002; DeWeese & Meister, 1999; Brenner, Strong et al., 2000; Butts, 2003) suggests the possibility of actively searching through stimulus space for features that elicit high temporal precision in single neurons or unique correlational structure between multiple neurons. The benefits of this new estimator arise from having a justified theory to select the appropriate model complexity for the estimated probabilistic spiking response P(R|S). Although our model of spiking activity P(R|S) is statistical and not biophysical, this model does highlight how dynamics restrict the production of uncertainty in any neuron and, with further calculation, provides an upper bound to judge the quality of any decoding mechanism—whether experimentally derived or downstream in sensory processing.
1714
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
Appendix A: The Politis-Romano (PR) Bootstrap The PR bootstrap is a numerical procedure used to generate correlated bootstrap samples from a time series (Politis & Romano, 1994; Davison et al., 1997). In this appendix Q(t) is a time series discretely sampled at t = [1, . . . , tmax ]. The bootstrap random variable B = (B1 , B2 , . . .) is a list of time indexes that preserve some of the autocorrelation in Q(t). The first value B1 is chosen at random from the integer time indexes. Given an already selected time index Bi , with probability p = 1/L (or with p = 1 if Bi = tmax ), we choose Bi+1 randomly and uniformly; otherwise, Bi+1 = Bi + 1. This procedure generates a bootstrap draw B composed of varying-length blocks of successive time indexes. The bootstrap resample of the time series is thus Q(B1 ), Q(B2 ), . . . The lengths of these blocks are exponentially distributed with mean length L. L is a free parameter chosen to reflect the duration of the correlation structure in the time series Q(t). In the statistical literature, the PR method is often called the stationary bootstrap because the bootstrap process generating B is statistically stationary, as opposed to the fixed-L block bootstrap methods previously used, which are only cyclically stationary. Several algorithms select L automatically given an observed time series (Politis & White, 2004). In practice, a simple but common heuristic is to select L to be a few times the width of the autocorrelation peak of Q(t). In the raster plots, L is calculated as twice the autocorrelation across the phase time of the spike rate r (t) or hˆ noise (t). Empirically, our results do not seem to depend significantly on the specific choice of L within a wide range of reasonable values. We have also used the software in Politis and White (2004) to estimate L, which results in no appreciable change in the results presented here.
Appendix B: Simulation of a Refractory Poisson Neuron In section 4 we use a single model of a neuron to judge the convergence and quality of our estimate of entropy rates. Our model neuron is an inhomogeneous Poisson process with an absolute and relative refractory period. The goal of this model is solely to provide an approximation of the temporal dynamics of a neuron (but see Berry & Meister, 1998; Keat, Reinagel, Reid, & Meister, 2001; Pillow & Simoncelli, 2003; Paninski, Pillow, & Simoncelli, 2004). The parameters that must be specified are the relative (and absolute) refractory periods and the firing rate over time. We matched the absolute refractory period of a real neuron (4 ms) by identifying the smallest interspike interval in the autocorrelation of the spike train. We used a sigmoidal recovery function fit to the firing statistics of the spike train to match the relative refractory period of the neuron (Uzzell & Chichilnisky, 2004).
Estimating Information Rates in Neural Spike Trains
1715
We approximated the firing rate using two separate methods. In the first method, we measured the probability of firing in the PSTH of a cell in response to 385 repetitions of the an identical stimulus 8 seconds in duration. We termed this firing rate time series λ1 . The firing rate λ1 is effectively a probability of a spike within the bin size δt. As can be seen from a raster plot, this quantity varies significantly with the stimulus history. Because λ1 has a limited duration, we generated a second, artificial firing rate λ2 of arbitrary length. We generated λ2 by using the correlated bootstrap method on λ1 (see appendix A) to preserve the correlation structure of λ1 . We used λ2 to test the estimator convergence over long durations of time. Acknowledgments This work was supported by NSF IGERT training grant DGE-0333451 and La Jolla Interfaces in the Sciences (J.S.) and a Sloan Research Fellowship (E.J.C.). We thank J. Victor and P. Reinagel for valuable discussions; our reviewers for substantive comments and feedback; and colleagues at the Institute for Nonlinear Science, Santa Cruz Institute for Particle Physics, and the Systems Neurobiology Laboratory (Salk Institute) for tremendous feedback and technical assistance.
References Abarbanel, H. D. I., Masuda, N., Rabinovich, M. I., & Tumer, E. (2001). Distribution of mutual information. Physics Letters A, 281, 368–373. Abarbanel, H. D., & Talathi, S. S. (2006). Neural circuitry for recognizing interspike interval sequences. Physical Review Letters, 96, 148104. Aguera y Arcas, B., & Fairhall, A. (2003). Computation in a single neuron: Hodgkin and Huxley revisited. Neural Computation, 15, 1715–1749. Amigo, J. M., Szczepanski, J., Wajnryb, E., & Sanchez-Vives, M. (2004). Estimating the entropy rate of spike trains via Lempel-Ziv complexity. Neural Computation, 16, 717–736. Baccus, S. A., & Meister, M. (2002). Fast and slow contrast adaptation in retinal circuitry. Neuron, 36, 909–919. Bair, W., & Koch, C. (1996). Temporal precision of spike trains in extrastriate cortex of the behaving macaque monkey. Neural Computation, 8, 1185–1202. Berry, M. J., & Meister, M. (1998). Refractoriness and neural precision. Journal of Neuroscience, 18, 2200–2211. Berry, M. J., Warland, D. K., & Meister, M. (1997). The structure and precision of retinal spike trains. Proceedings of the National Academies of Science, 94, 5411– 5416. Bezzi, M., Samengo, I., Leutbeg, S., & Mizmori, S. J. Y. (2002). Measuring information spatial densities. Neural Computation, 14, 405–420. Bialek, W., Rieke, F., de Ruyter van Steveninck, R., & Warland, D. (1991). Reading a neural code. Science, 252, 1854–1857.
1716
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
Borst, A., & Theunissen, F. (1999). Information theory and neural coding. Nature Neuroscience, 2, 947–957. Brenner, N., Bialek, W., & de Ruyter van Steveninck, R. (2000). Adaptive rescaling maximizes information transmission. Neuron, 26, 695–702. Brenner, N., Strong, S. P., Koberle, R., Bialek, W., & de Ruyter van Steveninck, R. (2000). Synergy in a neural code. Neural Computation, 12, 1531–1552. Brillouin, L. (1956). Science and information theory. Orlando, FL: Academic Press. Buracas, G., Zador, A., DeWeese, M., & Albright, T. (1998). Efficient discrimination of temporal patterns by motion-sensitive neurons in primate visual cortex. Neuron, 20, 959–969. Butts, D. (2003). How much information is associated with a particular stimulus? Network, 14, 177–187. Chichilnisky, E. J., & Kalmar, R. (2003). Temporal resolution of ensemble visual motion signals in primate retina. Journal of Neuroscience, 23, 6681– 6689. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Dan, Y., Alonso, J. M., Usrey, W. M., & Reid, R. E. (1998). Coding of visual information by precisely correlated spikes in the lateral geniculate nucleus. Nature Neuroscience, 1, 501–507. Davison, A., Hinkley, D., Gill, R., Ripley, B., Ross, S., Stein, M., & Williams, D. (1997). Bootstrap methods and their applications. Cambridge: Cambridge University Press. de Ruyter van Steveninck, R., & Bialek, W. (1988). Real-time performance of a movement sensitive neuron in the blowfly visual system: Coding and information transfer in short spike sequences. Proceedings of the Royal Society of London, Series B, 234, 379–414. de Ruyter van Steveninck, R., Lewen, G. D., Strong, S. P., Koberle, R., & Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 1805– 1808. DeWeese, M., & Meister, M. (1999). How to measure the information gained from one symbol. Network, 10, 325–340. Dimitrov, A. G., Miller, J. P., Gedeon, T., Aldworth, Z., & Parker, A. E. (2003). Analysis of neural coding through quantization with an information-based distortion measure. Network: Computation in Neural Systems, 14, 151–176. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. New York: Wiley. Fairhall, A., Lewen, G., Bialek, W., & de Ruyter van Steveninck, R. (2001). Efficiency and ambiguity in an adaptive neural code. Nature, 412, 787–792. Fano, R. M. (1961). Transmission of information: A statistical theory of communications. Cambridge, MA: MIT Press. Fellous, J. M., Tiesinga, P. H. E., Thomas, P. J., & Sejnowski, T. J. (2004). Discovering spike patterns in neuronal responses. Journal of Neuroscience, 24, 2989– 3001. Frechette, E. S., Sher, A., Grivich, M. I., Petrusca, D., Litke, A. M., & Chichilnisky, E. J. (2005). Fidelity of the ensemble code for visual motion in primate retina. Journal of Neurophysiology, 94, 119–135. Gamerman, D. (1997). Markov chain Monte Carlo: Stochastic simulation of Bayesian inference. New York: CRC Press.
Estimating Information Rates in Neural Spike Trains
1717
Gat, I., & Tishby, N. (1999). Synergy and redundancy among brain cells of behaving monkeys. In H. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 465–471). Cambridge, MA: MIT Press. Gawne, T. J., & Richmond, B. J. (1993). How independent are the messages carried by adjacent inferior temporal cortical neurons? Journal of Neuroscience, 13, 2758–2771. Gilmore, R., & Lefranc, M. (2002). The topology of chaos: Alice in stretch and squeeze land. New York: Wiley. Hastings, W. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrica, 57, 97. Hilborn, R. (2000). Chaos and nonlinear dynamics: An introduction for scientists and engineers. New York: Oxford University Press. Keat, J., Reinagel, P., Reid, R. C., & Meister, M. (2001). Predicting every spike: A model for the responses of visual neurons. Neuron, 30, 803–817. Kennel, M. B., & Mees, A. I. (2002). Context-tree modeling of observed symbolic dynamics. Physical Review E, 66, 056209. Kennel, M., Shlens, J., Abarbanel, H. D. I., & Chichilnisky, E. J. (2005). Estimating entropy rates with Bayesian confidence intervals. Neural Computation, 7, 1531– 1576. Kontoyiannis, I., Algoet, P., Suhov, Y., & Wyner, A. (1998). Nonparametric entropy estimation for stationary processes and random fields with applications to English text. IEEE Transactions in Information Theory, 44, 1319–1327. Lempel, A., & Ziv, J. (1976). On the complexity of finite sequences. IEEE Transactions in Information Theory, 22, 75–81. Lewen, G. D., Bialek, W., & de Ruyter van Steveninck, R. R. (2001). Neural coding of naturalistic motion stimuli. Network: Computation in Neural Systems, 12, 317– 329. Lind, D., & Marcus, B. (1996). Symbolic dynamics and coding. Cambridge: Cambridge University Press. Liu, R. C., Tzonev, S., Rebrik, S., & Miller, K. D. (2001). Variability and information in a neural code of the cat lateral geniculate nucleus. Journal of Neurophysiology, 86, 2789–2806. London, M., Schreibman, A., Hausser, M., Larkum, M., & Segev, I. (2002). The information efficacy of a synapse. Nature Neuroscience, 5, 332–340. MacKay, D. (2003). Information theory, inference and learning algorithms. Cambridge: Cambridge University Press. MacKay, D., & McCulloch, W. (1952). The limiting information capacity of a neuronal link. Bulletin of Mathematical Biophysics, 14, 127–135. Mainen, Z., & Sejnowski, T. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Mastronarde, D. N. (1989). Correlated firing of retinal ganglion cells. Trends in Neuroscience, 12, 75–80. Meister, M., & Berry, M. J. (1999). The neural code of the retina. Neuron, 22, 435–450. Meister, M., Lagnado, L., & Baylor, D. A. (1995). Concerted signaling by retinal ganglion cells. Science, 270, 1207–1210. Metropolis, N., Rosenbluth, M., Rosenbluth, A., Teller, M., & Teller, A. (1953). Equations of state calculations by fast computing machines. Journal of Chemical Physics, 21, 1087.
1718
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
Miller, G., & Madow, W. (1954). On the maximum likelihood estimate of the ShannonWiener measure of information. Air Force Cambridge Research Center Technical Report, 75, 54. Nemenman, I., Bialek, W., & de Ruyter van Steveninck, R. (2004). Entropy and information in neural spike trains: Progress on the sampling problem. Physical Review E, 69. Nemenman, I., Shafee, F., & Bialek, W. (2002). Entropy and inference, revisited. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processsing systems, 14. Cambridge, MA: Cambridge University Press. Nirenberg, S., & Latham, P. (2003). Decoding neuronal spike trains: How important are correlations? Proceedings of the National Academies of Science, 100, 7348–7353. Ott, E. (2002). Chaos in dynamical systems. Cambridge: Cambridge University Press. Paninski, L. (2003a). Estmation of entropy and mutual information. Neural Computation, 15, 1191–1254. Paninski, L. (2003b). Convergence properties of three spike-triggered analysis techinques. Network: Computation in Neural Systems, 14, 437–464. Paninski, L., Pillow, J., & Simoncelli, E. (2004). Maximum likelihood estimation of stochastic integrate-and-fire neural model. Neural Computation, 16, 2533–2561. Pillow, J., & Simoncelli, E. (2003). Biases in white noise analysis due to non-Poisson spike generation. Neurocomputing, 52, 109–115. Politis, D. N., & Romano, J. P. (1994). The stationary bootstrap. J. Amer. Stat. Assoc., 89, 1303–1313. Politis, D. N., & White, H. (2004). Automatic block-length selection for dependent bootstrap. Econometric Review, 23, 53–70. Puchalla, J., Schneidman, E., Harris, R., & Berry R. J. II (2005). Redundancy in the population code of the retina. Neuron, 46, 493–504. Rapp, P. E., Zimmerman, I. D., Vining, E. P., Cohen, N., Albano, A. M., & JimhezMontaiio, M. A. (1994). The algorithmic complexity of neural spike trains increases during focal seizures. Journal of Neuroscience, 14, 4731–4739. Reich, D. S., Mechler, F., & Victor, J. (2001). Information in nearby cortical neurons. Science, 294, 2566–2568. Reinagel, P., & Reid, R. C. (2000). Temporal coding of visual information in the thalamus. Journal of Neuroscience, 20, 5392–5400. Reinagel, P., & Reid, R. C. (2002). Precise firing events are conserved across neurons. Journal of Neuroscience, 22, 6837–6841. Rieke, F., Bodnar, D., & Bialek, W. (1995). Naturalistic stimuli increase the rate and efficiency of information transmission by primary auditory afferents. Proceedings of the Royal Society of London B: Biological Sciences, 262, 259–265. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Rissanen, J. (1989). Stochastic complexity in statistical inquiry. Singapore: World Scientific. Schneidman, E., Bialek, W., & Berry, M. J. (2003). Synergy, redundancy, and independence in population codes. Journal of Neuroscience, 23, 11539–11553. Schnitzer, M. J., & Meister, M. (2003). Multineuronal firing patterns in the signal from eye to brain. Neuron, 37, 499–511.
Estimating Information Rates in Neural Spike Trains
1719
Schurmann, T., & Grassberger, P. (1996). Entropy estimation of symbol sequences. Chaos, 6, 414–427. Shannon, C. (1948). A mathematical theory of communication. Bell Systems Technology Journal, 27, 379–423. Sharpee, T., Rust, N., & Bialek, W. (2004). Anaylzing neural responses to natural signals: Maximally informative dimensions. Neural Computation, 16, 223–250. Simoncelli, E. P., & Olhausen, B. A. (2001). Natural image statistics and neural representation. Annual Review of Neuroscience, 24, 1193–1215. Simoncelli, E. P., Paninski, L., Pillow, J., & Schwartz, O. (2004). Characterization of neural responses with stochastic stimuli. In M. Gazzaniga (Ed.), The cognitive neurosciences (3rd ed.). Cambridge, MA: MIT Press. Stanley, G. B., Li, F. F., & Dan, Y. (1999). Reconstruction of natural scenes from ensemble responses in the lateral geniculate nucleus. Journal of Neuroscience, 19, 8036–8042. Strong, S. P., Koberle, R., de Ruyter van Steveninck, R., & Bialek, W. (1998). Entropy and information in neural spike trains. Physical Review Letters, 80, 197–200. Theunissen, F., & Miller, J. (1995). Temporal encoding in the nervous system: A rigorous definition. Journal of Computational Neuroscience, 2, 149–162. Treves, A., & Panzeri, S. (1995). The upward bias in measures of information derived from limited data samples. Neural Computation, 7, 399–407. Usrey, W. M., & Reid, R. C. (1999). Synchronous activity in the visual system. Annual Reviews in Physiology, 61, 435–456. Uzzell, V. J., & Chichilnisky, E. J. (2004). Precision of spike trains in primate retinal ganglion cells. Journal of Neurophysiology, 92, 780–789. Victor, J. D. (2002). Binless strategies for estimation of information from neural data. Physical Review E, 66, 051903. Warland, D., Reinagel, P., & Meister, M. (1997). Decoding visual information from a population of retinal ganglion cells. Journal of Neurophysiology, 78, 2336–2350. Willems, F. M. J., Shtarkov, Y. M., & Tjalkens, T. (1995). The context tree weighting method: Basic properties. IEEE Transactions in Information Theory, IT-41, 653–664. Wolpert, D., & Wolf, D. (1995). Estimating functions of probability distributions from a finite set of samples. Physical Review E, 52, 6841–6854.
Received April 12, 2006; accepted October 4, 2006.
LETTER
Communicated by Jonathan Victor
Generation of Synthetic Spike Trains with Defined Pairwise Correlations Ernst Niebur [email protected] Krieger Mind/Brain Institute and Department of Neuroscience, Johns Hopkins University, Baltimore, MD 21218, U.S.A.
Recent technological advances as well as progress in theoretical understanding of neural systems have created a need for synthetic spike trains with controlled mean rate and pairwise cross-correlation. This report introduces and analyzes a novel algorithm for the generation of discretized spike trains with arbitrary mean rates and controlled cross correlation. Pairs of spike trains with any pairwise correlation can be generated, and higher-order correlations are compatible with common synaptic input. Relations between allowable mean rates and correlations within a population are discussed. The algorithm is highly efficient, its complexity increasing linearly with the number of spike trains generated and therefore inversely with the number of cross-correlated pairs.
1 Introduction Understanding any natural or artificial system is considerably furthered by the possibility of not only being able to observe but also to manipulate it, that is, by modifying the system’s state and observing the resulting changes in behavior. In neuroscience, recent technological advances have substantially increased the extent to which the microstate of nervous systems can be controlled. While electrical microstimulation has been one of the standard tools of neurophysiology for some time (for a recent review, see Cohen & Newsome, 2004), new technology allows the stimulation of substantially larger numbers of sites (see, e.g., Normann, Maynard, Rousche, & Warren, 1999; Sekirnjak et al., 2006). In addition, newly developed optical methods based on uncaging of neurotransmitters allow the activation of up to tens of thousands of locations per second with high spatial and temporal resolution (Callaway & Katz, 1993; Shepherd, Pologruto, & Svoboda, 2003; Boucsein, Nawrot, Rotter, Aertsen, & Heck, 2005; Shoham, O’Connor, Sarkisov, & Wang, 2005; Miesenbock & Kevrekidis, 2005; Thompson et al., 2005). Realizing the full potential of these techniques will require the injection of sets of artificial spike trains with controlled rates and correlations between them. There is thus a need for efficient and systematic methods Neural Computation 19, 1720–1738 (2007)
C 2007 Massachusetts Institute of Technology
Generation of Synthetic Spike Trains
1721
that allow generating not just individual spike trains but large numbers of them and the possibility of varying their rates and correlation structure. A similar need arises in theoretical neuroscience. While much of the earlier work in the field was based on the mean-rate coding assumption originating in the peripheral nervous system (Adrian & Zotterman, 1926), that is, the hypothesis that neural processing uses only the information contained in the mean number of spikes averaged over a few hundred milliseconds, this concept is being questioned in increasingly stronger terms (Abeles, 1982; Singer, 1999). A substantial number of current models explores the consequences of the temporal coding hypothesis, that is, that the nervous system does not discard all information in neural spike trains other than the mean activity level. It has been proposed that neural information may be represented by controlled changes in the autocorrelation function, as in the large number of models of oscillation coding (e.g., Gray & Singer, 1989; Niebur, Koch, & Rosin, 1993), or in cross-correlation functions, as in models studying synchrony (e.g., Niebur & Koch, 1994; Tiesinga & Sejnowski, 2004), or a combination of both. Thus, both the development of quantitative neural models that rely on correlated activity and of experimental techniques that allow selectively exciting neural tissue in multiple sites require the availability of test data. The subject of this letter is the generation of sets of synthetic spike trains with controlled rates and cross-correlations. A finite pairwise correlation within a population of N model neurons can be obtained by generating a set of N2 random numbers with suitable distribution (e.g., Poisson) and assigning them randomly to the N spike trains. If N2 < N, some states are necessarily identical, which introduces correlations between spike trains (Destexhe & Pare, 1999). While this method is efficient, the mean pairwise correlation is not parametrically controlled (only indirectly via the ratio N2 /N), and it is the same for all pairs of neurons. Other methods were based on manipulating stochastic processes by either subtracting events (Ogata, 1981) or adding events (Stroeve & Gielen, 2001). Different from these earlier methods, the algorithm developed here generates spike trains whose mean rates as well as cross-correlations between each pair of spike trains are free parameters that can be selected independently, subject to constraints discussed below. The algorithm developed here is a generalization of our earlier work (Mikula & Niebur, 2003a, 2003b, 2004, 2005) in that it allows differing rates for the spike trains rather than requiring that all rates are identical. Furthermore, the cross-correlation between any two of these spike trains can be selected anywhere between the cases of independent spike trains (minimal correlation), on one hand, and having identical spike trains (maximal correlation), on the other, subject to limitations discussed below. Rather than advancing to the general case immediately, we first introduce, in sections 2 and 3, the simple and intuitive case of considering two spike trains only. In sections 4 and 5, we generalize to an arbitrary number
1722
E. Niebur
of spike trains. Different from the earlier case of identical firing rates in all spike trains, not all firing rates and cross-correlations can be combined; it is impossible to have strong correlations in a set of spike trains with vastly different firing rates. The resulting constraints are studied in section 6. 2 Generating Two-Spike Trains Our goal is to provide a constructive method for generating two stochastic processes (we use the terms spike train and stochastic point process interchangeably), x and y, each consisting of a series of binary events (0, or no spike in bin, and 1, or one spike in bin). We assume that spike trains are memoryless, that is, all their time bins are independent, and we can therefore analyze them separately (this assumption can be relaxed; see section 7). These are Bernoulli processes, with 1 occurring in process x with probability1 r x = x, with 0 ≤ r x ≤ 1 (the expectation value of obtaining a 1; see equation 3.2) and 0 occurring with probability 1 − x. Likewise, in process y, 1 occurs with probability r y = y, with 0 ≤ r y ≤ 1, and 0 with probability 1 − y. We note that for most practical purposes, the firing probability in any given bin is low, r x 1, in which case the Bernoulli process can be approximated by a Poisson process. The pairwise correlation of the two processes is C xy , the expectation value of the probability of obtaining coincident spikes with rate correction (see equation 3.4). In order to construct the two processes x and y, we follow the appendix in Mikula and Niebur (2003b). We start out with three independent Bernoulli processes that have either a 1 or a 0 in each bin and are binomially distributed. For one of the spike trains, referred to as reference spike train r in the following (because it is left unchanged), the probability of having a 1 in any given bin is p; thus, a 0 is found with probability (1 − p). The other two spike trains, x and y, are constructed as follows. For spike train x, we start with a memoryless two-state stochastic process in each bin of which a 1 is found with probability px and a 0 with probability (1 − px ). √ We then switch, with a probability q , the state in x to that in r . Note that in general (in all cases other than for p = px or q = 0), the mean rate of the resulting spike train will be neither p nor px but an intermediate value to be computed later, in equation 3.2. The same procedure is applied to spike train y, with px replaced everywhere by p y . Table 1 shows the complete probability table for all combinations of a reference spike train (r , first column) and two spike trains x and y (second and third columns). Determining the probabilities Pr xy (fourth column) is straightforward. For the purpose of illustration, we illustrate computation
1
We use the terms firing probability and mean firing rate interchangeably since they are proportional, the factor of proportionality being the bin width, a constant. See also section 7.
Generation of Synthetic Spike Trains
1723
Table 1: Probability Table for All States of the Reference Spike Train (sr ) and Two Spike Trains sx and s y . sr
sx
sy
0 0 0 0 1 1 1 1
0 1 0 1 0 1 0 1
0 0 1 1 0 0 1 1
Pr xy √ √ (1 − p)[(1 − px ) + px q ][(1 − p y ) + p y q ] √ √ (1 − p) px (1 − q )[(1 − p y ) + p y q ] √ √ (1 − p)[(1 − px ) + px q ] p y (1 − q ) √ (1 − p) px p y (1 − q )2 √ p(1 − px )(1 − p y )(1 − q )2 √ √ p[ px + (1 − px ) q ](1 − p y )(1 − q ) √ √ p[(1 − px )(1 − q ][ p y + (1 − p y ) q ] √ √ p( px + (1 − px ) q )( p y + (1 − p y ) q )
Notes: The reference spike train has firing probability p, and the original spike trains x and y have firing probabilities px and p y , respectively. In columns 1–3, 1 indicates a spike in the respective spike train, and 0 stands for no spike.
of the first probability, P000 . It is obtained by observing that the probability of having no spike in the reference spike train is (1 − p) since the probability of obtaining a spike is p. This gives rise to the first factor, (1 − p). Given the absence of a spike in the reference spike train, there are two possibilities that spike train x also does not contain a spike in this spike train. The first is that it starts out with no spike. This happens with the probability (1 − px ), the first term in the square bracket. This is independent of the state of the reference spike train since whether a switch occurs or not, this spike train will always have no spike (switching would not change the state since, by assumption, the reference spike train does not contain a spike). The second possibility occurs if the state in spike train x is switched to that in the reference spike train (0) even though originally (before switching) this state was 1. Since the latter occurs with the probability px and the √ √ switching with the probability q , this additional probability is px q , which completes the derivation of the first square bracket in P000 (note that these two events, having a spike occur in spike train x and switching the state to that of the reference spike train, are independent; therefore, their probabilities multiply to obtain the probability of the compound event). Finally, the same considerations apply to the second spike train (y), which therefore contributes a factor equal to that obtained for x but with all px replaced by p y . The total probability is the product of those for the states of r, x, and y. The other cases in Table 1 are derived analogously. 3 Properties of Two Spike Trains Having obtained Table 1, it is straightforward (although somewhat tedious) to verify the normalization condition: i=0,1 j=0,1 k=0,1
Pi jk = 1.
(3.1)
1724
E. Niebur
We now proceed to compute the mean firing rate of the spike trains. Direct substitution yields r x := x :=
Pi1k = px +
√ q ( p − px ).
(3.2)
i=0,1 k=0,1
In this equation, the first and second equalities are the definition of the mean rate, and the last one is the result of direct substitution of terms from Table 1 and straightforward simplification. As anticipated, the mean rate x is influenced by p and px . Of course, for q = 0, we obtain x = px , and for q = 1, we have x = p. The analogous result is obtained for y: r y := y :=
Pik1 = p y +
√ q ( p − p y ).
(3.3)
i=0,1 k=0,1
The rate-corrected cross-correlation (or covariance) between spike trains x and y is C xy := xy − x y := Pi11 − Pi1k Pik1 = ( p − p 2 )q , i=0,1
i=0,1 k=0,1
(3.4)
i=0,1 k=0,1
where the first and second equalities are definitions, and the third one is obtained after substitution from Table 1 and simplification. We observe that C xy is independent of px and p y . Furthermore, C xy attains a maximum value of q /4 at p = 1/2 and decreases monotonically for larger and smaller p, to reach zero for both p = 0 and p = 1. Of course, C xy is identically zero for all p for q = 0. For q = 1, all three spike trains are identical and have spike rate p and correlation C xy = p − p 2 , which is the autocorrelation of a binomial process train with probability p. The latter can also be computed directly from Table 1 as the variance of the reference spike train, varr := Crr := (r − r )2 = p − p 2 ,
(3.5)
where the same notation as that introduced in equation 3.4 was used. In practice, rather than specifying the rates px and p y of the uncorrelated √ stochastic processes and the correlation q with the reference spike train, we are interested in generating spike trains with specified mean rates and cross-correlation. That is, x, y, and C xy are given quantities, and we have to determine parameters px , p y , and q given these values. By solving equations 3.2 to 3.4 for px , p y , and q , we find that a pair of spike trains with means x, y and cross-correlation C xy is obtained by choosing px , p y ,
Generation of Synthetic Spike Trains
1725
and q as √ x − p q px = √ 1− q √ y − p q py = √ 1− q q=
C xy , p − p2
(3.6) (3.7) (3.8)
where p, the rate of the reference spike train, is formally a free parameter. There are limitations on the choice of this parameter (e.g., p − p 2 ≥ C xy to ensure q ≤ 1 in equation 3.8), which will be discussed in the more general case of arbitrary numbers of spike trains in section 6. Note that these limitations do not apply if all firing rates are identical (x = y = p) since px and p y do not depend on q in this case (Mikula & Niebur, 2003a). 4 Arbitrary Number of Spike Trains For N spike trains, there are 2 N possible states in each bin. For nonzero q , all these states are entangled because each randomly chosen pair has to have the same mean covariance. Despite the apparent complexity of the problem, we can compute the probability of each of the states as the sum of two terms, as follows. The trick is that we specify only the correlation with the reference spike train r and that the pairwise correlation is then obtained from expressions like equation 3.4. For instance, Table 1 shows that the state without a spike in either x and y has a total probability of √ √ {(1 − p)[(1 − px ) + px q ][(1 − p y ) + p y q ]} √ +{ p(1 − px )(1 − p y )(1 − q )2 }, where the terms in the first and second parentheses correspond to the cases without and with a spike in r , respectively. The algorithm to generate N correlated spike trains is thus as follows. We start by generating N uncorrelated, memoryless stochastic point processes (i.e., they have independent time bins) and binomial firing statistic in each bin. As before, each is a Bernoulli process, and process i, with i ∈ {1, . . . , N}, has a mean firing rate (probability of having state 1) of pi ; 0 ≤ pi ≤ 1. One additional process with the same statistics and with mean rate p is generated and referred to, as before, as the reference spike train r . To generate the set √ of N correlated spike trains, we switch, with a probability q , the state of uncorrelated spike train i to that of spike train r . Using the notation from the previous paragraph, let (s1 , s2 , . . . , s N ) be the state of the N spike trains
1726
E. Niebur
Table 2: Probability of Obtaining State si in Process i Depends on the State of the Reference Process. sr 0 1 0 1
si
Psr si
(i)
P (i) (sr , si )
0 0 1 1
√ (1 − pi ) + pi q √ (1 − pi )(1 − q ) √ pi (1 − q ) √ pi + (1 − pi ) q
√ (1 − p)[(1 − pi ) + pi q ] √ p(1 − pi )(1 − q ) √ (1 − p) pi (1 − q ) √ p[ pi + (1 − pi ) q ]
Notes: Column 1: state of reference process. Column 2: state of process i. The last column shows the probability of obtaining the combination of state sr in the reference process and state si in process i, and the third column shows only that component of this probability that is independent of p. The first factor in P (i) (sr , si ) stems from the reference spike train r , the second factor comes from spike train i and depends on both its own state and that of r . Note that in order to avoid proliferation of symbols, we distinguish the symbols for the probabilities in column 4 from the expressions in column 3 (which are not probabilities but terms used in equation 4.1 and beyond) by using a subscript notation in column 3 and parentheses in column 4 (top row).
in a given time bin. The probability of finding the system in this state is then
P(s1 , s2 , . . . , s N ) = (1 − p)
N i=1
(i)
(i)
P0si + p
N
(i)
P1si ,
(4.1)
i=1
(i)
where P0si and P1si are the terms from the third column of Table 2. We will show now that equations 3.6 to 3.8 generalize immediately to the case of arbitrary N. The mean rate of process i is defined as the expectation value of this process having state 1, that is, (1) (N) (1) (2) (2)
(i) (N)
ri : = P(si = 1) = (1 − p) P00 + P01 P00 + P01 . . . P01 . . . P00 + P01 (N) (1) (1) (2) (2)
(i) (N)
+ p P10 + P11 P10 + P11 . . . P11 . . . P10 + P11 . The sums within each parenthesis in this equation are again unity, and we obtain (i)
(i)
ri = (1 − p)P01 + p P11 = pi +
√ q ( p − pi ),
(4.2)
where the second equation is obtained using column 3 in Table 2. Equation 4.2 is the generalization of equation 3.2 for spike train i.
Generation of Synthetic Spike Trains
1727
The correlation function is computed analogously. By definition, the correlation function of processes i and k is the expectation value of both processes having state 1 corrected by the product of their rates: Cik : = P(si = 1, sk = 1) − ri rk (1) (N) (1)
(i) (k) (N)
= (1 − p) P00 + P01 . . . P01 . . . P01 . . . P00 + P01 (N) (1) (1)
(i) (k) (N)
+ p P10 + P11 . . . P11 . . . P11 . . . P10 + P11 − ri rk .
(4.3)
Sums in the parentheses being unity, we are left with (i)
(k)
(i)
(k)
Cik = (1 − p)P01 P01 + p P11 P11 − ri rk . (i)
(4.4)
(i)
Using again P01 and P11 from Table 2 and equation 4.2, we obtain, after simplification, Cik = ( p − p 2 )q ,
(4.5)
which is the generalization of equation 3.4. In practice, the mean rates ri for i ∈ {1, . . . , N} of the correlated processes and their pairwise correlation C are given, and we need to determine the parameters pi and q of the uncorrelated stochastic processes. We therefore solve equations 4.2 and 4.5 for these parameters to obtain pi =
√ ri − p q √ 1− q
(4.6)
q=
C , p − p2
(4.7)
which are the generalizations of equations 3.2 to 3.4. Section 6 discusses constraints on the choice of these parameters. 5 Arbitrary Cross-Correlations So far we have assumed that all cross-correlations are identical. To generalize to arbitrary covariances, rather than changing the state in all stochastic √ processess with the same probability q , the state in process i is changed √ to that of the reference process with probability q i , which can be different for each i. The table of probabilities is then just as in Table 2 but with √ √ q replaced by q i everywhere. Straightforward algebra along the lines of equations 4.1 to 4.5 shows that the mean firing rate of process i is unchanged and given as in equation 4.2, and that the mean cross-correlation function
1728
E. Niebur
between processes i and k, as defined in equation 4.3, is √ Cik = ( p − p 2 ) q i q k .
(5.1)
Of course, in the special case of q i = q k , this reduces to equation 4.5. √ The mean rates are obtained as in equation 4.6, with q replaced by √ q i in the numerator and denominator. Likewise, given a set of crosscorrelations Ci j = C ji ; i, j = 1, . . . , N), the corresponding parameters q i j are computed from equation 5.1. In the most general case, when all Ci j are different, half of the N parameters q i can be chosen freely (subject to constraints similar to those discussed in section 6), and the other N/2 are determined by equation 5.1. We note in passing that another way to add structure to the correlation at the population level is to add additional reference spike trains. 6 Constraints on Rates and Correlations While arbitrary values for q ∈ [0, 1] and p ∈ [0, 1] can be chosen if the mean rates of all spike trains are identical (Mikula & Niebur, 2003a), a set of conditions between p and q needs to be satisfied when mean rates differ (for simplicity, we discuss the case of equal pairwise cross-correlation only; the generalization discussed in section 5 does not pose any particular difficulties but complicates the notation). This is intuitively understood by the fact that too large a spread of firing rates does not allow enforcing pairwise strong correlation on all pairs. Formally, the constraint manifests itself by the requirement that q in equation 4.7 and pi in equation 4.6 need to be well-defined probabilities with 0 ≤ q , pi ≤ 1. Equation 4.7 requires2 C ≤ p − p2 .
(6.1)
We solve for p to obtain 1 − 2
1 1 −C ≤ p ≤ + 4 2
1 − C. 4
(6.2)
Equation 4.6 yields √ ri − p q ≥ 0 √ √ ri − p q ≤ 1 − q .
2
(6.3) (6.4)
As noted at the end of section 3, these constraints do not apply if all mean rates are identical, pi ≡ p ∀i = 1, . . . , N, since in that case pi = ri = p in equation 4.6, and the equation becomes an identity. The same is true for the trivial cases of q = 0 or q = 1.
Generation of Synthetic Spike Trains
1729
Inequalities 6.3 and 6.4 can be combined as 1 − ri ri 1− √ ≤ p ≤ √ . q q
(6.5)
For a given p and q , the right inequality 6.5 determines the frequency of the spike train with the smallest possible ri as √ rmin = p q ,
(6.6)
and the left inequality determines the frequency of the spike train with the largest ri , √ rmax = 1 − (1 − p) q .
(6.7)
In the following, we consider the practically more important reverse question: For a given set of ri , can we find (at least) one p such that the algorithm can be applied, and if so, what are the allowable values for p? For any given pi and q < 1, a solution of inequality 6.5 exists because the left inequality can be transformed into √riq − ( √1q − 1), which is smaller than √ the right-hand side since by assumption, q < 1 (the case q = 1 is trivial, leading to N identical spike trains with rate p). The condition does, however, establish a constraint on a set of mean rates ri that can be generated with this procedure because if their spread becomes too large, it is not possible to find a p that satisfies both inequalities 6.5 for all i. r2
From the second of the inequalities 6.5 we have p ≤ √riq and thus p 2 ≤ qi , bearing in mind that the positive solution of this equation will be the correct one since p is a probability. Replacing q in this inequality by using equation 3.8, we obtain p 2 ≤ p≤
ri2 ( p− p2 ) C
and after division by p (= 0),
ri2 . C + ri2
(6.8)
Thus, p must be smaller than the expression on the right for all ri . For C = 0, the condition is fulfilled for all probabilities p, but for increasing C, it becomes more and more stringent. Using again equation 3.8, we rewrite the first inequality in equation 6.5 as 1 − ri 1− √ p − p2 ≤ p C
(6.9)
1730
E. Niebur
or p + A p − p 2 − 1 ≥ 0,
(6.10)
√ i is a positive constant. where A := 1−r C At equality, this is a nonlinear equation of the type
p = 1 − A p − p2 ,
(6.11)
whose solution depends, through A, on ri and C. In the relevant interval p ∈ ]0, 1[, the right-hand side of this equation is a U-shaped function with exactly one crossing with the identity. These fixed points are given by p = C[(1 − ri )2 + C]−1 . Although a set of p can always be found for any given pair of values (ri , C), it is possible that the intervals given by conditions equations 6.8 and 6.9 are nonoverlapping for a given C and a range of values [ri,min , ri,max ]. No solution with the specified set of mean rates and pairwise correlation C between all pairs exists in this case. Only if at least one p can be found that simultaneously fulfills equations 6.8 and 6.9 can the algorithm be applied. Figure 1 shows the solutions of inequalities 6.5 as a function of ri . Only those values of p that are between the two lines corresponding to the left inequality (solid line) and the right inequality (dashed line) fulfill both inequalities and are therefore acceptable solutions. If the pairwise correlation is relatively small, as in Figure 1a with C = 0.02, a suitable p value can be chosen for all but the most extreme values of ri , and in this case there is usually a large range of suitable p values. For large correlation, for example, for C = 0.2 as shown in Figure 1b (note that the maximum of completely correlated, i.e., identical, spike trains is attained for C = 0.25), the constraints become more stringent and the spread of allowable values for p is zero for a substantial subset of ri (those below and above the crossing points). It is impossible to generate sets of correlated spike trains with unequal mean rates; only sets with ri = p∀i are allowed. The shaded part of the abscissa (in Figures 1a and 1b) is an example for a range of firing rates [ri,min , ri,max ]. The projection onto the ordinate shows that for small correlation (see Figure 1a, C = 0.02), a large range of p values can be chosen (shaded part of the ordinate), while for strong correlation (C = 0.2, in Figure 1b), the range of possible p values is much smaller. For an even larger range, for example, if [ri,min , ri,max ] = [0.4, 0.6], it is impossible to generate spike trains with cross-correlation C = 0.2 between each pair of the set. Figure 2 shows that spike trains computed with the described algorithm have the desired properties. The upper panels of Figure 2 show examples of 300 bins each of 100 spike trains, with, respectively, low correlation
Generation of Synthetic Spike Trains
1731
Figure 1: Solutions of the inequalities 6.5 as a function of ri for C = 0.02 (a) and C = 0.2 (b). The solid lines show the limits of the solution (i.e., at equality) for the left inequality; the dashed lines are the equivalent for the right inequality. Acceptable p values are all those between the two curves, that is, above the full lines and below the dashed lines. For a chosen range of mean rates (shaded on the abscissa), the range of acceptable values (shaded on the ordinate) is significantly smaller for C = 0.2 than for C = 0.02. The spread of mean rates is zero for rates ri that are outside the range where the curves cross.
(C = 0.02) in the upper left panel and high correlation (C = 0.2) in the upper right. Even at C = 0.02, spike trains are clearly correlated and, as expected, correlation is very strong at C = 0.2. The lower left panel of Figure 2 demonstrates that the actual firing rates, computed as the expectation value of the number of events in equation 3.2 and plotted on the ordinate, scatter closely around the selected values (on the abscissa). For illustration, we generated 20 spike trains with 1000 bins each with firing rates varying in equidistant steps, with px ∈ [0.3, 0.4] for C = 0.02 and px ∈ [0.5, 0.6] for C = 0.2 (these intervals were chosen simply so we could show both data sets in one plot). In both cases, p = 0.58 was chosen, which is from the allowable range in Figure 1. The bottom right panel of Figure 2 is a plot of the cross-correlation values between all these 20 spike trains (that is, the expectation value computed in equation 3.4) plotted as a function of the parameter C. Again, it is seen that the cross-correlation of the simulated spike trains scatters around the chosen value. 7 Discussion A thorough understanding of biological neural systems requires the development of quantitative models, both experimental and theoretical. This is of particular importance for understanding neural microcircuity, one of the most complex and at the same time practically and philosophically important structures known. Both theoretical (computational) and experimental models of neural tissue need a principled way of stimulating the model
E. Niebur
100
100
80
80 Spike train
Spike train
1732
60
40
20
0 0
60
40
20
50
100
150 Bin
200
250
0
300
1
0
50
100
150 Bin
200
250
300
0.25 0.2
0.8
0.15
0.6
0.1
0.4 0.05 0.2 0 0.25
0
0.3
0.35
0.4
ri
0.45
0.5
0.55
0.6
-0.05 0
0.05
0.1
C
0.15
0.2
0.25
Figure 2: Properties of simulated spike trains. (Upper left) Raster plot of 100 spike trains with p = 0.2 and C = 0.02. (Upper right) Same, but C = 0.2. (Bottom left) Measured rates of 40 simulated spike trains plotted against nominal rates. For illustration, rates are shown for 20 spike trains with equidistant rates in the range [0.3, 0.4] for C = 0.02 (+ signs) and from the range [0.5, 0.6] for C = 0.2 (x signs). (Lower right) Average cross-correlation between spike trains plotted as a function of C. It can be seen that the observed cross-correlation, computed from equation 3.4, scatters around the nominal value.
substrate, in vitro or in silico. Recent reports demonstrate substantial progress in overcoming the technical difficulties of delivering stimuli with high spatial and temporal resolution into the tissue. While the simplest stimulation patterns might be independently distributed random processes, the next stages of investigation will surely demand more structured patterns. This letter introduces an efficient and straightforward algorithm for constructing excitation patterns consisting of stochastic processes with controlled mean activity and cross-correlations between all pairs of processes. The algorithm is highly efficient and requires on the order of N × B operations where N is the number of spike trains and B the number of time bins per spike train. Remarkably, its complexity is thus linear with N, even though the number of cross-correlations generated increases as o(N2 ). Furthermore, the algorithm lends itself to parallelization up to the finest grain (invidual spike trains and individual bins).
Generation of Synthetic Spike Trains
1733
An attractive alternative approach for generating correlated sets of model spike trains is based on maximum entropy principles. The basic idea (perhaps most clearly described by Jaynes, 1957, but going back to Gibbs and Laplace) is that the most rational course of action is to avoid making assumptions in the absence of information. Applied to the question of spike train statistics, it has been argued that spike trains with controlled secondorder correlations should be generated with maximum entropy in all higher orders. This means that all but second-order correlations must be minimal since higher-order correlations would introduce structure and therefore lower-than-maximum entropy. Algorithms for spike train generation and analysis based on maximum entropy principles (e.g., Bohte, Spekreijse, & Roelfsema, 2000; Nakahara & Amari, 2002) are powerful methods for characterizing ensembles of spike trains if higher-order correlations are either known to vanish or if the structure of the higher-order correlations is known (and can thus be taken into account), or if absolutely no information about them is available. However, none of these three cases is the situation considered here, namely, when we want to generate spike trains with correlational structure similar to that found in biological neural systems. Although we assume that only second-order correlations are explicitly known, as is usually the case (some exceptions in which higher-order correlations are observed and analyzed can be found are Gerstein & Clark, 1964; Abeles & Goldstein, 1977; Abeles ¨ Aertsen, & & Gerstein, 1988; Abeles, 1991; Martignon, von Hasseln, Grun, ¨ Diesmann, & Aertsen, 1997), and that we have no Palm, 1995; Riehle, Grun, detailed information about higher-order correlations, we do know that the spike trains are generated in a highly connected neural network. For this reason, it is unlikely that no correlations exist between them at any higher than second order, which is the assumption made by maximum entropy methods. Applying maximum entropy methods with vanishing higherorder terms in this situation would then violate the basic rationale for using these methods. Paraphrasing from Jaynes (1957), the goal is to generate a probability distribution that correctly represents our state of knowledge about the system, or finding a probability assignment that avoids bias, while agreeing with whatever information is given. It is always possible to use a maximum entropy approach, enforcing vanishing higher-order correlations, in a case in which insufficient information about higher-order correlations is available. In this study, we adopt a different strategy, with the explicit goal of taking into account all available information, including the likely presence of higher-order correlations. Although we have no detailed knowledge about those correlations in the absence of sufficient experimental data, we at least know that such correlations are likely to exist in a hierarchically connected complex network. The solution adopted in this study is to generate spike trains whose higher-order correlations correspond to those found in neural systems
1734
E. Niebur
whose correlations are due to common input. Although this is by no means the only possible source of correlations (examples for others being direct and indirect interactions between neurons, as was observed more than a generation ago; Moore, Segundo, Perkel, & Gerstein, 1970), common input is widely recognized to be an important source of correlation in spike trains. The spike trains generated by the methods developed here are thus particularly suited as models of neural activity where neurons are influenced by spiking input from a common source. While we have focused on pairwise cross-correlations in this work, it is possible to make the mean rates of all processes time variant (by adjusting the parameters pi ) and thus also to vary their autocorrelations. This allows, for instance, modeling absolute and refractory periods by temporarily lowering the spike rate of a process after a spike has occurred in this process. Figure 3 shows an example of correlation functions for a process with an absolute refractory period of two bins. Relative refractory periods are implemented in the same way but with a finite lowered probability of firing after each spike. The formulas for the computation of rates and correlations, equations 4.6 and 4.7, remain unchanged, with N reduced by the number of neurons that are in the absolute refractory period, and with ri in equation 4.6 replaced by a lowered rate (subject to the constraints in section 6) for neurons in the relative refractory period. Figure 3 shows, as expected, vanishing entries in the autocorrelation function (squares) during the absolute refractory period, which is reflected in smaller cross-correlation (crosses) because of the finite correlation between the spike trains. The figure also shows that refractoriness decreases the mean firing rate (from 0.2 to about 0.14 in this case) and leads to an increase of the autocorrelation function immediately following the refractory period, with a corresponding peak in the power spectrum (Bair, Koch, Newsome, & Britten, 1994; Franklin & Bair, 1995). This peak, more noticeable for higher firing rates where higher-order peaks also can be observed (data not shown), is again reflected in attenuated form in the cross-correlation function. We note that the occurrence of these structures is a matter of principle (Bair et al., 1994; Franklin & Bair, 1995) and not due to the specifics of our implementation. The possibility of varying the firing rates independently (rather than for all spike trains in unison) and the resulting requirement that bins are independent alleviates limitations of our older algorithm (Mikula & Niebur, 2003a). Experimentally observed cross-correlograms frequently have broad maxima rather than a sharp peak only in the center bin (as in, for example, Mikula & Niebur, 2003a; Destexhe & Pare, 1999). Spike jitter can be introduced by allowing transitions into a given bin of a spike train not only from the corresponding bin in the reference spike train but also (with smaller probability) from the neighboring bins. The probabilities for the states are computed as in equation 4.1 and, for a stationary situation (i.e., pi and jitter distribution independent of time), the rates (see equation 4.2) are unchanged. The width of the jitter distribution will determine the width of the
Generation of Synthetic Spike Trains
1735
0.2 0.18 0.16
Correlation functions
0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0
1
2
3
4
5
6
7
8
9
10
t [bins]
Figure 3: Correlation functions with refractory period. Shown are the average autocorrelation functions (squares) of 20 spike trains and the average crosscorrelation functions (crosses) between them, with an absolute refractory period of 2 bins. All processes have mean rates of p = 0.2 and correlation between spike trains is C = 0.1. The refractory period introduces a depression not only in the autocorrelation but, because of the correlation between spike trains, also in the cross-correlation. Also note the increase in correlation immediately following the refractory period in both functions.
resulting correlation function. Some kind of integral over the peak width is usually employed as a measure of synchrony, and the one-bin definition of synchrony (see equation 4.3) is replaced by a more complex expression that involves this measure. As a general observation, we established in section 6 that there are combinations of rates and pairwise cross-correlations that cannot be achieved, namely, high cross-correlations and, simultaneously, a large spread in the rates of the invididual spike trains. Within the limits described in sections 5 and 6, pairwise cross-correlations and mean firing rates can be chosen freely. A free parameter in the procedure is the bin width, which determines the relation between the probability of the 1 event in the stochastic process and the mean firing rate of the simulated spike train generated: mean firing rate = bin width × probability (see also note 1). The only formal requirement in the choice of the bin width is that there cannot be more than one event in any given bin, and the actual choice will depend on constraints provided by the system under study. For computational models, a suitable bin size could be the inverse maximal rate at which the neurons are known to fire;
1736
E. Niebur
for the generation of artificial spike trains to be used in microstimulation, it could be the inverse maximal stimulation frequency. The spike trains introduced here can be used in a variety of situations that we have explored previously with spike trains with identical rates, for example, to study the effect of spike-timing-dependent plasticity (Mikula & Niebur, 2003b), the interplay of excitation and inhibition (Mikula & Niebur, 2004), and in feedforward networks of coincidence detectors (Mikula & Niebur, 2005). The ease of generating large numbers of synthetic spike trains in a highly efficient algorithm should prove to be of use in a large number of applications. On request, the author will make available sets of spike trains with the described statistics. Acknowledgments I thank Emery Brown for discussions. The work was supported by NIH grants NS43188-01A1, 5R01EY016281-02, and R01-NS40596.
References Abeles, M. (1982). Local cortical circuits. New York: Springer-Verlag. Abeles, M. (1991). Corticonics—Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Abeles, M., & Gerstein, G. (1988). Detecting spatiotemporal firing patterns among simultaneously recorded single neurons. J. Neurophysiol., 60, 909–924. Abeles, M., & Goldstein, M. (1977). Multispike train analysis. Proc. IEEE, 65, 762–773. Adrian, E. D., & Zotterman, Y. (1926). The impulses produced by sensory nerve endings. Part 2: The response of a single end organ. J. Physiol., 61, 151–171. Bair, W., Koch, C., Newsome, W., & Britten, K. (1994). Power spectrum analysis of bursting cells in area MT in the behaving monkey. J. Neuroscience, 14(5), 2870–2892. Bohte, S. M., Spekreijse, H., & Roelfsema, P. R. (2000). The effects of pairwise and higher-order correlations on the firing rate of a postsynaptic neuron. Neural Computation, 12, 153–179. Boucsein, C., Nawrot, M., Rotter, S., Aertsen, A., & Heck, D. (2005). Controlling synaptic input patterns in vitro by dynamic photo stimulation. J. Neurophysiology, 94(4), 2948–2958. Callaway, E. M., & Katz, L. C. (1993). Photostimulation using caged glutamate reveals functional circuitry in living brain slice. Proc. Nat. Acad. Sci., USA, 90, 7661–7665. Cohen, M. R., & Newsome, W. T. (2004). What electrical microstimulation has revealed about the neural basis of cognition. Curr. Opin. Neurobiol., 14(2), 169–177. Destexhe, A., & Pare, D. (1999). Impact of network activity on the integrative properties of neocortical pyramidal neurons in vivo. J. Neurophysiology, 81, 1531–1547. Franklin, J., & Bair, W. (1995). The effect of a refractory period on the power spectrum of neuronal discharge. SIAM J. Appl. Math., 55(4), 1074–1093. Gerstein, G., & Clark, W. (1964). Simultaneous studies of firing patterns in several neurons. Science, 143, 1325–1327.
Generation of Synthetic Spike Trains
1737
Gray, C., & Singer, W. (1989). Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proc. Nat. Acad. Sci., USA, 86, 1698–1702. Jaynes, E. T. (1957). Information theory and statistical mechanics. Physical Review, 106(4), 620–630. ¨ S., Aertsen, A., & Palm, G. (1995). DetectMartignon, L., von Hasseln, H., Grun, ing higher-order interactions among the spiking events in a group of neurons. Biological Cybernetics, 73(1), 69–81. Miesenbock, G., & Kevrekidis, I. G. (2005). Optical imaging and control of genetically designated neurons in functioning circuit. Annu. Rev. Neurosci., 28, 533– 563. Mikula, S., & Niebur, E. (2003a). The effects of input rate and synchrony on a coincidence detector: Analytical solution. Neural Computation, 15, 539–547. Mikula, S., & Niebur, E. (2003b). Synaptic depression leads to nonmonotonic frequency dependence in the coincidence detector. Neural Computation, 15(10), 2339– 2358. Mikula, S., & Niebur, E. (2004). Correlated inhibitory and excitatory inputs to the coincidence detector: Analytical solution. IEEE Transactions on Neural Networks, 15(5), 957–962. Mikula, S., & Niebur, E. (2005). Rate and synchrony in feedforward networks of coincidence detectors: Analytical solution. Neural Computation, 14(4), 881–902. Moore, G., Segundo, J., Perkel, D., & Gerstein, G. (1970). Statistical signs of synaptic interactions in neurons. Biophysical Journal, 10, 876–900. Nakahara, H., & Amari, S. (2002). Information-geometric measure for neural spikes. Neural Comput., 14(10), 2269–2316. Niebur, E., & Koch, C. (1994). A model for the neuronal implementation of selective visual attention based on temporal correlation among neurons. Journal of Computational Neuroscience, 1(1), 141–158. Niebur, E., Koch, C., & Rosin, C. (1993). An oscillation-based model for the neural basis of attention. Vision Research, 33, 2789–2802. Normann, R. A., Maynard, E. M., Rousche, P. J., & Warren, D. J. (1999). A neural interface for a cortical vision prosthesis. Vision Research, 39(15), 2577–2587. Ogata, Y. (1981). On Lewis simulation method for point processes. IEEE Transactions on Information Theory, 27, 23–31. ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and Riehle, A., Grun, rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Sekirnjak, C., Hottowy, P., Sher, A., Dabrowski, W., Litke, A. M., & Chichilnisky, E. J. (2006). Electrical stimulation of mammalian retinal cells with multielectrode arrays. J. Neurophysiology, 95(6), 3311–3327. Shepherd, G. M., Pologruto, T. A., & Svoboda, K. (2003). Circuit analysis of experience-dependent plasticity in the developing rat barrel cortex. Neuron, 38(2), 277–290. Shoham, A., O’Connor, D. H., Sarkisov, D. V., & Wang, S. S. (2005). Rapid neurotransmitter uncaging in spatially defined patterns. Nature Methods, 2(11), 837– 843. Singer, W. (1999). Neuronal synchrony: A versatile code for the definition of relations? Neuron, 24, 49–65.
1738
E. Niebur
Stroeve, S., & Gielen, S. (2001). Correlation between uncoupled conductance-based integrate-and-fire neurons due to common and synchronous presynaptic firing. Neural Computation, 13, 2005–2029. Thompson, S. M., Kao, J. P., Kramer, R. H., Poskanzer, K. E., Silver, R. A., Digregorio, D., & Wang, S. S. (2005). Flashy science: Controlling neural function with light. J. Neurosci., 25(45), 10358–10365. Tiesinga, P. H. E., & Sejnowski, T. J. (2004). Rapid temporal modulation of synchrony by competition in cortical interneuron networks. Neural Computation, 16, 251–275.
Received February 2, 2006; accepted November 29, 2006.
LETTER
Communicated by Michelle Rudolph
Input-Driven Oscillations in Networks with Excitatory and Inhibitory Neurons with Dynamic Synapses Daniele Marinazzo [email protected] Department of Biophysics, Radboud University of Nijmegen, 6525 EZ Ni¨ megen, The Netherlands; TIRES—Center of Innovative Technologies for Signal Detection and Processing and Dipartimento Interateneo di Fisica, Universit`a di Bari, 70125, Bari, Italy; and Istituto Nazionale di Fisica Nucleare, Sezione di Bari, 70125 Bari, Italy
Hilbert J. Kappen [email protected]
Stan C. A. M. Gielen [email protected] Department of Physics, Radboud University of Nijmegen, 6525 EZ Nijmegen, The Netherlands
Previous work has shown that networks of neurons with two coupled layers of excitatory and inhibitory neurons can reveal oscillatory activ¨ ity. For example, Borgers and Kopell (2003) have shown that oscillations occur when the excitatory neurons receive a sufficiently large input. A constant drive to the excitatory neurons is sufficient for oscillatory activity. Other studies (Doiron, Chacron, Maler, Longtin, & Bastian, 2003; Doiron, Lindner, Longtin, Maler, & Bastian, 2004) have shown that networks of neurons with two coupled layers of excitatory and inhibitory neurons reveal oscillatory activity only if the excitatory neurons receive correlated input, regardless of the amount of excitatory input. In this study, we show that these apparently contradictory results can be explained by the behavior of a single model operating in different regimes of parameter space. Moreover, we show that adding dynamic synapses in the inhibitory feedback loop provides a robust network behavior over a broad range of stimulus intensities, contrary to that of previous models. A remarkable property of the introduction of dynamic synapses is that the activity of the network reveals synchronized oscillatory components in the case of correlated input, but also reflects the temporal behavior of the input signal to the excitatory neurons. This allows the network to encode both the temporal characteristics of the input and the presence of spatial correlations in the input simultaneously.
Neural Computation 19, 1739–1765 (2007)
C 2007 Massachusetts Institute of Technology
1740
D. Marinazzo, H. Kappen, and S. Gielen
1 Introduction The central nervous system can display a wide spectrum of spatially synchronized, rhythmic oscillatory patterns of activity with frequencies in the range from 0.5 Hz (δ rhythm), 20 Hz (β rhythm), to 40 to 80 Hz (γ rhythm) and even higher up to 200 Hz (for a review, see Gray, 1994). In the past two decades, evidence has been presented that synchronized activity and temporal correlation are fundamental tools for encoding and exchanging information for neuronal information processing in the brain (Singer & Gray, 1995; Singer, 1999; Reynolds & Desimone, 1999). In particular, it has been suggested that clusters of cells organize spontaneously into flexible groups of neurons with similar firing rates, but with a different temporal correlation structure. However, despite the fact that synchronization of groups of neurons has been the subject of intense research efforts in many studies, the functional role of synchronized activity is still a topic of debate (Gray, 1994; Fries, 2005). A longstanding, fundamental question is when synchrony can emerge and how cell properties, synaptic interactions, and network architecture interact to determine the nature of synchronous states in a large neural network. Most theoretical studies have investigated synchronization of neuronal activity in fully connected or sparsely connected networks under assumptions of weak or strong coupling (see, e.g., Ariaratnam & Strogatz, 2001; Mirollo & Strogatz, 1990; Ernst, Pawelzik, & Geisel, 1995; Hansel & Mato, 1993; Hansel, Mato, & Meunier, 1995; van Vreeswijk, 2000). Several experimental studies have shown that the grouping of neurons with synchronized firing depends on stimulus properties and on instruction to the subject or on attention (Reynolds, Chelazzi, & Desimone, 1999; Schoffelen, Oostenveld, & Fries, 2005). Therefore, both network properties, stimulus-driven feedforward (bottom-up) and feedback (top-down) mechanisms, are involved and should be incorporated in an explanatory model. These concepts are compatible with neurophysiological studies, which have suggested that synchronous oscillatory activity may be entrained by synchronous rhythmic inhibition originating from fast-spiking inhibitory interneurons (Buzs´aki, Leung, & Vanderwolf, 1983; Lytton & Sejnowski, 1991), which has been an important reason to study networks with excitatory and inhibitory neurons. Based on this insight, two different models have been proposed in the literature to explain the emergence of synchronized neuronal activity. Doiron, Lindner, Longtin, Maler, & Bastian (2004) presented a network with a layer of excitatory neurons with stochastic input and delayed inhibitory feedback of the summed activity of the excitatory neurons to the excitatory neurons. This model revealed the remarkable property that oscillations were observed in the excitatory neurons in response to spatially correlated stimuli but not to uncorrelated input. It was the spatial correlation in stimuli, not the total power of the stimulus, that was essential for the network to display oscillatory activity. The frequency of oscillation was determined by
Oscillations in Networks of Spiking Neurons
1741
the effective delay in the feedback loop and was not related to the temporal properties of spatially correlated input. Although the architecture with interacting populations of excitatory and inhibitory neurons in the model by (Doiron, Chacron, Maler, Longtin, & Bastian, 2003; Doiron et al., 2004) was similar to the architecture proposed by ¨ Borgers and Kopell (2003), the behavior of the latter model, which was proposed to explain oscillations in the gamma range (25–80 Hz), was different ¨ from that proposed by Doiron et al. The Borgers and Kopell model consists of two interacting layers of excitatory and inhibitory neurons without a pure time delay in the feedback loop. This model assumes a considerably longer time constant for the decay of inhibition than for the decay of excitation. It reveals synchronous rhythmic spiking due to a constant excitatory drive to the excitatory neurons. The excitatory neurons provide diverging input to the inhibitory cells, which inhibit (by divergent feedback to the excitatory neurons, where each inhibitory cell projects to multiple excitatory cells) and thereby synchronize the excitatory cells. With the proper parameters, this network starts to oscillate as soon as the power of the external input to the excitatory neurons is sufficiently large, regardless of spatial correlations in the input. This result may seem at odds with the results by Doiron et al. (2003), since Doiron et al. describe that oscillations occur only for spatially correlated input, regardless of the intensity of the external input. In our study, we present a more general model and demonstrate that the models ¨ by Doiron et al. (2003, 2004) and by Borgers and Kopell (2003) are a special case of a more general model, which reveals a qualitatively different behavior in different regimes of parameter space. In order to understand the reasons for the apparent discrepancy, we ¨ used the same architecture as Doiron et al. and Borgers and Kopell, with a layer of excitatory neurons and a layer of inhibitory neurons. External input is provided to the excitatory neurons, which feed their output to the inhibitory neurons. In addition to the external input, the excitatory neurons receive feedback from the inhibitory neurons. Regarding the type of ¨ neuron, Borgers and Koppell used the theta model (Ermentrout & Kopell, 1986) whereas Doiron et al. (2003, 2004) used the leaky integrate-and-fire (LIF) model. The LIF model is an approximation of the more physiological Hodgkin-Huxley model, but the standard Hodgkin-Huxley neuron also can well be approximated by the theta model (see, e.g., Ermentrout & Kopell, 1986; Hoppensteadt & Izhikevich, 1997; Gutkin & Ermentrout, 1998). In this study we choose the LIF neuron, because Lindner et al. (Lindner & Schimansky-Geier, 2001; Lindner, Schimansky-Geier, & Longtin, 2002; Lindner, Doiron, & Longtin, 2005) provided a method for an analytical analysis as an approximation for the dynamics of the LIF neuron. This allowed us to derive analytical expressions, which could be compared with the results of numerical analyses to explain the oscillatory behavior of the network. We also did some simulations using the standard Hodgkin-Huxley neuron, which gave the same results.
1742
D. Marinazzo, H. Kappen, and S. Gielen
In this study we add dynamic synapses (Tsodyks & Markram, 1997) to the model. Experimental studies have shown that the efficacy with which a synapse can transmit an action potential depends on the recent history of the synapse itself. The model by Tsodyks, Pawelzik, and Markram (1998) for dynamic synapses incorporates the property that presynaptic activity gives rise to depletion of neurotransmitter. This makes the synapse less effective when the firing rate of presynaptic activity arriving at the synapse increases. This dynamical behavior, called short-term depression, has been explained and modeled in detail (Tsodyks et al., 1998). Tsodyks, Uziel, and Markram (2000) and Pantic, Torres, and Kappen (2003) have explained how temporal coding, synchronization, and coincidence detection are greatly affected by a time-varying synaptic strength. Based on the results in the literature, we hypothesized that incorporating dynamic synapses in the network should made the network behavior robust for a relatively large range of input characteristics. Incorporating dynamic synapses also gave another surprising result. The model by Doiron et al. (2004) produced a resonance when the input was spatially correlated only when the feedback gain was relatively small. With dynamic synapses, this behavior became robust for a large range of feedback gains. Moreover, the model with dynamic synapses both shows a resonance peak to indicate spatial correlation at the input and preserves the temporal characteristics of the spatially correlated input. This is a novel finding, which is important for modeling sensory (e.g., visual) information processing in combination with rhythmic activity (such as θ -, β-, and γ -rhythms) in cortex.
2 Model Description The basic architecture of our model is very similar to that of Doiron et al. (2004) and is described in Figure 1: a population of leaky integrate-andfire (LIF) neurons labeled 1, 2, . . . , NE , with external inputs s1 , . . . , s NE . In this study, NE was equal to 100. The output of these neurons provides excitatory input to another LIF neuron, which provides inhibitory feedback with gain g to the excitatory neurons. In some of our simulations, the single inhibitory LIF neuron was replaced by a set of 20 LIF neurons, which project by inhibitory synapses with strength g/20 to each of the excitatory neurons. One of differences between our model and that by Doiron et al. (2003, 2004) is that the inhibitory feedback in the model by Doiron et al. (2004) was obtained by convolution of a linear summation of action potentials of the excitatory neurons, whereas we used the spike output of an LIF neuron (or of 20 LIF neurons) as inhibitory feedback to the excitatory neurons. As we will show later (see Figures 2 and 3), this affects only to a minor extent the responses of the model, as long as the membrane time constant of the leaky integrate-and-fire neuron is relatively long.
Oscillations in Networks of Spiking Neurons
1743
Figure 1: Model architecture describing the external input si to the excitatory neurons. The output of the excitatory neurons projects to the inhibitory neuron with an effective input current I E I . The output of the inhibitory neuron is fed back to the excitatory neurons by the inhibitory current IIE .
The dynamics of the membrane potential V of the LIF neurons satisfies d V(t) = −(V(t) − Vr ) + I (t), dt
(2.1)
where time is measured in units of the membrane time constant τm and the membrane resistance is normalized to one. Every time the potential of the jth neuron reaches the threshold value Vth , a spike is fired. This resets the potential to the rest potential Vr and remains bound to this value for an absolute refractory period τ Re f . As in Doiron et al. (2003, 2004), we have set Vr = 0, Vth = 1, and τ Re f = 3 ms. Each excitatory neuron j( j ∈ {1, . . . , NE }) receives an input I j (t), I j (t) = µ + η j (t) + σ
√
1 − cξ j (t) +
√ cξG (t) ,
(2.2)
which consists of a constant base current µ, internal√gaussian white √ noise η j (t) with intensity D, and a stimulus s j (t) = σ ( 1 − cξ j (t) + cξG (t)) where ξ j (t) and ξG (t) are both gaussian white noise with zero mean and unit power. Varying c increases or decreases the degree of spatial correlation of the external stimuli, while the total input power to each neuron remains constant. In addition to this input, the excitatory neurons receive
1744
D. Marinazzo, H. Kappen, and S. Gielen
inhibitory feedback (see Figure 1), which will be defined in equation 2.4 and the text following that equation. The series of spikes s Ej (t) of neuron j in this layer of input neurons provides excitatory input to the inhibitory LIF neuron, which has a constant base current µ. The input current I E I to the inhibitory neurons due to input from the excitatory neurons is given by the convolution of the sum of the spike trains of the excitatory neurons and a standard α-function: I E I (t) =
K τE
∗ STE (t) =
∞
STE (t − τ )α 2 τ e −ατ dτ,
(2.3)
0
E E where STE (t) = Nj=1 s j (t) is the sum of the spike trains of all excitatory neurons. The base current to the neurons is too small to reach threshold. Therefore, neurons fire only in response to excitatory spike input. Just like the excitatory neurons, the inhibitory LIF neuron has a constant base current µ and internal noise η(t) with the same variance as that for the excitatory neurons. In addition, it receives the input I E I (t), as defined by equation 2.3. The action potentials s I (t) of the inhibitory neuron provide an inhibitory synaptic current to the excitatory neurons after a time delay τ D . The shape of the postsynaptic potential in the excitatory neurons due to input from the inhibitory neuron will be represented in further equations by K τI , which will be defined after the introduction of equation 2.4. For the case of dynamic synapses, their effective strength is governed by three parameters obeying the following equations (Tsodyks & Markram, 1997), dx = z/τrec − Uxs I (t − τ D ) dt dy = −y/τin + Uxs I (t − τ D ) dt dz = y/τin − z/τrec dt
(2.4)
where x, y, and z are the fraction of synaptic resources in the recovered, active, and inactive state, respectively. Without any spike input, all neurotransmitter is recovered, and the fraction of available neurotransmitter is one: x(t) = 1. After each spike arriving at the synapse, a fraction U of the available (recovered) neurotransmitter is released. Note that the spike input from the inhibitory neuron s I (t) is delayed by τ D . The fraction y of active neurotransmitter is then inactivated into the inactive state z. τin is the time constant of the inactivation process, and τrec is the recovery time constant for conversion of the inactive to the active state. With dynamic synapses,
Oscillations in Networks of Spiking Neurons
1745
the total postsynaptic current IIE (t) is proportional to the fraction of neurotransmitter in the active state, that is, IIE (t) = gy(t) (g < 0 because of the inhibitory feedback). The total input to the excitatory neuron j is defined by I j (t) + IIE (t), where I j (t) is defined in equation 2.2. With this model, we can easily change to a situation with static synapses by setting x(t) = 1 for all t (or, equivalently, τrec → 0). Note that equation 2.4 includes an exponential decay for the postsynaptic potential with a time constant τin in response to each presynaptic action potential. This is true for the case of both dynamic and static synapses. We used the following parameter values in the model simulations: NE = 100; τm = 6 ms for all neurons except when explicitly mentioned otherwise; base current µ = 0.5; intensity of the internal gaussian white noise D = 0.08; σ = 0.4; α = 18 ms−1 ; τin = 3 ms; τrec = 800 ms; and time delay in the feedback loop τ D = 6 ms. All simulations were integrated using a Euler integration scheme with a time step of 0.05 ms. Along similar lines as outlined in Doiron et al. (2004; see the appendix), we obtain the following expression for the spike train power spectral density of the excitatory neurons when feedback is provided by an LIF neuron E ∗E 2 E 2 2 E 2 s Ej (ω)s ∗j (ω) = s0,E j (ω)s0, j (ω) + σ |A (ω)| + . . . + cσ |A (ω)| ×
2(AE (ω)ge −iωτ D K τI (ω)AI (ω)K τE (ω)) − |AE (ω)gK τI (ω)AI (ω)K τE (ω)|2 |1 − AE (ω)ge −iωτ D K τI (ω)AI (ω)K τE (ω)|2 (2.5)
Here, s Ej (ω) and s0,E j (ω) represent the spike train in the presence and absence, respectively, of the external stimulus and feedback. K τE (ω) and K τI (ω) represent the postsynaptic dynamics as defined in equations 2.3 and 2.4, respectively, in the frequency domain, and AE (ω) and AI (ω) represent the susceptibility in the frequency domain with respect to the input of the excitatory neurons and the inhibitory neurons, respectively (for details see Doiron et al., 2004; Lindner & Schimansky-Geier, 2001; Lindner et al., 2002). (C) represents the real-valued part of the complex number C. This result E was obtained with the assumptions that s0,E j (ω)sk∗ (ω) = s0,E j (ω)I j∗ (ω) = s Ej (ω)ξk∗ (ω) = 0 for j = k and for N → ∞ so as to neglect terms of order 1/N and higher. A derivation of equation 2.5 is given in the appendix. Resonance is obtained when the denominator of the last term in equation 2.5 is minimal, which depends mainly on the effective time delay (i.e., the sum of the pure time delay τ D , the frequency-dependent phase shifts due to the dynamics AE (ω) and AI (ω) of the LIF neurons, and the synaptic functions K τE (ω) and K τI (ω)) in the feedback loop.
1746
D. Marinazzo, H. Kappen, and S. Gielen
Figure 2: Spike train power spectra (defined by equation 2.5) of an excitatory neuron for spatially correlated (c = 1) and uncorrelated (c = 0) input when inhibition is provided by an LIF neuron (filled symbols) and for correlated input when inhibition is provided by a linear unit (LIN, open symbols), as in Doiron et al. (2003) for g = −1.2. All other parameters as described in the text.
3 Results We start by analyzing the model by Doiron et al. (2003, 2004) and demonstrate that replacing the linear summation in the model by an LIF neuron affects the behavior of this model only to a minor extent. We then explore the behavior of the model for various gain values of the inhibitory feedback ¨ loop and show that we obtain the behavior of the model by Borgers and Kopell (2003) for large feedback gains. We discuss the effect of feedback gain for both static and dynamics synapses. Figure 2 shows the results of computer simulations for the spike train power spectral density of an excitatory neuron from the network with delayed feedback described in Figure 1 in case of fully spatially correlated (c = 1) and fully uncorrelated (c = 0) input for a feedback gain g = −1.2. Since the statistics of action potential firing are the same for all excitatory neurons, it suffices to show the spectra for one neuron. For uncorrelated input (c = 0), the last term in the expression of equation 2.5 is zero, and we obtain a spectrum that increases slightly in the range from zero to 120 Hz. The input signal is gaussian noise with a flat spectrum, and the slight increase in spike train power spectral density reflects the dynamics of the
Oscillations in Networks of Spiking Neurons
1747
Figure 3: Spike train power spectra of an excitatory neuron for spatially correlated input when inhibition is provided by a single LIF neuron (dashed line) or a population of 20 inhibitory LIF neurons (solid line). All other parameters as in Figure 2.
excitatory LIF neurons and the effect of feedback. When the gaussian noise to the input units is fully correlated (c = 1), the results show a clear resonance near 40 Hz for the spatially correlated case. The resonance is absent in the absence of spatial correlation. For reference, the power spectrum of an excitatory neuron from the model with linear summation, as in Doiron et al. (2004), is shown for the case of fully correlated input. The peak in the power spectrum falls at a slightly lower frequency for the model with the inhibitory LIF neuron (filled circles, our model), which is due to the additional delay stemming from the dynamics of the inhibitory LIF neuron. Otherwise extensive simulations show that replacing the linear summation in the model by Doiron et al. (2004) by an LIF neuron gives very similar results, as long as the time constant of the inhibitory LIF neuron is sufficiently large (further explained in Figure 8). The results in Figure 2 were obtained with a single inhibitory neuron, as in the study by Doiron et al. (2004). Adding more inhibitory neurons did not affect the results, as shown in Figure 3. The reason is that when the model ¨ starts to oscillate (as pointed out by Borgers and Kopell, 2003, who did their simulations with a population of inhibitory neurons), the activity of the
1748
D. Marinazzo, H. Kappen, and S. Gielen
inhibitory neurons synchronizes. In that case, a mean-field approximation is allowed, which explains why the population of inhibitory neurons can be replaced by a single inhibitory neuron. In further simulations, we use one inhibitory neuron. The behavior of the network depends on the membrane time constant of the inhibitory neuron and the feedback gain. Varying these two quantities results in similar effects: a smaller membrane time constant gives rise to a higher firing rate of the inhibitory neuron, leading to a higher effective gain of the feedback. There is another effect related to the value of the membrane time constant τm,I of the inhibitory LIF neuron: when it is small, the inhibitory neuron behaves as a coincidence detector, whereas for large values of τm,I , the neuron becomes an integrator (in agreement with Rudolph & Destexhe, 2003). In the former case, the neurons spike only when there is a sufficiently large number of input spikes within a time window, small with respect to the membrane time constant. If the interval between input spikes is large relative to the membrane time constant, any effect of the single-spike inputs decays, and the neuron does not spike. The effect of the feedback gain and the membrane time constant will be discussed separately in the following. The fact that the network oscillations in Figure 2 do not depend on power ¨ of the stimulus, differs from the results by Borgers and Kopell (2003), who report that a sufficiently large input power is required to cause oscillations. This apparent discrepancy can be understood if the feedback gain g is increased (more negative values). Large values of g lead to network oscillations in case of uncorrelated input (see Figure 4, left panels). Each spike of the inhibitory neuron (lower left panel) inhibits the excitatory neurons after a delay of about 15 ms in the feedback loop. The explanation for the apparent discrepancies with the results by Doiron et al. (2003, 2004) in the case of uncorrelated input is that equation 2.5 was obtained using linear response theory for the LIF neuron, as outlined by Doiron et al. (2003) and Lindner, Doiron et al. (2005). For large, synchronized inputs and when the inhibitory neuron acts like a coincidence detector, the linear response approximation for the LIF is no longer valid, and the behavior changes from that illustrated in Figure 2 into the behavior reported by Borgers and Kopell (2003). Therefore, the model in Figure 1 can reproduce the results by both ¨ Doiron et al. (2004) and Borgers and Kopell (2003) for different values of the feedback gain g. It is worth stressing that the feedback gain values are not excessive or unrealistic, since the value of |g| is always within the limits of the selfconsistent equation for the effective base current, µeff = µ − gr0 (µeff ), with r0 (µeff ) −1 (µeff −v R )/√ D √ x2 e erfc(x)d x , = τr + π √ (µeff −vT )/ D
(3.1)
Oscillations in Networks of Spiking Neurons
1749
Figure 4: Superposition of spikes of all excitatory neurons (top panels) and of the inhibitory neuron (bottom panels) in case of uncorrelated gaussian white noise input with static (left) and dynamic synapses (right) for g = −3.6. Note that with static synapses, each action potential of the inhibitory neuron is followed by complete inhibition of the excitatory neurons after a time delay of about 15 ms.
where erfc is the complementary error function. (For more detailed information about the relevance of the effective base current and the derivation of r0 (µeff ), see Lindner, Doiron et al. 2005.) At this point it is useful to consider the effect of dynamic synapses. As is evident from Figure 4 (right panels), the presence of dynamic inhibitory synapses prevents complete silencing of the excitatory neurons as observed for static synapses in the case of a high feedback gain. We thus want to evaluate the effect of the feedback gain on the one hand and of the presence of dynamic synapses on the other, and explore how these two parameters change the capacity of the system to detect spatially correlated input. In order to do so, we measure the power of the spike train around the resonance frequency. The behavior of the model is explained in more detail in Figure 5, which shows the power spectra of spiking of the excitatory neurons for the model with static synapses (left panels) and for dynamic synapses (right panels) for the case of uncorrelated gaussian white noise input (c = 0, upper panels) and for correlated input (lower panels, c = 1) for two values of feedback
1750
D. Marinazzo, H. Kappen, and S. Gielen
Figure 5: Spectra of the excitatory neurons for uncorrelated input (c = 0; upper panels) and for correlated input (c = 1; lower panels) for the model with static synapses (left panels) and dynamic synapses (right panels) for two different gains in the inhibitory feedback loop: g = −2 (solid line), g = −4 (dashed line). The input signal was gaussian white noise with a flat frequency spectrum up to 120 Hz. The units along the vertical axis are in spikes2 /s. However, in order to compare the shape of the spectra for static and dynamic synapses and for correlated and uncorrelated input, the spectra were normalized.
gain. Variations in feedback gain cause changes in the firing rate of the excitatory and inhibitory neurons. Therefore, the power spectral density of the responses of the excitatory neurons (in spikes2 per s as in Figure 2) differs for different feedback gains. In Figures 5 and 8, we normalized the power spectral density to allow a better comparison of the shape of the spectra. The variations in firing rate are discussed later (see Figure 11). When the input is correlated (lower panels) a resonance peak appears near 40 Hz for both the static and dynamic synapses for g = −2 and g = −4. For the uncorrelated noise with g = −2, the model does not show a significant resonance peak for the static or dynamic synapses. Therefore, the model is able to detect spatially correlated input for g = −2 with both static and dynamic synapses. However, when the feedback gain is increased to g = −4, the network with static synapses resonates also in the absence of
Oscillations in Networks of Spiking Neurons
1751
Figure 6: Power spectra of the excitatory neurons for the model with dynamic synapses for different values of correlated input. Gains in the inhibitory feedback loop: g = −2. The input signal was gaussian white noise with a flat frequency spectrum up to 120 Hz. In order to compare the shape of the spectra, the units along the vertical axis were normalized.
spatially correlated input (upper left panel). These results demonstrate that we can reproduce the results of Doiron et al. (2003, 2004) for static synapses as long as the feedback gain is relatively small and obtain the results of ¨ Borgers and Kopell (2003) for relatively high feedback gains. When dynamic synapses are included in the model, the model behavior does not depend on feedback gain, since there is no resonance for physiological feedback gains when the input is uncorrelated (upper-right panel). If the input is correlated, there is a peak near 40 Hz as soon as there is feedback in the model. For simplicity, we have considered only the extreme cases of fully correlated (c = 1) or fully uncorrelated (c = 0) input in Figure 5. Values of c in the range between 0 and 1 yield results that are intermediate between the two extremes described above. This is illustrated in Figure 6, which clearly illustrates that the resonance peak decreases with decreasing values of c. Notice that for c = 1 we find a peak near 40 Hz, but a decrease in the power
1752
D. Marinazzo, H. Kappen, and S. Gielen
spectrum in the range between 50 and 80 Hz. This can be understood by the fact that the term ge −iωτ D K τE (ω)K τI (ω)AE (ω)AI (ω) in the denominator in the last term in equation 2.5 contains a factor e −iωτ D related to the time delay τ D in the feedback loop. When the term ge −iωτ D K τE (ω)K τI (ω)AE (ω)AI (ω) is maximal, it causes a peak in the spectrum near 40 Hz. For higher frequencies, the term e −iωτ D changes sign, which, together with the frequency dependence of the susceptibility A and synaptic transfer function K, changes the sign of ge −iωτ D K τE (ω)K τI (ω)AE (ω)AI (ω), causing a decrease in the spectrum. For our choice of time constants (membrane time constant 6 ms for excitatory and inhibitory neurons), there will or will not occur a resonance as indicated by the peak in Figure 2 depending on the strength of the feedback coupling, the characteristics of the input signal, and the dynamics of the synapse. The amplitude of the resonance can be quantified introducing the f statistics f1 , f2 = f12 S( f )d f , where the power spectrum of the spike train of the excitatory neurons is represented by S( f ). As in Doiron et al. (2003, 2004), we quantify the resonance by FB = 30,50 − 2,22 , comparing the value of
near the resonance frequency to the value of in a low-frequency region. Figure 7 (upper panel) shows that for relatively small values of the feedback gain, the value FB increases for small values of g (−2 < g < 0) for fully correlated input to the excitatory neurons for both the static and dynamic synapses, indicating increasing peak values of oscillations in the presence of fully correlated input. However, FB remains small for uncorrelated input as long as −2 < g < 0. For g < −2, the network also starts to oscillate for uncorrelated input (c = 0) for the case of static synapses. The spike train power near the resonance frequency rapidly reaches the same value as that for the correlated input case. This is not the case for the model with dynamic synapses, which shows a very robust behavior for both uncorrelated and correlated input over the whole range of feedback gains studied. To quantify this situation even better, we introduce a discrimination index, ρ0−1 =
FB (c = 1) − FB (c = 0) .
FB (c = 1) + FB (c = 0)
(3.2)
This index indicates to what extent the network is able to distinguish between a fully spatially correlated and a fully uncorrelated input. The lower panel of Figure 7 shows that dynamic synapses (filled symbols) guarantee a clear and constant discrimination for all feedback gain values, whereas the model with static synapses (open symbols) detects the correlated input only at intermediate feedback gains (range from −1 to −2) but cannot distinguish correlated and uncorrelated inputs for feedback gains g < −2. It has been shown (Whittington, Traub, & Jeffreys, 1995; Traub, Whittington, Colling, Buzsaki, & Jeffreys, 1996) that gamma oscillations in hippocampal interneurons are sensitive to the decay time constant of the
Oscillations in Networks of Spiking Neurons
1753
Figure 7: (Top) Power FB = 30,50 − 2,22 near the resonance peak (30–50 Hz) relative to that in the frequency range between 2 and 22 Hz in the presence of correlated (c = 1; circles) and uncorrelated (c = 0; triangles) input, with static synapses (SS; open symbols) and dynamic synapses (DS; filled symbols). (Bottom) Ability to discriminate correlated and uncorrelated input ρ0−1 (see equation 3.2) for static (SS, open symbols) and dynamic (DS, closed symbols) synapses as a function of the inhibitory gain.
GABAA synapse in the sense that the oscillation frequency decreases for increasing values of that time constant. This is compatible with the fact that ¨ the oscillation frequency in the models by Borgers and Kopell (2003) and by Doiron et al. (2003, 2004) depends on the effective time delay in the feed¨ back loop, which includes the time delay (absent in the Borgers and Kopell model), the dynamics of the synapses, and the dynamics of the inhibitory LIF neurons. Since changing the time constant of the inhibitory LIF neuron changes both the effective delay in the feedback loop and the behavioral characteristics of the LIF neuron (integrator versus coincidence detector), we have analyzed the behavior of the model for various time constants τm,I for the inhibitory LIF neuron. As mentioned before, the neurons in the network start to oscillate with synchronized firing of the excitatory neurons when the membrane time constant of the inhibitory neuron is small. This synchronization is independent of the nature of the external stimulus as long as the intensity is
1754
D. Marinazzo, H. Kappen, and S. Gielen
Figure 8: Ability to discriminate correlated and uncorrelated input ρ0−1 (see equation 3.2) for static (SS, open symbols) and dynamic (DS, filled symbols) synapses versus the membrane time constant of the inhibitory neuron, with g = −1.2.
sufficiently large and when the time constant τm,I is small (see Figure 8). For time constants larger than 15 ms, the linear approximation of the inhibitory LIF neuron is valid, and the network behaves as in Doiron et al. (2004, 2004). When dynamic synapses are included in the model, the separation between correlated and uncorrelated input is also preserved for small values of τm,I ; the index ρ0−1 , which represents the ability to discriminate between correlated and uncorrelated input, is constant for all values of the membrane time constantτm,I . Another novel aspect of the model with dynamic synapses relates to the ability to preserve the frequency content of the input signal. In the ¨ study by Borgers and Kopell (2003), the oscillation frequency depends on only the effective time delay in the feedback loop between excitatory and inhibitory neurons and is present as long as the power of the input signal is sufficiently large. In the model by Doiron et al. (2003), the presence of oscillations depends on the presence of spatially correlated input, and the oscillation frequency depends on the effective delay in the feedback loop, not the temporal properties of the input. With dynamic synapses, the system also preserves the temporal information contained in the input. This is illustrated in Figure 9, which shows the spectra of the spike responses of the excitatory neurons in response to bandpass filtered gaussian noise with a center frequency at 60 Hz—a frequency well above the resonance peak near 40 Hz. The bandpass filtered noise is obtained by filtering gaussian white noise with a flat spectrum up to 150 Hz with a second-order Butterworth bandpass filter with passband between 53 and 67 Hz. After filtering in the
Oscillations in Networks of Spiking Neurons
1755
Figure 9: Spectra of the excitatory neurons for uncorrelated input (C = 0; upper panels) and for correlated input (C = 1; lower panels) for the model with static synapses (left panels) and dynamic synapses (right panels) for three different gains in the inhibitory feedback loop: g = 0 (solid line), g = −2 (dashed line) and g = −4 (dotted line). The input signal was bandpass filtered gaussian white noise, centered at 60 Hz. The units along the vertical axis are in spikes2 /s. In order to compare the shape of the spectra, the spectra were normalized.
forward direction, the filtered sequence is reversed to ensure a zero-phase shift. The power of the bandpass filtered signal is normalized such that the power of the input signal is the same in all conditions. Without feedback (g = 0) the solid lines show the spectra of the responses to the bandpass filtered noise with a clear peak near 60 Hz. For the static synapses (left-hand panels), the amplitude of the peak near 60 Hz decreases with increasing feedback gain. If the input is spatially correlated, a resonance peak appears near 40 Hz, consistent with the results of Doiron et al. (2003, 2004) and with the results in Figure 5 and Figure 7, which show that a resonance peak is found for intermediate feedback gains only if the input is correlated. If the bandpass filtered noise at the input is spatially uncorrelated (c = 0; upper left panel), a peak near 40 Hz also appears for
1756
D. Marinazzo, H. Kappen, and S. Gielen
the high feedback gain of −4, in agreement with our previous results (see Figure 5). If the bandpass filtered noise for the excitatory input units is spatially correlated (c = 1; lower left panel), the peak near 60 Hz disappears and a peak near 40 Hz appears, for a feedback gain of −4. The spectra of the spike output of the excitatory neurons for the model with dynamic synapses differ from those for the static synapses in two aspects. The first difference is that the spectra for the model with dynamic synapses are almost identical for the uncorrelated bandpass filtered noise for all feedback gain values (c = 0, upper right panel). The almost identical shape of the spectra reflects the robustness of the model behavior with dynamic synapses. If the bandpass filtered noise is correlated (c = 1, lower right panel), a peak appears near 40 Hz, but in addition, the peak at the input near 60 Hz is preserved. This finding reflects that the model with the dynamic synapses has two important advantages over the model with static synapses: the responses of the excitatory neurons are robust for changes in feedback gain and the spectra preserve the spectra of the input, in addition to the peak in the spectrum indicating correlated input. In order to explore the effect of the feedback gain on the spike responses near the input frequency, we have explored the ratio
52.5,67.5 , which relates 72.5,87.5 the power of the excitatory neurons near the bandpass filtered input at 60 Hz to the value of power near 80 Hz, which is in a region where the power spectrum is not affected by feedback or resonance. A high value for this ratio reflects large response amplitudes near the input frequency of 60 Hz, whereas small values indicate that the input signal is not present in the spike responses. With feedback gain equal to zero, there is no difference between static and dynamic synapses (see Figure 10). As illustrated in Figures 5, 7, and 9, the ability to discriminate between spatially correlated and uncorrelated input is about the same for small feedback gains for dynamic and static synapses. For the range of feedback values between −1 and −2, we reproduce the behavior reported by Doiron et al. (2004) for static synapses; there is a resonance peak for spatially correlated input (see Figure 9, lower left panel), but the characteristics of the input signal are almost lost. This explains why for the static synapses, the ratio
52.5,67.5 72.5,87.5 is smaller in the range −2 < g < −1 than for g = 0. For values of g < −2, the system starts to oscillate at 40 Hz for the static synapses regardless of spatial correlation of the input (see Figures 5, 7, and 9), which explains the decrease of
52.5,67.5 to values near 1 for static synapses. Quite remarkably, 72.5,87.5
remains approximately constant for the dynamic synapses the ratio
52.5,67.5 72.5,87.5 for all values of g, indicating a preservation of the input characteristics in the firing rate of the excitatory neurons. The robustness of the model responses with the dynamic synapses is reflected not only in the spectra (see Figures 8, 9, and 10) but also in the firing rates of the excitatory and inhibitory neurons in the model. This is illustrated in Figure 11, which shows the firing rate of the excitatory (E) and inhibitory
Oscillations in Networks of Spiking Neurons
1757
Figure 10: Discrimination index ρ I F to discriminate the input frequency near 60 Hz in the responses of the excitatory neurons, defined as the power around
52.5,67.5 ) for static (open symbols, SS) 60 Hz divided by the power around 80 Hz ( 72.5,87.5 and dynamic (filled symbols, DS) synapses.
Figure 11: Firing rate of excitatory (E) and inhibitory (I) neurons for the models with static and dynamic synapses. Data at left (right) hand correspond to model with inhibitory LIF neuron with leak time constant of 6 (18) ms. Except for differences in time constant of inhibitory neuron, static versus dynamic synapses and feedback gain, all other parameters were the same for all conditions.
(I) neurons for the models with static and dynamic synapses for the various feedback gains and for two different time constants of the inhibitory neurons (6 and 18 ms). For the static synapses, the firing rate of the excitatory neurons decreases with increasing feedback gain. The difference in firing rate for g = 0 and g = −4 is typically about 30%. A similar decrease in firing
1758
D. Marinazzo, H. Kappen, and S. Gielen
rate is observed for the inhibitory neurons. However, for the model with dynamic synapses, the firing rates of the excitatory and inhibitory neurons depend only weakly on the feedback gain. Note that this phenomenon is independent of the time constant of the inhibitory neurons. 4 Discussion The main results of this study can be summarized in two conclusions. The ¨ first is that the different results, reported by Borgers and Kopell (2003) and by Doiron et al. (2003, 2004), who used models with the same architecture but with a qualitatively different behavior, can be explained by the responses of a single model with different feedback gains. The simulations presented here show that the same architecture with pools of excitatory and inhibitory neurons can display a different behavior when the parameter values are in a different range. For small and intermediate feedback gains, the model starts to oscillate if the input to the excitatory input units is spatially correlated. Without correlation, the resonance peak in the spike ¨ responses is absent. The resonance in the model, as reported by Borgers and Kopell (2003), which appears as soon as the input to the excitatory neurons is sufficiently large (regardless of correlation between input to the neurons), is obtained if the feedback gain is sufficiently large, in agreement with the conditions for resonance in their study. Another difference between the ¨ models by Borgers and Kopell (2003) and Doiron et al. (2003, 2004) is that ¨ the latter has only one inhibitory unit, whereas Borgers and Kopell have a population of inhibitory neurons. In this study we have shown that the behavior of the model by Doiron et al. (2003, 2004) does not change if the linear unit, which sums the spikes of the excitatory neurons and which feeds the feedback loop, is replaced by an LIF neuron. Figure 3 shows that the results are not affected either, when several inhibitory neurons are added in ¨ parallel. Borgers and Kopell (2005) also report that the qualitative behavior of the model is the same if several inhibitory neurons are added. In their study, suppression of excitatory cells can occur for asynchronously firing inhibitory cells. In our study, the inhibitory cells fire in synchrony for spatially correlated input (c = 1) and for high feedback gains. For c = 0 and for small feedback gains, the inhibitory neurons in our model do not fire in synchrony, and neither do the excitatory cells. However, if there are several inhibitory neurons, the network models starts oscillations at slightly higher feedback gains than for a single inhibitory neuron. In summary, the behavior of the model is qualitatively the same for a single inhibitory neuron and for the case of several inhibitory neurons. The second conclusion of our study is that replacing synapses with a constant efficacy by dynamic synapses has two major implications: the model behavior becomes robust for changes in feedback gain and in time constant of the inhibitory LIF neuron. The resonance peak appears only for correlated input stimuli. Moreover, the model with dynamic synapses preserves
Oscillations in Networks of Spiking Neurons
1759
the characteristics of the input in the spike responses. From a biological perspective, this is highly relevant since it allows transmission of the temporal properties of the input signal while the resonance provides a label to indicate whether the input is spatially correlated. In visual cortex cells, responses reflect the temporal properties of the stimulus in the receptive field, but in addition can reveal gamma-frequency components (see, e.g., Reynolds & Desimone, 1999; Roelfsema, Lamme, & Spekreijse, 2004). In addition, several studies (e.g., Baker, Pinches, & Lemon, 2003; Schoffelen et al., 2005) have shown that multiple EEG rhythms can occur simultaneously, which is in contradiction with the behavior of some models in the ¨ literature (see, e.g., Borgers & Kopell, 2003; Doiron et al., 2004), which reveal oscillations only at one particular frequency. The finding that the activity of the network can reflect both the temporal characteristics of the input signal (frequency of input modulations), which drives the network, and synchronous oscillations at one particular frequency is new. Previous studies modeling neurons as coupled oscillators have shown that when the width of the distribution of the intrinsic frequencies is small compared to the coupling strength, all oscillators become phase-locked (Ariaratnam & Strogatz, 2001). When the distribution of intrinsic frequencies becomes larger, the network can reveal a broad spectrum of dynamics, such as frequency locking at a single frequency, incoherent firing with a broad range of frequencies, oscillator death, and hybrid solutions that combine two or more of these states, depending on the coupling strength. Tsodyks, Mitkov, and Sompolinsky (1993) studied a system of globally coupled oscillators with pulse interactions (mimicking coupling by presynaptic action potentials). In such a system, the completely phase-locked state is unstable to weak disorder. For a small inhomogeneity, the oscillators are divided in two populations, one of which exhibits phaselocked periodic behavior, whereas the other consists of unlocked oscillators, which exhibit aperiodic patterns that are only partially synchronized. In this context it is important to stress, that the spectra shown in Figures 5, 6, and 9 are representative for all excitatory neurons. That is, the oscillations and the input characteristics can be found in each neuron and do not reflect a partitioning within the pool of excitatory neurons. The novel finding that a group of neurons can code both the correlation between parallel neuronal input channels and the temporal characteristics of input signals is relevant for several reasons. As explained in section 1, the phenomenon of oscillations in neuronal activity in groups of neurons has long been known. However the functional significance is yet the subject of speculation (see, e.g., Fries, 2005). Moreover, the neuronal mechanisms, which are responsible for the initiation of oscillatory activity, are barely known. It is well known that cells in visual cortex can fire in synchrony at frequencies near 40 to 60 Hz and at the same time can reproduce the temporal dynamics of changes in the light in the receptive field. For example, Fries, Reynolds, Rorie, and Desimone (2001) report typical changes in firing
1760
D. Marinazzo, H. Kappen, and S. Gielen
rate at a latency of about 25 ms in cat primary visual cortex, related to onset and offset of a visual stimulus in the receptive field of these neurons. At the same time, they find a correlated oscillatory activity at frequencies in the range between 40 and 70 Hz. This clearly illustrates that the activity of cells in cat primary visual cortex reflects simultaneously temporal changes in light falling in the receptive field as well as correlated oscillatory activity at 40 to 70 Hz. Similarly, cells in visual areas V2 and V4 reflect the dynamics of stimuli in the receptive field, but at the same time reveal correlated activity in the γ -range (Reynolds & Desimone, 1999). As far as we know, our study is the first to show that the temporal dynamics of activity in a population of neurons can reflect both changes in the stimulus that drives the cells and oscillatory activity. van Vreeswijk and Hansel (2001) studied the emergence of synchronized burst activity in networks of neurons with spike adaptation. They showed that networks of tonically firing adapting excitatory neurons can evolve to a state where the neurons burst in a synchronized manner. These authors showed that synchronized bursting is robust against inhomogeneities, sparseness of the connectivity, and noise. In networks of two populations, one excitatory and one inhibitory, they found that decreasing the inhibitory feedback can cause the network to switch from a tonically active, asynchronous state to the synchronized bursting state. In our study, we had dynamic synapses instead of spike adaptation, as in the study by van Vreeswijk and Hansel (2001). Another difference is that the latter study did not consider external input but studied tonically active neurons instead. Our study therefore is a further extension of the work of van Vreeswijk and Hansel by showing the effect of external input on the stable states of a network with excitatory and inhibitory neurons coupled by dynamic synapses. In the model shown in Figure 1, the dynamic synapses were implemented in the feedback (i.e., the I → E) path. If the firing rate of the inhibitory neuron increases, the efficacy of the dynamic synapses decreases, which acts like an automatic gain control. This explains partly why the responses of the model with dynamic synapses hardly depend on the feedback loop gain. This finding is in agreement with Pantic et al. (2003), who demonstrated that dynamic synapses can perform good coincidence detection over a large range of frequencies for one suitably chosen threshold value, whereas static synapses require an adaptive frequency-dependent threshold value for good detection. In fact, the adaptation is done by the synapses themselves. Dynamic synapses in the feedforward loop have also been discussed in several studies. The common message of these studies (see, e.g., Natschlager, Maas, & Zador, 2001; Senn, Segev, & Tsodyks, 1998; Pantic et al., 2003) is that dynamic synapses in the E → I path are very effective in allowing a correct coincidence detection for a wide range of firing rates of the excitatory neurons without changing the base current to the inhibitory neurons.
Oscillations in Networks of Spiking Neurons
1761
Initially it came as a surprise that the model shown in Figure 1 revealed oscillations also for uncorrelated noise (c = 0) for large feedback gains. The last term in equation 2.5 is zero for c = 0. The explanation for the unexpected resonance is that equation 2.5 was derived using linear-response theory (see Doiron et al., 2004, and Lindner, Chacron, & Longtin, 2005, where the assumptions and mathematical analyses are explained in detail). However, the linear approximations of an LIF neuron are valid only for relatively long time constants of the neuron, where it operates mainly as an integrator. This approximation is no longer valid when the time constant of the LIF neuron is short, transforming the neuron into a coincidence detector. Similarly, a high-feedback gain gives rise to full inhibition of the excitatory neurons for each action potential of the inhibitory neuron (see Figure 4, left panel), followed by a release of excitatory output. The large inhibitory input to the excitatory neurons synchronizes the excitatory neurons (precisely as ¨ in Borgers & Kopell, 2003), leading to a burst of activity to the inhibitory neuron. For such volleys of synchronized input, the linear approximation for the LIF neuron also fails. Appendix This appendix gives a derivation of equation 2.5. It mainly follows the lines set out in Doiron et al. (2004) and in Lindner, Chacron et al. (2005) and Lindner, Doiron et al. (2005). The input to the excitatory neurons is split into two parts. The first part consists of the base current µ, the internal neuron-specific noise ηi (t), and the time-independent mean of the feedback (gSTI (t) with g < 0) from the inhibitory neuron√to the excitatory √ neuron. The second part consists of the external input σ 1 − cξ j (t) + cξG (t) and the time-dependent part of the feedback from the inhibitory neurons. Using a linear response approximation for a leaky integrate-and-fire neuron with susceptibility AE (ω) and AI (ω) (see Lindner & SchimanskyGeier, 2001; Lindner, Schimansky-Geier, & Longtin, 2002) for the excitatory neurons and inhibitory neurons, respectively, the spectrum of the spike train of the excitatory neuron is given by
g s Ej (ω) = s0,E j (ω) + AE (ω) I j (ω) + e −iωτ D K τI (ω)s I (ω) , N
(A.1)
where s Ej (ω) and s0,E j (ω) represent the spike train of the jth excitatory neuron in the presence and absence, respectively, of the external stimulus and feedback. K τI (ω) represents the postsynaptic dynamics for the synapses from the inhibitory to the excitatory neurons as defined in equation 2.4 in the frequency domain, and AE (ω) and AI (ω) represent the susceptibility in the frequency domain with respect to the input of the excitatory neurons and the inhibitory neurons, respectively (for details, see Doiron et al., 2004;
1762
D. Marinazzo, H. Kappen, and S. Gielen
Lindner & Schimansky-Geier, 2001; Lindner et al., 2002). The term e −iωτ D is the Fourier representation of the time delay in the inhibitory feedback loop. For the activity of the inhibitory neuron in the frequency domain, we obtain
s I (ω) = AI (ω)KτE (ω)
NI
s Ej (ω).
(A.2)
j=1
Inserting equation A.2 in equation √ A.1 and with √ I j (ω) as the Fourier transform of I j (t) = µ + η j (t) + σ 1 − cξ j (t) + cξG (t) gives for the power spectrum of the spike train of the jth excitatory neuron
s˜ Ej s˜ ∗j
E
∗E E ˜ ∗E ˜ ˜ ∗E ˜ ∗ E I = s˜0,E j s˜0, + A s ˜ j j 0, j + A I j s˜0, j +
k=1
∗ s˜0,E j
+
I ∗ ˜ ∗ E g e iωτ D K˜ I τ∗ A ˜ I ∗ K˜ τE ∗ s˜kE A N
N
s˜0,E j
˜ E g e −iωτ D K˜ τI A ˜ I K˜ τE A N
NI
s˜kE
2 ˜ E I˜ j 2 + A
k=1
NI E 2 g iωτ I ∗ I ∗ E ∗ ∗ ˜ I˜ j e D K˜ τ A ˜ K˜ τ + A s˜kE N k=1
E 2 ∗ g −iωτ I I E D ˜ ˜ I˜ j e ˜ K˜ τ Kτ A + A N
NI
s˜kE
k=1
NE E 2 g 2 I 2 I 2 E 2 E E∗ ˜ ˜ ˜ ˜ Kτ A Kτ s˜k s˜l + A , N k,l
(A.3)
where x˜ represents the Fourier transform of x. E Using that s0,E j (ω)sk∗ (ω) = s0,E j (ω)I j∗ (ω) = s Ej (ω)ξk∗ (ω) = ∗ ξG (ω)ξk (ω) = 0 for j = k explains why the second and third terms on the right-hand side of the equal sign are equal to zero. I E Substitution of equation A.2 in A.1 and solving for Nj=1 s j (ω) gives NE NE k=1
s˜kE
=
j=1
˜E s˜0,E j + A
NE
I˜ j
j=1
˜ E ge −iωτ D K˜ τI A ˜ I K˜ τE 1− A
.
(A.4)
Oscillations in Networks of Spiking Neurons
1763
Substitution of equation A.4 in A.3 and using that s0,E j (ω)s ∗k (ω) = s0,E j (ω)I ∗ j (ω) = s Ej (ω)ξ ∗k (ω) + ξG (ω)ξ ∗k (ω) = 0 for j = k and for N → ∞ so as to neglect terms of order 1/N and higher, we obtain for the power spectrum of the spike train of the excitatory E
E ∗E 2 E 2 s Ej (ω)s ∗ j (ω) = s0,E j (ω)s0, j (ω) + σ |A (ω)| + . . . + cσ |A (ω)| 2
E
2 2
E ˜ g K˜ τI A ˜ I K˜ τE − A ˜ I K˜ τE 2 ˜ E ge −iωτ D K˜ τI A A . ˜ E ge −iωτ D K˜ τI A ˜ I K˜ τE |2 |1 − A
Acknowledgments We acknowledge support from the Lorentz Center (http://www. lc.leidenuniv.nl/) for organizing a workshop, which allowed us to discuss the results of this study with Andr´e Longtin and Nancy Kopell. We thank Magteld Zeitler for helpful discussions. References Ariaratnam, J. T., & Strogatz, S. H. (2001). Phase diagram for the Winfree model of coupled nonlinear oscillators. Phys. Rev. Lett., 86, 4278–4281. Baker, S. N., Pinches, E. M., & Lemon, R. N. (2003). Synchronization in monkey motor cortex during a precision grip task. II: Effect of oscillatory activity on corticospinal output. J. Neurophysiol., 89, 1941–1953. ¨ Borgers, C., & Kopell, N. (2003). Synchronization in networks of excitatory and inhibitory neurons with sparse, random connectivity. Neural Comp., 15, 509–538. Buzs´aki, G., Leung, L. W. S., & Vanderwolf, C. H. (1983). Cellular bases of hippocampal EEG in the behaving rat. Brain Res. Rev., 6, 139–171. Doiron, B., Chacron, M. J., Maler, L., Longtin, A., & Bastian J. (2003). Inhibitory feedback required for network oscillatory responses to communication but not prey stimuli. Nature, 421, 539–543. Doiron, B., Lindner, B., Longtin, A., Maler, L., & Bastian, J. (2004). Oscillatory activity in electrosensory neurons increases with the spatial correlation of the stochastic input stimulus. Phys. Rev. Lett., 93, 048101–048104. Ermentrout, G. B., & Kopell, N. (1986). Parabolic bursting in an excitable system coupled with a slow oscillation. SIAM J. Appl. Math., 46, 233–253. Ernst, U., Pawelzik, K., & Geisel, T. (1995). Synchronization induced by temporal delays in pulse-coupled oscillators. Phys. Rev. Lett., 74, 1570–1573. Fries, P. (2005). A mechanism for cognitive dynamics: Neuronal communication through neuronal coherence. Trends in Cognitive Sciences, 9, 474–480. Fries, P., Reynolds, J. H., Rorie, A. E., & Desimone, R. (2001). Modulation of oscillatory neuronal synchronization by selective visual attention. Science, 291, 1560–1563. Gray, C. M. (1994). Synchronous oscillations in neuronal systems: Mechanisms and functions. J. Comp. Neurosci., 1, 11–38.
1764
D. Marinazzo, H. Kappen, and S. Gielen
Gutkin, B. S., & Ermentrout, G. B. (1998). Dynamics of membrane excitability determine interspike interval variability: A link between spike generation mechanisms and cortical spike train statistics. Neural Comp., 10, 1047–1065. Hansel, D., & Mato, G. (1993). Patterns of synchrony in a heterogeneous HodgkinHuxley neural-network with weak-coupling. Physica A, 200, 662–669. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Comp., 7, 307–337. Hoppensteadt, F. C., & Izhikevich, E. M. (1997). Weakly connected neural networks. New York: Springer-Verlag. Lindner, B., Chacron, M. J., & Longtin, A. (2005). Integrate-and-fire neurons with threshold noise: A tractable model of how interspike interval correlations affect neuronal signal transmission. Phys. Rev., E, 72, 021911. Lindner, B., Doiron, B., & Longtin, A. (2005). Theory of oscillatory firing induced by spatially correlated noise and delayed inhibitory feedback. Phys. Rev. E, 72, 061919. Lindner, B., & Schimansky-Geier, L. (2001). Transmission of noise coded versus additive signals through a neuronal ensemble. Phys. Rev. Lett., 86, 2934–2937. Lindner, B., Schimansky-Geier, L., & Longtin, A. (2002). Maximizing spike train coherence or incoherence in the leaky integrate-and-fire model. Phys. Rev. E, 66, 031916. Lytton, W. W., & Sejnowski, T. J. (1991). Simulations of cortical pyramidal neurons synchronized by inhibitory interneurons. J. Neurophysiol., 66, 1059–1079. Mirollo, R. E., & Strogatz, S. H. (1990). Synchronization of pulse-coupled biological oscillators. SIAM J. Appl. Math., 50, 1645–1662. Natschlager, T., Maass, W., & Zador, A. (2001). Efficient temporal processing with biologically realistic dynamic synapses. Network Computation in Neural Systems, 12, 75–87. Pantic, L., Torres, J. J., & Kappen, H. J. (2003). Coincidence detection with dynamic synapses. Network Computation in Neural Systems, 14(1), 17–33. Reynolds, J. H., Chelazzi, L., & Desimone, R. (1999). Competitive mechanisms subserve attention in macaque areas V2 and V4. J. Neurosci., 19, 1736–1753. Reynolds, J. H., & Desimone, R. (1999). The role of neural mechanisms of attention in solving the binding problem. Neuron, 24, 19–29. Roelfsema, P., Lamme, V. A. F., & Spekreijse, H. (2004). Synchrony and covariation of firing rates in the primary visual cortex during contour grouping. Nature Neurosci, 7, 982–991. Rudolph, M., & Destexhe, A. (2003). Tuning neocortical pyramidal neurons between integrators and coincidence detectors. J. Comp. Neurosci., 14, 239–251. Schoffelen, J. M., Oostenveld, R., & Fries, P. (2005). Neuronal coherence as a mechanism of effective corticospinal interaction. Science, 308, 111–113. Senn, W., Segev, I., & Tsodyks, M. (1998). Reading neuronal synchrony with depressing synapses. Neural Comp., 10, 815–819. Singer, W. (1999). Neuronal synchrony: A versatile code for the definition of relations? Neuron, 24(1), 49–65. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Annu. Rev. Neurosci., 18, 555–586.
Oscillations in Networks of Spiking Neurons
1765
Traub, R. D., Whittington, M. A., Colling, S. B., Buzsaki, G., & Jefferys, J. G. R. (1996). Analysis of gamma rhythms in the rat hippocampus in vitro and in vivo. J. Physiol. (London), 493, 471–484. Tsodyks, M. V., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Ac. Sci. USA, 94, 719–723. Tsodyks, M., Mitkov, I., & Sompolinsky, H. (1993). Pattern of synchrony in inhomogeneous networks of oscillators with pulse interactions. Phys. Rev. Lett., 71, 1280–1283. Tsodyks, M., Pawelzik, K., & Markram, H. (1998). Neural networks with dynamic synapses. Neural Comp., 10, 821–835. Tsodyks, M., Uziel, A., & Markram, H. (2000). Synchrony generation in recurrent networks with frequency-dependent synapses. J. Neurosci., 20(RC50), 1–5. van Vreeswijk, C. (2000). Analysis of the asynchronous state in networks of strongly coupled oscillators. Phys. Rev. Lett., 84, 5110–5113. van Vreeswijk, C., & Hansel, D. (2001). Patterns of synchrony in neural networks with spike adaptation. Neural Comp., 13, 959–992. Whittington, M. A., Traub, R. D., & Jeffreys, J. G. R. (1995). Synchronized oscillations in interneuron networks driven by metabotropic glutamate receptor activation. Nature, 373, 612–615.
Received January 30, 2006; accepted November 8, 2006.
LETTER
Communicated by Andreas Nieder
Extracting Number-Selective Responses from Coherent Oscillations in a Computer Model Jeremy A. Miller [email protected]
Garrett T. Kenyon [email protected] Los Alamos National Laboratory, Los Alamos, New Mexico 87544, U.S.A.
Cortical neurons selective for numerosity may underlie an innate number sense in both animals and humans. We hypothesize that the numberselective responses of cortical neurons may in part be extracted from coherent, object-specific oscillations . Here, indirect evidence for this hypothesis is obtained by analyzing the numerosity information encoded by coherent oscillations in artificially generated spikes trains. Several experiments report that gamma-band oscillations evoked by the same object remain coherent, whereas oscillations evoked by separate objects are uncorrelated. Because the oscillations arising from separate objects would add in random phase to the total power summed across all stimulated neurons, we postulated that the total gamma activity, normalized by the number of spikes, should fall roughly as the square root of the number of objects in the scene, thereby implicitly encoding numerosity. To test the hypothesis, we examined the normalized gamma activity in multiunit spike trains, 50 to 1000 msec in duration, produced by a model feedback circuit previously shown to generate realistic coherent oscillations. In response to images containing different numbers of objects, regardless of their shape, size, or shading, the normalized gamma activity followed a square-root-of-n rule as long as the separation between objects was sufficiently large and their relative size and contrast differences were not too great. Arrays of winner-take-all numerosity detectors, each responding to normalized gamma activity within a particular band, exhibited tuning curves consistent with behavioral data. We conclude that coherent oscillations in principle could contribute to the number-selective responses of cortical neurons, although many critical issues await experimental resolution. 1 Introduction Both animals and humans possess an innate number sense that has been characterized in a variety of behavioral experiments (Dehaene, 1997, 2001; Nieder, 2005). Monkeys, for example, can distinguish between images Neural Computation 19, 1766–1797 (2007)
C 2007 Massachusetts Institute of Technology
Number-Selective Responses
1767
containing different numbers of objects regardless of their shape, size, or distribution, demonstrating an ability to utilize an abstract sense of number independent of nonnumerical factors, such as the total area of illumination or total integrated contour length (Brannon & Terrace, 1998; Nieder & Miller, 2004a). Likewise, preverbal human infants respond specifically to changes in numerosity (Feigenson, Dehaene, & Spelke, 2004), and the ability to perform numerosity discriminations under controlled conditions has been documented across a wide range of animal taxa, including birds, rats, dolphins, cats, and primates (Davis & Perusse, 1988; Nieder, 2005). Adult humans also possess an innate number sense, distinct from arithmetical computation, that can be revealed by measuring performance on numerosity discrimination tasks in which explicit counting is prevented, either by appropriate experimental protocols (Cordes, Gelman, Gallistel, & Whalen, 2001; Whalen, Gallistel, & Gelman, 1999) or by studying cultures that lack numerically precise words for all but the first few integers (Pica, Lemer, Izard, & Dehaene, 2004). The evolution of an innate number sense is clearly adaptive, as it provides, for example, a basis for making optimal foraging, fight-or-flight, and mating decisions (e.g., “Which branch has more fruit?” or “Which troupe has more individuals?”). In both animals and humans, performance on numerosity discrimination tasks typically depends on the ratio of the quantities being compared, so that 2 versus 4, 4 versus 8, 8 versus 16, and so on are all distinguished with approximately equal levels of accuracy. Similarly, behavioral experiments exhibit both a prominent size effect (e.g., 3 versus 4 can be distinguished more reliably than 4 versus 5) as well as a prominent distance effect (e.g., 3 versus 5 can be distinguished more reliably than 3 versus 4). Combined with data showing that the latency of such judgments depends only weakly on numerosity, the above results indicate that animals and humans, especially preverbal human infants, utilize an innate, qualitative sense of number in which individual items are processed in parallel without resort to explicit counting or reliance on the formal mathematical concept of an integer. Recordings from single neurons in the monkey prefrontal and posterior parietal cortex have identified cells that respond selectively to a preferred number of visual items, usually between one and five, independent of nonnumerical factors such as shape, total area, or boundary length, or the distribution of the individual objects in the scene (Nieder, Freedman, & Miller, 2002; Nieder & Miller, 2003, 2004b). Similar to behavioral measures, number-selective neurons exhibit both a size effect—tuning curves become broader as their preferred numerosity increases (e.g., the cells respond to a wider range of numerosities)—and a distance effect—the overlap between tuning curves falls off with increasing numerical separation. The response properties of number-selective cortical neurons make them attractive candidates for mediating the innate number sense documented in related behavioral studies.
1768
J. Miller and G. Kenyon
Here, we explore the hypothesis that coherent, object-specific oscillations implicitly encode numerosity in a manner that might contribute to the number-selective responses of cortical neurons. Neurons in the early visual system often exhibit coherent oscillations at frequencies within the gammaor upper gamma band, here defined broadly as between 40 and 120 Hz ¨ (Castelo-Branco, Neuenschwander, & Singer, 1998; Gray, Konig, Engel, & Singer, 1989; Gray & Singer, 1989; Kreiter & Singer, 1996; Neuenschwander, Castelo-Branco, & Singer, 1999; Neuenschwander & Singer, 1996; Singer & Gray, 1995). Typically, gamma-band oscillations recorded from visual neurons responding to the same object are strongly correlated, even though their common phase drifts randomly over time, as revealed by the absence of periodic structure in stimulus-locked measures, such as the peristimulus time histogram (PSTH), as well as by the declining amplitude of successive periodic side bands in autocorrelograms that otherwise exhibit strong oscillations, an indication that phase coherence is lost after several cycles. Moreover, the oscillatory responses evoked by different objects do not remain phase-locked, so that the cross-correlogram between two recordings sites may be flat even though the autocorrelograms recorded at each site show strong periodic modulations at roughly equal frequencies. It follows from the object-specific nature of coherent oscillations that if only a single item is present in a scene, all of the stimulated cells will be modulated in phase, and the average spectral amplitude in the gamma band, computed from the summed activity across all stimulated neurons, will be maximal when normalized by the total number of spikes. If the scene instead contains two objects of similar size and luminance, there will be two sources of oscillatory activity whose relative phase will vary randomly over time. Thus, in the case of two objects, the total normalized power in the gamma band, assuming the sum includes all stimulated neurons, will be reduced by approximately the square root of 2. Likewise, for n distinct objects of similar size, the normalized gamma power will be reduced by approximately the square root of n (i.e., the average of n imaginary numbers possessing equal magnitudes and random phases). If it is assumed that all distinct visual items, regardless of their shape, orientation, shading, or distribution, evoke coherent, object-specific oscillations, the resulting estimate of numerosity will be automatically independent of such nonnumerical factors. Although the status of gamma activity in normal visual processing remains hotly debated (Sejnowski & Paulsen, 2006; Shadlen & Movshon, 1999; Singer & Gray, 1995), we may nonetheless ask to what extent numerosity can be encoded by coherent, object-specific oscillations. To investigate the numerosity encoded by coherent oscillations, we analyzed the normalized gamma power from multiunit spike trains obtained from a circuit model previously shown to generate realistic spatiotemporal correlations (Kenyon, Harvey, Stephens, & Theiler, 2004; Kenyon et al., 2003; Kenyon, Theiler, George, Travis, & Marshak, 2004; Kenyon, Travis et al., 2004). To obtain similar measurements from an experimental preparation
Number-Selective Responses
1769
would have required recording simultaneously from large numbers of oscillatory neurons all of the same type and distributed across the visual field. Our theoretical study used computer-generated spike trains to investigate the feasibility of the hypothesis that gamma-band oscillations can, in principle, encode numerosity as a prelude to a more rigorous, yet technically difficult experimental study. Our results suggest that the normalized gamma activity, summed over all stimulated neurons and divided by the total spike count, permits single-trial discriminations to be made between images containing different numbers of objects. Because the circuit model was uniform and isotropic and produced object-specific, coherent oscillations in response to all sufficiently large, contiguous features, the resulting numerosity information was encoded in a manner that was inherently independent of the shape, orientation, shading, and distribution of the individual items in the scene. Previous attempts to compute numerosity using artificial neural networks implicitly utilized separate nodes for every enumerable object at each image location, either by starting with an explicit location map or by limiting the implementation to a 1D retina in which all possible lengths could be accounted for by appropriate winner-take-all filters (Dehaene & Changeux, 1993; Verguts & Fias, 2004). Our approach is fundamentally different. By utilizing the numerosity information encoded by stimulus-specific, coherent oscillations, it was possible to achieve invariance with respect to nonnumerical factors, such as size, shape, shading, and distribution, in a fully 2D model without introducing a combinatorially large number of additional elements. Finally, to compare our results to experimental measurements, we constructed an array of winner-take-all numerosity detectors that used only the normalized gamma activity present on single trials. The resulting theoretical tuning curves were consistent with behavioral data, supporting the hypothesis that gamma-band oscillations can, in principle, provide an implicit representation of numerosity that might contribute to the innate number sense shared by animals and humans. 2 Methods 2.1 Retinal Model. A negative feedback circuit, originally developed to account for high-frequency oscillations between retinal ganglion cells (Kenyon et al., 2003), was used to generate artificial spike trains with realistic spatiotemporal correlations in a manner consistent with known anatomy and physiology. Coherent oscillations generated by the model were of similar amplitude, frequency, and duration to those measured experimentally (Kenyon et al., 2003). Moreover, the gamma-band oscillations between model neurons were strongly size dependent and remained coherent only within contiguously stimulated regions (Kenyon, Harvey et al., 2004; Kenyon, Travis et al., 2004), thereby capturing important aspects of the object-specific, spatiotemporal correlations between neurons in the cat
1770
J. Miller and G. Kenyon
retina (Neuenschwander et al., 1999). The circuit model thus provided a reasonable basis for estimating the numerosity information encoded by coherent oscillations in the central nervous system. Because details of the circuit model have been presented previously, only an abbreviated description is provided below (see also Figure 1). Input to the model was conveyed by an array of external currents proportional to the pixel-by-pixel gray-scale value of a two-dimensional image. These external currents directly stimulated the model bipolar cells and approximated their light-modulated synaptic input from cone photoreceptors. The bipolar cells produced excitatory postsynaptic potentials in both ganglion cells and amacrine cells according to a random process (Freed, 2000). The axon-bearing amacrine cells were electrically coupled to neighboring ganglion cells and to each other, and made strong inhibitory connections onto the surrounding ganglion cells and axon-bearing amacrine cells. This feedback circuit produced robust, physiologically realistic oscillations in response to large stimuli. When several ganglion cells were activated by a stimulus, they in turn activated neighboring axon-bearing amacrine cells via gap junctions (Dacey & Brace, 1992; Jacoby, Stafford, Kouyama, & Marshak, 1996; Vaney, 1994). The stimulated cells were then hyperpolarized by the ensuing wave of axon-mediated inhibition, thus setting up the next cycle of the oscillation. Spike generation in ganglion cells was modeled as a leaky integrate-and-fire process with a membrane time constant of 5 msec, consistent with published physiological data (O’Brien, Isayama, Richardson, & Berson, 2002). The model also contained local nonspiking amacrine cells that generated randomly distributed inhibitory postsynaptic potentials that helped make spontaneous firing asynchronous and acted to increase
Figure 1: Schematic. (a,b) Circuit diagram. The model contained five distinct cell types—bipolar (BP), small amacrine (SA), large amacrine (LA), polyaxonal amacrine (PA), and ganglion cells (GC)—whose implementation was based on the anatomy and physiology of the motion processing pathway in the cat inner retina. (a) Excitation, feedforward, and feedback inhibition. BPs (relay neurons) were activated directly by the stimulus and in turn excited the other cell types (open triangles). Both the SAs and LAs provided feedforward inhibition to the GCs (closed circles), while all three amacrine cell types (retinal interneurons) made inhibitory feedback synapses onto the BPs. (b) Serial inhibition. Interactions between retinal interneurons were arranged as a negative feedback loop. (c) Oscillatory circuit. Local excitation, via gap junctions, elicited long-range, axon-mediated inhibition, producing realistic coherent oscillations. (d) Winnertake-all numerosity detectors. Normalized gamma activity, a measure of the coherent oscillations across all stimulated cells, is postulated to decline as the inverse square root of the number of objects. Bands representing two and four objects are illustrated. The normalized gamma activity present on any given trial was guaranteed to fall within the range of one and only one numerosity detector.
Number-Selective Responses
1771
a) Excitation, Feedforward & Feedback Inhibition
b) Serial Inhibition
BP
SA
LA
PA
GC c) Oscillatory Circuit
PA GC
2
n
4
normalized γ activity
d) winner-take-all numerosity detectors
1
2
3
4
preferred numerosity
5
1772
J. Miller and G. Kenyon
the overall dynamic range, but were not otherwise critical for generating oscillatory spike trains. The behavior of the model in response to a variety of stimuli, the robustness of the model with respect to both numerical and physiological parameters, and the connection of the model to anatomical and electrophysiological data, are described in detail elsewhere (Kenyon, Harvey et al., 2004; Kenyon & Marshak, 1998; Kenyon et al., 2003; Kenyon, Theiler et al., 2004; Kenyon, Travis et al., 2004). Some model parameters were modified from previously published values. Specifically, all axon-mediated feedback interactions were assigned a fixed synaptic delay of 2 msec and a finite transmission delay, corresponding to an action potential conduction velocity of 4 ganglion cells per time step (∼1 mm/msec), was incorporated. The time constant of the axon-bearing amacrine cells was also increased slightly, from 5 to 7.5 msec. By increasing the overall delay of the axon-mediated negative feedback loop, the above changes tended to strengthen the coherent oscillations in the retinal model, but this effect was compensated for by reducing the strength of the feedback inhibition onto the model ganglion cells by a factor of 0.44. Overall, the above changes made no qualitative difference in the behavior of the coherent oscillations in the dynamic feedback circuit but were implemented to make the model more realistic and provided a better fit to the high frequency resonance in the responses of cat Y ganglion cells to rapidly modulated stimuli (Frishman, Freeman, Troy, Schweitzer-Tong, & Enroth-Cugell, 1987; Miller, Denning, George, Marshak, & Kenyon, 2006). 2.2 Common Input Model. To control for the possibility that the dynamics of our retinal circuit model, which depended on long-range inhibitory feedback from spiking amacrine cells, could have introduced spurious higher-order correlations that might have unintentionally augmented the numerosity information encoded by coherent oscillations, we generated a second set of artificial spike trains in which the model ganglion cells were simply passive recipients of a common oscillatory input that uniformly modulated the firing rates of all neurons responding to the same contiguous stimulus. Starting with a frequency spectrum approximated by a single Lorentzian function, a best fit was obtained to the normalized Fourier amplitudes produced by the circuit model in response to an 8 × 8 stimulus. The specific form of the Lorentzian function used was (c/π) (0.5a )/(( f − b)2 + 0.25a 2 ), where f is the frequency of the three parameters a , b, and c, roughly corresponding to the width, location, and amplitude, which were determined empirically to be 12.9844 Hz, 73.4968 Hz, and 6.4584, respectively. To obtain a time-dependent oscillatory signal with realistic temporal correlations, we performed an inverse Fourier transform by assigning random phases, chosen uniformly between 0 and 2π, to the individual frequency components. Parallel sets of artificial spike trains with realistic spatiotemporal correlations were then constructed by applying the same time-dependent oscillatory signal to an 8 × 8 array of Poisson spike generators that simultaneously modulated their instantaneous firing rates.
Number-Selective Responses
1773
A constant offset was added to the oscillatory signal to fix the baseline firing rate of the Poisson spike generators at the same value exhibited by ganglion cells in the retinal model, here equal to 31 Hz, well within the range of observed resting firing rates for Y ganglion cells in the cat retina (Passaglia, Enroth-Cugell, & Troy, 2001). Each distinct object was represented by a separate 8 × 8 array of Poisson spike generators, obtained by repeating the above process using an independent distribution of random phase assignments, ensuring that the gamma-band oscillations arising from separate stimuli were uncorrelated. Additional mathematical details of a related procedure for generating artificial spike train via common input modulation are described elsewhere (Kenyon, Theiler et al., 2004). 2.3 Stimulation and Data Analysis. Both natural and computergenerated images containing various numbers of distinct objects, encompassing a range of sizes, shapes, shadings, and separations, were presented to the circuit model. Multiunit spike trains were constructed by combining the output of all stimulated output neurons into a single time series. For natural images, a binary mask was used to determine which neurons to include in the multiunit spike train. Nearly identical results to those presented here were obtained when an activity-dependent mask, which selected only those cells that fired a suprathreshold number of spikes over a brief time interval, was used instead of a fixed mask, but the latter method allowed more controlled comparisons across stimulus conditions. In most experiments, stimuli were presented for 200 msec prior to collecting data, so that only steady-state activity during the plateau portion of the response was considered. Multiunit spike trains obtained during the plateau portion of the response were typically 1000 msec, but shorter durations, corresponding to 125, 250, and 500 msec, were examined as well. In one set of experiments, we also considered the numerosity encoded by gamma activity during the transient portion of the response, using multiunit spike trains that included only the first 50 msec following stimulus onset. In all cases, single-trial frequency spectra were obtained as follows: the autocorrelation function was first computed directly from the multiunit spike train and then Fourier transformed, yielding a conventional power spectrum. The result was then divided by the total number of spikes (power in the DC band) to control for variations in the combined area of stimulation. Finally, the square root was taken of each frequency component, so that the final values reflected normalized spectral amplitudes. For each stimulus condition, corresponding to different numbers of objects in the input image, means and standard deviations of the normalized frequency spectra were computed by analyzing the results from 25 independent trials. The peak amplitude in the upper gamma band typically fell between the discrete steps along the frequency axis. To compensate, the maximum lag value of the autocorrelation function was adjusted for each stimulus condition so that the spectral peak was maximal. This latter procedure had no qualitative impact on our results or conclusions but was conducted to ameliorate aliasing effects due to the
1774
J. Miller and G. Kenyon
use of discrete frequency components. Finally, to obtain a scalar estimate of the normalized gamma activity, the normalized spectral amplitudes were averaged from 65 to 85 Hz, a range that approximately enclosed the peak oscillatory response. The predicted gamma activity for multiple objects was determined by dividing the normalized gamma activity produced in response to a single object by the square root of the number of objects in the corresponding input image. Comparisons to behavioral data were facilitated by constructing artificial winner-take-all numerosity detectors that responded to single-trial estimates of normalized gamma activity falling in a given span (see Figure 1d). Input ranges were nonoverlapping, such that one and only one detector responded to each stimulus presentation, corresponding to the node whose target normalized gamma activity was closest to the actual normalized gamma activity present on that trial. The probability of discharge for each detector in response to images containing various numbers of objects was computed from the mean and standard deviation of the normalized gamma activity under each stimulus condition, under the assumption that all measured quantities were gaussian distributed. The target normalized gamma activity for each numerosity detector was determined separately in each experiment by the mean value produced by images containing the preferred number of items. 3 Results To determine whether coherent oscillations could, in principle, encode numerosity, we stimulated the model circuit with computer-generated images containing different numbers of identical square spots, with each spot covering an array of 8 × 8 output cells (see Figure 2a). Normalized frequency
Figure 2: Numerical discriminations mediated by coherent oscillations. (a) A circuit model for producing realistic, object-specific oscillations was stimulated by square spots of equal intensity, each covering an array of 8 × 8 output neurons. (b) Frequency spectra computed from multiunit spike trains, 1000 msec in duration and which included the plateau responses of all stimulated neurons, exhibited maximal gamma activity (after normalizing by the total number of spikes) when only one object was present and declined steadily as the number of objects increased. (c) Black bars: Normalized gamma activity computed from the average spectral amplitude between 65 and 85 Hz on single stimulus trials. Error bars denote one standard deviation of the corresponding distribution. Gray bars: Prediction of the square-root-of-n rule. (d) Solid lines: Percentage activation of winner-take-all numerosity detectors. Only one detector, whose target level was closest to the normalized gamma activity, was active on any given trial. Dashed lines: Tuning curves measured behaviorally from monkeys performing a numerical discrimination task (Nieder & Miller, 2004a). Model and behavioral tuning curves exhibited similar features.
Number-Selective Responses
1775
a
normalized spectra
b
one
.3
two three
.2
four .1
50
75
normalized activity
c
model n .1
0
one
two
three
four
# of objects
100
model % activation
d
100
frequency (Hz)
monkey
0
1
2
3
4
# of objects
5
6
7
1776
J. Miller and G. Kenyon
spectra, computed from multiunit spike trains 1000 msec in duration obtained during the plateau portion of the response and consisting of all stimulated cells, exhibited different amounts of gamma activity depending on the number of objects in the image, being maximal when only one object was present and declining as the number of objects increased (see Figure 2b). To obtain a scalar estimate of gamma activity, the spectral amplitude was averaged over the response peak, 65 to 85 Hz. The mean normalized gamma activity declined as the number of objects increased in a manner well predicted by the square-root-of-n rule (see Figure 2c). Error bars denote one standard deviation of the corresponding distribution, not the standard deviation of the mean. There was good discrimination between images containing either one or two objects, as the corresponding means were well outside the associated error bars, but the overlap between neighboring distributions grew steadily as more objects were added, consistent with the size effect observed in a variety of animal and human studies (Nieder, 2005). To quantify the numerosity information encoded by coherent oscillations, we constructed an array of winner-take-all numerosity detectors that responded to the normalized gamma activity present on single stimulus trials. One and only one detector was activated by each stimulus presentation, corresponding to the element whose target activity, defined as the mean response to images containing the preferred number of objects, was closest to the normalized gamma activity present on that trial. Using the means and standard deviations of the normalized gamma activity for each stimulus condition, percentage activation rates were determined for each detector in response to images containing from one to four objects (see Figure 2d, solid lines). When only one object was present in the input image, the 1-object detector was always activated, corresponding to 0% false negatives. In response to images containing two, three, or four objects, the activation rates for the 2-, 3-, and 4-object detectors were 84%, 67%, and 45%, respectively, indicating an increasing percentage of false negatives for detectors preferring larger numerosities. Likewise, the percentage of false positives, given by the integrated response over all detectors other than the correct node, increased with preferred numerosity as well. The theoretical tuning curves computed from the single-trial normalized gamma activity (see Figure 2d, dashed lines) agreed quite well with previously published behavioral data (Nieder & Miller, 2004a). Monkeys were trained to report whether the number of items in a test stimulus, given by the value along the abscissa, equaled the number of items in the target image, with each curve representing a different preferred numerosity. The behavioral and model data both exhibited strong size and distance effects. Tuning curves became broader as the preferred numerosity increased, and the overlap between pairs of tuning curves tended to zero as the absolute difference in their preferred numerosities became larger. Because our theoretical analysis involved no fine-tuning to fit the experimental results—the
Number-Selective Responses
1777
parameters of the circuit model having been established previously on the basis of electrophysiological recordings—the close correspondence between the model and behavioral data suggests that the encoding of numerosity may be a robust property of coherent, stimulus-specific oscillations. As an additional test of robustness, we obtain qualitatively similar results after repeating the above experiment using the same basic circuit but with slightly different, previously published parameter values (see section 2), confirming that the encoding of numerosity by coherent oscillations did not require precise tuning of the model. The innate number sense, as characterized both behaviorally and electrophysiologically, depends only on numerosity and is independent of the specific properties of the individual objects in the scene, such as their shape or location. We asked whether normalized gamma activity obtained from the model circuit encoded numerosity in a manner that was likewise robust against nonnumerical factors (see Figure 3). The first set of input images contained various numbers of objects encompassing different shapes, sizes, and orientations, such that the total stimulated area remained invariant (see Figure 3a). This control was necessary to ensure that the normalized gamma activity was in fact specific for numerosity, and not simply a decreasing function of the total stimulated area. As a second, complementary control, we used images in which object size increased in proportion to the number of items present, so that the total stimulated area grew supralinearly (see Figure 3b). Third, to ensure that high-contrast edges were not required for normalized gamma activity to encode numerosity, we repeated the basic experimental protocol using objects that stimulated the same number of cells but possessed continuously shaded boundaries, falling off over onequarter cycle of a cosine function so that stimulus intensity was described by a series of square, isoluminant contours with the same total integrated strength as the standard 8 × 8 spot (see Figure 3c). For both the binary and gray-scale images, the normalized gamma activity encoded numerosity in a manner independent of object-specific properties (see Figure 3d), consistent with the innate number sense characterized experimentally. In addition, the decline in normalized gamma activity as a function of the number of objects was well predicted by a square-root-of-n rule. In the above experiments, only stimulated cells contributed to the normalized gamma activity. The restriction to stimulated cells was necessary because when all cells were included in the calculation, coherent oscillations no longer provided a useful measure of numerosity, probably because the oscillatory signals arising from the subset of stimulated neurons were overwhelmed by the baseline of ongoing activity. Background firing rates in the model were rather high, approximately 31 Hz, consistent with experimentally measured values (Passaglia et al., 2001). However, oscillatory activity in the retina can propagate to the visual cortex (Castelo-Branco et al., 1998), where the background firing rates of projection neurons may be much lower. Thus, the extraction of numerosity from coherent oscillations
1778
a
J. Miller and G. Kenyon
equal area
b
c
increasing size
d
equal size equal area increasing size shaded n
activity normalized
shaded
.1
.0
one
two
three
four
# of objects Figure 3: Invariance of numerosity encoding. (a) Constant area. Individual object sizes were decreased so as to keep the total stimulated area invariant. (b) Supralinear growth. Object sizes increased in proportion to the number of objects in the image. (c) Shaded boundaries. Sharp, high-contrast edges were replaced by smooth gradations. (d) Normalized gamma activity encoded numerosity in a manner consistent with a square-root-of-n rule regardless of object size, shape, or shading.
at points further along the visual processing pathway may not require the elimination of unstimulated cells from the analysis. Alternatively, it may be possible to identify stimulated cells from the same data used to estimate numerosity. As a preliminary test of this possibility, we reanalyzed the responses to the gray-scale objects (see Figure 3c) using an activity-dependent mask, whose threshold was chosen to best discriminate between stimulated and unstimulated cells based on 100 msec spike trains taken from the middle of each trial, to determine which individual units to include in the analysis. In this simple test, we obtained virtually identical results to those shown above, since the activity-dependent mask rarely differed by more than one or two cells from the fixed binary mask specified a priori. Although
Number-Selective Responses
1779
a detailed account of how numerosity might be extracted physiologically is beyond the scope of this study, an activity-dependent mask might be implemented physiologically by facilitating synapses (Beierlein and Connors, 2002) or dendritic nonlinearities (Polsky, Mel, & Schiller, 2004). Behavioral experiments indicate that numerical discriminations do not depend on the absolute difference in the number of items but rather on their ratio (Nieder, 2005; Nieder & Miller, 2003). To determine whether coherent oscillations encode numerosity in a manner consistent with the ratio law observed experimentally, we examined the discrimination mediated by normalized gamma activity in response to computer-generated images containing 1, 2, 4, 8, and 16 identical square spots, each covering an array of 8 × 8 model ganglion cells (see Figure 4). As in the above experiments, frequency spectra were computed from mass spike trains that included all stimulated cells. Spectral amplitudes in the upper gamma band, normalized by the total number of spikes, continued to decrease as the number of objects was raised, up to and including the maximum numerosity tested (see Figure 4a). As a function of the number of objects, the mean normalized spectral amplitude between 65 Hz to 85 Hz (gamma activity) adhered quite closely to a square-root-of-n rule, even for images containing up to 16 items (see Figure 4b). Any discrimination procedure based on the percentage change in normalized gamma activity will therefore depend only on the ratio of the number of items being compared. Plotted on a logarithmic scale, model numerosity detectors exhibited symmetrical tuning curves whose half-width at half-maximum (HWHM) was approximately constant (see Figure 4c). The total area under each tuning curve was always unity; the apparent differences were linear distortions due to the logarithmic scale. To the extent that the HWHM remains linearly proportional to the number of items, numerosity discrimination would be expected to obey a ratio law. We also asked whether the retinal model could encode the numerosity of more realistic gray-scale images, using a photograph that contained five airplanes on a tarmac (see Figure 5a). A somewhat artificial image was employed for these experiments to achieve good object specificity. To simulate both single- and multiple-plane images, we summed the spikes from either one or five distinct groups of selected ganglion cells, indicated by the corresponding binary masks (see Figures 5b and 5c). As with computergenerated images, the normalized frequency spectra computed from the multiunit spike trains showed a clear effect of numerosity, decreasing substantially when the data consisted of responses to five as opposed to a single airplane (see Figure 5d). In additional, the normalized gamma activity computed from the multiunit spike train containing responses to all five planes fell well within the error bars of the value predicted from a simple square-root-of-n rule (see Figure 5e). To ensure that the numerosity encoded in the multiunit spike train corresponding to a single plane was not confounded by the presence of additional items, we repeated the experiment using a new image in which the other four planes were digitally removed,
1780
J. Miller and G. Kenyon
normalized spectra
a
1 2 4 8 16
.3
.2
.1
50
75
100
frequency (Hz) normalized activity
b
model n .1
0
1
2
4
8
16
# of objects 100
1 2 4 8 16
% activation
c
0 1
2
4
8
16
# of objects Figure 4: Ratio law. The circuit model was stimulated with images containing 1, 2, 4, 8, or 16 identical square spots each covering an 8 × 8 array of output neurons. (a) Normalized frequency spectra. Gamma-band activity declined monotonically for numerosities up to 16. (b) Normalized gamma activity (black bars) closely adhered to a square-root-of-n rule (gray bars). (c) Percentage activation of winner-take-all numerosity detectors. On a logarithmic scale, tuning curves separated by powers of 2 were all approximately of the same width and had similar overlap with their neighbors, consistent with the ratio law observed experimentally.
Number-Selective Responses
a
1781
d normalized spectra
1 plane
b
.2
5 planes
.1
50
75
normalized activity
c
100
frequency (Hz)
e
model n .1
1
# of planes
5
Figure 5: Naturalistic stimuli. (a) Input image containing five airplanes on a tarmac. It was necessary to use a somewhat artificial image for this experiment to achieve adequate segmentation. (b) Binary mask used to select a multiunit spike train corresponding to a single plane. (c) Mask used to select a multiunit spike train that included responses to five separate planes. (d) Normalized frequency spectra showed clear effects of numerosity. (e) Normalized gamma activity provided good discrimination between multiunit spike trains containing one plane versus five planes, consistent with the prediction of a square-root-of-n rule.
obtaining equivalent results (not shown). These findings suggest that coherent oscillations may encode numerosity even in response to nonidealized gray-scale images. The above experiments all used an analysis window of 1000 msec during the plateau portion of the response. To be of maximum behavioral significance, however, the numerosity information encoded by coherent oscillations should be available on shorter timescales as well. We therefore used the circuit model to investigate how numerosity information encoded by gamma activity during the plateau portion of the response might degrade
1782
J. Miller and G. Kenyon
as the size of the analysis window or, equivalently, the length of the multiunit spike train record was reduced (see Figure 6). The mean normalized gamma activity was roughly constant for analysis window sizes down to approximately 200 msecs and then began rising steadily as the spike train duration was decreased still further, reflecting an overall increase in broadband noise (see Figure 6a). The standard deviation in the distribution of single-trial normalized gamma activities also increased steadily as the spike train duration was reduced (see Figure 6a, error bars). The degradation of numerosity encoding with smaller window sizes was reflected in the corresponding tuning curves, which became broader as the analysis time was decreased (see Figure 6b). However, considerable numerosity information was still present even at the shortest analysis times tested. In particular, reasonable numerical discrimination during the plateau portion of the response was still possible even for analysis times of only 125 msec. So far we have considered only the sustained portion of the multiunit spike train recorded after the initial response transients have settled to a stable plateau level. One advantage of focusing on the oscillations present during the asymptotic portion of the response is that correlations due to stimulus coordination become negligible after approximately 100 msec, as revealed by a nearly flat shift predictor once the plateau period has been reached (Kenyon et al., 2003). When instead we analyzed multiunit spike trains recorded during the transient portion of the response, the predicted performance on numerical discrimination tasks was greatly impaired (data not shown). Although the model predicted large-amplitude oscillations during the response transient, there was very little specificity between objects due to strong stimulus coordination. However, the model consisted of an unrealistically homogeneous population whose onset latencies were all identical, potentially exacerbating the problem of stimulus coordination during the response transient. When a gaussian spread of onset latencies, with width equal to ±5 msec, was imposed on the model input cells, reasonable performance on the numerical discrimination task was obtained (see Figure 7). Normalized spectra showed a clear trend toward lower amplitudes as the number of objects was increased (see Figure 7a). The peaks became broader in part due to the relatively low resolution (20 Hz) at which individual frequency components could be resolved using 50 msec analysis windows. Normalized gamma activity also showed a clear decline with increasing numerosity (see Figure 7b), although the widths of the corresponding distributions, indicated by the attached errors bars, were rather large due to the reduced signal-to-noise in numerical estimates based on 50 msec multiunit spike records. Similarly, the theoretical tuning curves predicted by a bank of winner-take-all numerosity detectors were much broader than in the standard case (see Figure 7c), although some level of numerosity discrimination was still clearly evident. Our results suggest that as long as there is a sufficient spread of onset latencies to support adequate object specificity during the transient portion of the response, coherent
Number-Selective Responses
1783
a normalized activity
one two three four
.2
.1
0.0
0.25
0.5
0.75
1.0
spike train duration (seconds)
b 100
1s % activation
0.25 s
0 1
2
3
4
5
6
7
# of objects Figure 6: Effect of analysis window size. (a) Normalized gamma activity versus spike train duration as a function of numerosity. Regardless of the analysis window size, normalized gamma activity declined with increasing numerosity. Distribution widths (errors bars) increased for shorter spike trains. Base-line gamma activity also increased as the analysis window was reduced below approximately 200 msec. (b) Percentage activation of artificial numerosity detectors. Tuning curves became broader as the analysis time was decreased, although considerable numerosity information was still present even for analysis windows of 250 msec.
1784
J. Miller and G. Kenyon
normalized spectra
a .3
one
three
two
four
.1
50
75
100
frequency (Hz) normalized activity
b
model n
.2
.1
4
3
2
1
# of objects
c 100
% activation
one two three four
0
1
2
3
4
5
6
7
# of objects Figure 7: Rapid encoding of numerosity during the response transient. Multiunit spike trains in response to images containing one to four identical square spots but limited in duration to the first 50 msec following stimulus onset. In order to reduce stimulus coordination, onsets were jittered by an average of ±5 msec independently for each input cell. (a) Normalized frequency spectra. Amplitudes declined with increasing numerosity. (b) Normalized gamma activity declined with numerosity in a manner roughly consistent with a squareroot-of-n rule. Distribution widths (error bars) were very large, reflecting reduced signal-to-noise. (c) Percentage activation of winner-take-all numerosity detectors. Tuning curves, though broad, nonetheless indicated numerosity discriminations could be supported by coherent oscillations during the response transient.
Number-Selective Responses
1785
oscillations could begin to encode numerosity very shortly after stimulus onset, consistent with the measured onset latencies of number-selective cortical neurons, which are on the order of 100 msec (Nieder & Miller, 2004b). Previous analysis of the circuit model suggests that coherent oscillations between separate objects will become phase locked as their spacing is reduced (Kenyon, Travis et al., 2004). The resulting breakdown in object specificity will necessarily confound the numerosity information encoded by coherent oscillations. To quantify this effect, we examined how the normalized gamma activity was affected as a group of four objects, each covering an 8 × 8 array of model ganglion cells, was moved closer together (see Figure 8a). As expected, at high packing densities, the normalized gamma activity became strongly dependent on the interobject spacing, with the transition between object-specific and nonspecific activity occurring at separations approximately three to five times the distance between neighboring output neurons (see Figure 8b). However, even at the highest densities examined, the four objects still could be distinguished from a single object on the basis of normalized gamma activity alone, although the exact numerosity was no longer represented unambiguously. Thus, even though strict specificity broke down as the objects were packed closer than approximately five interneuron spacings, significant numerosity information was still retained even for very small separations. Moreover, as long as the objects in the image were sufficiently well separated, the numerosity encoded by the normalized gamma activity in the circuit model was independent of their spatial distribution. The above experiments all employed objects possessing the same maximum stimulus intensity. Would coherent oscillations continue to encode numerosity at stimulus intensities corresponding to different levels of luminance contrast? Computer-generated images containing from one to four identical objects, with each again covering an array of 8 × 8 output neurons, were presented to the circuit model with the maximum stimulus intensity set to 1/2 or 3/2 times the standard value (see Figure 9, columns 1 and 2, respectively). For a single object, the averaged plateau firing rates were 50, 62, and 87 Hz, corresponding to stimulus intensities of 1/2, 1, and 3/2, respectively, relative to a baseline firing rate of 31 Hz. Despite dividing by the total number of events in the multiunit spike train, the normalized spectral amplitudes were not contrast invariant (see Figure 9a; compare columns 1 and 2). Coherent oscillations in the circuit model are more sensitive to contrast than are the firing rates (Kenyon, Theiler et al., 2004), possibly contributing to the lack of contrast invariance. Nonetheless, for any given contrast, the normalized spectral amplitudes still declined steadily with increasing object number. Likewise, the normalized gamma activity followed a square-root-of-n rule over a threefold range of stimulus intensities (see Figure 9b). In addition, the numerosity discrimination exhibited by the winner-take-all detectors was also roughly invariant with respect to stimulus intensity, as revealed by the similarity of the corresponding tuning
1786
J. Miller and G. Kenyon
a
8
7
6
5
4
3
2
1
0
normalized activity
b
.1
.0 0
1
2
3
4
5
6
7
8
separation between objects Figure 8: Effects of object spacing. (a) The spacing between four identical objects, each covering an 8 × 8 array of output neurons, was reduced from 8 to 0, with 1 corresponding to the distance between two neighboring output neurons. (b) Normalized gamma activity increased as the spacing between objects was reduced. The oscillations evoked by separate objects became partially phaselocked as they were moved closer together, obscuring numerosity information, although discrimination between one and four objects was still possible even at the smallest nonzero separations tested.
Number-Selective Responses
1787
curves (see Figure 9c). These results imply that coherent, object-specific oscillations can encode the numerosity of distinct visual items at various luminance contrasts, as long as the differences in the relative luminance contrast between objects in the same image are not too large. Otherwise, some mechanism for contrast normalization, or for making the image representation more binary, would be required to apply the model as currently formulated. We are unaware of any behavioral experiments that explicitly address the issue of how relative luminance contrast affects innate numerical comparisons. It was also useful to consider how numerosity estimates based on normalized gamma activity became degraded as the disparity in the relative sizes of the different objects in the image was made progressively larger. The model circuit was presented with sets of images containing various numbers of objects with equal intensity but whose relative areas differed by a factor of up to 16 (see Figure 10). We expected that the strong size dependence of the coherent oscillations, coupled with surround inhibition, would act to partially preserve numerosity information in the normalized gamma activity even when the image contained objects of different sizes. For input images consisting of one relatively large spot, covering a 16 × 16 array of model ganglion cells, in combination with either one, two, or three much smaller spots, each covering only a 4 × 4 array of model ganglion cells, the numerosity information conveyed by the normalized gamma activity was seriously confounded (see Figure 10a). Some numerosity information was still preserved, as it was possible to partially distinguish between one versus four objects (see Figure 10a, right column), but overall such large discrepancies in relative size had a major impact on the accuracy of numerical comparisons. The predicted performance on numerosity discrimination tasks improved dramatically as the difference in the relative sizes of the individual objects was reduced. For example, the numerosity encoded by gamma activity in response to images containing objects whose relative size differed by a factor of 4 supported discriminations comparable to that obtained when relative size differed by only a factor of 2 (see Figure 10b versus Figure 10c). Although performance on numerosity discrimination tasks, when based solely on the normalized gamma activity, clearly degraded as the size disparity between the individual objects became very large, coherent oscillations continued to encode potentially useful numerosity information over an approximately fourfold range of relative object sizes. We are unaware of any behavioral experiments that explicitly examine the issue of how numerical discriminations are affected by large differences in relative object size, but our results clearly predict a significant deterioration in task performance as such disparities become exceedingly large. Finally, to increase the generality of our conclusions, we considered the possibility that the numerosity encoded by normalized gamma activity in the retinal model was influenced by simulation artifacts. Although the
1788
J. Miller and G. Kenyon
1/2
1
3/2
a
normalized spectra
.4
one two three four
.2
.0 50
b
75
75
frequency (Hz)
frequency (Hz)
normalized activity
.2
model n .1
0
1
2
3
4
1
2
c
3
4
# of objects
# of objects
100
% activation
one two three four
0
1
2
3
4
# of objects
5
6
1
2
3
4
5
# of objects
6
7
Number-Selective Responses
1789
pairwise correlations between model neurons have been shown to be consistent with those measured experimentally, our simulations, based as they were on complex, nonlinear feedback dynamics, might have introduced spurious higher-order correlations that unintentionally favored the extraction of numerosity from multiunit spike trains over single stimulus trials. As a control, we constructed a new set of multiunit spike trains based on a much simpler dynamical process: a common oscillatory modulation applied to an array of identical Poisson spike generators. To model stimuli containing n objects, we constructed n separate 8 × 8 arrays of Poisson spike generators, with each 8 × 8 array driven by an independent source of common modulatory input. This procedure yielded a set of n multiunit spike trains with realistic gamma-band oscillations that were mutually uncorrelated with each other. As with the data generated by the more complex circuit model, the multiunit spike trains produced by common modulatory input supported reasonable numerosity discrimination (see Figure 11). Indeed, the diminished normalized spectral amplitudes with increasing numerosity (see Figure 11a), the corresponding decline in normalized gamma activity (see Figure 11b, black bars), the general adherence to a square-rootof-n rule (see Figure 11b, shaded bars), and the average responses of the numerosity detectors to single-stimulus presentations (see Figure 11c), were all nearly identical to corresponding measures obtained from the dynamic feedback model. However, because the correlations between the separate Poisson spike generators were due entirely to a common modulatory input, the possibility of higher-order artifacts being present in these experiments was greatly reduced. 4 Discussion To test the hypothesis that coherent oscillations can implicitly encode numerosity, images containing different numbers of objects, encompassing various shapes, sizes, and shadings, as well as different packing densities, intensities, and areas of illumination, were used to stimulate a model feedback circuit that generated gamma-band activity consistent with experimental data. As a function of the number of objects in the input image,
Figure 9: Numerosity discrimination at different stimulus intensities. Left column: Stimulus intensity = 1/2 (fraction of standard value). Right column: Stimulus intensity = 3/2. (a) Normalized frequency spectra. Although Fourier amplitudes were not contrast invariant, they still declined monotonically with increasing numerosity regardless of the stimulus intensity employed. (b) Normalized gamma activity followed a square-root-of-n rule over a threefold range of stimulus intensities. (c) Percentage activation of winner-take-all numerosity detectors. Tuning curves were roughly independent of stimulus intensity.
1790
J. Miller and G. Kenyon
normalized activity
% activation
a
100
16:1
.1
0
0
b
100
4:1 .1
0
0
c
100
1:2
model n .1
0
0 1
2
3
# of objects
4
1
2
3
4
5
6
7
# of objects
Figure 10: Numerosity discriminations involving stimuli of different relative size. All images contained one large object, covering a 16 × 16 array of output neurons, plus zero to three smaller objects. Relative sizes illustrated by insets at the end of each row. (a) Top row: Ratio of large to small object size: 16:1. (b) Middle row: Ratio reduced to 4:1. (c) Bottom row: Ratio reduced to 2:1. Left column: Normalized gamma activity. The square-root-of-n rule was strongly violated at very large ratios of relative object size but was in reasonable accord with the data for relative size ratios less than approximately 4:1. Right column: Percentage activation of winner-take-all (Polsky et al., 2004) numerosity detectors. Tuning curves became reasonably distinct for relative size ratios less than approximately 4:1.
Number-Selective Responses
1791
the normalized gamma activity measured during the plateau portion of the response followed a square-root-of-n rule rather closely, consistent with the expectation that the gamma-band oscillations arising from within a single object add in phase, whereas the oscillations arising from separate objects add with random phase. The square-root-of-n rule was found to be generally robust to the shape, size, shading, contrast, and distribution of the individual objects with several notable exceptions. Decreasing the separation between objects confounded numerosity estimates at high packing densities, as the oscillations evoked by separate objects became partially coherent (Kenyon, Travis et al., 2004). The inability of the feedback circuit to completely segment cluttered natural scenes, especially those containing large, overlapping objects, makes it unlikely that the model could have been successfully applied to more realistic images. Likewise, numerosity comparisons involving sets of objects encompassing very different relative sizes or very different relative contrasts (here modeled as stimulus intensity) were also confounded, as the brightest and largest objects could, if the relative disparity was sufficiently great, dominate the normalized gamma activity. Finally, using shorter analysis windows reduced the signal-to-noise ratio and thus led to degraded numerosity discrimination. Nonetheless, except in the most limiting cases, it was usually possible to discriminate between one and several objects based on the normalized gamma activity alone, virtually regardless of nonnumerical factors. To quantitatively assess the numerosity information encoded by coherent, object-specific oscillations, we constructed winner-take-all numerosity detectors based on the normalized gamma activity integrated across all stimulated neurons. An individual numerosity detector was activated whenever the normalized gamma activity on that trial fell nearest to its target value, given by the mean gamma activity evoked by images containing its preferred number of objects. The accuracy of these detectors followed the same general trends for numerosity discrimination observed behaviorally (Nieder & Miller, 2003). As the number of objects increased, the tuning curves broadened, so that it was progressively more difficult to distinguish sets of objects differing by a fixed number of items. Likewise, sets of objects became more distinguishable as their numerical separation increased. These two effects, known as the size and distance effects, respectively, were both clearly evident in our results and suggest that coherent, object-specific oscillations can encode numerosity in a manner consistent with behavioral data. Previous attempts to account for an innate number sense have used one-dimensional neural networks based on arrays of position and size filters (Dehaene & Changeux, 1993). However, to implement an analogous two-dimensional model, large numbers of additional shape and orientation filters, representing all possible 2D objects, would presumably be required. Coherent, object-specific oscillations, on the other hand, can in principle represent numerosity regardless of the shape, size, or locations of the
1792
J. Miller and G. Kenyon
normalized spectra
a one
.3
two three .2
four
.1
50
75
100
frequency (Hz) normalized activity
b .2
n .1
0
c
model
one
two
three
four
# of objects 100
% activation
one two three four
0
1
2
3
4
5
6
7
# of objects
individual visual items. Our proposed mechanism for encoding numerosity, based on normalized gamma activity, thus eliminates the need to allocate large numbers of additional neurons in order to represent every possible object to be enumerated. Other modeling efforts have shown how numberselective responses might be learned de novo by employing an appropriate
Number-Selective Responses
1793
plasticity rule (Verguts & Fias, 2004), but these experiments assumed as input a “location map” in which objects were represented a priori as single points, leaving open the question of how numerosity detectors might be constructed from low-level, retinotopic inputs. A completely different approach to an innate number sense is represented by accumulator models, which posit that objects are enumerated sequentially, with the tally being summed into a noisy short-term buffer (Gallistel & Gelman, 2000). Both behavioral and electrophysiological experiments, however, have cast doubt on accumulator models (Nieder, 2005). The response latencies of numberselective cortical neurons, as well as those measured behaviorally, suggest that numerosity is processed in parallel rather than sequentially. Moreover, the response curves recorded from number-selective cortical neurons suggest that numerosity is represented by a population code distributed across multiple broadly tuned elements. Our findings suggest that coherent oscillations encode numerosity in a form that can be extracted from multiunit spike trains over physiologically meaningful timescales. One advantage of using normalized gamma activity for encoding numerosity is that the complexity of the model does not increase as a function of either the number of objects to be detected or the number of possible combinations of object size, shape, or distribution. By employing numerosity detectors based purely on the normalized gamma activity, we could naturally take advantage of the translational and rotational invariance of low-level visual circuitry, thus automatically generalizing over a wide range of nonnumerical properties. The issue of how numerosity information encoded by coherent oscillations might be extracted by downstream elements was not addressed in this study, although it has been suggested that increased gamma activity or synchrony could differentially activate target structures (Alonso, Usrey, & Reid, 1996; Kenyon, Fetz, & Puff, 1990; Usrey, Alonso, & Reid, 2000). In addition, known physiological mechanisms such as facilitating synapses (Beierlein &
Figure 11: Common input model. To control for the possibility that our simulations introduced spurious higher-order correlations that inadvertently enhanced numerosity encoding, control sets of artificial multiunit spike trains were constructed by applying a common oscillatory modulation to separate 8 × 8 arrays of identical Poisson spike generators. Separate, uncorrelated modulatory signals, produced by randomizing the phases of the individual Fourier components, were applied to each 8 × 8 array, corresponding to distinct visual objects. (a) Normalized frequency spectra. (b) Normalized gamma activity. (c) Percentage activation of winner-take-all numerosity detectors. Coherent, object-specific, gamma-band oscillations produced by the common input model encoded numerosity in a manner that was very similar to the dynamic feedback circuit model.
1794
J. Miller and G. Kenyon
Connors, 2002), nonlinear dendritic processing (Polsky et al., 2004), and dynamical resonances (Miller et al., 2006; Frishman et al., 1987; Rager & Singer, 1998; Vigh, Solessio, Morgans, & Lasater, 2003) might confer a differential sensitivity to oscillatory input. Innate numerical competency extends well beyond the few examples considered here. Rather than being limited to simultaneously presented visual items, the number sense characterized extensively in animals and humans encompasses multiple sensory and motor modalities (Dehaene, 1997), such as the ability to respond to a particular number of tones (Meck & Church, 1983) or to generate a specified number of bar presses (Mechner & Guevrekian, 1962). Coherent oscillations arising early in the visual processing pathway can at best account for only a narrow aspect of this much wider phenomenon. Nonetheless, we may ask to what extent the mechanisms illustrated by our restricted model might be expanded to include a broader range of behaviors. Although our results were obtained using a model of the inner retina, gamma-band oscillations of cortical origin (Castelo-Branco et al., 1998; Gray et al., 1989; Gray & Singer, 1989; Kreiter & Singer, 1996; Singer & Gray, 1995) could likely encode similar, and indeed much higher-quality, numerosity information. The superior processing resources available to the cerebral cortex could presumably generate coherent oscillations that were much more object specific. In particular, by utilizing additional segmentation cues, such as shape, texture, color, depth, and motion, cortically generated coherent oscillations might encode numerosity much more reliably than was true of our comparatively simple feedback circuit. From this point of view, the much higher frequencies of the coherent oscillations arising in the retina versus those arising in the cerebral cortex (Castelo-Branco et al., 1998) suggest a possible strategy for distinguishing between numerosity information encoded at these two very distinct processing stages. Although their status remains speculative, theoretical models have also been proposed to explain how coherent oscillations could be used to represent temporal sequences (Hopfield & Brody, 2001), suggesting a possible mechanism for extending these findings to numerosity discrimination tasks involving nonsimultaneous stimuli. Oscillatory responses have been observed in a variety of brain regions, including auditory (Brosch, Budinger, & Scheich, 2002), motor (Murthy & Fetz, 1996a, 1996b), olfactory (Eeckman & Freeman, 1990), and somatosensory (Jones & Barth, 1997) cortical areas. The principles hypothesized to underlie the responses of number-selective neurons to visual inputs may therefore apply to the encoding of numerosity in other sensory modalities as well. Acknowledgments We acknowledge technical assistance from Mark Flynn, Chris Wood, James Theiler, David Marshak, John George, Reid Porter, and Bryan Travis. This
Number-Selective Responses
1795
work was supported in part by the Department of Energy Office of Nonproliferation Research and Engineering. References Alonso, J. M., Usrey, W. M., & Reid, R. C. (1996). Precisely correlated firing in cells of the lateral geniculate nucleus. Nature, 383, 815–819. Beierlein, M., & Connors, B. W. (2002). Short-term dynamics of thalamocortical and intracortical synapses onto layer 6 neurons in neocortex. Journal of Neurophysiology, 88, 1924–1932. Brannon, E. M., & Terrace, H. S. (1998). Ordering of the numerosities 1 to 9 by monkeys. Science, 282, 746–749. Brosch, M., Budinger, E., & Scheich, H. (2002). Stimulus-related gamma oscillations in primate auditory cortex. Journal of Neurophysiology, 87, 2715–2725. Castelo-Branco, M., Neuenschwander, S., & Singer, W. (1998). Synchronization of visual responses between the cortex, lateral geniculate nucleus, and retina in the anesthetized cat. Journal of Neuroscience, 18, 6395–6410. Cordes, S., Gelman, R., Gallistel, C. R., & Whalen, J. (2001). Variability signatures distinguish verbal from nonverbal counting for both large and small numbers. Psychonomic Bulletin and Review, 8, 698–707. Dacey, D. M., & Brace, S. (1992). A coupled network for parasol but not midget ganglion cells in the primate retina. Visual Neuroscience, 9, 279–290. Davis, H., & Perusse, R. (1988). Numerical competence in animals: Definitional issues, current evidence and a new research agenda. Behavioral and Brain Sciences, 11, 561–615. Dehaene, S. (1997). The number sense. New York: Oxford University Press. Dehaene, S. (2001). Precis of the number sense. Mind and Language, 16, 16–36. Dehaene, S., & Changeux, J. P. (1993). Development of elementary numerical abilities—A neuronal model. Journal of Cognitive Neuroscience, 5, 390–407. Eeckman, F. H., & Freeman, W. J. (1990). Correlations between unit firing and EEG in the rat olfactory system. Brain Research, 528, 238–244. Feigenson, L., Dehaene, S., & Spelke, E. (2004). Core systems of number. Trends in Cognitive Sciences, 8, 307–314. Freed, M. A. (2000). Rate of quantal excitation to a retinal ganglion cell evoked by sensory input. Journal of Neurophysiology, 83, 2956–2966. Frishman, L. J., Freeman, A. W., Troy, J. B., Schweitzer-Tong, D. E., & Enroth-Cugell, C. (1987). Spatiotemporal frequency responses of cat retinal ganglion cells. Journal of General Physiology, 89, 599–628. Gallistel, C. R., & Gelman, I. I. (2000). Non-verbal numerical cognition: From reals to integers. Trends in Cognitive Sciences, 4, 59–65. ¨ Gray, C. M., Konig, P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338, 334–337. Gray, C. M., & Singer, W. (1989). Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proceedings of the National Academy of Sciences USA, 86, 1698–1702.
1796
J. Miller and G. Kenyon
Hopfield, J. J., & Brody, C. D. (2001). What is a moment? Transient synchrony as a collective mechanism for spatiotemporal integration. Proceedings of the National Academy of Sciences USA, 98, 1282–1287. Jacoby, R., Stafford, D., Kouyama, N., & Marshak, D. (1996). Synaptic inputs to ON parasol ganglion cells in the primate retina. Journal of Neuroscience, 16, 8041–8056. Jones, M. S., & Barth, D. S. (1997). Sensory-evoked high-frequency (gamma-band) oscillating potentials in somatosensory cortex of the unanesthetized rat. Brain Research, 768, 167–176. Kenyon, G. T., Fetz, E. E., & Puff, R. D. (1990) Effects of firing synchrony on signal propagation in layered networks. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 141–148). San Mateo, CA: Morgan Kaufmann. Kenyon, G. T., Harvey, N. R., Stephens, G. J., & Theiler, J. (2004). Dynamic segmentation of gray-scale images in a computer model of the mammalian retina. In A. G. Tescher (Ed.), Proc. SPIE: Applications of Digital Image Processing XXVII (pp. 1–12). Bellingham, WA: SPIE. Kenyon, G. T., & Marshak, D. W. (1998). Gap junctions with amacrine cells provide a feedback pathway for ganglion cells within the retina. Proceedings of the Royal Society B: Biological Sciences, 265, 919–925. Kenyon, G. T., Moore, B., Jeffs, J., Denning, K. S., Stephens, G. J., Travis, B. J., George, J. S., Theiler, J., & Marshak, D. W. (2003). A model of high-frequency oscillatory potentials in retinal ganglion cells. Visual Neuroscience, 20, 465–480. Kenyon, G. T., Theiler, J., George, J. S., Travis, B. J., & Marshak, D. W. (2004). Correlated firing improves stimulus discrimination in a retinal model. Neural Computation, 16, 2261–2291. Kenyon, G. T., Travis, B. J., Theiler, J., George, J. S., Stephens, G. J., & Marshak, D. W. (2004). Stimulus-specific oscillations in a retinal model. IEEE Transactions on Neural Networks, 15, 1083–1091. Kreiter, A. K., & Singer, W. (1996). Stimulus-dependent synchronization of neuronal responses in the visual cortex of the awake macaque monkey. Journal of Neuroscience, 16, 2381–2396. Mechner, F., & Guevrekian, L. (1962). Effects of deprivation upon counting and timing in rats. Journal of the Experimental Analysis of Behavior, 5, 463–466. Meck, W. H., & Church, R. M. (1983). A mode control model of counting and timing processes. Journal of Experimental Psychology: Animal Behavior Processes, 9, 320–334. Miller, J. A., Denning, K. S., George, J. S., Marshak, D. W., & Kenyon, G. T. (2006). A high frequency resonance in the responses of retinal ganglion cells to rapidly modulated stimuli: A computer model. Visual Neuroscience, 23, 779–794. Murthy, V. N., & Fetz, E. E. (1996a). Oscillatory activity in sensorimotor cortex of awake monkeys: Synchronization of local field potentials and relation to behavior. Journal of Neurophysiology, 76, 3949–3967. Murthy, V. N., & Fetz, E. E. (1996b). Synchronization of neurons during local field potential oscillations in sensorimotor cortex of awake monkeys. Journal of Neurophysiology, 76, 3968–3982. Neuenschwander, S., Castelo-Branco, M., & Singer, W. (1999). Synchronous oscillations in the cat retina. Vision Research, 39, 2485–2497. Neuenschwander, S., & Singer, W. (1996). Long-range synchronization of oscillatory light responses in the cat retina and lateral geniculate nucleus. Nature, 379, 728– 732.
Number-Selective Responses
1797
Nieder, A. (2005). Counting on neurons: The neurobiology of numerical competence. Nature Reviews Neuroscience, 6, 177–190. Nieder, A., Freedman, D. J., & Miller, E. K. (2002). Representation of the quantity of visual items in the primate prefrontal cortex. Science, 297, 1708–1711. Nieder, A., & Miller, E. K. (2003). Coding of cognitive magnitude: Compressed scaling of numerical information in the primate prefrontal cortex. Neuron, 37, 149–157. Nieder, A., & Miller, E. K. (2004a). Analog numerical representations in rhesus monkeys: Evidence for parallel processing. Journal of Cognitive Neuroscience, 16, 889– 901. Nieder, A., & Miller, E. K. (2004b). A Parieto-frontal network for visual numerical information in the monkey. Proceedings of the National Academy of Sciences USA, 101, 7457–7462. O’Brien, B. J., Isayama, T., Richardson, R., & Berson, D. M. (2002). Intrinsic physiological properties of cat retinal ganglion cells. Journal of Physiology, 538, 787–802. Passaglia, C. L., Enroth-Cugell, C., & Troy, J. B. (2001). Effects of remote stimulation on the mean firing rate of cat retinal ganglion cells. Journal of Neuroscience, 21, 5794–5803. Pica, P., Lemer, C., Izard, V., & Dehaene, S. (2004). Exact and approximate arithmetic in an Amazonian indigene group. Science, 306, 499–503. Polsky, A., Mel, B. W., & Schiller, J. (2004). Computational subunits in thin dendrites of pyramidal cells. Nature Neuroscience, 7, 621–627. Rager, G., & Singer, W. (1998). The response of cat visual cortex to flicker stimuli of variable frequency. European Journal of Neuroscience, 10, 1856–1877. Sejnowski, T. J., & Paulsen, O. (2006). Network oscillations: Emerging computational principles. Journal of Neuroscience, 26, 1673–1676. Shadlen, M. N., & Movshon, J. A. (1999). Synchrony unbound: A critical evaluation of the temporal binding hypothesis. Neuron, 24, 67–77. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Annual Review of Neuroscience, 18, 555–586. Usrey, W. M., Alonso, J. M., & Reid, R. C. (2000). Synaptic interactions between thalamic inputs to simple cells in cat visual cortex. Journal of Neuroscience, 20, 5461–5467. Vaney, D. I. (1994). Patterns of neuronal coupling in the retina. Progress in Retinal and Eye Research, 13, 301–355. Verguts, T., & Fias, W. (2004). Representation of number in animals and humans: A neural model. Journal of Cognitive Neuroscience, 16, 1493–1504. Vigh, J., Solessio, E., Morgans, C. W., & Lasater, E. M. (2003). Ionic mechanisms mediating oscillatory membrane potentials in wide-field retinal amacrine cells. Journal of Neurophysiology, 90, 431–443. Whalen, J., Gallistel, C. R., & Gelman, R. (1999). Nonverbal counting in humans: The psychophysics of number representation. Psychological Science, 10, 130–137.
Received August 12, 2005; accepted September 15, 2006.
LETTER
Communicated by Bartlett Mel
Selectivity and Stability via Dendritic Nonlinearity Kenji Morita [email protected] RIKEN Brain Science Institute, Wako, Saitama 451-0198, Japan, and Institute of Industrial Science, University of Tokyo, Tokyo 153-8505, Japan
Masato Okada [email protected] Department of Complexity Science and Engineering, Graduate School of Frontier Sciences, University of Tokyo, Kashiwa 277-8561, Japan
Kazuyuki Aihara [email protected] Institute of Industrial Science, University of Tokyo, Tokyo 153-8505, Japan, and Aihara Modelling Project, ERATO, JST, Tokyo 151-0064, Japan
Inspired by recent studies regarding dendritic computation, we constructed a recurrent neural network model incorporating dendritic lateral inhibition. Our model consists of an input layer and a neuron layer that includes excitatory cells and an inhibitory cell; this inhibitory cell is activated by the pooled activities of all the excitatory cells, and it in turn inhibits each dendritic branch of the excitatory cells that receive excitations from the input layer. Dendritic nonlinear operation consisting of branch-specifically rectified inhibition and saturation is described by imposing nonlinear transfer functions before summation over the branches. In this model with sufficiently strong recurrent excitation, on transiently presenting a stimulus that has a high correlation with feedforward connections of one of the excitatory cells, the corresponding cell becomes highly active, and the activity is sustained after the stimulus is turned off, whereas all the other excitatory cells continue to have low activities. But on transiently presenting a stimulus that does not have high correlations with feedforward connections of any of the excitatory cells, all the excitatory cells continue to have low activities. Interestingly, such stimulus-selective sustained response is preserved for a wide range of stimulus intensity. We derive an analytical formulation of the model in the limit where individual excitatory cells have an infinite number of dendritic branches and prove the existence of an equilibrium point corresponding to such a balanced low-level activity state as observed in the simulations, whose stability depends solely on the signal-to-noise ratio of the stimulus. We propose this model as a model of stimulus Neural Computation 19, 1798–1853 (2007)
C 2007 Massachusetts Institute of Technology
Selectivity and Stability via Dendritic Nonlinearity
1799
selectivity equipped with self-sustainability and intensity-invariance simultaneously, which was difficult in the conventional competitive neural networks with a similar degree of complexity in their network architecture. We discuss the biological relevance of the model in a general framework of computational neuroscience.
1 Introduction Competitive neural networks with recurrent inhibition, which were originally devised for mimicking anatomical and physiological structure of the cortex (Wilson & Cowan, 1973; Amari, 1977), have been intensely studied to model real biological neural networks in the brain, as well as to create artificial neuromorphic systems. One biological research stream has been referred to as recurrent or feedback models of sensory processing (BenYishai, Bar-Or, & Sompolinsky, 1995; Somers, Nelson, & Sur, 1995; Hansel & Sompolinsky, 1996). These models assume relatively weak recurrent excitation and exhibit activities only during stimulus presentation. One of their major achievements was that they realized good intensity invariance. There is another research stream on biological competitive neural networks that models stimulus-selective short-term memory or decision making coupled with the short-term memory (Amit & Brunel, 1997; Wang, 2002; Brunel, 2003; Machens, Romo, & Brody, 2005; Wong & Wang, 2006). Such models assume strong recurrent excitation, and consequently, the activity of the winning cell is sustained even after the extinction of the stimulus; however, they usually do not show intensity-invariant response. In most models that fall under the above-mentioned types, the neuronal unit is assumed to be the one that linearly summates all the excitatory and inhibitory inputs and then imposes only a single nonlinear transformation on the sum. However, this model cannot describe any local nonlinear operation that would occur at each dendritic branch of real cortical neurons (Koch, Poggio, & Torre 1983). It has been well known that several types of GABAergic interneurons, including fast-spiking (FS) cells, preferentially or exclusively target perisomatic regions such as soma, perisomatic dendrites, or axonal initial segments of the pyramidal cells but do not target their distal dendrites. Therefore, regarding the inhibition from these perisomatargeting GABAergic cells, point neuron description in the conventional competitive neural network models may be partly justified. However, recent studies have revealed that a certain class of GABAergic cells, referred to as the non-FS cells, preferentially form synapses on the distal dendrites of pyramidal cells (Buhl et al., 1997; Tamas, Buhl, & Somogyi, 1997; Bacci, Rudolph, Huguenard, & Prince, 2003; Beierlein, Gibson, & Connors, 2003; Gao, Wang, & Goldman-Rak´ıc, 2003). Therefore, localized interactions between the excitatory and inhibitory synaptic inputs on each pyramidal dendritic branch would occur in a network that contains pyramidal cells
1800
K. Morita, M. Okada, and K. Aihara
and these distal dendrite–targeting non-FS GABAergic cells; these interactions cannot be described by conventional competitive neural networks using the point neuron description. Recently, Spratling and Johnson (2001, 2002) proposed an incremental discrete-time preintegration lateral inhibition model wherein the dendritic local inhibition is described as a localized threshold-linear transformation on each dendritic branch of the neural unit, showing that their model was superior to the conventional lateral inhibition models with regard to a certain type of unsupervised learning (Spratling & Johnson, 2002). In this study, we propose and examine a dendritic lateral inhibition (DLI) neural network model (Morita & Aihara, 2005), which is similar to that of Spratling and Johnson with regard to the description of dendritic local inhibition; however, it differs from their model in certain aspects. First, we consider dendritic saturation. Second, we do not presume specific constraints regarding connection strengths. Third, we consider continuous-time dynamics rather than incremental discrete-time dynamics to facilitate mathematical analysis. Specifically, our model consists of an input layer and a neuron layer that includes excitatory pyramidal cells and an inhibitory non-FS cell. Pyramidal cells receive excitation from the input layer, self-excitation from themselves, and inhibition from the non-FS cell at each dendritic branch; the non-FS inhibitory cell is activated by excitation from the pyramidal cells. In each dendritic branch of the pyramidal cell, excitation from the input layer, self-excitation, and inhibition are initially summed up linearly, followed by imposing a nonlinear transfer function consisting of threshold (rectification) and saturation. After mathematically defining the model in section 2 and the task to be imposed on it in section 3, we examine the dynamical behaviors of the model by simulation in section 4 and show that our model achieves a stimulus-selective, sustained, winner-take-all (WTA) response. Moreover, we show that such a property is preserved for a considerably wider range of stimulus intensity in our model than in the conventional model. We attempt to provide a qualitative explanation on these aspects in section 5. In section 6, we derive an analytical formulation of the model in the limit where individual excitatory cells have an infinite number of dendritic branches. We perform local stability analysis using this analytical formula in section 7. The results are in excellent agreement with the simulation results. Specifically, we prove that there exists an equilibrium point corresponding to a state where all the cells have balanced low-level activities, and its stability depends not on the stimulus intensity but on the signal-to-noise ratio. In section 8, we perform a reduction of the model using a type of meanfield approximation. Finally, in section 9, we compare the property and the performance of our model with other models. In addition, we discuss how dendritic inhibition actually operates in real neurons and how our model could be related to biological dendritic computation from the viewpoint of current computational neuroscience.
Selectivity and Stability via Dendritic Nonlinearity
1801
2 Model We propose a neural network model, called the dendritic lateral inhibition (DLI) model (Morita & Aihara, 2005), describing recurrent interactions between excitatory pyramidal cells and an inhibitory GABAergic cell of the non-FS type, which makes synapses on the distal dendrites of the target pyramidal cells, in a relatively small region of the cerebral cortex, possibly a cortical column (see Figure 1b). The DLI model consists of an input layer, which represents an excitatory input source originated from an external stimulus, and an output neuron layer, including pyramidal cells and a non-FS cell (see Figure 1c). We consider four types of connection: (1) excitatory convergent-and-divergent connections from the input layer to the pyramidal cells in the output neuron layer, (2) excitatory connections from the pyramidal cells’ axonal collaterals to the non-FS cell that is activated by the pooled activities of all the pyramidal cells, (3) inhibitory connections from the non-FS cell to each dendritic branch or, equivalently, each afferent connection, of every pyramidal cell, and (4) excitatory connections from the pyramidal cells’ axonal collaterals to each dendritic branch of themselves. As to connections 1, we assume that all the synapses from the same external input source converge to single dendritic branches of individual pyramidal cells, because it has been shown that synaptic inputs from different input sources are preferentially mapped onto different specific regions of the dendritic tree (Shepherd, 2003). Connections 2 and 3 constitute the dendritic lateral inhibitions between the pyramidal cells via the pooled non-FS cell, whereas connections 4 form self-excitations. All the strengths of these connections (except connections 1) are assumed to be uniform. The dynamics of the activities of n pyramidal cells and a pooled non-FS cell can be described by the following differential equations (see Figure 1c):
m 1 σ w ji Ii + αx j − βy −x j + m i=1 n γ dy 1 −y + xj . = dt τG n
dxj 1 = dt τp
for j = 1, . . . , n
(2.1)
(2.2)
j=1
Here m is the number of the input sources, Ii is the intensity of the ith component of the input sources (defining the m-dimensional stimulus vector as I ≡ (I1 · · · Im )), x j is the activity of the jth pyramidal cell, y is the activity of w the non-FS cell, mji (≥ 0) is the weight of feedforward excitation from the ith input to the jth pyramidal cell, mα (≥ 0) is the weight of excitation from every pyramidal cell to each dendritic branch of itself (self-excitation), mβ (≥ 0) is the weight of inhibition from the non-FS cell to each dendritic branch of every pyramidal cell, γn (≥ 0) is the weight of excitation from every pyramidal
1802
K. Morita, M. Okada, and K. Aihara
cell to the non-FS cell, and τ p and τG are the time constants of the dynamics of the pyramidal cells and the non-FS cell, respectively. σ (z) denotes the transfer function consisting of the threshold and the upper bound: 0 σ (z) = z − θ η
(z ≤ θ ) (θ < z ≤ η + θ ) . (η + θ < z)
(2.3)
This transfer function represents the nonlinear input summation on each dendritic branch in the following meanings. First, excitation from the input layer, self-excitation, and inhibition from the non-FS cell are assumed to be summed linearly for simplicity (but in section 9 we show that divisive inhibition also works). Second, the threshold θ , which is assumed to be θ = 0 in our simulations and mathematical analysis (but we will show that the model is rather robust for the change of θ in section 9.1), is imposed to represent
Figure 1: (a) Schematic diagram of the spatial restrictiveness of dendritic inhibition. GABAA synaptic input on the dendritic branch A (represented by the gray circle) effectively shunts the glutamatergic synaptic input (represented by the black circle) on the same branch A, but has little effect on the glutamatergic inputs on the other dendritic branches, such as B or D. Note that although there is a GABAA synaptic input on branch C, it has no effect because there is no glutamatergic input on the same branch. (b) Schematic diagram of a neuronal network consisting of pyramidal cells (represented by black triangles) and a GABAergic non-FS cell (represented by a gray ellipse). There are four types of connections: (1) (represented by square outlines) excitatory feedforward (afferent) connections from external input sources such as thalamus, other subcortical regions, or other parts of the cortex to the pyramidal cells, perhaps via excitatory interneurons (but not explicitly represented); (2) (represented by the black hexagon) excitatory connections from the pyramidal cells’ axonal collaterals to the non-FS cell that is activated by the pooled activities of all the pyramidal cells; (3) (represented by black filled circles) inhibitory connections from the non-FS cell to each dendritic branch of every pyramidal cell; and (4) (represented by outlined circles) excitatory connections from the pyramidal cells’ axonal collaterals to each dendritic branch of themselves. (c) Network architecture of the dendritic lateral inhibition (DLI) model (see equations 2.1 and 2.2). The symbols are the same as those used in b. (d) Network architecture of the prototypical conventional lateral inhibition model (see equations 2.4 and 2.5) that we used for a comparison with the DLI model. (e) Schematic diagram of a stimulus. If the stimulus I = (I1 , . . . , Im )t is correlated with (i.e., parallel to) one of the (the kth) pyramidal cell’s feedforward (afferent) connections wtk = (wk1 , . . . , wkm )t , the corresponding pyramidal cell (the kth cell) is likely to receive the largest total input among all the pyramidal cells. We call this stimulus I(//wtk ) a preferred stimulus for the kth pyramidal cell.
Selectivity and Stability via Dendritic Nonlinearity (a)
1803
(b) A
B C D
E
(c)
(d)
Ii
I1 wj1
wj i
m
m m
m m
m
y
xn
m
wj m m
I4
I6 I7
I3 I5
I1
I2
I8
wk xk
xn
x1
xj
y
xn n
(e)
x1
wj i
m
m
n
I
wj1 m
m
xj
x1
wj m
Im
Ii
I1
Im
1804
K. Morita, M. Okada, and K. Aihara
the spatial restrictiveness of each dendritic branch such that the GABAA receptor channel opening on a certain branch effectively shunts excitatory postsynaptic potentials (EPSPs) or prevents dendritic spike generation and propagation on the same branch, but does not affect other branches (see Figure 1a) (Miles, Toth, Gulyas, Hajos, & Freund, 1996; Larkum, Zhu, & Sakmann, 1999; Spruston, Stuart, & H¨ausser, 1999). Third, the upper bound η represents saturation on each branch. In this way, the nonlinear transfer function is imposed on each dendritic branch before summing all the branches. On the other hand, in the conventional models consisting of selfexcitation and the usual lateral inhibition, only one nonlinear function is imposed after linear summation of all the inputs (Amari & Arbib, 1977; Feng & Hadeler, 1996; Hahnloser, 1998; Xie, Hahnloser, & Seung, 2002). To compare dynamical properties of our model with those of the conventional models, we define a prototypical conventional model as follows (see Figure 1d): dxj 1 = dt τp
−x j + σ
1 w ji Ii m m
+ αx j − βy
for j = 1, . . . , n
i=1
n γ dy 1 −y + xj . = dt τG n
(2.4) (2.5)
j=1
Note that the only difference between this conventional model, equations 2.4 to 2.5 and our new DLI model, equations
2.1 to 2.2 is the order of the transfer function (σ ) and the summation ( ). 3 Tasks and Conditions In this section, we explain the tasks imposed on the models. First, we describe the assumption about the stimulus presented to the network, defining the terms preferred stimulus, signal, noise, and intensity of the stimulus. Then we describe two types of tasks: persistent and transient presentations of the stimulus. We define the notion of stimulus-selective sustained winner-take-all response. Finally, we discuss the parameter values. To examine the general dynamical properties of the DLI model (see equations 2.1 to 2.2), as well as the conventional lateral inhibition model (see equations 2.4 to 2.5), we assume a random feedforward weight matrix W = (w ji ) ( j = 1, 2, . . . , n, i = 1, 2, . . . , m) (see Figure 1e) where w ji is independently distributed according to a certain probability distribution: P(wji = x) = paff (x).
(3.1)
Selectivity and Stability via Dendritic Nonlinearity
1805
If the stimulus vector I = (I1 · · · Im ) is parallel to one of the feedforward connections of a certain pyramidal cell, that is, one (e.g., the kth) of the row vectors of W = (wji ): wk = (wk1 · · · wkm ), the total feedforward synaptic excitation of the kth pyramidal cell is expected to be the largest among all the pyramidal cells, because W is a random matrix and I should not have any correlation with the feedforward connections (row vectors of W) except for wk . In this sense, we refer to this stimulus vector I parallel to wk as the stimulus corresponding to the kth pyramidal cell, or the preferred stimulus for the kth pyramidal cell. Each pyramidal cell has its own preferred stimulus in the m-dimensional input space, and thus there are in total n preferred stimuli in the whole network. We generally assume that stimulus I consists of two vectors: a signal vector that is parallel to one of the pyramidal cell’s (e.g., the kth cell’s) feedforward connections wk , and a noise vector ξ = (ξ1 , . . . ξm ) where ξi is independently distributed according to a certain probability distribution: P(ξi = x) = pnoise (x).
(3.2)
Specifically, the stimulus vector is represented as I = I0 (swk + (1 − s)ξ ) ,
(3.3)
where s (0 ≤ s ≤ 1) is the signal ratio representing the proportion of the signal and I0 represents the intensity of the stimulus. A stimulus with s = 0 represents a completely random stimulus, whereas one with s = 1 represents a noiseless preferred stimulus for the kth pyramidal cell. Figure 2a illustrates an example of how a noiseless stimulus becomes distorted along with the noise increase or, in turn, the decrease of the signal ratio s (see the caption and the text below for details). In order to proceed, we should assume concrete probability distributions. In the following, we assume that both paff (x), the distribution of the weights of the feedforward connections w j = (w j1 · · · w jm ), and pnoise (x), the distribution of the component of the noise vector ξ = (ξ1 , . . . , ξm ), are uniform distributions on [0 1]. Under these assumptions, the amount of the feedforward input on each single dendritic branch (i.e., wji Ii ) is upper-bounded by wji Ii ≤ 1 × I0 (s × 1 + (1 − s) × 1) = I0 . In order to ensure that each dendritic branch, onto which the transfer function (see equation 2.3) is imposed in the DLI model, is not saturated only by feedforward excitation, we assume I0 < η.
(3.4)
This assumption about the stimulus intensity seems plausible because the stimulus I is actually an output of the neurons in another part of the neural
1806
K. Morita, M. Okada, and K. Aihara
Figure 2: Responses of the conventional lateral inhibition model to the persistent presentation of the stimulus. (a) Schematic images of a series of stimuli with various signal ratios. The top 30 × 30 matrix image indicates the components of the stimulus vector I = I0 w50 , which is the preferred stimulus for the 50th pyramidal cell without noise (i.e., s = 1), permutated in ascending order (appeared as a continuous change in the tone). The following matrix images show components of a series of stimuli I = I0 (sw50 + (1 − s)ξ ), which are the 50th cell’s preferred stimuli with various signal ratios (s = 0.8, 0.6, 0.4, 0.2, 0), permutated in the same order as the s = 1 case (the top matrix). Gradual distortion of those matrix images indicates how the specific stimulus is distorted along with a decrease in the signal ratio s. (b–g) Steady-state activities of the pyramidal cells (black bars) and the non-FS cell (the right-most light gray bar) in the conventional lateral inhibition model (α = 3, β = 1, γ = 20, and θ = 0.2) under the presence of the preferred stimulus for the 50th pyramidal cell with various signal ratios as indicated in a . The stimulus intensity (I0 ) is systematically varied so that I0 = 1, 0.72, 0.68, 0.64, 0.6, 0.5 from the left column (b) to the right one (g). The initial activities of the pyramidal cells were randomly chosen from the uniform distribution on [0, 0.1] while the initial activity of the non-FS cell was set to be zero in all the simulations.
Selectivity and Stability via Dendritic Nonlinearity
1807
system, such as the sensory organ or the sensory cortex, whose intensity should be upper-bounded. We examine the responses of the DLI model and the conventional model in two different paradigms. One is persistent presentation of the stimulus in which I ≡ (I1 , . . . , Im ) is time invariant. The other is transient presentation of the stimulus, in which I ≡ (I1 , . . . , Im ) is set to be positive only for a while and then reset to I = 0. In both tasks, our interest is in asking what states the network converges to and how they depend on the signal ratio of the stimulus. At first, we name three steady states that are of special interest: the state where only one of the pyramidal cells has a saturated activity and all the other pyramidal cells have no activity is called a winner-take-all (WTA) state; the state where multiple pyramidal cells have saturated activities while other pyramidal cells have no activity is called a winners-share-all (WSA) state; and the state where all the pyramidal cells have no activity is called a quiescent state. Next, let us consider the transient presentation task. If the network satisfies the following two conditions—(1) when a stimulus with a large signal ratio is presented, it converges to the WTA state, where the very pyramidal cell corresponding to the presented stimulus is the only winner, in the presence of the stimulus and also remains there even after the extinction of the stimulus, and (2) when a stimulus with a small or no signal ratio is presented, it converges to the quiescent state after the extinction of the stimulus—the network is said to show a stimulus-selective sustained winner-take-all response, which seems to be desirable if we consider the network as a kind of pattern discriminator. We will show that the DLI model, but not the conventional model, can show such a response for a wide range of the stimulus intensity. Here we discuss the specific values for the connection strengths and other parameters of the model. Considering that both models, equations 2.1 and 2.2, and 2.4 and 2.5, include self-inhibition terms, the net strength of selfexcitation for the conventional model is equal to α − βγ , and the maximum n net strength of self-excitation for the DLI model, which is obtained when all the dendritic transfer functions (see equation 2.3) are in their linear ranges, is also equal to α − βγ . Since we are especially interested in the sustained n winner-take-all response, we assume
α−
βγ >1 n
(3.5)
because otherwise, the activity cannot be sustained after the extinction of the stimulus for both models. For the strength of lateral inhibition, it has been shown for the conventional lateral inhibition model that insufficient inhibition fails to achieve the WTA behavior (Amari & Arbib, 1977; Xie et al., 2002). Therefore, we assume that the effective strength of lateral inhibition, which is a product of the excitation from the pyramidal cells to
1808
K. Morita, M. Okada, and K. Aihara
the non-FS cell (β) and the inhibition from the non-FS cell to the pyramidal cells’ dendrites (γ ), is larger than the strength of self-excitation: α < βγ .
(3.6)
We examine several parameter sets for the connection strength (α, β, γ ) in the simulations and analyses, and discuss the dependence of the model behavior on them in section 9.1. For the threshold of the transfer function, we set θ = 0 for the DLI model and θ = 0.2 for the conventional model, because a positive threshold is necessary for the latter but not for the former to realize stimulus selectivity (see below). We examine the dependence of the model behavior on θ in section 9.1. For the other parameters, we use η = 2, m = 900, n = 100, and τ p = τG = 1. In section 6, we perform mathematical analysis in the limit of m → ∞ and show that the number of the output neurons (n) and the time constants (τ p , τG ) have little (for n) or no (for τ p , τG ) effects on the local stability. 4 Simulation Results We begin by examining the properties of the conventional lateral inhibition model, equation 2.4 and 2.5, for the sake of later comparison. Let us think about the threshold θ of the transfer function σ (z) (see equation 2.3). If θ is positive and the total feedforward input m1 wtj I is less than θ for an arbitrary pyramidal cell, it is obvious that the quiescent state is locally stable in the presence of the stimulus and thus that the conventional model converges to the quiescent state under the presence of such a stimulus if the initial activities are sufficiently small. Thereby, if we tune the intensity (I0 ) of a series of stimuli with various signal ratios, {I = I0 (swk + (1 − s)ξ )|0 ≤ s ≤ 1} (for a fixed k) to be sufficiently small so that m1 wtj I = m1 I0 wtj (swk + (1 − s)ξ ) is less than θ for arbitrary j regardless of the signal ratio s, the conventional model always converges to the quiescent state under the presence of the stimulus regardless of its signal ratio, as shown in Figure 2g. When such a stimulus is presented transiently, the conventional model will remain at the quiescent state after the extinction of the stimulus because the quiescent state is locally stable in the absence of, as well as in the presence of, the stimulus (not shown). Note that in this case, even when a presented stimulus is a noiseless preferred stimulus for one of the pyramidal cells, the corresponding cell never shows a sustained activity. On the other hand, if the total feedforward input m1 wtj I is greater than θ for at least one pyramidal cell, the conventional lateral inhibition model with sufficiently strong self-excitation and lateral inhibition (see equations 3.5 and 3.6) always converges to either a WTA state or a WSA state, as has been studied well (Amari & Arbib, 1977; Feng & Hadeler, 1996; Hahnloser, 1998; Xie et al., 2002). Thereby, if we tune the intensity (I0 ) of a series of stimuli with various signal ratios,
Selectivity and Stability via Dendritic Nonlinearity
1809
{I = I0 (swk + (1 − s)ξ )|0 ≤ s ≤ 1} (for a fixed k) to be sufficiently large so that m1 wtj I = m1 I0 wtj (swk + (1 − s)ξ ) is greater than θ for at least one of j regardless of the signal ratio s, the conventional model always converges to either a WTA state or a WSA state under the presence of the stimulus regardless of its signal ratio, as shown in Figure 2b. When such a stimulus is presented transiently, the conventional model will remain at the WTA state or the WSA state after the extinction of the stimulus, because these states are locally stable even in the absence of the stimulus. Note that in this case, at least one pyramidal cell inevitably shows sustained activity even when the presented stimulus is a completely random, therefore never a preferred, stimulus for that cell. In this way, when the intensity (I0 ) of the presented stimulus is too large or too small, the conventional model does not show the stimulus-selective sustained WTA response in the sense defined in section 3. In order for the conventional model to realize such a response, we should tune the intensity (I0 ) of a series of stimuli I = I0 (swk + (1 − s)ξ )|0 ≤ s ≤ 1 (for a fixed k) to be an appropriate level so that the maximum of the total feedforward input to pyramidal cell (arg j max m1 wtj I) exceeds the threshold θ when and only when s is sufficiently large. Figures 2c to 2f show such cases. Note that the critical value of the signal ratio s, above which the sustained WTA response appears, depends on the stimulus intensity (I0 ): the greater the intensity is, the smaller the critical signal ratio becomes. Figure 3a summarizes the results from a hundred simulation trials about what response the conventional model shows to the transiently presented stimulus with various intensities (I0 = 1, 0.76, 0.72, 0.68, 0.64, 0.6, 0.56, 0.1, 0.01 from the top to the bottom) depending on its signal ratio (s) (the horizontal axes). The white regions indicate that the model shows a correct WTA response, in which only the pyramidal cell corresponding to the stimulus shows sustained activity after the extinction of the stimulus; the gray regions indicate that the model shows a “wrong” WTA response or a WSA response, in which at least one pyramidal cell that does not correspond to the stimulus shows a sustained activity; and the black regions indicate that the model returns to the quiescent state after the extinction of the stimulus. This figure clearly shows that the conventional model shows the stimulus-selective sustained WTA response, that is, a correct WTA response for a large s and at the same time shows a quiescent response for a small s, only when the stimulus intensity is in the appropriate range (0.6 < I0 < 0.72 for this parameter set). Indeed, even doubling or halving the stimulus intensity is not permitted for the conventional model to achieve such a stimulus selectivity. Actually, the conventional lateral inhibition model would realize the stimulusselectivity for a wider range of the stimulus intensity if some assumptions or structures other than equations 2.4 and 2.5) are added (discussed in section 9). Next, we examine our dendritic lateral inhibition (DLI) model described by equations 2.1 to 2.2). Figure 4e shows steady-state activities of the
1810
K. Morita, M. Okada, and K. Aihara
pyramidal cells (the black bars) and the non-FS cell (the right-most light gray bar) of the DLI model under the persistent presentation of the stimuli with a certain intensity (I0 = 1) having various signal ratios (s = 1, 12 , 14 , 18 , 0). When there is no noise (the top row of Figure 4e), that is, the presented stimulus is the noiseless preferred stimulus for the 50th pyramidal cell, only the corresponding 50th pyramidal cell has a saturated activity in the steady state. This steady state looks like a WTA state. However, on closer inspection, we can see that it is not a WTA state because other pyramidal cells are shown to have very small but positive activities. Here, we refer to this novel type of steady state as a WTA-like state. To be exact, whether this presumable steady state is really a stable equilibrium point of the system needs mathematical proof, which will be given in section 7 in the limit of m → ∞. The DLI model converges to the 50th cell’s WTA-like state as long as the noise is not so much, or the signal ratio s is not so small (the second and the third rows of Figure 4e). However, when the noise increases (the fourth and the bottom rows of Figure 4e), the DLI model no longer converges to the 50th cell’s WTA-like state. Instead, the model converges to another state, in which all the pyramidal cells have very small but positive activities that are far from the saturation value. We refer to this type of steady state as a balanced low-level activity (BLA) state. Again, whether this state is really a stable equilibrium point needs mathematical explanation, which will be given in section 7. Repeating the simulation for the DLI model alternating the signal ratio (s) in small steps, it is suggested that the DLI model converges to either a WTA-like state or a BLA state according to the signal ratio. Specifically, there seems to be a certain critical value of s that distinguishes the two behaviors. In the trial shown in Figure 4e, this critical reliability seems to lie between s = 1/8 and s = 1/4. Let us examine how such behaviors of the DLI model depend on the stimulus intensity (I0 ). Figures 4b, 4c, and 4d show the steady-state
Figure 3: Dependence of the stimulus-selective responses on the stimulus intensity in the conventional model (a) and the DLI model (b). Each panel shows what type of response the model shows to the stimulus with various signal ratios. The horizontal axis indicates the signal ratio (s = 0 ∼ 1) of the stimulus. The vertical axis indicates the fraction of simulation trials (unit: %): 100 trials have been done for each parameter set. White: the model shows a correct winner-take-all response, in which only the pyramidal cell corresponding to the stimulus shows sustained activity after the extinction of the stimulus. Gray: the model shows a wrong winner-take-all response or a winners-share-all response, in which at least one pyramidal cell that does not correspond to the stimulus shows a sustained activity. Black: the model shows a quiescent response in which all pyramidal cells return to be inactive. The stimulus intensity is systematically varied I0 = 1, 0.76, 0.72, 0.68, 0.64, 0.6, 0.56, 0.1, 0.01 from the top to the bottom.
Selectivity and Stability via Dendritic Nonlinearity
1811
1812
K. Morita, M. Okada, and K. Aihara
activities of the cells in the DLI model under the presence of the stimulus with various signal ratios, whose intensity is one-thousandth (I0 = 1/1000), one-hundredth (I0 = 1/100), or one-tenth (I0 = 1/10) of that in Figure 4e (I0 = 1), respectively. Changing the stimulus intensity (I0 ) over a thousand times barely changes the model’s behavior: it converges to a WTA-like state when the signal ratio is large, and it converges to a BLA state when the signal ratio is small. Indeed, under the assumption that each dendritic branch is not saturated only by feedforward excitation, that means I0 < η, which was assumed in equation 3.4, an arbitrary change in the stimulus intensity does not change its convergence property. We refer to this property of the DLI model as a intensity-semi-invariant stimulus selectivity in a sense that the DLI model has a stimulus selectivity even for a stimulus with an infinitesimal intensity. The mechanism of this intensity semi-invariance is possibly that since no dendritic branch is saturated in the BLA state, the dynamics around the BLA state is intensity invariant due to the threshold linearity of the dendritic transfer function unless the intensity is too large to ignore dendritic saturation. This conjecture will be mathematically validated in the limit of m → ∞ in sections 6 and 7. Note that if the stimulus intensity increases so much that the assumption equation 3.4 is not satisfied, for example, I0 = 10, the behavior is altered, as shown in Figure 4f (this is why we used the term semi-invariance): the corresponding 50th cell’s activity no longer reaches the saturation value, although it is still the largest among all
Figure 4: Responses of the dendritic lateral inhibition (DLI) model to the persistent (top row: a–f) or transient (bottom row: g–l) presentation of the stimulus. (a) Schematic images of a series of stimuli I = I0 (sw50 + (1 − s)ξ ), which are the preferred stimulus for the 50th pyramidal cell with various signal ratios (s = 1, 12 , 14 , 18 , 0 from the top to the bottom). (b–f) Steady-state activities of the pyramidal cells (black bars) and the non-FS cell (the right-most light gray bar) in the DLI model (α = 3, β = 1, and γ = 20) under the presence of the preferred stimulus for the 50th pyramidal cell with various signal ratios as indicated in a . The stimulus intensity (I0 ) is systematically varied so that I0 = 1/1000, 1 / 100, 1 / 10, 1, 10 from the left column (b) to the right one (f). The activity of the non-FS cell in f is around 5 and thus surpasses the upper limit (2) of the figure. (g–l) Temporal evolutions of the pyramidal cells’ activities in the DLI model under the transient presentation of the series of stimuli indicated in g having different intensities I0 = 1/1000, 1/100, 1/10, 1, 10 in h, i, j, k, and l, respectively. The horizontal axis indicates the time. The stimuli are presented only for a while from time t = 10 to time t = 30, indicated by the horizontal gray bars. The black solid line indicates the activity evolution of the 50th pyramidal cell, which is the corresponding cell to the stimulus. Although the mean activity evolutions of all the other pyramidal cells are also indicated by the light gray lines with black error bars representing variations over population, they are too small (just above the horizontal axis lines) to be visible except for l.
Selectivity and Stability via Dendritic Nonlinearity
1813
1814
K. Morita, M. Okada, and K. Aihara
the pyramidal cells. This change in the behavior is considered to be due to the fact that the dendritic transfer function σ (z), equation 2.3, has a finite upper bound. Keeping such intensity-semi-invariant convergence property of the DLI model under the presence of the stimulus in mind, now let us examine how the DLI model responds to transient presentation of the stimulus. Figure 4k shows the temporal evolutions of the activities of the pyramidal cells in response to transient presentation (from t = 10 to t = 30, indicated by horizontal gray bars) of the stimulus with the intensity I0 = 1 and various signal ratios (s = 1, 12 , 14 , 18 , 0 in each row): thick black lines indicate the activities of the pyramidal cell corresponding to the presented stimulus, while the activities of all the other pyramidal cells are indicated by gray lines, but they are too small (just near the horizontal axis line) to be visible. When the signal-tonoise ratio s is large (from the top row to the third row of Figure 4k), the DLI model converges to a WTA-like state during the stimulus and then eventually converges to a genuine WTA state after the extinction of the stimulus. On the other hand, when s is smaller (the bottom two panels of Figure 4k), the DLI model converges to a BLA state during the stimulus and then eventually converges to the quiescent state after the extinction of the stimulus. This means that the DLI model achieves the stimulus-selective sustained WTA response in the sense defined in section 3 to this series of stimuli. Furthermore, as shown in Figures 4h to 4j, which show the results of the same transient presentation task, with stimulus intensities of one-thousandth (see Figure 4h), one-hundredth (see Figure 4i), or one-tenth (see Figure 4j) of those in Figure 4k, such stimulus-selective sustained WTA responses of the DLI model are never affected by changing the stimulus intensity, as with the convergence properties under the persistent presentation of the stimulus described above. Note again that if the stimulus intensity is increased so that assumption 3.4 is not satisfied, the activity of the corresponding 50th neuron becomes the largest one during the presentation of the stimulus as described above but eventually diminishes after the stimulus is turned off, failing to show sustained WTA response, as shown in Figure 4l. Note, however, that even in such a case as Figure 4l, the model still shows correct stimulus selectivity during the presentation of the stimulus and, contrary to the conventional models, never converges to “wrong” WTA states. Figure 3b summarizes the results from a hundred simulation trials about what response the DLI model shows to the transiently presented stimulus with various intensities (I0 = 1, 0.76, 0.72, 0.68, 0.64, 0.6, 0.56, 0.1, 0.01 from the top to the bottom) depending on its signal ratio (s) (the horizontal axes). This figure clearly shows that the stimulus-selective sustained WTA response of the DLI model and, what is more, the critical value of s that distinguishes the sustained WTA response and the quiescent response, do not alter by the change in the stimulus intensity, even for a thousand times of change. Such a critical signal ratio seems to be around s ≈ 0.15 ∼ 0.2, although there is some range due to the randomness of the stimulus and the initial values of the pyramidal cells’ activities, which are altered every
Selectivity and Stability via Dendritic Nonlinearity
1815
trial. In section 7, we show that in the limit m → ∞, there really is a critical value of s that can be characterized as a saddle-node bifurcation point associated with the appearance or disappearance of a stable equilibrium representing the BLA state.
5 Qualitative Explanation of the Dynamics When the signal ratio s is small, the DLI model converges to a BLA state during stimulus presentation and eventually converges to the quiescent state after the extinction of the stimulus. Notably, such a state as the BLA state does not appear in the conventional model. Therefore, convergence to a BLA state in the presence of the heavily noise-distorted stimulus seems to be a key feature of the DLI model that underlies the stimulus-selective sustained WTA response. In this section, we give a qualitative explanation about how the BLA state can stably exist in the DLI model and how it depends on the signal ratio of the stimulus, before getting into mathematical analysis. For comparison, let us begin by describing what happens in the conventional lateral inhibition model, equations 2.4 and 2.5, in which the transfer function is imposed only after the linear summation of all the inputs, including self-excitation. Now assume that there is a steady state where the jth pyramidal cell, for example, has a positive but nonsaturated activity. Note that this requires that the transfer function, equation 2.3, is neither below the threshold nor saturated but is in the linear range: σ (z) = z − θ . Given that the activity of the jth cell is perturbed in the positive direction by x j (> 0) while other cells’ activities remain the same, this perturbation would generate an additional self-excitation of (α − βγ ) x j , which dominates the n additional decay equal to |− x j | because of the assumption of strong self> 1. Consequently, positive feedback would excitation: equation 3.5: α − βγ n begin, and the state would move away from the original (hypothetical) steady state. Therefore, such a steady state that has at least one pyramidal cell whose activity is a positive but nonsaturated value would be, even if existed, locally unstable in the conventional lateral inhibition model with strong self-excitation, equation 3.5. In other words, at an arbitrary locally stable steady state in the conventional lateral inhibition model, the activity of each pyramidal cell is either zero or saturated. Indeed, all the stable steady states of the conventional model that were observed in the simulation— WTA states, WSA states, and the quiescent state—are of this type. Next, let us consider the DLI model, equations 2.1 and 2.2, in which selfexcitation and lateral inhibition are added to each dendritic branch, and the transfer function is also locally imposed on each dendritic branch. Now assume that there is a steady state where the jth pyramidal cell has a positive but nonsaturated activity. In order for such a steady state to exist, some of the dendritic branches of the jth cell should be below the threshold (see Figure 5c). Specifically, only a fraction of branches should exceed the threshold so that the dendritic transfer function is in the linear range;
1816
K. Morita, M. Okada, and K. Aihara
Figure 5: Schematic diagrams showing whether each dendritic branch of a single pyramidal cell is below or above the threshold. In the conventional lateral inhibition model, since there is only one threshold, there exist only two cases: (a) the whole cell, including the soma and all the dendritic branches, though they are not explicitly represented in the conventional models, are below the threshold (indicated by black) or (b) above the threshold (indicated by white). But in the DLI model, there is a third case: (c) not all but some of the dendritic branches are below the threshold (black) while the rest of the branches are above the threshold (white). As the signal ratio s increases, the number of the dendritic branches of the corresponding pyramidal cell that are above the threshold increases (from c to d).
Selectivity and Stability via Dendritic Nonlinearity
1817
otherwise, the jth cell’s activity would be saturated due to the strong selfexcitation assumed in equation 3.5. This condition would be enabled if dendritic lateral inhibition is sufficiently strong compared with self-excitation, for example, under conditions as in equation 3.6. Now assume that the activity of the jth cell is perturbed in the positive direction by x j (> 0) while other cells’ activities remain the same. This perturbation would generate an additional self-excitation of mp (α − βγ ) x j , where p represents the number n of dendritic branches, out of the total m branches, that exceed the threshold in that (hypothetical) steady state. If p is small enough for the additional self-excitation not to dominate the additional decay equal to |− x j | even under the strong self-excitation assumed in equation 3.5, the small perturbation will not induce positive feedback, contrary to the conventional model case. Therefore, such a steady state that has at least one pyramidal cell whose activity is a positive but nonsaturated value could be locally stable in the DLI model even with a strong self-excitation (see equation 3.5). Note that both the WTA-like states and the BLA states, which appear in the DLI model but not in the conventional model, belong to this type of steady state. Now let us assume that the DLI model stays around the BLA state when the signal ratio s is zero, corresponding to a completely random stimulus. Then what happens if s increases a bit so that the stimulus becomes slightly correlated with the kth pyramidal cell’s feedforward connections? Such a slight increase in s is expected to increase the number of dendritic branches of the kth cell that exceed the threshold, say, from p cells to q (> p) cells (from Figure 5c to Figure 5d). This change would merely produce a slight increase in the kth cell’s activity in the steady state, of the BLA type, if q is just a bit larger than p. However, if we further increase s so that q is large enough for an additional self-excitation mq (α − βγ ) x j resulting from a perturbation n x j to dominate the additional decay (|− x j |) of the same perturbation, positive feedback occurs and the DLI model no longer remains around the BLA state. The critical value of s at which the additional self-excitation is balanced with the additional decay corresponds to the critical signal ratio observed in the simulation. 6 Analytical Formulation of the Model To obtain a mathematical basis for the dynamical properties of the DLI model discussed in the previous section, here we explore an analytical formulation of the model. Specifically, we introduce an idealized DLI model in which every pyramidal cell has an infinite number of dendritic branches. Since real neocortical pyramidal cells usually have hundreds of dendritic branches, such an approximation would be plausible. As described in section 3, let us assume that the stimulus I = (I1 , . . . , Im ) is represented by I = I0 (swk + (1 − s)ξ ),
(6.1)
1818
K. Morita, M. Okada, and K. Aihara
where wk = (wk1 , . . . , wkm ) is the feedforward connections of the kth pyramidal cell, meaning the preferred stimulus for the very kth cell, that is, the signal component of the stimulus; ξ = (ξ1 , . . . , ξm ) is a random noise vector; s (0 ≤ s ≤ 1) is the signal ratio; and I0 represents the stimulus intensity. Also as described in section 3, each component (wji ) of the feedforward weight matrix W is independently distributed according to a distribution P(wji = x) = paff (x),
(6.2)
and each component (ξi ) of the noise vector ξ is also independently distributed according to a distribution: P(ξi = x) = pnoise (x).
(6.3)
Now let us define the distribution of the value wji Ii , which represents excitatory input from the ith component of the stimulus to the jth pyramidal cell, as follows: Ii pwin (x; s) ( j = k) . P(wji Ii = I0 x) = P wji = x ≡ plos (x; s) ( j = k) I0
(6.4)
Here pwin (x; s) represents a probability density function for the kth pyramidal cell that is the corresponding cell for the stimulus, whereas plos (x; s) represents that for the other noncorresponding cells. Let us define the distribution functions of these densities as follows: f win (x; s) ≡
x
pwin (x; s)dx
and
0
x
f los (x; s) ≡
plos (x; s) dx.
(6.5)
0
We assume that these distributions are nonnegative and bounded above, f win (0; s) = f los (0; s) = 0 and
f win (φ; s) = f los (φ; s) = 1,
(6.6)
where φ denotes the upper bound. The total input of the jth pyramidal cell, including feedforward excitation, self-excitation, and dendritic lateral inhibition, is 1 σ (wji Ii + αx j − βy), m m
Jj ≡
i=1
(6.7)
Selectivity and Stability via Dendritic Nonlinearity
1819
where σ denotes the transfer function (see equation 2.3) with the threshold equal to zero: 0 σ (z) = z η
(z ≤ 0) (0 < z ≤ η) . (η < z)
(6.8)
Here, let us take the limit of m → ∞ so as to replace the summation according to m by an expectation value according to the distribution of wji Ii , that is, pwin (x; s) ( j = k) or plos (x; s) ( j = k). In order to correctly calculate the function σ in equation 6.8, the relationship between the range of the distribution of wji Ii + αx j − βy, that is, [αx j − βy, I0 φ + αx j − βy], and the range [0, η] in which σ is linear, matters. In general, there are six possible relationships between these two ranges, as shown in Figure 6. In the previous section, we guessed that in the BLA state, some of the dendritic branches are in the linear regime, that is, their transfer functions are above the threshold but not saturated, with the rest below the threshold (see Figure 5c) for an arbitrary pyramidal cell. This condition corresponds to the Figure 6d case, and is represented as αx j − βy ≤ 0 and 0 ≤ I0 φ + αx j − βy ≤ η ⇔ −I0 φ ≤ αx j − βy ≤ min(0, η − I0 φ).
(6.9)
On the other hand, in the WTA or the WSA state, it would be expected that all the dendritic branches of the winning pyramidal cell(s) are saturated because of strong self-excitation (see Figure 5b). This condition corresponds to the Figure 6c case and is represented as αx j − βy ≥ η.
(6.10)
Since we are interested in the dynamics around the BLA state or the WTAlike state, we will consider only the above two cases, equations 6.9 and 6.10, out of the total six cases. Now, if equation 6.9 is satisfied for the jth pyramidal cell, the total input of the jth cell in the limit of m → ∞ is 1 σ (wji Ii − (−αx j + βy)) m→∞ m m
J j = lim
i=1
m −αx j + βy 1 Ii σ wji − = I0 lim m→∞ m I0 I0 = I0
i=1
φ −(αx j +βy)/I0
x−
−αx j + βy I0
pinput (x; s) dx
1820
K. Morita, M. Okada, and K. Aihara
Figure 6: Six possible relationships between the range of the distribution of wji II0i + αx j − βy, indicated by black bars, and the range [0, η] where the function σ is linear, indicated by gray bars.
= I0
φ
−(αx j +βy)/I0
= I0 φ −
x−
−αx j + βy I0
−αx j + βy − I0
φ
d ( f input (x; s)) dx dx
−(αx j +βy)/I0
f input (x; s) dx ,
(6.11)
where pinput (x; s) ≡
pwin (x; s) ( j = k) plos (x; s) ( j = k)
and
f input (x; s) ≡
f win (x; s) ( j = k) . f los (x; s) ( j = k) (6.12)
Selectivity and Stability via Dendritic Nonlinearity
1821
Thereby the dynamics of the jth pyramidal cell (see equation 2.1) is dxj 1 = dt τp
−x j + I0
−αx j + βy − φ− I0
φ
(−αx j +βy)/I0
.
f input (x; s) dx
(6.13) Here, if we perform the scaling of the variables x j ≡
xj I0
and
y ≡
y , I0
(6.14)
then equation 6.13 can be rewritten as d x j
1 = dt τp
−x j
+ φ−
(−αx j
+ βy ) −
φ −αx j +βy
f input (x; s) dx
, (6.15)
which does not include the stimulus intensity I0 . Note that this scaling results from the assumption in equation 6.9 that every branch is either in the linear range or below the threshold but never saturated, and also that this scaling underlies the intensity-semi-invariant stimulus selectivity that the DLI model shows in the numerical simulations in section 4, as will be explained in section 7. Alternatively, if equation 6.10 is satisfied for the jth pyramidal cell, the total input of the jth cell is m 1 σ wji Ii − (−αx j + βy) m→∞ m
J j = lim
i=1
1 η m→∞ m m
= lim
i=1
=η
(6.16)
in the limit of m → ∞. Thereby the dynamics of the jth pyramidal cell is dxj 1 = (−x j + η). dt τp
(6.17)
Thus far, we have derived a general analytical formulation of the DLI model without assuming specific formulas of the distributions. For the comparison with the simulation results, hereafter we calculate a specific DLI model for the same distributions as used in the simulations in section 4: let us assume that both the distribution of the feedforward connection
1822
K. Morita, M. Okada, and K. Aihara
strength (wji ) (density: paff (x)) and the distribution of each component (ξi ) of the noise vector (density: pnoise (x)) are the uniform distributions on [ 0 1]. Our goal is to obtain analytical formulas for the distribution of wji II0i , that is, pinput (x) ( pwin (x) and plos (x)) and f input (x; s) ( f win (x; s) and f los (x; s)). Here, note that the distribution of wji II0i is bounded above by
φ =1×
I0 = 1. I0
(6.18)
First, let us consider the case in which the stimulus is completely random: the signal ratio s = 0. In this case, wji II0i = wji ξi ( j = 1, . . . , n) is a product of two independent uniform random numbers, and thus its density and distribution function are calculated as follows: pinput (x; 0) (= pwin (x; 0) = plos (x; 0)) = P wji Ii = I0 x Ii = P wji = x = − log x(0 ≤ x ≤ 1) I0 x f input (x; 0) = f win (x; 0) = f los (x; 0) = pinput (x; 0) dx
(6.19)
0
= x − x log x(0 ≤ x ≤ 1).
(6.20)
Figure 7a shows this analytically obtained pinput (x; 0) (equation 6.19, indicated by the black solid line), in comparison with the density calculated from a single simulation trial in which every pyramidal cell has 900 dendritic branches (i.e., m = 900) (indicated by the gray bars) as in our simulation in section 4. In the general case in which the signal ratio is positive (s > 0), analytical expressions of f win (x; s) and f los (x; s) can also be obtained (see appendix A). Figures 7b to 7d show the probability densities pwin (x; s) = ddx f win (x; s) (the thick solid line in Figure 7b) and plos (x; s) = ddx f los (x; s) (the thick dotted line in Figure 7b) obtained by differentiating the analytically formulated distribution functions (see appendix A) when the signal ratio is s = 0.4, in comparison with the densities calculated from single simulation trials with m = 900 (the gray bars in Figures 7c and 7d, respectively). As shown in Figure 7b, pwin (x; s) is larger than plos (x; s) when x is large, while plos (x; s) is larger than pwin (x; s) when x is small, indicating that the dendritic branches of the pyramidal cell corresponding to the stimulus tend to receive larger feedforward excitations than branches belonging to noncorresponding pyramidal cells.
Selectivity and Stability via Dendritic Nonlinearity
(a)
0
(b)
1
(c)
0
1823
0
1
(d)
1
0
1
Figure 7: The probability densities of the feedforward excitation on each dendritic branch (wji II0i ) of the DLI model. (a) The solid line is the analytically obtained probability density on each dendritic branch of every pyramidal cell ( pinput (x; 0): equation 6.19) when the signal ratio is s = 0. The gray bars indicate the corresponding probability density calculated from a single simulation trial in which every pyramidal cell has 900 dendritic branches (i.e., m = 900). (b) The thick solid line and the thick dotted line, respectively, indicate the analytically obtained probability densities on each dendritic branch of either the pyramidal cell that is corresponding to the stimulus (thick solid line: pwin (x; s)) or the cell that is not corresponding to it (thick dotted line: plos (x; s)) when the signal ratio is s = 0.4. The thin dashed line indicates the density when s = 0 ( pinput (x; 0)) for reference. (c) The solid line indicates the analytically obtained pwin (x; s), and the gray bars indicate the corresponding probability density calculated from a single simulation trial in which every pyramidal cell has 900 dendritic branches (i.e., m = 900). (d) The solid line indicates the analytically obtained plos (x; s), and the gray bars indicate the corresponding probability density calculated from a single simulation trial in which every pyramidal cell has 900 dendritic branches (i.e., m = 900).
1824
K. Morita, M. Okada, and K. Aihara
7 Analysis of the Local Dynamics In this section, we examine whether the analytically formulated DLI model in the limit of infinite number of dendritic branches per every cell (m → ∞), obtained in the previous section, has stable equilibrium points corresponding to the BLA state and the WTA-like state observed in the stimulations in section 4. We consider the DLI model under the presence of the stimulus I = I0 (swk + (1 − s)ξ ), which represents either a random stimulus (in the s = 0 case) or a noise-distorted preferred stimulus for the kth pyramidal cell with a signal ratio s(> 0). Let us assume that there is an equilibrium point corresponding to the BLA state of the form xk = I0 xw ,
x j = I0 xl ( j = 1, . . . , n, j = k),
y = I0
γ 1 xw + 1 − γ xl , n n
where xw , xl > 0,
(7.1)
and another equilibrium point corresponding to the WTA-like state of the form xk = η,
x j = I0 xh ( j = 1, . . . , n, j = k),
where xh > 0.
y=
γ 1 η + I0 1 − γ xh , n n (7.2)
Here all the activities of the pyramidal cells except for the kth cell are assumed to be equal in these equilibriums due to symmetry. We also assume that equilibrium activities of the pyramidal cells, except for the saturated case, should be proportional to the stimulus intensity I0 . Note that the equilibrium activity of the non-FS cell (y) is determined so as to satisfy the steady-state equation of the DLI model for the non-FS cell, equation 2.2. For the hypothetical BLA equilibrium, equation 7.1, it can be proven in the limit of n → ∞ (i.e., an infinite number of the pyramidal cells; see appendix B) that the condition equation 6.9, saying that some dendritic branches are in the linear regime (above the threshold but not saturated) while the rest are below the threshold, is satisfied for an arbitrary pyramidal cell including the kth cell around that equilibrium point under the assumption 0 < xl ≤
1 βγ − α
and 1 ≤
βγ xw < , xl α
(7.3)
which intuitively means that xl is sufficiently small and the xw /xl ratio is not too large. Therefore, the dynamics of an arbitrary pyramidal cell’s activity around the hypothetical BLA equilibrium point, equation 7.1, is
Selectivity and Stability via Dendritic Nonlinearity
1825
described as equation 6.13, and thus its scaled activity is described by equation 6.15: d x j
1 = dt τp
−x j
+ 1−
(−αx j
for j = 1, . . . , n, where f input (x; s) ≡
+ βy ) −
1
−αx j +βy
f win (x; s) ( j = k) . f los (x; s) ( j = k)
f input (x; s) dx
(7.4)
Specifically, when the stimulus is completely random (s = 0), the above equation can further be calculated by substituting f input (x; 0) = x − x log x, equation 6.20, into equation 6.15: 1 1 (x − x log x) dx −x j + 1 − (−αx j + βy ) − dt τp −αx j +βy for j = 1, . . . , n, 1 1 = −x j + 1 − 4ζ j + 3ζ j2 − 2ζ j2 log ζ j τp 4
d x j
=
(7.5)
where ζ j ≡ −αx j + βy .
(7.6)
On the other hand, for the hypothetical WTA-like equilibrium, equation 7.2, it can be proven in the limit of n → ∞ (see appendix C) that the condition 6.10, saying that all the dendritic branches are saturated, is satisfied for the kth cell while condition 6.9, saying that some dendritic branches are in the linear regime (above the threshold but not saturated) while the rest are below the threshold, is satisfied for all the other pyramidal cells around that equilibrium point under the assumption 1 α−1 0 < xh < min , , βγ − α βγ
(7.7)
which intuitively means that xh is sufficiently small. Therefore, the dynamics of the kth pyramidal cell’s activity is described as equation 6.17, whereas the one of the other cells’ activities are described as equation 6.13 around the hypothetical WTA-like equilibrium point, equation 7.2: 1 d xk = (−xk + η) , dt τp
(7.8)
1826
K. Morita, M. Okada, and K. Aihara
dxj 1 = dt τp
−x j + I0
−αx j + βy − 1− I0
1
(−αx j +βy)/I0
f los (x; s) dx
( j = 1, . . . , n, j = k).
(7.9)
Now let us consider steady-state equations obtained by setting time derivatives of the above equations to 0. At first, when the stimulus is completely random, s = 0, the steady-state equation for the hypothetical BLA state can be obtained by setting the time derivatives to 0 in equation 7.5: 1 F B L A(x∗ ) ≡ −x∗ + (1 − 4(βγ − α)x∗ + 3 ((βγ − α)x∗ )2 4 − 2 ((βγ − α)x∗ )2 log ((βγ − α)x∗ )) = 0, here x∗ (= xw = xl ).
(7.10)
Under the assumption α < βγ , equation 3.6, since lim F B L A(z) =
z→0+0
1 > 0, 4
FB L A
1 βγ − α
=−
1 < 0, and βγ − α
(7.11)
d F B L A(z) = −1 − (βγ − α) + (βγ − α)2 z − (βγ − α)2 z log ((βγ − α)z) dz 1 (7.12) <0 0 < z < βγ − α are satisfied, equation 7.10 has a unique solution in the range of equation 7.3. Therefore, it is proven that the DLI model with m → ∞ receiving a random (s = 0) stimulus indeed has an equilibrium point corresponding to the BLA state observed in the simulation. Hereafter, we call this equilibrium point the BLA equilibrium point. The activity of the pyramidal cells at the BLA equilibrium point, that is, I0 x∗ calculated by numerically solving equation 7.10, shows close agreement with the simulation results (see Figure 8a). In general cases where the signal ratio s is positive, the steady-state equations for hypothetical BLA states and WTA-like states can be solved using numerical integration and Newton’s method. Figure 9 shows such numerically calculated values of xw and xl at the BLA equilibrium point, indicated, respectively, by black and gray solid lines in the upper halves of the panels, and xh at the WTA-like equilibrium point, indicated by the black dashed line also in the upper halves of the panels, for three different parameter sets. As shown in this figure, as the signal ratio s increases from 0, xw gradually increases and xl and xh gradually decrease. Then, at a certain signal ratio, numerical calculation for the BLA equilibrium point (xw and xl ) does not converge, implying that this equilibrium point disappears at this signal ratio (this will be confirmed by local stability analysis, described below).
Selectivity and Stability via Dendritic Nonlinearity
1827
On the other hand, numerical calculations suggest that the WTA-like equilibrium point persists regardless of the signal ratio, even when s = 0. Error bars in the upper halves of the panels in Figure 9 indicate the simulation results: average values of xw , xl , and xh with standard variations that were obtained in the same manner as that described in section 4. As shown in the figure, the values calculated numerically from the analytical formulation of the DLI model show close agreement with the simulation results. Next, let us consider the local stability of the BLA- and WTA-like equilibrium points. Similar to the locations of the equilibrium points, the characteristic multipliers at these points can also be calculated numerically (see appendixes D and E). Figure 8b shows the largest real parts of the multipliers at the BLA equilibrium point (solid line) and the WTA-like equilibrium point (dashed line) when the stimulus is completely random—s = 0 for different values of the strength of dendritic lateral inhibition (βγ ; x-axis in Figure 8b) at a fixed value of α = 3; this value has been used for the simulations mentioned in section 4. As shown in the figure, when βγ is sufficiently strong, the largest real parts of the multipliers at both equilibrium points are negative; therefore, these equilibrium points are locally asymptotically stable. On the other hand, the local stabilities are lost when the inhibition weakens. Figure 8c shows the parameter region in which the BLA- and WTA-like equilibrium points are locally asymptotically stable when the stimulus is completely random: s = 0. As shown in the figure, the region in which the BLA equilibrium point is locally stable (above the solid line in Figure 8c) is contained within the region in which the WTA-like equilibrium point is locally stable (above the dashed line in Figure 8c). In cases in which both equilibrium points are locally stable, whether the model actually converges to the BLA state appears to depend on the degree or basin of attraction. The simulation results of the DLI model, which contained 100 pyramidal cells, with respect to its convergence from the neighbor of the quiescent state to the BLA state are shown in Figure 8c, where small circles and crosses show convergence and no convergence, respectively. As shown in the figure, although the local stability of the BLA equilibrium point does not necessarily ensure that the model converges to the BLA state, when the inhibition is sufficiently strong, the model in fact converges to the BLA state. Here, a noteworthy point is that as long as assumption equation 3.4 is satisfied, these local stabilities of the BLA- and WTA-like equilibrium points are independent of the stimulus intensity I0 . For the BLA equilibrium point, the local stability is attributable to the scale invariance of the dynamics around it, as discussed in equations 6.14, 6.15, and 7.1. For the WTA-like equilibrium point, the dynamics around it is not scale invariant, as shown in equations 7.2, 7.8, and 7.9. Thus, the steady-state activities of the loser pyramidal neurons and the non-FS cell at the WTA-like state are not scale invariant. However, since the difference in the dynamics is only a constant, its time derivative is scale invariant. Therefore, the local stability of the WTA-like equilibrium point is also scale invariant. These
1828
K. Morita, M. Okada, and K. Aihara
scale invariances of the local stabilities of the BLA and WTA-like equilibrium points are considered to underlie the intensity semi-invariance of the stimulus-selective sustained WTA response of the DLI model. Let us examine how the local stabilities of the equilibrium points change with the signal ratio s for the model with fixed parameters. Dotted and dashed lines in the lower halves of the panels in Figure 9 indicate the largest real parts of the multipliers at the BLA- and WTA-like equilibrium points, respectively, for three different sets of parameters. As shown in the figure, the largest real part of the multipliers at the WTA-like equilibrium point is always shown as negative regardless of the value of the signal ratio s (dashed lines in the lower halves of the panels in Figure 9), indicating that the WTA-like equilibrium point is always locally stable. Regarding the BLA equilibrium point, when the signal ratio s is small, the largest real part of the characteristic multipliers at the BLA equilibrium point (dotted lines in the lower halves of the panels in Figure 9) is negative, indicating that the BLA equilibrium point is asymptotically stable. However, as the signal ratio s increases the largest value of the real parts of the multipliers gradually increases, and finally reaches zero at a certain value of s, as shown in Figure 9. This indicates that a saddle-node bifurcation occurs at this point, and the stable BLA equilibrium point coalesces with the saddle point. The value of s at this bifurcation, referred to as the bifurcation parameter, is calculated to be s = 0.185, . . . for (α, β, γ ) = (3, 1, 20) (see Figure 9a). Notably, this value correlates with the critical signal ratio observed during the simulation (see Figure 3b), suggesting that the transition of the DLI model from the BLA state to the WTA-like state occurs due to this bifurcation.
Figure 8: Locations and local stabilities of the BLA and the WTA-like equilibrium points in the DLI model under the presentation of a random stimulus (s = 0). (a) The equilibrium activities of the pyramidal cells at the BLA equilibrium point. The horizontal axis indicates the stimulus intensity (I0 ). Error bars indicate the means and standard variations of the simulations, and solid lines indicate the value obtained by numerical calculation of equation 7.10. (b) The largest value among all the multipliers’ real parts of the Jacobi matrix at the BLA equilibrium point (solid line) and the WTA-like equilibrium point (dashed line). The strength of self-excitation is fixed to α = 3, whereas the strength of dendritic lateral inhibition (βγ ) is varied (horizontal axis). (c) The parameter region in which the BLA or the WTA-like equilibrium point is locally stable. The horizontal axis indicates the strength of self-excitation (α), while the vertical axis indicates the strength of dendritic lateral inhibition (βγ ). The BLA or the WTA-like equilibrium point is locally stable above the solid or dashed line, respectively. Small circles and crosses show the simulation results that the DLI model, including 100 pyramidal cells with the corresponding condition (α and βγ ) starting from a neighbor of the quiescent state, converged to the BLA state or not, respectively.
Selectivity and Stability via Dendritic Nonlinearity
1829
(a) Activities of the pyramidal cells at the BLA state in the s=0 case
0.03
0.02
0.01
0
0
0.5
1
Magnitude of the stimulus (I0)
Largest real part of the multipliers at the BLA state
(b)
0
5
10
15
20
25
(c) 30 25 20 15 10 5 1
2
3
4
1830
K. Morita, M. Okada, and K. Aihara
8 Reduction to Low-Dimensional System To further clarify dynamical natures of the DLI model, in this section we try to perform reduction of the system’s dimension by mean-field approximation. Let us consider again the DLI model under the presence of the stimulus corresponding to the kth pyramidal cell with a signal ratio s (I = I0 (swk + (1 − s)ξ )). In the limit of an infinite number of dendritic branches in every cell (m → ∞), the dynamics of the pyramidal cells’ scaled activities, that is, the activities divided by the stimulus intensity I0 , around the BLA state are described as (see equation 6.15) d x j
1 = dt τp
−x j
+ φ−
(−αx j
for j = 1, . . . , n,
where f input (x; s) ≡
+ βy ) −
φ
−αx j +βy
f input (x; s) dx
f win (x; s) ( j = k) . f los (x; s) ( j = k)
(8.1)
Setting j = k, we obtain the equation that describes the dynamics of the scaled activity of the kth pyramidal cell that corresponds to the stimulus and thus is called the potential winner: d xk 1 = dt τp
−xk
+ φ−
(−αxk
+ βy ) −
φ −αxk +βy
f win (x; s) dx
,
(8.2)
Figure 9: Comparison of the analytical formulation of the DLI model in the limit of m → ∞ with the simulation results. The horizontal axis indicates the signal ratio s. (Upper halves of the panels) Black and gray solid lines indicate numerically calculated values of xw (activity of the kth pyramidal cell corresponding to the stimulus) and xl (activity of the other pyramidal cells) at the BLA equilibrium point from the analytical formula, respectively; black dashed lines indicate xl (activity of the pyramidal cells other than the kth cell) at the WTA-like equilibrium point, for three different parameter sets: (α, β, γ ) = (3, 1, 20) in a , (α, β, γ ) = (3, 2, 20) in b, and (α, β, γ ) = (1.5, 0.5, 10) in c. Error bars indicate means and standard variations of the corresponding values from the simulation. The assumptions, equations 7.20 and 7.22, are calculated to be (xl < 171 , xxwl < 203 , xl < 171 ) for a , (xl < 371 , xxwl < 403 , xl < 371 ) for b, and (xl < 27 , xxwl < 103 , xl < 101 ) for c. Note that all of these assumptions are satisfied for these three parameter sets regardless of the value of the signal ratio s. (Lower halves of the panels) Dotted lines and dashed lines indicate the largest real part of the characteristic multipliers at the BLA equilibrium point and the WTA-like equilibrium point, respectively.
Selectivity and Stability via Dendritic Nonlinearity
1831
(a)
0.1 0.05 0
0
0.2
0.4 0.6 Signal ratio
0.8
1
0.2
0.4 0.6 Signal ratio
0.8
1
0.2
0.4 0.6 Signal ratio
0.8
1
(b)
0.1 0.05 0
0
(c)
0.1 0.05 0
0
1832
K. Morita, M. Okada, and K. Aihara
To obtain a reduced description of the other cells’ dynamics, we introduce a variable representing the mean scaled activity, that is, the mean of the activities divided by the stimulus intensity I0 of the pyramidal cells whose afferents have no correlation with the stimulus, which we call potential losers: x¯ =
n 1 x j . n−1
(8.3)
j=1, j =k
Averaging (see equation 8.1) for j = 1, . . . , n ( j = k) is then described as φ n 1 d x¯ 1 f los (x; s) dx . −x¯ + φ − (−α x¯ + βy) − = dt τp n−1 −αx j +βy j=1, j =k
(8.4) If here we take a mean-field approximation, φ n 1 f los (x; s) dx n−1 −αx j +βy ≈
j=1, j =k
φ 1 n−1
n
j=1, j =k
(−αxj +βy )
f los (x; s) dx =
φ
f los (x; s) dx,
−α x¯ +βy
(8.5)
equation 8.4 eventually becomes 1 d x¯ = dt τp
−x¯ + φ − (−α x¯ + βy ) −
φ −α x¯ +βy
f los (x; s) dx
,
(8.6)
which describes the dynamics of the mean activity of the potential losers. As the number of the pyramidal cells increases, that is, n → ∞, the remaining equation, 2.2, that describes the dynamics of the non-FS cell asymptotically becomes n dy 1 γ −y + = xj ’ dt τG n j=1
(n→∞)
−→
1 dy = −y + γ x¯ . dt τG
(8.7)
Therefore, equations 8.2, 8.6, and 8.7 constitute a three-dimensional autonomous dynamical system describing the scaled activity of the potential winner (xk ), the mean scaled activity of the potential losers (x¯ ), and the scaled activity of the non-FS cell (y ) in the limit of a large number of pyramidal cells. Furthermore, as indicated in equations 8.6 and 8.7, the
Selectivity and Stability via Dendritic Nonlinearity
1833
dynamics of x¯ (potential losers) and y (non-FS cell) does not depend on xk (potential winner), and thus constitutes a two-dimensional autonomous system. Figure 10a shows the trajectories on the x¯ − y phase planes with fixed connection strengths and various signal ratios. As shown in the figure, this system shows damping oscillation regardless of the signal ratio and the location of the stable equilibrium point, which we denote (x¯ , y ) = (xl , γ xl ),
(8.8)
does not largely shift along with the change of the signal ratio. Therefore, the dynamics of the DLI model can be essentially understood by analyzing the single equation, 8.2, for the potential winner. Substituting equation 8.8 into 8.2 results in dxk 1 = dt τp ⇔
−xk
+ φ−
(−αxk
+ βγ xl ) −
φ
−αxk +βγ xl
f win (x; s) dx
dxk 1 −xk + F (xk ; s, xl ) , = dt τp
(8.9)
where F (xk ; s, xl )
=
αxk
− βγ xl + φ −
φ
(−αxk +βγ xl )
f win (x; s) dx
(8.10)
represents the total input, consisting of the self-excitation, dendritic lateral inhibition, and the input from the stimulus, to the potential winner kth pyramidal cell. Figure 10b shows the dependence of this total input F (xk ; s, xl ) on xk , the scaled activity of the very kth cell, with three different values of the signal ratio. When the signal ratio is s = 0 where the stimulus is completely random (the top panel of Figure 10b), the thick solid line representing the total input F (xk ; s, xl ) significantly convexes downward and crosses the diagonal (the dashed line with a unit slope) at two different points. The left crossing point appears to be stable from this figure and thus corresponds to the BLA equilibrium point. The convexity of F (xk ; s, xl ) can be qualitatively explained as follows: when the activity of the kth cell is small, most of its dendritic branches are inactivated (i.e., below the threshold; see Figure 5c) so that positive feedback by self-excitation does not occur, which appears as a fact that the slope of F (xk ; s, xl ) is smaller than 1 when xk is small in the top panel of Figure 10b. As the activity of the kth cell increases, more dendritic branches escape inactivation (see Figure 5d), and thus the slope of F (xk ; s, xl ) increases so as to be convex downward. As the signal ratio increases, the total input to the kth cell increases so that F (xk ; s, xl ) shifts upward. And at the critical value of the signal
1834
K. Morita, M. Okada, and K. Aihara
Figure 10: Dynamics of the reduced three-dimensional system of the DLI model. (a) The example trajectories of the phase point. The horizontal axis (x¯ ) indicates the mean scaled activity of the potential losers, that is, the pyramidal cells except for the kth cell, which corresponds to the stimulus. The vertical axis (y ) indicates the scaled activity of the non-FS cell. The connection strengths are set to (α, β, γ ) = (3, 1, 20). The signal ratio is s = 0 (top panel), s = 0.17 (middle panel), and s = 0.25 (bottom panel). (b) Thick solid lines indicate the dependence of the total input to the potential winner kth pyramidal cell on the scaled activity of the very kth cell (xk : the horizontal axis). The parameters are the same as in a .
ratio (the second panel of Figure 10b), that is, s ≈ 0.17 for the connection strengths used in this figure, F (xk ; s, xl ) tangentially contacts with the diagonal (the dashed line) so that a saddle-node bifurcation occurs. If the signal ratio is greater than this critical value, F (xk ; s, xl ) does not cross with the
Selectivity and Stability via Dendritic Nonlinearity
1835
diagonal (the bottom panel of Figure 10b), indicating that there is no stable equilibrium point corresponding to the BLA state. Note that this value s ≈ 0.17 calculated from the reduced model matches the value s = 0.185 . . . calculated from the full model in section 7. The difference between these two values is considered to come from the mean-field approximation (see equation 8.5) and the limit operation n → ∞. 9 Discussion Inspired by recent neurobiological findings, we have constructed a network model of pyramidal cells and a non-FS cell that mediates dendritic recurrent inhibition. Numerical simulations have shown that the dendritic lateral inhibition (DLI) model shows a stimulus-selective sustained winner-take-all (WTA) response in an intensity-semi-invariant manner (i.e., for an arbitrary stimulus intensity below a certain upper limit even up to an infinitesimal level). Interestingly, the mechanism of stimulus selectivity of the DLI model appears to be distinct from that of the conventional model in that the DLI model converges to a novel type of steady state when a nonspecific stimulus is presented; this state is called the balanced low-level activity (BLA) state wherein all the pyramidal cells share small but nonzero activities. When the signal-to-noise ratio of the stimulus is low, inhibition surpasses excitation in most branches, and the output of such a branch is zero due to rectification. Therefore, small perturbation to such a branch does not have any effect. Consequently, positive feedback does not occur, and the system is stably maintained in the BLA state. A portion of the dendritic branches of the corresponding cell that has escaped from rectification increases with the increase in the signal-to-noise ratio. Finally, positive feedback starts at a certain critical signal ratio, and the system converges to the WTA-like state instead of the BLA state. Mathematical analysis of the model in the limit of an infinite number of dendritic branches per pyramidal cell has confirmed the existence and stability of the equilibrium points corresponding to the BLA and WTA-like states. Moreover, we have proven that the stabilities of those equilibrium points depend on the signal-to-noise ratio but not on the stimulus intensity. The functional novelty of the DLI model is that its stimulus selectivity has two properties simultaneously: (1) it is sustainable after the extinction of the stimulus, and (2) it is preserved for a wide range of stimulus intensity. Achieving self-sustainability and intensity invariance simultaneously, as in the DLI model, appears to be difficult in the conventional competitive neural networks with a similar degree of complexity in their network architecture (see below). Thus, it can be referred to as an emergent property of the network that includes the dendritic nonlinear operation. In the following sections, we first examine the dependence of the DLI model behavior on the values of the parameters, particularly the strength of dendritic lateral inhibition, self-excitation, and the threshold of the dendritic transfer function (see section 9.1). Next, we compare the DLI
1836
K. Morita, M. Okada, and K. Aihara
model with other existing conventional lateral inhibition models, particularly with a model that includes feedforward inhibition that normalizes the stimulus intensity (see section 9.2). Following this, we discuss the similarity and differences between the DLI model and the existing neural network models implementing dendritic local computations (see section 9.3). Finally, the possible relationships between the formal dendritic nonlinearity that we implemented in the DLI model and the real biological dendritic computations are discussed (see section 9.4). 9.1 Dependence of the DLI Model Behavior on the Values of the Parameters. As shown in the simulations in section 4 and by the analysis in section 7, the stimulus-selective sustained WTA response of the DLI model is intensity invariant when the stimulus intensity is decreased, even up to the infinitesimal level. When the intensity is increased, we observed
Figure 11: Dependence of the DLI model behavior on the values of the parameters and comparison of the DLI model with other models. (a) Maximum permitted stimulus intensity for the DLI model when the strength of self-excitation (α) and dendritic inhibition (γ ) are varied. Other values of the parameters are the same as those used in Figures 3 and 4. (b) Permitted range of the stimulus intensity and critical signal ratio for the DLI model when the threshold (θ) is varied from the original value (θ = 0). The y-axis represents the stimulus intensity (I0 ), while the x-axis represents the threshold (θ ) for each model. Circles or crosses indicate that the model succeeded (in more than 95% of 50 trials) or failed, respectively, to show appropriate stimulus-selective sustained WTA responses for the corresponding conditions. The contrast (brightness) of the circles indicates the critical signal ratios at which the model converged to the appropriate WTA state in half of the trials and converged to the quiescent state in the other half of the trials after the extinction of the stimulus (see the gray graduation bar for reference). The values of the parameters are the same as those used in Figures 3 and 4, except for the threshold (θ ) that was tested at θ = 0, 0.05, . . . , 0.25 (x-axis). (c) Permitted range of the stimulus intensity and the critical signal ratio for the conventional model with feedforward inhibition. The details of the model are described in appendix F. The values of the parameters are the same as that used in Figures 2 and 3, except for threshold (θ ) that was tested at θ = 0.05, . . . , 0.5 (x-axis) and the weight of inhibition from the feedforward inhibitory neuron to the pyramidal cells (δ = 0.5), which was not included in the previous model. The initial activity of the feedforward inhibitory neuron was set at zero. (d) Permitted range of the stimulus intensity and the critical signal ratio for the divisive version of the DLI model. The details of the model are described in appendix G. The values of the parameters are the same as those used in the original DLI model, except for the values of the strength of divisive inhibition by the pooled non-FS cell on each dendritic branch of every pyramidal cell (ε = 1) and the threshold (θ) that was tested at θ = 0.15, 0.2, . . . , 0.35 (x-axis).
Selectivity and Stability via Dendritic Nonlinearity (b) 50
10
5
0 0
20 10 5
Stimulus intensity (I0)
Maximum permitted intensity (I0)
(a)
1837
2 1 0.5 0.2 0.1
0.01 10
20
30
40
50
0
60
0.1
0
0.2
0.2
0.4
0.6
0.8
1
Critical signal ratio (s)
(c)
Conventional model with feed-forward inhibition
(d) 50
20 10 5
Stimulus intensity (I0)
Stimulus intensity (I0)
50
2 1 0.5 0.2 0.1
0.01 0
0.1
0.2
0.3
0.4
0.5
Divisive version of the DLI model
20 10 5 2 1 0.5 0.2 0.1
0.01 0
0.1
0.2
0.3
0.4
that the DLI model does not behave appropriately if the stimulus intensity surpasses a certain upper limit, as shown in Figures 4f and 4l. Such a failure for a stimulus that is too strong appears to stem from the fact that the WTAlike equilibrium point disappears as the stimulus intensity increases, while the upper bound of the transfer function (η) is fixed. Indeed, the maximum permitted stimulus intensity is almost linearly correlated with η, for example, if η is doubled, the maximum intensity is also approximately doubled (data not shown). Figure 11a shows how the maximum permitted stimulus intensity (y-axis in Figure 11a) varies according to the strength of dendritic lateral inhibition (γ : x-axis in Figure 11a) and self-excitation (α: three lines). As shown in this figure, the maximum permitted intensity increases with the increase in self-excitation or the decrease in lateral inhibition. Next, let us consider the threshold of the dendritic transfer function (θ ). Thus far, we have set this value at θ = 0 for the DLI model (see equation 2.3). Figure 11b shows how the DLI model behavior changes when the threshold θ is changed. Circles or crosses indicate the success or failure, respectively, of the model in showing appropriate stimulus-selective sustained WTA
1838
K. Morita, M. Okada, and K. Aihara
responses for the corresponding conditions. The contrast (brightness) of the circles indicates the critical signal ratio at which the model converged to the appropriate WTA state in half of the trials and converged to the quiescent state in the other half of the trials after the extinction of the stimulus (see the gray graduation bar for reference). As shown in Figure 11b, although the exact intensity semi-invariance of stimulus selectivity (i.e., up to the infinitesimal level) could be achieved only when the threshold was zero, the DLI model could show stimulus selectivity for a considerably wide range of stimulus intensity when the threshold value was greater than zero. 9.2 Comparison of the DLI Model with Other Competitive Neural Network Models of Stimulus Selectivity. Since competitive neural networks have been proposed as a model of cortical circuits (Wilson & Cowan, 1973; Amari, 1977), they have been intensely studied for modeling real biological neural circuits as well as for creating artificial algorithms or devices. One of the representative examples of biological studies is the so-called feedback model of sensory processing in the sensory cortex, particularly for visual feature selectivity in the primary visual cortex (V1) (Ben-Yishai et al., 1995; Somers et al., 1995; Hansel & Sompolinsky, 1996). Such models have succeeded in realizing contrast invariant selectively, which is a well-known property of visual processing in V1. Two features distinguish our DLI model from these feedback visual models. First, we analyzed the DLI model in a regime where the stimulus is applied only transiently, and the system sustains activities after the extinction of the stimulus by strong self-excitation (see section 3). That is, the stimulus-selective sustained responses in the DLI model are based on the multistability of the model. On the other hand, feedback visual models are normally considered only during the stimulus presentation condition; therefore, feature selectivity is not maintained solely by the system’s multistability. Amari (1977) studied a model that shows stimulus-selective multistability; however, a state similar to our BLA state was not observed. Second, our DLI model considers discrete neural elements, whereas the feedback visual models and Amari’s model deal with a continuous neural field. Examining the manner in which our DLI model can be extended to a neural field would be an interesting future theme. Another major stream of research in the biological competitive neural networks is models of stimulus-evoked short-term memory or decision making coupled with the short-term memory (Amari, 1977; Amit & Brunel, 1997; Wang, 2002; Brunel, 2003; Machens et al., 2005; Wong & Wang, 2006). In contrast to the DLI model, these models generally assume multiple populations, those specific to one of the stimuli and the nonspecific population, with the connections within each specific population strengthened by Hebbian plasticity. Considering the effect of plasticity on the DLI model would also be an interesting future theme (cf. Spratling & Johnson, 2002, and see section 9.3). Unlike the DLI model, these short-term-memory models generally do not show a wide range of intensity invariance.
Selectivity and Stability via Dendritic Nonlinearity
1839
The conventional model that shows sufficient structural similarity with our DLI model, facilitating direct comparison, is the WTA neural network model introduced in section 2 and examined in section 4. Such WTA lateral inhibition models with threshold linear transfer functions have been extensively studied (Hahnloser, 1998; Hahnloser, Sarpeshkar, Mahowald, Douglas, & Seung, 2000; Wersing, Steil, & Ritter, 2001; Wersing, Beyn, & Ritter, 2001; Xie et al., 2002; Hahnloser, Seung, & Slotine, 2003). When such a model is considered in a regime where recurrent excitation is not sufficiently strong to maintain activity after the extinction of the stimulus, it can show stimulus selectivity for a wide range of intensity. On the other hand, in a regime where recurrent excitation is sufficiently strong to sustain the activity by the system’s multistability after the extinction of the stimulus, which is the regime that we have assumed for the DLI model, the conventional lateral inhibition network with zero threshold (θ = 0) always converges to a certain WTA state (Xie et al., 2002); this state is sustained after the stimulus is turned off, regardless of the stimulus even if it is only a noise, thereby failing to always achieve correct stimulus selectivity. Given a positive threshold, the conventional model can show stimulus-selective sustained WTA response, as shown in section 4, but only for a narrow range of stimulus intensity (see Figure 3a). To enable the conventional lateral inhibition networks to acquire a stimulus-selective sustained response for a wider range of stimulus intensity, some additional mechanism for normalization is required. One possibility would be using a variable or adaptive threshold, that is, dynamically tuning the threshold (θ ) of the transfer function (see equation 2.3) according to the stimulus intensity. Although it might be difficult to biologically implement such an adaptive threshold in its naive form, there is an alternative and biologically more plausible method: the feedforward inhibition can be added for effectively normalizing the net feedforward input to the pyramidal cells. It has been shown that feedforward inhibition, assumed to be divisive, could indeed normalize the input intensity and achieve contrast invariance in the model of visual feature selectivity (Heeger, 1992; Carandini & Heeger, 1994). However, such divisive feedforward inhibition cannot be applied to the regime where the activity of the winner neuron is sustained due to strong self-excitation after extinction of the stimulus (cf. equation 3.5), which we consider in this study in section 3. Now assume that feedforward inhibition has a divisive effect on all the inputs to dendritic branches: both the self-excitation and the signals from the input layer. Then, as the stimulus intensity increases, the feedforward inhibition becomes stronger, and the effective coefficient of self-excitation decreases. Hence, the model loses its ability to show sustained activity after the extinction of the stimulus. In this manner, the intensity invariance by the divisive feedforward inhibition and self-sustainability of the winner neuron’s activity cannot hold true simultaneously in the conventional model. If feedforward inhibition has an effect only on the signals from the input layer
1840
K. Morita, M. Okada, and K. Aihara
but not on self-excitation, although this would be biologically implausible, the conventional model could acquire a wide range of intensity invariance. However, in that case, the threshold (θ ) would need to be finely tuned, contrary to that in our DLI model, where fine tuning of θ was not required, as shown in Figure 11b. Contrary to the divisive feedforward inhibition, subtractive feedforward inhibition does not affect the strength (coefficient) of the self-excitation, and thus can be added to the conventional lateral inhibition model within the regime, where activity can sustain after the extinction of the stimulus. Figure 11c shows how a conventional model with such a subtractive feedforward inhibition (see appendix F for details of the model) behaves depending on the threshold (x-axis) and stimulus intensity (y-axis). As shown in this figure, the range of permitted stimulus intensity for this model is wider than that for the original model without feedforward inhibition (see Figure 3a for comparison), but is narrower than that for the DLI model (see Figure 11b). Moreover, the critical signal ratio (represented by the contrast of the circles in Figures 11b and 11c) is more sensitive to the stimulus intensity (y-axis) in the conventional model with feedforward inhibition (see Figure 11c) than in the DLI model (see Figure 11b); in Figure 11b, the contrast (brightness) of the circles is mostly similar, whereas the contrast (brightness) varies systematically with the change in the stimulus intensity in Figure 11c. In other words, stimulus selectivity of the DLI model is more robust in a sense that not only the stimulus selective response occurs but also the value of the critical signal ratio is maintained for a wider range of stimulus intensity than that in the conventional model with feedforward inhibition. While the range of permitted stimulus intensity would depend on the values of the parameter, the difference between the two models in the sensitivity of the critical signal ratio to stimulus intensity appears to hold true in general. 9.3 Related Network Models Implementing Dendritic Local Computations. To date, there have been a number of studies on dendritic computation (Mel, 1993, 1994; Rall & Agmon-Snir, 1998; Koch, 1998; Stuart, Spruston, & H¨ausser, 1999; H¨ausser & Mel, 2003). Among these studies, the functional relevance of dendritic inhibition and the effects of having multiple dendritic branches have been the major topics. It was proposed that single cortical pyramidal cells with branched dendrites implement two stages of nonlinear transfer functions: one on each dendritic branch and the other on somata. Thus, they can be regarded as a two-layer artificial “neural network” (Poirazi, Brannon, & Mel, 2003). In another study, it was shown that having multiple bistable dendritic branches could make a single neuron multistable, realizing robust graded responses (Goldman, Levine, Major, Tank, & Seung, 2003). Given that individual neurons have multiple dendritic branches and local dendritic inhibition operates on each of them, it appears possible that if they constitute a network, further rich
Selectivity and Stability via Dendritic Nonlinearity
1841
or specialized computations would emerge. However, such network-level effects or functions of dendrites remain largely unknown, although there have been several pioneering studies, as noted below. Network-level effects of dendritic computation were originally studied by Spratling and Johnson (2001, 2002) in a discrete-time neural network model. They have shown that if the Hebbian learning rule is applied to both feedforward excitatory and dendritic inhibitory connections, their network model that implements multiple dendritic branches and local dendritic inhibition achieves better performance for a certain type of pattern recognition than that achieved by the conventional models. Related modeling studies have suggested possible roles of dendritic local inhibition in attention (Spratling & Johnson, 2004) and feature binding (Domijan, 2003, 2004). Our DLI model is similar to that of Spratling and Johnson (2001, 2002) in its manner of describing dendritic local inhibition; however, it differs from their models in that we did not presume specific constraints regarding connection strengths, and we considered dendritic saturation. Furthermore, we considered continuous-time dynamics rather than incremental discretetime dynamics in order to perform mathematical analysis. Although we did not consider learning and plasticity in this study, the weights of the feedforward connections of the DLI model might be expected to be self-organized by a Hebbian rule. On the other hand, regarding the inhibitory connections, there has been little evidence on Hebbian learning at the GABAergic synapses. Instead, anti-Hebbian plasticity has been reported in some cases (Woodin, Ganguly, & Poo, 2003). We have assumed uniform strengths on the GABAergic synapses in the DLI model, which does not violate the biological plausibility. Our DLI model and the models by Spratling and Johnson (2001, 2002) include only the non-FS or dendritic-targeting type GABAergic cells, although the FS or soma/proximal dendrite-targeting cells actually coexist in the cerebral cortex. Nevertheless, this problem may be resolved as follows. First, the FS cells and the non-FS cells can be included in different circuitries at least in some parts of the neocortex. In the rat visual cortex, axons of the non-FS cells are mainly distributed vertically, while those of the FS cells are primarily confined to layer V, suggesting that the non-FS cells mainly mediate intracolumnar inhibition, and the FS cells primarily mediate intralaminar inhibition (Xiang, Huguenard, & Prince, 1998). If this is the case, the DLI model could be considered a feasible model of the cortical column, and, Second, the FS cells and the non-FS cells can be differentially regulated by neural modulators. It has been found that both acetylcholine (Xiang et al., 1998) and dopamine (Gao et al., 2003) modulate the FS-mediated somatic and non-FS-mediated dendritic inhibitions in opposite manners, that is, they depress the former and promote the latter inhibitions, suggesting that the neurotransmitters switch the two modes of inhibition. Therefore, under certain conditions, either of the two—the FS cells or the non-FS cells—will dominate the other.
1842
K. Morita, M. Okada, and K. Aihara
9.4 Computation Implemented by Dendritic Inhibition. In the DLI model, we have implemented the effects of the non-FS cell-mediated dendritic inhibition rather formally: these effects are represented by subtraction with rectification on each dendritic branch in equation 2.1. Here, let us discuss the validity of this assumption. It is known that GABAA receptormediated shunting inhibition basically has a divisive effect rather than a subtractive effect through the increase in conductance in the case of subthreshold membrane potential (Koch et al., 1983). If dendrites are not so active, and thus the membrane potential is mostly subthreshold, shunting inhibition to distal dendrites would be expected to have a divisive effect. This is different from what we have implemented in the DLI model—subtraction with rectification and saturation. In fact, even if dendritic subtraction in our original DLI model is replaced by dendritic division, the very feature of our model, that is, stimulus-selective sustained WTA responses for a wide range of stimulus intensity, can still hold true, at least to some extent, as shown in Figure 11d (for details of the model, see appendix G). The factor that is important to realize the BLA state is apparently that most branches, excluding only a small number, are below the threshold (i.e., rectified) to escape the involvement in the positive feedback mechanism, which otherwise occurs in order to cause a WTA-like state (cf. Figure 5). Contrary to the subtraction effect, the divisive effect cannot make the dendritic branch activity negative. Therefore, if we set the dendritic threshold (θ ) to exactly zero in the divisive version of the DLI model (see appendix G), the BLA state cannot be realized, and the model can therefore not show stimulus-selective sustained WTA responses (θ = 0: on the y-axis in Figure 11d). However, if we assume some small positive value for the dendritic threshold (θ ), branches whose activities are considerably decreased by the divisive effect will be subthreshold, thereby realizing a convergence to the BLA state as shown in Figure 11d. Dendritic lateral inhibition could realize stimulus-selective sustained WTA responses for a wide range of stimulus intensity even if it has a divisive effect, proving the robustness of the theory regarding the DLI model reported until this point. However, whether dendritic inhibition has a subtractive or divisive effect might depend on situations. A study by Holt and Koch (1997) showed that when a neuron keeps firing, somatic shunting inhibition actually has a subtractive effect rather than the divisive effect on the firing rate. It may suggest that if dendrites are quite active, with plenty of forward-propagating as well as backpropagating dendritic spikes, the effect of dendritic shunting inhibition on the generation rate of such dendritic spikes could also have a subtractive rather than a divisive effect. A recent study (Mehaffey, Doiron, Maler, & Turner, 2005) has shown using fish pyramidal cell that proximal dendritic inhibition reduces the neuronal gain by inhibiting backpropagating action potential and thereafter preventing depolarizing afterpotential, proving that proximal dendritic shunting inhibition would actually have a divisive effect on the somatic firing rate
Selectivity and Stability via Dendritic Nonlinearity
1843
when the dendrites are active. However, to our knowledge, no study has examined the effects of distal dendritic inhibition on the relationship between distal excitatory inputs and the frequency of dendritic or somatic spikes. We consider that there remains a possibility that given that the distal dendrites are sufficiently active to generate forward-propagating dendritic spikes in response to (distal dendritic) excitations, distal dendritic shunting inhibition could have a subtractive effect on the frequency of such forward-propagating dendritic spikes, in the manner in which somatic shunting inhibition has an effect on the somatic spiking frequency shown by Holt and Koch (1997). Although the relationship between the frequency of forward-propagating dendritic spikes on each branch and the frequency of the somatic action potentials is unclear, there appears to remain a possibility that the mechanisms described above could implement similar effects to our assumptions in the original DLI model. Appendix A: Formulas for the Distribution Functions in the Case with a Positive Signal Ratio Under the assumption that both the distribution of the feedforward connection strength (wji ) and the distribution of each component (ξi ) of the noise vector are the uniform distributions on [0 1], the distribution of the excitatory input from the ith component of the stimulus to the jth pyramidal cell (wji Ii ), that is, f win (x; s) for the pyramidal cell corresponding to the stimulus or f los (x; s) for other pyramidal cells, can be calculated, for the case in which the signal ratio is in 0 < s < 12 , as follows: f win (x; s) = x x s x x c − + log + c2 − log c 2(1 − s) 1 − s s 2(1 − s) 1−s s x s c − + c2 − log c 2(1 − s) 2(1 − s) 1−s
where c ≡
s−1+
(0 ≤ x ≤ s) (s < x ≤ 1)
,
(A.1) (1 − s)2 + 4sx , 2s
x2 x log(1 − s) x log s − − − 2s(1 − s) s 1−s s x log(1 − s) x log x f los (x; s) = − − − 2(1 − s) s 1−s 2 − 1 x x log x 1 + − 2s(1 − s) s(1 − s)
(A.2)
(0 ≤ x ≤ s) (s < x ≤ 1 − s). (A.3) (1 − s < x ≤ 1)
1844
K. Morita, M. Okada, and K. Aihara
Appendix B: Evaluation of the Value αx j − βy at the BLA Equilibrium Point In section 7, we used the fact that the condition (see equation 6.9): αx j − βy ≤ 0
and 0 ≤ I0 φ + αx j − βy ≤ η
⇔ −I0 φ ≤ αx j − βy ≤ min(0, η − I0 φ),
(B.1)
saying that some of dendritic branches are in the linear regime (above the threshold but not saturated) while the rest are below the threshold, is satisfied in the limit of n → ∞ (i.e., an infinite number of the pyramidal cells) for an arbitrary pyramidal cell, including the kth cell corresponding to the stimulus, around the BLA equilibrium point described as (see equation 7.1) xk = I0 xw ,
x j = I0 xl ( j = 1, . . . , n, j = k), y = I0
γ 1 xw + 1 − γ xl , n n
where xw , xl > 0
(B.2)
under the assumption (see equation 7.3): 0 < xl ≤ 1≤
1 βγ − α
(B.3)
βγ xw < , xl α
(B.4)
without a proof. We describe the proof below. Substituting equation 3.4 that says I0 < η and equation 6.16 that says φ = 1 into equation B.1 results in −I0 ≤ αx j − βy ≤ 0 ( ... min(0, η − I0 ) = 0)
for j = 1, . . . , n.
(B.5)
First, let us consider the kth cell. Substituting equation B.2 into αxk − βy yields γ 1 αxk − βy = I0 αxw − β xw + 1 − γ xl n n βγ (xw − xl ) . = I0 −(βγ xl − αxw ) − n xw βγ − αxl (n → ∞) → −I0 α xl
(B.6)
Therefore, the right inequality of equation B.5 asymptotically holds under the assumption B.4 in the limit of n → ∞. Moreover, since equation B.6 can
Selectivity and Stability via Dendritic Nonlinearity
1845
be evaluated under equation B.4, αxk − βy → −I0
βγ xw − α xl
αxl (n → ∞)
≥ −I0 (βγ − α) xl ,
(B.7)
the left inequality of equation B.5 asymptotically holds under the assumption of equation B.3 in the limit of n → ∞. Next, we consider the other cells. Substituting equation B.2 into αx j − βy ( j = k) yields γ 1 γ xl αx j − βy = I0 αxl − β xw + 1 − n n βγ (xw − xl ) = I0 −(βγ − α)xl − n → −I0 (βγ − α)xl (n → ∞).
(B.8)
Therefore, the right inequality of equation B.5 always holds, while the left inequality holds under the assumption of equation B.3, asymptotically in the limit of n → ∞. Therefore, the proof is complete. Appendix C: Evaluation of the Value αx j − βy at the WTA-like Equilibrium Point In section 7, we used the fact that condition equation 6.10, αx j − βy ≥ η,
(C.1)
saying that all the dendritic branches are saturated, is satisfied for the kth cell while the condition equation 6.9 αx j − βy ≤ 0 and 0 ≤ I0 φ + αx j − βy ≤ η ⇔ −I0 φ ≤ αx j − βy ≤ min(0, η − I0 φ),
(C.2)
saying that some dendritic branches are in the linear regime (above the threshold but not saturated), while the rest are below the threshold, is satisfied for all the other pyramidal cells in the limit of n → ∞ around the WTA-like equilibrium point (see equation 7.2): x j = I0 xl ( j = 1, . . . , n, j = k), 1 γ γ xl , y = η + I0 1 − n n
xk = η,
where xl > 0
(C.3)
1846
K. Morita, M. Okada, and K. Aihara
under the assumption (see equation 7.7)
0 < xl < min
1 α−1 , βγ − α βγ
(C.4)
without a proof. We describe the proof below. First, let us consider the kth cell. Substituting equation C.3 into αxk − βy yields γ η 1 η αxk − βy = I0 α − β + 1− · γ xl I0 n I0 n βγ ( Iη0 − xl ) η α − βγ xl − = I0 I0 n η → I0 α − βγ xl (n → ∞) I0
= αη − I0 βγ xl = η + (α − 1)η − I0 βγ xl η = η + I0 (α − 1) − βγ xl I0 α−1 − xl , > η + I0 βγ βγ
(C.5)
where the last inequality comes from the assumption I0 < η (see equation 4.4). Therefore, the inequality of equation C.1 asymptotically holds under assumption C.4 in the limit of n → ∞. Next we consider the other cells. Substituting equation C.3 into αx j − βy ( j = k) yields γ η 1 γ xl + 1− · αx j − βy = I0 αxm − β n I0 n βγ ( Iη0 − xl ) = I0 −(βγ − α)xl − n
→ −I0 (βγ − α)xl ’ (n → ∞).
(C.6)
Therefore, the right inequality of equation C.2 always holds, while the left inequality holds under the assumption of equation C.4, asymptotically in the limit of n → ∞. Therefore, the proof is complete.
Selectivity and Stability via Dendritic Nonlinearity
1847
Appendix D: Characteristic Multipliers at the BLA Equilibrium Point The Jacobi matrix of the DLI model at the BLA equilibrium point (see equation 7.4) is calculated to be
d2 d3 d4 .. , .. J∗ = 0 . . d3 d 4 d5 · · · · · · d 5 d6 d1
0
(D.1)
where d1 =
1 τp
d2 =
1 τp
d3 =
1 τp
1 βγ xw + 1 − βγ xl ; s n n βγ 1 xw + 1 − βγ xl ; s −β − β f win −α + n n βγ 1 xw + −α + 1 − βγ xl ; s (α − 1) − α f los n n βγ 1 xw + −α + 1 − βγ xl ; s −β − β f los n n
(α − 1) − α f win
−α +
1 τp γ d5 = nτG d4 =
d6 = −
1 . τG
(D.2) (D.3) (D.4) (D.5) (D.6) (D.7)
Thereby the characteristic equation is: det (λIn+1 − J ∗ ) = 0 ⇔ (λ − d3 )n−2 ((λ − d1 )(λ − d3 )(λ − d6 ) − (n − 1)(λ − d1 )d4 d5 − (λ − d3 )d2 d5 ) = 0.
(D.8)
Therefore, the multipliers are the solutions of the above equation, including d3 ((n − 2) times degenerated). Specifically, when the stimulus is completely random, that is, s = 0, the Jacobi matrix of the DLI model at the BLA equilibrium point
1848
K. Morita, M. Okada, and K. Aihara
I0 (x∗ , . . . , x∗ , γ x∗ )T is 0 c2 .. .. . . J∗ = , 0 c1 c2 c3 · · · c3 c4
c1
(D.9)
where c1 =
1 −1 + α + α(βγ − α)x∗ log ((βγ − α)x∗ ) − 1 τp
(D.10)
1 −β + β(βγ − α)x∗ log ((βγ − α)x∗ ) − 1 τp γ c3 = nτG c2 =
c4 = −
(D.11) (D.12)
1 . τG
(D.13)
Thereby the characteristic equation of the system at the BLA equilibrium point I0 (x∗ , . . . , x∗ , γ x∗ )T is det (λIn+1 − J ∗ ) = 0 ⇔ (λ − c 1 )n−1 ((λ − c 1 )(λ − c 4 ) − nc 2 c 3 ) = 0,
(D.14)
and so the characteristic multipliers are obtained: λ1 = c 1 =
1 −1 + α + α(βγ − α)x∗ log ((βγ − α)x∗ ) − 1 τp
((n − 1) times degenerated) c 1 + c 4 ± (c 1 − c 4 )2 + 4nc 2 c 3 λ2,3 = 2 =
c1 + c4 ±
(c 1 − c 4 )2 +
−βγ +βγ (βγ −α)x∗ (log((βγ −α)x∗ )−1) τ p τG
2
(D.15)
.
(D.16)
Notably, these multipliers depend on only α (strength of self-excitation) and βγ (strength of dendritic lateral inhibition) and do not depend on n (number of the pyramidal cells).
Selectivity and Stability via Dendritic Nonlinearity
1849
Appendix E: Characteristic Multipliers at the WTA-Like Equilibrium Point The Jacobi matrix of the DLI model at the WTA-like equilibrium point (see equations 7.23 and 7.24) is calculated to be
d2 d3 d4 .. , .. J∗ = 0 . . d3 d 4 d5 · · · · · · d 5 d6 d1
0
(E.1)
where d1 = −
1 τp
d2 = 0
(E.2)
βγ 1 η + −α + 1 − βγ xh ; s n n βγ 1 1 d4 = η + −α + 1 − βγ xh ; s −β − β f los τp n n γ d5 = nτG d3 =
1 τp
d6 = −
(α − 1) − α f los
1 . τG
(E.3) (E.4) (E.5) (E.6) (E.7)
Thereby the characteristic equation is: det (λIn+1 − J ∗ ) = 0 ⇔ (λ − d3 )
n−2
(λ − d1 )(λ − d3 )(λ − d6 ) − (n − 1)(λ − d1 )d4 d5
−(λ − d3 )d2 d5
= 0.
(E.8)
Therefore, the multipliers are the solution of the above equation, including d3 ((n − 2) times degenerated).
1850
K. Morita, M. Okada, and K. Aihara
Appendix F: The Conventional Lateral Inhibition Model with Subtractive Feedforward Inhibition The conventional lateral inhibition model with subtractive feedforward inhibition is described as follows: dx j 1 = dt τp
−x j + σ
1 wji Ii m m
+ αx j − βy − δz
for j = 1, . . . , n
i=1
(F.1)
n dy γ 1 xj −y + = dt τG n
dz 1 = dt τG
(F.2)
j=1
m γ −y + Ii . m
(F.3)
i=1
Here z is the firing rate of the inhibitory neuron that receives feedforward inputs from the input layer and gives inhibition to the pyramidal cells and δ is the weight of inhibition from the feedforward inhibitory neuron to the pyramidal cells. Except for the existence of the feedforward inhibitory neuron (z), equations F.1 and F.2 are the same as the equations of the original conventional model: equations 2.4 and 2.5. Appendix G: The Divisive Version of the DLI Model The divisive version of the DLI model, in which the dendritic inhibition is represented by division rather than subtraction, is described as follows:
m wji Ii + αx j 1 σ −x j + m 1 + εy i=1 n γ 1 dy xj . −y + = dt τG n
dx j 1 = dt τp
for j = 1, . . . , n
(G.1)
(G.2)
j=1
Here ε is the strength of divisive inhibition from the pooled non-FS cell to each dendritic branch of every pyramidal cell. Except for using divisive inhibition represented by 1/(1 + εy) instead of subtractive inhibition represented by −βy, equations G.1 and G.2 are the same as the equations of the original DLI model: equations 2.1 and 2.2.
Selectivity and Stability via Dendritic Nonlinearity
1851
Acknowledgments This research is partially supported by Grant-in-Aid for Scientific Research on Priority Areas—System study on higher-order brain functions from the Ministry of Education, Culture, Sports, Science and Technology of Japan (17022012). K. M. is supported by the Japanese Society for the Promotion of Science Research Fellowships for Young Scientists (10834). We thank the anonymous reviewers for their valuable comments. K. M. thanks Dr. N. Masuda for giving constructive comments on the early version of this letter.
References Amari, S. (1977). Dynamics of pattern formation in lateral-inhibition type neural fields. Biol. Cybern. 27, 77–87. Amari, S., & Arbib, M. A. (1977). Competition and cooperation in neural nets. In J. Metzler (Ed.), Systems neuroscience (pp. 119–165). Orlando, FL: Academic Press. Amit, D. J., & Brunel, N. (1997). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cereb. Cortex., 7, 237–252. Bacci, A., Rudolph, U., Huguenard, J. R., & Prince, D. A. (2003). Major differences in inhibitory synaptic transmission onto two neocortical interneuron subclasses. J. Neurosci., 23, 9664–9674. Beierlein, M., Gibson, J. R., & Connors, B. W. (2003). Two dynamically distinct inhibitory networks in layer 4 of the neocortex. J. Neurophysiol., 90, 2987–3000. Ben-Yishai, R., Bar-Or, R. L., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Natl. Acad. Sci. U.S.A., 92, 3844–3848. Brunel, N. (2003). Dynamics and plasticity of stimulus-selective persistent activity in cortical network models. Cereb. Cortex, 13, 1151–1161. Buhl, E. H., Tamas, G., Szilagyi, T., Stricker, C., Paulsen, O., & Somogyi, P. (1997). Effect, number and location of synapses made by single pyramidal cells onto aspiny interneurones of cat visual cortex. J. Physiol., 500 (Pt. 3), 689–713. Carandini, M., & Heeger, D. J. (1994). Summation and division by neurons in primate visual cortex. Science, 264, 1333–1336. Domijan, D. (2003). A mathematical model of persistent neural activity in human prefrontal cortex for visual feature binding. Neurosci. Lett., 350, 89–92. Domijan, D. (2004). Recurrent network with large representational capacity. Neural Comput., 16, 1917–1942. Feng, J., & Hadeler, K. P. (1996). Qualitative behaviour of some simple networks. J. Phys. A, 39, 5019–5033. Gao, W. J., Wang, Y., & Goldman-Rak´ıc, P. S. (2003). Dopamine modulation of perisomatic and peridendritic inhibition in prefrontal cortex. J. Neurosci., 23, 1622–1630. Goldman, M. S., Levine, J. H., Major, G., Tank, D. W., & Seung, H. S. (2003). Robust persistent neural activity in a model integrator with multiple hysteretic dendrites per neuron. Cereb. Cortex, 13, 1185–1195.
1852
K. Morita, M. Okada, and K. Aihara
Hahnloser, R. L. (1998). On the piecewise analysis of networks of linear threshold neurons. Neural Netw., 11, 691–697. Hahnloser, R. H., Sarpeshkar, R., Mahowald, M. A., Douglas, R. J., & Seung, H. S. (2000). Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405, 947–951. Hahnloser, R. H., Seung, H. S., & Slotine, J. J. (2003). Permitted and forbidden sets in symmetric threshold-linear networks. Neural Comput., 15, 621–638. Hansel, D., & Sompolinsky, H. (1996). Chaos and synchrony in a model of a hypercolumn in visual cortex. J. Comput. Neurosci., 3, 7–34. H¨ausser, M., & Mel, B. (2003). Dendrites: Bug or feature? Curr. Opin. Neurobiol., 13, 372–383. Heeger, D. J. (1992). Normalization of cell responses in cat striate cortex. Vis. Neurosci., 9, 181–197. Holt, G. R., & Koch, C. (1997). Shunting inhibition does not have a divisive effect on firing rates. Neural Comput., 9, 1001–1013. Koch, C. (1998). Biophysics of computation. New York: Oxford University Press. Koch, C., Poggio, T., & Torre, V. (1983). Nonlinear interactions in a dendritic tree: Localization, timing, and role in information processing. Proc. Natl. Acad. Sci. U.S.A., 80, 2799–2802. Larkum, M. E., Zhu, J. J., & Sakmann, B. (1999). A new cellular mechanism for coupling inputs arriving at different cortical layers. Nature, 398, 338–341. Machens, C. K., Romo, R., & Brody, C. D. (2005). Flexible control of mutual inhibition: A neural model of two-interval discrimination. Science, 307, 1121–1124. Mehaffey, W. H., Doiron, B., Maler, L., & Turner, R. W. (2005). Deterministic multiplicative gain control with active dendrites. J. Neurosci., 25, 9968–9977. Mel, B. W. (1993). Synaptic integration in an excitable dendritic tree. J. Neurophysiol., 70, 1086–1101. Mel, B. W. (1994). Information-processing in dendritic trees. Neural Comput., 6, 1031– 1085. Miles, R., Toth, K., Gulyas, A. I., Hajos, N., & Freund, T. F. (1996). Differences between somatic and dendritic inhibition in the hippocampus. Neuron, 16, 815–823. Morita, K., & Aihara, K. (2005). A network model with pyramidal cells and GABAergic non-FS cells in the cerebral cortex. Neurocomputing, 65–66, 697–707. Poirazi, P., Brannon, T., & Mel, B. W. (2003). Pyramidal neuron as two-layer neural network. Neuron, 37, 989–999. Rall, W., & Agmon-Snir, H. (1998). Cable theory for dendritic neurons. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: From ions to networks (pp. 27–92). Cambridge, MA: MIT Press. Shepherd, G. M. (2003). The synaptic organization of the brain (5th ed.). New York: Oxford University Press, 2003. Somers, D. C., Nelson, S. B., & Sur, M. (1995). An emergent model of orientation selectivity in cat visual cortical simple cells. J. Neurosci., 15, 5448–5465. Spratling, M. W., & Johnson, M. H. (2001). Dendritic inhibition enhances neural coding properties. Cereb. Cortex, 11, 1144–1149. Spratling, M. W., & Johnson, M. H. (2002). Preintegration lateral inhibition enhances unsupervised learning. Neural Comput., 14, 2157–2179.
Selectivity and Stability via Dendritic Nonlinearity
1853
Spratling, M. W., & Johnson, M. H. (2004). A feedback model of visual attention. J. Cogn. Neurosci., 16, 219–237. Spruston, N., Stuart, G., & H¨ausser, M. (1999). Dendritic integration. In G. Stuart, N. Spruston, & M. H¨ausser (Eds.), Dendrites (pp. 231–270). New York: Oxford University Press. Stuart, G., Spruston, N., & H¨ausser, M. (1999). Dendrites. New York: Oxford University Press. Tamas, G., Buhl, E. H., & Somogyi, P. (1997). Fast IPSPs elicited via multiple synaptic release sites by different types of GABAergic neurone in the cat visual cortex. J. Physiol., 500 (Pt. 3), 715–738. Wang, X. J. (2002). Probabilistic decision making by slow reverberation in cortical circuits. Neuron, 36, 955–968. Wersing, H., Beyn, W. J., & Ritter, H. (2001). Dynamical stability conditions for recurrent neural networks with unsaturating piecewise linear transfer functions. Neural Comput., 13, 1811–1825. Wersing, H., Steil, J. J., & Ritter, H. (2001). A competitive-layer model for feature binding and sensory segmentation. Neural Comput., 13, 357–387. Wilson, H. R., & Cowan, J. D. (1973). A mathematical theory of the functional dynamics of cortical and thalamic nervous tissue. Kybernetik, 13, 55–80. Wong, K. F., & Wang, X. J. (2006). A recurrent network mechanism of time integration in perceptual decisions. J. Neurosci., 26, 1314–1328. Woodin, M. A., Ganguly, K., & Poo, M. M. (2003). Coincident pre- and postsynaptic activity modifies GABAergic synapses by postsynaptic changes in Cl- transporter activity. Neuron, 39, 807–820. Xiang, Z., Huguenard, J. R., & Prince, D. A. (1998). Cholinergic switching within neocortical inhibitory networks. Science, 281, 985–988. Xie, X., Hahnloser, R. H., & Seung, H. S. (2002). Selectively grouping neurons in recurrent networks of lateral inhibition. Neural Comput., 14, 2627–2646.
Received November 3, 2005; accepted October 4, 2006.
LETTER
Communicated by Andr´e Longtin
Filtering of Spatial Bias and Noise Inputs by Spatially Structured Neural Networks Naoki Masuda [email protected] Laboratory for Mathematical Neuroscience, RIKEN Brain Science Institute, Wako, Japan, and ERATO Aihara Complexity Modelling Project, Japan Science and Technology Agency, Tokyo, Japan
Masato Okada [email protected] Department of Complexity Science and Engineering, Graduate School of Frontier Sciences, University of Tokyo, Tokyo, Japan, and Intelligent Cooperation and Control, PRESTO, JST, Japan
Kazuyuki Aihara [email protected] Institute of Industrial Science, University of Tokyo, Tokyo, Japan, and ERATO Aihara Complexity Modelling Project, Japan Science and Technology Agency, Tokyo, Japan
With spatially organized neural networks, we examined how bias and noise inputs with spatial structure result in different network states such as bumps, localized oscillations, global oscillations, and localized synchronous firing that may be relevant to, for example, orientation selectivity. To this end, we used networks of McCulloch-Pitts neurons, which allow theoretical predictions, and verified the obtained results with numerical simulations. Spatial inputs, no matter whether they are bias inputs or shared noise inputs, affect only firing activities with resonant spatial frequency. The component of noise that is independent for different neurons increases the linearity of the neural system and gives rise to less spatial mode mixing and less bistability of population activities. 1 Introduction Neurons are embedded in the Euclidean space. One neuron is connected to many others but only up to a certain distance (Sholl, 1956; Hellwig, 2000; Destexhe & Sejnowski, 2003). Because spatial specificity of connectivity is usually difficult to determine empirically, modeling studies have preceded experimental work. Stationary bumps (localized sustained firing), traveling waves, and homogeneous stationary or oscillatory firing are characteristic Neural Computation 19, 1854–1870 (2007)
C 2007 Massachusetts Institute of Technology
Filtering of Spatial Bias and Noise Inputs
1855
of networks with spatially limited connectivity composed of dynamical analog neurons (Wilson & Cowan, 1973; Amari, 1977; Ermentrout & Cowan, 1979a, 1979b; Ben-Yishai, Bar-Or, & Sompolinsky, 1995; Redish, Elga, & Touretzky, 1996; Hansel & Sompolinsky, 1998; Pinto & Ermentrout, 2001; Miyawaki & Okada, 2003), spiking neurons (Bressloff, Bressloff, & Cowan, 2000; Compte, Brunel, Goldman-Rak´ıc, & Wang, 2000; Gutkin, Laing, Colby, Chow, & Ermentrout, 2001; Laing & Chow, 2001; Rubin, Terman, & Chow, 2001), and reaction-diffusion elements (Koga & Kuramoto, 1980; Nishiura & Mimura, 1989). External inputs to neural circuits such as visual and tactile stimuli can also be spatially limited. This is important because the cerebral cortex and sensory pathways often preserve topological information about stimuli, which are represented by “-topy” relations such as retinotopy and somatotopy. Given localized inputs, localized network activities, particularly bump states, have functional meanings. A good example of this is selforganizing orientation and direction selectivities (Ben-Yishai et al., 1995; Hansel & Sompolinsky, 1998; Bressloff et al., 2000; Ernst, Pawelzik, SaharPikielny, & Tsodyks, 2001; Seri`es, Latham, & Pouget, 2004), where spatially organized coupling enhances the sensitivity to localized features of stimuli. Other examples include working memory, in which a bump is switched on and off in bistable neural networks (Wilson & Cowan, 1973; Compte et al., 2000; Gutkin et al., 2001), perceptual masking (Wilson & Cowan, 1973), visual hallucination (Ermentrout & Cowan, 1979a), virtual rotation realized by traveling waves in response to moving stimuli (Ben-Yishai et al., 1995; Hansel & Sompolinsky, 1998; Bressloff et al., 2000), the head direction system (Redish, Elga, & Touretzky, 1996; Rubin et al., 2001), perceptual suppression by transcranial magnetic stimulation (Miyawaki & Okada, 2003), and localized synchronous firing (Hamaguchi, Okada, Yamana, & Aihara, 2005). Physiological inputs may take the form of noise rather than biases. In particular, different neurons may be driven by common localized noise source. Noise inputs with different spatial extents can carry different types of information. For instance, weakly electric fish use global common noise for communication between conspecific and local noise for hunting (Doiron, Chacron, Maler, Longtin, & Bastian, 2003). Inputs whose spatial extents are in resonance with spatial extents of coupling are enhanced in networks with local excitation and lateral inhibition (Ben-Yishai et al., 1995; Hansel & Sompolinsky, 1998; Shamma, 1998; Dayan & Abbott, 2001). However, spatially structured networks with nonresonant input and the effects of common noise are largely unexplored. In Hansel and Sompolinsky (1998), inputs are given in the form of spatial Fourier modes, and the spatial Fourier modes of firing rates are designated as relevant order parameters. Therefore, we investigate the dependency of such order parameters on connectivity and input whose spatial scale does not necessarily coincide with that of firing.
1856
N. Masuda, M. Okada, and K. Aihara
We analyze McCulloch-Pitts neurons with spatially organized coupling receiving spatially structured biases and independent and common noise inputs, which are introduced in section 2. This choice accommodates the analytical calculations in section 3, which examines how activity profiles depend on input and coupling. Networks with bias input and spatial common noise input are investigated in sections 4 and 5, respectively. We discuss our results in section 6. 2 Model We use spatially structured networks of McCulloch-Pitts neurons receiving spatial bias and noise inputs. On a one-dimensional ring, N neurons are aligned with the ith neuron (1 ≤ i ≤ N) located at angle θ = 2πi/N. We distinguish two types of noise. The independent and common noise inputs to the neuron at location θ are denoted by ξθt and ξˆθt , respectively, where t = 0, 1, 2, . . . is the discrete time. If ξˆθt is composed of N or more independently generated noise components, the common noise is indistinguishable from the independent noise. Therefore, we assume that ξˆθt originates from a small number of noise generators. The state of the ith neuron at time t is denoted by xθt ∈ {0, 1} (θ = 2πi/N), with 0 (or 1) representing the resting (or firing) state. The dynamics of xθt are described by xθt+1
=H
1 t t t J θ θ xθ + Iθ + ξˆθ + ξθ , N θ
(2.1)
where H(x) =
1, (x ≥ 0) 0, (x < 0)
(2.2)
Iθ is a static bias input, and J θ θ is the coupling strength from the neuron at θ to the neuron at θ . The summation in equation 2.1 is over θ = {2π/N, 4π/N, . . . , 2π}. Without losing generality, the threshold for firing is equal to 0. In the limit of N → ∞, equation 2.1 is approximated by xθt+1 = H
0
2π
dθ J θ θ xθ + Iθ + ξˆθt + ξθt
(0 < θ ≤ 2π).
(2.3)
We assume that ξθt is white gaussian noise with mean 0 and variance 2 , which is independent for different θ and different t. Let us restrict ourselves to symmetric networks invariant under spatial translations, as assumed in most studies mentioned in section 1.
Filtering of Spatial Bias and Noise Inputs
1857
Accordingly, the coupling strength depends on only the distance between two neurons, and we write J θ θ =
∞
J m cos[m(θ − θ )].
(2.4)
m=0
3 General Theory We derive the dynamics of firing patterns in terms of the spatial Fourier modes. The expected value E of xθt+1 in equation 1.3 with respect to the independent noise becomes E xθt+1 = erfc zθt ξˆθt ,
(3.1)
where
1 zθt ξˆθt =
2π
0
dθ J θ θ xθ + Iθ + ξˆθt ,
(3.2)
and 1 erfc(x) = √ 2π
∞ x
−x 2 exp 2
d x .
(3.3)
The quantity E[xθt+1 ] is the instantaneous firing rate averaged over the independent noise at a given position θ . The input-output relation is modulated by , or the strength of the independent noise. A larger yields a smaller gain. The entire spatial profile of instantaneous firing rates is equivalent to the set of Fourier coefficients of E t xθt (0 < θ ≤ 2π), where E t is the expectation with respect to all the independent noise sources up to time t. We obtain 2π
1 dθ E t+1 xθt+1 = dθ E t erfc zθt ξˆθt 2π 0 0 (3.4) = erfc zˆ θt (ξˆθt ) 2π 2π 1 1 t+1 ≡ dθ cos(mθ )E t+1 xθt+1 = dθ cos(mθ ) erfc zˆ θt ξˆθt rmc 2π 0 2π 0
r0t+1 ≡
t+1 ≡ rms
1 2π
1 2π
2π
2π 0
1 dθ sin(mθ )E t+1 xθt+1 = 2π
0
(3.5) 2π
dθ sin(mθ ) erfc zˆ θt (ξˆθt ) , (3.6)
1858
N. Masuda, M. Okada, and K. Aihara
where zˆ θt
∞
t
t 1 t t t ˆξθ = E t zθt ξˆθt = ˆ J m rmc cos mθ + rms sin mθ + Iθ + ξθ , J 0 r0 + m=1
(3.7) and m ≥ 1. With common noise, much information on the dynamics is contained in the probability distribution functions (PDFs) of the order parameters. If the PDF of r0t is broad, the population firing rate has relatively large fluctuations around the mean. Under ergodicity, it implies that the population firing rate fluctuates in time. Therefore, a broad PDF corresponds to synchronous firing. Similarly, the spread of the PDF of each order parameter represents the degree of synchronous firing on the respective spatial scale. 4 Response to Spatial Bias Inputs In this section and section 5, we analyze the interplay of the spatial modes of the recurrent coupling, the bias inputs, and the common noise inputs, t t which determines the neural activities r0t , rmc , and rms (m ≥ 1). In this section, t ˆ we focus only on bias inputs by setting ξθ = 0 for all t and θ . 4.1 Global Connectivity. Suppose that neurons are randomly (or allto-all) connected. With J 0 = 0 and J m = 0 (m ≥ 1), equation 3.7 is reduced to zˆ θt =
1 t J 0 r 0 + Iθ .
(4.1)
t Equation 3.4 is closed and describes the dynamics of r0t . Although rmc and t rms (m ≥ 1) do not vanish, they are irrelevant. We represent the bias inputs in the following form (Ben-Yishai et al., 1995; Hansel & Sompolinsky, 1998; Shamma, 1998; Bressloff et al., 2000; Dayan & Abbott, 2001):
Iθ = I0 +
∞
[Imc cos(mθ ) + Ims sin(mθ )] .
(4.2)
m=1
Because of rotational symmetry, we ignore Ims sin(mθ ) and look at how I0 and Im ≡ Imc drive the dynamics. When the nonlinearity of the set of equations 3.4 and 4.1 is strong, mode mixing is manifested. In other words, input components with high (low) spatial frequency influence order parameters with low (high) spatial frequency. When nonlinearity is marginal,
Filtering of Spatial Bias and Noise Inputs
1859
each input mode boosts only the corresponding order parameter (Shamma, 1998). Let us consider Iθ composed of I0 , I1 , and I2 . The numerically built bifur
t 2 t 2 cation diagrams for r0 ≡ limt→∞ r0t and r1 ≡ limt→∞ |r1c | + |r1s | are compared with direct numerical simulations of networks of n = 800 McCullochPitts neurons. We choose J 0 and J 1 as bifurcation parameters because we are interested in how coupling structure affects population dynamics. With a larger m, J m is the strength of a more spatially limited component of the coupling. Similarly, Im with a larger m is the strength of more spatially limited input. The bifurcation diagrams are shown as the crossing of J 1 = 0 lines and solid lines in Figures 1 to 4 for (I1 , I2 ) = (0, 0), (0, 1), (1, 0), and (1, 1), respectively. On the line J 1 = 0 in Figures 1 to 4, r0 naturally increases with J 0 . The region SA corresponds to the saturation regime defined by r0 > 0.999 and r1 ∼ = 0. In this region, all the neurons fire at high frequency owing to strong external and recurrent inputs. Below a negative value of J 0 , a stable global oscillation with period 2 occurs as a result of a flip (also called period-doubling) bifurcation (region GO). In this region, all the neurons repeat alternation of firing and resting in unison. In symmetrically coupled McCulloch-Pitts neurons like ours, stable solutions have period 1 or 2 (Marcus & Westervelt, 1989b). Although period 2 oscillations are sometimes artifacts of time discretization, oscillations observed here have natural continuous-time counterparts, namely, network oscillations via a Hopf bifurcation as negative feedback or feedback delay time is enhanced (Marcus & Westervelt, 1989a; Gerstner, 2000; Doiron et al., 2003; Masuda, Doiron, Longtin, & Aihara, 2005). As shown in Figures 3 and 4, turning on I1 makes r1t = 0, which contrasts with Figures 1 and 2 in which I1 = 0. However, r1t is irrelevant to the system dynamics when J m = 0 (m ≥ 1). The global activity r0t is insensitive to I1 or I2 despite that |I1 | and |I2 | (= 1 when they are turned on) are large relative to other system parameters. In each of Figures 1 to 4, a negative large J 0 yields global oscillations, and a larger J 0 yields larger firing rates (see panels A and C in each figure).
4.2 Spatially Limited Connectivity. Let us now introduce spatially organized coupling by setting J 1 = 0 and J m = 0 (m ≥ 2). Then equation 3.7 becomes zˆ θt =
t 1 t t J 0 r0 + J 1 r1c cos θ + r1s sin θ + Iθ .
(4.3)
The set of dynamical equations 3.4 to 3.6 and 3.7 is closed with respect to t t , and r1s . For the reasons given in section 4.1, population activities r0t , r1c
1860
N. Masuda, M. Okada, and K. Aihara
A 3
B 3
B
1.5 J
1
0
1.5
GO
J
SA
M
-1.5
-3 -3
1
LO -2
-1
0
1
-3 -3
2
M
SA
LO -2
-1
0 J
1
2
0
D 3
B
1.5
B
1.5
GO
J
SA
M
-1.5
-3 -3
GO
0
C 3
0
0
-1.5
J
J1
B
1
0
GO
M
SA
-1.5 LO -2
-1
0 J0
1
2
-3 -3
LO -2
-1
0
1
2
J0
Figure 1: Bifurcation diagrams in the J 0 -J 1 space. We set J m = 0 (m ≥ 2), Imc = 0 (m ≥ 3), I0 = −0.2, (I1 , I2 ) = (0, 0), and = 0.3. No common noise is assumed. Solid lines are analytically obtained bifurcation curves. For each figure, these theoretical curves are overlapped with long-term averages of (A) r0t , (B) r1t , (C) r0t+1 − r0t , and (D) r1t+1 − r1t obtained numerically with n = 800 McCullochPitts neurons. A brighter point indicates a larger value of an observable. The brightness is properly scaled so that different phases can be visually distinguished. M: monostable global activity; B: bump; GO: global oscillation; LO: localized oscillation; and SA: saturation.
with higher spatial resolution than that of J 1 , such as r2c , r2s , and r3c , do not essentially affect the dynamics. Figures 1 to 4 indicate that small |J 1 | yields little qualitative discrepancy from the homogeneous case (J 1 = 0). Global oscillations emerge for sufficiently negative coupling, and localized activities do not exist. A slight difference from the homogeneous case is that regions of bistability near the emergence of global oscillations are enlarged as J 1 increases. In the bistable regime, there is a combination of a subcritical flip bifurcation and a saddle node bifurcation of a pair of periodic cycles with period 2. This bistability has a continuous-time counterpart in which an oscillation builds up via a
Filtering of Spatial Bias and Noise Inputs A 3
1861 B 3
B
1.5 J
1
0
1.5
GO
J
SA →
M
-1.5
1
LO -2
-1
0 J
C 3
1
0
SA →
M
LO -2
-1
0 J
0
D 3
B
1
2
0
B
1.5
GO
J
SA →
M
-1.5
-3 -3
GO
-3 -3
2
1.5
1
0
-1.5
-3 -3
J
B
1
0
GO
SA →
M
-1.5 LO -2
-1
0 J0
1
2
-3 -3
LO -2
-1
0
1
2
J0
Figure 2: Bifurcation diagrams when (I1 , I2 ) = (0, 1). See the caption of Figure 1 for details and the other parameter values. The gray-scale code is the same as that used in Figure 1.
subcritical Hopf bifurcation in combination with a saddle node bifurcation of a pair of limit cycles. When I2 = 0, increases in J 1 make a spatially homogeneous state (r1 = 0) unstable in a broad parameter region (see Figures 1B, 1D, 2B, and 2D). This accompanies emergence of a bump via pitchfork bifurcations (see region B). In this parameter region, the firing rate is stationary for all the neurons. The neurons that belongs to the bump have larger firing rates than the background neurons. In contrast, as J 1 tends toward large negative values, localized oscillations appear via flip bifurcations (region LO). This state t t is defined by oscillations of r1c and r1s with period 2. In this case, all the neurons fire with period 2. In contrast to GO, the neurons inside the bump and those outside the bump fire alternatively. Introduction of I2 = 0 does not modify the results much, whereas I1 = 0 weakens or abolishes localized oscillations in favor of bumps, due to bumpy inputs (see Figures 3 and 4).
1862
N. Masuda, M. Okada, and K. Aihara A
B
3
3
1.5
1.5 B
J
1
0
B
SA →
GO
J
1
-1.5
0
-1.5 ← LO
-3 -4
-2
0 J
0
3
1.5
1.5
2
0
B
SA →
GO
J1
-1.5
-3 -4
0
D
3
0
-2 J
B 1
← LO
-3 -4
2
C
J
SA →
GO
0
SA →
GO
-1.5 ← LO -2
0 J
0
2
-3 -4
← LO -2
0 J
2
0
Figure 3: Bifurcation diagrams when (I1 , I2 ) = (1, 0). See the caption of Figure 1 for details and the other parameter values.
When J 0 ∼ = 1.2, |J 1 | ≥ 2.5 in Figures 2B and 2D and when J 0 ∼ = 1.2, J 1 ≤ −2.5 in Figures 4B and 4D, the analytical and numerical results for r1t disagree. The I2 cos(2θ ) term present in these cases generates two humps in the input. With J 1 fixed, increasing J 0 would homogenize the network activity toward global saturation. However, noise allows activities around one of the two humps—one at θ ∼ = π/2 and the other at θ ∼ = 3π/2—to be saturated earlier than the other, which causes asymmetrical firing profiles and nontrivial behavior of r1t . 4.3 Monostability and Bistability. As in the global oscillations mentioned in section 4.1, the onset of a bump (or localized oscillation) accompanies monostability with a supercritical pitchfork (or flip) bifurcation or bistability with a subcritical pitchfork (or flip) bifurcation combined with a saddle node bifurcation of a pair of stable points (or periodic orbits). Bistability comes along with a nonlinear f -I curve (Hansel & Sompolinsky, 1998). When = 0.3, transitions are more or less monostable.
Filtering of Spatial Bias and Noise Inputs B 3
A 3
1.5 J1
1863
1.5
0
SA →
B
J1
-1.5
0
-1.5 GO
GO -3 -7
-4
-1 J
-3 -7
2
-1 J
2
0
D 3
1.5
1.5
0
SA →
B
J1
-1.5
0
SA →
B
-1.5 GO
GO -3 -7
-4
0
C 3
J1
SA →
B
-4
-1 J
2
0
-3 -7
-4
-1 J
2
0
Figure 4: Bifurcation diagrams when (I1 , I2 ) = (1, 1). See the caption of Figure 1 for details and the other parameter values.
A decrease in will cause more nonlinear f -I relations and enlarged parameter regions for bistability. To show this, we vary in the bump regime. The bifurcation diagram in the J 0 - parameter space is shown in Figure 5. Smaller indeed yields more pronounced bistability. There is a bifurcation curve on which a pitchfork bifurcation occurs with the primary eigenvalue 1. Criticality of the pitchfork bifurcation changes at a certain point on the curve. A branch of saddle node bifurcations emerges there, which yields bistability. Enhanced bistability occurs also for localized and global oscillations (results not shown). 5 Response to Common Noise Inputs We argued in section 3 that independent noise averages out as N → ∞. However, this is not the case for common noise. The results in section 4 can be extended to the case of common noise. The stochastic prot t cess {(r0t , rmc , rms ) : m ∈ N} (t = 0, 1, 2, . . .) is a continuous-state Markov chain. It is straightforward to write down the stochastic transition rules
1864
N. Masuda, M. Okada, and K. Aihara
p+
0.5 p+ 0.4
∆
f+
0.3
sn p-
sn
0.2 -2
-1
p0
1
J0
Figure 5: Bifurcation curves in the J 0 - space with J 0 = 0, J 1 = 3, J m = 0 (m ≥ 2), I0 = −0.2, and Im = 0 (m ≥ 1). The branches for supercritical flip (f+ ), saddle node (sn), supercritical pitchfork (p+ ), and subcritical pitchfork (p− ) bifurcations are shown.
corresponding to equations 3.4 to 3.6. We set Iθ = 0 and constrain our analysis to simplistic situations. 5.1 Global Connectivity. When the neurons are globally connected with J m = 0 (m ≥ 1), only the PDF of r0t denoted by P[r0t ] is relevant, as shown in section 4.1. We assume ξˆθt = δ0 ξˆ0t + δ1 ξˆ1t cos θ,
(5.1)
where ξˆ0t and ξˆ1t are independent standard gaussian noise; δ1 will not qualitatively influence P[r0t ]. When δ1 = 0, the transition rule for r0t is given by P r0t+1 =
1
0
dr0t P r0t K r0t+1 , r0t ,
(5.2)
where
ξˆ02 1 ˆ d ξ0 √ exp − δ r0t+1 − erfc J 0 r0t + δ0 ξˆ0 2 2π
2 ξ0 + (J 0 r0t + δ0 ξ0 )2 = exp − , (5.3) 2
K r0t+1 , r0t =
Filtering of Spatial Bias and Noise Inputs
1865 B
4
80
3
60 P[r1]
P[r0]
A
2 1
40 20 0
0 0
0.2
0.4
0.6
r0
0.4
0.6 r1
Figure 6: PDFs of (A) r0 and (B) r1 when common noise is applied with δ1 = 0 (solid lines) and δ1 = 0.1 (dotted lines). We set J m = 0 (m ≥ 0), I0 = −0.2, Imc = 0 (m ≥ 1), = 0.3, δ0 = 0.1, and δm = 0 (m ≥ 2).
δ is the delta function, and ξ0 is defined so that
r0t+1 − erfc J 0 r0t + δ0 ξ0 = 0.
(5.4)
When δ1 = 0, we obtain the following equation by setting δ0 = 0 for simplicity:
−ξˆ12 1 d ξˆ1 √ exp δ r0t+1 − erfc J 0 r0t + δ1 cos θ ξˆ1 2 2π
2π −ξ12 (θ ) + (J 0 r0t + δ1 cos θ ξ1 (θ ))2 1 = dθ exp , (5.5) 2π 0 2
1 K r0t+1 , r0t = 2π
where r0t+1 − erfc J 0 r0t + δ1 cos θ ξ1 (θ ) = 0,
(5.6)
and we interpret ξ1 (π/2), ξ1 (3π/2) = ∞. Equations 5.3 and 5.5 are formally different, and we cannot incorporate the effects of δ1 by redefining a closed form equivalent to equation 5.3 or by rescaling the parameters. Therefore, we resort to numerical simulations. Although we could solve for P[r0t ] using equations 5.3 and 5.4, we choose to construct P[r0t ] based on direct numerical simulations of n = 800 neurons. When I0 = −0.2, Im = 0 (m ≥ 1), and J m = 0 (m ≥ 0), Figure 1 implies that there are no oscilla1 10000 t tions or bump states. We plot P[r0 ] ≡ 9900 t=101 δ(r0 − r0 ) and P[r1 ] ≡ 10000 1 t t=101 δ(r1 − r1 ) in Figure 6. As expected, δ0 > 0 disperses P[r0 ] (solid 9900 line in Figure 6A), which results in enhanced global synchrony (see the end of section 3 for the interpretation of PDFs) but not P[r1 ] (solid line in
1866
N. Masuda, M. Okada, and K. Aihara
Table 1: Cross-Correlation Coefficients of PDFs When Common Noise Is Present. J1 = 1
J 1 = 2.4
J 1 = −2.4
Two Configurations
P[r0 ]
P[r1 ]
P[r0 ]
P[r1 ]
P[r0 ]
P[r1 ]
(δ1 , δ2 ) = (0,0), (0,0.1) (0,0), (0.1,0) (0,0), (0.1,0.1) (0,0.1), (0.1,0) (0,0.1), (0.1,0.1) (0.1,0), (0.1,0.1)
0.997424 0.995109 0.989271 0.99887 0.996474 0.997912
0.999797 0.649504 0.66135 0.645986 0.658068 0.998667
0.997153 0.999028 0.996469 0.996065 0.997976 0.997283
0.985725 0.982197 0.974676 0.962443 0.980145 0.985094
0.996089 0.997314 0.995728 0.996093 0.997467 0.997237
0.983379 0.983783 0.978812 0.969407 0.97967 0.990309
Figure 6B), implying the absence of localized synchrony. In contrast, δ1 > 0 selectively broadens P[r1 ] (dotted lines in Figures 6A and 6B) and enhances localized synchrony only. 5.2 Spatially Limited Connectivity. We calculate the linear crosscorrelation coefficients between two PDFs (e.g., P[r0 ]) produced by a pair of different noisy inputs chosen from (δ1 , δ2 ) = (0, 0), (0, 0.1), (0.1, 0), (0.1, 0.1). This is to quantify how different inputs may lead to similar or different population activities. We set J 0 = 0, J m = 0 (m ≥ 2), and the other parameters are equal to those used in Figure 6. Then we calculate PDFs in temporally and spatially homogeneous states (J 1 = 1.0), bump states (J 1 = 2.4), and localized oscillation states (J 1 = −2.4). The common noise profile is varied with δ1 ∈ {0, 0.1} and δ2 ∈ {0, 0.1}, where an additional common noise term is given in the form δ2 ξˆ2n cos(2θ ), and ξˆ2n is a standard gaussian variable. Table 1 summarizes the results. The effects of spatial common noise on the relevant PDFs can be predicted from the results for bias inputs in section 4.2. First, δ2 affects P[r0 ] and P[r1 ] very little, resulting in large crosscorrelation values between pairs of parameter sets {(δ1 , δ2 ) = (0, 0), (δ1 , δ2 ) = (0, 0.1)} and {(δ1 , δ2 ) = (0.1, 0), (δ1 , δ2 ) = (0.1, 0.1)}. Second, δ1 does not affect P[r0 ]. Third, δ1 affects P[r1 ]. Accordingly, the pairs of P[r1 ] for the other four combinations are dissimilar when J 1 = 1. We remark that in the bump state (J 1 = 2.4) and the localized oscillation state (J 1 = −2.4), P[r1 ] is concentrated near possible extreme values (r1 ∼ = 1/π or −1/π). Because |J 1 | = 2.4 is large enough to confine r1 around the extreme values, P[r1 ] is robust against δ1 . Therefore, P[r1 ] is not affected by δ1 in these particular cases. 6 Discussion We have shown how population activities of neural networks with spatial structure reflect spatial frequency components of biases and common noise inputs. Although we have assumed J m = 0 (m ≥ 2), our results extend to
Filtering of Spatial Bias and Noise Inputs
1867
situations in which higher-order J m is present. If J m = 0 up to m = m0 , then r0 , rmc , rms , I0 , Im , δ0 , and δm with m ≤ m0 are relevant. Each Fourier mode of the input essentially controls the activities of the corresponding spatial frequency only, leading to bumps, localized oscillations, or localized synchrony of the corresponding size. Our results extend previous studies (see the first paragraph of section 1 for references) in that we have systematically determined the interdependence of the parameters specifying the strength and the spatial organization of coupling, bias input, and common noise input. Our particular emphases are on the bifurcation structure around monostability-bistability transitions, the different roles for independent noise and common noise, and localized oscillations. 6.1 Mode Mixing and Spatial Filtering. Although a spatial mode of input principally drives the corresponding mode of firing rates, the nonlinearity in the f -I curve yields mode mixing. If the error function in equation 3.3 is approximated as linear, Imc (or I0 ) affects only the corresponding rmc (or r0 ). For instance, equation 3.4 with equations 4.1 and 4.2 would turn into r0t+1
1 = 2π
2π 0
dθ (J 0 r0t + I0 ),
(6.1)
which does not contain Imc (m ≥ 1). However, the error function and other ordinary gain functions are nonlinear with thresholding and saturation. In the case of the error function, substituting erfc(x) =
1 1 1 + √ x − √ x 3 + O(x 5 ), 2 2π 2π
(6.2)
equation 3.4, given equations 4.1 and 4.2 with Ims = 0 (m ≥ 1), becomes r0t+1 = +
∞ 1 3 1 2 Imc + √ (J 0 r0t + I0 ) + (J 0 r0t + I0 )3 + (J 0 r0t + I0 ) 2 2 2π m=1 ∞ 3 I(m+m )c Imc Im c + higher-order terms, 4
(6.3)
m,m ≥1
implying mode mixing. An input with a high spatial frequency affects lowspatial-frequency components of neural activities and vice versa. The degree of nonlinearity can be controlled by the level of independent noise . A small results in a more nonlinear f -I curve and more mode mixing. At the same time, a small makes the neurons operate in the bistable regime. Actually, f -I curves are not so nonlinear in the
1868
N. Masuda, M. Okada, and K. Aihara
operating range relevant in most of our simulations and in many biological situations. In such cases, network-based spatial filtering is in operation, as was briefly addressed in the pioneering work of Wilson and Cowan (1973) and was suggested by theoretical and experimental results on feature detection (Ben-Yishai et al., 1995; Hansel & Sompolinsky, 1998; Shamma, 1998; Bressloff et al., 2000; Seri`es et al., 2004). Note that a network with strong localized inhibition yields a combination of spatial filtering and temporal filtering caused by oscillations. Although networks of McCulloch-Pitts neurons allow period 2 only, more realistic continuous-time neural networks accommodate general oscillation frequency and flexible temporal filtering. Spatial common noise inputs are filtered in the same manner as bias inputs; spatial common noise widens only the PDF of the corresponding order parameter and desynchronizes this mode. 6.2 Oscillations of Spatial Modes. Many sensory inputs and reverberating recurrent inputs are spatially limited. In accordance, oscillations often occur only in a limited part of a neural circuit. However, compared to bumps or global oscillations, localized oscillations have drawn much less attention in terms of phenomenology and functionality. Both global and localized oscillations occur via flip bifurcations. They are discrete-time versions of Hopf bifurcations in continuous-time systems. Hansel and Sompolinsky (1998) already suggested the possibility of localized oscillations in their analog-neuron model. In another model, a Hopf bifurcation of the localized activity was proved, but numerical results suggested it is subcritical and unobservable (Pinto & Ermentrout, 2001). Examples of stable oscillatory bumps are found in reaction-diffusion systems of the FitzHugh-Nagumo neurons (Koga & Kuramoto, 1980; Nishiura & Mimura, 1989), networks of integrate-and-fire neurons (Laing & Chow, 2001), and thalamic network models in which neurons are capable of postinhibitory rebound firing (Rubin et al., 2001). Because the neurons used in these studies are intrinsic oscillators, it is natural that their networks exhibit oscillations. Many pyramidal neurons in a noise-driven regime lack inherent oscillatory properties, although they often show oscillations. Delayed inhibitory feedback loops are a candidate mechanism of such oscillations in thalamocortical (Destexhe & Sejnowski, 2003) and electrosensory (Doiron et al., 2003) systems. We have treated the mixture of global and localized oscillations with different types of noise. Acknowledgments We thank K. Hamaguchi, G. Tanaka, and N. Konno for helpful discussion. This work is supported by the Special Postdoctoral Researchers Program of RIKEN and by a Grant-in-Aid for Scientific Research on Priority Areas
Filtering of Spatial Bias and Noise Inputs
1869
14084212, 17022012 and for Scientific Research (C) 16500093 from the Ministry of Education, Culture, Sports, Science, and Technology of Japan. References Amari, S. (1977). Dynamics of pattern formation in lateral-inhibition type neural fields. Biol. Cybern., 27, 77–87. Ben-Yishai, R., Bar-Or, R. L., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Natl. Acad. Sci. USA, 92, 3844–3848. Bressloff, P. C., Bressloff, N. W., & Cowan, J. D. (2000). Dynamical mechanism for sharp orientation tuning in an integrate-and-fire model of a cortical hypercolumn. Neural Comput., 12, 2473–2511. Compte, A., Brunel, N., Goldman-Rak´ıc, P. S., & Wang, X.-J. (2000). Synaptic mechanisms and network dynamics underlying spatial working memory in a cortical network model. Cerebral Cortex, 10, 910–923. Dayan, P., & Abbott, L., F. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. Destexhe, A., & Sejnowski, T. J. (2003). Interactions between membrane conductances underlying thalamocortical slow-wave oscillations. Physiol. Rev., 83, 1401–1453. Doiron, B., Chacron, M. J., Maler, L., Longtin, A., & Bastian, J. (2003). Inhibitory feedback required for network oscillatory responses to communication but not prey stimuli. Nature, 421, 539–543. Ermentrout, G. B., & Cowan, J. D. (1979a). A mathematical theory of visual hallucination patterns. Biol. Cybern., 34, 137–150. Ermentrout, G. B., & Cowan, J. D. (1979b). Temporal oscillations in neuronal nets. J. Math. Biol., 7, 265–280. Ernst, U. A., Pawelzik, K. R., Sahar-Pikielny, C., & Tsodyks, M. V. (2001). Intracortical origin of visual maps. Nat. Neurosci., 4(4), 431–436. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Computation, 12, 43–89. Gutkin, B. S., Laing, C. R., Colby, C. L., Chow, C. C., & Ermentrout, B. (2001). Turning on and off with excitation: The role of spike-timing asynchrony and synchrony in sustained neural activity. J. Comput. Neurosci., 11, 121–134. Hamaguchi, K., Okada, M., Yamana, M., & Aihara, K. (2005). Correlated firing in a feedforward network with Mexican-hat-type connectivity. Neural Comput., 17, 2034–2059. Hansel, D., & Sompolinsky, H. (1998). Modeling feature selectivity in local cortical circuits. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: From ions to networks (2nd ed., pp. 251–291). Cambridge, MA: MIT Press. Hellwig, B. (2000). A quantitative analysis of the local connectivity between pyramidal neurons in layer 2/3 of the rat visual cortex. Biol. Cybern., 82, 111–121. Koga, S., & Kuramoto, Y. (1980). Localized patterns in reaction-diffusion systems. Prog. Theo. Phys., 63(1), 106–121. Laing, C. R., & Chow, C. C. (2001). Stationary bumps in networks of spiking neurons. Neural Comput., 13, 1473–1494. Marcus, C. M., & Westervelt, R. M. (1989a). Stability of analog neural networks with delay. Phys. Rev. A, 39(1), 347–359.
1870
N. Masuda, M. Okada, and K. Aihara
Marcus, C. M., & Westervelt, R. M. (1989b). Dynamics of iterated-map neural networks. Phys. Rev. A, 40(1), 501–504. Masuda, N., Doiron, B., Longtin, A., & Aihara, K. (2005). Coding of temporally varying signals in networks of spiking neurons with global delayed feedback. Neural Comput., 17, 2139–2175. Miyawaki, Y., & Okada, M. (2003). A network model of perceptual suppression induced by transcranial magnetic stimulation. Neural Comput., 16, 309–331. Nishiura, Y., & Mimura, M. (1989). Layer oscillations in reaction-diffusion systems. SIAM J. Appl. Math, 49(2), 481–514. Pinto, D. J., & Ermentrout, G. B. (2001). Spatially structured activity in synaptically coupled neuronal networks: II. Lateral inhibition and standing pulses. SIAM J. Appl. Math., 62(1), 226–243. Redish, A. D., Elga, A. N., & Touretzky, D. S. (1996). A coupled attractor model of the rodent head direction system. Network, 7, 671–685. Rubin, J., Terman, D., & Chow, C. C. (2001). Localized bumps of activity sustained by inhibition in a two-layer thalamic network. J. Comput. Neurosci., 10, 313–331. Seri`es, P., Latham, P. E., & Pouget, A. (2004). Tuning curve sharpening for orientation selectivity: Coding efficiency and the impact of correlations. Nat. Neurosci., 7(10), 1129–1135. Shamma, S. (1998). Spatial and temporal processing in central auditory networks. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: From ions to networks (2nd ed., pp. 411–460). Cambridge, MA: MIT Press. Sholl, D. A. (1956). The organization of the cerebral cortex. New York: Wiley. Wilson, H. R., & Cowan, J. D. (1973). A mathematical study of the functional dynamics of cortical and thalamic nervous tissue. Kybernetik, 13, 55–80.
Received March 28, 2005; accepted December 20, 2006.
LETTER
Communicated by Alessandro Treves
Imposing Biological Constraints onto an Abstract Neocortical Attractor Network Model Christopher Johansson [email protected] School of Computer Science and Communication, Royal Institute of Technology, SE-100 44 Stockholm, Sweden
Anders Lansner [email protected] School of Computer Science and Communication, Royal Institute of Technology, and Stockholm University, SE-100 44 Stockholm, Sweden
In this letter, we study an abstract model of neocortex based on its modularization into mini- and hypercolumns. We discuss a full-scale instance of this model and connect its network properties to the underlying biological properties of neurons in cortex. In particular, we discuss how the biological constraints put on the network determine the network’s performance in terms of storage capacity. We show that a network instantiating the model scales well given the biologically constrained parameters on activity and connectivity, which makes this network interesting also as an engineered system. In this model, the minicolumns are grouped into hypercolumns that can be active or quiescent, and the model predicts that only a few percent of the hypercolumns should be active at any one time. With this model, we show that at least 20 to 30 pyramidal neurons should be aggregated into a minicolumn and at least 50 to 60 minicolumns should be grouped into a hypercolumn in order to achieve high storage capacity.
1 Introduction Over the last couple of decades, it has been speculated that global patterns of activity exist in the neocortex that to a first, very rough, approximation can be modeled by a diluted attractor network operating with sparse random patterns of activity (Frans´en & Lansner, 1998; Lansner, Frans´en, & Sandberg, 2003; Little & Shaw, 1975; Palm, 1980, 1981; Rolls & Treves, 1998). Here we investigate the properties of a full-scale instance of such an abstract cortical model where as many aspects of the model as possible are inferred from and constrained by quantitative data on neocortex. The purpose of this study is to explore the domain in parameter space in which this type of model functions well as an autoassociator. From an engineering perspective, it is Neural Computation 19, 1871–1896 (2007)
C 2007 Massachusetts Institute of Technology
1872
C. Johansson and A. Lansner
interesting to investigate the capabilities of this type of neocortex-inspired system and identify possible bottlenecks of the design. A critique of the approach of modeling neocortex with a diluted modular attractor network was presented by O’Kane and Treves (1992a, 1992b), who found that the number of unique and stable states (i.e., memories) in such a network trained with uniformly distributed random patterns scales not with the number of units but with the number of connections per unit. From small to large cortices, the number of connections per neuron is around 10,000, and the number of memories that can be stored would thus be constant and limited to the order of thousands (Treves, 2005). More specifically, O’Kane and Treves studied multimodular networks where subgroups (i.e., modules) of neurons instantiate small attractor networks, which in turn are connected by a fixed number of long-range connections to form an attractor network that spans over a substantial part of neocortex. In a subsequent study by Fulvi Mari and Treves (1998), two modifications of these global attractor networks were proposed; first, the random patterns are drawn from a probability distribution matched to the connectivity, and second, only a fraction of the modules are allowed to be simultaneously active. This greatly increased the storage capacity and made it scale with the number of units and connections (Fulvi Mari & Treves, 1998; Fulvi Mari, 2004). In order to enable simulation studies of the networks discussed by Fulvi Mari and Treves, an abstract version based on a Potts network has been proposed (Kropff & Treves, 2005; Treves, 2005). In a Potts network, each unit can have two or more different states, and the active state is determined by a winner-take-all competition among the possible states (Kanter, 1988; Bolle, Cools, Dupont, & Huyghebaert, 1993). In this abstract version of the O’Kane and Treves model, each state of a Potts unit represents a certain attractor state in one of the smaller subnetworks. Our cortical model, presented in this letter, is also implemented with a Potts type of network. In this type of modular attractor networks, the storage capacity scales with the connectivity between modules, and it is therefore important that each functional state has a high connectivity to the rest of the network (Kropff & Treves, 2005). This is achieved by letting the attractor states, instead of single neurons in the subnetworks, represent the functional states. Each of these attractor states is composed of several neurons and therefore can have a high connectivity to the rest of the network. Anatomical studies of neocortex clearly show that its neurons are not randomly wired together, although they are extensively recurrently connected (Scannell, Blakemore, & Young, 1995). Instead, cortical neurons are typically organized into local clusters of different sizes ranging from minicolumns and hypercolumns up to cortical areas. Cortical areas are in turn organized in an approximately hierarchical fashion where areas receiving sensory input and sending motor outputs are at the bottom level, and associative areas such as prefrontal and inferior temporal cortex are at the top level.
An Abstract Neocortical Model
1873
In our model, the minicolumn, an aggregation of about 100 neurons, is the smallest functional unit, ensuring that we have a high connectivity in the network. The minicolumns are grouped into modules called hypercolumns that normalize the minicolumns’ activities (Hubel & Wiesel, 1977; Amirikian & Georgopoulos, 2003). In a hypercolumn module, the functional states, that is, the attractor states, are represented and spatially localized in minicolumns. Precisely as in the networks studied by Treves and others, we allow a fraction of the modules (i.e., the hypercolumns) in our network to enter a quiescent state, which alleviates the scaling problem of a modular attractor network with fixed-size modules. The input to the network consists of uniformly distributed random sparse patterns. The connectivity of the network is incrementally set up to match the input by selectively creating synapses at a restricted number of potential sites, which ensures that all synapses are used efficiently. This also fulfills the first requirement for having a high storage capacity in a modular attractor network with a fixed number of connections per unit, matching the connectivity to the patterns of activity. Clearly, a model of neocortex should further consider the fact that cortical connectivity is spatially structured. Studies of attractor networks with dense local connectivity and sparse global connectivity have recently been conducted by a number of authors (Roudi & Treves, 2004, 2006; Bohland & Minai, 2001; Masuda & Aihara, 2004; McGraw & Menzinger, 2003; Morelli, Abramson, & Kuperman, 2004). There are several advantages of having spatially organized connectivity in cortex and hence also in a model of it. It reduces the variance in the signal term of synaptic support and thus enhances the storage capacity (Johansson, Rehn, & Lansner, 2006). Also, having an extensive local connectivity and only a small amount of global connectivity in a network reduces the length of wiring, which saves both space and energy (Bohland & Minai, 2001; Morelli et al., 2004). These topics are, however, not further investigated in this letter. 2 Neocortex The mammalian neocortex is a homogeneous structure that is very similar across different species regardless of large differences in size; for example, the human cortex is about three orders of magnitude larger than that of mouse (Braitenberg & Schuz, 1998). The human neocortex contains approximately n = 2 · 1010 neurons, that of a macaque monkey about 2 · 109 neurons, that of a cat around 4 · 108 neurons, and that of a mouse roughly 2 · 107 neurons (Johansson & Lansner, 2007). Neocortex is modularized into horizontal layers and vertical columns. There are smaller columns with a diameter less than 100 µm containing about 100 neurons called minicolumns (Buxhoeveden & Casanova, 2002b; Amirikian & Georgopoulos, 2003) and larger columns with a diameter less than 1000 µm called hypercolumns (Hubel & Wiesel, 1977; Mountcastle, 1997; Amirikian
1874
C. Johansson and A. Lansner
& Georgopoulos, 2003). In mouse, the minicolumn diameter is between 20 and 40 µm, and in cat, monkey, and human, the minicolumn diameter is somewhat larger: 30 to 60 µm (Buxhoeveden & Casanova, 2002a). For rats, cats, and monkeys, there are physiological data supporting the idea that minicolumns are a functional unit in neocortex, but for mice there is no such support (Buxhoeveden & Casanova, 2002a). The hypercolumns vary a great deal in size between different cortical areas and different species, but they typically have a diameter of 300 to 600 µm and contain about 100 minicolumns (Buxhoeveden & Casanova, 2002b; Mountcastle, 1997). In the visual cortex, neurons that represent, for example, line orientations have been found to compete for activation within a hypercolumn (Hubel & Wiesel, 1977). During evolution, the size of neocortex has primarily been increased by increasing the number of minicolumns rather than their size; at the same time as the number of minicolumns is increased by orders of magnitude, the thickness of cortex is only marginally increased (Rakic, 1995). This is a strong argument for the minicolumn as being a generic cortical building block, and it is likely that it also has a functional importance (Buxhoeveden & Casanova, 2002a). The average number of synapses per neuron is roughly constant at 8000 in cortices ranging in size from mouse to human (Braitenberg & Schuz, 1998; Beaulieu & Colonnier, 1989; Pakkenberg et al., 2003). In layer III and also in layer V, there are a few large pyramidal neurons that have about 20,000 to 30,000 synapses. A characteristic feature of neocortex is that it has a patchy layout of the connectivity (Bosking, Zhang, Schofield, & Fitzpatrick, 1997; Gilbert and Wiesel, 1989; DeFelipe, Conley, & Jones, 1986; Goldman & Nauta, 1977). A pyramidal neuron gives off axon collaterals that terminate in a number of clusters, and one such terminal cluster covers an area roughly corresponding to that of a hypercolumn. A pyramidal neuron’s intracortical axon collaterals typically give off close to 10 terminal clusters within a radius of a few millimeters, and an unknown number of distantly located terminal clusters are given off by the axon collaterals entering the white matter. 3 An Abstract Model of Neocortex In this section, we present our abstract cortical model, focusing on the cortex’s recurrent connectivity and autoassociative function (Johansson & Lansner, 2004, 2007; Lansner & Holst, 1996; Lansner et al., 2003; Johansson et al., 2006). In this model, the minicolumn is the smallest functional unit, and minicolumns are bundled into hypercolumns. In a minicolumn, there are about 100 neurons that are extensively connected, and here we approximate this local network as one of mutually excitatory pyramidal neurons. If one or a few pyramidal neurons are activated in a minicolumn, all other pyramidal neurons in the minicolumn are likely to become activated, and
An Abstract Neocortical Model
1875
this means that from a global network perspective, we can approximate their function as an atomic unit. The total number of neurons in a minicolumn is denoted by V, and the number of pyramidal neurons that are devoted to receiving long-range communication is denoted by Vrecv . The total number of synapses counted for all neurons in a minicolumn is around 8 · 105 , and we have previously estimated that K = 6 · 105 of these are dedicated to receiving inputs from nonlocal neurons—those located in other hypercolumns (Johansson & Lansner, 2007). The minicolumns are grouped into hypercolumns, and inside these hypercolumns, the minicolumns, are reciprocally connected to a small number of basket neurons (Johansson & Lansner, 2007). This construct serves to regulate the activity in the minicolumns by implementing a competitive mechanism. In the model, this competitive mechanism of the units’ activities is paired with a threshold mechanism that is used to set a hypercolumn quiescent; if none of the units in a hypercolumn receives excitation that exceeds the threshold, none of the units become active, and the hypercolumn is quiescent. In the other case, when one or several units receive strong excitatory input, a soft winner-takes-all (WTA) takes place, enhancing the response of the unit with the strongest input. The peakiness of the soft WTA function is controlled by a gain parameter, and for large values of this parameter, the soft WTA is turned into a strict WTA function, and that is what we use in this letter. Hypercolumns with one active unit are called active, and A denotes their number. Based on the neuron counts in the cortices of different mammals, we can compute how large, equivalently sized instances of our abstract model should be. Given that a minicolumn is composed of about V = 100 neurons and a hypercolumn is composed of U = 100 minicolumns, a human-sized instance of our model has N = HU = 2 · 108 units and H = 2 · 106 hypercolumns, a macaque-sized instance has N = 2 · 107 and H = 2 · 105 , and a cat-sized instance has N = 4 · 106 and H = 4 · 104 . 3.1 Connectivity in the Model. As in neocortex, the connections extending out from each unit in the model are confined to patches extending over all units in a hypercolumn. More precisely, we assume that a pyramidal neuron gives off axon collaterals that terminate in randomly located terminal clusters, and in each such cluster the collaterals uniformly spread over an area corresponding to a hypercolumn. In a minicolumn, there are V neurons, each sending out axons that terminate in B different hypercolumns. A connection supported by S synapses can be formed between the presynaptic minicolumn and one of the postsynaptic minicolumns in any of the B innervated hypercolumns. A mounting number of experiments show that the formation of new synaptic contacts is an important mechanism underlying the storage of long-term memories in the brain (Chklovskii, Mel, & Svoboda, 2004).
1876
C. Johansson and A. Lansner
Synapses can be formed between two neurons if their axons and dendrites are located close enough to each other. For each synapse present in neocortex, it has been computed, based on a geometrical analysis of anatomical data, that there are between three and nine times as many locations where synapses could potentially be formed (Stepanyants & Chklovskii, 2005; Stepanyants, Hof, & Chklovskii, 2002). In the following, we assume that synapses are formed at locations determined by the input patterns and that each synapse is formed at one of T possible locations. Given that an axon of a pyramidal neuron on average makes K /V distantly located synapses and T times as many potential synapses dedicated to global communication and that it has B/V terminal clusters, we can compute the average number of potential synapses made between the neurons in a pre- and postsynaptic minicolumn: S = KT /(BU ). This means that when a presynaptic minicolumn is active simultaneously with a postsynaptic minicolumn, an average number of S synapses can be formed and potentiated between them. The resolution in the synaptic efficacy for a single synaptic contact is limited to a few discrete levels (Montgomery & Madison, 2004). If the average neuron has 50 terminal clusters and T = 3–9, the average minito-minicolumn connection is represented by S = 3.6–10.8 synapses. This suggests that each such connection should be modeled with a weight value that has a precision limited to a few bits, probably less than 10 bits. In this letter, we limit ourselves to implement and analyze an instance of our model that has binary connection weights because it is simple and instructive. 3.2 Amount of Synaptic Input Required to Activate a Minicolumn. In order for a minicolumn to enter into an active state, its pyramidal neurons need a fair amount of synaptic input from the global network, that is, a sufficiently large summed excitatory postsynaptic potential (EPSP). Here we estimate that this requirement is met if at least one pyramidal neuron in a minicolumn gets at least 20 synaptic inputs from the global network in a short period of time (personal communication with M. Lundqvist, 2006), based on knowledge acquired through simulations done using a compartmental model with Hodgkin-Huxley current dynamics (Lundqvist, Rehn, Diurfeldt, & Lansner, 2006; Landqvist, Rehn, & Lansner, 2006). Given that the potential contact sites where synapses can be made are uniformly and randomly distributed, we can compute the number of expected synaptic inputs to a pyramidal neuron as X ∈Bin(Y,1/Vrecv ), where Y is the number of synapses made with neurons located in active presynaptic minicolumns. The cumulative distribution function for this synaptic input is F (k) = Pr(X ≤ k), and the distribution function for the maximum input to any of the neurons in the minicolumn is F Vrecv . Using these distribution functions, we can compute the expected mean number of inputs to each of the neurons and also the mean number of inputs to the neuron with the largest input for a population of minicolumns.
An Abstract Neocortical Model
1877
If Vrecv = V = 100, we can compute, using argmaxk (F Vrecv (k)) = 20, that we need to have a total number of Y = 1100 synaptic inputs to the minicolumn in order to get at least 1 neuron in it that has 20 synaptic inputs with high probability. For this value of Y, the average neuron in the postsynaptic minicolumn has about 11 synaptic inputs. If we assume a more realistic and detailed model of the minicolumn and its neural structure that accounts for the existence of large pyramidal neurons in layers III and V, which have large dendritic trees stretching up into layer I, we get a different value of Y. Given that we have Vrecv = 20 large pyramidal neurons each, with K /Vrecv = 3 · 104 synapses devoted to receiving global intercortical communication, the total number of synaptic inputs needed for each minicolumn is reduced to Y = 270. We still have the same number K of synapses dedicated to global communication and V = 100 neurons sending out axons globally in cortex. For these values of Vrecv and Y, the average neuron receiving input has about 13 synaptic inputs. This means that the number of connections, and hence also the storage capacity, is potentially increased up to a factor of 5 compared with the more simplistic model where all neurons in a minicolumn take equal part in receiving global communication. Thus, there is a big advantage to having a few large neurons handling the long-range communication according to these calculations.
3.3 Activity Level in the Neocortical Model. Is it possible to estimate the fraction of active hypercolumns, a = A/H, in the model based on metabolic constraints? By considering the metabolic costs of neural activity, it is clear that only a small fraction of all cortical neurons can be active at any particular moment. Lennie (2003) has computed, based on the metabolic costs, that a cortical neuron in human on average can fire 0.16 spikes per second per neuron. For freely moving rats the mean firing rates for cortical neurons range between 0.15 to 16 Hz (Attwell & Laughlin, 2001). Now we fit these data onto the model. In the case of an active hypercolumn, we assume that V/U neurons fire at 16 Hz, the rest being completely silent due to inhibition generated by the active neurons. This means that 16V · 1 = 16V spikes per second are generated. For a quiescent hypercolumn, we assume that all of its VU neurons fire at a baseline level of 0.16 Hz, and that means that 0.16VU = 16V spikes per second are produced. Thus, both active and quiescent hypercolumns generate spikes at an equal rate, and an arbitrary number of hypercolumns can therefore be active. Even if the neurons in the minicolumns located in active hypercolumns fire at 100 Hz instead of 16 Hz, we can have on the order of 10% active hypercolumns given that we decrease the baseline level of activity for neurons in quiescent hypercolumns from 0.16 Hz to 0.12 Hz. Thus, cortical metabolisms only loosely constrain the level of activity in the cortical model.
1878
C. Johansson and A. Lansner
Intrinsic signal imaging of the inferior temporal cortex shows that patches covering about 1% to 5% of the area studied exhibit increased neural activity (Tsunoda, Yamane, Nishizaki, & Tanifuji, 2001). This suggests that only a few percent of the hypercolumns are active in a cortical area when it processes information. Furthermore, it is reasonable to assume that only a fraction of all cortical areas are engaged in the processing of information at any one moment, and this leads to the conclusion that about 1% or less of all hypercolumns are active in neocortex. In conclusion, we have that an arbitrary number of hypercolumns can be active given the metabolic costs of neural activity (Lennie, 2003), the typical firing frequencies for cortical neurons (Attwell & Laughlin, 2001), and the model parameter U = 100. But experimental data suggest that at most, a few percent of the hypercolumns, and hence only a fraction of about 0.01% of the minicolumns and neurons, are active in a functional sense at any one moment in cortex (Tsunoda et al., 2001). 3.4 An Attractor Network Instantiation of the Cortical Model. The model instantiates an attractor network similar to a Potts type of network where each Potts unit is represented by a hypercolumn and the different states of a Potts unit are represented by minicolumns. In this section, we present an implementation of the model based on a Willshaw type of network that has binary weights and a Hebbian learning rule (Willshaw, Buneman, & Longuet-Higgins, 1969). This type of network is easier and more instructive to analyze than one with real valued weights such as the Hopfield type of network (Buhmann, Divko, & Schulten, 1989; Tsodyks & Feigelman, 1988). At the same time, the former type of network has only a slightly smaller optimal storage capacity compared with the latter: 0.69 instead of 0.72 bits per unique synapse (Willshaw et al., 1969; Buhmann et al., 1989; Tsodyks & Feigelman, 1988). Each hypercolumn in the network has Uimpl = U + 1 units, and there are H hypercolumns. The network has a total number of Nimpl = HUimpl units (minicolumns) and N of these units have Z connections (see Figure 1). The network is trained with L unary coded random patterns, ξ µ . In these patterns, the activity of one unit in each hypercolumn is set to 1 and the activity of the others to 0. The retrieval procedure is based on a local competition for activation between the units of each hypercolumn. First, the support, s j , is computed for all units in a hypercolumn with equation 3.1, and then a WTA function is used to set one unit active. The hypercolumns are asynchronously updated in this way. Here, wij is a connection weight, β j is a unit’s bias, and o i is the activity of a unit: Nimpl
sj = βj +
i=1
wij o i
(3.1)
An Abstract Neocortical Model block of connections
1879
potential connection
potentiated connection
hypercolumn
unit representing the quiescent state
units
Figure 1: A network as it is implemented with three hypercolumns, H = 3, each with three units, Uimpl = 3, per hypercolumn. The last unit, representing the quiescent state of a hypercolumn, does not have incoming or outgoing connections. This network has four blocks of connections, and the hypercolumn in the center has two blocks of connections: B = 2. The blocks of potential connections are given by the masking matrix yij , and the weight matrix wij represents only the potentiated weights.
The Willshaw type of learning rule is presented in equations 3.2 and 3.3. In these equations, h is the index of a particular hypercolumn and Qh is the set of indices of all units belonging to hypercolumn h. Here, yij is a masking matrix controlling what connections are possible, and M is a threshold that controls how likely the hypercolumn is to be set quiescent. We also have that Qh ∩ Qk = Ø, ∀h,k:h = k: L
wij = µ
βi =
1 0
µ µ if ξi ∧ ξ j ∧ yij , otherwise
∀i, j
0 if i = max(Qh ) ∀i : i ∈ Qh , −M otherwise
(3.2) for each h = {1, . . . , H} (3.3)
All weights between units in the same hypercolumn are set to zero: yij = 0 ∀i, j : i ∈ Qh ∧ j ∈ Qh
for each h = {1, . . . , H}.
(3.4)
1880
C. Johansson and A. Lansner
Further, all weights onto and from the last unit in each hypercolumn are set to zero precisely as in Kropff and Treves (2005): yij = 0 ∀i, j, h, k : i = max(Qh ) ∨ j = max(Qk )
(3.5)
This means that the threshold is implemented with one dedicated unit (the last one) in each hypercolumn, which is represented only by a bias value. If the last unit wins the competition for activation in a hypercolumn, this hypercolumn and its units do not contribute to the synaptic input of any other unit in the network and the hypercolumn is thus referred to as quiescent. The unit representing the quiescent state of a hypercolumn is in fact not needed if the threshold is implemented for each and one of the N units. The patterns we use, ξ µ , are vectors that are built up by concatenating H smaller vectors that each represents the activity in one of the hypercolumns. In exactly H − A of these vectors, the last unit is set active, representing the quiescent state. Taken over all patterns, the last unit in a hypercolumn is set active with a frequency 1 − a = (H − A)/H and the other units with a frequency a = A/H. The entropy (information content) of one of these patterns is approximated by Ipattern = −Halog2 (a /U) − H(1 − a )log2 (1 − a ). Because the connectivity Z/((H − 1)U) is very low and connections are created where they are needed, the primary factor limiting the storage capacity is the number of connections. This means that when the maximal storage capacity is attained, the average unit has Z connections potentiated and only units with correlated activity will have potentiated connections between them. For this case, the weight matrix wij contains ZHU = HBU2 /T nonzero elements, which is a fraction 1/T of the total number of nonzero elements in the masking matrix yij . 3.5 Evaluating the Performance of the Model. Attractor networks are typically evaluated based on their storage capacity, and therefore one objective way by which the performance of the cortical model can be benchmarked is by its storage capacity. The number of patterns that can be stored in a modular attractor network of the type that our cortical model instantiates has been found to scale as L stored = ZN/A,
(3.6)
where Z is the number of real-valued connections per unit and A is the number of active hypercolumns (Kropff & Treves, 2005). The storage capacity is also commonly given as the number of bits per connection that can be stored by the network. From equation 3.6, it follows, given that Z is a constant, that it is critical to minimize the number of active hypercolumns A in order to maximize L stored .
An Abstract Neocortical Model
1881
Further, if both Z and a = A/N are fixed, the storage capacity, L stored , is constant and independent of the network size. However, the information content of each stored pattern increases with increased network size N. 4 Results Here we investigate the performance of the abstract cortical model in terms of storage capacity given the biological parameters and constraints presented in sections 3.1 and 3.2 on the connectivity and activity levels. The implementation of this model, presented in section 3.4, imposes additional constraints on the performance that are also considered. The former constraints are quantified in sections 4.2 and 4.3 and the latter constraint in section 4.1. 4.1 A Probabilistic Stability Analysis. A lower bound on the activity level can be established by considering the signal-to-noise ratio of a unit’s support in the case of a fully loaded memory, that is, when the maximal number of patterns that it can hold has been stored. In this limit, all of the ZHU connections in the network have been potentiated. For this situation, we want to have a good separation between the strongest noise support and the weakest signal support a unit can receive, and here we compute these values as a function of the activity A. The noise input to a unit in the case of maximal storage is Xnoise ∈ Bin(A,Z/((H−1)U)), where all but A are fixed variables. The cumulative distribution function for this noise term is Fnoise (k) = Pr(Xnoise ≤ k):0 ≤ k ≤ A and the maximum noise input to any one of the N units is Fnoise max (k) = Fnoise (k) N . Using this distribution, we can compute a support value, Xmax , which is very unlikely exceeded by any of the units as Xmax : (1 − Fnoise max (Xmax )) = 0.01. The signal term to an active unit in the network is Xsignal ∈ Bin(A,B/H), where we again have only A as a free parameter. The cumulative distribution function for this signal term is Fsignal (k) = Pr(Xsignal ≤ k):0 ≤ k ≤ A, and the corresponding probability distribution function is f signal (k). The minimum input to any one of the A units participating in the active pattern is Fsignal min (k) = 1 − (1 − Fsignal (k)) A. For this minimal signal term, we want to have the following condition satisfied: Fsignal min (Xmax ) ≤ 0.01. Next, we minimize A given the two constraints above. When these two constraints are fulfilled, the probability that a given pattern is stable is 1 − (1 − 0.01)2 = 0.98. The noise term Xnoise ∈ Bin(A,Z/((H−1)U)) ≈ Bin(A,Z/(HU)) can be rewritten, using the relation Z = K /S = BU/T, as Xnoise ∈ Bin(A,B/(TH)). From this equation, it is evident that by increasing T, we get a smaller noise term. Less evident is the fact that an increase in B also gives a better signal-to-noise ratio. This factor is present also in the signal term, and
1882
C. Johansson and A. Lansner
increasing it makes the distributions of the signal and noise terms move apart. 4.2 The Synaptic Input Constraint. For a minicolumn to become active, it needs Y synaptic inputs from presynaptic neurons, which for the implemented network means that a unit requires at least Y/S inputs from active presynaptic units. Here we compute what values of A guarantee this level of synaptic input. The average number of synaptic inputs to a minicolumn is Y = SAB/H, and we can numerically compute the mean input meansignal min to the one minicolumn of the A active ones that has the smallest amount of input using Fsignal min . The number of synaptic inputs to this minicolumn is S · meansignal min , and if the requirement on synaptic input is to be met for this unit, we have to have meansignal min ≥ Y/S = YBU/(KT). This constraint is met by adjusting A. We here conclude that decreasing Y means that it is possible to decrease A, which in turn increases the storage capacity. 4.3 Optimizing the Storage Capacity with Respect to Z. From the scaling of the storage capacity, equation 3.6, it is evident that increasing Z improves the network’s storage capacity. Here, S = KT/(BU) is the number of synapses devoted to each minicolumn-to-minicolumn connection and the number of connections per minicolumn is Z = K /S = BU/T : S ≥ 1.
(4.1)
This means that if the number of terminal clusters per neuron or the hypercolumn size is increased, storage capacity is increased. The equation also suggests that decreasing T would improve storage capacity, but this is generally not the case since it also gives a poor signal-to-noise ratio that prevents the network from increasing its storage capacity. Obviously, S must be larger than 1, meaning that at least one synapse is used to represent each connection in the abstract model. Here we will assume that about 5 to 10 synapses are typically used. In a previous implementation of this abstract model, we assumed that S = 5, and we then had Z = 1.2 · 105 connections per unit (Johansson & Lansner, 2007). 4.4 Optimizing the Storage Capacity with Respect to A. Here we numerically find the lower bound of the fraction of active hypercolumns a = A/H that fits with the requirements on stability (see section 4.1) and synaptic input (see section 4.2). In Figure 2 we have plotted the lower bounds on the activity level as a function of the number of terminal clusters per neuron given these two requirements for a human-sized network with two different values of the
An Abstract Neocortical Model
1883
T =3, Y =1100
T =9, Y =1100
0.2
0.2 lowest bound for a EPSP strength signal-to-noise ratio
0.1
0.05
0
lowest bound for a EPSP strength signal-to-noise ratio
0.15
a
a
0.15
0.1
0.05
20
40
60
80
0
100
20
40
B /V
T =3, Y =270
100
0.2 lowest bound for a EPSP strength signal-to-noise ratio
0.15
0.1
0.05
lowest bound for a EPSP strength signal-to-noise ratio
0.15
a
a
80
T =9, Y =270
0.2
0
60 B /V
0.1
0.05
20
40
60 B /V
80
100
0
20
40
60
80
100
B /V
Figure 2: The curves represent the lower bound on the fraction of active hypercolumns, a , as a function of the number of terminal clusters per neuron, B/V, for a human-sized network. The dashed line is the constraint derived in section 4.1, the dotted line is the constraint computed in section 4.2, and the solid line shows the combination of these two constraints, which represents the lower bound of the activity level.
parameter T: T = 3 and T = 9, and for two different values of Y: Y = 1100 when Vrecv = 100 and Y = 270 when Vrecv = 20. For the parameters T = 9 and Y = 270, we get the lowest values of activity, and here we also find that the two constraints are relatively well matched to each other; both are close to their common lower bound on the activity level. In Figure 3 we have plotted the lower bound of the activity level for cat-, macaque-, and human-sized instances of the model with the parameters T = 9 and Y = 270. This plot shows that it is possible to have the activity level as low as about a = 1% in all three differently sized instances of the model, and given the scaling of the storage capacity, equation 3.6, it means that these differently sized networks all have a similar storage capacity. From the results presented in Figures 2 and 3, we conclude that the lowest possible activity levels in the model agree with the activity levels of 1% to 5% active hypercolumns suggested by the experimental intrinsic signal measurements discussed in section 3.3.
1884
C. Johansson and A. Lansner 0.08 human sized macaque sized cat sized
0.07 0.06
a
0.05 0.04 0.03 0.02 0.01 0
20
40
60
80
100
B /V
Figure 3: Lower bound on the fraction of active hypercolumns, a , as a function of the number of terminal clusters per neuron, B/V, plotted for three differently sized networks with the parameters T = 9 and Y = 270 (Vrecv = 20).
4.5 Estimating the Storage Capacity of Cortex-Sized Networks. In section 3.5 we presented the scaling of the storage capacity, and in this section we give a rough estimate of it and verify this estimate by simulations. Given that the activity level is set to the smallest permissible value as defined by the two criteria on signal-to-noise ratio between active and inactive units and the EPSP strength discussed in sections 4.1 and 4.2, we can estimate the storage capacity of the network here used to instantiate the cortical model. This is done by computing the number of patterns that can be embedded before all of the available synapses have been used up. In this calculation, we assume that each connection takes part only in giving input in response to one of all stored patterns, and this approximation becomes better as T increases. This type of analysis is a crude oversimplification of the process of storing and retrieving patterns in an attractor network, but as seen in Figure 4, it approximates the storage capacity fairly well. In the simulation, we stop increasing the number of patterns embedded when all units on the average have Z connections potentiated. On embedding a pattern, A units are activated, and on average AB/H connections are potentiated on each one of these active units. Thus, embedding a single pattern requires on the average A2 B/H connections. The total number of connections in the network is ZHU, where Z = BU/T, and the total number of patterns that can be stored is therefore ZH2 U/(A2 B) = H 2 U 2 /(TA2 ) = N2 /(TA2 ) = 1/(Ta2 ).
An Abstract Neocortical Model
computed storage capacity simulation
stored patterns
10
3
1885
10
2
100
150
200
250
300
350
400
B
Figure 4: Simulated and estimated storage capacity as a function of the connectivity plotted for a network with H = 1000, Uimpl = 10, and T = 10. The discontinuities in the dashed line showing the estimated values of the storage capacity are due to integer truncation of the threshold M.
Here it is important to point out that the storage capacity does not scale like this, and the estimate is valid only given that the condition on signalto-noise is fulfilled. When running the network, it is important to set the threshold M appropriately in order for it to function well. Using the distribution for the signal support, we can compute a value for the threshold as M ← Fsignal (M) = 0.001, where we have a probability of only 1% that an active hypercolumn becomes accidentally quiescent. The probability that an erroneous unit becomes activated in an active hypercolumn is in this case very small. In Figure 4 we have plotted the storage capacity of a network with H = 1000, Uimpl = 10, and T = 10 for different values of B—the number of hypercolumns contacted by each unit, together with estimated values of the storage capacity. A human-sized instance of the model with the parameters H = 2 · 106 , U = 100, B = 50, Vrecv = 100, and T = 9 has a storage capacity of 1.1 · 106 patterns each that holds Ipattern = 790 Kbits of information given that the network is run with the smallest possible fraction of a = 3.1% hypercolumns active. A macaque-sized instance of this network has a storage capacity of 1.3 · 106 patterns each that holds Ipattern = 75 Kbits of information given that the network is run with a = 2.9% hypercolumns active. A cat-sized instance of this network has a storage capacity of 1.5 · 106 patterns each that holds
1886
C. Johansson and A. Lansner
Ipattern = 14 Kbits of information given that the network is run with a = 2.8% hypercolumns active. If for the human-sized instance we reduce the number of synapses that receive corticocortical input to Vrecv = 20, the storage capacity becomes 6.6 · 106 patterns each that holds Ipattern = 364 Kbits of information given that the network is run with a fraction of a = 1.3% hypercolumns active. Now, if we either increase the size of the minicolumns to V = 200 or increase the number of minicolumns in a hypercolumn to U = 200 and reduce the number of hypercolumns to H = 106 , 1.5 · 107 patterns can be stored, each holding Ipattern = 124 Kbits of information. For both of these two parameter settings, the network is run with the smallest possible fraction of a = 0.85% hypercolumns active. We conclude that the storage capacity of the cortical model is on the order of millions given that only a few percent of the hypercolumns and a few permille of the neurons are active at any one moment. 4.6 Storage Capacity Dependence on Minicolumn Size. The fraction of minicolumns (units) that must be active in each pattern is decided by the two constraints: the signal-to-noise ratio constraint on separation between the strength of input to an active and inactive minicolumn and the absolute strength of the synaptic input to an active minicolumn. These two constraints also determine the storage capacity, as it is strongly dependent on the fraction of active units. Here we show that aggregating neurons into minicolumns, that is, increasing the size V of the minicolumns, increases the storage capacity in terms of patterns by allowing a smaller fraction of minicolumns to be active. This effect is very pronounced for networks with small minicolumns composed of only a few neurons, in which the signal-to-noise ratio constraint limits the storage capacity. When V is increased, the signal-to-noise ratio is improved, but the synaptic input is decreased, and for larger values of V, the number and strength of the synaptic input are the constraining factors. Here we investigate the storage capacity for values of V ranging between 1 and 150. This is done for the fixed values of U = {150, 100, 70, 50, 30} and n = 2 · 1010 . For small values of V, the number of hypercolumns, H, and units, N, is large and vice versa. When V = 1, the network is made up of single neurons aggregated into hypercolumns containing U neurons. We assume that a neuron has 50 terminal clusters, and when V increases, the arborization in each cluster increases slightly. If V is larger in large cortices, the increased arborization would contribute to a slightly thicker cortex, and here we implicitly assume this. In Figure 5 the storage capacity for a human-sized network is plotted as a function of the minicolumn size, V, for networks with differently sized hypercolumns. It is important that V is kept relatively small, as the number of units, N, and also the information content of the patterns stored in the network decrease when V is increased. For values of V up to around 30, the
An Abstract Neocortical Model
stored patterns
10
10
1887
U=150
6
U=100
U=70
5
U=50 U=30
10
10
4
signal-to-noise limited EPSP strength limited
3
20
40
60
80
100
120
140
V
Figure 5: Maximum storage capacity as a function of the minicolumn size V. Dotted lines mark the storage capacity when the signal-to-noise constraint is the limiting factor, and dashed lines are used to mark it when the strength of the EPSP is the limiting factor. Increasing V means that both the number of units N and the information content of each stored pattern decrease.
signal-to-noise ratio constraint limits the storage capacity, and the storage capacity increases rapidly as V is increased. For larger values of V, the strength of synaptic input limits the storage capacity. In this case, increasing V still gives a small increase in the number of stored patterns, but it also reduces the information stored in each pattern. For smaller-sized cortices, the scenario is much the same as for the large human-sized network but with the difference that the value of V where the rapid increase of the storage capacity ends is somewhat smaller. This applies to networks down to the size of mice, which have a size of about 2 · 107 neurons. 4.7 Storage Capacity Dependence on Hypercolumn Size. The fraction of active minicolumns (units) can be decreased by increasing the size, U, of the hypercolumns. Given that we have a constant n and V, decreasing the size of U increases H, but the total number of units, N, is constant. This means that the information content of the patterns stored in the networks is only marginally affected by changes made to the size of U. In Figure 6, we have plotted the storage capacity of a human-sized network as a function of U for different values of V. For small values of U, the storage capacity is limited by the signal-to-noise ratio constraint, and for larger values of U, the strength of synaptic input limits the storage capacity. Having hypercolumns that are larger than U = 50–60 does not increase the storage
1888
C. Johansson and A. Lansner
stored patterns
10
10
V=150
6
V=100
5
V=70 V=50 V=30
10
10
4
signal-to-noise limited EPSP strength limited
3
20
40
60
80
100
120
140
U
Figure 6: Maximum storage capacity as a function of the hypercolumn size U. Dotted lines mark the storage capacity when the signal-to-noise constraint is the limiting factor, and dashed lines are used to mark it when the strength of the EPSP is the limiting factor. Changing the size of U does not affect the number of units N in a network, and it only marginally affects the information content of each stored pattern.
capacity significantly for human-sized networks, and for smaller networks, this breaking point occurs for slightly smaller values of U. 4.8 Inferring the Amount of Synaptic Input to a Neuron. In this section, we try to infer, using the cortical model, the number of synaptic inputs a cortical neuron receives during a short period of time, which we here refer to as the time step. We define a neuron as active if it fires action potentials at 16 Hz, and the time step is therefore set to 1/16 s. In the model, the synaptic input that produces a neuron’s EPSP is composed of two components: a locally generated input from within the hypercolumn and an input coming via corticocortical afferents from outside the hypercolumn. Up to this point, we have considered only the latter. The number of inputs coming from outside the hypercolumn to a neuron via corticocortical afferents for a human-sized instance of the model with the parameters H = 2 · 106 , U = 100, K = 6 · 105 , T = 9, Vrecv = 100, and a = 3.1% during a time step is BaS/Vrecv = aKT/(UV recv ) = 17. This amount of input is barely enough to generate an action potential in a neuron with some certainty (Destexhe & Par´e, 1999). If we instead assume that there are only Vrecv = 20 pyramidal neurons with large dendritic trees receiving the major bulk of corticocortical afferents
An Abstract Neocortical Model
1889
in each minicolumn, we have that these 20 neurons receive 84 corticocortical synaptic inputs each, given the same level of activity. This amount of synaptic input is enough for action potentials to be generated with a high probability (Destexhe & Par´e, 1999). Within a minicolumn, the pyramidal neurons are tightly connected, and when some of these start to fire action potentials due to the global input, an avalanche in synaptic input to these neurons occurs. If we assume that 80% of all neurons in cortex are of the pyramidal type (Braitenberg & Schuz, 1998), that the connectivity between pyramidal neurons within a minicolumn is about 50% (Hellwig, 2000; Holmgren, Harkany, Svennenfors, & Zilberter, 2003), and that these neurons on average are connected by three synapses (Hellwig, 2000), we get that a pyramidal neuron in an active minicolumn receives 120 local synaptic inputs per time step in addition to the global input. Relative to a pyramidal neuron in an active minicolumn, one in an inactive minicolumn (in both active and quiescent hypercolumns) on average receives a fraction 1/T of the global synaptic inputs from corticocortical afferents. In addition to the strong locally and globally originating excitatory input to the neurons in the active minicolumn in an active hypercolumn, the inactive minicolumns in this active hypercolumn are subject to strong lateral inhibition. However, in a quiescent hypercolumn, the average activity in all minicolumns is around 0.16 Hz, whereby the local synaptic excitatory input to a neuron on average is 1/100 of that in an active minicolumn. In fact, this difference in total synaptic input to neurons in an active and quiescent hypercolumn, respectively, may be the physiological correlate of the low intrinsic signal from the quiescent hypercolumn.
5 Discussion In this letter, we have elaborated on the analogy between neocortex and a large monolithic attractor memory network. Although this is a very rough approximation of the neocortex, the approach gave rise to several interesting matches between the model and anatomical and physiological data. The model predicted a functional activity of 1% to 5% for the hypercolumns, and this agreed with experimental intrinsic signal imaging data. In a qualitative way, this result also fits with fMRI data, which show that only a small fraction of cortex is activated in a patchy manner at any particular moment (Cheng, Waggoner, & Tanaka, 2001). The model predicts that the hypercolumns should contain at least around U = 50 minicolumns and that increasing U further gives only a marginal increase of the storage capacity. Similarly, we found that aggregating more than V = 30 neurons into a minicolumn gives only a limited increase of the number of stored patterns, and depending on the particular value of U, it may decrease the total amount of information stored by the network.
1890
C. Johansson and A. Lansner
We found that having a few large pyramidal neurons receiving the corticocortical afferents ensures that these neurons get a large synaptic input if they participate in the active pattern, and this in turn ensures that the other neurons in the minicolumn start firing. The proposed model is very abstract and captures only a small fraction of the complexity of neocortex. It does not at all reflect the neurochemistry taking place in cortical neurons, and it also overlooks the details of their exact wiring. Therefore, as the model now stands, it will be able to provide only vague predictions of cortical anatomy and physiology. In order to get more detailed predictions from the model, it must be extended to include a more detailed model of the mini- and hypercolumn at the level of individual neurons. Such a detailed model would include a more authentic connectivity pattern and would therefore not be possible to run with random patterns as input, which means also that the preprocessing stages of cortical input must be included. We foresee a steady progress toward this type of more detailed model in the coming years (Lundqvist, Rehn, Diurfeldt et al., 2006; Markram, 2006). 5.1 Minicolumns as the Functional Units. The information content of the patterns stored in the network is proportional to the activity level, and a high activity level gives patterns with large information content and vice versa for a network with a low activity level. When the network is run with a low activity level, the connectivity must be fairly high so that each unit in the network receives a synaptic input with a good signal-to-noise ratio. Having minicolumns as the functional units instead of single neurons gives a higher connectivity density and enables low levels of activity, and this in turn gives high storage capacity in terms of patterns. If we instead have neurons as the functional units in the network, only a small number of patterns, each with a large information content, can be stored because the connectivity is low and the activity level must therefore be high. There is thus a trade-off between storing many patterns, each with a low information content, and storing a small number of patterns with a high information content, and this trade-off is expressed in terms of the size of the minicolumns. Assuming that nature has favored the former of the two scenarios (storing many unique patterns), there is a clear advantage to using minicolumns as the functional units. Anatomical data show that minicolumns are a generic construct in the mammalian neocortex, but physiological data indicate that they have a functional importance only for large cortices, although functional minicolumns would also increase the storage capacity in smaller cortices. This leads to the question of whether there are advantages of using single neurons as the functional units in smaller cortices. It might be that it is not necessary to have a storage capacity on the order of millions in these smaller cortices and that having the capacity to store a few thousand patterns is enough. Rather than having a large storage capacity in terms of patterns, it might be necessary in the smaller cortices to have the capability to represent
An Abstract Neocortical Model
1891
memory patterns containing a certain, large enough amount of information that would not be possible in a minicolumnar network. 5.2 Scaling of the Model. Why design a network that scales with an increasing number of fixed-sized modules instead of a network that scales by increasing the size of the modules? Generally, building a system by duplicating a building block with fixed structure is easier than building a system where the structure of the building blocks depends on the overall design of the entire system. That the growth of neocortex adheres to this design philosophy is evident from studies of ontogenetic columns (Rakic, 1995). In the case of both the neocortex and that of an implementation in silicon of the hypercolumns, their limited size makes it possible to implement the local competition, or any other local operation, using fast and cheap interconnections between the units. In the case of neocortex, unmyelinated axons can be used, which occupy a much smaller volume than myelinated ones. An implementation in silicon benefits from a local structure in that less space is used by transmission lines, and a high clock frequency can be used in the case of a digital design. When the model is scaled up, the number of synaptic inputs per unit remains constant, precisely as in the mammalian brain. As a consequence, the connection density decreases rapidly when the network is scaled up, and for cortical-sized networks, the connectivity is very low. This applies to networks of both single neurons and minicolumns, although for finitesized networks, the connectivity is considerably higher in a network of minicolumns than in one based on neurons. Conceptually, the idea of quiescent hypercolumns is appealing because it brings many computational advantages. First, it enables a modular attractor network to scale well in terms of stored patterns. Second, it allows a more flexible representation where different memory patterns can contain different amounts of information. Third, it makes implementations more efficient by reducing the communication needs between hypercolumns. Fourth, in both the brain and in silicon implementations, it serves to reduce the energy needed for computations and communication. 5.3 Storage Capacity: How Many Memories Are Enough?. Given that we model neocortex, or a fraction of it, with a fixed-point attractor memory, it is relevant to consider the number of patterns that such a memory can hold (O’Kane & Treves, 1992b; Treves, 2005). Treves (2005) states that being able to store only a few thousand memories is probably not good enough, and he suggests that a cortical scale attractor memory should be able to store at least on the order of millions of memories. To put this figure in perspective, we can consider the number of memories that a human can possibly come across during a life span. If a person perceives one new input each minute throughout a life span of 75 years, the total number of cues perceived amounts to 4 · 107 . This is only a factor of 10 larger than
1892
C. Johansson and A. Lansner
the storage capacity computed for a full-scale instance of the model. The point we want to make is that considering a system that has a much higher storage capacity than this is most likely irrelevant from a biological point of view. 6 Conclusions In this letter, we have investigated the scaling of an abstract cortical model with respect to biologically imposed constraints. We found that the cortical model performed very well in terms of storing patterns if the neurons were aggregated into minicolumns. Aggregating neurons into minicolumns ensured that neurons in active minicolumns received a strong synaptic input to trigger the firing of action potentials, and from a functional perspective, the minicolumns ensured good separation in the strength of synaptic input to neurons in active and quiescent minicolumns. We also showed that grouping minicolumns into hypercolumns and enabling some of these to enter a quiescent state results in good scaling of the storage capacity, gives representational advantages, and complies with experimental data on cortical patterns of activity. Further, we computed that the columnar-based model, although spanning the entire cortex, could function as a fixedpoint attractor memory while complying with biological constraints on connectivity, activity, and synaptic input. In all, this validates the approach taken to modeling cortex using an abstract attractor network model, and the model proposed here can help in understanding the proposed role of the minicolumn as a functional unit in the cortex. It can also help clarify the functional role of the hypercolumn. By refining this type of abstract large-scale models, we can extract both quantitative and qualitative aspects of, for example, neural wiring and activity patterns that will help in the quest of figuring out in detail the main principles underlying cortical function. Appendix A: List of Symbols Symbol n V Vrecv K H U N A
Meaning (Location of First Use) Number of neurons (section 2) Number of neurons per minicolumn (section 3) Number of neurons receiving input from corticocortical afferents in a minicolumn (section 3) Number of synapses receiving corticocortical inputs summed over all neurons in a minicolumn (section 3) Number of hypercolumns (section 3) Number of minicolumns in a hypercolumn (section 3) Number of minicolumns (section 3) Number of active hypercolumns (section 3)
An Abstract Neocortical Model
a B S T Y Z M Ipattern L stored
1893
Fraction of active hypercolumns (section 3.3) Number of terminal clusters made by one minicolumn (section 3.1) Number of synapses per minicolumn-to-minicolumn connection (section 3.1) Number of potential contact sites for each synapse (section 3.1) Number of synaptic inputs to a minicolumn (section 3.2) Number of connections per unit (section 3.4) Activation threshold for a unit (section 3.4) Information content of a pattern (section 3.4) Number of stored patterns (section 3.5)
Acknowledgments This work was partly supported by a grant from the Swedish Science Council (grant no 621-2001-2548) and from the European Union (FACETS project, FP6-2004-IST-FETPI-015879).
References Amirikian, B., & Georgopoulos, A. P. (2003). Modular organization of directionally tuned cells in the motor cortex: Is there a short-range order? Proc. Natl. Acad. Sci., 100, 12474–12479. Attwell, D., & Laughlin, S. B. (2001). An energy budget for signaling in the grey matter of the brain. J. Cereb. Blo. Flo. Met., 21, 1133–1145. Beaulieu, C., & Colonnier, M. (1989). Number and size of neurons and synapses in the motor cortex of cats raised in different environmental complexities. J. Comp. Neurol., 289, 178–187. Bohland, J. W., & Minai, A. A. (2001). Efficient associative memory using small-world architecture. Neurocomputing, 38–40, 489–496. Bolle, D., Cools, R., Dupont, P., & Huyghebaert, J. (1993). Mean-field theory for the Q-state Potts-glass neural network with biased patterns. J. Phys. A: Math. Gen., 26, 549–562. Bosking, W. H., Zhang, Y., Schofield, B., & Fitzpatrick, D. (1997). Orientation selectivity and the arrangement of horizontal connections in tree shrew striate cortex. J. Neurosci., 17, 2112–2127. Braitenberg, V., & Schuz, A. (1998). Cortex: Statistics and geometry of neuronal connectivity. New York: Springer-Verlag. Buhmann, J., Divko, R., & Schulten, K. (1989). Associative memory with high information content. Phys. Rev. A, 39, 2689–2692. Buxhoeveden, D. P., & Casanova, M. F. (2002a). The minicolumn and evolution of the brain. Brain, Behavior and Evolution, 60, 125–151. Buxhoeveden, D. P., & Casanova, M. F. (2002b). The minicolumn hypothesis in neuroscience. Brain, 125, 935–951.
1894
C. Johansson and A. Lansner
Cheng, K., Waggoner, R. A., & Tanaka, K. (2001). Human ocular dominance columns as revealed by high-field functional magnetic resonance imaging. Neuron, 32, 359–374. Chklovskii, D. B., Mel, B. W., & Svoboda, K. (2004). Cortical rewiring and information storage. Nature, 431, 782–788. DeFelipe, J., Conley, M., & Jones, E. G. (1986). Long-range focal collateralization of axons arising from corticocortical cells in monkey sensory-motor cortex. J. Neurosci., 6, 3749–3766. Destexhe, A., & Par´e, D. (1999). Impact of network activity on the integrative properties of neocortical pyramidal neurons in vivo. J. Neurophysiol., 81, 1531– 1547. Frans´en, E., & Lansner, A. (1998). A model of cortical associative memory based on a horizontal network of connected columns. Network: Comput. in Neural Systems, 9, 235–264. Fulvi Mari, C. (2004). Extremely dilute modular neuronal networks: Neocortical memory retrieval dynamics. J. Comp. Neursci., 17, 57–79. Fulvi Mari, C., & Treves, A. (1998). Modeling neocortical areas with a modular neural network. BioSystems, 48, 47–55. Gilbert, C. D., & Wiesel, T. N. (1989). Columnar specificity of intrinsic horizontal and corticocortical connections in cat visual cortex. J. Neurosci., 9, 2432–2442. Goldman, P. S., & Nauta, W. J. H. (1977). Columnar distribution of cortico-cortical fibers in the frontal association, limbic, and motor cortex of the developing rhesus monkey. Brain Res., 122, 393–413. Hellwig, B. (2000). A quantitative analysis of the local connectivity between pyramidal neurons in layers 2/3 of the rat visual cortex. Biol. Cybern., 82, 111–121. Holmgren, C., Harkany, T., Svennenfors, B., & Zilberter, Y. (2003). Pyramidal cell communication within local networks in layer 2/3 of rat neocortex. Journal of Physiology, 551, 139–153. Hubel, D. H., & Wiesel, T. N. (1977). Functional architecture of macaque monkey visual cortex. Proc. R. Soc. Lond. B., 198, 1–59. Johansson, C., & Lansner, A. (2004). Towards cortex sized artificial nervous systems. In M. G. Negoita, R. J. Howlett, & L. C. Jain (Eds.), Proc. of Knowledge-Based Intelligent Information and Engineering Systems (8th ed.). New York: Springer. Johansson, C., & Lansner, A. (2007). Towards cortex sized artificial neural systems. Neural Networks, 20, 48–61. Johansson, C., Rehn, M., & Lansner, A. (2006). Attractor neural networks with patchy connectivity. Neurocomputing, 69, 627–633. Kanter, I. (1988). Potts-glass models of neural networks. Phys. Rev. A, 37, 2739–2742. Kropff, E., & Treves, A. (2005). The storage capacity of Potts models for semantic memory retrieval. JSTAT, P08010. Lansner, A., Frans´en, E., & Sandberg, A. (2003). Cell assembly dynamics in detailed and abstract attractor models of cortical associative memory. Theory in Biosci., 122, 19–36. Lansner, A., & Holst, A. (1996). A higher order Bayesian neural network with spiking units. Int. J. Neural Systems, 7, 115–128. Lennie, P. (2003). The cost of cortical computation. Current Biology, 13, 493–497.
An Abstract Neocortical Model
1895
Little, W. A., & Shaw, G. L. (1975). A statistical theory of short and long term memory. Behavioral Biology, 14, 115–133. Lundqvist, M., Rehn, M., Djurfeldt, M., & Lansner, A. (2006). Attractor dynamics in a modular network model of the neocortex. Network: Comput. in Neural Systems, 17, 253–276. Lundqvist, M., Rehn, M., & Lansner, A. (2006). Attractor dynamics in a modular network model of the cerebral cortex. Neurocomputing, 69, 1155–1159. Markram, H. (2006). The Blue Brain Project. Nature Rev. Neurosci., 7, 153–160. Masuda, N., & Aihara, K. (2004). Global and local synchrony of coupled neurons in small-world networks. Biol. Cybern., 90, 302–309. McGraw, P. N., & Menzinger, M. (2003). Topology and computational performance of attractor neural networks. Phys. Rev. E, 68, 047102. Montgomery, J. M., & Madison, D. V. (2004). Discrete synaptic states define a major mechanism of synapse plasticity. Trends in Neurosci., 27, 744–750. Morelli, L. G., Abramson, G., & Kuperman, M. N. (2004). Associative memory on a small-world neural network. Eur. Phys. J. B, 38, 495–500. Mountcastle, V. B. (1997). The columnar organization of the neocortex. Brain, 120, 701–722. O’Kane, D., & Treves, A. (1992a). Short- and long-range connections in autoassociative memory. Journal of Physics A, 25, 5055–5069. O’Kane, D., & Treves, A. (1992b). Why the simplest notion of neocortex as an autoassociative memory would not work. Network: Comput. in Neural Systems, 3, 379–384. Pakkenberg, B., Pelvig, D., Marner, L., Bundgaard, M. J., Gundersen, H. J., Nyengaard, J. R., & Regeur, L. (2003). Aging and the human neocortex. Experimental Gerontology, 38, 95–99. Palm, G. (1980). On associative memory. Biol. Cybern., 36, 19–31. Palm, G. (1981). Towards a theory of cell assemblies. Biol. Cybern., 39, 181–194. Rakic, P. (1995). A small step for the cell, a giant leap for mankind: A hypothesis of neocortical expansion during evolution. Trends in Neurosci., 18, 383–388. Rolls, E. T., & Treves, A. (1998). Neural networks and brain function. New York: Oxford University Press. Roudi, Y., & Treves, A. (2004). An associative network with spatially organized connectivity. JSTAT, P07010. Roudi, Y., & Treves, A. (2006). Localised activity profiles and storage capacity of rate-based autoassociative networks. Phys. Rev. E, 73, 061904. Scannell, J. W., Blakemore, C., & Young, M. P. (1995). Analysis of connectivity in the cat cerebral cortex. Journal of Neuroscience, 15, 1463–1483. Stepanyants, A., & Chklovskii, D. B. (2005). Neurogeometry and potential synaptic connectivity. Trends in Neurosci., 28, 387–394. Stepanyants, A., Hof, P. R., & Chklovskii, D. B. (2002). Geometry and structural plasticity of synaptic connectivity. Neuron, 34, 275–288. Treves, A. (2005). Frontal latching networks: A possible neural basis for infinite recursion. Cognitive Neuropsychology, 22, 276–291. Tsodyks, M. V., & Feigelman, M. V. (1988). The enhanced storage capacity in neural networks with low activity level. Europhysics Letters, 6, 101–105.
1896
C. Johansson and A. Lansner
Tsunoda, K., Yamane, Y., Nishizaki, M., & Tanifuji, M. (2001). Complex objects are represented in macaque inferotemporal cortex by the combination of feature columns. Nature Neurosci., 4, 832–838. Willshaw, D. J., Buneman, O. P., & Longuet-Higgins, H. C. (1969). Non-holographic associative memory. Nature, 222, 960–962.
Received February 17, 2006; accepted August 10, 2006.
LETTER
Communicated by Odelia Schwartz
Combining Reconstruction and Discrimination with Class-Specific Sparse Coding Stephan Hasler [email protected]
Heiko Wersing [email protected]
¨ Edgar Korner [email protected] Honda Research Institute Europe GmbH, 63073 Offenbach/Main, Germany
Sparse coding is an important approach for the unsupervised learning of sensory features. In this contribution, we present two new methods that extend the traditional sparse coding approach with supervised components. Our goal is to increase the suitability of the learned features for classification tasks while keeping most of their general representation capability. We analyze the effect of the new methods using visualization on artificial data and discuss the results on two object test sets with regard to the properties of the found feature representation.
1 Introduction Most approaches to object recognition employ two kinds of methods: methods that learn features and methods that learn object representations. While the second group of methods can be used directly for classification by comparing a test image with the learned representation, the first group has a supporting function in finding subspaces in the data in which more robust object representations can be obtained. For the object representation learning methods, there is a further distinction between probabilistic generative and discriminative approaches, depending on whether they model the distribution of samples in the data space (Ulusoy & Bishop, 2005). In recent years a stronger interest arose in combining the advantages of both approaches (Raina, Shen, Ng, & McCallum, 2003; Ng & Jordan, 2002). Following Ulusoy and Bishop (2005), discriminative approaches are faster and more reliable in predicting class labels, since they are trained to do so rather than to model the joint distribution of input vectors and classes. Because of this specialization in a certain classification task, these approaches suffer the drawback that they have to be retrained whenever the scenario is changed, for example, by adding a new class. Neural Computation 19, 1897–1918 (2007)
C 2007 Massachusetts Institute of Technology
1898
¨ S. Hasler, H. Wersing, and E. Korner
Probabilistic generative methods, such as gaussian mixture models (GMMs; Mc-Lachlan & Peel, 2000), learn independent models for each class. Therefore, a new class simply adds a new model but does not influence the existing ones. Also, they are able to deal with missing information and unlabeled data. The disadvantage of generative methods is that they model details of a data distribution that may be irrelevant or even disturbing in classification tasks. Also for the feature learning methods, a distinction exists between generative and discriminative approaches, depending on whether the learned feature subspace supports reconstruction of the data or classification. Usually both approaches train one global feature basis for the whole data distribution. The discriminative approaches are trained in a supervised and the generative approaches normally in an unsupervised manner. Although the term generative is often used in this context, it is misleading because the feature learning methods do not specify how new data could be generated from a learned basis by means of learning priors on how to combine the features. The probabilistic generative models do so explicitly. There is a group of generative feature learning methods called linear generative methods. These methods search for subspaces that allow for a good reconstruction of the data vectors in terms of linear combinations of the basis functions. This means each data vector is associated with a set of coefficients that determines how the features (basis functions, weights) have to be used to yield the best reconstruction. The linear generative models differ in the constraints on how to reconstruct the data. Principal component analysis (PCA; Duda, Hart, & Stork, 2000) finds dimensions of highest variance in the data, which allows for a reconstruction with minimal information loss when using fewer features than dimensions in the data. Nonnegative matrix factorization (NMF; Lee & Seung, 1999) employs purely positive weights and coefficients and was shown to learn localized patterns that often have a direct interpretation as object parts. Sparse coding (Olshausen & Field, 1996) puts constraints on the coefficients, enforcing an efficient use of the basis functions. The principle of efficient coding resembles receptive field properties in primary visual cortex when applied to small patches of natural scenes. As mentioned in the beginning, linear generative methods are often used to facilitate classification. So, for example, PCA was successfully applied to face recognition (Turk & Pentland, 1991), and sparse coding features were used as intermediate layer in a feature hierarchy related to the ventral ¨ visual pathway (Wersing & Korner, 2003), that yields robust classification performance for different recognition problems. However, linear generative methods for feature learning suffer the same drawback as the probabilistic generative methods: they spend resources for modeling certain dimensions in the data that might be irrelevant or disturbing for classification tasks. On the other hand, the discriminative feature learning methods concentrate on dimensions in the data that are relevant for classification but do not offer the
Class-Specific Sparse Coding
1899
possibility of learning features that have an interpretation as object parts. Instead, they generate very holistic, noisy-looking features. An example for those methods is the Fisher linear discriminant (Duda et al., 2000), which finds subspaces in the data where the classes are separated best in terms of Euclidean distance. This subspace is very specific for the trained scenario, which may decrease the ability to generalize to new scenarios. The mixture of advantages and disadvantages suggests a combination of discriminative and generative properties as an attractive approach for feature learning. We decided to use the nonnegative sparse coding approach (Hoyer, 2002) as the basis for our investigations. The nonnegative sparse coding is a linear generative method that adopts the positivity constraints from the NMF. As outlined above, this property facilitates the learning of features that have an interpretation as object parts. But because the linear generative methods are mainly based on the principle of reconstruction of the data, the obtained features might not be useful for building a classifier. So the nonnegative sparse coding will focus its resources on reconstructing common parts of the classes in the first and does not concentrate on discriminative ones. By adding class-specific, supervised components to the cost function, we hope to prevent this behavior and to learn qualitatively different features that are more discriminative while keeping their interpretation as object parts. After reviewing related work in section 2, we introduce the new methods in section 3 and analyze them using a visualization of their representation properties dependent on the cost function parameters. In section 4, we analyze the obtained feature representations for two object test sets and give our conclusions in section 5. 2 Related Work The standard approach to sparse coding (Olshausen & Field, 1996) is formulated as a linear code representing the data. Its target is to combine efficient reconstruction with a sparse usage of the representing basis, resulting in the following cost function, 2 1 ES = c i p wp + γ ci p , xi − 2 i p p i
(2.1)
2
where the samples xi = (xi1 , xi2 , . . . , xi K )T , i = 1, . . . , I , and the weights T wp = w p1 , w p2 , . . . , w pK , p = 1, . . . , P, have the same dimension K . In the left reconstruction term, each xi is approximated by a linear combi nation ri = p c i p wp , where ri is referred to as the reconstruction of the corresponding xi . The coefficients c i p specify how much the pth weight is involved in the reconstruction of the ith data vector. The squared Euclidean
1900
¨ S. Hasler, H. Wersing, and E. Korner
norm · 22 of the difference vector between an xi and its reconstruction ri contributes to the cost. The right sparsity term sums up the c i p . The nonlinear function (e.g., (·) = | · |) increases the cost the more the activation is spread over different c i p , and so many of them become zero while few are highly activated. The influence of the sparsity term is scaled with the positive constant γ . The sparsity forces the weights to align more directly to the data and to reconstruct an xi most sparsely if multiple possibilities exist. This enables the sparse coding to handle an overcomplete representation. PCA (Duda et al., 2000) and NMF (Lee & Seung 1999) are based on the same measure of reconstruction as the sparse coding model described by equation 2.1, but do not put sparsity constraints on the coefficients. Using fewer weights than dimensions in the data (P < K ), PCA forces the weights to align to the directions with the biggest variance in data space. Therefore, PCA is often used to reduce the dimensionality of data, with a minimal loss of information. NMF differs from PCA in that it puts positivity constraints on both the weights and the coefficients. Therefore, the contribution of each weight to a certain reconstruction is purely positive and cannot be canceled out by the contribution of another weight. This limitation makes it economical to reconstruct an image with nonoverlapping weights, where each single weight is already a good reconstruction of an image part. This is often referred to as a parts-based representation. Later Hoyer (2004) added to the NMF an option to directly control the sparseness of the weights and the coefficients. He discovered that for achieving a parts-based representation with standard NMF, the same parts have to occur at the same position in the training samples. By adding sparseness constraints on the weights, a parts-based representation can be more reliably produced. In other cases, the NMF produces extremely parts-based weights. This means they contain only single pixels or small blobs and do not reveal any meaningful statistical background of the data. Adding sparseness constraints onto the coefficients forces the weights to reveal more holistic dependencies. Nonnegative sparse coding (Hoyer, 2002) adopts the idea of a parts-based representation for sparse coding by also putting positivity constraints on the weights and the coefficients. It differs from NMF in the fact that the sparsity of the coefficients is enforced explicitly, and so nonnegative sparse coding is similar to NMF with sparseness constraints. The remaining difference between both approaches is that sparse coding methods often use simple gradient descent for the optimization of the cost function, whereas NMF methods apply multiplicative update rules that do not require the definition of a learning rate. In the algorithms described above, each weight contributes only once to a reconstruction of a certain image, and the position of activation in the weight directly determines the position of activation in the reconstruction. Therefore, a single weight cannot represent or learn a part that occurs in
Class-Specific Sparse Coding
1901
different images at different locations (or in other transformations). This leads to redundancies in the coding scheme by representing transformations of one part with different weights. To overcome this limitation, Grimes and Rao (2005) proposed an extension of the sparse coding that factors an image into object features and transformations using a bilinear function. In this way the weights can contribute several times to the same reconstruction, each time undergoing another transformation beforehand (e.g., shifts to different locations). Currently the approach handles only translation, but in general it is able to deal with arbitrary transformations, such as rotation, scaling and view changes. The concept of bilinearity imposes that for a certain image, all features use the same transformations. This contradicts the notion of features as independent parts. Therefore, further extensions in Grimes and Rao (2005) go in the direction of allowing independent transformations of features per image. This is a similar method to the translation-invariant adaptation of the nonnegative sparse coding ¨ proposed in Wersing and Korner (2003) and the translation-invariant adap¨ tation of the NMF introduced in Eggert, Wersing and Korner (2004). The unsupervised methods mentioned above produce features with reconstructive qualities. The features are not specialized in solving a certain task and therefore could be transferred to other scenarios from those used for training. The drawback is that the extraction of statistically significant parts in high-dimensional data with unsupervised methods requires large training sets. Also, there is no guarantee that the obtained parts are useful in object recognition tasks. Another group of methods concentrates on only discriminative properties of the features. One example of such an approach is the Fisher linear discriminant (Duda et al., 2000). It searches for a low-dimensional representation of the data that, unlike PCA, does not favor the directions of biggest variance, but the directions allowing the best separation of the classes in the data. This is done by generating a transformation matrix that minimizes the ratio of within-class scatter to between-class scatter. Thus, in some sense, this projection maximizes the signal-to-noise ratio. When Q is the number of classes in the data, the feature space has dimension Q − 1. This allows a linear separation of the classes only if each has a very peaked, unimodal gaussian distribution in feature space. Discriminative features are very efficient in solving the task they are trained for, but normally lack the property of being reusable in adapted scenarios. Methods combining the advantages of unsupervised and supervised methods are rare. One is the maximum representation and discrimination feature (MRDF) approach (Talukder & Casasent 1998). It combines PCA and an adaptation of the Fisher linear discriminant, which could also handle multimodal distributions, and introduces a parameter that determines to which degree reconstruction or discrimination are desired. Since the method has no positivity constraints, the generated features are holistic and do not have a direct interpretation as object parts.
¨ S. Hasler, H. Wersing, and E. Korner
1902
Cups with Handle
Cups without Handle
Closed Containers
Figure 1: Example views of the three-class problem.
We propose two new methods to combine unsupervised and supervised feature learning on the basis of a nonnegative, parts-based representation. 3 Class-Specific Sparse Coding Class specificity should denote the property of a feature to give a strong clue on the class membership of an image the feature is detected in. Following this definition, in the three-class problem shown in Figure 1, the handles show a high specificity for the cups with handle class, because whenever you recognize a handle, you can be sure that you see a view of this class. In the same way, the white caps are specific for the closed container class. The standard sparse coding model in equation 2.1 does not care about the existence of different classes and produces features that are useful for general image reconstruction but lack the property of being class specific. Our two new approaches extend nonnegative sparse coding with supervised components. Suppose that the data samples are split into Q subsets (classes) Xq , with q = 1, . . . , Q. Each subset has nq elements labeled as class q . In the first approach, this class information has a direct effect on the coefficients c i p , and it will therefore be referred to as coefficient coding: 2 1 1 c i p c˜ı p c i p wp + γ ci p + α . EC = xi − 2 i 2 n n p p i,˜ı q (i) q (˜ı) i, p 2
(3.1)
q (i)=q (˜ı)
We assume c i p and w pk ≥ 0. For the sparsity term, we used the function (c i p ) = c i p , which corresponds in the nonnegative case to the absolute value. The right coefficient term causes cost if coefficients belonging to the same weight wp are active for differently labeled samples xi and x˜ı , where q (i) is the label of xi and nq (i) is the number of samples in the class of xi . nq (i) is used to normalize the effect of classes with different cardinality. The influence of the coefficient term is scaled with the positive constant α.
Class-Specific Sparse Coding
1903
In the second approach, the class information has a more direct effect on the weights, and it will therefore be referred to as weight coding:
EW
2 1 1 wpT xi wpT x˜ı = c i p wp + γ ci p + β . xi − 2 i 2 nq (i) nq (˜ı) p p i,˜ı i, p 2
(3.2)
q (i)=q (˜ı)
The right weight term causes cost if a wp has a large inner product with differently labeled samples xi and x˜ı . Again, q (i) denotes the label of xi , and nq (i) is the number of samples in the class of xi . The influence of the weight term is scaled with the positive constant β. The minimization of the cost functions of coefficient and weight coding is done by alternately applying coefficient and weight steps as described ¨ in Wersing and Korner (2003). In the coefficient step, the cost function is minimized with respect to the c i p using an asynchronous fixed-point search, while keeping the wp constant. The weight step is a single gradient step with a fixed step size in the wp , keeping the c i p constant. A more detailed description of the optimization procedure is given in the appendixes. To give an instructive visualization of sparse coding and show the qualitatively different behavior of the new approaches, we performed optimizations for an artificial two-dimensional setting. The conclusions we draw from this toy setting hold in principle also for more complex and higherdimensional problems. The artificial setting contains 10 samples that were distributed on the positive part of the unit circle and then assigned to two classes (see Figure 2a). These samples are reconstructed using two normalized weights. The actual visualization shows the resulting weights wp and reconstructions ri for different values of a control parameter of the cost function, for example, the influence of the sparsity term γ (see Figure 2b). The relative position of ri and wp allows conclusions on the sparsity of the reconstructions, whereas the course of the wp is a direct indicator for their discriminative properties. The optimally discriminating weights are those that maximize the gradient of their dot product with samples near the border between the classes. This corresponds to the property of a linear separator. In Figure 2a, the best discriminating features would point in the direction of the coordinate axes. Hence, in the visualization of the angles (see Figure 2b), these weights lie at 0 and 90 degrees. In Figure 3 the visualization is used to compare nonnegative sparse coding, coefficient coding, and weight coding. Figure 3a shows the typical behavior of the nonnegative sparse coding for an increasing influence factor of the sparsity term γ . Each reconstruction lies between the weights or on one of them due to the nonnegativity constraints. For γ → 0, the reconstruction is perfect (when using at least as many weights as dimensions in the data), and the weights are aligned with the outermost xi . If an ri lies on top of a weight symbol, this reconstruction is very sparse because it does
¨ S. Hasler, H. Wersing, and E. Korner
1904 a)
b)
1.0
90° 80°
0.8
70° 60°
0.6 50° 40° 0.4 30°
θ
20°
0.2
10°
θ 0.0
0° 0.0
0.2
0.4
Data xi
0.6
Class 1 Class 2
0.8
1.0
0.0
Class 1 Reconstructions ri Class 2
0.1
0.2
0.3
0.4
Weights wp
Figure 2: (a) Artificial setting, used in our visualization. The artificial setting is two-dimensional and contains 10 data samples xi . The xi are randomly distributed on the unit circle and assigned to two classes (symbolized with a star and a diamond). The optimization employs two normalized two-dimensional weights wp . For a certain parameter setting, the optimization results in the shown position of weights and reconstructions ri . (b) Schematic description of visualization. The actual visualization shows for each symbol in a the angle between a ray from the origin to that symbol and the x-axis (see how θ in a is represented in b). The angles of weights and reconstructions are shown at the x-coordinate of 0.2 because the result in a was produced with a cost function parameter of 0.2. Visualizing ri and wp for different values of the same parameter will reveal its qualitative effect on the cost function.
not use the other weight at all. With increasing γ , each ri gives up the use of the less suitable weight, and therefore the ri unite to two main paths. At the same time, each wp aligns to the center of the ri assigned to it. For high values of γ , the result is therefore comparable to that of a cluster approach. The coefficient term restricts the use of features by different classes. When its influence factor is increased for the coefficient coding (see Figure 3b), the reconstructions of each class are forced to use the same distinct weight basis (here only a single weight). So the reconstruction of the lowermost sample of the star class aligns with increasing α to the upper weight due to its class membership, while for the nonnegative sparse coding (see Figure 3a), the same ri aligns to the lower weight with increasing γ due to a better sparseness. Note that the outermost two reconstructions at both sides are equal from the beginning. For high values of α, each feature is dedicated to a single class and changes its direction independent of other classes. Therefore, an increase in discriminative quality is impossible, because this would require a strong influence of different classes onto the same weight.
Class-Specific Sparse Coding a)
b)
90°
90°
80°
80°
70°
70°
60°
60°
50°
50°
40°
40°
30°
30°
20°
20°
10°
10°
0° 0.0
c)
1905
0.3
γ
0° 0.6
0.9
0.0
d)
90°
0.4
α
0.8
1.2
1.0
80° 0.8
70°
WC
SC
CC
60° 0.6 50° 40° 0.4
CC
30°
SC
20°
0.2
10°
WC
0.0
0° 0.0
0.3
Data xi
γ Class 1 Class 2
0.6
0.9
Reconstructions ri
0.0
Class 1 Class 2
0.2
0.4
0.6
0.8
1.0
Weights wp
Figure 3: Influence of certain parameters on different cost functions. (a) Nonnegative sparse coding (SC). The results of optimization are plotted for 15 different influence factors of the sparsity term γ . (b) Coefficient coding (CC). The influence factor of the coefficient term α is varied, while γ is set to 0.05. (c) Weight coding (WC). The influence factor of the sparsity term γ is varied, while the influence factor of the weight term β is set to 0.3. (d) Accumulated results. The angular positions of the weights wp are visualized for typical parameter settings of the different approaches. A detailed description is given in the text.
For the weight coding (see Figure 3c), there is a complex interplay between sparsity term and weight term. When the weight term dominates, as for very small values of γ , it removes activation from the lower weight that it shares with members of the upper class and vice versa. So one weight moves to the top and the other to the bottom. This means each wp aligns to the direction that is most specific for the class it is representing. In the nonnegative case, this can be referred to as a gain in discriminative
¨ S. Hasler, H. Wersing, and E. Korner
1906
power. The weight term also dominates for very high values of γ . In this case, the reconstruction cost is near its maximum, and the algorithm tries to minimize the weight cost at least. Only for a certain range of parameters is there a meaningful combination of discriminative and reconstructive properties. Figure 3d shows some typical weights obtained with the different approaches. The features of nonnegative sparse coding lie relatively close to the borders of the data distribution, offering a good compromise between reconstruction and sparseness. For the coefficient coding, it is expensive to reconstruct the samples near the class border using both weights strongly. Therefore, the features move closer to the class centers. Only the weight coding finds out that activation in y is specific or diagnostic for the star class and activation in x for the diamond class. The results on the toy setting showed that nonnegative sparse coding limits the use of features globally, so each data sample is reconstructed using a small subset of features. To reduce the reconstruction cost, the features model parts that occur most frequently among the samples. But those parts are usually not specific for certain classes. In the extreme case, sparse coding works like a cluster approach, using a single weight per sample. The coefficient coding penalizes the use of a feature for different classes, so each class tends to use a distinct feature subset to reconstruct its samples. In this way, coefficient coding cannot avoid that parts that occur frequently in different classes attract weights. These weights are therefore not discriminative or class specific. More than this, the approach may waste resources by modeling those parts for each class independently. The weight coding directly penalizes if a feature reflects a part that occurs in different classes. As a result, the presence of a certain feature in an image gives a good indication for one or only a few classes. By enforcing the use of a distinct set of features for reconstructing each class, the coefficient coding mimics the behavior of probabilistic generative methods like a GMM. A GMM models a data distribution with the help of a finite number of gaussians. When the GMM framework is used to build a classifier, it also trains a separate GMM for each class. Therefore, GMM and coefficient coding represent or reconstruct details of the data distribution that may be irrelevant for determining the class label (Ulusoy & Bishop, 2005). The weight coding instead punishes directly if a weight contains activation that is shared among different classes and therefore irrelevant for determining the class label. The cost function of the weight coding can be rewritten as
EW = ES +
1 T T wp xq wp xq˜ β 2 p q ,q˜ q =q˜
with:
xq =
1 x, nq x∈X q
(3.2)
Class-Specific Sparse Coding
1907
where xq is the K -dimensional mean of the samples with label q . In this form, the weight coding shows some relation to the Fisher linear discriminant, when neglecting that the weight coding is nonnegative in all elements while the Fisher linear discriminant is not. The Fisher linear discriminant minimizes in the two-class case the following cost function with respect to w:
EF =
x∈X1
w T x − w T x1
2
+
(wT (x1 −
x∈X2 x2 ))2
T 2 w x − w T x2
.
(3.3)
The numerator prefers directions where the variance within each class is minimal and the denominator tries to separate the means of the classes as far as possible from each other. By multiplying the denominator out, you get:
2 2 2 wT (x1 − x2 ) = wT x1 − 2 wT x1 wT x2 + wT x2 .
(3.4)
The second term is equivalent to the weight term. The first and the third terms force the weight to align to the input pattern. In the weight coding, this is done by the interplay of reconstruction term and sparsity term. Assuming a unimodal, peaked distribution for each class in data space, the numerator of the Fisher linear discriminant plays no significant role, and hence both approaches put comparable forces on the features. The weight coding is also similar to the MRDF approach (Talukder & Casasent, 1998), which combines supervised and unsupervised feature learning by combining an adaptation of the Fisher linear discriminant with PCA. The advantage of the weight coding is that it can produce a partsbased, overcomplete representation, while the number of features in the Fisher linear discriminant is limited by the number of classes and in the MRDF by the number of dimensions in the data. The discriminative component of the MRDF approach tries to increase the distance of the individual members of different classes in the feature space. On the contrary, the weight term handles only the means of the class members. This limits its suitability on classes with unimodal data distributions. Another disadvantage of the weight coding is that the two parameters have to be chosen carefully. When the influence of the sparsity term is too weak, the weight term can force the features to point to meaningless dimensions—those where no class has activation. 4 Results on Two Scenarios To further analyze the qualitative and quantitative differences between coefficient coding and weight coding, both approaches have been applied to two scenarios. The first scenario is the three-class problem shown in
¨ S. Hasler, H. Wersing, and E. Korner
1908
Non−negative Sparse Coding
Coefficient Coding
Weight Coding
NMF
Figure 4: Features trained on first scenario with different approaches. The features for each approach are arranged from top left to bottom right by decreasing mutual information. The features of the nonnegative sparse coding and the coefficient coding are very similar to each other. The features of the weight coding are less view specific but much more parts based and class specific. In some of the features, there is a focus on the handles of the cups. In these cases also, the part of the cup opposite the handle is pronounced, since the presence of the handle at one side shifts the cup to the other side of the image frame, where otherwise no activation is present. In the cup-related features, the opening is not highlighted, since activation in this part of the image is more typical for the containers with the white caps. Therefore, there are features containing caps but no cup-related features like a handle. The NMF features are also parts based but not class specific and so lie visually somewhere in between.
Figure 1. Cups with visible handle, cups with no or occluded handle, and some round containers with caps from the COIL-100 database (Nayar, Nene, & Murase, 1996) were combined to three classes, each containing 140 views. The gray-scale images were resized to a resolution of 32×32 pixels in advance. Forty features were trained using the same influence γ = 0.1 of the sparsity term and relatively high values for the coefficient term α = 4.0 and for the weight term β = 0.1. Figure 4 shows the resulting features sorted by their individual mutual information, the calculation of which is described later. The features of nonnegative sparse coding and coefficient coding are very holistic and view specific. The features of weight coding and NMF are both sparser and more parts based, but the weight coding clearly emphasizes class-specific
Class-Specific Sparse Coding
1909
Table 1: Values of the Terms of the Cost Functions for the First Scenario.
SC CC WC NMF
Reconstruction
Sparsity
Coefficient
Weight
7.718 · 102 8.146 · 102 8.138 · 102 6.866 · 102
6.039 · 103 5.909 · 103 8.952 · 103 (7.489 · 103 )
(19.98) 9.505 (65.95) (46.76)
(5.357 · 104 ) (5.607 · 104 ) 2.215 · 104 (3.464 · 104 )
Notes: The terms do not include their influences (γ , α, and β) on the total cost. Values in parentheses were not used for optimization, but are shown to highlight qualitative differences between features.
parts. So there is a group of features highlighting handles, while in all NMF features that contain handles, the whole cup is recognizable. The first NMF feature represents the white cap of a container, while the following features show the opening and the rim of a cup. Both feature types have a strong overlap at the top of the image frame, and therefore the cap features will also respond to cups and the rim-opening features to the containers. The weight coding does not tolerate this and makes the features sparser to better work out the differences of cups and containers. Table 1 lists the values of the terms of the cost functions after optimization. These values are useful to interpret the effect of our two new approaches compared to the nonnegative sparse coding: The coefficient term puts a penalty on the use of features across different classes, which leads to a reduced feature basis for reconstructing each class. As a result, there is an increase in the reconstruction cost and a decrease in the sparsity cost. The demand for sparsity of the coefficients in the nonnegative sparse coding has an opposite effect on the weights, forcing them to become very view specific and leading to a higher reconstruction cost. In the weight coding, the weight term removes activation from the features. They become less view specific, which causes an increase of the sparsity cost. To evaluate the discriminative power of the trained features, we chose to calculate the mutual information between the features and the classes. The mutual information is a measure for the dependency between two or more random variables. In our case, these variables are the detection of the different features and the class label, both varying over the set of samples. The mutual information tells how much the detection of certain features in a sample restricts the possibility of different class hypotheses. Therefore, a high mutual information is a measure of discriminative power and a desired feature property. Unfortunately, the direct optimization of mutual information conveyed by a set of features about a class demands the continuous PDF of the data. For low-dimensional data, the PDF can be estimated from the training samples using the Parzen window technique (Kwak & Choi, 2002). However, in high-dimensional data, the approach is computationally too expensive and requires a huge set of samples.
¨ S. Hasler, H. Wersing, and E. Korner
1910
Table 2: Mutual Information for the First Scenario.
Mutual information Mean error rate Standard deviation
SC
CC
WC
NMF
1.7831 0.2839 0.0809
1.7314 0.2950 0.0814
3.4068 0.2045 0.0840
2.5708 0.2695 0.0820
Notes: The table lists the mutual information conveyed by the pool of features about the three classes (see: Ullman & Bart, 2004). Also some error rates are given performing 100 nearest-neighbor classifications per approach.
For the results in Table 2, we used a calculation similar to the method applied in Ullman and Bart (2004) to select informative image fragments. First, for each feature, an optimal threshold is determined. For this purpose the dot product with each sample is calculated. By applying a threshold to the results of this calculation, a binary detection variable over the set of samples is generated. The class label is also a discrete class variable over the set of samples. The optimal threshold is the one that maximizes the mutual information between the class variable and the detection variable. There are different ways to calculate the information between the classes and the set of features. Simply taking the sum of the individual mutual information the features convey would totally neglect their dependencies. Another way is to join the single values of the detection variables to a binary feature vector per sample and then calculate the mutual information between this vector and the classes. This approach is the mathematically correct one, but a perfect result tells only that no binary feature vector is used in different classes and not how well the information is distributed over the set of features. Therefore we adopted the iterative process proposed by Ullman and Bart (2004) to calculate the values in Table 2. First, the feature conveying the most mutual information is chosen, and later the features with the most additional mutual information. Because the calculation of the additional mutual information given a set of already selected features is impractical, we also adopted the heuristic of Ullman & Bart (2004). In this heuristic, the additional mutual information of one candidate feature is calculated with respect to each single selected feature. The minimum of these values is assigned to the candidate feature. The candidate feature with the highest assigned value is selected. This heuristic guarantees that the selected feature is informative and differs from the features already selected. The sum of the mutual information of the first feature and additional information of the other features is the given value. Because of the heuristic approach, this value can be higher than the entropy of the class distribution. The weight coding has the highest value and the sparse coding and the coefficient coding, the lowest ones. The NMF has an intermediate value. Furthermore, in the sparse coding and the coefficient coding, the twenty-sixth selected
Class-Specific Sparse Coding
1911
feature of the 40 existing ones already has an additional mutual information of less than 0.0005. In the NMF it is the thirty-fourth and in the weight coding the thirty-ninth. So in some sense, the information is best distributed in the weight coding approach. Table 2 also gives some error rates performing nearest-neighbor classifications on the three-class problem. In 100 runs per approach, three representatives per class were chosen randomly out of the 140 views. These representatives were transformed into feature space by calculating the dot product with the trained features. Each of the remaining views was then assigned to the closest representative in feature space. The error rates were calculated on the basis of wrong class assignments. The results strongly depend on the chosen representatives, causing a high standard deviation. Nevertheless, we confirmed with a t-test that the error rate of the weight coding, is significantly (with p = 0.001) lower than that of the other approaches. This supports the claim for an increased discriminative component of the weight coding features. The coefficient coding shows in the mean the worst performance because forcing each class to use a distinct set of features prevents the development of discriminative properties. Note that the projection of the image views on a feature space, which is simply a complete rotation of the original orthogonal basis system would not influence the result of a nearest-neighbor classifier. The reason for the differences shown is that the used algorithms produce a nonorthogonal basis with a reduced dimension. To show that the better performance of the weight coding is not simply caused by the higher degree of sparseness of its features, an additional test was performed. The features of the NMF were trained again, putting additional sparsity constraints on them following the method proposed in Hoyer (2004). So each NMF feature was ensured to have the same L2 norm (1.0) as each weight coding feature and the average L1 norm of the weight coding features. In this way the error rate of the NMF decreased from 0.2695 to 0.2457 but is still 4% higher than that of the weight coding. The mutual information increased from 2.5708 to 2.9853 compared to 3.4068 of the weight coding. Although these results show that sparsity of the weights has indeed some influence on the performance, the main difference is caused by the supervised component of the weight coding. For a second scenario, we acquired the HRI-10 database that consists of 10 classes. A single class contains 9 similar objects, each made up of 100 views taken during a rotation around the vertical axis (see Figure 5). Five objects are used for the training of the features and the remaining four objects for testing. We trained 80 features per approach on the gray-scale images, which were scaled to a resolution of 40×40 pixels. When training on the full rotation, we observed that NMF and weight coding produced very sparse, blob-like features that showed identical classification performance, while the weight coding features had slightly higher mutual information. Only
¨ S. Hasler, H. Wersing, and E. Korner
1912 Bottle
Box
Brush
Can
Car
Cup
Duck
Phone
Tool
Example
Test
Train
Animal
100 Views
Figure 5: HRI-10 database. The database consists of 10 classes, each containing 9 similar objects. The first 5 objects per class are used for training and the remaining 4 objects for testing. Each object is represented by 100 views covering a full rotation around the vertical axis.
with such sparse features were the approaches able to reconstruct the wide variety of images. Because of this problem, we reduced the complexity of the problem by using only views taken from −35 to +35 degrees from the first side view. Alternatively we could increase the number of features, but this would heavily increase the computation time and allow the standard NMF to produce even sparser features, while the sparsity constraint on the coefficients could prevent this development for the weight coding. The features trained on the simplified database are shown in Figure 6. The influence of the sparsity term was set to γ = 0.05, the influence of the coefficient term was α = 5, and the influence of the weight term was
Class-Specific Sparse Coding
Non−negative Sparse Coding
Coefficient Coding
Selected Weight Coding Features
1913
Weight Coding
NMF
Selected NMF Features
Figure 6: Features trained on the second scenario with different approaches. The features for each approach are arranged from top left to bottom right by decreasing mutual information. At the bottom, some selected features of weight coding and NMF are shown to highlight qualitative differences between both approaches. As for the first scenario, the features of nonnegative sparse coding and coefficient coding are very holistic and view specific, and little difference between them can be revealed. Weight coding and NMF again produce partsbased features, but this time the difference between both approaches is not as obvious as for the first scenario. But a carefull look shows some qualitative differences, which are discussed in the text.
β = 0.4. Again the features of the nonnegative sparse coding and the coefficient coding are holistic and similar to each other. Both NMF and weight coding produce sparser features. Although the difference between NMF and weight coding is not that obvious this time, some qualitative differences can be found. So the first selected weight coding feature again shows a separated handle that cannot be found among the NMF features. The first selected NMF feature is a car feature that looks very boxlike. The weight
¨ S. Hasler, H. Wersing, and E. Korner
1914
Table 3: Mutual Information for the Second Scenario.
Mutual information Mean error rate Standard deviation
NNSC
CC
WC
NMF
6.5459 0.2465 0.0330
6.2302 0.2537 0.0338
10.7695 0.1896 0.0325
8.8732 0.2074 0.0316
Notes: The table lists the mutual information conveyed by the pool of features about the 10 classes (see Ullman & Bart, 2004). Also some error rates are given performing 100 nearest-neighbor classifications per approach.
coding does not tolerate this and therefore makes a strong distinction between car and box features. That the weight coding features are not always sparser than the NMF features is shown on the example of the can opener (fifth selected feature). Also interesting is the selected NMF cup feature that seems to be split into two features by the weight coding. A reason may be that the upper part of the rim is a very specific pattern for all cups, while the bright opening occurs in only a certain cup and is represented because this cup would otherwise cause a very high reconstruction cost. The evaluation of mutual information and classification rate was performed on the four test objects per class, keeping the procedure of the first scenario. The results are shown in Table 3. This time, five representatives per class were chosen randomly out of 240 views that covered the limited rotation of the four objects as described above. The weight coding has a higher mutual information than the NMF, and both approaches are superior to nonnegative sparse coding and coefficient coding. The error rate in the nearest-neighbor experiment is for the weight coding 2% lower than that of NMF and 6% lower than that of coefficient coding and nonnegative sparse coding. The high standard deviation is again caused by the random selection of representatives. But the improved mean error rate of weight coding features is significant applying a t-test with p = 0.001. When adjusting the sparsity of each NMF feature to the average sparsity of the weight coding features, the error rate decreased from 0.2074 to 0.2032. This is still significantly higher than the 0.1896 of the weight coding. The mutual information increased from 8.8732 to 9.6584 compared to the 10.7695 of the weight coding. So again the performance is improved to some degree by the sparsity but does not reach the weight coding results. Also, the arrangement of the features in Figure 6 indicates that very sparse features normally provide lower information gain. This was also observed by Ullman and Bart (2004), who discovered the superiority of features with intermediate complexity. Despite the simplicity of using segmented views of objects, the two classification problems are suitable to show that the features of the weight coding are more class specific and diagnostic than the object templates
Class-Specific Sparse Coding
1915
produced by sparse coding. More complex scenarios would have increased the computational cost drastically (e.g., by requiring the approaches to be invariant to position and size of the objects), while we would expect the same qualitative differences. 5 Conclusion In this letter, two new class-specific extensions of the nonnegative sparse coding were introduced. Normally, unsupervised generative feature learning methods spend resources to model details of the data that are irrelevant for classification tasks. The goal of extending the cost function of the nonnegative sparse coding with discriminative components was to shift the focus of some resources from frequently occurring parts to diagnostic ones, in this way increasing the suitability of the trained features for classification tasks. It was shown that the coefficient coding does not increase the discriminative quality of the features because it prevents multiple classes from influencing the same weight. This is due to the fact that the coefficient coding restricts the use of features by different classes, whereas the weight coding directly penalizes the suitability of features for different classes and so successfully combines reconstruction and discrimination. The weight coding is related to the Fisher linear discriminant and the MRDF, but does not reduce the intraclass variance. Its advantage is that it learns localized, parts-based features. We used an artificial two-dimensional setting to introduce sparse coding to somebody new in the field and visualize the different behavior of the new approaches. Furthermore, we showed for two object scenarios that the weight coding results in qualitative other features from that produced by NMF or nonnegative sparse coding. This goes with higher mutual information of the features and increased classification performance. To test the difficulty of the scenarios, we used our features with a nearestneighbor classifier (NNC) and compared the performance to that of a singlelayer perceptron (SLP). In this way, we always get a lower classification rate for our approaches. When using some categories as clutter for testing and evaluating the false-positive rate by means of a receiver operating characteristic, weight coding performs slightly better. But those results are more a reflection of the different nature of SLP and NNC and not of the quality of the weight coding features. Mutual information is an unbiased measure for the discriminative quality of a feature learning method, whereas the classification rate is influenced by the auxiliary method used to calculate it, for example, NNC. A comparison of different features is only fair, when sticking to the same classifier, because otherwise it is not clear which component caused the difference. For the same reason, the NNC results are not comparable with those of superior, standard classification schemes such as an SLP or a GMM.
¨ S. Hasler, H. Wersing, and E. Korner
1916
However, the question may arise as to how in principle a discriminative method like an SLP could benefit from the combination with a reconstructive component. Purely discriminative approaches suffer from the drawback that they may overspecialize on the training scenario; they perfectly learn in which way a class differs from negative examples present during training. If these examples do not cover well the expected variations during testing, the overspecialization impairs the classifier in rejecting unseen clutter images because they may not differ in the learned features from the class. Keeping reconstructive information means keeping information on what the class is, regardless of which other classes existed during training. This gives the classifier the chance to reject a test image based on the inability to reconstruct it. Recent studies suggest that generative methods perform better when training data are limited (Raina et al., 2003), because they converge much faster. As more training data are available, discriminative models take the lead by reaching a lower asymptotic error (Ng & Jordan, 2002). There is also biological evidence for this process as outlined in Logothetis and Sheinberg (1996). When a new object is being learned, holistic snapshots are stored, keeping as much information as possible. With increasing familiarity of the stimulus, prototypes are generated keeping only meaningful, discriminative parts, enabling them to generalize over nonmeaningful parts. In relation to this, weight coding features are useful for building a representation of objects that are somewhere between novel and familiar by moving away from a full reconstruction of the stimulus to a prototypical representation focusing more on the diagnostic object parts. Weight coding can provide a basis for building an efficient object representation, a prerequisite for robust and fast object recognition. Appendix A: Nonnegative Sparse Coding The minimization of the cost function in equation 2.1 is done by alternately ¨ applying coefficient and weight steps as described in Wersing and Korner (2003). In the coefficient step, the cost function is minimized with respect to the c i p using an asynchronous fixed-point search, keeping the wp constant. To do that, the derivation of E S with respect to a certain c i p is set to zero, leading to the update rule, c i p := σ wpT xi −
−1 c i p˜ wp˜T wp − γ wpT wp ,
(A.1)
p˜ p˜ = p
where σ (·) = max(0, ·) ensures the positivity of the coefficients. This update rule is applied to randomly chosen c i p until convergence. The weight step
Class-Specific Sparse Coding
1917
is a single gradient step with a fixed step size η in the wp , keeping the c i p constant:
wp := σ wp − η
i
c i p˜ wp˜ c i p −
p˜
xi c i p .
(A.2)
i
The weight step is executed for each wp at the same time, and σ (·) is applied component-wise. Before the next coefficient step, the weights are normalized using wp :=
wp . wp 2
(A.3)
Appendix B: Coefficient Coding The optimization of the cost function, equation 3.1, is nearly the same as for the nonnegative sparse coding. Only the update rule for the coefficients changes into
c i p := σ wpT xi − c i p˜ wp˜T wp − γ − α p˜ p˜ = p
˜ı q (˜ı)=q (i)
c˜ı p T −1 , wp wp nq (˜ı) nq (i)
(B.1)
while the update of the weights follows exactly equations A.2 and A.3. Appendix C: Weight Coding The weight term of the cost function, equation 3.2, has effect only on the weight step, and so the update rule for the coefficients remains equation A.1, while the gradient step in the weights becomes
wp := σ wp − η c i p˜ wp˜ c i p − xi c i p + β i
p˜
i
i,˜ı q (i)=q (˜ı)
xi wpT x˜ı
nq (i) nq (˜ı)
, (C.1)
followed by the normalization of the weights, equation A.3.
1918
¨ S. Hasler, H. Wersing, and E. Korner
References Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification (2nd ed.). New York: Wiley. ¨ Eggert, J., Wersing, H., & Korner, E. (2004). Transformation-invariant representation and NMF. In Proc. IEEE International Joint Conference on Neural Networks (pp. 2535– 2539). Piscataway, NJ: IEEE. Grimes, D. B., & Rao, R. P. N. (2005). Bilinear sparse coding for invariant vision. Neural Computation 17(1), 47–73. Hoyer, P. (2002). Non-negative sparse coding. In Proc. IEEE Workshop on Neural Networks for Signal Processing (pp. 557–565). Piscataway, NJ: IEEE. Hoyer, P. (2004). Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research, 5, 1457–1469. Kwak, N., & Choi, C.-H. (2002). Input feature selection by mutual information based on parzen window. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(12), 1667–1671. Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788–791. Logothetis, N. K., & Sheinberg, D. L. (1996). Visual object recognition. Annual Review of Neuroscience, 19, 577–621. McLachlan, G., & Peel, D. (2000). Finite mixture models. New York: Wiley. Nayar, S. K., Nene, S. A., & Murase, H. (1996). Real-time 100 object recognition system. In Proc. IEEE Conference on Robotics and Automation, (Vol. 3 (pp. 2321– 2325). Piscataway, NJ: IEEE. Ng, A. Y., & Jordan, M. I. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Raina, R., Shen, Y., Ng, A. Y., & McCallum, A. (2003). Classification with hybrid generative/discriminative models. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Talukder, A. & Casasent, D. (1998). Classification and pose estimation of objects using nonlinear features. In Proc. SPIE Applications and Science of Computational Intelligence (Vol. 3390, pp. 12–23). Bellingham, WA: SPIE. Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1), 71–86. Ullman, S., & Bart, E. (2004). Recognition invariance obtained by extended and invariant features. Neural Networks, 17(1), 833–848. Ulusoy, I., & Bishop, C. M. (2005). Generative versus discriminative methods for object recognition. In Proc. IEEE International Conference on Computer Vision and Pattern Recognition (pp. 258–265). Piscataway, NJ: IEEE. ¨ Wersing, H., & Korner, E. (2003). Learning optimized features for hierarchical models of invariant object recognition. Neural Computation, 15(7), 1559–1588.
Received October 14, 2005; accepted October 7, 2006.
¨ Communicated by Klaus-Robert Muller
LETTER
SVDD-Based Pattern Denoising Jooyoung Park [email protected]
Daesung Kang [email protected] Department of Control and Instrumentation Engineering, Korea University, Jochiwon, Chungnam, 339-700, Korea
Jongho Kim [email protected] Mechatronics and Manufacturing Technology Center, Samsung Electronics Co., Ltd., Suwon, Gyeonggi, 443-742, Korea
James T. Kwok [email protected]
Ivor W. Tsang [email protected] Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong
The support vector data description (SVDD) is one of the best-known one-class support vector learning methods, in which one tries the strategy of using balls defined on the feature space in order to distinguish a set of normal data from all other possible abnormal objects. The major concern of this letter is to extend the main idea of SVDD to pattern denoising. Combining the geodesic projection to the spherical decision boundary resulting from the SVDD, together with solving the preimage problem, we propose a new method for pattern denoising. We first solve SVDD for the training data and then for each noisy test pattern, obtain its denoised feature by moving its feature vector along the geodesic on the manifold to the nearest decision boundary of the SVDD ball. Finally we find the location of the denoised pattern by obtaining the pre-image of the denoised feature. The applicability of the proposed method is illustrated by a number of toy and real-world data sets. 1 Introduction Recently, the support vector learning method has become a viable tool in ¨ the area of intelligent systems (Cristianini & Shawe-Taylor, 2000; Scholkopf & Smola, 2002). Among the important application areas for support Neural Computation 19, 1919–1938 (2007)
C 2007 Massachusetts Institute of Technology
1920
J. Park et al.
vector learning, we have the one-class classification problems (Campbell & Bennett, 2001; Crammer & Chechik, 2004; Lanckriet, El Ghaoui, & Jordan, ¨ 2003; Laskov, Sch¨afer, & Kotenko, 2004; Muller, Mika, R¨atsch, Tsuda, & ¨ ¨ Scholkopf, 2001; Pekalska, Tax, & Duin, 2003; R¨atsch, Mika, Scholkopf, ¨ ¨ ¨ & Muller, 2002; Scholkopf, Platt, & Smola, 2000; Scholkopf, Platt, Shawe¨ Taylor, Smola, & Williamson, 2001; Scholkopf & Smola, 2002; Tax, 2001; Tax & Duin, 1999). In one-class classification problems, we are given only the training data for the normal class, and after the training phase is finished, we are required to decide whether each test vector belongs to the normal or the abnormal class. One-class classification problems are often called outlier detection problems or novelty detection problems. Obvious examples of this class include fault detection for machines ¨ and the intrusion detection system for computers (Scholkopf & Smola, 2002). One of the best-known support vector learning methods for the oneclass problems is the SVDD (support vector data description) (Tax, 2001; Tax & Duin, 1999). In the SVDD, balls are used for expressing the region for the normal class. Among the methods having the same purpose with ¨ the SVDD are the so-called one-class SVM of Scholkopf and others (R¨atsch ¨ ¨ et al., 2002; Scholkopf et al., 2001; Scholkopf et al., 2000), the linear programming method of Campbell and Bennet (2001), the information-bottleneckprinciple-based optimization approach of Crammer and Chechik (2004), and the single-class minimax probability machine of Lanckriet et al. (2003). Since balls on the input domain can express only a limited class of regions, the SVDD in general enhances its expressing power by utilizing balls on the feature space instead of the balls on the input domain. In this letter, we extend the main idea of the SVDD toward the use for the problem of pattern denoising (Kwok & Tsang, 2004; Mika et al., 1999; ¨ Scholkopf et al., 1999). Combining the movement to the spherical decision boundary resulting from the SVDD together with a solver for the preimage problem, we propose a new method for pattern denoising that consists of the following steps. First, we solve the SVDD for the training data consisting of the prototype patterns. Second, for each noisy test pattern, we obtain its denoised feature by moving its feature vector along the geodesic to the spherical decision boundary of the SVDD ball on the feature space. Finally in the third step, we recover the location of the denoised pattern by obtaining the preimage of the denoised feature following the strategy of Kwok and Tsang (2004). The remaining parts of this letter are organized as follows. In section 2, preliminaries are provided regarding the SVDD. Our main results on pattern denoising based on the SVDD are presented in section 3. In section 4, the applicability of the proposed method is illustrated by a number of toy and real-world data sets. Finally, in section 5, concluding remarks are given.
SVDD-Based Pattern Denoising
1921
2 Preliminaries The SVDD method, which approximates the support of objects belonging to normal class, is derived as follows (Tax, 2001; Tax & Duin, 1999). Consider a ball B with center a ∈ Rd and radius R, and the training data set D consisting of objects xi ∈ Rd , i = 1, . . . , N. The main idea of SVDD is to find a ball that can achieve two conflicting goals (it should be as small as possible and contain as many training data as possible) simultaneously by solving min L 0 (R2 , a, ξ ) = R2 + C
N
ξi
i=1
xi − a2 ≤ R2 + ξi , ξi ≥ 0, i = 1, . . . , N.
s.t.
(2.1)
Here, the slack variable ξi represents the penalty associated with the deviation of the ith training pattern outside the ball, and C is a trade-off constant controlling the relative importance of each term. The dual problem of equation 2.1 is maxα
N
αi xi , xi −
i=1
s.t.
N
N N
αi α j xi , x j
i=1 j=1
αi = 1, αi ∈ [0, C], i = 1, . . . , N.
(2.2)
i=1
In order to express more complex decision regions in Rd , one can use the so-called feature map φ : Rd → F and balls defined on the feature space F . Proceeding similar to the above and utilizing the kernel trick φ(x), φ(z) = k(x, z), one can find the corresponding feature space SVDD ball B F in F . Moreover, from the Kuhn-Tucker condition, its center can be expressed as aF =
N
αi φ(xi ),
(2.3)
i=1
and its radius RF can be computed by utilizing the distance between a F and any support vector x on the ball boundary: R2F = k(x, x) − 2
N i=1
αi k(xi , x) +
N N
αi α j k(xi , x j ).
(2.4)
i=1 j=1
In this letter, we always use the gaussian kernel k(x, z) = exp(−x − z2 /s 2 ), and so k(x, x) = 1 for each x ∈ Rd . Finally, note that in this case, the SVDD
1922
J. Park et al.
formulation is equivalent to N N
minα
αi α j k(xi , x j )
i=1 j=1 N
s.t.
αi = 1, αi ∈ [0, C], i = 1, . . . , N,
(2.5)
i=1
and the resulting criterion for the normality is
f F (x) = R2F − φ(x) − a F 2 ≥ 0.
(2.6)
3 Main Results In SVDD, the objective is to find the support of the normal objects; anything outside the support is viewed as abnormal. In the feature space, the support is expressed by a reasonably small ball containing a reasonably large portion of the φ(xi )’s. The main idea of this letter is to utilize the ball-shaped support on the feature space for correcting test inputs distorted by noise. More precisely, with the trade-off constant C set appropriately,1 we can find a region where the normal objects without noise generally reside. When an object (which was originally normal) is given as a test input x in a distorted form, the network resulting from the SVDD is supposed to judge that the distorted object x does not belong to the normal class. The role of the SVDD has been conventional up to this point, and the problem of curing the distortion might be thought of as beyond the scope of the SVDD. In this letter, we go one step further and move the feature vector φ(x) of the distorted test input x to the point Qφ(x) lying on the surface of the SVDD ball B F so that it can be tailored enough to be normal (see Figure 1). Given that all the points in the input space are mapped to a manifold in the kernel-induced feature space (Burges, 1999), the movement is along the geodesic on this manifold, and so the point Qφ(x) can be considered as the geodesic projection of φ(x) onto the SVDD ball. Of course, since the movement starts from the distorted feature φ(x), there are plenty of reasons to believe that the tailored feature Qφ(x) still contains essential information about the original pattern. We claim that the tailored feature Qφ(x) is the denoised version of the feature vector φ(x). Pertinent to this claim is the discussion of Ben-Hur, Horn, Siegelmann, and Vapnik (2001) on the support vector clustering, in which the SVDD is shown to be a very efficient tool for clustering since the SVDD ball, when mapped back to 1 In our experiments for noisy handwritten digits, C = 1/(N × 0.2) was used for the purpose of denoising.
SVDD-Based Pattern Denoising
1923 SVDD ball
Qφ(x)
φ(x)
Figure 1: Basic idea for finding the denoised feature vector Qφ(x) by moving along the geodesic.
the input space, can separate into several components, each enclosing a separate cluster of normal data points, and can generate cluster boundaries of arbitrary shapes. These arguments, together with an additional step for finding the preimage of Qφ(x), comprise our proposal for a new denoising strategy. In the following, we present the proposed method more precisely with mathematical details. The proposed method consists of three steps. First, we solve the SVDD,
equation 2.5, for the given prototype patterns D = {xi ∈ Rd |i = 1, . . . , N}. As a result, we find the optimal αi ’s along with a F and R2F obtained via equations 2.3 and 2.4. Second, we consider each test pattern x. When the decision function f F of equation 2.6 yields a nonnegative value for x, the test input is accepted as normal, and the denoising process is bypassed with Qφ(x) set equal to φ(x). Otherwise, the test input x is considered to be abnormal and distorted by noise. To recover the denoised pattern, we move its feature vector φ(x) along the geodesic defined on the manifold in the feature space, toward the SVDD ball B F up to the point where it touches the ball. In principle, any kernel can be used here. However, as we will show, closed-form solutions can be obtained when stationary kernels (such as the gaussian kernel) are used.2 In this case, it is obvious that all the points are mapped onto the surface of a ball in the feature space, and we can see from Figure 2 that the point Qφ(x) is the ultimate destination of this movement. For readers’ convenience, we also include a three-dimensional drawing (see Figure 3) to clarify Figure 2. In the following, the proposed method will be presented only for the gaussian kernel, where all the points are mapped to the unit ball in the feature space. Extension to other stationary kernels is straightforward. In order to find Qφ(x), it is necessary to solve the following series of subproblems: 2
A stationary kernel k(x, x ) is a kernel that depends on only x − x .
1924
J. Park et al.
H
F
δa
F
φ(x)
Qφ(x)
aF
γ βa
F
Figure 2: Proposed denoising procedure when a stationary kernel is used.
r
To find the separating hyperplane HF . From Figure 2, it is clear that for the SVDD problems utilizing stationary kernels, the center a F of the SVDD ball has the same direction with the weight vector of the separating hyperplane HF . In particular, when the gaussian kernel is used, the hyperplane HF can be represented by 2a F , φ(x) = 1 + a F 2 − R2F .
r
(3.1)
Further information needed for identifying the location of Qφ(x) includes the vectors βa F , δa F , and the distance γ shown in Figure 2. To find vector βa F . As shown in Figure 2, the vector βa F lies on the hyperplane HF . Thus, it should satisfy equation 3.1—that is, 2a F , βa F = 1 + a F 2 − R2F .
(3.2)
Therefore, we have β=
1 + α T Kα − R2F 1 + a F 2 − R2F , = 2a F 2 2α T Kα
(3.3)
SVDD-Based Pattern Denoising
1925
Figure 3: Denoised feature vector Qφ(x) shown in a (hypothetical) threedimensional feature space. Here, since the chosen kernel is stationary, the projected feature vector Qφ(x), as well as the feature vector φ(x), should lie on a ball centered at the origin of the feature space. Also note that the location of Qφ(x) should be at the boundary of the intersection of the ball surface and the SVDD ball, which is colored black.
r
where α = [α1 · · · α N ]T , and K is the kernel matrix with entries Ki j = k(xi , x j ). To find distance γ . Since Qφ(x) is on the surface of the unit ball, we have Qφ(x)2 = 1. Also from the Pythagorean theorem, βa F 2 + γ 2 = Qφ(x)2 holds. Hence, we have γ =
r
1 − β 2 a F 2 = 1 − β 2 α T Kα.
(3.4)
To find vector δa F . Since Pφ(x) = φ(x) + δa F should lie on the hyperplane HF , it should satisfy equation 3.1. Thus, the following holds: 2a F , φ(x) + δa F = 1 + a F 2 − R2F .
(3.5)
Hence, we have δ=
1 + α T Kα − R2F − 2kx α 1 + a F 2 − R2F − 2a F , φ(x) = , 2a F 2 2α T Kα
where kx = [k(x, x1 ), . . . , k(x, x N )]T .
(3.6)
1926
J. Park et al.
r
To find the denoised feature vector Qφ(x). From Figure 2, we see that Qφ(x) = βa F +
γ (φ(x) + (δ − β)a F ). φ(x) + (δ − β)a F
(3.7)
Note that with
λ1 =
γ φ(x) + (δ − β)a F
(3.8)
and
λ2 = β +
γ (δ − β) , φ(x) + (δ − β)a F
(3.9)
the above expression for Qφ(x) can be further simplified into Qφ(x) = λ1 φ(x) + λ2 a F ,
(3.10)
where λ1 and λ2 can be computed from γ , λ1 = 1 + 2(δ − β)kx α + (δ − β)2 α T Kα λ2 = β +
γ (δ − β) 1 + 2(δ − β)kx α + (δ − β)2 α T Kα
(3.11) .
(3.12)
Obviously, the movement from the feature φ(x) to Qφ(x) is along the geodesic to the noise-free normal class and thus can be interpreted as performing denoising in the feature space. With this interpretation in mind, the feature vector Qφ(x) will be called the denoised feature of x in this letter. In the third and final steps, we try to find the preimage of the denoised feature Qφ(x). If the inverse map φ −1 : F → Rd is well defined and available, this final step attempting to get the denoised pattern via xˆ = φ −1 (Qφ(x)) will be trivial. However, the exact preimage typically does not exist (Mika et al., 1999). Thus, we need to seek an approximate solution instead. For this, we follow the strategy of Kwok and Tsang (2004), which uses a simple relationship between feature-space distance and input-space distance (Williams, 2002) together with the MDS (multi-dimensional scaling) (Cox & Cox, 2001). Using the kernel trick and the simple relation, equation 3.10, we see that Qφ(x), φ(xi ) can be easily computed as follows: Qφ(x), φ(xi ) = λ1 k(xi , x) + λ2
N
α j k(xi , x j ).
(3.13)
j=1
Thus, the feature space distance between Qφ(x) and φ(xi ) can be obtained by plugging equation 3.13 into d˜ 2 (Qφ(x), φ(xi )) = Qφ(x) − φ(xi )2
= 2 − 2Qφ(x), φ(xi ).
(3.14)
SVDD-Based Pattern Denoising
1927
Now, note that for the gaussian kernel, the following simple relation ˜ ship holds true between d(xi , x j ) = xi − x j and d(φ(x i ), φ(x j )) = φ(xi ) − φ(x j ) (Williams, 2002): d˜ 2 (φ(xi ), φ(x j )) = φ(xi ) − φ(x j )2 = 2 − 2k(xi , x j ) = 2 − 2 exp(−xi − x j 2 /s 2 ) = 2 − 2 exp(−d 2 (xi , x j )/s 2 ).
(3.15)
Since the feature space distance d˜ 2 (Qφ(x), φ(xi )) is now available from equation 3.14 for each training pattern xi , we can easily obtain the corresponding input space distance between the desired approximate preimage xˆ of Qφ(x) and each xi . Generally, the distances with neighbors are the most important in determining the location of any point. Hence, here we consider only the squared input space distances between Qφ(x) and its n nearest neighbors {φ(x(1) ), . . . , φ(x(n) )} ⊂ DF , and define
d2 = [d12 , d22 , . . . , dn2 ]T ,
(3.16)
where di is the input space distance between the desired preimage of Qφ(x) and x(i) . In MDS (Cox & Cox, 2001), one attempts to find a representation of the objects that preserves the dissimilarities between each pair of them. Thus, we can use the MDS idea to embed Qφ(x) back to the input space. For this, we first take the average n of the training data {x(1) , . . . , x(n) } ⊂ D to get their centroid x¯ = (1/n) i=1 x(i) , and construct the d × n matrix,
X = [x(1) , x(2) , . . . , x(n) ].
(3.17)
Here, we note that by defining the n × n centering matrix H = In − (1/n)1n 1nT , where In = diag[1, . . . , 1] ∈ Rn×n and 1n = [1, . . . , 1]T ∈ Rn×1 , the matrix XH centers the x(i) ’s at their centroid: XH = [x(1) − x¯ , . . . , x(n) − x¯ ].
(3.18)
The next step is to define a coordinate system in the column space of XH. When XH is of rank q , we can obtain the SVD (singular value
1928
J. Park et al.
decomposition) (Moon & Stirling, 2000) of the d × n matrix XH as XH = [U1 U2 ]
1 0 0 0
V1T
V2T
= U1 1 V1T = U1 Z,
(3.19)
where U1 = [e1 , . . . , eq ] is the d × q matrix with orthonormal columns ei , and Z = 1 V1T = [z1 , . . . , zn ] is a q × n matrix with columns zi being the projections of x(i) − x¯ onto the e j ’s. Note that x(i) − x¯ 2 = zi 2 , i = 1, . . . , n,
(3.20)
and collect these into an n-dimensional vector:
d20 = [z1 2 , . . . , zn 2 ]T .
(3.21)
The location of the preimage xˆ is obtained by requiring d 2 (ˆx, x(i) ), i = 1, . . . , n to be as close to those values in equation 3.16 as possible; thus, we need to solve the LS (least squares) problem to find xˆ : d 2 (ˆx, x(i) ) di2 , i = 1, . . . , n.
(3.22)
Now following the steps of Kwok and Tsang (2004) and Gower (1968), zˆ ∈ Rn×1 defined by xˆ − x¯ = U1 zˆ can be shown to satisfy 2 1 T 2 zˆ = − −1 1 V1 d − d0 . 2
(3.23)
Therefore, by transforming equation 3.23 back to the original coordinated system in the input space, the location of the recovered denoised pattern turns out to be xˆ = U1 zˆ + x¯ .
(3.24)
4 Experiments In this section, we compare the performance of the proposed method with other denoising methods on toy and real-world data sets. For simplicity, we denote the proposed method by SVDD.
SVDD-Based Pattern Denoising
1929
4.1 Toy Data Set. We first use a toy example to illustrate the proposed method and compare its reconstruction performance with PCA. The setup is similar to that in Mika et al. (1999). Eleven clusters of samples are generated by first choosing 11 independent sources randomly in [−1, 1]10 and then drawing samples uniformly from translations of [−σ0 , σ0 ]10 centered at each source. For each source, 30 points are generated to form the training data and 5 points to form the clean test data. Normally distributed noise, with variance σo2 in each component, is then added to each clean test data point to form the corrupted test data. 1 We carried out SVDD (with C = N×0.6 ) and PCA for the training set, and then performed reconstructions of each corrupted test point using both the proposed SVDD-based method (with neighborhood size n = 10) and the standard PCA method.3 The procedure was repeated for different numbers of principal components in PCA and for different values of σ0 . For the width s of the gaussian kernel, we used s 2 = 2 × 10 × σ02 as in Mika et al. (1999). From the simulations, we found out that when the input space dimensionality d is low (as in this example, where d = 10), applying the proposed method iteratively (i.e., recursively applying the denoising to the previous denoised results) can improve the performance. We compared the results of our method (with 100 iterations) to those of the PCA-based method using the mean squared distance (MSE), which is defined as
MSE =
1 tk − ˆtk 2 , M M
(4.1)
k=1
where M is the number of test patterns, tk is the kth clean test pattern, and tˆk is the denoised result for the kth noisy test pattern. Table 1 shows the ratio of MSE PC A/MSE SV DD . Note that ratios larger than one indicate that the proposed SVDD-based method performs better compared to the other one. Simulations were also performed for a two-dimensional version of the toy example (see Figure 4a), and the denoised results were shown in Figures 4b and 4c. For PCA, we used only one eigenvector (if two eigenvectors were used, the result is just a change of basis and thus not useful). The observed MSE values for the reconstructions using the proposed and PCA-based methods were 0.0192 and 0.1902, respectively. From Table 1 and Figures 4b and 4c, one can see that in the considered examples, the proposed method yielded better performance than the PCAbased method. The reason seems to be that here, the examples basically deal
3 The corresponding Matlab program is posted online at http://cie.korea.ac.kr/ ac lab/pro01.html.
1930
J. Park et al.
Table 1: Comparison of MSE Ratios After Reconstructing the Corrupted Test Points in R10 .
σ0 σ0 σ0 σ0
= 0.05 = 0.10 = 0.15 = 0.20
#EV=1
#EV=3
#EV=5
#EV=7
#EV=9
193.6 49.36 22.71 13.13
91.90 23.85 11.34 6.853
37.37 10.51 5.613 3.854
12.44 4.667 3.241 2.705
2.528 2.421 2.419 2.392
Note: Performance ratios, MSE PC A/MSE SV DD , being larger than one, indicate how much better SVDD did compared to PCA for different choices of σ0 , and different numbers of principal components (#EV) in reconstruction using PCA.
1
0.5
0
−0.5
−1
−0.5
0
0.5
1
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−0.5
0
0.5
1
−1
−0.5
0
0.5
1
Figure 4: A two-dimensional version of the toy example (with σ0 = 0.15) and its denoised results. Lines join each corrupted point (denoted +) with its reconstruction (denoted o). For SVDD, s 2 = 2 × 2 × σ02 and C = 1/(N × 0.6). (a) Training data (denoted •) and corrupted test data (denoted +). (b) Reconstruction using the proposed method (with 100 iterations) along with the resultant SVDD balls. (c) Reconstruction using the PCA-based method, where one principal component was used.
SVDD-Based Pattern Denoising
(a)
1931
(b)
(d)
(c)
(e)
Figure 5: Sample USPS digit images. (a) Clean. (b) With gaussian noise (σ 2 = 0.3). (c) With gaussian noise (σ 2 = 0.6). (d) With salt-and-pepper noise ( p = 0.3). (e) With salt-and-pepper noise ( p = 0.6).
with clustering-type tasks, so any reconstruction method directly utilizing projection onto low-dimensional linear manifolds would be inefficient. 4.2 Handwritten Digit Data. In this section, we report the denoising results on the USPS digit database (http://www.kernel-machines.org), which consists of 16 × 16 handwritten digits of 0 to 9. We first normalized each feature value to the range [0, 1]. For each digit, we randomly chose 60 examples to form the training set and 100 examples as the test set (see Figure 5). Two types of additive noise were added to the test set. The first is the gaussian noise N(0, σ 2 ) with variance σ 2 , and the second is the so-called salt-and-pepper noise with noise level p, where p/2 is the probability that a pixel flips to black or white. Denoising was applied to each digit separately. The width s of the gaussian kernel is set to 1 xi − x j 2 , N(N − 1) N
s02 =
N
(4.2)
i=1 j=1
the average squared distance between training patterns. Here, the value of C was set to the effect that the support for the normal class resulting from the SVDD may cover approximately 80% (=100% − 20%) of the training data. Finally, in the third step, we used n = 10 neighbors to recover the denoised pattern xˆ by solving the preimage problem.
J. Park et al. 5 4.5
5
4
4
4.5
3.5 3 SVDD KPCA (Mika et. al.) KPCA (Kwok & Tsang) PCA Wavelet (VisuShrink) Wavelet (SureShrink)
2.5 2 1.5 0
10
20
30 #EVs
40
3.5 3 SVDD KPCA (Mika et. al.) KPCA (Kwok & Tsang) PCA Wavelet (VisuShrink) Wavelet (SureShrink)
2.5 2 50
1.5 0
60
(a)
10
20
30 #EVs
40
4 3.5
2.5 50
2 0
60
5
5
4
4.5
4.5
SVDD KPCA (Mika et. al.) KPCA (Kwok & Tsang) PCA Wavelet (VisuShrink) Wavelet (SureShrink)
2 1.5 1 0
10
(d)
20
30 #EVs
40
3.5 3 SVDD KPCA (Mika et. al.) KPCA (Kwok & Tsang) PCA Wavelet (VisuShrink) Wavelet (SureShrink)
2.5 2 50
60
SNR (in dB)
SNR (in dB)
4
2.5
1.5 0
(e)
10
20
30 #EVs
40
50
60
50
60
(c)
4.5
3
SVDD KPCA (Mika et. al.) KPCA (Kwok & Tsang) PCA Wavelet (VisuShrink) Wavelet (SureShrink)
3
(b)
3.5 SNR (in dB)
5.5
SNR (in dB)
5 4.5
SNR (in dB)
SNR (in dB)
1932
10
20
30 #EVs
40
4 3.5 SVDD KPCA (Mika et. al.) KPCA (Kwok & Tsang) PCA Wavelet (VisuShrink) Wavelet (SureShrink)
3 2.5
50
60
2 0
10
20
30 #EVs
40
(f)
Figure 6: SNRs of the denoised USPS images. (Top) Gaussian noise with variance σ 2 . (a) σ 2 = 0.6. (b) σ 2 = 0.5. (c) σ 2 = 0.4. (Bottom) Salt-and-pepper noise with noise level p. (d) p = 0.6. (e) p = 0.5. (f) p = 0.4.
The proposed approach is compared with the following standard methods:
r r r r
Kernel PCA denoising, using the preimage finding method in Mika et al. (1999) Kernel PCA denoising, using the preimage finding method in Kwok and Tsang (2004) Standard (linear) PCA Wavelet denoising (using the Wavelet Toolbox in Matlab).
For wavelet denoising, the image is first decomposed into wavelet coefficients using the discrete wavelet transform (Mallat, 1999). These wavelet coefficients are then compared with a given threshold value, and those that are close to zero are shrunk so as to remove the effect of noise in the data. The denoised image is then reconstructed from the shrunken wavelet coefficients by using the inverse discrete wavelet transform. The choice of the threshold value can be important to denoising performance. In the experiments, we use two standard methods to determine the threshold: VisuShrink (Donoho, 1995) and SureShrink (Donoho & Johnstone, 1995). Moreover, the Symlet6 wavelet basis, with two levels of decomposition, is used. The methods of Mika et al. (1999) and Kwok and Tsang (2004) are both based on kernel PCA and require the number of eigenvectors as a
SVDD-Based Pattern Denoising
1933
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
Figure 7: Sample denoised USPS images. (Top two rows) Gaussian noise (σ 2 = 0.6). (a) SVDD. (b) KPCA (Mika et al., 1999). (c) KPCA (Kwok & Tsang, 2004). (d) PCA. (e) Wavelet (VisuShrink). (f) Wavelet (SureShrink). (Bottom two rows) Salt-and-pepper noise (p = 0.6). (g) SVDD. (h) KPCA (Mika et al., 1999). (i) KPCA (Kwok & Tang, 2004). (j) PCA. (k) Wavelet (VisuShrink). (l) Wavelet (SureShrink).
predetermined parameter. In the experiments, the number of principal components is varied from 5 to 60 (the maximum number of PCA components 1 that can be obtained on this data set). For SVDD, we set C = Nν with ν set to 0.2. For denoising using MDS and SVDD, 10 nearest neighbors are used to perform preimaging.
J. Park et al. 7
5
6
4
4
5
3 SVDD KPCA (Mika et. al.) KPCA (Kwok & Tsang) PCA Wavelet (VisuShrink) Wavelet (SureShrink)
2 1
50
100
150 #EVs
200
3 SVDD KPCA (Mika et. al.) KPCA (Kwok & Tsang) PCA Wavelet (VisuShrink) Wavelet (SureShrink)
2 1
250
0 0
300
50
100
150 #EVs
200
4
2
250
1 0
300
(a)
(b)
5
5
7
4
4
6
3
3
5
2 1 SVDD KPCA (Mika et. al.) KPCA (Kwok & Tsang) PCA Wavelet (VisuShrink) Wavelet (SureShrink)
0 −1 −2 0
50
100
150 #EVs
(d)
200
2 1 SVDD KPCA (Mika et. al.) KPCA (Kwok & Tsang) PCA Wavelet (VisuShrink) Wavelet (SureShrink)
−1 300
−2 0
50
100
150 #EVs
200
250
300
250
300
(c)
0
250
SVDD KPCA (Mika et. al.) KPCA (Kwok & Tsang) PCA Wavelet (VisuShrink) Wavelet (SureShrink)
3
SNR (in dB)
SNR (in dB)
0 0
SNR (in dB)
6
5
SNR (in dB)
6
SNR (in dB)
SNR (in dB)
1934
50
100
150 #EVs
200
4 3 SVDD KPCA (Mika et. al.) KPCA (Kwok & Tsang) PCA Wavelet (VisuShrink) Wavelet (SureShrink)
2 1 250
300
(e)
0 0
50
100
150 #EVs
200
(f)
Figure 8: SNRs of the denoised USPS images when 300 samples are chosen from each digit for training. (Top) Gaussian noise with variance σ 2 . (a) σ 2 = 0.6. (b) σ 2 = 0.5. (c) σ 2 = 0.4. (Bottom) Salt-and-pepper noise with noise level p. (d) p = 0.6. (e) p = 0.5. (f) p = 0.4.
To quantitatively evaluate the denoised performance, we used the average signal-to-noise ratio (SNR) over the test set images, where the SNR is defined as 10 log10
var(clean image) , var(clean image – new image)
in decibel (dB). Figure 6 shows the (average) SNR values obtained for the various methods. SVDD always achieves the best performance. When more PCs are used, the performance of denoising using kernel PCA increases, while the performance of PCA first increases and then decreases as some noisy PCs are included, which corrupts the resultant images. Note that one advantage of the wavelet denoising methods is that they do not require training. But a subsequent disadvantage is that they cannot utilize the training set, and so both do not perform well here. Samples of the denoised images that correspond to the best setting of each method are shown in Figure 7. As the performance of denoising using kernel PCA appears improving with the number of PCs, we also experimented with a larger data set so that even more PCs can be used. Here, we followed the same experimental setup except that 300 (instead of 60) examples were randomly chosen from
SVDD-Based Pattern Denoising 5.4
5.4
5.5 σ 2 =0.3
σ =0.3
σ 2 =0.4
σ2 =0.4 σ =0.5
2
σ =0.6
σ =0.5
SNR (in dB)
4.8
σ 2 =0.4
5.2
2
σ =0.6 5
σ 2 =0.3
2
2
σ 2 =0.5 σ 2 =0.6
2
5
SNR (in dB)
5.2 SNR (in dB)
1935
5
4.8
4.5 4.6
4.6
0.2
0.3
ν
0.4
0.5
4 0
0.6
(a)
4 2 weighting factor for s
6
4.4 6
8
6
6 p=0.3 p=0.4 p=0.5 p=0.6
SNR (in dB)
5
4
10 n
12
14
5.4
p=0.3 p=0.4 p=0.5 p=0.6
5.5
4.5
8
(c)
(b)
5.5 SNR (in dB)
2
p=0.3 p=0.4 p=0.5 p=0.6
5.2 5 SNR (in dB)
4.4 0.1
5
4.5
4.8 4.6 4.4 4.2
4
4 3.5 0.1
(d)
0.2
0.3
ν
0.4
0.5
0.6
3.5 0
(e)
2
4 weighting factor for s2
6
8
3.8 6
8
10 n
12
14
(f)
Figure 9: SNR results of the proposed method on varying each of ν, width of the gaussian kernel (s) and the neighborhood size for MDS (n). (Top) Gaussian noise with different σ 2 ’s. (a) Varying ν. (b) Varying s as a factor of s0 in equation 4.2. (c) Varying n. (Bottom) Salt-and-pepper noise with different p’s. (d) Varying ν. (e) Varying s as a factor of s0 in equation 4.2. (f) Varying n.
each digit to form the training set. Figure 8 shows the SNR values for the various methods. On this larger data set, denoising using kernel PCA does perform better than the others when a suitable number of PCs are chosen. This demonstrates that the proposed denoising procedure is comparatively more effective on small training sets. In order to investigate the robustness of the proposed method, we also performed experiments using the 60-example training set for a wide range of ν, width of the gaussian kernel (s), and the neighborhood size for MDS (n). Results are reported in Figure 9. The proposed method shows robust performance around the range of parameters used. In the previous experiments, denoising was applied to each digit separately, which means one must know what the digit is before applying denoising. To investigate how well the proposed method denoises when the true digit is unknown, we follow the same setup but combine all the digits (with a total of 600 digits) for training. Results are shown in Figures 10 and 11. From the visual inspection, one can see that its performance is slightly inferior to that of the separate digit case. Again, SVDD is still the best, though kernel PCA using the preimage method in Kwok and Tsang (2004) sometimes achieves better results as more PCs are included.
1936
J. Park et al.
5
5.5
4.5
5
3.5 3 SVDD KPCA (Mika et. al.) KPCA (Kwok & Tsang) PCA Wavelet (VisuShrink) Wavelet (SureShrink)
2.5 2 1.5 0
10
20
30 #EVs
40
4.5
4 3.5 3
SVDD KPCA (Mika et. al.) KPCA (Kwok & Tsang) PCA Wavelet (VisuShrink) Wavelet (SureShrink)
2.5 2 50
1.5 0
60
(a)
SNR (in dB)
SNR (in dB)
10
20
30 #EVs
40
4 3.5
2.5 50
2 0
60
(b)
4.5
5
3.5
4
4.5
SNR (in dB)
5
4
3
SVDD KPCA (Mika et. al.) KPCA (Kwok & Tsang) PCA Wavelet (VisuShrink) Wavelet (SureShrink)
2 1.5 1 0
10
20
30 #EVs
40
3.5 3 SVDD KPCA (Mika et. al.) KPCA (Kwok & Tsang) PCA Wavelet (VisuShrink) Wavelet (SureShrink)
2
(d)
60
20
30 #EVs
40
50
60
50
60
5.5
2.5
50
10
(c)
4.5
2.5
SVDD KPCA (Mika et. al.) KPCA (Kwok & Tsang) PCA Wavelet (VisuShrink) Wavelet (SureShrink)
3
SNR (in dB)
SNR (in dB)
5
4.5
4
SNR (in dB)
5.5
1.5 0
10
20
30 #EVs
40
(e)
4 3.5 SVDD KPCA (Mika et. al.) KPCA (Kwok & Tsang) PCA Wavelet (VisuShrink) Wavelet (SureShrink)
3 2.5 50
60
2 0
10
20
30 #EVs
40
(f)
Figure 10: SNRs of the denoised USPS images. Here, the 10 digits are combined during training. (Top) Gaussian noise with variance σ 2 ’s. (a) σ 2 = 0.6. (b) σ 2 = 0.5. (c) σ 2 = 0.4. (Bottom) Salt-and-pepper noise with noise level p’s. (d) p = 0.6. (e) p = 0.5. (f) p = 0.4.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Figure 11: Sample denoised USPS images. The 10 digits are combined during training. Recall that wavelet denoising does not use the training set, and so their de-noising results are the same as those in Figure 7 and are not shown here. (Top) Gaussian noise (with σ 2 = 0.6). (a) SVDD. (b) KPCA (Mika et al. 1999). (c) KPCA (Kwok & Tsang, 2004). (d) PCA. (Bottom) Salt-and-pepper noise (with p = 0.6). (e) SVDD. (f) KPCA (Mika et al. 1999). (g) KPCA (Kwok & Tsang, 2004). (h) PCA.
SVDD-Based Pattern Denoising
1937
5 Conclusion We have addressed the problem of pattern denoising based on the SVDD. Along with a brief review over the SVDD, we presented a new denoising method that uses the SVDD, the geodesic projection of the noisy point to the surface of the SVDD ball in the feature space, and a method for finding the preimage of the denoised feature vectors. Work yet to be done includes more extensive comparative studies, which will reveal the strengths and weaknesses of the proposed method, and refinement of the method for better denoising. References Ben-Hur, A., Horn D., Siegelmann H. T., & Vapnik V. (2001). Support vector clustering. Journal of Machine Learning Research, 2, 125–137. Burges, C. J. C. (1999). Geometry and invariance in kernel based methods. In A. J. ¨ Smola, P. L. Bartlett, B. Scholkopf, & D. Schuurmons (Eds.), Advances in kernel methods—support vector learning. Cambridge, MA: MIT Press. Campbell, C., & Bennett, K. P. (2001). A linear programming approach to novelty detection. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13, (pp. 395–401). Cambridge, MA: MIT Press. Cox, T. F., & Cox, M. A. A. (2001). Multidimensional scaling (2nd ed.). London: Chapman & Hall. Crammer, K., & Chechik, G. (2004). A needle in a haystack: Local one-class optimization. In Proceedings of the Twenty-First International Conference on Machine Learning. Banff, Alberta, Canada. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press. Donoho, D. L. (1995). De-noising by soft-thresholding. IEEE Transactions on Information Theory, 41(3), 613–627. Donoho, D. L., & Johnstone, I. M. (1995). Adapting to unknown smoothness via wavelet shrinkage. Journal of the American Statistical Association, 90(432), 1200– 1224. Gower, J. C. (1968). Adding a point to vector diagrams in multivariate analysis. Biometrika, 55(3), 582–585. Kwok, J. T., & Tsang, I. W. (2004). The pre-image problem in kernel methods. IEEE Transactions on Neural Networks, 15(6), 1517–1525. Lanckriet, G. R. G., El Ghaoui, L., & Jordan, M. I. (2003). Robust novelty detection with single-class MPM. In S. Becker, S. Thron, & K. Obermayer (Eds.), Advances in neural information processing systems, 15, (pp. 905–912). Cambridge, MA: MIT Press. Laskov, P., Sch¨afer, C., & Kotenko, I. (2004). Intrusion detection in unlabeled data with quarter-sphere support vector machines. In Proceeding of Detection of Intrusions and Malware and Vulnerability Assessment (DIMVA) 2004 (pp. 71–82). Dortmund, Germany.
1938
J. Park et al.
Mallat, S. G. (1999). A wavelet tour of signal processing (2nd ed.). San Diego, CA: Academic Press. ¨ ¨ Mika, S., Scholkopf, B., Smola, A. J., Muller, K. R., Scholz, M., & R¨atsch, G. (1999). Kernel PCA and de-noising in feature space. In M. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 536–542). Cambridge, MA: MIT Press. Moon, T. K., & Stirling, W. C. (2000). Mathematical methods and algorithms for signal processing. Upper Saddle River, NJ: Prentice Hall. ¨ ¨ Muller, K.-R., Mika, S., R¨atsch, G., Tsuda, K., & Scholkopf, B. (2001). An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2), 181–201. Pekalska, E., Tax, D. M. J., & Duin, R. P. W. (2003). One-class LP classifiers for dissimilarity representations. In S. Becker, S. Thrun, & K. Obermayer (Eds.) Advances in neural information processing systems, 15 (pp. 761–768). Cambridge, MA: MIT Press. ¨ ¨ R¨atsch, G., Mika, S., Scholkopf, B., & Muller, K.-R. (2002). Constructing boosting algorithms from SVMs: An application to one-class classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9), 1184–1199. ¨ ¨ Scholkopf, B., Mika, S., Burges, C., Knirsch, P., Muller, K.-R., R¨atsch, G., & Smola, A. J. (1999). Input space vs feature space in kernel-based methods. IEEE Transactions on Neural Networks, 10(5), 1000–1017. ¨ Scholkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001). Estimating the support of a high-dimensional distribution. Neural Computation, 13(7), 1443–1471. ¨ Scholkopf, B., Platt, J. C., & Smola, A. J. (2000). Kernel method for percentile feature extraction (Tech. Rep. MSR-TR-2000-22). Redmond, WA: Microsoft Research. ¨ Scholkopf, B., & Smola, A. J. (2002). Learning with kernels, Cambridge, MA: MIT Press. Tax, D. (2001). One-class classification. Unpublished doctoral/dissertation, Delft University of Technology. Tax, D., & Duin, R. (1999). Support vector data description. Pattern Recognition Letters, 20(11–13), 1191–1199. Williams, C. K. I. (2002). On a connection between kernel PCA and metric multidimensional scaling. Machine Learning, 46(1–3), 11–19.
Received August 31, 2005; accepted October 30, 2006.
LETTER
Communicated by James Theiler
Feature Selection via Coalitional Game Theory Shay Cohen [email protected] School of Computer Sciences, Tel-Aviv University, Tel-Aviv, Israel
Gideon Dror [email protected] Department of Computer Sciences, Academic College of Tel-Aviv-Yaffo, Tel-Aviv, Israel
Eytan Ruppin [email protected] School of Computer Sciences, Tel-Aviv University, Tel-Aviv, Israel, and Department of Computer Sciences, Academic College of Tel-Aviv-Yaffo, Tel-Aviv, Israel
We present and study the contribution-selection algorithm (CSA), a novel algorithm for feature selection. The algorithm is based on the multiperturbation shapley analysis (MSA), a framework that relies on game theory to estimate usefulness. The algorithm iteratively estimates the usefulness of features and selects them accordingly, using either forward selection or backward elimination. It can optimize various performance measures over unseen data such as accuracy, balanced error rate, and area under receiver-operator-characteristic curve. Empirical comparison with several other existing feature selection methods shows that the backward elimination variant of CSA leads to the most accurate classification results on an array of data sets. 1 Introduction Feature selection refers to the problem of selecting input variables, otherwise called features, that are relevant to predicting a target value for each instance in a data set. Feature selection can be used to rank all potentially relevant input variables or to build a good classifier, and each task may lead to a different methodological approach (Blum & Langley, 1997; Kohavi & John, 1997). Feature selection has several potential benefits: defying the curse of dimensionality to enhance the prediction performance, reducing measurement and storage requirements, reducing training and prediction times, providing better understanding of the process that generated the data, and allowing data visualization. This letter focuses on the first issue: selecting input variables in an attempt to maximize the performance of a Neural Computation 19, 1939–1961 (2007)
C 2007 Massachusetts Institute of Technology
1940
S. Cohen, G. Dror, and E. Ruppin
classifier on previously unseen data. Clearly, this would not necessarily produce the most compact set of features. Feature selection is a search problem, where each state in the search space corresponds to a subset of features. Exhaustive search is usually intractable, and methods to explore the search space efficiently must be employed. These methods are often divided into two main categories: filter methods and subset selection methods. The algorithms in the first category rank each feature according to some measure of association with the target, such as mutual information, Pearson correlation or χ 2 statistic, and features with high-ranking values are selected. A major disadvantage of filter methods is that they are performed independent of the classifier, and it is not guaranteed that the same set of features is optimal for all classifiers. In addition, most filter methods disregard the dependencies between features, as each feature is considered in isolation. The second category, the subset selection methods, has two types of algorithms. Embedded algorithms select the features through the process of generating the classifier, such as regularization methods such as grafting (Perkins, Lacker, & Theiler, 2003), Gram-Schmidt methods (e.g., Stoppiglia, Dreyfus, Dubois, & Oussar, 2003; Rivals & Personnaz, 2003), or methods specific for support vector machines (e.g., Weston et al., 2000; Guyon, Weston, Barnhill, & Vapnik, 2002). Wrapper algorithms treat the induction algorithm as a black box and interact with it in order to perform a search for an appropriate features set using search algorithms such as genetic algorithms or hill climbing (Kohavi & John, 1997). Although wrapper methods are successful in feature selection, they may be computationally expensive, because they require retraining a classifier on data with a large number of features. (For a survey of the current methods in feature selection, see Guyon & Elisseeff, 2003.) In this letter, we recast the problem of feature selection in the context of coalitional games, a notion from game theory. This perspective yields an iterative algorithm for feature selection, the contribution-selection algorithm (CSA), intent on optimizing the performance of the classifier on unseen data. The algorithm combines the filter and wrapper approaches, where the features are reranked on each step by using the classifier as a black box. The ranking is based on the Shapley value (Shapley, 1953), a well-known concept from game theory, to estimate the importance of each feature for the task at hand, specifically taking into account interactions between features. Due to combinatorial constraints, the Shapley value cannot be calculated precisely and is estimated by the multiperturbation Shapley analysis (MSA) (Keinan, Sandbank, Hilgetag, Meilijson, and Ruppin, 2004, 2006). Furthermore, since the classifier is trained and tested extensively, the classifier used by CSA must be fast in both training and testing phases. This requirement can be moderated by parallel processing. Throughout the letter, we use the following notations. Three disjoint sets containing independently and identically distributed (i.i.d.) sampled
Feature Selection via Coalitional Game Theory
1941
instances of the form (xk , yk ) are denoted by T, V, and S, representing the training set, validation set, and test set, respectively, where xk ∈ Rn denotes the kth instance and yk is the target class value associated with it. Given an induction algorithm and a set of features S ⊆ {1, . . . , n}, f S (x) stands for a classifier constructed from the training set using the induction algorithm, after its input variables were narrowed down to the ones in S. Namely, f S (x) labels each instance of the form (xi1 , . . . xi|S| ), i j ∈ S, 1 ≤ j ≤ |S| with an appropriate class value. The task of feature selection is to choose a subset S of the input variables that maximizes the performance of the classifier on the test set. In what follows, we focus on optimizing classifier accuracy, although we could as easily optimize other performance measures such as the area under the receiver operator chracteristic or balanced error rate. The rest of this letter is organized as follows. Section 2 introduces the necessary background from game theory and justifies the use of game theory concepts for the task of feature selection. It also provides a detailed description of the CSA algorithm. Section 3 provides an empirical comparison of CSA with several other feature selection methods on artificial and real-world data sets, accompanied by an analysis of the results, showing that the backward elimination version of CSA is significantly superior to other feature selection methods considered. Section 4 discusses the empirical results and provides further insights into the success and failure of the backward elimination version of the CSA algorithm. Section 5 summarizes the results. 2 Classification as a Coalitional Game Cooperative game theory introduces the concept of coalitional games in which a set of players is associated with a real function that denotes the payoff achieved by different subcoalitions in a game. Formally, a coalitional game is defined by a pair (N, v) where N = {1, . . . , n} is the set of all players and v(S), for every S ⊆ N, is a real number associating a worth with the coalition S. Game theory further pursues the question of representing the contribution of each player to the game by constructing a value function, which assigns a real value to each player. The values correspond to the contribution of the players in achieving a high payoff. The contribution value calculation is based on the Shapley value (Shapley, 1953). An intuitive example of the potential use of the Shapley value is a scenario of a production machine in a factory composed of numerous components. During its operation, the machine undergoes various malfunctions from time to time. In each such malfunction, the normal activity of a subset of its components may be shut down, resulting in reduction in the machine’s productivity and output. Based on annual data of these multicomponent failures and their associated production drops, the Shapley value provides a fair and efficient way to distribute the responsibility for the machine’s failure among its individual components, identifying the
1942
S. Cohen, G. Dror, and E. Ruppin
ones needing the most attention and maintenance, while considering their possible intricate functional interactions. In reference to feature selection, the machine is analogous to the predictor and its components to the classification features. The task of feature selection then involves identifying the features contributing most to the classification in hand. The Shapley value is defined as follows. Let the marginal importance of player i to a coalition S, with i ∈ / S, be i (S) = v(S ∪ {i}) − v(S).
(2.1)
Then the Shapley value is defined by the payoff i (v) =
1 i (Si (π)), n! π ∈
(2.2)
where is the set of permutations over N and Si (π) is the set of players appearing before the ith player in permutation π. The Shapley value of a player is a weighted mean of its marginal value, averaged over all possible subsets of players. Transforming these game theory concepts into the arena of feature selection, in which one attempts to estimate the contribution of each feature in generating a classifier, the players are mapped to the features of a data set and the payoff is represented by a real-valued function v(S), which measures the performance of a classifier generated using the set of features S. The use of Shapley value for feature selection may be justified by its axiomatic qualities: Axiom 1 (normalization or Pareto optimality). that i (v) = v(N).
For any game (N, v) it holds
i∈N
In the context of feature selection, this axiom implies that the performance on the data set is divided fully between the different features. Axiom 2 (permutation invariance or symmetry). tion π on N, it holds that i (v) = π (i) (πv).
For any (N, v) and permuta-
This axiom implies that the value is not altered by arbitrarily renaming or reordering the features. Axiom 3 (preservation of carrier or dummy property). For any game (N, v) such that v(S ∪ {i}) = v(S) for every S ⊆ N, it holds that i (v) = 0.
Feature Selection via Coalitional Game Theory
1943
This axiom implies that a dummy feature that does not influence the classifier’s performance indeed receives a contribution value 0. Axiom 4 (additivity or aggregation). For any two games (N, v) and (N, w), it holds that i (v + w) = i (v) + i (w) where (v + w)(S) = v(S) + w(S). This axiom applies to a combination of two different payoffs based on the same set of features. For a classification task, these may be, for example, accuracy and area under the receiver operator characteristic curve or falsepositive rate and false-negative rate. In this case, the Shapley value of a feature that measures its contribution to the combined performance measure is the sum of the corresponding Shapley values. The linearity of the Shapley value is a consequence of this property. If the payoff function v is multiplied by a real number α, then all Shapley values are scaled by α: i (αv) = αi (v). In other words, multiplying the performance measure by a constant does not change the ranking of the features, a vital property for any scheme that ranks features by their importance. Since it was introduced, the Shapley value has been successfully applied to many fields. One of the most important applications is with cost allocation, where the cost of providing a service should be shared among the different receivers of that service (Shubik, 1962; Roth, 1979; Billera, Heath, & Raanan, 1978). This use of the Shapley value has received recent attention in the context of sharing the cost of multicast routing (Feigenbaum, Papadimitriou, & Shenker, 2001). In epidemiology, the Shapley value has been used as a means to quantify the population impact of exposure factors on a disease load (Gefeller, Land, & Eide, 1998). Other fields where the Shapley value has been used include politics (starting from the strategic voting framework introduced by Shapley and Shubik, 1954), international environmental problems, and economic theory (see Shubik, 1985, for discussion and additional references). 2.1 Estimating Features Contribution Using MSA. The calculation of the Shapley value requires summing over all possible subsets of players, which is impractical in typical feature selection problems. Keinan et al. (2005) have presented an unbiased estimator for the Shapley value by uniformly sampling permutations from .1 Still, the estimator considers both large and small features sets to calculate the contribution values. In our feature selection algorithm, we use the Shapley value heuristically to estimate the contribution value of a feature for the task of feature selection. Since in most realistic cases, we assume that the size d of significant interactions between features is much smaller than the number of features, n, we will limit 1
The estimation in Keinan et al. (2005) is used in a different context: to analyze the functional contribution in artificial and biological networks. However, the method to estimate the Shapley value is valid for our purpose as well.
1944
S. Cohen, G. Dror, and E. Ruppin
ourselves to calculating the contribution value from permutations sampled from the whole set of players, with d being a bound on the permutation size, ϕi (v) =
1 i (Si (π)), |d | π ∈
(2.3)
d
where d is the set of sampled permutations on subsets of size d. Limiting the extent of interactions taken into account is not uncommon in feature selection methods. Most filter methods are equivalent to using d = 1, where no feature interactions are taken into account. Explicit restriction on the level of interactions also characterizes several ensemble √ methods, for example, Random Forests (Breiman, 2001), where d n is usually suggested. The use of bounded sets, coupled with the method for the Shapley value estimation, yields an efficient and robust way to estimate the contribution of a feature to the task of classification. (For a detailed discussion of the MSA framework and its theoretical background see Keinan et al., 2004, 2005. The MSA toolbox can be downloaded online from http://www.cns.tau.ac.il/ msa/.) 2.2 The Contribution-Selection Algorithm. The CSA is iterative in nature and can adopt a forward selection or backward elimination approach. Its backward elimination version, which overall yields better prediction accuracy (see section 3.2) is described in detail in Figure 1. In the backward elimination version, using the subroutine contribution, the CSA ranks each feature according to its contribution value and then eliminates e features with the lowest contribution values (using the subroutine elimination). It repeats the phases of calculating the contribution values of the currently remaining features and eliminating new features, until the contribution values of all candidate features exceed a contribution threshold . Forward selection CSA works in a similar manner, selecting on each iteration s features with highest contribution values, as long as their contribution values exceed some threshold. The algorithm, without further specification of the contribution subroutine, is a known generalization of filter methods. However, the main idea of the algorithm is that the contribution subroutine, unlike common filter methods, returns a contribution value for each feature according to its role in improving the classifier’s performance, which is generated using a specific induction algorithm and in conjunction with other features. Using the notation in section 2 and assuming that one maximizes the accuracy level of the classifier, the contribution subroutine for backward selection calculates the contribution values ϕi by equations 2.1 and 2.3, where the payoff function v(S) is simply the validation accuracy of the base classifier f S (x)
Feature Selection via Coalitional Game Theory
1945
Figure 1: The contribution-selection algorithm in its backward elimination version. F is the input set of features. t, d, , and e are hyperparamters: t = |d | is the number of permutations sampled (see equation 2.3), d is the maximal permutation size for calculating the contribution values, is a contribution value threshold, and e is the number of features eliminated in each phase. The variable c represents the set of candidate features for selection. The contribution subroutine calculates the contribution value of feature f according to equation 2.3, where f corresponds to the ith player. The elimination subroutine eliminates at most e features with lowest contribution values that do not exceed . In the forward selection version, the elimination subroutine is replaced with a selection subroutine, which selects s features in each phase, and the halting criterion is changed accordingly.
trained on the training set T, v(S) =
|{x| f S (x) = y, (x, y) ∈ V}| . |V|
The case S = φ is handled by returning the fraction of majority class instances. The maximal permutation size d has an important role in deciding the contribution values of the different features and should be selected in a way that ensures that different combinations of features that interact together are inspected. Its impact is demonstrated in section 3. The number of eliminated features e for the elimination subroutine controls the redundancies of the eliminated features; the higher e is, the more likely it is that correlated features with redundant contribution will be
1946
S. Cohen, G. Dror, and E. Ruppin
eliminated. Although e = 1 minimizes the redundancy dependencies of the features, increasing e accelerates the algorithm’s convergence and provides some regularization, as has been verified experimentally. The algorithm’s halting criterion depends on , which designates a trade-off between the number of selected features and the performance of the classifier on the validation set. With the backward elimination version, choosing = 0 means that CSA eliminates features as long as there exist features that are unlikely to improve the classifier’s performance. Increasing has the opposite effect on the size of the final set of features. The naive halting criterion that terminates feature elimination when no further performance gain is expected—when the exclusion of no feature enhances the performance—is entirely different. For example, during CSA backward elimination, there are occasionally features with negative contribution values: once they are eliminated, there is no performance improvement. Still the removal of such features tends to increase the generalization of the classifier by the mere reduction of its complexity. Indeed, testing this naive halting criterion has verified that it leads to considerably inferior performance levels. 3 Results 3.1 Experiments with Artificial Data. In order to demonstrate the algorithm’s behavior, we generated a data set that consists of nine features. The first three features are binary. The target labels are taken as the parity function of these three features. The other six features are correlated with the target by setting them to the target values and adding a random value taken from the normal distribution with expectancy 0 and standard deviation σ = 1. A simple calculation shows that the correlation between each of these six features and the target is (1 + σ 2 )−1 0.71, and the correlation between the sum of these features and the target is (1 + σ 2 /6)−1 ) 0.92. The mean accuracy of a classifier that outputs the sign of the sum of these six features is 0.84, which will be considered a baseline accuracy (the accuracy of C4.5 classifier that uses all nine features is 0.82). The data set consists of 200 training examples and 100 test examples. Using this data set, we inspected the features selected and the performance of several feature selection methods. We used Random Forests (Breiman, 2001), mutual information, Pearson correlation, regular backward wrapper method (eliminating one feature at a time), regular forward wrapper method (selecting one feature at a time), and CSA in both backward elimination and forward selection using C4.5 as base learner. CSA was run with s = 1, e = 1, d = 3, and t = 20.2 No optimization was performed on these hyperparameters. In all ranking methods (Random Forests, mutual
2 While CSA evaluated 540 permutations, this toy example can be solved by an exhaustive search requiring only 29 − 1 = 511 evaluations.
Feature Selection via Coalitional Game Theory
1947
(A)
(B) 5
1
4.5
Number of Features Selected
0.95
0.9
Accuracy
0.85
0.8
0.75
0.7
0.65
0.6
4
3.5
3
2.5
2
1.5
1
0.5
0
20
40
60
80
100
120
140
Number of Training Set Instances
160
180
200
0
0
20
40
60
80
100
120
140
160
180
200
Number of Training Set Instances
Figure 2: Behavior of backward selection CSA on the artificial data set. (A) The accuracy of the backward selection CSA changes with respect to n, the number of training set instances. (B) The average number of features selected as a function of n.
information, and Pearson correlation), the first three features were ranked as least informative features. As expected, the accuracy obtained by using these methods did not exceed the baseline accuracy 0.84. Forward selection CSA together with backward wrapper and forward wrapper did not select any of the first three most relevant features. Yet backward elimination CSA chose the three features for classification and achieved 100% accuracy. The reason for this behavior is that the other six features were too confusing for the classifier with most algorithms. With decision trees, the first three features will always be used in deep nodes of the tree due to the greedy criterion used by decision tree algorithm to select a feature for node splitting. Therefore, the inclusion or removal of any one of these features will have less influence on the classifier’s accuracy than any of the six features linearly correlated with the target. We further tested the behavior of CSA with respect to the number of training instances, n. To this end, we produced training sets of different values of n and ran the algorithm on each sample. For each value of n, five different training sets were sampled and scored, and the average accuracy and average number of features selected were computed. The results for the backward selection CSA are described in Figure 2. As can be seen, for n ≥ 140, many instances of the backward elimination CSA identify the three salient features and achieve perfect categorization of the test examples.
3.2 Experiments with Real-World Data. To test CSA empirically, we ran a number of experiments on seven real-world data sets with number of features ranging from 278 to 20,000 (see Table 1):
1948
S. Cohen, G. Dror, and E. Ruppin
Table 1: Description of Data Sets Used.
Name Reuters1 Reuters2 Arrhythmia Internet Ads Dexter Arcene I2000
r
r
r
r
Number of Classes
Number of Features
Training Set Size
Testing Set Size
3 3 2 2 2 2 2
1579 1587 278 1558 20,000 10,000 2000
145 164 280 2200 300 100 40
145 164 140 800 300 100 22
Following Koller and Sahami (1996), we constructed the Reuters1 data set from the Reuters-21578 document collection (Reuters, 1997). The data set consists of articles from the categories coffee (99 documents), iron-steel (137 documents), and livestock (54 documents). These topics do not have many overlapping words, making the task of classification easier. As a preprocessing step, we removed all words that appeared fewer than three times. Each article was then encoded into a binary vector, where each element designates whether the word appeared in the document. The Reuters2 data set was constructed, similar to the Reuters1 data set, from the categories gold (68 documents), gross national product (124 documents), and reserves (136 documents). These topics are more similar and contain many overlapping words, making the task of classification harder. For both of the Reuters data sets, our splits are not identical to the one in Koller and Sahami (1996) and contain fewer documents because we could not obtain the exact same data set. The Arrhythmia database from the UCI repository (Blake & Merz, 1998). The task for this database is to distinguish between normal and abnormal heartbeat. We used a version of the data that was slightly modified by Perkins et al. (2003): features that were missing in most of the instances were removed. The data set contains 237 positive instances and 183 negative instances. It can be found online at http://nis-www.lanl.gov/∼simes/data/jmlr03. The Internet Advertisements database from the UCI repository (Blake & Merz, 1998) was collected for research on identifying advertisements in web pages (Kushmerick, 1999). The features in the database describe different attributes in a web page, such as the domain it was downloaded from, the domain that referred to it, and its size. There are two classes: each instance is either an advertisement or not an
Feature Selection via Coalitional Game Theory
r
r
r
1949
advertisement. The data set contains 2581 positive instances and 419 negative instances. The Dexter data set from the NIPS 2003 workshop on feature selection (Guyon, 2003). The Dexter data set is a two-class text categorization data set constructed from a subset of the Reuters data set, using documents from the category in corporate acquisitions. The data set is abundant with irrelevant features. (For exact details on how the data set was constructed, see Guyon, 2003.) The data set contains 300 positive instances and 300 negative instances. As a preprocessing step, we binarized each of the instances that originally contained the word frequency in each document and removed words that appeared fewer than three times. The validation set of the data served in the following experiments as T, the test set. The Arcene data set from the NIPS 2003 workshop on feature selection (Guyon, 2003) is a two-class categorization data set, describing mass spectrometry analysis of the blood serum of patients with a certain kind of cancer and without it. It is affluent with features and poor with data instances. (For details on how the dataset was constructed, see Guyon, 2003.) The data set contains 88 positive instances and 112 negative instances. The validation set of the data served in the following experiments as T, the test set. I2000, a microarray colon cancer data set by Alon et al. (1999) is a two-class categorization data set for discriminating between healthy and ill tissues in colon cancer. The data contain the expression of 2000 genes with highest minimal intensity across 62 tissues. The data set contains 31 positive instances and 31 negative instances. It has a very high features-to-instances ratio, making the task of feature selection harder.
In principle, CSA can work with any induction algorithm L. However, due to computational constraints, we focused on fast induction algorithms, or algorithms that may be efficiently combined into CSA. We experimented with Naive Bayes, C4.5, and 1NN. For each of the data sets, we measured the training set accuracy of each classifier using tenfold cross-validation on the whole set features. For each data set, all subsequent work used the induction algorithm L that gave the highest cross-validation accuracy, as detailed in Table 2. Nine different classification algorithms were then compared on the data sets described above:
r
Classification using the induction algorithm L without performing any feature selection.
1950
S. Cohen, G. Dror, and E. Ruppin
Table 2: Parameters and Classifier Used with the CSA Algorithm for Each Data Set.
Data Set
Induction Algorithm (L)
Reuters1 Reuters2 Arrhythmia Internet Ads Dexter Arcene I2000
Naive Bayes Naive Bayes C4.5 1NN C4.5 C4.5 C4.5
s (Forward)
e (Backward)
d
t
1 1 1 1 50 100 100
100 100 50 100 50 100 100
20 20 20 20 12 5 3
1500 1800 500 1500 3500 10,000 2000
Notes: s is the number of features selected in forward selection in each phase, e is the number of features eliminated in backward elimination in each phase, d is the permutation size, and t is the number of permutations sampled to estimate the contribution values. For an explanation of how hyperparameters were chosen, see the description of the backward CSA algorithm.
r r
r
r
r
Classification using soft margin linear SVM with the SV Mlight package (Joachims, 1999). Data sets that had more than two classes were decomposed to a few one-versus-all binary classification problems. Classification using L after performing feature selection by estimation of the Pearson correlation coefficient. The number of features was selected by performing tenfold cross validation on the training set and averaging the results, each time adding more features with the highest correlation value to the current features set. After this process, the set that obtained the best result was selected. Classification using L after performing feature selection by estimation of mutual information. For data sets with continuous domain (Arrhythmia, Internet Ads, Arcene, and I2000), we used binning to estimate the mutual information. The number of features selected was optimized, as with the Pearson correlation coefficient. Classification using L after performing feature selection with Random Forests (Breiman, 2001). We used the randomForest library implementation for the R environment (Bengtsson, 2003). The number of features selected was optimized as with the Pearson correlation coefficient. Classification using L after performing feature selection with backward elimination CSA with d = 1. The number of permutations selected was large enough that each feature is sampled with high probability. This is equivalent to regular wrapper technique, in which backward elimination is used to eliminate the features that most degrades the accuracy of the classifier. This algorithm is chosen to check
Feature Selection via Coalitional Game Theory
r
r
r
1951
whether it is sufficient to examine each feature separately for performing feature selection on the data set. Classification using L after performing feature selection with forward selection CSA with d = 1. The number of permutations selected was large enough that each feature is sampled with high probability. This is equivalent to the regular wrapper technique, in which forward selection is used to select the features that most improve the accuracy of the classifier. This algorithm is similar to that of the backward wrapper. Classification using L after performing feature selection with backward elimination CSA and parameters as described in Table 2. The parameters d and t were chosen such that the expected number of times that each feature is sampled is higher than five. This number was chosen according to error analysis considerations of MSA (Keinan et al., 2004) and following preliminary experimentation with an artificial data set. When computation times allowed, the number of permutations sampled was much larger than the minimal value. The contribution value threshold for stopping elimination was = 0. No hyperparameter selection was performed on d, t, or . Classification using L after performing feature selection with forward selection CSA and parameters as described in Table 2. The parameters d and t were chosen such that the expected number of times that each feature is sampled is higher than five. The termination of feature selection was fixed by choosing a contribution value threshold = 0. No hyperparameter selection was performed on d, t, or .
CSA is prone to overfitting on the validation set. When the classifier’s performance is always evaluated on a possibly small validation set, the curse of dimensionality appears, and irrelevant features are selected, even if the classifier itself is trained using techniques that avoid overfitting. It might seem at first that evaluating the classifier each time on a different training set and validation set split can solve the problem. However, this leads to another problem: the classifier’s performance depends on the split, so the marginal contributions of the different features do not reflect their real value. In order to avoid both of these problems, we used tenfold cross validation; the training set was split into several parts, and the payoff function was evaluated by averaging a classifier’s performance on the whole training set. 3.3 Feature Selection and Classification Results. Table 3 summarizes the classifiers’ performance on the test set and the number of features selected in each of the experiments. The accuracy levels are the fraction of correctly classified test set instances.
1952
S. Cohen, G. Dror, and E. Ruppin
Table 3: Accuracy Levels and Number of Features Selected in the Different Data Sets. Data Set
No FS
SVM
Corr
MI
RF
Reuters1 Reuters2 Arrhythmia Internet Ads Dexter Arcene I2000
84.1% 81.1% 76.4% 94.7% 92.6% 83% 86.3%
94.4% 91.4% 80% 93.5% 92.6% 83% 72.7%
90.3% (20) 88.4% (20) 71.4% (20) 94.2% (15) 92.6% (1240) 83% (6600) 81.8% (260)
94.4% (20) 90.2% (5) 70% (20) 95.75% (70) 94% (230) 81% (5600) 90.9% (1060)
96.3% (6) 87.2% (21) 80% (40) 95.6% (10) 93.3% (800) 82% (6000) 86.3% (100)
Wrapper Bwd
Wrapper Fwd
CSA Bwd
CSA Fwd
94.4% (35) 95.7% (53) 77.8% (17) 95% (62) 92.6% (653) 82% (6800) 86.3% (1600)
92.4% (7) 91.4% (5) 70% (5) 80% (10) 58% (7) 86.3% (550)
98.6% (51) 93.2% (109) 84.2% (21) 96.1% (158) 93.3% (717) 86% (7200) 90.9% (1100)
96.5% (10) 90.1% (14) 74.2% (28) 95.6% (8) 92.6% (100) 81% (600) 86.3% (500)
Data Set Reuters1 Reuters2 Arrhythmia Internet Ads Dexter Arcene I2000
Notes: Upper table: No FS: no feature selection; SVM: linear soft margin SVM without feature selection; Corr: feature selection using Pearson correlation; MI: feature selection using mutual information; RF: feature selection using Random Forests. Bottom table: Wrapper Bwd and Wrapper Fwd: wrapper with backward and forward selection, respectively; CSA Bwd and CSA Fwd: CSA with backward elimination and forward selection, respectively, with parameters from Table 2. Accuracy levels are calculated by counting the number of misclassified parentheses. The number of features selected is given in parentheses. Notice that the accuracies obtained by our algorithms on the Dexter and Arcene data sets are inferior to those of the winners of NIPS 2003 feature selection competition.
3.3.1 Reuters1 Data Set. Feature selection using CSA with backward elimination did best, yielding an accuracy level of 98.6% with 51 features. Koller and Sahami (1996), for example, report that the Markov Blanket algorithm yields approximately 600 selected features with accuracy levels of 95% to 96% on this data set. 3.3.2 Reuters2 Data Set. Wrapper with backward elimination did best, yielding accuracy level of 95% with 53 features. For comparison, Koller and Sahami (1996) report that the Markov Blanket algorithm yields approximately 600 selected features with accuracy levels of 89% to 93% on this data set.3
3
The data sets used in Koller and Sahami (1996) are not identical to the data sets we used. We were unable to obtain the same data sets and had to reconstruct them from the original Reuters-21578 collection.
Feature Selection via Coalitional Game Theory
1953
3.3.3 Arrhythmia Data Set. This data set is considered a difficult one. CSA with backward elimination did best, yielding an accuracy level of 84% with 21 features. Forward selection with higher depth value (d = 20) did better than with d = 1, implying that one should consider many features concomitantly to perform good feature selection for this data set. For comparison, the grafting algorithm (Perkins et al., 2003) yields an accuracy level of approximately 75% on this data set. 3.3.4 Internet Ads Data Set. All the algorithms did approximately the same, leading to accuracy levels between 94% and 96%, with CSA slightly outperforming the others. Interestingly enough, with d = 1, the algorithm did not select any feature. In the first phase, the 1NN algorithm had neighbors from both classes with the same distance for each feature checked, leading to arbitrary selection of one of the classes, and the classifier’s performance was constant through all the phase, yielding zero contribution values. However, when selecting the higher depth levels, the simple 1NN algorithm was boosted up to outperform classifiers such as SVM. 3.3.5 Dexter Data Set. For the Dexter data set, we used algorithm L (C4.5 decision trees) only for the process of feature selection and linear SVM to perform the actual prediction on the features selected. This was done because C4.5 did not give satisfying accuracy levels for any of the feature selection algorithms, and it is impractical to use SVM with CSA for large data sets. To overcome the difference between the classifiers performing feature selection and the classifier used for the actual classification, we added an optimization phase for the forward selection algorithm after it stopped. In this phase, a tenfold crossvalidation is performed on the data set in a similar way to the one used to optimize filter methods. The simple mutual information feature selection performed best, followed closely by the CSA in its backward elimination version and by Random Forests. This implies that in Dexter, the contribution of single features significantly outweighs the contribution of feature combinations for the task of classification. The forward selection algorithm did as well as linear SVM without feature selection, but with a significantly lower number of features. 3.3.6 Arcene Data Set. Here, just as in the case of Dexter, we use C4.5 for the process of feature selection and linear SVM to perform the actual prediction on the features selected. The CSA with backward elimination obtained better performance than the rest of the algorithms. 3.3.7 I2000 Data Set. CSA with backward elimination, together with feature selection using mutual information, yielded the best results. The poor performance of CSA with forward selection can be explained by the poverty
1954
S. Cohen, G. Dror, and E. Ruppin
Table 4: Wilcoxon Signed-Rank Test p-Values.
CBwd. CFwd. WBwd. WFwd. RF MI No FS SVM
CFwd.
WBwd.
WFwd.
RF
MI
No FS
SVM
Corr.
0.015 -
0.078 0.672 -
0.031 0.219 0.016 -
0.093 0.625 0.031 0.156 -
0.015 0.218 0.094 0.046 0.562 -
0.015 0.625 0.016 0.625 0.156 0.078 -
0.015 0.687 0.016 0.437 0.156 0.437 0.812 -
0.015 0.156 0.016 0.937 0.156 0.109 0.812 0.109
Notes: This table specifies the Wilcoxon signed-rank test p-values related to the results in Table 3. The entry on row i and column j specifies the p-value related to testing whether method i is superior to method j. CBwd stands for CSA with backward elimination and WBwd stands for Wrapper with backward elimination. CFwd and WFwd follow similar naming. p-values were calculated using the exact distribution for n = 7 tests, which can easily be calculated by enumeration. It can be seen that the backward elimination CSA is better than the other methods tried with significance level 0.05, except for Random Forests and Wrapper backward elimination, where only a marginal significance is achieved. No other feature selection method was found to be significantly better than the majority of the remaining methods.
of data compared to the number of features. The algorithm selected in the first phases features that explain well the training data by coincidence and avoided selecting features that truly contribute to the task of classification. This phenomenon is explained section 3.5.2. 3.4 Significance of the Results. The McNemmar test (Gillick & Cox, 1989) on the results summarized in Table 3 identified no significant superiority of any feature selection method on any of the data sets. The accuracies are too close to each other compared to the size of the test sets. However, in five of the seven data sets, CSA with backward elimination achieved the highest accuracy. In the other case, it achieved the secondbest accuracy. So although the results are not significant for each data set, the overall picture may suggest otherwise. To test whether the backward elimination version of CSA is indeed superior to the other feature selection algorithms, we performed a one-sided Wilcoxon signed-rank test (Kanji, 1994). This test takes into account the ranking of feature selection methods across all data sets and tests whether the set of rankings significantly deviates from the H0 distribution that assumes that all methods are equal. Table 4 lists the p-values of these tests. As can be seen, the backward elimination version of CSA has significantly higher performance than most of the other methods tried.
Feature Selection via Coalitional Game Theory
1955
(A)
(B) 0.9
0.88 0.85
0.86 0.8
0.75
0.82
Accuracy
Accuracy
0.84
0.8
0.78
0.7
0.65
0.76 0.6
0.74 0.55
0.72
0.7
2
4
6
8
10
12
14
16
0.5
0
d
200
400
600
800
1000
1200
1400
1600
1800
2000
t
Figure 3: Dependency of CSA performance on its hyperparameters. (A) The accuracy of the backward elimination version of CSA for various values of d, size of subsets. As expected, for small values of d, there is a substantial improvement of performance as d is increased. But beyond a certain point (here, around d = 10), the accuracy stays stable around 83%. (B) The accuracy of backward elimination of CSA for various values of t, the number of permutations sampled. The overall picture is compatible with the fact that the estimate of the contribution value becomes more robust as more samples are taken.
3.5 A Closer Inspection of the Results 3.5.1 Behavior of the Algorithm with Different Parameters. In order to examine the effect of different parameter values on the algorithm, we ran the CSA on the Arrhythmia data set with different values of d (size of subsets analyzed) and values of t (number of permutations examined in each phase of eliminating new features). The results were averaged over five experiments for each value of d and t. Figure 3 describes the result. Figure 3A implies that there are optimal values of d for which the performance achieved is highest. For small values of d, not enough interactions between the different features are considered. As d increases, the performance on the data set increases until it reaches a critical value. For values of d larger than that critical value, the performance stays stable around the critical value’s performance. Figure 3B implies that the algorithm is rather robust to the number of permutations analyzed in each phase. For very small t values, the algorithm’s performance is limited. But as t grows to values a little higher, the performance grows as well, until it stays rather stable. 3.5.2 The Distribution of the Contribution Values. The MSA, intent on capturing correctly the contribution of elements to a task, enables us to examine the distribution of the contribution values of the features. Figure 4 depicts a
1956
S. Cohen, G. Dror, and E. Ruppin 0 Arrhythmia Dexter
−1
log frequency
−2 −3 −4 −5 −6 −7 −5
−4.5
−4
−3.5 log CV
−3
−2.5
−2
Figure 4: Power law distribution of contribution values. This log-log plot of the distribution of the contribution values (absolute value) in the first phase for Arrhythmia and Dexter, prior to making any feature selection, demonstrates a power law behavior. For both axes, natural logarithms are used. The p-values for the regression were 0.0047 (Arrhythmia) and 0.0032 (Dexter). The corresponding plots for the other data sets show power law characteristics with different slopes and were eliminated for clarity.
log-log plot of the distribution of the contribution values in the first phase for Arrhythmia and Dexter, prior to making any feature selection. This distribution follows a scale-free power law, implying that large contribution values (in absolute value) are very rare, while small ones are quite common, justifying quantitatively the need of feature selection. The other data sets were also observed to possess a similar power law characteristic. The behavior of the algorithm through the process of feature selection and feature elimination is displayed in Figure 5. After the forward selection algorithm identifies the significant features in the first few phases, there is a sharp decrease in the contribution values of the features selected in the following phases, while with backward elimination, there is a gradual and rather stable increase in the contribution values of the noneliminated features. The peaks in the graph of the contribution values in Figure 5A demonstrate that the contribution values change as the CSA iterates. In this case, the selection of a single feature considerably increased the contribution value of another feature, pointing at intricate dependencies between features. Figures 4 and 5 also assist in explaining why backward elimination usually outperforms several feature selection methods, including forward selection. Due to the high dimensionality of the data sets, a feature that assists in prediction merely by coincidence may be selected on account of other
Feature Selection via Coalitional Game Theory
1957
(A)
(B)
1
1
0.9 0.8
0.8
0.6
0.6
Accuracy / CVs
Accuracy / CVs
0.7 Accuracy on Validation Accuracy on Test CV x 10
0.5 0.4 0.3
0.4 0.2 0 Accuracy on Validation Accuracy on Test Average CV x 100
0.2 −0.2
0.1 0 0
5
10 15 20 Number of Selected Features
25
30
−0.4 0
50
100 150 200 Number of Eliminated Features
250
300
Figure 5: Prediction accuracy and feature contribution during forward selection (A) and backward elimination (B) for the Arrhythmia data set. Both parts of the figure show how the performance of the C4.5 classifier improves on the validation set as the algorithm selects (eliminates) new features, while the contribution values of the selected features decrease (increase). The backward elimination generalizes better on the test set through the algorithm’s progress. The behavior for the other data sets is similar.
truly informative features. Forward selection is penalized severely in such case: among the few significant features, some will not be chosen. However, backward elimination always maintains the significant features in the noneliminated set; a feature that truly enhances the classifier’s generalization will do so for the validation set as well and will not be eliminated. This leads to a more stable generalization behavior for backward elimination on the test set through the algorithm’s progress (see Figure 5). 4 Discussion CSA evaluates on each phase t feature sets for each coalition size in the range 1 to d, leading to O( tde n) sets being evaluated. However, in order to obtain reliable estimates of the contribution values, td should scale linearly with n. Therefore, in practice, CSA requires O(n2 ) evaluations, like standard forward selection or backward elimination (Wrappers). The filter methods, ranking features by their Pearson correlation (Corr) or by by their mutual information (MI) with the targets, both have linear time complexity. The time complexity of Random Forest (RF) is difficult to assess, since in order to get a reliable estimate of feature importance, one should increase the number of trees; the resulting trees are usually deeper, and with each node, √ O( n) features are being considered. The O(n2 ) behavior of CSA poses a real challenge when using it with nontrivial problems. To cope with it, fast induction algorithms such as naive Bayes, KNN, or decision trees must be used. Running times can be further reduced by parallelizing, an advantage not shared by wrapper algorithms,
1958
S. Cohen, G. Dror, and E. Ruppin
which use search methods such as hill climbing; At each phase, the permutations can be computed in parallel and combined on completion to obtain an estimate of contribution values. Furthermore, as the algorithm progresses, the number of candidate features for either selection (forward selection) or elimination (backward elimination) decreases. Consequently, the number of permutations sampled may be reduced, speeding up the algorithm significantly. The restriction in selecting the learning algorithm for CSA does not apply to the prediction once the features are selected. After a set of features is found by the CSA, it may be used by any induction algorithm, as demonstrated in section 3.3 with the Dexter and Arcene data sets. Several learning algorithms, such as KNN with Euclidean metric, Naive Bayes classifier and fisher’s linear discriminant, allow an optimization that dramatically reduces the time spent in the contribution subroutine. Without such optimizations, running times can be considerable: For example, using C4.5 as a base learning algorithm, the total running time for the Arrhythmia data set was on average 41 minutes on a 1.73 Ghz Pentium 4 for backward elimination and 34 minutes for forward selection. The standard deviations of running times were approximately 8 minutes and 6 minutes, respectively. For comparison, the running times for the filter algorithms (mutual information and Pearson correlation) were less than 4 minutes. Forward selection and backward elimination wrapper methods took 28 minutes and 33 minutes, respectively, and the running time of Random Forests on the same data set was 12 minutes. Since the contribution value is based on extensive sampling of feature sets, the CSA algorithm is capable of identifying intricate dependencies between features and the target. We therefore expect that CSA will be effective for data sets where feature independence is strongly violated, as demonstrated in section 3.1. However, CSA may fail in certain circumstances, for example, when a large coalition should be formed to aid in prediction. In such a case, it may well be that this coalition will not be sampled—and hence, the contribution value of the corresponding features will not be increased. Obviously, forward selection CSA is more prone to this pitfall. Furthermore, when the data are scarce, overfitting could pose a real problem for CSA. The significance of contribution estimates can be rather low, and the resulting noise can play a substantial role in driving the algorithm. Further incorporation of regularization into CSA may help to deal with such a situation. 5 Conclusion The contribution-selection algorithm presented in this letter views the task of feature selection in the context of coalitional games. It uses a wrapperlike technique combined with a novel ranking method based on the Shapley contribution values of the features to the classification accuracy. The CSA works in an iterative manner, each time selecting new features (or
Feature Selection via Coalitional Game Theory
1959
eliminating them) while taking into account the features that were selected (or eliminated) so far. We verified that the feature sets selected by CSA are significantly different from those selected by other filter methods. It turns out that the first strong features are selected by most methods. But within a few iterations, CSA selects entirely different features from other methods due to the fact that the contribution values of the candidate features are modified along the run of the algorithm, sometimes drastically, according to the features already selected. The CSA was tested on number of data sets, and the results show that the algorithm can improve the performance of the classifier and successfully compete with an existing array of filter and feature selection methods, especially in cases where the features interact with each other. In such cases, performing feature selection with a permutation size higher than one, namely, not using the common greedy wrapper approach, can enhance the classifier’s performance significantly. The results successfully demonstrate the value of applying game theory concepts to feature selection. While the forward selection version of the algorithm is competitive with other feature selection methods, our experiments show that overall, the backward elimination version is significantly superior to them and produces features sets that can be used to generate a high-performing classifier. Acknowledgments We thank David Deutscher and Alon Keinan for their valuable comments and assistance. The MSA toolbox was developed and is maintained by David Deutscher, Alon Keinan, and Alon Kaufman. References Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., & Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci., 96, 6745–6750. Bengtsson, H. (2003, March). The R.oo package—object-oriented programming with references using standard R code. In K. Hornik, F. Leisch, & A. Zeileis (Eds.), Proceedings of the 3rd International Workshop on Distributed Statistical Computing (dsc 2003). Vienna, Austria. Available online at http://www.ci.tuwien.ac.at/ Conferences/DSC-2003/. Billera, L. J., Heath, D., & Raanan, J. (1978). Internal telephone billing rates—a novel application of non-atomic game theory. Operations Research, 26, 956–965. Blake, C., & Merz, C. (1998). UCI repository of machine learning databases. Irvine University of California, Irvine, Department of Information and Computer Sciences. Available online at http://www.ics.uci.edu/∼mlearn/MLRepository.html.
1960
S. Cohen, G. Dror, and E. Ruppin
Blum, A., & Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence, 97(1–2), 245–271. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. Feigenbaum, J., Papadimitriou, C. H., & Shenker, S. (2001). Sharing the cost of multicast transmissions. Journal of Computer and System Sciences, 63(1), 21–41. Gefeller, O., Land, M., & Eide, G. E. (1998). Averaging attributable fractions in the multifactorial situation: Assumptions and interpretation. J. Clin. Epidemiol., 51(5), 437–441. Gillick, L., & Cox, S. (1989). Some statistical issues in the comparison of speech recognition algorithms. ICASSP, 1, 532–534. Guyon, I. (2003). Design of experiments for the NIPS 2003 variable selection benchmark. Tutorial at the NIPS 2003 Workshop on Feature Extraction and Feature selection. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182. Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1–3), 389– 422. ¨ Joachims, T. (1999). Making large-scale SVM learning practical. In B. Scholkopf, C. Barges, & A. Smola (Eds.), Advances in—support vector learning. Cambridge, MA: MIT Press. Kanji, G. (1994). 100 statistical tests. Thousand Oaks, CA: Sage. Keinan, A., Sandbank, B., Hilgetag, C., Meilijson, I., & Ruppin, E. (2004). Fair attribution of functional contribution in artificial and biological networks. Neural Computation, 16(9), 1887–1915. Keinan, A., Sandbank, B., Hilgetag, C., Meilijson, I., & Ruppin, E. (2006). Axiomatic scalable neurocontroller analysis via the shapley value. Artificial Life, 12, 333–352. Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97, 273–324. Koller, D., & Sahami, M. (1996). Toward optimal feature selection. In Proceedings of the 13th International Conference on Machine Learning (ML) (pp. 284–292). San Francisco: Morgan Kaufmann. Kushmerick, N. (1999). Learning to remove Internet advertisements. In Proceedings of the 3rd International Conference on Autonomous Agents. New York: Association for Computing Memory. Perkins, S., Lacker, K., & Theiler, J. (2003). Grafting: Fast, incremental feature selection by gradient descent in function space. Journal of Machine Learning Research, 3, 1333–1356. Reuters. (1997). Reuters collection. Distribution for research purposes by David Lewis. Available online at http://www.daviddlewis.com/resources/testcollections/ reuters21578/. Rivals, I., & Personnaz, L. (2003). MLPS (mono-layer polynomials and multilayer perceptrons) for nonlinear modeling. Journal of Machine Learning Research, 3, 1383– 1398. Roth, A. E. (1979). Axiomatic models of bargaining. Berlin: Springer-Verlag. Shapley, L. S. (1953). A value for n-person games. In H. W. Kuhn, & A. W. Tucker (Eds.), Contributions to the theory of games (Vol. 2, pp. 307–317). Princeton, NJ: Princeton University Press.
Feature Selection via Coalitional Game Theory
1961
Shapley, L. S., & Shubik, M. (1954). A method for evaluating the distribution of power in a committee system. American Political Science Review, 48(3), 787–792. Shubik, M. (1962). Incentives, decentralized control, the assignment of joint costs and internal pricing. Management Science, 8, 325–343. Shubik, M. (1985). Game theory in the social sciences. Cambridge, MA: MIT Press. Stoppiglia, H., Dreyfus, G., Dubois, R., & Oussar, Y. (2003). Ranking a random feature for variable and feature selection. Journal of Machine Learning Research, 3, 1399– 1414. Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., & Vapnik, V. (2000). ¨ Feature selection for SVMs. In T. K. Leen, T. G. Dietterich, & K.-R. Muller (Eds.), Advances in neural information processing systems, 13 (pp. 668–674). Cambridge, MA: MIT Press.
Received August 26, 2005; accepted October 14, 2006.
LETTER
Communicated by Aapo Hyvarinen
Outliers Detection in Multivariate Time Series by Independent Component Analysis Roberto Baragona [email protected] Dipartimento di Sociologia e Comunicazione, Universit`a di Roma “La Sapienza,” 00198 Roma, Italy
Francesco Battaglia [email protected] Dipartimento di Statistica, Probabilit`a e Statistiche Applicate, Universit`a di Roma “La Sapienza,” 00100 Roma, Italy
In multivariate time series, outlying data may be often observed that do not fit the common pattern. Occurrences of outliers are unpredictable events that may severely distort the analysis of the multivariate time series. For instance, model building, seasonality assessment, and forecasting may be seriously affected by undetected outliers. The structure dependence of the multivariate time series gives rise to the well-known smearing and masking phenomena that prevent using most outliers’ identification techniques. It may be noticed, however, that a convenient way for representing multiple outliers consists of superimposing a deterministic disturbance to a gaussian multivariate time series. Then outliers may be modeled as nongaussian time series components. Independent component analysis is a recently developed tool that is likely to be able to extract possible outlier patterns. In practice, independent component analysis may be used to analyze multivariate observable time series and separate regular and outlying unobservable components. In the factor models framework too, it is shown that independent component analysis is a useful tool for detection of outliers in multivariate time series. Some algorithms that perform independent component analysis are compared. It has been found that all algorithms are effective in detecting various types of outliers, such as patches, level shifts, and isolated outliers, even at the beginning or the end of the stretch of observations. Also, there is no appreciable difference in the ability of different algorithms to display the outlying observations pattern. 1 Introduction Independent component analysis (ICA) aims at identifying a set of independent unobservable variables that are supposed to generate the data set Neural Computation 19, 1962–1984 (2007)
C 2007 Massachusetts Institute of Technology
Outliers Detection in Multivariate Time Series by ICA
1963
of interest. An unknown mixing matrix is postulated to linearly transform the unobservable variables to produce a set of observable mixed ones. Both unobservable variables and the mixing matrix have to be estimated from the data. The ICA has been applied successfully to a variety of fields, such as biomedicine, speech, sonar and radar, signal processing, and time series. A comprehensive account of ICA theory and applications may be found in Hyv¨arinen, Karhunen, and Oja (2001). A problem that arises in time series analysis is outlier detection in multivariate time series. An outlier signal may usually be described by a simple deterministic sequence. This signal is, however, difficult to recognize, as it is embedded in a data set whose dynamic structure may either mask the outlying observations or spread the disturbance to data otherwise unaffected. Therefore, the outlier problem in time series is harder than in a random sample, since the extent of the anomaly is judged with respect to its conditional distribution given the remaining observations rather than to the marginal distribution. It follows that in time series, outliers are not necessarily the largest or smallest data, as in random samples; they are data that are not locally coherent with their surrounding observations and often are not easily detected by a simple graphical inspection. Methods that extend well-established procedures for univariate time series to the multivariate framework have been proposed ˜ and Pankratz (2000). Detection of time points where outliers by Tsay, Pena, occur and estimation of the outliers magnitudes are taken into account simultaneously. The former task, however, is known to be much more difficult than the latter. If the times of outliers occurrences were known, estimating their magnitude would be simplified considerably. ˜ and Tsay (2004) used projection pursuit techniques in Galeano, Pena, order to find the linear combination of a multivariate time series that maximizes kurtosis in order to best reproduce the outlying signal. Then detection of time points and estimating the magnitudes of multiple outliers may be accomplished by employing univariate searching methods. Here we propose the ICA as a tool capable of identifying the locations of multiple outliers in multivariate time series. The reason is that outlying components have a very large kurtosis, as we explain in the next section. The only task that is left consists of estimating the outliers’ magnitudes, for which several methods are available, for instance, intervention analysis (Abraham, 1980). Suppose that we observe a contaminated multivariate time series obtained by linearly mixing some independent gaussian signals and adding, only at some fixed time points, a constant to each observed component. When the series is decomposed by ICA, the most important nongaussian component is likely to represent the outlying pattern, while the remaining independent components would be essentially similar to gaussian linear combinations of the outlier free time series. As an illustrative example, let four time series be independently generated by a first-order autoregressive model sit = 0.7si,t−1 + a it , i = 1, . . . , 4 and t = 1, . . . , 100. The {a it } are sequences of independent standard gaussian variables. Let the
1964
R. Baragona and F. Battaglia
10 0 10 10
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
0 10 5 0
10 0 10 10 0 10
Figure 1: An example of multivariate time series (5 components and 100 observations) generated by linear combinations of first-order autoregressive models with an outlier in t = 50.
outlier pattern be a single outlier at time t = 50 described by the deterministic function t = δt−50 , where δ is Kronecker’s delta. The mixing matrix is chosen, 1 0 A= 0 0
0
0 0
1
0 0
0
1 0
0
1 −1
0 1 0
0
1
1 1 , 1 1
that is, each component is affected by an outlier of magnitude 1 at time point 50. Multiplying A by st = (s1t , s2t , s3t , s4t , t ) yields the multivariate time series {xt } displayed in Figure 1. To ease the visual inspection of the plots, the lower and upper limits (from Chebyshev’s inequality at 5% significance level) µi ± 4.47σi , i = 1, . . . , 5, are plotted as well. There is no evidence that the outlier at t = 50 occurs, as the series exhibits many values far greater than the 50th observation. The outliers are small, since their magnitude
Outliers Detection in Multivariate Time Series by ICA
1965
20 0 −20
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
5 0 −5 5 0 −5 5 0 −5 5 0 −5
Figure 2: Components (estimated by the ICA) that originate the observed multivariate time series displayed in Figure 1.
equals the residual standard error of the autoregressive models; therefore, no univariate method is able to detect them. The ICA method, however, yields the components displayed in Figure 2. Here the outlier pattern is neatly outlined, and the outlier magnitude is far outside the upper limit. All other components are inside the limits, that is, the signal corresponding to the outlying observation is entirely included in the first component. This example shows that an approach based on ICA may be very effective for detection of outliers in multivariate time series. The rest of the letter is organized as follows. In the next section, the multivariate time series model considered for investigating the occurrence of outlying observations is introduced. In section 3, a simulation experiment is presented, and results are discussed. Application to real time series and comparison with alternative multivariate outliers detection methods are displayed in section 4. Section 5 concludes. 2 The Multivariate Outlier Model Let us denote by {t } the sequence describing the outlier pattern, that is, t = 1, if at time t = t0 , there is an outlying observation, and t = 0
1966
R. Baragona and F. Battaglia
6
kurtosis
4
2
0
−2
0.2
0.3
0.4
0.5
outlier percentage
Figure 3: Kurtosis coefficient of a standardized outlier signal as a function of the proportion of outlying observations.
otherwise. A level change at time t = t0 may be represented by the sequence t = 1 if t ≥ t0 and 0 otherwise, and an outlier patch (a sequence of consecutive outliers) is easily modeled. The perturbation series {t } is far from normality, and its kurtosis coefficient is large in modulus. If n data are recorded and t = 1 for q distinct times 1 ≤ t1 < . . . < tq ≤ n, while t = 0 for the remaining times, then letting α = q /n, the standardized t has val ues equal to (1 − α)/α with frequency q and − α/(1 − α) with frequency n − q , and its kurtosis coefficient is κ4 = µ4 − 3(σ 2 )2 =
1 − 6, α(1 − α)
which is large in modulus for moderate α (the proportion of outlying observations). The plot of the kurtosis coefficient as a function of the outlier percentage is displayed in Figure 3. It is symmetric around α = 0.5. The modulus of the kurtosis coefficient decreases when α increases from zero √ up to 0.5(1 − 1/ 3) ≈ 0.211 and then increases. Since this result does not depend on where the outliers are located, only on their number q relative
Outliers Detection in Multivariate Time Series by ICA
1967
to the total observations n, the analysis of kurtosis may be also useful in case of a level change unless the ratio of the two different regime periods is √ √ near ( 3 + 1)/( 3 − 1) ≈ 3.73. Suppose that the observed series yt = (y1t . . . yht ) , t = 1, . . . , n, is obtained by adding perturbations to an outlier free time series xt = (x1t . . . xht ) having a multivariate gaussian distribution with E(xt ) = 0 and E(xt xt ) = x positive definite. Then yt = xt + ωt , where ω = (ω1 . . . ωh ) is the outlier magnitude. In order to apply ICA, we shall write yt as a linear combination of independent possibly nongaussian series. Since the variance-covariance matrix x is positive definite, the regular symmetric matrix S = −1/2 exists. Let λ = Sω, and denote also the x standardized variables by ut = Sxt , so that E(ut ut ) = I. The observed series may be written yt = S−1 ut + S−1 λt . Next, we define an orthonormal basis of Rh , (m1 , m2 , . . . , mh ) with m1 = λ/||λ||, and consider the matrix M=[m1 , m2 , . . . , mh ]; then M λ = (||λ||, 0, . . . , 0) and zt = M ut = M Sxt are also independent. The observed series may be written yt = S−1 M{zt + (λ, 0, . . . , 0) t }, where the first component in the linear transformation, z1t + ||λ||t , is contaminated by outliers and has large kurtosis, while the other components z jt , j = 2, . . . , h, are gaussian variables. Therefore, ICA will be able to identify the demixing matrix, and the outliers’ effects will be confined in just one of the extracted components. Whether such effects will be evident and clearly recoverable depends on the stochastic behavior of z1t , which has zero mean and variance one. Therefore, the outlier will be clearly evident in the component if ||λ|| = {ω x−1 ω}1/2 is large compared to a standard gaussian variate. However, if the x jt ’s are a highly autocorrelated series, then the occurrence of a perturbation in z1t may be better recognized using a univariate outlier detection method (see, e.g, Tsay, 1988; Chen & Liu, 1993). We now turn to consider a different model, which appears to be suitable for representing real data, especially with large dimensions, and allows for a factor structure and noise. Let yt = (y1t . . . yht ) , t = 1, . . . , n, denote as before the observed data, and εt = (ε1t . . . εht ) denote a sequence of independent and identically distributed (i.i.d.) vector gaussian variables with mean (0, 0, . . . , 0) and variance-covariance matrix ε = diag{σ12 , σ22 , . . . , σh2 }. We
1968
R. Baragona and F. Battaglia
suppose that the observed time series is obtained by a linear transformation of a k-variate (unobservable) time series xt = (x1t . . . xkt ) with k h, an outlier with vector magnitude ω and pattern t , plus a noise represented by the {ε t } series: yt = Axt + ωt + εt .
(2.1)
Model 2.1, with ω = 0, is known as a dynamic factor model and is often used in time series analysis and econometric studies (see, e.g., Geweke, ˜ & Box, 1987; Forni & Reichlin, 1998; Bai & Ng, 2002). It has 1977; Pena recently attracted new interest as a framework for forecasting from a large number of predictors (Stock & Watson, 2002) and for classifying outliers in innovations (Georgiev, 2005). The relationship between dynamic factor models and ICA in macroeconomics and finance is addressed by Vessereau (2000) and Yau (2004). We may assume without loss of generality that the k column vectors a1 , a2 , . . . , ak of the factor matrix A are linearly independent and that the variance-covariance matrix of {xt } is diagonal: E(xt xt ) = diag{s12 , s22 , . . . , sk2 } with s 2j > 0, j = 1, . . . , k, so that there are exactly k independent factors. Model 2.1 is still undetermined, since for any orthogonal k × k matrix C, we may write Axt = A∗ x∗t , where A∗ = AC and x∗t = C xt . Therefore, we may assume that A A = I, which allows the model to have independent innovations even after demixing, as we explain later. Let us now complete the orthonormal set (a1 , a2 , . . . , ak ) with (h − k) vectors to obtain an orthonormal basis of Rh . Since ω ∈ Rh , it may be projected onto the space spanned by the columns of A to obtain ω = α1 a1 + α2 a2 + · · · + αk ak + ω −
k
α j a j = Aα + (ω − Aα),
j=1
where α = (α1 , α2 , · · · , αk ) , and ω – Aα is orthogonal to each a j . Now, on defining the unit norm vector b1 = (ω − Aα)/ω − Aα, we may consider the k + 1 vectors a1 , a2 , . . . , ak , b1 and complete an orthonormal basis of Rh by choosing (h − k − 1) further orthonormal vectors b2 , b3 , · · · , bh−k . The resulting basis is denoted by [A, B] = [a1 , a2 , . . . , ak , b1 , b2 , . . . , bh−k ]. Let us further define the sequence {ηt } by the relationship ε t = [A, B]ηt . As we chose [A,B] orthonormal, the elements of ηt are mutually independent for each t because this property was assumed to hold for the sequence {ε t }. Then model 2.1 may be written in the
Outliers Detection in Multivariate Time Series by ICA
1969
equivalent form,
α t + [A, B]ηt ω − Aα x1t + α1 t η1t .. .. . . xkt + αk t ηkt = [A, B] ω − Aαt + ηk+1,t , ηk+2,t 0 . . . . . . 0 ηht
yt = Axt + [A, b1 ]
where yt is written as a linear combination of independent series, the (k + 1)th source is proportional to t plus a noise term and has the largest kurtosis, the first k components are also perturbed by outliers and have nonzero kurtosis, and the remaining h − k − 1 sources are purely gaussian noise. Therefore, ICA may extract the demixing matrix [A,B] and the h obtained components may be characterized as follows:
r r r
The component ω − Aαt + ηk+1,t contains only the term closely related to the outlier pattern, plus a noise term. k components x jt + α j t + η jt , j = 1, 2, . . . , k, include the original factor series, a possible term due to the outliers and a random error. The remaining h − k − 1 components η jt , j = k + 2, . . . , h, are purely random noise.
Note that the row corresponding to the (k + 1)th component in the demixing matrix [A,B] equals b1 = (ω − Aα) /ω − Aα and may give an indication of the behavior of the perturbating vector ω. With a view to practical outlier detection, two cases may occur: 1. The outlier magnitude vector ω is exactly a linear combination of the columns of the factor matrix A: ω = Aα, so that the perturbation in the observed series may be entirely explained by outliers in the factors. In this case, we can recover only the perturbed factor series x jt + α j t + η jt , and evidence of the outlier depends on the relative magnitude α j = ω a j with respect to the variability and the dynamic properties of the corresponding series x jt , as in the univariate case. Thus, it may happen that the presence of an outlier will not be noticed in any extracted source. 2. The outlier cannot be explained entirely in terms of factors, and ω − Aα = 0. In that case, one extracted source behaves proportionally to t plus a random noise term, and the outlier may be easily discovered by simple inspection if ||ω − Aα|| is sufficiently large
1970
R. Baragona and F. Battaglia
with respect to var(ηk+1,t ) = (ω − Aα) ε (ω − Aα)/||ω − Aα||2 ≤ max{σ12 , σ22 , . . . , σh2 }. In both cases, formal statistical tests for the null hypothesis of absence of outliers at a given time could be easily derived, owing to the gaussianity of the components, if the factor matrix coefficients were known. This is obviously not true in general, and estimation of such parameters may be biased by the presence of the outliers themselves. Thus, a more cautious procedure may be based on checking each value against a bound m ˆ ± c sˆ , where m ˆ and sˆ are the extracted source mean and standard error, and detecting an outlier whenever at least one of the extracted sources exceeds such bounds. To assess the value of c, we think that the gaussian quantiles are not appropriate, since our procedure tends to enhance nongaussianity of the data; a more conservative choice may be obtained by resorting to the Chebyshev inequality. Moreover, we may take into account each source separately and treat it as a univariate time series, applying methods for univariate outlier detection, which would be certainly advisable for case 1. Suppose now that the observed multivariate series contains two outliers, which occur at different times with different size. Provided that h ≥ k + 2, we write (1)
(2)
yt = Axt + ω1 t + ω2 t + ε t , (1)
(2.2)
(2)
where t = t . Following the above argument, it may be easily seen that the two outliers are always recovered as two different sources except in case of proportionality. If ω1 = k1 ω2 , then their effects will remain mixed and undistinguishable in the extracted sources. For example, let us consider in model 2.2 the matrix A, the multivariate time series {xt }, and the random noise {ε t } as in the simulation experiment that will be described in the next (1) (2) section. Then assume t = 1 for t = 30 and zero elsewhere, and t = 1 for t = 70 and zero elsewhere. Let us first consider two outliers that define the same direction: ω1 = (3,3,3,3,3,3) and ω2 = (−3,−3,−3,−3,−3,−3) . From a simulated multivariate time series, the ICA extracted the components displayed in Figure 4. Here the same source includes the effects of both outliers. Let us define the next two independent outliers, for example, ω1 = (3,3,3,3,3,3) and ω2 = (−3,0,−3,−3,0,−3) . The extracted components of a simulated multivariate time series are displayed in Figure 5. Each of the two outliers is included in a different source. 3 A Simulation Experiment Let us consider model 2.1 with h = 5 observed time series and k = 4 factors. We first generated the unobservable factor time series xit , i = 1, . . . , k, t =
Outliers Detection in Multivariate Time Series by ICA
1971
5 0 −5 0 5
10
20
30
40
50
60
70
80
90
100
10
20
30
40
50
60
70
80
90
100
10
20
30
40
50
60
70
80
90
100
10
20
30
40
50
60
70
80
90
100
10
20
30
40
50
60
70
80
90
100
10
20
30
40
50
60
70
80
90
100
0 −5 0 5 0 −5 0 5 0 −5 0 5 0 −5 0 10 0 −10
0
Figure 4: Independent components extracted from an artificial multivariate time series, with two parallel outliers in t = 30 and t = 70.
1, 2, . . . , n, according to first-order autoregressive (AR) models, xit = φi xi,t−1 + a it , where {a it } are (unit variance) white noises, and i = 1, . . . , k. Sets of 1000 multivariate time series were simulated with length n = 100. The AR model parameters were all taken equal to 0.7 (absolute value) and alternate in sign, that is, (φ1 ,
φ2 ,
φ3 ,
φ4 ) = (0.7,
−0.7,
0.7,
−0.7).
The noise series {εt } was assumed to have a mean equal to zero and diagonal variance-covariance matrix ε with entries on the main diagonal all equal to 0.04. Then we considered the perturbation magnitude vector, ω = (3,
3,
3,
3,
3) ,
1972
R. Baragona and F. Battaglia
5 0 0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
20 0 20
10
20
30
40
50
60
70
80
90
100
10
20
30
40
50
60
70
80
90
100
5 0
5 0
5 0
20 0
0 20
0
Figure 5: Independent components extracted from an artificial multivariate time series, with two independent outliers in t = 30 and t = 70.
and the factor matrix, 1 0 0 0 0.5 1 0 0 0 0.5 1 0 A= , 0 0 0.5 1 0 0 0 0.5 and proceeded to simulate the observed time series yt as in model 2.1. The outlier is chosen so it affects all time series. Note that the common outlier magnitude value is equal to 3, while the variance of the observed time series is about 2 for the first one, 0.5 for the last one, and about 2.5 for the remaining series. The factor components have common variance equal to about 2, so the outlier magnitude seems rather small compared to both variances of observed time series and factors. The outlier cannot be explained entirely in terms of factors, as the vector ω is not a linear combination of the columns of A; therefore, the outlier is expected to be recoverable in a single extracted source. Three different outlier patterns are assumed. The first one is an isolated outlier that occurs at the end of the
Outliers Detection in Multivariate Time Series by ICA
1973
series (t = 100). Methods developed in the time series analysis framework usually are not able to detect outliers in the last (or in the first) observation. The second pattern is an outlier patch. In this case also, the methods that are commonplace in time-series analysis may fail to recognize the outlying observations in the middle of the patch because of a kind of masking. The perturbation times range from t = 51 to t = 53. The third pattern is a level shift at t = 51 (i.e., t = 1 for t ≥ 51 and zero elsewhere). A change in the level in time series may be viewed as a structural break, so investigating the level changes matters beyond the problem of outlier detection. To implement the ICA, three algorithms are used that proved effective in applications. The Matlab code has been downloaded from the Web, and the default version of the algorithms is adopted. The first is the JADE algorithm (Cardoso and Souloumiac, 1993).1 JADE is aimed at maximizing the nongaussianity of the independent components and operates on the full set of fourth-order cumulants. The second method performs ICA decomposition using the infomax algorithm of Bell and Sejnowski (1995), which is closely related to the maximum likelihood estimation (Makeig, Bell, Jung, & Sejnowski, 1996). The EEGLAB toolbox implements several features and extensions.2 The third method is Hyv¨arinen’s fixed-point algorithm for approximate negentropy maximization (Hyv¨arinen & Oja, 2000), implemented by the FastICA toolbox.3 An outlier is identified at any time point where at least one of the estimated components exceeds the limits µi ± 4.47σi , where µi and σi are, respectively, the mean and the standard error of the ith extracted component. Such a choice is suitable for identifying rather large outliers but strongly prevents type 1 errors. As far as level shift is concerned, the differenced series (yt − yt−1 ) shows an isolated outlier at the time when the level shift occurs. Therefore, when searching for level shifts, we use ICA on the differenced series and search for an isolated outlier. Experiments concerning 1000 replications have been repeated 100 times. In Table 1 the average and standard errors of the percentages of correct identification of outliers are reported. Note that since the replications are independent, the standard error of the percentage estimates could be evaluated, according to the binomial distribution, as 100{ p(1 − p)/1000}1/2 ≤ 1.58. Results are given for the three patterns: a single outlier, a patch of three consecutive outlying observations, and a level shift at the middle of the series. Each of the three algorithms is perfectly able to detect the single outlier even if it comes at the end of the time series. Detection of the set of consecutive outliers seems much more difficult, and in Table 1, percentages of detection
1
Available online at http://www.tsi.enst.fr/∼cardoso/guidesepsou.html. Available online at http://www.sccn.ucsd.edu/eeglab/. 3 Available online at http://www.cis.hut.fi/projects/ica/fastica/. 2
1974
R. Baragona and F. Battaglia
Table 1: Results from the Simulation Experiment: Average Percentages of Outliers’ Correct Identification over 1000 Replications, Each Repeated 100 Times, Varying the Algorithm and the Outlier Pattern. Patch, t = 51, 52, 53 Algorithm JADE Infomax FastICA
Outlier, t = 100
≥1
All
Average Number
Level Shift
100 (0.0) 99.84 (0.111) 100 (0.0)
80.01 (1.394) 73.44 (1.483) 79.64 (1.464)
42.25 (1.382) 34.25 (1.449) 42.03 (1.397)
2.37 (0.607) 2.19 (0.690) 2.37 (0.611)
98.83 (0.343) 88.16 (1.06) 98.49 (0.419)
Note: The standard errors of the percentages are in parentheses.
of at least one outlier and of three outliers simultaneously are reported, along with the average number of outlying observations detected. There is nearly 80% success in detecting the presence of some outliers out of the three, and the whole patch is recognized in about 40% of cases. The average number of outliers detected is greater than 2. This circumstance may be explained by considering that the nongaussianity measures, the kurtosis, for instance, decrease when the number of outliers increases. As far as the level shift is concerned, the percentages of correct identification are comparable with those obtained in the single outlier case but slightly smaller. Indeed, the level shift is detected when a single outlier is detected on the differenced series, and the differencing operator is likely to make the outlier’s occurrence less evident, and the noise variance is larger. The standard errors of the percentages were close to their expected values in all cases. Outliers that do not affect all time series, for instance, ω = (0,0,0,3,3) , which affects only the two last series, were also considered. Although results for single outlier identification are only slightly less than those reported in Table 1, percentages of identification drop far below in case of patch and are considerably smaller for the level shift. We considered as well the case of outliers that are linear combinations of columns of A, for instance, ω = (4, 4, 3, 5, 2) . The identification is difficult: the isolated outlier at t = 100 is detected on average only about 57% of replications (standard error about 1.32) using the FastICA algorithm. The outlier may be entirely explained by perturbations in the factor series, so its presence may be unnoticed. Let us now consider the case where there are two distinct perturbations in the data. The first case is concerned with parallel outliers, that is, the outliers’ magnitude vectors define the same direction. The second case is concerned with independent outliers, where the two vectors define different directions. As instances of the two cases, we considered, in the model defined by 2.2, the
Outliers Detection in Multivariate Time Series by ICA
1975
Table 2: Percentages of Identification of Pairs of Parallel Outliers in the Same Components and Independent Outliers in Two Distinct Components Computed in 1000 Replications and Averaged over 100 Iterations.
Algorithm JADE Infomax FastICA
Parallel Outliers, Same Components
Independent Outliers, Different Components
97.146 (0.5341) 94.807 (0.7147) 97.238 (0.5268)
98.653 (0.4148) 91.522 (0.7734) 98.347 (0.430)
Note: Standard errors are in parentheses.
two pairs of parallel and independent vectors as in the preceding section. We expect that the algorithms extract a single component, including both outliers, if the magnitude vectors are parallel, and two distinct components, each including an outlier in case of independence. In Table 2, the percentages of correct identifications concerned with 1000 artificial time series simulated according to model 2.2 and averaged over 100 iterations are reported. Parameters are chosen as in the preceding simulation experiment, but matrix A is augmented by the further row (1, −1, 0, 1), and there are six observed series. The ICA algorithms are again the JADE, the Infomax, and the FastICA algorithms. The results displayed in Table 2 demonstrate that the two outliers in t = 30 and t = 70 are extracted as a single component if their magnitude vectors are parallel, while outliers are extracted as two distinct components if their magnitude vectors are independent. All algorithms perform satisfactorily. Note that the percentage of outlying observations, not their time location, mainly influences the capability of ICA to detect them. In fact, results do not change substantially if, for instance, we take t = 1 and t = 3 instead of t = 30 and t = 70. Namely, we may move the outlying observations and obtain similar results, provided that their number remains unchanged. In the simulation study, performances seem to be affected mostly by the outliers’ number and impact on time series. The overall performance deteriorates as far as the percentage of outlying observations increases, while the percentage of correct detections improves as the outliers affect more time series. In general, the three ICA algorithms perform in a similar way. In fact, all ICA algorithms aim at maximizing some approximate measures of independence and are likely to yield similar results if the basic assumptions are fulfilled. The differences among alternative ICA implementations, which appear when more than one outlier occurs, may be due to the different behavior of the independence measures (kurtosis and mutual information) when the number of outliers increases. However, an explanation of
1976
R. Baragona and F. Battaglia
differences that may possibly arise in practical applications would require systematic studies on the relative performances of the algorithms (see, e.g., Delorme & Makeig, 2004). 4 Application to Real Data 4.1 Industrial Production Index. Italian industrial production indexes in 16 branches are detailed in Table 3, where the weights determined according to their relative production values are displayed. The weighted average of the 16 branches indexes yields the general industrial production index. We considered monthly data from January 1990 to December 2000 provided by the Italian National Statistical Institute (ISTAT). We performed a search for outliers to compare the method based on ICA with the procedure suggested in Tsay et al. (2000). A preliminary univariate analysis showed that all 16 time series are outlier free. We inserted an outlier at time 50 by adding to each of the 16 indexes a value ω varying from 10 to 60 (step 5). The experiment is aimed at evaluating the impact of the outlier’s magnitude on the methods’ ability to locate the time when the outlier occurs. In Figure 6 the plots of the first four indexes (out of the original set of 16 time series) and the general index are displayed with the outlier inserted at time 50 (ω = 10, 30, 50). The outlier may hardly be noticed unless it is equal to 50, and the strong seasonality is likely to make the detection even more difficult. Upper and lower lines, (computed as the average index plus or minus c times the standard error) are also plotted in Figure 6. As before, Chebyshev’s inequality has been used at the 5% significance level, yielding c = 4.47. The outliers do not exceed the limits even for ω = 50. For the ICA-based method, we used the FastICA algorithm for computations. For Tsay et al.’s method, vector autoregressive integrated movingaverage (ARIMA) seasonal models have been identified and estimated. Then the statistic therein called J (loosely speaking, the maximum varying t of the Mahalanobis norm of each multivariate estimated potential outlier) has been computed and compared to the critical value 42 that we assessed by simulation at the 5% significance value. The multivariate time series is then cleaned and the procedure iterated until no more outliers are detected. On performing the outlier detection repeatedly by increasing the perturbation magnitude, we obtained the following results: the ICA-based method has been able to recognize the outlier as soon as ω = 35. The difference between the component value at time 50 and the mean was about seven times the estimated standard error. For greater values of ω, this difference increased only slightly. Tsay et al.’s procedure identified the outlier only when ω = 50, with J = 47.9. Thus, in this case, the ICA-based approach seemed to show more sensitivity to outlying observations. Finally, in order to verify the performance of our method on factor models, we included the general industrial production index in the time series set
Outliers Detection in Multivariate Time Series by ICA
1977
Table 3: Industrial Production Indexes, Italy 1990–2000. Weight βi
Index Description Mining and quarrying Food, beverages, and tobacco Textiles and textile products Leather and leather products Wood and wood products Pulp, paper, and paper products Coke, refined petroleum, and nuclear fuel Chemicals, chemical products, and manmade fibers Rubber and plastic products Other nonmetallic mineral products Basic metals and fabricated metals products Machinery and equipment Electrical and optical equipment Transport equipment Manufacturing, not classified elsewhere Electricity, gas, and water supply
0.019 0.086 0.098 0.028 0.021 0.056 0.024 0.070 0.039 0.052 0.131 0.103 0.083 0.054 0.038 0.098
200 100 0
0
20
40
60
80
100
120
140
0
20
40
60
80
100
120
140
0
20
40
60
80
100
120
140
0
20
40
60
80
100
120
140
0
20
40
60
80
100
120
140
200 100 0 200 100 0 200 100 0 200 100 0
Figure 6: Italy’s industrial production indexes (branches 1–4 plus the overall index, 132 monthly observations). Outliers in t = 50 of magnitude 10, 30, 50, are added in turn to each time series.
1978
R. Baragona and F. Battaglia
10 0 −10
0
20
40
60
80
100
120
140
0
20
40
60
80
100
120
140
0
20
40
60
80
100
120
140
0
20
40
60
80
100
120
140
0
20
40
60
80
100
120
140
20 10 0 0 −10 −20 20 10 0 10 0 −10
Figure 7: First five independent components extracted by the FastICA algorithm from the 17 time series set (branches indexes + general index) with an outlier of magnitude 10 added only to the general index. The outlier is given as first component.
and inserted a small outlier in that series only, while the 16 branch indexes are kept unperturbed. This corresponds to a factor model (h = 17, k = 16) with A=
I 16 × 16
β1 β2 . . . β16
ω = (0, · · · , 0, 10)
and εt = 0. Since i βi = 1, ω is not linearly dependent on the columns of A, ICA is expected to detect the outlier rather easily, and it indeed happens, as may be seen in Figure 7, even though the perturbation magnitude is rather small. For comparison, Tsay et al.’s procedure identified the outlier in t = 50 as an innovation outlier, then identified an additive outlier in t = 44 and another innovation outlier in t = 62. The ICA, however, seems able to display the outlying observation location with greater evidence. 4.2 Italian Quarterly Economic Data. We consider a set of economic data concerning Italy from 1960 to 1999, detailed in Table 4. The data have
Outliers Detection in Multivariate Time Series by ICA
1979
Table 4: Quarterly Economic Italian Data and Their Preliminary Transformation. Label
Description
CPI IP PGDP RBNDC RBNDM RCOMMOD RGDP ROIL RGOLD RSTOCK UNEMP
Consumer price index Index of industrial production Gross domestic product deflator Interest rate of long-term government bonds Interest rate of medium-term government bonds Real commodity price index Real gross domestic product Real oil prices Real gold prices Real stock price index Unemployment rate
Transformation 2 ln ln 2 ln ln ln ln ln ln
Note: stands for first difference: yt = xt − xt−1 ; ln stands for natural logarithm.
been employed by Stock and Watson (2004) to obtain combination forecasts of output growth; preliminary transformations (also quoted in Table 4) have been applied as suggested by Stock and Watson. A univariate analysis detects large outliers at times 23 and 24 (I and II quarter 1967) for the gross domestic product and the deflator (RGDP and PGDP), several outliers in the real oil price series (at times 55, 103, 121 corresponding to I quarter 1975, I quarter 1987, III quarter 1991), and one outlier in the unemployment rate for the IV quarter 1993 (time 130). The multivariate procedure of Tsay et al. (2000) detects a single innovation outlier at time 23. In this case, a secondorder vector AR model was fitted to the transformed data. The critical value for the J statistic at the 5% significance level has been found equal to 34.2 by simulation. The ICA analyses of the 11 series, applying JADE, infomax, and FastICA, produce substantially similar results: two sources with very large kurtosis are found (values around 40 and 30 respectively) and a third source with kurtosis about 10, while the remaining extracted sources have moderate kurtosis values. The behavior of the three sources, as extracted by FastICA, is displayed in Figure 8. The first source detects outliers at t = 23 and t = 24, and the estimated mixing matrix indicates that it depends essentially on the gross domestic product and deflator (RGDP and PGDP), but also on the consumer price index (CPI). The second source is linearly determined essentially by real oil prices and the consumer price index, and has a large outlier value at t = 55, while large values are also noticeable at t = 103 and 121, but their importance is less evident. Finally, the third source has a significant value at t = 130 and is linearly determined mainly by the real gross domestic product, with additional contributions of the GDP deflator, the industrial production index, and the real gold price. Interestingly, those series do not
1980
R. Baragona and F. Battaglia
10 5 0 −5 −10
0
20
40
60
80
100
120
140
160
0
20
40
60
80
100
120
140
160
0
20
40
60
80
100
120
140
160
5 0 −5 −10 5 0 −5 −10
Figure 8: Three sources with greatest kurtosis extracted by the FastICA algorithm from the Italian quarterly economic data.
show a clear univariate perturbation at t = 130, while the unemployment rate series, which has a univariate outlier at that time, has almost no weight in the present source, indicating that the anomaly in unemployment may be induced by the concurrent behavior of other relevant economic aggregates. 4.3 Magnetoencephalographic Data. We took into account magnetoencephalographic data concerning brain magnetic fields, recorded by a 28-channel system with active channels regularly distributed on a spherical surface covering an area of about 180 cm2 . The data are taken from Barbati, Porcaro, Zappasodi, and Tecchio (2005). Magnetoencephalographic data are useful for recording brain activity and are often affected by artifacts arising from heart and eye activity. ICA has been proposed for separating genuine magnetic signals from sources linked to electrocardiogram and electro-oculogram (see, e.g., Barbati, Porcaro, Zappasodi, Rossini, & Tecchio, 2004). We considered 12 channels with sampling frequency 1000 Hz and analyzed 600 data for a total recording time of 0.6 seconds. In Figure 9, channels are displayed that proved on the outlier analysis to include the most relevant anomalous data: C5, C6, C11 (outliers were detected using
Outliers Detection in Multivariate Time Series by ICA
1981
5000 0 5000 0 1000
100
200
300
400
500
600
100
200
300
400
500
600
100
200
300
400
500
600
100
200
300
400
500
600
100
200
300
400
500
600
100
200
300
400
500
600
0 1000 0 2000 0 2000 0 1000 0 1000 0 2000 0 2000 0 2000 0 2000
0
Figure 9: Channels (from top to bottom) C5, C6, C11, C2, C9, and C10 from magnetoencephalographic data. Channels are displayed where greatest magnitude outlying observations have been detected.
univariate methods and ICA too for C11), C2, C9, and C10 (detection was performed by using ICA). The univariate outlier analysis by means of autoregressive movingaverage (ARMA) models detects no outliers for most series, with the exception of the data at times 237, 257, 259, and 426 for channel C5; data 124, 500, and 208 for channel C6; and observations at times 47 and 48 in series C11, though with a moderate outlying value. For the multivariate procedure of Tsay et al. (2000), the statistic J has been compared to the critical value 39.68 determined by simulation (the significance level was 5%). A vector AR model of order 10 was fitted to the data. An outlier of the innovation type was detected at time 46; then several outliers scattered over the time interval 71 to 221 and two other innovation outliers at times 246 and 271 were also detected. The ICA, carried out by means of the FastICA algorithm, reveals a first source with kurtosis value of about 7.5, and a second component with kurtosis about 5, while the remaining extracted sources have negligible kurtosis value. The two relevant sources (see Figure 10) are clearly related to anomalous data, since they have very small variability but for only a
1982
R. Baragona and F. Battaglia
5
0
−5
0
100
200
300
400
500
600
0
100
200
300
400
500
600
5
0
−5
Figure 10: Independent components extracted from magnetoencephalographic data that are relevant for outlier detection.
few observations. The first one indicates clear-cut outliers at times from 249 to 254 (linked to heart activity as shown by a contemporaneous electrocardiogram). The second extracted component has outlying values, though not very large, at times 46 to 48, corresponding to the anomalous values discovered by Tsay et al.’s (2000) procedure and by univariate methods in channel C11. However the demixing matrix coefficients indicate that both the first and the second sources are essentially driven by channels C2, C9, and C10 (with the most relevant source depending also on C1 and C7). If we consider these channels only, the two estimated mixing vectors are linearly independent. Therefore, the two perturbations appear in different components as expected. 5 Conclusion A procedure based on ICA for detecting outliers in multivariate time series is introduced in this letter. This approach seems well motivated, as the outlier pattern usually may be modeled as a simple deterministic function, which is zero almost everywhere except where disturbances occur. This behavior is far from gaussian time series, so the ICA is likely to identify
Outliers Detection in Multivariate Time Series by ICA
1983
this component easily. A simulation experiment seems to support this view. As far as the outlier pattern is concerned, the proposed method performed best for single outlier detection, even at the end of the time series, and good performances were observed as well for patches and level shifts. Note that detecting outliers occurring at the end of the time series and outliers sequences is difficult for ARIMA model-based methods. There is no evidence that either the mixing matrix or the algorithm that we use for implementing the ICA (JADE, infomax, and FastICA) makes any appreciable difference as far as results are concerned. We assume an instantaneous ICA mixing model. Thus, our method does not exploit the time dynamics of the series, though the objective function to be maximized could be possibly better structured according to dynamic relationships. We hope that this will be a subject for further research. Nevertheless, the proposed procedure may yield useful results even if moderately nonstationary characteristics are present in the multivariate series, provided that they behave approximately like common features (for example, cointegration or seasonal cointegration) so that stationary sources may be found. Acknowledgments We gratefully acknowledge financial support provided by Miur (Italy) as part of the 2003 grant, Statistical Methods and Models for Non-Stationary and Non-Linear Time Series Forecasting, Theory and Applications. We also thank two anonymous referees whose comments helped us to improve both our arguments and the presentation of this letter. References Abraham, B. (1980). Intervention analysis and multiple time series. Biometrika, 67, 73–78. Bai, J., & Ng, S. (2002). Determining the number of factors in approximate factor models. Econometrica, 70, 191–221. Barbati, G., Porcaro, C., Zappasodi, F., Rossini, P. M., & Tecchio, F. (2004). Optimization of an independent component analysis approach for artifact identification and removal in magnetoencephalographic signals. Clinical Neurophysiology, 115, 1220–1232. Barbati, G., Porcaro, C., Zappasodi, F., & Tecchio, F. (2005). An ICA approach to detect functionally different intra-regional neuronal signals in MEG data. Lecture Notes in Computer Science, 3512, 1083–1090. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Cardoso, J.-F., & Souloumiac, A. (1993). Blind beamforming for nongaussian signals. IEEE Proceedings F, 140, 362–370. Chen, C., & Liu, L. (1993). Joint estimation of model parameters and outlier effects in time series. Journal of the American Statistical Association, 88, 284–297.
1984
R. Baragona and F. Battaglia
Delorme, A., & Makeig, S. (2004). EEGLAB: An open source toolbox for analysis of single-trial EEG dynamics including independent component analysis. Journal of Neuroscience Methods, 134, 9–21. Forni, M., & Reichlin, L. (1998). Let’s get real: A dynamic factor analytical approach to disaggregated business cycle. Review of Economic Studies, 65, 453–474. ˜ D., & Tsay, R. S. (2004). Outlier detection in multivariate time series via Galeano, P., Pena, projection pursuit (Working Paper 04-42). Madrid: Departamento de Estad´ıstica, Universidad Carlos III de Madrid. Georgiev, I. (2005). A factor model for innovational outliers in multivariate time series. Paper presented at the First Italian Congress of Econometrics and Empirical Economics, Venice. Available online at http://www.cide.info/conf old/papers/ 1151.pdf. Geweke, J. (1977). The dynamic factor analysis of economic time-series models. In D. J. Aigner & A. S. Goldberger (Eds.). Latent variables in socio-economic models. (pp. 365–383). New York: North-Holland. Hyvarinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Hyvarinen, A., & Oja, E. (2000). Independent component analysis: Algorithms and applications. Neural Networks, 13, 411–430. Makeig, S., Bell, A. J., Jung, T.-P., & Sejnowski, T. J. (1996). Independent component analysis of electroencephalographic data. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.). Advances in neural information processing systems, 8 (pp. 145–151). Cambridge, MA: MIT Press. ˜ D., & Box, G. E. P. (1987). Identifying a simplifying structure in time series. Pena, Journal of the American Statistical Association, 82, 836–843. Stock, J. H., & Watson, M. W. (2002). Forecasting using principal components from a large number of predictors. Journal of the American Statistical Association, 97, 1167–1179. Stock, J. H., & Watson, M. W. (2004). Combination forecasts of output growth in a seven-country data set. Journal of Forecasting, 23, 405–430. Tsay, R. S. (1988). Outliers, level shifts and variance changes in time series. Journal of Forecasting, 7, 1–20. ˜ D., & Pankratz, A. E. (2000). Outliers in multivariate time series. Tsay, R. S., Pena, Biometrika, 87, 789–804. Vessereau, T. (2000). Factor analysis and independent component analysis in presence of high idiosyncratic risks (CIRANO Working Paper 2000s–46). Montr´eal: CIRANO. Yau, R. (2004). Macroeconomic forecasting with independent component analysis. Paper presented at the Econometric Society 2004 Far Eastern Meetings 741, Econometric Society. Available online at http://ideas.repec.org/s/ecm/feam04.html.
Received June 29, 2005; accepted January 16, 2007.
VIEW
Communicated by Dmitri Chklovskii
Models Wagging the Dog: Are Circuits Constructed with Disparate Parameters? Thomas Nowotny [email protected] Institute for Nonlinear Science, University of California, San Diego, La Jolla, CA 92093-0402, U.S.A.
¨ Attila Szucs [email protected] Institute for Nonlinear Science, University of California, San Diego, La Jolla, CA 92093-0402, U.S.A., and Balaton Limnological Research Institute of the Hungarian Academy of Sciences, Tihany, H-8237, Hungary
Rafael Levi Institute for Nonlinear Science, University of California, San Diego, La Jolla, CA 92093-0402, U.S.A., and Escuela Politechnica Superior, Universidad Autonoma de Madrid, 28049, Madrid, Spain
Allen I. Selverston [email protected] Institute for Nonlinear Science, University of California, San Diego, La Jolla, CA 92093-0402, U.S.A.
In a recent article, Prinz, Bucher, and Marder (2004) addressed the fundamental question of whether neural systems are built with a fixed blueprint of tightly controlled parameters or in a way in which properties can vary largely from one individual to another, using a database modeling approach. Here, we examine the main conclusion that neural circuits indeed are built with largely varying parameters in the light of our own experimental and modeling observations. We critically discuss the experimental and theoretical evidence, including the general adequacy of database approaches for questions of this kind, and come to the conclusion that the last word for this fundamental question has not yet been spoken.
Current address for Thomas Nowotny: Center for Computational Neuroscience and Robotics, University of Sussex, Falmer, Brighton BN1 9QJ, England. Neural Computation 19, 1985–2003 (2007)
C 2007 Massachusetts Institute of Technology
1986
¨ T. Nowotny, A. Szucs, R. Levi, and A. Selverston
1 Introduction A major factor in the success of determining the precise synaptic connectivity of invertebrate circuits has been the ability to identify individual neurons. The advantage of being able to work with identifiable neurons was recognized early on by those working on Aplysia and lobster ganglia (Arvanitaki & Chalazonitis, 1958; Kandel, Frazier, Waziri, & Coggeshall, 1967; Otsuka, Kravitz, & Potter, 1967) who gave names to individual neurons based on their physiological and anatomical properties. Although the identified neurons always behaved consistently, with small variations attributed to the dissection or the physiological techniques used, little thought was given to how different populations of ion channels were combined to produce the characteristic voltage output of each cell. When neural circuits were established between identified cells, similar questions relating to the variability of identified synapses were not addressed either. Are identified neurons made up of the same type and density of ionic channels, or are the channels arranged differently, but in a way that gives each neuron its desired properties? Are the synapses within a particular circuit the same or different for each animal but balanced in a way that ensures that the circuit performs equivalent operations from animal to animal? These two fundamentally different ways of building cells and circuits would have important implications for biologists as well as for modelers. A new modeling study based on a three-neuron circuit from the lobster pyloric central pattern generator (CPG) (Prinz, Bucher, & Marder, 2004) raises important issues directly related to this question. It is well known that when constructing conductance-based models with different ionic channels and synaptic types, extreme sensitivity to parameter values is common, even in relatively simple systems. Thus, the method used to assign these values becomes a critical matter when trying to tune the circuit. Using a unique method to address this problem, Prinz et al. (2004) developed a large database by computing and evaluating the output of thousands of circuits over a wide range of parameter values and selecting those sets, that produced outputs matching the biological pattern. The outcome of this modeling work suggested that the values of the neuronal and synaptic parameters in the pyloric CPG could in fact be combined in many different ways, each able to generate equivalent pyloric patterns. These results have extremely important implications for how neural circuits might be assembled in general. But they also raise some troubling questions for research on small CPG circuits. There are many reasons to be initially skeptical of the idea that ion channels and other cell and circuit properties can, by some unknown mechanism, be mixed and matched but still generate the same functional pyloric pattern. One should, however, never dismiss an interesting idea without probing it carefully. Therefore, we have tested several predictions based on this hypothesis and examined the details of the database approach that led
Models Wagging the Dog
1987
to it. On the circuit level, we carried out a series of experiments by blocking a single channel type in all neurons in the intact pyloric circuit. We reasoned that if there were disparities in the conductance of this channel, there would also be variations in the effects of its selective blocking. We repeated the same manipulations in the equivalent circuit models, which included different values for the conductance of the blocked channel. On the single-cell level, we compared the variability of isolated cell dynamics in experiments and models and tested the sensitivity of the models to manipulations to assess the explanatory power of a database approach.
2 A Simple Test for the Consistency of Parameters in the Pyloric CPG The results in Prinz et al. (2004), obtained with a database approach using the pyloric subset model, suggest that the entire neural circuit produces similar firing patterns even if the parameters of the intrinsic voltage-dependent properties vary over a wide range—in fact, over several orders of magnitude. Hence, there are several equally acceptable combinations of cellular and synaptic parameters that can produce equivalent firing patterns. One particular voltage-gated current may be very strong in an identified neuron from one animal but very weak in the same neuron from a different animal. The desired network output would then come about as a result of correct combinations of cellular and synaptic properties; that is, strong hyperpolarizing currents would balance strong depolarizing currents. This kind of balancing requires compensatory mechanisms not only at the network level but also in the individual neurons. Developing muscle cells in Xenopus, for example, compensate for the overexpression of exogenous Na+ channels by upregulating the expression of at least two endogenous K+ channels (Linsdell & Moody, 1994). One way to see if such a possibility exists for the pyloric system is to remove one of the currents and observe the behavior of the neuron. We chose to do this with the transient potassium current IA , which has long been known to be one of the most important voltage-gated currents in shaping the voltage output of single neurons as well as the patterned activity of circuits containing those neurons (Tierney & Harris-Warrick, 1992). Neurons with such fast potassium currents would be expected to display strong changes in their behavior following blocking of the IA . Therefore, if we assume that the same type of neuron from different animals may indeed have very different maximal conductances for IA , then the compensatory currents, such as the H-current, which has been suggested by others (MacLean, Zhang, Johnson, & Harris-Warrick, 2003), will also vary over a wide range. A total blocking of an identified cell’s A-current in different preparations should result in populations of neurons with zero conductance in IA but with very different conductances in the compensatory-type voltage-gated currents. Blocking the IA would therefore unbalance the entire pyloric circuit. Accordingly, the activity pattern of
1988
¨ T. Nowotny, A. Szucs, R. Levi, and A. Selverston
single pyloric cells as well as the overall pyloric pattern should vary widely among different preparations. To test this possibility we performed a set of experiments with bathapplied 4-aminopyridine (4-AP), a potent and specific blocker of IA in lobster pyloric neurons (Tierney & Harris-Warrick, 1992), and examined how consistent the effects are. At the network level, 4-AP had the obvious effect of speeding up the frequency of the pyloric rhythm by 52.6 ± 10.8% (n = 8). This is consistent with the general expectation that the removal of a hyperpolarizing current, without altering other currents, will result in a net depolarization of the neurons at the same time. Application of 4-AP did not disrupt the phase relationship between the bursts of the pyloric neurons and the overall regularity of the three-phasic rhythm. At the single-neuron level, 4-AP affected the voltage output and spike patterns of the neurons in a cell-specific manner, with the PD neuron showing the most profound changes. The key question here is how much variability is observed after the A currents are eliminated. Remarkably, the changes in burst shape and interspike interval (ISI) pattern for the PD neuron in 4-AP were very reproducible and characteristic (see Figure 1 and Table 1). In fact, 4-AP turned the accelerating-type (ISIs decreasing) PD neuron into a decelerating-type neuron. As shown in the table, the standard deviations of the changes in the statistical parameters were consistently small. Consequently, 4-AP produces very similar effects in pyloric neurons from different animals. These effects are consistent and reproducible not only in the network parameters but also in the parameters describing the spike dynamics of single neurons. Usually burst-timing-based gross parameters are used to characterize the activity of oscillatory networks. These gross parameters of neural activity are the functionally most important ones and hence are understandably very similar
Figure 1: 4-AP induces consistent changes in the voltage output and spike dynamics of the PD neuron. Voltage waveforms of PD neurons from four different preparations (arranged in four rows) are displayed before (left) and during (right) the application of the A-current blocker 4-AP (4 mM). There are six overlapping bursts in each panel. The smooth traces with gray lines running on both sides are the average intraburst spike desity functions (± SD) calculated from approximately 100 successive bursts of each neuron. In control, all PD neurons are accelerating type bursters, while 4-AP turns them into decelerating type ones. The maps are joint interspike interval (ISI) density plots showing the serial dependence of ISIs within the bursts of the PDs neurons. While in control conditions all the neurons display the characteristic PD signature (V-shaped ¨ et al. 2003), the joint ISI maps during 4-AP and clustered maps; compare Szucs indicate a different behavior: rapid onset of the burst with gradually increasing ISI durations. The analysis reveals very similar dynamics in the PD neurons from different animals in normal conditions and when they are exposed to 4-AP ¨ & Selverston, 2006). (modified from Szucs
Models Wagging the Dog
1989
among different animals. However, even the more subtle parameters, like the interspike-interval-based metrics and spike density, are also consistent. The pyloric neurons apparently tune their biophysical properties in a way that makes the bursts of the individual neurons remarkably similar across different animals. Actually this is one of the reasons that the identification of such neurons is straightforward for the experienced neurophysiologist. As we and other investigators have previously shown, synaptic and neuromodulatory factors not only induce consistent changes to the overall burst pattern of the pyloric circuit but they also do the same consistently to the ¨ burst waveforms of the individual cells (Szucs, Abarbanel, Rabinovich, &
1990
¨ T. Nowotny, A. Szucs, R. Levi, and A. Selverston
Table 1: 4-AP Induces Consistent Effects in the Gross Burst Parameters and Spike Patterns of the PD Neuron from Different Preparations (n = 8). Circuit Property Pyloric burst frequency, control Pyloric burst frequency with 4-AP Change in pyloric burst frequency PD number of spikes, control PD number of spikes with 4-AP Change in number of spikes PD intraburst spike frequency, control PD intraburst spike frequency with 4-AP Change in PD intraburst spike frequency LP-PD relative phase, control LP-PD relative phase, 4-AP Change in LP-PD relative phase
Unit
Average ± STD
Hz Hz %
1.56 ± 0.13 2.37 ± 0.14 +52.6 ± 10.8 10.3 ± 1.7 6.9 ± 1.3 −32.4 ± 5.6 54.9 ± 9.2 78.9 ± 15.3 +43.6 ± 11.0 0.42 ± 0.03 0.59 ± 0.03 +41.7 ± 9.0
% Hz Hz %
%
Notes: The intraburst spike frequency was calculated by dividing the number of spikes by the duration of the bursts. The LP-PD relative phase was calculated by dividing the intervals between the LP and PD burst onset times by the PD burst cycle period. Italicized entries in column 3 show the most relevant changes in circuit properties.
¨ Selverston, 2005; Szucs, Pinto, Rabinovich, Abarbanel, & Selverston, 2003). When inspecting the experiments individually, we did not notice any clear differences in the changes in the circuit characteristics listed summarily in Table 1. In particular, there were no indications that the remaining variability, as indicated by the nonzero standard deviations, was due to differences in the effect of blocking IA rather than due to the a priori variability of the quantities in the intact circuits. Figure 2 demonstrates the range of four parameter values before and during 4-AP application for the eight preparations. Natural variations of the burst frequency, PD spike number, PD intraburst spike frequency, and the LP-to-PD relative phase are reflected by the 1.3- to 1.9-fold variability in the scatter plots (maximum and minimum ratios of the populations are indicated in each plot). A closer look at the burst frequency data (see Figure 2A) reveals that the population mean frequency is increased by 4-AP, but the dispersion of the data points remains very similar. The ratios are 1.29 and 1.21 for the control data and the 4-AP data, respectively. Relative changes of the burst frequency of each pyloric circuit are displayed in a similar format, and we find low dispersion of the data (1.23 ratio). Following the same procedure for the other three temporal parameters, we find that 4-AP clearly shifts the populations, but the variability does not change significantly in either case. Remarkably, the max/min ratio for the relative changes is always close to or less than 1.25. These values, as well as the max/min ratios of the raw data, are much less than those reported by Golowasch and Marder (1992) or Schulz, Goaillard, and Marder
Models Wagging the Dog
1991
Figure 2: Natural variability and the effect of 4-AP on four burst-related temporal parameters in eight pyloric preparations. Each point corresponds to a different preparation. Ratios between the population maxima and minima are indicated above the points. (A) Burst frequency. (B) Intraburst spike number of the PD neuron. (C) Intraburst spike frequency of the PD neuron. (D) LPto-PD relative phase. Relative changes of the parameters were calculated by normalizing the 4-AP values with their corresponding control data. The spread of relative changes is remarkably low, never exceeding 1.25.
(2006) for maximal conductance densities of selected K-currents. Looking at the original data (before normalization), we find somewhat greater variability in the PD intraburst spike number and spike frequency (close to twofold variability). Nonetheless, 4-AP does not increase the variability of these parameters either. The observed consistency of the pyloric CPG with respect to removal of IA can now be compared to corresponding results with the model used by Prinz et al. (2004). Figure 3 shows two control patterns that would meet their criteria for identification as pyloric rhythms (the same models as shown in Figure 5 of Prinz et al. 2004). If the IA values of the three neurons are different in the two models, one would expect a uniform reduction of this
1992
¨ T. Nowotny, A. Szucs, R. Levi, and A. Selverston
Figure 3: Effect of reducing IA in the CPG models in Prinz et al. (2004). The left panels in A and B show the output of two “equivalent models” of the lobster pyloric CPG (parameters taken from Figure 5 in Prinz et al., 2004). Scale bars are 50 mV/0.5 s. The middle and right panels show the output of the models with IA in all three neurons reduced to 75% and 50% of its original conductance. The panels in C show the spike density functions (σ = 20 ms) of CPG A (black) and B (gray) in comparison. While one would not reasonably expect the models to correctly reproduce the behavior of the pyloric CPG, they also fail to show equivalent dynamics in contrast to the claim of equivalence. The failure of reproducibility is in strong contrast with the observations in the pyloric system (cf. Figure 1). Neuron models were integrated with a 5/6 order variable time step Runge-Kutta algorithm with maximal time step 20 µs and relative error goal per time step of 10−9 .
conductance to produce different effects in each model. As expected, a small reduction in IA already shows noticeable differences in frequency between the two models, whereas a larger reduction of 50% leads to completely different behavior for each. For complete removal of IA , both models cease bursting in sharp contrast to the real system, which continues almost normal activity. Compare the AB/PD model trace in Figure 3A and Figure 3B with the effects of complete IA removal from PDs shown in Figure 1. The pyloric CPG has many regulatory mechanisms and internal feedback loops, driving it to a stable pattern even when some of its components are severely disrupted. Could this make the experimentally observed changes consistent in spite of very disparate parameter sets? While one should not exclude this possibility from the outset, there is no evidence that regulatory
Models Wagging the Dog
1993
mechanisms can act in acute preparations. Furthermore, one would expect feedback mechanisms to stabilize the normal physiological pattern rather than the deviation from it. Instead of large but consistent changes, one would expect to see no or very slight changes overall. Along the same lines, one would expect the presumably functionally more important properties like frequencies and duty cycles to be stabilized, but not details like the ISI signature of each neuron.
3 When Are Models Equivalent? The apparent discrepancies between the two circuit models with respect to blocking the IA current led us to a more fundamental question: When can one consider two models equivalent or similar enough to be in the same relationship to each other as the identified cells of two different animals? To approach this question, we compared the variability between the dynamics of the models that are equivalent with respect to the criteria in Prinz et al. (2004) to the variability of dynamics observed in isolated cells of the lobster stomatogastric ganglion. Figure 4 illustrates our findings. The panels show the membrane potential of four “equivalent models” (A) and four different isolated LP neurons (B) in response to six typical levels of DC current injection. Clearly, the LP cells show a noticeable variability, but their overall bifurcations from slow, short bursting over long, irregular bursting, to irregular spiking, and eventually to fast, tonic spiking are very typical. It is important to note that as in the case of the network, regulatory mechanisms such as feedback loops are not likely to shape the dynamics of the isolated neurons, because the natural system has never been exposed to such conditions, making it unlikely that a special mechanism to control the activity of the LP neuron in isolation would have evolved. Therefore, it seems fair to compare the results from isolation experiments to the behavior of single neuron models. The neuron models reproduce the irregular spiking for 0 nA current injection, the criterion on which they were selected, but otherwise do not match the data, or each other, very well. Apparently the models chosen from the database were not yet quite adequate. Assuming there are models in the pool of tested models with matching dynamics over a wide range of stationary current injections, would this requirement restrict their parameters to similar values? And if not, are there other criteria that would do so? One might argue that in order to address this question, one just needs to go into the database of models and select with increasingly restrictive criteria to answer this question easily. This assumes that all possible models are covered by the database and, conversely, that all parameter sets in the database constitute in some sense equivalent and viable model entities. To gain some insight into this question, we analyzed a subset of models from the pyloric neuron model database of Prinz, Billimoria, and Marder (2003).
1994
¨ T. Nowotny, A. Szucs, R. Levi, and A. Selverston
Figure 4: Models and data of isolated LP neurons. (A) Models LP2 through LP5 from Prinz et al. (2004) at six levels of current injection. Current injections range from −0.15 nA to 0 nA (control) in steps of 0.03 nA. Scale bars are 1 s/ 50 mV. (B) Data from isolated LP neurons (PD and VD neurons were killed by photoablation, 10 µM Picrotoxin) in four preparations of the stomatogastric ganglion of the lobster. Current injections into the soma through a separate sharp electrode were −4 nA to 0.5 nA, as noted next to the graphs. Scale bars are 1 s/25 mV. Taking the chaotic nature of the LP neuron and the expected experimental variability in the isolation procedure, the data appear, at least qualitatively, rather consistent across preparations. The models, on the other hand, appear to differ quantitatively and qualitatively from each other. Models were integrated with the same algorithm as in Figure 3.
4 Database Approaches At first sight, it is an intriguing and bold idea to map out the complete parameter space of a model in a database approach and observe all of the dynamical regimes, especially the bifurcations. Furthermore, this approach seems to offer a completely new wealth of information on how typical certain dynamical properties of models are—for example, by allowing statements about how many of all “equivalent LP models” have a large IA conductance. After careful inspection, our enthusiasm was, however, somewhat damped by the following observations. Conductance-based models of the Hodgkin-Huxley type are fairly complex, containing on the order of tens of parameters. Even if one restricts the
Models Wagging the Dog
1995
free parameters to maximal conductances only, one is easily left with 5 to 10 parameters for a given neuron. This already raises some skepticism on the feasibility of a complete database. Furthermore, experience with this type of model has demonstrated that they are highly sensitive to the values of some parameters (see Figure 5 for an example). Parameter mismatches often lead not only to dynamics inappropriate for the neurons being described but also inappropriate for neurons in general. Two common, and very surprising, examples from conductance-based models are metastable depolarized states or lack of stability in the hyperpolarized regime. Because of the sensitivity of the models to some of the parameters (and their combinations), a complete survey of the parameter space obviously needs to be done with appropriately small increments in each of the parameters. This quickly increases the computational cost. For example, if we take just 1000 different values for each of, say, five parameters, we have to compute the dynamics of 1015 models, which might take on the order of 1016 seconds on a modern PC, that is, about 3 billion years. On the other end of the spectrum of possibilities, if we take only five different values for each of five parameters, we will not cover any relevant portion of the parameter space (compare the gray panels in Figure 5 to the wealth of different dynamics in the remaining panels). Continuing this approach to the network level makes the combinatorial complexity even more daunting. Figure 5 illustrates a few further difficulties of brute force database approaches. The region in which a parameter change is relevant to the neuron dynamics is a priori not known. In the example of Figure 5, it is meaningful only to vary gNa between 0 and about 220 mS/cm2 . For all values beyond this range, the dynamics of the model neuron is insensitive to changes in gNa . Following Prinz et al. (2003) in counting the number of model neurons with a certain property (e.g., all neurons that are tonic spikers), one will count basically the same model several times, for gNa = 300, 400, and 500 mS/cm2 . The wealth of different models between gNa = 20 and 90 mS/cm2 , however, will go unnoticed. The head count of models with given properties apparently strongly depends on the choice of parameter values examined (in terms of both spacing and total range) and thus could become a highly ambiguous statement. Inspecting the data manually, as we did here for illustration purposes, does not resolve this problem, because the relevant range of values of, for example, the sodium conductance, will likely depend in a nontrivial way on the values of all the other parameters. Furthermore, given the different ways parameters enter into models, it even seems unclear whether linear, logarithmic, or other increments in parameter values might be appropriate. A simple example illustrates this basic problem. For simplicity, let us assume that we are looking for sets of three parameters with the property p1 · p2 · p3 ≤ 1. If we sample the three parameters on a logarithmic scale, from {0, 0.1, 0.2, 0.5, 1, 2, 5}, we get the result that 82.8% of parameter sets have this property. If we sample from a linear scale, like {0, 1, 2, 3, 4, 5}, however, we obtain a count of 42.6% of
1996
¨ T. Nowotny, A. Szucs, R. Levi, and A. Selverston
the parameter sets with this property. The percentage count depends almost entirely on our prior choice of the sampling method. Another dangerous pitfall is the assumption that if the relevant range of parameter values were tightly constrained by additional knowledge, counting models with a given property becomes more meaningful. A simple back-of-the-envelope calculation shows that if we assume we know the relevant parameter range to ±10% accuracy (e.g., we know that six parameters have relevant values between 0 and 1 ± 0.1), then the parameter space volume examined varies between 0.96 ≈ 0.531 and 1.16 ≈ 1.772. Therefore, one could easily obtain a result where the fraction of models having a given property varies between 90% and 27% depending on whether one sampled with the lower or upper value for the parameter range. The problem becomes even more aggravated for higher-dimensional parameter spaces. With respect to the models that are overlooked by too coarse parameter sampling, one could argue that parameter regions for which the models are highly sensitive to their parameters and initial conditions (like the models for between gNa = 20 and 90 mS/cm2 above) are not relevant for describing neural systems, which need to be robust and reliable. It has, however, been observed that neurons can have chaotic regimes (Elson, Huerta, Abarbanel, Rabinovich, & Selverston, 1999; Ren et al., 1997; Schiff et al., 1994), and it is known that self-organized systems often approach such unstable parameter regions for greater flexibility (Bak, Tang, & Wiesenfeld, 1988; Bertschinger
Figure 5: Illustration of the principle difficulties of a database approach. (A) Each panel shows six membrane potential traces of the model LP5 of Prinz et al. (2004) with a given value for the Na conductance as noted above the panels. The six traces correspond to DC current injection from −0.125 to 0 nA in increments of 0.025 nA (top to bottom). Gray panels are models that were examined in the database of Prinz et al. (2003); the other panels were not. Relevant bifurcations in the neuron dynamics (transitions from silent to fast tonic spiking to slower tonic activity and eventually irregular bursting) were not included in the database. The remaining models for even stronger Na currents (not shown here) are all comparatively similar and need not be examined. (B) The 10-fold finer resolution in the parameter values shown in A reveals further transitions in the activity patterns. (C) Another 10-fold increase in resolution from B. The model appears to be so sensitive to this parameter that even this resolution does not suffice to clearly resolve the transition between fast tonic spiking and irregular bursting. It seems that the necessary resolution to cover the whole dynamics of a neuron model is very fine, whereas this resolution is necessary only in very small but a priori unknown regions. A brute force database approach leads to either incomplete observations or an explosion in computational cost (see the text). Neuron models were integrated with the same procedure as before. Correct integration was controlled in some examples with a linear Euler algorithm of time step 10−6 ms, which revealed no relevant deviation.
Models Wagging the Dog
1997
& Natschlager, 2004; Kauffman & Johnsen, 1991). In this light, disregarding models solely based on the observation of sensitive or irregular dynamics appears to be somewhat presumptuous. The approach of systematically mapping the dynamics resulting from all possible parameter combinations might be better suited for models that are fairly insensitive to the parameters in question or for which the structure of the bifurcations is well known and simple enough such that sampling only a few points in every known dynamical regime will permit the identification of the dynamics over wide regions of the parameter space. Both of these limitations are probably not true for any Hodgkin-Huxley type model (Izhikevich, 2006). A different, more important function of building databases of model neurons can be to identify potentially useful models for other modeling purposes or to gain a general overview over possible model behaviors. As
1998
¨ T. Nowotny, A. Szucs, R. Levi, and A. Selverston
such, it can help circumvent tedious hand adjustments of models and become a valuable part of the portfolio of modeling tools used in neuroscience. 5 Summary and Conclusions Modeling is an extremely useful tool when handling large amounts of neurophysiological data from complex, nonlinear systems. If it goes hand in hand with experimental approaches, it can help explain the experimental findings as well as generate new hypotheses. Models can also provide consistency tests and testbeds for principles that may underlie the observations. All models in neuroscience are, however, by necessity phenomenological in nature, and a tight connection to experimental observations is therefore indispensable. Without direct experimental foundation, models can be fairly ambiguous because they are—unlike theories in physics—not constrained by clear fundamental principles, and all specific details (e.g., values of parameters) are not precisely known. Furthermore, the very nature of biological systems is to vary between animals, which might not allow building universal models of biological systems in the way one can in chemistry or physics. This exact premise led Prinz et al. (2004) to the intriguing idea of circuits being assembled in very different ways with very different components across animals. Do existing data support the idea that individual neurons and entire circuits can be made with such variable combinations of parameters? If each circuit is assembled de novo, from one animal to another, so that each one is unique, we would have to take the following into consideration. 5.1 Regulatory Mechanisms. Assuming that neurons and circuits in every animal can be considerably different, there must be a mechanism that regulates neuron and synapse properties to lead to a successful activity pattern. Furthermore, once this is achieved, individual biophysical and synaptic properties have to be maintained in the face of continuous protein turnover and activity-dependent changes (Turrigiano, 1999). While potential regulatory mechanisms have been suggested on the single-cell level (Golowasch, Casey, Abbott, & Marder, 1999; LeMasson, Marder, & Abbott, 1993; Liu, Golowasch, Marder, & Abbott, 1998), it remains rather unclear how this could be realized for synapses on the network level. 5.2 Action of Neuromodulators. When CPG circuits are exposed to neuromodulators delivered artificially or by stimulation of the neurons that contain them, they produce characteristic changes in ongoing motor patterns that are consistent from animal to animal (Harris-Warrick & Marder, 1991). A large body of evidence exists that indicates that the modulators bind to particular receptors on specific identified neurons and that they activate second messenger systems, which are also cell specific (Hempel, Vincent, Adams, Tsien, & Selverston, 1996). These actions in turn lead to
Models Wagging the Dog
1999
specific changes in membrane conductances and alterations in the biophysical properties of the neurons and synapses in a circuit. Would consistent effects of neuromodulatory action be possible if the types and distribution of ion channels in identifiable neurons were inconsistent? One could argue that regulatory mechanisms and specific neuromodulator actions are tuned carefully to allow consistent effects in spite of highly variable substrates. But our experiment with 4AP also showed very consistent effects of blocking IA channels, a perturbation that the system has never been and would not be exposed to in natural circumstances. The fact that neuromodulators always produce consistent results is neither a paradox nor a conundrum that can be dismissed and relegated to future research (Marder & Goaillard, 2006). Modulators have reproducible effects on networks because each has specific target cells to which they bind and specific second messenger pathways and phosphorylation sites that are affected. Of course, modulators exist within the framework of the entire system, so it is not surprising that they can produce variable results when other modulators or sensory inputs are present. But a single modulator applied to a ganglion in its standard experimental condition will always produce consistent effects. This would be extremely unlikely if the target channel had a 30-fold density range. 5.3 Feedback and Descending Control Mechanisms. For a CPG to be effective in the control of a behavior, it must be able to respond to sensory inputs. Such sensory feedback may impinge directly onto CPG circuitry, or it may be in the form of commands from higher centers after different sensory inputs have been integrated and decisions made about how to respond. These feedback pathways are specific to particular neurons. Can feedback control mechanisms provide consistent results if the target neurons are inconsistent? 5.4 Time Constants of Channels. The mixing and matching of channels in a neuron in a way that produces the same overall physiological properties could be done in theory if it was only the effective polarity of the response that mattered—that is, a little too much inward current could be offset with a corresponding amount of an outward current. But ionic currents also have different activation and inactivation curves as well as different kinetics, ranging from transient to persistent. These factors would make it almost impossible to mix ionic currents in a way that would achieve similar actions in a wide enough dynamic regime because not only would the polarity have to be compensated for, the balancing currents would have to possess identical activation and inactivation curves and kinetic properties as well. 5.5 Additional Experimental Evidence. Baro et al. (1997) examined the expression of Shal channels in identified neurons in the pyloric CPG and
2000
¨ T. Nowotny, A. Szucs, R. Levi, and A. Selverston
found that it is typical for each identified cell type with small variations between cells of the same type. They also found rather small variability in the maximal conductance of IA channels for each identified cell in contrast to the results in (Golowasch & Marder, 1992). Furthermore in a recent work (Schulz, Goaillard, & Marder, 2006) it has been shown that there is a certain amount of variability in some channels, while others are more controlled. This is not unexpected, as parameters that have a large impact on the system dynamics need to be more tightly controlled than others with less impact. In particular, Schulz et al. found in LP neurons gKd in the range of 0.09 to 0.12 µS/nF, gA in 0.05 to 0.16 µS/nF, and gKCa in 0.2 to 0.6 µS/nF, which corresponds to maximally three-fold differences. This is less variability in parameters than postulated from the models discussed above, but more than most experimenters would have expected. 5.6 Implications for the Concept of Identified Cells. Identifiability of neurons has been a cornerstone of the success of invertebrate CPG “circuit chasing.” Identifiable neurons have identical physiological properties, the same connections to other identifiable cells, a similar morphology, and identical biochemical and molecular signatures. The application of molecular biological techniques to the study of CPGs will depend on the assumption that different types of neurons have consistent cellular and synaptic channel compositions and receptor properties (Callaway, 2005; Kiehn & Kullander, 2004; Wulff & Wisden, 2005). The implications of disparate parameter compositions occurring in the same neuron type, on the contrary, would make such molecular approaches less applicable for pharmacological development because channel-specific drugs would affect each cell of the same type differently. Another interesting observation in the recent work of Schulz et al. (2006) is that the identified neurons, though having a fairly wide range of certain parameters, are well separated in parameter space. In this way, wide ranges of parameter values might be consistent with the original idea of identified cells. Each identified cell is characterized by a more or less wide region in parameter space. If the action of neuromodulators and other neuroactive substances were sufficiently robust, this could still allow consistent effects on the neuronal activity. How much variability would be sustainable in this view remains an interesting open research direction. The voltage output of a neuron is not a simple algebraic sum of channel populations but a complex nonlinear computation that includes both biophysical and anatomical factors. The novel methodology of using largescale databases developed by Prinz et al. (2003, 2004) is an exciting approach that incites the thinking of how neural systems are composed and maintained. One has to remember, though, that modelers, by necessity, have to make assumptions and simplifications, which may greatly influence the modeling results. This is aggravated by the fact that we do not know which aspects of the observed circuit and neuron dynamics really matter. The
Models Wagging the Dog
2001
fact that we observe very typical ISI signatures in the neurons of every preparation is suggestive but does not prove that ISI properties are indeed important. It might be, for example, that the A-current in the PD neuron is critically important to achieve reliable phase shift during the action of a specific endogenous neuromodulator such as dopamine and only as a side effect the A current also reshapes the burst of the PD neuron and produces those nice V-shaped and clustered ISI signatures. In a model, the modeler will have to choose which properties of the system (e.g., bursting frequencies, bursting phases, ISIs, ISI signatures, spike shape) are important to model and which can be neglected. When drawing conclusions from models, researchers must have the limitations due to these assumptions and simplifications very clear in their minds. Here we have pointed out some problems that come with a database approach, including the arbitrary choice of range, step, and type (e.g., logarithmic, linear) of parameter sampling, the high sensitivity and nonlinearity of Hodgkin-Huxley type neuron models, the ambiguities in assessing the accuracy and adequacy of models, and the difficult interpretation of “model counting” results. All this again stresses the need to work closely with experimental data. There is a danger that modeling becomes a substitute for experimental work while making assumptions that are not, or cannot be obtained experimentally. To our mind, the important biological question of how consistent circuit and neuron parameters really have to be remains wide open. Acknowledgments This work was supported by the NIH, grant RO 1 NS-050945; the NSF, grant PHY-0414174; and the Hungarian National Science and Research Foundation, grant T043162. References Arvanitaki, A., & Chalazonitis, N. (1958). Modal configurations of the activity of different neurons emanating from a common center. J. Physiol. (Paris), 50, 122– 125. Bak, P., Tang, C., & Wiesenfeld, K. (1988). Self-organized criticality. Physical Review A, 38, 364–374. Baro, D. J., Levini, R. M., Kim, M. T., Willms, A. R., Lanning, C. C., Rodriguez, H. E., & Harris-Warrick, R. M. (1997). Quantitative single-cell-reverse transcription-PCR demonstrates that A-current magnitude varies as a linear function of Shal gene expression in identified stomatogastric neurons. J. Neurosci., 17, 6597–6610. Bertschinger, N., & Natschlager, T. (2004). Real-time computation at the edge of chaos in recurrent neural networks. Neural Comput., 16, 1413–1436. Callaway, E. M. (2005). A molecular and genetic arsenal for systems neuroscience. Trends Neurosci., 28, 196–201.
2002
¨ T. Nowotny, A. Szucs, R. Levi, and A. Selverston
Elson, R. C., Huerta, R., Abarbanel, H. D., Rabinovich, M. I., & Selverston, A. I. (1999). Dynamic control of irregular bursting in an identified neuron of an oscillatory circuit. J. Neurophysiol., 82, 115–122. Golowasch, J., Casey, M., Abbott, L. F., & Marder, E. (1999). Network stability from activity-dependent regulation of neuronal conductances. Neural Comput., 11, 1079–1096. Golowasch, J., & Marder, E. (1992). Ionic currents of the lateral pyloric neuron of the stomatogastric ganglion of the crab. J. Neurophysiol., 67, 318–331. Harris-Warrick, R. M., & Marder, E. (1991). Modulation of neural networks for behaivor. Ann. Rev. Neurosci., 14, 39–57. Hempel, C. M., Vincent, P., Adams, S. R., Tsien, R. Y., & Selverston, A. I. (1996). Spatiotemporal dynamics of cyclic AMP signals in an intact neural circuit. Nature, 384, 166–169. Izhikevich, E. M. (2006). Dynamical systems in neuroscience: The geometry of excitability and bursting. Cambridge, MA: MIT Press. Kandel, E. R., Frazier, W. T., Waziri, R., & Coggeshall, R. E. (1967). Direct and common connections among identified neurons in Aplysia. J. Neurophysiol., 30, 1352–1376. Kauffman, S. A., & Johnsen, S. (1991). Coevolution to the edge of chaos: Coupled fitness landscapes, poised states, and coevolutionary avalanches. J. Theor. Biol., 149, 467–505. Kiehn, O., & Kullander, K. (2004). Central pattern generators deciphered by molecular genetics. Neuron, 41, 317–321. LeMasson, G., Marder, E., & Abbott, L. F. (1993). Activity-dependent regulation of conductances in model neurons. Science, 259, 1915–1917. Linsdell, P., & Moody, W. J. (1994). Na+ channel mis-expression accelerates K+ channel development in embryonic Xenopus laevis skeletal muscle. J. Physiol. (Lond.), 480, 405–410. Liu, Z., Golowasch, J., Marder, E., & Abbott, L. F. (1998). A model neuron with activity-dependent conductances regulated by multiple calcium sensors. J. Neurosci., 18, 2309–2320. MacLean, J. N., Zhang, Y., Johnson, B. R., & Harris-Warrick, R. M. (2003). Activityindependent homeostasis in rhythmically active neurons. Neuron, 37, 109–120. Marder, E., & Goaillard, J. M. (2006). Variability, compensation and homeostasis in neuron and network function. Nat. Rev. Neurosci., 7, 563–574. Otsuka, M., Kravitz, E. A., & Potter, D. D. (1967). Physiological and chemical architecture of a lobster ganglion with particular reference to gamma-aminobutyrate and glutamate. J. Neurophysiol., 30, 725–752. Prinz, A. A., Billimoria, C. P., & Marder, E. (2003). Alternative to hand-tuning conductance-based models: Construction and analysis of databases of model neurons. J. Neurophysiol., 90, 3998–4015. Prinz, A. A., Bucher, D., & Marder, E. (2004). Similar network activity from disparate circuit parameters. Nat. Neurosc., 7, 1345–1352. Ren, W., Hu, S. J., Zhang, B. J., Wang, F. Z., Gong, Y. F., & Xu, J. X. (1997). Period-adding bifurcation with chaos in the interspike intervals generated by an experimental pacemaker. International Journal of Bifurcation Chaos, 7, 1867–1872. Schiff, S. J., Jerger, K., Duong, D. H., Chang, T., Spano, M. L., & Ditto, W. L. (1994). Controlling chaos in the brain. Nature, 370, 615–620.
Models Wagging the Dog
2003
Schulz, D. J., Goaillard, J. M., & Marder, E. (2006). Variable channel expression in identified single and electrically coupled neurons in different animals. Nat. Neurosci., 9, 356–362. ¨ A., Abarbanel, H. D., Rabinovich, M. I., & Selverston, A. I. (2005). Dopaminae Szucs, modulation of spike dynamics in bursting neurons. Eur. J. Neurosci., 21, 763–772. ¨ A., Pinto, R. D., Rabinovich, M. I., Abarbanel, H. D., & Selverston, A. I. (2003). Szucs, Synaptic modulation of the interspike interval signatures of bursting pyloric neurons. J. Neurophysiol., 89, 1363–1377. ¨ A., & Selverston, A. I. (2006). Consistent dynamics suggest tight regulation of Szucs, biophysical parameters in a small network of bursting neurons. J. Neurobiol., 66, 1584–1601. Tierney, A. J., & Harris-Warrick, R. M. (1992). Physiological role of the transient potassium current in the pyloric circuit of the lobster stomatogastric ganglion. J. Neurophysiol., 67, 599–609. Turrigiano, G. G. (1999). Homeostatic plasticity in neuronal networks: The more things change, the more they stay the same. Trends Neurosci., 22, 221–227. Wulff, P., & Wisden, W. (2005). Dissecting neural circuitry by combining genetics and pharmacology. Trends Neurosci., 28, 44–50.
Received February 2, 2006; accepted July 25, 2006.
ARTICLE
Communicated by Sebastian Seung
Multiplicative Updates for Nonnegative Quadratic Programming Fei Sha [email protected] Computer Science Division, University of California, Berkeley, Berkeley, CA 94720, U.S.A.
Yuanqing Lin [email protected] Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, PA 19104, U.S.A.
Lawrence K. Saul [email protected] Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093, U.S.A.
Daniel D. Lee [email protected] Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, PA 19104, U.S.A.
Many problems in neural computation and statistical learning involve optimizations with nonnegativity constraints. In this article, we study convex problems in quadratic programming where the optimization is confined to an axis-aligned region in the nonnegative orthant. For these problems, we derive multiplicative updates that improve the value of the objective function at each iteration and converge monotonically to the global minimum. The updates have a simple closed form and do not involve any heuristics or free parameters that must be tuned to ensure convergence. Despite their simplicity, they differ strikingly in form from other multiplicative updates used in machine learning. We provide complete proofs of convergence for these updates and describe their application to problems in signal processing and pattern recognition. 1 Introduction Many problems in neural computation and statistical learning involve optimizations with nonnegativity constraints. Examples include large margin classification by support vector machines (Vapnik, 1998), density estimation Neural Computation 19, 2004–2031 (2007)
C 2007 Massachusetts Institute of Technology
Multiplicative Updates for Nonnegative Quadratic Programming
2005
in Bayesian networks (Bauer, Koller, & Singer, 1997), dimensionality reduction by nonnegative matrix factorization (Lee & Seung, 1999), and acoustic echo cancellation (Lin, Lee, & Saul, 2004). The optimizations for these problems cannot be solved in closed form; thus, iterative learning rules are required that converge in the limit to actual solutions. The simplest such learning rule is gradient descent. Minimizing an objective function F (v) by gradient descent involves the additive update, vi ← vi − η(∂ F /∂vi ),
(1.1)
where η > 0 is a positive learning rate and all the elements of the parameter vector v = (v1 , v2 , . . . , v N ) are updated in parallel. Gradient descent is not particularly well suited to constrained optimizations, however, because the additive update in equation 1.1 can lead to violations of the constraints. A simple extension enforces the nonnegativity constraints: vi ← max(vi − η(∂ F /∂vi ), 0).
(1.2)
The update rule in equation 1.2 is a special instance of gradient projection methods (Bertsekas, 1999; Serafini, Zanghirati, & Zanni, 2005). The nonnegativity constraints are enforced by projecting the gradient-based updates in equation 1.1 onto the convex feasible set—namely, the nonnegative orthant vi ≥ 0. The projected gradient updates also depend on a learning rate parameter η. For optimizations with nonnegativity constraints, an equally simple but more appropriate learning rule involves the so-called exponentiated gradient (EG) (Kivinen & Warmuth, 1997): vi ← vi e −η(∂ F /∂vi ) .
(1.3)
Equation 1.3 is an example of a multiplicative update. Because the elements of the exponentiated gradient are always positive, this update naturally enforces the nonnegativity constraints on vi . By taking the logarithm of both sides of equation 1.3, we can view the EG update as an additive update1 in the log domain: log vi ← log vi − η(∂ F /∂vi ).
(1.4)
Multiplicative updates such as EG typically lead to faster convergence than additive updates (Kivinen & Warmuth, 1997) if the solution v∗ of the
This update differs slightly from gradient descent in the variable ui = log vi , which would involve the partial derivative ∂ F /∂ui = vi (∂ F /∂vi ) as opposed to what appears in equation 1.4. 1
2006
F. Sha, Y. Lin, L. Saul, and D. Lee
optimization problem is sparse, containing a large number of zero elements. Note, moreover, that sparse solutions are more likely to arise in problems with nonnegativity constraints because in these problems, minima can emerge at vi∗ = 0 without the precise vanishing of the partial derivative (∂ F /∂vi )|v∗ (as would be required in an unconstrained optimization). The EG update in equation 1.3, like gradient descent in equation 1.1 and projected gradient descent in equation 1.2, depends on the explicit introduction of a learning rate η > 0. The size of the learning rate must be chosen to avoid divergent oscillations (if η is too large) and unacceptably slow convergence (if η is too small). The necessity of choosing a learning rate can be viewed as a consequence of the generality of these learning rules; they do not assume or exploit any structure in the objective function F (v) beyond the fact that it is differentiable. Not surprisingly, many objective functions in machine learning have structure that can be exploited in their optimizations—and in particular, by multiplicative updates. Such updates need not involve learning rates, and they may also involve intuitions rather different from the connection between EG and gradient descent in equations 1.3 and 1.4. For example, the expectation-maximization (EM) algorithm for latent variable models (Dempster, Laird, & Rubin, 1977) and the generalized iterative scaling (GIS) algorithm for logistic regression (Darroch & Ratcliff, 1972) can be viewed as multiplicative updates (Saul, Sha, & Lee, 2003), but unlike the EG update, they cannot be cast as simple variants of gradient descent in the log domain. In this article, we derive multiplicative updates for convex problems in quadratic programming where the optimization is confined to an axisaligned region in the nonnegative orthant. Our multiplicative updates have the property that they improve the value of the objective function at each iteration and converge monotonically to the global minimum. Despite their simplicity, they differ strikingly in form from other multiplicative updates used in statistical learning, including EG, EM, and GIS. This article provides a complete derivation and proof of convergence for the multiplicative updates, originally described in previous work (Sha, Saul, & Lee, 2003a, 2003b). The proof techniques should be of general interest to researchers in neural computation and statistical learning faced with problems in constrained optimization. The basic problem that we study in this article is quadratic programming with nonnegativity constraints: minimize
F (v) = 12 vT Av + bT v
subject to
v ≥ 0.
(1.5)
The constraint indicates that the variable v is confined to the nonnegative orthant. We assume that the matrix A is symmetric and strictly positive definite, so that the objective function F (v) in equation 1.5 is bounded below,
Multiplicative Updates for Nonnegative Quadratic Programming
2007
and its optimization is convex. In particular, it has one global minimum and no local minima. Monotonically convergent multiplicative updates for minimizing equation 1.5 were previously developed for the special case of nonnegative matrix factorization (NMF) (Lee & Seung, 2001). In this setting, the matrix elements of A are nonnegative, and the vector elements of b are negative. The updates for NMF are derived from an auxiliary function similar to the one used in EM algorithms. They take the simple, elementwise multiplicative form, vi ←−
|b i | vi , (Av)i
(1.6)
which is guaranteed to preserve the nonnegativity constraints on v. The validity of these updates for NMF hinges on the assumption that the matrix elements of A are nonnegative: otherwise, the denominator in equation 1.6 could become negative, leading to a violation of the nonnegativity constraints on v. In this article, we generalize the multiplicative updates in equation 1.6 to a wider range of problems in nonnegative quadratic programming (NQP). Our updates assume only that the matrix A is positive semidefinite: in particular, it may have negative elements off the diagonal, and the vector b may have both positive and negative elements. Despite the greater generality of our updates, they retain a simple, elementwise multiplicative form. The multiplicative factors in the updates involve only two matrix-vector multiplications and reduce to equation 1.6 for the special case of NMF. The updates can also be extended in a straightforward way to the more general problem of NQP with upper-bound constraints on the variable v ≤ . Under these additional constraints, the variable v is restricted to an axis-aligned “box” in the nonnegative orthant with opposing vertices at the origin and the nonnegative vector . We prove that our multiplicative updates converge monotonically to the global minimum of the objective function for NQP. The proof relies on constructing an auxiliary function, as in earlier proofs for EM and NMF algorithms (Dempster et al., 1977; Lee & Seung, 2001). In general, monotonic improvement in an auxiliary function suffices only to establish convergence to a local stationary point, not necessarily a global minimum. For our updates, however, we are able to prove global convergence by exploiting the particular structure of their fixed points as well as the convexity of the objective function. The rest of this article is organized as follows. In section 2, we present the multiplicative updates and develop some simple intuitions behind their form. The updates are then derived more formally and their convergence properties established in section 3, which completes the proofs sketched in earlier work (Sha et al., 2003a, 2003b). In section 4, we briefly describe
2008
F. Sha, Y. Lin, L. Saul, and D. Lee
some applications to problems in signal processing (Lin et al., 2004) and pattern recognition (Cristianini & Shawe-Taylor, 2000). Finally, in section 5, we conclude by summarizing the main advantages of our approach. 2 Algorithm We begin by presenting the multiplicative updates for the basic problem of NQP in equation 1.5. Some simple intuitions behind the updates are developed by analyzing the Karush-Kuhn-Tucker (KKT) conditions for this problem. We then extend the multiplicative updates to handle the more general problem of NQP with additional upper-bound constraints v ≤ . 2.1 Updates for NQP. The multiplicative updates for NQP are expressed in terms of the positive and negative components of the matrix A. In particular, let A+ and A− denote the nonnegative matrices with elements: +
Aij =
Aij if Aij > 0, 0
otherwise,
and
−
Aij =
|Aij | if Aij < 0, 0
otherwise.
(2.1)
It follows that A = A+ −A− . In terms of these nonnegative matrices, the objective function in equation 1.5 can be decomposed as the combination of three terms, which we write as F (v) = Fa (v) + Fb (v) − Fc (v)
(2.2)
for reasons that will become clear shortly. We use the first and third terms in equation 2.2 to “split” the quadratic piece of F (v) and the second term to capture the linear piece: Fa (v) = 12 vT A+ v, Fb (v) = bT v,
(2.3)
Fc (v) = 12 vT A− v. The decomposition (see equation 2.2) follows trivially from the definitions in equations 2.1 and 2.3. The gradient of F (v) can be similarly decomposed in terms of contributions from these three pieces. We have chosen our notation in equation 2.3 so that b i = ∂ Fb /∂vi ; for the quadratic terms in the objective function, we define the corresponding derivatives: ai =
∂ Fa = (A+ v)i , ∂vi
(2.4)
Multiplicative Updates for Nonnegative Quadratic Programming
ci =
∂ Fc = (A− v)i . ∂vi
2009
(2.5)
Note that the partial derivatives in equations 2.4 and 2.5 are guaranteed to be nonnegative when evaluated at vectors v in the nonnegative orthant. The multiplicative updates are expressed in terms of these partial derivatives as vi ←−
−b i +
b i2 + 4a i c i
2a i
vi .
(2.6)
Note that these updates reduce to the special case of equation 1.6 for NMF when the matrix A has no negative elements. The updates in equation 2.6 are meant to be applied in parallel to all the elements of v. They are remarkably simple to implement and notable for their absence of free parameters or heuristic criteria that must be tuned to ensure convergence. Since a i ≥ 0 and c i ≥ 0, it follows that the multiplicative prefactor in equation 2.6 is always nonnegative; thus, the optimization remains confined to the feasible region for NQP. As we show in section 3, moreover, these updates are guaranteed to decrease the value of F (v) at each iteration. There is a close link between the sign of the partial derivative ∂ F /∂vi and the effect of the multiplicative update on vi . In particular, using the fact that ∂ F /∂vi = a i +b i −c i , it is easy to show that the update decreases vi if ∂ F /∂vi > 0 and increases vi if ∂ F /∂vi < 0. Thus, the multiplicative update in equation 2.6 moves each element vi in an opposite direction to its partial derivative. 2.2 Fixed Points. Further intuition for the updates in equation 2.6 can be gained by examining their fixed points. Let mi denote the multiplicative prefactor inside the brackets on the right-hand side of equation 2.6. Fixed points of the updates occur when either (1) vi = 0 or (2) mi = 1. What does the latter condition imply? Note that the expression for mi is simply the quadratic formula for the larger root of the polynomial p(m) = a i m2 +b i m−c i . Thus, mi = 1 implies that a i +b i −c i = 0. From the definitions in equations 2.2 to 2.5, moreover, it follows that ∂ F /∂vi = 0. Thus, the two criteria for fixed points can be restated as (1) vi = 0 or (2) ∂ F /∂vi = 0. These are consistent with the Karush-Kuhn-Tucker (KKT) conditions for the NQP problem in equation 1.5, as we now show. Let λi denote the Lagrange multiplier used to enforce the nonnegativity constraint on vi . The KKT conditions are given by Av + b = λ,
v ≥ 0,
λ ≥ 0,
λ ◦ v = 0,
(2.7)
2010
F. Sha, Y. Lin, L. Saul, and D. Lee
in which ◦ stands for elementwise vector multiplication. A necessary and sufficient condition for v to solve equation 1.5 is that there exists a vector λ such that v and λ satisfy this system. It follows from equation 2.7 that the gradient of F (v) at its minimum is nonnegative: ∇F = Av + b ≥ 0. Moreover, for inactive constraints (corresponding to elements of the minimizer that are strictly positive), the corresponding partial derivatives of the objective function must vanish: ∂ F /∂vi = 0 if vi > 0. Thus, the KKT conditions imply that (1) vi = 0 or (2) ∂ F /∂vi = 0, and any solution satisfying the KKT conditions corresponds to a fixed point of the multiplicative updates, though not vice versa. 2.3 Upper-Bound Constraints. The multiplicative updates in equation 2.6 can also be extended to incorporate upper-bound constraints of the form v ≤ . A simple way of enforcing such constraints is to clip the output of the updates in equation 2.6:
vi ←− min i ,
−b i +
b i2 + 4a i c i
2a i
vi .
(2.8)
As we show in the next section, this clipped update is also guaranteed to decrease the objective function F (v) in equation 1.5 if it results in a change of vi . 3 Convergence Analysis In this section, we prove that the multiplicative updates in equation 2.6 converge monotonically to the global minimum of the objective function F (v). Our proof is based on the derivation of an auxiliary function that provides an upper bound on the objective function. Similar techniques have been used to establish the convergence of many algorithms in statistical learning (e.g., the EM algorithm, Dempster et al., 1977, for maximum likelihood estimation) and nonnegative matrix factorization (Lee & Seung, 2001). The proof is composed of two parts. We first show that the multiplicative updates monotonically decrease the objective function F (v). Then we show that the updates converge to the global minimum. We assume throughout the article that the matrix A is positive definite such that the objective function is convex. (Though theorem 2 does not depend on this assumption, convexity is used to establish the stronger convergence results that follow.) 3.1 Monotonic Convergence. An auxiliary function G(u, v) for the objective function in equation 1.5 has two crucial properties: (1) F (u) ≤ G(u, v) and (2) F (v) = G(v, v) for all positive vectors u and v. From such an auxiliary
Multiplicative Updates for Nonnegative Quadratic Programming
2011
Figure 1: Using an auxiliary function G(u, v) to minimize an objective function F (v). The auxiliary function is constructed around the current estimate of the minimizer; the next estimate is found by minimizing the auxiliary function, which provides an upper bound on the objective function. The procedure is iterated until it converges to a stationary point (generally a local minimum) of the objective function.
function, we can derive the update rule v = arg minu G(u, v), which never increases (and generally decreases) the objective function F (v): F (v ) ≤ G(v , v) ≤ G(v, v) = F (v).
(3.1)
By iterating this update, we obtain a series of values of v that improve the objective function. Figure 1 graphically illustrates how the auxiliary function G(u, v) is used to compute a minimum of the objective function F (v) at v = v∗ . To derive an auxiliary function for NQP, we first decompose the objective function F (v) in equation 1.5 into three terms as in equations 2.2 and 2.3 and then derive the upper bounds for each of them separately. The following two lemmas establish the bounds relevant to the quadratic terms Fa (u) and Fc (u). Lemma 1. Let A+ denote the matrix composed of the positive elements of the matrix A, as defined in equation 2.1. Then for all positive vectors u and v, the quadratic form Fa (u) = 12 u T A+ u satisfies the following inequality: Fa (u) ≤
1 (A+ v)i 2 ui . 2 i vi
(3.2)
2012
F. Sha, Y. Lin, L. Saul, and D. Lee
Proof. Let δij denote the Kronecker delta function, and let K be the diagonal matrix with elements K ij = δij
(A+ v)i . vi
(3.3)
Since Fa (u) = 12 uT A+ u, the inequality in equation 3.2 is equivalent to the statement that the matrix (K − A+ ) is positive semidefinite. Consider the matrix M whose elements Mij = vi (K ij − A+ ij )v j
(3.4)
are obtained by rescaling componentwise the elements of (K − A+ ). Thus, (K − A+ ) is positive semidefinite if M is positive semidefinite. We note that for all vectors u, uT Mu =
ui vi (K ij − A+ ij )v j u j
ij
=
δij (A+ v)i ui u j v j −
ij
=
2 A+ ij vi v j ui −
ij
=
(3.5) A+ ij vi v j ui u j
(3.6)
ij
A+ ij vi v j ui u j
(3.7)
ij
1 + Aij vi v j (ui − u j )2 ≥ 0. 2
(3.8)
ij
Thus, (K − A+ ) is positive semidefinite, proving the bound in equation 3.2. An alternative proof that (K − A+ ) is semidefinite positive can also be made by appealing to the Frobenius-Perron Theorem (Lee & Seung, 2001). For the terms related to the negative elements in the matrix A, we have following result: Lemma 2. Let A− denote the matrix composed of the negative elements of the matrix A, as defined in equation 2.1. Then for all positive vectors u and v, the quadratic form Fc (u) = 12 u T A− u satisfies the following inequality: −Fc (u) ≤ −
1 2
ij
ui u j A− v v 1 + log . ij i j vi v j
(3.9)
Multiplicative Updates for Nonnegative Quadratic Programming
2013
Proof. To prove this bound, we use the simple inequality: z ≥ 1 + log z. Substituting z = ui u j /(vi v j ) into this inequality gives ui u j ≥ vi v j
ui u j 1 + log vi v j
.
(3.10)
Substituting the above inequality into Fc (u) = 12 ij A− ij ui u j and noting the negative sign, we arrive at the bound in equation 3.9. Combining lemmas 1 and 2 and noting that Fb (u) = proved the following theorem:
i
b i ui , we have
Theorem 1. Define a function G(u, v) on positive vectors u and v by G(u, v) =
1
(A+ v)i
2
i
vi
ui2
−
1 2
ij
ui u j + Aij vi v j 1 + log b i ui . vi v j i −
(3.11) Then G(u, v) is an auxiliary function for the function F (v) = 12 vT Av + bT v, satisfying F (u) ≤ G(u, v) and F (v) = G(v, v). As explained previously, a new estimate that improves the objective function F (v) at its current estimate v is obtained by minimizing the auxiliary function G(u, v) with respect to its first argument u, as shown by following theorem: Theorem 2. Given a positive vector v and a mapping v = M(v) such that v = arg minu G(u, v), we have F (v ) ≤ F (v).
(3.12)
Moreover, if v = v, then the inequality holds strictly. Therefore, the objective function is strictly decreased unless at the fixed point of the mapping M(v), where v = M(v). The mapping M(v) takes the form of equation 2.6 if v is constrained only to be nonnegative and takes the form of equation 2.8 if v is box-constrained. Proof. The inequality in equation 3.12 is a direct result from the definition of the auxiliary function and its relation to the objective function. The derivation in equation 3.1 is reproduced here for easy reference: F (v ) ≤ G(v , v) ≤ G(v, v) = F (v).
(3.13)
2014
F. Sha, Y. Lin, L. Saul, and D. Lee
To show that the objective function is strictly decreased if the new estimate v is not the same as the old estimate v, we must also show that the auxiliary function is strictly decreased: if v = v, then G(v , v) < G(v, v). This can be proved by further examining the properties of the auxiliary function. We begin by showing that G(u, v) is the sum of strictly convex functions of u. For a strictly convex function, the minimizer is unique, and the minimum is strictly less than any other values of the function. We reorganize the expression of the auxiliary function G(u, v) given by equation 3.11 such that there are no interaction terms among the variables ui : G(u, v) =
1 (A+ v)i 2 − ui 1 − ui − (A v)i vi log + b i ui − Aij vi v j . 2 i vi vi 2 i i ij
We identify the auxiliary function with G(u, v) = where G i (ui ) is a single-variable function of ui : G i (ui ) =
1 (A+ v)i 2 ui ui − (A− v)i vi log + b i ui . 2 vi vi
i
G i (ui ) − 12 vT A− v,
(3.14)
Note that the minimizer of G(u, v) can be easily found by minimizing each G i (ui ) separately: vi = arg minui G i (ui ). Moreover, we will show that G i (ui ) is strictly convex in ui . To see this, we examine its second derivative with respect to ui : G i (ui ) =
(A+ v)i (A− v)i + vi . vi ui2
(3.15)
For a positive vector v, (A+ v)i and (A− v)i cannot be simultaneously equal to zero. Otherwise, the ith row of A is all-zero, contradicting our assumption that A is strictly convex. This implies that G i (ui ) is strictly positive and G i (ui ) is strictly convex in ui . Theorem 2 follows directly from the above observation. In particular, if vi is not a minimizer of G i (ui ), then vi = vi and G i (vi ) < G i (ui ). Since the auxiliary function G(u, v) is the sum of all the individual terms G i (ui ) plus a term independent of u, we have shown that G(v , v) is strictly less than G(v, v) if v = v. This leads to F (v ) < F (v). As explained previously, the minimizer v can be computed by finding the minimizer of each individual term G i (ui ). Computing the derivative of G i (ui ) with respect to ui , setting it to zero, and solving for ui lead to the multiplicative updates in equation 2.6. Minimizing G i (ui ) subject to box constraints ui ∈ [0, i ] leads to the clipped multiplicative updates in equation 2.8.
Multiplicative Updates for Nonnegative Quadratic Programming
2015
3.2 Global Convergence. The multiplicative updates define a mapping M from the current estimate v of the minimizer to a new estimate v . By iteration, the updates generate a sequence of estimates {v1 , v2 , . . .}, satisfying vk+1 = M(vk ). The sequence monotonically improves the objective function F (v). Since the sequence {F (v1 ), F (v2 ), . . . , } is monotonically decreasing and is bounded below by the global minimum value of F (v), the sequence converges to some value when k is taken to the limit of infinity. While establishing monotonic convergence of the sequence, however, the above observation does not rule out the possibility that the sequence converges to spurious fixed points of the iterative procedure vk+1 = M(vk ) that are not the global minimizer of the objective function. In this section, we prove that the multiplicative updates do indeed converge to the global minimizer and attain the global minimum of the objective function. (The technical details of this section are not necessary for understanding how to derive or implement the multiplicative updates.) 3.2.1 Outline of the Proof. Our proof relies on a detailed investigation of the fixed points of the mapping M defined by the multiplicative updates. In what follows, we distinguish between the “spurious” fixed points of M that violate the KKT conditions versus the unique fixed point of M that satisfies the KKT conditions and attains the global minimum value of F (v). The basic idea of the proof is to rule out both the possibility that the multiplicative updates converge to a spurious fixed point, as well as the possibility that they lead to oscillations among two or more fixed points. Our proof consists of three stages. First, we show that any accumulation point of the sequence {v1 , v2 , . . .} must be a fixed point of the multiplicative updates—either a spurious fixed point or the global minimizer. Such a result is considerably weaker than global convergence to the minimizer. Second, we show that there do not exist convergent subsequences S of the mapping M with spurious fixed points as accumulation points. In particular, we show that if such a sequence S converges to a spurious fixed point, then it must have a subsequence converging to a different fixed point, yielding a contradiction. Therefore, the accumulation point of any convergent subsequence must be the global minimizer. Third, we strengthen the result on subsequence convergence and show that the sequence {v1 , v2 , . . .} converges to the global minimizer. Our proof starts from Zangwill’s convergence theorem (Zangwill, 1969), a well-known convergence result for general iterative methods, but our final result does not follow simply from this general framework. We review Zangwill’s convergence theorem in appendix A. The application of this theorem in our setting yields the weaker result in the first step of our proof: convergence to a fixed point of the multiplicative updates. As explained in the appendix, however, Zangwill’s convergence theorem does not exclude the possibility of convergence to spurious fixed points. We derive our stronger result of global convergence by exploiting the particular
2016
F. Sha, Y. Lin, L. Saul, and D. Lee
structure of the objective function and the multiplicative update rules for NQP. A key step (see lemma 4) in our proof is to analyze the mapping M on sequences that are in the vicinity of spurious fixed points. Our analysis appeals repeatedly to the specific properties of the objective function and the mapping induced by the multiplicative updates. The following notation and preliminary observations will be useful. We 1 2 k k ∞ use {vk }∞ 1 to denote the sequence {v , v , . . . , v , . . .} and {F (v )}1 to denote 1 2 k the corresponding sequence {F (v ), F (v ), . . . , F (v ), . . .}. We assume that the matrix A in equation 1.5 is strictly positive definite so that the objective function has a unique global minimum. From this, it also follows that the matrix A+ in equation 1.7 has strictly positive elements along the diagonal. 3.2.2 Positivity. Our proof will repeatedly invoke the observation that for a strictly positive vector v, the multiplicative updates in equation 2.6 yield a strictly positive vector v = M(v). There is one exception to this rule, which we address here. Starting from a strictly positive vector v, the multiplicative updates will set an element vi = 0 directly to zero in the case that b i ≥ 0 and the ith row of the matrix A has no negative elements. It is easy to verify in this case, however, that the global minimizer v∗ of the objective function does have a zero element at vi∗ = 0. Once an element in zeroed by the multiplicative updates, it remains zero under successive updates. In effect, when this happens, the original problem in NQP reduces to a smaller problem—of dimensionality equal to the number of nontrivial modes in the original system. Without loss of generality, therefore, we will assume in what follows that any trivial degrees of freedom have already been removed from the problem. More specifically, we will assume that the ith row of the matrix A has one or more negative elements whenever b i ≥ 0 and that consequently, a strictly positive vector v is always mapped to a strictly positive vector v = M(v). 3.2.3 Accumulation Points of {vk }∞ 1 . The following lemma is a direct result of Zangwill’s convergence theorem, as reviewed in appendix A. It establishes the link between the accumulation points of {vk }∞ 1 and the fixed points of M. Lemma 3. Given a point v1 , suppose the update rule in equation 2.6 generates a sequence {vk }∞ 1 . Then either the algorithm terminates at a fixed point of M or the accumulation point of any convergent subsequence in {vk }∞ 1 is a fixed point of M. Proof. If there is a k ≥ 1 such that vk is a fixed point of M, then the update rule terminates. Therefore, we consider the case that an infinite sequence is generated and show how to apply Zangwill’s convergence theorem. Let M be the update procedure in Zangwill’s convergence theorem. We first verify that the sequence {vk }∞ 1 generated by M is in a compact set.
Multiplicative Updates for Nonnegative Quadratic Programming
2017
Because {F (vk )}∞ 1 is a monotonically decreasing sequence, it follows that for all k, vk ∈ = {v|F (v) ≤ F (v1 )}.
(3.16)
Note that the set is compact because it defines an ellipsoid confined to the positive orthant. We define the desired set S to be the collection of all the fixed points of M. If v ∈ / S, then from theorem 2, we have that F (M(v)) < F (v). On the other hand, if v ∈ S, then we have that F (M(v)) = F (v). This shows that the mapping M maintains strict monotonicity of the objective function outside the desired set. The last condition to verify is that M is closed at v if v is not in the desired set. Note that M is continuous if v = 0. Therefore, if the origin v = 0 is a fixed point of M, then M is closed outside the desired set. If the origin is not a fixed point of M, then it cannot be the global minimizer. Moreover, we can choose the initial estimate v1 such that F (v1 ) < F (0). With this choice, it follows from the monotonicity of M that the origin is not contained in and that M is continuous on . Either way, we have shown that M is closed on a proper domain. Therefore, we can apply Zangwill’s convergence theorem to the mapping M restricted on : the limit of any convergent subsequence in {vk }∞ 1 is in the desired set or equivalently, a fixed point of M. Remark. It is easy to check whether the global minimizer occurs at the origin with value F (0) = 0. In particular, if all the elements of b are nonnegative, then the origin is the global minimizer. On the other hand, if there is a nonnegative element of b, then we can choose the initial estimate v1 such that F (v1 ) < F (0). For example, suppose b k < 0. Then we can choose v1 such that its kth element is σ and all other elements are τ . A positive σ and τ can be found such that F (v1 ) < 0 by noting F (v1 ) =
1 1 Aij τ 2 + Aik τ σ + b i τ + Akk σ 2 + b k σ 2 i, j =k 2 i =k i =k
1 1 2 ≤ Akk σ + b k σ. |Aij |τ + |Aik |σ + |b i | τ + 2 2 i i ij
(3.17) Note that if we choose a positive σ < −2b k /Akk , we can always find a positive τ such that F (v1 ) < 0 because the right-most term of the inequality in equation 3.17 is negative and the left and middle terms vanish as τ → 0+ .
2018
F. Sha, Y. Lin, L. Saul, and D. Lee
Figure 2: Fixed points of the multiplicative updates: the global minimizer v∗ , indicated by a star and spurious fixed points vˆ and vˆ , indicated by squares. Contour lines of the objective function are shown as ellipses. A hypothetical ˆ is sequence {vk }∞ 1 with a subsequence converging to the spurious fixed point v represented by solid lines connecting small black circles. The δ-ball around the spurious fixed point vˆ does not intersect the δ -ball around the other spurious fixed point vˆ .
3.2.4 Properties of the Fixed Points. As stated in section 2.2, the minimizer of F (v) satisfies the KKT conditions and corresponds to a fixed point of the mapping M defined by the multiplicative update rule in equation 2.6. The mapping M, however, also has fixed points that do not satisfy the KKT conditions. We refer to these as spurious fixed points. Lemma 3 states that any convergent subsequence in {vk }∞ 1 must have a fixed point of M as its accumulation point. To prove that the multiplicative updates converge to the global minimizer, we will show that spurious fixed points cannot be accumulation points. Our strategy is to demonstrate that any subsequence S converging to a spurious fixed point must itself have a subsequence “running away” from the fixed point. The idea of the proof is shown schematically in Figure 2. The star in the center of the figure denotes the global minimizer v∗ . Black squares denote spurious fixed points vˆ and vˆ . The figure also shows a hypothetical subsequence that converges to the spurious fixed point v. ˆ
Multiplicative Updates for Nonnegative Quadratic Programming
2019
At a high level, the proof (by contradiction) is as follows. Suppose that there exists a convergent subsequence as shown in Figure 2. Then we can draw a very small -ball around the spurious fixed point vˆ containing an infinite number of elements of the subsequence. We will show that under the mapping M, the subsequence must have an infinite number of successors that are outside the -ball yet inside a δ-ball where δ > . This bounded successor sequence must have a subsequence converging to an accumulation point, which by lemma 3 must also be a fixed point. However, we can choose the δ-ball to be sufficiently small such that the annulus between the -ball and δ-ball contains no other fixed points. This yields a contradiction. More formally, we begin by proving the following lemma: Lemma 4. Let v1 denote a positive initial vector satisfying F (v1 ) < F (0). Suppose that the sequence vk+1 = M(vk ) generated by the iterative update in equation 2.6 has a subsequence that converges to the spurious fixed point v. ˆ Then there exists an > 0 and a δ > 0 such that for every v = vˆ such that v − v ˆ < , there exists an integer p ≥ 1 such that < M p (v) − v ˆ < δ, where M p (v) is p times composition of M applied to v: M ◦ M· · · ◦ M(v). p
Proof. If vˆ is a fixed point, then either vˆ i = 0 or vˆ i = 0 for any i. If the former is true, as shown in section 2.2, it follows that (∂ F /∂vi )|vˆ = 0. When vˆ i = 0, then either (∂ F /∂vi )|vˆ ≥ 0 or (∂ F /∂vi )|vˆ < 0. If vˆ is a spurious fixed point that violates the KKT conditions, then there exists at least one i such that vˆ i = 0 and
∂ F < 0. ∂vi vˆ
ˆ < }. By Let vˆ be a small ball centered at vˆ with radius of : vˆ = {v| v − v continuity, there exists an such that (∂ F /∂vi ) < 0 for all v ∈ vˆ . Let be the image of vˆ under the mapping M. Since M is a continuous mapping, we can find a minimum ball δvˆ = {v| v − v ˆ < δ} to encircle . We claim that the and δ satisfy the lemma. As observed in section 2.2, the multiplicative update increases vi if ∂ F /∂vi is negative. Consider the sequence {M(v), M ◦ M(v), . . . , Mk (v), . . .}. The ith component of the sequence is monotonically increasing until the condition (∂ F /∂vi ) becomes nonnegative. This happens only if the element of the sequence is outside the vˆ ball. Thus, for every v ∈ vˆ , the update will push vi to larger and larger values until it escapes from the ball. Let p be the smallest integer such that M p−1 (v) ∈ vˆ and M p (v) ∈ / vˆ
2020
F. Sha, Y. Lin, L. Saul, and D. Lee
By construction of the δvˆ ball, M p is inside the ball since is the image of the vˆ under the mapping M and M p−1 (v) ∈ vˆ . Therefore, M p (v) ∈ ⊂ δvˆ . The size of the δ-ball depends on the spurious fixed point vˆ and . Can δvˆ contain another spurious fixed point vˆ ? The following lemma shows that we can choose sufficiently small such that the ball δvˆ contains no other fixed points. Lemma 5. If vˆ and vˆ are two different spurious fixed points, let and δ be the radii for vˆ such that lemma 4 holds and and δ be the radii for vˆ . There exists an 0 > 0 such that max( , ) ≤ 0 and δvˆ ∩ δvˆ is empty. Proof. It suffices to show that M(v) becomes arbitrarily close to vˆ as v approaches v. ˆ Since M is a bounded and continuous mapping, the image
of vˆ under M becomes arbitrarily small as → 0. Note that vˆ is a fixed point of M, and we have vˆ ∈ . Therefore, the δ-ball centered at vˆ can be made arbitrarily small as v approaches v. ˆ Let us choose sufficiently small such that δ is less than the half of the distance between vˆ and vˆ and choose likewise. Then the intersection of δvˆ and δvˆ is empty. This is illustrated in Figure 2. Because the spurious fixed points are separated by their δ-balls, we will show that the existence of a subsequence converging to a spurious fixed point leads to a contradiction. This observation leads to the following theorem. Theorem 3. If the matrix A is strictly positive definite, then no convergent subsequence of {vk }∞ 1 can have a spurious fixed point of the multiplicative updates as an accumulation point. Proof. The number of spurious fixed points is finite and bounded above by the number of ways of choosing zero elements of v. Let > 0 be the minimum pairwise distance between these fixed points. By lemma 4, we can choose a small such that the radius of the δ-ball for any spurious fixed point is less than /2. With this choice, the δ-balls at different spurious fixed points are nonoverlapping. k Suppose there is a convergent subsequence {vk } ⊂ {vk }∞ 1 such that v −→ vˆ with k ∈ K, where K is an index subset and vˆ is a spurious fixed point. Without loss of generality, we assume the whole subsequence is contained in the -ball of v. ˆ For each element vk of the subsequence, by lemma 4, there exists an integer pk such that M pk (vk ) is outside the -ball yet inside the δ-ball. Consider the “successor” sequence M pk (vk ) with k ∈ K, schematically shown in Figure 2 as the black circles between the -ball and the δ-ball. The infinite successor sequence is bounded between the -ball and δ-ball and therefore
Multiplicative Updates for Nonnegative Quadratic Programming
2021
must have a convergent subsequence. By lemma 3, the accumulation point of this subsequence must be a fixed point of M. However, this leads to contradiction. On one hand, the subsequence is outside the -ball of vˆ so vˆ is not the accumulation point. On the other hand, the subsequence is inside the δ-ball of v: ˆ therefore, it cannot have any other spurious fixed point vˆ as its accumulation point, because we have shown that all pairs of fixed points are separated by their respective δ-balls. Therefore, the accumulation point of the subsequence cannot be a fixed point. Thus, we arrive at a contradiction, showing that spurious fixed points cannot be accumulation points of any convergent subsequence. 3.2.5 Convergence to the Global Minimizer. We have shown that the only possible accumulation point of {vk }∞ 1 is the global minimizer: the fixed point of M that satisfies the KKT conditions. We now show that the sequence {vk }∞ 1 itself does indeed converge to the global minimizer. Theorem 4. Suppose that the origin is not the global minimizer and that we choose a positive initial vector v1 such that F (v1 ) < 0. Then the sequence {vk }∞ 1 converges to the global minimizer v∗ , and {F (vk )}∞ 1 converges to the optimal value F ∗ = F (v∗ ). Proof. As shown in equation 3.16, the infinite sequence {vk }∞ 1 is a bounded set; therefore, it must have an accumulation point. By the preceding theorem, the accumulation point of any convergent subsequence of {vk }∞ 1 cannot be a spurious fixed point; thus, any convergent subsequence must converge to the fixed point that satisfies the KKT conditions: the global minimizer. By monotonicity, it immediately follows that {F (vk )}∞ 1 converges to the optimal value F ∗ of the objective function. Since F ∗ < F (v) ˆ for all spurious fixed points v, ˆ we can find an ∗ > 0 such that the set ∗ = {v|F (v) ≤ F ∗ + ∗ } contains no spurious fixed points of M. Moreover, since {F (vk )}∞ 1 is a monotonically decreasing sequence converging to F ∗ , there exists a k0 such that vk ∈ ∗ for all k ≥ k0 . We now prove the theorem by contradiction. Suppose {vk }∞ 1 does not converge to the minimizer v∗ . Then there exists an > 0 such that the set = {vk : vk − v∗ > } has an infinite number of elements. In other words, there must be a subsequence of {vk }∞ 1 in which every element has distance at least from the minimizer. Moreover, the intersection of and ∗ must have an infinite number of elements. Note that by construction, ∗ contains no fixed points
2022
F. Sha, Y. Lin, L. Saul, and D. Lee
Figure 3: Estimating time delay from signals and their echos in reverberant environments.
other than the global minimizer, and does not contain the global minimizer. Thus, there are no fixed points in ∩ ∗ . The infinite set ∩ ∗ , however, is bounded, and therefore must have an accumulation point; by lemma 3, this accumulation point must be a fixed point. This yields a contradiction. Hence, the set cannot have an infinite number of elements, and the sequence {vk }∞ 1 must converge to the global minimizer. 4 Applications In this section, we sketch two real-world applications of the multiplicative updates to problems in signal processing and pattern recognition. Recently, the updates have also been applied by astrophysicists to estimate the mass distribution of a gravitational lens and the positions of the sources from combined strong and weak lensing data (Diego, Tegmark, Protopapas, & Sandvik, 2007). 4.1 Acoustic Time Delay Estimation. In a reverberant acoustic environment, microphone recordings capture echos reflected from objects such as walls and furniture in addition to the signals that propagate directly from a sound source. As illustrated in Figure 3, the signal at the microphone x(t) can be modeled as a linear combination of the source signal s(t) at different delay times: s(t − t), s(t − t ) and s(t − t ). Given a received signal x(t)
Multiplicative Updates for Nonnegative Quadratic Programming
2023
and its source signal s(t), how can we identify in x(t) all the time-delayed components of s(t)? To this end, we consider the model (Lin et al., 2004) N
x(t) =
αi s(t − ti )
αi ≥ 0,
with
(4.1)
i N in which {ti }i=1 are all possible time delays (discretized to some finite resolution) and {αi } are the relative amplitudes (or attenuations) of the time-delayed components. The nonnegativity constraints in equation 4.1 incorporate the assumption that only the amplitudes of acoustic waves are affected by reflections while the phases are retained (Allen & Berkley, 1979). Within this model, the time-delayed components of s(t) can be identified by computing the amplitudes αi that best reconstruct x(t). The reconstruction with least-squares error is obtained from the nonnegative deconvolution:
1 αi s(t − ti )|2 . α ∗ = arg min |x(t) − αi ≥0 2 i N
(4.2)
Nonzero weights αi∗ in the least-squares reconstruction are interpreted as indicating time-delayed components in the received signal with delays ti . It is convenient to rewrite this optimization in the frequency domain. Let x˜ ( f ) and s˜ ( f ) denote the Fourier transforms of x(t) and s(t), respectively. Also, define the positive semidefinite matrix K ij and the real-valued coefficients c i by K ij =
|˜s ( f )|2 e j2π f (t j −ti ) ,
(4.3)
s˜ ∗ ( f ) x˜ ( f )e j2π f ti ,
(4.4)
f
ci =
f
where the sums are over positive and negative frequencies. In terms of the matrix K ij and coefficients c i , the optimization in equation 4.2 can be rewritten as
ij αi K ij α j
minimize
1 2
subject to
αi ≥ 0.
−
i
c i αi
(4.5)
This has the same form as the NQP problem in equation 1.5 and can be solved by the multiplicative updates in equation 2.6. Note that K ij defines a Toeplitz matrix if the possible time delays ti are linearly spaced. Using fast Fourier transforms, the Toeplitz structure of K ij can be exploited for much faster matrix-vector operations per multiplicative update.
2024
F. Sha, Y. Lin, L. Saul, and D. Lee iteration: 0
iteration: 512
1
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0
0
iteration: 256
iteration: 1024
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0
0 5 time delay (Ts)
10
0 5 time delay (T ) s
1
10
0
iteration: 2048
iteration: 4196
0 5 time delay (T )
10
s
Figure 4: Convergence of the multiplicative updates for acoustic time delay estimation.
Figure 4 shows the convergence of the multiplicative updates for a problem in acoustic time delay estimation. The source signal s(t) in this example was a 30 ms window of speech, and the received signal x(t) was given by x(t) = s(t − Ts ) + 0.5 s(t − 8.5Ts ), where Ts was the sampling period. The vertical axes measure the estimated amplitudes αi after different numbers of iterations of the multiplicative updates; the horizontal axes measure the time delays ti in the units of Ts . The vertical dashed lines indicate the delays at Ts and 8.5Ts . The figure shows that as the number of iterations is increased, the time delays are accurately predicted by the peaks in the estimated amplitudes αi . 4.2 Large Margin Classification. Large margin classifiers have been applied successfully to many problems in machine learning and statistical pattern recognition (Cristianini & Shawe-Taylor, 2000; Vapnik, 1998). These classifiers use hyperplanes as decision boundaries to separate positively and negatively labeled examples represented by multidimensional vectors. Generally the hyperplanes are chosen to maximize the minimum distance (known as the margin) from any labeled example to the decision boundary (see Figure 5). N Let {(xi , yi )}i=1 denote a data set of N labeled “training” examples with binary class labels yi = ±1. The simplest case, shown in Figure 5 (left),
Multiplicative Updates for Nonnegative Quadratic Programming H
2025
H H
H H
H
Figure 5: Large margin classifiers. Positively and negatively labeled examples are indicated by filled and hollowed circles, respectively. (Left) A linearly separable data set. The support vectors for the maximum margin hyperplane H lie on two hyperplanes H + and H − parallel to the decision boundary. Large margin classifiers maximize the distance between these hyperplanes. (Right) A linearly inseparable data set. The support vectors in this case also include examples lying between H + and H − that cannot be classified correctly.
is that the two classes are linearly separable by a hyperplane that passes through the origin. Let w∗ denote the hyperplane’s normal vector; the classification rule, given by y = sgn(w∗ T x), labels examples based on whether they lie above or below the hyperplane. The maximum margin hyperplane is computed by solving the constrained optimization: minimize subject to
1 T w w 2 yi wT xi
≥ 1, i = 1, 2, . . . , N.
(4.6)
The constraints in this optimization ensure that all the training examples are correctly labeled by the classifier’s decision rule. While a potentially infinite number of hyperplanes satisfy these constraints, the classifier with minimal ||w|| (and thus maximal margin) has provably small error rates (Vapnik, 1998) on unseen examples. The optimization problem in equation 4.6 is a convex quadratic programming problem in the vector w. Its dual formulation is N
αi α j yi y j xiT x j −
minimize
1 2
subject to
αi ≥ 0, i = 1, 2, . . . , N.
i, j=1
i
αi
(4.7)
Let A denote the positive semidefinite matrix with elements yi y j xiT x j , and let e denote the column vector of all ones. The objective function for the dual in equation 4.7 can then be written as the NQP problem: L(α) =
1 T α Aα − eT α, 2
(4.8)
2026
F. Sha, Y. Lin, L. Saul, and D. Lee support vectors 2
support vectors 5 iteration: 0 0 5
iteration: 400
0 1 iteration: 100
0 4
0.5
iteration: 500
0 1 iteration: 200
2 0 2 1 0 0
1
0.5
iteration: 600
0 1 iteration: 300 200 400 600 800 1000 1200
0.5 0 0
iteration: 700 200 400 600 800 1000 1200
Figure 6: Convergence of the multiplicative updates in equation 2.6 for a large margin classifier distinguishing handwritten digits (2s versus 3s). The coefficients corresponding to nonsupport vectors are quickly attenuated to zero.
and the multiplicative updates in equation 2.6 can be used to find the optimal α ∗ . Labeled examples that correspond to active constraints in equation 4.6 are called support vectors. The normal vector w∗ is completely determined by support vectors since the solution to the primal problem, equation 4.6, is given by w∗ =
yi αi∗ xi .
(4.9)
i
For nonsupport vectors, the inequalities are strictly satisfied in equation 4.6, and their corresponding Lagrange multipliers vanish (that is, αi∗ = 0). Figure 6 illustrates the convergence of the multiplicative updates for ¨ large margin classification of handwritten digits (Scholkopf et al., 1997). The plots show the estimated support vector coefficients αi after different numbers of iterations of the multiplicative updates. The horizontal axes in these plots index the coefficients αi of the N = 1389 training examples, and the vertical axes show their values. For ease of visualization, the training examples were ordered so that support vectors appear to the left and nonsupport vectors to the right. The coefficients were uniformly initialized as αi = 1. Note that the nonsupport vector coefficients are quickly attenuated to zero. Multiplicative updates can also be used to train large margin classifiers when the labeled examples are not linearly separable, as shown in Figure 5 (right). In this case, the constraints in equation 4.6 cannot all be simultaneously satisfied, and some of them must be relaxed. One simple relaxation
Multiplicative Updates for Nonnegative Quadratic Programming
2027
is to permit some slack in the constraints but to penalize the degree of slack as measured by the 1 -norm:
minimize
wT w + C
subject to
yi w xi ≥ 1 − ξi
i
ξi
T
(4.10)
ξi ≥ 0. The parameter C balances the slackness penalty versus the large margin criterion; the resulting classifiers are known as soft margin classifiers. The dual of this optimization is an NQP problem with the same quadratic form as the linearly separable case but with box constraints: minimize
1 T α Aα 2
− eT α
subject to
0 ≤ αi ≤ C, i = 1, 2, . . . , N.
(4.11)
The clipped multiplicative updates in equation 2.8 can be used to perform this optimization for soft margin classifiers. Another way of handling linear inseparability is to embed the data in a high-dimensional nonlinear feature space and then to construct the maximum margin hyperplane in feature space. The nonlinear mapping is performed implicitly by specifying a kernel function that computes the inner product in feature space. The optimization for the maximum margin hyperplane in feature space has the same form as equation 4.7 except that the original Gram matrix with elements xiT x j is replaced by the kernel matrix of inner products in feature space (Cristianini & Shawe-Taylor, 2000; Vapnik, 1998). The use of multiplicative updates for large margin classification of linearly inseparable data sets is discussed further in (Sha et al., 2003a, 2003b). A large number of algorithms have been investigated for nonnegative quadratic programming in SVMs. Among them, there are many criteria that could be compared, such as speed of convergence, memory requirements, and ease of implementation. The main utility of the multiplicative updates appears to lie in their ease of implementation. The updates are very well suited for applications involving small to moderately sized data sets, where computation time is not a primary concern and where the simple, parallel form of the updates makes them easy to implement in high-level languages such as Matlab. The most popular methods for training SVMs—so-called subset methods—take a fundamentally different approach to NQP. In contrast to the parallel form of the multiplicative updates, subset methods split the variables at each iteration into two sets: a fixed set in which the variables are held constant and a working set in which the variables are optimized
2028
F. Sha, Y. Lin, L. Saul, and D. Lee
by an internal subroutine. At the end of each iteration, a heuristic is used to transfer variables between the two sets and improve the objective function. Two subset methods have been widely used for training SVMs. The first is the method of sequential minimal optimization (SMO) (Platt, 1999), which updates only two coefficients of the weight vector per iteration. In this case, there exists an analytical solution for the updates, so that one avoids the expense of an iterative optimization within each iteration of the main loop. SMO enforces the sum and box constraints for soft margin classifiers. If the sum constraint is lifted, then it is possible to update the coefficients of the weight vector sequentially, one at a time, with an adaptive learning rate that ensures monotonic convergence. This coordinate descent approach is also known as the kernel Adatron (Friess, Cristianini, & Campbell, 1998). SMO and kernel Adatron are among the most viable methods for training SVMs on large data sets, and experiments have shown that they converge much faster than the multiplicative updates (Sha et al., 2003b). Nevertheless, for simplicity and ease of implementation, we believe that the multiplicative updates provide an attractive starting point for experimenting with large margin classifiers.
5 Summary and Discussion In this article, we have described multiplicative updates for solving convex problems in NQP. The updates are distinguished by their simplicity in both form and computation. We showed that the updates lead to monotonic improvement in the objective function for NQP and converge to global minima. The updates can be viewed as generalizations of the iterative rules previously developed for nonnegative matrix factorization (Lee & Seung, 2001). They have a strikingly different form from other additive and multiplicative updates used in statistical learning. We can also compare the multiplicative updates to interior point methods (Wright, 1997) that have been studied for NQP. These methods start from an interior point of the feasible region and then iteratively update the current estimate of the minimizer along particular search directions. There are many ways to determine the search directions—for example, using Newton’s method to solve the equations characterizing primal and dual optimality, or approximating the original optimization problem by a simpler subproblem inside the trust region of the current estimate. The resulting updates take an elementwise additive form, stepping along a particular search direction in the nonnegative orthant. The step size is chosen to ensure that the search remains in the feasible region while making progress toward the minimizer. If the search direction corresponds to the negative gradient of the objective function F (v), then the updates reduce to steepest descent. Most of the computational effort in interior point methods is devoted to deriving search directions and maintaining the feasibility of the updated estimates.
Multiplicative Updates for Nonnegative Quadratic Programming
2029
The multiplicative updates are similar to trust region methods in spirit. Instead of constructing an ellipsoidal trust region centered at the current estimate of the minimizer, however, we have derived the updates from a nonlinear yet analytically tractable auxiliary function. Optimizing the auxiliary function guarantees the improvement of the objective function in the nonnegative orthant, which can be viewed as the trust region for each update. The search direction derived from the auxiliary function is very simple to compute, as opposed to that of many interior point methods. It remains an open question to quantify more precisely the rate of convergence of the multiplicative updates. Though not as well theoretically characterized as traditional methods for NQP, the multiplicative updates have nevertheless proven extremely useful in practice. In this article, we have described our own use of the updates for acoustic echo cancellation and large margin classification. Meanwhile, others have applied the updates to NQP problems that arise in the analysis of astrophysical data (Diego et al., 2007). We are hopeful that more applications will continue to emerge in other areas of neural computation and statistical learning. Appendix: Zangwill’s Convergence Theorem Zangwill’s convergence theorem enumerates the conditions for global convergence of general iterative procedures. The theorem is presented in its most general form in Zangwill (1969). For our purposes here, we specifically state the theorem in the context of optimization. Let F (v) denote the objective function to be minimized by an iterative update rule vk+1 = M(vk ), where the domain and range of the mapping M : V → V lie in the feasible set. Suppose that we apply the mapping M to generate a sequence of parameters {vk }∞ k=1 . Our goal is to examine whether this sequence converges to a point in a “desired” set V ∗ ⊂ V. Assume that the objective function F (v) and the mapping M are continuous and that the following conditions are met: 1. All points vk are in a compact set that is a subset of V. 2. If v ∈ / V ∗ , then the update leads to a strict reduction in the objective function. That is, F (M(v)) < F (v). 3. If v ∈ V ∗ , then either M(v) = v or F (M(v)) ≤ F (v). Zangwill’s convergence theorem states that under these conditions, either ∗ the sequence {vk }∞ k=1 stops at a point in the set V , or all accumulation points of the sequence are in the set V ∗ . The theorem can be used to analyze the convergence of iterative update rules by verifying that these three conditions hold for particular “desired” sets. For the multiplicative updates in equation 2.6, the theorem implies convergence to a fixed point v = M(v). It does not, however, imply
2030
F. Sha, Y. Lin, L. Saul, and D. Lee
convergence to the unique fixed point that is the global minimizer of the objective function. In particular, if we constrain V ∗ to contain only the global minimizer, then condition (2), which stipulates that F (v) strictly decreases under the mapping M for all V ∈ / V ∗ , is clearly violated due to the existence of “spurious” fixed points. The proof of global convergence thus requires the extra machinery of section 3.2. Acknowledgments This work was supported by NSF Award 0238323. References Allen, J. B., & Berkley, D. A. (1979). Image method for efficiently simulating smallroom acoustics. Journal of the Acoustical Society of America, 65, 943–950. Bauer, E., Koller, D., & Singer, Y. (1997). Update rules for parameter estimation in Bayesian networks. In Proceedings of the Thirteenth Annual Conference on Uncertainty in AI (pp. 3–13). San Francisco: Morgan Kaufmann. Bertsekas, D. P. (1999). Nonlinear programming (2nd ed.). Belmont, MA: Athena Scientific. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge: Cambridge University Press. Darroch, J. N., & Ratcliff, D. (1972). Generalized iterative scaling for log-linear models. Annals of Mathematical Statistics, 43, 1470–1480. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 1–38. Diego, J. M., Tegmark, M., Protopapas, P., & Sandvik, H. B. (2007). Combined reconstruction of weak and strong lensing data with WSLAP. Monthly Notices of the Royal Astronomical Society, 375, 958–970. Friess, T., Cristianini, N., & Campbell, C. (1998). The Kernel-Adatron algorithm: A fast and simple learning procedure for support vector machines. In Proceedings of the Fifteenth International Conference on Machine Learning (pp. 188–196). San Francisco: Morgan Kaufmann. Kivinen, J., & Warmuth, M. (1997). Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132, 1–63. Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects with nonnegative matrix factorization. Nature, 401, 788–791. Lee, D. D., & Seung, H. S. (2001). Algorithms for non-negative matrix factorization. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 556–562). Cambridge, MA: MIT Press. Lin, Y., Lee, D. D., & Saul, L. K. (2004). Nonnegative deconvolution for time of arrival estimation. In Proceedings of the International Conference of Speech, Acoustics, and Signal Processing (ICASSP-2004) (Vol. 2, pp. 377–380). Piscataway, NJ: IEEE. Platt, J. (1999). Fast training of support vector machines using sequential minimal ¨ optimization. In B. Scholkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances
Multiplicative Updates for Nonnegative Quadratic Programming
2031
in kernel methods—Support vector learning (pp. 185–208). Cambridge, MA: MIT Press. Saul, L. K., Sha, F., & Lee, D. D. (2003). Statistical signal processing with nonnegativity constraints. In Proceedings of the Eighth European Conference on Speech Communication and Technology (Vol. 2, pp. 1001–1004). Geneva, Switzerland. ¨ Scholkopf, B., Sung, K., Burges, C., Girosi, F., Niyogi, P., Poggio, T., & Vapnik, V. (1997). Comparing support vector machines with gaussian kernels to radial basis function classiers. IEEE Transactions on Signal Processing, 45, 2758–2765. Serafini, T., Zanghirati, G., & Zanni, L. (2005). Gradient projection methods for quadratic programs and applications in training support vector machines. Optimization Methods and Software, 20, 353–378. Sha, F., Saul, L. K., & Lee, D. D. (2003a). Multiplicative updates for nonnegative quadratic programming in support vector machines. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 897– 904). Cambridge, MA: MIT Press. Sha, F., Saul, L. K., & Lee, D. D. (2003b). Multiplicative updates for large margin classifiers. In Proceedings of the Sixteenth Annual Conference on Computational Learning Theory (COLT-03) (pp. 188–202). Berlin: Springer. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Wright, S. J. (1997). Primal-dual interior point methods. Philadelphia, PA: SIAM. Zangwill, W. J. (1969). Nonlinear programming: A unified approach. Englewood Cliffs, NJ: Prentice Hall.
Received April 10, 2006; accepted August 10, 2006.
ARTICLE
Communicated by Nicolas Brunel
Critical Analysis of Dimension Reduction by a Moment Closure Method in a Population Density Approach to Neural Network Modeling Cheng Ly [email protected] Courant Institute of Mathematical Sciences, New York University, New York, NY 10012, U.S.A.
Daniel Tranchina [email protected] Department of Biology, Courant Institute of Mathematical Sciences and Center for Neural Science, New York University, New York, NY 10003, U.S.A.
Computational techniques within the population density function (PDF) framework have provided time-saving alternatives to classical Monte Carlo simulations of neural network activity. Efficiency of the PDF method is lost as the underlying neuron model is made more realistic and the number of state variables increases. In a detailed theoretical and computational study, we elucidate strengths and weaknesses of dimension reduction by a particular moment closure method (Cai, Tao, Shelley, & McLaughlin, 2004; Cai, Tao, Rangan, & McLaughlin, 2006) as applied to integrate-and-fire neurons that receive excitatory synaptic input only. When the unitary postsynaptic conductance event has a singleexponential time course, the evolution equation for the PDF is a partial differential integral equation in two state variables, voltage and excitatory conductance. In the moment closure method, one approximates the conditional kth centered moment of excitatory conductance given voltage by the corresponding unconditioned moment. The result is a system of k coupled partial differential equations with one state variable, voltage, and k coupled ordinary differential equations. Moment closure at k = 2 works well, and at k = 3 works even better, in the regime of high dynamically varying synaptic input rates. Both closures break down at lower synaptic input rates. Phase-plane analysis of the k = 2 problem with typical parameters proves, and reveals why, no steady-state solutions exist below a synaptic input rate that gives a firing rate of 59 s−1 in the full 2D problem. Closure at k = 3 fails for similar reasons. Low firing-rate solutions can be obtained only with parameters for the amplitude or kinetics (or both) of the unitary postsynaptic conductance event that are on the edge of the physiological range. We conclude that this
Neural Computation 19, 2032–2092 (2007)
C 2007 Massachusetts Institute of Technology
Moment Closure in Population Density Theory
2033
dimension-reduction method gives ill-posed problems for a wide range of physiological parameters, and we suggest future directions.
1 Introduction A growing area of computational neuroscience is the simulation and analysis of electrical activity in large-scale neuronal networks, such as cortical areas involved in visual information processing (Ursino & La Cara, 2005; Rangan, Cai, & McLaughlin, 2005; McLaughlin, Shapley, & Shelley, 2003; Krukowski & Miller, 2001; McLaughlin, Shapley, Shelley, & Wielaard, 2000; Somers, Nelson, & Sur, 1995) and working memory (Miller & Wang, 2006; Teramae & Fukai, 2005; Wang, 1999). The large number of neurons and even larger number of synapses involved in these networks present daunting analytical and computational problems. Population density methods (PDM) can help on both fronts. Population density methods have been applied fruitfully to problems in theoretical and computational neuroscience for many years (Wilbur & Rinzel, 1983; Tuckwell, 1988; Kuramoto, 1991; Abbott & van Vreeswijk, 1993; Gerstner, 1999; Knight, Manin, & Sirovich, 1996; Knight, Omurtag, & Sirovich, 2000; Omurtag, Knight, & Sirovich, 2000; Knight, 2000; Sirovich, Omurtag, & Knight, 2000; Casti et al., 2002; Sirovich, 2003; Gerstner, 1995, 1999; Brunel & Hakim, 1999; Brunel, 2000; Amit & Brunel, 1997; Mattia & Giudice, 2002; Nykamp & Tranchina, 2000; Fourcaud & Brunel, 2002; Brunel, Chance, Fourcaud, & Abbott, 2001; Brown, Moehlis, & Holmes, 2004; Huertas & Smith, 2006; Doiron, Rinzel, & Reyes, 2006; Moreno-Bote & Parga, 2004, 2005). The PDM has provided great analytical tools, but it can also be used as a time-saving computational tool (Knight et al., 1996; Omurtag et al., 2000; Sirovich et al., 2000; Knight, 2000; Nykamp & Tranchina, 2000, 2001; Huertas & Smith, 2006). Population density methods provide an alternative to classical Monte Carlo simulations of activity in large-scale neural networks, in which one follows the individual states of thousands of neurons, including millions of synapses. While the time to simulate an arbitrary network by conventional methods grows quadratically with the size of the network (Hansel, Mato, Meunier, & Neltner, 1998), the simulation time with population density methods grows only with the number of different populations. Recently, there has been progress in speeding up the computation times of large networks of integrate-and-fire neurons (Rangan & Cai, 2007; Rangan et al., 2005). The PDM is a statistical mechanical method in which one groups similar neurons together into a population and then tracks the distribution of neurons over state space. This distribution is described by the population, or probability, density function, whose dimension depends on the number of state variables in the underlying single-neuron model. The population
2034
C. Ly and D. Tranchina
firing rate is given by the flux of probability across a surface in phase space. In the case of an integrate-and-fire (I&F) neuron with instantaneous synaptic kinetics, there is only one state variable, the membrane voltage V, and the population firing rate is given by the flux of probability across threshold voltage, vth . In the case of an integrate-and-fire-or-burst neuron with instantaneous synaptic kinetics, there are two state variables: membrane voltage, V, and the calcium current inactivation variable, h (Casti et al., 2002; Huertas & Smith, 2006). For an I&F neuron receiving excitatory synaptic input only, in which the unitary postsynaptic conductance event has a single exponential time course (first-order kinetics), there are two state variables: membrane voltage, V, and excitatory synaptic conductance, G e (Apfaltrer, Ly, & Tranchina, 2006); the population firing rate is given by the integral, over all synaptic conductance values, of the flux of probability per unit conductance across threshold voltage vth . Brunel et al. (2001) examined the dynamical input-output properties of a similar 2D system in which I&F neurons receive correlated current noise injections to mimic background synaptic input with first-order kinetics. If inhibitory synaptic input with first-order kinetics is added, there are three state variables: membrane voltage V, excitatory synaptic conductance G e , and inhibitory conductance G i . Because synaptic kinetics play an important role in network dynamics (Treves, 1993; Abbott & van Vreeswijk, 1993; Brunel & Hakim, 1999; Nykamp & Tranchina, 2001; Haskell, Nykamp, & Tranchina, 2001; Wang, 1999; Krukowski & Miller, 2001; Whittington, Traub, Kopell, Ermentrout, & Buhl, 2000; Silberberg, Wu, & Markram, 2004; Fourcaud & Brunel, 2002), it is important to model accurately the time course of unitary postsynaptic conductance events. Consequently, it seems that three state variables would be a minimum for realistic simulations of neural network activity using PDMs. A problem with population density methods is that as the dimension of the state space increases to make the underlying neuron model more realistic, so does the computation time for numerical solution of these equations. At some point, population density methods lose their edge in efficiency over direct Monte Carlo simulations. In the case of two state variables (the largest number explored so far), PDMs are still substantially faster than Monte Carlo simulations (Casti et al., 2002; Huertas & Smith, 2006). Apfaltrer et al. (2006) found that for an error tolerance of a few percent in estimates of the true population firing rate, numerically solving a set of 2D population density evolution equations was about 30 times faster than the corresponding Monte Carlo simulations. It appears that adding one more state variable would make the computation times comparable for population density and Monte Carlo methods. Good dimension-reduction methods might make population density methods practical in state spaces with three or more dimensions and could speed up computations for state spaces of lower dimension.
Moment Closure in Population Density Theory
2035
A number of dimension-reduction methods have been suggested, and some of these have been implemented (see in section 9 the discussion of Apfaltrer et al., 2006). A common goal of these methods is a low-dimensional kinetic theory that approximates a truly high-dimensional system. WilsonCowan type models (Wilson & Cowan, 1972, 1973), in which population firing rates are governed by a small system of ordinary differential equations (ODEs), sit at the low end of spectrum of dimension number in these methods. Stunning progress in the development of low-dimensional rate models is discussed in Dyan and Abbott (2001). Our focus is on the application of dimension reduction by a moment closure method in a population density approach to neural network modeling. Moment closure methods have a rich history in statistical physics (Dreyer, Herrmann, & Kunik, 2004; Junk & Unterreiter, 2002; Dreyer, Junk, & Kunik, 2001; Bardos, Galose, & Levermore, 1993, 1991). In this article, we present a detailed theoretical and computational study of a particular moment closure method (Cai, Tao, Shelley, & McLaughlin, 2004; Cai, Tao, Rangan, & McLaughlin, 2006), as applied to I&F neurons that receive excitatory synaptic input only. In the special case where the unitary postsynaptic conductance event has a single-exponential time course (firstorder kinetics), the evolution equation for the population density function is a partial differential integral equation in a two-dimensional state space with state variables, voltage, and excitatory conductance. In the moment closure method we consider, one approximates the kth conditional moment of excitatory conductance given voltage by a function of the k − 1 lower conditional moments and the first k unconditioned conductance moments. The particular function we consider is obtained by an ansatz: the kth centered moment of excitatory conductance given voltage is equal to the corresponding unconditioned moment. We shall refer to this as the centered moment closure method for short. This ansatz is a simple generalization of the moment closure that gives rise to the mean-field firing-rate model discussed below and in appendix A. Recently Rangan and Cai (2006) have shown that this closure for k = 2 can be derived, for a closely related steady-state problem, at high synaptic input rates, by a maximum-entropy method. The result is a system of k coupled partial differential equations with one state variable, membrane voltage, and k coupled ordinary differential equations for the unconditioned moments of synaptic conductance. We prove by phase-plane analysis that solutions to the centered moment closure problem at k = 2 do not exist over a wide range of physiologically relevant parameters. Moment closure at k = 3 has similar problems. Our findings and conclusions conflict with those of Cai et al. (2004, 2006) (see section 9.1 for resolution). The purpose of this article is to demonstrate, by computations and analysis, the successes of the centered moment closure method and
2036
C. Ly and D. Tranchina
Table 1: Relevant Functions and Variables. Dynamical Problem
Steady-State Problem
V, v
V, v
G, g
G, g
t ρ(v, g, t)
Not defined ρ(v, g)
(0)
(0)
f V (v, t)
f V (v)
J V (v, g, t)
J V (v, g)
∞
r (t) = 0 J V (vth , g, t) dg f G|V (v, g, t)
( j)
∞ r∞ = 0 J V (vth , g) dg f G|V (v, g)
( j)
f V (v, t)
f V (v)
( j) J V (v, t)
( j) J V (v)
µG j |V (v, t)
µG j |V (v)
2 (v, t) σG|V
2 (v) σG|V
µG j (t)
µG j (constant)
σG2 (t) Not defined
σG2 (constant) h(v) ≡ µG|V (v) v − εr h min (v) ≡ εe − v
Not defined Not defined
q (v) ≡ h(v) − h min (v)
Not defined
w(v) ≡ µG 2 |V (v)
Definition Random and specific membrane voltage (mV) Random and specific conductance (normalized) Time (s) Probability density of neurons at state (v, g) (0) Marginal voltage density: f V = ∞ ρ dg 0 Voltage flux density; J V = − τ1m [(v − εr ) + g(v − εe )]ρ Firing rate (s−1 ) Conditional conductance density (given V): (0) f G|V = ρ/ f V ( j) ∞ f V = 0 g j ρ dg ∞ j ( j) J V = 0 g J V dg Conditional jth moment of G (given V): ( j) (0) µG j |V = f V / f V Conditional variance of G (given V): 2 σG|V = µG 2 |V − µ2G|V The ∞ jthj moment of G: µG j = 0 g ζ dg Variance of G: σG2 = µG 2 − µ2G Conditional mean of G (given V) Minimal conditional mean of G for r∞ > 0 Conditional mean minus minimal h(v) Conditional second moment of G (given V)
the restricted parameter range where this method works; to promote progress in extending this range; and to encourage future development of other moment closure methods or entirely different dimension-reduction methods. For facilitating the reading of this article, we refer readers to Table 1 for a list of relevant functions and variables, Table 2 for a list of biophysical parameters, and Table 3 for dimensionless variables, functions, and parameters used throughout this article.
Moment Closure in Population Density Theory
2037
Table 2: Parameters. Name
Units
εr vth εe τm νe (t) τe µA
mV mV mV s s−1 s s
Definition Resting potential Threshold for action potential Excitatory reversal potential Membrane time constant (leak) Input rate of synaptic events Time constant of unitary postsynaptic conductance event Mean area under unitary postsynaptic conductance event
Note: The mean excitatory postsynaptic potential amplitude at peak time on receiving a unitary event (µEPSP ) is determined by all parameters except νe .
Table 3: Dimensionless Functions, Variables, and Parameters. Name
Ú Õ δ µG γ α
Definition v − εr vth − εr Õ(Ú) = q (v) τe τm νe µ A σA 2 µA 1+ 2τm µA εe − εr vth − εr
Typical Value or Range 0≤Ú≤1 Same as q (v) 0.25 0.1–0.3 0.006 6.5 or 4.7
2 Integrate-and-Fire neuron with Excitation Only The neuron’s behavior is governed by two random differential equations (RDEs).1 The first is an evolution equation for membrane voltage (potential), V, that is based on current balance: c
dV ˆ e (t)(V − εe ) = 0, + gr (V − εr ) + G dt
ˆ e is the random excitatory where gr is the resting membrane conductance, G postsynaptic conductance, εr is the resting membrane voltage, εe is the reversal potential for excitatory postsynaptic current, and c is the membrane capacitance. This equation can be rewritten as dV 1 (V − εr ) + G e (t)(V − εe ) , =− dt τm
(2.1)
1 For RDEs throughout this article, uppercase roman letters are used to indicate random quantities.
2038
C. Ly and D. Tranchina
ˆ e (t)/gr is the where τm = c/gr , the membrane time constant, and G e (t) = G dimensionless excitatory postsynaptic conductance, measured relative to the resting membrane conductance. The second equation is an evolution equation for G e (t), Ak dG e 1 δ(t − Tk ), = − Ge + dt τe τe k
(2.2)
where Tk is the arrival time of the kth unitary postsynaptic conductance event, Ak is the area under the conductance time course of the kth event, Ak /τe is the instantaneous conductance jump on arrival of the kth event and τe is the time constant for the decay of the exponential unitary postsynaptic conductance event. The arrival times Tk of all synaptic events are assumed to be random variables governed by a modulated Poisson process with mean rate νe (t). The random variables Ak are assumed to be independent and identically distributed random variables (IIDRVs) from a given distribution. When a neuron’s voltage reaches threshold (vth ), the neuron is said to fire an action potential. For simplicity in exposition, we do not include a refractory period, so that at the moment of action potential firing the voltage is instantaneously reset to vreset . Proper treatment of a refractory period is covered in appendix F. 3 Population Density Approach Throughout this article, we use the following common conventions in probability theory (see also Tables 1 and 2): uppercase and lowercase letters (e. g., V and v), are used to denote a random variable and a particular value, respectively; ρ and ψ are used to denote 2D probability density functions; f denotes a marginal or a conditional probability density function; F and F˜ denote cumulative and complementary cumulative distribution functions, respectively; the symbol µ denotes a mean (average, or expected value); random variable subscripts on density and distribution functions are used when needed to denote the particular random variable referred to, such as f V (v, t); random variables and powers of random variables are used as subscripts on µ to refer to the particular type of mean; and conditional densities and means are denoted by subscripts containing the | symbol (e. g., µG|V (v, t) denotes the mean of the random variable G at time t, given that the random variable V is equal to the particular value v). In the population density method, neurons with similar biophysical properties are grouped together, and the density function ρ(v, g, t) describes the distribution of those neurons over state space. Integrating the density over a region in state space gives the probability that a neuron randomly
Moment Closure in Population Density Theory
2039
chosen from the population will be in that region of state space: ρ(v, g, t) dg dv. Pr (V(t), G e (t)) ∈ =
(3.1)
Let us drop the subscript e on the G variable because we have only excitatory conductance. We assume the populations are sparsely connected, so that the arrival times of synaptic events, conditioned on the overall rate for each population, can be approximated as independent for all neurons in a population (Nykamp & Tranchina, 2000). The evolution equation of the density function ρ(v, g, t) as derived in Apfaltrer et al. (2006) is ∂ρ + ∇ · J (v, g, t) = 0, ∂t
(3.2)
where (3.3) J (v, g, t) = (J V , J G ),
1 J V (v, g, t) = − (v − εr ) + g · (v − εe ) ρ(v, g, t), (3.4) τm g gρ(v, g, t) + νe (t) F˜ A τe (g − g ) ρ(v, g , t) dg . (3.5) J G (v, g, t) = − τe 0 F˜ A is the complementary cumulative distribution function, so Pr(A > x) = F˜ A(x). A drift-diffusion approximation of the integral component of equation 3.5, which gives a perhaps more familiar Fokker-Planck equation, is discussed in section 6.1. The results reported here are not changed by this approximation. Each component of the probability flux vector J (v, g, t) is a signed quantity. By convention, positive or negative flux J V is probability per unit time per unit conductance of crossing v from below or above, and similarly for J G . The boundary conditions, with vreset = εr (for simplicity), are J V (vth , g, t) = J V (εr , g, t),
for g > gth
J V (vth , g, t) = J V (εr , g, t) = 0,
for g ≤ gth
(3.6)
and J G (v, 0, t) = lim J G (v, g, t) = 0, g→∞
(3.7)
−εr is the minimum value of g that gives a positive voltage where gth = vεeth−v th component of the probability flux at v = vth , the threshold voltage (i.e, a
2040
C. Ly and D. Tranchina
nonzero firing rate per unit conductance). The population firing rate is given by the total flux of probability across vth : r (t) =
∞
J V (vth , g , t) dg .
(3.8)
0
One inspiration for dimension reduction by moment closure comes from the fact that the firing rate, equation 3.8, can be written in terms of the marginal density,
(0)
f V (v, t) =
∞
ρ(v, g , t) dg ,
(3.9)
0
and µG|V (v, t), the mean of G conditioned on V. If we define f G|V (v, g, t) as (0) the conditional conductance density, then ρ(v, g, t) = f G|V (v, g, t) f V (v, t), and it can be seen that
∞
0
(0)
gρ(v, g, t) dg = f V (v, t) µG|V (v, t).
(3.10)
More generally, the kth conditional conductance moment µG k |V (v, t) is defined by
∞
0
(0)
g k ρ(v, g, t) dg = f V (v, t) µG k |V (v, t).
(3.11)
It is convenient to define the right-hand side of equation 3.11 as (k)
(0)
f V (v, t) ≡ f V (v, t) µG k |V (v, t).
(3.12)
Substituting J V (vth , g, t) = −
1
(vth − εr ) + g · (vth − εe ) ρ(vth , g, t), τm
(3.13)
into the right-hand side of equation 3.8 and performing the integration, with the definitions above, yields r (t) = −
1 (0) (vth − εr ) + µG|V (vth , t) · (vth − εe ) f V (vth , t). τm (0)
(3.14)
The firing rate depends on only two functions, f V (v, t) and µG|V (v, t), evaluated at v = vth , and both depend on only one state variable v and
Moment Closure in Population Density Theory
2041
time t. The idea of dimension reduction by moment closure follows from the derivation of partial differential equations for these functions. 4 Dimension Reduction by Moment Closure at the kth Moment Moment closure techniques are well established in other areas of physics (Dreyer et al., 2001, 2004; Junk & Unterreiter, 2002), but it appears that methods going beyond mean-field versions have not been applied in computational neuroscience until recently (Cai et al., 2004, 2006; Rangan & Cai, 2006). We can integrate equation 3.2 over g to get a partial differential equation of lower dimension for the evolution of the marginal voltage density, (0) f V (v, t): (0) ∂ f V (v, t) ∂
1 (0) =− − (v − εr ) + µG|V (v, t)(v − εe ) f V (v, t) , ∂t ∂v τm
(4.1)
or equivalently, (0) ∂ f V (v, t) 1 ∂
(0) (1) − (v − εr ) f V (v, t) + (v − εe ) f V (v, t) . =− ∂t ∂v τm
(4.2)
(0)
The evolution equation for f V (v, t) depends on the unknown function (1) (1) f V (v, t). To derive an equation for f V , we can multiply equation 3.2 by g (2) and integrate in g again. Then we find that the function f V (v, t) appears (1) on the right-hand side of the evolution equation for f V (v, t). Continuing in this vein, one can derive a system of k-coupled partial differential equations (0) (k−1) (k) (0) for f V through f V , in which the function f V (v, t) ≡ f V (v, t) µG k |V (v, t) (k−1) appears on the right-hand side of the evolution equations for f V (v, t). Thus, the system of equations is not closed, because there is one more function than equations. If the system were closed, these k equations could be computationally efficient to solve because of the low dimension of each equation. The system of equations (simplified by using Fubini’s theorem and various algebraic manipulations) is summarized below: (i) ∂ f V (v, t) 1 ∂ (i) (i+1) (v − εr ) f V (v, t) + (v − εe ) f V (v, t) =− − ∂t ∂v τm µ Aj f V (v, t) i! i (i) f V (v, t) + νe (t) τe (i − j)! j! (τe ) j i
−
j=1
(i)
for 0 ≤ i ≤ k − 1, where f V (v, t) =
∞ 0
g i ρ(v, g, t) dg.
(i− j)
(4.3)
2042
C. Ly and D. Tranchina
The original boundary condition, equation 3.6, can be multiplied by g i and integrated over g to give (i)
(i)
J V (vth , t) = J V (εr , t),
(4.4)
where (i) J V (v, t) =
∞ 0
=−
g i J V (v, g, t) dg
1 (i) (i+1) (v − εr ) f V (v, t) + (v − εe ) f V (v, t) . τm
(4.5)
In the closure we study, the ansatz is that the kth centered conditional moment of G is independent of V (Cai et al., 2004, 2006, for k = 2): E
k k G(t) − µG|V (v, t) V(t) = v = E G(t) − µG (t)
for k ≥ 2, (4.6)
and µG|V (v, t) = µG (t)
in the special case k = 1.
(4.7)
The assumed relationship, equation 4.6, is a logical generalization of the k = 1 case, which underlies the mean-field firing-rate model discussed below. This ansatz, equation 4.6, can also be thought of as a particularly simple special case of a slightly more general one: that the kth conditional moment of excitatory conductance G, µG k |V can be expressed in terms of the first k unconditioned moments of G and the k − 1 lower conditional moments of G. Intuitive reasoning suggesting that the moments of the random synaptic conductance variable G might depend only weakly on voltage V at high enough synaptic input rates goes as follows. At high synaptic input rates, the distribution of conductance G is tight in the sense that the coefficient of variation (σG /µG ) is inversely proportional to the square root of the synaptic input rate. In this regime, the vast majority of neurons in a population will have large conductance values and will be firing at a high rate. The resetting of voltage after firing will tend to decorrelate conductance G and voltage V. However, at low synaptic input rates, where neurons seldom fire, we expect G and V to be more tightly coupled. A neuron with voltage near threshold is likely to be there because it has a conductance at the high end of the distribution. At intermediate population firing rates, a neuron with voltage near reset is probably there because it has just fired (with high conductance).
Moment Closure in Population Density Theory
2043
At very low synaptic input rates (output firing rates), a neuron with voltage near reset (rest) voltage is likely to be there because its synaptic conductance is very low. We do not know of an intuitive argument suggesting that the higher moments of G would depend more weakly on V. The notion that making the closure approximation at higher values of k might be better could be based on the following ideas: higher moments, at least at low synaptic input rates, might be smaller in magnitude; consequently, the centered moment approximation might be more innocuous at higher k; lower moments that are larger in magnitude might be approximated better when the closure is made at higher k. For example, the first three centered unconditioned moments of the conductance random variable, G, are all proportional to the synaptic input rate, νe ; the expressions and numerical values with typical parameters (see below) are as follows: µG /νe = µ A = 2.4 × 10−4 ; σG2 /νe = µ A2 /(2τe ) = 7.2 × 10−6 ; and κG3 /νe = µ A3 /(3τe2 ) = 3.1 × 10−7 . A cautionary note is that although higher moments get progressively smaller at low enough synaptic input rates, higher even moments become progressively larger at high synaptic input rates. We emphasize that although this moment closure method is certainly reasonable, it does not have a formal theoretical foundation. It is quite different from the Chapman-Enskog moment closure procedure that has been used with great success in gas and fluid dynamics (Chapman & Cowling, 1970). The moment identity that follows from the closure assumption above is µG k |V (v, t) = E
k k G(t) − µG (t) − i=1
= µG k (t) +
k i=1
k! µi (v, t) µG k−i |V (v, t) (k − i)!i! G|V
k! µG k−i (t) µiG (t) − µG k−i |V (v, t) µiG|V (v, t) . (k − i)!i! (4.8)
(k)
(0)
Because f V (v, t) = f V (v, t)µG k |V (v, t) implies µG i |V (v, t) = (k)
(0)
(1)
(i)
f V (v,t) (0) f V (v,t)
(2)
, equa(k−1)
tion 4.8 allows us to express f V (v, t) in terms of f V , f V , f V , . . . . , f V and k unconditioned moments of G. Therefore, the system of evolution (0) (k−1) equations for f V through f V , equation 4.3, must be supplemented by a system of k ordinary differential equations for unconditioned moments of G, µG i (t) for 1 ≤ i ≤ k. 4.1 Evolution Equations for Unconditioned Moments of G. The evolution equations for the unconditioned moments of G can be derived
2044
C. Ly and D. Tranchina
from the evolution equation for the marginal density function of G, v ζ (g, t) ≡ εr th ρ(v, g, t) dv:
g ∂ζ (g, t) ∂ 1 F˜ A τe (g − g ) ζ (g , t) dg . =− − gζ (g, t) + νe (t) ∂t ∂g τe 0 (4.9) Because µG i (t) = g i ζ (g, t) dg, we can derive the ODEs that govern the evolution of µG i (t) by multiplying equation 4.9 by g i and integrating in g. The result is i dµG i (t) µ Aj µG i− j (t) i! 1 iµG i (t) − νe (t) =− dt τe (i − j)! j! (τe ) j−1
(4.10)
j=1
for 1 ≤ i ≤ k, where µG 0 (t) = 1. 5 Moment Closure at k = 1 5.1 Dynamical Equations for Moment Closure at k = 1. In the moment closure method for k = 1, the assumption is that µG|V (v, t) = µG (t). The dynamical equations are
(0) ∂ f V (v, t) ∂ 1 (0) (v − εr ) + µG (t) (v − εe ) f V (v, t) , =− − ∂t ∂v τm
(5.1)
dµG (t) 1 µG (t) − νe (t)µ A . =− dt τe
(5.2)
5.2 Mean-Field, Quasi-Steady-State, Rate Model from Moment Closure at k = 1. Equation 5.1 corresponds to the evolution equation for the probability density function for a population of I&F neurons receiving deterministic, smoothly varying synaptic conductance whose time course is described by equation 5.2. Random, asynchronous behavior of the population could stem only from random initial conditions. We do not present numerical results from solution of equation 5.1 because of the well-known problem that I&F neurons can phase-lock with deterministic synaptic input (Knight, 1972). It is easy to see that equations 5.1 and 5.2 would give behavior unlike that in the original 2D problem following a period of very small synaptic input, during which the membrane voltage for each neuron would approach εr . Afterward, any synaptic conductance time course that would cause firing would do so with near-perfect population synchrony.
Moment Closure in Population Density Theory
2045
Nevertheless, the steady-state solution for closure at k = 1 is of interest for two reasons: a solution exists for all synaptic input rates, unlike the problem for k = 2 as shown below; and the steady-state solution is the basis for a particular firing-rate model, often referred to as a mean-field (firingrate) model. In this model, the dynamical equation for µG (t), equation 5.2, is solved, and the firing rate is obtained from the steady-state solution of equation 5.1 with the µG set equal to µG (t). This method can be regarded as an approximate solution to the system 5.1 and 5.2 that is good when the synaptic conductance changes on a timescale that is long compared to the interspike interval. Details can be found in appendix A. 6 Moment Closure at k = 2 6.1 Dynamical Equations for Moment Closure at k = 2. For moment closure at k = 2, we assume the second centered conditional moment of synaptic conductance G is independent of voltage V: E
2 2 G(t) − µG|V (v, t) V = v = E G(t) − µG (t) .
This implies µG 2 |V (v, t) − µ2G|V (v, t) = σG2 (t). Hence, (2) (0) (0) f V (v, t) = f V (v, t)µG 2 |V (v, t) = f V (v, t) σG2 (t) + µ2G|V (v, t) (0) = σG2 (t) f V (v, t)
+
2 (1) f V (v, t) (0)
f V (v, t)
.
(6.1)
Summary of equations: (0) ∂ fV ∂ 1 (0) (1) (v − εr ) f V + (v − εe ) f V =− − ∂t ∂v τm
(1)
∂ fV ∂ =− ∂t ∂v
(6.2)
(1) 2 f 1 (1) (0) 2 − (v − εr ) f V + (v − εe ) σG (t) f V + V(0) τm fV 1 (1) (0) − f V − νe (t)µ A f V τe
(6.3)
2046
C. Ly and D. Tranchina
where σG2 (t) = µG 2 (t) − µ2G (t), and the evolution equations for µG (t) and µG 2 (t) are: dµG 1 µG − νe (t)µ A . =− dt τe
(6.4)
dµG 2 µ A2 2 µG 2 − νe (t) + µ A µG . =− dt τe 2τe
(6.5)
The boundary conditions are (0)
(0)
(6.6)
(1)
(1)
(6.7)
(0)
1 (0) (1) (v − εr ) f V + (v − εe ) f V , τm
J V (εr , t) = J V (vth , t) J V (εr , t) = J V (vth , t) where J V (v, t) = −
(6.8)
and (1) J V (v, t)
(1) 2 f 1 (1) (0) =− (v − εr ) f V + (v − εe ) σG2 (t) f V + V(0) . τm fV
(6.9)
For completeness and to relate our work to previous work on this subject (Cai et al., 2004, 2006; Rangan & Cai, 2006), we mention that if one makes a truncated Taylor series approximation, ρ(v, g , t) ≈ ρ(v, g, t) +
∂ρ(v, g, t) g −g , ∂g
in the integral term of J G (v, g, t) in the original equations 3.2 to 3.5, equation 3.5 becomes J G (v, g, t) = −
gρ(v, g, t) + νe (t) τe
+
∂ρ(v, g, t) ∂g
g − g
g
F˜ A τe (g − g ) ρ(v, g, t)
0
dg
g ∂ρ(v, g, t) , (6.10) = − + νe (t)c 1 (g) ρ(v, g, t) + νe (t)c 2 (g) τe ∂g
Moment Closure in Population Density Theory
2047
g g where c 1 (g) ≡ 0 F˜ A(τe (g − g )) dg and c 2 (g) ≡ 0 (g − g) F˜ A(τe (g − g )) dg . Using a change of variables and integration by parts, we obtain the expressions ∞ 1 F˜ A(x) d x µA − τe τe g µ A2 1 1 ∞ − + x F˜ A(x) d x . c 2 (g) = τe 2τe τe τe g c 1 (g) =
In the special case where A is deterministic and equal to a (Cai et al., 2004, 2006; Rangan & Cai, 2006), F˜ A(x) = H(a − x), where H(y) is the unit (Heaviside) step function. The result is a , if g > a ; 1 τe c 1 (g) = a − H(a − τe g) · (a − τe g) = τe g, if 0 ≤ g ≤ τe
a τe
.
a 2 2 − 2τ 2 , if g > τae ; a2 1 a g2 e − = + H(a − τe g) · − c 2 (g) = 2 τe 2τe 2τe τe − g2 , if 0 ≤ g ≤
a τe
Cai et al. (2004, 2006) and Rangan and Cai (2006) make a further approximation by assuming c 1 (g) and c 2 (g) are independent of g; they simply set 2 c 1 (g) = τae and c 2 (g) = − 2(τa e )2 to get J G (v, g, t) =
νe (t)a − g νe (t)a 2 ∂ρ(v, g, t) . ρ(v, g, t) − τe 2τe2 ∂g
(6.11)
Equation 6.11 is the diffusion approximation used in Cai et al. (2004, 2006) and Rangan and Cai (2006) in the absence of feedback. The moment closure method at k = 2 applied to this drift-diffusion approximation, equation 6.11, will yield the same equations above: 6.2 to 6.9. 6.1.1 Numerical Results for the Dynamical Problem from Moment Closure at k = 2. For high synaptic input rates νe (high mean excitatory postsynaptic conductance µG ), moment closure at the second moment performs well, and it provides improved performance over the mean-field firing-rate model (see appendix A) above. Figure 1B shows a comparison of firing rates in response to sinusoidal modulation of the synaptic input rate (see Figure 1A), computed by three different methods: solution of the full 2D problem (thin solid black line; Apfaltrer et al., 2006), moment closure at k = 2 (thick dashed line), and mean-field firing-rate model (thin gray line). Results from moment closure
.
2048
C. Ly and D. Tranchina
A
B 180 µG|V=µG 160
2500
Firing Rates (s )
Synaptic Input Rate (s )
3000
2000
Full 2 D 2 2 σG|V=σG
140 120 100
1500 80
1000 0
0.01
0.02 0.03 Time(s)
0.04
0.05
60 0
0.01
0.02 0.03 Time (s)
0.04
0.05
Figure 1: Numerical solution of dynamical system from moment closure at k = 2, equations 6.2–6.9. (A) The input to an uncoupled population of neurons is sinusoidal modulation of the rate of unitary postsynaptic conductance events, νe (t) = 2000 [1 − c sin(2π f t)] s−1 , with c = 0.379, f = 20 s−1 . (B) Comparison of firing-rate responses to the input in A, computed by three different methods: “exact” results from numerical solution of the full 2D problem (thin solid black line; Apfaltrer et al., 2006), moment closure at k = 2 (thick dashed line), and mean-field firing-rate model (thin gray line). Results from moment closure at k = 2 match closely the exact results. Parameters: τm = 20 ms; τe = 5 ms; µ A = 2.44 × 10−4 s, giving a mean unitary excitatory postsynaptic potential amplitude of 0.5 mV at peak time (µEPSP ); εr = −65 mV; εe = 0 mV; vth = −55 mV.
at k = 2 deviate only slightly from the exact results and provide a clearly better match than the mean-field firing-rate model, particularly at the lower firing rates. For this and many other figures and analyses in this article, we used the same set of parameters as Apfaltrer et al. (2006), which are typical in the literature (Dyan & Abbott, 2001): membrane time constant, τm = 20 ms; time constant for the decay of the unitary postsynaptic conductance event, τe = 5 ms (Edmonds, Bigg, & Colquhoun, 1995); µ A = 2.44 × 10−4 s, giving a mean excitatory postsynaptic potential amplitude at peak time (µEPSP ) of 0.5 mV (Shadlen & Newsome, 1998); εr = −65 mV; vth = −55 mV; εe = 0 mV. (see appendix H for further discussion of µ A and µEPSP ). We encountered problems in numerical solutions of the dynamical system of equations from moment closure method at k = 2, equations 6.2 to 6.9, with the parameters above, when the synaptic input rate drops below roughly 1240 s−1 (depending on the dynamic structure of the synaptic input rate), where the firing rate drops precipitously to zero. This synaptic input rate gives a steady-state firing rate of roughly 60 s−1 for the solution of the full 2D problem (Apfaltrer et al., 2006). At the point in time when the zero firing rate was achieved during our numerical integration, the marginal (0) probability density function f V had developed a sharp peak near v = vth . (0) When the numerical integration was allowed to proceed, f V underwent
Moment Closure in Population Density Theory
2049
wild oscillations in v and in t on the way to “blowing up” shortly thereafter. It is worth noting that a zero firing rate cannot be obtained as a steady-state solution to the moment closure problem (see appendix B). Because of the obvious instabilities surrounding the zero-firing-rate time point, we do not interpret the numerical solution during this period should as a real solution. We obtained similar negative results for moment closure at k = 2 in numerical solution of the dynamical system, equations 6.2 to 6.5, with numerous numerical methods, including implicit, explicit, or mixture schemes, with various orders of accuracy, fixed or adaptive time steps, and various ways of approximating the partial derivatives. Neither refinement of the voltage grid nor taking smaller time steps overcame the problem of the precipitous drop of the firing rate to zero when νe (t) falls below approximately 1240 s−1 , where the firing rate is in the neighborhood of 60 s−1 . The (1) alternative of solving a ∂t∂ µG|V equation, rather than the ∂t∂ f V , gave the same behavior. Adding a refractory period (see appendix F) and modifying the equations accordingly did not rectify the problem. This finding caused us to ask whether the system of equations for moment closure at k = 2 represents an ill-posed problem when the synaptic input rate is below some critical value. We address this question below by analyzing the steady-state problem, because it suffices to answer the question and because it gives a nicely tractable problem. 6.2 The Steady-State Problem for Moment Closure at k = 2: Proof by Phase-Plane Analysis That Solutions Sometimes Do Not Exist. We ana(0) lyze the steady-state problem using the two functions f V (v), and µG|V (v) (0) (1) rather than f V (v) and f V (v), because this choice facilitates analysis. In this (0) section, we reduce a coupled pair of ODEs for f V (v) and µG|V (v), with two boundary conditions, to a single ODE for µG|V (v) with a single boundary condition. Phase-plane analysis will provide a method for proving the nonexistence of solutions where the numerical results above suggest that solutions might not exist. In the absence of such proof, one could not know whether the results above suggesting the nonexistence of solutions are simply a reflection of inadequate numerical methods. This question is especially poignant in the light of the apparent success and advocation of this moment closure method in Cai et al. (2004, 2006). The technical reason that phase-plane analysis provides a useful and appropriate tool stems from the nonautonomous nature of the ODE for µG|V (v). In order to streamline the notation, let h(v) ≡ µG|V (v). We set equations (0) 6.2 to 6.5 equal to 0 to obtain two coupled ODEs for the functions f V (v) and h(v), and expressions for the constants µG and σG2 . The result is:
1 d (0) (v − εr ) + h(v) (v − εe ) f V (v) , − 0=− dv τm
(6.12)
2050
C. Ly and D. Tranchina
0=−
1 d (0) (v − εr ) h(v) + σG2 + h 2 (v − εe ) f V (v) − dv τm 1 (0) h(v) − µG f V (v), − τe
(6.13)
0=−
1 µG − νe µ A , τe
(6.14)
0=−
µ A2 2 + µ A µG . µG 2 − νe τe 2τe
(6.15)
The boundary conditions are (0)
(0)
(6.16)
(1)
(1)
(6.17)
J V (εr ) = J V (vth ); J V (εr ) = J V (vth ), where 1 (0) (v − εr ) + h(v) (v − εe ) f V (v), τm 1 (1) (0) (v − εr ) h(v) + σG2 + h 2 (v) (v − εe ) f V (v). J V (v) = − τm (0)
J V (v) = −
(6.18) (6.19)
Equations 6.14 and 6.15 imply that µG and σG2 are constants given by µG = νe µ A
(6.20)
and σG2 =
νe µ A2 . 2τe
(6.21)
6.2.1 Foundation of the proof of Nonexistence of Solutions. An important implication of equation 6.13 and the boundary condition, equation 6.17, is that
vth
εr
(0)
h(v) f V (v) dv = µG .
(6.22)
Moment Closure in Population Density Theory
2051
This fact can be seen by integrating equation 6.13 from v = εr to v = vth , (0) where f V (v) is a density function that integrates to 1. Equation 6.22 implies that h(v) = µG ,
for some v ∈ (εr , vth ),
(6.23)
if we assume that h(v) is continuous. Thus, equation 6.23 is a necessary condition for the existence of a steady-state solution. Phase-plane analysis allows us to use this fact below to demonstrate the nonexistence of solutions in a wide range of cases. 6.2.2 Steady-State Solutions with Nonzero Firing Rates Exist for Moment Closure at k = 2 at High Synaptic Input Rates. For the steady-state firing rate (0) r∞ > 0, we set the marginal voltage probability flux J v (v) equal to r∞ , thus satisfying the boundary condition 6.16. Then we use equation 6.12 to solve (0) for f V (v) in terms of h(v) and v: (0)
f V (v) =
τm r∞ . −(v − εr ) + (εe − v)h(v)
(6.24) (0)
Next, we substitute our expression for f V (v), equation 6.24, into equation 6.13 and simplify to obtain an ODE in terms of h(v) only: εe − εr σG2 + ττme µG − h v − εr + h(v − εe ) dh . = dv v − εr 2 2 2 (εe − v) σG − h − εe − v
(6.25)
It is important to note that the denominator of equation 6.25 is equal to zero, and consequently dh/dv is singular, along a curve in the v − h plane that spans the entire domain from v = vreset = εr to v = vth . This is the source of the nonexistence of the solutions (ill-posedness) problem that we demonstrate below. The moment closure method we are studying creates a pathological differential equation for h(v) ≡ µG|V (v). The second boundary condition, equation 6.17, when written in terms of h only, becomes b 2h (εr ) = b 2h (vth ), where
b 2h (v) =
v − εr h(v) + v − εe σG2 + h 2 (v) v − εr + (v − εe )h(v) (0)
.
In equation 6.24, f V (v) ≥ 0 implies that h(v) > h min (v) ≡ mal h that will give a nonzero firing rate.
v−εr , εe −v
the mini-
2052
C. Ly and D. Tranchina
Further analysis by phase-plane techniques is facilitated by defining a new function, q (v), that gives a vector field and nullclines that are easier to interpret than those of h(v): q (v) ≡ h(v) − h min (v) = h(v) −
v − εr . εe − v
(6.26)
Because we consider only r∞ > 0, we have q (v) > 0. The relation between (0) q and f V from equation 6.26 is (0)
f V (v) =
τm r∞ . q (v) (εe − v)
(6.27)
Let us rewrite our ODE, equation 6.25, and boundary condition for h(v) in terms of q (v) τe εe − εr − µG − h min (v) q 1+ dq τm τm εe − v = −q dv τe (εe − v) q 2 − σG2
(6.28)
with boundary condition q
q
b 2 (εr ) = b 2 (vth ),
(6.29)
where q
b 2 (v) ≡
q 2 (v) + h min (v)q (v) + σG2 . q (v)
(6.30)
Note that the denominator of equation 6.28 is equal to zero, making dq /dv singular, along a horizontal line q = σG in the v − q plane. This is a view from the q -variable perspective hinting at trouble ahead in finding solutions to the differential equation and boundary condition. For purposes of codifying our problem in the simplest form, we show in appendix C that the ODE for q (v), equation 6.28, and boundary condition, r equation 6.29, can be written in terms of a dimensionless voltage ≡ vv−ε , th −εr where 0 ≤ ≤ 1, and four dimensionless parameters, defined in Table 3. Because the ODE for q (v), equation 6.28, is nonautonomous, insight can be provided by phase-plane analysis of a dynamical system that can be associated with ODE, equation 6.28. We introduce an auxiliary variable s and examine the dynamical system (v(s), q (s)). This extension not only allows phase-plane analysis, but it also has the desirable feature that the s
Moment Closure in Population Density Theory
2053
dq
derivatives can never be singular (unlike dv ). For simplicity, let τe εe − εr q 1+ − µG − h min (v) dq τm τm εe − v = −q ≡ β(q , v); ds τe (εe − v) dv = q 2 − σG2 ≡ α(q , v). ds
(6.31) (6.32)
By dividing dq /ds by dv/ds, we recover our original equation for dq /dv, equation 6.28. Figure 2 shows the v − q phase plane for a high synaptic input rate, νe = 1341 s−1 (giving µG = 0.3272), where a solution of equation 6.28 obeying boundary condition 6.29 exists. Note that the v-nullcline, where dv/ds = 0 (dq /dv is singular), is given by q (v) = σG , and it is plotted in a dot-dashed line. Note that dv/ds changes sign as a trajectory crosses the v-nullcline. The q -nullcline, where dq /ds = 0, is given by q (v) =
µG − h min (v) τe εe − εr , 1+ τm εe − v
(6.33)
and it is plotted by a dashed line. The curve q = µG − h min (v), which, in the function h(v), corresponds to h(v) ≡ µG|V (v) = µG , is plotted by a dotted line. Candidate solutions of equation 6.28 correspond to trajectories in the v − q phase plane that start at (εr , q (εr )) and evolve according to equations 6.31 and 6.32 with a consistent sign of dv/ds for all v ∈ (εr , vth ), as required (0) for q (v), and, hence, f V (v), to be functions (single-valued). Thus, any trajectory that hits the v-nullcline for v ∈ (εr , vt ) is not a candidate solution. The dynamical system evolves with increasing s if dv/ds ≥ 0 or with decreasing s if dv/ds ≤ 0 at v = εr . A solution to our problem, if it exists, is a candidate trajectory that also satisfies the boundary condition, equation 6.29. The shaded gray bars at εr and vth are provided as a visual aid; they q are gray-scale plots of the values of the boundary condition function b 2 (v) in equation 6.29 evaluated at v = vreset = εr (left bar) and at v = vth (right bar). A trajectory that satisfies the boundary condition must start and terminate at the same gray-scale value. In Figure 2, the solution q trajectory is plotted by a solid black line. Note that the solution q (v) straddles the curve corresponding to µG|V (v) = µG (dotted line), so that h(v) ≡ µG|V (v) is sometimes greater and sometimes less than the unconditioned moment µG , and µG|V (v) = µG at some point in the trajectory, as required by the necessary condition for a solution to exist (see equation 6.23).
2054
C. Ly and D. Tranchina
Figure 2: Steady-state moment closure at k = 2 (second moment) for a high synaptic input rate (νe = 1341 s−1 , giving µG = 0.3272), where a solution exists. Phase plane for a corresponding dynamical system (v(s), q (s)) governed by differential equations 6.31 and 6.32. The function q (v) ≡ h(v) − h min (v) ≡ min µG|V (v) − µmin G|V (v), where µG|V (v) is the minimal value of conditional conductance at each v that corresponds to a positive marginal voltage probability flux (see the text). The direction of flow (dq , dv) is indicated by arrows. The v-nullcline (dot-dashed line), where dv/ds = 0 (dq /dv is singular), is given by changes sign as a trajectory crosses the v-nullcline. q (v) = σG . Note that dv ds The q -nullcline (dashed curve) is given by equation 6.33 in the text. The curve µG − h min (v) (dotted line) corresponds to h(v) = µG , that is, µG|V (v) = µG . A trajectory q (v) corresponding to a solution of the ODE, equation 6.28, that satisfies the boundary condition, equation 6.29, is plotted by a solid black line. Parameters: νe = 1341 s−1 ; τm = 20 ms; τe = 5 ms; µ A = 2.44 × 10−4 s; vreset = εr = −65 mV; εe = 0 mV; vth = −55 mV; giving µEPSP = 0.5 mV and µG = 0.3272.
We found the solution illustrated in Figure 2 and other solutions at high synaptic input rates where they exist by numerical implementation of a shooting method (see appendix D). (0) Figure 3 shows the marginal voltage density f V (v) (see Figure 3A, solid gray line) corresponding to the solution q (v) plotted in Figure 2. This marginal density matches well the marginal density computed by
Moment Closure in Population Density Theory A
B 0.38 σ2 =σ2
σ2G|V=σ2G
G|V
0.36 G|V
full 2 D PDM
0.15
µ
f(0) (v), Marginal Density (mV 1) V
0.25
0.2
2055
G
full 2 D PDM
0.34
0.32
0.1
0.3
0.05 Voltage (mV)
C 11
x 10
Voltage (mV)
3
σ2
10
9
8
2
σG σ2
G|V
7 Voltage (mV)
Figure 3: Comparison of results obtained by moment closure at k = 2 to those from the full 2D PDM at a high synaptic input rate (νe = 1341 s−1 , giving µG = 0.3272). (A) Marginal voltage density computed by solving the steady-state equations (solid line) for the moment closure at the second moment (k = 2) is plotted along with the exact solution (full 2D PDM; dot-dashed). The firing rates are 66.95 s−1 and 67.01 s−1 , respectively. (B) Corresponding mean of G given V, µG|V (v): computed by moment closure at the second moment (k = 2, solid line) and by the full PDM (dot-dashed line). Moment closure at the second moment 2 (v) from is a good approximation in this case. (C) The conditional variance σG|V the full 2D PDM (solid) is plotted with the unconditioned variance σG2 (dotdashed line). The approximation in the moment closure method at k = 2, that 2 (v) = σG2 , is fairly good for the synaptic input rate νe = 1341 s−1 used in this σG|V example. Parameters are the same as in Figure 2.
numerical solution of the original full 2D problem (Apfaltrer et al., 2006), plotted by a dot-dashed line. Figure 3B shows a good match between the conditional conductance mean µG|V (v) computed by moment closure at k = 2 (solid gray) and that computed by solution of the full 2D problem 2 (dot-dashed line). Figure 3C shows that σG|V (v) (solid gray line) computed
2056
C. Ly and D. Tranchina
from the full 2D problem varies by less than 20% from its maximum value; consequently, it is fairly well approximated by the unconditioned variance σG2 (dot-dashed line) in this example. The firing rate computed by moment closure at k = 2 (corresponding to Figure 3A) of 66.95 s−1 matches closely that computed by solution of the full 2D problem (Apfaltrer et al., 2006), 67.01 s−1 . The firing rate given by the mean-field firing-rate model (discussed above and in appendix A) is 67.86 s−1 . 6.2.3 Check That the ODE Method for Moment Closure at k = 2 Solves the Same Problem as the PDE Method with Steady Synaptic Input. The derivation for the ODE for q (v) involved many exact algebraic manipulations of the (0) (1) original equations 6.2 to 6.9 (pair of coupled PDEs for f V (v, t) and f V (v, t) and the ODEs for the first two unconditioned moments of G), in which there are a number of opportunities for inadvertent error. Furthermore, it is conceivable that the shooting method we used to solve the ODE for q (v) had some undetected flaw. Therefore, as an independent check that our ODE analysis and numerical methods are correct, we compared the steadystate results obtained by solving our ODE for q (v) with those obtained by solving the dynamical system of PDEs and ODEs, equations 6.2 to 6.9, in the special case of steady synaptic input rate. Figure 4A compares the marginal (0) voltage density functions f V (v) obtained by solution of the system of PDEs, equations 6.2 to 6.9, and letting the system relax to a steady state, to that obtained by solving the ODE for q , equation 6.28, with boundary condition 6.29. Figure 4B compares the conditional means µG|V (v) obtained by these two methods. The solutions match in both cases. Our conclusion from the results above and from similar numerical tests is that dimension reduction by moment closure at k = 2 (second moment), with the parameters used above, is a good method for high synaptic input rates, corresponding to large values of the unconditioned mean conductance µG , which give high firing rates. 6.2.4 Numerical Evidence that Steady-State Solutions Do Not Exist for Moment Closure at k = 2 at Low Synaptic Input Rates. Once we found a solution at a high synaptic input rate, as described above, we found a solution at a nearby lower synaptic input rate, if it existed, by starting our shooting algorithm at the initial condition (εr , q (εr )) from the solution at the slightly higher synaptic input rate. In computations not shown, we found that the initial value q (εr ) that gives a solution trajectory for equations 6.31 and 6.32 varies smoothly as the synaptic input rate νe is varied. By following this procedure, we were able to find solutions at lower synaptic input rates down to νe = 1241 s−1 (µG = 0.303, where the boundary condition for q (v) could not be satisfied. We also used an alternative method of finding a steadystate solution, in which we solved the dynamical problem, equations 6.2 to
Moment Closure in Population Density Theory B 0.38
0.25
Steady state with ODE in q Steady State with ODE in q
0.36
Dynamical steady state G|V
0.2
0.15
µ
f(0) (v), Marginal Density (mV 1) V
A
2057
0.34
0.1
0.32
0.05
0.3 Voltage (mV)
Dynamical steady state
Voltage (mV)
Figure 4: Comparison of the steady-state solution from the dynamical PDE moment closure equations 6.2–6.5 to the solution of the ODE, equation 6.28, with boundary condition 6.29. This figure provides a check that the steady-state analysis in q (v) function and computations are correct. The solid lines are from the ODE in q ; the dot-dashed lines are from the fully dynamical equations. (A) The two densities. (B) The two conditioned means. The match is perfect in both cases. Parameters are the same as in Figures 2 and 3.
6.5, for a time-varying synaptic input rate that ramps down very slowly (on the timescale of the membrane time constant, τm ). This method yielded the same value of 1241 s−1 as the lowest synaptic input rate where a solution could be found. 6.2.5 Phase-Plane Analysis Demonstrating That No (Nonzero Firing Rate) Solutions Exist at Low Synaptic Input Rates. To illustrate that no solution exists for low synaptic input rates, with other parameters held fixed, let us consider a particular synaptic input rate of 1000 s−1 (µG = 0.244). This synaptic input rate gives a mean-field firing rate (see appendix A) of 40.62 s−1 and a firing rate of 39.22 s−1 in the full 2D problem with our physiologically relevant parameters (τm = 20 ms, τe = 5 ms, µ A = 2.44 × 10−4 s, giving µEPSP = 0.5 mV, vreset = εr = −65 mV, εe = 0 mV, vth = −55 mV). The phase plane corresponding to νe = 1000 s−1 is plotted in Figure 5. Recall that in order for a solution to equation 6.28 to exist, the following three conditions must be satisfied: (1) the solution q (v) trajectory must intersect the line q (v) = µG − h min (v) by 6.23; (2) the trajectory q (v) must be a function (i.e., it cannot cross the v-nullcline q (v) = σG , which would make q (v) multivalued); and (3) the q (v) trajectory must satisfy the boundary condition 6.29. For some parameter choices (e.g., those in Figure 5), the line q (v) = µG − h min (v) and the attracting v-nullcline curve, q (v) = σG , are nearby. Consequently, trajectories that are close to the curve q (v) = µG − h min (v) will intersect the v-nullcline at some point. In Figure 5, two trajectories are
2058
C. Ly and D. Tranchina
Figure 5: The phase plane as defined in Figure 2, for a synaptic input rate νe = 1000 s−1 (µG = 0.244), where a solution does not exist. The v-nullcline (dotdashed line) q -nullcline (dashed line), and the line µG − h min (v) (dotted line) are plotted. The intersection of the q - and v-nullclines is denoted by an open circle. Two trajectories are shown: trajectory 1 (1 gray) starts at q (εr ) = σG− and terminates at q (vth ) < µG − h min (vth ); trajectory 2 (2 black) starts at (v, q (εr ) = q 2init < µG ) and ends at (vth , q (vth ) = σG+ ). Trajectory 1 lies entirely below and trajectory 2 entirely above the curve q (v) = µG − h min (v). There are no candidate solution trajectories because every σG < q (εr ) < q 2init = 0.759 will intersect the v-nullcline, and all other trajectories will not intersect the line µG − h min (v). A necessary condition that must be satisfied by a q (v) trajectory is that q (v) = µG − h min (v) for some v ∈ (εr , vth ) (see the text). Parameters other than νe are the same as in previous figures.
shown: trajectory 1 starts at the v-nullcline, at (εr , q (εr ) = σG− ), and terminates at (vth , q (vth ) < µG − h min (vth )); trajectory 2, obtained by integrating backward, starts at (εr , q (εr ) = q 2init > µG ), where q 2init = 0.759, and terminates at (vth , q (vth = σG+ )). Note that the notation σG± is used because dq /dv is singular at q = σG ; in the dynamical system, dv/ds crosses 0 at q = σG . Therefore, the convention is that σG± determines whether a trajectory starting at (εr , σG ) begins with increasing or decreasing q as v advances forward. Trajectory 1 lies entirely below and trajectory 2 lies entirely above
Moment Closure in Population Density Theory
2059
the curve q = µG − h min (v). Consequently, neither trajectory satisfies the necessary condition 6.23 for a solution: h(v) = µG (i.e., µG|V (v) = µG ) for some v ∈ (εr , vth ) or, equivalently, q (v) = µG − h min (v) for some v ∈ (εr , vth ). It is clear from the fact that the differential equation for q has a unique solution (nonintersecting trajectories) that any initial q (εr ) between σG+ and q 2init will give a trajectory that intersects the v-nullcline. Thus, no initial q (εr ) value with σG < q (εr ) < q 2init will give a trajectory that is a candidate solution to the differential equation for q . Moreover, any trajectory starting from q (εr ) < σG will lie below trajectory 1, and therefore entirely below q (v) = µG − h min (v). Any trajectory starting from q (εr ) > q 2init will lie above trajectory 2, and therefore entirely above q (v) = µG − h min (v). One distinguished remaining possibility for a solution is a trajectory that starts at σG+ < q (εr ) < q 2init , passes through the critical point (vc , q c = σG ) given by the intersection of the v- and q -nullclines, terminates with v = vth , and satisfies the boundary condition. We found that such solutions do exist for some other choice of parameters. However, as one might suspect from inspecting the flow lines in the phase plane (see Figure 5), such a trajectory does not exist in this case. We show this analytically by linearizing our dynamical system around the critical point. In a small enough neighborhood of (vc , q c ), the linearization around (vc , q c ) provides a good description of the behavior of trajectories. Equations for the critical point can be solved analytically: τm τ (vc , q c ) = e
[(µG − σG )εe + εr ] − σG (εe − εr ) τm (µG + 1 − σG ) τe
, σG .
(6.34)
The linearized system is dq
∂β ∂q ∂α ∂q
∂β ∂β ds ∂q ∂v q = dv ∂α ∂α v ∂q ∂v (vc ,qc ) ds ∂β −(τm /τe )(εe − vc ) − (εe − εr ) τm σG σG − µG − 1 σ ∂v G (εe − vc )2 τe (εe − vc )2 = . ∂α 2σG 0 ∂v (vc ,qc ) (6.35)
With the parameters used in Figure 5, the eigenvalues of this matrix are complex valued, with negative real part. This implies that a trajectory starting at any point in the neighborhood of (vc , q c ) will spiral into the critical
2060
C. Ly and D. Tranchina
point and consequently intersect the v-nullcline. Thus, there is no trajectory that simply passes through (vc , q c ) for the case illustrated in Figure 5. It is worth noting that with other parameter choices (example and theoretical analysis not shown) that give real eigenvalues of equation 6.35, a trajectory that passes through (vc , q c ) must do so along an eigenvector of 6.35. We conclude that there is no solution to the steady-state problem for the particular synaptic input rate of νe = 1000 s−1 used for illustration in Figure 5. The eigenvalues of equation 6.35 are complex, and the phase plane in Figure 5 is generic for all synaptic input rates that give 0 < µG − h min (vth ) < σG . For our parameters, this range of synaptic input rates is 750 < νe < 1109 s−1 (0.183 < µG < 0.271). Therefore, the arguments above apply and show that there are no nonzero firing-rate solutions for synaptic input rates 750 < νe < 1109 s−1 . We computed candidate q trajectories for synaptic input rates 1109 < νe < 1241 s−1 and for νe < 750 s−1 . These are trajectories that do not intersect the v-nullcline but do intersect q (v) = µG − h min (v). We found no such trajectories (using our shooting method restricted to admissible q (εr ) values) that satisfied the boundary condition for q . We conclude that there are no nonzero firing-rate solutions for the steadystate moment closure problem at k = 2, with our choice of parameters, for νe < 1241 s−1 (µG < 0.303). A formal mathematical proof is beyond the scope of this article. In appendix B, we show that zero firing-rate solutions do not exist for any range of synaptic input rates. Thus, there is no solution of any kind for the steady-state moment closure at k = 2, for νe < 1241 s−1 . This contrasts with the k = 1 case in which zero firing-rate solutions do exist for synaptic input rates below the threshold rate for firing. We wish to emphasize that for our physiologically realistic parameters, the minimal value of synaptic input rate where we find a solution, νe = 1241 s−1 (µG = 0.303), is much higher than the threshold synaptic input rate for the mean-field firing-rate model, νe = 744.5 s−1 (µG = 0.182). The firing rates given by solution of the full 2D problem (Apfaltrer et al., 2006) at νe = 1241 s−1 and at νe = 744.5 s−1 are 59 s−1 and 17 s−1 , respectively. Therefore, there is a sense in which moment closure at k = 2 (second moment) is worse than moment closure at k = 1. But when moment closure at k = 2 does work, it more closely approximates the full 2D solution. It is worth noting that at low synaptic input rates, where the moment closure problem at k = 2 fails to have a solution, nothing special happens in the original 2D problem. However, these low synaptic input rates cor(0) respond to marginal voltage density functions f V (v) that have a negative slope at v = vth , the threshold voltage. Figure 6A shows marginal density computed from the full 2D PDM (Apfaltrer et al., 2006) for the synaptic input rate of 1000 s−1 in Figure 5. Figures 6B and 6C compare the unconditioned mean and variance to the actual conditioned mean and variance
Moment Closure in Population Density Theory B
0.25
0.32
0.2
0.3
0.15
0.28
µG|V
µ
µG
0.1
0.26
0.05
0.24
V
f(0)(v), Marginal Density (mV 1)
A
2061
0
0.22 Voltage (mV)
C 8
x 10
Voltage (mV) 3
σ
2
7
6
5
σ2
G|V 2
σG
4 Voltage (mV)
Figure 6: Steady-state density, mean, and variance from full 2D for νe = (0) 1000 s−1 . (A) The marginal density f V (v) computed from the full 2D PDM for νe = 1000 s−1 , where a solution to the moment closure at k = 2 problem fails to exist. (B) The mean of G given V, µG|V (v) (solid line), is plotted from the full 2D PDM along with the unconditioned mean of G, µG (dot-dashed line). (C) The 2 (v) (solid line) from the full 2D PDM is plotted with variance of G given V, σG|V the unconditioned variance of G, σG2 (dot-dashed line). Parameters are the same as those in Figure 5: νe = 1000 s−1 , τm = 20 ms, τe = 5 ms, µ A = 2.44 × 10−4 s, giving µEPSP = 0.5 mV, and µG = 0.244.
given by the full 2D solution. Note that the variations in µG|V (v) and 2 σG|V (v) in Figure 6 (νe = 1000 s−1 ) are much greater than those in Figure 3 (νe = 1341 s−1 , where µG = 0.327). Figure 6C shows that at νe = 1000 s−1 , where the moment closure at k = 2 problem fails to have a solution, the 2 conditional conductance variance σG|V (v) is not well approximated by the 2 unconditioned variance σG . The key approximation (assumption) for moment closure at k = 2, measured as fractional deviation, is better at higher synaptic input rates.
2062
C. Ly and D. Tranchina
6.2.6 Smaller τe Gives Steady-State Solutions for the Moment Closure Problem (0) at k = 2 at Lower Synaptic Input Rates. A PDE for f V (v, t) can be derived by moment closure at k = 2 in the limit τe → 0 (Cai et al., 2004; see also appendix E). A steady-state solution to this PDE exists for all synaptic input rates. This result inspired us to try to find solutions by the methods above for smaller values of τe , with other parameters held fixed. It is worth noting that when µ A and νe are held fixed, limτe →0 µEPSP = 0.77 mV, which does not differ greatly from the value of 0.5 mV when τe = 0.5 ms. In this numerical experiment, we chose νe = 744.5 s−1 (µG = 0.182), the threshold value for the mean-field firing-rate model when τe = 5 ms and µ A = 2.44 × 10−4 s (µEPSP = 0.5 mV). We did not attempt to find with precision the largest value of τe that would give a steady-state solution, but estimate it to be approximately 0.2 ms, much smaller than 5 ms. It is worth noting that as τe is decreased with all other parameters held fixed, the voltage vc at which the v- and q -nullclines intersect decreases. However, in computations not shown, we found that the condition vc < εr (outside the domain that pertains to our problem) is not sufficient for a solution to exist. Figure 7 shows the v − q phase plane for τe = 0.1 ms, νe = 744.5 s−1 (µG = 0.182), and other parameters as above. The solution trajectory is plotted by the thick black line. (0) The marginal voltage density f V (v) corresponding to the q (v) trajectory shown in Figure 8A has the shape that is expected with low firing rates (Apfaltrer et al., 2006). The computed firing rate is 23.02 s−1 , while the mean-field firing rate is 0 s−1 . Figure 8A shows that the density for the limit τe → 0, computed analytically following Cai et al. (2004), matches the density we computed for τe = 1 × 10−4 s. This result is expected because τe τm , the only other parameter with units of time in the problem. This serves as a check of our analysis and numerical methods and provides further evidence that our steady-state analysis and numerical methods using the q (v) function are indeed correct. Whenever τe τm , we expect the full 2D problem (Apfaltrer et al., 2006) to be well approximated by the 1D (instantaneous synaptic kinetics) problem (Nykamp & Tranchina, 2000); we also expect the moment closure problem with τe τm to be well approximated by the τe → 0 limit of this problem. As a check of the accuracy of moment closure at k = 2 when τe is small, we compared the input-output curves computed according to four methods: the τe → 0 limit for moment closure at k = 2; τe = 0.1 ms for moment closure at k = 2 (as in the application in Rangan & Cai, 2006); the 1D limit of the 2D PDM corresponding to instantaneous synaptic kinetics (Nykamp & Tranchina, 2000); and the full 2D problem with τe = 0.1 ms. The results in Figure 8 show that moment closure at the second moment with τe = 0.1 ms underestimates the actual firing rate (from the full 2D PDM) by
Moment Closure in Population Density Theory
2063
Figure 7: Phase plane and solution q (v) trajectory (solid black line) for a low synaptic input rate. νe = 744.5 s−1 (µG = 0.182), but with a very small excitatory synaptic conductance time constant, τe = 0.1 ms. The q -nullcline (dotdashed line) and the line µG − h min (v) (dashed line) are indistinguishable. The v-nullcline (q = σG = 1.83) is off the scale of this plot. Other parameters: τm = 20 ms, εr = −65 mV, εe = 0 mV, vth = −55 mV, µ A = 2.44 × 10−4 s, giving µEPSP = 0.77 mV.
a few s−1 . For higher input rates, the match between the two curves, as well as the mean-field model (not shown), is very good. The full 2D solution with τe = 0.1 ms is matched by the 1D PDM (Apfaltrer et al., 2006). The moment closure at k = 2 with τe = 0.1 ms results is matched by the moment closure results in the τe → 0 limit. One can conclude that when moment closure at k = 2 does work in the regime of very small τe and low synaptic input rates, it is not as accurate as the solution to the 1D problem, which is easier to compute. 6.2.7 Some Edges of Parameter Space Where Solutions to Moment Closure Problem at k = 2 Exist. To obtain a representative view of the failures of moment closure at k = 2, we explored other regions of parameter space; the results are reported in Tables 4 and 5 (where we used the reversal, threshold, leak, and reset voltages in Cai et al., 2004, 2006, to allow easier comparison with their work). The same shooting method (see appendix D) was used to compute solutions and determine the nonexistence of solutions.
2064
C. Ly and D. Tranchina
A
B 35
(0) V
0.15
0.1
30
G|V
20
G
e
σ2 =σ2 , τ =10 4s G|V
25
1
0.2
σ2 =σ2 , τ → 0 limit 4 2 2 σG|V=σG, τe=10 s 2 2 σG|V=σG, lim τe → 0
Firing Rate (s )
1
f (v), Marginal Density (mV )
0.25
G
e
1 D instantaneous syn kinetics 2 D, τ =10 4s e
15 10
0.05 5
0
0 0
Voltage (mV)
200
400
600
800
Input Rate ν (s ) e
Figure 8: Small τe . (A) Marginal voltage densities for small τe and for limit τe → 0. The solid gray plot corresponds to the computed moment closure at k = 2 solution, q (v) in Figure 7, where τe = 0.1 ms. The dot-dashed plot is the density computed from the moment closure solution in the limit τe → 0, equation, E.2. The marginal density rises up and comes down to a lower value at vth , a characteristic of low firing rate (τe = 1 × 10−4 s, τm = 20 ms, µ A = 2.44 × 10−4 s, giving µEPSP = 0.77 mV). The firing rates are r∞ = 23.02 s−1 for both densities. (B) Results from moment closure at k = 2 in the regime of small τe and low firing rates that are less accurate than the full 2D PDM. Input-output curves: moment closure at k = 2, in the limit as τe → 0 equation, E.3 (solid gray line); steady-state moment closure at k = 2 equations, 6.28 and 6.29 with τe = 0.1 ms (dot-dashed black line); the 1D PDM corresponding to instantaneous synaptic kinetics (solid black line); the full 2D PDM with τe = 0.1 ms (dot-dashed gray line). The parameter µ A = 2.44 × 10−4 s is fixed at its value, as in Figures 2–7. The 1D PDM matches well with the 2D PDM (τe = 0.1 ms) as expected. The moment closure at k = 2 with τe = 0.1 ms matches the τe → 0 equations, but they underestimate the actual firing rate for low input rates.
Table 4 reports, for various other values of τe , the minimum synaptic input rate νemin where a solution to the moment closure equations at k = 2 has a steady-state solution when all other parameters are held fixed. To obtain Table 4, the parameters εr = −70 mV, vth = −55 mV, τm = 20 ms, µ A = 2.27 × 10−4 s giving µEPSP = 0.5 mV, were held fixed while the excitatory membrane time constant τe was varied. In terms of dimensionless parameters (see Table 3) that characterize the problem, α and γ were held fixed; for each chosen value of δ, the minimum µG that would allow a solution was found. The minimal synaptic input rates correspond to substantially high firing rates r∞ (computed by the full 2D PDM) over a wide range of τe values. As τe is decreased from 5 ms, the firing rate corresponding to νemin at first gets higher (the method gets worse) until τe = 0.2 ms.
Moment Closure in Population Density Theory
2065
Table 4: Minimal Input Rate for Steady-State Moment Closure at k = 2. τe 5 ms 2 ms 1 ms 0.5 ms 0.2 ms 0.1 ms
νemin
r∞ from Full 2D PDM
1804.11 s−1 2377.84 s−1 2584.88 s−1 2760.47 s−1 539.05 s−1 535.19 s−1
51.35 s−1 74.22 s−1 85.82 s−1 109.16 s−1 < 0.01 s−1 < 0.01 s−1
Notes: The minimal input rate νemin for each value of τe was found using the same shooting method described above (see appendix D). Parameters: εr = −70 mV, εe = 0 mV, vth = −55 mV, τm = 20 ms, µ A = 2.27 × 10−4 s giving µEPSP = 0.5 mV.
Table 5: Minimal EPSP, at Peak Time (Determined by µ A) Required for a Solution to Exist for the Steady-State Moment Closure at k = 2 Problem, with a Fixed Value of µG . τe 5 ms 2 ms 1 ms
EPSP 7.47 mV 4.89 mV 3.40 mV
µA 3.94 × 10−4
s s 1.26 × 10−4 s 2.03 × 10−4
r∞ from Full 2D PDM 61.15 s−1 65.70 s−1 69.24 s−1
Notes: µG = µth G , the threshold value for the mean-field firing-rate model (moment closure at k = 1). For each τe , the input rate νe was decreased as µ A was increased until there was a solution to the steady−εr state moment closure problem at k = 2. Because µG = νe µ A = vεeth−v = th
µth G , the explored values of νe and µ A are locked in inverse proportion. Parameters as in Table 4: εr = −70 mV, εe = 0 mV, vth = −55 mV, τm = 20 ms.
Table 5 reports, for various values of τe , the minimum value of µ A and corresponding mean value of EPSP amplitude (at peak time) required for a solution to exist for the moment closure equations at k = 2, with µG held fixed at µth G , the threshold for the mean-field firing-rate model (see appendix A). To obtain Table 5, the parameters εr = −70 mV, vth = −55 mV, τm = 20 ms, and µG were held fixed; for each chosen value of τe , µ A was “titrated” downward, until the minimum µ A that would permit a solution was found. In terms of dimensionless parameters (see Table 3), α and µG were held fixed; for each chosen value of δ, the minimum value of γ was found. The corresponding minimal average EPSP amplitude of 7.5 mV is quite large, but as τe is decreased, the minimal EPSP also decreases. The minimal EPSP of 7.5 mV above may be considered large, because upon receiving only two synaptic input events in close temporal proximity, the neuron will fire (on average). Large EPSPs have been reported in the literature. Fricker and Miles (2000) recorded large EPSPs (3–5 mV) in
2066
C. Ly and D. Tranchina
pyramidal cells in hippocampal slices of Sprague-Dawley rats. Yoshimura, Sato, Imamura, and Watanabe (2000) recorded large EPSPs in pyramidal cells in the superficial layers of cat cortex slices with paired-pulse stimulation. Their EPSPs varied over a wide range, with some as large as 7 mV. Shadlen and Newsome (1998) state that neurons typically fire an action potential when receiving 10 to 40 synaptic input events within 20 to 40 ms of each other; this range corresponds to smaller EPSPs than those reported in Fricker and Miles (2000) and Yoshimura et al. (2000). EPSPs of 0.3 to 1 mV are commonly reported (Kitano, Okamoto, & Fukai, 2003; Biella, Uva, & Curtis, 2001; Thompson, Girdlestone, & West, 1988; McNaughton, Barnes, & Anderson, 1981), and this range is consistent with that of Shadlen and Newsome (1998). The results above show clearly that solutions to the steady-state moment closure equations at k = 2 do not exist for a large volume of parameter space. 7 Moment Closure at k = 3 7.1 Dynamical Equations for Closure at k = 3. For moment closure at the third moment, we assume the third centered conditional moment of 3 synaptic conductance G, κG|V , is independent of voltage V: E
3 3 G(t) − µG|V (v, t) V = v = E G(t) − µG (t) .
We conclude: (3) (0) f V (v, t) = f V (v, t) µG 3 (t) − 3µG 2 (t)µG (t) + 2µ3G (t) + 3µG 2 |V (v, t)µG|V (v, t) − 2µ3G|V (v, t) . Summary of equations: (0) ∂ f V (v, t) ∂ 1 (0) (1) (v − εr ) f V (v, t) + (v − εe ) f V (v, t) , =− − ∂t ∂v τm
(1) ∂ f V (v, t) ∂ 1 (1) (2) (v − εr ) f V (v, t) + (v − εe ) f V (v, t) =− − ∂t ∂v τm 1 (1) (0) f V (v, t) − νe (t)µ A f V (v, t) , − τe
(7.1)
(7.2)
Moment Closure in Population Density Theory
(2) ∂ f V (v, t) ∂ 1 (2) (3) (v − εr ) f V (v, t) + (v − εe ) f V (v, t) =− − ∂t ∂v τm µ A2 (0) 2 (2) (1) f (v, t) , f (v, t) − νe (t) µ A f V (v, t) + − τe V 2τe V
2067
(7.3)
dµG (t) 1 µG (t) − νe (t)µ A , =− dt τe
(7.4)
µ A2 dµG 2 (t) 2 µG 2 (t) − νe (t) + µ AµG (t) , =− dt τe 2τe
(7.5)
dµG 3 (t) 3 µG (t)µ A2 µ A3 µG 3 (t) − νe (t) µG 2 (t)µ A + =− . (7.6) + 2 dt τe τe 3τe The boundary conditions are: (0)
(0)
(7.7)
(1)
(1)
(7.8)
(2)
(2)
(7.9)
J V (vth , t) = J V (εr , t), J V (vth , t) = J V (εr , t), J V (vth , t) = J V (εr , t), where (i)
J V (v, t) = −
1 (i) (i+1) (v − εr ) f V (v, t) + (v − εe ) f V (v, t) . τm
(7.10)
7.1.1 Numerical Results for the Dynamical Problem from Moment Closure at k = 3 and Testing Against Those from the Original Full 2D Problem. Moment closure at the third moment performs better than closure at the second moment in the regime where solutions exist for both closures. For example, Figure 9 shows the firing-rate responses to square-wave modulation of the synaptic input rate computed by three different methods: full 2D PDM and moment closure at k = 2 and at k = 3. There is excessive “ringing” with closure at the second moment; closure at the third moment smoothes this out to a large extent and is therefore a better approximation. 7.2 The Steady-State Problem for Moment Closure at k = 3. We proceed in a similar fashion as for moment closure at k = 2, and we define an additional function w(v) ≡ µG 2 |V (v). In appendix G, a pair of ODEs and boundary conditions for q (v) (same definition as above) and w(v) are derived.
2068
C. Ly and D. Tranchina
A
B 250 Full 2 D κ3G|V=κ3G
200
Firing Rates (s )
Firing Rates (s )
250
σ2 =σ2 G|V G 150
100
50 0
0.02
0.04 0.06 Time (s)
0.08
0.1
200 Full 2 D 3 3 κ =κ
150
G|V 2
100
50 0.075
G 2
σG|V=σG
0.08
0.085 0.09 Time (s)
0.095
0.1
Figure 9: Performance of moment closure at k = 3 in a dynamical problem with time-dependent synaptic input. (A) Numerical simulation of responses to a square-wave modulation of synaptic input (mean rate of 1480 s−1 , contrast = [νmax − νmin ]/[νmax + νmin ] = 0.4) computed by three different methods: numerical solution of the full 2D problem (thick black line), moment closure at k = 2 (thin black line), and moment closure at k = 3 (dashed gray line). The solution obtained by moment closure at k = 3 matches the exact result better than that obtained by moment closure at k = 2. (B) Zoomed-in figure corresponding to the “ringing” time period in A.
The equations (see appendix G) for q (v) and w(v), G.2 and G.3, can be written as a matrix-vector product: dq dv M(q , w, v) dw
− =→ u (q , w, v).
(7.11)
dv When det(M) = 0 the solution of equation 7.11 is of the form dq x(q , w, v) = dv det(M) dw y(q , w, v) = . dv det(M) As in the above analysis, we introduce an auxiliary variable s and examine the dynamical system (q (s), w(s), v(s)), where dq = x(q , w, v), ds
(7.12)
Moment Closure in Population Density Theory
dw = y(q , w, v), ds dv = det(M(q , w, v)). ds
2069
(7.13) (7.14)
The system evolves with increasing or decreasing s depending on the sign of dv/ds evaluated at v = εr . As in the k = 2 moment closure problem, if dv/ds crosses zero on a trajectory, it does not correspond to a candidate solution because q and/or w take on multiple values for a single v value. 7.2.1 Steady-State Solutions with Nonzero Firing Rates for Moment Closure at k = 3 at High Synaptic Input Rates. As in the moment closure problem at k = 2, when the synaptic input rate is high (giving large µG ), we can easily find a solution. Figure 10 shows an example where the steady-state equations have a solution that satisfies the two boundary conditions. Notice that the correct q (v) and w(v) trajectories straddle the lines µG − h min (v) and µG 2 , respectively (see figure 10A), as required (see the section on steadystate k = 2 and necessary condition for solution existence). The marginal (0) voltage density f V (v) computed by moment closure at k = 3 matches that computed by the solution of the full 2D problem (see figure 10B). Figures 10C and 10D compare the conditional mean µG|V and variance 2 σG|V obtained by solving the steady-state equations for moment closure at k = 3 (solid line) to those obtained by the full 2D PDM (dot-dashed line) with steady synaptic input, when the dynamical problem settles to a steady state. The conditional means obtained by the full 2D PDM and moment closure at k = 3 match well (see Figure 10C). The variances differ somewhat but are qualitatively similar (see Figure 10D). 7.2.2 Check That the ODE Method for Moment Closure at k = 3 Solves the Same Problem as the Corresponding PDE Method for Moment Closure with Steady Synaptic Input. As a check of our mathematical analysis of the steadystate problem in the q (v) and w(v) functions, we computed conditional means and variances obtained by two different methods: solving the ODEs, equations G.2 to G.5, and solving coupled PDEs, equations 7.1 to 7.10, with steady synaptic input after the solution of the PDEs has settled to a steady state. Figure 11 shows that the results obtained by these two methods are indistinguishable. 7.2.3 Numerical Evidence That Steady-State Solutions Do Not Exist for Moment Closure at k = 3 at Low Synaptic Input Rates. We obtained numerical evidence that solutions do not exist at two particular synaptic input r rates discussed above: νe = µ Av(εthe−ε = 744.5 s−1 , the threshold frequency −vth ) for the mean-field firing-rate model, and νe = 1241 s−1 , the minimum rate where we could find a solution for the moment closure at k = 2 (with
2070
C. Ly and D. Tranchina
A
B µg hmin(v)
0.4
q(v)
0.3 0.2
0.2
µ
w(v)
0.16
µ
2
2
G
0.18 G |V
f(0)(v), Marginal Density (mV 1)
q=µG|V hmin
0.5
0.2
κ3 =κ3 G|V
G
full 2 D PDM 0.15
0.1
V
0.14
0.25
0.12 65
60 Voltage (mV)
C
0.05
55
Voltage (mV)
D 0.012
0.39 3 3 =κ G|V G
κ 0.38
0.011 2 G|V
σ
µG|V
full 2 D PDM 0.37
0.01 3
0.009
0.35
3
κG|V=κG
0.36
full 2 D PDM
0.008
0.34 Voltage (mV)
Voltage (mV)
Figure 10: Steady-state moment closure equations at k = 3 with high synaptic input rate. (A) Top panel: Solution q trajectory (solid black line) obtained by solving the system G.2 to G.5, plotted with the line µG − h min (v) (dotted line). Lower panel: Solution w trajectory (solid black line) plotted with the line µG 2 (dotted line). (B) The computed marginal density obtained by moment closure at k = 3 (solid gray line) plotted with the marginal density of the full 2D PDM (dotdashed line). Computed firing rates are 78.18 s−1 and 77.96 s−1 from moment closure at k = 3 and from the full 2D PDM, respectively. (C, D) The mean µG|V and 2 for a high synaptic input rate νe = 1480 s−1 , where the steady-state variance σG|V moment closure at k = 3 is a good approximation to the full population density equations. (C) The conditional mean from moment closure at k = 3 (solid gray line), and from the full 2D PDM (dot-dashed line). (D) The conditional variance from moment closure at k = 3 (solid gray line) and from the full 2D PDM (dotdashed line). Parameters: νe = 1480 s−1 ; εr = −65 mV; εe = 0 mV; vth = −55 mV; τm = 20 ms; τe = 5 ms; µ A = 2.44 × 10−4 s; giving µEPSP = 0.5 mV.
parameters: εr = −65 mV, εe = 0 mV, vth = −55 mV, τm = 20 ms, τe = 5 ms, µ A = 2.44 × 10−4 s, giving µEPSP = 0.5 mV). We formed a grid of starting values (q 0 , g0 ), solved the system of equations G.2 to G.5, and compared the boundary values. With the the value of νe = 744.5 s−1 (above), a trajectory satisfying both boundary conditions did not exist. For νe = 1241 s−1 , the
Moment Closure in Population Density Theory A
2071
B
0.4
Steady state ODEs (q,g)
0.39
Dynamical steady state
µG|V
0.2
0.15
Steady state with ODEs in (q,g) Dynamical steady state
0.38 0.37 0.36
0.1
0.35
V
f(0)(v), Marginal Density (mV 1)
0.25
0.05
0.34 Voltage (mV)
C x 10
Voltage (mV)
3
11
2
σG|V
10
9 Steady state with ODEs (q,g) 8
Dynamical steady state
Voltage (mV)
Figure 11: Check of steady-state analysis and computations for moment closure at k = 3. Comparing the steady-state solutions from the full dynamical equations 7.1–7.10 to the ODEs, G.2–G.5 for a high enough input rate where a solution exists (νe = 1480 s−1 ). The solid lines are from the ODE in q and g, and the dot-dashed lines are from the fully dynamical equations. (A) The two densities with computed firing rates of 78.18 s−1 . (B) The two conditional means. (C) The two conditional variances. All are perfect matches. This figure provides evidence that the steady-state ODE analysis in the q and w variables is correct. Parameters are same as in Figure 10.
minimal input rate where closure at k = 2 has a solution, a trajectory satisfying both boundary conditions still did not exist. We found no initial conditions that gave trajectories in which the boundary conditions were nearly satisfied for either synaptic input rate, νe = 744.5 s−1 or νe = 1241 s−1 . 7.2.4 Minimal Synaptic Input Rate for Which a Steady-State Solution Exists for the Moment Closure Problem at k = 3. We used a shooting method similar to that above (see appendix D), for finding νemin for moment closure at k = 3, with τe , µ A, and voltage parameters held fixed. For the ODEs in q and g, equations G.2 to G.5, we found that the lowest synaptic input rate where we were able to find a solution is νemin = 1459 s−1 . This value of νemin is in
2072
C. Ly and D. Tranchina
80 70
1
Firing Rate (s )
60 50
full 2 D mean field, µ
=µ
G|V G 2 2 2nd moment, σG|V=σG 3 3 3rd moment, κG|V=κG
40
min ν
e
30 20
min νe
10
where solutions exist
0 0
500
1000
where solutions exist
1500
Constant Input Rate νe (s ) Figure 12: Input-output curves for four methods of solving the 2D PDM. The gold standard is the full 2D PDM (black solid line). Moment closure at k = 1 (mean field) is plotted in dark gray, with a solid line (extending to νe = 0). Moment closure at k = 2 is plotted in a gray dot-dashed line. Moment closure at k = 3 is plotted in a light gray solid line. The vertical lines for moment closure at k = 2 and k = 3 indicate the minimal synaptic input rate where a solution exists. Parameters: τm = 20 ms; τe = 5 ms; µ A = 2.44 × 10−4 s, giving a mean unitary excitatory postsynaptic potential amplitude of 0.5 mV at peak time (µEPSP ); εr = −65 mV; εe = 0 mV; vth = −55 mV.
good agreement with where the dynamical problem, equation, 7.1 to 7.10, breaks down, which is νemin = 1457 s−1 (using 100-voltage grid points). This rate of synaptic events corresponds to a firing rate of 76.33 s−1 in the full 2D problem and corresponding Monte Carlo simulations. 8 Input-Output Curves for Moment Closure Methods and the Full 2D PDM Figure 12 summarizes the shortcomings of the dimension-reduction methods discussed in this article. Each input-output curve is a plot of the steadystate firing rate as a function of steady synaptic input rate, over the range
Moment Closure in Population Density Theory
2073
of input rates where a solution exists. The gold standard is the full 2D PDM (black solid line). Moment closure at k = 1, the mean-field model (dark gray, solid line), has a solution over all synaptic input rates, but there is a threshold synaptic input rate below which the firing rate is equal to zero. The figure also shows that moment closure at k = 2 (gray dot-dashed line) is somewhat better than the mean-field model where solutions exist, but solutions exist over a more restricted range of synaptic input rates (also see Figure 1). For closure at k = 2, there is a synaptic input rate below which the output firing rate is undefined, because no solutions of any kind exist for the moment closure equations in this regime. The story is similar for moment closure at k = 3 (light gray, solid line). Figure 9 shows how closure at k = 3 is better than closure at k = 2 at high output firing rates, but Figure 12 shows that solutions exist for k = 3 over an even more restricted range of synaptic input rates. 9 Discussion We have analyzed a dimension-reduction method by a particular type of moment closure as applied to an uncoupled population of I&F neurons receiving external, conductance-driven excitatory synaptic input only, with first-order synaptic kinetics. The population density function for the full problem has two state variables, voltage V and excitatory synaptic conductance G. The centered moment closure method we studied (Cai et al., 2004, 2006) was suggested by Haskell et al. (2001) in the discussion section of their article, and it is an extension of a method used by Nykamp and Tranchina (2001). This closure method is based on the assumption or approximation that the kth centered moment of conductance, conditioned on the membrane voltage, is independent of voltage. The result is a system of k coupled partial differential equations with one state variable, membrane voltage, and k coupled ordinary differential equations for the first k unconditioned moments of the excitatory conductance random variable. We used phase-plane analysis for moment closure at the second moment to prove that no nonzero steady-state firing-rate solutions exist over a wide range of synaptic input rates for an example set of physiological parameters. Additional computations with the same set of parameters showed that a solution for the moment closure system of equations does not exist for synaptic input rates below that which gives an output firing rate of 59 s−1 in the original (full PDM) dynamical 2D problem with excitation only (Apfaltrer et al., 2006) and corresponding Monte Carlo simulations. We showed that other reasonable sets of parameters for the underlying single-neuron model give the same existence problems. Although we have focused, for sake of exposition, on a single uncoupled population of neurons receiving excitatory synaptic input only, our conclusions pertain just as well to networks of excitatory neurons with arbitrary architecture, including feedback. Equation 6.28 shows that the existence of
2074
C. Ly and D. Tranchina
steady-state solutions to the moment closure equations, or lack thereof, for any excitatory population is completely determined by the unconditioned mean and variance of the net excitatory postsynaptic conductance input (once the reversal potentials and time constants are chosen). Furthermore, we checked to see if the dynamic moment closure partial differential equations resulting from the addition of inhibitory synaptic input (Cai et al., 2006) behave differently. We found similar evidence suggesting the nonexistence of solutions in our attempts to solve the equations numerically over a wide range of excitatory and inhibitory synaptic input rates (results not shown). The centered moment closure method at the second moment performs well, and closure at the third moment performs even better, with physiologically typical EPSPs, in a regime where τe and τm are not on disparate timescales, only at high synaptic input rates that give high output firing rates. However, the much more computationally efficient mean-field firingrate model does well, although not quite so well, in this regime also. Moment closure at k = 2 also works well for small τe and a wide range of synaptic input rates, including those that give a low output firing rate, where the mean-field firing-rate method fails. 9.1 Previous Applications of the Centered Moment Closure Method. There are minor theoretical differences between our implementation of the centered moment closure method at k = 2 and that of Cai et al. (2004, 2006). However, they do not alter qualitatively the results reported here. Our results and conclusions differ radically from those of Cai et al. (2004, 2006). We have shown that for a wide range of reasonable parameter values (not including τe τm ) that the moment closure kinetic equations, taken together with the boundary conditions, are too rigid to admit solutions. Yet the numerical results of Cai et al. (2004, 2006) showed, in similar parameter regions, good agreement of the moment closure kinetic theory with simulations of the corresponding full integrate-and-fire systems. The difference (D. Cai and D. McLaughlin, private communication with the authors, February 2006) was the presence of numerical or artificial viscosity (0) in the equations for µG|V and f V (ρ in their notation) in the simulations of the kinetic theory of Cai et al. (2004, 2006). This viscosity was active only in the vicinity of the boundary and, through the creation of small boundary layers, relaxed the rigidity and allowed both steady-state and dynamical (0) numerical solutions to exist. This additional viscosity in the µG|V and f V equations, if formalized mathematically, might give a different closure. This idea merits further study, given that their numerical solutions agreed accurately with simulations. 9.2 Accounting for Successes and Failures of the Centered Moment Closure Method. Our analysis gives insight into why the centered moment closure method at k = 2 works when it works and why it fails when it fails. 2 Moment closure at k = 2, based on the assumption that σG|V = σG2 , leads to
Moment Closure in Population Density Theory
2075
a pathological ODE for h(v) ≡ µG|V (v); it produces an attracting curve in the v − µG|V plane spanning the entire domain from v = vreset = εr to v = vth , along which dµG|V /dv is singular. There are two requirements for a solution to the ODE for µG|V (satisfying a boundary condition) that are incompatible at low enough synaptic input rates. A solution trajectory in the v − µG|V plane must not intersect the singular curve, because it cannot be crossed, but it must intersect the line µG|V (v) = µG at some v in the domain. At low 2 synaptic input rates, where the moment closure assumption, σG|V = σG2 is not very good; all trajectories that intersect µG|V (v) = µG also intersect the singular curve. The attracting singular curve cannot be avoided because it is in the same vicinity in the v − µG|V plane as the line µG|V (v) = µG (which must be intersected). It is worth noting that in these cases, the exact µG|V (v) that is computed from solution of the full 2D problem does intersect the singular curve (not shown), and it can do so because the the forbidden intersection is a product of the moment closure method, and it is not relevant to the full 2D problem. When the synaptic input rate is high enough, in which case the approxi2 mation σG|V = σG2 is better, the singular curve and the line µG|V (v) = µG are far away from one another. Consequently, the singular curve is out of the way because it is not located in the relevant region of the v − µG|V plane where the µG|V (v) solutions lie. The story for moment closure at k = 3 seems to be analogous. In this case, a derived pair of ODEs can be singular on a 2D surface. It is worth noting that once a moment closure approximation has been made, the resulting equations no longer pertain exactly to any physical system. Even if an approximation is not bad in a percentage error sense, there is no guarantee in general that the resulting differential equations will have solutions that satisfy boundary conditions. We do not have any intuition about why moment closures at k = 1, k = 2, and k = 3 break down at systematically increasing values of the synaptic input rate. Nor do we know whether this pattern will continue for higher moments. The trouble is that the centered moment closure method we have studied does not provide any theoretical basis for knowing what to expect when going to higher and higher moments once one is out of the regime where the closure approximation is excellent. One could argue that this disqualifies this moment closure method as a rational method. 9.3 Concluding Remarks. Moment closure methods have been applied to kinetic problems in statistical physics for many years. We cite here only a few recent references. For example, Bardos et al. (1991, 1993) derived various hydrodynamic systems from the Boltzmann equation using moment closure methods. One moment closure method uses a maximum-entropy argument (Dreyer et al., 2001, 2004; Rangan & Cai 2006; Junk & Unterreiter, 2002). Dreyer et al. (2001, 2004) have studied drawbacks in the application of the maximum-entropy principle to the Fokker-Planck equation, and they
2076
C. Ly and D. Tranchina
compared this approach to the Hermite/Grad approach. Junk and Unterreiter (2002) also analyzed problems with the maximum entropy closure approach as applied to the Boltzmann equation for rarefied gas dynamics. Recently, Rangan and Cai (2006) applied the maximum relative entropy closure technique to a diffusion-approximation form of the steady-state 2D PDM equations considered here (see equation 6.11), to derive the closure 2 σG|V (v) = σG2 . They used a steady-state closure—valid at high synaptic input rates where the marginal density of G is well approximated by a gaussian— to motivate the assumption in the time-dependent problem, at all synaptic 2 input rates, that σG|V (v, t) = σG2 (t). This closure yields the same equations studied here and used by Cai et al. (2004, 2006). Rangan and Cai (2006) also used a maximum relative entropy argument to derive a new closure at the third moment. The resulting closure equation says that the third conditional (conductance) centered moment is equal to 0. We have found, for both the steady-state and the dynamical problems, that this closure at the third moment breaks down like the closure at k = 3 studied above, but at a slightly lower synaptic input rate (results not shown). There is a possibility that moment closure methods might work better when applied to population density formulations of other underlying single-neuron models (see the review by Izhikevich, 2004), such as phase models (Brown et al., 2004) or generalized nonlinear leaky I&F models (Abbott & van Vreeswijk, 1993; Fourcard-Trocme, Hansel, van Vreeswijk, & Brunel, 2003; Jolivet, Lewis, & Gerstner, 2004). It is also possible that other moment closures could be derived—perhaps along the lines of ChapmanEnskog methods (Chapman & Cowling, 1970)—that would work well for the original 2D population density equation considered here. Several authors (Fourcaud & Brunel, 2002; Moreno-Bote & Parga, 2004, 2005) have studied a similar 2D population density method where the density evolves via a Fokker-Planck equation. Assuming the two timescales are disparate, they derive asymptotic expansions for the 2D density and the output firing-rate that give good results even when the timescales are not disparate. Similar asymptotic expansions may give insight into why the centered moment closure method considered here fails at low synaptic input rates. We are currently exploring this possibility. We hope that our analysis will inspire further progress in dimension reduction by moment closure and further development of alternative dimension-reduction techniques. Appendix A: Mean-Field, Quasi-Steady-State, Firing-Rate Model from Moment Closure at k = 1 In the steady state, we set the right-hand sides of equations 5.1 and 5.2 equal to 0. Therefore, µG = νe µ A, and
∂ 1 (0) 0=− (v − εr ) + µG (v − εe ) f V (v) . − ∂v τm
Moment Closure in Population Density Theory
2077
The term inside the braces is the steady-state flux in voltage; it is independent of v because its derivative in v is 0. In particular, this flux must be equal to the flux at v = vth , which is the steady-state firing-rate r∞ : r∞ =
−
1 (0) (v − εr ) + µG (v − εe ) f V (v). τm
(A.1)
−εr r Since f V (v) ≥ 0, µG ≤ v−ε ≤ vεeth−v would give a nonpositive flux across εe −v th vth , with r∞ < 0. This is not allowed, so the conclusion is that r∞ = 0 for −εr this range of µG . When µG > vεeth−v , the steady-state firing rate is nonzero. th (0)
(0)
We can divide equation A.1 by the term in braces and leave f V on the right-hand side, integrate both sides from εr to vth , and use the fact that vth (0) εr f V (v) dv = 1 to get the firing rate:
r∞
vth − εr 0, if µG ≤ ; εe − vth τm = 1 + µG −1 µ (ε − ε ) vth − εr G e r , if µG > . × log µ (ε − v ) − (v − ε ) εe − vth G e th th r (A.2) (0)
From here, we derive an explicit formula for f V (v): εr + µG εe vth − εr δ v − , if µG ≤ ; 1 + µG εe − vth (0) [−(v − εr ) + µG (εe − v)] f V (v) = 1 + µG −1 µG (εe − εr ) vth − εr × log , if µG > . µG (εe − vth ) − (vth − εr ) εe − vth (A.3) According to the mean-field firing-rate model, the firing rate at any moment is approximated by vth − εr 0, if µG (t) ≤ ; εe − vth τm r (t) = 1 + µG (t) −1 µG (t)(εe − εr ) vth − εr × log , if µG (t) > . µG (t)(εe − vth ) − (vth − εr ) εe − vth (A.4)
2078
C. Ly and D. Tranchina
Appendix B: For Moment Closure at k = 2, Zero Firing-Rate Solution Does Not Exist for Any Range of Synaptic Input Rates We prove here that a zero firing-rate solution of the steady-state problem from moment closure at k = 2 does not exist for any range of synaptic input rates. Our approach will be to derive a first-order ordinary differential (0) equation and boundary condition for f V (v) alone. Then we will show that no solution to the ODE can simultaneously satisfy the boundary condition and integrate to 1, as required for a marginal density function. Suppose that there is a steady synaptic input rate νe that gives a steady firing-rate r∞ = 0. Equation 6.12 tells us that the term in braces, the marginal voltage probability flux, equation 6.18, is independent of v. Hence, it is equal to the value at vth , which is r∞ : (0)
J V (v) = −
1 (0) (v − εr ) + h(v) (v − εe ) f V (v) = r∞ = 0, τm
∀ v.
(B.1)
(0)
Thus, any f V and h satisfying equation B.1 automatically satisfies the first (0) (0) boundary condition, J V (εr ) = J V (vth ). Equation B.1 allows us to simplify equation 6.13 by the following manipulations. First, rewrite equation 6.13 as
−
d dv
1 1 2 (0) (0) (v − εr ) + h(v) (v − εe ) f V (v) h(v) − − σG (v − εe ) f V (v) τm τm (0)
=
f V (v) (h(v) − µG ) . τe
(B.2)
(0)
The term in brackets in equation B.2 is J v and is equal to 0 by equation B.1. Therefore, equation B.2 simplifies to
d dv
(0) f (v) 1 2 (0) h(v) − µG . σG (v − εe ) f V (v) = V τm τe
(B.3)
The function h(v) can be eliminated from equation B.3 by substituting the relation implied by equation B.1,
(0)
(0)
f V (v) h(v) = f V (v)
v − εr , εe − v
(B.4)
Moment Closure in Population Density Theory
2079
into equation B.3 to yield
(0) d f V (v) v − εr µG 1 τm (0) = f V (v) − . + dv (εe − v)2 εe − v σG2 τe εe − v
(B.5)
Using equation B.1, the second boundary condition simplifies to (0)
(0)
f V (vth ) = f V (εr )
εe − εr . εe − vth
(B.6)
Equation B.5 has an analytic solution:
τm (µG + 1) 2 v − ε εe − εr 1 + τ m r (0) (0) τ σ e G f V (v) = f V (εr ) exp − . εe − v τe σG2 εe − v
(B.7) (0)
(0)
Notice that if f V (εr ) = 0, then f V (v) = 0 for all v; the boundary condi(0) tion B.6 is automatically satisfied, but f V does not integrate to 1 as required for a density function. Furthermore, it can be shown that if boundary condition B.6 is satisfied, then equation B.7 forces νe to be one particular value, and the boundary condition cannot be satisfied by any other value of νe . We conclude that solutions with r∞ = 0 for moment closure at the second moment do not exist for any range of synaptic input rates νe . This is unlike the k = 1 case, where a nontrivial solution exists for r∞ = 0 for a range of νe values below a threshold value. Appendix C: ODE and Boundary Condition for q (v) in Dimensionless Form We define Õ(Ú) ≡ q (v), where Ú ≡ v −εr , and 0 ≤ ≤ 1. In so doing, th r one can see that ( ) is determined by four dimensionless parameters: v−ε
d =− d α−
α 1+δ α−
− µG −
δ − µG γ 2
α−
,
(C.1)
with boundary condition
2 (1) = 2 (0),
(C.2)
2080
C. Ly and D. Tranchina
where
2 ( ) ≡ ( ) + µG γ
1
( )
+
α−
.
(C.3)
The four dimensionless parameters in equations C.1 to C.3 (also listed in 2 µA σA r [1 + ], and α ≡ vεthe −ε . Table 3) are δ ≡ ττme , µG = νe µ A, γ ≡ 2τ µA −εr m Appendix D: Shooting Algorithm for Solution of ODEs For high-input rates, the correct trajectory of q (v) is relatively easy to find. A Matlab stiff ODE solver was used to compute the trajectories q (v) for a given q (εr ) = q 0 . We start at a very high νe where we know a solution exists and is easy to find. Fminsearch is used to try to zero the difference in the two boundary conditions. The root finding program may use any q 0 ; however, we want only trajectories that end at vth , so |v(end) − vth | must be zeroed as well (v(end) is the final voltage value where the ODE solver terminates.) Since v(end) = vth is necessary for a solution to exist, a heavier weight is put on this term, and fminsearch must zero the function q
q
|b 2 (εr ) − b 2 (v(end))| + 1040 |v(end) − vth | in order for a solution to exist. Once a solution is found, νe is decreased by a few s−1 , and the correct q (εr ) is the initial guess (q (εr ) varies smoothly with νe ). When the function q q |b 2 (εr ) − b 2 (v(end))| + 1040 |v(end) − vth | becomes larger than a certain tolerance (1 × 10−3 ), the boundary conditions can no longer be satisfied up to the specified precision. Finding a solution for small νe where either τe is made small or the EPSP is large is not easy by guessing and checking with different q 0 values. The shooting algorithm is very robust and can easily find a solution in those cases. It is important to note that when we suspected a solution did not exist, the fminsearch algorithm was started at a large grid of q 0 values. Appendix E: Steady-State Moment Closure Problem at k = 2 in the Limit τ e → 0 We study the following equations: 6.12, limτe →0 6.13 ∗ τe (Cai et al., 2004). We ν µ (0) still have µG = νe µ A and σG2 = e2τeA2 . The boundary conditions are J V (εr ) = (0) (1) (1) J V (vth ) and J˜ V (εr ) = J˜ V (vth ), where (0)
J V (v) = −
(0) f V (v)
(v − εr ) + (v − εe )h(v) , τm
Moment Closure in Population Density Theory
(1) J˜ V (v) = lim
τe →0
=−
−
νe µ A2 2τm
2081
(0) f V (v) v − εr h(v) + v − εe σG2 + h 2 (v) ∗ τe τm (0) f V (v) v − εe .
(0)
The second boundary condition simplifies to f V (εr )(εr − εe ) = (0) f V (vth )(vth − εe ). The equation limτe →0 (6.13) ∗τe can be rewritten as νµ 2 d
e A (0) (0) f V (v)(v − εe ) . f V (v) h(v) − µG = 2τm dv (0)
So we now have an expression for f V (v)h(v), which we can substitute into equation 6.12. However, we know the term inside the brackets of equation 6.12 is the firing rate r∞ . The result is a first-order ODE with nonconstant coefficients: (0) d f V (v) (εe − v)(Y + µG ) − (v − εr ) (εe − v)2 Y (0) − + f V (v) = r∞ dv τm τm (E.1) νe µ A2 . We can solve this by multiplying by an integrating factor 2τm εe −εr η(v) e , where η(v) = log(εe − v) Y+µYG +1 + Y(ε . The solution is e −v)
where Y
=
(0)
f V (v) =
−τm r∞ Y
v
exp η(v) − η(v ) (εe − v )2
εr
dv + K exp − η(v) .
(E.2)
In equation E.2, the constant K can be solved for by using the boundary con(0) (0) dition f V (εr )(εr − εe ) = f V (vth )(vth − εe ). The firing rate r∞ can be solved (0) for by integrating equation E.2 from εr to vth and using the fact that f V (v) integrates to 1:
K =
vth
e η(v )−η(vth ) dv (εe − v )2 εr εe − vth −η(vth ) e −η(εr ) − e εe − εr
εe − vth εe − εr
2082
C. Ly and D. Tranchina
and r∞ =
−Y τm
vth
εr
−1 exp(η(vth ) − η(v )) dv + K exp(−η(v )) . th (εe − v )2
(E.3)
For any input rate νe , a solution limit τe → 0 problem always exists. Appendix F: Steady-State Moment Closure at k = 2 with Refractory Period τ ref When a refractory period is included in the I&F model, the voltage is no longer instantaneously reset after the firing of an action potential. Rather, the neuron goes into a refractory state for a specified time τref , during which the excitatory conductance variable evolves according to equation 2.2. After this period τref has elapsed, the voltage is reset to vreset = εr . We first discuss the equations for the refractory density for the full problem (the nonrefractory density ρ is similar to before). We then derive the moment closure equations at k = 2 for the refractory density and study the steady-state equations. Let ψ(s, g, t) denote the probability density function for the refractory pool of neurons. The variables g (conductance) and t (time) have the same meaning as they do for ρ, and s is the time since entering refractory state (0 ≤ s ≤ τref ). By conservation of probability,
vth ∞
εr
ρ(v, g, t) dg dv +
0
0
τref ∞
ψ(s, g, t) dg ds = 1.
0
The evolution equation for ψ(s, g, t) is ∂ψ ∂ s ∂ = − J S (s, g, t) − J (s, g, t), ∂t ∂s ∂g G
(F.1)
where J S (s, g, t) = 1 · ψ(s, g, t) J Gs (s, g, t) = −
gψ(s, g, t) + νe (t) τe
g
F˜ A τe (g − g ) ψ(s, g , t) dg .
0
The boundary conditions are now J V (vth , g, t) = J S (0, g, t),
(F.2)
J S (τref , g, t) = J V (εr , g, t),
(F.3)
Moment Closure in Population Density Theory
2083
and J G (v, 0, t) = lim J G (v, g, t) = 0, g→∞
J Gs (s, 0, t) = lim J Gs (s, g, t) g→∞
= 0.
(F.4)
∞ (k) (k) and J S (s, t) ≡ Let us define f S (s, t) = 0 g k ψ(s, g, t) dg ∞ k 0 g J S (s, g, t) dg. If we perform the same type of moment closure at k = 2 manipulation as above, the resulting equations for the density of refractory pool of neurons are (0)
(0)
∂ fS ∂f =− S , ∂t ∂s
(F.5)
∂ fS ∂f 1 (1) (0) f S − νe (t)µ A f S , =− S − ∂t ∂s τe
(F.6)
(1)
(1)
with boundary conditions: (0)
(0)
J S (0, t) = J V (vth , t), (0)
⇒ f S (0, t) = r (t); (0)
(0)
(0)
(0)
(F.7a)
J V (εr , t) = J S (τref , t), ⇒ J V (εr , t) = f S (τref , t); (1)
(1)
(1)
(1)
(F.7b)
J S (0, t) = J V (vth , t), ⇒ f S (0, t) = J V (vth , t); (1)
(1)
(1)
(1)
(F.8a)
J V (εr , t) = J S (τref , t), ⇒ J V (εr , t) = f S (τref , t).
(F.8b)
The equations for the nonrefractory density ρ are similar to equations 6.2 to 6.9. Equation F.5 with boundary condition F.7a can be solved by the method (0) of characteristics: f S (s, t) = r (t − s). Equation F.6 can also be solved by the (1) method of characteristics to give an ODE for f S : (1) ∂ fS 1 (1) f S − νe (t − τref + s)µ A r (t − τref ) . =− ∂s τe
2084
C. Ly and D. Tranchina
Let us consider the steady-state problem for r∞ > 0. Equation F.5 with
(0)
∂ fS ∂t
(0)
= 0 tells us that f S is constant (in s) in the steady state, and the first boundary condition, F.7a, implies (0)
r∞ = f S . Notice that the boundary conditions F.7 are automatically satisfied. In the steady state, equation F.6 becomes ∂ fS 1 (1) f S − µG r ∞ , =− ∂s τe (1)
which has solution
(1) f S (s)
= r ∞ µG +
(1) f S (0) s − µG exp − . r∞ τe
(1)
The value f S (0) is given by equation F.8a. Summary of equations: τe εe − εr q 1+ − µG − h min (v) dq τm τm εe − v = −q , dv τe (εe − v) q 2 − σG2
(F.9)
with boundary condition: q b 2 (εr )
q b 2 (vth )
=
+ µG −
q b 2 (vth )
1 − exp −
τref τe
,
(F.10)
where q
b 2 (v) =
q 2 (v) + h min (v)q (v) + σG2 . q (v) (0)
Once a q (v) that satisfies the boundary condition is computed, f V (v) is com(0) τm r ∞ puted from the formula f V (v) = q (v)(ε . The firing rate r∞ is determined e −v) vth (0) τref (0) by the constraint that εr f V (v) dv + 0 f S (s)ds = 1, which gives r∞
vth
εr
τm dv + τref = 1. q (v)(εe − v)
Moment Closure in Population Density Theory
2085
Notice if τref = 0 in equation F.10, we simply recover our boundary condition q q without a refractory period: b 2 (εr ) = b 2 (vth ). Introducing a refractory period (0) τref does not alter the dynamics of the q (v) (and hence f V ) in the steady state; it changes only the boundary condition to equation F.10. Appendix G: Steady-State Equations for Moment Closure at k = 3 We define h(v) ≡ µG|V (v), q (v) = h(v) − h min (v) (as for moment closure at k = 2) and a new function, w(v) ≡ µG 2 |V (v). Now set equations 7.1 to 7.3 (0) equal to 0. Equation 7.1 will give us an equation for f V (v) in terms of h(v) (the same as equation 6.24). Substituting this equation into the next two equations (from equations 7.2 and 7.3), gives a system of ODEs (in v) for h(v) and w(v), which we write in terms of q (v) and w(v). The steady-state equations for the unconditioned moments of G are obtained by setting equations 7.4 to 7.6 equal to 0 to get µG = νe µ A νe µ A2 µG 2 = + (νe µ A)2 2τe µG 3 =
νe µ A3 3(νe )2 µ Aµ A2 + + (νe µ A)3 . 3(τe )2 2τe
(G.1)
The ODEs in terms of q and w are dq dw − (v − εe )2 q (v) (v − εe )2 w(v) − (v − εr )2 dv dv τm q (v)
= (εr − εe ) q (v) + 2h min (v) q (v) + v − εr τe + v − εe · µG − q (v)
(G.2)
νe µ A3 dq 3 v − εe 2h min (v) w(v) + 3q 2 (v) − 2h 3min (v) + + 4q (v) 2 3(τe ) dv dw −q (v) 3 v − εe q (v) − 2 v − εr = dv 2 εe − εr
− 6q (v) q (v) + h min (v) + 2w(v)q (v) εe − v νµ 2 2q (v)τm e A µG q (v) + h min (v) + −w . + τe 2τe (G.3)
2086
C. Ly and D. Tranchina
The boundary conditions are B1 (εr ) = B1 (vth )
(G.4)
B2 (εr ) = B2 (vth ),
(G.5)
where B1 (v) =
−(v − εr )[(εe − v)q (v) + (v − εr )] + (εe − v)2 w(v) (εe − v)2 q (v)
B2 (v) =
νe µ A3 3 + 3w(v)(q (v) + h (v)) − 2(q (v) + h (v)) −(v − εr )w(v) + (εe − v) min min 3(τe )2 . (εe − v)q (v)
Appendix H: Handy Approximations Relating µ A and µEPSP The mean excitatory postsynaptic potential amplitude at peak time (µEPSP ) was computed numerically with a specified µ A. We will discuss two very good analytic approximations that allows one to solve the moment closure equations without computing µEPSP or µ A numerically. Let us assume the arrival time of the excitatory synaptic event is at t = 0. H.1 Approximation for µEPSP < 1 mV. The I&F neuron equation for voltage is (see equation 2.1): dv 1 = − {(v − εr ) + ge (t) · (v − εe )}. dt τm Since we are considering what happens on receiving one excitatory conductance event, ge (t) is completely determined: ge (t) =
µA t . exp − τe τe
For small enough µ A, v is close enough to εr , so that ge (t) · (v − εe ) is well approximated by ge (t) · (εr − εe ). The resulting equation is: dv εr µA t εr − εe v = − exp − . + dt τm τm τe τe τm
Moment Closure in Population Density Theory
2087
Since v(0) = εr , the solution is: v(t) = εr +
t µ A(εr − εe ) − t e τm − e − τe . τe − τm
The peak voltage time is t ∗ = τm τe
log(τm /τe ) , so on average: τm − τe
µEPSP = v(t ∗ ) − v(0) τe τm ε −ε τ τe τm − τe e r e τ − τ m e − = µA · . τm τm τm − τe
(H.1)
H.2 Approximation for τ e τm . When τe is small compared to the membrane time constant (τm ), the excitatory conductance time course is well approximated by a delta function: ge (t) = µ A δ(t). Equation 2.1 becomes: dv 1
(v − εr ) + µ A δ(t)(v − εe ) =− dt τm 1 dv 1 v − εr ⇒ + µ A δ(t) . =− v − εe dt τm v − εe The jump in voltage is instantaneous, so the equation must be integrated on an infinitesimally small interval 0− to 0+ . The first time on the right-hand side goes to 0, and since v(0− ) = εr , the final result is µA µEPSP = v(0+ ) − εr = (εe − εr ) 1 − exp − . τm
(H.2)
Appendix I: Distribution for Random Variable A The distribution of random jumps in conductance (dimensionless) is given by A/τe . The distribution function chosen for A is parabolic with compact support: f A(x) =
−3 x(x 4(µ A)3
0,
− 2µ A), if 0 ≤ x ≤ 2µ A; otherwise.
The probability density function for the random variable A was chosen for numerical and analytical convenience. Its particular form does not play an important role. The compact support of f A(x) results in banded matrices
2088
C. Ly and D. Tranchina
when discretizing the operators in the full 2D PDM (Apfaltrer et al., 2006). The moments are easily computed: µ A2 =
6 (µ A)2 , 5
σ A2 =
1 (µ A)2 , 5
(I.1)
and in general we have: µ An = (µ A)n
3 · 2n+1 . (n + 2)(n + 3)
(I.2)
Acknowledgments We are grateful to Bruce Knight for his advice, particularly for his suggestions on applications of phase-plane analysis. We thank Charles Peskin for helpful discussions and his insight on the necessary condition for the existence of solutions. We thank David Cai and Larry Sirovich for introducing us to moment closure methods in statistical physics and to David Cai for teaching us about the maximum entropy method in particular. We are grateful to David McLaughlin, David Cai, and Aaditya Rangan for detailed discussion of this article and its relationship to that of Cai et al. (2004, 2006). We thank Eric Shea-Brown and Brent Doiron for their careful reading of the manuscript and their detailed suggestions. This work was supported by NSF grant number BNS0090159. C.L. was supported by the Department of Defense Graduate Fellowship (NDSEG). References Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous states in networks of pulsecoupled oscillators. Physical Review E, 48, 1483–1490. Amit, D., & Brunel, N. (1997). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cerebral Cortex, 7, 237–252. Apfaltrer, F., Ly, C., & Tranchina, D. (2006). Population density methods for stochastic neurons with a 2-D state space: Application to neural networks with realistic synaptic kinetics. Network: Computation in Neural Systems, 17, 373–418. Bardos, C., Galose, F., & Levermore, D. (1991). Fluid dynamic limit of kinetic equations 1: Formal derivations. Journal of Statistical Physics, 63, 323–344. Bardos, C., Galose, F., & Levermore, D. (1993). Fluid dynamic limit of kinetic equations 2: Convergence proofs for the Boltzmann equation. Communication on Pure and Applied Math, 46, 667–753. Biella, G., Uva, L., & Curtis, M. D. (2001). Network activity evoked by neocortical stimulation in area 36 of the guinea pig perirhinal cortex. Journal of Neurophysiology, 86, 164–172.
Moment Closure in Population Density Theory
2089
Brown, E., Moehlis, J., & Holmes, P. (2004). On the phase reduction and response dynamics of neural oscillator populations. Neural Computation, 16, 673– 715. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. Journal of Computational Neuroscience, 8, 183–208. Brunel, N., Chance, F., Fourcaud, N., & Abbott, L. (2001). Effects of synaptic noise and filtering on the frequency response of spiking neurons. Physical Review Letter, 86, 2186–2189. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrate-andfire neurons with low firing rates. Neural Computation, 11, 1621–1671. Cai, D., Tao, L., Rangan, A., & McLaughlin, D. (2006). Maximum-entropy closures for kinetic theories of neuronal network dynamics. Communications in Mathematical Sciences, 4, 97–127. Cai, D., Tao, L., Shelley, M., & McLaughlin, D. (2004). An effective kinetic representation of fluctuation-driven neuronal networks with application to simple and complex cells in visual cortex. Proceedings of the National Academy of Science, 101, 7757–7762. Casti, A. R., Omurtag, A., Sornborger, A., Kaplan, E., Knight, B., Sirovich, L., & Victor, J. (2002). A population study of integrate-and-fire-or-burst neurons. Neural Computation, 14, 957–986. Chapman, S. I., & Cowling, T. (1970). The mathematical theory of non-uniform gases. Cambridge: Cambridge University Press. Doiron, B., Rinzel, J., & Reyes, A. (2006). Stochastic synchronization in finite size spiking networks. Physical Review E, 74, 030903. Dreyer, W., Herrmann, M., & Kunik, M. (2004). Kinetic solutions of the BoltzmannPeierls equation and its moment systems. Continuum Mechanics and Thermodynamics, 16, 453–469. Dreyer, W., Junk, M., & Kunik, M. (2001). On the approximation of the Fokker-Planck equation by moment systems. Institute of Physics Publishing, 14, 881–906. Dyan, P., & Abbott, L. (2001). Theoretical neuroscience: Computational and mathematical modeling of neural systems. Cambridge, MA: MIT Press. Edmonds, B., Bigg, A. J., & Colquhoun, D. (1995). Mechanisms of activation of glutamate receptors and the time course of excitatory synaptic Currents. Annual Review of Physiology, 57, 495–519. Fourcaud, N., & Brunel, N. (2002). Dynamics of the firing probability of noisy integrate-and-fire neurons. Neural Computation, 14, 2057–2110. Fourcard-Trocme, N., Hansel, D., van Vreeswijk, C., & Brunel, N. (2003). How spike generation mechanisms determine the neuronal response to fluctuating inputs. Journal of Neuroscience, 23, 11628–11640. Fricker, D., & Miles, R. (2000). EPSP amplification and the precision of spike timing in hippocampal neurons. Neuron, 28, 559–569. Gerstner, W. (1995). Time structure of the activity in neural network models. Phys. Rev. E, 51, 738–758. Gerstner, W. (1999). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Computation, 12, 43–90. Hansel, D., Mato, G., Meunier, C., & Neltner, L. (1998). On numerical simulations of integrate-and-fire neural networks. Neural Computation, 10, 467–483.
2090
C. Ly and D. Tranchina
Haskell, E., Nykamp, D. Q., & Tranchina, D. (2001). Population density methods for large-scale modeling of neuronal networks with realistic synaptic kinetics: Cutting the dimension down to size. Network: Compututation in Neural Systems, 12, 141–174. Huertas, M. A., & Smith, G. D. (2006). A multivariate population density model of the dLGN/PGN relay. Journal of Computational Neuroscience, 21(2), 171–189. Izhikevich, E. (2004). Which model to use for cortical spiking neurons? IEEE Transactions on Neural Networks, 15, 1063–1070. Jolivet, R., Lewis, T., & Gerstner, W. (2004). Generalized integrate-and-fire models of neuronal activity approximate spike trains of a detailed model to a high degree of accuracy. Journal of Neurophysiology, 92, 952–976. Junk, M., & Unterreiter, A. (2002). Maximum entropy moment systems and Galilean invariance. Continuum Mechanics and Thermodynamics, 14, 536–576. Kitano, K., Okamoto, H., & Fukai, T. (2003). Time representing cortical activities: Two models inspired by prefrontal persistent activity. Biological Cybernetics, 88, 387–394. Knight, B. W. (1972). Dynamics of encoding in a population of neurons. Journal of General Physiology, 59, 734–766. Knight, B. W. (2000). Dynamics of encoding in neuron populations: Some general mathematical features. Neural Computation, 12, 473–518. Knight, B. W., Manin, D., & Sirovich, L. (1996). Dynamical models of interacting neuron populations. In E. C. Gerf (Ed.), Symposium on Robotics and Cybernetics: Computational Engineering in Systems Applications. Lille, France: Cite Scientifique. Knight, B. W., Omurtag, A., & Sirovich, L. (2000). The approach of a neuron population firing rate to a new equilibrium: An exact theoretical result. Neural Computation, 12, 1045–1055. Krukowski, A. E., & Miller, K. D. (2001). Thalamocortical NMDA conductances and intracortical inhibition can explain cortical temporal tuning. Nature Neuroscience, 4, 424–430. Kuramoto, Y. (1991). Collective synchronization of pulse-coupled oscillators and excitable units. Physica D, 50, 15–30. Mattia, M., & Giudice, P. D. (2002). Population dynamics of interacting spiking neurons. Physical Review E, 66, 1–19. McLaughlin, D., Shapley, R., & Shelley, M. (2003). Large-scale modeling of the primary visual cortex: Influence of cortical architecture upon neuronal response. Journal of Physiology–Paris, 97, 237–252. McLaughlin, D., Shapley, R., Shelley, M., & Wielaard, D. J. (2000). A neuronal network model of macaque primary visual cortex (V1): Orientation tuning and dynamics in the input layer 4Cα. Proceedings of National Academy of Science, 97, 8087– 8092. McNaughton, B. L., Barnes, C. A., & Anderson, P. (1981). Synaptic efficacy and EPSP summation in granule cells of rat fascia dentata studied in vitro. Journal of Neurophysiology, 46, 952–966. Miller, P., & Wang, X. (2006). Power-law neuronal fluctuations in a recurrent network model of parametric working memory. Journal of Neurophysiology, 95, 1099–1114. Moreno-Bote, R., & Parga, N. (2004). Role of synaptic filtering on the firing response of simple model neurons. Physical Review Letters, 92, 028102.
Moment Closure in Population Density Theory
2091
Moreno-Bote, R., & Parga, N. (2005). Membrane potential and response properties of populations of cortical neurons in the high conductance state. Physical Review Letters, 94, 088103. Nykamp, D. Q., & Tranchina, D. (2000). A population density approach that facilitates large-scale modeling of neural networks: Analysis and an application to orientation tuning. Journal of Computational Neuroscience, 8, 19–50. Nykamp, D. Q., & Tranchina, D. (2001). A population density approach that facilitates large-scale modeling of neural networks: Extension to slow inhibitory synapses. Neural Computation, 13, 511–546. Omurtag, A., Knight, B. W., & Sirovich, L. (2000). On the simulation of large populations of neurons. Journal of Computational Neuroscience, 8, 51–63. Rangan, A., & Cai, D. (2006). Maximum-entropy closures for kinetic theories of neuronal network dynamics. Physical Review Letters, 96, 178101. Rangan, A., & Cai, D. (2007). Fast numerical methods for simulating large-scale integrate-and-fire neuronal networks. Journal of Computational Neuroscience, 22, 81–100. Rangan, A., Cai, D., & McLaughlin, D. (2005). Modeling the spatiotemporal cortical activity associated with the line-motion illusion in primary visual cortex. Proceedings of the National Academy of Science, 102, 18793–18800. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. Journal of Neuroscience, 18, 3870–3896. Silberberg, G., Wu, C., & Markram, H. (2004). Synaptic dynamics control the timing of neuronal excitation in the activated neocortical microcircuit. Journal of Physiology, 556, 19–27. Sirovich, L. (2003). Dynamics of neuronal populations: Eigenfunction theory: Some solvable cases. Network: Computational Neural Systems, 14, 249–272. Sirovich, L., Omurtag, A. & Knight, B., (2000). Dynamics of neuronal populations: The equilibrium solution. SIAM Journal of Applied Mathematics, 60, 2009–2028. Somers, D. C., Nelson, S. B., & Sur, M. (1995). An emergent model of orientation selectivity in cat visual cortical simple cells. Journal of Neuroscience, 15, 5448–5465. Teramae, J., & Fukai, T. (2005). A cellular mechanism for graded persistent activity in a model neuron and its implication on working memory. Journal of Computational Neuroscience, 18, 105–121. Thompson, A. M., Girdlestone, D., & West, D. C. (1988). Voltage-dependent currents prolong single-axon postsynaptic potentials in layer 3 pyramidal neurons in rate neocortical slices. Journal of Neurophysiology, 60, 1896–1907. Treves, A. (1993). Mean-field analysis of neuronal spike dynamics. Network: Compputation in Neural Systems, 4, 259–284. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology, Vol. 1: Linear cable theory and dendritic structure and stochastic theories. Cambridge: Cambridge University Press. Ursino, M., & La Cara, G. (2005). Dependence of visual cell properties on intracortical synapses among hypercolumns: Analysis by a computer model. Journal of Computational Neuroscience, 19, 291–310. Wang, X. J. (1999). Synaptic basis of cortical persistent activity: The importance of NMDA receptors to working memory. Journal of Neuroscience, 19, 9587–9603.
2092
C. Ly and D. Tranchina
Whittington, M. A., Traub, R., Kopell, N., Ermentrout B., & Buhl, E. H. (2000). Inhibition-based rhythms: Experimental and mathematical observations on network dynamics. International Journal of Psychophysiology, 38, 315–336. Wilbur, W., & Rinzel, J. (1983). A theoretical basis for large coefficient of variation and bimodality in neuronal interspike interval distributions. Journal of Theoretical Biology, 105, 345–368. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophysical Journal, 12, 1–24. Wilson, H. R., & Cowan, J. D. (1973). A mathematical theory of the functional dynamics of cortical and thalamic nervous tissue. Biological Cybernetics, 13, 55–80. Yoshimura, Y., Sato, H., Imamura, K., & Watanabe, Y. (2000). Properties of horizontal and vertical inputs to pyramidal cells in the superficial layers of the cat visual cortex. Journal of Neuroscience, 20, 1931–1940.
Received March 30, 2006; accepted August 2, 2006.
LETTER
Communicated by DeLiang Wang
Oscillatory Neural Networks with Self-Organized Segmentation of Overlapping Patterns Thomas Burwick [email protected] ¨ Neuroinformatik, Ruhr-Universit¨at Bochum, 44306 Bochum, Germany Institut fur
Temporal coding is considered with an oscillatory network model that generalizes the Cohen-Grossberg-Hopfield model. It is assumed that the frequency of oscillating units increases with stronger and more coherent input. We refer to this mechanism as acceleration. In the context of Hebbian memory, synchronization and acceleration take complementary roles, and their combined effect on the storage of patterns is profound. Acceleration implies the desynchronization that is needed for self-organized segmention of two overlapping patterns. The superposition problem is thereby solved even without including competition couplings. With respect to brain dynamics, we point to analogies with oscillation spindles in the gamma range and responses to perceptual rivalries.
1 Introduction The segmentation of a pattern that is overlapping with other patterns, each of them memorized according to the Hebbian rule, is a classical challenge and a severe problem of neural network applications. The origin of the problem is easy to understand. Assume that the pattern has been retrieved and is activated. A sufficiently large overlap of this pattern with other patterns may then imply a large input to these patterns. As a consequence, the overlapping patterns may also get activated, and the unambigous segmentation of the first pattern becomes impossible unless some additional information is provided. In this letter, we consider this problem in the context of temporal coding. The additional information is then given by the temporal structure of the active neural units. This should permit distinguishing the separate patterns with respect to coherence. (For an introduction to theoretical aspects related to the memory of overlapping patterns, see Hertz, Krogh, & Palmer, 1991. For an introduction to temporal coding, see von der Malsburg, 1999. For reviews of its growing biological evidence, see Engel & Singer, 2001; Engel, Fries, & Singer, 2001. And for references to models of temporal coding based on oscillatory networks, see Burwick, 2005, 2006.) Neural Computation 19, 2093–2123 (2007)
C 2007 Massachusetts Institute of Technology
2094
T. Burwick
In this letter, we present a new approach of dealing with this problem. The approach is a consequence of including an effect that we denote as acceleration. In contrast to the proposals described in Burwick (2005), the approach of this letter does not need competition couplings. Acceleration means that the frequency of oscillating neural units in a network increases with stronger and more coherent input. This is obviously an effect that may be in agreement with fundamental assumptions about neural behavior. It is interesting to see that it also arises naturally from a formal point of view. The consequences, however, may be surprising with respect to the storage of overlapping patterns. Synchronization and acceleration take complementary roles where acceleration introduces a desynchronization that implies self-organized segmentation of overlapping patterns. Given the problem that was described above, such a property may be an essential feature, and it is certainly worthwhile to study its consequences. The relevance of temporal coding for information processing results from its relevance for the formation of assemblies. These are distributed representations of objects, where different parts of the network correspond to different features of the object (Hebb, 1949). Temporal coding proposes that the formation of these assemblies is based on temporal correlation of the corresponding parts of the network (von der Malsburg, 1981); see also (von der Malsburg, 1986). The challenge of linking the different features is also referred to as the binding problem. The presence of the binding problem is particularly obvious with some psychophysical phenomena where the brain fails to solve it. (In that respect, see the review on illusory conjunctions in Wolfe & Cave, 1999.) If perception is related to the storage and recall of patterns, as realized with Hebbian memory, then the problem of overlapping patterns will appear whenever different patterns share some features. A particular class of temporal coding applications is figure-ground segregation. As an example, see Shareef, Wang, and Yagel (1999) for an application to medical image segmention. Other applications are described in Wang, Freeman, Kozma, Lozowski, & Minai (2004). We implement temporal coding with an oscillatory network model that is obtained as complex-valued generalization of the Cohen-GrossbergHopfield (CGH) model (Burwick, 2005). In section 2, the model is described, and acceleration is introduced by allowing for a complex phase of the couplings. In section 3, the couplings are chosen according to Hebbian memory, and the dynamics is related to coherences of the patterns. In section 4, the complementary role of synchronization and acceleration becomes obvious from studying the example of two overlapping patterns. In section 5, we comment on the possible inclusion of acceleration in the context of networks with nonsinusoidal couplings, in particular, networks of relaxation oscillators. In section 6, we briefly comment on fast implementations. Section 7 contains the summary.
Self-Organized Segmentation of Patterns
2095
2 Oscillatory Neural Networks with Synchronization and Acceleration In section 2.1, the model that we use for the discussion in this letter is introduced. It may be fomulated as a complex-valued gradient system. In section 2.2, we discuss the role of complex-valued potential functions for oscillatory systems and their relation to minimization principles. In section 2.3, we give the potential function that implies the model. In section 2.4, this is then used to derive the exact forms of the couplings in terms of real coordinates. These will be given for all-order mode couplings. For the main arguments of this letter, however, it is sufficient to consider only the first-mode couplings that are specified in section 2.5. 2.1 The Model. The CGH model (Cohen & Grossberg, 1983; Hopfield, 1984) may be extended to an oscillatory network model that allows implementing features of temporal coding (Burwick, 2005); see also (Burwick, 2006). Given a network with N units, where each unit k is described in terms of the real-valued uk and phase θk , k = 1, . . . , N, the oscillatory system is of the form N duk 1 τ = (uk ) Ik − uk + wkl (u, θ )Vl , dt N
(2.1a)
l=1
1 dθk skl (u, θ )Vl , = τ ωk (u, θ ) + dt N l=1 N
τ
(2.1b)
synchronization terms where t is the time, τ is a timescale, Ik is an external input, and the amplitude Vk is related to uk via the activation function g, chosen as Vk = g(uk ) =
1 (1 + tanh(uk )). 2
(2.2)
In this letter, the frequency terms are 1 ωkl (u, θ )Vl , N l=1 N
ωk (u, θ ) = ω1,k + ω2,k Vk +
(2.3)
acceleration terms where the ω1,k are the eigenfrequencies of the oscillators, and the ω2,k parameterize the shear terms.
2096
T. Burwick
In the following sections, equation 2.1 with explicit expressions for the couplings w, s, and ω and the scaling factor will be derived from a complex-valued gradient system. In section 2.3, we make an ansatz for a complex-valued potential function L and postulate that the dynamics takes the form τ
dzk ∂L =− , dt ∂ z¯ k
(2.4)
where the complex coordinates are given by xk = zk2 = Vk exp(iθk ),
x¯ k = z¯ k2 = Vk exp(−iθk ).
(2.5)
Due to equation 2.2, the dynamics is restricted to the punctured unit disk, 0 < |xk | = zk z¯ k = Vk = g(uk ) < 1.
(2.6)
(Given the quadratic dependencies in equation 2.5, using coordinates zk , z¯ k corresponds to a formulation on the two-sheet Riemann surface.) Equation 2.1 generalizes the classical CGH model by introducing phase and amplitude dependencies of the couplings in equation 2.1a and phase dynamics in equation 2.1b. The sum in equation 2.1b describes the synchronization terms, while the sum in equation 2.3 gives what we denote as acceleration terms. We consider only excitatory couplings. It will then be obvious from the explicit form of the skl that the synchronization terms have a synchronizing effect on the phases. The explicit form of the ωkl will make obvious that the corresponding terms in equation 2.3 imply an increase in phase velocity of the units for stronger and more coherent input from the other units. We refer to this effect as acceleration. The system of equation 2.1 that results from equation 2.4 will agree with the model presented in Burwick (2005) except for the presence of the acceleration terms (and the related modification of the couplings w). It turns out that the inclusion of acceleration terms is not a minor modification. Below, when formulating equation 2.1 as a complex-valued gradient system in sections 2.3 and 2.4, we find that the acceleration terms result from a complex phase of the coupling parameters. In the next section, we argue that including such complex-valued couplings is a natural choice in the context of oscillatory systems. In fact, complex-valued couplings are well-known ingredients of several physical systems (e.g., see Kuramoto, 1984). Given the reasonable interpretation in the context of neural systems, it may be oversimplifying to neglect the effects related to acceleration. Our examples will demonstrate that these effects are indeed substantial.
Self-Organized Segmentation of Patterns
2097
2.2 Complex-Valued Gradient Systems and Their Relation to Minimization Principles. In section 2.1, we mentioned that equation 2.1 may be derived as the gradient system of equation 2.4. A crucial aspect of this letter is that we allow for a complex-valued potential function L, L = R + i ,
(2.7)
where R is the real and the imaginary part. The acceleration terms result from an imaginary part of the coupling parameters, while the discussion in Burwick (2005) was focused on real-valued couplings. Before going into the details of deriving equation 2.1 from equation 2.4, we add two comments on the imaginary part of L. In the first comment, we emphasize that such imaginary parts are natural for the most fundamental oscillator models. Consider a single oscillator with a complex-valued second-order potential function given by L0 = −
1 1 ω2 z¯z − (z¯z)2 + iτ ω1 z¯z + (z¯z)2 , 2 2 2
(2.8)
where ω1 , ω2 are real parameters and τ is the timescale. Assuming a gradient dynamics like the one in equation 2.4, one obtains τ
dz ∂L0 z =− = (1 − z¯z + iτ (ω1 + ω2 z¯z)). dt ∂ z¯ 2
(2.9)
With coordinates like the ones in equation 2.5 (without the index k), we may use dz z 1 dV dθ = +i (2.10) dt 2 V dt dt to arrive at the real coordinate version of equation 2.9: τ
dV = V(1 − V), dt
dθ = ω1 + ω2 V. dt
(2.11a) (2.11b)
This is then recognized as a basic oscillator model. The ω1 is the eigenfrequency and ω2 V is a shear term. The shear term implies a larger phase velocity for a larger amplitude. The oscillator amplitude approaches V = 1, and the frequency approaches ω = ω1 + ω2 . Thus, the frequency is determined by the imaginary part of L0 . It should therefore be seen as a natural step to allow for an imaginary part also in equation 2.7. Moreover, given
2098
T. Burwick
this simple example, it should not be surprising that the frequency dependency in equation 2.3 will result from such an imaginary part. The second comment is on the relation of the imaginary part to the minimization principle as it appears in the classical CGH model. Using ∂L ∂ ∂L ∂ d z¯ k + 2i = + 2i = −τ , ∂zk ∂zk ∂zk dt ∂zk
(2.12)
one obtains N d z¯ k ∂L dL dzk ∂L + = dt dt ∂zk dt ∂ z¯ k k=1
N dzk 2 dz ∂ i k = −2τ dt − τ dt ∂z . k
(2.13)
k=1
With ≡ 0, the system will have dL/dt ≤ 0, and local minima will be approached. With = 0, this may no longer be the case. The examples of section 3 will indeed show that the real part of L, that is, the analog of the CGH Lyapunov function, is not decreasing for all times if = 0. However, this should not be taken as a motivation to eliminate the imaginary part . After all, such terms are natural ingredients of oscillatory systems. In fact, we will observe that this deviation from the classical picture of a monotonically decreasing Lyapunov function has a well-justified functional relevance. This will be obvious when considering the examples in section 4. 2.3 The Model as a Complex-Valued Gradient System. The dynamical system of equation 2.1 is obtained from equation 2.4 by choosing L = P + W,
(2.14)
where P describes the oscillator model of the single units, 1 (zk z¯ k ln(zk z¯ k ) + (1 − zk z¯ k ) ln (1 − zk z¯ k )) 2 N
P=
k=1
−
N k=1
1 2 J 1,k zk z¯ k + J 2,k (zk z¯ k ) , 2
(2.15)
Self-Organized Segmentation of Patterns
2099
and W describes the couplings of the network, W =−
H N 1 h kl b µν (zk zk )µ (¯zl z¯ l )ν (2.16) a µν (zk z¯ k )µ (zl z¯ l )ν + 2N k,l=1
µ,ν=1
phase-couplings (Burwick, 2005). The h kl > 0 are the weights of the network and assumed to be symmetric. In section 3, they will be specified according to Hebbian memory. The a µν > 0 describe the coupling strengths of the different modes. For equation 2.16, it was assumed that these two dependencies factorize. For the coefficients in equation 2.15, we set J 1,k = Ik +
i τ ω1,k , 2
J 2,k =
i τ ω2,k , 2
(2.17)
and for the phase-coupling coefficient in equation 2.16, we set
b µν = b µν + ic µν .
(2.18)
The coefficients a µν , b µν , and c µν are assumed to be real and symmetric with h kl a µν = h lk a νµ , h kl b µν = h lk b νµ , and h kl c µν = h lk c νµ .
(2.19)
The couplings c µν imply the acceleration terms, as may be seen in the next section. 2.4 The Dynamics in Real Coordinates. Using the function L of section 2.3 in equation 2.4, the resulting dynamics may be formulated in terms of the real coordinates given by equation 2.5. These coordinates imply 1 dzk 1 = zk dt 2
1 d Vk dθk +i Vk dt dt
= (1 − g(uk ))
duk i dθk + , dt 2 dt
(2.20)
(2.21)
using dg(u)/du = 2g(u)(1 − g(u)),
(2.22)
and the derivative is given by 1 ∂ ∂ i ∂ = + . zk ∂ z¯ k ∂ Vk Vk ∂θk
(2.23)
2100
T. Burwick
We also need to apply the inverse g −1 of the activation given in equation 2.2: uk = g −1 (Vk ) =
Vk 1 ln . 2 1 − Vk
(2.24)
Equations 2.4 and 2.14 to 2.19 then imply equation 2.1, where the scaling factor is (uk ) =
1 1 − Vk
(2.25)
(see Burwick, 2005, for a discussion of this factor), and the coupling terms are given by wkl (u, θ ) = h kl
H
µ−1
µ Wµν (νθl − µθk ) Vk
Vlν−1 ,
(2.26)
µ,ν=1
skl (u, θ ) = h kl
H
µ−1
2µ Sµν (νθl − µθk ) Vk
Vlν−1 ,
(2.27)
Vlν−1 ,
(2.28)
µ,ν=1
ωkl (u, θ ) = h kl
H
µ−1
2µ Cµν (νθl − µθk ) Vk
µ,ν=1
with Wµν (θ ) = a µν + b µν cos (θ ) − c µν sin (θ ) , Sµν (θ ) = b µν sin (θ ) , Cµν (θ ) = c µν cos (θ ) . For the sake of completeness, we have given this general dynamics. However, introducing acceleration and its effect does not require the inclusion of higher-mode couplings. In the next section, we therefore give the simpler form that describes the dynamics with first-mode couplings only. 2.5 The Dynamics with First-Mode Couplings Only. We set a 11 = a
and
1
b 11 = (σ + iτ ω3 ) , 2
(2.29)
where a > 0, σ > 0, and ω3 > 0 are real and assume that all other highermode couplings vanish. In terms of real coordinates and using only the
Self-Organized Segmentation of Patterns
2101
first-mode couplings, the real and imaginary parts of L = R + i are then given by 1 (Vk ln Vk + (1 − Vk ) ln (1 − Vk ) − 2I1,k Vk ) 2 N
R=
k=1
−
N
1 σ h kl a + cos(θk − θl ) Vk Vl , 2N 2
(2.30)
k,l=1
=−
N N τ τ ω3 1 ω1,k Vk + ω2,k Vk2 − h kl cos(θk − θl )Vk Vl . 2 2 4N k=1
(2.31)
k,l=1
With the examples of section 4, we demonstrate how the dynamics of the real part relates to the retrieval of the patterns. Using the parameter choices of equation 2.29, equation 2.1 reduces to the first-mode coupling part, where the couplings have no amplitude dependency: N duk 1 τ wkl (θl − θk )Vl , = (uk ) Ik − uk + dt N
(2.32a)
l=1
1 dθk skl (θl − θk )Vl , = τ ωk (u, θ ) + dt N l=1 N
τ
(2.32b)
synchronization terms with 1 ωkl (θl − θk )Vl , N l=1 N
ωk (u, θ ) = ω1,k + ω2,k Vk +
(2.33)
acceleration terms and
σ τ ω3 wkl (θ ) = h kl a + cos (θ ) − sin (θ ) , 2 2 skl (θ ) = h kl σ sin (θ ) , ωkl (θ ) = h kl ω3 cos (θ ) .
(2.34) (2.35) (2.36)
For introducing acceleration and its consequences, it is sufficient to consider only these first-mode couplings. Therefore, the simple form of equations 2.32 will be used for the remainder of this letter.
2102
T. Burwick
3 Hebbian Memory and Pattern Coherences So far, we have not specified the couplings h kl . In section 3.1, these are defined in terms of patterns that are stored according to Hebbian memory. In section 3.2, the coherence measures for patterns are introduced. These serve to describe the dynamics, in particular the ωk of equations 2.33, in terms of activities, coherences, and phases of the patterns. In section 3.3, we begin to discuss the complementary role of synchronization and acceleration, which will become more apparent with the examples in the following section. p
3.1 Hebbian memory. Consider the storage of P patterns ξk , with p = 1, . . . , P and k = 1, . . . , N. In this letter, it will be sufficient to assume that p ξk ∈ {0, 1}, where 1 (0) corresponds to an on-state (off-state). We refer to p the units k with ξk = 1 as the units of pattern p. In equation 2.1, Hebbian memory may be used that is defined by h kl =
P
p p
λ p ξk ξl ,
(3.1)
p=1
with λ p > 0. The λ p give the weights for patterns p. Let us also introduce p = ρpλp,
(3.2)
where ρp =
Np , N
(3.3)
and Np is the number of nonvanishing units of pattern p. The p may be interpreted as the weight of pattern p that also includes the pattern density ρ p , that is, with equal λ p , a larger pattern contributes more than a smaller pattern. 3.2 Dynamics in Terms of Activities, Coherences, and Phases of Patterns. The collective dynamics of the network may be described in terms of activity R, coherence C, and phase of each pattern, given by Rp =
N 1 p ξk Vk , Np
(3.4)
N 1 p ξk Vk exp(iθk ), Np
(3.5)
k=1
Z p = C p exp(i p ) =
k=1
Self-Organized Segmentation of Patterns
2103
for p = 1, . . . , P. Let us define Apk (u, θ ) = C p cos p − θk ,
(3.6)
Spk (u, θ ) = C p sin p − θk ,
(3.7)
where Apk < 1, Spk < 1. For example, the synchronizing terms in equation 2.32b take the form σ σ p p h kl sin(θl − θk )Vl = Im λ p ξk ξl exp (i(θl − θk )) Vl N N N
P
l=1
N
p=1 l=1
= σ Im
P
p ξk λ p
p=1
=σ
P
N ρp p ξl Vl e iθl e −iθk , Np l=1
p
ξk p Skp (u, θ ).
(3.8)
p=1
Steps analogous to that of equation 3.8 may be used to reformulate the other coupling terms of equations 2.32. This leads to an alternative form of equations 2.32 that relates the dynamics to pattern activities, coherences, and phases: duk σ = Ik − uk + a Qk (u) + Ak (u, θ ), dt 2
(3.9a)
dθk = τ ωk (u, θ ) + σ Sk (u, θ ), dt
(3.9b)
ωk (u, θ ) = ω1,k + ω2,k Vk + ω3 Ak (u, θ ),
(3.10)
τ (1 − Vk ) τ where
and Qk (u) =
P
p
(3.11)
p
(3.12)
p
(3.13)
ξk p R p (u),
p=1
Ak (u, θ ) =
P
ξk p Apk (u, θ ),
p=1
Sk (u, θ ) =
P p=1
ξk p Spk (u, θ ).
2104
T. Burwick
Equation 3.11 gives the classical (i.e., phase-independent) contribution, while equations 3.12 and 3.13 describe, respectively, acceleration and synp chronization terms. Due to the factor ξk in these definitions of Qk , Ak , and Sk , every unit k couples only to patterns p that are active at unit k. The tendency to synchronize two units k and l has to overcome any desynchronizing tendency that arises with different value of ωk and ωl . The synchronizing couplings in equation 3.9b are given by the σ -dependent terms. According to equations 3.13 and 3.7, these σ -dependent terms force each unit k to synchronize with the phases p of patterns that are nonzero p at this unit, that is, patterns p with ξk = 1. These patterns compete for the influence on unit k with a strength that is given by the product of the weight λ p of the pattern, the pattern density ρ p , and its coherence C p . Thus, for example, with p = ρ p /P for every p, a less coherent or smaller pattern will have less influence on the unit, and a highly coherent and larger pattern will have a stronger influence. With p = 1, the influence of each pattern will depend only on its coherence C p . Notice also that the σ -dependent terms of equation 3.9a support the synchronization. With respect to equation 3.9a, the coupling contribution from pattern p is maximal for θk = p and minimal for θk = p + π. This should be expected for couplings that favor synchronization. The a -dependent terms in equation 3.9a describe the classical CGH couplings that lead to onand off-state amplitudes (together with the the phase-dependent modification of the couplings). 3.3 Avoiding Global Coherence: Complementarity of Synchronization and Acceleration. With the examples of section 4, we consider choices for the eigenfrequency and shear coefficients that obey ω1,k = ω1 ,
(3.14)
ω2,k = ω2 ,
(3.15)
for some ω1 , ω2 . This is done to study the desynchronizing effect of acceleration without confusing it with desynchronizing effects that result from distributed eigenfrequency or shear coefficients. In this section, we begin the discussion of what dynamics should be expected. In section 1, we described the superposition problems of overlapping patterns. This is not solved simply by introducing phases. Consider the situation without acceleration: ω3 = 0. Due to the synchronizing terms, every phase of a connected network component, given the choices of equations 3.14 and 3.15, will then approach the same value θ and arrive at a stable configuration of global coherence (this will be demonstrated with examples 1a and 2a in section 4). The superposition problem is present as for the classical case without phases.
Self-Organized Segmentation of Patterns
2105
The situation changes drastically, however, in the presence of acceleration: with ω3 > 0. Then global coherence may no longer be stable in case of overlapping patterns. A simple argument for this is the following. Consider the situation of global coherence. Using equations 3.14 and 3.15, and assuming that the units are on-state with a phase θ , one obtains Vk 1,
θk = for every k :
p dθk ξk p . ω1 + ω 2 + ω3 dt p
(3.16)
With ω3 > 0, it follows immediately from equation 3.16 that globalcoherp ence may not be stable in case of overlapping patterns, since the ξk p will not be the same for different units k that participate in different sets of patterns (this will be demonstrated with examples 1b and 2b in section 4). Different parts of the patterns with overlap receive different inputs from the different patterns and thereby are subject to different accelerations. Thus, different parts of the pattern will have different phase velocities that destroy the coherence. It is then obvious that synchronization and acceleration take complementary roles: synchronization tends to establish coherence of the network, but this will then imply acceleration effects that desynchronize overlapping patterns. The following examples of two overlapping patterns will show how this interplay of synchronization and acceleration may help to overcome the superposition problem that plagues Hebbian storage of overlapping patterns without acceleration. 4 Examples: Self-Organized Segmentation of Two Overlapping Patterns In this section, we demonstrate the combined effect of synchronization and acceleration terms by considering numerical examples for the case of two overlapping patterns. In section 4.1, we describe the network architectures, some parameters, the numerical approach, and initial conditions that are used. Then we compare the dynamics with and without the acceleration terms. In section 4.2, this is done for the case that one pattern is dominating, 1 > 2 . In section 4.3, neither of the patterns is dominating: 1 = 2 . In section 4.4, we comment on the corresponding behavior of the potential functions (Lyapunov functions in case of vanishing acceleration). In sections 4.5 and 4.6, network behavior is traced back to the dynamics of overlapping and nonoverlapping parts (“patches”) of the patterns. The observed dynamics corresponds to oscillation spindles. This is illustrated in section 4.7, where the effect of changing the shear parameter ω2 is demonstrated. 4.1 Network Architectures and Numerical Setup. We consider P = 2 patterns for networks with N = 36 (examples 1 and 2) and N = 30 units
2106
T. Burwick
A
B 21
9
14
6
p=1
p=2
8
p=1
14
p=2
C 10
10
p=1
10
p=2
Figure 1: The three architectures of the examples in section 4. (A) Example 1. The first of the two patterns has N1 = 30 nonvanishing units and the second N2 = 15. The patterns overlap at 9 units. (B, C) The analog architectures for examples 2 and 3. The example networks have (A, B) N = 36 and (C) N = 30 units.
(example 3) (see Figure 1). The patterns in the Hebbian couplings of equation 3.1 are equally weighted with λ p = 1/2 for p = 1, 2. Thus, p = ρ p /2. The coupling parameters a , σ , and ω3 will be specified for each example. Let us use equation 2.2 to define a shifted input Ik through Ik +
a a h kl g(ul ) = Ik + h kl tanh(ul ). N 2N N
N
l=1
l=1
(4.1)
The input is then chosen so that Ik = 0. This implies that uk = 0: Vk = 1/2 is a threshold for (decoherent) on- and off-states of unit k (see Burwick, 2005). We use identical eigenfrequencies and shear parameters as specified in equations 3.14 and 3.15. This ensures that any desynchronizing effect arises from the acceleration terms. With examples 1 and 2, we choose ω1 = ω2 = 0. The phase velocity will then be generated only by acceleration and synchronization. Only example 3 will use ω2 > 0. This example will then serve to demonstrate the effect of nonvanishing shear terms and the relation to oscillation spindles. The numerical study is performed with a real parameter 0 < ε 2 1, and modifying the effective timescale τ (1 − Vk ) of equation 2.1a with equation 2.25 by using this regularization parameter for replacement: τ (1 − Vk ) → τ (1 − Vk + ε 2 )
(4.2)
Self-Organized Segmentation of Patterns
2107
(see Burwick, 2005). The system of equation 2.1 is then discretized with the Euler approximation, where the time step is given by dt = ε 2 τ . The following examples are based on ε 2 = 0.001. The initial values are chosen so that initial amplitudes are small, while phases are equally distributed. Examples 1 and 2 use the same initial conditions (example 3 has different N). 4.2 Examples 1: With a Dominating Pattern. We study the network with the architecture displayed in Figure 1A. Due to our choice of p = ρ p /2 (see section 4.1), a comparison gives 1 = ρ1 /2 = 10/24 > 2 = ρ2 /2 = 5/24.
(4.3)
In consequence, one may expect that the first pattern is dominating the second. Coupling parameters are chosen as a /N = 2, σ = a . Since we are interested in the acceleration effect, we compare two examples. Example 1a uses ω3 = 0, and example 1b uses ω3 = πσ/(2τ ) > 0. For the resulting dynamics in terms of the pattern coherences, see the top diagrams of Figure 2. 4.2.1 Example 1a. In case of ω3 = 0, we find that both patterns become coherent, and due to the overlap, we may conclude that global coherence is reached (see the top diagram of Figure 2A). The individual patterns may no longer be distinguished by activity or coherence. This is the superposition problem for overlapping patterns. 4.2.2 Example 1b. In case of ω3 = πσ/(2τ ) > 0, the situation changes decisively. Now, only pattern 1 becomes coherent, while pattern 2 alternates between moments of coherence and periods of decoherence (see the top diagram of Figure 2B). The dominating pattern 1 has been segmented from the nonoverlapping part of pattern 2. No superposition problem arises. 4.3 Examples 2: Without a Dominating Pattern—Pattern Switching. Next, we consider the architecture of Figure 1B: N1 = N2 = 22 with 8 overlapping units. The pattern weights are now equal: 1 = 2 = ρ1 /2 = 11/36.
(4.4)
Neither of the patterns may be expected to dominate the other. Coupling parameters are a /N = 2, σ = a /2. Again, we compare the cases with and without acceleration (see Figure 3).
2108
T. Burwick
Cp
A
B
Coherence of Patterns
Coherence of Patterns
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0 0
0.5
1
1.5
2
0 0
Re(L/N)
Potential function (real part) 5
0
0
-5
-5
0
0.5
1
1.5
1
1.5
2
Potential function (real part)
5
-10
0.5
2
-10
0
0.5
1
1.5
2
Figure 2: Examples 1a and 1b, with the network architecture of Figure 1A. (A) Example 1a: Without acceleration, ω3 = 0. (B) Example 1b: with acceleration, ω3 > 0. The top diagrams show the coherences for patterns p = 1 (solid line) and p = 2 (dotted line). The bottom diagrams give the real part of the potential function (i.e., Lyapunov function in case of ω3 = 0). See the discussion in sections 4.2 and 4.4.
4.3.1 Example 2a. In case of ω3 = 0, we find the superposition problem for overlapping patterns as with examples 1a (see the top diagram of Figure 3A). 4.3.2 Example 2b. In case of ω3 = πσ/τ > 0, the effect of the acceleration terms is again profound (see the top diagram of Figure 3B). The dynamics changed qualitatively and is now different from example 1b. Again, each pattern is retrieved coherently but not with a steady state. Instead, an alternating series of states is generated, where each state corresponds to the coherent retrieval of one individual pattern, while the other pattern approaches decoherence. We refer to the kind of behavior that is illustrated with Figure 3C as pattern switching. The superposition problem of
Self-Organized Segmentation of Patterns
Cp
A
2109
B
Coherence of Patterns
Coherence of Patterns
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0 0
1
2
3
0 0
Re(L/N)
Potential function (real part) 5
0
0
0
1
2
3
C
2
3
Potential function (real part)
5
-5
1
-5
0
1
2
3
Coherence of Patterns
Cp
1
0.5 0 1.5
2
2.5
3
Figure 3: Examples 2a and 2b, with the network architecture of Figure 1B. (A) Example 2a: without acceleration, ω3 = 0. (B) Example 2b: with acceleration, ω3 > 0. The top diagrams show the coherences for patterns p = 1 (solid line) and p = 2 (dotted line). The bottom diagrams of A and B give the real part of the potential function (i.e., Lyapunov function in case of ω3 = 0). (C) A magnified view on the coherences in the top diagram of B, with time window from t = 1.5τ to t = 3.0τ . The dynamics shows pattern-switching behavior. See the discussion in sections 4.3 and 4.4.
example 2a is no longer present. With pattern switching, there is no simultaneous coherent retrieval of the two patterns. 4.4 Nonclassical Gradient Systems. Considering examples 1 and 2, we may also compare the dynamics of the potential functions. We concentrate
2110
T. Burwick
this discussion on R, the real part of L. This allows us to compare the acceleration effect with the classical situation, where the dynamics minimizes R. Notice that for computing R, we use equation 2.30 with (1 − Vk ) ln (1 − Vk ) → 1 − Vk + ε 2 ln 1 − Vk + ε 2
(4.5)
in order to avoid numerical problems with the case Vk 1. The parameter 0 < ε 2 1 was introduced in section 4.1. 4.4.1 Examples 1a and 2a. Without acceleration, ω3 = 0, the potential function L is a real-valued Lyapunov function, L = R. The corresponding dynamics of R is displayed with the bottom diagrams of Figures 2A and 3A. We find that R is minimized as expected. Although the model describes a phase dynamics, the behavior corresponds to the classical picture of the CGH model. 4.4.2 Examples 1b and 2b. With acceleration, ω3 > 0, the potential function has a nonvanishing imaginary part, L = R + i (see the discussion in section 2.2). The resulting dynamics of R has been displayed with the bottom diagrams of Figures 2B and 3B. In contrast to the cases of vanishing acceleration, the real part of L is not decreasing for all time. Instead, it reaches a minimal level, and then its values oscillate around this level. With example 2b, this frequency is related to the switching between the patterns (see Figure 3B). This confirms that a deviation from the classical picture of a monotonically decreasing Lyapunov function may have functional relevance. With examples 1b and 2b, it is related to segmentation of overlapping patterns. 4.5 Decomposing the Networks into Patches. In the next section, examples 1b and 2b are discussed based on the concept of patches (Burwick, 2005). In this section, we review this concept and introduce the quantities needed for the discussion. The networks of this section consist of three patches, referred to as 10, 11, and 01. These notations correspond to the binary values α = ξk1 ξk2 . Thus, patch α = 10 collects the units k with ξk1 = 1, ξk2 = 0, and so forth. The nonoverlapping part of pattern p = 1 is patch α = 10, the nonoverlapping part of pattern p = 2 is patch α = 01, and the overlap is given by patch α = 11. Patch α = 00 (the “background”) is empty since the patterns cover the complete network.
α and phases α are defined in analogy to the The patch coherences C pattern coherences of equation 3.5. This leads to
10 exp(i 10 ) = 1 Z10 = C xk ,
N10 k∈K 10
(4.6)
Self-Organized Segmentation of Patterns
2111
11 exp(i 11 ) = 1
xk , Z11 = C
N11 k∈K
(4.7)
01 exp(i 01 ) = 1
xk , Z01 = C
N01 k∈K
(4.8)
11
01
where xk = Vk exp(iθk ), K α is the set of units k that constitute patch α, and
Nα is the number of units that belong to this patch. These expressions are related to the pattern coherences and phases via Z1 = C1 exp(i 1 ) = ρ110 Z10 + ρ111 Z11 ,
(4.9a)
Z2 = C2 exp(i 2 ) = ρ211 Z11 + ρ201 Z01 ,
(4.9b)
N10 /N1 , and so forth. The analogous definitions may be given where ρ110 = for the patch activities R10 , R11 , and R01 . 4.6 Dynamics in Terms of Patches. In this section, we use the concept of patches to arrive at some qualitative understanding of the dynamics that is observed with examples 1b and 2b. For these examples, we assumed equations 3.14 and 3.15 for every k. This was done to restrict the origin of desynchronization to the acceleration terms in order to get a better view on their effect. Units may then have different ωk only due to different accelerations (see equation 2.33). In section 3.3, we argued that global coherence of a pattern may not be stable if the pattern is overlapping with other patterns. The argument, however, does not exclude the coherence of units that participate in the same set of patterns, that is, units that belong to the same patch (see section 4.5). Indeed, in the following, we observe that the dynamics of examples 1b and 2b generate coherent patches. The dynamics of the complete patterns may then be understood with respect to phase differences between the patches. The observed decoherences of the patterns are related to mutual decoherence of coherent patches. 4.6.1 Example 1b. As illustrated in Figure 2B, the dominating pattern p = 1 becomes coherent with small oscillations close to C1 1, while pattern p = 2 is alternating between states of coherence and decoherence. Despite the observed decoherence, it turns out that all three patches of the network reach
α 1 (see Figure 4). This implies that the observed coherent states with C decoherences in Figure 2B result from phase differences between the patch phases α . This is indeed the case (see Figure 5). Figure 5B illustrates that the two patches α = 10 and α = 11 of the dominating pattern p = 1 are close to synchronization. A careful inspection of the figure allows understanding of the origin of the small oscillation at C1 1. The acceleration terms in equations 3.9b and 3.10, together with equations
2112
T. Burwick
A
Activity of Patches 1
0.8 0.6 0.4 0.2 0 0
0.2
0.4
0.6
B
0.8
1
1.2
1.4
1.6
1.8
2
1.4
1.6
1.8
2
Coherence of Patches 1
0.8 0.6 0.4 0.2 0 0
0.2
0.4
0.6
0.8
1
1.2
α of the patches, Figure 4: Example 1b. (A) Activities Rα and (B) coherences C with α = 10 (solid line), α = 01 (dotted line), and α = 11 (dash-dotted line). See the discussion in section 4.6.
3.12 and 3.6, are key to understanding this oscillation. Every patch is part of a set of patterns and receives acceleration from each of these patterns that is proportional to the pattern coherence (the C p in equation 3.6) and increases with closer phase differences (the cos p − θk in equation 3.6). Thus, consider first a situation where the patch phases 10 , 11 are close to each other, that is, close to the pattern phase 1 , while patch phase 01 is close to 1 + π. This implies minimal coherence C2 . In consequence, the overlap will receive only minimal acceleration from pattern p = 2. Thus, both patches will be closest in phase velocity, and this makes maximal coherence C1 possible. This explains why maximal coherence C1 and minimal coherence C2 are correlated in Figure 5A. Correspondingly, we find maximal synchronization in Figure 5B and mininal synchronization in Figure 5C. Next, we study how such a state of maximal C1 and minimal C2 evolves. With our example, we use ω1 = ω2 = 0. Therefore, the phase velocities are generated only by acceleration and synchronization. Since patches α = 10, 11 are driven by pattern p = 1 with maximal C1 , while patch α = 01 is driven only by the minimal C2 , the former two patches will have higher phase velocity than the latter patch. In consequence, the assumed situation, where
01 is close to 1 + π changes, and the phase of the overlap α = 11 will approach the phase of the nonoverlapping part of pattern p = 2, patch α = 01, and therefore C2 increases. In consequence, the overlap is more strongly
Self-Organized Segmentation of Patterns
2113
Figure 5: Example 1b. (A) The pattern coherences C p , with time window from t = 0.5τ to t = 2τ , for p = 1 (solid line) and p = 2 (dotted line). The dominating pattern p = 1 gets coherent. (B) Patch coherences and phases in terms of
α cos( α ). The solid line gives α = 10 and the dash-dotted line α = 11. (C) The C corresponding diagram for α = 11 and α = 01, where the latter is given by the dotted line. See the discussion in section 4.6.
accelerated by pattern p = 2, an acceleration that the nonoverlapping part of pattern p = 1 does not feel. Therefore, the phase velocities of the patches α = 10 and α = 11 become different, and the coherence C1 decreases. Of course, the described effect becomes maximal when the phase of the overlap
11 reaches the phase 01 (i.e., when C2 is maximal). This explains why maximal C2 is also correlated with minimal C1 . Again, these considerations are confirmed by the correlation of minimal synchronization in Figure 5B and maximal synchronization in Figure 5C. 4.6.2 Example 2b. With example 2b, the patches become active and coherent as with example 1b (see Figure 6). Therefore, the observed decoherences of example 2b, in particular the phenomenon of pattern switching, should be traced back to differences between phases of the patches (see Figure 7). We describe example 2b with reference to the foregoing discussion of example 1b. Consider again the situation where pattern p = 1 is coherent and the phases of patches α = 10, 11 move toward the phase of patch α = 01 with
2114
T. Burwick
A
Activity of Patches 1
∼ Rα
0.8 0.6 0.4 0.2 0 0
0.5
1
B
1.5
2
2.5
3
2
2.5
3
Coherence of Patches 1
∼ Cα
0.8 0.6 0.4 0.2 0 0
0.5
1
1.5
time t / τ
α of the patches, Figure 6: Example 2b. (A) Activities Rα and (B) coherences C with α = 10 (solid line), α = 01 (dotted line), and α = 11 (dash-dotted line). See the discussion in section 4.6.
higher velocity. We have already described that this leads to patch α = 11’s receiving stronger acceleration than patch α = 10. This may be interpreted as patch α = 01 pulling phase 11 away from phase 10 and toward phase
01 . In case this effect is strong enough, one may even expect that the pulling leads to a transfer of synchronization away from the synchronization of 10 and 11 toward synchronization 11 and 01 . This is not the case with example 1b, where pattern p = 1 dominated pattern p = 2. However, this is exactly what happens with example 2b, where neither pattern dominates the other (see Figure 7). The same mechanism that caused the dynamics of example 1b, resulting only in a weak oscillation of C1 close to C1 1, and the alternation of C2 between coherence and decoherence, leads to the transfer of synchronization that was denoted as pattern switching. In summary, the breakdown of the coherent retrieval of one of the patterns and the switch to coherently retrieving the other pattern may be understood as an effect of sufficiently strong pulling away the overlap phase
11 from the nonoverlapping phase 10 ( 01 ) of pattern p = 1 ( p = 2), due to its approaching 01 ( 10 ) so that synchronization of the overlap is transferred to the other patch phase 01 ( 10 ), making the other pattern p = 2 ( p = 1) coherent. In consequence, phases 11 and 01 ( 10 ) begin to move
Self-Organized Segmentation of Patterns
2115
A Cp
1
0.5
0 1.5
2
2.5
3
2
2.5
3
2.5
3
1
∼
C α cos( Ψα )
B
∼
0
-1 1.5 1
∼
C α cos( Ψα )
C
∼
0
-1 1.5
2
time t / τ
Figure 7: Example 2b. (A) Pattern coherences C p , with a time window from t = 1.5τ to t = 3τ , with p = 1 (solid line) and p = 2 (dotted). The diagram illustrates the pattern switching. (B) Patch coherences and phases in terms of
α cos( α ). The solid line gives α = 10 and the dash-dotted line α = 11. (C) The C corresponding diagram for α = 11 and α = 01, where the latter is given by the dotted line. See the discussion in section 4.6.
faster than 10 ( 01 ), and the situation repeats, now transferring the synchronization from pattern p = 2 ( p = 1) back to p = 1 ( p = 2), and so on. 4.7 Examples 3: Shear and Oscillation Spindles. So far, we have used vanishing shear, ω2,k = ω2 = 0. Now we want to study the effect of nonvanishing shear. We demonstrate that nonvanishing shear generates an oscillation spindle structure. Consider a network with N = 30, and P = 2 patterns of N1 = N2 = 20 nonvanishing units. We assume that the patterns overlap at 10 units (see Figure 1C). The parameter choices are a /N = 2, σ = a /2, ω3 = πσ/τ , and ω1,k = ω1 = 0 for every k. Example 3a is defined by ω2 = 0 and example 3b by ω2 = ω3 . 4.7.1 Example 3a. First, we consider the case of vanishing shear terms, ω2 = 0 in equation 2.33. The example shows pattern switching as for the case of example 2b (see Figure 8). Figures 8C and 8D display the characteristic alternating coherent retrieval of the patterns and the oscillating real part of
T. Burwick
A
C1 cos( Ψ1 )
2116
1 0
B
C2 cos( Ψ2 )
-1
0 1
1
1.5
2
2.5
3
0.5
1
1.5
2
2.5
3
0.5
1
1.5
2
2.5
3
0.5
1
1.5
2
2.5
3
0 -1
C
0.5
0
Cp
1 0.5 0 0 5
Re(L/N)
D
0 -5
0
time t / τ
Figure 8: Example 3a, with vanishing shear terms, ω2 = 0. (A) Coherence and phase of pattern p = 1 in terms of C1 cos( 1 ) and (B) pattern p = 2 in terms of C2 cos( 2 ). (C) Coherences C1 (solid line) and C2 (dotted line). (D) The corresponding values of R/N, the real part of the potential function L/N. See the discussion in section 4.7.
L that we oberserved and discussed in sections 4.3 and 4.4. Here, we are interested in the oscillation behavior of the single patterns p = 1 and p = 2, as displayed in the Figures 8A and 8B, respectively. 4.7.2 Example 3b. Second, we allow for nonvanishing shear terms and choose ω2 = ω3 . The resulting dynamics is displayed in Figure 9. The occurrence of pattern switching (see Figures 9C and 9D), is rather unchanged, as are the frequency of pattern switching and oscillations of L. However, as is to be expected due to larger ω2 , the patterns now oscillate with higher frequency. The combination of low-frequency pattern switching and high-frequency oscillation of the single oscillators results in oscillation spindles of the single patterns (see Figures 9A and 9B). This is the main statement that we want to make with this example: the frequency of pattern switching and oscillations of the single oscillators may be different, allowing for oscillation spindles. Notice that we also included the dynamics of the real part of L in Figure 9D to illustrate that its oscillation
A
C1 cos( Ψ1 )
Self-Organized Segmentation of Patterns
1 0
C2 cos( Ψ2 )
-1
B
0 1
0.5
1
1.5
2
2.5
3
0.5
1
1.5
2
2.5
3
0.5
1
1.5
2
2.5
3
0.5
1
1.5
2
2.5
3
0 -1
C
2117
0
Cp
1 0.5 0 0 5
Re(L/N)
D
0 -5
0
time t / τ
Figure 9: Example 3b, where the shear terms are nonvanishing, ω2 = ω3 . (A) Coherence and phase of pattern p = 1 in terms of C1 cos( 1 ) and (B) pattern p = 2 in terms of C2 cos( 2 ). (C) Coherences C1 (solid line) and C2 (dotted line). (D) The corresponding values of R/N, the real part of the potential function L/N. See the discussion in section 4.7.
is related to pattern switching frequency and not the frequency of single oscillators. 5 Higher-Mode Couplings and Relaxation Oscillators In this letter, we studied phase models for the oscillators with sinusoidal couplings. In section 2.4, we also gave the form of the higher-mode couplings, but these were not used for the examples in section 4. The relevance of higher-mode couplings for the model with inhibition and without acceleration has been demonstrated in Burwick (2005). In the following, we comment on their effect, with a comparison with Wilson-Cowan or relaxation oscillators. Pattern switching phenomena have been observed with other models. For example, in Wang, Buhmann, and von der Malsburg (1990), the oscillators, have been described as extended Wilson-Cowan oscillators, and different patterns were stored according to Hebbian memory. There, pattern switching was observed as a consequence of delayed self-inhibition
2118
T. Burwick
in the presence of competition couplings (noise or modulation of external input were mentioned as alternative mechanisms to cause the switching). In succeeding work, networks of coupled relaxation oscillators were considered by including a second timescale (Terman & Wang, 1995; Wang & Terman, 1995, 1997). Coupled relaxation oscillators have been compared to phase oscillators with sinusoidal couplings, finding differences in synchronization time and sensitivity to the distribution of eigenfrequencies (Somers & Kopell, 1993, 1995). Further clarification of this comparison has been obtained by giving a phase model formulation for relaxation oscillator networks. These were formulated (in the limit of weak couplings) as phase oscillators with discontinuous connection functions (Izhikevich, 2000). Given the phase formulation of the relaxation oscillator networks, it may be of interest to study whether acceleration can also be implemented with these relaxation networks. This could be achieved by using their phase model formulation and the higher-order terms that were given in section 2.4. Even the noted discontinuities may be implemented, at least in principle, by using Fourier decomposition similar to that of the Heaviside (step) function. As a future research topic, these higher-mode couplings should be considered. The studies in Somers and Kopell (1993, 1995) and Izhikevich (2000) showed that the synchronization behavior of relaxation oscillator networks is less sensitive to the distribution of eigenfrequencies in comparison with sinusoidally coupled networks. To study the corresponding feature in the presence of acceleration terms would be of particular interest, given the possible interpretation of acceleration terms as generalized eigenfrequency terms with dynamical structure (see equations 2.1, 2.3, and 2.28). 6 A Note on Fast Implementation Models We want to add one remark on a possible modification of the system of equation 2.1. The proposed modification may be of use for faster implementation and should be of particular value when realizing more advanced applications. The uk dynamics of the considered model comes with an inverse timescale given by (uk ) 1 1 = τ τ 1 − Vk 1 = lim 1 + Vk + Vk2 + . . . + Vkκ τ κ→∞ 1 = lim κ (uk ). τ κ→∞
(6.1)
The limit in equation 6.1 is finite, since 0 < Vk = g(uk ) < 1. The presence of gives the model the simplest form in terms of complex coordinates, allowing for a straightforward derivation of the model as a
Self-Organized Segmentation of Patterns
2119
simple gradient system. In effect, it implies shorter timescales for on-states. Therefore, in the context of finite implementations, for example, with a Euler approximation, using the then requires choosing short discrete time steps dt. With the regularization of section 4.1, these were restricted to dt = ε 2 τ . Otherwise, overshooting occurs in the case of on-states. Consider the following modification. For the sake of fast implementation, it may be advisable to choose κ with some finite κ instead of . Using some κ instead of will then destroy the simple form of the system in terms of complex coordinates, but may allow fast implementations while keeping the essential features studied in this letter. Another possibility would be using fourth-order Runge-Kutta methods. A detailed treatment of these proposals is beyond the scope of this letter. For the purpose of this letter, the described discretization allowed sufficiently fast implementations. 7 Summary In this letter, we studied an oscillatory network model that is a generalization of the classical Cohen-Grossberg-Hopfield (CGH) system. In Burwick (2005), this generalization was obtained by going from the real-valued to a complex-valued system. Here, we took another step in this direction by allowing for complex-valued couplings. It turns out that the resulting couplings have a natural interpretation in terms of neural properties. They describe an increase in phase velocity of the neural units for stronger and more coherent input. This mechanism is what we refer to as acceleration. It embeds into the framework of temporal coding a core feature that is usually identified with rate coding. Nevertheless, the new perspectives opened by including acceleration may come as a surprise. We concentrated our discussion on the superposition problem of overlapping patterns. The combined effect of synchronization and acceleration was demonstrated for two overlapping patterns with a number of examples. The patterns were stored according to Hebbian memory, implying the usual excitatory couplings. With the parameters of our examples, the patterns became active, and without including acceleration, global coherence was reached. The patterns therefore could not be segmented with respect to coherence. The situation changed in the presence of acceleration. We distinguished two cases. First, one of the patterns dominates the other. In that case, the dominating pattern becomes coherent, thereby segmenting itself from the nonoverlapping part of the other pattern that shows a different phase dynamics. Second, neither of the patterns dominated the other. In that case, temporal segmenting of the two patterns occurred, realized with an alternating chain of transient coherent retrievals of the individual patterns. We referred to this behavior as pattern switching. For both cases, the superposition problem was solved with respect to the pattern coherences. The approach of this letter does not need inhibitory couplings. In particular, we did not include the mutually inhibitory couplings that constitute
2120
T. Burwick
competition. In Burwick (2005), four mechanisms were described that may solve the superposition problem for overlapping patterns. Each of these mechanisms was built on competition. Therefore, it may come as a surprise that acceleration allows segmenting overlapping patterns without explicitly including such competition. Notice also that the resulting segmentation is self-organized in the sense that it arises not in response to some external input but as a consequence of the internal (excitatory) couplings. We mentioned that even the simplest oscillatory systems have a natural formulation in terms of complex-valued gradient systems, where complex value refers not only to taking derivatives with respect to complex coordinates but also to taking derivatives of a potential function L with imaginary part. This gradient structure is reminiscent of formulating the CGH model as a real-valued gradient system that minimizes some real-valued Lyapunov function. However, there is an essential difference: nontrivial phase dynamics follows from the imaginary part of L, and these are not compatible with the classical picture of minimizing the Lyapunov function (at least not with respect to the phase dynamics). There is a choice to be made between a classical system that minimizes the Lyapunov function or a system that has a more standard oscillatory character. Several earlier approaches toward oscillatory neural networks based on phase models used the more classical approach. Consider the treatment in Hoppensteadt and Izhikevich (1997, section 10.4), where the different eigenfrequencies and shear terms were eliminated from the general Ginzburg-Landau phase model in order to arrive at a gradient system that is complex valued (i.e., the derivative is with respect to complex coordinates) but with a real-valued potential function (Lyapunov function). Here, we took another approach by keeping the network dynamics truly oscillatory. From a formal point of view, this amounted to allowing for imaginary parts of the potential function L. In fact, we allowed for the imaginary parts not only with regard to the oscillator models, allowing for different eigenfrequencies and shear, but also with regard to the couplings, leading to acceleration. This implied the favorable features that were mentioned before. In consequence, the real part of L (corresponding to the Lyapunov function in the classical and earlier phase model approaches) decreased toward a minimal level, where the decrease is mainly caused by the amplitude dynamics. Then, however, it oscillates around this level. For example, in case of pattern switching, we found that the frequency of the pattern switching is related to the frequency of this oscillating L. Future work should study network behavior with more than two overlapping patterns, as well as the presence of asymmetric connections. The latter aspect may contribute to understanding how the mechanism that was presented in this letter may become part of some hierarchical architecture. Also, a stability analysis may provide a more systematic approach to the phenomena described here. We also mentioned that the applicability of the acceleration mechanism to relaxation oscillator networks should
Self-Organized Segmentation of Patterns
2121
be considered. Moreover, it would be of interest to study the combined effect of synchronization and acceleration in the context of more advanced applications. This could be based on the fast implementation models described in section 6. In this letter, a new mechanism is introduced and described in the context of simple examples. More challenging real-world applications will be the decisive test for the relevance of the described mechanism. Discussing these is beyond the scope of this letter. Notice, however, that synchronization and acceleration may be part of the solution, but the combination with additional mechanisms may be needed. Advanced applications may also involve additional problems like occlusion and may require implementations with hierarchical architecture. In analogy to brain dynamics, it could be argued that a combination of temporal and labeled line coding might be necessary (see Singer, 2003, for an introduction to the differences of these coding schemes). Although we argued that inhibition is not necessary to generate desynchronization, competition based on inhibition may play an essential role for the complete picture of the hierarchical architecture, supporting also the matching of bottom-up and top-down information flow. Given the biological example, only such a complete machinery should be expected to realize advanced and truly intelligent applications. In that respect, we consider the specification of the hierarchical processing to be of utmost importance. Finally, let us add some remarks on biological analogies. It is quite common to identify temporal coding with synchronization. Here, we argued that it may be worth while also considering the combination of synchronization with acceleration. The examples of this letter demonstrated that the impact of including acceleration may be fundamental. We emphasized that the observed dynamics corresponds to oscillation spindles, a fundamental dynamics that is observed also in brain dynamics. Moreover, the pattern switching that we described seems to be analogous to Gestalt switching in response to perceptual rivalries. Nevertheless, whether the combination of synchronization and acceleration will be of relevance for understanding information processing in the brain is an open question. Its answer will depend on whether a meaningful interpretation of the network properties may be found. It should be obvious that any interpretation in terms of biological quantities needs to be accompanied with some averaging, temporal or spatial, or both. Consider, for example, an interpretation of network units as time-averaged groups of biological neurons that oscillate with frequencies in the gamma range. Despite the participation in the high-frequency wave, the underlying biological neurons may not be at saturation with respect to their individual firing rates. Thus, there may indeed be some mechanism where stronger and more coherent input implies a higher frequency of the collective firing, even without leaving the gamma range. Such a property would correspond to acceleration.
2122
T. Burwick
Acknowledgments It is a pleasure to thank Christoph von der Malsburg for valuable discussions. References Burwick, T. (2005). Pattern recognition based on coherent activity in complex-valued neural networks. Manuscript submitted for publication. Burwick, T. (2006). Oscillatory networks: Pattern recognition without a superposition problem. Neural Computation, 18, 356–380. Cohen, M. A., & Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Transaction on Systems, Man, and Cybernetics, SMC-13, 815–826. Engel, A. K., Fries, P., & Singer, W. (2001). Dynamics predictions: Oscillations and synchrony in top-down processing. Nature Reviews Neuroscience, 2, 704–716. Engel, A. K., & Singer, W. (2001). Temporal binding and the neural correlates of sensory awareness. Trends in Cognitive Sciences, 5, 16–25. Hebb, D. O. (1949). The organization of behavior: A neuropsychological theory. New York: Wiley. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Hopfield, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proceeding of the National Academy of Sciences of the U.S.A., 81, 3088–3092. Hoppensteadt, F. C., & Izhikevich, E. M. (1997). Weakly connected neural networks. New York: Springer-Verlag. Izhikevich, E. M. (2000). Phase equations for relaxation oscillators. SIAM J. Appl. Math., 60, 1789–1805. Kuramoto, Y. (1984). Chemical oscillations, waves, and turbulence. New York: SpringerVerlag. Shareef, N., Wang, D., & Yagel, R. (1999). Segmentation of medical images using Legion. IEEE Transactions on Medical Imaging, 18, 74–91. Singer, W. (2003). Synchronization, binding and expectancy. In M. Arbib (Ed.), Brain theory and neural networks (2nd ed., pp. 1136–1143). Cambridge, MA: MIT Press. Somers, D., & Kopell, N. (1993). Rapid synchronization through fast threshold modulation. Biological Cybernetics, 68, 393–407. Somers, D., & Kopell, N. (1995). Waves and synchrony in networks of oscillators of relaxation and non-relaxation type. Physica D, 88, 1–14. Terman, D., & Wang, D. (1995). Global competition and local cooperation in a network of neural oscillators. Physica D, 81, 148–176. von der Malsburg, C. (1981). The correlation theory of brain function (Internal Rep. ¨ 81-2). Gottingen: Max-Planck Institute for Biophysical Chemistry. von der Malsburg, C. (1986). Am I thinking assemblies? In G. Palm & A. Aertsen (Eds.), Brain theory: Proceedings of First Trieste Meeting on Brain Theory, October 1–4, 1984 (pp. 161–176). New York: Springer-Verlag.
Self-Organized Segmentation of Patterns
2123
von der Malsburg, C. (1999). The what and why of binding: The modeler’s perspective. Neuron, 24, 95–104. Wang, D., Buhmann, J., & von der Malsburg, C. (1990). Pattern segmentation in associative memory. Neural Computation, 2, 94–106. Wang, D., Freeman, W., Kozma, R., Lozowski, A., & Minai, A. (Eds.). (2004). IEEE Transactions on Neural Networks, 15(5). [Special Issue]. Wang, D., & Terman, D. (1995). Locally excitatory globally inhibitory oscillator networks. IEEE Transaction on Neural Networks, 6, 283–286. Wang, D., & Terman, D. (1997). Image segmentation based on oscillatory correlation. Neural Computation, 9, 805–836. Wolfe, J. M., & Cave, K. R. (1999). The psychophysical evidence for a binding problem in human vision. Neuron, 24, 11–17.
Received April 3, 2006; accepted August 14, 2006.
LETTER
Communicated by John Milton
Multistability in Spiking Neuron Models of Delayed Recurrent Inhibitory Loops Jianfu Ma [email protected]
Jianhong Wu [email protected] Laboratory for Industrial and Applied Mathematics, Department of Mathematics and Statistics, York University, Toronto, Ontario M3J 1P3, Canada
We consider the effect of the effective timing of a delayed feedback on the excitatory neuron in a recurrent inhibitory loop, when biological realities of firing and absolute refractory period are incorporated into a phenomenological spiking linear or quadratic integrate-and-fire neuron model. We show that such models are capable of generating a large number of asymptotically stable periodic solutions with predictable patterns of oscillations. We observe that the number of fixed points of the socalled phase resetting map coincides with the number of distinct periods of all stable periodic solutions rather than the number of stable patterns. We demonstrate how configurational information corresponding to these distinct periods can be explored to calculate and predict the number of stable patterns. 1 Introduction In a living nervous system, recurrent loops involving two or more neurons are ubiquitous and are particularly prevalent in cortical regions for memory such as the hippocampal-mesial temporal lobe complex (Traub & Miles, 1991). The simplest recurrent inhibitory loop (see Figure 1) in a neural network consists of an excitatory neuron E and an inhibitory neuron I , where neuron E gives off collateral branches and excites the inhibitory neuron I , which in turns inhibits the firing of E, in a delay time. (See Foss, Longtin, Mensour, & Milton, 1996; Mackey & an der Heiden, 1984; Mackey & Milton, 1987.) These previous studies show that two-neuron inhibitory loops with delay display similar complex dynamic behaviors as larger networks such as multistability, and many techniques developed to deal with two-neuron networks can carry over to large-size networks. Therefore, the two-neuron recurrent inhibitory loops have been used as prototypes to improve our understanding of the impact on the network’s computational performance of the interaction of time delay and inhibitory feedback. Neural Computation 19, 2124–2148 (2007)
C 2007 Massachusetts Institute of Technology
Multistability in Neuron Models
2125
Figure 1: A recurrent neuron loop consists of an excitatory neuron E and an inhibitory neuron I , where neuron E excites the neuron I , which delivers an inhibitory postsynaptic potential (IPSP) to the neuron E, in a time lag τ .
An interesting computational behavior of a delayed inhibitory loop is multistability: the coexistence of multiple stable patterns such as equilibria and periodic orbits. The coexistence of multiple equilibria in a neural network is the basis of the mechanism for (associative) content-addressable memory and retrieval (Foss et al., 1996; Hopfield, 1982, 1984; Milton, 2000; Morita, 1993) where each equilibrium is identified with static memory, while stable periodic orbits are associated with temporally patterned spike trains (Canavier, Baxter, Clark, & Byrne, 1994; Foss et al., 1996; Foss, Moss, & Milton, 1997). Multistability in a delayed neural network has been extensively studied in the literature, in particular for delayed neural recurrent loops (Foss et al., 1996, 1997), and experimentally in electrical circuits (Foss et al., 1997) and recurrently clamped neurons (Foss & Milton, 2000). The rigorous mathematical investigation of the coexistence of multiple periodic orbits and the description of the domains of attraction of periodic solutions as well as the structure of the global attractor can be found in Chen and Wu (1999, 2000, 2001a, 2001b, 2001c) and Chen, Wu, and Krisztin (2000). Time delays, a powerful mechanism for multistability, are intrinsic properties of the nervous systems and are unavoidable in electronic implementation due to axonal conduction times, distances of interneurons, and the finite switching speeds of amplifiers. An essential requirement for delayinduced multistability in a delayed feedback loop is that the delay must be longer than the response time of the system or intrinsic period of the spiking neuron (Foss et al., 1996, 1997; Ikeda & Matsumoto, 1987). An example is the delay differential equation, d x(t) = −αx(t) + F (x(t − τ )), dt
(1.1)
2126
J. Ma and J. Wu
where α is a positive constant and F is the feedback function. The work of an der Heiden and Mackey (1992) and Ikeda and Matsumoto (1987) suggested that multistability occurs when the time delay τ is longer than the intrinsic timescale of the control mechanism, i.e., τ > α −1 , and the feedback function F is nonmonotonic. Losson (1991) and Losson, Mackey, and Longtin (1993) studied equation 1.1 with the feedback function F (x) =
c
if x ∈ [x1 , x2 ];
0
otherwise (x < x1 or x > x2 ),
(1.2)
and observed three coexisting attracting nonconstant periodic solutions when α = 3.25, c = 20.5, x1 = 1, x2 = 2, and τ = 1. The Mackey-Glass delay differential equation (Mackey & Glass, 1997), given by d x(t) βx(t − 1) = −αx(t) + , dt 1 + x(t − 1)10
(1.3)
and originally introduced to model oscillations in neutrophil populations, was found to have four coexisting, attracting periodic solutions in symmetric positive and negative pairs. Furthermore, Foss et al. (1996) studied neural recurrent inhibitory loops using the well-known Hodgkin-Huxley model (see details below) and found three coexisting attracting periodic solutions. Another important concept related to multistability is the phase resetting curve (PRC), which describes the phase shift of an oscillation in response to a perturbation in the neuron’s spike times. PRC is a powerful tool to study the effects of perturbations on biological periodic oscillations. Various mathematical models based on phase-resetting properties have been extensively used to study the effect of periodic inhibitory stimulation of cortical slices (Schindler, Bernasconi, Stoop, Goodman, & Douglas, 1997) and the occurrence of multistability in ring circuit models for CPGs (Canavier et al., 1999), and to explain complex dynamical behaviors of large populations of neurons (Hoppensteadt & Izhikevich, 1997). In particular, Foss and Milton (2000) developed a mathematical model that incorporates two measurable parameters involving time delay and phase resetting to investigate the occurrence of multistability in the delayed recurrent loop, where they used multiple fixed points in a phase-resetting map to predict the number of coexistent stable patterns. Most of this work on inhibitory loops and feedbacks has been based on the consideration of the dynamical behaviors of the single excitatory neuron in the two-neuron inhibitory loop, and hence the model equation takes the form of a scalar delay differential equation that can also arise in modeling of a single neuron with delayed self-feedback. Many of the results require the signal function to be nonmonotone. There are also two popular models
Multistability in Neuron Models
2127
for a single neuron in theoretical neuroscience: phenomenological spiking neuron models and detailed conductance-based neuron models. In comparison with simple phenomenological spiking neuron models, detailed conductance-based neuron models have a high degree of accuracy to measure the intrinsic complexity of a single neuron, such as firing and inhibitory rebound spikes. A well-known example of the detailed conduction-based models is the following Hodgkin-Huxley differential equation: C x (t) = Iion (x, m, n, h) + Isyn (t) + Is (t), Iion = −gNa m3 h(x(t) − E Na ) − g K n4 (x(t) − E k ) − g L (x(t) − E L ),
(1.4)
where x(t) is the membrane potential of the neuron under consideration at time t, C is the membrane capacitance, and Iion is the sum of currents through sodium ion channel, potassium ion channel, and leakage channel. The applied currents have two parts: postsynaptic current Isyn (PSC) and external stimulus Is . Constants gNa and g K are the maximum conductance of sodium and potassium ion channels, the constant g L is the conductance of leakage channel, and constants E Na , E K , and E L are empirical parameters called the reversal potential. Parameters commonly used are C = 1, gNa = 120, g K = 36, g L = 0.3, E Na = 115, E K = −12, and E L = 10.613. There are three (gating) variables (m, n, h) that describe the probability that a certain channel is open, and these variables evolve according to the following differential equations: m (t) = αm (x)(1 − m) − βm (x)m, n (t) = αn (x)(1 − n) − βn (x)n, h (t) = αh (x)(1 − h) − βh (x)h,
(1.5)
where the functions α and β indexed by (m, n, h) are given by 0.1−0.01x 0.125 αn = exp(1−0.1x)−1 , βn = exp(x/80) 2.5−0.1x 4 αm = exp(2.5−0.1x)−1 , βm = exp(x/18) 0.07 1 αh = exp(x/20) , βh = exp(3−0.1x)+1 .
(1.6)
Unfortunately, the intrinsic complexity of the conductance-based model makes it difficult for a careful theoretical and qualitative analysis. Thus, simple phenomenological spiking neuron models are introduced especially for the study of neural coding, memory, and network dynamics, in addition to facilitation of engineering implementation. A widely used such model
2128
J. Ma and J. Wu
is the following linear integrate-and-fire model (LIF) for the membrane potential, d x(t) = −βx(t) + Isyn + Is (t), dt
(1.7)
with the reset condition limt→t+f x(t) = Vr , where β is the decay rate, Vr is the reset potential, and the firing time t f is defined by a threshold criterion, t f : x(t) = ϑ1
and
x (t)|t=t f > 0,
(1.8)
with ϑ1 being the firing threshold. Such a model can be seen as a reduced neural model with a mean excitatory and inhibitory membrane potential (averaged over the neuron). A nonlinear analog is the following quadratic integrate-and-fire model (QIF), given by d x(t) = β(x − µ)(x − γ ) + Isyn + Is (t) dt t f : x(t) = ϑ1
and
(1.9)
x (t)|t=t f > 0,
with the reset condition limt→t+f x(t) = Vr , where β is a constant, µ is the critical reversal potential of resting state or inhibitory rebound, and γ is the critical reversal potential of firing. (See Feng, 2001; Hansel & Mato, 2001; Latham, Richmond, Nelson, & Nirenberg, 2000; and discussions below.) However, simple phenomenological models alone are unable to capture some important biological features exhibited by the detailed conductionbased models, such as the absolute refractory period, which seems to be an important factor for the occurrence of a large number of stable patterns, and as the inhibitory rebound that enables spikes in initial functions to propagate in future time. This letter analyzes effects of these biological features on multistability in phenomenological models in order to provide effective mechanisms for phenomenological models to generate a huge number of coexisting stable periodic solutions. The core to these mechanisms is the interaction of time lag, inhibitory feedback, firing, rebound, and absolute refractory period, incorporated in a (QIF) model with delayed feedback. The relative simplicity of the phenomenological QIF model enables us to formulate analytically the relevant phase-resetting map, based on which we find that the number of fixed points of the phase-resetting map coincides exactly with the number of distinct periods of stable periodic solutions rather than with the number of distinct stable patterns. However, using the configurational information corresponding to these distinct periods, we can calculate the number of distinct periodic solutions and precisely predict their patterns.
Multistability in Neuron Models
2129
Basic frameworks of the QIF models with delayed feedback are discussed in section 2, and detailed discussions about QIF models are given in section 3, along with the description of numerical simulations and the connections with phase-resetting techniques. Further comments are presented in the final section. 2 Recurrent Inhibitory Loops: Model Formulation A recurrent inhibitory loop is composed of two neurons: an excitatory neuron E and an inhibitory neuron I . Injecting a large enough stimulus Is into the dendrite of the neuron E causes its membrane potential to increase until it reaches its firing threshold ϑ1 . The neuron E fires and emits an action potential or spike, and then the membrane potential is reset to a certain value Vr (called reset membrane potential). In the absence of stimulus, a single neuron is at rest, and the corresponding membrane potential is called the resting membrane potential V0 . The perpetuated stimulus causes the neuron to emit a sequence of spikes called a spike train. In the absence of recurrent inhibition, the period of spikes in the spike train is called the intrinsic spiking period of the excitatory neuron E, denoted by T in the remaining part of this article. The firing of neuron E excites the inhibitory interneuron I , which delivers an inhibitory postsynaptic potential (IPSP) to the excitatory neuron E. Foss et al. (1996) described the membrane potential of the excitatory neuron E using the Hodgkin-Huxley model (HH) by considering the effect of IPSP as self-feedback. They obtained the following delay differential system, C x (t) m (t) n (t) h (t)
= −gNa m3 h(x(t) − E Na ) − g K n4 (x(t) − E k ) − g L (x(t) − E L ) −F (x(t − τ )) + Is (t), = αm (x)(1 − m) − βm (x)m,
(2.1)
= αn (x)(1 − n) − βn (x)n, = αh (x)(1 − h) − βh (x)h,
where F (x) is the signal function that describes the effect of the inhibitory neuron I on the membrane potential of the excitatory neuron E, and τ is the time lag. The corresponding linear integrate-and-fire model (LIF) and quadratic integrate-and-fire model (QIF) are given by x (t) = −βx(t) − F (x(t − τ )) + Is (t),
x (t) = β(x − µ)(x − γ ) − F (x(t − τ )) + Is (t),
(2.2) (2.3)
2130
J. Ma and J. Wu
with firing time t f , t f : x(t) = ϑ1
and
x (t)|t=t f > 0,
and firing threshold ϑ1 . In order to specify a solution of the above delay differential equations, it is necessary to give an initial function φ in the interval [−τ, 0]. In what follows, the initial function has the form of neural spike trains. Namely, it is given by a sum of square pulse functions as follows: φ(t) = ψ(t − ti ), (2.4) where
ψ(t) =
c
if t ∈ [0, d];
0
otherwise,
(2.5)
with c being the amplitude of the action potential (e.g., c = 100 mV in the Hodgkin-Huxley model) and d the duration of a spike. We shall use the following piecewise nondecreasing constant function, F1 (x) =
a
if x ≥ ϑ1 ;
0
otherwise,
(2.6)
where a is a positive constant. For simplification, we set the resting membrane potential to zero (V0 = 0). 3 Simulations and Analysis In what follows, we address the multistability issue in two regions classified by the nature of the stimulus Is : the excitable regime where Is cannot make the neuron fire and the periodic regime where Is makes the neuron fire successively. 3.1 Hodgkin-Huxley Model (HH). When τ = 116 msec and the signal function F (x) = µx with a constant µ, Foss et al. (1996) found that in the excitable region Is = 0, each initial function φ gives rise to a solution that is eventually periodic. These eventually periodic solutions are stable, but the model has no coexisting attractor. In the periodic regime Is = 10 µA, however, they found three coexisting attracting periodic solutions. We obtained similar results using the piecewise constant function F (x) = F1 (x). In particular, for the piecewise constant function and in the excitable regime Is = 0, we found that different initial functions give rise to different, eventually periodic, solutions that are stable and there is no
Multistability in Neuron Models
100
2131
(a) 10
50
0
0
MEMBRANE POTENTIAL (mV)
τ 100
0 50τ
51τ
52τ
53τ
54τ
55τ
56τ
(b) 10
50
0
0 τ 100
0 50τ
51τ
52τ
53τ
54τ
55τ
56τ
(c) 10
50
0
0 τ 100
0 50τ
51τ
52τ
53τ
54τ
55τ
56τ
(d) 10
50
0
0
τ
0 50τ
51τ
52τ
53τ
TIME(MSEC)
54τ
55τ
56τ
TIME(MSEC)
Figure 2: Four coexisting attracting periodic solutions generated by the excitatory neuron E for the Hodgkin-Huxley model (HH) given by equations 2.1 with the piecewise constant signal function F1 (x) when I = 10 µA, τ = 116 msec, ϑ1 = 18.6 mV, and a = 16. Spikes in the initial function φ on the interval (−τ, 0) are 4 msec in duration. After 50τ of transients, the periodic solutions are output in six delay intervals. The right-hand side is the blow-up of the solutions in a given period (not delay τ ) to clearly illustrate the patterns of solutions.
coexisting attractor. But in the periodic regime Is = 10 µA with parameters τ = 116 msec, ϑ1 = 18.6 mV, and a = 16, we found four coexisting periodic attractors shown in Figure 2. The corresponding trajectories exhibit exotic transient behaviors but eventually become periodic. 3.2 LIF Model with Firing and Absolute Refractoriness. Each time the excitatory neuron fires a spike, a feedback is delivered at time τ later. Since the exact time course of action potential carries no information, for most of our study of the simple phenomenological spiking neuron models, the form of action potential may be neglected and the action potential is characterized by firing time t f and the reset condition. For our linear and quadratic integrate-and-fire models, the type of multistability not only depends on the time delay τ but also on the effective timing of the feedback that affects the excitatory neuron. The total timing of the feedback, denoted by Tϑ , is the portion of the duration when the spike is above the firing threshold. This total timing of the feedback is not characterized by the reset condition and firing time mentioned in the simple phenomenological
2132
J. Ma and J. Wu
spiking neuron models. To simplify our presentation and simulations, we now introduce an approximation of the spike time course. 3.3 Firing. If the membrane potential reaches its firing threshold ϑ1 from below at a moment t f , an action potential (spike) is generated, and then the membrane potential is reset to the reset potential Vr < 0. As the shapes of spikes vary little and the form of an action potential is insignificant to the LIF model, we can represent the time course of the membrane potential by any function that increases from ϑ1 to c first and then decreases to Vr . In our work, we choose the following continuous linear function, x f (t) = xˆ 1 (t) =
ϑ1 +
c−ϑ1 s1 −t f
(t − t f )
if t ∈ [t f , s1 )
Vr +
c−Vr (s s2 −s1 2
− t)
if t ∈ [s1 , s2 ],
(3.1)
or the continuous exponential function, α1 (t−t f ) ϑ1 e x f (t) = xˆ 2 (t) = e −α2 (s2 −t) s2V−sr 1 (t − s1 ) +
c e α2 (s2 −s1 ) (s − t) 2 s2 −s1
if t ∈ [t f , s1 ] if t ∈ (s1 , s2 ], (3.2)
where c is the amplitude of action potential and s1 < s2 are the times when x(t) = c and x(t) = Vr , respectively. All simulation results reported below are based on the function xˆ 1 . To simplify our simulation, we take c = 10 mV for the rest of this article. We emphasize again that we introduce explicitly the firing time course of the membrane potential in order to have the continuity for the action potential; all that is really needed for the analysis and simulations is the timing for firing and the three parameters ϑ1 , c, and Vr . 3.4 Absolute Refractory Period. The second factor determining the effective timing of the feedback is the absolute refractory period, a short period after the firing of a spike during which the neuron is not affected by inputs at all. Without the input term I (t) = Is (t) − F (x(t − τ )), equation 2.2 becomes dx = −βx, dt
(3.3)
with the corresponding solution given by xabs (t) = Vr e −β(t−s2 ) ,
(3.4)
Multistability in Neuron Models
10
2133 (a)
(a)
5
MEMBRANE POTENTIAL (mV)
0
τ
10
0 50τ
51τ
52τ
53τ
54τ
55τ
56τ
(b)
(b)
5
0
τ
10
0 50τ
51τ
52τ
53τ
54τ
55τ
56τ
(c)
(c)
5
0
τ
0 50τ
51τ
52τ
53τ
TIME(MSEC)
54τ
55τ
56τ
TIME(MSEC)
Figure 3: Three coexisting attracting periodic solutions (patterns 3w8v, 1w2v2w6v, 1w3v1w3v1w2v) generated by the LIF model given by equation 2.2 with the signal function F1 (x) for Is = 0.38 µA. Parameters are τ = 116 msec, β = 0.08, c = 10 mV, a = 0.6, s1 − t f = 0.60 msec, s2 − s1 = 2.7 msec, ϑ1 = 1.2 mV, Vr = −1.1 mV, and dabs = 1.1 msec. The right-hand side is the blow-up of the solutions in a given period (not delay τ ) to clearly illustrate the patterns of solutions.
for t ∈ [s2 , s2 + dabs ], where dabs is the length of this period. The solution xabs (t) = Vr e −β(t−s2 ) is the same as the function η, which describes the form of the spike and spike afterpotential in the spike response model (SRM) when mapping the integrate-and-fire model to the SRM (Gerstner & Kistler, 2002). The absolute refractoriness allows the membrane potential to decay back from the hyperpolarization (afterpotential) to the resting potential even if feedback is delivered during this period. In the periodic regime such as Is = 0.38 µA, we observe multiple coexisting attracting periodic solutions shown in Figure 3, with parameters τ = 116 msec, β = 0.08, c = 10 mV, a = 0.6, s1 − t f = 0.60 msec, s2 − s1 = 2.7 msec, ϑ1 = 1.2 mV, Vr = −1.1 mV, and dabs = 1.1 msec. Trajectories are output in six delay intervals after 50τ of transients. The right-hand side is the blow-up of the solutions in a given period (not delay τ ) to clearly illustrate the patterns of solutions. Ignoring the action potential and focusing on the small oscillatory part of these solutions, we found that there exist only two shapes, denoted by w and v, respectively. Because all solutions are composed of w and v, we can describe patterns in terms of the number and order where w and v appear within one period. Therefore, Figure 3 has three different patterns denoted by 3w8v, 1w2v2w6v, and 1w3v1w3v1w2v, respectively. Figure 4 lists three other patterns: 3w2v, 4w1v2w3v, and all v. These numerical observations clearly show the capacity of the LIF model for generating a large number of attracting periodic solutions.
2134
10
J. Ma and J. Wu (a)
(a)
5
MEMBRANE POTENTIAL (mV)
0
τ
10
0 50τ
51τ
52τ
53τ
54τ
55τ
56τ
(b)
(b)
5
0
τ
10
0 50τ
51τ
52τ
53τ
54τ
55τ
56τ
(c)
(c)
5
0
τ
0 50τ
51τ
52τ
53τ
54τ
TIME (MSEC)
55τ
56τ
TIME (MSEC)
Figure 4: Three other asymptotically stable periodic patterns (3w2v, 4w1v2w3v and all v) generated by the LIF model given by equation 2.2 with the signal function F1 (x) for Is = 0.38 µA and other parameters identical to those used in Figure 3. The right-hand side is the blow-up of the solutions in a given period (not delay τ ) to clearly illustrate the patterns of solutions.
The advantage of the LIF model is that the equation can be analytically integrated, and such a dynamic system exhibits multistability in the periodic regime. Two important biological features of a single neuron are incorporated in the LIF model: the firing and absolute refractory period. The firing determines the possible timing Tϑ of the corresponding feedback, and the absolute refractoriness determines whether such a feedback has impact on the excitatory neuron. If a feedback arrives at time s3 during the absolute refractory period (s3 ∈ [s2 , s2 + dabs )), then the feedback affects the membrane potential of the excitatory neuron in only a small time interval, that is, the effective timing is of the duration Tϑ − (s2 + dabs − s3 ). In contrast to results of the work of Foss et al. (1996), in the excitable regime and using various signal functions F including the one used in the work of Foss et al., spikes in an initial function for the LIF model cannot propagate in the subsequent delay intervals, and spike train patterns cannot be generated. The reason is that another important biological function of a single neuron, inhibitory rebound, is not captured by the LIF model. The quadratic integrate-and-fire model, incorporating a rebound equation, naturally solves this problem. 3.5 Quadratic Integrate-and-Fire Model. The quadratic integrate-andfire model (QIF) was proposed in Feng (2001), Hansel and Mato (2001), and Latham et al. (2000): τ
dx = a 0 (x − xrest )(x − xc ) + RI (t), dt
(3.5)
Multistability in Neuron Models
2135
where xrest is the resting membrane potential, a 0 and R are positive constants, and xc > xrest . When there is no external stimulus (I (t) = 0) and the initial condition x(t) < xc , the membrane potential converges to the resting membrane potential xrest . For x(t) > xc , the membrane potential increases so that an action potential is triggered. The parameter xc can therefore be interpreted as the critical potential for firing. From a biophysical point of view, xc can be considered as a reversal potential due to the different ion concentrations between the interior of cell membrane and its surrounding liquid. In the QIF model, the relationship between a weak stimulus and the equilibrium of the membrane potential is a quadratic function. One of the motivations for us to introduce the QIF model is to have a similar equation for the inhibitory rebound mechanism; another reason is to use the QIF instead of the LIF model is the capability of a QIF model to generate a large number of stable periodic patterns. In particular, we have used both LIF and QIF models for intensive simulations, and our simulations show that in the excitable region, both LIF and QIF models with an inhibitory rebound mechanism (described below) can generate similar results. However, in the periodic region, the QIF model can generate much more stable patterns than the LIF model. Furthermore, it is much easier to choose relevant parameter values to generate stable patterns by the QIF model, and the generated patterns are more distinguishable. 3.6 Inhibitory Rebound Spike. When an inhibitory input is imposed on a group of neurons for a short period of time and is then removed, some neurons respond with one or more inhibitory rebound spikes. The rebound occurs at the rebound time tb , when the inhibitory input makes the membrane potential less than the so-called rebound threshold ϑ2 and when the input is switched off so that I (tb− ) < 0 and 0 ≤ I (tb ) < Imax , where Imax is the maximum value of an input that cannot make a neuron fire. The QIF model alone, does not give inhibitory rebound spikes that seem to be a very important factor to permit the spike propagation in the excitable regime of recurrent inhibitory loops. To provide this inhibitory rebound mechanism, we need to introduce the following rebound equation, τ
dx = a 0 (x − xI )(x − xc ) + RI (t), dt
(3.6)
where xc > xI , xI is the critical reversal potential of inhibitory rebound and xc is the critical reversal potential of the firing introduced above. The rebound equation describes the dynamical behavior of the neuron from the rebound time tb to the next spiking time. After the inhibitory input is switched off, if x < xI and if xI < ϑ1 , then the membrane potential x(t) increases and finally approaches its equilibrium xI when I (t) = 0, or reaches higher potential than xI when I (t) > 0. If xI > ϑ1 (the firing threshold) and
2136
J. Ma and J. Wu
when the membrane potential reaches its firing threshold ϑ1 , the neuron fires and creates an action potential. We can now formulate our rules that govern the evolution of the neuron potential for recurrent inhibitory loops. Let xrest = 0. Then we have x (t) = β(x − µ)(x − γ ) − F (x(t − τ )) + Is ,
(3.7)
where µ=
xI
if x(t) ≤ ϑ2 and the inhibitory input is switched off
xrest = 0
otherwise, (3.8)
where β is a positive constant, ϑ2 < 0 is the rebound threshold, xI > ϑ1 is the reversal potential of inhibitory rebound, and γ is the reversal potential of the firing. After the membrane potential reaches the firing threshold ϑ1 at t = t f , the potential time course is given by x(t) = x f (t), x(t) = xabs (t),
t ∈ [t f , s2 ] t ∈ [s2 , s2 + dabs ],
where xabs (t) is given by d xabs (t) = β(xabs − xrest )(xabs − γ ), dt
t ∈ [s2 , s2 + dabs ],
(3.9)
with dabs being the length of the absolute refractory period. Consequently, the rebound mechanism, the firing mechanism, and the absolute refractory period as intrinsic properties of a neuron are all incorporated into the QIF model. Figure 5 shows three spike patterns (left) in the excitable regime Is = 0 generated by the quadratic integrate-and-fire model, in comparison with the Hodgkin-Huxley model (right). For the QIF model, parameters are τ = 116 msec, β = 0.08, c = 10 mV, a = 0.9, s1 − t f = 0.60 msec, s2 − s1 = 2.7 msec, ϑ1 = 1.2 mV, Vr = −1.1 mV, ϑ2 = −0.8 mV, xI = 2.5, γ = 3.0, and dabs = 1.1 msec. We note that solutions of the QIF model are very similar to those of the HH model. If spikes in an initial function φ are sufficiently separated, spikes will propagate and finally form stable periodic solutions as shown in Figures 5a and 5b. In such cases, different initial functions give rise to different solutions. If spikes in an initial function φ are too close, as in Figure 5c, spikes will eventually merge. There are no coexisting periodic attractors in the excitable regime. We point out that the rebound mechanism enables IPSP to bring the system state across the firing threshold separatrix,
Multistability in Neuron Models
100
2137
(a)
100
50
50
MEMBRANE POTENTIAL (mV)
0
0 τ
100
τ
0
2τ
3τ
4τ
τ
5τ
(b)
100
50
0
τ
2τ
3τ
4τ
5τ
0
τ
2τ
3τ
4τ
5τ
0
τ
2τ
3τ
4τ
5τ
(b)
50
0
0 τ
100
(a)
τ
0
2τ
3τ
4τ
τ
5τ
(c)
100
50
(c)
50
0
0 τ
τ
0
2τ
3τ
4τ
5τ
τ
TIME (MSEC)
Figure 5: Three spike train patterns generated by the QIF model (left), in comparison with the Hodgkin-Huxley model (right) in the excitable regime Is = 0. For the QIF model, parameters are τ = 116 msec, β = 0.08, c = 10 mV, a = 0.9, s1 − t f = 0.60 msec, s2 − s1 = 2.7 msec, ϑ1 = 1.2 mV, Vr = −1.1 mV, ϑ2 = −0.8 mV, xI = 2.5, γ = 3.0, and dabs = 1.1 msec. In order to compare with the HH model, values of the QIF model are amplified 10 times. If spikes in an initial function are sufficiently separated (a, b), spikes will propagate without merging. If spikes are too close (c), spikes will merge.
following the release of an inhibition so that spikes in initial functions can propagate in future time. Figure 6 shows, in the periodic regime Is = 0.38 µA, five coexisting attracting periodic solutions generated by the QIF model with parameters τ = 116 msec, β = 0.08, c = 10 mV, a = 0.9, s1 − t f = 0.60 msec, s2 − s1 = 2.7 msec, ϑ1 = 1.2 mV, Vr = −1.1 mV, ϑ2 = −0.8 mV, xI = 2.5, γ = 3.0, and dabs = 1.1 msec. After 50τ of transients, the periodic solutions are output in six delay intervals. The number of attractive periodic solutions composed of two w oscillations and nine v oscillations within one period is 5, and this is not a coincidence. It is equal to the number of different ways to arrange (11)(9) two w’s and nine v’s in a ring: 211 9 = 5, and all possible arrangements of two w’s and nine v’s in a ring are
wwvvvvvvvvv, wvwvvvvvvvv, wvvwvvvvvvv, . wvvvwvvvvvv, wvvvvwvvvvv
2138
10
J. Ma and J. Wu
(a)
(a)
5 0
MEMBRANE POTENTIAL (mV)
τ 10
0 50τ
51τ
52τ
53τ
54τ
55τ
56τ
(b)
(b)
5 0
τ 10
0 50τ
51τ
52τ
53τ
54τ
55τ
56τ
(c)
(c)
5 0
τ 10
0 50τ
51τ
52τ
53τ
54τ
55τ
56τ
(d)
(d)
5 0
τ 10
0 50τ
51τ
52τ
53τ
54τ
55τ
56τ
(e)
(e)
5 0
τ
0 50τ
51τ
52τ
53τ
54τ
55τ
56τ
TIME (MSEC)
TIME (MSEC)
Figure 6: Five coexisting attracting periodic solutions of patterns: 2w9v, 1w1v1w8v, 1w2v1w7v, 1w3v1w6v, and 1w4v1w5v, generated by the QIF model in the periodic regime Is = 0.38 µA. Parameters are the same as those used in Figure 5. The right-hand side is the blow-up of the solutions in a given period (not delay τ ) to clearly illustrate the patterns of solutions. 10
(a)
(a)
5 0
MEMBRANE POTENTIAL (mV)
τ 10
0 50τ
51τ
52τ
53τ
54τ
55τ
56τ
(b)
(b)
5 0
τ 10
0 50τ
51τ
52τ
53τ
54τ
55τ
56τ
(c)
(c)
5 0
τ 10
0 50τ
51τ
52τ
53τ
54τ
55τ
56τ
(d)
(d)
5 0
τ 10
0 50τ
51τ
52τ
53τ
54τ
55τ
56τ
(e)
(e)
5 0
τ
0 50τ
51τ
52τ
53τ
TIME (MSEC)
54τ
55τ
56τ
TIME (MSEC)
Figure 7: Five other patterns generated by the QIF model in the periodic regime Is = 0.38 µA. Parameters are identical to those in Figure 6. (a) Pattern 4w6v. (b) Pattern 5w5v. (c) Pattern 4w1v2w2v. (d) Pattern 6w1v1w1v. (e) Pattern wwwwwwwww. The right-hand side is the blow-up of the solutions in a given period (not delay τ ) to clearly illustrate the patterns of solutions.
Figure 7 displays five other patterns: 4w6v, 5w5v, 4w1v2w2v, 6w1v1w1v, and wwwwwwwww. Similar to the linear integrate-and-fire model, equation 3.7 can be integrated analytically. Let I = Is − F (x(t − τ )). In the intervals where I is a
Multistability in Neuron Models
2139
Table 1: Number of Stable Patterns and Number of Distinct Periods When τ = nT. τ
Number of Patterns
Number of Periods
τ
Number of Patterns
Number of Periods
1 2 2 3
1 2 2 3
5T 6T 7T 8T
4 6 8 13
3 4 4 5
T 2T 3T 4T
constant, equation 3.7 can be written in the form dx = β(x − λ1 )(x − λ2 ), dt
(3.10)
where µ+γ λ1 = + 2
µ−γ 2
2
I − β
and
µ+γ λ2 = − 2
µ−γ 2
2 −
I . β
The solution of the above equation is given by
x(t) =
µ+γ 2 + λ + 2
x(t0 )− µ+γ 2 1−β(x(t0 )− µ+γ 2 )(t−t0 )
(x(t0 )−λ2 )(λ1 −λ2 ) x(t0 )−λ2 −(x(t0 )−λ1 )e β(λ1 −λ2 )(t−t0 )
if (µ − γ )2 =
4I β
;
if (µ − γ )2 >
4I β
,
(3.11)
where t0 is an initial time. A similar but slightly more complicated formula can be derived if (µ − γ )2 < 4Iβ . Therefore, the trajectory for a given initial function φ can be analytically obtained. However, for most initial functions, it seems that trajectories exhibit quite exotic transient behaviors before they eventually become periodic. Using the above analytical expression for solutions of the QIF model, we can calculate the number of stable periodic patterns and the number of distinct periods of stable periodic patterns, listed in Table 1 when τ = nT, n is an integer from 1 to 8, and T is the intrinsic spiking period. We explain how Table 1 is obtained at the end of this section. We point out that when τ = nT, the results become more complicated due to two different types of W oscillation coexisting in one stable pattern and due to the transition from one subset of stable patterns to another at certain critical values of the time delay τ . We defer detailed discussions to future work but list in Tables 2, 3, and 4 some illustrative examples when τ ∈ (T, 2T), τ ∈ (2T, 3T), and τ ∈ (3T, 4T). In particular, in the region (T, 2T), multistability occurs only in the subinterval τ ∈ (T + 0.5488T, T + 0.5552T].
2140
J. Ma and J. Wu
Table 2: Number of Stable Patterns and Number of Distinct Periods When τ ∈ (T, 2T). τ (T, T + 0.1815T] (T + 0.1815T, T + 0.3890T] (T + 0.3890T, T + 0.5488T] (T + 0.5488T, T + 0.5552T] (T + 0.5552T, T + 0.6329T] (T + 0.6329T, 2T)
Number of Patterns
Number of Periods
1 1 1 2 1 1
1 1 1 2 1 1
Table 3: Number of Stable Patterns and Number of Distinct Periods When τ ∈ (2T, 3T). τ (2T, 2T + 0.1815T] (2T + 0.1815T, 2T + 0.3890T] (2T + 0.3890T, 2T + 0.5488T] (2T + 0.5488T, 2T + 0.5552T] (2T + 0.5552T, 2T + 0.6329T] (2T + 0.6329T, 2T + 0.7586T] (2T + 0.7586T, 2T + 0.8486T] (2T + 0.8486T, 2T + 0.8767T] (2T + 0.8767T, 2T + 0.9290T] (2T + 0.9290T, 3T)
Number of Patterns
Number of Periods
2 2 2 4 3 2 1 2 2 1
2 2 2 3 2 2 1 2 2 1
Table 4: Number of Stable Patterns and Number of Distinct Periods When τ ∈ (3T, 4T). τ (3T, 3T + 0.1815T] (3T + 0.1815T, 3T + 0.3890T] (3T + 0.3890T, 3T + 0.5488T] (3T + 0.5488T, 3T + 0.5552T] (3T + 0.5552T, 3T + 0.6329T] (3T + 0.6329T, 3T + 0.8486T] (3T + 0.8486T, 3T + 0.8767T] (3T + 0.8767T, 3T + 0.9290T] (3T + 0.9290T, 4T)
Number of Patterns
Number of Periods
2 2 2 5 4 3 6 4 2
2 2 2 3 2 3 4 4 2
In the region (2T, 3T), multistability cannot occur in the intervals τ ∈ (2T + 0.7586T, 2T + 0.8486T] ∪ (2T + 0.9290T, 3T). And in region (3T, 4T), multistability occurs everywhere. 3.7 Phase Resetting Properties. The effect of the IPSP in the QIF model can be summarized by the phase reset curve (PRC). PRC describes the phase shift related to the phase at which the IPSP arrives in the neuron’s spike
Multistability in Neuron Models 1
phase resetting curve approximation no phase resetting
0.9 0.8 0.7 New Phase Φ’
2141
Firing and Absolute Refractoriness
0.6 0.5
Firing Procedure
0.4 0.3
C 0.2
A B
0.1 0
0
0.2
0.4
Phase Φ
0.6
0.8
1
Figure 8: The phase-resetting curve (point curve), its approximation (solid curve) of the QIF model, and the dotted line of no phase resetting ( = ).
time. The phase at which the IPSP arrives is
=
tp − t f , T
(3.12)
where tp is the time that the IPSP is delivered, t f is the firing time of the last spike, and T is the intrinsic spiking period (Foss & Milton, 2000). Following the arrival of the IPSP, the phase has been reset by an amount called the phase reset, which is defined by =1−
t1 − t f , T
(3.13)
where t1 is the firing time of the next spike. The new phase is easily calculated by
= + .
(3.14)
Figure 8 shows the phase-resetting curve (point curve) of the QIF model, a plot of the new phase versus the original phase , and the approximation (solid curve) and the dotted line of no phase resetting ( = ) when T = 10.54 msec and other parameters are identical to those used in
2142
J. Ma and J. Wu
Figure 7. From the original point (0,0) to point A, the new phase is equal to the original phase ( = ) because the IPSP does not affect the membrane potential during the period of firing. From point A to B, the new phase ( ) decreases as the original phase ( ) increases because the absolute refractoriness reduces the effective timing of the feedback that affects the excitatory neuron. After the absolute refractoriness, from point B to point C, the full IPSP affects the excitatory neuron, and the new phase increases. The approximation was obtained by a least-square fitting to the phase-reset curve that gives rise to ( ) =
0 a 3 3 + a 2 2 + a 1 + a 0
if < 0.1575 otherwise,
(3.15)
where a 3 = −0.8287, a 2 = 1.7939, a 1 = −2.0261, and a 0 = 0.27859, and the root mean square (rms) error is 0.003577. Each time the neuron fires a spike, an IPSP or feedback is delivered at time τ later. We denote n as the phase of the nth IPSP due to a spike Sn . The phase n is determined by τ ( n−i ) − N, + T k
n =
(3.16)
i=1
where k is the number of other IPSPs in the interval between n and the spike Sn , and N is the number of spikes in this interval (see details in Foss & Milton, 2000). We introduce a new variable , defined by = + N.
(3.17)
Then equation 3.16 can be rewritten as τ ( n−i ) = F ( n−1 , n−1 , ..., n−k ), + T k
n =
(3.18)
i=1
where i = i mod 1. When a pattern is asymptotically approached, approaches one of its fixed points ∗ , which are the solutions of the equation f ( ) = g( ),
(3.19)
where f ( ) = −
τ , T
and
g( ) = k( ),
Multistability in Neuron Models
2143
2 g(Ψ) τ=T τ=2T f(Ψ)
f(Ψ) and g(Ψ)
1
τ=3T
τ=4T
D
0
τ=5T
τ=6T
τ=7T
τ=8T
7
8
9
G
E
H
F I J
0
1
2
3
4
5 6 Variable Ψ
Figure 9: Critical points of equation 3.19 for different values of the time delay τ . Function f ( ) is the dotted line, and function g( ) is the solid curve. Critical points for equation 3.19 are 1 for τ = T, 2 for τ = 2T, 2 for τ = 3T, 3 for τ = 4T, 3 for τ = 5T, 4 for τ = 6T, 4 for τ = 7T, and 5 for τ = 8T. We label three critical points (D, E, F) for τ = 4T and four points (G, H, I, J) for τ = 6T.
k is the nearest integer that is rounded down to. Foss and Milton (2000) suggested that the condition for these solutions to be stable is that the slope S∗ of ( ) evaluated at the fixed point satisfies −1 < S∗ < k −1 . Figure 9 shows the plot of f ( ) and g( ) as functions of , where the approximation 3.15 is used. Critical points for equation 3.19 are 1 for τ = T, 2 for τ = 2T, 2 for τ = 3T, 3 for τ = 4T, 3 for τ = 5T, 4 for τ = 6T, 4 for τ = 7T, and 5 for τ = 8T. Having only one fixed point when τ < T implies that multistability seems unlikely to arise in such a case. These numbers of critical points coincide exactly with the numbers of distinct periods of stable patterns listed in Table 1. This coincidence suggests that each critical point represents a subset of stable periodic solutions, in which different periodic solutions have the same period rather than just one stable pattern. All periodic solutions in such a subset have the same configuration; they are composed of the same number of w oscillations and the same number of v oscillations within one period, but the orders where w and v appear may be different. Knowing the configuration of each subset, we are able to calculate the number of periodic solutions in the subset and predict the patterns of these
2144
J. Ma and J. Wu
periodic solutions. To be more precise, we denote the set of all periodic solutions of τ = nT as nT and a subset composed of m w oscillations and h v oscillations within one period as m,h , where m + h ≤ n. When τ = 4T, 4T = 1,0 ∪ 0,1 ∪ 3,1 . There is only one periodic solution 1w in
1,0 and one periodic solution 1v in 0,1 . The number of periodic solutions (4)(3) in 3,1 is 3 4 3 = 1 and 3,1 = {3w1v}. Hence, 4T = {1w, 1v, 3w1v}. When τ = 6T, 6T = 1,0 ∪ 0,1 ∪ 1,1 ∪ 3,3 . For the subset 1,1 , the number of (2)(1) periodic solutions is 1 2 1 = 1, and the periodic solution is 1w1v1w1v1w1v. For the subset 3,3 , the number of periodic solutions composed of three (6)(3)−2 = 3 because two possible arrangements for w’s and three v’s is 3 63 pattern 1w1v (actually 1w1v1w1v1w1v has been counted in the subset 1,1 ) must be deducted; hence, 3,3 = {3w3v, 2w2v1w1v, 2w1v1w2v}. Using this approach to calculate the number of periodic solutions in a subset, we can (8)(5) calculate the number of periodic solutions for τ = 8T as 1 + 1 + 3 8 5 + (75)(22) (76)(11) + 7 = 13. 7 From Table 1, we conclude that both the number of stable patterns and the number of distinct periods of stable patterns increase as n increases when τ = nT and n is an integer. However, when τ = nT + t0 and 0 < t0 < T, the stable periodic patterns become more complicated because in addition to a v oscillation, two types of w oscillation may coexist in one pattern in a certain region (a neighborhood around t0 = 0.6T), which are denoted by W1 (down and up) and W2 (up, down, and up) (the notation is introduced to demonstrate the shape of oscillations after the absolute refractory period (from time s2 + dabs to the next firing time t f )). The coexistence of two types of w oscillations significantly increases the number of stable patterns, though it does not increase the number of distinct periods of stable patterns. This suggests that as far as multistability is concerned, we need to pay more attention to the distinct periods of stable patterns. Compared with the linear integrate-and-fire model (LIF), the neural intrinsic inhibitory rebound mechanism is incorporated into the quadratic integrate-and-fire model, and it allows spikes in an initial function to propagate in subsequent delay intervals in the excitable regime. In the periodic regime, both the firing and absolute refractory period are the key for simple phenomenological spiking neuron models to generate a large number of coexisting stable patterns. The firing determines the possible timing of the feedback, and the absolute refractory period determines whether the feedback has an impact on the excitatory neuron. The coexistence of three different oscillations W1 , W2 , and v significantly increases the number of stable periodic patterns but does not affect the number of distinct periods of stable periodic patterns. The number of critical points of the phase-resetting map coincides exactly with the number of the distinct periods of stable patterns we analytically calculated. Properties of the multiple
Multistability in Neuron Models
2145
distinct periods of stable patterns seem to be an important issue in the study of multistability. Perhaps it is worth mentioning that the quadratic integrate-and-fire neuron is closely related to the so-called θ -neuron, a type of canonical type I neuron (Latham et al., 2000), given by dϕ = (1 − cos ϕ) + I (1 + cos ϕ), dt
(3.20)
where I = I − Iθ and Iθ denotes the minimal current necessary for repetitive firing of the model. 3.8 More on the Number of Periodic Patterns. We now briefly explain how the results in Table 1 were obtained. The key is to determine what kind of subsets m,h (composed of m W oscillations and h V oscillations) can exist when τ = nT for a given integer n. Once this is done, we use the approach described above to calculate the number of stable patterns. First, we note that 0,1 ⊆ nT for any positive integer n, that is, periodic pattern 1V, can always be generated. Second, we remark that 1,0 ⊆ nT when n ≥ 2, that is, periodic pattern 1W can always be generated when n ≥ 2. The existence of the other subset m,h with m, h ≥ 1 depends on variables T, TF R , and Tϑ determined by the QIF model, where T is the intrinsic spiking period, Tϑ is the possible feedback timing, and TF R is the sum of the firing period and absolute refractory period. For a periodic pattern in m,h , all W oscillations have the same shape, characterized by (tup , Tϑ , t1ϑ )(up, down, up). Here the notation means that after the absolute refractoriness, the membrane potential first increases for tup msec, then decreases for Tϑ msec due to a feedback, and finally increases to the firing threshold ϑ1 for t1ϑ msec. The condition of the existence of such periodic patterns is 0 < tup =
(n + 1 − m − h)T − [TF R + (m − 1)(Tϑ + t1ϑ − T + TF R )] ≤ Tc m
for a parameter Tc determined by the QIF model. We are able to use the relationship among T, TF R , and Tϑ in our simulations to determine what kind of subset m,h can exist for a given n. For example, when τ = 4T, the above condition for the subset 3,1 is satisfied. Therefore, 4T = {1w, 1v, 3w1v}. 4 Discussion We have focused on the asymptotic behaviors of the excitation neuron in a recurrent inhibitory loop, and our emphasis is on the capability of such a single loop to generate multiple stable patterns. Although the wellknown Hodgkin-Huxley model has been previously used to address this
2146
J. Ma and J. Wu
multistability issue in Foss et al. (1996), such a dynamic system of four degrees of freedom is difficult to visualize and analyze. Our study shows that the key to generating a large number of coexisting stable patterns in simple phenomenological models is the incorporation of firing and the absolute refractory period. The effect of both time delay and effective timing of the feedback on the excitatory neuron determines the type of the multistability. The firing procedure determines the possible timing of the feedback that affects the excitatory neuron. The absolute refractoriness increases the complexity of stable periodic patterns by changing the effective timing of the feedback on the excitatory neuron. Our simulation results show that a recurrent inhibitory loop based on the simple phenomenological spiking neuron models exhibits coexisting multiple attracting periodic solutions in the periodic regime. These solutions can be categorized in terms of the symbols w and v, and the corresponding periodic patterns can be predicted by considering different ways of arranging w and v in a ring. The inhibitory rebound mechanism seems to play a very important role in spike propagation in the excitable regime for recurrent inhibitory loops. The quadratic integrate-and-fire model (QIF) captures very well this inhibitory rebound period using the rebound equation. The phase-resetting curve provides a powerful tool to study multistability in recurrent inhibitory loops based on simple phenomenological spiking neuron models. Our study shows that the number of fixed points of the phase-resetting map coincides with the number of distinct periods of stable patterns rather than the number of stable patterns when τ = nT and n is an integer. With the help of the configurational information corresponding to these critical points, we are able to calculate the number of the stable periodic patterns and predict these stable periodic patterns. For the recurrent inhibitory loop discussed in this article, the effect of the inhibitory postsynaptic potential (IPSP) from the inhibitory neuron I to the excitatory neuron E is simplified as a delayed self-feedback of the membrane potential of the excitatory neuron E. Under this simplification, we need only to consider dynamical behaviors of the single excitatory neuron E, and hence the model equation takes the form of a scalar delay differential equation. However, a more realistic approach for such a twoneuron network must be based on the coupled differential equations,
x (t) = β(x − µ)(x − γ ) − F (y(t)) + Is y (t) = β(y − µ)(y − γ ) + F (x(t − τ )),
(4.21)
with the firing time t f x and t f y
t f x : x(t) = ϑ1 t f y : y(t) = ϑ1
and and
x (t)|t=t f x > 0 y (t)|t=t f y > 0,
(4.22)
Multistability in Neuron Models
2147
for the excitatory neuron and the inhibitory neuron, where x(t) is the membrane potential of the excitatory neuron E, y(t) is the membrane potential of the inhibitory neuron I , and F (x) is the signal function. Detailed study of equations 4.21 remains a challenging task for the future. Acknowledgments This research was partially supported by Canada Research Chairs Program, by National Sciences and Engineering Research Council of Canada, and by Mathematics for Information Technology and Complex Systems. We thank the two reviewers whose comments led to a substantial improvement of our manuscript. References an der Heiden, U., & Mackey, M. C. (1982). The dynamics of production and destruction: Analytic insight into complex behaviour. Math. Biol., 16, 75–101. Canavier, C., Baxter, D., Clark, J., & Byrne, J. (1994). Multiple modes of activity in a model neuron suggest a novel mechanism for the effects of neuromodulators. J. Neurophysiol., 72, 872–882. Canavier, C., Butera, R., Dror, R., Baxter, D., Clark, J., & Byrne, J. (1999). Phase response characteristics of model neurons determine which patterns are expressed in a ring circuit model of gait generation. Bio. Cybern., 77, 367–380. Chen, Y., & Wu, J. (1999). Minimal instability and unstable set of a phase-locked periodic orbit in a delayed neural network. Phys. D, 134, 185–199. Chen, Y., & Wu, J. (2000). Limiting profiles of periodic solutions of neural with synaptic delays. Singapore: World Scientific. Chen, Y., & Wu, J. (2001). Existence and attraction of a phase-locked oscillation in a delayed network of two neurons. Differential Integral Equations, 14, 1181–1236. Chen, Y., & Wu, J. (2001b). Slowly oscillating periodic solutions for a delayed frustrated network of two neurons. J. Math. Appl., 259, 118–208. Chen, Y., & Wu, J. (2001c). The asymptotic shapes of periodic solutions of a singular delay differential system. J. Differential Equations, 169, 614–632. Chen, Y., Wu, J., & Krisztin, T. (2000). Connecting orbits from synchronous periodic solutions in phase-locked periodic solutions in a delay differential system. J. Differential Equations, 163, 130–173. Feng, J. (2001). Is the integrate-and-fire model good enough: A review. Neural Comput., 13, 955–975. Foss, J., Longtin, A., Mensour, B., & Milton, J. (1996). Multistability and delayed recurrent loops. Phys. Rev. Lett., 76, 708–711. Foss, J., & Milton, J. (2000). Multistability in recurrent neural loops arising from delay. J. Neurophysiol., 84(2), 975–985. Foss, J., Moss, F., & Milton, J. (1997). Noise, multistability, and delayed recurrent loops. Phys. Rev. E, 55, 4536–4543. Gerstner, W., & Kistler, W. M. (2002). Spiking neuron models, single neurons, populations, plasticity. Cambridge: Cambridge University Press.
2148
J. Ma and J. Wu
Hansel, D., & Mato, G. (2001). Existence and stability of persistent states in large neuronal networks. Phys. Rev. Lett., 86, 4175–4178. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci., 79, 2554–2558. Hopfield, J. J. (1984). Neurons with grades response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci., 81, 3088–3092. Hoppensteadt, F., & Izhikevich, E. (1997). Weakly connected neural networks. New York: Springer-Verlag. Ikeda, K., & Matsumoto, K. (1987). High-dimensional chaotic behavior in systems with time-delayed feedback. Physica D, 29, 223–235. Latham, P. E., Richmond, B. J., Nelson, P. G., & Nirenberg, S. (2000). Intrinsic dynamics in neuronal networks. I. theory. J. Neurophysiol., 83, 808–827. Losson, J. (1991). Multistability and probabilistic properties of delay differential equations. Unpublished master’s thesis, McGill University. Losson, J., Mackey, M. C., & Longtin, A. (1993). Solution multistability in first-order nonlinear differential delay equations. Chaos, 3, 167–176. Mackey, M. C., & Glass, L. (1997). Oscillation and chaos in physiological control systems. Science, 197, 287–289. Mackey, M. C., & an der Heiden, U. (1984). The dynamics of recurrent inhibition. Math. Biol., 19, 211–225. Mackey, M. C., & Milton, J. G. (1987). Dynamical diseases. Ann. New York Acad. Sci., 504, 16–32. Milton, J. (2000). Epilepsy: Multistability in a dynamic disease. In J. Walleczed (Ed.), Self-organized biological dynamics and nonlinear control (pp. 374–386). Cambridge: Cambridge University Press. Morita, M. (1993). Associative memory with non-monotone dynamics. Neural Networks, 6, 115–123. Schindler, K., Bernasconi, C., Stoop, R., Goodman, P., & Douglas, R. (1997). Chaotic spike patterns evoked by periodic inhibition of rat cortical neurons. Z. Naturforsch, 25a, 509–512. Traub, R. D., & Miles, R. (1991). Neuronal networks of the hippocampus. Cambridge: Cambridge University Press.
Received February 5, 2006; accepted November 29, 2006.
LETTER
Communicated by K. K. Tan
Analysis and Design of Associative Memories Based on Recurrent Neural Networks with Linear Saturation Activation Functions and Time-Varying Delays Zhigang Zeng [email protected] School of Automation, Wuhan University of Technology, Wuhan, Hubei, 430070, China, and Department of Mechanical and Automation Engineering, Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
Jun Wang [email protected] Department of Mechanical and Automation Engineering, Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
In this letter, some sufficient conditions are obtained to guarantee recurrent neural networks with linear saturation activation functions, and time-varying delays have multiequilibria located in the saturation region and the boundaries of the saturation region. These results on pattern characterization are used to analyze and design autoassociative memories, which are directly based on the parameters of the neural networks. Moreover, a formula for the numbers of spurious equilibria is also derived. Four design procedures for recurrent neural networks with linear saturation activation functions and time-varying delays are developed based on stability results. Two of these procedures allow the neural network to be capable of learning and forgetting. Finally, simulation results demonstrate the validity and characteristics of the proposed approach. 1 Introduction Recurrent neural networks with linear saturation activation functions (RNNs), including cellular neural networks (CNNs) as a special case, are dynamical systems with rich neurodynamic properties. In addition, RNNs are also suitable for realizing large-scale associative memories. Associative memories are designed to store a set of desired patterns as stable equilibria such that the stored patterns can be retrieved when the initial probes contain sufficient information about the patterns. The desired memory patterns in associative memories are usually represented by bipolar vectors (or binary vectors). For instance, cellular neural networks (CNNs) are directly regarded as associative memories in Michel and Farrell (1990), Seiler, Schuler, and Nossek (1993), Brucoli, Carnimeo, and Grassi (1995), Liu Neural Computation 19, 2149–2182 (2007)
C 2007 Massachusetts Institute of Technology
2150
Z. Zeng and J. Wang
(1997), Liu and Lu (1997), and Lu and Liu (1998). The synthesis of CNNs with space-invariant cloning templates is discussed in Liu (1997). A design algorithm based on the eigenstructure method was developed in Michel and Farrell (1990). The realization of associative memories was considered in Lu and Liu (1998) via the class of (zero-input) CNNs introduced in Chua and Yang (1988), and a new synthesis procedure (design algorithm) for CNNs with space-invariant cloning templates based on the well-known perceptron training procedure was developed in Lu and Liu (1998). Using the eigenstructure method developed in Li, Michel, and Porod (1989a, 1989b) and Michel, Si, and Yen (1991), one can guarantee that any set of given memory patterns be stored as memory vectors. However, it is usually required that k < n, where k is the number of the desired memory patterns and n is the number of neurons. Simulation results show that when k is very close to n − 1, there will be many spurious equilibria in the synthesized associative memory. Experimental studies that compare the eigenstructure method (implemented on various types of artificial neural networks) with other methods have been conducted in several previous works (see, e.g., Personnaz, Guyon, & Dreyfus, 1985, 1986; Li, Michel, & Porod, 1988, 1989a, 1989b; Michel, Farrell, & Sun, 1990; Michel & Gray, 1990; Waugh, Marcus, & Westervelt, 1990; Perfetti, 1991; Salam, Wang, & Choi, 1991; Yen, & Michel, 1991, 1992; Liu & Michel, 1994; Park, Cho, & Park, 1999; Park & Park, 2000). These works indicate that the capacity of neural network paradigms making use of the eigenstructure method compares rather well with other paradigms (which make use, e.g., of the outer product method, pseudo-inverse techniques, and the like). In many engineering applications and hardware implementations of neural networks, time delays—even varying delays in neural signal transmission—are often inevitable. Neural networks with time-varying delays thus deserve in-depth investigation. The estimation of the total number of spurious equilibria is also interesting and important for research. Until now, no formula on computing has been reported. One of the desirable features of neural associative memories is their learning capability (the capability to store further asymptotically stable equilibrium points) and forgetting capability (the ability to delete specified equilibrium points from a given set of stored vectors without affecting the existing equilibria). For this reason, design techniques for associative memories with learning and forgetting capabilities via artificial neural networks have received increasing interest (Yen & Michel, 1991, 1992; Brucoli et al., 1995). Until now, these kind of memories have been designed by recomputing all of the interconnection matrices. The stability of DRNN (recurrent neural network with delays) has been widely investigated (e.g., Liao & Wang, 2003; Yi, Tan, & Lee, 2003; Zeng, Wang, & Liao, 2003, 2004; Zeng & Wang, 2006a, 2006b). Most existing studies (e.g., Liao & Wang, 2003) focus on the global stability of DRNN, which
Associative Memories Based on Recurrent Neural Networks
2151
implies only one equilibrium point. However, in associative memories, DRNN must have as many stable equilibrium points as possible. Motivated by the above discussions, one of our aims in this letter is to provide some new design procedures for recurrent neural networks with linear saturation activation functions and time-varying delays (DRNN) based on attractability of multiequilibria. In addition, the convergence can be guaranteed by some stability theorems. These design procedures have the following features:
r r r
The number of the desired memory patterns can be greater than n. By using the new design procedures, the total number of spurious equilibria can be made as small as possible, and a formula on computing numbers of spurious equilibria can also be given. Design procedures of DRNN with learning and forgetting capabilities without recomputing the interconnection matrices can be obtained by using varying external inputs.
The rest of this letter is organized as follows. Section 2 describes some preliminaries. Section 3 elucidates the DRNN stability results for associative memories. In sections 4 and 5, some sufficient conditions are obtained that guarantee that DRNNs with variable or fixed external inputs have one or multiple stable memory patterns. In section 6, four design procedures of DRNN for pattern memory are presented. In section 7, two illustrative examples are given to demonstrate the effectiveness of our proposed approach. Finally, some concluding remarks and directions for future research are presented in section 8. 2 Preliminaries 2.1 Model. Consider the DRNN model, d xi (t) a ij f (x j (t)) + b ij f (x j (t − τij (t))) + ui , = −xi (t) + dt n
n
j=1
j=1
(2.1)
where i = 1, 2, . . . , n, x = (x1 , x2 , . . . , xn )T ∈ n is the state vector, A = (a ij ) ∈ n×n and B = (b ij ) ∈ n×n are connection weight matrices, 0 ≤ τij (t) ≤ τ is time delay, ∀r ∈ , f (r ) =
1 (|r + 1| − |r − 1|), 2
and u = (u1 , . . . , un )T ∈ n is the external input vector.
(2.2)
2152
Z. Zeng and J. Wang
Let C([t0 − τ, t0 ], D) be the space of continuous functions mapping [t0 − τ, t0 ] into D ⊂ n with the norm defined by ||φ||t0 = max { sup
1≤i≤n u∈[t −τ,t ] 0 0
|φi (u)|},
where φ(r ) = (φ1 (r ), φ2 (r ), . . . , φn (r ))T . The initial condition of DRNN, equation 2.1, is assumed to be φ(ϑ) = (φ1 (ϑ), φ2 (ϑ), . . . , φn (ϑ))T , where φ(ϑ) ∈ C([t0 − τ, t0 ], D). Denote x(t; t0 , φ) as the state of equation 2.1 with the above initial condition. It means that x(t; t0 , φ) is continuous and satisfies equation 2.1 and x(ϑ; t0 , φ) = φ(ϑ), for ϑ ∈ [t0 − τ, t0 ]. Define a region as follows: n (i) (i) (−∞, −1]δ × [1, +∞)1−δ , δ ( j) = 1 or 0, j = 1, 2, . . . , n . = i=1 Hence, is made up of 2n elements. A vector s will be called a (stable) memory vector (or simply a memory ) of DRNN if s = f (β), where β is an asymptotically stable equilibrium point of DRNN, equation 2.1. Denote B n as the set of n-dimensional bipolar vectors: B n = {−1, 1}n = {x ∈ n , xi = 1 or − 1, i = 1, 2, . . . , n}. Hence, B n is made up of 2n elements. For any (s1 , s2 , . . . , sn )T ∈ B n , define an operator: L(si ) =
[1, ∞),
si = 1,
(−∞, −1],
si = −1.
Consequently, (s1 , s2 , . . . , sn )T , and is a one-to-one correspondence.
n i=1
L(si ) = L(s1 ) × L(s2 ) × · · · × L(sn )
2.2 Properties and Design Problem Definition 1. An equilibrium point x ∗ of equation 2.1 is said to be locally exponentially stable in region D, if there exist constants α > 0, β > 0 such that ∀t ≥ t0 , x(t; t0 , φ) − x ∗ ≤ β||φ||t exp{−α(t − t0 )}, 0
Associative Memories Based on Recurrent Neural Networks
2153
where x(t; t0 , φ) is the solution of DRNN, equation 2.1 with any initial condition φ(ϑ) ∈ D, ϑ ∈ [t0 − τ, t0 ], and D is said to be a locally exponentially attracting set of the equilibrium point x ∗ . When D = n , x ∗ is said to be globally exponentially stable. Definition 2. An equilibrium point x ∗ is said to be an isolated equilibrium point of equation 2.1, if there exists δ > 0 such that ∀x˜ ∈ {x| 0 < ||x − x ∗ || < δ, x ∈ Rn }, x˜ is not an equilibrium point of equation 2.1. Design problem. Given k (k ≤ 2n ) vectors s (1) , s (2) , . . . , s (k) ∈ B n , choose the connection weight between the ith and jth neurons a ij , b ij , and ui such that s (1) , s (2) , . . . , s (k) are stable memory vectors of DRNN. Definition 3. σ = (σ1 , σ2 , . . . , σn )T is said to be an index of memory if ∀i ∈ {1, 2, . . . , n}, σi ≥ 1. σ = (σ1 , σ2 , . . . , σn )T is said to be an index of forgetting if there exists i ∈ {1, 2, . . . , n} such that σi < 1. 2.3 Notations. Throughout this letter, we always assume that 2n vecn tors s (1) , s (2) , . . . , s (2 ) in B n satisfy that ∀i, j ∈ {1, 2, . . . , 2n }, i = j, s (i) = s ( j) . Denote ∂ as a boundary of the set and int as the interior of the () () () set . Denote the desired memory patterns s () = (s1 , s2 , . . . , sn )T for = 1, 2, . . . , k. () For i, j ∈ {1, 2, . . . , n}, ∈ {1, 2, . . . , k}, let σi and σi be constants: c ij = c˜ ij =
a ii + b ii − σi , i = j, a ij + b ij ,
i = j;
a ii + b ii + ui si , i = j, a ij + b ij ,
i = j.
Denote E as an n × n identity matrix, and for k ∈ {1, 2, . . . , 2n }, S(k) = s (1) , s (2) , . . . , s (k) n×k , T (k) (k) s¯ (k) = s1 (s (k) )T , s2 (s (k) )T , . . . , sn(k) (s (k) )T 1×(n×n) , (k) S¯ Bn = s¯ (1) , s¯ (2) , . . . , s¯ (k) (n×n)×k ,
νi = (a i1 + b i1 , a i2 + b i2 , . . . , a in + b in )T , T ν1 , 0, ..., 0 ν2T , . . . , 0 0, V= ···, T 0, 0, . . . , νn n×(n×n),
2154
Z. Zeng and J. Wang
u1 ,
u1 ,
. . . , u1
u2 , u2 , . . . , u2 ˜k = U ..., un , un , . . . , un n×k, (1) (1) (2) (2) (k) (k) σ1 s1 , σ1 s1 , . . . , σ1 s1 (1) (1) σ2 s2 , σ2(2) s2(2) , . . . , σ2(k) s2(k) Wk = ..., (1) (1)
σn sn ,
(1)
u1 s1 , (1) (k) u2 s2 , UBn = ..., (1) un sn ,
(2) (2)
(k) (k)
σn sn , . . . , σn sn n×k, (2) (k) u1 s1 , . . . , u1 s1 (2) (k) u2 s2 , . . . , u2 s2 (2) (k) un sn , . . . , un sn n×k.
When k ≤ n and Rank{s (1) , s (2) , . . . , s (k) } = k, choose s (k+1) , s (k+2) , . . . , s (n) ∈ B n such that Rank{s (1) , . . . , s (k) , s (k+1) , . . . , s (n) } = n. Hence, S(k)T S(k) and S(n) are invertible matrices. Denote ˜ n )S(n)−1 ; P = ( pij )n×n = (Wn − U ˜ k )(S(k)T S(k))−1 S(k)T . P˜ = ( p˜ ij )n×n = (Wk − U When k > n and Rank{s (1) , s (2) , . . . , s (k) } = n, S(k)S(k)T is an invertible matrix. Denote ˜ k S(k)T (S(k)S(k)T )−1 ; Q = (q ij )n×n = −U ˜ = (q˜ ij )n×n = (Wk − U ˜ k )S(k)T (S(k)S(k)T )−1 . Q 3 Stability Results for Associative Memories Denote N1 , N2 , . . . , N2n −1 , N2n as all subsets of {1, 2, . . . , n}. Let Nq = {l1 , l2 , . . . , lmq }, where for q = 2n , mq ∈ {1, 2, . . . , n} and for i = j, li = l j , li , l j ∈ {1, 2, . . . , n} (i, j ∈ {1, 2, . . . , mq }). Let N2n be an empty set and N1 = {1, 2, . . . , n}, that is, m1 = n. For q = 2n , let Nq (s) = {(x1 , . . . , xn )| xi si > 1, i ∈ Nq ; xi = si , i ∈ Nq }, N1 (s) = {(x1 , . . . , xn )| xi = si , i ∈ {1, 2, . . . , n}};
(3.1)
Associative Memories Based on Recurrent Neural Networks
2155
that is, N1 (s) degrades into a point. Obviously, ∂s =
n 2 −1
Nq (s).
(3.2)
q =1
Hence, ∂s is made up of 2n − 1 parts, and s is made up of 2n parts. For example, when n = 2, s = (1, 1)T , s = ints
{1} (s)
{2} (s)
{1,2} (s),
where {1} (s) = {(x1 , x2 )| x2 > 1; x1 = 1}, {2} (s) = {(x1 , x2 )| x1 > 1; x2 = 1}, {1,2} (s) = {(x1 , x2 )| x1 = 1; x2 = 1}. Denote A˜ = (a˜ ij )n×n , where a˜ ij =
a ii + b ii − 1, i = j a ij + b ij ,
i = j.
Let A˜ Nq be made up of the ith row and the ith column of matrix A˜ for all i ∈ Nq . n , DRNN, equation 2.1, has one Theorem 1. For any s = (s1 , s2 , . . . , sn )T ∈ B n isolated equilibrium point located in s = i=1 L(si ) if and only if one of the following conditions holds:
i. ∀i ∈ {1, 2, . . . , n}, n (a ij + b ij )s j + ui si > 1;
(3.3)
j=1
ii. There exists q ∈ {1, 2, . . . , 2n − 1}, such that n (a ij + b ij )s j + ui si > 1, j=1
i ∈ Nq ,
n (a ij + b ij )s j + ui = si ; det A˜ Nq = 0, i ∈ Nq . j=1
(3.4)
2156
Z. Zeng and J. Wang
In addition, if an equilibrium point x ∗ ∈ ints , then it is locally exponentially stable, and ints is locally exponentially attractive at x ∗ . Proof. x ∗ = (x1∗ , . . . , xn∗ )T is an equilibrium point of equation 2.1 if and only if xi∗ =
n (a ij + b ij ) f x ∗j + ui .
(3.5)
j=1 ∗ equation 3.5 and x ∗ ∈ ints , then xi∗ = nProof of necessity: If x satisfies ∗ j=1 (a ij + b ij )s j + ui , and xi si > 1. Hence, equation 3.3 holds. If x ∗ satisfies equation 3.5 and x ∗ ∈ ∂s , then from equation 3.2 and equation 3.5, there exists q ∈ {1, 2, . . . , 2n − 1}, such that xi∗ si > 1, i ∈ Nq ; xi∗ = ∗ si , i ∈ Nq , and xi = nj=1 (a ij + b ij )s j + ui . If det A˜Nq = 0, then x ∗ is not isolated. Hence, det A˜Nq = 0. Thus, equation 3.4 holds. Proof of sufficiency: Choose xi∗ = nj=1 (a ij + b ij )s j + ui . From equations 3.3 and 3.4, x ∗ ∈ s . In addition, −xi∗ + nj=1 (a ij + b ij ) f (x ∗j ) + ui = 0; that is, x ∗ is an equilibrium point located in s . Since in the saturation region, the equilibrium point of equation 2.1 is always isolated, if x ∗ ∈ ints , then x ∗ is isolated. If x ∗ ∈ ∂s , that is, there exists q ∈ {1, 2, . . . , 2n − 1}, such that x ∗ ∈ Nq , then there exists a small neighborhood of x ∗ such that equation 3.5 is equivalent to
∗ x = (a ij + b ij ) f x ∗j + (a ij + b ij )s j + ui , i ∈ Nq , i j∈Nq j ∈ Nq ∗ (a ij + b ij ) f x ∗j + (a ij + b ij )s j + ui , i ∈ Nq . xi =
(3.6)
j ∈ Nq
j∈Nq
Assume there exists x¯ ∗ = (x¯ 1∗ , x¯ 2∗ , . . . , x¯ n∗ )T , |x¯ i∗ | ≤ 1 (i = 1, 2, . . . , n) such that for i ∈ Nq , x¯ i∗ =
(a ij + b ij ) f x¯ ∗j + (a ij + b ij )s j + ui .
(3.7)
j ∈ Nq
j∈Nq
From equations 3.6 and 3.7, for i ∈ Nq , (a ij + b ij ) f x¯ ∗j − x ∗j . f x¯ i∗ − xi∗ = j∈Nq
Since det A˜Nq = 0, the solution of the above linear system is unique. Hence, x ∗ is isolated. Thus, DRNN, equation 2.1, has one isolated equilibrium point located in region s .
Associative Memories Based on Recurrent Neural Networks
2157
If x ∗ ∈ ints and φ(θ ) ∈ s , θ ∈ [t0 − τ, t0 ], then from equations 2.1 and 2.2, d xi (t) (a ij + b ij )s j + ui . = −xi (t) + dt n
(3.8)
j=1
Obviously, if equation 3.8 has an equilibrium point, then it is globally exponentially stable. Hence, x ∗ is locally exponentially stable, and ints is a locally and exponentially attracting region of x ∗ . Remark 1. If ∀i ∈ {1, 2, . . . , n}, a ii + b ii −
n
(|a ij | + |b ij |) − |ui | > 1,
(3.9)
j=1, j =i
1, DRNN, then for all s ∈ B n , equation 3.2 holds. According to theorem n equation 2.1, has one isolated equilibrium point located in s = i=1 L(si ). n Hence, it has 2 locally exponentially stable equilibrium points. Thus, if the connection weights of DRNN satisfy equation 3.9, then, the number of spurious equilibria in the associative memory may be very large. Reducing the number of spurious equilibria requires prior attention. Remark 2. According to theorem 1, when a diagonal dominance condition (see equation 3.9) does not hold, DRNN, equation 2.1, can also have multiple isolated equilibrium points. For example, consider a DRNN with n = 2, b ij ≡ 0, u1 = 0.1, u2 = 0.2: A=
0,
1.2
1.3,
0
.
It is obvious that this DRNN does not satisfy a diagonal dominance condition. However, from equation 3.3, such a DRNN has two locally exponentially stable equilibrium points located in [1, 1] × [1, 1] and [−1, −1] × [−1, −1]. For Nq = {l1 , l2 , . . . , lmq }, i ∈ Nq , let ++ J Nq (A, s, i) = { j| s j J +− (A, s, i) = { j| s j Nq −+ J Nq (A, s, i) = { j| s j −− J Nq (A, s, i) = { j| s j
= 1, a ij ≥ 0, j ∈ Nq }, = 1, a ij < 0, j ∈ Nq }, = −1, a ij ≥ 0, j ∈ Nq }, = −1, a ij < 0, j ∈ Nq };
(3.10)
2158
Z. Zeng and J. Wang
++ J Nq (B, s, i) = { j| s j J +− (B, s, i) = { j| s j Nq −+ J Nq (B, s, i) = { j| s j −− J Nq (B, s, i) = { j| s j
= 1, b ij ≥ 0, j ∈ Nq }, = 1, b ij < 0, j ∈ Nq },
(3.11)
= −1, b ij ≥ 0, j ∈ Nq }, = −1, b ij < 0, j ∈ Nq }.
From theorem 1, if the isolated equilibrium point is located in ints , then the equilibrium point is locally exponentially stable. When the isolated equilibrium point is located in ∂s , we have the following result: Theorem 2. For s = (s1 , s2 , . . . , sn )T ∈ B n , if there exists q ∈ {1, 2, . . . , 2n − 1}, such that n (a + b )s + u s > 1, i ∈ N , i i q ij ij j j=1
(3.12)
n (a ij + b ij )s j + ui = si ;
i ∈ Nq ,
j=1
and for i ∈ Nq , si = 1, −+ +− j∈J N (A,s,i)∪J N (A,s,i), j =i q q
|b ij | < 1,
(3.13)
+− −+ j∈J N (B,s,i)∪J N (B,s,i) q q
a ii +
|a ij | +
|a ij | +
++ −− j∈J N (A,s,i)∪J N (A,s,i), j =i q q
|b ij | < 1,
(3.14)
|b ij | < 1,
(3.15)
++ −− j∈J N (B,s,i)∪J N (B,s,i) q q
for i ∈ Nq , si = −1,
a ii +
|a ij | +
+− −+ j∈J N (A,s,i)∪J N (A,s,i), j =i q q
+− −+ j∈J N (B,s,i)∪J N (B,s,i) q q
|a ij | +
++ −− j∈J N (A,s,i)∪J N (A,s,i), j =i q q
|b ij | < 1,
(3.16)
|b ij | < 1,
(3.17)
|b ij | < 1,
(3.18)
++ −− j∈J N (B,s,i)∪J N (B,s,i) q q
for i ∈ Nq ,
|a ij | +
j∈J N+− (A,s,i)∪J N−+ (A,s,i) q q
j∈J N++ (A,s,i)∪J N−− (A,s,i) q q
j∈J N+− (B,s,i)∪J N−+ (B,s,i) q q
|a ij | +
j∈J N++ (B,s,i)∪J N−− (B,s,i) q q
Associative Memories Based on Recurrent Neural Networks
2159
then DRNN, equation 2.1, has one isolated equilibrium point located in Nq (s), which is locally exponentially stable. Proof. Let xi∗ = nj=1 (a ij + b ij )s j + ui . (x1∗ , . . . , xn∗ )T ∈ Nq (s). And −xi∗ +
From
equation
3.12,
n (a ij + b ij ) f x ∗j + ui = 0;
x∗ =
(3.19)
j=1
that is, x ∗ is an equilibrium point located in Nq (s). Choose δ > 0 enough small (e.g., δ = mini ∈ Nq {|si − xi∗ |/2}) such that ∀x ∈ δ = {x| ||x − x ∗ || ≤ δ}, ∀i ∈ Nq , xi si > 1. From equations 3.13 to 3.18, there exists µ > 0 such that for i ∈ Nq , si = 1,
a ij −
j∈J N−+ (A,s,i) q
j∈J N+− (A,s,i), j =i q
a ii +
a ij +
|a ij | +
j∈J N++ (A,s,i)∪J N−− (A,s,i), j =i q q
|b ij | exp{µτ } + µ < 1,
j∈J N+− (B,s,i)∪J N−+ (B,s,i) q q
(3.20) |b ij | exp{µτ } + µ < 1,
j∈J N++ (B,s,i)∪J N−− (B,s,i) q q
(3.21) for i ∈ Nq , si = −1, a ii −
a ij +
+− j∈J N (A,s,i) q
−+ j∈J N (A,s,i), j =i q
|a ij | +
++ −− j∈J N (A,s,i)∪J N (A,s,i), j =i q q
a ij +
|b ij | exp{µτ } + µ < 1,
+− −+ j∈J N (B,s,i)∪J N (B,s,i) q q
(3.22) |b ij | exp{µτ } + µ < 1,
(3.23)
++ −− j∈J N (B,s,i)∪J N (B,s,i) q q
for i ∈ Nq ,
|a ij | +
j∈J N+− (A,s,i)∪J N−+ (A,s,i) q q
j∈J N++ (A,s,i)∪J N−− (A,s,i) q q
|b ij | exp{µτ } + µ < 1,
(3.24)
|b ij | exp{µτ } + µ < 1.
(3.25)
j∈J N+− (B,s,i)∪J N−+ (B,s,i) q q
|a ij | +
j∈J N++ (B,s,i)∪J N−− (B,s,i) q q
2160
Z. Zeng and J. Wang
Let (1) G i (t) = xi (t) − xi∗ + δ exp{−µ(t − t0 )} , (2) G i (t) = xi (t) − xi∗ − δ exp{−µ(t − t0 )} . If φ(θ ) ∈ δ , θ ∈ [t0 − τ, t0 ], then ∀i ∈ {1, 2, . . . , n}, xi (t) − x ∗ ≤ δ exp{−µ(t − t0 )}. i
(3.26)
Otherwise, since for all i ∈ {1, 2, . . . , n}, ∀θ ∈ [t0 − τ, t0 ], φ(θ ) ∈ δ , there exist p ∈ {1, 2, . . . , n}, t2 > t1 ≥ t0 , ε > 0 sufficiently small such that ∀ j = 1, 2, . . . , n, ∀t ∈ [t0 − τ, t2 ], xi (t) − x ∗ ≤ δ exp{−µ(t − t0 )} + ε, i
(3.27)
and one of the following cases holds: Case 1: (1) G (1) p (t1 ) = 0, G p (t2 ) = ε,
(3.28)
+ (1) D+ G (1) p (t1 ) ≥ 0, D G p (t2 ) > 0,
(3.29)
Case 2: (2) G (2) p (t1 ) = 0, G p (t2 ) = −ε, + (2) D+ G (2) p (t1 ) ≤ 0, D G p (t2 ) < 0,
(3.30) (3.31)
From equation 3.27, for i ∈ Nq , ∀t ∈ [t0 − τ, t2 ], si − δ exp{−µ(t − t0 )} − ε ≤ f (xi (t)) ≤ 1, si = 1;
(3.32)
si ≤ f (xi (t)) ≤ si + δ exp{−µ(t − t0 )} + ε, si = −1.
(3.33)
For i ∈ Nq , ∀t ∈ [t0 − τ, t2 ], f (xi (t)) = si .
(3.34)
Consider case 1, when p ∈ Nq , s p = 1, from equations 2.1, 3.28, and 3.32 to 3.34: D+ G (1) p (t2 ) |(2.1) = −x p (t2 ) +
j∈Nq
(a pj f (x j (t2 )) + b pj f (x j (t2 − τ pj (t2 ))))
Associative Memories Based on Recurrent Neural Networks
+
2161
(a pj + b pj )s j + u p + µδ exp{−µ(t2 − t0 )}
j ∈ Nq
≤ −(s p + δ exp{−µ(t2 − t0 )} + ε) + a pp + a pj s j + b pj s j j∈J N++ (A,s, p), j = p q
+
j∈J N++ (B,s, p) q
a pj (s j − δ exp{−µ(t2 − t0 )} − ε)
j∈J N+− (A,s, p), j = p q
+
b pj (s j − exp{µτ }δ exp{−µ(t2 − t0 )} − ε)
j∈J N+− (B,s, p) q
+
a pj (s j + δ exp{−µ(t2 − t0 )} + ε)
j∈J N−+ (A,s, p) q
+
b pj (s j + exp{µτ }δ exp{−µ(t2 − t0 )} + ε)
j∈J N−+ (B,s, p) q
+
a pj s j
j∈J N−− (A,s, p) q
+
j∈J N−− (B,s, p) q
b pj s j +
(a pj + b pj )s j + u p + µδ exp{−µ(t2 − t0 )}.
j ∈ Nq
(3.35)
From equations 3.19 and 3.35: D+ G (1) p (t2 ) |(2.1) ≤ −(δ exp{−µ(t2 − t0 )} + ε) + +
a pj (−δ exp{−µ(t2 − t0 )} − ε)
j∈J N+− (A,s, p), j = p q
b pj (− exp{µτ }δ exp{−µ(t2 − t0 )} − ε)
j∈J N+− (B,s, p) q
+
a pj (δ exp{−µ(t2 − t0 )} + ε)
j∈J N−+ (A,s, p) q
+
b pj (exp{µτ }δ exp{−µ(t2 − t0 )} + ε) + µδ exp{−µ(t2 − t0 )}.
j∈J N−+ (B,s, p) q
(3.36)
2162
Z. Zeng and J. Wang
From equations 3.20 and 3.36: D+ G (1) p (t2 ) |(2.1)
≤ −δ exp{−µ(t2 − t0 )}(1 +
j∈J N+− (A,s, p), j = p q
−
a pj −
j∈J N−+ (B,s, p) q
b pj −
j∈J N+− (B,s, p) q
b pj exp{µτ }
j∈J N+− (B,s, p) q
b pj exp{µτ } − µ) − ε 1 +
j∈J N−+ (A,s, p) q
+
a pj +
a pj −
j∈J N−+ (A,s, p) q
a pj
j∈J N+− (A,s, p), j = p q
b pj ≤ 0.
(3.37)
j∈J N−+ (B,s, p) q
When p ∈ Nq , s p = −1, from equations 2.1, 3.28, 3.32 to 3.34, 3.19, and 3.22: D+ G (1) p (t2 ) |(2.1) ≤ −(s p + δ exp{−µ(t2 − t0 )} + ε) + a pp (s p + δ exp{−µ(t2 − t0 )} + ε) a pj s j + b pj s j + j∈J N++ (A,s, p) q
+
j∈J N++ (B,s, p) q
a pj (s j − δ exp{−µ(t2 − t0 )} − ε)
j∈J N+− (A,s, p) q
+
b pj (s j − exp{µτ }δ exp{−µ(t2 − t0 )} − ε)
j∈J N+− (B,s, p) q
+
a pj (s j + δ exp{−µ(t2 − t0 )} + ε)
j∈J N−+ (A,s, p), j = p q
+
b pj (s j + exp{µτ }δ exp{−µ(t2 − t0 )} + ε)
j∈J N−+ (B,s, p) q
+
a pj s j
j∈J N−− (A,s, p), j = p q
+
j∈J N−− (B,s, p) q
b pj s j +
(a pj + b pj )s j + u p + µδ exp{−µ(t2 − t0 )} ≤ 0.
j ∈ Nq
(3.38) When p ∈ Nq , from equations 2.1, 3.28, 3.32 to 3.34, 3.19, and 3.24: D+ G (1) p (t2 ) |(2.1) ≤ −(s p + δ exp{−µ(t2 − t0 )} + ε) +
j∈J N++ (A,s, p) q
a pj s j +
j∈J N++ (B,s, p) q
b pj s j
Associative Memories Based on Recurrent Neural Networks
+
2163
a pj (s j − δ exp{−µ(t2 − t0 )} − ε)
j∈J N+− (A,s, p) q
+
b pj (s j − exp{µτ }δ exp{−µ(t2 − t0 )} − ε)
j∈J N+− (B,s, p) q
+
a pj (s j + δ exp{−µ(t2 − t0 )} + ε)
j∈J N−+ (A,s, p) q
+
j∈J N−+ (B,s, p) q
+
b pj (s j + exp{µτ }δ exp{−µ(t2 − t0 )} + ε) +
b pj s j +
a pj s j
j∈J N−− (A,s, p) q
(a pj + b pj )s j + u p + µδ exp{−µ(t2 − t0 )} ≤ 0.
j ∈ Nq
j∈J N−− (B,s, p) q
(3.39)
Since equations 3.37 to 3.39 contradict equation 3.29, case 1 does not hold. Consider case 2, when p ∈ Nq , s p = 1, from equations 2.1, 3.30, and 3.32 to 3.34: D+ G (1) p (t2 ) |(2.1) ≥ −(s p − δ exp{−µ(t2 − t0 )} − ε) + a pp (s p − δ exp{−µ(t2 − t0 )} − ε) + a pj (s j − δ exp{−µ(t2 − t0 )} − ε) j∈J N++ (A,s, p), j = p q
+
b pj (s j − exp{µτ }δ exp{−µ(t2 − t0 )} − ε)
j∈J N++ (B,s, p) q
+
a pj s j +
j∈J N+− (A,s, p), j = p q
+
j∈J N+− (B,s, p) q
b pj s j +
a pj s j
j∈J N−+ (A,s, p) q
b pj s j
j∈J N−+ (B,s, p) q
+
a pj (s j + δ exp{−µ(t2 − t0 )} + ε)
j∈J N−− (A,s, p) q
+
b pj (s j + exp{µτ }δ exp{−µ(t2 − t0 )} + ε)
j∈J N−− (B,s, p) q
+
j ∈ Nq
(a pj + b pj )s j + u p − µδ exp{−µ(t2 − t0 )}.
(3.40)
2164
Z. Zeng and J. Wang
From equations 3.19, 3.21, and 3.40: D+ G (1) p (t2 ) |(2.1)
≥ δ exp{−µ(t2 − t0 )}(1 − a pp −
−
b pj exp{µτ }
j∈J N++ (B,s, p) q
+
a pj
j∈J N++ (A,s, p), j = p q
a pj +
j∈J N−− (A,s, p) q
b pj exp{µτ } − µ)
j∈J N−− (B,s, p) q
+ ε 1 − a pp −
a pj
j∈J N++ (A,s, p), j = p q
−
b pj +
j∈J N++ (B,s, p) q
j∈J N−− (A,s, p) q
a pj +
b pj
≥ 0.
(3.41)
j∈J N−− (B,s, p) q
When p ∈ Nq , s p = −1, from equations 2.1, 3.30, 3.32 to 3.34, 3.19, and 3.23: D+ G (1) p (t2 ) |(2.1) ≥ −(s p − δ exp{−µ(t2 − t0 )} − ε) + a pp s p + a pj (s j − δ exp{−µ(t2 − t0 )} − ε) j∈J N++ (A,s, p) q
+
b pj (s j − exp{µτ }δ exp{−µ(t2 − t0 )} − ε)
j∈J N++ (B,s, p) q
+
a pj s j +
j∈J N+− (A,s, p) q
+
j∈J N+− (B,s, p) q
b pj s j +
a pj s j
j∈J N−+ (A,s, p), j = p q
b pj s j
j∈J N−+ (B,s, p) q
+
a pj (s j + δ exp{−µ(t2 − t0 )} + ε)
j∈J N−− (A,s, p), j = p q
+
b pj (s j + exp{µτ }δ exp{−µ(t2 − t0 )} + ε)
j∈J N−− (B,s, p) q
+
j ∈ Nq
(a pj + b pj )s j + u p − µδ exp{−µ(t2 − t0 )} ≥ 0.
(3.42)
Associative Memories Based on Recurrent Neural Networks
2165
When p ∈ Nq , from equations 2.1, 3.30, 3.32 to 3.34, 3.19, and 3.25: D+ G (1) p (t2 ) |(2.1) ≥ −(s p − δ exp{−µ(t2 − t0 )} − ε) + a pj (s j − δ exp{−µ(t2 − t0 )} − ε) j∈J N++ (A,s, p) q
+
b pj (s j − exp{µτ }δ exp{−µ(t2 − t0 )} − ε)
j∈J N++ (B,s, p) q
+
j∈J N+− (A,s, p) q
+
a pj s j +
b pj s j +
j∈J N+− (B,s, p) q
a pj s j +
j∈J N−+ (A,s, p) q
b pj s j
j∈J N−+ (B,s, p) q
a pj (s j + δ exp{−µ(t2 − t0 )} + ε)
j∈J N−− (A,s, p) q
+
b pj (s j + exp{µτ }δ exp{−µ(t2 − t0 )} + ε)
j∈J N−− (B,s, p) q
+
(a pj + b pj )s j + u p − µδ exp{−µ(t2 − t0 )} ≥ 0.
(3.43)
j ∈ Nq
Since equations 3.41 to 3.43 all contradict equation 3.31, case 2 does not hold. Hence, equation 2.26 holds: DRNN, equation 2.1, has one isolated equilibrium point located in Nq (s), which is locally exponentially stable. Remark 3. If ∀i ∈ {1, 2, . . . , n}, j∈J N+− (A,s,i)∪J N−+ (A,s,i) q q
j∈J N++ (A,s,i)∪J N−− (A,s,i) q q
|a ij | +
|b ij | < 1,
(3.44)
|b ij | < 1,
(3.45)
j∈J N+− (B,s,i)∪J N−+ (B,s,i) q q
|a ij | +
j∈J N++ (B,s,i)∪J N−− (B,s,i) q q
then equations 3.13 to 3.19 all hold. For N1 = {1, 2, . . . , n}, according to theorem 2, we have the following corollary:
2166
Z. Zeng and J. Wang
Corollary 1. For s = (s1 , . . . , sn )T ∈ B n , if ∀i ∈ {1, . . . , n}, equations 3.44 and 3.45 hold, and n (a ij + b ij )s j + ui = si ,
(3.46)
j=1
then DRNN, equation 2.1, has one isolated equilibrium point x ∗ = s, which is locally exponentially stable. If Nq = N2n is an empty set, according to theorem 2, we have the following corollary: Corollary 2. For s = (s1 , s2 , . . . , sn )T ∈ B n , Nq = N2n , if equation 3.12 holds, then DRNN, equation 2.1, has one isolated equilibrium point x ∗ = s, which is locally exponentially stable. Theorem 3. For any s = (s1 , s2 , . . . , sn )T ∈ B n , DRNN, equation 2.1, has only one globally exponentially stable equilibrium point in s if n − (|a ij | + |b ij |)si + ui si ≥ 1, i ∈ {1, 2, . . . , n}.
(3.47)
j=1
Proof. From equation 3.47, ∀i ∈ {1, 2, . . . , n},
−
n (|a ij | + |b ij |) + ui ≥ 1, si = 1;
(3.48)
j=1
−
n (|a ij | + |b ij |) − ui ≥ 1, si = −1.
(3.49)
j=1
When si = 1, from equations 2.1 and 3.48, for all xi < 1, d xi (t) a ij f (x j (t)) + b ij f (x j (t − τij (t))) + ui = −xi (t) + dt > −1 −
n
n
n
j=1
j=1
(|a ij | + |b ij |) + ui ≥ 0.
j=1
(3.50)
Associative Memories Based on Recurrent Neural Networks
2167
When si = −1, from equations 2.1 and 3.49, for all xi > −1, d xi (t) a ij f (x j (t)) + b ij f (x j (t − τij (t))) + ui = −xi (t) + dt <1 +
n
n
j=1
j=1
n
(|a ij | + |b ij |) + ui ≤ 0.
(3.51)
j=1
Equations 3.50 and 3.51 imply that ∀φ ∈ C([t0 − τ, t0 ], n ), x(t; t0 , φ) will enter into and keep in s . Hence, there exists T > t0 such that for all t ≥ T, i = 1, 2, . . . , n, d xi (t) (a ij + b ij )s j + ui . = −xi (t) + dt n
(3.52)
j=1
Obviously, if equation 3.52 has an equilibrium point, then it is globally exponentially stable. Hence, DRNN, equation 2.1, has only one globally exponentially stable equilibrium point. Remark 4. According to theorem 1, if condition 3.3 or 3.4 holds, then DRNN, equation 2.1, has one isolated equilibrium point located in s and can have other isolated equilibrium points located in n . However, according to theorem 3, if condition equation 3.47 holds, then DRNN has one isolated equilibrium point located in n . In addition, since equations 3.3 and 3.4 are necessary and sufficient conditions, for any s = (s1 , s2 , . . . , n)T ∈ B n , using equations 3.3 and 3.4, one can know whether n DRNN, equation 2.1, has one isolated equilibrium point located in i=1 L(si ). Hence, one can know the number of the memorized patterns by equations 3.3 and 3.4. For k ∈ {1, 2, . . . , 2n }, let (k)
(k)
H (k) = V S¯ Bn + UBn . Then H (k) is an n × k matrix. Denote H (k) = (h ij )n×k , h (i) = (h 1i , h 2i , . . . , h ni )T (i = 1, 2, . . . , k), 1, ∀ j ∈ {1, 2, . . . , n}, h ji > 1, (i) µ1 (h ) = 0, ∃ j ∈ {1, 2, . . . , n}, h ji < 1, µ ¯ 1 (H (k) ) =
k i=1
µ1 h (i) .
(3.53)
2168
Z. Zeng and J. Wang
Theorem 4. For DRNN, equation 2.1, if h ji = 1, ∀i ∈ {1, 2, . . . , k}, ∀ j ∈ k {1, 2 . . . , n}, then µ ¯ 1 (H (k) ) = i=1 µ1 (h (i) ) is the number of the memorized pat(1) (2) (k) terns in k patterns s , s , . . . , s . Proof. According to theorem 1, if the pattern s (1 ) is memorized, then ( ) ( ) (k) ( nj=1 (a ij + b ij )s j 1 + ui )si 1 ≥ 1, ∀i ∈ {1, 2, . . . , n}. Since H (k) = V S¯ Bn + (k)
UBn and h ji = 1, from equation 3.53, µ1 (h (1 ) ) = 1.
(3.54)
If the pattern s (2 ) is forgotten, then there exists i ∈ {1, 2, . . . , n} such that ( ) ( ) ( nj=1 (a ij + b ij )s j 2 + ui )si 2 < 1. Hence, from equation 3.53, µ1 h (2 ) = 0.
(3.55)
k From equations 3.54 and 3.55, µ ¯ 1 (H (k) ) = i=1 µ1 (h (i) ) is the number of the memorized patterns in DRNN, equation 2.1. For k ∈ {1, 2, . . . , 2n }, and given Nq , let 1, (i) µ2 (h ) = 1, 0, µ ¯ 2 (H (k) ) =
k
∀ j ∈ {1, 2, . . . , n}, h ji > 1, j ∈ Nq , h ji > 1; j ∈ Nq , h ji = 1,
(3.56)
∃ j ∈ {1, 2, . . . , n}, h ji < 1,
µ2 (h (i) ).
i=1
Similar to the proof of theorem 4, we have the following result. Theorem 5. Given Nq , if for all k patterns s (1) , s (2) , . . . , s (k) , equations 3.13 to k µ2 (h (i) ) is the number of the memorized patterns 3.18 hold, then µ ¯ 2 (H (k) ) = i=1 (1) (2) (k) in k patterns s , s , . . . , s for DRNN, equation 2.1. n
Remark 5. Assume vectors s (1) , s (2) , . . . , s (2 ) satisfy ∀i, j ∈ {1, 2, . . . , 2n }, i = j, s (i) = s ( j) . According to theorem 4, if ∀i ∈ {1, 2, . . . , 2n }, ∀ j ∈ 2n n {1, 2, . . . , n}, h ji = 1, then µ ¯ 1 (H (2 ) ) = i=1 µ1 (h (i) ) is the number of the memorized patterns in DRNN, equation 2.1. If s (1) , s (2) , . . . , s (k) are den sired patterns, then µ ¯ 1 (H (2 ) ) − µ ¯ 1 (H (k) ) is the number of spurious equilibria. If there exists Nq , such that ∀ j ∈ Nq , ∀i ∈ {1, 2, . . . , k}, h ji = 1 and ∀ j ∈ Nq , ∀i ∈ {k + 1, 2, . . . , 2n }, h ji = 1, then according to theorem 5, n µ ¯ 2 (H (2 ) ) − µ ¯ 2 (H (k) ) is the number of spurious equilibria.
Associative Memories Based on Recurrent Neural Networks
2169
4 DRNNs with Fixed External Inputs 4.1 Single Pattern Corollary 3. Assume that u is a fixed external input. If c ij = −s j ui /n, σi > 1, ∀i, j = 1, 2, . . . , n,
(4.1)
then the pattern s is memorized. If c ij = −s j ui /n for all i, j = 1, 2, . . . , n, and there exists ∈ {1, 2, . . . , n} such that σ < 1, then the pattern s is forgotten. Proof. Since c ij = −s j ui /n (i, j = 1, 2, . . . , n), n
c ij s j = −ui .
j=1
Hence, for i = 1, 2, . . . , n, n (a ij + b ij )s j + ui = σi si ; j=1
that is, n (a ij + b ij )s j + ui si = σi . j=1
According to theorem 1, if for all i = 1, 2, . . . , n, σi > 1, then the pattern s is memorized; if there exists ∈ {1, 2, . . . , n} such that σ < 1, then the pattern s is forgotten. Remark 6. Since for all i = 1, 2, . . . , n, σi = a ii + b ii − c ii , a pattern can be memorized or forgotten by adjusting a ii + b ii . For example, if a ii + b ii = 1 (i = 1, 2, . . . , n) such as σi = 0.5 < 1 (i = 1, 2, . . . , n), then σ¯ i (i = 1, 2, . . . , n) can be equal to 1.5 > 1 by choosing a¯ ii + b¯ ii = 2 (i = 1, 2, . . . , n), where a¯ ii and b¯ ii are new connection weights. Thus, the corresponding pattern shifts to being memorized from being forgotten. In order to decrease the total number of spurious equilibria, we have the following corollary:
2170
Z. Zeng and J. Wang
Corollary 4. Assume that u is a fixed external input. If for all i, j = 1, 2, . . . , n, |a ii | + |b ii | + σi = |a ij | + |b ij | = sinui , σi > 1,
si ui n
,
if i = j, if i = j,
(4.2)
then DRNN, equation 2.1, has only one memorized pattern s. Proof. Equation 4.2 implies − nj=1 (|a ij | + |b ij |) + ui si = σi . If σi > 1, for all i = 1, 2, . . . , n, then equation 3.47 holds. According to theorem 4, DRNN, equation 2.1, has only one memorized pattern s. According to remark 1, the difference between corollaries 3 and 4 is that DRNN, equation 2.1, can have other memorized patterns if condition 4.1 holds but only one isolated equilibrium point located in n if condition 4.2 holds. 4.2 Multiple Patterns. Corollaries 3 and 4 can be used only to ascertain the local or global stability of DRNN, equation 2.1. These are used to memorize or store single patterns. However, in most applications, DRNN must exhibit more than one stable equilibrium point instead of a single globally stable equilibrium point, for example, associative memories of multiple patterns. Corollary 5. Assume that u is a fixed external input, Rank{s (1) , s (2) , . . . , s (k) } = k ≤ n, Rank{s (1) , . . . , s (k) , s (k+1) , . . . , s (n) } = n. If a ij + b ij = pij for all i, j = 1, 2, . . . , n, and σi > 1 for i = 1, 2, . . . , n, = 1, 2, . . . , k; σi < 1 for i = 1, 2, . . . , n, = k + 1, k + 2, . . . , n, then the patterns s (1) , s (2) , . . . , s (k) are memorized, and the patterns s (k+1) , s (k+2) , . . . , s (n) are forgotten. Proof. Since a ij + b ij = pij (i, j = 1, 2, . . . , n), ˜ n. (a ij + b ij )n×n S(n) = Wn − U Hence, for all i = 1, 2, . . . , n, = 1, 2, . . . , n, n () () () (a ij + b ij )s j + ui = σi si ; j=1
that is, n () (a ij + b ij )s j + ui si() = σi() . j=1
Associative Memories Based on Recurrent Neural Networks
2171 ()
According to theorem 1, if for all i = 1, 2, . . . , n, = 1, 2, . . . , k, σi > 1, then the patterns s (1) , s (2) , . . . , s (k) are memorized; if for all i = 1, 2, . . . , () n, = k + 1, k + 2, . . . , n, σi < 1, the patterns s (k+1) , s (k+2) , . . . , s (n) are forgotten. Corollary 6. Assume that u is a fixed external input and Rank{s (1) , s (2) , . . . , s (k) } = k > n. If ˜k ˜ k S(k)T (S(k)S(k)T )−1 S(k) = U U
(4.3)
and c ij = q ij , σi > 1, for all i, j = 1, 2, . . . , n, then the patterns s (1) , s (2) , . . . , s (k) are memorized; if c ij = q ij , for all i, j = 1, 2, . . . , n and there exists ∈ {1, 2, . . . , n} such that σ < 1, then the patterns s (1) , s (2) , . . . , s (k) are forgotten. ˜ k S(k)T (S(k)S(k)T )−1 S(k) = U ˜ k and Q = (q ij )n×n = −U ˜ k S(k)T Proof. Since U (S(k)S(k)T )−1 , ˜ k. QS(k) = −U Hence, from c ij = q ij (i, j = 1, 2, . . . , n), for all i = 1, 2, . . . , n, 1, 2, . . . , k,
=
n () () (a ij + b ij )s j + ui = σi si ; j=1
that is, n () (a ij + b ij )s j + ui si() = σi . j=1
According to theorem 1, if for all i = 1, 2, . . . , n, σi > 1, the patterns s (1) , s (2) , . . . , s (k) are memorized; if for all i = 1, 2, . . . , n, σi < 1, the patterns s (1) , s (2) , . . . , s (k) are forgotten. Remark 7. Since for all i = 1, 2, . . . , n, σi = a ii + b ii − c ii , the patterns s (1) , s (2) , . . . , s (k) can be memorized or forgotten at the same time by adjusting a ii + b ii . 5 DRNNs with Variable External Inputs When the external input is fixed, a DRNN is not capable of learning and forgetting. If the external input is variable, DRNN is capable of learning and
2172
Z. Zeng and J. Wang
forgetting. In other words, when s is memorized by DRNN, equation 2.1, the variable external input can change s into a forgotten pattern of DRNN. In addition, when s is forgotten by DRNN, DRNN is capable of learning such that s is memorized by DRNN by variable external input. Hence, when the external input is variable, DRNN is provided with the capability of learning new patterns and forgetting old ones without recomputing the whole interconnection matrices A and B. 5.1 Single Pattern Corollary 7. If c˜ ij = s j si σi /n, σi > 1, for all i, j = 1, 2, . . . , n, then the pattern s is memorized; if c˜ ij = s j si σi /n, for all i, j = 1, 2, . . . , n, and there exists ∈ {1, 2, . . . , n} such that σ < 1, then the pattern s is forgotten. Proof. Since for all i, j = 1, 2, . . . , n, c˜ ij = s j si σi /n, n
c˜ ij s j si = σi .
j=1
Hence, n (a ij + b ij )s j si + ui si = σi ; j=1
that is, n (a ij + b ij )s j + ui si = σi . j=1
According to theorem 1, if for all i = 1, 2, . . . , n, σi > 1, then the pattern s is memorized; if for all i = 1, 2, . . . , n, σi < 1, then the pattern s is forgotten. Remark 8. Since for all i = 1, 2, . . . , n, σi = nj=1 (a ij + b ij )s j si + ui si , the patterns s can be memorized or forgotten by fixing a ij + b ij and adjusting ui . In other words, for a given neural network, a pattern can shift to being memorized from being forgotten or shift to being forgotten from being memorized by adjusting ui . Corollary 8. If for all i, j = 1, 2, . . . , n, −(|a ii | + |b ii |) + ui si = −(|a ij | + |b ij |) = σni , σi > 1,
σi n
, if i = j, if i = j,
then the DRNN in equation 2.1 can memorize one pattern only.
Associative Memories Based on Recurrent Neural Networks
2173
Proof. Since for all i, j = 1, 2, . . . , n, −(|a ii | + |b ii |) + ui si = σi /n, − (|a ij | + |b ij |) = σi /n, −
n (|a ij | + |b ij |) + ui si = σi . j=1
If for all i = 1, 2, . . . , n, σi > 1, then equation 3.47 holds. According to theorem 3, the DRNN in equation 2.1 can memorize one pattern only.
5.2 Multiple Patterns Corollary 9. Assume that Rank{s (1) , s (2) , . . . , s (k) } = k ≤ n. If a ij + b ij = () p˜ ij , σi > 1, for ∈ {1, 2, . . . , k}, and all i, j = 1, 2, . . . , n, then the pattern s () is memorized, if a ij + b ij = p˜ ij , for ∈ {1, 2, . . . , k}, all i, j = 1, 2, . . . , n, and () there exists i ∈ {1, 2, . . . , n} such that σi < 1, then the pattern s () is forgotten. ˜ k )(S(k)T S(k))−1 S(k)T S(k) = Wk − U ˜ k, Proof. Since (Wk − U (i, j = 1, 2, . . . , n) imply
a ij + b ij = p˜ ij
˜ k. (a ij + b ij )n×n S(k) = Wk − U Hence, for all i = 1, 2, . . . , n, = 1, 2, . . . , k, n () () () (a ij + b ij )s j + ui = σi si ; j=1
that is, n () (a ij + b ij )s j + ui si() = σi() . j=1 ()
According to theorem 1, if for all i = 1, 2, . . . , n, σi > 1, then the pattern () s () is memorized; if there exists i ∈ {1, 2, . . . , n} such that σi < 1, then the () pattern s is forgotten. ˜ k satisfies Corollary 10. Assume that Rank{s (1) , s (2) , . . . , s (k) } = k > n, and U ˜ k )(E − S(k)T (S(k)S(k)T )−1 S(k)) = 0. (Wk − U ()
If a ij + b ij = q˜ ij , σi > 1, for ∈ {1, 2, . . . , k}, all i, j = 1, 2, . . . , n, then the pattern s () is memorized. If a ij + b ij = q˜ ij , for ∈ {1, 2, . . . , k}, all i, j = 1, 2, . . . , n,
2174
Z. Zeng and J. Wang ()
and there exists i ∈ {1, 2, . . . , n} such that σi ten.
< 1, then the pattern s () is forgot-
˜ k )(E − S(k)T (S(k)S(k)T )−1 S(k)) = 0, ˜ k satisfies (Wk − U Proof. Since U ˜ k )S(k)T (S(k)S(k)T )−1 S(k) = Wk − U ˜ k . From Q ˜ = (Wk − U ˜ k )S(k)T (S(k) (Wk − U S(k)T )−1 , ˜ k. (a ij + b ij )n×n S(k) = Wk − U Hence, for all i = 1, 2, . . . , n, = 1, 2, . . . , k, n () () () (a ij + b ij )s j + ui = σi si ; j=1
that is, n () (a ij + b ij )s j + ui si() = σi() . j=1
()
According to theorem 1, if for all i = 1, 2, . . . , n, σi > 1, then the pattern, () s () is memorized; if there exists i ∈ {1, 2, . . . , n} such that σi < 1, then the () pattern s is forgotten. Remark 9. According to corollaries 5, 9, and 10, the pattern s () and σ () is a one-to-one correspondence. For two patterns, s (1 ) and s (2 ) , σ (1 ) can differ from σ (2 ) . Hence, the pattern s (1 ) can be memorized; at the same time, the pattern s (2 ) can be forgotten. But corollary 6 can be used only to judge whether the patterns s (1 ) and s (2 ) are memorized or are forgotten. Remark 10. Since B n is made up of 2n elements, there exist 2n memn ory vectors s (1) , s (2) , . . . , s (2 ) such that ∀i, j ∈ {1, 2, . . . , 2n }, i = j, s (i) = s ( j) . () For all i = 1, 2, . . . , n, = 1, 2, . . . , k, choose σi > 1. For all = k + 1, k + () 2, . . . , 2n , choose σ1 < 1. According to corollary 10, for k desired patterns s (1) , s (2) , . . . , s (k) , it is possible that there exist desired connection weights a ij , b ij , ui such that the designed DRNN have only k stable memory vectors; that is, there is no spurious equilibrium. 6 Design Procedures When u is fixed and k ≤ n, we have the following procedure:
Associative Memories Based on Recurrent Neural Networks
2175
Design Procedure 1 Step 1: Denote desired patterns using vectors in {−1, 1}n ; that is, obtain k vectors s (1) , s (2) , . . . , s (k) ∈ B n , where k is the number of the desired patterns and n is the number of neurons of designed DRNN. Step 2: Choose s (k+1) , s (k+2) , . . . , s (n) ∈ B n such that Rank{s (1) , . . . , s (k) , s (k+1) , . . . , s (n) } = n. Step 3: Choose σi > 1, (i = 1, 2, . . . , n; = 1, 2, . . . , k); σi < 1, (i = 1, 2, . . . , n; = k + 1, . . . , n). ˜ n )S(n)−1 . Step 4: Compute ( pij )n×n = (Wn − U Step 5: Choose the desired connection weights a ij + b ij = pij , i, j = 1, 2, . . . , n. Then s (1) , s (2) , . . . , s (k) are stable memory vectors of DRNN. When u is fixed and k > n, we have the following procedure: Design Procedure 2 Step 1: Obtain k vectors s (1) , s (2) , . . . , s (k) ∈ B n . ˜ k S(k)T (S(k)S(k)T )−1 = (c ij )n×n . Step 2: Compute −U Step 3: Choose σi > 1 (i = 1, 2, . . . , n) and the desired connection weights, a ii + b ii − σi = c ii , if i = j a ij + b ij = c ij ,
if i = j,
for i, j = 1, 2, . . . , n. Then s (1) , s (2) , . . . , s (k) are stable memory vectors of DRNN. When u is variable and k ≤ n, we have the following procedure: Design Procedure 3 Step 1: Obtain k vectors s (1) , s (2) , . . . , s (k) ∈ B n . () Step 2: Choose ui (i = 1, 2, . . . , n) and σi (i = 1, 2, . . . , n, = 1, 2, . . . , k). ˜ k )(S(k)T S(k))−1 S(k)T = (q ij )n×n . Choose the deStep 3: Compute (Wk − U sired connection weights a ij + b ij = q ij , i, j = 1, 2, . . . , n. ()
Step 4: If for all i = 1, 2, . . . , n, σi > 1 then s () is a stable memory vector () of DRNN. If there exists i ∈ {1, 2, . . . , n} such that σi < 1, then s () is not a stable memory vector of DRNN. When u is variable and k > n, we have the following procedure: Design Procedure 4 Step 1: Obtain k vectors s (1) , s (2) , . . . , s (k) ∈ B n . Step 2: Choose ui , σi (i = 1, 2, . . . , n, = 1, 2, . . . , k), such that (Wk − ˜ k )(E − S(k)T (S(k)S(k)T )−1 S(k)) = 0. U ˜ k )S(k)T (S(k)S(k)T )−1 = (q˜ ij )n×n . Choose the deStep 3: Compute (Wk − U sired connection weights a ij + b ij = q˜ ij , i, j = 1, 2, . . . , n.
2176
Z. Zeng and J. Wang
Figure 1: Four desired memory patterns (3 × 4) in Example 1. ()
Step 4: If for all i = 1, 2, . . . , n, σi > 1, then s () is a stable memory vector () of DRNN. If there exists i ∈ {1, 2, . . . , n} such that σi < 1, then s () is not a stable memory vector of DRNN. 7 Illustrative Examples
7.1 Example 1. The same example introduced in Lu and Liu (1998) is used here. Consider DRNN with 12 neurons (n = 12). The objective is to store the four patterns shown in Figure 1 as stable memories (black = −1 and white = 1). Denote the desired patterns s (1) = (1, 1, 1, 1; −1, −1, −1, −1; 1, 1, 1, 1)T , s = (−1, −1, −1, −1; 1, 1, 1, −1; 1, 1, 1, −1)T , s (3) = (−1, −1, −1, −1; 1, 1, 1, −1; −1, −1, −1, −1)T , and s (4) = (−1, −1, 1, 1; −1, −1, 1, 1; −1, −1, −1, −1)T . Choose expiatory patterns s (4+1) , . . . , s (4+8) such that (2)
1 1 1 1 −1 −1 S(12) = −1 −1 1 1 1 1
−1 −1 −1 −1 1 1 1 −1 1 1 1 −1
−1 −1 −1 −1 1 1 1 −1 −1 −1 −1 −1
−1 −1 1 1 −1 −1 1 1 −1 −1 −1 −1
−1 −1 −1 −1 −1 1 1 1 1 1 1 1
−1 −1 −1 1 1 −1 1 1 1 1 1 1
−1 −1 1 1 1 −1 −1 1 1 1 1 1
−1 1 1 1 1 −1 −1 −1 1 1 1 1
−1 −1 −1 −1 −1 −1 −1 −1 −1 1 1 1
−1 −1 −1 −1 −1 −1 −1 −1 −1 −1 1 1
−1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 1
−1 −1 −1 −1 −1 −1 . −1 −1 −1 −1 −1 −1
ui = 1.825 (i = 1, 2, . . . , 12) as introduced in Lu and Liu (1998) is a fixed external input. Choose ()
σi = 1.1 (i = 1, 2, . . . , 12; = 1, 2, 3, 4); ()
σi = 0.5 (i = 1, 2, . . . , 12; = 5, 6, . . . , 12).
Associative Memories Based on Recurrent Neural Networks
2177
According to design procedure 1 in section 6, the desired connection weight matrices A+ B = 2.3250 1.8250 1.8250 2.3250 1.8250 1.8250 1.8250 1.8250 1.8250 1.8250 1.8250 1.8250 1.8250 1.8250 1.8250 2.4250 2.4250 1.8250 2.4250 1.8250 2.4250 1.8250 1.8250 1.8250
1.2250 1.2250 2.3250 1.8250 1.8250 1.8250 2.4250 2.4250 2.4250 2.4250 2.4250 1.2250
−4.5750 −4.5750 −5.1750 −4.6750 −5.7750 −5.7750 −6.3750 −6.9750 −6.9750 −6.9750 −6.9750 −4.5750
1.5250 1.5250 1.5250 1.5250 2.6250 2.1250 2.1250 2.1250 2.1250 2.1250 2.1250 1.5250
−3.0500 −3.0500 −3.6500 −3.6500 −3.6500 −3.1500 −4.2500 −4.8500 −4.8500 −4.8500 −4.8500 −3.0500
1.2250 1.2250 1.8250 1.8250 1.8250 1.8250 2.9250 2.4250 2.4250 2.4250 2.4250 1.2250
1.8250 1.8250 1.8250 1.8250 1.8250 1.8250 1.8250 2.9250 1.8250 1.8250 1.8250 1.8250
0 0 0 0 0 0 0 0 1.1 0.6 0.6 0
0 0 0 0 0 0 0 0 0 0.5 0 0
0 0 0 0 0 0 0 0 0 0 0.5 0
0 0 0 0 0 0 . 0 0 0 0 0 0.5
Corollary 5 implies that the four patterns depicted in Figure 1 are stable memory patterns of the designed DRNN: d xi (t) a ij ( f (x j (t)) + b ij f (x j (t − τij (t)))) + ui , i = 1, . . . , 12. = −xi (t) + dt n
j=1
(7.1) In fact, µ(H ¯ (4) ) = 4. Hence, according to theorem 4, s (1) , s (2) , s (3) , s (4) are stable memory patterns of equation 7.1. Let a ij = b ij , τij (t) = i (i, j ∈ {1, 2, . . . , 12}). Simulation results are depicted in Figure 2, where all the trajectories from 60 random initial points converge to one of seven stable equilibrium points. The number of the spurious equilibria using the results in Lu and Liu (1998) is greater than or equal to 22. The number of the spurious equilibria using the results in Liu and Lu (1997) is greater than or equal to 4. By com12 puting, µ(H ¯ (2 ) ) = 7. According to remark 5, the number of the spurious equilibria is 3 in equation 7.1. 7.2 Example 2. The same example introduced in Liu and Michel (1994) is used here. Consider DRNN with 16 neurons (n = 16). The objective is to store the three patterns shown in Figure 3 as stable memories (black = −1 and white = 1). Denote the desired patterns s (1) = (−1, −1, −1, 1; 1, 1, −1, 1; −1, −1, −1, −1; 1, 1, −1, 1)T , s (2) = (1, 1, 1, 1; 1, 1, 1, 1; −1, −1, −1, −1; 1, 1, 1, 1)T , and s (3) = (−1, 1, 1, 1; −1, 1, 1, −1; −1, −1, −1, −1; −1, 1, 1, 1)T .
2178
Z. Zeng and J. Wang
Figure 2: Transient behavior of xi in equation 7.1.
Choose ui = −2 (i = 1, 2, 3, 5, 7, 8, 13, 15), ui = 1 (i = 4, 6, 9, 10, 11, 12, () 14, 16), σi = 1.1, i ∈ {1, 2, . . . , 16}, = 1, 2, 3. According to design procedure 3 in section 6, the desired connection weight matrices
˜ 3 )(S(3)T S(3))−1 S(3)T = A + B = (W3 − U
0.3098 0.0146 0.0146 −0.0073 0.0683 0.0073 0.0146 0.0683 0.1537 0.1537 0.1537 0.1537 0.0683 −0.0073 0.0146 −0.0073
0.1976 0.2713 0.2713 0.0018 −0.0171 0.0018 0.2713 −0.0171 −0.0384 −0.0384 −0.0384 −0.0384 −0.0171 0.0018 0.2713 0.0018
0.1976 0.2713 0.2713 0.0018 −0.0171 0.0018 0.2713 −0.0171 −0.0384 −0.0384 −0.0384 −0.0384 −0.0171 0.0018 0.2713 0.0018
0.1512 0.2518 0.2518 0.0116 0.2585 0.0116 0.2518 0.2585 −0.2433 −0.2433 −0.2433 −0.2433 0.2585 0.0116 0.2518 0.0116
0.2634 −0.0049 −0.0049 0.0024 0.3439 0.0024 −0.0049 0.3439 −0.0512 −0.0512 −0.0512 −0.0512 0.3439 0.0024 −0.0049 0.0024
0.1512 0.2518 0.2518 0.0116 0.2585 0.0116 0.2518 0.2585 −0.2433 −0.2433 −0.2433 −0.2433 0.2585 0.0116 0.2518 0.0116
0.1976 0.2713 0.2713 0.0018 −0.0171 0.0018 0.2713 −0.0171 −0.0384 −0.0384 −0.0384 −0.0384 −0.0171 0.0018 0.2713 0.0018
0.2634 −0.0049 −0.0049 0.0024 0.3439 0.0024 −0.0049 0.3439 −0.0512 −0.0512 −0.0512 −0.0512 0.3439 0.0024 −0.0049 0.0024
Associative Memories Based on Recurrent Neural Networks
2179
Figure 3: Three desired memory patterns (4 × 4) in example 2.
−0.1512 −0.2518 −0.2518 −0.0116 −0.2585 −0.0116 −0.2518 −0.2585 0.2433 0.2433 0.2433 0.2433 −0.2585 −0.0116 −0.2518 −0.0116
−0.1512 −0.2518 −0.2518 −0.0116 −0.2585 −0.0116 −0.2518 −0.2585 0.2433 0.2433 0.2433 0.2433 −0.2585 −0.0116 −0.2518 −0.0116
−0.1512 −0.2518 −0.2518 −0.0116 −0.2585 −0.0116 −0.2518 −0.2585 0.2433 0.2433 0.2433 0.2433 −0.2585 −0.0116 −0.2518 −0.0116
−0.1512 −0.2518 −0.2518 −0.0116 −0.2585 −0.0116 −0.2518 −0.2585 0.2433 0.2433 0.2433 0.2433 −0.2585 −0.0116 −0.2518 −0.0116
0.2634 −0.0049 −0.0049 0.0024 0.3439 0.0024 −0.0049 0.3439 −0.0512 −0.0512 −0.0512 −0.0512 0.3439 0.0024 −0.0049 0.0024
0.1512 0.2518 0.2518 0.0116 0.2585 0.0116 0.2518 0.2585 −0.2433 −0.2433 −0.2433 −0.2433 0.2585 0.0116 0.2518 0.0116
0.1976 0.2713 0.2713 0.0018 −0.0171 0.0018 0.2713 −0.0171 −0.0384 −0.0384 −0.0384 −0.0384 −0.0171 0.0018 0.2713 0.0018
0.1512 0.2518 0.2518 0.0116 0.2585 0.0116 0.2518 0.2585 . −0.2433 −0.2433 −0.2433 −0.2433 0.2585 0.0116 0.2518 0.0116
Corollary 9 implies that the three patterns depicted in Figure 3 are stable memory patterns of the designed DRNN: d xi (t) = −xi (t) + a ij ( f (x j (t)) + b ij f (x j (t − τij (t)))) + ui , i = 1, . . . , 16. dt j=1 n
(7.2)
Since µ(H ¯ (3) ) = 3, according to theorem 4, s (1) , s (2) , s (3) are stable memory patterns of equation 7.2. Let a ij = b ij , τij (t) = i, i, j ∈ {1, 2, . . . , 16}. Simulation results are depicted in Figure 4, where all the trajectories from 160 random initial points converge to one of three stable equilibrium points. 16 By computing, µ(H ¯ (2 ) ) = 3. According to remark 5, there is no spurious equilibrium. In other words, DRNN, equation 7.2, has only three stable memory patterns: s (1) , s (2) , and s (3) .
2180
Z. Zeng and J. Wang
Figure 4: Transient behavior of xi in equation 7.2.
8 Conclusion In this letter, four new design procedures for DRNN are developed based on stability theory, and the convergence of the designed associative memories can be guaranteed by some stability results. Two of these design procedures allow the neural network with time-varying delays to be capable of learning and forgetting. These results can guarantee that the desired patterns are stable memory or forgotten patterns of the designed DRNN. In addition, when the new design procedures are used, the total number of spurious equilibria can be as small as possible. Hence, it is very convenient to design DRNN for storing or forgetting desired memory patterns effectively. Finally, the simulation results showed the validity and feasibility of the proposed approach. Further investigations will include how to use our proposed approach with specific neural network models (e.g., designing cellular neural networks with space-invariant cloning templates) and how to extend the presented approach with other cases (e.g., design brain-state-in-a-box neural model as a memory model based on neurophysiological considerations). One of the advantages of considering a space-invariant cloning template for neural networks is the small number of parameters for describing a neural network, which also simplifies the hardware implementation process. In addition, it is also possible to design globally asymptotically stable neural associative memories, where the memory storage and retrieval are achieved
Associative Memories Based on Recurrent Neural Networks
2181
by external inputs rather than initial states, which enables both heteroassociative and autoassociative memories to be designed. Hence, further investigations will also include how to ensure the one-to-one correspondence of the input and output patterns without spurious equilibria. Furthermore, because the robustness issues are rarely considered, it is desirable to design robust associative memories to withstand the parameter perturbations and random disturbance on patterns and characterize the allowable region of the bias vector in both heteroassociative and autoassociative memories. Acknowledgments This work was supported by the Hong Kong Research Grants Council under Grant CUHK4165/03E, the Natural Science Foundation of China under Grant 60405002. References Brucoli, M., Carnimeo, L., & Grassi, G. (1995). Discrete-time cellular neural networks for associative memories with learning and forgetting capabilities. IEEE Trans. Circuits and Systems I, 42, 396–399. Chua, L. O., & Yang, L. (1988). Cellular neural networks: Theory. IEEE Trans. Circuits and Systems, 35, 1257–1272. Li, J. H., Michel, A. N., & Porod, W. (1988). Qualitative analysis and synthesis of a class of neural networks. IEEE Trans. Circuits and Systems, 35(8), 976–986. Li, J. H., Michel, A. N., & Porod, W. (1989a). Analysis and synthesis of a class of neural networks: Variable structure systems with infinite grain. IEEE Trans. Circuits and Systems, 36(5), 713–731. Li, J. H., Michel, A. N., & Porod, W. (1989b). Analysis and synthesis of a class of neural networks: Linear systems operating on a closed hypercube. IEEE Trans. Circuits and Systems, 36(11), 1405–1422. Liao, X. X., & Wang, J. (2003). Algebraic criteria for global exponential stability of cellular neural networks with multiple time delays. IEEE Trans. Circuits and Systems I, 50, 268–275. Liu, D. (1997). Cloning template design of cellular neural networks for associative memories. IEEE Trans. Circuits and Systems I, 44, 646–650. Liu, D., & Lu, Z. (1997). A new synthesis approach for feedback neural networks based on the perceptron training algorithm. IEEE Trans. Neural Networks, 8, 1468– 1482. Liu, D., & Michel, A. N. (1994). Sparsely interconnected neural networks for associative memories with applications to cellular neural networks. IEEE Trans. Circuits and Systems II, 41(4), 295–307. Lu, Z. J., & Liu, D. (1998). A new synthesis procedure for a class of cellular neural networks with space-invariant cloning template. IEEE Trans. Circuits and Systems II, 45(12), 1601–1605. Michel, A. N., & Farrell, J. A. (1990). Associative memories via artificial neural networks. IEEE Contr. Syst. Mag., 10, 6–17.
2182
Z. Zeng and J. Wang
Michel, A. N., Farrell, J. A., & Sun, H. F. (1990). Analysis and synthesis techniques for Hopfield type synchronous discrete time neural networks with application to associative memory. IEEE Trans. Circuits and Systems, 37(11), 1356–1366. Michel, A. N., & Gray, D. L. (1990). Analysis and synthesis of neural networks with lower block triangular interconnecting structure. IEEE Trans. Circuits and Systems, 37(10), 1267–1283. Michel, A. N., Si, J., & Yen, G. (1991). Analysis and synthesis of a class of discretetime neural networks described on hypercubes. IEEE Trans. Neural Networks, 2(1), 32–46. Park, J., Cho, H., & Park, D. (1999). On the design of BSB neural associative memories using semidefinite programming. Neural Computation, 11, 1985–1994. Park, J., & Park, Y. (2000). An optimization approach to design of generalized BSB neural associative memories. Neural Computation, 12, 1449–1462. Perfetti, R. (1991). A neural network to design neural networks. IEEE Trans. Circuits and Systems, 38(9), 1099–1103. Personnaz, L., Guyon, I., & Dreyfus, G. (1985). Information storage and retrieval in spin-glass like neural networks. Journal de Physique Lettres, 46(8), 359–365. Personnaz, L., Guyon, I., & Dreyfus, G. (1986). Collective computational properties of neural networks: New learning mechanisms. Physical Review A, 34(5), 4217–4228. Salam, F. M. A., Wang, Y. W., & Choi, M. R. (1991). On the analysis of dynamic feedback neural nets. IEEE Trans. Circuits and Systems, 38(2), 196–201. Seiler, G., Schuler, A. J., & Nossek, J. A. (1993). Design of robust cellular neural networks. IEEE Trans. Circuits and Systems I, 40, 358–364. Waugh, F. R., Marcus, C. M., & Westervelt, R. M. (1990). Fixed-point attractors in analog neural computation. Physical Review Letters, 64(16), 1986–1989. Yen, G., & Michel, A. N. (1991). A learning and forgetting algorithm in associative memories: Results involving pseudo-inverses. IEEE Trans. Circuits and Systems, 38(10), 1193–1205. Yen, G., & Michel, A. N. (1992). A learning and forgetting algorithm in associative memories: The eigenstructure method. IEEE Trans. Circuits and Systems, 39(4), 212–225. Yi, Z., Tan, K. K., & Lee, T. H. (2003). Multistability analysis for recurrent neural networks with unsaturating piecewise linear transfer functions. Neural Computation, 15(3), 639–662. Zeng, Z. G., & Wang, J. (2006a). Complete stability of cellular neural networks with time-varying delays. IEEE Transactions on Circuits and Systems–I, 53(4), 944–955. Zeng, Z. G., & Wang, J. (2006b). Multiperiodicity and exponential attractivity evoked by periodic external inputs in delayed cellular neural networks. Neural Computation, 18, 848–870. Zeng, Z. G., Wang, J., & Liao, X. X. (2003). Global exponential stability of a general class of recurrent neural networks with time-varying delays. IEEE Trans. Circuits and Systems I, 50(10), 1353–1358. Zeng, Z. G., Wang, J., & Liao, X. X. (2004). Stability analysis of delayed cellular neural networks described using cloning templates. IEEE Trans. Circuits and Systems I, 51(11), 2313–2324.
Received April 11, 2006; accepted July 19, 2006.
LETTER
¨ Communicated by Klaus-Robert Muller
Robust Loss Functions for Boosting Takafumi Kanamori [email protected] Department of Mathematical and Computing Sciences, Tokyo Institute of Technology, Tokyo 152-8552, Japan
Takashi Takenouchi [email protected] Nara Institute of Science and Technology, Ikoma, Nara 630-0192, Japan
Shinto Eguchi [email protected] Institute of Statistical Mathematics, and Department of Statistical Science, Graduate University of Advanced Studies, Minato-ku, Tokyo 106-8569, Japan
Noboru Murata [email protected] School of Science and Engineering, Waseda University, Shinjuku, Tokyo 169-8555, Japan
Boosting is known as a gradient descent algorithm over loss functions. It is often pointed out that the typical boosting algorithm, Adaboost, is highly affected by outliers. In this letter, loss functions for robust boosting are studied. Based on the concept of robust statistics, we propose a transformation of loss functions that makes boosting algorithms robust against extreme outliers. Next, the truncation of loss functions is applied to contamination models that describe the occurrence of mislabels near decision boundaries. Numerical experiments illustrate that the proposed loss functions derived from the contamination models are useful for handling highly noisy data in comparison with other loss functions. 1 Introduction Boosting is a learning method to construct a good predictor by combining so-called weak hypotheses. A typical implementation of boosting algorithms is Adaboost (Freund & Schapire, 1997), whose ability has been demonstrated from practical and theoretical viewpoints. Some statistical properties have been clarified based on the argument of the uniform convergence of empirical processes (Schapire, Freund, Bartlett, & Lee, 1998; Bartlett, Jordan, & McAuliffe, 2003). In addition, the relation between Neural Computation 19, 2183–2244 (2007)
C 2007 Massachusetts Institute of Technology
2184
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
boosting algorithms and statistical estimators such as the maximum likelihood estimator has been shown from the viewpoint of information geometry (Lebanon & Lafferty, 2002; Murata, Takenouchi, Kanamori, & Eguchi, 2004). Friedman, Hastie, and Tibshirani (2000) pointed out that boosting algorithms can be uniformly understood as coordinate descent methods derived from some classes of loss functions. In the past decade, some useful loss functions for classification problems have been proposed, for example, the ¨ hinge loss for support vector machine (Cortes & Vapnik, 1995; Scholkopf & Smola, 2001) and the exponential loss for Adaboost (Friedman et al., 2000). These typical loss functions are convex, and therefore highly developed optimization techniques can be applied to attain the global optima of loss functions. The relation between loss functions and prediction performance is widely studied in statistics and machine learning communities, since loss functions play an important role in statistical inference. In this letter, we discuss the robustness of boosting algorithms. It is pointed out that in the Adaboost procedure, excessively heavy weights are often assigned to solitary examples, even though these examples are outliers that should not be learned. Several proposals to improve robustness ¨ have already been made (R¨atsch, Onoda, & Muller, 2001; Servedio, 2003). Kalai and Servedio (2003) proposed a theoretical foundation of boosting in the presence of noisy data from the viewpoint of probably approximately correct (PAC) learning. On the other hand, we deal with those problems by applying theoretical techniques in robust statistics. We focus on two typical contamination situations in binary classification problems: outliers and mislabels. Outliers are a small number of observations that lie outside the scope of the assumption for data. Outliers may occur in recording data. It is often indicated that outliers seriously degrade the generalization performance of Adaboost, even though the ratio of outliers is small (R¨atsch et al., 2001). There are some criteria to measure the robustness of estimators. For example, gross error sensitivity (Hampel, Rousseeuw, Ronchetti, & Stahel, 1986) evaluates the worst case of estimator fluctuations against outliers. An estimator that minimizes the gross error sensitivity is called the most B-robust estimator, which is expected to be resistant to outliers. Here, “B” denotes “bias.” Moreover, robust estimators for observations with several sorts of outliers are introduced (see, e.g., Victoria-Feser, 2002). In this letter, we investigate the condition of loss functions that derive the most B-robust estimator when there are outliers and apply these loss functions to construct boosting algorithms. One of the other types of contamination is mislabeling, which may occur independent of the feature vector or may occur near the decision boundary of the feature space. A typical example of mislabels is the false diagnosis that usually occurs near the decision boundary. In the context of binary regression problems, statistical models describing mislabels, which are
Robust Loss Functions for Boosting
2185
called contamination models, have been studied (Copas, 1988; Takenouchi & Eguchi, 2004). We use contamination models to deal with a certain number of mislabels. Our main objective is to propose boosting algorithms that are robust against both outliers and mislabels at the same time. This letter is organized as follows. In section 2, we briefly introduce boosting algorithms from the viewpoint of optimizing loss functions. In section 3, we expound on the relation between loss functions and statistical models. In section 4, we explain some concepts in robust statistics and derive robust loss functions. Consequently, we propose a transformation formula of loss functions. The transformation of loss functions makes boosting algorithms robust against outliers. The results in section 4 are applied to contamination models in section 5, and their validity is illustrated with numerical experiments in section 6. The last section is devoted to concluding remarks. The appendices provide some proofs of theorems. In this letter, mainly statistical properties of boosting in the case of the limit of infinitely many large training sets are studied theoretically. The performance in finite sample size is assessed by numerical experiments. The preliminary results in section 4 have been presented earlier (Kanamori, Takenouchi, Eguchi, & Murata, 2004), and are expanded significantly here; the results in section 5 are new. 2 Boosting Algorithms Several studies have clarified that boosting algorithms are derived from gradient descent methods for loss functions (Friedman et al., 2000; Mason, Baxter, Bartlett, & Frean, 1999). (For gradient descent methods, refer to Bertsekas, 1999.) The derivation is illustrated in this section. Suppose that a set of samples, {(x1 , y1 ), . . . , (xn , yn )}, is observed, where xi is an element in an input space X , and yi takes 1 or −1 as class labels. We denote a set of hypotheses by H = {h t : X → {1, −1} | t = 1, 2, 3, . . .}. Each hypothesis is a class label function from input space X . Outputs of hypothesis denote a prediction of labels for given inputs. The cardinality of H may be uncountably infinite. In this letter, H is assumed to be finite set or countably infinite set for the sake of simple explanations. Our aim is to construct a powerful predictor from observed samples by combining hypotheses in H. In this letter, the term predictor denotes functions from X to R. Let predictor Hα (x) be a linear combination of hypotheses in H, that is, Hα (x) =
∞ t=1
αt h t (x),
(2.1)
2186
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
where α denotes α = {αt }∞ t=1 and αt ∈ R for all t. The prediction of class labels for input x ∈ X is given by sign(Hα (x)), where sign(z) denotes the sign of z. Let S[H] be the set of predictors, S[H] = {Hα | α1 < ∞} , where α1 denotes ∞ t=1 |αt |. For any input x ∈ X , the value of Hα (x) in S[H] is finite because the absolute convergence of infinite series, equation 2.1, is ensured. To estimate a predictor, we often use a loss function L : R → R. Some assumptions on loss functions will be shown in the next section. Prediction accuracy of a predictor is measured by the risk defined as follows. Definition 1 (Risk). Let Z be X × {1, −1}. For a probability measure Q on Z, risk RL (Q, H) of predictor H : X → R under loss function L : R → R is defined as RL (Q, H) =
Z
Q(dz)L(−yH(x)).
Typically, increasing functions are employed as loss functions. The prediction accuracy of H(x) tends to be higher as the value of the risk is smaller, because yH(x) takes large values with a high probability. Note that the positivity of yH(x) indicates the correct prediction. In practical situations, empirical distribution of a set of samples is substituted into risk as probability measure Q. The empirical distribution P is defined as 1 P(B, y) = I (xi ∈ B, yi = y), n n
i=1
where I (·) denotes the indicator function and B is any measurable subset of X . Then the risk is equal to RL ( P, H) =
1 L(−yi H(xi )), n n
i=1
which is called empirical risk. We show a generic boosting algorithm. In practice, input to the boosting algorithm is training data or empirical distribution. Here, for theoretical analysis, we introduce a boosting algorithm that takes any probability measure Q on Z as input. Thus, the boosting algorithm is described as statistical functional (Hampel et al., 1986).
Robust Loss Functions for Boosting
2187
The boosting algorithm is derived from the gradient method for minimizing risk RL (Q, H) with respect to H. Predictor H(x) is updated to H(x) + αh(x) to decrease the value of risk, where H ∈ S[H] and h ∈ H. Suppose that α is infinitesimal; then RL (Q, H + αh) − RL (Q, H) = −α
Z
Q(dz)L (−yH(x))yh(x) + o(α) (2.2)
holds. If α is positive, a preferable hypothesis h ∈ H would be a minimizer of equation 2.2. One has yh(x) = 1 − 2I (y = h(x)), and thus the equality arg min − h∈H
Z
Q(dz)L (−yH(x))yh(x)
= arg min h∈H
Z
Q(dz)L (−yH(x))I (y = h(x))
(2.3)
holds. Equation 2.3 is regarded as the gradient direction of risk RL (Q, H) at H. Although exact minimization with respect to h ∈ H is difficult, in practice it is enough to find hypotheses that approximately minimize the differential of risk at H. The value of equation 2.3 is interpreted as a weighted error of hypothesis h, and the weight on a sample (x, y) is given as L (−yH(x)). When the value of yH(x) is large (small), the weight on the sample (x, y) is small (large). This means that weights on samples will increase if the samples are not well trained. Adjusting weights on samples is an effective technique for boosting algorithms. We can apply common learning algorithms to find hypothesis h, which approximately minimizes weighted error on training samples, because many learning algorithms accept weighted samples as inputs. Even if a learning algorithm does not accept weighted samples as inputs, resampling techniques can be used to generate pseudo-weighted samples. In the context of boosting, the learning algorithm for hypothesis h is referred to as a weak learner. As a consequence of the above argument, a boosting algorithm based on loss function L is given in Figure 1. Note that the boosting algorithm with exponential loss L(z) = e z provides Adaboost when distribution Q is an empirical distribution and the initial predictor is given as H (0) (x) = 0. In the case of Adaboost, each coefficient α (m) is given in an analytical form. 3 Loss Functions and Associated Models In learning algorithms for classification problems, loss functions are often used to construct predictors that achieve high prediction performance.
2188
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
Figure 1: Pseudocode for generic boosting algorithms.
Thus, the design of loss functions is a central issue in learning theory. Typically, convex loss functions are used because of their ease of optimization. For example, exponential loss L(z) = e z in the Adaboost algorithm (Freund & Schapire, 1997; Friedman et al., 2000) and hinge loss L(z) = max{0, z + 1} in the support vector machine (Cortes & Vapnik, 1995) have been proposed for practical learning algorithms. For rigorous arguments, let us define loss functions as follows: Definition 2 (Loss Function). Loss function L : R → R is a function that is strictly increasing and convex on R and is strictly convex on the set of negative real numbers. Note that nonnegativity or lower-boundedness of loss function is not assumed. In section 5, we introduce an example of unbounded loss function originally proposed by Takenouchi and Eguchi (2004). The condition on convexity is a bit different between the positive and negative numbers. The exponential loss function satisfies the definition, but the hinge loss does not. When a loss function L is differentiable, one has L (z) ≥ 0 for all z ∈ R due to monotone increasing and L (z) > 0 for all z < 0 due to strict convexity. (See proposition B.4 in Bertsekas, 1999, for details.)
Robust Loss Functions for Boosting
2189
In this letter, the analysis is restricted to convex loss functions, since most boosting algorithms use convex loss functions. Although there are some boosting algorithms derived from nonconvex loss functions, usually risk function defined from nonconvex loss function has some local minima, and even a searching approximation of optimal solution is hopeless. The relation between loss function and statistical model is illustrated in this section, and for nonconvex loss functions, the corresponding statistical model is unlikely to have the property of identifiability. This is very undesirable for estimating conditional probability. From a practical viewpoint, using only convex loss functions is not a significant restriction. To prove the theorems in this letter, we assume the conditions on differentiability of loss functions shown below. We clearly specify which conditions are required to prove theorems: A1: Loss function L is once continuously differentiable on R. A2: Loss function L is twice continuously differentiable except at zero, and inequality sup |L (z)| < ∞
z=0,|z|≤A
holds for any A > 0. A3: Loss function L is three times continuously differentiable except at zero, and inequality sup |L (z)| < ∞
z=0,|z|≤A
holds for any A > 0. Here, we explore the relation among predictors, loss functions, and conditional probabilities. Let P(y|x) be a conditional probability of class labels given input x. Suppose that samples are identically and independently distributed. When the number of samples approaches infinity, empirical risk converges in probability to the expectation of the loss function by the law of large numbers, RL ( P, H) −→ µ(d x) P(y|x)L(−yH(x)) = RL (µP, H) (n → ∞), X
y=±1
(3.1) where µ is a probability measure on X and µP is the probability measure on X × {1, −1} defined as µ(d x)P(y|x), (3.2) µP(B, y) = B
for any measurable set B ⊂ X .
2190
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
We derive predictor H that minimizes risk RL (µP, H). The integrand of equation 3.1 is minimized at H(x), which satisfies ∂
y=±1
P(y|x)L(−yH(x)) ∂ H(x)
= 0,
(3.3)
because y=±1 P(y|x)L(−yH(x)) is convex in H(x) under A1 and extreme values are identical to the minimum value in convex functions. Equation 3.3 is equivalent to P(1|x) L (H(x)) = . P(−1|x) L (−H(x))
(3.4)
Note that if P(−1|x) is equal to zero, equation 3.3 is reduced to L (−H(x)) = 0, and it has no solution in H(x) because L is always positive under A1, as shown in the proof of lemma 1. To solve equation 3.4, let us define an odd function ρ L (z) as ρ L (z) =
L (z) 1 log . 2 L (−z)
When L(z) = e z , the function ρ L is equal to the identity function: ρ L (z) = z. For this reason, ρ L is multiplied by a factor of 12 . To define ρ L based on L, one needs not only nonnegativity of L but also positivity of L . Indeed, the following lemma holds. We leave the proof to appendix A. Lemma 1. Let L be a loss function. Under condition A1, function ρ L is well defined on R and is strictly increasing. Equation 3.4 is equivalent to P(1|x) 1 log = ρ L (H(x)), 2 P(−1|x) and if a solution exists, it is unique by lemma 1. The solution is given by H(x) =
ρ L−1
P(1|x) 1 log . 2 P(−1|x)
(3.5)
Robust Loss Functions for Boosting
2191
Note that the sign of the predictor, sign(H(x)), is identical to the Bayes rule of P(y|x) (McLachlan, 1992). When predictor H(x) satisfies equation 3.5 for arbitrary x ∈ X , the conditional probability has the form of P(y|x) =
1 . 1 + exp{−2ρ L (yH(x))}
(3.6)
and vice versa. Based on equality 3.6, let us define a statistical model M[ρ L ] as M[ρ L ] = {PρL (y|x, H) | H ∈ S[H]}, where PρL (y|x, H) =
1 . 1 + exp{−2ρ L (yH(x))}
The statistical model M[ρ L ] is specified by L, so M[ρ L ] is called the associated model of loss function L. The relation between loss functions and associated models is summarized in theorem 1. Theorem 1. Suppose that loss function L satisfies condition A1. If P(y|x) = PρL (y|x, Hα ) ∈ M[ρ L ] holds, then the optimal predictor that minimizes risk RL (µP, H) is equal to Hα ∈ S[H]. According to theorem 1, the estimator of conditional probability is constructed as follows. Suppose that predictor H is given by boosting algorithm under loss function L. The estimator for the underlying conditional probability is given by PρL (y|x, H). Typically, in boosting algorithms, the dimension of S[H] is high, and statistical model M[ρ L ] is large enough to find a good approximation of conditional probability P(y|x). For odd function ρ, which is strictly increasing, we can define statistical model M[ρ] as well as M[ρ L ], that is, P(y|x) ∈ M[ρ] ⇐⇒ ∃ H ∈ S[H], P(y|x) = Pρ (y|x, H) =
1 . 1 + exp{−2ρ(yH(x))}
If ρ is not strictly increasing, the identifiability of model M[ρ] is likely to be violated. For nonconvex function L, usually the monotonicity of ρ L does not hold. In such a case, different predictors may indicate the same conditional probability. To avoid such complicated situations, we impose strict monotonicity on ρ.
2192
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
The relation among loss functions is clarified by corollary 1. Corollary 1. Suppose that loss functions L 1 and L 2 satisfy condition A1 and that P(y|x) = Pρ (y|x, Hα ) ∈ M[ρ] holds. If equality ρ L 1 = ρ L 2 = ρ holds, then the minimizer of RL 1 (µP, H) in H is identical to that of RL 2 (µP, H), and the solution is given as Hα . Proof. Under the condition of ρ L 1 = ρ L 2 = ρ, we have M[ρ] = M[ρ L 1 ] = M[ρ L 2 ] and P(y|x) = Pρ (y|x, Hα ) = PρL 1 (y|x, Hα ) = PρL 2 (y|x, Hα ). Then the statement in the corollary is clear from theorem 1. Note that generally there are infinitely many loss functions that satisfy ρ L = ρ for an odd function ρ. The minimizer of RL (Q, H) is identical for any loss function L with ρ L = ρ, when Q ∈ M[ρ]. Example 1. The following three loss functions, L 1 (z) = e , z
L 2 (z) = log(1 + e ), and L 3 (z) = 2z
z≥0
z 1 2z e 2
−
1 2
z<0
, (3.7)
provide the same associated model, because ρ L 1 (z) = ρ L 2 (z) = ρ L 3 (z) = z holds. Note that these loss functions satisfy condition A1. The associated model of these loss functions is given as Pρ (y|x, H) =
1 , 1 + exp{−2yH(x)}
(3.8)
where ρ is the identity function ρ(z) = z. Model 3.8 is a logistic model that is widely used for statistical data analysis. Note that loss function L 1 is used for Adaboost, L 2 for Logitboost (Friedman et al., 2000), and L 3 for Madaboost (Domingo & Watanabe, 2000). Logistic models belong to exponential families in which function ρ is linear. If ρ is a nonlinear function, M[ρ] is not an exponential family. That is, the associated model has statistical curvature, which is defined in terms of information geometry (Amari & Nagaoka, 2000). Under a finite number of samples, the chosen predictors under each empirical risk are not necessarily identical. Thus, under the finite samples, the statistical properties of estimated predictors depend on loss functions. In the context of statistics, loss function L is regarded as an M-estimator under statistical model M[ρ L ] when the dimension of M[ρ L ] is finite (van der Vaart, 1998). It is important to design appropriate loss functions according to the situation in which training data are observed. If the underlying probability P(y|x) lies in M[ρ L ] of finite dimension, the log-likelihood
Robust Loss Functions for Boosting
2193
estimator for M[ρ L ] has good statistical properties. For a training set with few outliers, the log-likelihood estimator tends to be unstable, and some robust estimators would be useful to deal with contaminated data. 4 Robust Loss Functions In data analysis, observed samples are often contaminated by outliers or unspecified noise. In the machine learning community, it has been pointed out that Adaboost is sensitive to outliers and that predictors given by Adaboost do not have high prediction performance for noisy data. Some boosting algorithms are proposed to overcome this drawback (Servedio, 2003; R¨atsch et al., 2001). In this section we derive robust loss functions for boosting algorithms. Concepts and techniques in the field of robust statistics (Hampel et al., 1986) are applied to make boosting algorithms robust against outliers. First, we study robust loss functions for one-dimensional statistical models. Next, we apply theoretical results of one-dimensional models to infinite-dimensional models M[ρ] to derive robust loss functions for boosting algorithms. 4.1 Most B-Robust Loss Functions. First, we consider estimators in one-dimensional models. Let us define one-dimensional model M1 [ρ, h, H] as
M1 [ρ, h, H] = Pρ1 (y|x, α) α ∈ R , where h ∈ H, H ∈ S[H], and Pρ1 (y|x, α) =
1 . 1 + exp{−2ρ(yH(x) + αyh(x))}
Statistical model M1 [ρ, h, H] is regarded as a submodel of M[ρ] because H + αh is an element in S[H] for arbitrary α ∈ R. As explained in theorem 1, when P(y|x) = Pρ1 (y|x, α0 ) ∈ M1 [ρ, h, H], parameter α, which minimizes the risk RL (µP, H + αh), is given by α = α0 under the condition of ρ L = ρ. When some outliers may be included in observed samples, loss functions that are tolerant of outliers should be used. Otherwise a few outliers would undermine the reliability of estimated predictors. To construct robust loss functions, we need to quantify the influence of outliers on estimators. We use influence functions, which are commonly studied in the field of robust statistics (Hampel et al., 1986), to measure the influence of outliers. If joint probability µ(d x)Pρ1 (y|x, α0 ) is affected by an infinitesimal contamination from outlier (x˜ , y˜ ), contaminated probability Pε1 is defined by
Pε1 (B, y) = (1 − ε)
B
µ(d x)Pρ1 (y|x, α0 ) + ε I (x˜ ∈ B, y = y˜ ),
(4.1)
2194
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
where B is an arbitrary measurable subset in X and ε is the occurrence probability of outliers. When the risk is measured by loss function L, optimal parameter αε (x˜ , y˜ ), under the contaminated probability Pε1 , is given as αε (x˜ , y˜ ) = arg min RL ( Pε1 , H + αh). α∈R
Note that when ε = 0, αε (x˜ , y˜ ) is equal to α0 . Influence function φ((x˜ , y˜ ), L , α0 ) is defined by the infinitesimal shift of the estimated parameter as: φ((x˜ , y˜ ), L , α0 ) = lim
ε→+0
αε (x˜ , y˜ ) − α0 . ε
(4.2)
From the definition of the influence function, parameter αε (x˜ , y˜ ) is approximately given as αε (x˜ , y˜ ) = α0 + ε φ((x˜ , y˜ ), L , α0 ) + o(ε), and this expression suggests that the absolute value of φ((x˜ , y˜ ), L , α0 ) can be used as a measure of the influence of outliers. To assess the worst-case influence of contamination, let us define gross error sensitivity γ (L , α0 ) as γ (L , α0 ) = ess sup |φ((x˜ , y˜ ), L , α0 )|,
(4.3)
x˜ ∈X , y˜ =±1
where the essential supremum is taken instead of the supremum for a technical reason. For a measurable function f : Z → R, the essential supremum is defined as ess sup f (z) = inf{c | µP({z| f (z) > c}) = 0}, z∈Z
where µP is the probability measure on Z = X × {1, −1}, defined in equation 3.2. Note that µP(B, y) = 0 if and only if µ(B) = 0, under the condition that P(y|x) > 0 holds for any (x, y) ∈ X × {1, −1}. Gross error sensitivity depends on loss function L. A loss function that minimizes γ (L , α0 ) is regarded as the most robust loss function against outlier contamination. Let us define the most robust loss function based on gross error sensitivity: Definition 3 (Most B-Robust Loss Function). A loss function that minimizes gross error sensitivity, equation 4.3, subject to ρ L = ρ is called the most B-robust loss function at α0 for model M1 [ρ, h, H].
Robust Loss Functions for Boosting
2195
The condition ρ L = ρ ensures that the minimizer of RL (µP, H + αh) in α is identical to α0 . Besides ρ L = ρ, we may impose some constraints on loss functions such as A1, A2, or A3. The term most B-robust loss function is also used under such conditions. The gross error sensitivity and the most B-robustness are important concepts in robust statistics. Influence function and gross error sensitivity are defined in the limit of infinitely large training sets. For finite training sets, empirical influence function is defined according to Hampel et al. (1986). The theoretical analysis of empirical influence function is, however, not simple. In this letter, the theoretical analysis for finite training sets is not treated, and some numerical simulations are presented in section 6 to assess the validity of the theoretical results. Generally the most B-robust loss function is specified for each parameter α0 in M1 [ρ, h, H]. We show that in our context, the most B-robust loss function does not depend on α0 . Thus, the minimum value of the gross error sensitivity is uniformly attained by loss function L ρ , which is defined in theorem 3. The form of the influence function, equation 4.2, is given by theorem 2. The proof is in appendix B. Theorem 2. Let L be a loss function that satisfies A1 and A2. We suppose that there exists some constant δ > 0 such that ∀α ∈ (α0 − δ, α0 + δ),
µ({x|Hβ (x) + αh(x) = 0}) = 0
(4.4)
holds, where Hβ ∈ S[H] and h ∈ H. For outlier (x˜ , y˜ ), we assume Hβ (x˜ ) + α0 h(x˜ ) = 0.
(4.5)
Then the influence function of loss function L at α0 for model M1 [ρ L , h, Hβ ] is given as φ((x˜ , y˜ ), L , α0 ) =
L (− y˜ Hβ (x˜ ) − y˜ α0 h(x˜ )) y˜ h(x˜ ), C(L , ρ L )
(4.6)
where C(L , ρ) is defined as C(L , ρ) =
X
µ(d x)
Pρ1 (y|x, α0 )L (−yHβ (x) − α0 yh(x)).
(4.7)
y=±1
Note that integral 4.7 is well defined from assumption 4.4, though L (z) is not necessarily defined at z = 0 under conditions A1 and A2.
2196
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
The form of the most B-robust loss functions for M1 [ρ, h, H] is specified by the following theorem. The proof is in appendix C. Theorem 3. Let ρ : R → R be an odd function, which is once continuously differentiable, and ρ (z) > 0 holds for any z ∈ R. For parameter α0 of M1 [ρ, h, Hβ ] and probability measure µ on X , we suppose that equation 4.4 holds. Then the most B-robust loss function at α0 for M1 [ρ, h, Hβ ] subject to A1 and A2 is given as z z L ρ (z) = e 2ρ(w) dw
z≥0 z<0
.
(4.8)
0
Note that loss function L ρ does not depend on parameter α0 . This is significant in practice because L ρ always minimizes the gross error sensitivity regardless of the parameter in M1 [ρ, h, Hβ ]. The derivative of L ρ is equal to L ρ (z)
=
1 e
2ρ(z)
z≥0 z<0
,
which is proportional to weights on samples in boosting algorithms. Derivative L ρ (z) is constant for z ≥ 0, and thus significant weights are not assigned to outliers. This is an intuitive reason that why L ρ is robust against outliers. The loss function for Madaboost, L 3 , is the most B-robust loss function for one-dimensional logistic models in which the function ρ is given as ρ(z) = z. Loss functions for Adaboost, Logitboost, and Madaboost, which are defined by L 1 , L 2 , and L 3 in equation 3.7, are plotted in Figure 2. The derivatives of these loss functions are also plotted. Next, we study how loss functions control the effect of each sample on estimators. When loss function L is used to estimate parameter α of M1 [ρ, h, H], the estimator is given as the solution of equations n
L (−yi H(xi ) − yi αh(xi ))yi h(xi ) = 0.
i=1
This equation denotes that the derivative of empirical risk vanishes at the estimated parameter. We find that the previous equation is equivalent to n i=1
yi + 1 1 v(H(xi ) + αh(xi )) − Pρ (+1|xi , α) h(xi ) = 0, 2
2197
4
2.0
5
2.5
Robust Loss Functions for Boosting
Adaboost Logitboost Madaboost
0
0.0
1
0.5
2
1.0
L(z)
L’(z)
3
1.5
Adaboost Logitboost Madaboost
−2
−1
0
1
2
z
−2
−1
0
1
2
z
Figure 2: Loss functions for logistic models M1 [ρ, h, H], where ρ(z) = z. (Left) Loss functions for Adaboost, Logitboost, and Madaboost. Constant values are added to loss function such that limz→−∞ L(z) = 0. (Right) Derivatives of functions are plotted.
where weight function v is defined by v(z) = L (z) + L (−z), because L (−yi H(xi ) − yi αh(xi ))yi h(xi ) L (−yi H(xi ) − yi αh(xi )) v(−yi H(xi ) − yi αh(xi ))yi h(xi ) v(−yi H(xi ) − yi αh(xi )) = yi 1 − Pρ1 (yi |x, α) v(H(xi ) + αh(xi ))h(xi ) yi + 1 1 = − Pρ (1|x, α) v(H(xi ) + αh(xi ))h(xi ) 2 =
holds. Note that v is an even function. Therefore, loss function L determines weights of samples via function v(z). Under logistic models with ρ(z) = z, function v(z) for the maximum likelihood estimator is a constant function. The weight functions for Adaboost, Logitboost, and Madaboost are plotted in Figure 3. Note that the absolute value of H(x) + αh(x) becomes large when conditional probability Pρ1 (+1|x, α) is close to 0 or 1. In the case of Adaboost with L(z) = e z , if Pρ1 (+1|x, α) is close to 1, the change in sign of the class labels from 1 to −1 at x significantly affects the estimation result because weight function v(H(x) + αh(x)) takes a large value, while the weight function for Madaboost depresses the influence of such outliers. Likewise, a loss function such as L ρ in equation 4.8 depresses the influence of outliers.
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
2.0
Adaboost Logitboost Madaboost
1.0
1.5
v(z)
2.5
3.0
2198
0
1
2
z
Figure 3: Weight functions for Adaboost, Logitboost, and Madaboost are plotted. Loss functions for these boosting algorithms are given in equation 3.7.
Victoria-Feser (2002) proposed some other weight function candidates to depress the influence of outliers. 4.2 Robustness of Boosting Algorithms. The most B-robust loss functions for one-dimensional model M1 [ρ, h, H] were studied in the previous section. Next, we study the robustness of boosting algorithms under model M[ρ]. Here, contaminated probability Pε of Pρ (y|x, H) ∈ M[ρ] is defined by
Pε (B, y) = (1 − ε)
B
µ(d x)Pρ (y|x, H) + ε I (x˜ ∈ B, y = y˜ ),
where B is any measurable subset in X . We study the influence of contaminations for Pρ (y|x, H). This means that we consider the situation such that Q= Pε , H (0) = H, and M = 1
(4.9)
in the boosting algorithm in Figure 1. Let an output of the boosting algo L ,ε . rithm under loss function L be H
Robust Loss Functions for Boosting
2199
The setup of equation 4.9 may seem unrealistic. This situation is considered a final phase of the boosting algorithm. That is, when enough accurate predictor is already obtained, we would like to evaluate how much one-step, that is, M = 1, boosting algorithm affects the estimator under contamination by outliers.
L ,ε , we define the changeTo measure the influence of outlier (x˜ , y˜ ) on H of-risk function ψ L 0 ((x˜ , y˜ ), L , H) as ψ L 0 ((x˜ , y˜ ), L , H) = lim
ε→+0
L ,ε ) − RL 0 (µP, H) RL 0 (µP, H , ε2
(4.10)
where P(y|x) = Pρ (y|x, H) ∈ M[ρ] and L 0 is a loss function to measure the
L ,ε . We suppose that L and L 0 satisfy equality difference between H and H ρ L = ρ L 0 = ρ. A typical example of L 0 is L 0 (z) = log(1 + e 2ρ(z) ), because L 0 (−yH(x)) corresponds to the negative log likelihood of Pρ (y|x, H) ∈ M[ρ]. The change-of-risk sensitivity γ L 0 (L , H) for M[ρ] is also defined as γ L 0 (L , H) = ess sup ψ L 0 ((x˜ , y˜ ), L , H).
(4.11)
x˜ ∈X , y˜ =±1
In Hampel et al. (1986), the change-of-variance function and the changeof-variance sensitivity are defined to investigate the influence of outliers on the asymptotic variance of estimators. In our context, the number of parameters in model M[ρ] is usually infinite, and the standard definition in Hampel et al. (1986) does not work well. Therefore, we define the change-ofrisk function and the change-of-risk sensitivity based on risk measure. The minimizer of γ L 0 (L , H) in L subject to ρ L = ρ is called the most B-robust loss function for model M[ρ] according to the definition for one-dimensional models. When P(y|x) = Pρ1 (y|x, α0 ) ∈ M1 [ρ, h, H] holds and the parameter α0 of M1 [ρ, h, H] is estimated by using loss function L, proportionality
lim
ε→+0
RL 0 (µP, H + αε (x˜ , y˜ )h) − RL 0 (µP, H + α0 h) ∝ φ((x˜ , y˜ ), L , H)2 ε2
2200
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
holds. This is shown in the proof of theorem 4. Thus, for the one-dimensional model M1 [ρ, h, H], we have ψ L 0 ((x˜ , y˜ ), L , H) ∝ φ((x˜ , y˜ ), L , H)2 and γ L 0 (L , H) ∝ γ (L , H)2 .
(4.12)
The proportional constant depends on L 0 . Hence, change-of-risk function 4.10 and change-of-risk sensitivity 4.11 are regarded as a generalization of influence function 4.2 and gross error sensitivity 4.3, respectively. According to equation 4.12, we expect that loss function L ρ defined in theorem 3 also minimizes change-of-risk sensitivity 4.11 for model M[ρ]. The rigorous statement is shown in theorem 4. The proof is in appendix D. Theorem 4. Let ρ : R → R be an odd function that is twice continuously differentiable and ρ (z) > 0 holds for any z ∈ R. Suppose that loss function L 0 satisfies A1 and A2 and that equality ρ L 0 = ρ holds. For a probability measure µ on X and a predictor Hβ ∈ S[H], we assume that there exists some constant δ0 > 0 such that ∀c ∈ (−δ0 , δ0 ),
µ({x|Hβ (x) = c}) = 0
(4.13)
holds. Then the loss function L ρ is a minimizer of the change-of-risk sensitivity γ L 0 (L , Hβ ) in L subject to conditions, A1, A2, A3, and ρ L = ρ. Theorem 4 explains that the boosting algorithm based on L ρ is robust against outliers at probability Pρ (y|x, Hβ ). If the initial predictor, H (0) , is far from Hβ (x), the influence of outliers on boosting algorithms is still not clearly understood. However, we think that theorem 4 suggests a practical way of constructing robust-boosting algorithms. 4.3 Transformation of Loss Functions. Theorems 3 and 4 give us a way of transforming loss functions for robust boosting algorithms. For loss function L, the most B-robust loss function L ρL for model M[ρ L ] is given as L ρL (z) =
z z
L (w) 0 L (−w) dw
because e 2ρL (w) =
L (w) L (−w)
z≥0 z<0
,
(4.14)
Robust Loss Functions for Boosting
2201
holds. Note that we can apply Newton methods to determine the value of α (m) in boosting algorithms even though L ρL is not given in the analytical form. A few examples of L ρL are shown below. Example 2. Adaboost uses L 1 (z) in equation 3.7 as the loss function. We obtain L ρL 1 (z) = L 3 (z) because the function ρ L 1 is given as ρ L 1 (z) = z. That is, the most B-robust boosting algorithm for logistic model M[ρ L 1 ] is Madaboost. The target of estimators under model M[ρ L 1 ] is log-odds of conditional probabilities, 1 P(1|x) log ∈ S[H], 2 P(−1|x) which is a basic quantity in statistical analysis (MacCullagh & Nelder, 1989). Example 3. Truncated quadratic loss function L 4 (z) = (max{0, 1 + z})2 is also used for classification problems. The support vector machine with ¨ 2-norm soft margin (Scholkopf & Smola, 2001) uses this loss function. Since the domain on which L 4 is strictly convex is restricted, the set of predictors should also be restricted as follows: ¯ S[H] = {Hα | α1 < 1}. On the interval (−1, 1), L 4 satisfies conditions A1, A2 and A3. Thus, ¯ theorems 3 and 4 hold for S[H] instead of S[H]. Consequently, the most B-robust loss function for L 4 is L ρL 4 (z) =
z
0≤z<1
−z − 2 log(1 − z)
−1 < z < 0
,
where ρ L 4 (z) =
1+z 1 log , 2 1−z
−1 < z < 1.
The set of conditional probabilities in associated model M[ρ L 4 ] is given as PρL 4 (y|x, H) =
1 (1 + yH(x)) , 2
¯ H ∈ S[H].
2202
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
Then the relation between predictors and conditional probabilities is specified by affine transformation. Rosset (2005) has also proposed a transformation of loss functions to robustify boosting algorithms against outliers. The loss function is referred to as Huberized loss. The definition of Huberized loss is similar to equation 4.14. Indeed, the Huberized loss also has linear part beyond some point. Rosset (2005) investigated the relation between Huberized loss and bagging (Breiman, 1994) and gave the rationale of Huberized loss as hybrids of standard boosting loss functions and the bagging linear loss function. 5 Contamination Models In this section, we introduce contamination models to deal with a change in sign of class labels or mislabels. Contamination models describe the change in sign of class labels near decision boundaries. Transformation formula 4.14 can be applied to loss functions for contamination models. Then we obtain loss functions that are resistant to an infinitesimally small number of outliers and substantially positive percentage of mislabels. 5.1 Statistical Models for Mislabels. Contamination models were proposed for describing the occurrence of mislabels (Copas, 1988). Typically, conditional probability P(y|x) is contaminated, such as (1 − κ)P(y|x) + κ P(−y|x),
(5.1)
where κ denotes the ratio of mislabels. Note that the decision boundary of equation 5.1 is identical to that of P(y|x) for values of κ less than 0.5. In an ideal situation, the probability we try to estimate from the observed samples is P(y|x), while in a practical situation, the samples are generated according to the contaminated model, equation 5.1. A typical example of a mislabel is a false diagnosis. For example, suppose that x denotes a diagnostic test results and y denotes whether a tumor is benign or malignant. Contamination model 5.1 represents the conditional probability of a diagnostic result based on x in the presence of the false diagnosis, while P(y|x) denotes the probability whether a tumor is actually benign or malignant based on diagnostic test result x. Therefore, the value of κ denotes the precision of the diagnosis. It is valid to assume that the samples we can use are generated according to the contamination models. In the context of boosting, Takenouchi and Eguchi (2004) discussed an extension of model 5.1 with ratio κ mislabeled depending on x. It would be more realistic if κ becomes larger as x is near the decision boundary. The loss function is defined by L(z; η) = (1 − η)e z + ηz,
(5.2)
Robust Loss Functions for Boosting
2203
where η is a nonnegative constant less than one. The boosting algorithm based on equation 5.2 is called eta-boost. Note that loss function 5.2 consists of the loss function for Adaboost and a linear term. Derivative L (z; η) with respect to z satisfies z ≥ 0 =⇒ 1 ≤ L (z; η) ≤ e z and z < 0 =⇒ 1 ≥ L (z; η) ≥ e z because of the linear term. This means that the weight distribution of samples is shifted to the uniform distribution in comparison to Adaboost. Thus, the linear term is expected to moderate the overweighting to mislabels. A model associated with equation 5.2 is given as P(y|x, H) = (1 − κη (x, H))P0 (y|x, H) + κη (x, H)P0 (−y|x, H),
(5.3)
where P0 (y|x, H) is logistic model P0 (y|x, H) =
1 1+
e −2yH(x)
,
and κ is defined by κη (x, H) =
(1 − η)
η y=±1
e −yH(x) + 2η
.
Function ρ L for the associated model of equation 5.2 is also derived as ρ L (z) =
1 (1 − η)e z + η log . 2 (1 − η)e −z + η
Note that κη (x, H) is maximized at x such that H(x) = 0 and the maximum value is η/2. Therefore, the mislabel occurs primarily near the decision boundary, and the probability of contamination, κη (x, H), decreases exponentially as input x moves away from the decision boundary. 5.2 Robust Eta-Boost Algorithm. In Takenouchi and Eguchi (2004), the eta-boost algorithm is shown to work well on noisy samples. Possible extreme outliers, however, would degrade the generalization performance of eta-boost because the loss function increases exponentially as well as that of Adaboost.
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
2.0
4
5
2.5
2204
Logitboost eta−boost robust eta−boost
L’(z)
−1
0.0
0
0.5
1
1.0
2
L(z)
1.5
3
Logitboost eta−boost robust eta−boost
−2
−1
0
1
−2
2
−1
0
1
2
z
z
Figure 4: Loss functions for the contamination models. (Left) The loss functions for Logitboost, eta-boost, and robust eta-boost. The value of η is set to 0.5. Constant values are added to the loss function for robust eta-boost such that limz→−∞ L(z; η) = 0. (Right) Derivatives of functions are plotted.
As studied in the last section, transformation 4.14 of loss function 5.2 provides a robust modification of eta-boost. The transformed loss function is analytically given as L ρL (z; η) =
z≥0
z 1 [(1 η2
− η)(e − 1)η + (2η − 1) log(1 + (e − 1)η)] z
z
z<0
, (5.4)
and the derivative is L ρL (z; η)
=
1
z≥0
(1−η)e z +η (1−η)e −z +η
z<0
.
We refer to the boosting algorithm under L ρL (z; η) as robust eta-boost. Note that lim L ρL (z, η) = L 3 (z),
η→0
which means that L ρL is an extension of L 3 . By using loss function 5.4, the boosting algorithm is expected to be robust against both mislabels near the decision boundary and outliers far from the decision boundary. Loss functions for eta-boost and robust eta-boost are plotted with the loss function of Logitboost in Figure 4. The derivatives are also plotted.
Robust Loss Functions for Boosting
2205
6 Illustrative Examples In classification problems, the concept of margin is important in achieving high prediction accuracy (Vapnik, 1998; Schapire et al., 1998). A common definition of margin on a training sample (x, y) is given as yHα (x)/α1 . Most binary classification methods are designed to maximize the smallest margin over training samples, that is, the optimization problem max min α
i
yi Hα (xi ) α1
(6.1)
is solved approximately. In Adaboost algorithm, the predictor minimizing the exponential loss, n 1 yi Hα (xi ) exp −α1 , n α1 i=1
can be regarded as an approximate solution of equation 6.1. In the same way, boosting under loss function L is also regarded as a learning algorithm that approximately provides predictors maximizing the smallest margin. Under noisy data, the predictor, which is an exact solution of equation 6.1, typically overfits the training data. As a result, estimated classifiers do not achieve high prediction accuracy. Many researchers point out that maximization of the smallest soft margin often provides promising results for noisy data (Vapnik, 1998; Meir & R¨atsch, 2003). A soft margin relaxes the penalty of misclassification in comparison to a margin. This relaxation depends on each learning algorithm, and the design of the soft margin is significant for achieving stable estimation of predictors under noisy data. Robust boosting algorithms (Demiriz, Bennett, & Shawe-Taylor, 2002; R¨atsch, 2001; R¨atsch et al., 2000; R¨atsch et al., 2001; R¨atsch, Demiriz, & Bennett, 2002) also maximize the smallest soft margin. For example, in RoBoost (R¨atsch et al., 2000) and Adaboost Reg (R¨atsch et al., 2001), a soft margin is substituted into exponential loss of Adaboost instead of a margin, where the definition of soft margin is different between RoBoost and Adaboost Reg . In this letter, we propose the robust boosting algorithm under loss function L ρ . Our methods do not use soft margin directly. We examine boosting with the most B-robust loss function L ρ under noisy data and compare the results with existing boosting algorithms. In this section, six boosting algorithms are compared: Adaboost, Logitboost, Madaboost, eta-boost, robust eta-boost, and Adaboost Reg . Loss functions for Adaboost, Logitboost, and Madaboost are introduced in equation 3.7. Loss functions for eta-boost and
2206
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
robust eta-boost are shown in section 5. Adaboost Reg differs slightly from other boosting algorithms. In the learning process of Adaboost Reg , changes of the weights are monitored, and the history of weights in the process is used to detect outliers. The algorithm of Adaboost Reg is more complicated, and its calculation is more time-consuming than other boosting algorithms based on coordinate descent methods derived from loss functions. In numerical experiments, the computational cost of Adaboost Reg is more than twice that of robust eta-boost, including hyperparameter estimation. First, we study two-dimensional binary classification problems. The robustness of the proposed method is examined for toy problems. Next, we apply boosting algorithms to the data in the IDA benchmark repository (R¨atsch et al., 2001; R¨atsch et al., 2000). 6.1 Synthetic Example. Labeled samples are generated from a fixed probability, and class labels of a few samples are flipped as outliers. The detailed setup is as follows. Input vector x is uniformly distributed on the two-dimensional region [−π, π] × [−π, π], and the conditional probability of the class label is defined by P(y|x1 , x2 ; η, H) =
(1 − η)e yH(x1 ,x2 ) + η , (1 − η) y =±1 e y H(x1 ,x2 ) + 2η
(6.2)
where H(x1 , x2 ) = x2 − 3 sin x1 . Note that P(y|x1 , x2 ; η, H) corresponds to model 5.3 with H(x1 , x2 ) = x2 − 3 sin x1 . Moreover, a % samples are flipped as outliers, where these are randomly chosen from the top 10a % samples sorted in descending order of |H(x1 , x2 )|. Hence, outliers are scattered far from the decision boundary. Typical training samples are shown in Figure 5. In these numerical experiments, the learning algorithm called “decision stumps” (Friedman et al., 2000) is used as a weak learner. First, test errors of robust eta-boost and Adaboost Reg (R¨atsch et al., 2001) are compared. The value of η in P(y|x1 , x2 ; η, H) is set to 0.5; that is, there are about 25% mislabels around the decision boundary. The averaged test errors in 100 different runs for robust eta-boost and Adaboost Reg are shown in Figure 6. The axis of the abscissa denotes the number of m in the boosting algorithm. In each run, the number of training data is set to 200, where 2% outliers are mixed with the training data. The test error is calculated using 5000 test samples that include neither mislabels nor outliers, because our main objective of the numerical experiments is to observe the influence of contamination on the test error of label prediction. Both Adaboost Reg and robust eta-boost include hyperparameters such as the regularization
2207
0
x2
1
2
3
Robust Loss Functions for Boosting
+1 mislabel outlier
0
1
2
3
x1
Figure 5: Typical training samples used in numerical experiments are plotted. The value of η is set to 0.5, and 2% outliers are mixed in these training samples. Mislabels are observed around the decision boundary, and some outliers are scattered far from the decision boundary.
parameter C and the mislabel ratio η, respectively. Although appropriate selection of the value of C in Adaboost Reg results in a fairly good generalization performance, the estimation of C is often difficult because of its wide range of values. On the other hand, the value of η for robust eta-boost is restricted to the interval [0, 1], where the value has a clear meaning. Next, the training results by Adaboost, Madaboost, and Adaboost Reg are compared from the viewpoint of the robustness. The mislabel ratio η in equation 6.2 is set to 0.0. Thus, there are no mislabeled data. The regularization parameter, C, for Adaboost Reg is set to C = 500. The averaged test errors in 50 different runs for Adaboost, Madaboost, and Adaboost Reg are shown in Figures 7a and 7b, with respect to the number of boosting iterations. As shown in the figure, Adaboost is sensitive to outliers. To demonstrate this sensitivity to the outliers, we plot the test error differences against the number of boosting iterations. In Figure 7c, the difference of test errors between 2% outliers and nonoutliers is plotted. In fact, for the training data with outliers, Figure 7b shows that Madaboost and Adaboost Reg stably provide an optimal performance around 50 boosting iterations, while Adaboost fails to capture such an optimal performance, which was obtained for the training
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata .
AdaboostReg, C=10 AdaboostReg, C=500
16.5 15.5
16.0
test error (%)
17.0
17.5
2208
20
40
60
80
100
boost
Figure 6: Averaged test errors for Adaboost Reg and robust eta-boost. Samples are contaminated by 2% outliers. Regularization parameter C for Adaboost Reg is set to 10 or 500; η in robust eta-boost is set to 0.1 or 0.5.
data without outliers as observed in Figure 7a. Thus, we conclude that Adaboost is more sensitive to outliers than Madaboost and Adaboost Reg under this numerical experiment. Next, we compare six boosting algorithms. The results are shown in Table 1. In this case, the mislabel ratio η in equation 6.2 is set to 0.5. We apply 10-fold cross validation to determine hyperparameters such as the number of boosting iterations, the regularization parameter of Adaboost Reg , and the mislabel ratio in eta-boost and robust eta-boost. The test error is evaluated for 5000 test samples that include neither mislabels nor outliers. In Table 1, asterisks denote that the test error of a particular learning algorithm is 5%(∗) or 1%(∗∗), significantly less than that of Adaboost under the paired t-test. The generalization performance is shown for each level of outliers, and some of the results in Table 1 are depicted in Figure 8 by box-and-whisker plots. In case of nonoutliers, Adaboost Reg and eta-boost are better than the other algorithms, although the significance of their superiority is not clear. This result suggests that Adaboost Reg provides good performance even under contamination by mislabels near the decision boundary. As the ratio
Robust Loss Functions for Boosting
2209
of outliers increases, boosting algorithms other than robust eta-boost and Madaboost are affected by outliers. As indicated by the analysis of loss functions based on gross error sensitivity, loss functions with bounded derivatives are tolerant of outliers. When there are a few outliers in the training data, the averaged results for robust eta-boost are slightly better than those for Madaboost. This suggests that the parameter η in robust eta-boost effectively depresses the influence of mislabels on estimators. Finally, we study another kind of contamination: adding nuisance di¨ mensions of pure noise (Blanchard, Sch¨afer, Rozenholc, & Muller, 2005). More precisely, for a two-dimensional input (x1 , x2 ) of training data, nuisance dimensions are added as (x1 , x2 , ε3 , . . . , ε ), where εi ’s are independently generated from the normal distribution with mean 0 and standard deviation 100. That is, the noise level of nuisance dimension is quite large. The dimension of input in test data is also extended, (x1 , x2 , 0, . . . , 0) ∈ R . In this experiment, the parameters of the conditional probability 6.2 are given as η = 0 and H(x1 , x2 ) = 10(x2 − 3 sin x1 ). Blanchard et al. (2005) have reported that the support vector machine is not robust in this setting. The results of the experiments are shown in Table 2, where “dimension” denotes
Figure 7: Test error of boosting algorithms. (a) Outliers: 0%. (b) Outliers: 2%. (c) Difference between test errors of 2%—outliers and nonoutliers.
2210
Figure 7: Continued
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
Robust Loss Functions for Boosting
2211
Table 1: Averaged Test Errors for Six Boosting Algorithms. Outliers
0%
Adaboost Logitboost Madaboost Eta-boost Robust eta-boost Adaboost Reg
14.86 ± 0.23 14.95 ± 0.22 15.04 ± 0.22 14.70 ± 0.19 14.87 ± 0.20 14.57 ± 0.18
2%
5%
16.50 ± 0.24
17.45 ± 0.30
∗ 16.01 ± 0.22
∗ 16.93 ± 0.24
16.90 ± 0.26 ∗∗ 15.70 ± 0.22 16.09 ± 0.30
∗∗ 16.62 ± 0.21
∗ 15.98 ± 0.24
∗ 16.89 ± 0.29
17.43 ± 0.26 16.96 ± 0.30
Notes: Asterisks denote that the test errors of the learning algorithm are 5%(∗) or 1%(∗∗), significantly less than that of Adaboost under the paired t-test. Boldface denotes the best method for each level of outliers.
Synthetic Data (outlier: 5%)
25
test error(%)
20
20
15
15
test error(%)
25
30
30
Synthetic Data (outlier: 0%)
robust eta
eta
Ada
Mada
Logit
Ada Reg
robust eta
eta
Ada
Mada
Logit
Ada Reg
Figure 8: Test results of synthetic data. Test errors of robust eta-boost, eta-boost, Adaboost, Madaboost, Logitboost, and Adaboost Reg over 100 trials are shown by box-and-whisker plots. The boxes have lines at the lower quartile, median, and upper quartile values. The whiskers are lines extending from each end of the box to the most extreme data value within 1.5 · IQR (interquartile range) of the box. Outliers are data with values beyond the end of the whiskers, which are displayed as open circles. (Left) Results without outliers. (Right) Results with 5% outliers.
the dimension of the input vector. For each setup, we consider 50 repetitions of a training set of size 200 and a test set of size 5000. In the numerical experiments, all hyperparameters are estimated from contaminated training data by applying five-fold cross validation. All of the methods except the support vector machine are stable against additional noisy dimensions. Note that the applied weak learner is the decision stumps. In general, decision
2212
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
Table 2: Results of Boosting Methods, with Contamination of the Data by Extra Noisy Dimensions. Dimension Adaboost Logitboost Madaboost Eta-boost Robust eta-boost Adaboost Reg SVM
2
4
6
8
7.6 ± 1.8 7.4 ± 1.3 7.6 ± 1.8 10.2 ± 2.2 7.2 ± 1.3 7.6 ± 1.5 8.1 ± 1.6
7.7 ± 1.6 7.8 ± 1.4 7.7 ± 1.2 11.2 ± 2.8 7.9 ± 1.5 7.9 ± 1.3 14.0 ± 1.1
8.9 ± 2.0 8.4 ± 1.2 8.6 ± 1.7 11.0 ± 2.2 8.7 ± 1.6 8.8 ± 1.8 16.5 ± 1.0
9.1 ± 1.7 9.3 ± 1.9 9.3 ± 1.5 11.0 ± 2.5 9.2 ± 1.7 9.2 ± 1.5 18.4 ± 1.3
Notes: The “dimension” in the table includes both the informative part and the purely noisy part. For each setup, we consider 50 repetitions of a training set of size 200 and a test set of size 5000. The averaged test error and the standard deviation are reported in percentage.
stumps are quite robust to various kinds of noises. We think that the stability of the boosting algorithms to additional nuisance dimensions comes from the robustness of decision stumps. 6.2 Experiments on Benchmark Data. We use 13 artificial and realworld data sets from the UCI, DELVE, and STATLOG benchmark repositories: Banana, Breast-Cancer, Diabetes, German, Heart, Image, Ringnorm, Flare-Solar, Splice, Thyroid, Titanic, Twonorm, and Waveform. All data sets are provided as IDA benchmark repository. (See R¨atsch et al., 2001, and R¨atsch et al., 2000, for details of data sets.) Each data set includes training sets and test sets. We divide each training set into two parts: 70% for parameter estimation and 30% for hyperparameter estimation. The properties of each data set are shown in Table 3, where dim, #train, #val, #test, and #real denote the input dimension, the size of training set for parameter estimation, the size of validation set for hyperparameter estimation, the size of test set, and the number of realizations of the data, respectively. For real-world data, realization of data denotes different random partitions of data. We apply “decision stumps” (Friedman et al., 2000) as a weak learner. In the boosting algorithm, the parameter α of the predictor Hα is estimated on the training set for parameter estimation. The validation set is used to estimate hyperparameters such as the number of boosting iterations m, the value of η in eta-boost and robust eta-boost, or the regularization parameter C in Adaboost Reg . We can apply the cross-validation technique to estimate the hyperparameters. In the experiments, however, we do not use cross validation because it is time-consuming. Accuracy of estimators is evaluated as follows. The prediction accuracy of H is measured by the test error of sign(H) on the test set of each data set.
Robust Loss Functions for Boosting
2213
Table 3: Properties of Each Data Set. Data Property Name Banana Breast-Cancer Diabetes German Heart Image Ringnorm Flare-solar Splice Thyroid Titanic Twonorm Waveform
dim
# train
# val
# test
# real
2 9 8 20 13 18 20 9 60 5 3 20 21
280 140 327 489 118 909 280 466 700 98 105 280 280
120 60 141 211 52 391 120 200 300 42 45 120 120
4900 77 300 300 100 1010 7000 400 2175 75 2051 7000 4600
100 100 100 100 100 20 100 100 20 100 100 100 100
Note: Dim, # train, # val, # test, and # real denote the input dimension, the size of the training set for parameter estimation, the size of the validation set, the size of the test set, and the number of realizations of the data, respectively.
The log-loss, defined as 1 log PρL ( y¯ i |x¯ i , H), N N
log-loss = −
i=1
is used to measure the estimation accuracy of the conditional probability PρL (y|x, H) given by the loss function L, where {(x¯ i , y¯ i ), i = 1, . . . , N} denotes the test set of each data set. Log-loss is commonly used to measure discrepancy between the underlying probability and the estimated one ¨ (Grunwald & Dawid, 2004). To estimate the accurate classifier, we adopt the hyperparameter minimizing test error on the validation set. On the other hand, for the estimation of conditional probabilities, hyperparameter minimizing log-loss on validation set will be preferable. For Adaboost Reg , however, there is no clear correspondence between predictors and conditional probabilities. In Adaboost Reg algorithm, the soft margin is substituted into the loss function L(−z) = e −z/2 , and thus the statistical model associated with the loss function is expected to provide a natural correspondence between predictors and conditional probabilities. Therefore, we define the conditional probability estimated by Adaboost Reg as P(y|x, H) =
1 , 1 + exp{−yH(x)}
where H is a predictor given by Adaboost Reg .
2214
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
Our main concern is to observe the influence of outliers or mislabels on estimators. Thus, outliers and mislabels are added to the training set, where training set includes all the samples for parameter estimation and validation. First, labels of training set are flipped randomly as mislabels. Note that mislabels are mixed uniformly. To add outliers to the training set, we require rough estimate of the conditional probability of labels, since the probability of outliers will be small under ideal situation. We input the training data without mislabels and outliers into rpart algorithm (Breiman, Friedman, Olshen, & Stone, 1984), and obtain rough estimates of conditional probability. New training samples with low conditional probability are generated and added to the training set as outliers. In the same way, mislabels are mixed to test sets, but outliers are not added. Since we are interested in estimation accuracy for underlying conditional probability, test sets are also contaminated by mislabels. A predictor, H, is estimated by using training and validation sets in each realized data set, and the accuracy of H is evaluated on the test set. If the number of realizations is 100, one has 100 estimates. The averaged values of test error and log-loss × 10 for these estimates are shown with standard deviation in Tables 5 to 17 in appendix E, and the summary of experiments over 13 data sets are shown in Table 4. Table 4 indicates the winning percentage of a particular method over 13 data sets, where averaged value of test error or log-loss is used to compare with each other methods. For example, under the criterion of test error, robust eta-boost wins seven times over 13 data sets with 0% mislabels and 3% outliers. In Tables 5 to 17, detailed results are shown, where the p-values are calculated by one-side paired t-test against Adaboost. R¨atsch et al. (2001) used the same data sets (IDA repository) to assess the prediction accuracy of some boosting algorithms including Adaboost Reg . In their experiments, for example, the test error on Banana is about 10%. In our experiments, the test error on Banana is roughly 27% under 0% mislabels and 0% outliers as shown in Table 5. The difference mainly comes from the choice of weak learner. R¨atsch et al. (2001) applied radial basis function nets as the weak learner, and we use decision stumps. When we apply rpart as the weak learner, the test error on Banana improves to about 14%. On Splice and Flare-Solar, our results are better than those in R¨atsch et al. (2001). In practice, the choice of weak learner is significant. In this letter, however, we focus only on investigating the properties of loss functions. Our experiments show that:
r
r
From the summary of the test error in Table 4, robust eta-boost and Adaboost Reg perform better than the other methods when there are no mislabels. In particular, Adaboost Reg can depress the influence of outliers even if their ratio is high. When the ratio of mislabel is high (20%), Madaboost seems to provide a good classifier. However, as a whole, it is not a dominating method.
Robust Loss Functions for Boosting
2215
Table 4: Frequency That a Particular Method Wins over 13 Data Sets, Where Averaged Values of the Test Error or Log-Loss Are Used for Comparison. Mislabel Outlier
0%
20%
0%
3%
5%
0%
3%
5%
Summary: Test error Adaboost Logitboost Madaboost Eta-boost Robust eta-boost Adaboost Reg
0/13 1/13 1/13 1/13 7/13 3/13
2/13 0/13 1/13 0/13 7/13 3/13
0/13 1/13 0/13 1/13 4/13 7/13
1/13 2/13 1/13 0/13 4/13 5/13
0/13 0/13 7/13 0/13 3/13 3/13
2/13 1/13 0/13 1/13 3/13 6/13
Summary: Log-loss Adaboost Logitboost Madaboost Eta-boost Robust eta-boost Adaboost Reg
2/13 2/13 2/13 0/13 6/13 1/13
0/13 1/13 2/13 1/13 8/13 1/13
1/13 1/13 1/13 0/13 9/13 1/13
0/13 2/13 3/13 0/13 8/13 0/13
0/13 1/13 6/13 0/13 6/13 0/13
1/13 1/13 7/13 0/13 4/13 0/13
Note: Boosting algorithms in boldface represent wins over more than half over 13 data sets.
r
From the lower part of Table 4, robust eta-boost outperforms the other methods in estimation of conditional probabilities. When the ratio of mislabel is high (20%), Madaboost also provides accurate estimates of the conditional probability as well as robust eta-boost.
In summary, even if mislabels or outliers contaminate training data, robust eta-boost can depress the influence of contamination on estimators in comparison to the other methods. When the target of learning is an accurate classifier, Adaboost Reg and robust eta-boost are better than the other methods. Regarding the estimation of conditional probabilities, Adaboost Reg does not perform well. Robust eta-boost and Madaboost provide a stable and accurate estimator of conditional probabilities. For Adaboost Reg , rigorous correspondence between predictors and conditional probability is unknown. This may be a main reason that the probability estimator given by Adaboost Reg is not accurate in comparison to the other boosting algorithms. 7 Conclusion We formulated a way of constructing robust boosting algorithms. We proposed a transformation formula that derives the most B-robust loss function. Applying the transformation formula to loss functions whose associated
2216
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
models are contamination models, we obtain the robust eta-boost algorithm, which is robust against both mislabels and outliers, especially for the estimation of conditional probability. In numerical experiments, we verified that robust eta-boost is robust against outliers and mislabels in comparison to the other boosting algorithms. For the estimation of conditional probability, Adaboost Reg does not perform well, because the relation between predictors and conditional probabilities is unclear. In order to establish the theoretical validity of Adaboost Reg , it will be helpful to investigate the relation between predictors and conditional probabilities in Adaboost Reg algorithm. Alternatively, as an estimator of conditional probability, robust eta-boost and Madaboost perform better than the others, especially in the case of contamination by outliers. These results are compatible with theoretical conclusions developed in this letter. We focused on only the most B-robust loss function, which is expected to be the least affected by outliers in the sense of gross error sensitivity. There are, however, many candidates that improve robustness from different points of view. For example, the most B-robust loss function is derived without considering the efficiency, which is measured by the variance of the estimator caused by fluctuations of given samples. Under certain assumptions, efficiency can be compatible with robustness. A loss function that copes with efficiency and robustness is called the optimal B-robust loss function. The optimal B-robust loss function has one parameter to adjust the balance between efficiency and robustness. Roughly, the optimal B-robust loss function can be composed as an intermediate of the most B-robust loss function and the efficient loss function. For logistic models, the efficient loss function is given as L 2 in equation 3.7, which is used for Logitboost. The optimal B-robust loss function for logistic models is shown in Figure 9. Regularization terms are other important factors for good generalization performance, because the dimension of statistical models for boosting algorithms is high in general. We think that the next step of our research is theoretical analysis to derive appropriate regularization terms for more accurate prediction under contaminated data. Appendix A: Proof of Lemma 1 Because L is convex on R, inequality L(z1 ) + L (z1 )(z2 − z1 ) ≤ L(z2 ) ⇐⇒ L (z1 ) ≤
L(z2 ) − L(z1 ) z2 − z1
is valid for any z1 < z2 . Likewise, we have L(z2 ) + L (z2 )(z1 − z2 ) ≤ L(z1 ) ⇐⇒
L(z2 ) − L(z1 ) ≤ L (z2 ) z2 − z1
2217
2.0
2.0
Robust Loss Functions for Boosting
Logitboost Optimal B−robust Most B−robust
1.0
L’(z)
0.5
1.0 0.0
0.0
0.5
L(z)
1.5
1.5
Logitboost Optimal B−robust Most B−robust
−2
−1
0
1
2
−2
−1
z
0
1
2
z
Figure 9: Loss functions for logistic models, where function ρ is defined as ρ(z) = z. (Left) Loss functions. Logitboost loss, optimal B-robust loss, and most B-robust loss (Madaboost) are plotted. Constant values are added such that limz→−∞ L(z) = 0. (Right) Derivatives of functions are plotted.
for z1 < z2 . Hence for z1 < z2 , L (z1 ) ≤ L (z2 ) holds. Moreover, since L is strictly convex for negative real values, inequality L(z1 ) + L (z1 )(z2 − z1 ) < L(z2 ) ⇐⇒ L (z1 ) <
L(z2 ) − L(z1 ) z2 − z1
is valid for z1 < z2 < 0, and then, for z1 < z2 < 0, we have L (z1 ) < L (z2 ). When z1 < 0 and z1 < z2 , there exists a negative value z3 < 0 such that z1 < z3 < z2 . Then the following inequality holds: L(z1 ) + L (z1 )(z2 − z1 ) = L(z1 ) + L (z1 )(z3 − z1 ) + L (z1 )(z2 − z3 ) < L(z3 ) + L (z1 )(z2 − z3 ) ≤ L(z3 ) + L (z3 )(z2 − z3 ) ≤ L(z2 ).
(A.1)
2218
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
From equation A.1, monotonicity z1 < 0, z1 < z2 =⇒ L (z1 ) < L (z2 )
(A.2)
is derived. Next, we show that inequality 0 < L (z) also holds for any z ∈ R. For any z ∈ R, there exists z1 < 0 such that z1 < z, and then L (z1 ) < L (z) holds because of inequality A.2. Since loss function L is strictly increasing, we have 0 ≤ L (z1 ), and then we find that 0 < L (z) holds for any z ∈ R. Thus, ρ L is well defined on R. If 0 ≤ z1 < z2 holds, we have inequalities 0 < L (z1 ) ≤ L (z2 ), and 0 < L (−z2 ) < L (−z1 ). Thus, ρ L (z1 ) < ρ L (z2 ) holds. Similarly, we can prove that ρ L (z) is strictly increasing in other cases such as z1 < 0 ≤ z2 or z1 < z2 < 0. Appendix B: Proof of Theorem 2 Inequality | − yHβ (x) − yαh(x)| ≤ β1 + |α0 | + δ holds for any α ∈ (α0 − δ, α0 + δ), and then we have ∂ L(−yHβ (x) − yαh(x)) = | − L (−yHβ (x) − yαh(x))yh(x)| ∂α ≤ L (β1 + |α0 | + δ)
(B.1)
because L is a positive and nondecreasing function as shown in the proof of lemma 1. Hence, by Lebesgue’s bounded convergence theorem (Halmos, 1974), the exchange of integration and differentiation, ∂ ∂α
X y=±1
=−
µ(d x)Pρ1L (y|x, α0 )L(−yHβ (x) − yαh(x))
X y=±1
µ(d x)Pρ1L (y|x, α0 )L (−yHβ (x) − yαh(x))yh(x),
(B.2)
Robust Loss Functions for Boosting
2219
is valid for α ∈ (α0 − δ, α0 + δ) under condition A1. This means that RL ( Pε1 , Hβ + αh) is differentiable in α, where Pε1 is given by equation 4.1 with ρ = ρ L . Let us define η(α, ε) by η(α, ε) =
1 ∂ RL Pε , Hβ + αh . ∂α
From equation B.2, we find that η(α, ε) is of the form η(α, ε) = −(1 − ε)
X y=±1
µ(d x)Pρ1L (y|x, α0 )L (−yHβ (x) − yαh(x))yh(x)
− εL (− y˜ Hβ (x˜ ) − y˜ αh(x˜ )) y˜ h(x˜ ),
(B.3)
where η(α, ε) is defined on (α, ε) ∈ (α0 − δ, α0 + δ) × [0, 1]. The domain of ε can be analytically prolonged to (−c, c) for some c ∈ (0, 1). Thus, we can assume that η(α, ε) is defined on (α, ε) ∈ (α0 − δ, α0 + δ) × (−c, c). Since RL ( Pε1 , Hβ + αh) is convex in α, a solution of equation η(α, ε) = 0 in α is a minimizer of RL ( Pε1 , Hβ + αh). That is, 1 αε (x˜ , y˜ ) = arg min RL Pε , Hβ + αh ⇐⇒ η(αε (x˜ , y˜ ), ε) = 0 α∈R
holds, at least for 0 ≤ ε < c. Even for ε < 0, we can define αε (x˜ , y˜ ) by a zero of η(α, ε) in α. We show that η(α, ε) is once continuously differentiable in a vicinity of (α0 , 0), and we apply the implicit function theorem to obtain αε (x˜ , y˜ ). Note that when αε (x˜ , y˜ ) is differentiable at ε = 0, the differential is equal to influence function φ((x˜ , y˜ ), L , α0 ) because ∂ αε (x˜ , y˜ ) − α0 αε (x˜ , y˜ ) − α0 = lim αε (x˜ , y˜ ) = lim = φ((x˜ , y˜ ), L , α0 ) ε→0 ε→+0 ∂ε ε ε ε=0 holds. First, the integration part of equation B.3 is continuous in α because we can apply Lebesgue’s bounded convergence theorem to the first term of η(α, ε) with inequality B.1 and the continuity of L . The second term of equation B.3 is also continuous in α due to condition A1. Thus, the ∂ continuity of η(α, ε) in a vicinity of (α0 , 0) is valid, and so is ∂ε η(α, ε). Next, we consider the differentiability of η(α, ε) by α. From A2 and equation 4.4, for almost every (x, y) ∈ X × {1, −1}, L (−yHβ (x) − yαh(x)) is
2220
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
differentiable by α around α0 . Moreover, for α ∈ (α0 − δ, α0 + δ), inequality |L (−yHβ (x) − yαh(x))| ≤
sup z=0,|z|≤β1 +|α0 |+δ
|L (z)| < ∞
holds almost everywhere. Thus, the exchange of integration and differential, ∂ µ(d x)Pρ1L (y|x, α0 )L (−yHβ (x) − yαh(x))yh(x) ∂α X y=±1 =− µ(d x)Pρ1L (y|x, α0 )L (−yHβ (x) − yαh(x)), (B.4) X y=±1
is valid for α ∈ (α0 − δ, α0 + δ). We can prove the continuity of equation B.4 around α = α0 in a similar fashion. The second term of η(α, ε) is also differentiable by α because of equation 4.5 and A2. Therefore, η(α, ε) is differentiable by α, and its derivative is equal to ∂ η(α, ε) = (1 − ε) µ(d x)Pρ1L (y|x, α0 )L (−yHβ (x) − yαh(x)) ∂α X y=±1
+ εL (− y˜ Hβ (x˜ ) − y˜ αh(x˜ )). Note that the second term, L (− y˜ Hβ (x˜ ) − y˜ αh(x˜ )), is continuous around α = α0 under equation 4.5 and A2. Therefore, η(α, ε) is once continuously differentiable. Let X+ and X− be X+ = {x ∈ X |Hβ (x) + α0 h(x) > 0} and X− = {x ∈ X |Hβ (x) + α0 h(x) < 0}, respectively. From condition 4.4, µ(X+ ) > 0 or µ(X− ) > 0 holds. Thus, the derivative of η(α, ε) by α at (α, ε) = (α0 , 0) satisfies the following inequality, ∂ η(α, ε) = µ(d x) Pρ1L (y|x, α0 )L (−yHβ (x) − yα0 h(x)) ∂α X α=α0 ,ε=0 y=±1 ≥ µ(d x)Pρ1L (1|x, α0 )L (−Hβ (x) − α0 h(x)) X+
+ > 0,
X−
µ(d x)Pρ1L (−1|x, α0 )L (Hβ (x) + α0 h(x)) (B.5)
Robust Loss Functions for Boosting
2221
since L is strictly convex for z < 0, and L (z) > 0 holds for z < 0. Given the continuous differentiability of η(α, ε) around (α, ε) = (α0 , 0), and the ∂η positivity of ∂α (α0 , 0), we can apply the implicit function theorem to η(α, ε). Consequently, αε (x˜ , y˜ ) is once continuously differentiable at ε = 0, and the derivative is given by −1 ∂ ∂ ∂ αε (x˜ , y˜ ) η(α, ε) η(α, ε) =− , φ((x˜ , y˜ ), L , α0 ) = ∂ε ∂α ∂ε ε=0 α=α0 ,ε=0 (B.6) from the implicit function theorem. The right-hand side of equation B.6 is equal to equation 4.6 because equalities ∂ η(α0 , 0) = ∂α
X
µ(d x)
Pρ1 (y|x, α0 )L (−yHβ (x) − α0 yh(x))
y=±1
and ∂ ∂ η(α0 , 0) = − ε · L (− y˜ Hβ (x˜ ) − y˜ α0 h(x˜ )) y˜ h(x˜ ) ∂ε ∂ε ε=0 = −L (− y˜ Hβ (x˜ ) − y˜ α0 h(x˜ )) y˜ h(x˜ ) hold. For the last equation, the fact that η(α0 , 0) = 0 is used. Appendix C: Proof of Theorem 3 First, we show the existence of loss functions, such as ρ L = ρ, under conditions A1 and A2. Lemma 2 is provided for technical details in the proof of theorem 3. Lemma 2. Let ρ : R → R be an odd function, which is once continuously differentiable, and ρ (z) > 0 holds for any z ∈ R. Then loss function L(z) =
z z 0
z≥0 e
2ρ(w)
dw
satisfies A1, A2, and ρ L = ρ.
z<0
2222
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
Proof. The derivative of L is equal to
L (z) =
z≥0
1 e
2ρ(z)
z<0
.
We see that ρ L = ρ holds and that L (z) is continuous. Loss function, L, is a strictly increasing function because L (z) > 0. We can also verify that function L is convex because L is a nondecreasing function. For z < 0, we have L (z) = 2ρ (z)e 2ρ(z) > 0, and this inequality denotes that L is strictly convex for z < 0. Thus, condition A1 is verified. Twice continuous differentiability except at zero is also satisfied. The twice differential of L except at z = 0 is given as
L (z) =
0
z>0
2ρ (z)e 2ρ(z)
z<0
,
and then, for any A > 0, inequality sup |L (z)| = sup 2ρ (z)e 2ρ(z) ≤ sup 2ρ (z)e 2ρ(z) < ∞
z=0,|z|≤A
−A≤z<0
−A≤z≤0
holds because of the positivity and the continuity of 2ρ (z)e 2ρ(z) on R. Remark 1. If ρ is twice continuously differentiable, loss function L in lemma 2 satisfies A1, A2, and A3. The proof is almost the same as that in lemma 2. This result is applied to the proof of theorem 4. Proof of Theorem 3. Suppose that loss function L satisfies A1, A2, and ρ L = ρ. The existence of such a loss function is ensured by lemma 2. The influence function is equal to φ((x˜ , y˜ ), L , α0 ) =
L (− y˜ Hβ (x˜ ) − y˜ α0 h(x˜ )) y˜ h(x˜ ), C(L , ρ)
that is, given in theorem 2, and then the gross error sensitivity of L is given as γ (L , α0 ) =
ess supx˜ ∈X , y˜ =±1 L (− y˜ Hβ (x˜ ) − y˜ α0 h(x˜ )) C(L , ρ)
.
We assumed Hβ (x˜ ) + α0 h(x˜ ) = 0 to calculate the influence function in theorem 2, but this constraint does not affect the essential supremum because
Robust Loss Functions for Boosting
2223
of condition 4.4. Note that inequality 0 < ess sup L (− y˜ Hβ (x˜ ) − y˜ α0 h(x˜ )) < ∞ x˜ ∈X , y˜ =±1
holds because supx˜ ∈X , y˜ =±1 | − y˜ Hβ (x˜ ) − y˜ α0 h(x˜ )| is bounded and L is positive and continuous. The multiplication of constant to loss functions does not affect the value of the gross error sensitivity. For this reason, we can assume that
ess sup L (− y˜ Hβ (x˜ ) − y˜ α0 h(x˜ )) = 1
(C.1)
x˜ ∈X , y˜ =±1
holds for arbitrary loss functions. To derive the most B-robust loss functions, we need to solve the minimization problem of the gross error sensitivity, which is equivalent to the maximization of the functional, max C(L , ρ) L
s.t.
ρ L = ρ,
ess sup L (− y˜ Hβ (x˜ ) − y˜ α0 h(x˜ )) = 1,
x˜ ∈X , y˜ =±1
under the condition that L is a loss function satisfying A1 and A2. Since ρ L = ρ holds, we have L (z) = L (−z)e 2ρ(z) , and the derivative of both sides gives us L (z) + L (−z)e 2ρ(z) = 2ρ (z)L (−z)e 2ρ(z) ,
(C.2)
except at z = 0. By substituting equation C.2 into C(L , ρ), we obtain C(L , ρ) = 2
X
µ(d x)Pρ1 (1|x, α0 )ρ (Hβ (x) + α0 h(x))L (−Hβ (x) − α0 h(x)),
where the integrand takes positive value. For any loss function L with ρ L = ρ, the inequalities L (−yHβ (x) − yα0 h(x)) ≤ 1
2224
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
and L (−yHβ (x) − yα0 h(x)) = L (yHβ (x) + yα0 h(x))e 2ρ(−yHβ (x)−yα0 h(x)) ≤ e 2ρ(−yHβ (x)−yα0 h(x)) hold for almost every (x, y) due to equation C.1. Hence, the inequality
L (yHβ (x) + yα0 h(x)) ≤ min 1, e 2ρ(yHβ (x)+yα0 h(x)) is valid with probability one. Let us define L ρ as L ρ (z) =
z
min 1, e 2ρ(w) dw.
(C.3)
0
We see that ρ L ρ = ρ and L ρ (z) ≤ 1 holds for any z ∈ R. Consequently, we obtain the inequality C(L , α0 ) ≤ C(L ρ , α0 ), because L (−Hβ (x) − α0 h(x)) ≤ L ρ (−Hβ (x) − α0 h(x)) holds with probability one. Due to lemma 2, L ρ satisfies A1 and A2, that is, the influence function for loss function L ρ exists. Thus, the above argument is valid. Integration C.3 is equal to equation 4.8. Appendix D: Proof of Theorem 4 The proof consists of four propositions. Proposition 1. The optimal weak hypothesis for the boosting algorithm with equation 4.9 does not depend on ε. Proof. In the boosting algorithm under loss function L, chosen hypothesis h (1) ∈ H is the minimizer of
µ(d x) Pε (y|x, Hβ )L (−yHβ (x))I (y = h(x)) X
=
1 2
y=±1
X
µ(d x)
y=±1
ε
Pε (y|x, Hβ )L (−yHβ (x)) − L (− y˜ Hβ (x˜ )) y˜ h(x˜ ), 2 (D.1)
Robust Loss Functions for Boosting
2225
where we used equality X
µ(d x)
Pρ (y|x, Hβ )L (−yHβ (x))yh(x) = 0
(D.2)
y=±1
which comes from the optimality of Hβ in M[ρ]. Due to equation D.1, h (1) does not depend on the value of ε > 0 but only on outlier (x˜ , y˜ ). Thus, for given (x˜ , y˜ ), hypothesis h (1) is fixed. Proposition 2. Suppose that Hβ (x˜ ) = 0. Let η(α, ε) be η(α, ε) = −(1 − ε)
X
µ(d x)
Pρ (y|x, Hβ )L (−yHβ (x) − yαh (1) (x))yh(x)
y=±1
− εL (− y˜ Hβ (x˜ ) − y˜ αh (1) (x˜ )) y˜ h(x˜ ), as in the proof of theorem 2. Then η(α, ε) is twice continuously differentiable. Proof. Under condition A1, η(α, ε) is identical to the derivative of RL ( Pε , Hβ + αh (1) ) with respect to α for 0 ≤ ε ≤ 1. Applying the same argument in the proof of theorem 2, we can derive the change-of-risk function. We will show that there exist δx˜ > 0 and ε0 > 0 such that η(α, ε) is twice continuously differentiable on (−δx˜ , δx˜ ) × (−ε0 , ε0 ). The differentiability of η(α, ε) with respect to ε is clear, and any small positive constant ε0 is valid. Next, we study the differentiability with respect to α. If we set δx˜ = min{|Hβ (x˜ )|, δ0 }, where δ0 is given in equation 4.13, then Hβ (x˜ ) + αh (1) (x˜ ) = 0 for α ∈ (−δx˜ , δx˜ ), and it is clear that the second term of η(α, ε) is twice continuously differentiable under A3. Note that for any α ∈ (−δx˜ , δx˜ ), Hβ (x) + αh (1) (x) is not equal to zero almost everywhere because we have µ({x|Hβ (x) + αh (1) (x) = 0}) = µ({x|Hβ (x) + α = 0, h (1) (x) = 1}) +µ({x|Hβ (x) − α = 0, h (1) (x) = −1}) ≤ µ({x|Hβ (x) + α = 0}) + µ({x|Hβ (x) − α = 0}) =0 from equation 4.13. Then integrations X
µ(d x)
y=±1
Pρ (y|x, Hβ )L (−yHβ (x) − yαh (1) (x))
(D.3)
2226
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
and X
µ(d x)
Pρ (y|x, Hβ )L (−yHβ (x) − yαh (1) (x))yh (1) (x)
(D.4)
y=±1
are well defined under the condition that L satisfies A2 and A3. Under A2, boundedness |L (−yHβ (x) − yαh (1) (x))| ≤ ≤
sup x∈X ,y=±1
|L (−yHβ (x) − yαh (1) (x))|
sup z=0,|z|≤β1 +δx˜
|L (z)|
<∞ holds almost everywhere for α ∈ (−δx˜ , δx˜ ). Thus, by Lebesgue’s bounded convergence theorem, the exchange of integration and differentiation −
∂ ∂α
X y=±1
=
µ(d x)Pρ (y|x, Hβ )L (−yHβ (x) − yαh (1) (x))yh (1) (x)
X y=±1
µ(d x)Pρ (y|x, Hβ )L (−yHβ (x) − yαh (1) (x))
is valid and the continuity of equation D.3 in α also holds in a similar fashion. Likewise, from A3, boundedness |L (−yHβ (x) − yαh (1) (x))yh (1) (x)| ≤ ≤
sup x∈X ,y=±1
|L (−yHβ (x) − yαh (1) (x))|
sup z=0,|z|≤β1 +δx˜
|L (z)|
<∞ holds almost everywhere for α ∈ (−δx˜ , δx˜ ). Then the exchange of integration and differentiation, −
∂ ∂α
X y=±1
=
µ(d x)Pρ (y|x, Hβ )L (−yHβ (x) − yαh (1) (x))
X y=±1
µ(d x)Pρ (y|x, Hβ )L (−yHβ (x) − yαh (1) (x))yh (1) (x),
Robust Loss Functions for Boosting
2227
is valid, and we see that the continuity of equation D.4 in α also holds in a similar fashion. Consequently, we find that η(α, ε) is twice continuously differentiable on (−δx˜ , δx˜ ) × (−ε0 , ε0 ). Proposition 3. The change-of-risk function, ψ L 0 ((x˜ , y˜ ), L , H), is given as ψ L 0 ((x˜ , y˜ ), L , Hβ ) = ess sup L (|Hβ (x˜ )|)2 · x˜ ∈X
1 2C(L , ρ)2
X
µ(d x) Pρ (y|x, Hβ )L 0 (−yHβ (x)), y=±1
where the definition of C(L , ρ) is given by equation 4.7. ∂η (0, 0) can be proved in a similar way to Proof. The positivity of ∂α equation B.5. Thus, applying the implicit function theorem, we find that there exists a parameter αε satisfying equation
η(αε , ε) = 0 in a vicinity of ε = 0. Note that the function αε is also twice continuously differentiable with respect to ε on (−2ε1 , 2ε1 ), where ε1 satisfies 0 < 2ε1 < ε0 . 2 ε The inequality |αε | < δx˜ holds for ε ∈ (−2ε1 , 2ε1 ). Since ∂α and ∂∂εα2ε are ∂ε ε continuous, there are positive constants, C1 and C2 , such that | ∂α | ≤ C1 ∂ε 2 ∂ αε
and | ∂ε2 | ≤ C2 hold for any ε ∈ (−ε1 , ε1 ). Predictor HL ,ε derived from loss function L under the contaminated probability Pε is given as
L ,ε = Hβ + αε h (1) . H Note that αε depends on ε and (x˜ , y˜ ) while h (1) depends only on (x˜ , y˜ ). By using Lebesgue’s bounded convergence theorem, we find that the
L ,ε ) with respect to ε on twice-continuous differentiability of RL 0 (P, H (−ε1 , ε1 ) also holds under A1 and A2 for L 0 , because the boundedness of the differentials is satisfied almost everywhere. Indeed, the following inequalities hold with probability one: ∂ L 0 (−yHβ (x) − yαε h (1) (x)) = L (−yHβ (x) − yαε h (1) (x)) ∂αε ∂ε 0 ∂ε ≤ C1 sup L (z) |z|≤β1 +δx˜
0
≤ C1 L 0 (β1 + δx˜ ), < ∞,
(D.5)
2228
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
and 2 ∂ ∂αε 2 (1) (1) ∂ε 2 L 0 (−yHβ (x) − yαε h (x)) = L 0 (−yHβ (x) − yαε h (x)) ∂ε −
L 0 (−yHβ (x)
≤ C12
∂ 2 αε − yαε h (x))yh (x) 2 ∂ε
sup z=0,|z|≤β1 +δx˜
+ C2
(1)
(1)
|L 0 (z)|
sup z=0,|z|≤β1 +δx˜
|L 0 (z)|
< ∞.
(D.6)
L ,ε ) exists up to the Consequently, the Taylor expansion of risk RL 0 (P, H order of two,
L ,ε ) RL 0 (P, H
= RL 0 (P, Hβ ) − ε
+
ε2 2
−
ε 2
2
X
µ(d x)
y=±1
X
X
µ(d x)
y=±1
µ(d x)
Pρ (y|x, Hβ )L 0 (−yHβ (x))yh (1) (x)
y=±1
∂αε ∂ε
2 ∂αε Pρ (y|x, Hβ )L 0 − yHβ (x) − yαθ ε h (1) (θ ε) ∂ε ∂ 2 αε Pρ (y|x, Hβ )L 0 − yHβ (x) − yαθ ε h (1) yh (1) 2 (θ ε), ∂ε
where θ ∈ [0, 1]. The exchange of limitation ε → 0 and the integration is valid because of equations D.5 and D.6. Since equation D.2 holds even for L = L 0 and equality ∂αε = φ((x˜ , y˜ ), L , 0) ∂ε ε=0 holds, we have 1 ψ L 0 ((x˜ , y˜ ), L , Hβ ) = φ((x˜ , y˜ ), L , 0)2 2
X
µ(d x)
Pρ (y|x, Hβ )L 0 (−yHβ (x)),
y=±1
where φ((x˜ , y˜ ), L , 0) is the influence function at α = 0 for model M1 [ρ, h (1) , Hβ ]. Note that influence function φ((x˜ , y˜ ), L , 0) does not depend on the choice of hypothesis h (1) .
Robust Loss Functions for Boosting
2229
Therefore, change-of-risk sensitivity is equal to 1 µ(d x) Pρ (y|x, Hβ )L 0 (−yHβ (x)) γ L 0 (L , Hβ ) = ess sup φ((x˜ , y˜ ), L , 0)2 2 X x˜ ∈X , y˜ =±1 y=±1
=
ess supx˜ ∈X L (|Hβ (x˜ )|)2 2C(L , ρ)2
X
µ(d x)
Pρ (y|x, Hβ )L 0 (−yHβ (x)).
y=±1
We have assumed that Hβ (x˜ ) = 0 to calculate change-of-risk function, but this constraint does not affect the essential supremum of L (|Hβ (x˜ )|) because µ({x|Hβ (x) = 0}) = 0 holds from equation 4.13. Proposition 4. The loss function L ρ minimizes the change-of-risk sensitivity γ L 0 (L , Hβ ). Proof. We solve the minimization problem analogous to the proof of theorem 3, max C(L , ρ) L
s. t.
ρ L = ρ, ess sup L (|Hβ (x˜ )|) = 1, x˜ ∈X
under the condition that L is a loss function with conditions A1, A2, and A3. Due to the constraint of the minimization problem, loss function L satisfies the following inequality with probability one, L (H(x)) ≤ 1, L (H(x)) = L (−H(x))e 2ρ(H(x)) ≤ e 2ρ(H(x)) , and then,
L (H(x)) ≤ min 1, e 2ρ(H(x)) = L ρ (H(x))
(D.7)
holds almost everywhere with respect to measure µ on X . Since ρ is twice continuously differentiable, we find that L ρ is three times continuously differentiable except at zero. We can verify that loss function L ρ satisfies conditions A1, A2, and A3 as pointed out in remark 1 in appendix C. Under the imposed constraints, inequality C(L , ρ) ≤ C(L ρ , ρ) follows from equation D.7 in a similar way of the proof in theorem 3. Consequently, we find that L ρ minimizes the change-of-risk sensitivity, γ L 0 (L , Hβ ).
2230
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
Appendix E: Results of Experiments Table 5: Error and Log-Loss×10 on the Test Set of Banana Among Six Boosting Algorithms: Adaboost (AB), Logitboost (LB), Madaboost (MB), Eta-Boost (Eta), Robust Eta-Boost (r-eta), and Adaboost Reg (AB R ). Banana: Test Error Mislabel
0%
Outlier
AB LB MB Eta r-eta AB R
0%
3%
Error
p-val
Error
p-val
Error
p-val
28.4 ± 1.9 27.7 ± 1.7 26.0 ± 2.2 28.4 ± 2.1 • 25.5 ± 2.1 27.7 ± 1.8
− 0.00 0.00 0.54 0.00 0.00
29.1 ± 2.2 28.5 ± 2.3 27.0 ± 2.5 28.8 ± 2.3 • 26.6 ± 2.5 28.4 ± 2.0
− 0.02 0.00 0.13 0.00 0.01
28.9 ± 2.2 28.7 ± 2.0 27.4 ± 2.6 29.2 ± 2.2 • 26.5 ± 2.1 28.4 ± 2.0
− 0.15 0.00 0.87 0.00 0.04
Mislabel
20%
Outlier
AB LB MB Eta r-eta AB R
5%
0%
3%
5%
Error
p-val
Error
p-val
Error
p-val
38.8 ± 2.8 38.7 ± 2.6 38.2 ± 2.9 38.6 ± 2.4 • 37.5 ± 1.8 38.9 ± 2.6
− 0.26 0.04 0.20 0.00 0.64
38.5 ± 2.1 38.6 ± 2.6 • 37.8 ± 2.4 38.2 ± 2.0 37.8 ± 1.9 38.6 ± 2.3
− 0.67 0.01 0.13 0.00 0.66
38.7 ± 2.7 38.6 ± 2.5 38.0 ± 2.5 38.6 ± 2.6 • 37.5 ± 2.4 38.6 ± 2.7
− 0.39 0.01 0.42 0.00 0.38
Log-Loss
p-val
6.03 ± 0.12 6.00 ± 0.13 6.01 ± 0.14 6.14 ± 0.41 6.02 ± 0.17 6.08 ± 0.22
− 0.03 0.14 0.99 0.18 0.98
Banana: Log-Loss×10 Mislabel
0%
Outlier
AB LB MB Eta r-eta AB R
0%
•
3% p-val
Log-Loss
p-val
6.01 ± 0.18 5.99 ± 0.15 6.03 ± 0.17 6.04 ± 0.25 5.99 ± 0.17 6.03 ± 0.17
− 0.12 0.82 0.82 0.17 0.76
6.03 ± 0.15 6.01 ± 0.14 6.02 ± 0.13 6.09 ± 0.34 • 5.99 ± 0.12 6.04 ± 0.17
− 0.05 0.30 0.95 0.00 0.59
Mislabel Outlier
AB LB MB Eta r-eta AB R
5%
Log-Loss
•
20% 0%
3%
5%
Log-Loss
p-val
Log-Loss
p-val
Log-Loss
p-val
6.76 ± 0.11 6.76 ± 0.11 • 6.74 ± 0.10 6.82 ± 0.33 6.74 ± 0.10 6.79 ± 0.15
− 0.41 0.01 0.95 0.01 0.93
6.77 ± 0.09 6.77 ± 0.11 • 6.75 ± 0.10 6.80 ± 0.26 6.76 ± 0.11 6.82 ± 0.14
− 0.51 0.04 0.91 0.26 0.99
6.78 ± 0.11 6.77 ± 0.10 • 6.75 ± 0.09 6.91 ± 0.43 6.75 ± 0.11 6.82 ± 0.16
− 0.28 0.01 0.99 0.03 0.99
Notes: The p-values are calculated by one-side paired t-test against Adaboost. The dot indicates the best method under each noise condition.
Robust Loss Functions for Boosting
2231
Table 6: Error and Log-Loss×10 on the Test Set of Breast-Cancer Among Six Boosting Algorithms: Adaboost (AB), Logitboost (LB), Madaboost (MB), EtaBoost (Eta), Robust Eta-Boost (r-eta), and Adaboost Reg (AB R ). Breast-Cancer: Test Error Mislabel
0%
Outlier
AB LB MB Eta r-eta AB R
0%
3%
Error
p-val
Error
p-val
Error
p-val
28.1 ± 4.5 27.6 ± 4.4 27.4 ± 4.3 27.6 ± 4.2 • 27.0 ± 4.8 27.6 ± 4.7
− 0.14 0.06 0.12 0.01 0.17
27.9 ± 4.5 27.3 ± 4.4 27.3 ± 4.1 28.4 ± 4.3 • 27.3 ± 4.4 27.7 ± 4.3
− 0.10 0.07 0.85 0.11 0.35
27.5 ± 4.4 27.6 ± 4.1 27.7 ± 4.6 28.0 ± 4.7 27.3 ± 4.9 • 27.2 ± 4.9
− 0.58 0.74 0.89 0.28 0.24
Mislabel
20%
Outlier
AB LB MB Eta r-eta AB R
5%
0%
3%
5%
Error
p-val
Error
p-val
Error
p-val
37.6 ± 5.0 37.4 ± 5.6 36.8 ± 5.8 37.3 ± 5.9 • 36.5 ± 5.9 37.7 ± 5.6
− 0.30 0.03 0.23 0.01 0.64
38.2 ± 5.2 38.2 ± 5.6 • 37.7 ± 5.3 38.6 ± 5.2 38.6 ± 5.3 38.6 ± 5.4
− 0.47 0.13 0.82 0.87 0.79
38.3 ± 5.8 38.5 ± 5.6 37.8 ± 6.2 • 37.7 ± 5.3 37.9 ± 5.2 37.8 ± 5.9
− 0.75 0.15 0.12 0.17 0.16
Breast-Cancer: Log-Loss×10 Mislabel Outlier
AB LB MB Eta r-eta AB R
0% 0%
3%
Log-Loss
p-val
Log-Loss
p-val
Log-Loss
p-val
5.87 ± 0.69 5.81 ± 0.70 5.81 ± 0.67 6.18 ± 2.15 • 5.74 ± 0.65 5.98 ± 0.66
− 0.12 0.15 0.94 0.00 0.98
5.81 ± 0.63 5.82 ± 0.64 5.77 ± 0.60 5.85 ± 0.89 • 5.76 ± 0.68 5.82 ± 0.69
− 0.57 0.22 0.71 0.17 0.60
5.77 ± 0.66 5.78 ± 0.61 5.86 ± 0.65 5.86 ± 1.04 • 5.74 ± 0.64 5.92 ± 0.88
− 0.59 0.96 0.82 0.22 0.99
Mislabel Outlier
AB LB MB Eta r-eta AB R
5%
20% 0%
3%
5%
Log-Loss
p-val
Log-Loss
p-val
Log-Loss
p-val
6.71 ± 0.46 6.70 ± 0.49 6.66 ± 0.46 6.68 ± 0.50 • 6.63 ± 0.41 6.76 ± 0.51
− 0.35 0.07 0.24 0.01 0.93
6.78 ± 0.53 6.81 ± 0.47 6.78 ± 0.46 6.79 ± 0.57 • 6.75 ± 0.53 6.89 ± 0.63
− 0.76 0.48 0.59 0.20 0.99
6.74 ± 0.50 6.81 ± 0.53 • 6.72 ± 0.48 6.84 ± 1.01 6.79 ± 0.56 6.86 ± 0.60
− 0.98 0.30 0.86 0.91 0.99
Notes: The p-values are calculated by one-side paired t-test against Adaboost. The dot indicates the best method under each noise condition.
2232
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
Table 7: Error and Log-Loss×10 on the Test Set of Diabetes Among Six Boosting Algorithms: Adaboost (AB), Logitboost (LB), Madaboost (MB), Eta-Boost (Eta), Robust Eta-Boost (r-eta), and Adaboost Reg (AB R ). Diabetes: Test Error Mislabel
0%
Outlier
AB LB MB Eta r-eta AB R
0%
3%
Error
p-val
Error
p-val
Error
p-val
25.2 ± 1.8 25.3 ± 2.1 25.5 ± 1.8 25.7 ± 2.1 25.7 ± 1.9 • 25.0 ± 1.7
− 0.70 0.95 0.97 0.99 0.22
25.5 ± 1.8 25.4 ± 1.8 25.8 ± 1.8 25.6 ± 1.8 25.8 ± 1.9 • 25.1 ± 1.8
− 0.39 0.95 0.80 0.94 0.02
25.6 ± 1.9 25.7 ± 2.1 25.9 ± 1.9 25.6 ± 2.1 25.6 ± 1.9 • 25.1 ± 1.9
− 0.57 0.86 0.40 0.37 0.02
Mislabel
20%
Outlier
AB LB MB Eta r-eta AB R
5%
0%
3%
5%
Error
p-val
Error
p-val
Error
p-val
36.6 ± 2.8 37.0 ± 2.8 36.6 ± 2.6 36.8 ± 2.8 36.8 ± 2.9 • 36.4 ± 3.0
− 0.93 0.58 0.75 0.76 0.31
36.9 ± 2.9 37.2 ± 2.7 36.9 ± 3.0 36.9 ± 3.3 • 36.9 ± 2.8 37.2 ± 3.1
− 0.86 0.42 0.43 0.40 0.77
37.1 ± 2.9 37.0 ± 2.9 37.4 ± 3.0 37.1 ± 2.8 • 37.0 ± 2.4 37.1 ± 2.5
− 0.26 0.75 0.40 0.27 0.39
Diabetes: Log-Loss×10 Mislabel Outlier
AB LB MB Eta r-eta AB R
0% 0%
3%
Log-Loss
p-val
Log-Loss
p-val
Log-Loss
p-val
5.29 ± 0.30 5.26 ± 0.33 5.26 ± 0.29 5.40 ± 0.50 • 5.25 ± 0.29 5.38 ± 0.33
− 0.12 0.13 0.99 0.04 0.99
5.34 ± 0.31 5.26 ± 0.33 5.28 ± 0.29 5.40 ± 0.35 • 5.25 ± 0.28 5.39 ± 0.34
− 0.01 0.03 0.98 0.00 0.93
5.30 ± 0.31 5.29 ± 0.32 5.28 ± 0.27 5.40 ± 0.75 • 5.27 ± 0.29 5.39 ± 0.38
− 0.30 0.20 0.90 0.10 0.99
Mislabel Outlier
AB LB MB Eta r-eta AB R
5%
20% 0%
3%
5%
Log-Loss
p-val
Log-Loss
p-val
Log-Loss
p-val
6.61 ± 0.30 6.63 ± 0.28 6.58 ± 0.27 6.68 ± 0.62 • 6.57 ± 0.33 6.61 ± 0.30
− 0.75 0.14 0.90 0.06 0.52
6.64 ± 0.26 6.65 ± 0.29 • 6.60 ± 0.28 6.65 ± 0.38 6.62 ± 0.25 6.66 ± 0.28
− 0.68 0.04 0.65 0.14 0.90
6.62 ± 0.25 6.62 ± 0.27 • 6.60 ± 0.23 6.68 ± 0.38 6.62 ± 0.24 6.67 ± 0.27
− 0.59 0.24 0.96 0.50 0.99
Notes: The p-values are calculated by one-side paired t-test against Adaboost. The dot indicates the best method under each noise condition.
Robust Loss Functions for Boosting
2233
Table 8: Error and Log-Loss×10 on the Test Set of German Among Six Boosting Algorithms: Adaboost (AB), Logitboost (LB), Madaboost (MB), Eta-Boost (Eta), Robust Eta-Boost (r-eta), and Adaboost Reg (AB R ). German: Test Error Mislabel
0%
Outlier
AB LB MB Eta r-eta AB R
0%
3%
Error
p-val
Error
p-val
Error
p-val
25.7 ± 2.6 25.4 ± 2.5 25.5 ± 2.5 26.0 ± 2.9 • 25.3 ± 2.6 25.5 ± 2.5
− 0.13 0.25 0.87 0.06 0.21
25.9 ± 2.5 25.7 ± 2.4 25.6 ± 2.4 25.9 ± 2.5 • 25.3 ± 2.4 25.6 ± 2.7
− 0.26 0.18 0.56 0.01 0.06
26.2 ± 2.7 26.0 ± 2.6 25.6 ± 2.6 26.8 ± 2.8 25.9 ± 2.7 • 25.5 ± 2.7
− 0.20 0.01 0.99 0.09 0.00
Mislabel
20%
Outlier
0% Error
AB LB MB Eta r-eta AB R
5%
•
3% p-val
36.8 ± 3.0 36.6 ± 2.7 36.8 ± 2.9 36.8 ± 3.1 36.8 ± 3.0 36.9 ± 2.9
− 0.28 0.51 0.53 0.47 0.63
Error
5% p-val
37.5 ± 2.8 37.3 ± 2.7 • 37.1 ± 2.7 37.7 ± 2.8 37.1 ± 2.8 37.6 ± 2.6
− 0.16 0.05 0.66 0.04 0.61
•
Error
p-val
37.2 ± 2.8 38.0 ± 2.9 37.5 ± 2.8 37.9 ± 2.9 37.5 ± 2.9 37.6 ± 2.6
− 0.99 0.89 0.99 0.85 0.95
German: Log-Loss×10 Mislabel
0%
Outlier
AB LB MB Eta r-eta AB R
0%
3%
Log-Loss
p-val
Log-Loss
p-val
Log-Loss
p-val
5.29 ± 0.32 5.22 ± 0.32 • 5.20 ± 0.31 5.34 ± 0.39 5.20 ± 0.34 5.31 ± 0.34
− 0.00 0.00 0.98 0.00 0.83
5.24 ± 0.33 5.22 ± 0.31 5.20 ± 0.28 5.31 ± 0.38 • 5.19 ± 0.30 5.29 ± 0.34
− 0.20 0.02 0.99 0.01 0.98
5.28 ± 0.32 5.19 ± 0.30 • 5.16 ± 0.27 5.33 ± 0.35 5.21 ± 0.30 5.32 ± 0.35
− 0.00 0.00 0.99 0.00 0.95
Mislabel
20%
Outlier
AB LB MB Eta r-eta AB R
5%
0%
•
3%
5%
Log-Loss
p-val
Log-Loss
p-val
Log-Loss
p-val
6.51 ± 0.23 6.48 ± 0.22 6.50 ± 0.22 6.50 ± 0.25 6.49 ± 0.23 6.52 ± 0.23
− 0.05 0.27 0.32 0.15 0.60
6.55 ± 0.20 6.55 ± 0.21 6.54 ± 0.20 6.56 ± 0.21 • 6.53 ± 0.21 6.58 ± 0.23
− 0.48 0.24 0.69 0.08 0.96
6.54 ± 0.22 6.54 ± 0.22 6.51 ± 0.21 6.54 ± 0.21 • 6.51 ± 0.21 6.58 ± 0.22
− 0.60 0.03 0.56 0.01 0.98
Notes: The p-values are calculated by one-side paired t-test against Adaboost. The dot indicates the best method under each noise condition.
2234
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
Table 9: Error and Log-Loss×10 on the Test Set of Heart Among Six Boosting Algorithms: Adaboost (AB), Logitboost (LB), Madaboost (MB), Eta-Boost (Eta), Robust Eta-Boost (r-eta), and Adaboost Reg (AB R ). Heart: Test Error Mislabel
0%
Outlier
AB LB MB Eta r-eta AB R
0%
3%
Error
p-val
Error
p-val
Error
p-val
19.6 ± 4.0 19.4 ± 4.2 19.5 ± 4.6 19.4 ± 4.2 • 18.7 ± 4.0 19.1 ± 4.1
− 0.34 0.47 0.3 0.02 0.12
19.0 ± 4.2 19.4 ± 4.0 19.4 ± 3.8 19.2 ± 3.9 18.7 ± 4.0 • 18.5 ± 4.0
− 0.8 0.77 0.63 0.26 0.10
19.9 ± 4.4 19.4 ± 4.1 18.9 ± 4.6 19.7 ± 4.8 19.5 ± 4.1 • 18.9 ± 3.8
− 0.07 0.03 0.33 0.19 0.01
Mislabel
20%
Outlier
0% Error
AB LB MB Eta r-eta AB R
5%
•
3% p-val
35.2 ± 5.2 35.3 ± 5.1 36.0 ± 5.5 35.2 ± 5.6 35.6 ± 5.1 35.6 ± 5.1
− 0.59 0.94 0.50 0.79 0.80
Error
5% p-val
36.0 ± 5.2 36.1 ± 5.0 • 35.2 ± 5.3 35.4 ± 5.7 35.4 ± 5.0 35.6 ± 5.6
− 0.64 0.04 0.11 0.11 0.23
•
Error
p-val
35.6 ± 5.4 35.7 ± 5.5 35.7 ± 5.9 36.2 ± 5.2 35.9 ± 5.2 35.9 ± 4.8
− 0.55 0.54 0.90 0.68 0.72
Heart: Log-Loss×10 Mislabel
0%
Outlier
AB LB MB Eta r-eta AB R
0%
3%
Log-Loss
p-val
Log-Loss
p-val
Log-Loss
p-val
5.02 ± 1.01 4.87 ± 0.98 • 4.80 ± 0.94 5.14 ± 1.27 4.84 ± 1.04 5.09 ± 1.06
− 0.03 0.01 0.87 0.03 0.77
4.79 ± 0.92 4.68 ± 0.87 • 4.63 ± 0.88 5.02 ± 1.37 4.70 ± 0.95 5.06 ± 1.22
− 0.09 0.03 0.97 0.16 0.99
4.89 ± 0.92 4.74 ± 0.88 4.63 ± 0.93 5.04 ± 1.15 • 4.58 ± 0.82 4.97 ± 1.04
− 0.02 0.00 0.91 0.00 0.80
Mislabel
20%
Outlier
0% Log-Loss
AB LB MB Eta r-eta AB R
5%
•
6.74 ± 0.71 6.72 ± 0.63 6.73 ± 0.67 6.86 ± 1.35 6.83 ± 0.80 6.85 ± 0.84
3% p-val − 0.35 0.43 0.83 0.93 0.95
Log-Loss 6.74 ± 0.65 6.78 ± 0.70 • 6.72 ± 0.66 6.83 ± 0.91 6.74 ± 0.73 6.94 ± 0.94
5% p-val − 0.73 0.32 0.91 0.47 0.99
•
Log-Loss
p-val
6.76 ± 0.70 6.78 ± 0.66 6.80 ± 0.64 6.96 ± 1.36 6.79 ± 0.66 6.86 ± 0.70
− 0.56 0.69 0.92 0.63 0.90
Notes: The p-values are calculated by one-side paired t-test against Adaboost. The dot indicates the best method under each noise condition.
Robust Loss Functions for Boosting
2235
Table 10: Error and Log-Loss×10 on the Test Set of Image Among Six Boosting Algorithms: Adaboost (AB), Logitboost (LB), Madaboost (MB), Eta-Boost (Eta), Robust Eta-Boost (r-eta), and Adaboost Reg (AB R ). Image: Test Error Mislabel
0%
Outlier
AB LB MB Eta r-eta AB R
0%
•
3%
Error
p-val
Error
p-val
Error
p-val
4.4 ± 0.6 4.1 ± 0.6 4.2 ± 0.6 6.2 ± 0.9 4.2 ± 0.6 4.4 ± 0.6
− 0.06 0.03 0.99 0.06 0.60
4.4 ± 0.7 4.6 ± 0.6 4.3 ± 0.6 6.2 ± 1.0 • 4.2 ± 0.6 4.4 ± 0.8
− 0.89 0.31 0.99 0.21 0.60
5.4 ± 0.9 4.4 ± 0.6 4.8 ± 0.8 6.4 ± 1.4 • 4.2 ± 0.8 4.7 ± 0.8
− 0.00 0.01 0.99 0.00 0.00
Mislabel
20%
Outlier
AB LB MB Eta r-eta AB R
5%
0%
3%
5%
Error
p-val
Error
p-val
Error
p-val
25.7 ± 2.4 25.1 ± 1.9 24.7 ± 1.9 25.1 ± 1.9 • 24.6 ± 1.5 24.7 ± 2.0
− 0.05 0.01 0.12 0.01 0.02
25.6 ± 2.0 25.6 ± 1.8 • 24.4 ± 1.5 25.7 ± 1.7 24.6 ± 1.7 24.6 ± 1.5
− 0.60 0.00 0.62 0.00 0.01
25.5 ± 1.8 24.9 ± 1.7 24.7 ± 2.0 25.9 ± 2.0 24.7 ± 1.7 • 24.5 ± 1.2
− 0.08 0.03 0.88 0.01 0.01
Image: Log-Loss×10 Mislabel Outlier
AB LB MB Eta r-eta AB R
0% 0%
3%
Log-Loss
p-val
Log-Loss
p-val
Log-Loss
p-val
1.33 ± 0.19 1.43 ± 0.12 1.62 ± 0.10 1.97 ± 0.19 1.60 ± 0.12 • 1.32 ± 0.22
− 0.99 0.99 0.99 0.99 0.34
1.32 ± 0.22 1.40 ± 0.15 1.59 ± 0.13 1.86 ± 0.35 1.57 ± 0.14 • 1.28 ± 0.19
− 0.96 0.99 0.99 0.99 0.16
1.46 ± 0.12 1.55 ± 0.14 1.67 ± 0.11 2.07 ± 0.41 1.72 ± 0.14 • 1.38 ± 0.20
− 0.99 0.99 0.99 0.99 0.03
Mislabel Outlier
AB LB MB Eta r-eta AB R
5%
20% 0%
3%
5%
Log-Loss
p-val
Log-Loss
p-val
Log-Loss
p-val
5.71 ± 0.18 5.68 ± 0.16 5.65 ± 0.16 5.72 ± 0.17 • 5.64 ± 0.17 5.71 ± 0.17
− 0.12 0.02 0.66 0.01 0.56
5.71 ± 0.14 5.68 ± 0.14 • 5.65 ± 0.12 5.77 ± 0.15 5.67 ± 0.14 5.70 ± 0.16
− 0.08 0.01 0.96 0.08 0.34
5.69 ± 0.15 5.66 ± 0.16 • 5.63 ± 0.14 5.77 ± 0.19 5.67 ± 0.16 5.69 ± 0.17
− 0.09 0.01 0.99 0.18 0.41
Notes: The p-values are calculated by one-side paired t-test against Adaboost. The dot indicates the best method under each noise condition.
2236
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
Table 11: Error and Log-Loss×10 on the Test Set of Ringnorm Among Six Boosting Algorithms: Adaboost (AB), Logitboost (LB), Madaboost (MB), Eta-Boost (Eta), Robust Eta-Boost (r-eta), and Adaboost Reg (AB R ). Ringnorm: Test Error Mislabel
0%
Outlier
0%
AB LB MB Eta r-eta AB R
3%
Error
p-val
Error
p-val
Error
p-val
8.7 ± 1.2 8.6 ± 1.3 8.7 ± 1.1 10.8 ± 1.9 • 8.2 ± 1.1 8.5 ± 1.0
− 0.34 0.43 0.99 0.00 0.14
11.7 ± 1.7 11.2 ± 1.6 11.5 ± 1.7 13.4 ± 2.3 • 11.0 ± 1.6 11.4 ± 1.7
− 0.00 0.08 0.99 0.00 0.05
11.8 ± 1.9 11.5 ± 1.8 11.6 ± 2.2 14.2 ± 2.7 • 11.4 ± 1.8 11.7 ± 1.6
− 0.05 0.27 0.99 0.02 0.33
Mislabel
20%
Outlier
AB LB MB Eta r-eta AB R
5%
0%
3%
5%
Error
p-val
Error
p-val
Error
p-val
31.0 ± 2.2 31.0 ± 2.0 31.2 ± 2.2 31.4 ± 2.1 30.8 ± 1.7 • 30.5 ± 2.0
− 0.53 0.76 0.92 0.19 0.01
33.2 ± 2.3 32.7 ± 2.0 32.9 ± 2.2 33.4 ± 2.1 32.8 ± 2.1 • 32.5 ± 2.3
− 0.03 0.18 0.84 0.06 0.01
33.5 ± 2.0 33.4 ± 2.2 33.5 ± 2.2 33.9 ± 2.2 33.2 ± 2.1 • 33.0 ± 2.4
− 0.38 0.53 0.95 0.11 0.02
Ringnorm: Log-Loss×10 Mislabel
0%
Outlier
AB LB MB Eta r-eta AB R
0%
•
3%
Log-Loss
p-val
Log-Loss
p-val
Log-Loss
p-val
1.98 ± 0.21 1.97 ± 0.12 2.10 ± 0.13 2.75 ± 0.80 2.07 ± 0.11 1.99 ± 0.20
− 0.33 0.99 0.99 0.99 0.68
2.91 ± 0.42 2.70 ± 0.39 2.71 ± 0.32 3.54 ± 1.10 • 2.64 ± 0.25 2.84 ± 0.46
− 0.00 0.00 0.99 0.00 0.08
3.04 ± 0.40 2.73 ± 0.35 2.76 ± 0.30 3.67 ± 0.65 • 2.70 ± 0.22 3.03 ± 0.43
− 0.00 0.00 0.99 0.00 0.42
Mislabel Outlier
AB LB MB Eta r-eta AB R
5%
20% 0%
3%
5%
Log-Loss
p-val
Log-Loss
p-val
Log-Loss
p-val
6.30 ± 0.21 6.26 ± 0.22 • 6.22 ± 0.24 6.45 ± 0.58 6.23 ± 0.24 6.35 ± 0.25
− 0.06 0.00 0.99 0.00 0.93
6.52 ± 0.22 6.44 ± 0.22 6.43 ± 0.23 6.49 ± 0.23 • 6.42 ± 0.27 6.58 ± 0.29
− 0.00 0.00 0.09 0.00 0.97
6.56 ± 0.24 6.51 ± 0.24 6.48 ± 0.22 6.60 ± 0.32 • 6.42 ± 0.24 6.59 ± 0.24
− 0.02 0.00 0.86 0.00 0.84
Notes: The p-values are calculated by one-side paired t-test against Adaboost. The dot indicates the best method under each noise condition.
Robust Loss Functions for Boosting
2237
Table 12: Error and Log-Loss×10 on the Test Set of Flare-Solar Among Six Boosting Algorithms: Adaboost (AB), Logitboost (LB), Madaboost (MB), EtaBoost (Eta), Robust Eta-Boost (r-eta), and Adaboost Reg (AB R ). Flare-Solar: Test Error Mislabel
0%
Outlier
AB LB MB Eta r-eta AB R
0%
3%
Error
p-val
Error
p-val
Error
p-val
32.5 ± 1.9 32.6 ± 2.3 32.6 ± 1.9 • 32.4 ± 1.8 32.4 ± 1.7 32.8 ± 2.6
− 0.79 0.92 0.09 0.41 0.93
33.0 ± 2.5 32.7 ± 1.6 • 32.6 ± 1.8 32.8 ± 1.9 32.6 ± 1.7 32.8 ± 1.8
− 0.07 0.03 0.17 0.03 0.16
32.9 ± 1.8 32.9 ± 1.7 32.9 ± 1.8 • 32.7 ± 1.8 32.8 ± 1.9 32.8 ± 1.9
− 0.58 0.38 0.00 0.04 0.03
Mislabel
20%
Outlier
AB LB MB Eta r-eta AB R
5%
0%
3%
5%
Error
p-val
Error
p-val
40.9 ± 3.0 40.8 ± 3.1 • 40.8 ± 2.9 41.2 ± 3.3 40.9 ± 3.2 41.4 ± 3.3
− 0.44 0.36 0.87 0.53 0.97
41.0 ± 3.0 41.1 ± 3.0 • 40.6 ± 2.8 40.8 ± 2.9 40.9 ± 2.8 41.1 ± 2.6
− 0.65 0.11 0.28 0.32 0.65
•
Error
p-val
40.8 ± 3.2 40.8 ± 3.2 40.9 ± 3.4 41.4 ± 3.5 40.8 ± 3.1 41.2 ± 3.0
− 0.44 0.61 0.92 0.49 0.88
Flare-Solar: Log-Loss×10 Mislabel
0%
Outlier
0% Log-Loss
AB LB MB Eta r-eta AB R
•
5.73 ± 0.11 5.75 ± 0.11 5.76 ± 0.11 5.80 ± 0.14 5.76 ± 0.10 5.75 ± 0.16
3% p-val − 0.99 0.99 0.99 0.99 0.98
Mislabel Outlier
AB LB MB Eta r-eta AB R
Log-Loss 5.85 ± 0.14 5.87 ± 0.13 5.90 ± 0.11 • 5.85 ± 0.13 5.88 ± 0.10 5.86 ± 0.18
5% p-val − 0.99 0.99 0.50 0.99 0.81
•
Log-Loss
p-val
5.82 ± 0.12 5.86 ± 0.11 5.88 ± 0.11 5.83 ± 0.12 5.88 ± 0.10 5.83 ± 0.15
− 0.99 0.99 0.92 0.99 0.62
20% 0%
3%
5%
Log-Loss
p-val
Log-Loss
p-val
Log-Loss
p-val
6.66 ± 0.15 6.66 ± 0.14 6.66 ± 0.13 6.66 ± 0.13 • 6.65 ± 0.14 6.69 ± 0.16
− 0.11 0.18 0.14 0.03 0.98
6.67 ± 0.12 6.68 ± 0.12 • 6.67 ± 0.12 6.70 ± 0.13 6.67 ± 0.10 6.73 ± 0.17
− 0.74 0.31 0.99 0.35 0.99
6.70 ± 0.13 6.67 ± 0.12 • 6.67 ± 0.11 6.72 ± 0.15 6.68 ± 0.11 6.70 ± 0.17
− 0.01 0.00 0.99 0.02 0.73
Notes: The p-values are calculated by one-side paired t-test against Adaboost. The dot indicates the best method under each noise condition.
2238
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
Table 13: Error and Log-Loss×10 on the Test Set of Splice Among Six Boosting Algorithms: Adaboost (AB), Logitboost (LB), Madaboost (MB), Eta-Boost (Eta), Robust Eta-Boost (r-eta), and Adaboost Reg (AB R ). Splice: Test Error Mislabel
0%
Outlier
0%
AB LB MB Eta r-eta AB R
3%
Error
p-val
Error
p-val
Error
p-val
8.6 ± 1.0 8.4 ± 1.4 8.7 ± 1.0 10.3 ± 2.0 • 8.0 ± 0.9 8.2 ± 1.0
− 0.28 0.54 0.99 0.04 0.07
11.8 ± 2.1 10.9 ± 1.7 11.6 ± 1.8 14.8 ± 2.6 • 10.8 ± 1.5 11.4 ± 2.0
− 0.03 0.21 0.99 0.01 0.25
11.9 ± 1.9 11.6 ± 1.7 11.5 ± 1.4 14.0 ± 2.0 11.8 ± 2.0 • 11.2 ± 1.5
− 0.26 0.14 0.99 0.38 0.05
Mislabel
20%
Outlier
AB LB MB Eta r-eta AB R
5%
0%
3%
5%
Error
p-val
Error
p-val
Error
p-val
31.0 ± 1.9 31.4 ± 1.7 31.1 ± 1.8 31.5 ± 2.1 30.6 ± 2.0 • 30.5 ± 1.6
− 0.88 0.58 0.87 0.21 0.14
33.0 ± 2.7 33.0 ± 1.9 33.5 ± 2.5 32.9 ± 2.0 • 32.9 ± 2.2 32.9 ± 2.1
− 0.51 0.82 0.45 0.43 0.44
33.1 ± 2.0 33.5 ± 1.8 32.8 ± 1.8 32.7 ± 1.4 32.7 ± 1.6 • 32.0 ± 1.6
− 0.75 0.29 0.22 0.20 0.03
Splice: Log-Loss×10 Mislabel
0%
Outlier
AB LB MB Eta r-eta AB R
0%
•
3%
Log-Loss
p-val
Log-Loss
p-val
Log-Loss
p-val
1.94 ± 0.16 1.98 ± 0.16 2.09 ± 0.11 2.47 ± 0.44 2.08 ± 0.13 1.95 ± 0.18
− 0.85 0.99 0.99 0.99 0.53
2.96 ± 0.57 2.63 ± 0.31 2.83 ± 0.43 3.49 ± 0.50 • 2.61 ± 0.28 2.92 ± 0.43
− 0.00 0.14 0.99 0.00 0.37
2.91 ± 0.40 2.80 ± 0.36 2.75 ± 0.30 3.62 ± 0.56 • 2.70 ± 0.31 2.96 ± 0.29
− 0.19 0.02 0.99 0.01 0.71
Mislabel Outlier
AB LB MB Eta r-eta AB R
5%
20% 0%
3%
5%
Log-Loss
p-val
Log-Loss
p-val
Log-Loss
p-val
6.34 ± 0.17 6.32 ± 0.26 6.24 ± 0.22 6.34 ± 0.27 • 6.16 ± 0.21 6.32 ± 0.24
− 0.40 0.02 0.53 0.00 0.40
6.53 ± 0.26 6.56 ± 0.23 6.49 ± 0.26 6.65 ± 0.97 • 6.43 ± 0.24 6.57 ± 0.19
− 0.66 0.31 0.70 0.10 0.76
6.55 ± 0.15 6.54 ± 0.25 6.46 ± 0.16 6.63 ± 0.29 • 6.46 ± 0.19 6.57 ± 0.27
− 0.36 0.01 0.84 0.04 0.65
Notes: The p-values are calculated by one-side paired t-test against Adaboost. The dot indicates the best method under each noise condition.
Robust Loss Functions for Boosting
2239
Table 14: Error and Log-Loss×10 on the Test Set of Thyroid Among Six Boosting Algorithms: Adaboost (AB), Logitboost (LB), Madaboost (MB), Eta-Boost (Eta), Robust Eta-Boost (r-eta), and Adaboost Reg (AB R ). Thyroid: Test Error Mislabel
0%
Outlier
AB LB MB Eta r-eta AB R
0%
3%
Error
p-val
Error
p-val
Error
p-val
7.7 ± 3.3 7.7 ± 3.2 7.2 ± 3.5 9.4 ± 4.5 • 7.2 ± 2.9 7.9 ± 3.3
− 0.50 0.09 0.99 0.07 0.74
8.8 ± 3.9 8.9 ± 4.5 8.6 ± 3.8 10.0 ± 5.0 • 8.1 ± 3.3 8.5 ± 4.0
− 0.56 0.29 0.99 0.02 0.28
8.5 ± 3.9 8.1 ± 3.6 8.7 ± 3.8 10.0 ± 5.8 • 7.8 ± 3.2 8.4 ± 3.4
− 0.13 0.66 0.99 0.02 0.36
Mislabel
20%
Outlier
AB LB MB Eta r-eta AB R
5%
0%
•
3%
5%
Error
p-val
Error
p-val
Error
p-val
28.4 ± 6.0 28.1 ± 5.2 28.5 ± 5.2 28.6 ± 5.3 28.6 ± 5.3 28.5 ± 5.5
− 0.29 0.56 0.60 0.66 0.57
29.1 ± 6.2 28.5 ± 6.3 • 28.3 ± 6.5 29.5 ± 6.6 28.7 ± 6.6 29.0 ± 5.7
− 0.14 0.08 0.76 0.27 0.42
29.9 ± 6.5 30.1 ± 6.1 29.7 ± 6.2 30.7 ± 6.2 29.8 ± 6.5 • 29.4 ± 6.2
− 0.64 0.30 0.93 0.39 0.16
Thyroid: Log-Loss×10 Mislabel Outlier
AB LB MB Eta r-eta AB R
0% 0%
3%
Log-Loss
p-val
Log-Loss
p-val
Log-Loss
p-val
2.29 ± 1.55 1.98 ± 1.14 1.87 ± 0.93 8.46 ± 16.0 • 1.64 ± 0.96 3.25 ± 4.49
− 0.01 0.00 0.99 0.00 0.99
2.58 ± 1.26 2.33 ± 0.91 2.42 ± 1.07 8.00 ± 12.7 • 2.10 ± 0.94 2.83 ± 1.47
− 0.01 0.09 0.99 0.00 0.97
2.53 ± 1.15 2.08 ± 0.83 2.15 ± 0.88 5.78 ± 9.22 • 1.95 ± 0.82 2.91 ± 1.77
− 0.00 0.00 0.99 0.00 0.99
Mislabel Outlier
AB LB MB Eta r-eta AB R
5%
20% 0%
3%
Log-Loss
p-val
6.18 ± 0.69 6.07 ± 0.71 • 5.98 ± 0.68 7.03 ± 3.54 6.11 ± 0.77 6.30 ± 1.07
− 0.04 0.00 0.99 0.17 0.93
•
5%
Log-Loss
p-val
Log-Loss
p-val
6.16 ± 0.73 6.12 ± 0.71 6.12 ± 0.75 6.46 ± 2.10 6.17 ± 0.79 6.34 ± 0.98
− 0.23 0.28 0.94 0.56 0.99
6.47 ± 0.99 6.24 ± 0.87 • 6.22 ± 0.82 7.19 ± 3.58 6.35 ± 1.03 6.42 ± 0.93
− 0.00 0.00 0.98 0.05 0.28
Notes: The p-values are calculated by one-side paired t-test against Adaboost. The dot indicates the best method under each noise condition.
2240
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
Table 15: Error and Log-Loss×10 on the Test Set of Titanic Among Six Boosting Algorithms: Adaboost (AB), Logitboost (LB), Madaboost (MB), Eta-Boost (Eta), Robust Eta-Boost (r-eta), and Adaboost Reg (AB R ). Titanic: Test Error Mislabel
0%
Outlier
0% Error
AB LB MB Eta r-eta AB R
3% p-val
22.8 ± 1.5 22.8 ± 1.3 • 22.7 ± 1.1 23.0 ± 1.6 22.8 ± 1.5 23.0 ± 1.8
− 0.33 0.22 0.81 0.41 0.78
•
Mislabel
Error
p-val
22.5 ± 0.9 22.7 ± 1.5 22.7 ± 1.2 22.9 ± 1.6 22.6 ± 1.1 22.9 ± 1.6
− 0.86 0.94 0.99 0.85 0.99
•
Error
p-val
23.1 ± 1.9 22.7 ± 1.2 22.9 ± 1.4 23.3 ± 2.2 22.9 ± 1.5 23.1 ± 2.1
− 0.03 0.11 0.84 0.06 0.54
20%
Outlier
AB LB MB Eta r-eta AB R
5%
0%
3%
5%
Error
p-val
Error
p-val
Error
p-val
34.6 ± 2.6 34.6 ± 2.0 34.7 ± 2.6 34.5 ± 2.0 • 34.5 ± 2.3 34.7 ± 3.8
− 0.43 0.65 0.32 0.19 0.53
35.1 ± 2.2 35.0 ± 2.2 35.2 ± 2.1 35.2 ± 2.2 • 35.0 ± 2.0 35.4 ± 3.5
− 0.26 0.61 0.57 0.18 0.77
34.8 ± 2.3 34.4 ± 2.0 34.2 ± 1.6 34.4 ± 1.9 • 34.1 ± 1.9 34.4 ± 2.0
− 0.01 0.00 0.02 0.00 0.02
Titanic: Log-Loss×10 Mislabel Outlier
AB LB MB Eta r-eta AB R
0% 0%
3%
Log-Loss
p-val
5.32 ± 0.20 5.32 ± 0.20 5.32 ± 0.16 5.34 ± 0.24 • 5.30 ± 0.20 5.43 ± 0.29
− 0.55 0.39 0.80 0.09 0.99
Mislabel Outlier
AB LB MB Eta r-eta AB R
•
5%
Log-Loss
p-val
Log-Loss
p-val
5.34 ± 0.24 5.30 ± 0.18 5.33 ± 0.17 5.37 ± 0.25 5.31 ± 0.16 5.46 ± 0.47
− 0.02 0.34 0.89 0.06 0.99
5.37 ± 0.22 5.35 ± 0.20 5.35 ± 0.22 5.39 ± 0.33 • 5.33 ± 0.20 5.45 ± 0.30
− 0.09 0.09 0.69 0.02 0.99
20% 0%
3%
5%
Log-Loss
p-val
Log-Loss
p-val
6.49 ± 0.16 6.49 ± 0.17 6.48 ± 0.16 6.46 ± 0.14 • 6.46 ± 0.13 6.57 ± 0.26
− 0.61 0.33 0.02 0.02 0.99
6.52 ± 0.15 6.54 ± 0.18 6.54 ± 0.19 6.53 ± 0.21 • 6.52 ± 0.15 6.64 ± 0.38
− 0.92 0.88 0.74 0.46 0.99
•
Log-Loss
p-val
6.52 ± 0.17 6.49 ± 0.14 6.50 ± 0.15 6.60 ± 0.97 6.49 ± 0.16 6.59 ± 0.29
− 0.03 0.08 0.81 0.03 0.99
Notes: The p-values are calculated by one-side paired t-test against Adaboost. The dot indicates the best method under each noise condition.
Robust Loss Functions for Boosting
2241
Table 16: Error and Log-Loss×10 on the Test Set of Twonorm Among Six Boosting Algorithms: Adaboost (AB), Logitboost (LB), Madaboost (MB), Eta-Boost (Eta), Robust Eta-Boost (r-eta), and Adaboost Reg (AB R ). Twonorm: Test Error Mislabel
0%
Outlier
0% Error
AB LB MB Eta r-eta AB R
3% p-val
5.9 ± 1.2 5.9 ± 1.2 5.9 ± 1.1 9.7 ± 1.4 5.7 ± 0.8 • 5.7 ± 0.9
− 0.45 0.40 0.99 0.09 0.09
Mislabel
Error
p-val
Error
p-val
6.5 ± 1.0 6.8 ± 1.0 6.8 ± 1.0 9.5 ± 1.3 6.9 ± 1.2 6.5 ± 1.1
− 0.99 0.99 0.99 0.99 0.60
6.8 ± 1.0 7.1 ± 1.0 7.1 ± 1.0 9.6 ± 1.2 7.1 ± 1.2 • 6.4 ± 1.0
− 0.99 0.99 0.99 0.99 0.01
20%
Outlier
AB LB MB Eta r-eta AB R
•
5%
0%
3%
5%
Error
p-val
Error
p-val
Error
p-val
30.6 ± 1.4 31.4 ± 1.6 31.4 ± 1.8 30.4 ± 1.7 31.3 ± 1.5 • 29.6 ± 1.6
− 0.99 0.99 0.19 0.99 0.00
30.4 ± 1.5 31.4 ± 1.4 31.5 ± 1.9 30.3 ± 1.6 31.0 ± 1.5 • 29.3 ± 2.0
− 0.99 0.99 0.43 0.99 0.00
30.5 ± 1.6 30.9 ± 1.6 31.1 ± 1.5 30.4 ± 1.5 31.1 ± 1.5 • 29.2 ± 1.7
− 0.99 0.99 0.26 0.99 0.00
Twonorm: Log-Loss×10 Mislabel Outlier
AB LB MB Eta r-eta AB R
0% 0%
3%
Log-Loss
p-val
Log-Loss
p-val
Log-Loss
p-val
2.13 ± 0.35 1.73 ± 0.20 1.66 ± 0.21 4.08 ± 1.47 • 1.64 ± 0.22 2.27 ± 0.59
− 0.00 0.00 0.99 0.00 0.98
2.06 ± 0.29 1.83 ± 0.26 • 1.74 ± 0.20 3.40 ± 0.91 1.78 ± 0.27 2.15 ± 0.31
− 0.00 0.00 0.99 0.00 0.99
2.10 ± 0.41 1.84 ± 0.25 1.79 ± 0.24 3.27 ± 0.87 • 1.75 ± 0.24 2.03 ± 0.33
− 0.00 0.00 0.99 0.00 0.06
Mislabel Outlier
AB LB MB Eta r-eta AB R
5%
20% 0%
3%
5%
Log-Loss
p-val
Log-Loss
p-val
Log-Loss
p-val
6.69 ± 0.21 6.64 ± 0.22 6.64 ± 0.23 6.66 ± 0.27 • 6.61 ± 0.24 6.75 ± 0.32
− 0.03 0.03 0.19 0.00 0.96
6.66 ± 0.22 6.62 ± 0.22 6.63 ± 0.23 6.69 ± 0.35 • 6.59 ± 0.27 6.70 ± 0.26
− 0.07 0.14 0.80 0.01 0.93
6.61 ± 0.23 6.59 ± 0.26 6.57 ± 0.24 6.62 ± 0.30 • 6.48 ± 0.26 6.71 ± 0.24
− 0.18 0.06 0.56 0.00 0.99
Notes: The p-values are calculated by one-side paired t-test against Adaboost. The dot indicates the best method under each noise condition.
2242
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
Table 17: Error and Log-Loss×10 on the Test Set of Waveform Among Six Boosting Algorithms: Adaboost (AB), Logitboost (LB), Madaboost (MB), Eta-Boost (Eta), Robust Eta-Boost (r-eta), and Adaboost Reg (AB R ). Waveform: Test Error Mislabel
0%
Outlier
AB LB MB Eta r-eta AB R
0%
3%
Error
p-val
Error
p-val
Error
p-val
13.7 ± 1.2 14.1 ± 1.3 14.2 ± 1.1 14.7 ± 1.3 14.3 ± 1.3 • 13.2 ± 1.0
− 0.99 0.99 0.99 0.99 0.00
14.3 ± 1.3 14.6 ± 1.4 14.9 ± 1.4 15.0 ± 1.4 14.7 ± 1.3 • 13.8 ± 1.3
− 0.97 0.99 0.99 0.99 0.00
14.2 ± 1.3 14.7 ± 1.3 15.1 ± 1.3 15.0 ± 1.3 14.8 ± 1.3 • 13.8 ± 1.1
− 0.99 0.99 0.99 0.99 0.01
Mislabel
20%
Outlier
AB LB MB Eta r-eta AB R
5%
0%
3%
5%
Error
p-val
Error
p-val
Error
p-val
32.9 ± 1.5 33.1 ± 1.5 33.2 ± 1.5 32.7 ± 1.4 33.1 ± 1.5 • 32.3 ± 1.7
− 0.86 0.96 0.15 0.81 0.00
32.8 ± 1.4 32.9 ± 1.2 33.0 ± 1.3 32.6 ± 1.3 32.6 ± 1.5 • 32.4 ± 1.5
− 0.82 0.97 0.05 0.15 0.01
33.0 ± 1.2 32.9 ± 1.1 33.1 ± 1.2 32.8 ± 1.3 33.0 ± 1.3 • 32.4 ± 1.3
− 0.17 0.85 0.11 0.64 0.00
Waveform: Log-Loss×10 Mislabel Outlier
AB LB MB Eta r-eta AB R
0% 0%
3%
Log-Loss
p-val
Log-Loss
p-val
Log-Loss
p-val
3.68 ± 0.36 3.60 ± 0.34 3.52 ± 0.32 4.19 ± 0.69 • 3.44 ± 0.35 3.81 ± 0.54
− 0.02 0.00 0.99 0.00 0.99
3.76 ± 0.42 3.61 ± 0.38 3.60 ± 0.35 4.21 ± 0.94 • 3.47 ± 0.29 3.82 ± 0.43
− 0.00 0.00 0.99 0.00 0.88
3.69 ± 0.34 3.62 ± 0.33 3.54 ± 0.33 4.13 ± 0.94 • 3.47 ± 0.31 3.75 ± 0.37
− 0.02 0.00 0.99 0.00 0.92
Mislabel Outlier
AB LB MB Eta r-eta AB R
5%
20% 0%
3%
5%
Log-Loss
p-val
Log-Loss
p-val
Log-Loss
p-val
6.48 ± 0.21 6.49 ± 0.23 6.49 ± 0.23 6.50 ± 0.34 • 6.45 ± 0.20 6.53 ± 0.26
− 0.64 0.71 0.69 0.12 0.97
6.45 ± 0.22 6.47 ± 0.21 • 6.43 ± 0.18 6.48 ± 0.32 6.44 ± 0.24 6.53 ± 0.26
− 0.82 0.18 0.84 0.34 0.99
6.50 ± 0.24 6.47 ± 0.19 • 6.46 ± 0.17 6.54 ± 0.44 6.48 ± 0.22 6.50 ± 0.25
− 0.14 0.08 0.84 0.27 0.61
Notes: The p-values are calculated by one-side paired t-test against Adaboost. The dot indicates the best method under each noise condition.
Robust Loss Functions for Boosting
2243
Acknowledgments We thank the anonymous referees for helpful comments. This research was partially supported by the Ministry of Education, Culture, Sports, Science and Technology, Grant-in-Aid for Young Scientists (B), 17700277, 2005.
References Amari, S., & Nagaoka, H. (2000). Methods of information geometry. New York: Oxford University Press. Bartlett, P. L., Jordan, M. I., & McAuliffe, J. D. (2003). Convexity, classification, and risk bounds (Tech. rep. 638). Berkeley: Statistics Department, University of California, Berkeley. Bertsekas, D. P. (1999). Nonlinear programming (2nd ed.). Belmont, MA: Athena Scientific. ¨ Blanchard, G., Sch¨afer, C., Rozenholc, Y., & Muller, K.-R. (2005). Optimal dyadic decision trees (Tech. rep.) Berlin: Fraunhofer FIRST, 2005. Available online at http://ida.first.fraunhofer.de/blanchard/publi/index.html. Breiman, L. (1994). Bagging predictors (Tech. rep. 421). Berkeley: Statistics Department, University of California, Berkeley. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Belmont, CA: Wadsworth and Brooks/Cole. Copas, J. (1988). Binary regression models for contaminated data. J. Royal Statist. Soc. B., 50, 225–265. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. Demiriz, A., Bennett, K. P., & Shawe-Taylor, J. (2002). Linear programming boosting via column generation. Machine Learning, 46(1–3), 225–254. Domingo, C., & Watanabe, O. (2000). MadaBoost: A modification of AdaBoost. In Proc. of the 13th Conference on Computational Learning Theory, COLT’00. San Francisco: Morgan Kaufmann. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139. Friedman, J. H., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28, 337–407. ¨ Grunwald, P. D., & Dawid, A. P. (2004). Game theory, maximum entropy, minimum discrepancy, and robust Bayesian decision theory. Annals of Statistics, 32(4), 1367– 1433. Halmos, P. R. (1974). Measure theory. New York: Springer-Verlag. Hampel, F. R., Rousseeuw, P. J., Ronchetti, E. M., & Stahel, W. A. (1986). Robust statistics: The approach based on influence functions. New York: Wiley. Kalai, A., & Servedio, R. A. (2003). Boosting in the presence of noise. In STOC ’03: Proceedings of the Thirty-Fifth Annual ACM Symposium on Theory of Computing. New York: ACM.
2244
T. Kanamori, T. Takenouchi, S. Eguchi, and N. Murata
Kanamori, T., Takenouchi, T., Eguchi, S., & Murata, N. (2004). The most robust loss function for boosting. In Neural Information Processing: 11th International Conference, ICONIP (pp. 496–501). Berlin: Springer. Lebanon, G., & Lafferty, J. (2002). Boosting and maximum likelihood for exponential models. In T. G. Dieterrich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. MacCullagh, P. A., & Nelder, J. (1989). Generalized linear models. London: Chapman and Hall. Mason, L., Baxter, J., Bartlett, P. L., & Frean, M. (1999). Boosting algorithms as gradient descent. In M. S. Stearns, S. Solla, & D. Cohen (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press. McLachlan, G. J. (1992). Discriminant analysis and statistical pattern recognition. New York: Wiley. Meir, R., & R¨atsch, G. (2003). An introduction to boosting and leveraging. New York: Springer. Murata, N., Takenouchi, T., Kanamori, T., & Eguchi, S. (2004). Information geometry of u-boost and Bregman divergence. Neural Computation, 16(7), 1437–1481. R¨atsch, G. (2001). Robust boosting via convex optimization. Unpublished doctoral dissertation, University of Potsdam. R¨atsch, G., Demiriz, A., & Bennett, K. (2002). Sparse regression ensembles in infinite and finite hypothesis spaces. Machine Learning, 48, 193–221. ¨ R¨atsch, G., Onoda, T., & Muller, K.-R. (2001). Soft margins for Adaboost. Machine Learning, 42(3), 287–320. ¨ ¨ R¨atsch, G., Scholkopf, B., Smola, A. J., Mika, S., Onoda, T., & Muller, K.-R. (2000). Robust ensemble learning. Cambridge, MA: MIT Press. Rosset, S. (2005). Robust boosting and its relation to bagging. In Proceedings of the 11th ACM SICKDD International Conference on Knowledge Discovery in Data Mining (pp. 249–255). New York: ACM Press. Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26(5), 1651–1686. ¨ Scholkopf, B., & Smola, A. J. (2001). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge, MA: MIT Press. Servedio, R. (2003). Smooth boosting and learning with malicious noise. Journal of Machine Learning Research, 4, 633–648. Takenouchi, T., & Eguchi, S. (2004). Robustifying Adaboost by adding the naive error rate. Neural Computation, 16(4), 767–787. van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge: Cambridge University Press. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. Victoria-Feser, M.-P. (2002). Robust inference with binary data. Psychometrika, 67(1), 21–32.
Received January 29, 2006; accepted August 2, 2006.
LETTER
Communicated by Jochen Triesch
Reinforcement Learning, Spike-Time-Dependent Plasticity, and the BCM Rule Dorit Baras [email protected]
Ron Meir [email protected] Department of Electrical Engineering, Technion, Haifa 32000, Israel
Learning agents, whether natural or artificial, must update their internal parameters in order to improve their behavior over time. In reinforcement learning, this plasticity is influenced by an environmental signal, termed a reward, that directs the changes in appropriate directions. We apply a recently introduced policy learning algorithm from machine learning to networks of spiking neurons and derive a spike-time-dependent plasticity rule that ensures convergence to a local optimum of the expected average reward. The approach is applicable to a broad class of neuronal models, including the Hodgkin-Huxley model. We demonstrate the effectiveness of the derived rule in several toy problems. Finally, through statistical analysis, we show that the synaptic plasticity rule established is closely related to the widely used BCM rule, for which good biological evidence exists.
1 Policy Learning and Neuronal Dynamics Reinforcement learning (RL) is a general term used for a class of learning problems in which an agent attempts to improve its performance over time at a given task (e.g., Bertsekas & Tsitsiklis, 1996; Sutton & Barto, 1998). Formally, it is the problem of mapping situations to actions in order to maximize a given reward signal. The interaction between the agent and the environment is modeled mathematically as a partially observable Markov decision process (POMDP, see appendix C.3.2). At each (discrete) time step t, the agent and the environment are in a particular state x(t). The state determines a, possibly noisy, observation vector y(t) that is seen by the agent. Upon observing y(t), the agent performs an action a (t) and receives a reward r (t) based on the quality of its action. The mapping from states to actions is referred to as a policy. The objective of RL is to learn an optimal policy through exploring and interacting with the environment. An optimal policy is a policy that maximizes the long term average reward. Appendix C.3 provides definitions of MDPs and POMDPs. (We refer Neural Computation 19, 2245–2279 (2007)
C 2007 Massachusetts Institute of Technology
2246
D. Baras and R. Meir
readers to Bertsekas & Tsitsiklis, 1996, and Baxter & Bartlett, 2001, for the precise mathematical definition of policy learning in general.) We begin with the simple leaky integrate-and-fire (LIF) neuronal model (e.g., Gerstner & Kistler, 2002; Koch, 1999). A more general model neuron will be introduced in section 3. Consider a population of N LIF neurons, each characterized by a potential vi (t), i = 1, 2, . . . , N. Each neuron receives inputs from both external sources and from other neurons in the network. We assume that each LIF neuron is characterized by a reset value vr , a leakage voltage v L , and threshold vθ . The dynamics of neurons between spikes is given by
τm
dvi (t) = −(vi (t) − v L ) + RIi (t) + RIiext (t) dt
(i = 1, 2, . . . , N).
(1.1)
Upon reaching a threshold vθ , a spike is emitted, and the neuron is reset to vr (we neglect the refractory period in this analysis). The current Iiext (t) denotes the external current entering neuron i, and the synaptic current Ii (t) is given by Ii (t) =
j
wi j
(f) , α t − tj
(1.2)
f
where wi j denotes the synaptic efficacy. The first summation in equation 1.2 is over all neurons that connect to neuron i, while the sum over f runs (f) over the firing times of the presynaptic neurons, where {t j } denote the firing times of neuron j. Although the derivation below applies to general α functions, we specify two special and analytically tractable cases in section 2.1. An advantage of the simple linear dynamics 1.1 and 1.2 is that the potential value between spikes can be precisely computed, yielding (0) t − tˆi 1 t t−s vi (t) = vr exp − + exp − (Ii (s) + Iiext (t)) ds, τm c tˆi(0) τm
(1.3)
(0)
where c = τm /R and tˆi is the last spiking time of neuron i preceding t. The mapping of our problem to a POMDP is defined as follows (see also Bartlett & Baxter, 1999): State: The membrane potential value. We use discrete values within a fixed resolution (the voltage range was [−60, 0] mV,
Reinforcement Learning and Synaptic Plasticity
2247
divided into 120 discrete bins of equal size), along with a binary variable denoting a spiking event in the neuron.1 Observation: A noisy measurement of the state. In this work, the observation is taken to be the state itself, although the formalism is more general. Actions: Given the membrane potential, each neuron may fire or remain quiescent. The action, to fire or not to fire, depends on the state stochastically. Reward: The reward is provided as a real valued number, based on the underlying task. The reward process is assumed to be global, available to all the neurons in the network. The dynamics described by equations 1.1 and 1.2, including the spiking process, is purely deterministic. In order to allow for stochasticity in action selection, required within the POMDP framework (Baxter & Bartlett, 2001), we assume that the spiking process is random. For each neuron, we define a spiking point process ζi (t) ∈ {0, 1} through P [ζi (t) = 1|vi (t)] = [1 + exp(−(λvi (t) − ϕi ))]−1
= σ (λvi (t) − ϕi ) ,
(1.4)
where ζi (t) = 1 denotes a spike at time t, ζi (t) = 0 denotes the absence of a spike, and λ determines the steepness of the sigmoidal function. In the sequel, we use the same threshold ϕ for all neurons, but for generality we allow different ϕi . When applying the direct RL approach presented in section 2 to networks of spiking neurons, the set of synaptic weights parameterizes the stochastic policy. In previous work, as well as in ours, the neurons are modeled as MDPs or POMDPs (see section C.3) with the possible actions of generating or not generating a spike. The approach presented in Bartlett and Baxter (1999) is similar to ours, but neurons are modeled as simple McCulloch-Pitts binary elements. In Xie and Seung (2004) and Seung (2003), a more realistic neural model is used, but learning is based on spiking rates rather than on the membrane potential itself, and the derivation of the algorithm requires additional statistical assumptions about the spiking events. In addition, we are able to relate the plasticity rule derived to the well-known BCM rule. 1 This state space suffices for deriving the algorithm. In simulations presented later in this letter, the state space is larger and contains the following additional ingredients: exact spiking times of each presynaptic neuron (required for exact computation of the weight update) and a variable denoting the number of spike events that occurred in the particular neuron (required for the computation of the reward signal). We emphasize that this extension of the state space does not affect the derivation of the algorithm as appears in section 2 and is required only for implementation of the simulations as described in section 4.
2248
D. Baras and R. Meir
The remainder of the letter is organized as follows. Section 2 is devoted to the derivation of the basic synaptic plasticity rule in the context of LIF neurons. The framework is then extended to general neural models in section 3. Some results demonstrating the applicability of the approach to several toy problems are presented in section 4. A detailed statistical analysis of the derived rule is presented in section 5, showing the close relationship to the BCM rule. We conclude with a discussion and a presentation of several open problems. 2 Derivation of the Weight Update In order to limit the state space and efficiently implement the algorithm, a discretization procedure is required. Discretizing the differential equation 1.1, we have v(t + ) = v(t) +
1 (v L − v(t) + RI (t)) , τm
(2.1)
where I (t) = Ii (t) + Iiext (t). The notations t + , t − refer to the next and previous time steps, respectively. This procedure is analogous to solving the original differential equation using Euler’s method. A similar approach is taken with respect to ζi (t): spiking can occur with a certain probability in each discrete time step bin. Note that as in equation 1.1, the voltage is reset after every spiking event. We use the update rule for the parameters as suggested in the GPOMDP algorithm (Baxter & Bartlett, 2001). In general, let θ denote the vector of parameters describing a system, and let µa (y(t), θ ) denote a probabilistic policy based on the observation y(t) and the parameter value θ . More precisely, µa (y(t), θ ) = Pr choosing action a |y(t), θ . The GPOMDP parameter update rule (Baxter & Bartlett, 2001) is given by θ (t) = θ (t − ) + γ r (t)z(t),
(2.2)
where r (t) is the reward signal, and the eligibility trace, z(t), is given by z(t + ) = βz(t) +
∇µa (t) (y(t), θ ) . µa (t) (y(t), θ )
(2.3)
Here, γ > 0 is a small-step-size parameter and β ∈ [0, 1) is a bias-variance trade-off parameter. The convergence of the (temporal) average reward to its expected value, based on the parameter update rules 2.2 and 2.3, is
Reinforcement Learning and Synaptic Plasticity
2249
established in Baxter and Bartlett (2001) under the appropriate technical conditions. Here, the parameters denote the synaptic weights W = {wi j }i,Nj=1 , and the actions are denoted by the firing pattern {ζi (t)}. From equation 1.4, µζi (t) (y(t), W) = σ ((2ζi (t) − 1)(λvi (t) − ϕi )), where we have used σ (x) = 1 − σ (−x). We compute the gradient required in equation 2.3. Assume initially that ζi (t) = 1. Setting xi (t) = λ(vi (t) − ϕi ), we have ∂ xi (t) ∂µ1 (y(t), W) = σ (xi (t)) ∂wi j ∂wi j ∂vi (t) ∂wi j λ t t−s (f) ds, = σ (xi (t))(1 − σ (xi (t))) exp − α s − tj c tˆi(0) τm f = λσ (xi (t))(1 − σ (xi (t)))
where we have used σ (x) = σ (x)(1 − σ (x)) and the explicit solution 1.3 in the derivation. A similar result can be established for the case ζi (t) = 0, leading to the relation ∂µζi (t) (y(t), W)/∂wi j λ t = (ζi (t) − σ (xi (t)))
c tˆi(0) µζi (t) y(t), W t−s (f) ds. × exp − α s − tj τm f Summarizing the derivation, we obtain the following update rule for the synaptic weights: wi j (t) = wi j (t − ) + γ r (t)zi j (t), zi j (t + ) = βzi j (t) + (ζi (t) − σi (xi (t)))
λ c
t
(0) tˆi
t−s (f) ds. × exp − α s − tj τm f
(2.4)
Note that the weight update depends on the correlation between the reward and the eligibility trace zi j . The eligibility trace of a weight is updated based on the spiking activity of the presynaptic and postsynaptic neurons only. The presynaptic spikes contribute only if they occurred after the last postsynaptic spike (the reset effect), and the sign of the change depends
2250
D. Baras and R. Meir
on the event of postsynaptic firing, through the term (ζi (t) − σi (xi (t))) in equation 2.4. For example, the largest positive contribution to the eligibility trace occurs in the event that both the presynaptic and the postsynaptic cells fire vigorously while a positive reward is delivered. 2.1 Two Explicit Choices for α. In order to better understand the meaning of the update rule established, we consider two special choices for the function α: Pulse function: Setting α(t) = q δ(t), the integral in equation 2.4 can be computed exactly, leading to the update zi j (t + ) = βzi j (t) + (ζi (t) − σi (t))
where
f
(f) t − tj λq exp − , c f τm
(2.5)
(f)
denotes a summation over all presynaptic firing times t j (0)
(f)
such that tˆi < t j ≤ t. Exponentially decaying window: A common choice for α is α(t) =
q −s/τs e I {t ≥ 0}. τs
An explicit (but tedious) calculation (see section C.1 for details) leads to the update zi j (t + ) = βzi j (t) + (ζi (t) − σi (t))F
(f) (0) , tˆi , tj
where F
(f) (0) , tˆi = tj
(f)
λq τm − t−tτ j s e c (τm − τs ) f
τm − τs (f) (0) −1 . min t − t j , t − tˆi × exp τm τs 3 Extensions to General Neuronal Models The synaptic plasticity rules derived in section 2 are based on a simple integrate-and-fire model, which neglects both nonlinearities and adaptation mechanisms known to take place within a neuron. We consider now a network of generalized neural elements. Let vi (t) denote the membrane potential of neuron i, and set ui (t) = (ui,1 (t), . . . , ui,P (t)) to be a vector
Reinforcement Learning and Synaptic Plasticity
2251
of additional neural variables, describing, for example, adaptation mechanisms neglected in the simple LIF model. The network model is given by dvi = F (vi (t), ui (t)) + G(vi (t), ui (t))Ii (t), dt dui,k (i = 1, 2, . . . , N, k = 1, 2, . . . , P), = Hk (vi (t), ui (t)) dt with appropriate reset definitions if needed (e.g., reset vi (t) after spiking in integrate-and-fire type models). For example, such a model may describe many biophysically motivated models such as the Hodgkin-Huxley model and simpler two-dimensional approximations. In general, one cannot solve the differential equation and find an explicit solution for vi (t) as was done in equation 1.3, and one has to resort to a different approach in order to determine the synaptic update rules. Proceeding similarly to the derivation in section 2, we find that wi j (t) = wi j (t − ) + γ r (t)zi j (t), zi j (t + ) = βzi j (t) + λ(ζi (t) − σi (t))
∂vi (t) . ∂wi j
In order to compute ∂vi (t)/∂wi j we use an idea similar to the real-time recurrent learning algorithm described in Haykin (1999). Discretizing time, as in section 2 (and neglecting I ext for simplicity), the neural dynamics is given by (f) vi (t + ) = vi (t)+F (vi (t), ui (t)) + G(vi (t), ui (t)) wi j αi t − t j
j
ui,k (t + ) = ui,k (t) + Hk (vi (t), ui (t)).
f
(3.1)
(f)
Note that t j = for some integer . In order to obtain an algorithm that is both computationally efficient and requires low memory, we demand that (and demonstrate in section 3.1) αi (t + ) = A(αi (t), n, ξ i (t + )),
(3.2)
for some function A, where ξ i (t) is a binary vector, which indicates a spike at a presynaptic neuron, that is, ξ ij (t) = 1 if a presynaptic spike occurred at neuron j at time t, and zero otherwise. In other words, the value of αi (t + ) depends only on the values of αi (t) at time t and not on previous times.
2252
D. Baras and R. Meir
Denote wi = (wi1 , wi2 , . . . , wi N ) , the vector of weights afferent to neuron i, and set α i (t) =
(f) (f) (f) αi t − t1 , αi t − t2 , . . . , αi t − tN .
f
f
f
We express equation 3.1 as
vi (t + ) = vi (t) + F vi (t), ui (t) + G vi (t), ui (t) wi α i (t) .
(3.3)
Differentiating equation 3.3 with respect to wi and using the chain rule, we obtain ∇w vi (t + ) = ∇w vi (t) + F (vi (t), ui (t))∇w vi (t) + G (vi (t), ui (t))∇w vi (t)wi α i (t) + G(vi (t), ui (t))α i (t), or, equivalently,
∇w vi (t + ) = ∇w vi (t) 1 + F vi (t), ui (t) + G vi (t), ui (t) wi α i (t) . +G(vi (t), ui (t))α i (t).
(3.4)
This is a recursive update rule for ∇w vi (t), with an initial condition set at every postsynaptic spike: ∇w vi (t)t=tˆi = 0.
(3.5)
Summarizing the learning algorithm along with the neuronal dynamics we obtain algorithm 1: Algorithm 1: Synaptic Update Rule for a Generalized Neuronal Model Initialization Initialize weights wi j (0), eligibility traces zi j (0) = 0 and αi (0) = 0 Initialize vi (0), ui,k (0). Step 1: Weight update wi j (t) = wi j (t − ) + γ r (t)zi j (t),
Reinforcement Learning and Synaptic Plasticity
2253
∂vi (t) zi j (t + ) = βzi j (t) + λ ζi (t) − σ ∂wi j Step 2: Dynamic update
vi (t + ) = vi (t) + F vi (t), ui (t)
(f) +G vi (t), ui (t) wi j αi t − t j
j
f
ui,k (t + ) = ui,k (t) + Hk (vi (t), ui (t)) Step 3: Parameter update ∂vi (t + ) ∂vi (t) = 1 + F (vi (t), ui (t)) ∂wi j ∂wi j + G (vi (t), ui (t))wi α i (t) (f) + G(vi (t), ui (t)) αi t − t j
(f)
αi t + − t j
f
=
f
(f) A αi (t − t j ), n, ξ i (t + )
f
At every postsynaptic spike at neuron i Reset vi (t), ui (t) and ∂vi (t)/∂wi j
3.1 Explicit Calculation of the Update Rules for Different α Functions. As mentioned above, we assume that α(t) obeys equation 3.2. Here we demonstrate the update rule of the α function. 3.1.1 Demonstration for α(s) = q δ(s). In the present discrete time setting, the Dirac delta function is replaced by a Kronecker delta function, δ(t) → δ0,t where t = for some integer . Thus, f
(f) = αi t + − t j δt+,t( f ) = q ξ ij (t + ), f
j
since the only contribution to the last sum arises when a presynaptic spike occurs at t + .
2254
D. Baras and R. Meir
3.1.2 Demonstration for α(s) = f
q − τss e τs
. In this case,
q t+−t(j f )
− (f) τs = αi t + − t j e τs f (f)
q q − − t−tτ j τ s s + ξ ij (t + ) = e e τs τ s f
= e − τs
f
q (f) + ξ ij (t + ), αi t − t j τs
with the appropriate initial conditions αi (0) = 0. 3.2 Depressing Synapses. As a simple application of the general procedure presented, we consider the case of LIF neurons possessing a synaptic depression mechanism. While several formulations currently exist for depression, we consider the simple formulation proposed in Richardson, Melamed, Silberberg, Gerstner, and Markram (2005), leading to the dynamic equations (between spikes), τm
dvi (f) D j (t) wi j α t − tj = − vi (t) − vr + R dt j f
d D j (t) 1 − D j (t) = dt τd
( f −1) (f) tj , ≤ t ≤ tj
(f)
where t j are the presynaptic spiking times. Whenever a presynaptic spike occurs, the depression variable, D j (t), is updated according to the rule (f) (f) (f) ← Dj tj − vD Dj tj , Dj tj where v D is related to the level of synaptic resource depletion (Richardson et al., 2005). These equations describe a depletion of the synaptic resources whenever a presynaptic spike arrives, followed by a recovery process at a temporal scale of τd . Note that setting D j (t) = 1, we recover the standard LIF model. An explicit expression for D j (t) between two presynaptic spikes can be easily derived. The initial conditions are D f = D j (t)t=t( f )+ = D j (t) − v D D j (t) t=t( f )− , j
and for the first presynaptic spike, D j (0) = 1.
j
Reinforcement Learning and Synaptic Plasticity
2255
Integrating the differential equation for D j (t), we obtain D j (t) = 1 − e
(f) − t−t j /τ D
( f −1) (f) . tj ≤ t ≤ tj
[1 − D f ]
The exact solutions for vi (t) and D j (t) enable an explicit calculation of the update rule for the case of depressing synapses. The general solution is rather cumbersome, and we present only the result for the case where α(t) = q δ(t). The full derivation and general solution can be found in section C.2. wi j (t) = wi j (t − ) + γ r (t)zi j (t), λq zi j (t + ) = βzi j (t) + (ζi (t) − σi (t)) c
(f)
(f) t − tj (f) , exp − Dj tj τ m (0)
f :t j >tˆi
which is very similar to equation 2.5, except that the presynaptic contribution is multiplied by the depression variable D j (t). 4 Simulation Results In order to study the effectiveness of the update rules derived above, we present two learning experiments. The experiments are performed with the simple LIF neuron model augmented with the stochastic spiking procedure described in section 1. We note that these experiments are aimed at demonstrating the derived synaptic rules rather than proving their efficiency. Moreover, the results presented were obtained using episodic learning only (see the definition below). (We refer readers to Baras, 2006, for further numerical experiments and the application to online learning.) The learning rule presented in section 2 is derived for a single neuron. However, based on theorem 1 in Bartlett and Baxter (1999), the same learning rule applies at the network level. What these authors show is that each neuron essentially treats other neurons as though they were part of the environment; the only communication between neurons is through the global reward signal and the influence of other neurons on the state observations. The claim can be formally demonstrated by noting that the derivative of the average reward with respect to the (concatenated) vector of parameters corresponding to all neurons can be broken down into derivatives with respect to each of the neurons’ parameters, since the values of the weights afferent to each neuron are independent of those relating to other neurons. This argument is similar to the one used to derive the real time recurrent learning algorithm (e.g., Haykin, 1999). The derivation in section 2 assumed an online learning setup, where both the eligibility trace and the weights are updated continually in time.
2256
D. Baras and R. Meir
In many applications, especially where the inputs and outputs are encoded using rates averaged over a temporal window, it is more meaningful to consider episodic learning. Episodic learning over a temporal window T is obtained when the eligibility traces are updated at each Euler step, but the weights are updated only at the end of the period T. More precisely, Episodic learning:
wi j (T) = wi j (0) + γ r (T)¯zi j (T),
(4.1)
z¯ i j (T) = zi j (t) T
=
T− 1 zi j (t), Ne p
(4.2)
t=0
T + 1, and zi j (t) is given by equation where T is the episode length, Ne p = 2.4, that is, the eligibility traces are updated at every (discrete) time step (we assume for simplicity that T/ is an integer). We comment briefly on the rate coding used in the experiments discussed below. From initial experimentation, we found that a high rate input stimulus is required in order to generate a network response. Therefore, input coding consists of high firing rates (the exact values are provided in section 4). Since the network weights are restricted, limiting the possible firing rates of output neurons, we used a lower threshold for output coding. Specifically an output value of 1 was coded by a firing rate above 80 Hz. Low rates are encoded as 0 Hz for both input and output regardless of the problem. We consider the following learning tasks. XOR: The network shown in Figure 1a is a fully connected feedforward network with two input neurons, two hidden neurons, and a single output neuron. We denote the weights entering hidden neuron i by wihj , j = 1, 2 and the output weights by wiout , i = 1, 2. The task consists of solving the XOR problem, where inputs 0 and 1 are encoded as spike trains with low and high firing rates, respectively. The high input and output rates used in this experiment were 200 Hz and 120 Hz, respectively. During the learning process, time is split into different learning epochs (or episodes). In each episode, the network is presented with a randomly chosen pattern (the probability of all four patterns is equal). At the end of each episode, the output is determined as correct, incorrect, or undetermined. The decision is obtained by comparing the output spiking rate to a predetermined threshold. An undetermined output relates to the case of a spiking rate very close to the threshold. Reward is assigned depending on the output: positive reward for correct output and negative reward otherwise. Weights are randomly inih h tialized, but the following constraints were used: w11 and w22 are initialized h h with positive values, and w12 and w21 are initialized with negative values. This is the only problem (among those solved in Baras, 2006) in which a solution forces some of the weights to be negative (inhibitory synapses).
Reinforcement Learning and Synaptic Plasticity
2257
(a) XOR learning
(b) Path learning in a small network
(c) Path learning in a large network Figure 1: Networks architectures used in the simulations.
All technical details and further explanations (e.g., exact reward values, parameter values) are given in appendix B. Figure 2 displays learning curves averaged over all successful learning simulations (99% successful convergence out of 100 experiments). We observe that hill climbing occurs for the average reward. Since the curves are averaged, it is impossible to see that the learning speed differs from simulation to simulation, although this does occur (Baras, 2006). Path learning The networks shown in Figures 1b and 1c consist of a single input neuron, two output neurons, and a varying number of hidden neurons. The aim here is to cause one of the output neurons to fire vigorously on stimulation of the input neuron, while demanding that the other output neuron remains silent. There are no limitations on the behavior of other neurons in the network (hidden neurons). We consider both feedforward and recurrent networks here.
2258
D. Baras and R. Meir 35
100
80
Average reward value
Pattern 01 Pattren 10 Pattern 11 Pattern 00 Threshold value
90
Rate value
70 60 50 40 30 20
30
25
20
15
10 0
0
100
200
300
400
500
600
700
10
800
0
100
200
300
400
500
600
700
800
Episodes
Episodes
(a) Firing rates of output neurons
(b) Average reward 0.7
0.8
w outh1 w outh2
w h1i1 w h1i2 w h2i1 w h2i2
0.6
0.6
0.4
0.5
Weights value
Weights value
0.2
0
0.4
0.3
0.2
0.1
0
100
200
300
400
500
Episodes
600
700
800
0
0
100
200
300
400
500
600
700
800
Episodes
(c) Evolution of first layer weights (d)Evolution of second layer weights Figure 2: XOR learning.
The reward scheme used in this case is more informative than the one used in the previous task and depends on the exact spiking rate of each output separately. The idea is that the reward should be proportional to the performance of the network. Every output neuron contributes a positive term when it is correct and a negative term otherwise. The exact details are given in appendix B. The high input and output rates used in this experiment were 200 Hz and 160 Hz. Figure 3 displays learning curves averaged over all successful learning simulations for the small network (94% successful convergence). We also studied sequential switching between the two tasks for the smaller network of Figure 1b. When considering switching tasks, we allow the network to learn a certain path for 2000 episodes. We then reverse the learned task. This process is repeated three times (six tasks overall). Figure 4 displays learning curves averaged over all successful learning simulations in the small network (99% successful convergence). Note especially the
Reinforcement Learning and Synaptic Plasticity 200
100
N3 output rate N4 output rate Threshold value
180
50
160
Average reward value
140
Rate value
2259
120
100
80
60
0
40
20
0
0
500
1000
1500
0
500
Episodes
1000
1500
Episodes
(a) Firing rates of output neurons
(b) Average reward
0.7 w31 w32 w41 w42
Weights value
0.6
0.5
0.4
0.3
0.2
0.1
0
0
500
1000
1500
Episodes
(c) Weight evolution Figure 3: Path learning in a small network.
behavior of weights during the learning process. Several observations are in order:
r
r
From Figure 4c, weights reaching the same output neuron are high together or low together. Namely, w31 and w32 increase together when the first output should be high, while w41 and w42 decrease together and vice versa, depending on the task. The speed of learning in different weights is different. Observe that w31 and w41 reach stable values (either low or high) faster than w32 and w42 . This occurs since the input neuron stimulus possesses a high rate, but weights are constrained. No matter how high w21 is, neuron 2 will always have a lower rate than the input neuron. Therefore, the input neuron has more influence on both output neurons, and hence on the reward. Thus, the correlation between the reward and the eligibility traces of w31 and w41 is higher than the correlation between the reward
2260
D. Baras and R. Meir
200
100
N3 output rate N4 output rate Threshold value
180
0
Average reward value
160
Rate value
140 120 100 80 60 40 20 0
0
2000
4000
6000
8000
10000
12000
0
2000
4000
6000
8000
10000
12000
Episodes
Episodes
(a) Firing rates of output neurons
(b) Average reward for each task
0.5 w31 w32 w41 w42
0.45
Weights value
0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
0
2000
4000
6000
8000
10000
12000
Episodes
(c) Weight evolution Figure 4: Path learning: switching tasks.
and the eligibility traces of w32 and w42 . This causes the slower learning process of these “indirect” weights. Figure 5 displays learning curves averaged over all successful learning simulations in the large network of Figure 1c (82% successful convergence). The “hidden” units rate curve (Baras, 2006) shows that the recurrence (closed loop in the network) did not cause saturation in the rates, and they all remain within some reasonable area. The convergence rate is smaller than the previous network due to the following reasons: 1. A large network implies a large number of parameters. When the number of parameters increases, the correlation between the reward and each eligibility trace is reduced, sometimes leading to unsuccessful learning. 2. A large number of parameters increases the probability of converging to a local maximum due to a larger number of extrema.
Reinforcement Learning and Synaptic Plasticity
2261
120 High output rate Low output rate Threshold value
Average reward value
100
Rate value
80
60
40
20
0
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0
500
1000
1500
2000
Episodes
2500
3000
3500
4000
4500
5000
Episodes
(a) Firing rates of output neurons
(b) Average reward
0.5 w35 w36 w47 w48
0.45
0.4
Weights value
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Episodes
(c) First layer weight evolution Figure 5: Path learning in a large network.
3. The state space is larger (proportional to the number of neurons in the network). Therefore, more exploration might be required. All learning experiments described perform hill climbing in the sense that the average reward increases with time. However, in some cases, the system reaches a local maximum of the average reward. By unsuccessful learning, we refer to situations where the reward did not reach a prescribed reward threshold value within the allotted time. This may result from either slow convergence or a poor local maximum. 5 Relation to the BCM Rule The synaptic plasticity rules derived in section 2 belong to the class of spiketime-dependent plasticity rules. In this section, we consider the average behavior of the derived rule, where the expectation is taken over a temporal window, allowing us to transform individual spikes to firing rates. We will show that the derived average rule behaves similar to the well-known BCM
2262
D. Baras and R. Meir
rule (Bienenstock, Cooper, & Munro, 1982), for which good experimental evidence exists (e.g., Cooper, Intrator, Blais, & Shouval, 2004). A related result for several forms of STDP was presented in Izhikevich and Desai (2003). We comment on this work at the end of this section. For the analysis presented in this section, we assume α(t) = q δ(t) and episodic learning. In order to average the synaptic rule one needs to make appropriate statistical assumptions. Here we assume that the presynaptic firing pattern is described by a stationary Poisson process, while the output firing is given by a Poisson process with a rate function that depends on the presynaptic neurons. More precisely,
r r
The input spike train is a homogeneous Poisson process with rate x j . The output spike train is an inhomogeneous Poisson process with rate λi (t), where − t−t( f ) /τ m j e , (5.1) λi (t) = xi + ηwi j f
where f denotes a summation over all presynaptic firing times following the last postsynaptic spike and where xi is the output rate when no presynaptic activity at neuron j is present. This relation captures the effect of increased postsynaptic firing rate due to both enhanced presynaptic activity and large synaptic weight (a similar and slightly more general approach is presented in Gerstner & Kistler, 2002, p. 408). In the sequel we will allow η to be a function of the weight wi j and the presynaptic firing rate x j . This provides extra flexibility in determining a postsynaptic firing rate based on equation 5.1. Based on equations 4.1 and 4.2, the expected weight change over an episode is given by
Ewi j (T) = Epost,pre
T−
zi j (t + )
t=0
= Epost|pre Epre
T−
zi j (t + ),
(5.2)
t=0
where we have assumed for simplicity that wi j (0) = 0 and have neglected the constant term Ne p , which does not affect the form of the result. The expectation in equation 5.2 is with respect to the presynaptic and postsynaptic spike trains. The computation of the expectations in equation 5.2 is rather tedious and can be found in the appendix. Based on this derivation, we obtain an
Reinforcement Learning and Synaptic Plasticity
2263
expression that holds in the low firing rate limit, where (x j )2 1: Ewi j (T) ≈ r (T)(x j )
T−
h (1 − x j ) −1 xi H1 (xi , t) − H2 (xi , wi j , t) ,
t=0
(5.3) where h
t−tk H1 (xi , t) = e − τm ,
k=1
t−ti 2 t−tk t−tk , ηwi j e − τm − e − τm f e − τm h
H2 (xi , wi j , t) =
k=1
and where h = min (t, m), f (x) = σ (vr + wi j x), tk = t − h + k, and m is defined through equation A.1 in appendix A. Note that both H1 and H2 depend on xi through h which depends on xi through equation A.1. While the expressions are rather cumbersome, it is easy to compute them numerically and compare them graphically with the BCM plasticity rule. Figure 6 presents examples of the synaptic modification curve. The curves were created with η such that ηwi j = 0.1x j , meaning that the contribution of a presynaptic spike to the weight update is no more than 10% of the presynaptic rate. This type of normalization guarantees that the firing of a single presynaptic cell does not dominate the postsynaptic firing rate. The specific choice of 0.1 is arbitrary, and different values lead to very similar behaviors. For example, we have tried corrections of the form ηwi j = constx j . This relation is not mandatory, but some restriction on the value of η should be enforced in order to prevent unrestrained growth of the postsynaptic firing rate. Note, however, that the effect of η is in any event small, as can be seen in equation A.12, where it is multiplied by an additional (small) term . Other terms in equation A.12 contain products of the form xi or x j , where xi and x j , the firing rates, are much larger. Examining equation 5.3 and Figure 6, we observe the following properties:
r r r
The qualitative form of the expected weight change is similar to that of the BCM rule. Both potentiation and depression are observed, based on an activitydependent threshold. Zero crossing depends on both pre- and postsynaptic activities. Higher presynaptic activity leads to increasing the threshold, which prevents
0
50
100
150
xi (Hz)
(a) xj = 20
200
Expectation
Expectation
D. Baras and R. Meir
Expectation
2264
0
20
40
60
80
100
120
140
160
180
xi (Hz)
(b) xj = 100 Hz
0
50
100
150
200
xi (Hz)
(c) xj = 180 Hz
Figure 6: Average weight change as a function of the firing rate of the postsynaptic neuron for three presynaptic firing rates.
uncontrolled weight increase (self-regulation). In fact, we find that the threshold is essentially linear in the presynaptic firing rate. The curves were obtained using the same parameter values used in the simulations of the toy problems in section 4. Using other parameter values (for example, τm , T) led to qualitatively similar results. Izhikevich and Desai (2003) have also recently presented some theoretical work related to the relationship between STDP and the BCM rule. Essentially this work showed that BCM followed directly from STDP when the pre- and postsynaptic neurons fire weakly correlated Poisson spike trains, and only near-neighbor spike interactions are taken into account (in contrast to some implementations of STDP, which use all spikes to update the weight). It is interesting to observe that our formulation leads to a similar result, except that instead of considering only nearest neighbors, we take into account all spikes between two consecutive postsynaptic spikes. This follows from the resetting mechanism taking place on reaching threshold. Additionally, the STDP rule used here was derived from the gradient POMDP framework rather than being postulated. Note that in the statistical analysis of the learning rule, we take into account all spikes over a fixed temporal window. 6 Discussion Learning is a behavioral process, taking place at the level of an organism, based on interacting with the environment and receiving feedback. Synaptic plasticity is an internal process that depends on neural activity. The aim of this work was to suggest an answer to how these two processes are related in the context of learning a stochastic policy in a network of spiking neurons. By formalizing the learning process as a reinforcement learning problem and using the direct RL approach (Baxter & Bartlett, 2001), we derived an algorithm that updates the synaptic weights, based on the reward obtained as feedback. While previous work has addressed these issues
Reinforcement Learning and Synaptic Plasticity
2265
(Bartlett & Baxter, 1999; Seung, 2003; Xie & Seung, 2004), we have been able to extend these approaches to general neural models, which can be made as biologically realistic as is required. Additionally, a detailed analysis has shown that the derived rule is related to the BCM rule. We are not aware of any previous work relating the BCM rule to reinforcement learning. The synaptic update rule derived is computationally efficient and yields good results in several learning tasks. Moreover, the basic rule derived in section 2 depends on local activity and data available at the synaptic site and requires mechanisms that are available at the cellular level. The update rule derived in section 3 for more realistic neural models is indeed rather complex. A question left open at this point relates to the possible biological implementation of such rules. Finally, we comment on the relation of our work to other recent contributions (we have already related our work to Bartlett & Baxter, 1999, and Xie & Seung, 2004). The work of Rao and Sejnowski (2001) begins with a biophysical model of a neuron and shows how the temporal difference learning rule (e.g., Sutton & Barto, 1998) can reproduce the temporally asymmetric win¨ otter ¨ dow observed in STDP plasticity rules. Worg and Porr (2004) present a general overview of temporal difference learning and its implementation in networks of spiking neurons. Both of these approaches relate to ours in that they connect a well-known computational approach, temporal difference learning, to synaptic plasticity rules. Our work differs from these others in at least three ways. First, we derive (rather than relate) a synaptic modification rule for learning an optimal policy (a control problem) rather than attempting to estimate an optimal value function (a prediction task), as is done in temporal difference methods. The last approach always requires an additional step of deciding on an appropriate action, given the value estimate. Second, the relationship to the BCM rule and its direct relation to policy improvement is novel. Third, the approach taken here enables the extension of the approach to a broad range of novel policy improvement algorithms proposed in the recent machine learning literature (see below). As a final note, we comment that Toyoizumi, Pfister, Aihara, and Gerstner (2005) have recently derived a BCM-like learning rule for spiking neurons based on information-theoretic arguments within an unsupervised learning framework. Based on an integrate-and-fire neuronal model and Poisson statistics assumptions, they show that maximizing the mutual information between the input and output, subject to constraints on the output neuron’s firing rate, leads to a BCM-like plasticity rule. It would be interesting to see whether this type of result may shed an information-theoretic light on the reinforcement learning setup we have considered in this letter. The framework developed here can be extended in many directions. For example, more realistic neural models may be easily incorporated within the setting presented in section 3. Moreover, general synaptic window functions can be considered, based on more detailed physiological information. Finally, recent extensions of gradient POMDP approaches, based on
2266
D. Baras and R. Meir
actor-critic architectures (Konda & Tsitsiklis, 2003), are based on a temporal difference signal rather than on the reward itself as in this work. Given the flurry of recent interest in the relationship between the temporal difference signal in RL and the behavior of dopaminergic neurons in the brain stem (e.g., Schultz, 2002), it would be particularly interesting to extend the framework presented here accordingly. Appendix A: Computing Expectations Our objective here is to compute the expectation
Epost,pre
T−
zi j (t + ) = Epost,pre
T−
t=0
(ζi (t) − σi (t))
t=0
×
(f)
(f) t − tj exp − , τm (0)
f :t j >tˆi
where the expectation is with respect to the presynaptic and postsynaptic spike processes. Calculating this expression directly is very difficult. A possible solution to this difficulty is to sum over a fixed time window of length m. A reasonable choice for the window length is given by the average interspike interval of a Poisson process with rate xi . Since these intervals are exponentially distributed with the same parameter as the original process, we set2 m ≡ E(Interspike interval) =
1 xi
m=
and
m ,
(A.1)
implying the approximation
exp −
(f) (0) f :t j >tˆi
(f)
t − tj τm
≈
exp −
(f) f :t j >t−m
(f)
t − tj τm
.
This approximate equality can also be expressed as
(f)
exp − (0)
f :t j >tˆi
2
(f)
t − tj τm
≈
t =t−h
t − t I (spike at time t ) exp − , τm
Note that the postsynaptic rate is not strictly xi (see equation 5.1).
Reinforcement Learning and Synaptic Plasticity
2267
where we have introduced an indicator function at the times of the presynaptic spikes, and where h = min (t, m) and all times are multiples of . Defining T
y(t) ≡
t =t−h
t − t I (spike at time t ) exp − τm
(A.2)
and f (θ ) = σi {vr + wi j θ },
(A.3)
we are left with Epost,pre
zi j (t + ) ≈ Epre Epost|pre
t
[(ζi (t) − f (y(t))) y(t)] .
t
(A.4) A.1 Expectation with Respect to the Postsynaptic Spike Train. The expectation Epost|pre [·] can be computed easily, based on the relation ζi (t) =
1 with probability λi (t) 0 with probability 1 − λi (t)
,
where λi (t) = wi j + ηwi j y(t). We obtain Epost|pre
t
zi j (t + ) = Epost|pre
[(ζi (t) − f (y(t)))y(t)] t
= [λi (t)(1 − f (y(t)))y(t) t
+ (1 − λi (t))(0 − f (y(t)))y(t)] [y(t)(λi (t) − f (y(t)))] = t
= [y(t)(xi + ηwi j y(t) − f (y(t)))]. t
(A.5)
2268
D. Baras and R. Meir
A.2 Expectation with Respect to the Presynaptic Spike Train. Based on equation A.5, we are the left with the task of computing Epre
[y(t)(xi + ηwi j y(t) − f (y(t)))] t
=
xi Epre y(t) + ηwi j Epre y2 (t) − Epre (y(t) f (y(t))) ,
(A.6)
t
which involves expectations of functions of the form Epre G i (y(t)) with 2 G 1 (y) = y, G 2 (y) = y and G 3 (y) = yf (y)). To do so, we present the computation of Epre G y(t) for any function G. We first prove a simple lemma:
Lemma 1. Let y be a random variable with characteristic function (ω) = E e − jωy . Then ∞
dsG(s)F −1 ((ω))(s), EG y = −∞
where F −1 (w) is the inverse Fourier transform of (w), given by F −1 ((w)) =
1 2π
∞ −∞
(ω)e jωy dω.
Proof. Using the equality δ(t) =
EG y = E
∞
−∞
1 2π
!∞
−∞
dωe jωt , we have
dsG(s)δ(s − y)
∞ ∞ 1 dsG(s) dωe jω(s−y) = E 2π −∞ −∞ ∞ ∞ 1 = dsG(s) dωe jωs Ee − jωy 2π −∞ −∞ ∞
= dsG(s)F −1 Ee − jωy (s). −∞
! Note that lemma 1 follows easily from the equality EG(y) = G(s) f (s)ds, where f is the pdf of the random variable y, and recalling that the pdf is the inverse Fourier transform of the characteristic function. Using lemma 1, we have ∞ Epre G(y(t)) = −∞
G(s)F −1 (Epre e − jωy(t) )(s)ds.
(A.7)
Reinforcement Learning and Synaptic Plasticity
2269
Defining the Fourier pair,
t (ω) = Epre e − jωy(t) ;
φt (s) = F −1 (t (ω))(s),
we have t (ω) = Epre e − jω t "
(a )
=
t
t =t−h
Ee − jω
I (Spike at time t ) exp − t−t τm
I (Spike at time t ) exp − t−t τm
t =t−h t "
=
x j e − jω
exp − t−t τm
+ 1 − xj ,
t =t−h
where (a ) used independence. Inverse Fourier transforming t (ω), we have # φt (s) = F
−1
t "
x j e
− jω exp − t−t τm
+ 1 − xj
$ (s).
t =t−h
Based on the properties of the Fourier transform, we convert a product over functions in the time domain to a convolution of transforms in the frequency domain. Setting K = h/ and defining kt (ω) ≡ x j e − jω
exp −
t−tk τm
+ 1 − xj ,
(k = 0, 1, . . . , K ),
where tk = t − h + k, and
φtk (s) ≡ F −1 kt (ω) (s), we have that t (ω) =
K "
kt (ω),
k=0
and denoting a convolution by , we have that φt (s) = φt1 φt2 · · · φtK (s).
(A.8)
2270
D. Baras and R. Meir
Now, t−t
k + 1 − x j (s) φtk (s) = F −1 x j e − jω exp − τm t−tk
= x j δ s − e − τm + (1 − x j )δ(s)
(tk = t − h + k).
In computing equation A.8, we face the following task. Given m numbers, {x1 , x2 , . . . , xm }, and a constant a , compute
A = (a δ(x − x1 ) + (1 − a )δ(x)) · · · (a δ(x − xm ) + (1 − a )δ(x)). Using the properties of the δ-function, it is not hard to show that #
#
A= a δ x − m
m
$$ +a
xk
m−1
(1 − a )
k=1
m i=1
δ x −
xk
k:k=i
+ · · · + (1 − a )m δ(x). Thus, we find %
#
φt (s) = (x j ) δ s − K
× s −
K
$ e
−
t−tk τm
k=1
e
t−t − τmk
+ (x j ) K −1 (1 − x j )
K
δ
i=1
+ · · · + (1 − x j ) K δ(s)
(A.9)
k:k=i
where K = h/. Recalling that φt (s) = F −1 ((ω)), and substituting equation A.9 into A.7 yields Epre G(y(t)) = (x j ) K G
# K
$ e
t−t − τmi
+(x j ) K −1 (1 − x j )
i=1
K + · · · + (1 − x j ) G(0) .
K i=1
G
e
t−t − τmk
k:k=i
(A.10)
Thus, the expectation of a function of y(t) equals a weighted sample of this function at particular points in time. At this stage, we assume a single presynaptic neuron. The extension to several presynaptic neurons is straightforward.
Reinforcement Learning and Synaptic Plasticity
2271
Combining equations A.5, A.6, and A.10 yields the final result. While the computation can be performed exactly, it leads to the following very cumbersome expressions:
Epre
[y(t)(xi t + ηwi j ty(t) − f (y(t)))] t
=
[xi tEpre y(t) + ηwi j tEpre y2 (t) − Epre (y(t) f (y(t)))]
t
=
t
h
xi t (x j t)
t
t−ti
e − τm + (x j t) t −1 (1 − x j t) h
i=1
t
h
×
h t
e
t−t − τmk
h 2 t t−ti h − + · · · + ηwi j t (x j t) t e τm
i=1 k:k=i
i=1 h
+ (x j t)
h t −1
(1 − x j t)
t
i=1
h − (x j t) t f
e
t−t − τmi
(1 − x j t)
+ · · ·
e
t−t − τmi
i=1
h
+ (x j t)
e
h
t
i=1
h t −1
2 t−t − τmk
k:k=i
h
t
t
f
i=1
e
t−t − τmk
k:k=i
k:k=i
+ · · · .
e
t−t − τmk
Note that here, t1 = t − h, tK = t, and G i (0) = 0, i = 1, 2, 3. The expressions become easier to handle under the approximation of low firing rates, where (x j )2 1, which allows us to neglect higherorder terms. Specifically, neglecting terms of order (x j t)k , k ≥ 2, we obtain Epre
[y(t)(xi t + ηwi j ty(t) − f (y(t)))] t
=
[xi tEpre y(t) + ηwi j tEpre y2 (t) − Epre (y(t) f (y(t)))]
t
=
t
xi t(x j t)(1 − x j t)
h
h t −1
t t−ti e − τm
i=1
(A.11)
2272
D. Baras and R. Meir h
+ηwi j t(x j t)(1 − x j t)
h t −1
t 2 t−ti e − τm
i=1
−(x j t)(1 − x j t) t −1 h
h t
e
t−t − τmi
t−ti . f e − τm
i=1
which can be rewritten as Epost,pre
T−
zi j (t + ) ≈ (x j )
t=0
T−
(1 − x j ) −1 h
t=0 h
× xi
h
e
−
t−tk τm
+ ηwi j
k=1
e−
t−tk τm
2
k=1
h
−
e−
t−tk τm
t−ti f (e − τm ) ,
(A.12)
k=1
where tk = t − h + k. Appendix B: Simulation Details The values used for the main simulation parameters are shown in Table 1. The reward values for the XOR problem are summarized in Table 2. Table 3 contains some technical details related to the simulation, and Table 4 describes the weight constants and initializations used. For the path learning problem, we first define the following variables:
r r r
nth : the threshold value for determining “high” and “low” output Sout1 , Sout2 : the number of spikes that occurred in the first and second outputs respectively F1high , F2low : Boolean flags denoting whether output 1 is high and output 2 is low, respectively
The reward is determined based on the formula R = 3(F2low (nth − Sout2 ) − 2(1 − F2low )Sout2 + F1high Sout1 − 2(1 − F1high )(nth − Sout1 ) − 50). We also comment about the reward signals used to train the networks. In the case of the XOR learning network, the reward values are constant based on the network’s performance. The specific values were chosen by trial and
Reinforcement Learning and Synaptic Plasticity
2273
Table 1: Simulation Parameter Values. Parameter λ β γ T vr = v L Rm Cm τm τs q
Value 5 · 10−4 120 0 0.001 250 msec −60 mV 106 3 · 10−8 F 30 msec 3 msec 1.8 · 10−9 C
Comments Euler step size Squashing function parameter Bias/variance trade-off variable Learning rate for the policy gradient method Episode length Neuron resting potential Membrane resistance Membrane capacitance τm = Rm · Cm Time constant of α function
Table 2: XOR Reward Values. Description
Value
Correct output Incorrect output Undetermined output
R+ = 96 R− = −66 Ru = −69
Table 3: Technical Details and Convergence Rates.
Problem XOR Path learning, small network Path learning, small network, switching tasks Path learning, large network
Episode Length (msec)
Number of Episodes
Successful Convergence Percentage
250 500
800 1500
99% 94%
500
2000 per task
99%
500
5000
82%
error. The fact that the reward scheme affects the quality of the solution is a well-known phenomenon in direct policy search methods. As far as we are aware, there do not appear at present to be systematic approaches to selecting optimal reward schemes. In the case of the path learning task, we tried a more realistic and informative reward signal in both networks. The general approach that led us to the reward function form has already been explained in section 4, and the exact constants and coefficients were also determined by trial and error.
2274
D. Baras and R. Meir
Table 4: Weight Constraints and Initialization Used. Problem
Weight Constraints
XOR
Range [−1, 1]
Path learning, small network
Range [0, 0.5]
Path learning, small network, switching tasks Path learning, large network
Range [0, 0.5]
Range [0, 0.5]
Weight Initialization Randomly initialized from a uniform distribution over [−0.1, 0], [0, 0.1], sign predetermined (see section 4 for details) Randomly initialized from a uniform distribution over [0, 0.1] Randomly initialized from a uniform distribution over [0, 0.1] Randomly initialized from a uniform distribution over [0, 0.1]
Appendix C: Technical Derivations This appendix provides some explicit technical derivations that are not essential to understanding the main results. C.1 Decaying Exponential α Function. We present the calculation leading to the expression for the case of a decaying exponential function presented in section 2.1. The α function is α(s) = τqs exp(− τss )I {s ≥ 0}. Substituting this choice of α(s) on the right-hand side of equation 2.4, we obtain t−s (f) ds exp − α s − tj (0) τm tˆi f
t
# (f)$
s − tj t−s q (f) = exp − exp − I s − t j ≥ 0 ds (0) τm τs τs tˆi f
t
# (f)$
s − tj q t t−s (f) = exp − exp − I s − t j ≥ 0 ds τs f tˆi(0) τm τs (f) tj −t q exp = exp τs f τm τs
(f) tj q −t = exp exp τs f τm τs
1 1 (f) exp s − I s − t j ≥ 0 ds (0) τm τs tˆi t
1 1 ds − ( f ) (0) exp s τm τs max t j ,tˆi t
Reinforcement Learning and Synaptic Plasticity
(f) tj −t q = exp exp τs f τm τs
2275
1 1 exp t − τm τs − τ1s 1
1 τm
( f ) (0) 1 1 − − exp max t j , tˆi τm τs # (f) ( f ) $ t − tj tj −t q 1 − exp exp − exp = 1 τs τm τs − τ1s τs f τm
× exp max
( f ) (0)
t j , tˆi
# (f)$ t − tj q τm exp − = τs (τs − τm ) f
1 1 ( f ) (0) − × 1 − exp t − max t j , tˆi τm τs # (f)$ t − tj q τm exp − = τs (τm − τs ) f
τm − τs (f) (0) min t − t j , t − tˆi × exp −1 , τm τs
hence, (f)
(ζi (t) − σi (t))λq τm − t−tτ j s e zi j (t + t) = βzi j (t) + c (τm − τs ) f
τm − τs (f) (0) min(t − t j , t − tˆi ) − 1 . × exp τm τs
C.2 Depressing Synapses. We present the exact details and derivation in the case of the depressing synapses. We will not repeat details presented in section 3.2. We use the same POMDP model as in previous case (see section 1), except that the state of each neuron also includes the current value of the depression variable. C.2.1 Parameter Update.The same update rule for the parameters is used. Let us now calculate an explicit expression for the weight update for this neural model.
2276
D. Baras and R. Meir
As presented in section 3.2, the update rule is given by (C.1) wi j (t) = wi j (t − t) + γ r (t)zi j (t), t
λ t−s (f) zi j (t + t) = βzi j (t) + ζi (t) − σ D j (s)ds. exp − α s − tj (0) c tˆi τm f (C.2) We distinguish between two cases: no presynaptic spike occurred since the last postsynaptic spike or at least one presynaptic spike occurred since then. Case 1: No Presynaptic Spike Occurred Since the Last Postsynaptic ( f ),la st as the time of the last presynaptic spike and D f,la st as Spike. Denote t j its initial condition. In the first case, the algorithm has the form wi j (t) = wi j (t − t) + γ r (t)zi j (t), λ t − t−s zi j (t + t) = βzi j (t) + (ζi (t) − σ ) e τm c tˆi(0)
− × 1−e
( f ),la st s−t j τD
1 − D f,la st
(C.3)
(f) ds. α s − tj f
Case 2: At Least One Presynaptic Spike Occurred Since the Last (0) Postsynaptic Spike. In this case, assume that k spikes occurred since tˆi . (0) (0) Denote t j as the time of the last presynaptic spike before tˆi , and denote the appropriate initial condition for D j (t) for this spike as Ds0 . Let (s1) (sk) t j , . . . , t j denote the presynaptic spike times that occurred in the time (0) interval [tˆi , t], and denote the appropriate initial conditions for D j (t) at those times as Ds1 , . . . , Dsk respectively. Then the integral in equation C.2 is given by t−s (f) D j (s)ds exp − α s − tj (0) τm tˆi f
t
=
(s1)
tj
(0) tˆi
f
+
t−s (f) D j (s)ds e − τm α s − tj
(s2) tj (s1)
tj
t−s (f) D j (s)ds + · · · e − τm α s − tj f
Reinforcement Learning and Synaptic Plasticity
+
t−s (f) D j (s)ds e − τm α s − tj
t (sk)
tj
=
(C.4)
f
(s1)
tj (0) tˆi
s−t(j f ),la st
t−s (f) − τD ds e − τm 1 − e α s − tj 1 − D f,la st f
+
(s2) tj (s1)
tj
+
2277
s−t(s1) j
(f) − − t−s τD τ m 1 − Ds1 ds + · · · e α s − tj 1−e f
s−t(sk) j
t−s (f) − τD ds, e − τm 1 − e α s − tj 1 − Dsk
t (sk)
tj
f
and the algorithm can be written using the expressions above. C.2.2 Explicit Expression for α(t) = q δ(t). In this case,
t−s (f) D j (s)ds exp − α s − tj (0) τm tˆi f t
= =
t−s (f) D j (s)ds exp − q δ s − tj (0) τm tˆi f t
q f
t−s (f) D j (s)ds δ s − tj exp − (0) τm tˆi
=q
(f)
t
(f) t − tj ( f )
exp − D j t j ds, τm (0)
f :t j >tˆi
hence, λq zi j (t + t) = βzi j (t) + (ζi (t) − σ ) · c
(f)
exp − (0)
f :t j >tˆi
(f)
t − tj τm
(f)
D j (t j ).
Two remarks:
r r
The contribution of the depression variable is considered only at presynaptic spiking times (for this α-function). (f)
An explicit expression for D j (t j ) is available. The expression depends ( f −1)
on t j
.
2278
D. Baras and R. Meir
C.3 MDPs, POMDPs C.3.1 MDPs. An MDP (Markov decision process) is a quartet {S, A, P, r } where
r r r r
S is the state space. A is the action space. P(a ) = pi j (a ) = P(xt+1 = j|xt = i, a t = a ) is the transition probability when action a is taken. Pr (r ) = P(r |x, a ) is the reward distribution when choosing action a in state x.
Additionally, we set
r r
π : S → A (policy) is a mapping from state space to action space. A(x) is the set of available actions from A in state x.
C.3.2 POMDPs.A POMDP (partially observed Markov decision process) is a sextet S, A, O, Ps , Pr , Po where
r r r r r r
S is the state space. A is the action space. O is the observation space. Ps = pisj (a ) = P(xt+1 = j|xt = i, a t = a ) is the transition probability when action a is taken. Pr (r ) = P(r |x, a ) is the reward distribution. Po (y) = P(y|x) is the observation process distribution.
Additionally, set
r r
π : S → A (policy) is a mapping from state space to action space. A(x) is the set of available actions from A in state x.
Note that the states (hidden variables) form a standard MDP. Acknowledgments This work was partially supported by EU Project PASCAL and by the Technion VPR fund for promotion of research and by the Ollendorf foundation. We are grateful to the anonymous reviewers for helpful remarks, which greatly improved the clarity of the presentation.
Reinforcement Learning and Synaptic Plasticity
2279
References Baras, D. (2006). Direct policy search in reinforcement learning and synaptic plasticity in biological neural networks. Unpublished master’s thesis Technion. Available online at http://www.ee.technion.ac.il/rmeir/BarasThesis06.pdf. Bartlett, P. L., & Baxter, J. (1999). Hebbian synaptic modifications in spiking neurons that learn (Technical rep.). Canberra: Research School of Information Sciences and Engineering, Australian National University. Baxter, J., & Bartlett, P. L. (2001). Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15, 319–350. Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-Dynamic programming. Belmont, MA: Athena Scientific. Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. Journal of Neuroscience, 2(1), 32–48. Cooper, L., Intrator, N., Blais, B. B., & Shouval, H. Z. (2004). Theory of cortical plasticity. Singapore: World Scientific. Gerstner, W., & Kistler, W. (2002). Spiking neuron models. Cambridge: Cambridge University Press. Haykin, S. (1999). Neural networks: A comprehensive foundation (2nd ed.). Upper Saddle River, NJ: Prentice Hall. Izhikevich, E. M., & Desai, N. S. (2003). Relating STDP to BCM. Neural Comp., 15(7), 1511–1523. Koch, C. (1999). Biophysics of computation. New York: Oxford University Press. Konda, V. R., & Tsitsiklis, J. N. (2003). On actor-critic algorithms. SIAM Journal on Control and Optimization, 42(4), 1143–1166. Rao, R., & Sejnowski, T. (2001). Spike-timing-dependent Hebbian plasticity as temporal difference learning. Neural Comput., 13(10), 2221–2237. Richardson, M. J. E., Melamed, O., Silberberg, G., Gerstner, W., & Markram, H. (2005). Short-term synaptic plasticity orchestrates the response of pyramidal cells and interneurons to population bursts. Journal of Computational Neuroscience, 18(3), 323–331. Schultz, W. (2002). Getting formal with dopamine and reward. Neuron, 36(2), 241– 263. Seung, H. S. (2003). Learning in spiking neural networks by reinforcement of stochastic synaptic transmission. Neuron, 40, 1063–1073. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Toyoizumi, T., Pfister, J., Aihara, K., & Gerstner, W. (2005). Generalized BienenstockCooper-Munro rule for spiking neurons that maximizes information transmission. Proc. Natl. Acad. Sci. USA, 102(14), 5239–5244. ¨ otter, ¨ Worg F., & Porr, B. (2004). Temporal sequence learning, prediction, and control—a review. Neural Computation, 17, 1–75. Xie, X., & Seung, H. S. (2004). Learning in neural networks by reinforcement of irregular spiking. Physical Review E, 69, 041909.
Received January 26, 2006; accepted September 1, 2006.
ARTICLE
Communicated by Tobias Delbruck
A Multichip Neuromorphic System for Spike-Based Visual Information Processing R. Jacob Vogelstein [email protected] Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21205, U.S.A.
Udayan Mallik [email protected] Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD 21218, U.S.A.
Eugenio Culurciello [email protected] Department of Electrical Engineering, Yale University, New Haven, CT 06511, U.S.A.
Gert Cauwenberghs [email protected] Division of Biological Sciences, University of California, San Diego, La Jolla, CA 92093, U.S.A.
Ralph Etienne-Cummings [email protected] Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD 21218, U.S.A.
We present a multichip, mixed-signal VLSI system for spike-based vision processing. The system consists of an 80 × 60 pixel neuromorphic retina and a 4800 neuron silicon cortex with 4,194,304 synapses. Its functionality is illustrated with experimental data on multiple components of an attention-based hierarchical model of cortical object recognition, including feature coding, salience detection, and foveation. This model exploits arbitrary and reconfigurable connectivity between cells in the multichip architecture, achieved by asynchronously routing neural spike events within and between chips according to a memory-based look-up table. Synaptic parameters, including conductance and reversal potential, are also stored in memory and are used to dynamically configure synapse circuits within the silicon neurons.
Neural Computation 19, 2281–2300 (2007)
C 2007 Massachusetts Institute of Technology
2282
R. Vogelstein et al.
1 Introduction The brain must process sensory information in real time in order to analyze its surroundings and prescribe appropriate actions. In contrast, most simulations of neural functions to date have been executed in software programs that run much more slowly than real time. This places fundamental limits on the kinds of studies that can be done, because most software neural networks are unable to interact with their environment. However, software models have the advantages of being flexible, reconfigurable, and completely observable, and much has been learned about the brain through the use of software. (See Dayan & Abbott, 2001.) Neuromorphic hardware aims to emulate the functionality of the brain using silicon analogs of biological neural elements (Mead, 1989). Typically, unlike most software, hardware models can operate in real time (or even faster than their biological counterparts), providing the opportunity to create artificial nervous systems that can interact with their environment (Horiuchi & Koch, 1999; Indiveri, 1999; Simoni, Cymbalyuk, Sorensen, Calabrese, & DeWeerth, 2001; Indiveri, Murer, & Kramer, 2001; Jung, Brauer, & Abbas, 2001; Cheely & Horiuchi, 2003; Lewis, Etienne-Cummings, Hartmann, Cohen, & Xu, 2003; Zaghloul & Boahen, 2004; Reichel, Leichti, Presser, & Liu, 2005). Unfortunately, silicon designs take a few months to be fabricated, after which they are usually constrained by limited flexibility, so fixing a bug or changing the system’s operation may require more time than that required for an equivalent software model (although a mature hardware design can be reused in many different systems; see, e.g., Zaghloul & Boahen, 2004). Additionally, the models are not usually as detailed as software models due to the limited computational primitives available from silicon transistors and the deliberate use of reductionist models to simplify the hardware infrastructure by reducing the dimensionality of parameter space. Reconfigurable neuromorphic systems represent a compromise between fast, dedicated silicon hardware and slower but versatile software. They are useful for studying real-time operation of high-level (e.g., cortical), large-scale neural networks and prototyping neuromorphic systems prior to fabricating application-specific chips. Instead of hardwiring connections between neurons, most reconfigurable neuromorphic systems use the address-event representation (AER) communication protocol (Sivilotti, 1991; Lazzaro, Wawrzynek, Mahowald, Sivilotti, & Gillespie, 1993; Mahowald, 1994). In an address-event system, connections between neurons are emulated by time-multiplexing neural events (also called action potentials, or spikes) onto a fast serial bus, and AER “synapses” are implemented with encoders and decoders that monitor the bus and route incoming and outgoing spikes to their appropriate neural targets. These systems can be reconfigured by changing the routing functions, and multiple authors have demonstrated versions of AER that use memory-based
Multichip Neuromorphic Vision System
2283
projective field mappings toward this end (Deiss, Douglas, & Whatley, 1999; Higgins & Koch, 1999; Goldberg, Cauwenberghs, & Andreou, 2001; ¨ H¨afliger, 2001; Liu, Kramer, Indiveri, Delbruck, & Douglas, 2002; Indiveri, Chicca, & Douglas, 2004; Ros, Ortigosa, Agis, Carrillo, & Arnold, 2006; Vogelstein, Mallik, Vogelstein, & Cauwenberghs, 2007). We have developed a reconfigurable multichip AER-based system for emulating cortical spike processing of visual information. The system uses one AER subnet to communicate spikes between an 80 × 60 pixel silicon retina (Culurciello, Etienne-Cummings, & Boahen, 2003; Culurciello & Etienne-Cummings, 2004) and a 4800-neuron silicon cortex (Vogelstein, Mallik, Cauwenberghs, Culurciello, & Etienne-Cummings, 2005), and a second AER subnet to communicate spikes between cortical cells (see Figure 1). Each cell in the silicon retina converts light intensity into spike frequency (see section 2.2). Each cell in the silicon cortex implements an integrateand-fire neuron with conductance-like synapses (see section 2.1). Neural connectivity patterns and synaptic parameters are stored in digital memory, allowing “virtual synapses” to be implemented by routing spikes to one or more locations on the silicon cortex. A number of multichip reconfigurable neuromorphic systems have been described in the literature (Goldberg et al., 2001; Horiuchi & Hynna, 2001; Taba & Boahen, 2003; Arthur & Boahen, 2004; Indiveri et al., 2004; Paz et al., 2005; Riis & H¨afliger, 2005; Serrano-Gotarredona et al., 2006; Zou, Bornat, Tomas, Renaud, & Destexhe, 2006), but ours differs in some important ways. First, the 4800-neuron silicon cortex is the largest general-purpose neuromorphic array presented to date. Second, unlike the other systems, our silicon cortex has no local or hardwired connectivity, and each neuron implements a synapse with programmable weight and equilibrium potential, so all 4800 neurons can be utilized for any arbitrary connection topology, limited only by the capacity of the digital memory (and bandwidth if realtime operation is desired). Third, the hardware infrastructure supports up to 1 million address events per second and allows real-time operation of large networks. Finally, the silicon cortex can function as a standalone AER transceiver, enabling the creation of even larger networks by connecting multiple cortices together. To explicate all of these features, we conducted a series of four experiments within the common framework of a hierarchical model of visual information processing (see Figure 2). Section 3.1 demonstrates the speed and scale of the hardware by operating all 4800 cortical cells in real time. Sections 3.2 and 3.4 highlight the versatility of the hardware by reconfiguring the cortex into both feedforward and feedback networks. Section 3.3 uses the neurons’ dynamically programmable synapses to multiplex a wide range of synaptic connections onto individual cells. Finally, section 4 details how the complete hierarchical model could be implemented in real time by partitioning the network into functional units, each organized around one silicon cortex.
2284
R. Vogelstein et al.
2 Hardware Every neuron on the silicon retina and cortex is assigned a unique address at design time, which is transmitted as an address event (AE) over an AER bus when that neuron fires an action potential. All of the address-event transactions in the multichip system illustrated in Figure 1 are processed by a field programmable gate array (FPGA) located within the integrate-andfire array transceiver (IFAT) component (Vogelstein, Mallik, Vogelstein, & Cauwenberghs, 2007). In addition to the FPGA, the IFAT contains 128 MB of nonvolatile digital memory in a 4 MB × 32-bit array (RAM), the 4800-cell silicon cortex, and an 8-bit digital-to-analog converter (DAC) required to operate the silicon cortex (see section 2.1). The path of an AE through the system is illustrated in Figure 1. In this example, an outgoing presynaptic address from the silicon retina is placed on the external AER bus and captured by the FPGA, which uses the neuron’s address as an index into the RAM. Each of the 4,194,304 lines of RAM stores information on a single synaptic connection, including its equilibrium potential, its synaptic weight, and the destination (postsynaptic) address (see Figure 3 and Deiss et al., 1999). This information is then used by the FPGA to activate a particular cell in the silicon cortex. Divergent connectivity is achieved by allowing the FPGA to access sequential lines in memory until it retrieves a stop code, as well as by implementing reserved address words that are used to activate multiple cells on one or more chips simultaneously. Each application in section 3 requires a different number of synapses. Full-field spatial feature extraction (see section 3.1) and salience detection (see section 3.2) can be implemented with approximately 19,200 synapses each. Spatial acuity modulation (see section 3.3) with a 16 × 16 fovea surrounded by three concentric rings of geometrically decreasing resolution uses 60,736 synapses. And computing the maximum of N neurons (see section 3.4) relies upon N + N2 synapses (903 in the example shown here, or 90,300 to compute the maximum of all local salience estimates for an 80 × 60-pixel visual field).
Figure 1: (a) Block diagram of the multichip system with silicon retina (OR) and cortex (I&F). The silicon cortex is located within the IFAT subsystem, which also contains a field programmable gate array (FPGA), digital memory (RAM) for storing the synaptic connection matrix, and a digital-to-analog converter (DAC) required to operate the I&F chips. The FPGA controls two AER buses: one internal bus for events sent to and from the silicon cortex and one external bus for events sent to and from external neuromorphic devices or a computer (CPU). Circled numbers 1–6 highlight the path of incoming events from the OR (see section 2.2). (b) Photograph of the system.
Multichip Neuromorphic Vision System
(a)
(b)
2285
2286
R. Vogelstein et al.
View-Tuned Cells
WTA
Composite Cells MAX
Local Salience
Feature Cells
MAX
Spatial Features
Complex Cells MAX
Retina
SAM
Simple Cells
Figure 2: Hierarchical model of visual information processing based on work by Riesenhuber and Poggio (1999). Spatial features are extracted from retinal images using a small set of oriented spatial filters, whose outputs are combined to form estimates of local salience. The region with maximum salience is selected by a winner-take-all network (WTA) and used to foveate the image by spatial acuity modulation (SAM). A large set of simple cells with many different preferred orientations is then used to process this bandwidth-limited signal. The simple cells’ outputs are combined with a MAX function to form spatially invariant complex cells, and the resulting data are combined in various ways to form feature cells, composite cells, and, finally, “view-tuned cells” that selectively respond to a particular view of an object.
2.1 Silicon Cortex. The silicon cortex used in this system is composed of 4800 random-access integrate-and-fire (I&F) neurons implemented on two custom aVLSI chips, each of which contains 2400 cells (Vogelstein, Mallik, & Cauwenberghs, 2004; Vogelstein, Mallik, Vogelstein, & Cauwenberghs, 2007). All 4800 neurons are identical; every one implements a conductancelike model of a general-purpose synapse using a switched-capacitor architecture. The synapses have two internal parameters—the synaptic equilibrium potential and the synaptic weight—that can be set to different values for each incoming event. Additionally, the range of synaptic weights can be extended by two dynamically controlled external parameters: the probability of sending an event and the number of postsynaptic events sent for every presynaptic event (Koch, 1999; Vogelstein, Mallik, Vogelstein, & Cauwenberghs, 2007). By storing values for these parameters along with
Multichip Neuromorphic Vision System
2287
address
data
0x0000 0x00 0x01
0x0009 0x7 0xA 0x5 0xA0 0xFFFF 0xF 0xF 0xF 0xFF
0xFF
(a)
(b)
(c)
(d) (e) (f) (g)
Figure 3: Example of IFAT RAM contents. Each line stores parameters for one synaptic connection. The presynaptic neuron’s address is used as a base index (a) into the lookup table, while the FPGA increments an offset counter (b) as it iterates through the list of postsynaptic targets (c). Synaptic weight is represented as a product of the three values stored in columns d–f, which represent the size of the postsynaptic response to an event, the number of postsynaptic events to generate for each presynaptic event, and the probability of generating an event, respectively (Koch, 1999; Vogelstein et al., 2007). The synaptic equilibrium potential is stored in column g and is used to control the DAC (see Figure 1). The reserved word shown at offset 0x01 is used to indicate the end of the synapse list for presynaptic neuron 0x0000, so the data at offsets 0x02–0xFF is undefined.
the pre- and postsynaptic addresses in RAM (see Figure 3), the FPGA on the IFAT can implement a different type of synapse for every virtual connection between neurons. The maximum rate of event transmission from the silicon cortex and its associated IFAT components is approximately 1,000,000 AE per second and is primarily limited by the speed of the internal arbitration circuits.
2.2 Silicon Retina. The silicon retina used in this system is called the octopus retina (OR) because its design is based on the phototransduction mechanism found in the retinae of octopi (Culurciello, et al., 2003; Culurciello & Etienne-Cummings, 2004). Functionally, the OR is an asynchronous imager that translates light intensity levels into interspike interval times at each pixel. However, unlike a biological octopus retina, in which each photosensor’s output travels along a dedicated axon to its target(s), all of the OR’s outputs are collected on its AER bus and transmitted serially off-chip to the IFAT. Under uniform indoor lighting (0.1 mW/cm2 ), the OR produces an average of 200,000 address events per second (41.7 effective fps) while consuming 3.4 mW. However, most visual scenes do not have uniform lighting, so the typical range of event rates for this application is approximately 5,000 to 50,000 address events per second.
2288
R. Vogelstein et al.
3 Results: Spike Domain Image Processing As described in section 1, we chose to exploit different aspects of the reconfigurable multichip system in a series of experiments organized around the common framework of a hierarchical model of visual information processing (see Figure 2). This model was selected to showcase the system’s versatility because each processing stage places different requirements on the fundamentally similar neurons within the silicon cortex, just as sensory processing in the human cortex requires fundamentally similar pyramidal cells in different locations to execute different functions (Kandel, Schwartz, & Jessell, 2000). In the model (see Figure 2), retinal outputs are first processed through oriented spatial filters that highlight regions of high contrast (Mallik, Vogelstein, Culurciello, Etienne-Cummings, & Cauwenberghs, 2005). This information is then used by a salience detector network that focuses attention on a region of interest and decreases the resolution in surrounding areas to reduce the number of data being used for computations and transmission (Vogelstein et al., 2005). Within the foveated center, data from the local spatial filters are combined with a nonlinear pooling function to form global spatial filters, which are subsequently combined to create feature cells, composite cells, and view-tuned cells (Riesenhuber & Poggio, 1999). Results from implementations of the first few stages of this network computed entirely in the spike domain on our multichip system are described below. Because this reconfigurable system is optimized not for any particular application but for flexibility, these data are primarily intended to illustrate the breadth of computations that can be performed and confirm the general functionality of the proposed network architecture. 3.1 Spatial Feature Extraction. In the human visual cortex, the first stage of processing is spatial feature extraction, performed by simple cells (Kandel et al., 2000). Simple cells act as oriented spatial filters that detect local changes in contrast, and their receptive fields and preferred orientations are both functions of the input they receive from the retina. Spatial feature extraction is used twice in the hierarchical model of visual information processing in Figure 2—first coarsely over the entire visual field to estimate salience and then more finely within a small region of interest. Figure 4 illustrates how the silicon cortex can be used to perform spatial feature extraction by emulating eight different simple cell types (see Figures 4B1–I1) with overlapping receptive fields (Mallik et al., 2005). Note that because the OR output is proportional to light intensity, these simple cells respond to intensity gradients, not contrast gradients. In this example, each cortical cell integrates inputs from four pixels in the OR, two of which make excitatory synapses and two of which make inhibitory synapses. The excitatory and inhibitory synaptic weights are balanced so that there is no net response to uniform light.
Multichip Neuromorphic Vision System
2289
Figure 4: (B1–I1) Orientation-selective kernel compositions in the simple cell network. Each simple cell has a 4 × 1 receptive field and receives two excitatory (+) and two inhibitory (−) inputs from the silicon retina. (A2) Original image captured by silicon retina. (B2–I2) Frames captured from real-time video sequences of retinal images processed by simple cell networks implemented on the silicon cortex. Each frame is composed from the output of 4800 simple cells that were all configured in the orientation shown above (e.g., B2 shows output from cells with receptive field drawn in B1) (Mallik et al., 2005).
Figures 4B2–I2 show a few sample frames from real-time video images generated by a simple cell network implemented on the silicon cortex (Mallik et al., 2005). Because both the silicon cortex and the silicon retina contain 4800 neurons, there is necessarily a trade-off between the spacing of similarly oriented simple cells throughout the visual field and the number of differently oriented simple cells with overlapping receptive fields. For the images in Figure 4, this trade-off was resolved in favor of increased resolution: each frame was captured from a different configuration of the system wherein all 4800 simple cells had identical preferred orientations. However, we have also generated similar results with lower resolution when the cortex is configured to simultaneously process two or four different orientations (data not shown). In addition to illustrating the principle of spatial feature extraction, these data demonstrate that the multichip system is capable of executing large-scale networks in real-time. 3.2 Salience Detection. Salient regions of an image are areas of high information content. In the hierarchical model of visual information processing, estimates of salience are used to select a region of interest that will undergo further processing. There are many ways to compute salience; one simple technique uses the magnitude of spatial derivatives of light intensity within a given region as an approximate measure. A neural network architecture for computing this metric is illustrated in Figure 5. In this scheme, outputs from simple cells with overlapping receptive fields and different preferred orientations are linearly pooled by second-level cells to form estimates of local salience. A winner-take-all (WTA) circuit with one
2290
R. Vogelstein et al.
Figure 5: Pictorial representation of network for computing salience. Each local salience detector cell (large black circle) integrates inputs from a neighborhood of simple cells (small gray circles) with multiple different preferred orientations. A large output from a salience detector cell indicates a strong change in spatial image intensity, which frequently coincides with high information content.
input from each second-level cell could then be used to detect the region of overall greatest salience (see section 3.4). Data from the silicon cortex configured to compute salience are illustrated in Figure 6 (Vogelstein et al., 2005). Figure 6a shows a raw image generated by the silicon retina (focused on a photograph) under normal indoor lighting. This image is then processed by the coarse oriented spatial filtering network described in section 3.1, with four sets of 1200 simple cells simultaneously processing horizontal and vertical intensity changes (cell types RF5–RF8 as designated by Figure 4; see Figure 6b for simple cell output). To compute the local salience estimates (see Figure 6c), outputs from 64 simple cells of various orientations spanning an 8 × 8-pixel visual space are pooled by a single second-level cell. Smooth transitions between adjacent estimates are ensured by shifting each second-level cell’s receptive field by four pixels in either the horizontal or vertical direction (see Figure 5). Because the silicon cortex contains only 4800 neurons, the spatial filtering and salience detection cannot both be implemented for the entire visual field simultaneously. Therefore, to generate the images in Figure 6, each stage of network processing was executed serially. This was achieved by using a computer to log the output of the silicon cortex configured as spatial filters, changing the cortical network to pool simple cell outputs, and then playing back the sequence of events to the silicon cortex to compute the local salience estimates. This strategy highlights the versatility of the hardware; the same approach could also be used to perform a WTA or MAX operation on the local salience estimates (see section 3.4). Moreover, this
Multichip Neuromorphic Vision System
2291
(a)
(b)
(c)
Figure 6: (a) Frame capture of image generated by silicon retina. (b) Output of feature detectors (simple cells) using silicon retina data as input. (c) Output of salience detectors using simple cell data as input (Vogelstein et al., 2005).
2292
R. Vogelstein et al.
technique faithfully simulates the operation of any hierarchical feedforward network (feedback can be implemented within a given processing stage), while allowing analysis of each stage’s output independently. 3.3 Spatial Acuity Modulation. In a human retina, there is a natural distribution of photoreceptors throughout the visual field, with the highest density of light-sensitive elements in the center of vision and lower numbers of photoreceptors in the periphery (Kandel et al., 2000). In combination with reflex circuits that guide the center of the eye to salient regions in space, this configuration conserves computational resources and transmission bandwidth between levels of the network. The same principles of conservation are important in our multichip hardware system. However, because the silicon retina used as the frontend to our visual system has a fixed position and uniform resolution throughout its field of view, we modulate the spatial acuity of the image in the address domain. Spatial acuity modulation is performed by pooling the outputs from neighboring pixels in the retina onto single cells in the silicon cortex, using overlapping gaussian kernels with broad spatial bandwidths in the periphery and narrow bandwidths in the center of the image (see Figure 7a). Because synaptic weights in the IFAT can be dynamically configured using multiple degrees of freedom, these kernel functions can be reasonably approximated using discrete changes to the internal weight variable, the number of output events sent per input event, and the synaptic equilibrium potential. To relocate the center of vision (called the fovea) to an area of interest in the visual field, the system could reprogram the RAM with a different connectivity pattern, but instead, the FPGA performs simple arithmetic manipulations to incoming address events, adding or subtracting a fixed value from the row and column addresses to offset their position (Vogelstein, Mallik, Culurciello, Etienne-Cummings, & Cauwenberghs, 2004).
Figure 7: (a) Pictorial representation of the spatial acuity modulation network with fovea positioned over the center of the image. Circles represent cortical cells, with the diameter of the circle proportional to the size of the spatial receptive field. The outermost cortical cells integrate inputs from 64 retinal cells, while the innermost cortical cells receive inputs from a single retinal cell. One cell within each group is shaded to illustrate the pattern of synaptic weights onto the cells in that group. Light shading represents a small synaptic weight, and dark shading represents a large synaptic weight. (b) Example image output from silicon retina with spatial acuity modulation performed by silicon cortex. The nine subfigures show how the output varies as the center of vision (fovea) moves from the top left corner of the image to the bottom right corner (Vogelstein, Mallik, Culurciello, et al., 2004).
Multichip Neuromorphic Vision System
2293
(a)
(b)
2294
R. Vogelstein et al.
An example image with nine different foveations is shown in Figure 7b. With a 16 × 16-pixel fovea surrounded by k concentric rings of geometrically decreasing resolution, the number of cortical neurons (M) required to represent the foveated image from a N × N-pixel retina is given by M = 162 + 2(k+2) − 4 N2 . This allows for a significant reduction (75% for the example shown here) in the number of address events processed by the hardware as well as a reduced communication cost of transmitting an image “frame” (Vogelstein, Mallik, Culurciello, et al., 2004). 3.4 MAX Computation. The cortical process of object recognition is modeled in Figure 2 with a series of linear and nonlinear poolings of simple cell outputs (Riesenhuber & Poggio, 1999). In the first set of computations, outputs from simple cells with similar preferred orientations and different receptive fields are combined with a maximum operation to form complex cells, which essentially act as position-invariant oriented spatial filters. Because of the large bandwidth required, the maximum is taken over only a subset of the image with high salience. This is similar to the attention spotlight model of human perception (Posner, Snyder, & Davidson, 1980; Eriksen & St. James, 1986). The maximum operation (MAX) is defined here as a nonlinear saturating pooling function on a set of inputs whose output codes the magnitude of the largest input, regardless of the number and levels of lesser inputs. A neural implementation of the MAX is illustrated in Figure 8a, where a set of input neurons {x} causes the output neuron z to generate spikes at a rate proportional to the input with the fastest firing rate. The MAX operation is closely related to the WTA function, except that a standard WTA network allows one of many potential output neurons to be active, and that neuron’s activity level is dependent on only the relative magnitude of the inputs, not their absolute value. (For distractor input spike frequencies up to about 80% of the maximum, the y neurons in the MAX network compute a WTA as an intermediate step toward computing the maximum. Higher distractor input spike frequencies can be accommodated by increasing the reciprocal inhibitory feedback between y neurons at the expense of the accuracy of the z neuron.) When used in a neural network to pool responses from different feature detectors, such as simple cells, a MAX neuron can simultaneously achieve high feature specificity and invariance (Riesenhuber & Poggio, 1999). We implemented a MAX network model originally proposed by Yu, Giese, and Poggio (2002), shown in Figure 8a. The network is highly interconnected with all-to-all reciprocally inhibitory feedback connections between y neurons, confirming the ability of the silicon cortex to implement recurrent networks. The invariance of the network to the number of inputs n is illustrated in Figure 8b. Thirty configurations, with n ∈ [1, 30], were tested on the silicon cortex. In each configuration, the networks were allowed to run for 60 seconds, with the x cells’ inputs generated by
Multichip Neuromorphic Vision System
x1
y1
x2
y2
x3
y3
xn
yn
2295
z
(a)
(b)
(c)
Figure 8: (a) Pictorial representation of MAX network. Excitatory connections are shown by solid lines and triangular synapses. Inhibitory connections are shown by dashed lines and circular synapses. Input to the x neurons is provided by a computer that generates independent homogeneous Poisson processes. Each y neuron makes inhibitory synapses with all other y neurons, but only some connections are shown for clarity. The output of the z neuron is monitored by a computer. (b) Invariance of the silicon cortex-based MAX network to the number of inputs, n. (c) Invariance of the silicon cortex-based MAX network to the firing rate of nonmaximum inputs.
independent homogeneous Poisson processes with parameter λ = 30 Hz for nonmaximum inputs and λ = 50 Hz for xmax . As shown in Figure 8b, the firing rate of the output neuron z is approximately the same for any value of n. In addition to invariance toward the number of inputs n, the MAX network is invariant to the firing rate of nonmaximum inputs. This was tested by fixing n = 25 and allowing λ to vary for nonmaximum inputs from 2 Hz to 50 Hz. The results are shown in Figure 8c, where the firing
2296
R. Vogelstein et al.
rate of the output neuron z is seen to be roughly constant for rates up to 40 Hz. Because the inputs to the network are stochastic, its performance is weakened by very high firing rates of nonmaximal inputs or very large numbers of nonmaximal inputs. 4 Discussion The above experiments have demonstrated the operation of the primary components of the hierarchical visual processing model (see Figure 2). A full implementation would require larger numbers of neurons than can be simultaneously accommodated in the present multichip system. However, to construct the complete network, multiple IFATs could be connected together, one for each stage of processing, to form a very large silicon cortex. In addition to providing more neurons, this arrangement would reduce the constraints on bandwidth. For example, the spatial feature extraction architecture described requires each retinal cell to project to 64 simple cells at full resolution. However, if the silicon retina produces 50,000 AE per second and each IFAT is limited to processing 1,000,000 AE per second, the maximum fan-out from each retinal cell to any individual IFAT is only 20. By dividing the orientations among multiple IFATs (see Choi, Merolla, Arthur, Boahen, & Shi, 2005), a fan-out of 64 could easily be sustained without overtaxing the system. Furthermore, because the number of connections between neurons within a given level is larger than the number of connections between neurons in different levels (especially in recurrent networks like the MAX network), giving each processing stage its own IFAT will conserve energy by reducing the number of events transmitted across the external AER bus. Connecting multiple IFATs together in a feedforward structure requires little hardware beyond that shown in Figure 1. Because each IFAT functions as an address event transceiver, sending and receiving events according to a lookup table in RAM, it needs to know only the addresses of neurons in the subsequent processing stage to communicate with them directly over its AER output bus. For recurrent connections between IFATs, a central arbiter would be required to merge incoming events from multiple AER buses and route them to their appropriate targets. This can be achieved with simple logic circuits implemented in a fast complex programmable logic device (CPLD) or FPGA. The same hardware that supports multiple IFATs could also support multiple neuromorphic sensors. Because the silicon cortex implements a general-purpose neural model, it is well suited for multimodal computations. Even without additional hardware, the system in Figure 1 can be adapted for sensory modalities other than vision—any neuromorphic sensor using AER can be attached to the port currently occupied by the OR. Under normal operating conditions, such as when implementing the networks described in section 3, each IFAT neuron executes approximately 1 million to 10 million operations per second (MOPS; addition, subtraction,
Multichip Neuromorphic Vision System
2297
multiplication, and comparison are all considered single operations). The exact number of MOPS can be computed if the number of input and output spikes are known, because each input event requires approximately six basic operations per neuron, and every output event requires two or three operations (Vogelstein, Mallik, Vogelstein, & Cauwenberghs, 2007). However, if the network architecture is optimized to take advantage of parallel activation of multiple cells, the number of OPS increases significantly. For example, if every incoming spike is routed to an entire row of neurons simultaneously, the IFAT would perform more than 360 operations per spike, or at least 360 MOPS for 1,000,000 input spikes per second. In the current hardware, the upper bound on operations per second is 19,200 MOPS if all 2400 neurons on one chip are activated simultaneously, or 38,400 MOPS if all 4800 neurons in the silicon cortex are used in parallel. (These figures will improve with technology and are not fundamental limits of our approach.) To date, we have utilized only the parallel activation functions of the IFAT to implement global “leakage” events, but one can easily imagine future applications that take advantage of this feature, such as a fully connected winner-take-all network (Abrahamsen, H¨afliger, & Lande, 2004; Oster & Liu, 2004). 5 Conclusion We have described a novel multichip neuromorphic system capable of processing visual images in real time. The system contains a silicon cortex with 4800 neurons that can be (re)configured into arbitrary network topologies for processing any spike-based input. Results from the first few stages of a hierarchical model for salience detection and object recognition confirm the utility of the system for prototyping large-scale sensory information processing networks. Future work will focus on increasing the number of neurons in the silicon cortex, so that the entire hierarchical visual processing model can be tested while it interacts with the environment. Acknowledgments This work was partially funded by the National Science Foundation, the National Institute on Aging, the Defense Advanced Research Projects Agency, and the Institute for Neuromorphic Engineering. Additionally, R.J.V. is supported by an NSF Graduate Research Fellowship. References Abrahamsen, J., H¨afliger, P., & Lande, T. S. (2004). A time domain winner-takeall network of integrate-and-fire neurons. In Proceedings of the IEEE International Symposium on Circuits and Systems (Vol. 5, pp. 361–364). Piscataway, NJ: IEEE.
2298
R. Vogelstein et al.
Arthur, J. V., & Boahen, K. A. (2004). Recurrently connected silicon neurons with active dendrites for one-shot learning. In Proceedings of the IEEE International Joint Conference on Neural Networks (Vol. 3, pp. 1699–1704). Piscataway, NJ: IEEE. Cheely, M., & Horiuchi, T. (2003). A VLSI model of range-tuned neurons in the bat echolocation system. In Proceedings of the IEEE International Symposium on Circuits and Systems (Vol. 4, pp. 872–875). Piscataway, NJ: IEEE. Choi, T. Y. W., Merolla, P. A., Arthur, J. V., Boahen, K. A., & Shi, B. E. (2005). Neuromorphic implementation of orientation hypercolumns. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 52(6), 1049–1060. Culurciello, E., & Etienne-Cummings, R. (2004). Second generation of high dynamic range, arbitrated digital imager. In Proceedings of the IEEE International Symposium on Circuits and Systems (Vol. 4, pp. 828–831). Piscataway, NJ: IEEE. Culurciello, E., Etienne-Cummings, R., & Boahen, K. A. (2003). A biomorphic digital image sensor. IEEE Journal of Solid-State Circuits, 38(2), 281–294. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience: Computational and mathematical modeling of neural systems. Cambridge, MA: MIT Press. Deiss, S. R., Douglas, R. J., & Whatley, A. M. (1999). A pulse-coded communications infrastructure for neuromorphic systems. In W. Maass & C. M. Bishop (Eds.), Pulsed neural networks (pp. 157–178). Cambridge, MA: MIT Press. Eriksen, C. W., & St. James, J. D. (1986). Visual attention within and around the field of focal attention: A zoom lens model. Perception and Psychophysics, 40, 225– 240. Goldberg, D. H., Cauwenberghs, G., & Andreou, A. G. (2001). Probabilistic synaptic weighting in a reconfigurable network of VLSI integrate-and-fire neurons. Neural Networks, 14(6–7), 781–793. H¨afliger, P. (2001). Asynchronous event redirecting in bio-inspired communication. In Proceedings of the IEEE International Conference on Electronics, Circuits and Systems (Vol. 1, pp. 87–90). Piscataway, NJ: IEEE. Higgins, C. M., & Koch, C. (1999). Multi-chip neuromorphic motion processing. In D. S. Wills & S. P. DeWeerth (Eds.), Proceedings of the 20th Anniversary Conference on Advanced Research in VLSI (pp. 309–323). Los Alamitos, CA: IEEE Computer Society. Horiuchi, T., & Hynna, K. (2001). Spike-based VLSI modeling of the ILD system in the echolocating bat. Neural Networks, 14, 755–762. Horiuchi, T. K., & Koch, C. (1999). Analog VLSI-based modeling of the primate oculomotor system. Neural Computation, 11, 243–265. Indiveri, G. (1999). Neuromorphic analog VLSI sensor for visual tracking: Circuits and application examples. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 46(11), 1337–1347. Indiveri, G., Chicca, E., & Douglas, R. J. (2004). A VLSI reconfigurable network of integrate-and-fire neurons with spike-based learning synapses. In Proceedings of the European Symposium on Artificial Neural Networks (pp. 405–410). Bruges, Belgium: D-Facto. Indiveri, G., Murer, R., & Kramer, J. (2001). Active vision using an analog VLSI model of selective attention. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 48(5), 492–500.
Multichip Neuromorphic Vision System
2299
Jung, R., Brauer, E. J., & Abbas, J. J. (2001). Real-time interaction between a neuromorphic electronic circuit and the spinal cord. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 9(3), 319–326. Kandel, E. R., Schwartz, J. H., & Jessell, T. M. (2000). Principles of neural science (4th ed.). New York: McGraw-Hill. Koch, C. (1999). Biophysics of computation: Information processing in single neurons. New York: Oxford University Press. Lazzaro, J., Wawrzynek, J., Mahowald, M., Sivilotti, M., & Gillespie, D. (1993). Silicon auditory processors as computer peripherals. IEEE Transactions on Neural Networks, 4(3), 523–528. Lewis, M. A., Etienne-Cummings, R., Hartmann, M. H., Cohen, A. H., & Xu, Z. R. (2003). An in silico central pattern generator: Silicon oscillator, coupling, entrainment, physical computation and biped mechanism control. Biological Cybernetics, 88(2), 137–151. ¨ Liu, S.-C., Kramer, J., Indiveri, G., Delbruck, T., & Douglas, R. (2002). Orientationselective aVLSI spiking neurons. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Mahowald, M. (1994). An analog VLSI system for stereoscopic vision. Boston: Kluwer. Mallik, U., Vogelstein, R. J., Culurciello, E., Etienne-Cummings, R., & Cauwenberghs, G. (2005). A real-time spike-domain sensory information processing system. In Proceedings of the IEEE International Symposium on Circuits and Systems (Vol. 3, pp. 1919–1922). Piscataway, NJ: IEEE. Mead, C. (1989). Analog VLSI and neural systems. Reading, MA: Addison-Wesley. Oster, M., & Liu, S.-C. (2004). A winner-take-all spiking network with spiking inputs. In Proceedings of the IEEE International Conference on Electronics, Circuits and Systems (pp. 203–206). Piscataway, NJ: IEEE. Paz, R., Gomez-Rodriguez, F., Rodriguez, M. A., Linares-Barranco, A., Jimenez, G., & Civit, A. (2005). Test infrastructure for address-event-representation communications. Lecture Notes in Computer Science, 3512, 518–526. Posner, M. I., Snyder, C. R. R., & Davidson, B. J. (1980). Attention and the detection of signals. Journal of Experimental Psychology: General, 109, 160–174. Reichel, L., Leichti, D., Presser, K., & Liu, S.-C. (2005). Robot guidance with neuromorphic motion sensors. In Proceedings of the IEEE International Conference on Robotics and Automation (pp. 3540–3544). Piscataway, NJ: IEEE. Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2(11), 1019–1025. Riis, H. K., & H¨afliger, P. (2005). An asynchronous 4-to-4 AER mapper. Lecture Notes in Computer Science, 3512, 494–501. Ros, E., Ortigosa, E. M., Agis, R., Carrillo, R., & Arnold, M. (2006). Real-time computing platform for spiking neurons (RT-spike). IEEE Transactions on Neural Networks, 17(4), 1050–1063. Serrano-Gotarredona, R., Oster, M., Lichtsteiner, P., Linares-Barranco, A., ´ Paz-Vicente, R., Gomez-Rodr´ ıguez, F., et al. (2006). AER building blocks for ¨ multi-layer multi-chip neuromorphic vision systems. In Y. Weiss, B. Scholkopf, & J. Platt (Eds.), Advances in neural information processing systems, 18 (pp. 1217–1224). Cambridge, MA: MIT Press.
2300
R. Vogelstein et al.
Simoni, M. F., Cymbalyuk, G. S., Sorensen, M. Q., Calabrese, R. L., & DeWeerth, S. P. (2001). Development of hybrid systems: Interfacing a silicon neuron to a leech heart interneuron. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 173–179). Cambridge, MA: MIT Press. Sivilotti, M. (1991). Wiring considerations in analog VLSI systems, with application to fieldprogrammable networks. Unpublished doctoral dissertation, California Institute of Technology. Taba, B., & Boahen, K. A. (2003). Topographic map formation by silicon growth cones. In S. T. S. Becker & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 1139–1146). Cambridge, MA: MIT Press. Vogelstein, R. J., Mallik, U., & Cauwenberghs, G. (2004). Silicon spike-based synaptic array and address-event transceiver. In Proceedings of the IEEE International Symposium on Circuits and Systems (Vol. 5, pp. 385–388). Piscataway, NJ: IEEE. Vogelstein, R. J., Mallik, U., Cauwenberghs, G., Culurciello, E., & Etienne-Cummings, R. (2005). Saliency-driven image acuity modulation on a reconfigurable silicon array of spiking neurons. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 1457–1464). Cambridge, MA: MIT Press. Vogelstein, R. J., Mallik, U., Culurciello, E., Etienne-Cummings, R., & Cauwenberghs, G. (2004). Spatial acuity modulation of an address-event imager. In Proceedings of the IEEE International Conference on Electronics, Circuits and Systems (pp. 207–210). Piscataway, NJ: IEEE. Vogelstein, R. J., Mallik, U., Vogelstein, J. T., & Cauwenberghs, G. (2007). Dynamically reconfigurable silicon array of spiking neurons with conductance-based synapses. IEEE Transactions on Neural Networks, 18(1), 253–265. Yu, A. J., Giese, M. A., & Poggio, T. A. (2002). Biophysiologically plausible implementations of the maximum operation. Neural Computation, 14(12), 2857–2881. Zaghloul, K. A., & Boahen, K. (2004). Optic nerve signals in a neuromorphic chip I: Outer and inner retina models. IEEE Transactions on Biomedical Engineering, 51(4), 657–666. Zou, Q., Bornat, Y., Tomas, J., Renaud, S., & Destexhe, A. (2006). Real-time simulations of networks of Hodgkin-Huxley neurons using analog circuits. Neurocomputing, 69, 1137–1140.
Received March 29, 2006; accepted October 7, 2006.
LETTER
Communicated by Rajesh Rao
Visual Recognition and Inference Using Dynamic Overcomplete Sparse Learning Joseph F. Murray [email protected] Massachusetts Institute of Technology, Brain and Cognitive Sciences Department, Cambridge, MA 02139, U.S.A.
Kenneth Kreutz-Delgado [email protected] University of California, San Diego, Electrical and Computer Engineering Department, La Jolla, CA 92093-0407, U.S.A.
We present a hierarchical architecture and learning algorithm for visual recognition and other visual inference tasks such as imagination, reconstruction of occluded images, and expectation-driven segmentation. Using properties of biological vision for guidance, we posit a stochastic generative world model and from it develop a simplified world model (SWM) based on a tractable variational approximation that is designed to enforce sparse coding. Recent developments in computational methods for learning overcomplete representations (Lewicki & Sejnowski, 2000; Teh, Welling, Osindero, & Hinton, 2003) suggest that overcompleteness can be useful for visual tasks, and we use an overcomplete dictionary learning algorithm (Kreutz-Delgado, et al., 2003) as a preprocessing stage to produce accurate, sparse codings of images. Inference is performed by constructing a dynamic multilayer network with feedforward, feedback, and lateral connections, which is trained to approximate the SWM. Learning is done with a variant of the backpropagation-through-time algorithm, which encourages convergence to desired states within a fixed number of iterations. Vision tasks require large networks, and to make learning efficient, we take advantage of the sparsity of each layer to update only a small subset of elements in a large weight matrix at each iteration. Experiments on a set of rotated objects demonstrate various types of visual inference and show that increasing the degree of overcompleteness improves recognition performance in difficult scenes with occluded objects in clutter. 1 Introduction Vision, whether in the brain or computer, can be characterized as the process of inferring certain unknown quantities using an input image and Neural Computation 19, 2301–2352 (2007)
C 2007 Massachusetts Institute of Technology
2302
J. Murray and K. Kreutz-Delgado
predictions or expectations based on prior exposure to the environment. Visual inference includes tasks such as recognizing objects, reconstructing missing or occluded features, imagining previously learned or entirely novel objects, and segmentation (finding which features in a cluttered image correspond to a particular object). Performing these inference tasks requires combining information about the current image (bottom-up processing) and abstract concepts of objects (top-down processing). These tasks can naturally be placed into the framework of Bayesian probabilistic models, and determining the structure and priors for such models is a great challenge for both understanding vision in the brain and application-oriented computer vision. A primary goal of this letter is to derive an effective probabilistic model of visual inference consistent with current understanding of biological vision. A number of important properties have emerged from neuroscience: 1. Vision in the brain is a hierarchical process with information flowing from the retina to the lateral geniculate nucleus (LGN), occipital and temporal regions of the cortex (Kandel, Schwartz, & Jessel, 2000). 2. This hierarchy has extensive recurrence with reciprocal connections between most regions (Felleman & Van Essen, 1991). 3. There is extensive recurrence within cortical regions, as typified by lateral inhibition, a mechanism for how sparse coding can arise (Callaway, 2004). 4. The primary visual cortex (V1) is strikingly overcomplete, meaning there are many more cells than are needed to represent the retinal information. In humans, there are over 200 to 300 V1 neurons per each LGN neuron and a lesser degree of overcompleteness in other primates (Stevens, 2001; Ejima et al., 2003). 5. The firing patterns of cortical neurons gives evidence for sparse distributed representations, in which only a few neurons are active out of a large population, and that information is encoded in these ensembles (Vinje & Gallant, 2000; Quiroga, Reddy, Kreiman, Koch, & Fried, 2005). 6. Although there are differences among areas, the basic structure of the cortex is qualitatively similar, and the notion of cortical similarity states that the underlying cortical operation should be similar from area to area (Mountcastle, 1978; Hawkins & Blakeslee, 2004). Since these six properties are present in animals with high visual acuity, it is reasonable to assume they are important for inference, and we will adopt them in a network model. While many computational models of vision have been developed that incorporate some of the properties listed above (Fukushima & Miyake,
Visual Inference Using Dynamic Sparse Learning
2303
1982; Rao & Ballard, 1997; Riesenhuber & Poggio, 1999; Rolls & Milward, 2000; Lee & Mumford, 2003; Fukushima, 2005), we propose a model that takes into account all six properties. For example, the recognition models of Rolls and Milward (2000) and Riesenhuber and Poggio (1999) do not use feedback (and so are incapable of inference tasks such as reconstruction or imagination), and the dynamic system of Rao and Ballard (1997) does not use overcomplete representations. The use of learned overcomplete representations for preprocessing is a new and largely unexplored approach for visual recognition and inference algorithms. Recent developments in learning overcomplete dictionaries (Lewicki & Sejnowski, 2000; Kreutz-Delgado et al., 2003; Teh, Welling, Osindero, & Hinton, 2003) and the associated methods for sparse image coding (Murray & Kreutz-Delgado, 2006) now make possible the investigation of their utility for visual inference. Real-world images are high-dimensional data that can be explained in terms of a much smaller number of causes, such as objects and textures. Each object in turn can appear in many different orientations but in fact is seen in only one particular orientation. For each orientation, an object can be represented with a concise set of features, such as lines, arcs, and textures. The key feature of these various types of image descriptions is that they can be represented as sparse vectors, where only a few of the many possible choices suffice as explanation. While pixel values of images have nonsparse distributions (they are unlikely to be zero), these more abstract representations are very sparse (each component is likely to be zero), and only a few nonzero components at a time succinctly describe the scene. This intuition, along with the biological evidence for sparsity, is the justification for our use of sparse prior distributions. Other advantages of sparsity include reduced metabolic cost and increased storage capacity in associative memories (Olshausen & Field, 1997). 1.1 Overview and Organization. Beginning with a hypothetical hierarchical generative world model (GWM) that is presumed to create images of objects seen in the world, we discuss in section 2 how the GWM can be used for visual inference. The GWM requires the selection of a probability distribution, and a suitable choice is required to create practical algorithms. As a first move, we consider a Boltzmann-like distribution that captures the desired top-down, bottom-up, and lateral influences between and within layers, but it is computationally intractable. Then a simplified world model (SWM) distribution is created based on a variational approximation to the Boltzmann-like distribution and specifically designed to model sparse densities (see section 2.4). By designing a dynamic network that rapidly converges to a selfconsistency condition of the SWM, we can perform inference tasks if we have the weights that parameterize the network (see section 3). The dynamic network arises as a way of estimating the fixed-point state of the SWM. Although we consider only the problem of estimating static world
2304
J. Murray and K. Kreutz-Delgado
models, generalization to dynamic worlds is also possible. To determine the unknown weights, we develop a learning algorithm based on the backpropagation-through-time algorithm (Williams & Peng, 1990) which operates on the preactivation state and includes a sparsity-enforcing prior (see section 4). This algorithm can be seen as a natural extension of the sparse-coding principles that are useful in modeling V1 response properties (Olshausen & Field, 1997) to the full visual inference task. We demonstrate experimentally several types of visual inference: recognition, reconstruction, segmentation, and imagination. These simulations show that overcomplete representations can provide better recognition performance than complete codes when used in the early stages of vision (see section 6). A discussion of the biological motivations and comparison to prior work is given in section 7, and conclusions are drawn in section 8. 1.2 Notation. We use the following notation: a B c(m) D f (·) I{·} JPA K Ll M n N r s Ut v, v V Vt Wlm W xl X Yˇ Yt Y, V, U
Activation function parameters Error-related term in learning algorithm Object code for object m, (sparse binary code) Sparsity-enforcing term in learning algorithm Sigmoid activation function Indicator function: 1 if the expression is true, 0 otherwise Cost function on preactivation state, minimized by learning rule Number of images in training set Y Lateral weights between units in layer l Number of unique objects in training set Number of layers in network Number of elements in state vector X Number of nonzero elements r = [r1 , . . . , rn ], where rl is the number of nonzero elements in layer l (diversity ×n) Size of layers, s = [s1 , . . . , sn ], where sl is the size of layer l Network input at time t Unit weight sum v (for entire layer vl ), preactivation function Preactivation state of all layers Certainty-equivalence approximation of preactivation values Weights from layer m to layer l Complete weight matrix for all layers (including all Wlm and Ll ), W ∈ R N×N Activation vector at layer l, expected values of P(zl |zl−1 , zl+1 ) State vector of all layers, X = [x1T , . . . , xnT ]T Training data, Yˇ = [yˇ 1T , 0, . . . , 0, yˇ nT ]T , where yˇ 1 is a sparsely coded image and yˇ n is an object code Dynamic network output at time t Sets of multiple state vectors Y, V (e.g., Y = {Y(1) , . . . , Y(K ) })
Visual Inference Using Dynamic Sparse Learning
2305
True state of generative model at layer l, binary random vector ∈ {0, 1}sl Z True state of generative model, all layers, Z = [z1T , . . . , znT ]T , binary random vector ∈ {0, 1} N βt Indicator vector of whether target values are available for each element of Vt ε Error between variational approximation and true state ε Error between data set and network approximation Vt ζ Normalization constant (partition function) η Learning rate λ Regularization parameter µ Target mean for hidden layers Error between true and approximate state, = Z − X = [φ 1T , . . . , φ nT ]T ξ Energy-like function τ Number of time steps network is run for (maximum value of t) GWM Generative world model (Boltzmann-like distribution) NLCP Neighboring-layer conditional probability SWM Simplified world model, variational approximation to Boltzmann-like distribution DN Dynamic network that settles to the self-consistency condition of the SWM zl
2 Generative Models for Visual Inference In this section, we postulate a hierarchical generative visual-world model (GWM) and discuss its properties, particularly that of independence of the nodes of a layer conditioned on its immediately neighboring layers. We then discuss how the GWM can be used for visual inference tasks such as recognition, imagination, reconstruction, and expectation-driven segmentation. Specific forms of the probability distribution in the model must be chosen, and as a starting point, we posit a Boltzmann-like distribution. Since inference with the Boltzmann-like distribution is generally intractable, a variational approximation is developed leading to a simplified world model (SWM). The key assumption of sparsely distributed activations is enforced and used extensively. In this section we consider static world models; in section 3.1, we will use dynamic networks to implement inference by settling to the fixed points of the SWM. 2.1 Hierarchical Generative Visual-World Model. Images of objects seen in the world can be thought of as being created by a hierarchical, stochastic generative model (the GWM). Although it cannot be rigorously claimed that the real world uses such a model to generate images, the idea
2306
J. Murray and K. Kreutz-Delgado
Figure 1: Hierarchical generative visual-world model (GWM) for objects. At each layer zl , the image can be represented by a large (possibly overcomplete) sparse vector. In this generative model, each layer is a binary random vector, which, given only the layer immediately above it in the hierarchy, is independent of other higher layers.
of the GWM is a useful fiction that guides the development of learning algorithms (Hinton & Ghahramani, 1997). For the GWM, we assume a hierarchical binary state model of the form shown in Figure 1. The number of layers is somewhat arbitrary, though there should be enough layers to capture the structure of the data to be modeled, and four to five appears to be reasonable for images of objects (Riesenhuber & Poggio, 1999; Lee & Mumford, 2003; Hinton, Osindero, & Teh, 2006). The arrows in Figure 1 indicate that each layer, given the layer directly above it, is independent of higher layers. At the highest level, the vector z5 is a sparse binary coding of the object in the image, and its value is drawn from the prior distribution P(z5 ). The representation of the particular orientation z4 of an object depends on only the object representation z5 . The invariant, composite, and local features, z3 , z2 , and z1 , depend only on the layer immediately above them, for example, P(z3 |z4 , z5 ) = P(z3 |z4 ), and the local features z1 model the image I . The sequence can be summarized as P(z4 |z5 )
P(z3 |z4 )
P(z2 |z3 )
P(z1 |z2 )
P(I |z1 )
z5 –−→ z4 –−→ z3 –−→ z2 –−→ z1 –−→ I.
(2.1)
Visual Inference Using Dynamic Sparse Learning
2307
The joint distribution of the image and generative states zl is P(I, z1 , z2 , z3 , z4 , z5 ) = P(I |z1 )P(z1 |z2 )P(z2 |z3 )P(z3 |z4 )P(z4 |z5 )P(z5 ), (2.2)
where each layer zl is a binary vector of size sl . We postulate that the zl are sparse: they have very few nonzero components (Olshausen & Field, 1997). For example, in every image, only a few of all possible objects will be present, and each object will be in only one of its possible orientations, and so forth. Sparsity is proportional to the number of zero components in a vector z ∈ Rn , sparsity ≡ #{zi = 0}/n. A related quantity, diversity, is proportional to the number of nonzero components, diversity ≡ #{zi = 0}/n = 1 − sparsity. Many studies have confirmed that natural images can be represented accurately by sparse vectors, corresponding to z1 (Olshausen & Field, 1996; Kreutz-Delgado et al., 2003; Murray & Kreutz-Delgado, 2006). These studies have mainly dealt with small patches of images (on the order of 8×8 to 16×16 pixels), and it is clear that features larger than such patches will be represented nonoptimally. This further redundancy in larger-scale features can be reduced at higher levels, which can also have the property of sparseness. 2.1.1 Neighboring Layer Conditional Probability (NLCP). For a middle layer zl given all the other layers, we find that zl conditioned on its immediate neighbors zl−1 , zl+1 is independent of all the remaining layers—for example,
P(z2 |I, z1 , z3 , z4 , z5 ) = =
P(I |z1 )P(z1 |z2 )P(z2 |z3 )P(z3 |z4 )P(z4 |z5 )P(z5 ) P(I |z1 )P(z1 |z3 )P(z3 |z4 )P(z4 |z5 )P(z5 ) P(z1 |z2 )P(z2 |z3 ) . P(z1 |z3 )
(2.3)
For an arbitrary layer, we can find the neighboring layer conditional probability (NLCP),
P(zl |zl−1 , zl+1 ) =
P(zl−1 |zl )P(zl |zl+1 ) P(zl−1 |zl+1 )
(NLCP).
(2.4)
This important independence assumption is equivalent to saying that each layer learns about the world only through its neighboring layers (Lee &
2308
J. Murray and K. Kreutz-Delgado
Mumford, 2003).1 Returning to the joint distribution and substituting in the NLCPs, P(I, z1 , z2 , z3 , z4 , z5 ) = P(I |z1 ) · P(z1 |z2 )P(z2 |z3 ) · P(z3 |z4 )P(z4 |z5 ) · P(z5 ) = P(I |z1 ) · P(z2 |z1 , z3 )P(z1 |z3 ) · P(z4 |z3 , z5 )P(z3 |z5 ) · P(z5 ).
(2.5)
So the joint can be recovered given the NLCP and additional terms. Of course, other factorizations of the joint are possible, but these are also consistent with the NLCP for their respective layers (Brook, 1964).2 2.1.2 Properties of Generative World Model (GWM). We now summarize the four properties of our generative world model (GWM). 1. There is a hierarchy of n hidden-layer vectors z1 , . . . , zn that model each image I . 2. Each layer is independent of all higher layers given the neighboring layer above, P(zl |zl+1 , . . . , zn ) = P(zl |zl+1 ). 3. Each layer is independent of all lower layers given the neighboring layer below, P(zl | zl−1 , . . . , z1 ) = P(zl |zl−1 ) (as shown in Murray, 2005, section 1.2). 4. Given its immediate neighboring layers, a layer zl is independent of all other higher and lower layers, P(zl |I, z1 , . . . , zn ) = P(zl |zl−1 , zl+1 ). 2.2 Types of Inference: Recognition, Imagination, Reconstruction, and Expectation-Driven Segmentation. For object recognition, the goal is to infer the highest layer representation zn given an image I . However, recognition is only one type of inference that might be required. Another type is running a model generatively using a high-level object representation to imagine an image of that object. In the brain, imagining a particular instance of an object will not correspond to the level of detail in the retinal representation, but there is evidence of activity in many of the lower visual areas (such as medial temporal, V1, and V2) during imagination (Kosslyn, Thompson, & Alpert, 1997). 1
The NLCP is closely related to the Markov blanket, which is defined for a single node in a Bayesian network as that node’s parents, children, and children’s parents. The NLCP is defined over all the units in a given layer. 2 Brook (1964) proves that any system specified by the NLCP, P(z |z , i = j) = j i P(z j |z j−1 , z j+1 ), has a joint distribution that can be factored as P(z1 , . . . , zn ) = n+1 Qi (zi , zi−1 ), which is the joint factorization of a simple Markov chain. This proof i=1 is for the case of scalar z, but since our zl are binary vectors, they can be equivalently represented as scalar integer variables ∈ {1, . . . , 2sl }. Thus, any system defined by the vector NLCP is consistent with a joint distribution that can be specified as the product of neighboring-layer factors, that is, the Markov assumption in equation 2.2.
Visual Inference Using Dynamic Sparse Learning
2309
Table 1: Types of Inference That Can Be Performed with the Hierarchical Generative World Model and the Types of Information Flow Required. Requires Type of Inference Recognition Imagination Reconstruction Expectation-driven segmentation Expectation-driven detection
Inputs
Outputs
Bottom Up
Top Down
(I → z1 ) zn (I → z1 ) (I → z1 ), zn (I → z1 ), zn
zn (z1 → I ) (z1 → I ) (z1 → I ) zn
Y N Y Y Y
N Y Y Y Y
Notes: We wish to find a good approximation to the layer zl of interest. The approximation used is the conditional expected value of zl under the variational approximation, E Q [zl |zl−1 , zl+1 ] = xl , as discussed in section 2.4.
Certain types of inference involve the use of top-down influences interacting with bottom-up inputs. For example, given a partially occluded image that has been recognized by higher layers, top-down influences can be used to reconstruct the hidden parts of the object (those features that are most likely given the input). Another type of inference is expectationdriven segmentation, where a prediction is presented at a higher level that may be used to explain cluttered, incomplete, or conflicting inputs at the lowest layer and the desired output is the segmented object at the first layer (Grossberg, 1976; Rao & Ballard, 1997; Hecht-Nielsen, 1998). The expectation input (higher-layer, top-down) must come from a source external to the visual system, which in the brain could be higher cortical areas or other senses and in computer vision could be dependent on the task or provided by a user. If we wish to find which objects are in a cluttered scene (i.e., the desired output is the highest-layer object representation) based on prior knowledge of what might be there (higher-layer input), we perform expectation-driven detection. If the high-level prediction about the scene is consistent with the input, the system converges with the expectation at the highest layer, and the prediction is confirmed. If the system converges to a different pattern, this indicates that the expected object is not present (which could be considered a state of surprise). Table 1 shows types of inference and the necessary information flow (top down or bottom up) needed in the model. As discussed below, we use a sparse-coding algorithm to transform the image into the first layer representation, z1 , and vice versa (denoted by → in the table). 2.3 Boltzmann-Like Distributions for Layer-Conditional Probabilities. Our next task is to postulate a form for the GWM distributions P that is powerful enough to generate the images seen in the world. A common choice in probabilistic modeling is the Boltzmann distribution, P(z) = ζ −1 exp(−βξ (z)), where the probabilities are related to a function ξ that
2310
J. Murray and K. Kreutz-Delgado
assigns an energy to each state, ζ is a normalizing function, and β is a constant (which is a degree-of-randomness parameter related to temperature in physical systems, β ∝ T −1 ; Hopfield, 1982; Hinton & Sejnowski, 1983; Hertz, Palmer, & Krogh, 1991). In thermodynamics and physical systems such as magnetic materials, the energy function captures the influence of each particle on its neighbors, where lower-energy states are more probable. The energy function usually has the form ξ (z) = − 12 i j wi j zi z j , where wi j is the symmetric interaction weight (wi j = w ji ) between zi and z j . In the context of associative memories, the weights of the energy function are adjusted so that learned patterns form low-energy basins of attraction (e.g., using the Boltzmann machine learning rule; Ackley, Hinton, & Sejnowski, 1985). The Boltzmann distribution requires the weights wi j to be symmetric and have zero self-energy wii = 0 (Kappen & Spanjers, 2000). There are three main advantages of symmetric weights. First, a dynamic network with symmetric interactions is guaranteed to be asymptotically stable and settle to a fixed point (the zero-temperature solution), which minimizes the Boltzmann energy function (Mezard, Parisi, & Virasoro, 1987). Second, there is a procedure (Gibbs sampling with simulated annealing) that generates samples from this distribution at a given nonzero temperature T. Finally, given a gradual enough annealing schedule for reducing T, Gibbs sampling will track the global minimum-energy state (highest probability state) of the network and guarantee convergence to the zero-temperature solution as the temperature is lowered (Geman & Geman, 1984). While the above properties are attractive and help explain the wide interest in the Boltzmann distribution and the Boltzmann machine, they may be of limited use in practice. It often takes considerable time for a stochastic network with symmetric weights to settle to an equilibrium state, possibly longer than a brain or artificial network has to make a decision (Welling & Teh, 2003), which accounts for the interest in simplifying approximations such as mean-field annealing (Peterson & Anderson, 1987). Furthermore, it has also been argued that the use of asymmetric weights can improve performance (such as by suppressing spurious memory states) and has greater biological plausibility (Parisi, 1986; Crisanti & Sompolinsky, 1988; Sompolinsky, 1988; Gutfreund, 1990; Apolloni, Bertoni, Campadelli, & de Falco, 1991; Kappen & Spanjers, 2000; Chengxiang, Dasgupta, & Singh, 2000). An additional motivation for admitting asymmetric weights is the notion that in hierarchical networks designed for invariant recognition, the relative strengths of the feedforward and feedback pathways will need to be different. Since neurons in higher layers tend to require inputs from multiple units to activate, the relative strength of the feedback connections to those units must be stronger than the feedforward weights to enable lower layer activity (i.e., generative ability). The primary deterrent to the use of asymmetric weights is the difficulty associated with ensuring asymptotic stability of the resulting algorithms, which involves the use of significantly more complex stability arguments (Apolloni et al., 1991).
Visual Inference Using Dynamic Sparse Learning
2311
We allow for asymmetric weights and sidestep the stability issue by working within a finite-time horizon framework. The resulting simplicity of the finite horizon problem relative to the infinite horizon problem is well known in the dynamical systems literature (Bertsekas, 1995). In appendix B we design a learning rule that encourages convergence to the desired state within a small number of time steps τ . Also, the use of symmetric weights is merely sufficient for fixed points to exist; it is not a necessary condition. We use the terms Boltzmann-like and energy-like to distinguish our model (with asymmetric weights) from stricter Boltzmann distribution assumptions. The Boltzmann-like form of the NLCP is PB (zl |zl−1 , zl+1 ) =
1 exp (−ξ (zl , zl−1 , zl+1 )) ζ (zl−1 , zl+1 )
(NLCP-B), (2.6)
where ξ is the energy-like function and ζ is a normalizing function, with ξ (zl , zl−1 zl+1 ) = −zlT Wl,l−1 zl−1 − zlT Ll zl − zlT Wl,l+1 zl+1 − θlT zl ζ (zl−1 , zl+1 ) = exp(−ξ (zl , zl−1 , zl+1 )),
(2.7)
zl
where Wl,l+1 are top-down weights from layer l + 1 to l, Wl,l−1 are the bottom-up weights from the layer l − 1 to l, Ll encodes the influence of units in layer l on other units in that layer (lateral weights), and θl is a bias vector. The summation in ζ is over all states of layer l. Note that if the properties of symmetric weights are desired, they can be used without changes to the variational approximation developed in section 2.4. An important question is whether the Boltzmann-like distribution, equations 2.6 and 2.7, is adequate to model the GWM model of Figure 1. It is possible to construct densities that are not well represented by any set of weights W, L in equation 2.7. However, we do not need to model an arbitrary density, only densities that are sparse and therefore have more limited forms of dependence. Algorithms related to the Boltzmann machine have shown success on real-world vision tasks (Teh & Hinton, 2001; Hinton et al., 2006) and tend to confirm that the Boltzmann-like distribution is a reasonable starting point. 2.4 Simplified World Model Developed with a Variational Method. The Boltzmann-like distribution, equations 2.6 and 2.7, provides a reasonable form of the probabilities in the GWM, which allows feedforward, feedback, and lateral influences. Unfortunately, exact inference on zl given zl−1 , zl+1 is intractable for reasonably sized models even when the parameters of PB (zl |zl−1 , zl+1 ) are known, because of the need to sum over every possible state zl in the normalizing function ζ . In this section, we use a
2312
J. Murray and K. Kreutz-Delgado
variational method that approximates PB (zl | zl−1 , zl+1 ) with a factorial distribution, PQ (zl | zl−1 , zl+1 ). By variational, we mean that there are certain parameters xl = {xl,i } that are varied to make the distribution PQ as close to PB as possible. The form of PQ is taken to be a generalized factorial Bernoulli distribution, z
sl xl,i − a 4 PQ (zl |zl−1 , zl+1 ) = a1 i=1
l,i −a 4 a1
z −a 1− l,ia 4
xl,i − a 4 1− a1
1
,
(2.8)
where xl,i are the variational parameters and a = [a 1 , a 2 , a 3 , a 4 ] are additional constant parameters (a 2 and a 3 will be introduced later) that are used to encourage sparsity-inducing densities (see section 2.5). The dependence on zl−1 , zl+1 will be introduced through xl,i as derived below. A sufficient condition for equation 2.8 to be a probability distribution is 4 4 4 that xl,ia−a + (1 − xl,ia−a ) = 1 and xl,ia−a ≥ 0, which is true for a 1 > 0 and 1 1 1 xl,i ≥ a 4 . The slightly generalized Bernoulli distribution, equation 2.8, is based on a shift in the logical values of zl,i in the energy function from {0, 1} to {a 4 , a 1 + a 4 } (the experiments below use {−0.05, 1.05}, which improves computational efficiency). Our formulation encompasses the two common choices for logical levels, {0, 1} and {−1, 1}; for example, if logical levels of {−1, 1} are needed, then a 4 = −1, a 1 = 2. Collecting the xl,i into vectors xl of the same size as zl for each layer, it can be shown that xl are the conditional expected values for each layer, xl = E Q [zl |zl−1 , zl+1 ] .
(2.9)
Note that the variational parameter vector xl is a random variable which is the minimum mean-squared error (MMSE) estimate of zl given the values of its neighboring layers (Kay, 1993). We now find the xl,i that minimize the Kullback-Leibler divergence (Cover & Thomas, 1991) between the conditional probabilities PB (zl |zl−1 , zl+1 ) and PQ (zl |zl−1 , zl+1 ), KL(PQ ||PB ) = E Q [log PQ (zl |zl−1 , zl+1 )] − E Q [log PB (zl |zl−1 , zl+1 )],
(2.10)
where E Q is the expected-value operator with respect to the distribution PQ (zl |zl−1 , zl+1 ). When expected value E Q [zl,i ] = xl,i is used, the first term is E Q [log PQ (zl |zl−1 , zl+1 )] =
xl,i − a 4 xl,i − a 4 log a1 a1 i
xl,i − a 4 a 1 − xl,i + a 4 log 1 − . + a1 a1
(2.11)
Visual Inference Using Dynamic Sparse Learning
2313
The second term in equation 2.10 can be expanded: E Q [log PB (zl |zl−1 , zl+1 )] = E Q [− log(ζ ) − ξ (zl , zl−1 , zl+1 )] = E Q [− log(ζ ) − zlT Wl,l−1 zl−1 − zlT Ll zl − zlT Wl,l+1 zl+1 − θlT zl ].
(2.12)
Again using the expected value E Q [zl,i ] = xl,i , E Q [log PB (zl |zl−1 , zl+1 )] = − log(ζ ) − −
Wik− zl−1,k xl,i −
ik
Wik+ zl+1,k xl,i −
ik
L ik xl,k xl,i
ik
θl,i xl,i + cl ,
(2.13)
i
where Wik+ , Wik− , and L ik are elements of the weight matrices Wl,l+1 , Wl,l−1 , and Ll , respectively and, defining φl = (zl − xl ), the term cl = E Q [(zl − xl )T Ll (zl − xl )] = E Q [φ lT Ll φ l ], which is zero assuming that L ii = 0.3 2.4.1 Self-Consistency Conditions of the Variational Approximation. The variational parameters xl,i that minimize the distance between PB and PQ , equation 2.10 are found by solving, ∂KL(PQ ||PB ) − + = 0 = a2 Wik zl−1,k + L ik xl,k + Wik zl+1,k ∂ xl,i k k k
z1 − xl,i + a 4 (2.14) − a3, + log xl,i − a 4 using a constant term a 3 for the bias θl,i 4 and factoring out a 2 from W+ , W− and L (with a slight abuse of notation, including factoring a11 into a 2 , a 3 ; see equation 2.11). Setting equation 2.14 equal to zero and solving for xl,i , xl,i = f (vl,i ) vl,i = Wik− zl−1,k + L ik xl,k + Wik+ zl+1,k , k
k
(2.15)
k
3 The term c = E [(z − x )T L (z − x )] = Tr[L ], where is the covariance mal Q l l l l l l zl zl trix of zl under PQ . Since zl is assumed conditionally independent under PQ , the nondiagonal elements of the covariance matrix are zero. We will disallow self-feedback (i.e., L ii = 0), so that Tr[Ll zl ] is zero. However, it is straightforward to handle the case when L ii = 0 given the factorial form of PQ . 4 For simplicity, we set θ = a for all l, i. However, this assumption can be relaxed. l,i 3
2314
J. Murray and K. Kreutz-Delgado
where f (·) is a sigmoid activation function parameterized by a = [a 1 , a 2 , a 3 , a 4 ], f (v) =
a1 + a4. 1 + exp (−a 2 v + a 3 )
(2.16)
Defining φl,i = zl,i − xl,i to be the approximation error, the state z is equal to x plus a random noise component, zl,i = xl,i + φl,i . This yields vl,i =
Wik− (xl−1,k + φl−1,k ) +
k
=
Wik− xl−1,k +
k
k
L ik xl,k +
k
L ik xl,k +
Wik+ (xl+1,k + φl+1,k )
k
Wik+ xl+1,k + εl,i ,
(2.17)
k
where the φ terms have been collected into εl,i . By collecting all the terms for each layer into a vector, we obtain the single equation,
x1
L1 W12 0
x W21 L2 2 x3 = f 0 W32 . . . .. . xn 0 ···
···
W23 L3 W34 .. .. . . 0 Wn,n−1
0 ε1 x1 .. ε x2 2 . .. x3 + ε 3 , . . . . . .. . . . xn εn Ln
(2.18)
which is the self-consistency condition for the variational approximation. The self-consistency condition, equation 2.18, is a necessary condition for the factorial NLCP (see equation 2.8) of the SWM to have been optimally fit to the GWM NLCP of equation 2.6. 2.4.2 Simplified World Model Forms. Further collecting all the estimates for each layer into a single vector X = [x1T , . . . , xnT ]T and all the weights into a global weight matrix W, equation 2.18, can be written concisely: X = f (WX + ε)
(SWM-E),
(2.19)
which is called the simplified world model on the expected values (SWM-E) and where the vector forms of the errors are = Z− X ε = (W − L).
(2.20)
Visual Inference Using Dynamic Sparse Learning
2315
The error ε is generally unobservable, and later we will have to make approximations to perform inference and learn the weights W. In particular, for inference and learning, we will neglect ε and use a certainty equivalence approximation (see section 4). The SWM can be written equivalently in terms of the binary state Z, Z = f (WZ − L) +
(SWM-B),
(2.21)
which is called the SWM on the binary state (SWM-B). The SWM can also be written in an equivalent dual form on the preactivation state. Collecting the vl,i into a state vector V ≡ WX + ε (and X = f (V)), we have V = W f (V) + ε
(SWM-P),
(2.22)
which is called the SWM on the preactivation state (SWM-P). Equations 2.19, 2.21, and 2.22 are self-consistency conditions for the SWM. We will return to these key results in section 3.1, where we discuss how to find solutions to these conditions through evolution of updates in time. Note that with a slight abuse of notation, we refer to the self-consistency conditions themselves as the SWMs. 2.4.3 Relation to Other Variational Methods. Our approach is based on using a neighboring-layer conditional probability model matched to the hierarchical NLCP GWM using a factorial variational approximation. The use of a factorial variational approximation is known as a mean-field (MF) approximation in the statistical physics community (Peterson & Anderson, 1987), where the probabilities to be approximated are either unconditional (as typically done in statistical physics) or conditioned only on visible layers (as in the Boltzmann machine). A distinguishing feature of our method is that the Boltzmann-like NLCP admits the use of an approximating distribution PQ that is factorial when conditioned on its neighboring layers, and the resulting approximation is not factorial when conditioned only on the visible layers. This modeling assumption removes less randomness (and allows more generative capability) than the MF approximation conditioned on the visible layers (as in the deterministic Boltzmann machine; Galland, 1993). This is a richer model than the deterministic Boltzmann machine, as our conditional expectations, xl = E q [zl |zl−1 , zl+1 ], retain more randomness than the nonhidden expectations xl = E q [zl |visible layers] of the deterministic Boltzmann machine. Our factorial approximation is reasonable, as it is equivalent to saying that the meaningful information about the world contained in any layer is provided by its immediate neighboring layers, so if we condition on neighboring layers, then only random (“meaningless”) noise remains. The factorial Bernoulli distribution is one of the most tractable and commonly used variational approximations (Jordan,
J. Murray and K. Kreutz-Delgado
′
2316
Figure 2: (A) Activation function f (v) with parameters a = [1.1, 2.0, 4.0, −0.05] (see equation 2.16). The limits of the activation function are [a 4 , a 1 + a 4 ] = [−.05, 1.05], and the slope is set by a 2 and the bias by a 3 . The shape of the activation function encourages sparsity by ensuring that small input activities v < vl do not produce any positive output activity. In the simulations, the values of x = f (v) are thresholded so that x = [ f (v)] ∈ [0, 1]; however, the values of f (v) are kept for use in the weight updates (see appendix B). (B) The probability density P(x; a) of a normal random variable (µ = 0, σ 2 = 1) after being transformed by the activation function, f (v) in equation 2.16, is a sparsityinducing density if the parameters a are chosen properly. The parameters used are a = [1.1, 2.0, 4.0, −0.05]. (C) Probability P(x; a ) is not sparsity inducing with the standard set of parameters for sigmoid activation functions, a = [1, 1, 0, 0].
Ghahramani, Jaakkola, & Saul, 1998); however, if more accurate approximations are desired, other distributions may be used, such as the secondorder methods described by Welling & Teh (2003) and Kappen and Spanjers (2000), although at higher computational cost. 2.5 Activation Functions Can Encourage Sparse Distributions. The parameterized sigmoid activation function, equation 2.16, can be used to encourage sparse activation by appropriate choice of parameters a. Figure 2 shows the activation function, equation 2.16, when parameterized with a = [1.1, 2.0, 4.0, −0.05], which was chosen so that small levels of activation do not lead to positive values of f (v). Parameters a 2 , a 3 can be viewed as prior constraints on the network weights. Theoretically these weight scaling and bias terms could be learned, but from a practical standpoint, our networks are quite large, and the critical property of sparsity that makes learning tractable must be enforced from the early epochs or too many extra weights would be updated. We can reasonably assume that v = vl,i as given by equation 2.15 is a normally distributed random variable due to the central limit theorem (Johnson, 2004) because v is the sum of (nearly) independent and identically distributed values with bounded variance.5 The density P(x; a) can 5 There will be dependence in v between units that all represent the same feature or object; however, since all network layers are constrained to be sparse, these dependencies
Visual Inference Using Dynamic Sparse Learning
2317
then be found by transforming the normal density P(v) = N (µ, σ 2 ) by the activation function 2.16 (see equation 1.2.26 of Murray, 2005). For the values of a given above and for µ = 0, σ 2 = 1, Figure 2B shows that P(x; a) is indeed a sharply peaked sparsity-inducing distribution. In contrast, Figure 2C shows P(x; a ) after being transformed by the sigmoid activation function with parameters a = [1, 1, 0, 0], which does not lead to a sparsityinducing distribution. The choice of parameters µ, σ 2 is also important for the transformed distributions to be sparse. For example, if µ = 0, the variance must be less than about 2.0, or the resulting density will be bimodal. However, with the proper choice of initial conditions, we are able to ensure these conditions are met (see section 5). 3 Recurrent Dynamic Network Recognizing that solutions to important inferencing problems correspond to solutions of the self-consistency conditions derived in section 2.4, we generalize these conditions into a dynamic network capable of converging to a solution satisfying equation 2.18 in order to estimate the states xl . We introduce a time index t for the iterations of this dynamic network, while our goal remains to estimate the state of the static SWM. There are n layers in the network, and the vector of activations for the lth layer at time t is denoted xl,t , l = 1, . . . , n, with layer sizes s = [s1 , . . . , sn ]. The network is designed to enforce rapid convergence to the self-consistency conditions (see equation 2.18) for xl , such that xl,t → xl . The state vector of all the layers at time t is denoted T Xt = x1T , x2T , . . . , xnT ∈ R N ,
(3.1)
where N is the size of the state vector (dropping the time index on xl inside the vector for clarity). The activity in all layers xl , is enforced to be sparse, and the number of nonzero elements of the layers is denoted rt = [r1 . . . rn ]. Figure 3 shows the four-layer network structure used for the experiments in this letter. Dotted lines indicate inputs and connections that are not used here but are allowed in the model. The layers used for input and output depend on the type of inference required. In this work, inputs are usually injected at either the highest
will be much less than the typical pixel-wise dependencies in the original images. Central limit theorems (CLT) that relax the independence assumptions have been developed (Johnson, 2004), and while these extensions are not strictly valid here, they give some level of credence to the belief that CLT-like results should hold in environments with statistical dependencies. The approximate normality of v is confirmed by simulations in Figure 8. Thus, we conclude that assuming normality of v is a reasonable and useful modeling and one that has been made in other work on large, layered networks (Saul & Jordan, 1998; Barber & Sollich, 2000).
2318
J. Murray and K. Kreutz-Delgado
Figure 3: Dynamic network used in the experiments. Inputs images I are first sparsely coded using the FOCUSS+ algorithm, which operates on nonoverlapping patches of the input image (see appendix A). This sparse overcomplete code u1 is used as bottom-up input to the four-layer hierarchical network. Dotted lines indicate inputs (u3 ) and connections (L1 ) that are not used in the experiments in this letter but are allowed by the network.
or lowest layer (although in general, we may have inputs at any layer if additional types of inference are required). We define an input vector UtX (again dropping the time index inside the vector), T UtX = u1T , u2T , . . . , unT ,
(3.2)
where u1 is a sparsely coded input image (see appendix A) and un is an mout-of-n binary code called the object code, which represents the classification of the object. The advantage of using an m-out-of-n object code is that it allows more objects to be represented than the size n of the highest layer, which is the limitation of 1-out-of-n codes. The object code provides a high
Visual Inference Using Dynamic Sparse Learning
2319
representational capacity and robustness to the failure of any individual unit or neuron, both of which are desirable from a biological perspective. In addition, we can represent new objects without adjusting the size of the highest layer, un , by creating new random object codes. For recognition and reconstruction, the input u1 is the coded image, and the object code input is zero, un = 0. When the network is used for imagination, the input is the object code presented at the highest layer un and random noise at u2 , and the output is the reconstructed image at the lowest layer. For expectation-driven segmentation, both u1 and un inputs are used. Table 1 shows the layers used for input and output for each type of inference. 3.1 Dynamic Network Form. The recurrent dynamic network (DN-E) is the time-dependent generalization of the self-consistency conditions 2.18 of the SWM-E given by
x1 x 2 x3 . . . xn
t+1
L1 W12 0
W21 L2 = f 0 W 32 . .. 0 ···
···
W23 L3 W34 .. .. . . 0 Wn,n−1
0 x1 u1 ε1 .. ε u x 2 2 2 . .. x3 + ε3 + u3 , . . . .. . . .. . . . . xn t εn t un t+1 Ln (3.3)
which can be written in the compact form, X Xt+1 = f (WXt + ε t ) + Ut+1
(DN-E),
(3.4)
where UtX is the input to the network, which can include a sparsely coded input image u1 or a top-down un consisting of an object code, or both. Our goal will be to learn a W such that network 3.4 will rapidly converge to a steady state, given transient or constant inputs UtX . We will attempt to enforce the steady-state self-consistency behavior at finite time horizon t = τ , where the horizon τ is a design parameter chosen large enough to ensure that information flows from top to bottom and bottom to top and small enough for rapid convergence. Because of the block structure of W, information can pass only to adjacent layers during one time step t. (We use the terms time step and iteration interchangeably.) For example, in a four-layer network, it takes only four time steps to propagate information from the highest to the lowest layer, while the network may require more iterations to converge to an accurate estimate. A relatively small number of iterations, on the order of 8 to 15, will be shown to work well.
2320
J. Murray and K. Kreutz-Delgado
Table 2: Progression of Models Developed in Sections 2 and 3. Hierarchical generative world model (GWM) Inference given neighboring layers: P(zl |zl−1 , zl+1 ) (GWM, equation 2.6) ↓ Simplified world model (SWM) (self-consistency conditions) Variational approximation E Q [Z] = X leads to: X = f (WX + ε) (SWM-E, equation 2.19) Binary state: Z = f (WZ − L) + (SWM-B, equation 2.21) Equivalent preactivation state: V = W f (V) + ε (SWM-P, equation 2.22) ↓ Dynamic network (DN) (discrete time) State update: X Xt+1 = f (WXt + εt ) + Ut+1 Preactivation state update: V Vt+1 = W f (Vt ) + εt+1 + Ut+1
(DN-E, equation 3.4) (DN-P, equation 3.5)
3.2 Preactivation State-Space Model. In the previous section we created a dynamic network on the state vector Xt based on the SWM-E. By defining an equivalent model on the preactivation vector Vt , we create another dynamic network, which will be used in deriving the learning algorithm (see section 4). Generalizing the preactivation model (SWM-P, equation 2.22) to a dynamic network, V Vt+1 = W f (Vt ) + ε t+1 + Ut+1
(DN-P),
(3.5)
where Vt is assumed to be a gaussian vector, as discussed in section 2.5, and V Ut+1 is the input or initial conditions for the preactivation state (compare X with Ut+1 for the state X). The DN-E and DN-P are equivalent representations of a dynamic generative world model. Interpreting the layers of Vt as the hidden states of the generative visual-world model, the visible world is found with the read-out map, Yt = C g(Vt ) + noise,
(3.6)
where g(·) is the output nonlinearity, and C = [1, 0, . . . , 0 ] hides the internal states. Table 2 summarizes the moves made from the generative world model of section 2 to the dynamic networks of this section.
Visual Inference Using Dynamic Sparse Learning
2321
4 Finding a Cost Function for Learning the Weights W The dynamic networks of the previous section can perform visual inference by being forced to approximate the self-consistency conditions of the simplified world model (SWM). This can be done assuming that the weights W are known. Now we turn to the problem of learning these weights given a set of training data. In this section, we proceed in a Bayesian framework assuming W is a random variable6 and derive a cost function after suitable ˇ = { Yˇ (1) , . . . , Yˇ (K ) }, approximations. The labeled training set is denoted Y (k) ˇ where the kth element Y is a vector with a sparse coding of image k at (k) (k) its first layer yˇ 1 and the corresponding object code yˇ n at the highest layer and zero vectors at the other layers, T Yˇ (k) = yˇ 1T 0 . . . 0 yˇ nT ,
(4.1) (k)
where the superscript index of the pattern k for each layer (i.e., yˇ n ) has been omitted for clarity. The cost function for W is derived using the DN-P dynamics on the preactivation state Vt , equation 3.5. During training, for each pattern k, we create an input time series UtX from the data set as follows: UtX = Yˇ (k) for t = 1, 2, 3 and UtX = 0 for 4 ≤ t ≤ τ . This choice of UtX starts the dynamic network in the desired basin of attraction for the training pattern Yˇ (k) (UtX = Yˇ (k) for t = 1, 2, 3). The network is then allowed to iterate without input (UtX = 0 for 4 ≤ t ≤ τ ), which with untrained weights W will in general not converge to the same basin of attraction. The learning process attempts to update the weights so that the training inputs are basins of attraction, and to create middle-layer states consistent with that input. The set of inputs (k) (k) for pattern k for all the time steps is denoted U(k) = {U1 , . . . , Uτ }, and for (1) (K ) the entire data set we have U = {U , . . . , U }. Similarly, for each pattern (k) (k) in the preactivation state, we have V(k) = {V1 , . . . , Vτ }, and for the whole data set, V = {V(1) , . . . , V(K ) }. Assuming that the weights W are random variables, their posterior distribution is found by Bayes’ rule, P(W|V; U) =
P(V|W; U)P(W) . P(V; U)
(4.2)
Our goal is to find the weights W that are most likely given the data and the generative model, and we use the maximum a posteriori (MAP)
6 If a noninformative prior on W is used, this reduces to the maximum likelihood approach.
2322
J. Murray and K. Kreutz-Delgado
estimate, W = arg max P(W|V; U) W
= arg min − ln P(V|W; U) − ln P(W),
(4.3)
W
due to the denominator in equation 4.2 not depending on W. Correct assumptions about W are important for successful learning, which requires some form of constraint, such as prior normalization to use all of the network’s capacity (see section5). Assuming patterns in the training set are independent, P(V|W; U) = k P(V(k) |W; U(k) ), W = arg min − W
ln P(V |W; U ) − ln P(W) . (k)
(k)
(4.4)
k
Note that the dynamic system 3.5 is Markovian under our assumption that ε t are independent (Bertsekas, 1995). Then the probability of the sequence of time steps can be factored (omitting the pattern index k on the Vt for clarity), P(V(k) |W; U(k) ) = P Vτ , Vτ −1 , Vτ −2 , . . . , V1 |W; U(k) =
τ
P(Vt |Vt−1 , W; U(k) ),
(4.5)
t=1
from the chain rule of probabilities. The preactivation state at each time Vt can be expressed in terms of each layer vl,t , n (k) (k) (k) (k) P vl,t |Vt−1 , W; U(k) , P Vt |Vt−1 , W; U(k) =
(4.6)
l=1
if we assume that the layers are conditionally independent of each other at t given the state at the previous time Vt−1 . Combining equations 4.4, 4.5, and 4.6, W = arg min − W
n τ k
(k) (k) (k) ln P vl,t |Vt−1 , W; U − ln P(W) .
(4.7)
t=1 l=1
Since vl,t is approximately normal (see section 2.5), for those layers where and when we have target values of yl,t from the data set and corresponding
Visual Inference Using Dynamic Sparse Learning
target states for vl,t ,7 the probability of the layer is
1 1 T , exp − ε ε Ptarg (vl,t |Vt−1 , W; U) = l,t (2πσv2 )sl /2 2σv2 l,t
2323
(4.8)
where σv2 is the variance of each component (which is assumed identical). At other layers and times, the state probabilities vl,t are approximately gaussian, but we do not have a desired state, and so we enforce sparsity in these cases. We model the distributions at these layers by independent gaussians with fixed mean µ and variance σs2 ,
1 1 2 (4.9) exp − 2 vl,t − µ , Pspar (vl,t |Vt−1 , W; U) = (2πσs2 )sl /2 2σs where µ = [µ . . . µ]T with µ a design parameter of the appropriate size. Introducing an indicator variable β that selects between Pspar and Ptarg , we define βl,t = 1 if we have target values for layer l at t and βl,t = 0 otherwise. The probability of each layer becomes P(vl,t |Vt−1 , W; U) = βl,t Ptarg (·) + (1 − βl,t )Pspar (·) .
(4.10)
Substituting equation 4.10 in equation 4.7 yields W = arg min W
! − ln P(W) ,
τ T ε t (ε t β t ) + λ(Vt − µ)T [(Vt − µ) (1 − β t )] k
t=1
(4.11)
where β t ∈ R N is the indicator vector for all elements of Vt , is the elementwise Hadamard vector product and the constant terms depending on σv2 , σs2 have been combined into a new constant λ (again omitting the k inside the summation). Several things should be noted about this formulation. First, the objective function is derived in relation to the preactivation vector Vt instead of the postactivation vector Xt . This is done to use the gaussian form of equation 4.8 and is reminiscent of the technique in the generalized linear model literature of working with the linear structure vector of a nonlinear model (Gill, 2001). Second, the cost function, equation 4.11, is similar in form to those used in overcomplete coding algorithms, which are unsupervised, and are designed to minimize the reconstruction error using as sparse a code as possible (Olshausen & Field, 1997; Kreutz-Delgado et al., 2003). 7 We assume that noise = 0 in equation 3.6 and that given Y , we can solve for a t corresponding value of Vt .
2324
J. Murray and K. Kreutz-Delgado
The cost function for W, equation 4.11, is a function of the true state Vt and the error ε t , which we generally do not have access to. In practice, we will resolve this problem by generating estimates of the unknown Vt using a current estimate of the weights from the dynamic network (DN-P) under the certainty equivalence approximation that εt = 0 (Bertsekas, 1995). Certainty equivalence is a standard technique in optimization when certain variables are random. For example, an unknown random variable can be replaced by its mean before optimizing the cost function. In our case, we estimate the unknown random Vt by the dynamic network’s output Vt , which is then used to find the W that minimizes the cost function 4.11. For each pattern in the data set, we run DN-P (see equation 3.5) using the input sequence UtV = vh UtX , where vh = f −1 (1.0) (see Figure 2).8 Running the network with certainty equivalence gives estimated states Vt−1 ) + UtV . Vt = W f (
(4.12)
t and the The errors εt used for learning are then the difference between V desired target states found from the data set, εt = ( Vt − vh Yˇ (k) ) β t ,
(4.13)
where layers with no target values are set to 0 due to the effect of β t . Using the cost function in equation 4.11, we find a learning algorithm for weights W (see appendix B) which is closely related to the backpropagationthrough-time algorithm (BPTT) for training recurrent networks (Williams & Peng, 1990). The main drawback of the BPTT algorithm is that it is computationally inefficient due to the unrolling of the network for each time step. Our approach overcomes this drawback by using a small number of time steps τ and taking advantage of the sparsity of every layer to update only weights between units with some nonzero activity. 5 Algorithm Implementation This section summarizes the implementation details of the dynamic network and learning algorithm as used in the experiments. 5.1 Preparing the Data Set. The data set consists of K images representing M unique objects, where in general we have many different views or transformations of each object, so K > M. For each object m, we generate a sparse object code c(m) ∈ Rsn (the size of the highest layer) with rn randomly selected nonzero elements, which is used as the desired value of the highest 8 This is an approximation to U V = f −1 (U X ) when U X is binary, assuming elements t t t V X = 0. U j,t = 0 when U j,t
Visual Inference Using Dynamic Sparse Learning
2325
layer. Each image k is preprocessed and converted into a sparse code (see section 6), which is used as the first layer input, y1 . The data set of all images ˇ = {Yˇ (1) , . . . , Yˇ (K ) } where each pattern is is Y Yˇ (k) = yˇ 1T
0
,...,
0
yˇ nT
T
,
(5.1)
and the highest layer is the object code, yˇ nT = cT(m) . 5.2 Network Initialization. The network weights are initialized with small, random values uniformly distributed within certain ranges. The initial weight ranges are feedforward and feedback weights W ∈ [−0.01, 0.01] and lateral weights L ∈ [−0.001, 0.000] (which enables only lateral inhibition, not excitation). Self-feedback is not allowed, L ii = 0, and lateral weights are not used in layer 1 for computational efficiency. Feedback weights are initialized to be the transpose of the corresponding feedforT ward weights, Wlm = Wml , but are not restricted to stay symmetric during training. 5.3 Performing Inference Given Known Weights W. To run the network for the experiments, we create an input time series UtX from the images ˇ The input can include yˇ 1 or yˇ n (or both) and object codes in the data set Y. as determined by the type of inference desired (see Table 1). For example, when the network is run for recognition, the inputs for the first few time steps are the coded image yˇ 1 , so that (UtX )T = [ yˇ 1T , 0, . . . , 0 ]T , t = 1, 2, 3, and UtX = 0, t ≥ 4. When the network is run generatively, the object code is used as input, such that (UtX )T = [ 0 , . . . , yˇ nT ]T , t = 1, . . . , τ , and the network is then run for τ steps, after which the first layer contains a representation of an imagined image. Given a sequence of inputs UtX , the network is run in certaintyequivalence mode (no added noise) for a fixed number of discrete time steps, 0 ≤ t ≤ τ , with τ being 8 to 15 for the experiments below. With an initial state X0 = 0, the network is run using Xt Vt = W Vt−1 ) + UtX Xt = f (
1 ≤ t ≤ τ.
(5.2)
Xt ∈ [0, 1] N . To imThe state Xt is further restricted to be in the unit cube, prove computational efficiency, only a limited number of nonzero elements are allowed in each layer, r¯ = [¯r1 , . . . , r¯n ], which is enforced on Vt at each layer by allowing only the largest r¯l of them to remain nonzero. 5.4 Learning the Weights W. Training proceeds in an online epoch-wise ˇ and inputs are fashion. In each epoch, a subset of patterns is chosen from Y, created with the coded image in the first layer for the first three time steps, so
2326
J. Murray and K. Kreutz-Delgado
that UtX = [yˇ 1T , 0, 0, yˇ nT ] , t = 1, 2, 3, and UtX = 0, t ≥ 4. Input patterns must be removed at some point during training, because otherwise, there would be no error gradient to enforce learning of reconstruction. Presenting the input for three time steps was found to give better performance than other lengths of input (see section 6.1). The state Xt and preactivation state Vt from running the network (see equation 5.2) are saved for each t ≤ τ . The error vector for weight updates is εt = ( Vt − vh Yˇ (k) ) β t (see equation 4.13). Weight updates w ji are given by equation B.14. In standard gradient descent, weight updates will naturally become small when errors are small. However, since we use an additional sparsity enforcing term, even if both the highest and lowest layer errors are zero, weight updates will still occur in order to sparsify middle layers. Training stops after a certain number of epochs are completed. For computational efficiency when learning sparse patterns, only a small set of weights w ji is updated for each pattern. During our simulations, Xt is found by thresholding the activation function output f ( Vt−1 ) to [0, 1], resulting in a sparse Xt given certain conditions (see section 2.5). Weights are then updated between units only when the source unit Xi,t is active and either the target unit X j,t is active or has nonzero error ε j,t .9 During the initial epochs of learning, there must be enough weight strength to cause activation throughout the middle layers. As learning progresses, the activity is reduced through the sparseness-enforcing term. 5.5 Testing for Classification. To classify an input image once the network has settled into a stable state, the last layer’s activation xn is compared with the object codes c(m) to find the class estimate, Class(xn ) = arg
min
m∈{1,...,M}
xnT − c(m) .
(5.3)
5.6 Weight Normalization. In early experiments with the learning algorithm, we found that some units were much more active than others, with corresponding rows in the weight matrices much larger than average. This suggests that constraints need to be added to weight matrices to ensure that all units have reasonably equal chances of firing. These constraints can also can be thought of as a way of avoiding certain units being starved of connection weights. A similar issue arose in the development of our dictionary learning algorithm (Kreutz-Delgado et al.,2003) and led us to enforce equality among the norms of each column of the weight matrix. Here, both
9 In theory, the thresholding should not significantly affect the learning. However, due to the size of the network, it was not practical to compare thresholded versus nonthresholded performance. Even with the smallest dictionary size (64 × 64, layer 1 input size = 4096), there are about 5,570,000 weights. Thresholding reduces the actual number of weights updated to about 45,000 per pattern, an increase in speed of over 100 times.
Visual Inference Using Dynamic Sparse Learning
2327
row and column normalization are performed on each weight matrix (feedforward, lateral, and feedback). Normalization values are set heuristically for each layer, with an initial value of 1.0 and increasing layer normalization until sufficient activity can be supported by that layer. The normalization values remain constant during network training and are adjusted from trial to trial. 6 Visual Recognition and Inference Experiments In this section, we detail experiments with the learning algorithm developed above and demonstrate four types of visual inference: recognition, reconstruction, imagination, and expectation-driven segmentation. A set of gray-scale images was generated using the Lightwave photorealistic rendering software.10 Each of 10 objects was rotated 360 degrees through its vertical axis in 2 degree increments, for a total of 10 × 180 = 1800 images, of which 1440 were used for training and the 360 remaining were held out for testing (see Figure 4). All images were 64 × 64 pixels. Before images can be presented to the network, they must be sparsely coded, which is done with a sequence of preprocessing (see Figure 5). First, each image is edge-detected11 to simulate the on-center/off-center contrast enhancement performed by the retina and LGN (see Figure 5B). Edge-detected images are then scaled by subtracting 128 and dividing by 256, so that pixel values are ∈ [−0.5, 0.5]. Next, each image is divided into 8 × 8 pixel patches and sparsely coded with FOCUSS+ using a dictionary learned by FOCUSSCNDL+ as described in appendix A (see Figure 5C). Dictionaries of size 64 × 64, 64 × 128, and 64 × 196 were learned to compare the effect of varying degrees of overcompleteness on recognition performance. (Figures 6 to 14 in this section are from experiments with the 64 × 196 dictionary.) Table 3 shows the accuracy and diversity of the image codes. As dictionary overcompleteness increases from 64 × 128 to 64 × 196, both mean square error (MSE) and mean diversity decrease; images are more accurately represented using a smaller number of active elements chosen from the larger overcomplete dictionary. As seen in Figure 5C, the reconstructed images accurately represent the edge information even though they are sparsely coded (on average 192 of 12,288 coefficients are nonzero). Finally, the nonnegative sparse codes are thresholded to {0, 1} binary values before being presented to the network; any value greater than 0.02 is set to 1 (see Figure 5D). This stage, however, does introduce errors in the reconstruction process, and the fidelity of the network’s reconstructions will be limited
10
Available online at www.newtek.com/products/lightwave/. detection was done with XnView software (www.xnview.com) using the edge detect light filter, which uses the 3 × 3 convolution kernel [0 −1 0; −1 4 −1; 0 −1 0]. 11 Edge
2328
J. Murray and K. Kreutz-Delgado
Figure 4: (A) Objects used in the experiments, showing one of the 180 views of each object. Images are 64 × 64 pixel gray scale. (B) Sample rotated object images in the data set.
by the binarization. A histogram of coefficient values before binarization is given in Figure 8. 6.1 Recognition with a Four-Layer Network. To test recognition performance, a four-layer network was trained using the data set described above. The training parameters of the network are given in Table 4. Note that all the lateral interactions were forced to be inhibitory or 0 and that no lateral connections were used in the first layer (as we assume the increase in sparsity produced by the FOCUSS+ iterations model the layer 1 lateral connections). Coded images were presented to the first layer of the network for the initial three time steps. Random object codes with r4 = 10 nonzero elements were used on the highest layer. Training took between 11 and 22 hours depending on dictionary size using an Intel Xeon 2.8 Ghz processor. Classification performance reached 100% accuracy on the test set after 135 epochs, but training continued until 1000 epochs to increase the reconstruction accuracy at the first layer. Figure 6 shows the iterations of the network state Xt during classification of a test set image. The first row
Visual Inference Using Dynamic Sparse Learning
2329
Figure 5: Several preprocessing steps are done before presentation to the network: edge detection, FOCUSS+ sparse coding, and binarization of coefficients. (A) Original images. (B) Edge-detected images. (C) Reconstructions from FOCUSS+ codes using a learned overcomplete dictionary. (D) Reconstructions from binarized FOCUSS+ codes. Table 3: Coding Performance on 64 × 64 Pixel Images (Blocked into 8 × 8 Patches) Using Complete and Overcomplete Dictionaries. Diversity Dictionary Size 64 × 64 64 × 128 61 × 196
Layer 1 Size
MSE
Maximum
Mean
Minimum
4096 8192 12,288
0.00460 0.00398 0.00292
0.0449 0.0339 0.0221
0.0266 0.0240 0.0156
0.0103 0.0128 0.0085
Note: Mean squared-error (MSE) is calculated over all 8 × 8 patches in the image, and diversity = (#non-zero coefficients)/(layer 1 size).
shows the FOCUSS+ coded input image and the original. The next rows show the activity of each layer and the reconstructed image from the first layer. The object was presented for three time steps and then removed, so that all activity on layer 1 for t ≥ 4 results from network feedback. As the iterations proceed, the reconstruction completes the outline of the airplane
2330
J. Murray and K. Kreutz-Delgado
Figure 6: Recognition of a test set object. Each row shows the network activity Xt at a time step. In layer 4, the marker indicates that the unit is active and is part of the correct object code, ◦ that the unit is in the object code but inactive, and × that the unit is active but should not be. When t > 3, there is no external input, and the reconstructed image in layer 1 is due only to network feedback. At t = 4 in layer 4, there are four incorrectly activated units (×), but at later times, the dynamics of the network suppress these incorrectly active units.
Visual Inference Using Dynamic Sparse Learning
2331
Table 4: Network Parameters for Training the Four-Layer Network with 64 × 196 Overcomplete Dictionary, Corresponding to Layer 1 Size of 12,288. s (layer size) r¯ (maximum diversity of layer) τ (time iterations per pattern) η (learning rate) λ (regularization parameter) µ (target mean for hidden layers) Epoch size (number of patterns) Maximum number of epochs Feedforward weight range Feedback weight range Lateral weight range Layer 1 norms (FB) Layer 2 norms (FF, L, FB) Layer 3 norms (FF, L, FB) Layer 4 norms (FF, L)
[12,288, 512, 512, 256] [430, 100, 100, 100] 8 0.002 0.005 −4.0 100 1000 [−5.0, 5.0] [−5.0, 5.0] [−5.0, 0.0] [12.0] [12.0, 2.1, 2.1] [5.9, 2.1, 1.5] [1.5, 1.5 ]
Note: For other sized dictionaries, the size of the first layer was 8192 (64 × 128 dictionary) and 4096 (64 × 64 dictionary), with all other parameters as listed in the table.
and becomes stronger in intensity. In layer 4, the marker shape indicates whether the unit is active and is part of the correct object code (), or is part of the object code but inactive (◦), or is active but should not be (×). At t = 4, all 10 of the highest layer units in the object code for airplane are active (), so that the image is classified correctly; however, there are four other units active that should not be (×). At later iterations, these extra incorrect units are deactivated (or “sparsified away”) so that at t ≥ 5, only those units in the object code are active, demonstrating the importance of lateral connections in the highest layer. Activity in layers 2 and 3 also decreases with time. Presenting rotated test set views of the object shows that the network has learned basins of attraction for the other orientations. Figure 7 shows the state of the network at t = 7 after presenting various rotations of the airplane. The invariance of the representation is shown to increase from layer 1 (with nearly completely different units active) through layer 3 (with many of the same units active) to layer 4 (which has identical activity for all four orientations of the airplane). Training on rotated objects gives the network some robustness to small translations. When tested on images translated +/−1 pixel in each direction, recognition accuracy is 96.9% on the test set. However, in general, we make no claim that our network has learned transformations (such as translation or scaling) that it has not seen. The network includes many parameters (see Table 4), and learning is more sensitive to some than others. For example, the maximum activity r¯2 , r¯3 for layers 3 and 4 can vary quite widely without noticeable effect on performance, while increasing r¯2 to 512 (from 100) increases the training time by more than an order of magnitude. This is because weights are
2332
J. Murray and K. Kreutz-Delgado
Figure 7: Each row is the network state Xt at t = 7 after presenting various rotated images of the airplane (test set images, views unseen during training), demonstrating that multiple basins of attraction can be learned for each object. Higher layers show more invariant representations than lower layers, with layer 4 showing the fully invariant representation of the airplane.
updated only between active units, and increasing the maximum number of active units on layer 2 results in a very large number of weights to and from layer 1 that must be updated. When the diversity penalty is turned off (λ = 0), the average diversity of the second and third layers increases by about 30% and 60%, respectively, with no significant change in MSE or classification rates. This demonstrates that using the diversity penalty results in more efficient representations (more sparse), consistent with our sparse generative-world model. We also experimented with different variations of the input time series, and these changes had more dramatic effects on performance. Two experiments were done with 12 time steps: (1) with input presented for 6 time steps and turned off for 6 steps and (2) using a linear decay 1 − t/6 (input presented at full strength at the first time step and decaying to 0 at the sixth time step). The performance of both of these experiments was worse than the original method (input presented for three steps). For the six-step input, the recognition accuracy reached only 85% on the test set; for the decaying six-step input, the accuracy was 90% compared with 100% using the original method. It is not clear why performance drops, but there seems to be a reduction in middle layer activity. Perhaps adjusting normalization or other parameters could improve these results.
Visual Inference Using Dynamic Sparse Learning A) 10000
2333 C) 1.2
B) 0.4
1
6000 4000
Dist.
0.3
Dist.
Counts
8000
0.2
0.8 0.6 0.4
0.1
2000 0
0.2 0 0
1
2
0
2
0
1
0
Figure 8: (A) Histogram of FOCUSS+ coefficient values before binarization. There are 2.9 × 106 elements in the zero bin, and the maximum coefficient value is 2.9. (B) Histogram and gaussian fit of a randomly chosen unit in the preactivation state Vt in layer 3 before training. (C) Histogram and gaussian fit of residual ε t in layer 1 after training.
To test the gaussian assumptions made regarding Vt and the errors εt , we plot histograms and normal curve fits of randomly chosen units in Figures 8B and 8C. From Figure 8B we can see that distribution of units in Vt is quite normal before training. In Figure 8C, we see that after training, the errors ε t for layer 1 are less gaussian but still reasonably modeled as such. Also, there is more mass in the negative tail, indicating patterns where the target values are 1 but the network output is much lower. 6.2 Reconstruction of Occluded Images. Using the same network trained in section 6.1, reconstruction is demonstrated using occluded images from the test set. Approximately 50% of pixels are set to black by choosing a random contiguous portion of the image to hide. Figure 9 shows the network iterations during reconstruction, where an occluded image is presented for the first three time steps. By t = 3, the feedback connections to the first layer have reconstructed much of the outline of the copier object, showing that feedback from the second layer contains much of the orientation-dependent information for this object. Further iterations increase the completion of the outline, particularly of the bottom corner and lower-right panel. (Another example of reconstruction is shown in Figure 1.13 of Murray, 2005.) The network also performs well when recognizing occluded objects. Accuracy is 90% on the occluded test set objects with the complete dictionary (64 × 64) and 96% to 97% with the overcomplete dictionaries. Figure 9 shows that (as above) there are incorrectly activated units in layer 4 at t = 4, which are suppressed during later times. In contrast with Figure 6, in layer 2 here, there is more activity as time progresses, presumably due to the activation of missing features during reconstruction. More insight into reconstruction can be gained by examining the receptive and projective fields of units in the middle layers. Considering layer 2
2334
J. Murray and K. Kreutz-Delgado Object
Reconstruction
Layer 1
Layer 2
Layer 3
Layer 4
t =2 t =8
t =7
t =6
t =5
t =4
t =3
Input presented
t =1
Edges
Figure 9: Reconstruction of an occluded input image. As early as t = 3, feedback from layer 2 results in reconstruction of some of the outer edges of the objects. More detail is filled in at later time steps. Layer 4 legend: = unit is active and in correct object code. ◦ = unit is in the object code but inactive. × = unit is active but should not be (not in object code).
2335
Projective
Receptive
Visual Inference Using Dynamic Sparse Learning
Figure 10: Receptive and projective fields of four units in layer 2. For each unit, the top row shows the receptive fields (feedforward weights from layer 1 to 2), and the bottom row shows the projective field (feedback weights from layer 2 to 1). The weight vectors are converted into images by multiplying by the learned dictionary. The unit in the first column is tuned to respond to both the plane and the table, while its projective field appears to include many possible orientations of the plane.
(see Figure 10), we find that the receptive fields (top row) tend to learn a large-scale representation of a particular orientation of an object. This is mainly because the receptive fields are allowed to cover the entire first layer, and no topology is enforced on the weights. Some receptive fields (such as the first column of Figure 10) are tuned to two very different objects, suggesting that units are recruited to represent more than one object, as would be expected from an efficient distributed code. The projective fields are not as clearly specific to a particular orientation and include strong noise, which indicates there must be inhibitory feedback from other layer 2 units, contributing to the cleaner version of the layer 1 outputs when the full network is run. 6.3 Imagination: Running the Network Generatively. Imagination is the process of running the network generatively with input given as an object code at the highest layer. For this experiment, the network trained in section 6.1 is used with an object code clamped on the highest layer for all time steps. Random activity is added to the second layer at t = 3 so that the network has a means of choosing which view of the object to generate. It was found that increasing the feedback strength (by multiplying feedback weights by 5.0) to the first and second layers increased the activity and quality of the imagined image at the first layer. Without this increase, the layer 1 reconstruction was very likely to settle to the 0 state. Figure 11 shows the results when the object code for the knight is presented. At t = 4, the reconstruction is a superposition of many features from many objects but at later times, the outline of the object can be seen. The orientation of the
2336
J. Murray and K. Kreutz-Delgado
Figure 11: Imagination using the object code for the knight as the top-down input and the injection of random activity in layer 2 at t = 3. The reconstruction is a bistable (oscillating) pattern of the object from the front and side views.
Visual Inference Using Dynamic Sparse Learning
2337
Figure 12: Imagination using random object codes as input to layer 4 and at t = 3, random activity at layer 2. The images are the network’s layer 1 state at t = 15 with top-layer objects codes of the fire hydrant, grill, knight, copier, and airplane. For the first and last images, the network has settled into a superposition of multiple objects (fire hydrant and copier) or multiple orientations of the same object (airplane).
generated image alternates between a front view (t = 5, 7) and a side view (t = 6, 8), which is reminiscent of the bistable percept effect. Not all trials of this experiment result in a bistable state; the majority converged to a single orientation. Interestingly, some orientations of certain objects appear to be generated much more often than other orientations. These “canonical views” represent high-probability (low-energy) states of the network. A random sample of five imagined objects is shown in Figure 12, showing that a superposition of states can also occur, which is consistent with the projective field properties shown in Figure 10. 6.4 Expectation-Driven Segmentation: Out from Clutter. In expectation-driven inference, both an input image and a top-down expectation are presented to the network, and the output can be either the highest-layer classification or the lowest-layer reconstructed image. Here, we considered the latter case, where the desired output is a segmented image reconstructed from the first layer. The same network trained in section 6.1 is used here with increased feedback strength as described in section 6.3. Cluttered input images are created by combining many objects from the data set at random translations, overlaid with a portion of the desired image (the same portion, 50%, used in the reconstruction experiment). This is a fairly difficult recognition problem, as the clutter in each image is composed of features from trained objects, so that competing features tend to confound recognition algorithms. Although the features from the clutter objects are likely to be in different locations than seen during training, it is still a more difficult task than segmentation from a randomly chosen (untrained) background. The problem of expectation-driven segmentation is different from recognition in that we ask the network not, “What object is this?” but, “Assuming object X is here, what features in the image most likely correspond to it?” For this experiment, we present at t = 2, 3 the image of the occluded object in clutter and at t = 1, . . . , 4 the expectation that the object is present at the
2338
J. Murray and K. Kreutz-Delgado
highest layer. Figure 13 shows the network states when presented with a cluttered image and top-level expectation of the knight object. The timing of the inputs was arranged so that the feedback and feedforward input first interact at t = 3 in layer 3. When t = 4, the input image is no longer presented, and the network feedback has isolated some features of the object. Later time steps show a sharper and more accurate outline of the knight, including edges that were occluded in the input image. At the highest layer, feedforward interactions from lower layers cause the correct object code to degrade. At t = 12, all the units in the object code for knight were active, as well as four incorrectly active units, which still allows correct classification. To illustrate the need for the top-down expectation input in this case, Figure 14 shows the states at t = 1, 4, 8 when no object code is presented at layer 4. The activity gradually decays, and there is no reconstruction at layer 1. Comparing Figure 11 (imagination) and Figure 13 shows that the partial information provided in the cluttered image is enough to keep the network at a stable estimate of segmentation and in this case prevent oscillations between two orientations (which occurred when only top-down input was present).
6.5 Overcompleteness Improves Recognition Performance. One of the central questions addressed in this work is how a sparse overcomplete representation in the early stages of visual processing, such as V1 in primates (Sereno et al., 1995), could be useful for visual inference. As described above, we trained networks using learned dictionaries of varying degrees of overcompleteness, 64 × 64, 64 × 128, and 64 × 196, and corresponding sizes of the first layer: 4096, 8192 and 12,288. Performance was compared on the test set objects, occluded objects, and objects in clutter. The cluttered images were created by overlaying the entire object on a cluttered background, resulting in a somewhat easier problem than the occluded-object-in-clutter images used in section 6, although here no top-down expectations were used to inform the recognition. Figure 15 shows the recognition accuracy on these three image sets. For the test set (complete images), all three networks had performance at 99% to 100%, but for the occluded and cluttered images, there is a gain in accuracy when using overcomplete representations, and the effect is more pronounced for the more difficult cluttered images. For occluded objects, accuracy was 90% (324/360) for the complete dictionary and 97% (349/360) for the 3X overcomplete dictionary. The most significant improvement was with the cluttered images: accuracy was 44% (160/ 360) for the complete dictionary and 73% (263/360) for the 3X overcomplete dictionary. While the absolute classification rate for the cluttered images might appear low (44–73%), many of the misclassified objects were those of smaller size (e.g., the airplane and fire hydrant), which allowed more features from other larger objects to be visible and confound the recognition. In addition, neither the dictionary nor the network was trained on images
Visual Inference Using Dynamic Sparse Learning
2339
Figure 13: Expectation-driven segmentation using occluded objects over a cluttered background. The clutter input is presented at the lowest layer for t = 2, 3. Top-down expectations (the object code for knight) are presented at the highest layer for t = 1, . . . , 4. By t = 12, the network converges to a segmented outline of the knight in the correct orientation at the first layer. Layer 4 legend: = unit is active and in correct object code. ◦ = unit is in the object code but inactive. × = unit is active but should not be (not in object code).
2340
J. Murray and K. Kreutz-Delgado
Figure 14: Recognizing the occluded object in a cluttered background is difficult without top-down expectations. The same input image used in Figure 13 is presented for t = 1, 2, 3; however, no top-down inputs are present. A few representative time steps show that the activity gradually decays over time, and no object is reconstructed at layer 1. Layer 4 legend: = unit is active and in correct object code. ◦ = unit is in the object code but inactive. × = unit is active but should not be (not in object code).
with clutter, so the network had no previous experience with this particular type of cluttered images. 7 Discussion In this section, we discuss the motivations for our network and compare it with other recurrent and probabilistic models of vision. Additional discussion can be found in Murray (2005, sec. 1.8). 7.1 Why Sparse Overcomplete Coding and Recurrence? In the brain, early visual areas are highly overcomplete, with about 200 to 300 million neurons in V1 compared to only about 1 million neurons that represent the retina in the lateral geniculate nucleus (LGN) of the thalamus (Stevens, 2001; Ejima et al., 2003). As primate evolution has progressed, there has been an increase in the ratio of V1 to lateral geniculate nucleus (LGN) size. While even the smallest of primates shows a high degree of overcompleteness, the increase in higher primates is linked with an increase in retinal resolution
Visual Inference Using Dynamic Sparse Learning
2341
Correct classification (%)
100 80 60 40 20 0
Full object Occluded object Object in clutter 1 1.5 2 2.5 3 Overcompleteness ratio of layer 1 and dictionary
Figure 15: Recognition performance on the test set (full object), occluded images (50% occlusion), and cluttered images with three different degrees of overcompleteness in layer 1 representation and learned dictionaries. Recognition performance improves with increased overcompleteness, particularly in the difficult cluttered scenes. Test set size is 360 images (36 views of 10 objects).
and presumably improved visual acuity (e.g., 87 times overcomplete for the tarsier monkey compared with 200 to 300 times for humans). Mathematically, sparse coding strategies are necessary to make efficient use of overcomplete dictionaries because the dictionary elements are generically nonorthogonal. To provide a low-redundancy representation (Attneave, 1954; Barlow, 1959), a sparse set of elements must be chosen that accurately represents the input. If we have faith in the generative model postulated in Figure 1, real-world images can be accurately modeled as being caused by a small number of features and objects, supporting the choice of a sparse prior (even in the case of complete coding). Other benefits of sparse coding include making it easier to find correspondences and higherorder correlations, increasing the signal-to-noise ratio, and increasing the storage and representational capacity of associative memories (Field, 1994). Biological evidence for sparse coding ranges from the simple fact that average neural firing rates are low, 4 to 10 Hz (Kreiman, Koch, & Fried, 2000), to experiments that find sparseness in V1 increases as larger patches of natural images are presented, indicating that a concise representation can be found by deactivating redundant features, presumably through the interaction of lateral and feedback inhibition (Vinje & Gallant, 2000). One of the successes of sparse-coding theory has been the learning of receptive fields that resemble the orientation and location selectivity of V1 neurons (Olshausen & Field, 1997), and extensions have been made to model complex cells (Hoyer & Hyv¨arinen, 2002).
2342
J. Murray and K. Kreutz-Delgado
While overcompleteness and sparse coding are important features of early vision in V1, perhaps the most striking aspect of higher visual areas is the amount of lateral and feedback connections within and between areas (Felleman & Van Essen, 1991). Even in V1, lateral and feedback input from other cortical areas account for about 65% of activity, with only 35% of response directly due to feedforward connections from the LGN (Olshausen & Field, 2005). We showed in section 2 that feedback and lateral connections are required for many types of inference. In some recognition tasks, there is evidence that the brain is fast enough to complete recognition without extensive recurrent interaction (Thorpe, Fize, & Marlot, 1996). Consistent with this, our model is capable of quickly recognizing objects in tasks such as Figure 6, where the correct object code is found at t = 5. However, more difficult tasks such as segmentation (see Figure 13) require recurrence and would take longer for the brain (Lee, Mumford, Romero, & Lamme, 1998). 7.2 Related Work: Biologically Motivated Models of Vision. There have been many hierarchical models created to explain vision, and these fall into two main categories: feedforward only or recurrent (which include various types of feedback and lateral connections between layers). Some examples of the feedforward class are the Neocognitron model of Fukushima and Miyake (1982), VisNet of Rolls and Milward (2000), and the invariant¨ ak (1991) and Riesenhuber and Poggio (1999). recognition networks of Foldi´ While many of these models use sparsity with some form of winner-take-all competition, which is usually interpreted as lateral interaction, since they do not include feedback connections, they are not capable of the range of inference described in section 2.2 and will not be discussed further here. One of the more closely related works is the dynamic network developed by Rao and Ballard (1997). A stochastic generative model for images is presented, and a hierarchical network is developed to estimate the underlying state. Their network includes multiple layers with feedforward and feedback connections, which are interpreted as passing the residuals from predictions at higher levels back to lower levels (but with no explicit learnable lateral connections or overcomplete representations). Experiments demonstrate recognition, reconstruction of occluded images, learning of biologically plausible receptive fields, and ability to tell that an object had not been seen during training. Perhaps because of the computational requirements, only fairly limited recognition experiments were performed, using five objects (one orientation per object) and rotation-invariant recognition with two objects, each with 36 views used for training and testing (Rao, 1999). Newer versions of the Neocognitron include feedback connections and are demonstrated for recognition and reconstruction (Fukushima, 2005). The model posits two types of cells in each region of the system, S-cells and C-cells, in analogy with the simple and complex cells categorized by Hubel and Wiesel (1959). The S-cells are feature detectors, and the C-cells
Visual Inference Using Dynamic Sparse Learning
2343
pool the output of S-cells to create invariant feature detectors. To solve the reconstruction problem, further cell types and layers are added, and many of the layers have different learning rules. In contrast, our network is able to perform various inference types without changes to the architecture or learning rule. 7.3 Related Work: Probabilistic Models in Computer Vision. Recent work in computer vision has investigated probabilistic generative models apart from direct biological motivation (Hinton et al., 2006; Hinton and Salakhutdinov, 2006; Fergus, Perona, & Zisserman, 2007; Sudderth, Torralba, Freeman, & Willsky, 2005). Most closely related to our work is the learning algorithm of Hinton et al. (2006) for hierarchical belief networks. The network has multiple hidden layers, followed by a much larger overcomplete associative memory (whereas our overcomplete stage occurs at the second layer), and a highest layer with a 1-out-of-n code for the object class. The first layer has realvalued inputs, while stochastic binary values are used at higher layers. Feedforward and feedback weights are learned, but no lateral connections are used, and during testing, only one forward-backward pass is made at each layer. When trained on a benchmark handwritten digit data set, the accuracy is competitive with the best machine learning methods, showing that generative hierarchical networks are promising for real-world vision tasks. Using a similar learning procedure with an autoencoder network architecture, Hinton and Salakhutdinov (2006) show applications to data compression and text classification. While there are many differences between this work and our algorithm, they address the same basic question of how to train hierarchical generative models. One important difference is that Hinton et al. (2006) use stochastic units and Gibbs sampling for generative inference, while we use a nearest-layer conditional variational approximation. We believe the factorial approximation of equation 2.8 can be sufficiently accurate in the case of sparse activations and that enforcing a short enough time horizon τ makes learning computationally tractable. More experiments with known generative models will be needed to further evaluate the differences between these algorithms. Fergus et al. (2007) develop a model for classification of object categories in unsegmented images. The first step is finding a small set of interesting features using a saliency detector. For each category, a probabilistic model is learned for these features, including their relative position and scale. Impressive detection performance is achieved on real-world data sets. In contrast to our work, which models all the features in the image, Fergus et al. (2007) use only a small number of features (fewer than 30), so that, if run generatively, their model would reconstruct only a small subset of the features in each object. Using a saliency detector improves position and scale invariance (which would benefit our network); however, using only
2344
J. Murray and K. Kreutz-Delgado
this small feature set reduces performance when features of a class model cannot be found. In a related work, Sudderth et al. (2005) present a probabilistic model of object features and apply it to object categorization in real-world scenes. Similar to our model and in contrast with Fergus et al. (2007), their model is a true multiclass classifier, which allows features to be shared between models of different objects and allows for more rapid classification without the need to run multiple classifiers. As above, the small feature set limits the potential detail of generative reconstruction. However, segmentation results show that regions such as “building,” “car,” and “street” can be detected in city scenes. 8 Conclusion We have developed a framework and learning algorithm for visual recognition and other types of inference such as imagination, reconstruction of occluded objects, and expectation-driven segmentation. Guided by properties of biological vision, particularly sparse overcomplete representations, we posit a stochastic generative world model. Visual tasks are formulated as inference problems on this model, in which inputs can be presented at the highest layer, lowest layer, or both, depending on the task. A variational approximation (the simplified world model) is developed for inference, which is generalized into a discrete-time dynamic network. An algorithm is derived for learning the weights in the dynamic network, with sparsity-enforcing priors and error-driven learning based on the preactivated state vector. Experiments with rotated objects show that the network dynamics quickly settle into easily interpretable states. We demonstrate the importance of top-down connections for expectation-driven segmentation of cluttered and occluded images. Four types of inference were demonstrated using the same network architecture, learning algorithm, and training data. We show that an increase in overcompleteness leads directly to improved recognition and segmentation in occluded and cluttered scenes. Our intuition as to why these benefits arise is that overcomplete codes allow the formation of more basins of attraction and higher representational capacity. Appendix A: Sparse Image Coding with Learned Overcomplete Dictionaries The dynamic network and learning algorithm presented above require that the inputs ul at each layer be sparse vectors. To transform the input image into a suitable sparse vector, we use the focal underdetermined system solver (FOCUSS) algorithm for finding solutions to inverse problems. The FOCUSS algorithm represents data in terms of a linear combination of a small number of vectors from a dictionary, which may be overcomplete.
Visual Inference Using Dynamic Sparse Learning
2345
Other methods for sparsely coding signals include matching pursuit, basis pursuit, and sparse Bayesian learning, which were also evaluated for image coding (Murray & Kreutz-Delgado, 2006). The overcomplete dictionary is learned using the FOCUSS-CNDL (column-normalized dictionary learning) algorithm developed by Murray and Kreutz-Delgado (2001) and Kreutz-Delgado et al. (2003). The problem that FOCUSS-CNDL addresses here is that of representing a small patch of an image y ∈ Rm using a small number of nonzero components in the source vector x ∈ Rn under the linear generative model, y = Ax,
(A.1)
where the dictionary A may be overcomplete, n ≥ m. The algorithm updates and more discussion of the FOCUSS-CNDL algorithm in this context are given in Murray (2005, section 1.A). Parameters for FOCUSS-CNDL are: data set size = 20,000 image patches; block size N = 200; dictionary size = 64 × 64, 64 × 128, or 64 × 196; diversity measure p = 1.0; regularization parameter λmax = 2 × 10−4 ; learning rate γ = 0.01; number of training epochs = 150; reinitialization every 50 epochs. After each dictionary update, A is normalized to have unit Frobenius norm, A F = 1 and equal column norms. Figure 1.18 of Murray (2005) shows the learned 64 × 196 dictionary after training on edge-detected patches of man-made objects (the data set described in section 6). Once the dictionary A has been learned, input images for the dynamic network (DN) are coded using the FOCUSS+ algorithm (Murray & KreutzDelgado, 2006). The input images are divided into consecutive nonoverlapping patches of the same 8 × 8 size used for dictionary learning. The FOCUSS+ algorithm consists of repeated iterations of equation 6 from Murray and Kreutz-Delgado (2006) over an image patch yk to estimate xk . Each xk is updated for 15 iterations with p = 0.5. Appendix B: Derivation of Learning Algorithm for W The learning algorithm for the weights W is derived similarly to the backpropagation through time algorithm (BPTT) (Williams and Peng, 1990). Using the preactivation cost function, equation 4.11, for an individual pattern, JPA =
τ 1 T εt (ε t β t ) + λ(Vt − µ)T [(Vt − µ) (1 − β t )] , 2
(B.1)
t=1
which uses states Vt generated from running the network in certaintyequivalence mode. The effect of the weight prior − ln P(W) will not be considered in this section, as it was found that enforcing periodic weight
2346
J. Murray and K. Kreutz-Delgado
normalization is more computationally efficient than using prior constraints in every weight update (see section 5). To minimize the cost J P A, we update the weights using gradient descent, τ
w ji = −η
∂ J P A ∂ Vj,t ∂ JPA = −η · , ∂w ji ∂ Vj,t ∂w ji
(B.2)
t=1
where w ji is the element from the jth row and ith column of W. The second term on the right is ∂ Vj,t ∂ = W j· Xt = Xi,t , ∂w ji ∂w ji
(B.3)
where W j· is the jth row of W. The first term on the right of equation B.2 is divided into two parts, ∂ JPA = B j,t + D j,t ∂ Vj,t τ ∂ 1 B j,t = ερT (ερ β ρ ) ∂ Vj,t 2 ρ=1
τ ∂ λ T (Vt − µ) [(Vt − µ) (1 − β t )] , D j,t = ∂ Vj,t 2
(B.4)
ρ=1
where B is related to reconstruction error and D increases sparsity on those layers without desired values. Recursion expressions can now be found for B j,t and Dt, j . First, some notation: the jth row of the weight matrix W is denoted W j· , and the element from the jth row and ith column is w ji . Beginning with the reconstruction-enforcing term B (temporarily omitting the binary indicator variable β for notational clarity), τ ∂ 1 T ερ ερ . B j,t = ∂ Vj,t 2
(B.5)
ρ=1
For B j,t , at the last time step in equation B.2, when t = τ , only the ρ = τ terms depend on Vj,τ , B j,τ
∂ = ∂ Vj,τ
1 T ∂ ε ετ = ε τT ετ = ε j,τ , 2 τ ∂ Vj,τ
(B.6)
Visual Inference Using Dynamic Sparse Learning
2347
where ε j,t is the jth element of the error vector ε t . When t = τ − 1,
B j,τ −1
∂ = ∂ Vj,τ −1
1 T 1 T ε ετ + ετ −1 ε τ −1 . 2 τ 2
(B.7)
The second term on the right can be found to be −ε j,τ −1 as in equation B.6. For the first term, 1 T ∂ ∂ ∂ ετ = ε τT W f (Vτ −1 ) ετ ετ = ε τT ∂ Vj,τ −1 2 ∂ Vj,τ −1 ∂ Vj,τ −1 = f (Vj,τ −1 )
N
εk,τ wk j .
(B.8)
k=1
Substituting equations, B.8 and B.6 into the expression for B j,τ −1 (see equation B.7),
B j,τ −1 = ε j,τ −1 + f (Vj,τ −1 )
N
Bk,τ wk j .
(B.9)
k=1
The general recursion for Bt, j is (after reintroducing the indicator variable β) ε j,t β j,t
B j,t =
ε j,t β j,t + f (Vj,t )
N k=1
t=τ Bk,t+1 wk j
1≤t ≤τ −1
.
(B.10)
Turning to the sparsity-enforcing term (again omitting β),
D j,t
τ ∂ λ = (Vt − µ)T (Vt − µ) . ∂ Vj,t 2
(B.11)
ρ=1
Following similarly to the derivation for B above, when t = τ , D j,τ =
∂ ∂ Vj,τ
λ ∂ (Vτ − µ) = λ(Vj,τ − µ). (Vτ − µ)T (Vτ − µ) = λ(Vτ − µ) 2 ∂ Vj,τ (B.12)
2348
J. Murray and K. Kreutz-Delgado
When t < τ , we follow equations B.7 to B.9, and find the general recursion for D j,t (reintroducing β),
D j,t =
λ(Vj,t − µ)(1 − β j,t ) λ(Vj,t − µ)(1 − β j,t ) + f (Vj,t − µ)
N
t=τ Dk,t+1 wk j 1 ≤ t ≤ τ − 1.
k=1
(B.13) the recursions B.10 and B.13 are used in the final weight update, w ji = −η
τ (B j,t + D j,t )Xi,t ,
(B.14)
t=1
where the activation function derivative is f (v) =
a 1 a 2 exp(−a 2 v + a 3 ) 2 . {1 + exp(−a 2 v + a 3 )}
(B.15)
Acknowledgments J.F.M. gratefully acknowledges support from the ARCS Foundation. This research was supported in part by NSF cooperative agreement ACI-9619020 through computing resources provided by the National Partnership for Advanced Computational Infrastructure at the San Diego Supercomputer Center. Thanks also to Virginia de Sa, Robert Hecht-Nielsen, Jason Palmer, Terry Sejnowski, Sebastian Seung, Tom Sullivan, Mohan Trivedi, and David Wipf for comments and discussions. References Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985) A learning algorithm for Boltzmann machines. Cognitive Science, 9(1), 147–169. Apolloni, B., Bertoni, A., Campadelli, P., & de Falco, D. (1991). Asymmetric Boltzmann machines. Biological Cybernetics, 61, 61–70. Attneave, F. (1954). Informational aspects of visual perception. Psychological Review, 61, 183–193. Barber, D., & Sollich, P. (2000). Gaussian fields for approximate inference in layered ¨ sigmoid belief networks. In S. A. Solla, T. K. Leen, & K.-R. Muller (Eds.), Advances in neural information processing systems, 12. Cambridge, MA: MIT Press. Barlow, H. B. (1959). The mechanisation of thought processes. London: Her Majesty’s Stationery Office.
Visual Inference Using Dynamic Sparse Learning
2349
Bertsekas, D. P. (1995). Dynamic programming and optimal control. Belmont, MA: Athena Scientific. Brook, D. (1964). On the distinction between the conditional probability and the joint probability approaches in the specification of nearest-neighbor systems. Biometrika, 51(3/4), 481–483. Callaway, E. M. (2004). Feedforward, feedback and inhibitory connections in primary visual cortex. Neural Networks, 17, 625–632. Chengxiang, Z., Dasgupta, C., & Singh, M. (2000). Retrieval properties of a Hopfield model with random asymmetric interactions. Neural Computation, 12, 865–880. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Crisanti, A., & Sompolinsky, H. (1988). Dynamics of spins systems with randomly asymmetric bonds: Ising spins and Glauber dynamics. Physical Review A, 37(12), 4865–4874. Ejima, Y., Takahashi, S., Yamamoto, H., Fukunaga, M., Tanaka, C., Ebisu, T., & Umeda, M. (2003). Interindividual and interspecies variations of the extrastriate visual cortex. Neuroreport, 14(12), 1579–1583. Felleman, D. J., & Van Essen, D. C. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex, 1, 1–47. Fergus, R., Perona, P., & Zisserman, A. (2007). Weakly supervised scale-invariant learning of models for visual recognition. International Journal of Computer Vision, 71, 273–303. Field, D. J. (1994). What is the goal of sensory coding? Neural Computation, 6, 559–601. ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural ComFoldi´ putation, 3, 194–200. Fukushima, K. (2005). Restoring partly occluded patterns: A neural network model. Neural Networks, 18, 33–43. Fukushima, K., & Miyake, S. (1982). Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern Recognition, 15(6), 455–469. Galland, C. C. (1993). The limitations of deterministic Boltzmann machine learning. Network, 4, 355–379. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6), 721–741. Gill, J. (2001). Generalized linear models: A unified approach. Thousand Oaks, CA: Sage. Grossberg, S. (1976). Adaptive pattern classification and universal recoding, II: Feedback, expectation, olfaction, and illusions. Biological Cybernetics, 23, 187– 202. Gutfreund, H. (1990). Neural networks and spin glasses. Singapore: World Scientific. Hawkins, J., & Blakeslee, S. (2004). On intelligence. New York: Times Books. Hecht-Nielsen, R. (1998). A theory of the cerebral cortex. In Proceedings of the 1998 International Conference on Neural Information Processing (ICONIP’98) (pp. 1459– 1464). Burke, VA: Ios Press. Hertz, J. A., Palmer, R. G., & Krogh, A. S. (1991). Introduction to the theory of neural computation. Redwood City, CA: Addison-Wesley. Hinton, G. E., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Phil. Trans. R. Soc. Lond. B, 352, 1177–1190.
2350
J. Murray and K. Kreutz-Delgado
Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554. Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313, 504–507. Hinton, G. E., & Sejnowski, T. J. (1983). Optimal perceptual inference. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 448–453). New York: IEEE. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79(8), 2554– 2558. Hoyer, P. O., & Hyv¨arinen, A. (2002). A multi-layer sparse coding network learns contour coding from natural images. Vision Research, 42(12), 1593–1605. Hubel, D. H., & Wiesel, T. N. (1959). Receptive fields of single neurones in the cat’s striate cortex. Journal of Physiology, 148, 574–591. Johnson, O. (2004). Information theory and the central limit theorem. London: Imperial College Press. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1998). Learning in graphical models. Cambridge, MA: MIT Press. Kandel, E. R., Schwartz, J. H., & Jessel, T. M. (2000). Principles of neural science. (4th ed.). New York: McGraw-Hill. Kappen, H. J., & Spanjers, J. J. (2000). Mean field theory for asymmetric neural networks. Physical Review E, 61(5), 5658–5661. Kay, S. M. (1993). Fundamentals of statistical signal processing. Upper Saddle River, NJ: Prentice Hall. Kosslyn, S. M., Thompson, W. L., & Alpert, N. M. (1997). Neural systems shared by visual imagery and visual perception: A positron emission tomography study. Neuroimage, 6, 320–334. Kreiman, G., Koch, C., & Fried, I. (2000). Category-specific visual responses of single neurons in the human medial temporal lobe. Nature Neuroscience, 3(9), 946– 953. Kreutz-Delgado, K., Murray, J. F., Rao, B. D., Engan, K., Lee, T. W., & Sejnowski, T. J. (2003). Dictionary learning algorithms for sparse representation. Neural Computation, 15(2), 349–396. Lee, T. S., & Mumford, D. (2003). Hierarchical Bayesian inference in the visual cortex. Journal of the Optical Society of America A, 20(7), 1434–1448. Lee, T. S., Mumford, D., Romero, R., & Lamme, V. A. F. (1998). The role of the primary visual cortex in higher level vision. Vision Research, 38, 2429–2454. Lewicki, M. S., & Sejnowski, T. J. (2000). Learning overcomplete representations. Neural Computation, 12(2), 337–365. Mezard, M., Parisi, G., & Virasoro, M. A. (1987). Spin glass theory and beyond. Singapore: World Scientific. Mountcastle, V. B. (1978). The mindful brain. Cambridge, MA: MIT Press. Murray, J. F. (2005). Visual recognition, inference and coding using learned sparse overcomplete representations. Unpublished doctoral dissertation, University of California, San Diego. Murray, J. F., & Kreutz-Delgado, K. (2001). An improved FOCUSS-based learning algorithm for solving sparse linear inverse problems. In Conference Record of the
Visual Inference Using Dynamic Sparse Learning
2351
35th Asilomar Conference on Signals, Systems and Computers (Vol. 1, pp. 347–351). New York: IEEE. Murray, J. F., & Kreutz-Delgado, K. (2006). Learning sparse overcomplete codes for images. Journal of VLSI Signal Processing, 45, 97–110. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vis. Res., 37, 3311–3325. Olshausen, B. A., & Field, D. J. (2005). 23 problems in systems neuroscience. New York: Oxford University Press. Parisi, G. (1986). Asymmetric neural networks and the process of learning. J. Phys. A: Math. Gen., 19, L675–680. Peterson, C., & Anderson, J. R. (1987). A mean field theory learning algorithm for neural networks. Complex Systems, 1(5), 995–1019. Quiroga, R. Q., Reddy, L., Kreiman, G., Koch, C., & Fried, I. (2005). Invariant visual representation by single neurons in the human brain. Nature, 435, 1102–1107. Rao, R. P. N. (1999). An optimal estimation approach to visual perception and learning. Vision Research, 39(11), 1963–1989. Rao, R. P. N., & Ballard, D. H. (1997). Dynamic model of visual recognition predicts neural response properties in the visual cortex. Neural Computation, 4, 721–763. Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2, 1019–1025. Rolls, E. T., & Milward, T. (2000). A model of invariant object recognition in the visual system: Learning rules, activation functions, lateral inhibition, and informationbased performance measures. Neural Computation, 12(11), 2547–2572. Saul, L. K., & Jordan, M. I. (1998). Learning in graphical models. Cambridge, MA: MIT Press. Sereno, M. I., Dale, A. M., Reppas, J. B., Kwong, K. K., Belliveau, J. W., Brady, T. J., Rosen, B. R., & Tootell, R. B. H. (1995). Borders of multiple visual areas in humans revealed by functional magnetic resonance imaging. Science, 268(5212), 889–893. Sompolinsky, H. (1988). Statistical mechanics of neural networks. Physics Today, 41(21), 70–80. Stevens, C. F. (2001). An evolutionary scaling law for the primate visual system and its basis in cortical function. Nature, 411, 193–195. Sudderth, E. B., Torralba, A., Freeman, W. T., & Willsky, A. S. (2005). Learning hierarchical models of scenes, objects and parts. In International Conference on Computer Vision (ICCV 2005) (Vol. 2, pp. 1331–1338). Berlin: Springer. Teh, Y. W., & Hinton, G. E. (2001). Rate-coded restricted Boltzmann machines for face recognition. In T. K. Lenna, T. G. Dieterrich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 908–914). Cambridge, MA: MIT Press. Teh, Y. W., Welling, W., Osindero, S., & Hinton, G. E. (2003). Energy-based models for sparse overcomplete representations. Journal of Machine Learning Research, 4, 1235–1260. Thorpe, S., Fize, D., & Marlot, C. (1996). Speed of processing in the human visual system. Nature, 381(6), 520–522. Vinje, W. E., & Gallant, J. L. (2000). Sparse coding and decorrelation in primary visual cortex during natural vision. Science, 287(18), 1273–1276.
2352
J. Murray and K. Kreutz-Delgado
Welling, M., & Teh, Y. W. (2003). Approximate inference in Boltzmann machines. Artificial Intelligence, 143(1), 19–50. Williams, R. J., & Peng, J. (1990). An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Computation, 2, 490–501.
Received August 2, 2005; accepted December 15, 2006.
LETTER
Communicated by Emilio Salinas
Visual Remapping by Vector Subtraction: Analysis of Multiplicative Gain Field Models Carlos R. Cassanello [email protected]
Vincent P. Ferrera [email protected] Columbia University, Department of Psychiatry, Center for Neurobiology and Behavior, David Mahoney Center for Brain and Behavior Research, New York, NY 10032, U.S.A.
Saccadic eye movements remain spatially accurate even when the target becomes invisible and the initial eye position is perturbed. The brain accomplishes this in part by remapping the remembered target location in retinal coordinates. The computation that underlies this visual remapping is approximated by vector subtraction: the original saccade vector is updated by subtracting the vector corresponding to the intervening eye movement. The neural mechanism by which vector subtraction is implemented is not fully understood. Here, we investigate vector subtraction within a framework in which eye position and retinal target position signals interact multiplicatively (gain field). When the eyes move, they induce a spatial modulation of the firing rates across a retinotopic map of neurons. The updated saccade metric can be read from the shift of the peak of the population activity across the map. This model uses a quasi-linear (half-rectified) dependence on the eye position and requires the slope of the eye position input to be negatively proportional to the preferred retinal position of each neuron. We derive analytically this constraint and study its range of validity. We discuss how this mechanism relates to experimental results reported in the frontal eye fields of macaque monkeys. 1 Introduction The retinal image of an object changes whenever the object moves or the eyes move. However, monkeys and humans are able to saccade accurately to the remembered location of a target that has been rendered invisible even when there is an intervening shift of gaze between the target’s disappearance and the final saccade. The brain accomplishes this by updating the displacement needed to acquire the target based on the most recent position of the eyes. This displacement can be approximated as a vector subtraction between the retinal representation of the location of the target at the instant it was Neural Computation 19, 2353–2386 (2007)
C 2007 Massachusetts Institute of Technology
2354
C. Cassanello and V. Ferrera
perceived and the change in eye position. This computation requires that the brain integrate the retinal position of a target stimulus with the eye position in order to acquire the target. In this letter, we discuss a way of performing this updating online by vector subtraction. There is substantial evidence that retinal and eye position signals interact multiplicatively in several regions of visual cortex (Andersen & Mountcastle, 1983; Andersen, Essick, & Siegel, 1985; Andersen, Bracewell, Barash, Gnadt, & Fogassi, 1990; Van Opstal, Hepp, Suzuki, & Henn, 1995; Campos, Cherian, & Segraves, 2006). Updating saccade plans in multiplicative gain field models is not straightforward, and most previous approaches have used neural networks (Zipser & Andersen, 1988; Salinas & Abbott, 1995, 1997; Pouget & Sejnowski, 1997; Xing & Andersen, 2000; Quaia, Optican, & Goldberg, 1998; Mitchell & Zipser, 2003; Pouget & Snyder, 2000; but see also Siegel, 1998). Although these networks achieve the appropriate transformation, an analytical approach can provide insight into how this is accomplished. The approach taken here is similar in that we start with the goal of the computation: to perform vector subtraction. However, instead of constructing a network, we use analytic methods to derive a minimal model that solves the problem. The goal of this study is to derive analytic solutions for computing saccade updating by vector subtraction in a population of neurons with gain field–like properties. We discuss the conditions that need to be imposed on the gain field in order to perform this computation at a single-neuron level given certain assumptions, and we show using simple simulations how these conditions may be implemented by the neuronal hardware. 2 Results 2.1 Gain Field Differential Equation and Gain Field Modulation Model. The problem to be solved is shown in Figure 1. The eyes are fixating at FP1 when a target is momentarily presented. After the target disappears, the eyes move to FP2. This is the first step of a double-step saccade, the ultimate goal of which is to acquire the remembered location of the target. The vector of the saccade required for the second step needs to be updated based on the information provided by x and y, which represent retinal target position (RTP) and eye position (EP), respectively. Both x and y are measured relative to the first fixation position (FP1). The model does not incorporate any time dependence in the response of the neurons. It is assumed that the response to the initial target presentation is sustained after the target disappears. It is this sustained memory or motorplan-related activity that is modulated by eye position. It is further assumed that the target is briefly presented and that all subsequent behavior occurs in total darkness. The model takes as inputs the retinal target position given by x and the change in eye position denoted y before the final saccade (see Figure 1).
Visual Remapping by Vector Subtraction
x
2355
FP2
x-y
y
FP1 Figure 1: Memory-guided saccade with intervening gaze shift (double-step saccade). The subject fixates at FP1 when the target is flashed at location x. Eye movements are shown with solid black arrows. After the target is turned off and before the second saccade, there is a shift in gaze by an eye position displacement in the dark given by y. To accurately acquire the target (x), the second saccade needs to be updated to the vector difference x − y.
The output is the saccade vector x − y. Each model cell has a response of the general form R(x, y, ρ) = f (x, ρ) • g( y, ρ).
(2.1)
The arguments of the functions in equation 2.1 are vectors. However, for simplicity, we restrict the analysis to one dimension and discuss later how it generalizes. Here f is a retinal sensitivity function, and g is the eye position sensitivity function that modulates the retinal input; ρ denotes the centers of the retinal receptive fields of neurons in the model. Saccade direction and amplitude are encoded by the location of the peak response of a population of model neurons. Therefore, for a saccade of amplitude (x − y)0 , the response R must have a maximum over the variable ρ at the location of those cells for which ρ0 = (x − y)0 . Given a particular form for the retinal sensitivity function f (x,ρ), the eye position sensitivity function g(y,ρ) that satisfies the vector subtraction constraint can be derived by finding the maximum of R with respect to the preferred retinal location (ρ). Since R is given by the product of f and g, g must obey the following differential equation, ∂ ∂ 1 log g| y = − log f |x = − 2 • (x − ρ) ∂ρ ∂ρ σ
≡
−
1 • y, σ2
(2.2)
2356
C. Cassanello and V. Ferrera
This one-dimensional equation generalizes by using the two-dimensional gradient operator and the two-dimensional vector analogs to x, y, and ρ. The last identity in the right-hand side of equation 2.2 enforces the constraint ρ = x − y in order to obtain a function of only y and ρ. We call this the vector subtraction condition. The structure of equation 2.2 is a consequence of the assumptions of a multiplicative modulation and the independence of the variables. The vector subtraction constraint allows one to obtain a closed differential equation, which can be solved for g. When f (x, ρ) is a gaussian receptive field of size σ centered at ρ 1 2 f (x, ρ) = f 0 √ exp − 2 (x − ρ) , 2σ 2πσ 1
(2.3)
the derived eye position sensitivity function has the form ρ ρ g(y, ρ) = g0 • exp − 2 • y ≈ g0 • 1 − 2 • y + · · · . σ σ +
(2.4)
Together f (x,ρ) and g(y,ρ) describe the responses of a population of neurons that form a saccade vector map in retinotopic coordinates. Saccade vectors are encoded by the peak of the population response. This peak moves as the eyes move to update the saccade command. This solution is unique up to an arbitrary function of x and y. The + subscript on the right-hand side of equation 2.4 indicates that under certain conditions to be specified, we can approximate the exponential gain field by the linear rectified term of its polynomial expansion. What is important from the standpoint of vector subtraction is not only the functional form of g(y,ρ), but the relationship between the parameters in g(y,ρ) and f (x,ρ). For each cell, the preferred retinal location, ρ, and size of the receptive field, σ , determine the slope of the eye position sensitivity function. Across the population, the slope of the eye position sensitivity function increases in magnitude with the eccentricity of the receptive field center but is opposite in sign. The same strategy can be applied to other receptive field shapes. If the visual receptive field is modeled as a sigmoid, f (x, ρ) =
A x−ρ • 1 + tanh , 2 η
(2.5)
where A is the amplitude of the neuron’s response, ρ is the point where the visual response reaches half of its amplitude, and η is the half-width of the spatial rise of the receptive field, the companion eye position sensitivity
Visual Remapping by Vector Subtraction
2357
function results: g(y, ρ) = g0 • exp
1 − tanh
y ρ ρ ∼ • = g0 • 1 − y • 2 + · · · . η η η + (2.6)
We can pick g0 to be exp(− ρη ) so that g(y = 0, ρ) = 1. For both gaussian and sigmoidal receptive fields, the derived eye position sensitivity function is an exponential function of eye position. The exponential function implements the mapping of an additive structure into a multiplicative one, which allows the model to take into account successive gaze shifts. However, the brain can compute an approximate vector subtraction using a linear rectified approximation to the exponential gain field function. This approximation, which uses the right-hand side of equations 2.4 and 2.6, is used in the model simulations, and its validity is discussed in a later section. 2.2 Eye Position versus Eye Displacement. In the derivations above, y represents eye position relative to the center of gaze, which may or may not be identical to the primary position of the eye. When the eye is in this position, the eye position gain for all cells is 1.0. What happens if the animal performs a double-step saccade task where the initial fixation position (FP1) is at an eccentric location? In this case, y represents the change in eye position, FP2 − FP1 (see Figure 1), or eye displacement. In terms of the model presented above, it does not matter if y represents absolute eye position or eye displacement. The activity in the retinal map will shift by an amount − y in either case. However, there is an interesting wrinkle. If the input to the model encodes absolute eye position and the eyes start at an eccentric position, then the gain factor will vary across neurons depending on their eye position sensitivity functions. If eye position gain modulates visual responses, then the response to a visual target will be distorted whenever the animal is fixating eccentrically. Specifically, if a target is presented at position x 0 while the eyes are at position y0 , then the maximal activity across the population will be not at x 0 but at x 0 − y0 . A full treatment of this case is beyond the scope of this letter. However, one potential solution might be to incorporate temporal dynamics into the eye position sensitivity function such that whenever the eye starts in an eccentric position, all of the gains relax back to the initial, uniform state (e.g., all gains = 1.0). In this case, the model should behave as if the eye is always in primary position at the start of the task, regardless of the location of FP1. However, the eye position gains should not relax to a uniform state when the eye moves to FP2. Hence, the dynamics would need to be sensitive to behavioral state (i.e., fixating versus planning the next saccade). These considerations suggest
2358
C. Cassanello and V. Ferrera
that the eye position signal should not be a tonic representation of absolute eye position but a dynamic representation of eye displacement. 2.3 One-Dimensional Model Simulations. We simulate the response of a one-dimensional array of neurons with gaussian receptive fields, modulated by an eye position sensitivity function that has a quasi-linear (halfrectified) dependence on eye position. The coefficient of the eye position term is proportional to the receptive field eccentricity but with the opposite sign. All gain fields intersect at the center of gaze (see Figures 2A and 2B). The array consists of 81 neurons with receptive fields equally spaced covering retinal positions within 60 degrees of the fovea. Each unit’s response has the form 1 R(x, y, ρ) = A1 • exp − 2 (x − ρ)2 • [1 − A2 · ρ · y]+ . (2.7) 2σ Figure 2A shows the visual receptive fields of a subset of the units. Figure 2B shows the corresponding gain field functions, which have been half rectified. Figure 2C shows the population response for three different saccades: solid dots represent a target 10 degrees to the right with initial eye position at the primary position; open squares have initial eye position at 10 degrees to the left of the primary position with target flashed 10 degrees to the right of the fovea; and open circles represent a target flashed at 10 degrees to the right of the fovea but the eyes shifted to 30 degrees to the right, therefore requiring a leftward saccade. The population response is always unimodal and peaks around the value of the correct saccade amplitude and direction, indicating that the most active cells in the population are those with receptive field eccentricity congruent with the saccade. Figure 2D shows the discrepancy between the peak of the population activity (updated saccade vector) predicted by the model and the actual size of the ideal second-step saccade for various locations of the target. The diagonal dashed lines indicate constant intervening gaze shifts before the second-step saccade ranging from 20 degrees to the left of the primary eye position for the upper diagonal curve to 20 degrees to the right for the lowest curve in steps of 10 degrees. These simulations use the following parameters. The intercept of the linear gain field function has been set to 1.0. The slope of the gain field function has both a linear and a small quadratic dependence on the receptive field eccentricity given by slope = −0.0009 ∗ ρ – sgn(ρ) ∗ 2.9 × 10−6 ∗ ρ 2 . The overall magnitude of the gain field does not affect the performance of the model as long as the ratio of slope to intercept remains the same. The receptive field size of the neurons increases with the eccentricity following the relation: σ = 30◦ + 0.21 ∗ Abs(ρ). The receptive field sizes are well within the range reported by Bruce and Goldberg (1985) for frontal eye field neurons (average direction tuning width = 45 degrees, average amplitude
Visual Remapping by Vector Subtraction
2359
A
B
1 4.5 4
0.8
Gain factor
Sensitivity
3.5 0.6
0.4
3 2.5 2 1.5 1
0.2
0.5 0 −60
−40
−20
0
20
40
0 −60
60
−40
Retinal position (deg)
−20
0
20
40
60
20
30
Eye position (deg)
C
D
Avg error = 0.96 deg.
Saccade amplitude (deg)
60 1
Response
0.8 0.6 0.4 0.2 0 −60
40 20 0 −20 −40 −60
−40
−20
0
20
40
60
Preferred retinal position (deg)
−30
−20
−10
0
10
Retinal target position (deg)
Figure 2: One-dimensional gain field model comprising an array of 81 neurons with gaussian receptive fields and quasi-linear gain fields. (A) Visual receptive fields of a subset of neurons. (B) Eye position sensitivity functions (gain fields) for the same neurons as in A. The darkness of the lines indicates the correspondence between the curves in A and B, for example, the heavy black curves represent the retinal and eye position sensitivity functions for the same neuron. (C) Population response for the three conditions: solid dots, initial target position = 10 degrees and eye position = 0; open squares, initial target position = 10 degrees and eye position = −10 degrees, requiring a 20 degree rightward saccade; open circles, initial target position = 10 degrees and eye position = 30 degrees, requiring a 20 degree leftward saccade. (D) Predicted second saccade amplitudes (symbols) compared to exact required update (dashed lines) for various initial target locations and eye positions, both ranging from −30 to 30 degrees in 10 degree steps. Each group of symbols represents saccade amplitude to various initial target locations for one particular eye position.
2360
C. Cassanello and V. Ferrera Response to 12 deg leftward saccade
1.6
A Gaussian–Linear rectified model
1.6
B
Gaussian–Exponential model targ −30 o
Response amplitude
Response amplitude
targ −20
1.2
0.8
1.2
targ
0
o
targ 10 o targ 20 o targ 30
o
0.8
0.4
0.4
0
o
targ −10o
− 40 − 20 0 20 40 Receptive field center (deg)
0
− 40 − 20 0 20 40 Receptive field center (deg)
Figure 3: (A) Population tuning for a 12 degree leftward saccade (vertical line) using the gaussian-linear rectified model. The different curves represent various combinations of initial target position and eye position, all giving rise to the same saccade. (B) Population tuning for a 12 degree leftward saccade (vertical line) using a gaussian-exponential gain field model, again for various combinations of initial target position and eye position.
tuning = 18 degrees). A more extensive discussion of the effect of the change in receptive field size on the slope of the gain field function is given in the appendix. Figure 3 compares the gaussian-exponential model in equations 2.3 and 2.4 to the linear rectified approximation used in the simulations. Figure 3A shows the population response to a 12 degree leftward saccade for various target locations using the gaussian-linear rectified model. The solid vertical line shows the retinal location of 12 degrees to the left of the fovea. The derivation given in the previous section predicted exponential eye sensitivity functions. Plots similar to those of Figure 3A are presented in Figure 3B for a gaussian-exponential gain field model. It is interesting that both the population response and the individual neuron selectivity for saccades (not shown) obtained with the exponential model are not as sharp as in the quasi-linear model. In addition, the response of the quasi-linear model can be driven higher for a comparable set of parameters and relatively small saccades using a gain field intercept larger than one. For the gaussianexponential gain field model, the coefficient of the eye position term (which becomes the slope of the gain field function in the quasi-linear model) is predicted to be −ρ/σ 2 . This condition can be relaxed in the quasi-linear model. The gaussian-linear rectified model does not require the gain field
Visual Remapping by Vector Subtraction
2361
slope to be strictly equal to −ρ/σ 2 because there is a trade-off between the gain field slopes and intercept. 2.4 Saccade Tuning of Single Cells. The model proposed in this letter is based on a retinotopic map that remains fixed as the eye moves. Saccades are updated by shifting the peak of activity within this map in coordination with the movement of the eyes. This can be understood by considering what happens at the level of single neurons. A neuron with a gaussian receptive field centered at position ρ0 and exponential eye position sensitivity will have a firing rate given by the product of equations 2.3 and 2.4, which can be rewritten as 1 1 2 R(x, y, ρ0 ) = A • exp − 2 • (x − y − ρ0 )2 • exp • (y − 2xy) . 2σ 2σ 2 (2.8) This equation remains valid when its arguments are replaced by corresponding two-dimensional vectors. The first exponential term produces a response that is maximal for saccades for which x − y = ρ 0 . The difference, x − y, is precisely the motor error of the second saccade in a double-saccade task. Hence the gain field model can be thought of as producing neurons that represent motor error. This is a consequence of enforcing the subtraction condition on equation 2.2. If we had enforced a vector addition constraint, the first exponential in equation 2.8, while still centered at ρ 0 , would have resulted in tuning to the head-centered coordinates of the target, x + y, instead of saccade tuning. The second exponential in equation 2.8 shows a dependence on eye and retinal target position. It gives an enhancement proportional to the square of the eye displacement y2 and a factor −2x · y that favors the response when x and y have the opposite sign, that is, when the eye moves in a direction opposite to the initial target position. Now assume we always set the saccade target to be at the retinal receptive field of the neuron so that x = ρ 0 . Then the response will be ρ0 R(ρ0 , y, ρ0 ) = A • exp −y • 2 . σ
(2.9)
Therefore, there will be an enhancement of the response the farther the fixation point is located with respect to the cell receptive field center. In particular, it will be highest if target and saccade start are in opposite visual hemifields. This feature of neurons from the frontal eye field (FEF) has been reported before as evidence of the presence of a gain field-like eye position modulation and in support of the hypothesis that these neurons are performing vector subtraction computations (Balan & Ferrera, 2003).
2362
C. Cassanello and V. Ferrera
2.5 Eye Movement–Induced Shift of the Neuronal Receptive Fields. There exists an alternative approximation to derive the form of the gain field. The solution turns out to be identical to the solution to the differential equation (see equation 2.2) if only a quasi-linear (half-rectified) term in eye displacement is considered. The approximation consists in truncating the expansion of the shift operator of Fourier analysis to the first-order term followed by a half-rectification. Gaussian receptive fields belong to a class of functions that admit a series representation written as a linear superposition of a basis set given by the family of complex exponentials. In fact, almost any spatially tuned and reasonably localized (bell-shaped) receptive field function would belong to this functional class. If its center is given by ρ, the receptive field can be written in a Fourier decomposition involving the functions exp (i • k · ρ): 1 f (x, ρ) = A • exp − 2 (x − ρ)2 = h(x, k) · exp(i • k · ρ) · dk. 2σ (2.10) The explicit form of the function h is not necessary; k is related to the spatial frequency, and only the complex exponential depends on ρ on the right-hand side. In order to shift the underlying retinotopic map to which the coordinate ρ belongs due to an eye movement of amplitude y, it suffices to shift the argument in the complex exponentials. This can be done using the following operator:
∂ exp y • ∂ρ
≡
1 ∂n yn n , n! ∂ρ n
since
1 ∂ exp (i • k · ρ) = exp (i • k · ρ) yn (i • k)n exp y • ∂ρ n! n = exp [i • k · (ρ + y)] . When we apply this operator to a gaussian, we get: 1 f (x, ρ) = A • exp − 2 (x − ρ)2 2σ n
1 ∂ ∂2 ∂ 1 ∂ f= yn n f ∼ + y2 2 + · · · f exp y = 1+y ∂ρ n! ∂ρ ∂ρ 2! ∂ρ n
Visual Remapping by Vector Subtraction
2363
1 1 2 ρ y − 2xy = A • exp − 2 (x − ρ)2 1 − 2 y − 2σ σ 2σ 2 y2 + 4 (x − ρ)2 + · · · . 2σ The expression contains
terms of the expansion of the
in braces
the first function exp − σρ2 y • exp − 2σ1 2 y2 − 2xy . These are the factors that result when we shift the gaussian receptive field by an amount y. The solution must be half-waved rectified to avoid negative firing rates. This operator will always generate a first-order term linear in the shift parameter y. A quadratic dependence of the receptive field shape on the eccentricity ρ will generate a linear dependence of the slope of the gain field on the eccentricity. However, the parameters should remain in the domain in which the approximation is valid. This operator can be generalized to two dimensions using a scalar product of the eye displacement and the gradient operator on the receptive field center map. This approximation is not the same used to obtain equation 2.2. The rationales for both approximations are quite distinct. In the first case, a closed differential equation for g was derived, requiring that the population response be unimodal and enforcing the vector subtraction condition. In the second case, the shift operator quantifies the changes in the neuronal responses due to the shift of the origin of the retinotopic map, describing how the entire sensory field is updated by the movement of the eyes. The coincidence between both approximations implies that saccade updating and sensory field remapping are consistent. 2.6 How and Where Does the Quasi-Linear (Half-Rectified) Approximation Work?. The model works by shifting the peak of activity from the retinal location given by the position of the target when flashed to a new locus at the retinal location corresponding to the required saccade. Figure 4A shows such shift from the location of the target x (where there will be a neuron with receptive field ρ 1 ) in the retinal map when the eyes are at the fixation point before the gaze shift, to x − y (neuron with retinal receptive field ρ 2 ). After the gaze shift and before the second saccade, each of these neurons is modulated by a planar rectified gain field whose slope is negatively proportional to their eccentricities. In Figure 4A, the gray circles indicate the pattern of activity based on the position of the target before the gaze shift. The black circles show the pattern of activity after gaze shifts to position y. In the absence of further visual stimulus, the effect of the gain fields is to shift the locus of activity to the neighborhood of x − y in the retinotopic map. In other words, the most active cell after the gaze shift in darkness is the same cell that would have been most responsive if the eyes had moved first and the target were subsequently turned on at the same spatial location without any gain field present. For the gaussian-linear
2364
C. Cassanello and V. Ferrera A
B
in ga f ie
Validity of the approximation
ld
120
n pla e2
x
2
2
-1
2
F(y)=|y|*(exp(y /2σ )-1) - σ *|y|
x-y
Criterion: F(y) > |x| Saccade = x - y
-1
σ = 40o σ = 30o
80
σ = 25o σ = 20o |x|
ga
F(y); |x| (deg)
y
in
- ρ2
- ρ1
fie
ld
pla
ne
40
1 0 0
10 20 30 40 50 Retinal target position / Eye position (deg)
Figure 4: (A) A two-dimensional eye translation of vector y induces a shift on the retinal map origin so that the peak of activity will shift from retinotopic map location x when the target is flashed to retinotopic location x − y right before the second step saccade. This new location of the peak of activity is the same as if the target were flashed again after the gaze shift. (B) The approximation used in this letter will work as long as the values of retinal target location and eye displacement remain below the intersection point between the straight dashed line given by |x| and the curves representing F (| y|); this is for F (| y|) > |x|. Each curve corresponds to a different size of the neuronal receptive fields (see text and appendix A.1 for discussion).
rectified model using receptive fields with uniform amplitude and size, the activities at retinal locations x and x − y are x • y and R1 = A • 1 − σ2 (x − y) • y y2 . R2 = A • exp − 2 • 1 − 2σ σ2
(2.11)
These formulas give the responses at retinal locations x(R1 ) and x − y(R2 ) after the eyes have moved to the new fixation point but before the final saccade to the target. To update the metric for the final saccade to the target requires that R2 surpasses R1 after the gaze shift and before the final saccade to the memorized position of the target. The range within which the linear rectified model is valid is shown in Figure 4B for four different values of the receptive field size. It requires that the absolute values of retinal target position and eye displacement remain below the point of intersection of
Visual Remapping by Vector Subtraction
2365
the straight line (|x|) and the curves F (| y|) in Figure 4B. A derivation of the validity condition is given in appendix A.1. 3 Extensions of the Basic Model 3.1 Multistep Saccades. The model can account for multistep gaze shifts in darkness because the exponential function maps multiplicative factors into addition of their arguments. Therefore, the successive application of the multiplicative modulation, first by an eye shift y followed by a shift z, will result in a global multiplicative modulation dependent on the total shift y + z. In the same way a product of two linear-rectified factors (1 − y · ρ/σ 2 ) · (1 − z · ρ/σ 2 ) will approximate (1 − ( y + z) · ρ/σ 2 + · · ·). Suppose that instead of one flash and two eye displacements in the dark, there are two successive flashes at different locations. After the first target is foveated, the model will correctly provide the updated metric for a subsequent saccade to the second flash. The order in which these saccades take place represents a decision that should be added to the model by extra variables. The model does not require the use of a temporal queue to solve this sequence of saccades. 3.2 Eye Movements in Two Dimensions. The model can be readily extended to two dimensions. In this case, the shape and orientation of the receptive field become important. Assume that the receptive field of a cell is a two-dimensional gaussian with principal variances σ1 and σ2 along the corresponding principal axes. The principal axis with the largest variance forms an angle θ with the x-axis in a fixed (spatial) frame of coordinates, which instantaneously coincides with the retinal coordinates. As before, x is the position vector for the saccade target, and ρ is the center of the neuron’s receptive field assumed gaussian and with principal directions 1 and 2. These vectors can be represented in polar coordinates by their size and their angle with respect to the horizontal axis (see Figure 8 and appendix A.2): x1 = x · cos (α) , x2 = x · sin (α) , ρ1 = ρ · cos(β), and ρ2 = ρ · sin(β). Substituting these, the argument of the gaussian describing the receptive field reads −
1 (x · cos(α − θ ) − ρ · cos(β − θ ))2 2σ12
−
1 (x · sin(α − θ ) − ρ · sin(β − θ ))2 ≡ E(x, ρ, α, β, θ, σ1,2 ). 2σ22
The intuitive interpretation of this expression is that both vectors x and ρ are projected onto the principal axes of the receptive field. If we apply the two-dimensional version of the shift operator to the two-dimensional
2366
C. Cassanello and V. Ferrera
gaussian, we have
∂ ∂ 1 + y1 · + y2 · + · · · ◦ {A1 · exp [−E (x, ρ, α, β, θ, σ1,2 )]} ∂ρ1 ∂ρ2
= A1 · exp [−E] • { 1 − y · ρ · [A · cos (β − γ ) + B · cos (β + γ − 2θ)] + · · ·}. (3.1) Here A=
1 2
1 1 + 2 2 σ1 σ2
,
B=
1 2
1 1 − 2 2 σ1 σ2
and the eye displacement vector is given by y = (y · cos (γ ) , y sin (γ )). The first term gives the projection of the eye displacement onto the receptive field center vector weighted by the geometric mean of the principal variances. This is an isotropic term and has no information about the shape of the receptive field. It accounts for the contributions in the direction collinear with the receptive field center. The second term measures the discrepancies between the receptive field orientation and the orientations of the receptive field center and the eye displacement. The one-dimensional case is recovered when γ = β = θ and γ = θ, β = θ ± π. The contribution from σ2 cancels out, and the one-dimensional scenario applies along the direction of the receptive field eccentricity. For circularly symmetric receptive fields, B = 0 and A = 1/σ 2 . Figure 5 shows the receptive and gain fields of two neurons in two dimensions. The slope of the planar rectified gain field is negatively proportional to the eccentricity of the receptive field center. Figure 6 shows the simulation of a two-dimensional model using circularly symmetric receptive fields. Figure 6A shows the intended eye movement. A target is flashed at position A, and there is an intervening gaze shift to position B before the final movement to the remembered target location. Figure 6B shows the population response to a single-step saccade from the center of gaze to position A. Figures 6C and 6D show the population activity before the first and second saccade, respectively. Figure 6D shows the population encoding of the correct second saccade vector computed as the difference between vectors A and B. Figure 7A shows plots of the gain field slope given by the coefficient of the eye displacement y in equation 3.1 for cells with elliptical receptive field oriented at an angle θ with respect to the horizontal axis. This direction need not be the same as the vector pointing to the center of the receptive field (β, taken to be zero in Figure 7A). Even when the eye movement is aligned with the direction β, equation 3.1 shows that the slope of the gain field can be changed by the effect of the receptive field orientation. In fact,
Visual Remapping by Vector Subtraction
2367
Figure 5: Example receptive fields and gain field functions in two dimensions. Firing rate is coded as gray-scale intensity level. (A) A cell with circularly symmetric receptive field centered in the lower left quadrant at (−25◦ , −25◦ ). (B) The gain field function for this unit has a maximum in the upper right quadrant. (C) Unit with circularly symmetric receptive field centered in the upper right at (25◦ , 25◦ ). (D) The gain field function now increases along the (−1, −1) direction.
the slope of the gain field may be changed smoothly between the values required for subtractive and summation conditions by appropriately tuning the orientation and aspect ratio of the neuronal receptive fields and appropriately weighting the symmetric and asymmetric contributions that enter equation 3.1. In particular, there may exist subpopulations of neurons with eye position sensitivities positively correlated with the receptive field
2368
C. Cassanello and V. Ferrera B
A 35 30
Vertical Position (deg)
Single Step Saccade
50 40
A
25
30
20
20
15
10
A −B
10
0 −10
5
−20
0
B
−5
−30 −40
−10 −15
−10
0
10
20
−50 −50
30
Vertical Position (deg)
C
D
Double Step 1
50
40
40
30
30
20
20
10
10
0
0
−10
−10
−20
−20
−30
−30
−40
−40 0
Horizontal Position (deg)
50
Double Step 2
50
−50 −50
0
Horizontal Position (deg)
Horizontal Position (deg)
50
−50 −50
0
50
Horizontal Position (deg)
Figure 6: Remapping of the saccade encoding in two dimensions. (A) Eye movement from the origin to point A in one- or two-step saccades. (B) Population activity for a single-step saccade to position A in panel A. (C, D) A double-step saccade with the same end point as in panel B. (C) Step 1 shows the population activity before the first saccade to the intermediate gaze position (vector B in panel A). (D) Step 2 shows the updated population activity encoding the correct saccade vector A-B once the gain field modulation has been activated by the first eye displacement.
eccentricity that would respond as if they were spatially rather than retinotopically tuned. These cells will do something other than computing vector subtraction; they could be summation cells or even cells compensating to get a pseudo-spatial or head-centered tuning fully built from retinal and eye position signals.
Visual Remapping by Vector Subtraction
2369
A
θ = 0o o θ = 15 θ = −15o
0
−0.15
B −1
0.15
Effect of variable RF size (σ)
Gain field slope (deg )
−1
Gain field slope − (deg )
Effect of RF orientation (θ)
σ = 4o + 0.21*ρ o σ = 4 + 0.4*ρ σ = 4o + 0.6*ρ σ = 8o + 0.6*ρ
0.2
0
−0.2
slope = − ρ/σ
2
o
o
o
o
ρ = 24 , σ1 = 16 , σ2 = 8
−90 −45 0 45 90 Eye displacement direction γ (deg)
σ = 30 + 0.21*ρ σ = 0.9x10 −3 *ρ − 2.9x10 −6 *ρ2
−50 0 50 Receptive field eccentricity (deg)
Figure 7: (A) Slope of the gain field function in two dimensions (see equation 3.1) as a function of the angle between the eye displacement (γ ) and the receptive field direction (β), here taken to be horizontal (0◦ ). The principal direction of the largest variance has an angle θ with respect to the horizontal axis (see Figures 8A and 8B). The tilt of the receptive field with respect to the radial direction is given by |β − θ |. (B) Gain field slope versus RF eccentricity for various dependences of the receptive field (RF) size σ on the eccentricity ρ (see the appendix for a discussion). Slopes were computed using the theoretical prediction −ρ/σ 2 . The solid black curve uses the parameters of the simulations without further correction. The dotted black curve shows the slopes for the simulations of Figure 2.
3.3 Rotations. This section addresses the case of a simple rigid rotation of the head (head tilt) around the center of gaze, without any eye displacement. These head rotations require head position signals originating from the vestibular system rather than eye position signals. We do not address torsional movements of the eyes or the issue of noncommutativity of rigid body rotations when eye orientation is taken into account (Crawford & Guitton, 1997; Crawford, Medendorp, & Marotta, 2004; Henriques, Klier, Smith, Lowy, & Crawford, 1998; Klier & Crawford, 1998; Misslisch, Tweed, & Vilis, 1998; Smith & Crawford, 2001; Tweed, 1997; Tweed, Haslwanter, & Fetter, 1998; Medendorp, Smith, Tweed, & Crawford, 2002; see also Quaia & Optican, 1998). However, we show that a subpopulation of neurons with planar gain fields negatively correlated with their receptive field eccentricity can compensate for the visual effects of a rigid head rotation. This is at least in part consistent with the rotational remapping discussed by Medendorp et al. (2002), although we address only the simplest case without dealing with errors that accumulate across multiple saccades.
2370
C. Cassanello and V. Ferrera
By comparing the responses before and after the rotation, one obtains an estimate of the form of the extra retinal signals required if the process takes place in the absence of visual stimulus. A visual target is flashed at a location described by a vector of size x and angle α with respect to the horizontal axis and elicits a response from a neuron with receptive field center described by a vector of eccentricity ρ and direction β. Finally, the cell receptive field itself has an orientation with respect to the horizontal axis given by θ (see Figures 8A and 8B). Using the parameterization we developed in the previous section, we can write the visual response of our generic neuron as 1 R (0) = A exp − 2 [x cos (α − θ ) − ρ cos (β − θ )] 2σ1 1 − 2 [x sin (α − θ ) − ρ sin (β − θ )] 2 . 2σ2
2
α − θ and β − θ are, respectively, the angles between the directions of the vectors x and ρ and the first principal axis of the receptive field. Thus, this expression describes the projections of these vectors onto the cell’s receptive field. When the head rotates around the center of gaze by an angle ω, it induces a rotation on the retinal map. From the reference frame of a single neuron whose receptive field moves with the map, this operation is equivalent to a rotation of the vector x in the opposite direction, changing the angle α to α − ω: 1 R(ω) = A exp − 2 [x cos(α − ω − θ ) − ρ cos(β − θ )] 2 2σ1 1 2 − 2 [x sin(α − ω − θ ) − ρ sin(β − θ )] . 2σ2 The cross term involving the magnitude of the vectors ρ and x that arise solely due to the rotation are ω ω cos 2 α − θ − R (ω) ≈ A · f (ω = 0) • exp 2x 2 B sin2 2 2 ω ω sin 2 (α − θ ) − − sin 2 2 ω ω • exp 2xρ sin A sin α − β − 2 2 ω . + B sin α + β − 2θ − 2
Visual Remapping by Vector Subtraction
2371
RF
ρ
A x1
C x
β
x-y
x1 - y1
ω FP
y
y1
γ
RF
ρ
θ −ω
x
β
B σ1 σ2
ey e1
e2
α θ
CG
ex
ρ
Figure 8: (A) Effect of gaze shifts for two different configurations of target x and eye position displacement y. The cell with receptive field ρ = x − y = x 1 − y1 will give the correct updating for arbitrary values of x and y as long as the difference remains congruent with its RF because its RF will always land on the remembered target location. This is independent of the position of FP. γ is the angle of direction of the eye displacement vector. β is the direction of the vector pointing to the center of the RF in the retinal map. (B) Gaussian receptive field having its axis of largest principal variance at an angle θ with respect to the horizontal axis in two dimensions. Black arrows indicate the unit vectors in the canonical directions. Gray arrows show the corresponding vectors along the principal directions of the RF. (C) Effect of a (head) counterclockwise rotation around the CG by an angle ω. The neuron with RF center given by ρ and angle β, shifts to a new orientation given by β + ω with the same eccentricity. The target position vector x with orientation α stays fixed in space, so that when viewed from the retinal map reference frame, it appears rotated clockwise or by an angle −ω. The neuron’s RF is assumed to be a gaussian with principal axis oriented at an angle θ with respect to the horizontal axis. This angle need not be aligned with the RF vector direction.
2372
C. Cassanello and V. Ferrera
If the receptive field is reasonably round so that σ1 ∼ σ2 , then A approximates 1/σ 2 and B ∼ 0. In this case, the equation, up to linear terms, simplifies to ω ω xρ sin α − β − + ··· . R (ω) ≈ A · f (ω = 0) • 1 + 2 2 sin σ 2 2 For simplicity, assume that the target flashes on the horizontal axis in the right visual hemifield, and consider small
counterclockwise rotations. Then α = 0 and ω > 0 but small so that sin ω2 ≈ ω2 , and we can further neglect ω when compared with β. The gain field then reads 2 ρ g (x, ρ, β, ω) = g0 • 1 − xω · 2 · sin (β) + · · · . σ The extraretinal signal is provided by xω, which plays a similar role to the eye displacement in the case of translations. It (xω) is indeed the information about the shift of the target location over the retinal map that must be supplied by extraretinal signals if the head rotation occurs in the dark. The factor sin(β) will have its maximum value of 1 at β = π2 = 90◦ , which is perpendicular to the direction of the target position vector. Hence, neurons with receptive field center along the positive vertical axis will have a maximal negative slope in their gain fields. This suggests that the circuitry implementing the gain field modulation for translations would also work for this type of rotations. The only difference is that the rotation signal should be fed—presumably from a different extraretinal source— into the neurons with receptive field vector aligned perpendicular to the target position vector. 3.4 Effect of Noise on Model Performance. To address the probabilistic nature of the neuronal responses, we run simulations of the model with Poisson distributed noise in the output. We considered each node in the model to represent a pool of neurons, all having the same receptive field and eye position sensitivity. The response of each node is given by r (x, y) = median [Poisson (n, A ∗ f (x) ∗ g (y))] , where r (x, y) is the response as a function of retinal target location (x) and eye position (y); n is the number of cells in the pool; A is a constant that determines the average firing rate and is the same for all nodes; and f (x) and g(y) are the underlying retinal and eye position sensitivities. The function “Poisson” generates n Poisson-distributed random deviates with λ = A ∗ f (x) ∗ g(y). The response of the node is computed by taking the median of this sample.
Visual Remapping by Vector Subtraction
2373
Table 1: Saccade Error versus Neuronal Pool Size. n 1 10 100 1000 1 10 100 1000
A
Error (deg)
25 25 25 25 100 100 100 100
5.89 3.64 2.67 3.13 5.32 2.68 1.74 1.43
In a range of saccade amplitudes between −60 and 60 degrees the performance depends on the overall firing rate (A) and the pool size (n). Table 1 shows average saccade error for A = 25 or 100 spikes per second and neuronal pool sizes (n) from 1 to 1000. Thus, with n larger than 10 and firing rates over 25 spikes per second, the average error remains around or below 3 deg, which represents a typical spread in saccade end points in stimulation experiments of the FEF (Bruce & Goldberg, 1985). 4 Discussion In a population of neurons with tuned retinal receptive fields that are modulated multiplicatively by eye position–dependent gain fields, vector subtraction can be computed by shifting the peak of the population activity in a retinotopic map provided that there is a systematic relationship between the parameters of the retinal and eye position sensitivity functions. The vector subtraction computation and the requirement of response unimodality impose constraints on this relationship across the neuronal population. Specifically, the slope of the eye position sensitivity should be negatively proportional to the receptive field eccentricity. This computation generalizes to any system with cells in which one input is encoded in a topographic map and the other input follows a rate code proportional to the displacement of the center of that map. We have applied this general idea to the specific case of double-step saccades in the FEF. We have shown how this sensory and eye displacement integration gives origin to a topographic organization of saccade vectors preference independent of the initial gaze direction, in accordance with results from microstimulation experiments (Robinson & Fuchs, 1969; Mushiake, Tanatsugu, & Tanji, 1997; Mushiake, Fujii, & Tanji, 1999; Russo & Bruce, 1993). We supported the theoretical derivations with simulations in one and two dimensions using units with gaussian receptive fields and linear-rectified eye position–dependent gain fields.
2374
C. Cassanello and V. Ferrera
Neuronal responses can be tailored to compensate to various degrees for the eye movement–induced displacement of their receptive fields by adjusting the slope of the eye position dependence of the multiplicative modulation. In the model described here, all receptive fields are retinal. They do not shift or change form. The eye position modulation is exclusively an amplitude modulation. In this sense, this model is parsimonious. It does not invoke shifting receptive fields or intermediate encoding stages or representations. Plasticity in the shape and orientation of the receptive field could allow recruitment of neurons to encode saccades in a given direction. 4.1 Relation to Previous Work. 4.1.1 Neural Network Approaches. Xing and Anderson (2000), elaborating on previous work by Zipser and Andersen (1988), implemented a neural network that could keep memory of the position of subsequent targets and was able to perform a double-step saccade. Their model was explicitly intended to reproduce the behavior of neurons from the lateral intra-parietal area (LIP). Their network stored a queue of the various target representations and maintained the memory of them in neuronal delayed activity until the saccade was executed. The analysis of the hidden units of the network showed that the cells developed very large receptive fields and that their responses could be represented using multiplicative gain fields with a planar dependence on eye position. Interestingly, the hidden unit population splits into two subgroups. Type I units encode the location of the target in head-centered coordinates by computing the addition of the target retinal representation vector and the eye position vector. These units display a planar rectified gain field encoding the eye position that ramps up in the same direction as the retinal receptive field of the unit. Type II units encode the desired saccade vector to the second target. They also display a planar rectified gain field encoding the eye position, but in this case the planar function ramps up opposite to the prefered location of their visual receptive field. Our model suggests how cells with those characteristics can perform the computations of vector addition and subtraction and how it is related to the slope of the gain field function encoding the eye position. Salinas and Abbott (1995) have shown very generally how to use gain fields to compute virtually any kind of linear transformation downstream from the neurons on which the information about retinal location and eye position converges. These authors used linear-rectified gain fields with positive and negative slopes for the eye position coefficient. A difference between their model and the one presented here is that their gain fields have preferred gaze locations distributed uniformly across the eye position domain. Their model requires a translational invariant encoding of the eye position input. In our model, this requirement is absent, as the only preferred gaze location is the center of gaze. Salinas and Abbott (1996) have also shown how to obtain multiplicative responses out of additive inputs by
Visual Remapping by Vector Subtraction
2375
virtue of the use of recurrent connections in a network of neurons. We shall not discuss this point further here since our starting point already assumes multiplicative gain fields. However, our results are consistent in principle with their results. Pouget and Sejnowski (1997) and Pouget, Fisher, and Sejnowski (1993) suggested that neurons in the parietal cortex may display responses given by products of gaussians representing retinal position and sigmoids that encode eye position. A set of functions constructed in that way constitutes a basis of their functional space so that any function of those variables can be written as a linear superposition of the elements of the basis. They also pointed out that linear rectified functions of eye position can replace the sigmoids since the saturation at zero provides the required nonlinearity. A significant difference between our model and that of Pouget and Sejnowski (1997) is that the latter uses a family of sigmoids with constant eye position coefficient and inflexion points uniformly spaced across the eye position domain. While the encoding of any nonlinear function of retinal and eye positions is guaranteed, the lack of a systematic dependence on the (oculocentric) topography of the population of neurons results in encoding through a distribution of learnable weights across the members of the basis functions and the need of a population average readout stage downstream. In our model, each unit is already able to respond maximally across the map when the saccade metric congruent with its retinotopic field is required. There is no weight learning required, although it is not precluded, and the saccade updating can be read from the activity peak location. The model presented here uses the eye displacement with respect to the instantaneous center of gaze, which is not equivalent to other points in the eye position field. Encoding of the eye position in a translational invariance manner is therefore not explicitly required, and although not precluded, there is no need for a representation of the orbital position of the eye or direction of gaze. Quaia and collaborators (Quaia et al., 1998) addressed the computation of vector subtraction in the FEF using switching circuits. Their model was intended to predict the saccade vector for the second saccade of a doublesaccade task. They constructed a neural network with units that resemble responses of neurons from area LIP and FEF that is able to perform this calculation. Their main premise is that the brain does not require an explicit supraretinal representation of the target given, for example, in external space or body-centered coordinates. Instead it suffices to buffer information in the tonic and sustained activity of neurons in areas LIP and FEF in order to obtain a dynamic remapping of the visual space. 4.1.2 Analytic Approaches. Siegel (1998) addressed a similar question in relation to neurons from area 7a and showed how to use a population average encoding strategy, the center-of-mass equation, to decode the position of an object in head-centered coordinates, knowing its retinal representation
2376
C. Cassanello and V. Ferrera
and the eye position. This computation is a weighted average of the retinal locations using as a weight the eye position gain field. In order to link the two factors, one needs to assume a dependence of the gain field parameters on the receptive field location of the cell encoding the retinal location of the target. Seigel showed that it is possible to encode linearly the head-centered coordinates of the object assuming that the responses of the neurons of area 7a are modulated by a gain field that is linear in the eye position and assuming that the intercept parameter of such gain field function varies linearly proportional to the location of the center of the receptive field of the cell. 4.2 Relation to Behavioral Tasks and Electrophysiological Experiments. A necessary condition for this model to work is the presence of appropriate eye position signals in the FEF. We report on the experimental quantification of these signals in a separate work (Cassanello & Ferrera, 2004, 2005). This modulation would allow the cell population to predict the new position of the target with respect to the displaced retinal map. This change in response could be anticipatory to a saccade (efference copy) and provide the correct metric for the desired eye movement (Colby, Duhamel, & Goldberg, 1995; Duhamel, Colby, & Goldberg, 1992; Goldberg & Bruce, 1990; Umeno & Goldberg, 1997; Walker, Fitzgibbon, & Goldberg, 1995). Alternatively, it could be an afferent signal from extraocular proprioceptors. In an electrophysiological experiment, if the fixation point is eccentric to the center of gaze, the background activity or the response to a peripheral visual target should reveal the afferent signal. If this circuitry is hard-wired in the neuronal population, a task consisting of visually guided saccades with starting points at various eccentric locations along the preferred direction of a cell’s receptive field should be able to uncover the negative correlation between receptive field eccentricity and slope of the eye position gain field (e.g., Andersen et al., 1990). The model predicts a correlation between the value of the slope of the eye position sensitivity function and the ratio between receptive field eccentricity and the square of the receptive field size (ρ/σ 2 ). That correlation is an identity in the case of the exponential gain field and gets relaxed in the linear-rectified model. It will also be affected by variation of the receptive field size with eccentricity and by the presence of background firing (see the appendix). The model presented here can be enriched by adding further dependencies on other variables such as eye velocity or acceleration or more complicated dynamics. 5 Conclusion Eye position signals that modulate visual activity have proven to be pervasive across several areas of the brain like LIP, 7a, V4, and superior colliculus (Andersen et al., 1985; Andersen et al., 1990; Siegel, 1998; Salinas & Abbott, 1995; Bremmer, Ilg, Thiele, Distler, & Hoffmann, 1997; Bremmer, Pouget,
Visual Remapping by Vector Subtraction
2377
& Hoffmann, 1998; DeSouza et al., 2000; Funahashi & Takeda, 2002; Xing and Andersen, 2000; Snyder, 2000; Van Opstal et al., 1995; Campos et al., 2006), and there is plenty of evidence from neurophysiogical data from the FEF (Cassanello & Ferrera, 2004, 2005). The idea of gain field modulations taking place in several areas of the brain is not new, nor is the proposal of using linear rectified gain fields. However, to the best of our knowledge, this algorithm and the rationale used to derive it have not been presented before. In our opinion this work complements and illuminates approaches based on neural networks. The general algorithm can be applied to maps in other areas of the brain and of other features (e.g., an auditory map), which can be integrated with position signals from different areas of the brain known to encode such signals. The gain field modulation, appropriately tuned, can implement coordinate transformations into various frames of reference and produce neuronal responses that may look retinotopically tuned, head-centered tuned, or intermediate between these limits. This may explain intermediate coordinate frames found in studies of foveating or reaching for objects (Mullette-Gillman, Cohen, & Groh, 2005; Baker, Harper, & Snyder, 2003; Batista, Buneo, Snyder, & Andersen, 1999; Buneo, Jarvis, Batista, & Andersen, 2002). This property gives a large flexibility to the algorithm, and because of the simplicity of its implementation, it may well be the computational tool of choice for the brain. What has to be tailored is the way in which the different extraretinal signals that one wants to integrate with the visual ones get fed. However, this is just a matter of choosing the correct circuitry at the level of the inputs. The planar slope in the relevant gain field is virtually all that is needed, and it is present consistently. Appendix A.1 Validity of the Linear Rectified Approximation. The shift in the peak of population activity can be understood by comparing the response elicited by a target flashed twice at the same spatial location: the first time with the eyes at FP1 and the second time after an eye displacement y to FP2 (see Figure 4A). The cell with receptive field center at position ρ 1 = x will give a response A when the target is flashed with the eyes at FP1. The cell with receptive field at ρ 2 = x − y will give a smaller response of value A · exp(− y2 /2σ 2 ), because their receptive field centers are separated by y. When the eyes fixate at FP2 and the target is flashed at the same spatial location, cell 1 will decrease its response to A · exp(−y2 /2σ 2 ) and cell 2 will increase its response to A since the target is now at the center of its receptive field. Our working hypothesis is that the brain mimics this switch in activity when there is no second flashing of the target. For the linear-rectified approximation, these activities are shown in equation 2.11. The regime of values of x and y in which R2 grows higher than R1, and consequently the peak of activity shifts from ρ 1 to ρ 2 , is defined by the
2378
C. Cassanello and V. Ferrera
condition: y2 x • y (x − y) • y ≤ exp − 1− • 1 − , σ2 2σ 2 σ2 which implies 2 −1 y σ2 − −|x| • cos (α) ≤ | y| • exp . − 1 2σ 2 | y| Here α is the angle between the vectors x and y. The most unfavorable situation occurs when these vectors point in opposite directions, or cos(α) = −1. These are the curves plotted in Figure 4B for several values of σ . If the eye sensitivity functions intercept at a value larger than one, the range of validity will be enhanced because the curves F (| y|) will be shifted to the right. Equivalently, the intercept can be kept at unity, reducing the gain field slope by the inverse of the intercept. The error in the correct position of ρ 2 = x − y can be computed by taking the derivative of equation 2.7 with A2 = 1/σ 2 , at x and y fixed, finding the root of this derivative set equal to zero and comparing the value of the root to the correct saccade metric x − y. A.2 Two-Dimensional Receptive Field Parameterization. The unit vectors along the principal axes of the cell receptive field centered at ρ can be written in terms of the unit vectors along the canonical axes (see Figure 8B): e1 = cos(θ )ex + sin(θ ) e y e2 = − sin(θ )ex + cos(θ ) e y so that the inverted relation is: ex = cos(θ ) e1 − sin(θ ) e2 e y = sin(θ ) e1 + cos(θ ) e2 . A vector ν = (ν x , ν y ) can be written in the basis {e1 , e2 } and the components will be
ν = ν x ex + ν y e y = (ν x cos(θ ) + ν y sin(θ )) e1 + −ν x sin(θ ) + ν y cos(θ ) e2 .
Visual Remapping by Vector Subtraction
2379
If we replace the vector ν = x − ρ = (x 1 − ρ 1 ) ex + (x 2 − ρ 2 ) e y , the argument of the gaussian representing the retinal receptive field becomes −
1 ((x1 − ρ1 ) cos (θ ) + (x2 − ρ2 ) sin(θ ))2 2σ12 −
1 (− (x1 − ρ1 ) sin(θ ) + (x2 − ρ2 ) cos(θ ))2 . 2σ22
A.3 Generalization to Other Shapes of Receptive Fields. Equation 2.2 can be applied to other receptive field shapes. Examples of visual receptive fields that have the shape of unimodal localized spatial functions are: f0 • 1 −
f (x, ρ) x−ρ 2 σ
π n f 0 • cos 2n (x − ρ) for |x − ρ| ≤ σ 0 otherwise x−ρ f 0 • cosh σ
f0 •
1
(x − ρ)2 + σ 2 −1 x−ρ 2 f0 = 2 • 1+ σ σ
g (y, ρ)
ρ 1 • g0 • exp −2 • y • 2 σ2 1 − (y/σ ) y 2 ρ ∼ • 1+ + ··· = −2 • y • σ2 σ
n · π π g0 • exp − ρ • tan •y 2σ 2σ n · π ρ ∼ + · · · = g0 1 − y • 2 σ2 ρ y g0 • exp − • tanh σ σ ρ ≈ g0 • 1 − y • 2 + · · · σ g0 • exp − 2 · y ·
ρ y2 + σ 2
ρ 1 ∼ . = g0 • exp −2 · y · 2 · σ 1 + (y/ρ)2
In all of these examples, the companion gain field is an exponential of a function of the eye displacement, with coefficients negatively proportional to the receptive field eccentricity and inversely proportional to the square width of the receptive field. The first-order approximations to these expressions are linear-rectified gain fields, with slopes proportional to the same combination of parameters. A.4 Effect of Receptive Field Size. If there is a systematic dependence of the receptive field size on the receptive field eccentricity, this dependence
2380
C. Cassanello and V. Ferrera
enters equation 2.2 and affects both exact solution and linear-rectified approximation of the gain field. In particular, the gain field slope can be different from the ratio −ρ/σ 2 . When the receptive field σ depends on the eccentricity ρ, equation 2.2 changes into 1 ∂σ ∂ 1 2 log (g) = 0. (x − ρ) + 2 (x − ρ) + σ 3 ∂ρ σ ∂ρ y
(A.1)
A simple linear dependence of the receptive field size on the eccentricity is σ (ρ) = σ0 (1 + ρλ ), so that for foveal values of ρ, the size of the receptive field is roughly σ0 , and λ gives a spatial scale over which σ becomes proportional to the eccentricity. Figure 7B shows the region of expected values of the gain field slope for single cells, for reasonable ranges of σ0 and k = σ0 /λ. The equation that determines g becomes ∂ 1 dσ 1 1 1 dσ 2 log (g) = − 2 (x − ρ) − 3 y. (x − ρ)2 = − 2 y − 3 ∂ρ σ σ dρ σ σ dρ y
(A.2)
The second term on the right-hand side arises from the dependence of σ on ρ. This term introduces a direct quadratic dependence on the eye position in the linear rectifield approximation of the exponential gain field, a “parabolic” gain field. However, this dependence has opposite sign to a similar term arising from higher orders in the expansion of the exponential gain field. Therefore, it will compete in magnitude helping to stabilize the validity of the linear dependence on the eye position. When k = 0, one recovers equation 2.2. Integrating equation A.2 over ρ gives the following formal expression: log (g/g0 ) = −y = −y
σ (ρ) σ (0) σ (ρ) σ (0)
dσ 1 − y2 2 σ (t) dσ /dt
σ (ρ)
σ (0)
dσ 1 1 − y2 2 σ (t) dσ /dt 2
dσ σ 3 (t)
σ 2 (ρ) − σ 2 (0) . σ 2 (0) σ 2 (ρ)
If we use for σ the specific linear form given below equation A.1, we get log (g/g0 ) = − =−
ρ 1 kρ σ 2 − σ02 y − y2 σ0 σ 2 σ02 σ 2 ρ 1 kρ (2σ0 + kρ) . y − y2 2 σ0 (σ0 + kρ) 2 σ0 (σ0 + kρ)2
Notice that now the slope of the gain field function is given byρ σ0 (σ0 + kρ). There are two limits to consider here. When σ0 ρ, the
Visual Remapping by Vector Subtraction
2381
slope of the gain field approximates the ratio ρ/σ02 as when the width of the receptive field does not depend on the eccentricity. When σ0 ρ, the slope of the gain field is given by the ratio (1/kσ0 )(1 − σ0 /kρ). The second factor is almost one because ρ σ0 . The slope becomes independent of the eccentricity and could become very flat or very steep, depending on the values of k and σ0 . If the size of the receptive field changes from, say, 5 to 25 degrees over a 50 degree range, then k ≈ 0.4. Figure 7B shows plots of the value of the gain field slope versus the receptive field eccentricity for k = 0.21, 0.4, and 0.6, with small foveal σ0 . There is also a plot of the slopes obtained using the parameters of the simulations (see Figure 2) for which σ0 = 30◦ and k = 0.21, and a plot of the gain field slopes including the small quadratic dependence on the receptive field eccentricity. One can consider two other cases, one of which can be computed exactly: 1. σ (ρ) = σ0 exp(ρ/λ), in which case, σ grows faster than any power of ρ/λ. 2. σ (ρ) = σ0 + σ0 ln(1 + ρ/eλ), where σ grows slower than any power of ρ/eλ. Case 1 gives the solution: y ρ ρ 1+ . log g/g0 = −y 2 1 − λ λ σ0 The slope of the gain field can change dramatically depending on λ. An intuitive interpretation for this parameter would be the scale of the foveal representation in cortex. For eccentricities below λ, the receptive field sizes would be roughly constant. Beyond that scale, the receptive field size grows strongly with the eccentricity. If λ is very large, we recover the result derived before, and the slope of the gain field would be the ratio ρ/σ 2 . Loosely speaking, this would be the case in which our entire retina is foveal. If λ is of the order of 5 to 10 degrees, the interesting twist is that the slope of the gain field function reverses sign beyond that eccentricity. If this were the correct scenario, the gain field slopes of neurons with very eccentric receptive fields could be positively correlated with the receptive field center. Admittedly, a rate of receptive field growth exponential on the eccentricity is extreme, but this example points out that the predicted negative correlation between gain field slopes and receptive field eccentricity should not be demanded for every neuron. The other case can be estimated. We arrive at the following expression: σ (ρ)/σ0 λ log(g/g0 ) = −y 2 du · u−2 · exp (u) σ0 1
2 1 y2 1 − 1 + ln 1 + ρ/eλ +
2 . 2 σ02 1 + ln 1 + ρ/eλ
2382
C. Cassanello and V. Ferrera
As long as σ0 < σ < 2σ0 , the integral in the first term is bounded between σ (ρ)/σ0 ρ ρ ≤ . du · u−2 · exp(u) ≤ e · ln 1 + 2−2 e 2 · ln 1 + eλ eλ 1 The second term introduces a quadratic dependence in the eye position. An estimate of the gain field for values of the receptive field eccentricity smaller than e · λ would be given by ρ ρ log g/g0 = −y · k · 2 − y2 · + ···. σ0 eλσ02 with 0.6931 < k < 1. A.5 Effect of the Background Firing Rate. The derivations presented so far neglected the presence of significant background firing in the response of the neurons. If there is background firing, 1 f (x, ρ) = A0 + A1 • exp − 2 (x − ρ)2 , 2σ with the same conventions used before. The derivative of the log of this function when A0 is not negligible with respect to A1 is −1 ∂ log[ f (x.ρ)] 1 2 = A0 + A1 • exp − 2 (x − ρ) ∂ρ 2σ 1 1 2 − ρ) − ρ) • A1 • , • exp − (x (x σ2 2σ 2 which should be set equal to the derivative of the log of the gain field with the constraint enforced: −1 ∂ log [g (y.ρ)] 1 = − A0 + A1 • exp − 2 y2 ∂ρ 2σ 1 1 2 • A1 • y • exp − y . σ2 2σ 2 Integrating over ρ, the gain field takes the form g(y.ρ) = g0 • exp
ρ − 2 • y• σ
A1 • exp − 2σ1 2 y2
. A0 + A1 • exp − 2σ1 2 y2
Visual Remapping by Vector Subtraction
For y small compared to
√
2383
2 • σ,
A1 • − 2σ1 2 y2 A1 A0 1 2
1 ≈ • 1− • • y + · · · A0 + A1 A0 + A1 2σ 2 A0 + A1 • − 2σ 2 y2 and g (y.ρ) ≈ g0 • exp
A0 A1 ρ A1 ρ 3 − 2 • • y+ • y + ··· . 2 2 σ A0 + A1 2 (σ 2 ) (A0 + A1 )
1 reduces the size of the slope of the gain field if A0 becomes The ratio A0A+A 1 a sizable proportion of A1 . There is also a competition between the linear and a third-order term in the eye displacement which we shall not discuss further.
Acknowledgments This research was supported by grant NIMH MH59244. We are indebted to Lawrence Abbott for pointing out important differences between gain field models and for encouraging and enlightening suggestions. C.R.C. thanks Ning Quian and Suresh Krishna for interesting suggestions and illuminating discussions. References Andersen, R. A., Bracewell, R. M., Barash, S., Gnadt, J. W., & Fogassi, L. (1990). Eye position effects on visual, memory and saccade-related activity in areas LIP and 7a of macaque. Journal of Neuroscience, 10(4), 1176–1196. Andersen, R. A., Essick, G. K., & Siegel, R. M. (1985). Encoding of spatial location by posterior parietal neurons. Science, 230, 456–458. Andersen, R. A., & Mountcastle, V. B. (1983). The influence of the angle of gaze upon the excitability of the light-sensitive neurons of the posterior parietal cortex. Journal of Neuroscience, 3, 532–548. Baker, J. T., Harper, T. M., & Snyder, L. H. (2003). Spatial memory following shifts of gaze: Saccades to memorized world-fixed and gaze-fixed targets. Journal of Neurophysiology, 89, 2564–2576. Balan, P. F., & Ferrera, V. P. (2003). Effects of gaze shifts on maintenance of spatial memory in macaque frontal eye field. Journal of Neuroscience, 23, 5446–5454. Batista, A. P., Buneo, C. A., Snyder, L. H., & Andersen, R. A. (1999). Reach plans in eye-centered coordinates. Science, 285, 257–260. Bremmer, F., Ilg, U. J., Thiele, A., Distler, C., & Hoffmann, K.-P. (1997). Eye position effects in monkey cortex. I. Visual and pursuit-related activity in extrastriate areas MT and MST. Journal of Neurophysiology, 77, 944–961.
2384
C. Cassanello and V. Ferrera
Bremmer, F., Pouget, A., & Hoffmann, K.-P. (1998). Eye position encoding in the macaque posterior parietal cortex. European Journal of Neuroscience, 10, 153–160. Bruce, C. J., Goldberg, M. E. (1985). Primate frontal eye fields. I. Single neurons discharging before saccades. J. Neurophysiol., 53, 603–635. Buneo, C. A., Jarvis, M. R., Batista, A. P., & Andersen, R. A. (2002). Direct visuomotor transformations for reaching. Nature, 416, 632–636. Campos, M., Cherian, A., & Segraves, M. A. (2006). Effects of eye position upon activity of neurons in macaque superior colliculus. Journal of Neurophysiology, 95, 505–526. Cassanello, C. R., & Ferrera, V. P. (2004). Vector subtraction using gain fields in the FEF of Macaque monkeys. Society for Neuroscience abstract, 34th Annual Meeting. Cassanello, C. R., & Ferrera, V. P. (2005). Eye position signals and vector subtraction in FEF neurons. Society for Neuroscience abstract, 35th Annual Meeting. Colby, C. L., Duhamel, J. R., & Goldberg, M. E. (1995). Oculocentric spatial representation in parietal cortex. Cerebral Cortex, 5, 470–481. Crawford, J. D., & Guitton, D. (1997). Visual-motor transformations for accurate and kinematically correct saccades. Journal of Neurophysiology, 78, 1447–1467. Crawford, J. D., Medendorp, W. P., & Marotta, J. J. (2004). Spatial transformations for eye-hand coordination. Journal of Neurophysiology, 92, 10–19. DeSouza, J. F. X., Dukelow, S. P., Gati, J. S., Menon, R. S., Andersen, R. A., & Vilis, T. (2000). Eye position signal modulates a human parietal pointing region during memory-guided movements. Journal of Neuroscience, 20, 5835–5840. Duhamel, J. R., Colby, C. L., & Goldberg, M. E. (1992). The updating of the representation of visual space in parietal cortex by intended eye movements. Science, 255, 90–92. Funahashi, S., & Takeda, K. (2002). Information processes in the primate prefrontal cortex in relation to working memory processes. Reviews in the Neurosciences, 12, 313–345. Goldberg, M. E., & Bruce, C. J. (1990). Primate frontal eye fields. III. Maintenance of a spatially accurate saccade signal. Journal of Neurophysiology, 64, 489–508. Henriques, D. Y. P., Klier, E. M., Smith, M. A., Lowy, D., & Crawford, J. D. (1998). Gaze-centered remapping of remembered visual space in an open-loop pointing task. Journal of Neuroscience, 15, 1583–1594. Klier, E. M., & Crawford, J. D. (1998). Human oculomotor system accounts for 3-D eye orientation in the visual-motor transformation for saccades. Journal of Neurophysiology, 80, 2274–2294. Medendorp, W. P., Smith, M. A., Tweed, D. B., & Crawford, J. D. (2002). Rotational remapping in human spatial memory during eye and head motion. Journal of Neuroscience, 22, RC196. Misslisch, H., Tweed, D., & Vilis, T. (1998). Neural constraints on eye motion in human eye-head saccades. Journal of Neurophysiology, 79, 859–869. Mitchell, J. F., & Zipser, D. (2003). Sequential memory-guided saccades and target selection: A neural model of the frontal eye fields. Vision Research, 43, 2669– 2695. Mullette-Gillman, O. A., Cohen, Y. E., & Groh, J. M. (2005). Eye-centered, headcentered, and complex coding of visual and auditory targets in the intraparietal sulcus. Journal of Neurophysiology, 94, 2331–2352.
Visual Remapping by Vector Subtraction
2385
Mushiake, H., Fujii, N., & Tanji, J. (1999). Microstimulation of the lateral wall of the intraparietal sulcus compared with the frontal eye field during oculomotor tasks. Journal of Neurophysiology, 81, 1443–1448. Mushiake, H., Tanatsugu, Y., & Tanji, J. (1997). Neuronal activity in the ventral part of premotor cortex during target-reach movement is modulated by direction of gaze. Journal of Neurophysiology, 78, 567–571. Pouget, A., Fisher, S. A., & Sejnowski, T. J. (1993). Egocentric spatial representation in early vision. Journal of Cognitive Neuroscience, 5, 150–161. Pouget, A., & Sejnowski, T. J. (1997). Spatial transformations in the parietal cortex using basis functions. Journal of Cognitive Neuroscience, 9, 222–237. Pouget, A., & Snyder, L. H. (2000). Computational approaches to sensorimotor transformations. Nature Neuroscience, 3, 1192–1198. Quaia, C., & Optican, L. M. (1998). Commutative saccadic generator is sufficient to control a 3-D ocular plant with pulleys. Journal of Neurophysiology, 79, 3197– 3215. Quaia, C., Optican, L. M., & Goldberg, M. E. (1998). The maintenance of spatial accuracy by the perisaccadic remapping of visual receptive fields. Neural Networks, 11, 1229–1240. Robinson, D. A., & Fuchs, A. F. (1969). Eye movements evoked by stimulation of frontal eye fileds. Journal of Neurophysiology, 32, 637–648. Russo, G. S., & Bruce, C. J. (1993). Effect of eye position within the orbit on electrically elicited saccadic eye movements: A comparison of the macaque monkey’s frontal and supplementary eye fields. Journal of Neurophysiology, 69, 800–818. Salinas, E., & Abbott, L. F. (1995). Transfer of coded information from sensory to motor networks. Journal of Neuroscience, 15, 6461–6474. Salinas, E., & Abbott, L. F. (1996). A model of multiplicative neural responses in parietal cortex. Proc. Natl. Acad. Sci. USA, 93, 11956–11961. Salinas, E., & Abbott, L. F. (1997). Invariant visual responses from attentional gain fields. Journal of Neurophysiology, 77, 3267–3272. Siegel, R. M. (1998). Representation of visual space in area 7a neurons using the center of mass equation. Journal of Computational Neuroscience, 5, 365–381. Smith, M. A., & Crawford, J. D. (2001). Implications of ocular kinetics for the internal updating of visual space. Journal of Neurophysiology, 86, 2112–2117. Snyder, L. H. (2000). Coordinate transformations for eye and arm movements in the brain. Current Opinion Neurobiology, 10, 747–754. Tweed, D. (1997). Three-dimensional model of the human eye-head saccadic system. Journal of Neurophysiology, 77, 654–666. Tweed, D., Haslwanter, T., & Fetter, M. (1998). Optimising gaze control in three dimensions. Science, 281, 1363–1366. Umeno, M. M., & Goldberg, M. E. (1997). Spatial processing in the monkey frontal eye field. I. Predictive visual responses. J. Neurophysiol., 78(3), 1373–1383. Van Opstal, A. J., Hepp, K., Suzuki, Y., & Henn, V. (1995). Influence of eye position on activity in monkey superior colliculus. Journal of Neurophysiology, 74, 1593– 1610. Walker, M. F., Fitzgibbon, E. J., & Goldberg, M. E. (1995). Neurons in the monkey superior colliculus predict the visual result of impending saccadic eye movements. Journal of Neurophysiology, 73, 1988–2003.
2386
C. Cassanello and V. Ferrera
Xing, J., & Andersen, R. A. (2000). Memory activity of LIP neurons for sequential eye movements simulated with neural networks. Journal of Neurophysiology, 84, 651–665. Zipser, D., & Andersen, R. A. (1988). A back-propagation programmed network that simulate response properties of a subset of posterior parietal neurons. Nature, 331, 679–684.
Received November 8, 2005; accepted November 1, 2006.
LETTER
Communicated by Nihat Ay
Representations of Space and Time in the Maximization of Information Flow in the Perception-Action Loop Alexander S. Klyubin [email protected] Adaptive Systems Research Group, School of Computer Science, University of Hertfordshire, Hatfield Herts AL10 9AB, U.K.
Daniel Polani [email protected] Adaptive Systems and Algorithms Research Groups, School of Computer Science, University of Hertfordshire, Hatfield Herts AL10 9AB, U.K.
Chrystopher L. Nehaniv [email protected] Adaptive Systems and Algorithms Research Groups, School of Computer Science, University of Hertfordshire, Hatfield Herts AL10 9AB, U.K.
Sensor evolution in nature aims at improving the acquisition of information from the environment and is intimately related with selection pressure toward adaptivity and robustness. Our work in the area indicates that information theory can be applied to the perception-action loop. This letter studies the perception-action loop of agents, which is modeled as a causal Bayesian network. Finite state automata are evolved as agent controllers in a simple virtual world to maximize information flow through the perception-action loop. The information flow maximization organizes the agent’s behavior as well as its information processing. To gain more insight into the results, the evolved implicit representations of space and time are analyzed in an information-theoretic manner, which paves the way toward a principled and general understanding of the mechanisms guiding the evolution of sensors in nature and provides insights into the design of mechanisms for artificial sensor evolution. 1 Introduction 1.1 Information in the Perception-Action Loop. In nature, sensors and actuators of living agents are constantly engaged in complicated interaction. Despite the sensors and the actuators being specific to agents’ niches and needs, a general and at the same time a minimal model of sensorimotor interaction is desired. Neural Computation 19, 2387–2432 (2007)
C 2007 Massachusetts Institute of Technology
2388
A. Klyubin, D. Polani, and C. Nehaniv
Although a large body of research identifies information as useful for understanding sensors, less work has been done on actuators. The pressure to capture information relevant to an agent’s survival may push sensors to the limits. For example, vision sensors can operate at the limits set by physics (Bouman, 1959; Bialek, 1987). Capturing and processing sensory information have their metabolic costs (Laughlin, de Ruyter van Stevenick, & Anderson, 1998). Neural pathways have limited information transmission rates. Optimization of information-related quantities in artificial neural networks results in phenomena similar to ones found in biological sensory pathways (Barlow, 1959; Linsker, 1988; Nadal & Parga, 1995; Bell & Sejnowski, 1995) and information-theoretic tools provide versatile ways to analyze biological and artificial neural systems (Avraham, Chechik, & Ruppin, 2003; Schneidman, Bialek, & Berry, 2003). Actions have also been identified as important from an informational perspective; however, to the best of our knowledge, less research has treated actions information-theoretically on par with sensors. Actions may facilitate active perception to uncover hidden information or simplify information processing in sensory pathways (Kirsh & Maglio, 1994; Nolfi & Marocco, 2002). Actions are also crucial for offloading and later reacquisition of information (Hutchins, 1995; Klyubin, Polani, & Nehaniv, 2004c). In this letter, the use of causal Bayesian networks is introduced to model the perception-action loop. This formalism allows the treatment of perception action in a consistent information-theoretic manner: from the capture of environmental information through the sensors, to the “imprinting” of information onto the environment through the actuators. The causal structure of the networks allows the formulation of the concept of information flow (Ay & Wennekers, 2003; Klyubin, Polani, & Nehaniv, 2004a; Ay & Krakauer, 2007). Information flows are a signature of the causal structure, and they allow one to track the flow of pieces of information through the system analogous to how the flow of radioisotope tracers is tracked through a living organism. Quantitative information-theoretic treatment of sensors and actions as parts of the perception-action loop was implicitly done in the early work of Ashby (1956) and recently revived by Touchette and Lloyd (2000). This letter proposes an information-theoretic method with a minimal set of assumptions for studying and constructing perception-action loops, which may span an arbitrary number of time steps as opposed to just a single time step as in the models of Ashby, and Touchette and Lloyd. The use of information theory reduces the bias of the processing architecture to a minimum, creating a “coordinate-free”1 view of what happens with information processing.
1
We are indebted to Peter Dauscher for suggesting this analogy.
Information Flow in the Perception-Action Loop
2389
1.2 Contributions of the Letter. The main contributions of this letter are: 1. Formulating an information-theoretic picture of the perception-action loop with a minimal set of assumptions. The interactions of an agent’s sensors, actuators, memory, and the environment are modeled using causal Bayesian networks. 2. Maximizing information flow through the full perception-action loop of an agent, replacing structural assumptions such as those of infomax of Linsker (1988; e.g., receptive fields, layers) or the compositionality of temporal infomax of Ay and Wennekers (Ay & Wennekers, 2003; Wennekers & Ay, 2005; i.e., components of a finely grained computational architecture, which are identifiable through time) by agent-environment interaction over many time steps and minimal assumptions about the agent’s information processing. 3. Using information-theoretic tools to analyze representations2 of space and time resulting from the perception-action loop infomax. Since some of the representations are hard to interpret due to the lack of architectural assumptions, the judicious use of information-theoretic tools is essential. 1.3 Outline of the Letter. This letter is structured as follows. First, related work is discussed in relation to this letter in section 2. In section 3, a formalism for treating the perception-action loop using informationtheoretic methods is laid out. In section 4, the formalism is used to maximize an information flow through a simulated agent and its environment through time. Evolutionary search is employed in the experiment to find agent controllers that capture the maximum amount of information about the state of the world. The evolved controllers are analyzed in the second part of the letter by looking at their mechanisms and their behavior through informationtheoretic tools. The following questions are addressed. What information about the initial position do the evolved automata capture? How do the controllers capture the information? How is the information represented in the state of the controller? In section 5 the representation of the agent’s initial position in the state of its controller at the end of the run is analyzed, extending the results of Klyubin et al. (2004a) by studying how the representations change with small variations in the agent’s “embodiment” (what the sensors capture from the environment and what the actuators do in the environment). 2
Representation is a loaded term (cf. Millikan, 2002; Usher, 2001). In this letter, the “representation” of one variable by another means the probabilistic mapping or correspondence between the two variables.
2390
A. Klyubin, D. Polani, and C. Nehaniv
In section 6 the representation of time inside the agent and by the agentenvironment system as a whole is studied. A method to quantify the time dependence of representations is proposed in section 7. Since some of the representations appear to be factorizable (decomposable into independent components), in section 8 the representations of space (initial position) and time are factorized. Finally, overall discussion and conclusions are presented in section 9. In appendix A, the information-theoretic quantities used in the letter are briefly defined. Appendix B remarks on modeling an agent’s controller using causal Bayesian networks. Appendices C and D explain how information flows are measured and how the controllers are evolved in this letter. Appendix E proves the assertion for the grid-world model used in this letter that when an agent follows the gradient and reaches the source, most of the information about the agent’s initial position relative to the source passes through its sensors. Appendix F continues the discussion of time dependence of representation by considering two examples. Appendix G contains the details of the factorization method used in section 8 and discusses its relation to synergy and the parallel bottleneck case of the multivariate information bottleneck method.
2 Background 2.1 . Related Work 2.1.1 Perception-Action Loop and Information Theory. Gibson (1979) proposed that animals and humans do not normally view the world in terms of a geometrical space, independent arrow of time, and Newtonian mechanics and that the natural description is rather in terms of what an agent can perceive and do. The interplay between sensors and actuators, the perception-action loop, is also central to the notion of homeostasis, introduced by Cannon (1939) and treated in more detail by Ashby (1952, 1956). The main assumption is that agents act so as to keep certain internal variables (essential variables) within desirable ranges. Considering that the essential variables are the result of perception of external or internal states, homeostasis creates a purely internal goal, where action is for the purpose of controlling perception (Powers, 1974). Homeokinesis (Der, Steinmetz, & Pasemann, 1999) is a dynamic version of homeostasis where the agent acts so as to minimize the prediction error of its past and future sensoric input. If the sensors can be seen as taking information in from the environment, the actuators can then also be thought of as having the capacity to modify the environment in terms of information. Interestingly, Gibson (1979) argued against applying information theory to the agent’s perceptions, his main argument being that the environment does not intentionally “speak” to the observer and that “the assumption that information can be transmitted and
Information Flow in the Perception-Action Loop
2391
the assumption that it can be stored” are not appropriate for the theory of perception (p. 242). However, work in perceptual neural networks (see section 2.1.2), recent work in control (Touchette & Lloyd, 2004), and implicitly also Ashby’s work (1956) show that information theory can be productively applied to perception-action without the need to assume any intentionality. Touchette and Lloyd (2000, 2004) provided bounds on the efficiency of any sensor for one-step control (entropy reduction). Their result for closed-loop control can be interpreted as a generalized statement of Ashby’s law of requisite variety, which states that only variety (entropy) in actions can reduce the variety (entropy) in the environment. 2.1.2 Neural Networks and Information Theory. Attneave and Barlow in the mid-1950s recognized the importance for the brain not just to capture information but also to represent it in a convenient way. Attneave (1954) observed that sensoric input of living organisms is very much redundant. He hypothesized that sensory pathways take advantage of the redundancy and also abstract away unnecessary or overwhelming details to represent the remaining sensoric information in an economical way. Barlow (1959, 1963) proposed the principle of redundancy reduction according to which sensory information in the brain is incrementally recoded to reduce the redundancy of its representation. Recently (Barlow, 2001) relaxed the claim by noting that (1) further up the processing chain, redundancy of representations can actually increase rather than decrease, and (2) highly compressed representations may be harder to use than more redundant ones. The emphasis was thus shifted from redundancy reduction to redundancy exploitation. In a layered or hierarchical system it is unclear what information each of the layers ought to preserve without designing the knowledge of ultimate goals of the system into each layer. This is addressed by Linsker’s principle of maximum information preservation, or infomax (Linsker, 1988). The principle requires each layer to maximize information transmission from its inputs to its outputs without bias as to what information to preserve and without the need for a layer to know anything about other layers. Infomax usually assumes spatial structure. For example, applied to a simulated retina-like layered feedforward neural network with localized receptive fields, the principle made the network self-organize and develop various feature detectors found also in the retinas of living beings. In his later work Barlow extended what he meant by redundancy to include not only the unused capacity of a channel but also what he called “hidden redundancy”—the statistical interdependencies between components of output (Barlow, 1963, 2001). He proposed minimal entropy coding, which reduces “hidden” redundancy by decorrelating the components and leads to factorial codes where probabilities of individual components are independent of each other (Barlow, 1989). Nadal and Parga (1995) demonstrate that this particular type of Barlow’s redundancy reduction and Linsker’s
2392
A. Klyubin, D. Polani, and C. Nehaniv
infomax are related and sometimes equivalent. A similar result is obtained in the context of source separation by Bell and Sejnowski (1995). A related method for arriving at factorial codes and feature detectors is the principle of predictability minimization by Schmidhuber (1992; Schmidhuber, Eldracher, & Foltin, 1996). The parallel information bottleneck principle proposed by Friedman, Mosenzon, Slonim, and Tishby (2001) as a variety of multivariate information bottleneck can also create factorial codes, the advantage over other methods above being the capability for lossy relevance-based compression. Ay (2002) noted that although infomax may explain the functioning of early sensory stages, at later stages information from various pathways is integrated. He proposed the hypothesis of invariant maximization of interaction according to which “learning . . . is driven by mechanisms that maximize the stochastic interaction” between components of systems. The maximization of global stochastic interaction results in local infomax consistent with Linsker’s idea. By incorporating the dynamical aspect Ay and Wennekers (Ay & Wennekers, 2003; Wennekers & Ay, 2005) proposed the principle of temporal infomax according to which the spatiotemporal stochastic interaction between various pathways or components is maximized. The approach is compositional as it assumes that an information processing system consists of components (e.g., neurons) that retain their identity through time. The complexity of its information processing is quantified information-theoretically by comparing components at two adjacent time steps. Temporal infomax consists of the maximization of this complexity. 2.2 Relation of Previous Work to This Letter. The prior work already noted is related to this work as follows: 1. The minimalistic model of an agent employed in this letter (see section 3.1) is defined in terms of what the agent’s controller can access: sensors, actuators, and memory, conforming to Gibson’s philosophy. 2. In this letter, the treatment of the perception-action loop by Ashby and by Touchette and Lloyd is generalized by modeling an arbitrary number of time steps. 3. The conceptual difference between this work and Linsker’s infomax is that internal spatial structure in the processing architecture (i.e., receptive fields and layers of neurons) is replaced by natural constraints coming only from the agent-environment interaction (“embodiment”) extended over many time steps. 4. Although information flow is maximized through time in this letter, the perception-action loop infomax is conceptually different from the temporal infomax of Ay and Wennekers. First of all, compositionality, the existence of separate finely grained components, is central to their
Information Flow in the Perception-Action Loop
2393
quantification method, whereas the split into sensors, actuators, and memory in this letter is not crucial for determining information flow quantities. Secondly, the temporal aspect of temporal infomax spans only two adjacent time steps, whereas this letter deals with information flows spanning an arbitrary numbers of time steps. The formalism presented here can be used to quantify and maximize information flows occurring between different parts of a system unconstrained by any temporal restrictions (e.g., information flows occurring across components but not through time), whereas the temporal aspect is central to the temporal infomax. 5. The factorization of the representations resulting from perceptionaction infomax (see section 8) is intimately related to but different from the creation of factorial codes. Here not only the output but also the mapping between the input and the output is required to be factorized. 3 Bayesian Model of the Perception-Action Loop 3.1 The Model. Central to the methodology is to model the perceptionaction loop of an agent as a causal Bayesian network. This can be a general and minimal model of an agent with a controller, possibly with memory. The following assumptions are made in the model:
r r r r r
The agent is part of a larger agent-environment system. The system has discrete states. The system is discrete in time. Consecutive states of the system form a Markov chain. The momentary state of the system makes the past of the system statistically independent from its future. The agent’s controller selects actions and modifies its own memory having access only to the momentary state of the sensors and own memory.
There are many ways to partition such a system into subsystems. Here the partitioning is done from the perspective of the agent’s controller. The controller has direct access only to the agent’s sensor,3 the actuator, and its own memory. Everything else in the system that is not encompassed by these three components is the rest of the system.
3
Although the agent can have more than one sensor and actuator, from now on they will be referred to in the singular. Multiple sensors can of course be treated as one composite sensor. Similar reasoning applies to actuators.
2394
A. Klyubin, D. Polani, and C. Nehaniv
All the constituents of the perception-action loop are modeled as random variables:
r r r r
S—the state of the sensor A—the action performed by the actuator M—the state of the controller’s memory R—the state of the rest of the system
R formally accounts for the effects of actuation, the agent’s environment, and morphology on the sensors. Sensors and actuators enable information flows between the agent and its environment. Time is introduced to model the temporal aspect of the flows. The states of the sensor, the actuator, the controller’s memory, and the rest of the system at discrete time t are denoted by random variables St , At , Mt , and Rt , respectively. Modeling relations between the variables at different time steps unrolls the perception-action loop in time. The approach is not new (Touchette & Lloyd, 2000). However, as opposed to Touchette and Lloyd (2000), where only one time step is modeled, an arbitrary number of time steps is modeled here. The relationships between the variables are modeled as a causal4 Bayesian network (Pearl, 2001), which is a directed acyclic graph where any node, given its parents, is conditionally independent from any other node that is not its parent or successor (any node directly or indirectly reachable from the node). The pattern of relations between variables at consecutive time steps is shown in Figure 1. Assume that the pattern of relations is time invariant and thus holds for any t. Thus, the graph in Figure 1 is just a section of the network. The agent’s controller is modeled with minimum assumptions about its architecture: the controller is discrete state, discrete time, and obeys the laws of classical probabilities. In the most general case, the controller performs a probabilistic mapping (St , Mt ) → (At , Mt+1 ).5 See appendix B for a more detailed discussion of the controller’s model as a causal Bayesian network. 4
A Bayesian network captures statistical relationships between observed variables. More than one Bayesian network may fit observed data. In a causal Bayesian network, the relationships between variables are not just statistical but also causal, corresponding to the underlying mechanisms that generate the data. This allows one to calculate effects of interventions—changes in some of the mechanisms (consult Pearl, 2001, for an in-depth treatment). Examples of interventions include fixing a variable to a particular value, or “injecting” information into the system (see below). The latter is used in this letter. In general, interventions can be defined only for causal Bayesian networks. 5 Throughout the letter the probabilistic mapping from a random variable X to a random variable Y is denoted using the shorthand notation X → Y, which stands for the Markov kernel X × Y → [0, 1], (x, y) → Pr (Y = y|X = x).
Information Flow in the Perception-Action Loop
2395
Figure 1: Perception-action loop as a Bayesian network. S—state of the sensor; A—action performed by the actuator; M—state of the memory of the controller; R—state of the rest of the agent-environment system. The diagram can be read as follows: action At is picked given sensor state St and memory state Mt , sensor state St is obtained from the state of the rest of the agent-environment system Rt , and Rt+1 is obtained from Rt and At .
The causal Bayesian network model of the perception-action loop is useful for several reasons. First, the relations between the components of the perception-action loop are visualized. Second, various paths of the information flow (see section 3.2) and constraints on its bandwidth are visualized. For example, it is clear that information flows between the memory M and the rest of the system R are mediated by the sensors and actuators, and, hence, the bandwidth of the sensors and actuators sets a limit on the rate of information flows between the agent’s memory and the rest of the system. Third, the model, being a causal Bayesian network, makes it possible to analytically calculate various probability distributions as well as the results of interventions (see Pearl, 2001). 3.2 Information Flow. In the material world, the concept of matter flow is natural and widespread owing to the conservation laws and additivity of matter. For example, in nuclear medicine, radioisotope tracers are injected into a patient, and their flow through the organism is tracked. Once one tries to formulate an analogous concept of information flow, one encounters a number of differences, one being the nonconservative nature of classical information unlike matter. Classical information can be duplicated or destroyed. Furthermore, information can be “encrypted,” making it disappear unless one knows another “piece” of information that serves as the “decryption” key.6 As an illustration, consider an organization where confidential information is leaked to the outside and there are several suspects who have access to the information. To identify which one is leaking the information, one would provide each suspect with a different and deliberately made-up piece of information and see what information gets out. The presence of the
6
Here encryption is used in the broad sense of the word. The one-time pad encryption scheme (Schneier, 1995) is one example. Another example is the use of an address as a key to access information on a storage medium.
2396
A. Klyubin, D. Polani, and C. Nehaniv
particular piece of injected information outside the organization will have to be solely due to the information flowing from the confidential source via one of the suspects. Analogously, in this letter, information flow is defined for an arbitrary system modeled as a causal Bayesian network. Given a causal Bayesian network information flow from a random variable X to a random variable Y is defined as the amount of information about X causally transmitted from X to Y. Analogous to the information leakage example, if X is an input node of the network, then any information about X in a node Y is necessarily due to the causal effect (Pearl, 2001) of X on Y. In that case, the amount of information flow from X to Y is I (X; Y), the information-theoretic mutual information between X and Y. Information flow is conceptually different from mutual information. Mutual information is a correlational and thus symmetric quantity: I (X; Y) = I (Y; X) for any X and Y. Information flow is a causal quantity and is therefore directional and asymmetric.7 It may be possible to quantify the information flow in the general case, where X is an arbitrary node of the network. One needs to distinguish between information shared between X and Y due to the causal effect of X on Y and the information arriving from common predecessors. One could “block” unwanted paths of information flow to Y, leaving only those passing via X (see Klyubin, Polani, & Nehaniv, 2004b, section 8.2, for an example and Ay & Krakauer, 2007, for a more general treatment). A causal Bayesian network model helps identify the suitable sets of blocking variables. 4 Information Capture Experiment 4.1 Motivation. Linsker’s infomax principle for neural networks is concerned with the maximization of the information flow from the inputs of a perceptual layer of to its outputs (Linsker, 1988). As a consequence of preserving information about inputs, layers of center-surround cells, orientation-sensitive cells, and orientation columns appeared. An extended and generalized version of the infomax experiment can be performed using the causal Bayesian model of the perception-action loop. Instead of maximizing the information flow across just one layer of neurons, one can maximize the information flow through the perception-action loop of an agent through an arbitrary number of time steps. Moreover, since the model is more general, one need not assume linearity or a particular information processing substrate such as locally connected layers of neurons. Chemotaxis, phototaxis, or gradient following in general is a good scenario because information about the position of the gradient’s point source must flow through the agent’s perception-action loop. In the following
7 In fact, information flow is nonzero only in the direction of the causal future (descendants) of X.
Information Flow in the Perception-Action Loop
2397
experiment, an agent controller is evolved to capture the agent’s initial position relative to the source of the gradient. The goal may also be interpreted as the agent reconstructing the position of the gradient source at the beginning of the experiment. 4.2 Outline of the Experiment. An agent moves in an infinite twodimensional grid world.8 The position of the agent corresponds to R, the rest of the system, in the Bayesian model of the perception-action loop. The agent’s controller has no direct access to the agent’s position. The controller has access to a gradient sensor that provides only a small amount of information about the position. The agent’s controller is evolved to maximize the information flow from the initial position R0 into M15 , the state of the controller’s memory at time step 15, under the condition that the controller is always started in the same initial state. Similar to the infomax experiment, the features of the initial position captured by the memory of the evolved controllers are analyzed. It turns out that most of the features have strong spatial continuity properties due to the way the agent perceives the world and acts in it. Therefore, small changes in the sensorimotor apparatus of the agent are made, and the impact on the captured features is studied. 4.3 Test Bed. The environment is a two-dimensional grid of infinite size. A source is located at the origin of the grid. The source emits a signal, the strength P of which in any cell of the grid is P(r ) = r −2 P(0) = 2 , where r is the distance to the source. The exact relation is not important for the experiments. It is only important that the decrease is strictly monotonic with distance. An agent is situated in a single cell at a time. The agent has a gradient sensor. The gradient points to the cell with the highest signal strength among the four adjacent cells (north, east, south, west). If there are several cells with highest signal strength, the gradient randomly points to one of these with equal probability (see Figure 2). The agent also has an actuator: at each time step the agent performs one of the four available actions: move north, east, south, or west. The agent has a controller, which takes sensoric input and then moves the agent. The global state of the system consists of just the agent’s position on the grid and the state of the agent’s controller. Following the perception-action loop model from section 3, denote the sensoric input (the gradient) with the random variable S taking values from the set S = {s N , s E , s S , sW }; the action with the random variable A taking values from the set A = {a N , a E , a S , a W };
8
The grid is large enough so that the agent never experiences the boundary conditions. Without loss of generality, the grid can be considered infinite; however, only a finite number of cells is ever reached.
2398
A. Klyubin, D. Polani, and C. Nehaniv
Figure 2: Sensoric input S depending on the position R. In cells with only one arrow the sensor input is constant. In cells with multiple arrows it is randomly chosen from the one of the arrows shown in the cells. The source is located at the center of the grid.
the internal state or the memory of the agent’s controller with the random variable M taking values from the set M = {1, 2, . . . , N}, where N is the number of states; and the two-dimensional position of the agent with the random variable R taking values from the set Z2 . S, A, M, and R completely describe the state of the whole system at any instant. At and Mt+1 depend directly only on (St , Mt ) for any t (see Figure 1). The agent’s controller, regardless of its implementation, is performing a probabilistic mapping (St , Mt ) → (At , Mt+1 ). The mapping is assumed to be time invariant. The probabilistic mapping can be implemented by a nondeterministic Mealy finite state automaton operating with input set S, output set A, and state set M. Preliminary experiments showed that the evolved mappings tend to be deterministic; hence, the experiments are constrained to deterministic automata, allowing only deterministic mappings. However, without any loss of generality, the approach can be used with stochastic mappings. Determinism of the controller is an experimental choice, not a limitation of the model. 4.4 Initial Position Capture. For this experimental setup it can be shown theoretically that, when an agent follows the gradient and ends up near the source, most of the information about the agent’s initial position passes through the gradient sensor (see appendix E for a proof). Can this flow be directed into the agent’s memory to capture information about
Information Flow in the Perception-Action Loop
2399
Figure 3: Initial position capture as optimization of the communication channel between R0 and Mt . Only the mappings represented by wavy lines can be modified.
the initial position of the agent, using a controller with limited memory and given limited time? In finding a controller that maximizes the information flow9 from R0 into Mt , where t is the duration of the run (see Figure 3), the controller can be seen as constructing a temporally extended communication channel between these two temporally separated parts of the system, passing not only through the sensor but also via the memory, the actuator, and the environment. Two types of controllers are evolved:10 (1) unconstrained, which are not constrained in terms of the implemented mapping (see Figure 3a) and (2) gradient following (see Figure 3b), which are constrained to follow the gradient. Clearly, gradient-following controllers are a subset of unconstrained controllers where the information flow from M to A is blocked. To evaluate a controller, the agent’s initial position R0 is randomly and uniformly distributed over a square with side d centered at the source. At the beginning of the run, the controller is always in state 1. The agent moves for t time steps, after which its fitness I (R0 ; Mt ) is evaluated. I (R0 ; Mt ) is the amount of information about the agent’s initial position captured by the state of controller at time step t . It quantifies how well one 9
See appendix C for a brief description of how the flow is measured. The details of the evolutionary algorithm employed can be found in appendix D.
10
2400
A. Klyubin, D. Polani, and C. Nehaniv
Figure 4: Information flow for best controllers. The amount of information flow from R0 into Mt (ordinate) is plotted against the size of the controller’s memory (|M|, abscissa). Both quantities are measured in bits. Dotted line: gradient-following controllers; solid line: unconstrained; dashed line: upper bound dictated by the IB method for gradient-following controllers. Each point is the maximum of five evolutionary runs. d = 5, t = 5, 1000 generations.
can predict the state of the controller at time step t from the initial position of the agent; or, alternatively, how well one can deduce the initial position of the agent knowing only the state of its controller at time step t . Evolutionary search is not guaranteed to find optimal solutions. However, an upper bound for achievable performance is the amount of information about R0 in principle extractable from the cumulative sensoric input Stc −1 = (S0 , S1 , . . . , St −1 ) into Mt . Since R0 , Stc −1 , and Mt form a data processing chain, it follows from the data processing inequality that no controller can extract more information into Mt about R0 than it obtains via Stc −1 . However, since the cumulative sensoric input contains in the (temporal) limit all the information about the initial position, the main constraint derives from the agent’s controller having access only to the momentary sensoric input, one at a time, and having limited memory. The bound can be estimated for gradient-following controllers. Since all gradient-following controllers by definition employ exactly the same actuation strategy, that is, follow the gradient, the probability distribution of the random variable Stc −1 is exactly the same for all the controllers. To calculate the bound, the information bottleneck (IB) principle (Tishby, Pereira, & Bielek, 1999) can be used to find a mapping Stc −1 → Mt that maximizes the mutual information I (R0 ; Mt ). The latter is an upper bound for the amount of information about R0 any gradient-following controller can capture in Mt . In Figure 4, the performance of best-evolved controllers is compared to the size of their memory in bits. The controllers manage to extract information that is temporally “smeared” in sensoric input. Unconstrained
Information Flow in the Perception-Action Loop
2401
Figure 5: Best evolved four-state unconstrained controller as a finite state automaton. Circles are states. Arrows are transitions between states. Every transaction is annotated with the sensoric input from {n, e, s, w}, which triggers the transition, and this is followed by the performed action from {n, e, s, w}.
controllers clearly outperform gradient-following ones. However, as the performance of unconstrained controllers never exceeded the upper bound for the gradient-following controllers, nothing can be concluded about the nonoptimality of gradient-following controllers. However, the results suggest that this upper bound obtained using the IB principle may also apply to unconstrained controllers. 5 Representation of Positional Information 5.1 Introduction. In section 4.4 controllers were evolved to maximize the amount of information they capture about the agent’s initial position. What information about the initial position is captured? How do the controllers capture the information? How is it represented in the state of the controllers? The evolved controllers are finite state automata. One such automaton is shown in Figure 5 as an example. It is possible to analyze the static structure ¨ osi ¨ of the automata, for example, through hierarchical decomposition (Dom & Nehaniv, 2005). However, since the controllers were evolved to maximize information flow, it is natural to analyze them from the perspective of Shannon information, demonstrating that information-theoretic tools can be used not only for evolving systems but also for analyzing them. The following sections compare and supplement the insights obtained using information-theoretic tools with prior knowledge about the analyzed system. A later goal, however, is to analyze systems by relying solely on the information-theoretic tools.
2402
A. Klyubin, D. Polani, and C. Nehaniv
In this section in particular, the representation of the agent’s initial position by the state of the controller at the end of the run is studied. The analysis extends the results of Klyubin et al. (2004a) by also studying how the representations change with small changes in the agent’s embodiment (what the sensors capture about the environment and what the actuators do in the environment). 5.2 Representations as Mappings. Representations are treated as probabilistic mappings. To study the representation of the agent’s initial position R0 in the state of its controller at the end of the run Mt , the mapping of Mt onto R0 is visualized. The evolutionary experiment is rerun with d = 11 and t = 15 to obtain spatially larger maps. Mappings implemented by some of the best-evolved controllers are shown in Figure 6. The controllers perform lossy compression of the agent’s initial position: some of the information about the agent’s initial state R0 is contained in the controller’s state M15 at the end of the run. Unconstrained controllers successfully retain nearly the maximum amount of information that fits in their memory. Gradient-following controllers retain less information. In some cases factorizable codes are produced: it is possible to find projections of M that independently code for independent features of position, such as odd and even cell checkerboard pattern or left and right half (see Figure 6, gf ++, |M| = 8). This suggests interesting parallels with the results obtained for information flow maximization in neural networks (Nadal & Parga, 1995; Bell & Sejnowski, 1995), where factorial codes are produced as a result of maximizing information transmission by a network of neurons. (Factorization is discussed in more detail in section 8.) Maximizing the information flow from R0 to M15 tends to create maps that induce near-hard partitions over R0 . This is not surprising, since information maximization between any two variables will tend to create hard partitions. However, any permutation of a map with respect to the 11 × 11 set of initial position will preserve the mutual information between R0 and M15 . Also, the form of the partitions is not important for capturing information. Theoretically, there is no preferred grouping of initial positions. The employed method of evolving the controller mappings is not biased in this respect either. One could thus expect that the form of the partitions of initial position would be completely different between different controllers due to random factors of the evolutionary process. However, it turns out that the evolved maps are not arbitrary. There are certain repeating motifs in the maps: checkerboard patterns for gradientfollowing controllers and spatially solid tiles for unconstrained controllers. These regularities arise because not all of the maps from R0 to M15 are possible. In short, the maps are constrained by and thus reflect the agent’s embodiment and its information processing capabilities.
Information Flow in the Perception-Action Loop
2403
Figure 6: Representations of initial position (M15 → R0 ) used by best-evolved controllers with different memory sizes. Types: g f —gradient following, u— unconstrained, ++—cross-sensor and cross-actuator, +×—cross-sensor and diagonal actuator, ××—diagonal sensor and diagonal actuator, ++ (××)— evolved for ++ but mappings obtained with ××. Each box shows a mapping from a controller’s state at time step 15 onto the agent’s initial position (11 × 11 cells centered at the source). The intensity of each cell represents the probability of the agent having started in that cell: black—highest, white—zero. For example, state 1 of the unconstrained controller (u + +, |M| = 2) at time step 15 means that the agent started in the lower-right half of the world. State 2 means the opposite.
To illustrate where the constraints arise, it is useful to briefly study the map R0 → M15 , the inverse of M15 → R0 . The map is a product of identical time-invariant maps (Rt , Mt ) → (Rt+1 , Mt+1 ) (time evolution operator) projected onto the Mt component. Already this fact is a constraint. The form of the time evolution operator is also constrained: the agent can move only into
2404
A. Klyubin, D. Polani, and C. Nehaniv
adjacent cells, the sensor captures only certain features of the world, and, in the case of following the gradient, the agent’s controller is constrained to strictly following the gradient. The evolved maps R0 → M15 , and also M15 → R0 , are thus not arbitrary. They reflect the structure of the agent’s sensorimotor apparatus, the correlations between the actuator and the sensor—in other words, the agent’s embodiment. The maps also reflect constraints on the agent’s information processing capabilities. This is illustrated by the qualitative differences in maps of gradient-following (constrained) and unconstrained controllers (see Figure 6), as well as by quantitative differences between maps created by controllers with different number of states. 5.3 Different Embodiment. In the previous experiment the agent’s body/embodiment was kept constant, and it was the controller that was allowed to change. To illustrate the idea that the evolved maps R0 → M15 reflect both the embodiment and the constraints on the controllers, an evolutionary experiment was performed with different sensors and actuators but with the same constraints on the controller. The simplest change to sensors and actuators is to permute them. This can be seen as changing the meaning of the sensoric states and actions. For instance, the action “go west” can be changed to move the agent one cell north. Correspondingly, the former “go north” action can be changed to move the agent west. The evolutionary search is not biased toward a particular meaning of sensoric states and actions. It treats sensoric inputs and actions as two unordered sets. Similarly, the fitness function, information flow, being an information-theoretic quantity, is not influenced by the permutations either. As a result, the evolutionary search is completely immune to permutations of meaning externally assigned to elements of the sensor and action sets. What would happen if, instead of permuting sensoric states and actions, a new meaning were assigned to them in terms of what features of the position the sensors capture and how the actions change the position? In the previous experiment, the sensor was measuring the gradient by comparing the signal strength in four cells, located north, east, south, and west from the agent. Due to the layout, the sensor is called a cross sensor. In addition to the cross sensor, a diagonal sensor, which compares signal strength in four diagonally adjacent cells (northeast, southeast, southwest, northwest), will be used in the following experiments. Similar to the two types of sensors, the agent can have a cross (original) or a diagonal actuator. Two more evolutionary experiments were run optimizing the information flow from R0 to M15 for the case of cross sensor and diagonal actuator, and for the case of diagonal sensor and diagonal actuator. The resulting maps implemented by the best-evolved controllers are visualized in Figure 6, type +× and ××, respectively.
Information Flow in the Perception-Action Loop
2405
The second case, diagonal sensor and diagonal actuator, is interesting because the experiment is almost isomorphic to the original one rotated by 45 degrees. The difference comes from the fact that the initial position of the agent is still distributed around the source in a square exactly as in the original experiment. The best-evolved mappings M15 → R0 for the diagonal sensor and diagonal actuator case (see Figure 6, ××) are indeed quite similar to the best-evolved mappings rotated by 45 degrees obtained in the original experiment (see Figure 6, ++) where cross sensor and cross actuator were used. Due to this observation, it was interesting to see how the controllers evolved for the original scenario (cross sensor, cross actuator) behaved in the agent with diagonal sensor and diagonal actuator. It turns out that for a small number of states (|M| ≤ 6) they produce original maps rotated by 45 degrees (see Figure 6, ++ (××)) and actually capture a little bit more information about R0 than in the original scenario. For larger number of states, the performance of gradient-following controllers drops compared to the original embodiment, and the maps no longer represent original features in a clean way (see Figure 5.2, |M| = 8 gf ++ ××). 5.4 . Discussion 5.4.1 Experiments. The mappings from the controller’s state at the end of the run M15 to the agent’s initial position R0 show what information about the agent’s initial position gets captured by different controllers. Gradient-following and unconstrained controllers as a class capture different features of the initial position (see Figure 6). The main qualitative difference is that unconstrained controllers divide the initial position into spatially solid chunks. Neighboring initial positions are generally coded by the same state of the controller at the end of the run. Representations facilitated by gradient-following controllers lack the solidity of the chunks. These representations tend to have stripe or checkerboard patterns. The differences in representations derive from the differences in information processing capabilities (unconstrained versus gradient-following) and embodiment. The results may thus be viewed as having extended and generalized Linsker’s work on the infomax principle (Linsker, 1988) (see section 4.1) but with fewer assumptions: this letter does not assume linearity, it does not assume a particular information processing substrate for the agent’s controller, and it allows for more than just one time step for information processing. The unconstrained controllers were even allowed to perform actions based on the past inputs. The results, however, are similar to Linsker’s: without being told what is important about the agent’s initial position, evolution found controllers that extract various features of the initial position as a result of maximizing information preservation. Despite the agent having no concept of geometry, most of the features, especially the ones created
2406
A. Klyubin, D. Polani, and C. Nehaniv
by unconstrained controllers, have strong spatial continuity properties due to the way the agent perceives the world and acts in it. 5.4.2 Scalability. The main aim here was to find basic principles—hence the minimal bias and high inefficiency of the method. A natural question is how to scale the method up. More efficient methods of approximating information flow could be devised for particular environments or computational structures. A promising way of improving efficiency could be to factorize (see section 8) the information flows, thus creating small and manageable building blocks from which more complex hierarchical systems could be composed. 5.4.3 Biological Relevance. It may be argued that natural evolution is influenced by many factors and constraints that have no relation to information processing. However, assuming that the constraints take precedence, Linsker’s infomax (Linsker, 1988) is a useful guide because more information is likely to be advantageous given the constraints: information may provide selective advantage (Howard, 1966; Polani, Nehaniv, Martinetz, & Kim, 2006). Thus, if nothing is known (assumed) about the system in advance, infomax is a maximally unbiased approach: transmit as much information as possible, since it is not known what information will be important in later stages. If, however, there is bias in terms of what information is important, then that information should be transmitted, and hence evolution will adapt accordingly by removing or reusing the unnecessary capacity. The information capture experiment based on the perception-action loop model demonstrates how under very limited constraints (limited sensors and actuators), the infomax principle results in self-organization of the information processing in an unstructured system. Additional constraints are expected not to reduce but rather modify the effect (cf. unconstrained versus gradient-following controllers). The minimum-assumption perceptionaction loop model here, combined with infomax, provides an organizing principle for information processing. However, infomax may be evolutionarily more important, resulting in trade-offs with other constraints. In fact, information may be a primary resource on par with energy: in the resting state the human brain is responsible for around 20% of the total oxygen consumption of the body (Kandel, Schwartz, & Jessell, 1991), and the fly photoreceptors consume around 10% (Laughlin et al., 1998) of the total energy consumed. Whether information is a primary resource or whether it is superseded by other constraints, the above indicates that infomax-like principles may help in understanding the organization of information processing in evolved natural and also artificial systems. 6 Representation of Time Assume that during the experiment the state of the controller was observed without knowing the time. Would the state of the controller contain
Information Flow in the Perception-Action Loop
2407
Figure 7: Relation of MT , T, and the Bayesian model of the perception-action loop. MT is a readout of the process (Mt )t=0,1,...,15 at an unknown time step T. Neither T nor MT has an effect on the dynamics of the loop.
information about the time step? It turns out that it would, provided the probability distribution of the controller’s state depends on time. Time is modeled as a random variable T with values t in T = {0, 1, . . . , 15}. A special case is a peaked distribution where p(t) = 1 only for a single t and is 0 for all others, resulting in the observed state of memory Mt . However, in general, the distribution of T may be arbitrary. Denote the readout of the process (Mt )t=0,1,...,15 observed at an unknown time step T by a random variable MT (see Figure 7) with values mT . Here p(mT ) is given by p(mT ) = Pr (MT = mT ) =
Pr (T = t)Pr (MT = mT |T = t),
(6.1)
t∈T
where the probability distribution of the state of the readout for a particular time step t is inherited from the memory variable at the time step Pr (MT = mT |T = t) = Pr (Mt = mT ).
(6.2)
The relation between MT , T, and the process (Mt )t=0,1,...,15 trivially follows from equations 6.1 and 6.2: Pr (MT = mT ) =
Pr (T = t)Pr (Mt = mT ).
(6.3)
t∈T
The amount of information captured by MT about the time step T is I (MT ; T). Sensoric input can also be included, in which case the amount of information about time available to the controller is I ((S, M)T ; T) ≥ I (MT ; T), where (S, M)T is defined analogous to MT . Informally, these quantities can be interpreted as the amount of information the controller has on the average about the time step it is at.
2408
A. Klyubin, D. Polani, and C. Nehaniv
Figure 8: p(t|(s, m)T )—representation of time by the best two- and ninestate controllers. Vertical axis—(s, m)T − 2 × 4 states; horizontal axis—t − {0, 1, . . . , 15}. The darker the cell, the higher the conditional probability p(t|(s, m)T ). I = I ((S, M)T ; T) measured in bits.
Assuming a uniformly distributed T, the controllers evolved in the initial position capture experiment capture from 0.1 to 0.84 bits of information about time in their sensors and memory. For comparison, the global state of the system (R, M)T captures between 0.93 and 2.00 bits of information about time. The small amounts of the former are not surprising considering that the controllers were not evolved for capturing information about time.11 To investigate how time is represented by the controllers, the mapping (S, M)T → T will now be studied.12 For every tuple of sensoric input and state of the controller, the mapping tells the probability of being at a particular time step. To illustrate the idea, Figure 8 shows the conditional probability distribution p(t|(s, m)T ) for some of the controllers. Different states (s, m)T capture different features of time. One type of state captures local features of time. In Figure 8b, rows 2, 3, 6 and 7 (starting 11 If the controllers were evolved to capture as much information about time as possible, they would evolve into clocks. The simplest solution is to shut out the environment and permute the controller’s state. 12 Strictly speaking, the reconstructed time is not metric time but an ordinal sequence.
Information Flow in the Perception-Action Loop
2409
from the top) are such examples. These states differentiate between odd and even time steps. In Figure 8d there are states that capture time modulo 2, 3, and 4. Another type of states captures global properties of time, such as being close to the beginning, the middle, or the end of the experiment. Rows 1 and 4 in Figure 8 are an example. There are also hybrid states, which capture both local and global properties of time. Row 1 and middle rows in Figure 8 are such states. They locate the system in time in terms of being at a particular modulo 4 time step and also in terms of being at the beginning or the end of the experiment. The controllers were evolved to capture information about the agent’s initial position at the end of 15 time steps. Why do the controllers capture information about time? Is capturing information about time an evolutionary neutral side effect, or is it crucial to capturing information about the agent’s initial position? For example, it may be that the processing and integration of information by the controller depend on time, or that different methods of obtaining the information about the agent’s position are used at different time steps, or that the controller represents the information in a time-dependent way. The issue of time-dependence of representation is addressed in the next section.
7 Time Dependence of Representation Section 5 showed how M15 , the state of controller at the end of the experiment, represents R0 , the agent’s position at the beginning of the experiment. It turns out that the representation of the initial position in the controller’s memory throughout the experiment is time dependent. For illustration purposes the representation used by the best eight-state unconstrained controller is shown in Figure 9. The degree of time dependence of the representation of the initial position R0 in the process (Mt )t=0,1,...,15 is quantified as the probabilistic dependence of the mapping MT → R0 on the time step T, where MT is the readout of the process at an unknown time T. (See section 6 for the definition of T and MT .) For every observed state mT of the controller, the mapping MT → R0 , characterized by the conditional probability distribution p(r0 |mT ), describes where and with what probability the agent could have started given that the controller was found in that state at an unknown time step. If, in addition to knowing mT , the time step t is known, the mapping
p(r0 |mT , t) = Pr (R0 = r0 |MT = mT , T = t) = Pr (R0 = r0 |Mt = mT ) (7.1)
2410
A. Klyubin, D. Polani, and C. Nehaniv
Figure 9: Time dependence of the representation of the initial position R0 by the process (Mt )t=0,1,...,15 of the best eight-state unconstrained controller. MT is the readout of the process at an unknown time step T. Vertical axis: eight states of the controller; horizontal axis: time step (0 to 15). Darker cells code for higher probability of having started there. If the observation time step T is uniformly distributed, then knowing whether the time step is odd or even greatly helps determine what the controller’s states mean in terms of initial position.
can be used to find out where the agent could have started. For example, in the case of the best evolved eight-state unconstrained controller (see Figure 9) knowing whether the time step is odd or even helps to significantly reduce the uncertainty in the agent’s initial position given the state of the controller. The difference in how the two conditional distributions (time-less and time-dependent) describe the initial position can be used to quantify the dependence of the mapping MT → R0 on the time step T. Here the degree of time dependence of the representation is quantified as the Kullback-Leibler distance from the conditional distribution p(r0 |mT , t) to the conditional distribution p(r0 |mT ):13 Dp(mT ,t) p(r0 |mT , t) p(r0 |mT ) = H(R0 |MT ) − H(R0 |MT , T) = I (R0 ; T|MT ). 13 The method for measuring time dependence of a mapping between two random variables is quite general. In fact, equation 7.2 applies to the general case of having three arbitrary random variables X, Y, Z and measuring the dependency of the mapping X → Y on Z:
D p(x,z) p(y|x, z)|| p(y|x) = I (Y; Z|X) I (Y; Z|X) + I (X; Z) = I (X; Z|Y) + I (Y; Z).
Information Flow in the Perception-Action Loop
2411
The distance can be interpreted as the average increase in the information about the initial position R0 due to knowledge of the time step T when the controller’s state MT is known. For example, for the eight-state controller shown in Figure 9, the time dependence is 0.57 bits, assuming a uniform distribution for T. The time-dependence measure is nonnegative. It is zero if and only if knowing the time step does not improve the knowledge about the initial position, that is, when the state of controller represents the initial position in a time-independent way. The higher the measure, the higher the time dependence of the representation. The time-dependence measure is not symmetric with respect to R0 and MT . However, the dependence of MT → R0 and its inverse, R0 → MT on T, is related: I (R0 ; T|MT ) + I (MT ; T) = I (MT ; T|R0 ) + I (R0 ; T).
(7.2)
Since the initial position of the agent is not dependent on time, the last term, I (R0 ; T), vanishes. The time dependence of the two representations and the amount of information captured by the controller about time I (MT ; T) are then related as I (R0 ; T|MT ) + I (MT ; T) = I (MT ; T|R0 ).
(7.3)
See appendix F for further discussion of time dependence. 8 Factorization of Representations 8.1 Rationale. In section 5, it was mentioned that some of the representations of the agent’s initial position by the state of the controller at the end of the run shown in Figure 6 are factorizable. In this letter, factorization of a representation means that the corresponding probabilistic mapping can be represented as a product of two or more component mappings, such that they capture independent pieces of the information captured by the original mapping, creating an orthogonal coordinate system. Another way of looking at the factorization of the representation of the initial position is to think of it as factorization of the initial position itself. However, only certain features of the initial position are factorized. The mapping between the initial position and the state of the controller at the end of the run defines what features are relevant. This is more in line with the relevance-based compression view adopted in the parallel bottleneck method (Friedman et al., 2001) discussed in more detail in appendix G. 8.2 Problem Statement. Assume two random variables X and Y with a given joint probability distribution p(x, y). Y represents X in some way. The goal is to factorize this representation, that is, factorize Y with respect to X.
2412
A. Klyubin, D. Polani, and C. Nehaniv
Figure 10: Factorization of Y with respect to X into Z1 and Z2 . X and Y are given. The task is to create two new variables, Z1 and Z2 , based on Y.
Y is to be split into two new random variables, Z1 and Z2 (see Figure 10), such that they meet the following criteria.14 First, the two new variables should capture as much information about X as possible. Hence, I (Z1 , Z2 ; X) should be as large as possible. Second, the two variables should comprise an orthogonal coordinate system for X. Hence, the two variables should be as independent as possible (I (Z1 ; Z2 ) close to zero), and the conditional probability distribution p(z1 , z2 |x) describing the projection of X onto Z1 and Z2 should be closely approximated by the product of partial projections p(z1 |x) and p(z2 |x). The distance between p(z1 , z2 |x) and the product p(z1 |x) p(z2 |x) can be measured as the Kullback-Leibler distance: (8.1) Dp(x) p(z1 , z2 |x) p(z1 |x) p(z2 |x) = I (Z1 ; Z2 |X) . The distance should be as close to zero as possible. Note that although the distance is for conditional distributions, it still depends on the distribution of X. Minimizing I (Z1 ; Z2 |X) makes Z1 and Z2 as conditionally independent given X as possible. This avoids synergetic or XOR-like coding schemes where, taken separately, Z1 and Z2 contain less information about X than when they are combined.15 (See section G.3 for more details.) To summarize the task, given a joint distribution p(x, y), find two conditional distributions p(z1 |y) and p(z2 |y) satisfying the following three criteria: 1. I (Z1 , Z2 ; X) ≤ I (X; Y) is maximal. 2. I (Z1 ; Z2 ) ≥ 0 is minimal. 3. I (Z1 ; Z2 |X) ≥ 0 is minimal (goal: p(z1 , z2 |x) = p(z1 |x) p(z2 |x)). 14 Factorizing into three or more variables places more constraints on the solution. Two variables are sufficient to start with. Later, each of the new variables can be factorized further if necessary. 15 Synergy colloquially refers to the case when the whole is more than the sum of its parts.
Information Flow in the Perception-Action Loop
2413
Appendix G discusses the details of the factorization method used in this letter, as well as the relation to synergy and the parallel information bottleneck method. 8.3 Factorization of the Representation of the Initial Position. The mappings M15 → R0 created by the best controllers evolved for capturing information about the initial position of the agent (see section 5) are factorized here. In terms of the factorization task, R0 is X, and M15 is Y. Thus, the information about R0 available in M15 was factorized into two new variables, Z1 and Z2 , by means of mappings M15 → Z1 and M15 → Z2 . The original mappings, as well as mappings from Z1 , Z2 , and Z back to initial position R0 , are shown in Figure 11. The factorization of the representation of initial position used by the bestevolved eight-state gradient-following controller (see Figure 11, |M| = 8) is a good example of the previous claims that these representations are factorizable. In the first factorization, the two new variables are constrained to be binary (|Z1 | = |Z2 | = 2). The first variable captures the initial position in terms of being left or right from the source. The second variable captures the position roughly in terms of being on either white or black cells of a checkerboard pattern. Note that there is a vertical discontinuity in the pattern in the middle. The factorization is not at the ideal point (d = (0.008, 0.000, 0.000)) (see appendix G) because 0.008 bit of the information about R0 captured by M are not captured by the new variables. To capture this residual information, one of the new variables was grown to have three states. The resulting best factorization is shown in the middle of Figure 11 (8 → 3 × 2). There only 0.001 bit of original information is lost. The representation of the initial position employed by the second variable is exactly the same as used by the first variable in the previous case. However, the first variable uses a proper checkerboard coordinate system, without the vertical fault in the middle. The controllers were originally evolved to capture at the end of the run as much information about the agent’s initial position as possible. Alternatively, this task may be seen as establishing a temporally extended information flow from R0 to M15 . Factorization of the initial position through the state of the controller at the end of the run can be seen as splitting the information flow, albeit only at the end, into two or more independent components. Later, each component can be processed separately. Such factorization has several advantages. First, the state space of the variables representing information is reduced due to compression. Second, information is split into independent parts or features. The state spaces of features are normally smaller than the state space of the original variables. Compact representation may allow for more efficient processing, requiring less space to store and transmit. Third, the fact that individual components are independent of each other allows for their parallel independent processing. At a later stage, the results of the parallel processing can be again
2414
A. Klyubin, D. Polani, and C. Nehaniv
Figure 11: Factorization of information about the initial position captured by the best-evolved two-, three-, four-, and eight-state controllers into pairs of binary (2 × 2) or ternary (3 × 3) variables. Mapping M15 → R0 is factorized into Z1 → R0 and Z2 → R0 , and then reconstructed as (Z1 , Z2 ) → R0 . Each box shows the mapping from a state (m or z) onto 11 × 11 cells of initial position. The darker the cell, the higher the probability.
combined. Moreover, if individual components can be factorized further, they could serve as starting points for hierarchies. 8.4 Factorization of Representations of Time. In this section, the information about time available to the agent is factorized. To be more specific,
Information Flow in the Perception-Action Loop
2415
the information about T as seen through (S, M)T is factorized. T is assumed to be uniformly distributed over {0, 1, . . . , 15}. In terms of the original definition of the factorization task, T is X, and (S, M)T is Y. (S, M)T can be thought of as containing information available to the agent’s controller. The amount of information about time contained in internal state (S, M)T of the best-evolved gradient-following controllers is quite small. The best factorizations into two binary variables are shown in Figure 12a. Most of the resolution of the new variables is concentrated on the first half of the run. Because of the very low level of information about time in (S, M)T , the factorized variables contain even less information. Consequently, it is hard to interpret them. Factorizations of the information about time in (S, M)T of the bestevolved unconstrained controllers produce features distinguishing mostly between odd and even time steps (see Figure 12b), although in the case of the best three-state controller the first feature distinguishes between the beginning and the rest of the run. 8.5 Discussion. Why is factorization of the agent’s initial position as seen through the state of the controller at the end of the run (see section 8.3) much more successful than factorization of time as seen through the agent’s internal state and the sensor (see section 8.4)? To quickly rehash the structure of the factorization problem, random variables X and Y are given, and two new random variables, Z1 and Z2 , are individually mapped from Y so that they capture independent pieces of information about X. The factorization method used in this section can be approximately thought of as consisting of two stages. First, similar to the information bottleneck principle, it compresses the information contained in Y about X into new variables, Z1 and Z2 , which have fewer states than Y, hence resulting in a compact representation of the information. Second, factorization splits this information into two independent variables. The initial position of the agent as seen through the state of its controller at the end of the run was factorized first (see section 8.3). The controllers were evolved so that the state of the controller captured as much information about the initial position as possible. Thus, the state of the controller at the end of the run already is a compact, compressed representation of the initial position. Factorization simply represents the captured information in terms of two independent variables. In most cases the results of the factorization algorithm were very close to the ideal point. The factorization of the information about time as seen through the sensor-controller state tuple (see section 8.4) was much further away from the ideal point. In most cases the algorithm failed to capture a significant share of the information about time that was to be factorized. The explanation may be quite simple. Although the amount of information about time captured by the agent’s internal state or the system’s global state is not high, the information cannot be squeezed into two binary variables. It is
2416
A. Klyubin, D. Polani, and C. Nehaniv
Information Flow in the Perception-Action Loop
2417
“smeared” over a large number of states (|T | · |S| · |M|) and, being an average quantity, cannot be extracted without noticeable loss into two binary variables. This hypothesis was confirmed by relaxing the factorization constraints and simply trying to extract as much of the information as possible into just four states. The amount of the extracted information is a lower bound for the distance from the ideal point the factorization algorithm can achieve. Thus, the original factorization may have been less successful simply because the controllers were not evolved to capture information about time in a convenient compressed way. The factorization of the highly compressed information about the initial position offers the advantage of viewing the captured features in terms of two smaller independent features. The case of the best evolved eightstate gradient-following controller in Figure 11 is a good example. There, factorization creates two natural variables: left versus right and black versus white cell of the checkerboard. When factorizing the relatively scarce and widely “smeared” information about time, the most useful feature of factorization is that relevancebased compression is performed. This projects the information about time from a high number of states to just a few. These are easier to interpret. For example, it becomes obvious that the two most common features of time captured are roughly (1) whether the time step is odd or even and (2) whether roughly half of the run has elapsed. 9 Conclusion An information-theoretic approach to analysis and construction of perception-action loops was presented. The approach enables measuring information flows between various parts of the loop through time, constructing perception-action loops with desired properties, and analyzing the resulting behavior at the level of information. To show the approach in action, a two-dimensional grid world with an agent controlled by a finite state automaton and having access only to a gradient sensor was presented. The agent’s sensor captures partial information about the agent’s position. It was shown that it is possible to capture more of the information by integrating sensoric input over time and employing an evolved movement strategy.
Figure 12: Factorization of internal state (S, M)T of best-evolved controllers with respect to time. Each section shows the pair of mappings Z1 → T and Z2 → T created from (S, M)T → T. Four boxes show the probabilistic mappings described by p(t|z1 ) and p(t|z2 ) from states of Z1 and Z2 onto the 16 states of time t (left to right, from 0 to 15). Darker color codes for higher probability. Since new variables Z are binary, p(t|Zn = 1) = 1 − p(t|Zn = 0)∀t∀ p.
2418
A. Klyubin, D. Polani, and C. Nehaniv
Controllers with a limited amount of memory were evolved to capture as much information as possible about the agent’s initial position under the constraints of the agent’s embodiment. To use the memory efficiently, the controllers performed lossy compression, which resulted in near-hard partitioning of the position with respect to the controller’s state. From the information-theoretic perspective, these controllers were evolved to create a temporally extended communication channel of maximum bandwidth between two temporally separate parts of the system: the agent’s position at the beginning of the run and the controller’s state at the end of it. The channel is created through interaction of the agent with the environment. Compression, partitioning, and the movement strategy employed by the controllers are all induced by maximization of information transfer (infomax) through the perception-action loop through time. This is related to results for neural networks, where information transfer maximization resulted in phenomena such as feature detectors (Linsker, 1988), factorial codes (Nadal & Parga, 1995; Bell & Sejnowski, 1995), and source separation (Bell & Sejnowski, 1995), but is more general, since it allows arbitrary temporally extended channels and removes assumptions about the internal structure of the underlying information processing. In the second part of this letter, a way to use information-theoretic tools to analyze representations of space (initial position) and time by the bestevolved controllers was presented. The intention was twofold: to provide more insight into what information the controllers capture and to demonstrate the power of several information-theoretic tools. The representations of the agent’s initial position by the state of controllers at the end of the experiment were studied. The representations reflect some properties of the agent’s embodiment, the interaction and constraints of the world, and the agent’s sensorimotor apparatus. Slight changes to the agent’s actuator produce similarly slight changes in the representations. In some cases, even old representations are suitable for a slightly changed actuator. A method for factorizing representations was demonstrated. It turned out that some of the representations factorize nicely into smaller independent features. Factorized features, being independent by definition, can serve as starting points for hierarchies if factorized further. The amount of information captured by the agent about time was quantified by treating time as a random variable, and the representation of time in the agent’s controller was studied. A method for information theoretically measuring the time dependence of representations was demonstrated, linking time dependence with the amount of information captured about time. Last but not least, the representations of time were factorized to understand what is captured. The most common features of time captured are (1) whether the time step is odd or even and (2) whether the first half of the run has elapsed. The applications of information-theoretic tools to the perception-action loop are quite promising. They allow one to analyze information processing
Information Flow in the Perception-Action Loop
2419
even when one does not understand the underlying substrate. For example, there is no requirement to assume finely grained components (e.g., neurons) that perform the information processing. The model of computation is thus more abstract and accommodating than that of traditional infomax. The various information-theoretic tools presented here can provide a peek into systems that do information processing in a nonobvious way. Viewing information as the currency or essence of computation in the perception-action loop provides a plausible mechanism, where a few assumptions and constraints, such as embodiment with various bottlenecks coming from the limited capacity of sensory and motor channels, and limited memory, can give rise to structures in a self-organized way. This paves the way toward a principled and general understanding of the mechanisms guiding the evolution of sensors, actuators, and information processing substrates in nature and provides insights into the design of mechanisms for artificial evolution. The main contributions of this letter are (1) building up an informationtheoretic picture of the perception-action loop with a minimal set of assumptions, (2) maximizing information flow through the full perception-action loop of an agent through time, as opposed to the well-established maximization of the flow through layers in neural networks, and (3) introducing the use of information-theoretic tools to analyze the representations arising from the maximization of information flow through the perception-action loop.
Appendix A: Information Theory In this section, the central information-theoretic notions used in this letter are briefly introduced. For a detailed introduction to information theory (Shannon, 1948), consult, for example, Cover and Thomas (1991). A random variable can assume various values with various probabilities. In this letter, exclusively discrete random variables are considered. Denote random variables with uppercase letters (e.g., X), their sets of values with calligraphic letters (e.g., X ), and their values with lowercase letters (e.g., x). Denote composite random variables by listing their elements inside parentheses (e.g., (X, Y)) with values (x, y) from the set X × Y. By abuse of notation, denote Pr (X = x), the probability that X assumes the value x, by p(x). Similarly, the joint probability of X and Y is denoted by p(x, y) and the conditional probability of X given Y by p(x|y). The entropy of X, denoted by H(X), is defined as a measure of the uncertainty of the probability distribution of X:
H(X) := −
x∈X
p(x) log2 p(x).
2420
A. Klyubin, D. Polani, and C. Nehaniv
Entropy as well all other information-theoretic measures used in this letter are measured in bits. Note that all of the information-theoretic measures presented in this appendix are nonnegative. The conditional entropy of X given Y, denoted H(X|Y), is defined as the uncertainty of X knowing Y weighted by the probability of Y: H(X|Y) :=
p(y)H(X|Y = y)
y∈Y
=−
p(y)
y∈Y
p(x|y) log2 p(x|y).
x∈X
The mutual information between X and Y, denoted I (X; Y), is defined as the average reduction in the uncertainty of X given Y: I (X; Y) := H(X) − H(X|Y). Kullback-Leibler distance (Kullback & Leibler, 1951) from a distribution p1 (x) to a distribution p2 (x), denoted D( p1 || p2 ), is a generalized measure of distance between the two distributions: D( p1 || p2 ) =
p1 (x) log2
x∈X
p1 (x) . p2 (x)
D( p1 || p2 ) = 0 if and only if p1 is identical to p2 . The measure can assume infinity if there exists x for which p1 (x) = 0 and p2 (x) = 0. Kullback-Leibler distance is not actually a distance in the mathematical sense, in particular since it is not symmetric with respect to p1 and p2 . Kullback-Leibler distance from a conditional distribution p1 (x|y) to a conditional distribution p2 (x|y) is defined as the sum of Kullback-Leibler distances from p1 (x|Y = y) to p2 (x|Y = y) weighted by the probability of Y: Dp(y) ( p1 || p2 ) =
p(y)D p1 (x|Y = y) p2 (x|Y = y)
y∈Y
=
y∈Y
p(y)
x∈X
p1 (x|y) log2
p1 (x|y) . p2 (x|y)
Appendix B: Bayesian Network Model of the Agent’s Controller The agent’s controller selects actions based on sensoric input and its own memory. The assumptions about the agent’s controller are minimal: the controller is finite time and finite state. Given the sets of sensoric states S, actions A, and the memory states of the controller M, any such controller
Information Flow in the Perception-Action Loop
2421
Figure 13: Causal Bayesian network models of an agent’s controller. S—state of the sensor; A—action performed by the actuator; M—state of the memory of the controller. (a) General model. (b) Model with an extra causality assumption. The new state of memory is influenced by the choice of action.
can be viewed as performing a probabilistic mapping S × M → A × M. The interpretation is that the controller picks an action and the next state of its memory based on the current sensoric input and the current state of its memory. The causal Bayesian network model for the general case of any such probabilistic mapping is as shown in Figure 13a. To simplify the diagrams, in the main body of the letter a different model shown in Figure 13b was used. This model, although simpler to understand at first glance, contains an extra assumption that the new state of memory can be influenced by the choice of action. In general, this need not be the case. For example, it is easy to envisage a controller where it is the action that is influenced by the new state of memory. The general model, shown in Figrue 13a, on the other hand, does not assume any particular causality between actions and the new state of the controller’s memory. Appendix C: Measuring Information Flows The goal of the information capture experiment in section 4 is to evolve agent controllers that maximize the information flow from the agent’s initial position to the memory of the controller at a later time. To calculate the amount of information flow, the joint probability distribution of the parts of the system that are the source and the destination of information flow, is required. Here the relevant parts are the agent’s initial position R0 and M15 , the state of memory of the agent’s controller at time step 15. Although the joint distribution could be obtained from the Bayesian network analytically, here Monte Carlo simulations are used for estimating the probability distributions. The system is started in an initial state drawn from the probability distribution of the initial state (see section 4.4 for an example), then iterated for the required number of time steps after which data of interest are gathered. The process is repeated until the estimates of probability distribution of interest are stable enough.
2422
A. Klyubin, D. Polani, and C. Nehaniv
To spot and avoid undersampling, for each of the possible initial states of the system, 256 or more samples are produced depending on the quantities measured. The number of samples is then increased by a factor of at least 16 to check whether the quantities of interest remain stable. Appendix D: Method of Evolution In the information capture experiment in section 4, evolution is used as a search in the space of controllers. The aim is not to develop new evolutionary computation techniques but to search for controllers with the maximum information flow from the agent’s initial position to its controller’s memory at a fixed time step. To do that in a straightforward, transparent, and unbiased way, a minimal setup is employed. This is to emphasize that nothing in the general approach is specific to the particular model or the search methods employed. The controller is evolved. The controller is described by a probabilistic time-invariant mapping (St , Mt ) → (At , Mt+1 ) (see section 4.3). Evolving the controller means evolving the mapping for fixed sizes of state (memory), sensoric input, and action sets. The search is limited to deterministic mappings only. To any pair of sensoric input and state (st , mt ) there corresponds exactly one action and new state (a t , mt+1 ). A mutation is performed picking a pair of (st , mt ) at random with uniform probability from the set S × M and setting its image to a pair (a t , mt+1 ) picked at random with uniform probability from the set A × M. This type of mutation is unbiased in the sense that it is not designed for a particular fitness function. Mutation is the only evolutionary operator employed to generate variation. Fitness is evaluated by letting the controller control the agent. The fitness function is the amount of information flow from the agent’s initial position R0 to the memory of the controller Mt at a later time step. The population is initialized with five randomly generated controllers. After each generation, five best controllers are kept in the population and produce five offspring, each by mutation. Thus, the size of the population is between 5 and 30. To speed up the search, ideas from integer programming (Glover, 1986) are used: (1) random shakeup—the number of mutations applied to each offspring is uniformly distributed between 1 and 1 + (G mod 20), where G is the generation, and (2) permanent tabu list—offspring controllers that have been evaluated before or are present in the population are discarded. Additionally, at least five separate evolutionary runs are performed to sample different solutions. Appendix E: Gradient Following as Initial Position Capture This section proves the assertion in section 4.4 that when the agent follows the gradient, most of the information about the agent’s initial position passes through its sensors.
Information Flow in the Perception-Action Loop
2423
Denote the agent’s initial position with a random variable R0 . The entropy of initial position is H(R0 ). This is the maximum amount of information there is to know about the initial position. Assume that the agent follows the gradient for t time steps. Denote the agent’s position after t time steps with a random variable Rt . Denote the sequence of sensoric input up to but excluding the time step c t with a random variable St−1 = (S0 , S1 , . . . , St−1 ). The goal is to prove that c I St−1 ; R0 = H(R0 ) − ,
(E.1)
where is a small quantity. Denote the sequence of actions the agent performed up to but excluding the time step t with a random variable Act−1 = (A0 , A1 , . . . , At−1 ). Using standard information-theoretic equalities, the amount of information the sequence of actions Act−1 contains about the agent’s initial position R0 can be expressed as I Act−1 ; R0 = H(R0 ) − I R0 ; Rt |Act−1 − H R0 |Rt , Act−1 .
(E.2)
All actions move the agent deterministically over the grid. Given a final c position rt , the sequence of actions performed, a t−1 , identifies the exact starting position r0 . This means that the initial position is a deterministic function of the final position and the sequence of actions: H R0 |Rt , Act−1 = 0.
(E.3)
Equation E.2 becomes I Act−1 ; R0 = H(R0 ) − I R0 ; Rt |Act−1 .
(E.4)
When the agent follows the gradient, it does not end up exactly at the source. Because there is no “do nothing” action, once the agent is at the source, it cannot stay there and moves into one of the four adjacent cells at the next time step. One time step later, it comes back following the gradient. Thus, after a certain amount of time, the agent cycles through two states: (1) being in the center and moving in one of the four randomly chosen directions and (2) being in one of the four cells adjacent to the central cell and moving back to the central cell. Due to this fact, given enough time, the tails of sequences of actions will consist of a repeating pattern: a random action, then the action opposite to it—for example, a tc = (. . . , ↑, ↓, →, ←, ↑). When the agent is at the central cell, it enters the loop of performing an action followed by its counteraction. When not in the central cell, the agent performs an action followed by its
2424
A. Klyubin, D. Polani, and C. Nehaniv
counteraction only in one situation: when the agent is in a cell adjacent to the center. The following two sequences are examples of what the agent can perform if it starts in the cell adjacent from the left to the central cell: (→, ←, →, ←, →, ←), (→, ←, →, ↑, ↓, ↑). The first sequence consists of the same repeating tuple of action and its counteraction. In principle, the same sequence of actions can be performed by the agent starting from the central cell. The second sequence of action cannot, because the agent must follow the gradient consistently. Thus, only a tuple of action and its counteraction followed by a tuple of a different action and its counteraction signify that the agent has reached the center. Sequences of one and the same tuple of action and its counteraction do not mean that the agent has reached the center. However, the probability of such sequences decreases exponentially with their length. The above observation means that the higher the t, the more the sequence of actions determines where the agent finished, and, hence, where it started. This means that H(R0 |Act−1 ) tends to zero. Since I R0 ; Rt |Act−1 = H R0 |Act−1 − H R0 |Act−1 , Rt ≤ H R0 |Act−1 ,
(E.5) (E.6)
it follows that I R0 ; Rt |Act−1 → 0
(E.7)
and from equation E.4, I Act−1 ; R0 → H(R0 ).
(E.8)
When the agent follows the gradient, its actions are completely determined by the sensoric input. The sequence of actions is in one-to-one correspondence with the sequence of sensoric input. Hence, the sequence of actions contains the same amount of information about the initial position as the sequence of sensoric input: c I St−1 ; R0 = I Act−1 ; R0 .
(E.9)
Thus, c I St−1 ; R0 → H(R0 ). with increasing t.
(E.10)
Information Flow in the Perception-Action Loop
2425
Appendix F: Examples of Time Dependence F.1 Clock Example. It is possible to imagine a case where the mapping MT → R0 is time independent (first term in equation 7.3 is 0), whereas the reverse mapping R0 → MT is time dependent. Assume a controller with pos two substates: M = (Mpos , Mclock ). Substate MT permanently encodes some information about the agent’s initial position (mpos is constant in time), and substate MTclock captures some information about time, for example, counts modulo some number. The combined state (Mpos , Mclock ) of the controller will be changing with time because Mclock does. Nevertheless, the mapping (Mpos , Mclock )T → R0 pos will be time independent, with all pairs (mT , ∗) mapping to the same subset of initial position. The reverse mapping R0 → (Mpos , Mclock )T will be time dependent because the substate Mclock changes with time. Knowing only pos the initial position predicts only MT . Without knowledge of time, MTclock remains completely random. In this scenario, the time dependence of R0 → (Mpos , Mclock )T is exactly the amount of information about time T contained in MTclock : I (Mpos , Mclock )T ; T|R0 = I (Mclock )T ; T .
(F.1)
F.2 XOR Example. At the other extreme of the time-dependence scale lies the case where both mappings, MT → R0 and R0 → MT , are time dependent, though the state of the controller does not contain any information about time. Consider any two-state controller that is started in state 1 if the agent is above the signal source, and in state 2 otherwise. At each time step, regardless of the sensoric input, the controller permutes its state: 1 → 2, 2 → 1. In this case, the state of the controller contains no information about time I (MT ; T) = 0. For instance, the fact that the state is 1 means that the agent started above the source provided the time step is odd, and that the agent started elsewhere provided the time step is even. Thus, without knowledge of the time step, the state of the controller does not tell anything about the initial position of the agent. The coarsely grained initial position (above the source or not), time step T modulo 2, and state of the controller M are in an XOR relationship. Knowing only one of the variables does not provide information about any of the other two. Knowing the states of any two variables provides full information about the third. The mapping MT → R0 maps every state mT uniformly onto initial position Z2 . However, with the knowledge of time, the mapping becomes more deterministic, mapping state 1 onto one half of Z2 and state 2 onto the other. Similarly, the reverse mapping R0 → MT is made completely deterministic by knowing whether the current time step is odd or even. Agreeing with
2426
A. Klyubin, D. Polani, and C. Nehaniv
equation 7.3, the time dependence of both mappings is identical since I (MT ; T) = 0: I (R0 ; T|MT ) = I (MT ; T|R0 ) = 1.
(F.2)
F.3 Factoring Time Out. In the light of the two examples above, it is possible to give an interpretation to equation 7.3, linking time dependence of the mappings between R0 and MT and the amount of information about time contained in MT . The state of the controller MT can be thought of as containing two pieces of information about time. One is available without reference to any other variable; its amount is quantified by I (MT ; T). The other is “encrypted” through R0 . Knowledge of R0 provides I (R0 ; T|MT ) more bits of information about time. The total amount of information about time extractable from MT with the help of R0 is then I (MT ; T|R0 ). One could attempt to factor MT into two variables: one containing information about time, the other containing information about the rest of MT . In the clock example above, this type of factorization is easy. The new variables pos correspond to MTclock and MT by definition. In the XOR example, though, it is not possible to split MT due to the fact that all information about time in MT is “encrypted” through R0 and is thus not “freely” available without the knowledge of R0 . The term I (MT ; T) in equation 7.3 is the upper bound to the amount of information about time that can be factored out of MT . Factoring out or removing information about time from MT reduces the time dependence of the R0 → MT mapping by exactly the amount of information about time removed. However, this still does not guarantee that the mapping will become time independent. At best the time dependence of R0 → MT can be lowered to the time dependence of its inverse, MT → R0 . However, the time dependence of the latter cannot be influenced by removing information about time from MT . Appendix G: Factorization: Details G.1 Ideal Solutions. An ideal situation is when (1) all of the information about X contained in Y is captured by (Z1 , Z2 ), (2) Z1 and Z2 are independent, and (3) the mapping X → (Z1 , Z2 ) is the product of the mappings X → Z1 and X → Z2 . The ideal point in terms of the left-hand side values of the three criteria above is thus (I (X; Y), 0, 0). It should be noted that in case Z1 or Z2 is large enough to accommodate all the states of Y, it is possible to create a trivial factorization, such as where Y → Z1 is information preserving and Y → Z2 preserves no information about Y. This trivial factorization meets the ideal point criterion but is not interesting. A similarly trivial factorization is known for numbers: any number can be factorized into the product of itself and 1. To avoid trivial factorization, both |Z1 | and |Z2 | were made smaller than |Y|.
Information Flow in the Perception-Action Loop
2427
G.2 The Method. Each solution is a tuple of conditional distributions p(z1 |y) and p(z2 |y). Comparing different solutions requires trading off between the three criteria above. As there is no known precise method for finding an optimal factorization, the following method was used. When the number of states for the two new random variables Z1 and Z2 is given, the goodness or fitness of a tuple is defined in terms of how well it satisfies the three criteria above. The ideal situation is where all of the information about X contained in Y is captured by Z1 and Z2 , and the left-hand side of the other two criteria is zero. The fitness is defined as the Manhattan distance in the criteria space from the ideal point, d = (c − I (Z1 , Z2 ; X)) + I (Z1 ; Z2 ) + I (Z1 ; Z2 |X),
(G.1)
where c = min(I (X; Y), log2 |Z1 ||Z2 |) is the maximum amount of information that Z1 and Z2 could in principle capture about X. c is a constant for all tuples with same sizes of Z1 and Z2 . The goal is to find a tuple with minimal distance from the ideal point. Here hill climbing is employed in the space of the tuples of mappings ( p(z1 |y), p(z2 |y)) to search for the best tuple. The search is restricted to deterministic mappings. The best solution is chosen from five independent runs of the hill climbing search. Each search run is started with a randomly chosen tuple and run for 50,000 steps. At each step, the tuple is modified 1 + (t mod 20) times, where t is the step. A modification is performed by randomly choosing one of the two mappings and then deterministically mapping a randomly chosen state y to a randomly chosen state z1 or z2 , depending on which of the two mappings is modified. If after the modifications the new tuple is closer to the ideal point than the old one, the tuple is used as a base for the next step; otherwise, the new tuple is discarded, and the old one is used at the next step. G.3 Relation to Synergy. Synergy colloquially refers to the case when the whole is more than the sum of its parts. In neural networks, synergy refers to an ensemble of neurons providing more information about a stimulus than the sum of the information provided by neurons individually (Brenner, Strong, Koberle, Bialek, & de Ruyter van Steveninck, 2000; Schneidman et al., 2003). Synergy can also be applied to the factorization problem presented in section 8. The total amount of information about X contained in Z = (Z1 , Z2 ) can be expressed as I (Z; X) = I (Z1 ; X) + I (Z2 ; X) + I (Z1 ; Z2 |X) − I (Z1 ; Z2 ). I (Z1 ; X) and I (Z2 ; X) is the amount of information about X contained in Z1 and Z2 , respectively. The extra term I (Z1 ; Z2 |X) − I (Z1 ; Z2 ), called synergy
2428
A. Klyubin, D. Polani, and C. Nehaniv
(Brenner et al., 2000; Schneidman et al., 2003), Syn(Z1 , Z2 ) = I (Z1 , Z2 ; X) − I (Z1 ; X) − I (Z2 ; X) = I (Z1 ; Z2 |X) − I (Z1 ; Z2 ), can be seen as the amount of information about X that is “jointly” contained in Z1 and Z2 in an “encrypted” form and can be extracted only when both variables are known. Synergy can be negative. When synergy is negative, information about X is contained in Z1 and Z2 in a redundant way. An extreme example of synergetic representation of information is the binary XOR case, when x = z1 ⊕ z2 and Z is uniformly distributed. I (Z1 ; X) = I (Z2 ; X) = 0 = I (Z1 ; Z2 ) = 0 bit, whereas I (Z; X) = I (Z1 ; Z2 |X) = Syn(Z1 , Z2 ) = 1 bit. Knowing either Z1 or Z2 does not tell anything about X, whereas knowing both Z1 and Z2 tells everything. Conditions 2 and 3 of the factorization task minimize I (Z1 ; Z2 ) and I (Z1 ; Z2 |X), respectively. When these two terms reach zero, synergy also becomes zero. In this case, the resulting representation of X by Z1 and Z2 is neither synergetic nor redundant. This is exactly what is desired, since Z1 and Z2 should be independent components or orthogonal coordinates of X. G.4 Relation to Parallel Information Bottleneck Method. The factorization task is similar in spirit to the parallel bottleneck case of the multivariate information bottleneck (Friedman et al., 2001). In this section, the similarities and the differences between the two tasks are pointed out briefly. To facilitate easier comparison, this letter’s notation for the parallel bottleneck problem is used instead of the original notation from Friedman et al. (2001). The random variable Z = (Z1 , Z2 ) is introduced to save space. The parallel bottleneck problem is stated as following. Given a joint distribution p(x, y), find two conditional distributions p(z1 |y) and p(z2 |y) such that the two random variables Z1 and Z2 compress Y in “parallel,” predicting X as much as possible. The Bayesian network of relations between the random variables X, Y, Z1 and Z2 is identical to the Bayesian network from the factorization problem in Figure 10. The quality of a solution is quantified by the following Lagrangian, which should be minimized in the space of tuples ( p(z1 |y), p(z2 |y)) to find a good solution, L(1) = −β I (Z; X) + I (Z1 ; Z2 ) + I (Z; Y),
(G.2)
where β is a trade-off parameter that specifies the relative importance of preserving information about X. The parallel bottleneck problem is very much similar to the factorization problem described in this article. The relations between the random variables are identical. The goal is almost identical too. Here the aim is to
Information Flow in the Perception-Action Loop
2429
factorize the information about X captured by Y into new variables Z1 and Z2 . In parallel bottleneck, this requirement is less explicit. Both methods express the goodness of a solution (a tuple of mappings p(z1 |y) and p(z2 |y)) in terms of a single measure. Here it is the Manhattan distance from the ideal point (see equation G.1), and in the parallel bottleneck case, it is the Lagrangian (see equation G.2). Without the constant term c, the distance specified in equation G.1 is d = −I (Z; X) + I (Z1 ; Z2 ) + I (Z1 ; Z2 |X).
(G.3)
X, Y, and Z form a Markov chain X → Y → Z (see Figure 10) and therefore I (Z; Y) = I (Z; X) + I (Z; Y|X). Hence, the Lagrangian from equation G.2 can be rewritten as L(1) = −(β − 1)I (Z; X) + I (Z1 ; Z2 ) + I (Z; Y|X).
(G.4)
Now two differences between the factorization and the parallel bottleneck are apparent. First, the factorization method in this article can be seen as using a constant parallel bottleneck β = 2. Second, the last term in the two equations is different. The latter needs closer attention. I (Z1 ; Z2 |X) = 0 requires that Z1 and Z2 are conditionally independent given X. This avoids synergetic, XOR-like coding schemes where Z1 and Z2 individually cannot be used to obtain information about X, where only their combination is helpful. I (Z; Y|X) = 0 requires that Z and Y share no information other than information about X. In other words, this requires that Z1 and Z2 contain information only about X. The two requirements are different. However, they are related. Keeping in mind that Z = (Z1 , Z2 ) and that Z1 and Z2 are obtained from Y using independent mappings, I (Z; Y|X) = I (Z1 ; Y|X) + I (Z2 ; Y|X) − I (Z1 ; Z2 |X).
(G.5)
Minimizing only I (Z1 ; Z2 |X) allows Z1 and Z2 to contain “irrelevant” information—information about Y that is not information about X. Parallel bottleneck, on the other hand, attempts to forbid such irrelevant information. Interestingly, I (Z; Y|X) ≥ I (Z1 ; Z2 |X).16 As a result of nonnegativity of mutual information, if I (Z; Y|X) = 0, then I (Z1 ; Z2 |X) = 0. Hence, a
16 I (Z; Y|X) = I (Z ; Y|X) + I (Z ; Y|X) − I (Z ; Z |X). Since I (Z ; Z |X) ≤ I (Z ; Y|X) 1 2 1 2 1 2 1 and I (Z1 ; Z2 |X) ≤ I (Z2 ; Y|X):
I (Z; Y|X) ≥ I (Z1 ; Z2 |X) + I (Z1 ; Z2 |X) − I (Z1 ; Z2 |X) ≥ I (Z1 ; Z2 |X).
2430
A. Klyubin, D. Polani, and C. Nehaniv
minimum of the parallel bottleneck Lagrangian in the case of I (Z; Y|X) = 0 is also an optimal solution for the factorization method in this article. An important advantage of the multivariate bottleneck method, of which parallel bottleneck is just an example, is that it provides an iterative optimization algorithm for finding a good solution. However, the algorithm is designed only for certain types of Lagrangians. The distance d (see equation G.3) used in this article seems to be unsuitable for the algorithm. Acknowledgments We thank the Condor Team from the University of Wisconsin, whose highthroughput computing system Condor provided a convenient way to run large numbers of simulations on ordinary workstations.
References Ashby, W. R. (1952). Design for a brain. New York: Wiley. Ashby, W. R. (1956). An introduction to cybernetics. London: Chapman & Hall. Attneave, F. (1954). Some informational aspects of visual perception. Psychological Review, 61(3), 183–193. Avraham, H., Chechik, G., & Ruppin, E. (2003). Are there representations in embodied evolved agents? Taking measures. In W. Banzhaf, T. Christaller, P. Dittrich, J. T. Kim, & J. Ziegler (Ed.), Proceedings of the 7th European Conference on Artificial Life (ECAL) (pp. 743–752). Berlin: Springer-Verlag. Ay, N. (2002). Locality of global stochastic interaction in directed acyclic networks. Neural Computation, 14(12), 2959–2980. Ay, N., & Krakauer, D. C. (2007). Geometric robustness theory and biological networks. Theory in Biosciences, 125(2), 93–121. Ay, N., & Wennekers, T. (2003). Dynamical properties of strongly interacting Markov chains. Neural Networks, 16(10), 1483–1497. Barlow, H. B. (1959). Possible principles underlying the transformations of sensory messages. In W. A. Rosenblith (Ed.), Sensory communication: Contributions to the Symposium on Principles of Sensory Communication (pp. 217–234). Cambridge, MA: MIT Press. Barlow, H. B. (1963). The coding of sensory messages. In W. H. Thorpe & O. L. Zangwill (Eds.), Current problems in animal behaviour (pp. 331–360). Cambridge: Cambridge University Press. Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1(3), 295–311. Barlow, H. B. (2001). Redundancy reduction revisited. Network: Computation in Neural Systems, 12(3), 241–253. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129–1159. Bialek, W. (1987). Physical limits to sensation and perception. Ann. Rev. Biophys. Biophys. Chem., 16, 455–478.
Information Flow in the Perception-Action Loop
2431
Bouman, M. A. (1959). History and present status of quantum theory in vision. In W. A. Rosenblith (Ed.), Sensory communication: Contributions to the Symposium on Principles of Sensory Communication (pp. 377–401). Cambridge, MA: MIT Press. Brenner, N., Strong, S. P., Koberle, R., Bialek, W., & de Ruyter van Steveninck, R. R. (2000). Synergy in a neural code. Neural Computation, 12(7), 1531–1552. Cannon, W. B. (1939). The wisdom of the body. New York: Norton. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Der, R., Steinmetz, U., & Pasemann, F. (1999). Homeokinesis—a new principle to back up evolution with learning. In M. Mohammadian (Ed.), Computational intelligence for modelling, control, and automation (pp. 43–47). Amsterdam: IOS Press. ¨ osi, ¨ P., & Nehaniv, C. L. (2005). Algebraic theory of finite automata networks: An Dom introduction. Philadelphia: Society for Industrial and Applied Mathematics. Friedman, N., Mosenzon, O., Slonim, N., & Tishby, N. (2001). Multivariate information bottleneck. In Uncertainty in Artificial Intelligence: Proceedings of the Seventeenth Conference (UAI-2001) (pp. 152–161). San Francisco: Morgan Kaufmann. Gibson, J. J. (1979). The ecological approach to visual perception. Boston: Houghton Mifflin. Glover, F. (1986). Future paths for integer programming and links to artificial intelligence. Computers and Operations Research, 13(5), 533–549. Howard, R. A. (1966). Information value theory. IEEE Transactions on Systems Science and Cybernetics, SSC-2, 22–26. Hutchins, E. (1995). How a cockpit remembers its speeds. Cognitive Science, 19(3), 265–288. Kandel, E. R., Schwartz, J. H., & Jessell, T. M. (1991). Principles of neural science (3rd ed.). New York: McGraw-Hill. Kirsh, D., & Maglio, P. (1994). On distinguishing epistemic from pragmatic action. Cognitive Science, 18(4), 513–549. Klyubin, A. S., Polani, D., & Nehaniv, C. L. (2004a). Organization of the information flow in the perception-action loop of evolved agents. In R. S. Zebulum, D. Gwaltney, G. Hornby, D. Keymeulen, J. Lohn, & A. Stoica (Eds.), Proceedings of 2004 NASA/DoD Conference on Evolvable Hardware (pp. 177–180). Los Alamitos, CA: IEEE Computer Society. Klyubin, A. S., Polani, D., & Nehaniv, C. L. (2004b). Organization of the information flow in the perception-action loop of evolved agents (Tech. Rep. 400). Hertfordshire: Department of Computer Science, University of Hertfordshire. Klyubin, A. S., Polani, D., & Nehaniv, C. L. (2004c). Tracking information flow through the environment: Simple cases of stigmergy. In J. Pollack, M. Bedau, P. Husbands, T. Ikegami, & R. A. Watson (Eds.), Artificial Life IX: Proceedings of the Ninth International Conference on the Simulation and Synthesis of Living Systems (pp. 563–568). Cambridge, MA: MIT Press. Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22(1), 79–86. Laughlin, S. B., de Ruyter van Stevenick, R. R., & Anderson, J. C. (1998). The metabolic cost of neural information. Nature Neuroscience, 1(1), 36–41. Linsker, R. (1988). Self-organization in a perceptual network. IEEE Computer, 21(3), 105–117. Millikan, R. G. (2002). Varieties of meaning. Cambridge, MA: MIT Press.
2432
A. Klyubin, D. Polani, and C. Nehaniv
Nadal, J.-P., & Parga, N. (1995). Information transmission by networks of non linear neurons. In Proc. of the Third Workshop on Neural Networks: From Biology to High Energy Physics (1994, Elba, Italy). International Journal of Neural Systems (Suppl.), 153–157. Nolfi, S., & Marocco, D. (2002). Active perception: A sensorimotor account of object categorization. In B. Hallam, D. Floreano, J. Hallam, G. Hayes, & J.-A. Meyer (Eds.), From Animals to Animats 7, Proceedings of the VII International Conference on Simulation of Adaptive Behavior (pp. 266–271). Cambridge, MA: MIT Press. Pearl, J. (2001). Causality: Models, reasoning, and inference. Cambridge: Cambridge University Press. Polani, D., Nehaniv, C. L., Martinetz, T., & Kim, J. T. (2006). Relevant information in optimized persistence vs. progeny strategies. In L. M. Rocha, M. Bedau, D. Floreano, R. Goldstone, A. Vespignani, & L. Yaeger (Eds.), Proc. Artificial Life X (pp. 337–343). Cambridge, MA: MIT Press. Powers, W. T. (1974). Behavior: The control of perception. London: Wildwood House. Schmidhuber, J. (1992). Learning factorial codes by predictability minimization. Neural Computation, 4(6), 863–879. Schmidhuber, J., Eldracher, M., & Foltin, B. (1996). Semilinear predictability minimization produces well-known feature detectors. Neural Computation, 8(4), 773– 786. Schneidman, E., Bialek, W., & Berry II, M. J. (2003). Synergy, redundancy, and independence in population codes. Journal of Neuroscience, 23(37), 11539–11553. Schneier, B. (1995). Applied cryptography: Protocols, algorithms, and source code in C (2nd ed.). New York: Wiley. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423. Tishby, N., Pereira, F. C., & Bialek, W. (1999). The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing (pp. 368–377). Urban-Champaign: Department of Electrical and Computer Engineering, University of Illinois. Touchette, H., & Lloyd, S. (2000). Information-theoretic limits of control. Phys. Rev. Lett., 84, 1156. Touchette, H., & Lloyd, S. (2004). Information-theoretic approach to the study of control systems. Physica A, 331(1–2), 140–172. Usher, M. (2001). A statistical referential theory of content: Using information theory to account for misrepresentation. Mind and Language, 16(3), 311–334. Wennekers, T., & Ay, N. (2005). Finite state automata resulting from temporal information maximization. Neural Computation, 17(10), 2258–2290.
Received November 21, 2005; accepted October 16, 2006.
LETTER
Communicated by George Gerstein
Computational Properties of Networks of Synchronous Groups of Spiking Neurons Judith E. Dayhoff [email protected] Complexity Research Solutions, Silver Spring, MD 20906, U.S.A.
We demonstrate a model in which synchronously firing ensembles of neurons are networked to produce computational results. Each ensemble is a group of biological integrate-and-fire spiking neurons, with probabilistic interconnections between groups. An analogy is drawn in which each individual processing unit of an artificial neural network corresponds to a neuronal group in a biological model. The activation value of a unit in the artificial neural network corresponds to the fraction of active neurons, synchronously firing, in a biological neuronal group. Weights of the artificial neural network correspond to the product of the interconnection density between groups, the group size of the presynaptic group, and the postsynaptic potential heights in the synchronous group model. All three of these parameters can modulate connection strengths between neuronal groups in the synchronous group models. We give an example of nonlinear classification (XOR) and a function approximation example in which the capability of the artificial neural network can be captured by a neural network model with biological integrate-and-fire neurons configured as a network of synchronously firing ensembles of such neurons. We point out that the general function approximation capability proven for feedforward artificial neural networks appears to be approximated by networks of neuronal groups that fire in synchrony, where the groups comprise integrate-and-fire neurons. We discuss the advantages of this type of model for biological systems, its possible learning mechanisms, and the associated timing relationships. 1 Introduction Artificial neural networks have long been shown to have powerful computational capabilities, including general function approximation, prediction, and pattern classification (Haykin, 1998). They are built from networks of interconnected processing units, sometimes called neurons or nodes. An activation value is associated with each node, and a weight value is associated with each interconnection. The activation level is obtained by calculating an incoming sum of arriving signals and applying a squashing function, usually a sigmoid. There are many different training paradigms, designed Neural Computation 19, 2433–2467 (2007)
C 2007 Massachusetts Institute of Technology
2434
J. Dayhoff
to find suitable values for the weights. Since biological neurons do not have features such as continuous-valued activation functions and numeric weights, it is an open question as to how these artificial neural network models could possibly correspond to the activity of biological neurons. In this letter, we show a correspondence between the activity of artificial neural networks and networks of synchronously firing groups of biological integrate-and-fire neurons. We choose the classical feedforward artificial neural network for this demonstration and alter the artificial neural network slightly by disallowing negative activation values on its nodes, translating and rescaling the squashing function, and allowing for a sigmoid-like shape rather than a true sigmoid. With these alterations, the calculations performed by the artificial neural network can be directly translated and performed by a network of neuronal groups with corresponding structure, where each neuronal group has integrate-and-fire (IAF) mechanisms and displays synchronous firing among its members. Previous work has shown analogies between artificial neural networks and networks with spiking neurons. Although some have even shown arbitrary function approximation in networks of spiking neurons, their constructs usually utilize properties of IAF neurons that are different from synchronous groups networked to each other (Ianella & Back, 2001; De Kamps & Van der Velde, 2001; Maass, 1997a–1997c, 1999, 2002, 2003; Sanger, 1998; Cotter & Conwell, 1993; Cotter & Mian, 1992). Maass has considered the firing of populations of neurons in which neuronal groups are networked to each other, but did not develop the theme of networked synchronies as far as we do here. Other research has shown neuronal groups, called synfire chains, that transmit signals down a chain of synchronously firing groups. Proposed by Abeles, these models form the foundation from which this letter is developed (Abeles, 1991; Aviel, Pavlov, Abeles, & Horn, 2002; Diesmann, Gewaltig, & Aertsen, 1999). Synfire chains can increase their activity level as activation propagates down the chain and increase the temporal precision of their synchrony. However, these models have not previously been configured into branching networks of neuronal groups that tend to fire in synchrony, as we do here, but rather have been limited to linearly connected groups. Extensive work has been done to characterize the computational capabilities of artificial neural networks, including the capability of general function approximation as delineated by Hornik, Stinchcombe, and White (1989) and investigated by others (Cybenko, 1989; Funahashi, 1989; Castro, Mantas, & Benitez, 2000; Ito, 1991; Kalouptsides, 1997; Chen & Chen, 1993, 1995a, 1995b; Williamson & Helinke, 1995; Kurkova, 1992; Koutroumbas, 2003; Gibson and Cowan, 1990; Makhoul, El-Jaroudi, & Schwartz, 1991; Shonkwiler, 1993; Sandberg, 1994, 1999, 2001; Sandberg et al., 2001). General function approximation is a powerful capability because function approximation is used to solve problems in system modeling, prediction, control,
Networks of Synchronous Groups
2435
diagnostics, and pattern classification. Thus, applications for general function approximation span a wide spectrum of domains, including finance, medicine, and engineering. If an analogy to the feedforward artificial neural network can be constructed with a biologically realistic model, then these powerful computational capabilities can be inferred as a possible outcome for a realistic biological model. Biological uses of this computational capability could potentially be extensive. In our model, groups of IAF neurons arranged in networks can be constructed so that they mimic the computational mechanisms of artificial neural networks and, hence, appear to inherit a computational capability for general function approximation. This letter is organized as follows. In section 2, we discuss the concept of neural coding and how it applies to synchronous groups. Section 3 covers the basic structure of our model, consisting of networks of neuronal groups that fire in synchrony. In section 4, we discuss the feedforward artificial neural network model. Section 5 covers models of both types, performing the exclusive-or (XOR) task, and calculating a function. Section 6 shows a mapping between artificial neural networks and synchronous group networks, and section 7 discusses the trade-offs and implications of biological systems utilizing networks of synchronous groups. 2 Neural Coding Neural coding refers to schemes by which the nervous system encodes information when signals are carried. These signals may be carried from one neuron to another, from one group of neurons to another group, from one part of the brain to another part of the brain, or between the brain and the body. Neural coding also refers to schemes for information representation used during neural computational processes. These processes include pattern recognition, feature binding, generation of motor patterns, and other higher-order processes associated with brain function and consciousness. Neural coding schemes are typically analyzed by researchers to attach meaning to the spiking signals observed between neurons. A variety of neural coding schemes have been proposed in the literature (Perkel & Bullock, 1968). The most prominently considered code is the temporally averaged firing rate. Many neurophysiology experiments have been performed in which the spiking activity of a neuron was recorded and the average firing rate correlated with a stimulus presented to the animal. Firing-rate codes have also been studied in neural models. When the precise timing of spikes is considered significant, the scheme is called a temporal code or temporal pattern code. A temporal code could occur from the spike timing of a single neuron or a group of neurons. Thus, temporal codes include single unit codes and multiunit codes. An individual firing pattern, in a temporal code, would have significance in terms of signal representation and transmission. Each occurrence of a spike pattern could be either a precise duplicate of each other occurrence or
2436
J. Dayhoff
involve statistical variations as well as extra or missing spikes (Dayhoff & Gerstein, 1983a, 1983b; Abeles & Gerstein, 1988). Interval codes have also been considered, in which interspike intervals are assigned specific meanings. Interspike intervals have been studied biologically through interspike histograms. Neural network models utilizing spike intervals and delays have also been built (Ianella & Back, 2001; Maass, 2003). Population coding refers to a scheme by which information is encoded in the activity of a population of neurons, firing within a temporal window. In a population code, the spikes issued by an entire population or group of neurons is the feature considered to be encoding information. These spikes are usually summed or time-averaged within a time interval, or window. The details of the spiking patterns within the time window are not important; only the total number of spikes per window would be considered. Synchrony coding is proposed here to be a distinct concept from the population code or the temporal code, although it could be considered to be included in either group. In synchrony coding, an ensemble of neurons fires in near-synchrony one or more times. Relevant parameters are the size of the synchrony and the temporal fidelity of the synchrony. A synchronous group that sends axons to another neuron or neuronal group could trigger a rapid postsynaptic potential response and a firing of the target neuron or neuronal group. Here we propose to delineate the synchrony code as being distinct from any of the coding schemes mentioned above because it could be an important operating principle of real nervous systems. A synchrony could occur at any time and does not have to occur within a prespecified time window. The prespecified time window would be needed only in the population code, to take a population average. The group of neurons that fires in nearsynchrony would not have to be exactly the same each time that a synchrony occurs. A synchronous firing, for example, could occur from a subset of a larger group of neurons. Each instance of firing could be a different, possibly overlapping subset of the same larger group. Synchronous firing of two or more neurons has been observed in many experimental preparations utilizing established methods for detection (Lindsey, Morris, Shannon, & Gerstein, 1997; Maldonado & Gerstein, 1996; Lindsey, Hernandez, Morris, Shannon, & Gerstein, 1992; Salinas & Sejnowski, 2001; Steinmetz et al., 2000; Gray & Viana Di Prisco, 1997; Fries, Roelfsema, Engel, Konig, & Singer, 1997; Baker, Spinks, Jackson, & Lemon, 2001; Aertsen, Gerstein, Habib, & Palm, 1989; Gerstein, Perkel, & Dayhoff, 1985; Gerstein & Aertsen, 1985; Dayhoff, ¨ Diesmann, & Aertsen, 2002; Gerstein & Perkel, 1972). 1994; Grun, There is no a priori reason that there could not be neurons that are members of two different groups of neurons, each group producing synchronous firings among its members at different times. A synchronous firing from the first group would have a different meaning from a synchronous firing from the second group. Different information would thus be represented
Networks of Synchronous Groups
2437
and signaled for synchronies of the two groups of neurons, even though they have overlapping members. The results in this letter show that the use of synchrony codes can lead to unique operating properties in models that network synchronously firing groups to one another, with powerful computational capabilities. Although this letter deals with models that have nonoverlapping groups, models with overlapping synchronous groups are expected to behave similarly if the amount of overlap is limited. 3 Model with Synchronous Ensembles In this section, we describe a model in which synchronously firing ensembles of neurons are networked together. All neurons have integrate-and-fire mechanisms, with a constant threshold. The excitatory postsynaptic potentials (EPSPs) and inhibitory postsynaptic potentials (IPSPs) are assumed to have shapes similar to experimental observations. The effects of arriving spikes on the postsynaptic potential are summed at each successive time step during the simulation. When threshold is reached, the target neuron fires, and the potential is reset to zero. Neuronal membrane potential is allowed to decay after partial depolarization, but in our simulations, the initial response to a group of incoming pulses, near-simultaneous, constitutes the measured output, after which leakage occurs and the response dies down. Figure 1a shows a neural network model with two neuronal groups, in which one group sends axons to the other. The groups are disjoint, sharing no neurons, and connections are chosen probabilistically. Each neuron in group j has a probability pji of receiving a connection from each neuron in group i. In the current model, we allow at most one connection between a given pair of neurons, but the results provided in this letter could be extended to allowing more than one connection. An external stimulus must be given to the model to initiate a flow of synchronous activations. In general, one or more groups are selected to receive external input. If a group, i, receives an external stimulus, then the value a i is the fraction of neurons that fire synchronously during the stimulus. This fraction is chosen randomly. If a simulation is rerun, the same total number of neurons in the stimulated group fires, but the identity of the neurons can be different. We simulate a scenario in which the first group, group i, receives external input. Figure 1b shows the result of simulating the network shown in Figure 1a. At time step 1, a fraction a i of the neurons in group i fires in synchrony. At time step 2, some or all neurons in group j fire. Data collected during the simulations consisted of the fraction firing of group i (at time step 1) and the fraction firing of group j (at time step 2). The group sizes, ni and n j , were set to 500. The value of e ji , the height of the postsynaptic potential from a single spike received by a neuron in group j from a neuron
2438
J. Dayhoff
Figure 1: (a) A model with two neuronal groups, in which neurons in group i send axons to synapse onto neurons in group j. (b) Results from simulating the network shown in a . Fraction firing in group j is graphed as a function of the fraction firing in group i. There were 500 simulated integrate-and-fire neurons in each group. The probability of there being a connection from a randomly chosen neuron in group i to a randomly chosen neuron in group j was 0.4, and the PSP height was 0.0106. Threshold was set at 1.0. A single simulation run was performed for each activation level from 0 to 100%. There were 167 points per trace.
in group i, was 0.0106. T, the threshold for firing, was set to 1.0 in our simulations, although the results here can be generalized to other values. We set the probability for the existence of each pairwise connection to group j from group i as pji = 0.4. A single run was performed for each fraction active level, a i , between 0 and 1.0. Figure 1b displays the fraction firing in group j as a function of the fraction firing in group i. There is a basic sigmoid-like shape to the curve. A zero firing in group i causes no firing in group j, and a 100% firing in group i causes a 100% firing in group j. The graph passes approximately through a center point (0.5,0.5), although the curve would not necessarily intersect this center point if different parameters were used in the simulation. The sigmoid-like shape has a slightly different curvature at the top right compared to the bottom left. Figure 2 shows what happens to the transfer measurement in Figure 1b when the simulation parameters are varied. In Figure 2a, the height e ji of the PSP is varied. There are eight traces in the figure, with the PSP height moving from higher values to lower values from left to right. The PSP heights on individual traces were 0.020, 0.018, 0.016, 0.014, 0.012, 0.010, 0.008, and 0.006. Five hundred neurons were in each group, and the probability pji of each interneuronal connection to group j from group i was 0.4. Note that the slope of the center of the trace appears to increase as one moves from the right trace to the left trace. Thus, for higher PSP heights,
Networks of Synchronous Groups
2439
Figure 2: Results from varying parameters on the groups and intergroup connections for the model shown in Figure 1a. Fraction firing in group j is plotted as a function of fraction firing in group i. (a) Results from varying PSP height. From left to right, the PSP heights on individual traces were: 0.020, 0.018, 0.016, 0.014, 0.012, 0.010, 0.008, and 0.006. Five hundred neurons were in each group. The probability pji of each interneuronal connection to group j from group i was 0.4, and the threshold for all neurons was set at 1.0. There were 167 points per trace. (b) For each neuron in group j, the probability pji of receiving a connection from a particular neuron in group i was varied. From left to right, traces were made at pji = 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, and 0.2. Five hundred neurons were in each group, the PSP height was set at 0.10, and threshold was set at 1.0. There were 167 points per trace. (c) Results from varying group size, simulated with the size of both groups the same. From left to right, the values for group size were 1000, 750, 500, and 250. The probability pji of each interneuronal connection from group i to group j was 0.03, PSP height was 0.02, and threshold was 1.0. There were 101 points per trace.
the slope of the sigmoid-like curve is increased. The curve on the left, with the highest slope, shows a gain of 90% in output activity when the fraction firing of the input group goes from 0.20 to 0.30. This means that the neuronal circuit could be considered to act like a switch, turning on or off in response to a small signal, or change, in the input activity level.
2440
J. Dayhoff
The lower values of e ji yield curves that have lower slopes and become spaced farther apart as the value of e ji decreases. Concomitantly, for the higher values of e ji , the curves are placed closer together for the same amount of increase in e ji . Thus, not only does the circuit perform like a switch at higher values of e ji , but the precision of the switch is increased as e ji increases. Figure 2b shows the result of varying the parameter pji , the probability that for each neuron in group j and each neuron in group i, there exists an interneuronal connection between them. From left to right, traces were made at equally spaced values of pji : pji = 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, and 0.2. Five hundred neurons were in each group, the PSP height was set at 0.10, and threshold was set at 1.0. As in Figure 2a, the curves on the left have higher slopes at midpoint, and the slope increases as one moves from right to left. The left-most curve, where pji = 0.8, shows a 90% activity increase in the output in response to less than a 10% activity increase (22% to 30%) in the input. This switch-like property appears extremely precise for the left-most curve. Similar to Figure 2a, the spacing between curves becomes narrower as one moves from the right to the left, implying that the precision of the switch increases at higher interconnection densities. Figure 2c shows input-output results for the same model, but with different sizes for the neuronal groups. For each simulation, the size of both groups was the same. From left to right, the traces show results from having 1000 neurons, 750 neurons, 500 neurons, and 250 neurons in each group. A switch-like behavior is seen in the curves at the left, with the left-most curve increasing by more than 90% from an increase of under 10% in its input. The spacing between the midpoints of the curves decreases as one moves from right to left, indicating the increased precision to the switch at higher group sizes. For larger group sizes, the curve appears smoother. The probability pji for the presence of each interneuronal connection from group i to group j was 0.03, the PSP height was 0.02, and the threshold was 1.0. The results in Figure 2 demonstrate that there is a trade-off between the parameters ni , the size of the incoming neuronal group, pji , the interconnection probability, and e ji , the PSP height. If the sigmoid is to be moved to the left, then there must be an increase in the neuronal group size, an increase in the interconnection probability, or an increase in the PSP height. A lowering of one of these parameters could be offset by the raising of another. We propose that there is an intergroup connection strength that is modulated by the three parameters: ni , the size of the presynaptic group; pji , the interconnection density; and e ji , the postsynaptic potential height. We define the midpoint of the sigmoid-like curve to be the point where its height is 0.5. Then the horizontal placement of the midpoint is modulated by these three parameters, and its position reflects the weight, or connection strength, between the two groups.
Networks of Synchronous Groups
2441
4 Artificial Neural Network Model Here we describe a feedforward artificial neural network that is the most popular structure used in applications such as engineering, finance, and medicine. We show a modification to its architecture that will allow us to draw analogies between its computational architecture and that of the interconnected neuronal groups covered in section 3. The classic feedforward artificial neural network arose from initial work by Rosenblatt (1962) and Widrow (Widrow & Hoff, 1960), and was extended by Werbos (1974, 1988) to include the ability to learn nonlinear functions, using a gradient-descent approach to training its weights. The latter form was independently invented by Parker (1982) and by Rumelhart and McClelland (1986). Many neural network architectures use the same functional structure in their artificial neurons, as covered in the textbooks by Haykin (1998) and by Mehrotra, Mohan, and Ranka (1997). Applications of these computing structures abound (Jain & Vemuri, 1999; Miller, Sutton, & Werbos, 1990; Sejnowski & Rosenberg, 1987; Dayhoff, 1990). The basic structure of the popular artificial neural network is shown in Figure 3. Individual processing units are connected in a forward direction as shown in Figure 3a. Usually an interconnection architecture is chosen that consists of a single layer of input units, a single internal layer of hidden units, and a layer of output units, and successive layers are fully interconnected. Figure 3b shows a schematic for the processing of a single unit. Each processing unit i has an activation value b i . Each interconnection has an associated weight, where wji is the weight to unit j from unit i. Input units have their activation values set by externally imposed values. Each input is a vector, or pattern, with one entry corresponding to each input unit, and d input units total. Units in the hidden layer and the output layer compute an incoming sum, formed from adding the product of activation values and weights from feeder units. This incoming sum is then subjected to a squashing function, most popularly a sigmoid function with the range (−1,1) (see Figure 3c). The incoming sum is computed as follows:
Sj =
d
wji b i ,
i=1
with the sigmoid squashing function, F (x) =
2 − 1, 1 + e −x
applied to the result.
2442
J. Dayhoff
Figure 3: Schematic of the basic structure of an artificial neural network. (a) Processing units interconnect in a feedforward structure, with each layer fully connected to the following layer. (b) An individual processing unit. Unit j receives input from d units in the previous layer, and each interconnection has an associated weight. An incoming sum S j is computed, and a squashing function, often a sigmoid, is applied to the result to obtain the new activation value for unit j. (c) Sigmoid function often used as a squashing function for an ANN.
Artificial neural networks are given a training set of data, which is used to adjust the internal weights of the network to produce a desired output. Typically the training set is presented to the network many times while the network adjusts its weights incrementally. After training, hidden units
Networks of Synchronous Groups
2443
Figure 4: Alteration of the basic artificial neural network structure. The activation level is restricted to nonnegative numbers, and the sigmoid function is 1 is moved to center at 0.5 with a domain of 0 to 1. The formula f (x) = 1+e −η(x−0.5) shown plotted, where η, a scale factor on the slope of the sigmoid, was set at η = 20.
detect individual features in the input vector. The output of the network is a vector built from the activation values of the output units. Many successful training rules for adjusting the weights have been proposed, as reviewed in current textbooks (Mehrotra et al., 1997; Haykin, 1998). Alteration of the basic artificial neural network is illustrated in Figure 4. First, we restrict the activation level of each processing unit to a nonnegative number between 0 and 1, inclusive. This will correspond to the activation level (fraction active) of a neuronal group. We continue to use a sigmoid-like transfer function, but we move its domain to the interval [0,1] and center the sigmoid at 0.5 on the horizontal axis. This domain will correspond to the fraction active in the neuronal group that serves as input. The range is similarly set to be the interval [0,1], which will correspond to the fraction active in the neuronal group that serves as output. Figure 5a shows an artificial neural network configuration in which a single artificial unit feeds another. The sigmoid transfer function shown in Figure 4 was used for the ANN’s squashing function. In Figure 5b, the activation level of the input unit is given on the x-axis, and the activation level of the output unit is given on the y-axis, with both scales being [0,1]. The basic sigmoid structure appears, similar to Figures 1b, 2a, and 2c. The weight on
2444
J. Dayhoff
Figure 5: (a) A configuration in which a single unit in an ANN feeds another unit. (b) Transfer functions at different weights. Activation of target unit j is plotted as a function of activation level of input unit i. Weights plotted are, from left to right, 4, 3, 2, 1, and 0.5.
the ANN was modulated to produce a series of traces, with weight values of 4, 3, 2, 1, and 0.5, from left to right. These look nearly identical to the graphs from the synchronous group architectures shown in Figures 2a to 2c. Thus, the modulation of the weight in the ANN produces similar results to the modulation of the parameters that were varied in Figure 2, specifically, the incoming group size, the probability of interconnections between groups, and the PSP height. 5 Computational Examples In this section, we demonstrate how a network of synchronously firing neuronal groups can perform computational tasks. The first task selected is the XOR function, a traditional example for computing a nonlinear task. The second task selected is a function calculation with five intermediary processing units. In both examples, the computations of an artificial network are mimicked by a synchronous group network. Figure 6a shows a network consisting of two neuronal groups that receive external input, which feed two groups that comprise a second layer. The latter groups feed a final neuronal group (third layer) that provides
Networks of Synchronous Groups
2445
Figure 6: (a) Two different neuronal groups receive external input, and each sends connections to two additional groups. The two receiving groups in turn send connections to the final neuronal group shown at the top. Intergroup weights are shown along each interconnection. These weights were chosen to implement an exclusive-or (XOR) function. (b) Results of simulating the network shown in a , simulated as a synchronous group model. The x-axis shows the fraction active for the first input group and the y-axis the fraction active for the second input group. The intensity of each square in the grid is scaled from 0 (black) to 1 (white) and represents the fraction active for the output neuronal group. (c) Results of simulating the same network configuration with an ANN.
2446
J. Dayhoff
Table 1: Boolean Exclusive-Or (XOR) Function. 0
0 0
1 1
1
1
0
Note: The horizontal and vertical axes of the table indicate the two inputs to the function, and the entry in the table is the output.
output from the network. To simulate the model in Figure 6a, we assigned values for the parameters that modulate the intergroup activity transfer. We chose the value pji = 0.2 for all interconnections between groups and set all group sizes to 100. For simplicity, we have set the PSP height as the product of two parameters. The first parameter is allowed to be different for each intergroup connection. The second parameter is the same for all intergroup connections. When multiplied, they become the PSP height for an individual intergroup connection. All neurons connected between the same two groups will have this same PSP height. We will call the first parameter the intergroup weight and the second parameter the weight multiplier. The weight multiplier is to rescale the postsynaptic response so that units of measurement can be chosen. In this case, we set the firing threshold at 1.0, so that the postsynaptic response represents a fraction of the firing threshold. However, another rescaling would allow for millivolts to be used. Figure 6a shows the first of these parameters, the intergroup weight, associated with each intergroup connection. Note that we have included inhibition as well as excitation in this model, as two of the intergroup connection weights are negative. Mathematical notation is needed to describe the model, as follows. Let U be a connectivity matrix, in which uji represents the strength value for the intergroup weights to group j from group i. Let α be the strength multiplier. The value of α will be multiplied by the value of each entry uji to obtain e ji , the height of the PSP for neuronal connections to group j from group i. The value of e ji is the depth of an IPSP if e ji is less than zero. Let P be an interconnectivity matrix in which pji is the probability that each individual neuron in group j receives a connection from each individual neuron in group i. In the XOR simulation presented here, all values of the matrix P were set to 0.2, and the strength multiplier α was set to 0.12. The output of the synchronous group model is shown in Figure 6b. Each input value was stepped from 0 to 1, and the output of the network is shaded in the corresponding square. Lighter outputs correspond to a higher fraction of synchronous firing in the output group. The outputs at the four corners represent the correct outputs to the XOR function, given in Table 1. Next we modeled an artificial neural network with the same configuration as shown in Figure 6a, replacing each neuronal group with a single
Networks of Synchronous Groups
2447
artificial unit and using the same weights as the intergroup weights shown in Figure 6a. The result is given in Figure 6c and demonstrates that we have an ANN that operates as an XOR function, as the outputs at the four corners represent the correct outputs to the XOR function. This artificial model corresponds functionally to the model of networked synchronous groups, and the output shown in Figure 6c matches approximately the output for the synchronous group model shown in Figure 6b. A function calculation example is shown in Figure 7. Figure 7a gives a model configuration in which one group provides input to five groups, which in turn feed a single output group. The intergroup connection strengths from the input group to the hidden units were (0.7, 1.0, 1.6, 3.0, and 5.0), and the weights from the hidden units to the output groups were toggled between 1 and −1 (e.g., 1, −1, 1 −1, 1). A synchronous group simulation of this network was done using the parameters pji = 0.25 for all interconnections, an intergroup multiplier α of 0.03, and all group sizes of 300 neurons. Figure 7b shows the input-output relationship found for this network. Figure 7c shows the input-output relationship of an artificial neural network with the same configuration. A near match of the two outputs is apparent. The slope of the sigmoid in the artificial neural network model was adjusted to increase the match with the synchronous group network’s output. Since the mathematical relationship between the two networks is analyzed according to matching the midpoints in the sigmoid-like transfer functions, additional matching of the function’s shape or slope is acceptable, and other small corrections may be needed also. Weights in both layers were the same as in Figure 7a. However, an additional scale factor of 1.125 was used for this figure to produce a higher-fidelity match to the synchronous group model’s output. Figures 6 and 7 offer a brief demonstration that within the parametric operating ranges used here, the computations made by an artificial neural network can be mimicked by a network of synchronous biological neuronal groups. Although the match in Figure 7 is not exact from the formulas derived, a slight adjustment brings about a near perfect correspondence.
6 Mapping ANNs to Synchronous Groups In this section, we show how to map an ANN to a synchronous group architecture. We present the needed mathematical notation and show formulas and procedures for converting any ANN to a synchronous group architecture. Through this process, it becomes apparent that within reasonable operating ranges, the synchronous group networks can be expected to behave approximately as the ANNs do. First, consider a neural architecture where one group (group i) feeds another group (group j), as in Figure 1a. We simulate this network with
2448
J. Dayhoff
Figure 7: The same function is computed by an ANN and a synchronous group network. (a) One input unit receives external inputs and feeds five hidden units, which in turn feed a single output unit that produces a signal in the interval [0.1]. (b) Output from a synchronous group model. Fraction active of the output group is plotted as a function of fraction active of the input group. Three hundred neurons were in each group, the interconnection density pji was 0.25, the intergroup multiplier was 0.03, and the intergroup connection strengths are shown in a . (c) Output from ANN model. Activation of the output units is plotted as a function of activation of the input unit. The plot matches the plot in b approximately. Both layers of weights were uniformly adjusted by a multiplier of 1.125 for the purpose of producing a better match to the synchronous net’s output, and the slope of the sigmoid was uniformly adjusted so that the artificial neural network output matched approximately the slopes in b.
two time steps. In the first time step, neurons in group i fire. In the second time step, neurons in group j fire. Let a i be the fraction of group i that is active (generates spikes) at time = 1, and let a j be the fraction of group j that is active (generates spikes) at
Networks of Synchronous Groups
2449
time = 2. Let ni be the number of neurons in group i, and let n j be the number of neurons in group j. Then ni a i is the number of active neurons in group i (at time = 1). Let T be the threshold for firing for all neurons. For simplicity, we have set T to be the same for all neurons, but the results here can be generalized to different thresholds, and the threshold can also be rescaled to fit biological numeric values more closely. Here, T is set to 1. Let e ji be the PSP height for connections to group j from group i. For simplicity, we set all PSP heights on a given intergroup connection to be the same, but the results given in this letter can be generalized to a distribution of values. Define pji as the probability that an individual neuron in group j receives a connection from an individual neuron in group i. For an individual neuron, neuron k, in group j, pji is then the expected fraction of neurons in group i that connect to neuron k in group j. For simplicity, we allow here only one connection between any two neurons, but our results can be generalized to situations where more than one connection is allowed between two neurons. Then pji ni is the expected (average) number of neurons in group i that connect to a neuron, k, in group j, and ni a i pji is the expected (average) number of incoming spikes to a neuron (k) in group j. For a recipient neuron in group j, define the boost from group i as the amount of change in the neuron’s PSP as a result of inputs from group i. This boost occurs after neurons in group i have fired (at time = 1). The average boost, then, for a neuron in group j is given below as mi : mi = ni a i pji e ji .
(6.1)
We next look at the population of neurons in group j and how they respond differently from one another. We begin by defining the probability distribution for the number of incoming connections received by each neuron in group j. Let h(q ) be the probability that a neuron k in group j receives exactly q incoming axons from group i, assuming that neuron k is chosen at random from group j. Correspondingly, let g(z) be the probability that a fraction z of neurons in group i feed an individual neuron k in group j. Then let Ri (x) be the probability that the boost from group i to an individual neuron k in group j is equal to x. It is possible to start with the distribution h(q ), then calculate the distribution g(z), and then calculate the distribution of Ri (x). Let us suppose that for neuron k in group j, there are q c incoming axons from q c distinct neurons. Then let zc be the fraction of neurons in group i that feed neuron k. Then qc = zc . ni
(6.2)
2450
J. Dayhoff
Then the fraction of neurons in group j that receive axons from exactly zc ni neurons in group i is g(zc ) = h(q c ), and Ri (ni a i e ji zc ) = g(zc )
(6.3)
since ni a i e ji zc is the boost from q c incoming axons. So the Ri distribution is derived from the g distribution with only a rescaling of the horizontal axis. Note that since zc has discrete values, ni a i e ji zc also has discrete values. Let Di be the set of all of the discrete values that ni a i e ji zc can take on. The set Di can be used later in summing probabilities for different boost values. Note also that the Ri (x) distribution depends on the values of ni , e ji , and a i . So we expand our notation for Ri (x) to be Ri,a i ,ni ,e ji (x). Now let us look at the result from input a i . The fraction firing in group j would be ∞
Ri,a i ,ni ,e ji (x),
x≥T,x∈Di
where the sum is taken over all the discrete values in Di that are at least the value of the threshold T. Example. Two groups configured as in Figure 1a were modeled. Both groups had 600 neurons with interconnection probability pji = 0.2 and postsynaptic response e ji = 0.0084. In this example, we calculate h(z) and Ri,a i ,ni ,e ji (x). Shown in Figure 8a is the plot of h(z), the probability that a neuron k in group j receives exactly q incoming axons from group i. The peak is at 120. The plot of the boost Ri, a i ,ni ,e ji (x) is given in Figure 8b, in which the horizontal axis is rescaled compared to 8a. The peak is centered at 1, which means that approximately half of the output units are above a threshold of 1. Thus, a j = 0.5. Next, we consider a scenario where a i changes values, which shows how the sigmoid-like shape for the input-output function arises. First, consider the probability distribution Ri,a i ,ni ,e ji (x). From equation 6.3, Ri,a i ,ni ,e ji (x) = g(zc ) when x = zc ni a i e ji . We will hold the values for ni and e ji constant and vary a i . This corresponds biologically to a situation in which the group size ni does not change and the PSP response does not change. The variable a i represents an external stimulus strength, which can change rapidly in biological situations. Figure 9 graphs the probability that a neuron in group j receives a boost value of x. We set the group sizes at 500, pji = 0.15, and e ji = 0.028. Each
Networks of Synchronous Groups
2451
Figure 8: (a) Probability distribution for the number of incoming axons to an arbitrarily chosen neuron in group j. The x-axis shows the number of incoming axons (to a single neuron in group j), and the y-axis shows the probability that a randomly selected neuron in group j would receive exactly x incoming axons. (b) Probability distribution for the boost due to incoming axons, to a (randomly selected) neuron in group j. The x-axis shows the possible values for incoming boost to a neuron in group j. The y-axis shows the probability of each boost value being received by an arbitrarily chosen neuron in group j. The probability distribution is the same as in a except for a rescaling of the x-axis. There were 600 neurons in each group, an interconnection probability pji of 0.2, PSP height e ji of 0.0084, and input activation fraction of 1.0.
graph shows this probability for a different value of incoming stimulus a i . To find the fraction of neurons in group j that fire as a result of stimulus a i , we calculate the area under the curve where the boost is greater than threshold (x ≥ T). Figures 9a to 9e illustrate the distribution of Ri,a i ,ni ,e ji (x) moving from left to right as a i increases, and the shaded area to the right of the threshold T = 1 corresponds to the network’s output a j . The visualization in Figure 9 makes it possible to see where the sigmoid-like shape comes from in the transfer function of Figure 1b. The slope of the sigmoid corresponds to the height of the distribution at T = 1. The point on the sigmoid with the highest slope is at the peak, caused by the peak passing over the T value of 1, shown in Figure 9c. As the peak gets farther from the T value, the height at the T value in Figure 9 gets lower, and, correspondingly, the slope of the sigmoid-like shape gets lower. There is an asymmetry between Figures 9a and 9e and between Figures 9b and 9d, which causes a corresponding asymmetry to the sigmoid-like graph in Figure 1b, whereby the bottom curl of the sigmoid is more curved than the top curl. Changing the value of a i is equivalent to rescaling the x-axis on the probability distribution. Rescaling of the x-axis does not produce a reflection between these figure pairs; thus, the asymmetry.
2452
J. Dayhoff
Figure 9: Distribution of Ri,a i ,ni ,e ji (x) for a model in which a single neuronal group feeds a target neuronal group. Each plot shows results for a different value of a i . The threshold is T = 1, and the area to the right of the threshold is shaded. The shaded area corresponds to the fraction firing of the target neuronal group. The distribution moves to the right as the incoming fraction active increases. (a) a 1 = 0.3. (b) a 1 = 0.43. (c) a 1 = 0.5. (d) a 1 = 0.57. (e) a 1 = 0.7.
Networks of Synchronous Groups
2453
Figure 9: Continued
6.1 Translating Weights from ANN to Synchronous Group Networks: One Group Feeds a Target Group. Here we show how weights in the ANNs correspond to intergroup weights in the synchronous group networks. To accomplish this correspondence, we match the midpoint of a transfer function in the ANN to the midpoint of the input-output function in the synchronous group architecture. When these positions match, the
2454
J. Dayhoff
Figure 9: Continued
weight in the ANN is set equal to the intergroup weight in the synchronous group network. The midpoint is defined as the point (b i , b j ) where the transfer function is at half height, for example, b j = 0.5. 6.1.1 ANN. Take an ANN with a single input unit (activation level b i ) and a single output unit (activation level b j ), as in Figure 1a. Let the weight on the single interconnection be wji . Then the output is given by b j = f (wji b i ), where f is the squashing function, a sigmoid here, as in Figure 4. But at the midpoint of the sigmoid, f (wji b i ) = 0.5, and wji b i = 0.5
(6.4)
Therefore, wji =
1 at the midpoint. 2b i
6.1.2 Corresponding Synchronous Group Model. Consider a synchronous group model in which a single input group feeds a single output group. The fraction activation of the input group is denoted a i , and the fraction activation of the output group is denoted a j . We set the inputs of the two models equal: a i = b i . To place the midpoint at (a i ,0.5), as in the ANN, we
Networks of Synchronous Groups
2455
need to find what values of ni , pji , and e ji yield the value a j = 0.5. If a j = 0.5, then the threshold T equals the point at which half the area is on each side of the value T. We approximate this point by the average boost, given in equation 6.1, because of the approximate symmetry between the two sides of the peak: T ≈ ni a i pji e ji . Since we previously set T = 1, 1 ≈ ni a i pji e ji . This implies that ai ≈
1 ni pji e ji
at the midpoint in the synchronous group model. From equation 6.4, at the midpoint in the ANN, ai =
1 . 2wji
Matching the formulas for a i at the midpoints in the two models, we get 1 1 ≈ , 2wji ni pji e ji which implies that setting wji ≈
1 ni pji e ji 2
(6.5)
will cause the midpoints to match approximately in the two models. Equation 6.5 now gives us a way to convert the weight wji in the ANN to an intergroup weight in the synchronous group network. The intergroup weight in the synchronous group network is the product of ni , the size of the incoming group, pji , the probability of individual interconnections between groups, and e ji , the size of the PSP, with a multiplier of 12 . Example. Let us check this model on the same configuration of neurons, but with a different value for the weight wji . Using the configuration of Figure 1a, let ni = 200, pji = 0.1, and e ji = 0.2. Then wji = 12 (200)(0.1)(0.2) = 2. So, if a synchronous group model has the above values for ni , pji , and e ji , it will function similarly to an ANN with a weight of wji = 2.
2456
J. Dayhoff
6.2 Translating Weights from ANN to Synchronous Group Networks: Multiple Groups Feed a Target Group. In this section we address the neural network configuration in which multiple groups feed a target group. Given d feeder groups labeled i = 1, 2, . . . , d and given a single target group j, we match the midpoint of the squashing function in the ANN to the midpoint of the sigmoid-like function in the synchronous group model. 6.2.1 ANN. Take an ANN with d input units i (i = 1, 2, . . . , d) feeding a single target unit j. Let the respective weights be w j1 , w j2 , . . . , w jd . Then the output of the ANN is f
d
wji b i .
i=1
At the midpoint f (
wji b i ) = 0.5, which implies S j =
wji b i = 0.5.
6.2.2 Corresponding Synchronous Group Model. Take a synchronous group model in which d groups feed a single target group. Let a i denote the fraction activation of input group i, and let a j denote the fraction activation of the output group. We set the inputs of the two models equal: a i = b i . Let Ri,a i ,ni ,e ji (x) be the probability that the value of the PSP boost from group i is x, given parameters a i , ni , and e ji for the input group i. Let Rall (y) be the probability that the total PSP boost from groups 1 to d is y, given parameters a i , ni , and e ji for all input groups i. Then Rall (y) = y
R1,a 1 ,n1 e j1 (x1 ) ∗
x1 =0
y−x1
R2,a 2 ,n2 ,e j2 (x2 ) ∗
x2 =0 y−x1 −x2
R3,a 3 ,n3 ,e j3 (x3 ) ∗
x3 =0
∗··· ∗ y−x1 −x2 −···−xd−2
Rd−1,a d−1 ,nd−1 ,e j,d−1 (xd − 1) ∗
xd−1 =0
Rd,a d ,nd ,e jd (xd ), where xd = y − x1 − x2 − · · · − xd−1 .
(6.6)
Networks of Synchronous Groups
2457
We have studied the properties of equation 6.6 empirically through simulations of examples with different distributions of R and multiple terms. We found that the mean of the total distribution (left side of equation 6.6) is equal to the sum of the means of the distributions on the right side of equation 6.6. Thus, we utilize the following equation: mall = m1 + m2 + m3 + · · · + mn ,
(6.7)
where mi = mean of distribution Ri,a i ,ni ,e ji (x) and mall is the mean of distribution Rall . Next we look at the midpoints. Note that mi = ni a i pji e ji from equation 6.1. At the midpoint in the synchronous group model, T = 1 ≈ mall , which equals d
ni a i pji e ji
i=1
by equations 6.7 and 6.1. We utilize next the weight correspondence found for the 1 to 1 model given in equation 6.5: wji ≈
e ji ni pji . 2
At midpoint, threshold T is as follows: 1≈
d
ni a i pji e ji = 2
i=1
d
a i wji ,
i=1
which implies d
a i wji ≈ 0.5.
i=1
This confirms that the midpoint of the synchronous group network matches approximately the midpoint in the ANN. Thus, the configuration with multiple units feeding a target unit is expected to work the same or similarly for the ANN as for the synchronous group network. The sigmoid-like shapes to the synchronous group transfer functions are not an exact match to the actual sigmoid shown in Figure 3c. A better match
2458
J. Dayhoff
between the two models can be obtained by first determining the sigmoidlike shapes in the synchronous group network and then using the same shapes for the squashing function in the ANN. After training the ANN, the trained weights can be put on the synchronous group architecture. Because of the statistical nature of choosing at random a fraction a i of the input group neurons and the statistical nature of choosing a fraction pji of the interconnections to be present, there is a statistical nature of the result. For small values of ni , one would need to produce an average over many runs to get a smooth input-output relationship that matches between the two models. For large values of ni , the statistical averages would not need to be taken, and the transfer functions appear smooth. Other distributions would occur in nature, such as a distribution of heights e ji for the maximum PSP response, a distribution of thresholds, noise added to the postsynaptic potential as it adds responses evoked from incoming spikes, and temporal delays due to dendritic organization. A typical biological neuron would be expected to require hundreds (or more) of incoming spikes to fire, which would allow for a high degree of averaging and could produce sufficiently smooth curves, even in the presence of all of these sources of noise and variation. 7 Discussion In this letter, we have shown a set of models in which groups of IAF neurons are networked together to perform computations in which a spatiotemporal pattern of synchronies sweeps across the network, and the computational results can mimic any function or any calculation attainable by an artificial neural network. Each instance of synchronous firing may involve all, or some, of the neurons in a group. A correspondence, or mapping, is set up between ANNs and synchronous group networks, in which the ANN connection weights correspond to intergroup weights attainable with biological parameters. As connection weights increase, interconnections in both types of network can behave like switches, mediating a wide variety of computations. The neurons in the synchronous group models behave as functional assemblies, or ensembles, in two respects: (1) each neuronal group, capable of firing in synchrony, constitutes an ensemble with redundant neurons, and (2) the different groups that are networked together work as an assembly to accomplish computations analogous to ANN units working as a distributed assembly of simple subunits that, taken as a whole, can compute a powerful array of results. In the synchronous group architecture, there are three ways to modulate the relative strengths of the weights that interconnect the synchronous groups: changing (1) the probability pji for each individual interconnection, (2) the height e ji of the EPSP or IPSP, or (3) the size ni of the incoming neuronal group. There is only one way in the artificial neural network: changing the numeric weight wji on an interconnection between
Networks of Synchronous Groups
2459
nodes. Using the correspondence wji = 12 pji e ji ni , a biological neural system could then implement weight changes by changing one parameter and then resetting that parameter to its original value while changing another parameter, thus retaining the new connection strength. For example, if pji increases, then later pji could decrease while e ji or ni increases, resulting in the same intergroup weight. The biological net would then function in the same way computationally as the corresponding ANN. New neuronal generation (neurogenesis) has been observed in adult mammals, but typically these neurons live for only a few months (Gould & Gross, 2002; Gould, Vail, Wagers, & Gross, 2001). This observation could reflect a mechanism whereby weight changes are first implemented by changes in group size due to new neurons being formed. Later, when the new neurons die, changes in interconnectivity or EPSP or IPSP heights could implement the needed weight changes. Alternatively, additional older neurons could be recruited into a group to strengthen the weight, or additional connections between existing neuronal groups could be generated. These mechanisms would allow for more options to implement long-term weight changes. In the models presented here, groups are disjoint (share no neurons), but in general, this does not have to be the case. Some overlap between the groups could likely be tolerated, resulting in extra noisy spikes. As observed in Figures 2a to 2c, the traces on the left of each figure contain sigmoid-like curves with high slopes. This means that the action of the target group is like a switch. The highest value of a i that produces a 5% response could be considered the threshold of the switch. The lowest value of a i that produces a 95% response would be considered the on position of the switch, corresponding to the top of the sigmoid. Extra spikes arriving below threshold, or above the "on" position of the switch, would not make much difference in the action of the target group. Thus, a low level of extra spikes due to overlapping groups could cause similar operation of the network. Alternatively, a different architectural configuration, yet to be discovered, might implement the same calculations but with substantially overlapping groups. Spontaneous activity arriving from outside neurons would be expected to produce noise for the system. Since ANN outputs have varied sensitivity to fluctuations in their inputs, the same would be expected of the corresponding synchronous group nets. Synchronies that might arrive spontaneously from outside the network would have potentially larger effects on the network output compared to individual nonsynchronized spikes; however, unless the arriving synchronies were trained in some as yet unknown way, the effects would be random. Previous studies have worked on the role of spontaneous activity in the case of synfire chains and logic gates (XOR) (Van Rossum, Turrigiano, & Nelson, 2002; Vogels & Abbott, 2005). It should be noted that it is possible to build models in which background noise consists of many small PSPs, which in effect raise the resting potential,
2460
J. Dayhoff
thus lowering the distance to threshold, which is the equivalent of lowering T in our model. Many learning rules have been reported in the literature on biological neurons. It is beyond the scope of this letter to infer which training or learning rules might be used by a biological system that utilizes synchronous groups. However, the models presented here can be used to elucidate the structures needed to be modified if general learning capabilities occur in networked synchronous ensembles. With the synchronous group network, there is an additional question as to how the neuronal groups capable of firing in synchrony organize in the first place. Simulation studies that demonstrate how neuronal synchronies can self-organize have been demonstrated for several network configurations (Amari, Nakahara, Wu, & Sakai, 2003; Masudo & Aihara, 2003). Although this letter shows implementations of only feedforward networks, a variety of different types of artificial neural networks have been proposed in the literature, including attractor networks, self-organizing feature maps, recurrent networks, and competitive architectures (Haykin, 1998; Mehrotra et al., 1997). Many of these models, if set up to limit the squashing function to the interval [0-1], as in Figure 4, would be amenable to implementation by networked synchronous groups of biologically structured integrate-and-fire neurons. In this way, the models proposed here are starting points for an entire field of models that show how computations of artificial neural networks could occur in realistic biological neurons via networks of synchronously firing neuronal groups. Many pulsed architectures proposed in the literature are designed with hardware implementation in mind. In these implementations, it is assumed that individual neurons will not be lost during the lifetime of the hardware. Often the design makes each neuron, and possibly every spike, essential to the computational performance, a feature undesirable in biological systems. In contrast, a biological system needs enough redundancy among neurons that an individual neuron could die or become defective without destroying the functionality of the entire system. The synchronous group model proposed here addresses this limitation. If a single neuron in a synchronous group is lost, the others are still present to perform the needed activity. Furthermore, a lost neuron could be replaced by increases in the EPSP or IPSP heights of other units in the same group, or by increased interconnection density, to compensate for the loss and retain the intergroup connection strength. Synchronously firing neurons have been observed in biological systems. Lindsey et al. (1997) observed repeated patterns of distributed synchrony in respiratory neurons, recorded simultaneously. Morris and colleagues observed changes in cat medullary neuron firing rates and synchrony following induction of respiratory long-term facilitation (Morris, Shannon, & Lindsey, 2001; Morris et al., 2003). Multiunit synchronies were observed
Networks of Synchronous Groups
2461
to be modulated by attention in neuronal firing in primate somatosensory cortex (Steinmetz et al., 2000). Synchronies have been identified in multiunit recordings from hippocampus, phase-locked to the theta rhythm (Harris et al., 2002; Harris, Csicsvari, Hirase, Dragoi, & Buzsaki, 2003). Recordings of neurons from monkey motor cortex have exhibited context-dependent patterns of coincident action potentials (Riehle, Grun, Diesmann, & Aertsen, 1997). Salinas and Sejnowski (2001) review experimental and theoretical results that indicate that correlated fluctuations in nerve impulses could be important for cortical processes. Synchronies among two or more spike trains are expected in the event of oscillations of larger sets of neurons. Local synchronization in striate cortex of the alert cat was observed in the context of stimulus-dependent neuronal oscillations (Gray & Di Prisco, 1997), and the synchronization of oscillatory responses was observed in visual cortex (Fries et al., 1997). However, these synchronies occur in the context of regular oscillations that could cause them to occur rather than as individual synchronies that could mediate a feedforward calculation, as we propose here. A network of neuronal groups configured as we show here can behave like a network of on-off switches. With sufficiently strong weights, the input-output relationship of an intergroup connection becomes a sigmoidlike curve with a high slope, as shown in the left traces of Figures 2a to 2c. This has a switch-like behavior because the difference in input fraction that triggers a 10% output response versus a 90% output response is very small—as small as 10% in Figure 2. When the target group becomes on at more than 90% active, it occurs quickly, as a group’s synchrony can be triggered near instantaneously from a previous group’s synchrony. Thus, a network with relatively high weights (at least 1 in our models) would enact a dramatic computational mechanism, in which a pattern of neuronal groups is rapidly switched on in a pattern propagating forward through the network. An incoming synchronous group would affect the target dendrites, which would propagate their reaction to the axon hillock, which would then trigger a target group’s firing response. The integration time for the PSP to reach threshold plus the propagation time along the dendrite and axon would be the delay δ from the input synchrony to the synchrony generated by the target group. As a result, the processing time for a pattern recognition task by a biological synchronous group network could be exceedingly fast. Each layer to propagate a signal would have a delay δ, and the number of layers β through the network would result in a computational result at time δβ. General function approximation ability could be achieved with two layers of weights, corresponding to a time of approximately 2δ. In contrast, a population code makes it necessary to wait to integrate the number of arriving spikes from a population before one knows the output of a computation, which is expected to take longer.
2462
J. Dayhoff
It should be noted that an important temporal parameter in a biological implementation of a network of synchronous groups would be the temporal fidelity of the synchrony. In a synchrony code, a group of neurons that fires in synchrony, or near-synchrony, would have a spread of spikes over time. The spread over time in turn affects the quickness of the calculation. In addition, it should be noted that a layer of units, or neuronal groups, would need to fire at the same time to be summed correctly by the next layer of units. Thus, the timing of the synchrony has to be within toleration limits for the calculation to occur correctly. It is for this reason that a burst of multiple synchronies, instead of a single synchrony, could be a biological advantage, as it would take away the need for precise synchronization of synchronies in the interior layers of the network. The model proposed here is consistent with the use of bursting by neuronal groups. A group that bursts in near-synchrony would satisfy the criterion for the computational action of the model, except that it would have several synchronies in place of just a single synchrony. The computation would be triggered by an outside stimulus that causes a burst in the input units (groups). Bursting would increase the redundancy of the system, a biologically desirable result. 8 Conclusion This letter shows that powerful computational abilities previously demonstrated in artificial neural networks can occur in a network of neuronal groups in which each group consists of spiking IAF neurons and some or all fire synchronously, or near-synchronously. It demonstrates how the classic artificial neural network construct can be altered, without changing its computational properties, to have a new structure that has a correspondence with models constructed of networks of synchronously firing spiking neurons with integrate-and-fire mechanisms. It extends the synfire chain concept to a network of synchronously firing groups that have computational capabilities, thus comprising a synfire net that can compute as well as transmit signals. We use midpoint matching to show the specifics of the equivalency between ANNs and spiking neuron models, resulting in a formula that translates ANN weights into parameters pertaining to the interconnections between groups of IAF neurons. Through this correspondence, we can show a structure by which the computational properties of artificial neural networks could occur biologically and demonstrate how the classic artificial network weight becomes the product of three biological entities: the intergroup connection density, the postsynaptic potential, and the synaptic strength. We show how the use of synchronies can drive fast-acting, forwardpropagating calculations. We demonstrate the presence of switch-like behavior in ANNs as well as in synchronous group biological models and
Networks of Synchronous Groups
2463
show how a sigmoid-like shape emerges as a transfer function in the synchronous spiking model. The networks defined in this letter can be configured with any interconnection topology and thus can mimic feedforward networks, competitive networks, attractor networks, and other networks reported in the artificial neural network literature. With these structures, we can then study the dynamics, the architecture, and the computational properties of the corresponding biological models in a new light and obtain insights into how biological systems can, or might, compute. Acknowledgments I would acknowledge Moshe Abeles for suggesting the idea of using fraction active in this model and for his historical work on synfire chains that establish the background for this letter. I thank George Gerstein for his discussions on synchronous groups and his emphasis on the importance of functional groups in the nervous system. References Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Abeles, M., & Gerstein, G. L. (1988). Detecting spatio-temporal firing patterns among simultaneously recorded single neurons. J. Neurophysiology, 60, 909– 924. Aertsen, A. M. H. J., Gerstein, G. L., Habib, M. K., & Palm, G. (1989). Dynamics of neuronal firing correlation—modulation of effective connectivity. J. Neurophysiology, 61, 900–917. Amari, S.-I., Nakahara, H., Wu, S., & Sakai, Y. (2003). Synchronous firing and higherorder interactions in neuron pool. Neural Computation, 15, 127–142. Aviel, Y., Pavlov, E., Abeles, M., & Horn, D. (2002). Synfire chain in a balanced network. Neurocomputing, 44–46, 285–292. Baker, S. N., Spinks, R., Jackson, A., & Lemon, R. N. (2001). Synchronization in monkey motor cortex during a precision grip task. I. Task-dependent modulation in single-unit synchrony. J. Neurophysiology, 85, 869–885. Castro, J. L., Mantas, C. J., & Benitez, J. M. (2000). Neural networks with a continuous squashing function in the output are universal approximators. Neural Networks, 13, 561–563. Chen, T., & Chen, H. (1993). Approximation of continuous functionals by neural networks with application to dynamic systems. IEEE Transactions on Neural Networks, 4(5), 910–918. Chen, T., & Chen, H. (1995a). Approximation capability in C (Rn) by multilayer feedforward networks and related problems. IEEE Transactions on Neural Networks, 6(1), 25–31. Chen, T., & Chen, H. (1995b). Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems. IEEE Transactions on Neural Networks, 6(4), 911–917.
2464
J. Dayhoff
Cotter, N. E., & Conwell, P. R. (1993). Universal approximation by phase series and fixed-weight networks. Neural Computation, 5(3), 359–362. Cotter, N. E., & Mian, O. N. (1992). A pulsed neural network capable of universal approximation. IEEE Transactions on Neural Networks, 3(2), 308–314. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2, 304–314. Dayhoff, J. E. (1990). Neural network architectures: An introduction. New York: Van Nostrand Reinhold. Dayhoff, J. E. (1994). Synchrony detection in neural assemblies. Biological Cybernetics, 71(3), 263–270. Dayhoff, J. E., & Gerstein, G. L. (1983a). Favored patterns in nerve spike trains I. Detection. Journal of Neurophysiology, 49(6), 1334–1348. Dayhoff, J. E., & Gerstein, G. L. (1983b). Favored patterns in nerve spiketrains II. Application. Journal of Neurophysiology, 49(6), 1349–1363. De Kamps, M., & Van Der Velde, F. (2001). From artificial neural networks to spiking neuron populations and back again. Neural Networks, 14, 941–953. Diesmann, M., Gewaltig, M. O., & Aertsen, A. (1999). Conditions for stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529– 533. Fries, P., Roelfsema, P. R., Engel, A. K., Konig, P., & Singer, W. (1997). Synchronization of oscillatory responses in visual cortex correlates with perception in inter-ocular rivalry. Proc. Natl. Acad. Sci., 94, 12699–12704. Funahashi, K.-I. (1989). On the approximate realization of continuous mappings by neural networks. Neural Networks, 2, 183–192. Gerstein, G. L., & Aertsen, A. M. H. J. (1985). Representation of cooperative firing activity among simultaneously recorded neurons. J. Neurophysiol., 54(6), 1513– 1528. Gerstein, G. L., & Perkel, D. H. (1972). Mutual temporal relationships among neuronal spike trains. Biophysical Journal, 12, 453–473. Gerstein, G. L., Perkel, D. H., & Dayhoff, J. E. (1985). Cooperative firing activity in simultaneously recorded populations of neurons: Detection and measurement. Journal of Neuroscience, 5(4), 881–889. Gibson, G. J., & Cowan, C. F. N. (1990). On the decision regions of multilayer perceptrons. Proceedings of the IEEE, 78(10), 1590–1594. Gould, E., & Gross, C. G. (2002). Neurogenesis in adult mammals: Some progress and problems. Journal of Neuroscience, 22, 619–623. Gould, E., Vail, N., Wagers, M., & Gross, C. G. (2001). Adult-generated hippocampal and neocortical neurons in macaques have a transient existence. Proc. Natl. Acad. Sci., 98, 10910–10917. Gray, C. M., & Viana Di Prisco, G. (1997). Stimulus-dependent neuronal oscillations and local synchronization in striate cortex of the alert cat. J. Neurosci., 17(9), 3239–3253. ¨ S., Diesmann, M., & Aertsen, A. (2002). Unitary events in multiple singleGrun, neuron spiking activity: Detection and significance. Neural Computation, 14(1), 43–80. Harris, K. D., Csicsvari, J., Hirase, H., Dragoi, G., & Buzsaki, G. (2003). Organization of cell assemblies in the hippocampus. Nature, 424, 552–556.
Networks of Synchronous Groups
2465
Harris, K. D., Henze, D. A., Hirase, H., Leinekugel, X., Dragoi, G., Czurko, A., & Buzsaki, G. (2002). Spike train dynamics predicts theta-related phase precession in hippocampal pyramidal cells. Nature, 417, 738–741. Haykin, S. (1998). Neural networks: A comprehensive foundation, (2nd ed.). Upper Saddle River, NJ: Prentice Hall. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366. Iannella, N., & Back, A. D. (2001). A spiking neural network architecture for nonlinear function approximation. Neural Networks, 14, 933–939. Ito, Y. (1991). Representation of functions by superpositions of a step or sigmoid function and their applications to neural network theory. Neural Networks, 4(3), 385–394. Jain, L. C., & Vemuri, V. R. (1999). Industrial applications of neural networks. New York: CRC Press. Kalouptsidis, N. (1997). Signal processing systems: Theory and design. New York: Wiley. Koutroumbas, K. (2003). On the partitioning capabilities of feedforward neural networks with sigmoid nodes. Neural Computation, 15, 2457–2481. Kurkova, V. (1992). Kolmogorov’s theorem and multilayer neural networks. Neural Networks, 5(3), 501–506. Lindsey, B. G., Hernandez, Y. M., Morris, K. F., Shannon, R., & Gerstein, G. L. (1992). Dynamic reconfiguration of brain stem neural assemblies: Respiratory phasedependent synchrony vs. modulation of firing rates. J. Neurophysiol., 67, 923– 930. Lindsey, B. G., Morris, K. F., Shannon, R., & Gerstein, G. L. (1997). Repeated patterns of distributed synchrony in neuronal assemblies. J. Neurophysiol., 78, 1714– 1719. Maass, W. (1997a). Fast sigmoidal networks via spiking neurons. Neural Computation, 9, 279–304. Maass, W. (1997b). Networks of spiking neurons: The third generation of neural network models. Neural Networks, 10, 1659–1671. Maass, W. (1997c). Noisy spiking neurons with temporal coding have more computational power than sigmoidal neurons. In M. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 211–217). Cambridge, MA: MIT Press. Maass, W. (1999). Computing with spiking neurons. In W. Maass & C. M. Bishop (Eds.), Pulsed neural networks (pp. 55–85). Cambridge, MA: MIT Press. Maass, W. (2002). Paradigms for computing with spiking neurons. In J. L. van Hemmen, J. D. Cowan, & E. Domany (Eds.), Models of neural networks: Early vision and attention (Vol. 4, pp. 373–402). Berlin: Springer. Maass, W. (2003). Computation with spiking neurons. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (2nd ed., pp. 1080–1083). Cambridge, MA: MIT Press. Makhoul, J., El-Jaroudi, A., & Schwartz, R. (1991). Partitioning capabilities of twolayer neural networks. IEEE Transactions on Signal Processing, 39(6), 1435–1440. Maldonado, P. E., & Gerstein, G. L. (1996). Reorganization in the auditory cortex of the rat induced by intracortical microstimulation: A multiple single-unit study. Experimental Brain Research, 112, 431–441.
2466
J. Dayhoff
Masudo, N., & Aihara, K. (2003). Duality of rate coding and temporal coding in multilayered feedforward networks. Neural Computation, 15, 103–125. Mehrotra, K., Mohan C. K., & Ranka, S. (1997). Elements of artificial neural networks. Cambridge, MA: MIT Press. Miller, W. T. III, Sutton R. S., & Werbos, P. J. (1990). Neural networks for control. Cambridge, MA: MIT Press. Morris, K. F., Baekey, D. M., Nuding, S. C., Dick, T. E., Shannon, R., & Lindsey, B. G. (2003). Invited review: Neural network plasticity in respiratory control. Journal of Applied Physiology, 94, 1242–1252. Morris, K. F., Shannon, R., & Lindsey, B. G. (2001). Changes in cat medullary neurone firing rates and synchrony following induction of respiratory long-term facilitation. Journal of Physiology, 532(Pt. 2), 483–497. Parker, D. B. (1982). Learning logic (Invention Rep. S81-64, File 1). Palo Alto, CA: Stanford University. Perkel, D. H., & Bullock, T. H. (1968). Neural coding. Neuroscience Research Program Bulletin, Vol. 6, No. 3. ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and Riehle, A., Grun, rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Rosenblatt, F. (1962). Principles of neurodynamics. Washington, DC: Spartan Books. Rumelhart, D. E., & McClelland, J. L. (1986). Parallel distributed processing. Cambridge, MA: MIT Press. Salinas, E., & Sejnowski, T. J. (2001). Correlated neuronal activity and the flow of neural information. Nature Reviews Neuroscience, 2(8), 539–550. Sandberg, I. W. (1994). General structures for classification. IEEE Transactions on Circuits and Systems, I, 41(5), 372–376. Sandberg, I. W. (1999). A note on classification and the approximation of functionals. International Journal of Circuit Theory and Applications, 27, 443–445. Sandberg, I. W. (2001). Gaussian radial basis functions and inner-product spaces. Circuits, Systems, and Signal Processing, 20, 635–642. Sandberg, I. W., Lo, J. T., Fancourt, C. L., Principe, J. C., Katagiri, S., & Haykin, S. (2001). Nonlinear dynamical systems: Feedforward neural network perspectives. New York: Wiley. Sanger, T. D. (1998). Probability density methods for smooth function approximation and learning in populations of tuned spiking neurons. Neural Computation, 10, 1567–1586. Sejnowski, T., & Rosenberg, C. (1987). Parallel networks that learn to pronounce English text. Complex Systems, 1(1), 145–168. Shonkwiler, R. (1993). Separating the vertices of N-cubes by hyperplanes and its application to artificial neural networks. IEEE Transactions on Neural Networks, 4(2), 343–347. Steinmetz, P. N., Roy, A., Fitzgerald. P., Hsiao, S. S., Johnson, K. O., & Niebur, E. (2000). Attention modulates synchronized neuronal firing in primate somatosensory cortex. Nature, 404, 187–190. Van Rossum, M., Turrigiano, G. G., & Nelson, S. B. (2002). Fast propagation of firing rates through layered networks of noisy neurons. J. Neuroscience, 22(5), 1956– 1966.
Networks of Synchronous Groups
2467
Vogels, T. P., & Abbott, L. F. (2005). Signal propagation and logic gating in networks of integrate-and-fire neurons. J. Neurosci., 25(46), 10786–10795. Werbos, P. J. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. Unpublished doctoral dissertation, Harvard University. Werbos, P. J. (1988). The roots of backpropagation: From ordered derivatives to neural networks and political forecasting. New York: Wiley-Interscience. Widrow, B., & Hoff, M. (1960). Adaptive switching circuits. August IRE WESCON Convention Record, (Pt. 4), 96–104. Williamson, R. C., & Helinke, U. (1995). Existence and uniqueness results for neural network approximations. IEEE Transactions on Neural Networks, 6(1), 2–14.
Received January 23, 2006; accepted September 2, 2006.
LETTER
Communicated by Liam Paninski
Variable Timescales of Repeated Spike Patterns in Synfire Chain with Mexican-Hat Connectivity Kosuke Hamaguchi [email protected] RIKEN, Brain Science Institute, Wako-shi, Saitama, 351-0198, Japan
Masato Okada [email protected] RIKEN, Brain Science Institute, Wako-shi, Saitama, 351-0198, Japan; Department of Complexity Science and Engineering, University of Tokyo, Kashiwa, Chiba, 277-8561, Japan; and Intelligent Cooperation and Control, PRESTO, JST, Saitama, 351-0198, Japan
Kazuyuki Aihara [email protected] Institute of Industrial Science, University of Tokyo, Meguro, Tokyo 153-8505, Japan, and ERATO Aihara Complexity Modeling Project, JST, Shibuya-ku, Tokyo 151-0065, Japan
Repetitions of precise spike patterns observed both in vivo and in vitro have been reported for more than a decade. Studies on the spike volley (a pulse packet) propagating through a homogeneous feedforward network have demonstrated its capability of generating spike patterns with millisecond fidelity. This model is called the synfire chain and suggests a possible mechanism for generating repeated spike patterns (RSPs). The propagation speed of the pulse packet determines the temporal property of RSPs. However, the relationship between propagation speed and network structure is not well understood. We studied a feedforward network with Mexican-hat connectivity by using the leaky integrate-and-fire neuron model and analyzed the network dynamics with the Fokker-Planck equation. We examined the effect of the spatial pattern of pulse packets on RSPs in the network with multistability. Pulse packets can take spatially uniform or localized shapes in a multistable regime, and they propagate with different speeds. These distinct pulse packets generate RSPs with different timescales, but the order of spikes and the ratios between interspike intervals are preserved. This result indicates that the RSPs can be transformed into the same template pattern through the expanding or contracting operation of the timescale.
Neural Computation 19, 2468–2491 (2007)
C 2007 Massachusetts Institute of Technology
Variable Timescales of Repeated Spike Patterns
2469
1 Introduction Repeated spike patterns (RSPs) are, literally, patterns of spikes that repeatedly appear amid the apparent random activities of a neuron population. They have been observed both in vivo and in vitro through several measurement methods, such as multielectrode array recordings and calcium imaging (Abeles, Bergman, Margalit, & Vaadia, 1993; Prut et al., 1998; Mao, Hamzei-Sichani, Aronov, Froemke, & Yuste, 2001; Ikegaya et al., 2004). The generating mechanism of RSPs is, however, not clear yet. One possible mechanism is the propagation of spike synchrony through a feedforward network, which is often called the synfire chain. Abeles (1991) defined the term synfire chain as a network that can support the synchronous spike propagation mode under a certain condition. A homogeneous feedforward network of spiking neurons is the simplest neural entity of the synfire chain. Its capability of transmitting a synchronized spike volley has been extensively studied in several spiking neuron models (Diesmann, Gewaltig, & Aertsen, 1999; Cˆateau & Fukai, 2001; Gewaltig, Diesmann, & Aertsen, 2001; Kistler & Gerstner, 2002). A feedforward network model with synaptic delay does not show the explicit synchrony of spikes, but the essential mechanism of activity propagation—the synchrony of arriving spikes—is still required (Izhikevich, 2006). The feasibility of such synchronized activity propagation has been experimentally confirmed in an iteratively constructed biological network in vitro (Reyes, 2003). In this letter, we refer to the synchronized spike discharge in one layer as a spike volley and refer to its propagation through a network as a pulse packet. The pulse packet propagation can explain the RSP phenomenon as follows. Assume that a multielectrode array was injected into the feedforward neural network, and several events of pulse packet propagation occurred. When one pulse packet propagates, electrodes can detect the spike event correlated to the activity propagation. The development of the propagating speed has been shown to be quite stable (Gewaltig et al., 2001). Therefore, those spikes detected over several electrodes have a specific interspike interval (ISI). Since each synchronous spike volley propagates through the network with approximately the same speed in all trials, the same ISIs repeatedly appear among uncorrelated spikes from spontaneous firings. Those statistically significant ISIs are defined as RSPs. Therefore, the speed of a pulse packet is an important property of RSPs. In this letter, our interest resides in the relationship between the propagation speed of a pulse packet and the resultant RSPs in a multistable network. If a network had multistability, the pulse packets in different stable fixed points would have different propagation speeds depending on the configuration of the network structure. Given that, the temporal order of an evoked precise spike sequence would not change, but the intervals of spikes within the RSPs would change. Many of the studies on synfire chains, however, have treated a homogeneous network structure, and they
2470
K. Hamaguchi, M. Okada, and K. Aihara
had only one stable pulse packet shape: a spatially uniform synchronized activity. In this case, there is no chance to observe different RSPs. To understand the mechanism of RSPs, we will consider a biologically plausible network structure and its resultant activities. Electrophysiological and anatomical data indicate that the cerebral cortex is spatially organized (Mountcastle, 1997). The columnar activities are widely observed in several cortical regions, including the primary visual cortex (Hubel & Wiesel, 1977) and prefrontal cortex (Goldman & Nauta, 1977). Both the orientation and ocular dominance columns have been identified through an intensive study with the optical imaging method (Blasdel, 1992). The recent development of the two-photon calcium imaging technique allows us to study the population of neuron activity with cellular-level resolution. The spontaneous activity of slices from the visual cortex shows that the repetition of activity pattern is also spatially organized (Cossart, Aronov, & Yuste, 2003). To understand the properties of repeated activity patterns in the spatially organized network, we chose the Mexican-hat connectivity, which is widely accepted as one of the biologically realistic connectivities in the cortex. The studies on spatially localized activity date back to the 1970s. Intensive modeling studies have shown that spatially localized activity is stable in recurrent neural circuits with nearby excitation and global inhibition (Wilson & Cowan, 1972; Amari, 1977; Ben-Yishai, Bar-Or, & Sompolinsky, 1995; Compte, Brunel, Goldman-Rakic, & Wang, 2000), which is also called the Mexican-hat type interaction. This interaction is widely used as a neural substrate for general columnar activities (Wang, 2001). A numerical study of a network with feedforward Mexican-hat connectivity has shown that without the reverberation loop, the localized propagating activity is stable (van Rossum, Turrigiano, & Nelson, 2002). The coexistence of uniform and localized activity propagation has been shown in analysis with binary neurons (Hamaguchi, Okada, Yamana, & Aihara, 2005), but because of the discrete time dynamics of a binary neuron model, the speed of the pulse packet was not studied there. In this letter, we report our study on the dynamics of feedforward networks with Mexican-hat connectivity composed of leaky integrate-and-fire (LIF) neurons without refractoriness. Combined with the simulations with the LIF neurons, we used the equivalent Fokker-Planck equation (FPE) for the analysis. The FPE generates continuous firing-rate dynamics, which is useful for the analysis of pulse packet propagation, especially when the firing rate is very low. Our strategy is to embed multistability into the network, observe what type of RSPs appear, and seek the common feature among them. We describe the activity of a network with a small number of macroscopic variables indicating the population firing rate, localization parameter, and position of the activity. It allows us to reduce the complexity of the spatially organized network dynamics into a low-dimensional parameter space. Using this approach, we studied the stability of spike packets
Variable Timescales of Repeated Spike Patterns
2471
within a certain parameter region and found the multistable regime. To understand the nature of RSPs generated in multistable networks, we simulated several trials of multielectrode array recording in the multistable synfire chain. We show that RSPs generated from different pulse packets are similar but have different time constants because of the difference in propagation speed of pulse packets.
2 Network Structure and Neuron Model In this section, we first define the dynamics of the neuron model that we use in this letter. We then connect neurons by giving the network structure, or synaptic efficacy between neurons. The dynamics of the membrane potential vθl of a neuron indexed with θ in layer l is described by a differential equation,
C
vl dvθl = − θ + Iθl (t) + µ + Dξθ (t), dt R ∞ W(θ − θ ) dτ , Iθl (t) = α(τ )δ t − tθl−1 ,k − τ θ
0
(2.1) (2.2)
k
where C is the membrane capacitance and R is the membrane resistance. The index θ = {−π, −π + 2π , . . . , π − 2π } is the functional distance between N N two neurons. We assume that the input to the soma of neuron θ consists of Iθ (t), a weighted sum of outputs from presynaptic neurons, and white gaussian noise with mean µ and standard deviation D. The white gaussian noise is drawn from the independently identically distributed gaussian distribution ξθ (t), which satisfies ξθ (t) = 0 and ξθ (t) · ξθ (t ) = δθ θ δ(t − t ). The first summation over θ in equation 2.2 is a sum of different synaptic currents from neuron θ with dimensionless weight W(θ − θ ), and the second summation over k is a sum of different spikes arriving at time t = tθl−1 ,k , where tθl−1 ,k is the kth spike timing of neuron θ in layer l − 1. Excitatory postsynaptic current (EPSC) or inhibitory postsynaptic current (IPSC) time courses are described with α(t), where α(t) = βα 2 t exp(−αt)H(t) and β is chosen such that a single excitatory postsynaptic potential (EPSP) generates 14 mV depolarization from the resting potential. Here, H(t) is the Heaviside step function. Note that α(t) will be normalized by system size N later in equation 2.3. The convolution of the function α(t) and spikes gives the current from presynaptic neurons. The membrane potential dynamics follows the spike-and-reset rule: when vθl reaches the threshold Vth , a spike is fired, and vθl is reset to the resting potential Vrest .
2472
K. Hamaguchi, M. Okada, and K. Aihara
Figure 1: Network architecture. Each layer consists of N units of leaky integrateand-fire (LIF) neurons, which are arranged in a circle. Each neuron projects its axons to the postsynaptic layer with Mexican-hat connectivity. The actual dynamics of one layer, which corresponds to the collective dynamics of LIF neurons, is calculated by using the FPE.
Mexican-hat type synaptic efficacy is described with a uniform term W0 and a spatial modulation term W1 cos(θ ) (Ben-Yishai et al., 1995), W(θ − θ ) =
W0 W1 + cos(θ − θ ). N N
(2.3)
Note that this connection is only feedforward, and there is no recurrent interaction within one layer. The whole network is a structured feedforward network composed of identical LIF neurons, aligned in one-dimensional ring layers (see Figure 1). Throughout this letter, the parameter values are fixed as follows: C = 100 pF, R = 100 M, Vth = 15 mV, Vrest = 0 mV, D = 100, µ = 0.075 pA, α = 2 ms−1 , and β = 1.7 × 10−4 . For the LIF neuron simulations, N = 104 . 3 Fokker-Planck Equation and Macroscopic Variables The probability distribution of the membrane potential evolving in time according to the dynamics of equation 2.1 can be described by the following Fokker-Planck equation (FPE) (Risken, 1996), ∂t Pθl (v, t)
= ∂v
2 v − R Iθl (t) + µ 1 D + ∂v Pθl (v, t), τ 2 C
(3.1)
Variable Timescales of Repeated Spike Patterns
2473
∂ where τ = RC, ∂t = ∂t∂ and ∂v = ∂v . The probability distribution Pθl (v, t) indicates the probability that the membrane potential of neuron θ in layer l is vθl = v at time t. Since we regard Pθl (v, t) as the probability, the normalization condition is Vth dv Pθl (v, t) = 1. (3.2) −∞
The resetting mechanism of the LIF neuron requires absorbing boundary conditions at threshold potential Vth and the current source at resting potential Vrest . The boundary conditions of the partial differential equation are Pθl (Vth , t) = 0, 2 1 D + − l − ∂v Pθl (Vrest r (θ, t) ≡ , t) + ∂v Pθl (Vrest , t) , 2 C 2 1 D ∂v Pθl (Vth , t). =− 2 C
(3.3) (3.4) (3.5)
Here, r l (θ, t) means the instantaneous firing rate of neurons θ in layer l at time t. The initial condition for the membrane potential distribution is the stationary distribution Pst (v) for no external input, Iθ (t) = 0. Details of the derivation of Pst (v) are given in appendix A. Given initial condition Pθ (v, 0) = Pst (v) and the boundary conditions in equations 3.3 to 3.5, the dynamics of the membrane potential distribution is calculated by numerical integration of equation 3.1, and the firing rate is obtained through the boundary condition (see equation 3.5). The firing-rate dynamics of one layer depends on the firing rate of the presynaptic layer. Therefore, we start our calculation from the l = 1 to Lth layer in a sequential manner. The input current Iθl (t) is now a function of the firing rates of neurons in the (l − 1)th layer: π ∞ dθ W(θ − θ ) dτ α(τ )r l−1 (θ , t − τ ). (3.6) Iθl (t) = 2π −π 0 Note that equation 3.6 for FPE is the counterpart of equation 2.2 for the LIF model. Let us further transform equation 3.6 to introduce some useful macroscopic variables of the network. By introducing equation 2.3 into equation 3.6 and exchanging two integrals, we get ∞ π dθ l l Iθ (t) = dτ α(τ ) W0 r (θ , t − τ ) 0 −π 2π
π dθ l r (θ , t − τ ) cos(θ ) cos(θ ) + sin(θ ) sin(θ ) , (3.7) +W1 −π 2π
2474
K. Hamaguchi, M. Okada, and K. Aihara
∞
= 0
dτ α(τ ) W0 r0l−1 (t − τ ) + W1 rcl−1 (t − τ ) cos(θ )
+rsl−1 (t − τ ) sin(θ ) ,
(3.8)
where r0l (t) =
π
−π
rcl (t) =
π
−π
rsl (t) =
π
−π
dθ l r (θ, t), 2π
(3.9)
dθ l r (θ, t) cos(θ ), 2π
(3.10)
dθ l r (θ, t) sin(θ ). 2π
(3.11)
Here, r0l (t) is the population firing rate, and both rcl (t) and rsl (t) are the coefficients of the Fourier transformation of the spatial firing pattern, which represent the degree of localization around θ = 0 or π/2 at time t, respectively. These macroscopic parameters have a dimension of firing rate. Since the network has translational invariance, we can transform (rcl (t), rsl (t)) into the following variables to separate the dependence of the position of the bump:
2 2 rcl + rsl , φ l (t) = arctan rsl (t)/rcl (t) . r1l (t) =
(3.12) (3.13)
From the above transformations, we obtain the bump-position-invariant index r1l (t), which indicates the degree of localization at time t, and φ l (t) indicating the position of the bump in terms of angle. In this way, equation 3.8 is expressed as Iθl (t) =
0
∞
dτ α(τ ) W0 r0l − 1 (t − τ ) + W1 r1l−1 (t − τ ) cos(θ − φ l−1 (t − τ )) . (3.14)
We have simplified the activity of the network into a small number of variables without any approximations. Pulse packet propagation, or any other type of activity propagation, in this feedforward network can be understood as the transformation of these macroscopic variables. The actual procedure of our calculation is as follows. Given the time courses of the l − 1 layer macroscopic variables (r0l−1 (t ), r1l−1 (t ), φ l−1 (t )) in the range of t < t, the firing rate of each neuron at time t is calculated from equation 3.5. By summing up the firing rates of postsynaptic neurons using the definitions in equations 3.9 to 3.13, we can obtain the next macroscopic variables on
Variable Timescales of Repeated Spike Patterns
2475
the postsynaptic neural layer (r0l (t), r1l (t), φ l (t)) at time t. These steps are performed recursively until the last neural layer is reached. We note that the FPE in equation 3.1 is essentially equivalent to the stochastic ordinary equation in equation 2.1 in the limit of large neuron number N. When the network consists of neurons with inhomogeneous properties such as different mean input, one FPE is not enough to capture the inhomogeneity. However, by using more than one FPE to interpolate the difference, we can use the Fokker-Planck analysis, provided that the number of FPEs is large enough to cover the spatial frequency of inputs. Since the input takes the form of, at most, a unimodal shape, we found that a few Fokker-Planck equations are enough to give qualitatively similar results to the LIF simulations. Here, we divide θ space into 100 regions to guarantee quantitatively good agreement with the LIF simulation. For the actual numerical calculation of the FPE, we used a modified Chang-Cooper algorithm (Chang & Cooper, 1970), which is a fully implicit method for solving the advective-diffusion equation. To support the FokkerPlanck analysis, we performed LIF neuron network simulations using the stochastic second-order Runge-Kutta algorithm (Honeycutt, 1992). Their numerical complexity is compared in appendix C. All code was simulated in Matlab. 4 Results 4.1 Membrane Potential Distribution Dynamics. In this section, we show the dynamics of the membrane potential distribution of one neural layer. First, we define input currents to neurons in the first layer. They are formulated in terms of the firing rate of the presynaptic virtual layer (l = 0) activity as follows: r 0 (θ, t) =
r00 + r10 cos(θ ) (t − ¯t 0 )2 exp − . √ 2(σ 0 )2 2πσ 0
(4.1)
Throughout this letter, we use r00 = 500, r10 = 350 for the localized stimulation (see Figure 2a) and r00 = 900, r10 = 0 for the uniform stimulation case. Here, the temporal dispersion of input spikes σ 0 and timing of the stimulation ¯t 0 are set to σ 0 = 1 and ¯t 0 = 5. Note that this localized stimulation has φ 0 (t) = 0 for all t. When a neural layer is activated by localized stimuli with φ 0 (t) = 0, membrane potentials around the origin are activated and reach the thresholds (see Figure 2b). Higher-probability regions of Pθ (v, t) are shown in black. Probability densities of membrane potential distributions averaged over θ at several timings are indicated by shaded regions in Figure 2c. Figure 2d shows the dynamics of the membrane potential distribution of FPEs at position θ = {0, π/2, 3π/4}. The membrane potential distributions were driven to the threshold Vth at approximately t = 5 ms and reappeared from
2476
K. Hamaguchi, M. Okada, and K. Aihara
Figure 2: Overview of the dynamics of the membrane potential distributions of one layer in response to a localized stimulus. (a) Input rate in terms of r0 (t) and r1 (t) generated from equation 4.1 with parameters r0 = 500 and r1 = 350. (b) Snapshots of membrane potential distributions. The horizontal axis is space θ, and the vertical axis is the membrane potential. The darker region has a higher density of membrane potential Pθ (v, t). (c) Probability density of membrane potential distribution averaged over θ . Numerical calculations of the FPEs are shown in the shaded region, and simulations from 104 LIF neurons are shown by solid lines to support the Fokker-Planck calculation. They are almost identical. (d) Time courses of membrane potential of neurons at different positions θ = 0, π/2, 3π/4.
Variable Timescales of Repeated Spike Patterns
2477
the resting potential Vrest . After the stimulation, the membrane potential distribution gradually relaxed toward its stationary distribution. Numerical simulations of many identical LIF neurons are used to support the analysis by the FPE (Brunel & Hakim, 1999; Omurtag, Knight, & Sirovich, 2000; Brunel, 2000). To support our analysis, we simulated 104 LIF neurons distributed over a ring layer. Their membrane potential distributions are superposed in Figure 2c. It shows good agreements between the FPEs (shaded regions) and the LIF simulation (solid lines) results. 4.2 Dynamics of Macroscopic Variables. In section 3, we saw that some averaged quantities can help us to reduce the dimension of the output. Although the macroscopic variables defined in equations 3.9 to 3.13 can fully explain the output of one layer, they are still functions of time t. Therefore, let us further consider a method of reducing the complexity in the temporal direction in order to illustrate network activity. We first show typical time courses of r01 (t) and r11 (t) in response to the local stimulation in Figure 3a for both the numerical simulation of LIF neurons and the FPE. Here, the firing rate of LIF neurons at each timing was calculated from the number of spikes within 0.5 ms sliding windows. For a quantitative evaluation of the evolutions of the spike packet shape, the gaussian-approximation method provides a powerful method for capturing the temporal profile of a spike volley (Diesmann et al., 1999). Diesmann et al. characterized a spatially uniform spike volley by two parameters of the gaussian function: its area and the variance of the gaussian function. These correspond to the number of spikes and the temporal dispersion of the spikes, respectively. We follow this approach to characterize the activity of a localized spike volley. The effect of a spike volley on the next layer can be fully described by the time courses of population firing rate r0 (t), localization parameter r1 (t), and position φ(t). Without loss of generality, we can neglect the position parameter φ l (t) as long as φ l (t) is constant because of the translational invariance of the network. We therefore characterize a spike volley by approximating r0 (t) and r1 (t) with two gaussian functions (see Figure 3b), respectively. We first approximate r0 (t) − ν0 with the gaussian function and derive three parameters—r0 , σ , and ¯t—which correspond to the area, standard deviation, and peak time of the gaussian. Here, ν0 is the spontaneous firing rate (see equation A.2). Since the Mexican-hat type interaction adds new dimension r1 (t), we then derive another parameter r1 from the area of the r1 (t)—approximating gaussian function. The practical advantage of using the FPE is that the firing rate at any moment can be obtained as a smooth and continuous time series. This is useful for approximating a spike volley with the gaussian, especially when the firing rate is very low. In contrast to FPE, a simulation with LIF neurons gives the firing rate through statistical sampling of their spiking events. In the low-firing-rate regime, statistical sampling of spikes gives
2478
K. Hamaguchi, M. Okada, and K. Aihara
(t) [spk/s]
(a) r0(t)
400 200 0
FP LIF
r1(t) 0
5
10
15
t [ms]
(b) r0(t)
r1(t)
Gaussian approximation
r1
r0
tFigure 3: (a) Response of a neural layer driven by a localized stimulation described by macroscopic variables r0 (t) and r1 (t). The solid line, r0 (t), and the dashed line, r1 (t), were calculated from equation 3.1. LIF neuron simulations are shown again to confirm our analysis. Population firing rate r0 (t) and degree of localization r1 (t) are plotted as squares and triangles, respectively. (b) Four parameters that characterize the temporal profiles of a spike volley: r0 , r1 , σ, ¯ They were derived from the gaussian approximation of the temporal proand t. ¯ are derived from the gaussian files of r0 (t) and r1 (t). Three of them, r0 , σ , and t, approximating r0 (t), as the area, standard deviation, and mean time. Here, r1 was obtained from the area of the gaussian, which approximates the temporal profile of r1 (t). The parameters of the gaussian functions are obtained by minimizing the mean squared error with r0 (t) or r1 (t) and the gaussian function.
relatively big fluctuations, which is not suitable for estimating gaussian parameters. 4.3 Propagation of Pulse Packets. So far, we have considered the activities of one layer to prepare for the analysis in multilayer cases. To understand the stability of the pulse packet propagation, we numerically calculated the FPE for 20 layers and checked the convergence of parameters r0 , r1 , and σ . In the spontaneous firing state, firing rates satisfies r (θ, t) = ν0 . It also indicates r0 (t) − ν0 = 0; therefore, r0 = 0. Since the firing profile is spatially uniform, the spontaneous firing state is described as r0 = r1 = 0 in terms of the estimated gaussian parameters. On the other hand, positive r0 values after convergence indicate that a pulse packet is stable. When the activity
Variable Timescales of Repeated Spike Patterns
Uniform stim.
Local stim.
Layer Uniform stim.
Uniform phase (W0,W1) = (1, 0.6)
1000
1 2 3 4 5 6 7 8
800 600 400
ing rate
Local stim.
Non ing phase (W0,W1) = (0.7, 1)
2479
200 0
1 2 3 4 5 6 7 8 0
5
10
Localized phase (W0,W1) = (0.7, 2.5)
15
0
5
10
15
Multi-stable phase (W0,W1) = (1, 1.4)
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 0
5
10
15 0
5
10
15
t [ms]
Figure 4: Typical activity profiles of the feedforward networks in terms of firing rate in the four phases: N, U, L, and M. The evolutions of firing rates are illustrated with colors (see the right color bar). Equation 4.1 was used to describe the input current to the first layer with parameters r00 = 500, r10 = 350 for the localized stimulus cases and r00 = 900, r10 = 0 for the uniform stimulus cases. The other parameters are σ 0 = 1 and ¯t 0 = 2.
profile is spatially uniform, r1 (t) vanishes. Therefore, a spatially uniform spike volley is defined as r0 > 0 and r1 = 0. If the localized pulse packet is stable, r0 > 0 and r1 > 0. To summarize, we classified the network state into three states: spontaneous firing (r0 = r1 = 0), uniform pulse packet mode (r0 > 0, r1 = 0), and localized pulse packet mode (r0 > 0, r1 > 0). This classification is consistent with the analysis of the network with feedforward Mexican-hat type connections composed of binary neurons (Hamaguchi, Okada, Yamana et al., 2005). In Figure 4, typical examples of network dynamics in response to the local or uniform stimulus to the first layer are exhibited in terms of firing rate. The network has four phases depending on parameters (W0 , W1 ). When the values of both W0 and W1 are small, no pulse packet can propagate, and the
2480
K. Hamaguchi, M. Okada, and K. Aihara
spontaneous firing state is the only stable state: nonfiring phase (N). When the uniform excitation term W0 is sufficiently strong, a uniform pulse packet is stable in addition to the spontaneous firing: uniform phase (U). When the Mexican-hat amplitude W1 is strong enough, a localized pulse packet is stable in addition to the spontaneous firing: localized phase (L). When W0 and W1 are balanced, there exist multistable states where both the uniform and localized pulse packets are stable depending on the initial input: multistable states (M). Note that spontaneous firing is stable in any phase. We note that localized and uniform pulse packets are the only possible states among the possible propagating pulse packets. This is because the inputs are described with spatially constant term and cos(θ ) function as described in equation 3.14, so the possible inputs to a network are spatially unimodal or flat no matter how spatially complex the form of previous layer activity is. Therefore, the uniform and localized bump state is the only possible state. If we used higher-frequency cosine functions in the weight function W(θ ), it would be possible to make more than one localized bump propagation mode. If we went beyond the pulse packet mode by increasing the connection strength even more, there would be other states such as bursting-instability mode where the firing rate always increases as the activity propagates through the layers. In this letter, we avoid such unrealistic cases. The evolution of a pulse packet and its convergence to an attractor can be illustrated in flow diagrams (Diesmann et al., 1999; Cˆateau & Fukai, 2001). Figures 5a to 5c show the evolutions of pulse packets in the (r0 , σ, r1 ) space. One line segment with a small arrowhead indicates the evolution of
Figure 5: Flow diagrams for U, L, and M phases. (a–c) These panels depict the flow diagram in three-dimensional space (r0 , σ, r1 ). Each line with a small arrowhead indicates the evolution of a spike volley per layer, and one sequence of a line represents the evolution of a pulse packet. The colors of the arrowheads and lines represent the final stable states of the pulse packets: spontaneous firing state (blue), uniform pulse packet mode (green), or localized pulse packet mode (red). Insets indicate the connection profiles W(θ ), and the weight highlighted with color is the one used in the flow diagram. The blue rectangle frame in a to c correspond to d to f . The position of the frame was chosen to include at least one attractor of the pulse packet mode. Big arrowheads in d to f show the direction of the evolution of the spike packet evolution. The choice of colors is the same as in a to c. (a) Flow diagram in uniform phase (U) with parameters (W0 , W1 ) = (1, 0.6). The bottom panel, d, is the r1 = 0 plane. (b) Flow diagram in localized phase (L) with parameters (W0 , W1 ) = (0.7, 2.5). The bottom panel, e, is the σ = 0.37 plane. (c) Flow diagram in multistable phase (M) with parameters (W0 , W1 ) = (1, 1.5). The bottom panel, f , includes two attractors: uniform and localized pulse packet modes. The broken lines indicate the schematic boundary of the basin of the attraction.
0
500
1000
(d)
0
500 r0
1000
0
200
400
0
W( )
0
0
2
4
0.2
0.2
0 2
0.4
0.4
0
200
400
(e)
0
r0
0.4
200
1000
500
0
(f ) 400
500 r0
1000
500 r0 0.2
0 2
400
(c)
1000 0
W( )
0
0
2
4
r1
0
200
400
(b)
r1 r1
r1
r0
0
200
r1
(a)
1000
0
0
0
500
W( ) r
0
2
4
0
0.2
0 2
0
0.4
Variable Timescales of Repeated Spike Patterns 2481
2482
K. Hamaguchi, M. Okada, and K. Aihara
a spike volley from one layer to the next, and the whole sequence of the line represents the evolution of a pulse packet. The colors of the arrows represent the stable states of the pulse packets after convergence: spontaneous firing state (blue), uniform pulse packet mode (green), or localized pulse packet mode (red). Insets indicate the connection profiles W(θ ), and the weight highlighted with color is the one used in the flow diagram. The blue rectangular frame in Figures 5a to 5c corresponds to Figures 5d to 5f to show the details of the flow. The position of the frame was chosen to include at least one attractor of the pulse packet mode. Depending on the connection profile, the attracting point where several lines converges changes. The U phase with parameters (W0 , W1 ) = (1, 0.6) described in Figure 5a shows that the pulse packets converge on the r1 = 0 plane. It indicates that the pulse packet becomes spatially uniform even though the initial state is localized. The green lines converge to high firing rate r0 with low spike timing dispersion σ point (uniform pulse packet mode). The blue arrows converge to low-firing-rate r0 with high spike timing dispersion σ (failure of pulse propagation). Figures 5d to 5f show the flow diagram on the r1 = 0 plane. Large arrowheads show the direction of the spike packet evolution. This is qualitatively equivalent to that of the conventional synfire chain model (Diesmann et al., 1999). When W1 was increased, the L phase appeared. The flow diagram of the L phase with parameters (W0 , W1 ) = (0.7, 2.5) illustrates the existence of the attracting point in the nonzero r1 region (see Figure 5b). Several red lines converge to the high firing-rate r0 , high localized parameter r1 , and with low temporal dispersion σ (localized pulse packet mode), indicating the stable propagation of localized pulse packets. Figure 5e shows a flow diagram on the plane σ = 0.37. This plane contains the attracting point of the localized pulse packet. In the M phase, depending on the initial stimulus, pulse packets converge to either uniform or localized pulse packet mode. In Figure 5c, green and red lines converge to uniform and localized pulse packet attractor, respectively. Figure 5f includes both of these attractors. The broken lines in Figures 5d to 5f indicate the schematic boundary of the basin of the attraction. The converging dynamics toward the attractor of a synchronized spike volley are similar in the U, L, and M phases. If a spike volley is within the basin of the attractor of the pulse packet mode, spike volleys are rapidly shaped into a stable spike packet of high firing rate with submillisecond dispersion. Otherwise, spike packets gradually die out (r0 → 0 and large σ ). Near the boundary, the nonmonotonic evolution of spike packets is commonly observed; weakly activated spike volleys, which have a submillisecond dispersion of spikes with a relatively small number of spikes, evolve with increasing spike jitter (increasing σ ) in initial stages. If the initial number of spikes has exceeded a certain threshold, the network evolves to a pulse packet mode, and the spike jitter is reduced again (σ decreases). Otherwise, the pulse packet dies out, and σ continues to increase. This
Variable Timescales of Repeated Spike Patterns
2483
Figure 6: Phase diagram of the system in (W0 , W1 ) space. The parameter set is as in Figure 4. In the localized phase (L) and uniform phase (U), low-firing spontaneous firing state is stable, as well as in the nonfiring phase (N). When the values of W0 or W1 are too high, they lead to the bursting phase (B).
nonmonotonic phenomenon is illustrated as curved red or green lines starting near the boundary of the spontaneous state. They are observed in all the development processes of spike volleys in the U, L, and M phases. Phase diagrams in Figure 6 show the stability of U phase and L phase in (W0 , W1 ) space with the same parameter set. As we have seen, larger W0 leads to U phase, and larger W1 leads to L phase. In between them, there is a multistable region M. Note that the nonfiring state (N) is stable within these regions. If we increase connection strength more than a certain threshold, a system becomes unstable, and the firing rate of the pulse packet goes to infinity as it propagates to the deeper layers. We refer to this state as the bursting state (B). 4.4 Propagation Speed of Pulse Packets. The timescale of an RSP is strongly related to the propagation speed of a pulse packet. In this section, our interest resides in the propagation speed under multistability of pulse packets. The time required for a pulse to propagate from the (l − 1)th layer to the lth layer is defined as ¯tl = ¯tl − ¯tl−1 . The mean arrival timing ¯tl is obtained from the r l (t)-estimating gaussian function as illustrated in 0 Figure 3b. We let ¯t denote the characteristic propagation time of a stable pulse packet, which is obtained after pulse packets converge to their stable states. The characteristic propagation time of stable pulse packets is shown in Figure 7 for various values of parameter W1 . The characteristic propagation time ¯t depends on both (W0 , W1 ) and (r0 (t), r1 (t)), but in M phase, the effect of activity pattern (r0 (t), r1 (t)) on the speed can be directly studied by comparing each ¯t. In Figure 7, circles represent ¯t for a uniform pulse packet case, and triangles represent those of localized pulse packets. Increasing W1 reduces the propagation time ¯t
2484
K. Hamaguchi, M. Okada, and K. Aihara
w0 = 1
-
∆ t [ms]
1.2
local uniform
1
0.8 1.4
1.6
1.8
w1
2
Figure 7: Plot of characteristic propagation time ¯t in M phase, where ¯t = ¯tl − t¯l−1 after convergence. Localized pulse packets propagate more slowly than the uniform ones. Since the W0 parameter was fixed here, t of uniform pulse packets was constant.
of localized pulse packets. In contrast, the uniform pulse packet does not depend on W1 because uniform activity has vanishing r1 (t) values; thus the W1 term can be neglected in equation 3.12. In this letter, we used the LIF neuron models without refractoriness for simplicity, and the pulse packet propagation phenomena are studied with the parameter region without rate instability, where the firing rate of a pulse packet does not grow to infinity as it propagates through the network. Under this condition, the propagation speeds of local pulse packets are slower than those of uniform ones. We note that if we introduce long refractoriness after firing and set the value of parameter W1 much larger than that of W0 , the speeds of local pulse packets can be faster than those of uniform ones. 4.5 Repeated Spike Patterns Generated by Multistable Synfire Chain. In the previous section, we showed that a pulse packet has its own characteristic propagation speed. Here we consider RSPs generated from different pulse packets. We simulated multielectrode recordings from randomly chosen neurons in the 20 layers of a multistable feedforward network as shown in Figure 8a. Parameters were set to (W0 , W1 ) = (1, 1.5). The simulated multielectrode recordings were performed as follows. First, 10 recording sites {θ, l} were randomly determined. Then we performed five trials of the LIF neuron simulations for each of the uniform and the localized pulse packet case. In Figure 8b, the spike events are plotted with squares for the uniform pulse packet cases and triangles for the localized pulse packet cases. The vertical position of the squares and triangles corresponds to its trial number. To show the overall shape of the probability of observing a spike, one trial of the FPE calculation is performed for each pulse packet shape. The probability of detecting a spike at each recording
Variable Timescales of Repeated Spike Patterns
2485
(a) 8 1
0
5
2 3
1
6 9
10
7 4
5
10 Layer
(b)
15
20
1
2000 Uniform Local
2
rate spike
3
0 2000 0 2000 0 2000 0 2000
7
0 2000
8
uniform trial 1-5{ local trial 1-5{
0 2000
9
# of Electrodes 6 5 4
0 2000
0 2000
r( ,t) [Hz]
position
π
10
0 2000 0
5
10
15 20 time [ms]
25
0 30
Figure 8: In silico experiments of multielectrode array recordings in a multistable synfire chain. (a) Placement of electrodes in the multistable feedforward network with the same parameters as in Figure 5a. (b) Spike raster plot from 10 electrodes. The responses of the network were measured with five trials from the experiments of uniform stimulation to the first layer and five trials from localized ones. The spikes in uniform pulse packet mode (squares) are plotted in the upper halves of raster plots in each electrode, and those of localized pulse packet mode (triangles) are plotted in the lower halves. The firing rate r (θ, t) is also shown to support the LIF simulations, whose firing rate is shown by the right y-axes. Parameters were set to (W0 , W1 ) = (1, 1.5).
2486
K. Hamaguchi, M. Okada, and K. Aihara
site, r (θ, t) t, is obtained from the firing rate of the FPE at the electrode position θ . The solid and dashed lines represent the probability of observing spikes for the uniform pulse packet propagating cases and localized pulse packet propagating case, respectively. RSP is defined as a specific combination of interspike intervals (ISIs) ISI {τiISI j , τ jk }, which indicates that a spike from electrode unit i is followed by a spike from unit j after exactly τiISI j ms, followed by another spike of unit ISI ISI ISI k after exactly τ jk s. The {τi j , τ jk } ± τc allows events that occurred within a time constant ±τc ms time window around the exact ISI. We can define ISI a longer combination of ISIs as {τiISI j , τ jk . . .} ± τc . We apply a suffix to each RSP to indicate which pulse packet an RSP is generated from. Hereafter, ISI(U) ISI(U) we refer to {τi j , τ jk . . .} as the RSP generated from the uniform pulse ISI(L)
ISI(L)
packets and {τi j , τ jk . . .} as generated from the localized pulse packets. ISI Note that {τiISI j , τ jk } itself can be used to describe any ISIs. An RSP is a special combination of ISIs that occurs more frequently than a certain threshold determined from a reasonable assumption, such as the stationary Poisson firing. Therefore, an ISI combination with high statistical significance is called a repeated spike pattern. RSPs are often searched for by ISI counting the number of events {τiISI j , τ jk . . .} ± τc throughout the recording data. However, since our simulations had a low spontaneous firing rate, the RSPs were easily recognizable in spike rasters, as shown in Figure 8b. Details of our definition of RSPs are given in appendix B. In Figure 8b, the RSPs generated by a uniform pulse packet (squares) were detected in all the randomly inserted electrodes, but the RSPs generated by a localized pulse packet (triangles) did not appear in all the electrodes. The initial stimulation position was set to φ 0 = 0, so only the electrodes around θ ∼ 0 could detect the RSP. The spike patterns aligned to a specific spike timing of an electrode are commonly used to show the RSPs (Abeles et al., 1993; Prut et al., 1998). Here, we realigned the spike trains according to the first spike of the first electrode, as plotted in Figure 9a. For comparison, we chose electrodes that detected RSPs generated from both uniform and localized pulse packets. We observed different RSPs depending on the stability of the pulse packet. Each RSP itself retained millisecond-order fidelity, but two RSPs were clearly separated depending on the propagating patterns. The final goal of this letter is to find the relationship between those ISI(U) ISI(L) different RSPs generated in the same network. If we plot (τi j , τi j ) for several {i j} pairs, we can find a specific relationship between the two RSPs. For fixed i (= 1) and varied j = {1, 2, 5, 6, 7, 9, 10} pairs, we plotted ISI(U) ISI(L) (τ1 j , τ1 j ) in Figure 9b. It is clear that they are aligned on one line with t¯U , which satisfies slope
t¯L ISI(L)
τi j
=
¯tL ISI(U) τ .
¯tU i j
(4.2)
Variable Timescales of Repeated Spike Patterns (b)
20
10
10
ISI(L) 1j
[ms]
# of electrode 9 7 5 1
(a)
2487
0
5
10 ISI 1j
15 [ms]
20
0 0
10 ISI(U) 1j
20 [ms]
Figure 9: (a) Spike raster aligned to the first electrode unit spikes. It correISI(U) ISI(L) sponds to τ1ISIj . (b) Plot of (τ1 j , τ1 j ) (cross) representing the ratio of ISIs lengths. The ISIs were measured between the first electrodes and the others, j = {1, 2, 5, 6, 7, 9, 10}, as illustrated in Figure 8a. The dotted line represents ¯ was obtained from the result in Figure 7. equation 4.2, whose slope
t¯tUL = 1.17 0.81 Parameters were set to (W0 , W1 ) = (1, 1.5).
The slope approximately equals the ratio of the propagation time ¯t between the uniform and localized spike packets. These results indicate that the ratios of ISIs are constant over the different RSPs, and the ratio is determined by the ratio of the propagation speed of each pulse packet. Since the order of spikes does not change, different RSPs can be transformed to each other by expansion or contraction of the timescale with a certain ratio. Therefore, in a multistable feedforward network with a stable synfire chain state, RSPs can have variable timescales, but RSPs are strongly connected through the timescale expansion or contraction operation. 5 Conclusion In spite of the importance of the propagation speed of neural activity in discussions of repeated spike patterns (RSPs), the relationships among the speed, the network structure, and the spatiotemporal patterns of the pulse packet had not been well studied. We used the Fokker-Planck equation to study the dynamics of a feedforward network with Mexican-hat connectivity. The network has a spatial structure in the connections, but by using macroscopic variables to describe the network activity, we can simplify the output of one neural layer by three time courses of macroscopic variables: population firing rate, localization parameter, and position of the activity. The Fokker-Planck analysis allowed us to numerically calculate the deterministic dynamics of the macroscopic variables and their stability. We found that there are four phases in the W0 − W1 space: nonfiring (N), localized phase (L), uniform phase (L), and multistable phase (M). In the M phase, two types of pulse packets (the uniform spike packet and localized
2488
K. Hamaguchi, M. Okada, and K. Aihara
one) can propagate through the neural layers. These two pulse packets have their own characteristic propagation speed, which indicates that the speed of information processing depends on the spiking patterns, or the representation of the stored information. When we observe the pulse packets’ propagation through the multielectrode, the activity will be observed as the RSPs. The different pulse packets generate different RSPs due to the difference of propagation speed, but the different RSPs can be mapped onto the same template pattern through the timescale expansion or contraction operation.
Appendix A: Stationary Membrane Potential Distribution and Spontaneous Firing Rate The initial condition for the membrane potential distribution is the stationary distribution for no external input, Iθ (t) = 0. The stationary distribution under the dynamics of equation 3.1 and boundary conditions (threshold potential Vth and reset potential Vrest ) is obtained as follows:
Pst (v) = e
−U(v)
Vth
du
v
2ν0 C 2 H(u − Vrest )e U(u) , D2
(A.1)
2
and ν0 is the spontaneous firing rate for Iθ (t) = 0 where U(v) = C(v−Rµ) (DR)2 case. From the normalization condition (see equation 3.2), ν0 is obtained as
−1
(ν0 )
√
√ C (Vth −Rµ)
=τ π √ C
Here, erf(y) =
D
D (Vrest −Rµ)
√2 π
y 0
dy exp(y2 )(1 + erf(y)).
(A.2)
dx exp(−x 2 ).
Appendix B: Estimation of an RSP To estimate τiISI j , we calculated the probability of observing a spike at position θ from the FPE as shown in Figure 8b. Then we switched to spike data from LIF simulations and collected spikes that drop in a small time window τc = 1 around the peak of the probability of observing spikes r (θ, t) calculated by FPE. This process is used to remove uncorrelated spikes from the ISI estimation. Finally, we took the mean of the ISIs from these spike sets as the estimated τiISI j .
Variable Timescales of Repeated Spike Patterns
2489
Appendix C: Numerical Complexity Here we compare the numerical complexity of calculating LIF and FPE. In one layer, the number of LIF neurons and FPEs are N and θ M , respectively. The membrane potential in a FPE is discretized in M bins. LIF neurons’ dynamics in equation 2.1 has been calculated using the second-order stochastic Runge-Kutta algorithm reported in Honeycutt (1992):
t D√
tξ (F1 + F2 ) + 2 C F1 = f (v(t)) D√ F2 = f v(t) + t F1 +
tξ C v f (v) = − + I (t) + µ C. R
v(t + t) = v(t) +
(C.1) (C.2) (C.3) (C.4)
This process requires 13N operations. To calculate I (t), α-function convolved macroscopic variables r0α (t), rcα (t) and rsα (t) are required. Here, ∞ r xα (t) = 0 dτ α(τ )r x (t − τ ). Assuming that these inputs are given, equation 2.2 can be written as Iθ (t) = W0 r0α (t) + W1 rcα (t) cos(θ ) + rsα (t) sin(θ ) ,
(C.5)
which requires 5N + 1 operations. In total, N units of LIF neuron simulation per one time step take approximately ≈ 23N operations. Fokker-Planck calculation requires the inverse of a matrix when we use the implicit method. Given the probability of observing a membrane potential v at time t as Pt (v) (an M × 1 column vector) and the transition probability matrix F, we can calculate Pt+ t (v) through the following linear matrix equation, FPt+ t (v) = Pt (v),
(C.6)
where F is a tridiagonal matrix. We can therefore solve this equation through LU decomposition and gaussian elimination (Strang, 1988), which requires only 4M − 2 operations. When the input changes, it takes 13M operations to construct F (for details; see Chang & Cooper, 1970.) In total, therefore, one update requires 17Mθ M operations. The coefficient may vary depending on the order of arithmetic operations and the number of terms included in the equation. Here, we used N = 104 LIF neurons, θ M = 100, and M = 800 for the best calculation. The time step sizes were the same: 0.01 ms. Therefore, LIF simulations require 2.3 × 105 operations, and the Fokker-Planck method
2490
K. Hamaguchi, M. Okada, and K. Aihara
requires 1.36 × 106 operations per time step. In this case, the Fokker-Planck method has a higher computational cost, but it can provide more stable results for any firing-rate regime. This computational complexity depends on the choice of spatial discretization M and number of equations θ M compared with the number of neurons N. For example, the FPE simulation for a simple homogeneous synfire chain with θ M = 1 will be much faster than LIF simulations. Acknowledgments The preliminary result in this letter was reported in Hamaguchi, Okada, and Aihara (2005). This work is partially supported by the Japanese Society for the Promotion of Science, Research Fellowships for Young Scientists, Advanced and Innovational Research Program in Life Sciences, a Grant-inAid for Scientific Research, on Priority Areas No. 17022012, No. 14084212, No. 18020007, No. 18079003, Scientific Research (C) No. 16500093 from the Ministry of Education, Culture, Sports, Science, and Technology, the Japanese Government. References Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993). Spatiotemporal firing patterns in the frontal cortex of behaving monkeys. J. Neurophysiol., 70, 1629– 1638. Amari, S. (1977). Dynamics of pattern formation in lateral-inhibition type neural fields. Biol. Cybern., 27, 77–87. Ben-Yishai, R., Bar-Or, R. L., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Natl. Acad. Sci. USA, 92, 3844–3948. Blasdel, G. (1992). Orientation selectivity, preference, and continuity in monkey striate cortex. J. Neurosci., 12(8), 3139–3161. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comp. Neurosci., 8(3), 183–208. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrate-andfire neurons with low firing rates. Neural Comp., 11(7), 1621–1671. Cˆateau, H., & Fukai, T. (2001). Fokker-Planck approach to the pulse packet propagation in synfire chain. Neural Networks, 14, 675–685. Chang, J. S., & Cooper, G. (1970). A practical difference scheme for Fokker-Planck equations. J. Comp. Phys., 6, 1–16. Compte, A., Brunel, N., Goldman-Rakic, P. S., & Wang, X.-J. (2000). Synaptic mechanisms and network dynamics underlying spatial working memory in a cortical network model. Cerebral Cortex, 10, 910–923. Cossart, R., Aronov, D., & Yuste, R. (2003). Attractor dynamics of network up states in the neocortex. Nature, 423(6937), 283–288. Diesmann, M., Gewaltig, M.-O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533.
Variable Timescales of Repeated Spike Patterns
2491
Gewaltig, M.-O., Diesmann, M., & Aertsen, A. (2001). Propagation of cortical synfire activity: Survival probability in single trials and stability in the mean. Neural Networks, 14, 657–673. Goldman, P. S., & Nauta, W. J. H. (1977). Columnar distribution of cortico-cortical fibers in the frontal association, limbic, and motor cortex of the developing rhesus monkey. Brain Research, 122(3), 393–413. Hamaguchi, K., Okada, M., & Aihara, K. (2005). Theory of localized synfire chain: Characteristic propagation speed of stable spike patterns. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17. (pp. 553– 560). Cambridge, MA: MIT Press. Hamaguchi, K., Okada, M., Yamana, M., & Aihara, K. (2005). Correlated firing in a feedforward network with Mexican-hat type connectivity. Neural Computation, 17, 1–26. Honeycutt, R. L. (1992). Stochastic Runge-Kutta algorithms. I. White noise. Phys. Rev. A, 45, 600–604. Hubel, D., & Wiesel, T. (1977). Functional architecture of macaque monkey visual cortex. Proc. R. Soc. Lond. B Biol. Sci., 198, 1–59. Ikegaya, Y., Aaron, G., Cossart, R., Aronov, D., Lampl, I., Ferster, D., & Yuste, R. (2004). Synfire chains and cortical songs: Temporal modules of cortical activity. Science, 304, 559–564. Izhikevich, E. M. (2006). Polychronization: Computation with spikes. Neural Computation, 18, 245–282. Kistler, W. M., & Gerstner, W. (2002). Stable propagation of activity pulses in populations of spiking neurons. Neural Comp., 14, 987–997. Mao, B., Hamzei-Sichani, F., Aronov, D., Froemke, R., & Yuste, R. (2001). Dynamics of spontaneous activity in neocortical slices. Neuron, 32, 883–898. Mountcastle, V. (1997). The columnar organization of the neocortex. Brain, 120(4), 701–722. Omurtag, A., Knight, B., & Sirovich, L. (2000). On the simulation of large populations of neurons. J. Comp. Neurosci., 8, 51–63. Prut, Y., Vaadia, E., Bergman, H., Haalman, I., Slovin, H., & Abeles, M. (1998). Spatiotemporal structure of cortical activity: Properties and behavioral relevance. J. Neurophysiol, 79(6), 2857–2874. Reyes, A. (2003). Synchrony-dependent propagation of firing rate in iteratively constructed networks in vitro. Nature Neuroscience, 6, 593–599. Risken, H. (1996). The Fokker-Planck equation. Berlin: Springer-Verlag. Strang, G. (1988). Linear algebra and its application. Belmont, CA: Thomson Learning. van Rossum, M. C. W., Turrigiano, G. G., & Nelson, S. B. (2002). Fast propagation of firing rates through layered networks of noisy neurons. J. Neurosci., 22, 1956–1966. Wang, X.-J. (2001). Synaptic reverberation underlying mnemonic persistent activity. Trends in Neurosciences, 24(8), 455–463. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized population of model neurons. Biophysical Journal, 12, 1–24.
Received March 9, 2006; accepted September 21, 2006.
LETTER
Communicated by Nishikawa Takashi
Asymptotic Behavior and Synchronizability Characteristics of a Class of Recurrent Neural Networks Christof Cebulla [email protected] Institute for Applied Mathematics, Universit¨at Bonn, D-53115 Bonn, Germany
We propose an approach to the analysis of the influence of the topology of a neural network on its synchronizability in the sense of equal output activity rates given by a particular neural network model. The model we introduce is a variation of the Zhang model. We investigate the time-asymptotic behavior of the corresponding dynamical system (in particular, the conditions for the existence of an invariant compact asymptotic set) and apply the results of the synchronizability analysis on a class of random scale free networks and to the classical random networks with Poisson connectivity distribution. 1 Introduction One of the reasons for the great interest in the analysis of neural networks is the possibility of better understanding brain functions. Besides a variety of models using the sigmoidal transfer functions as the input-output characteristics (Feng & Brown, 1998; Dauce, Moynot, Pinaud, & Samuelides, 2001; Trappenberg, 2002), there exist models where the energy transfer between nodes in the network follows a threshold dynamics (see Hertz, Krogh, & Palmer, 1991; Anderson, 1995; Albeverio & Tirozzi, 1997, for an overview). This kind of threshold excitation is a better approximation of the brain dynamics on the level of spiking neuron systems (to be distinguished from rate models) (Kandel, Schwartz, & Jessel, 2000; Gerstner & Kistler, 2002). Another advantage lies in the intrinsic properties of those models, which reflect the corresponding dynamic behavior of neural signal processing, like temporal and spatial integration and the absolute refractory period. Also, the existence of inhibitory synapses can be included in the model. Nevertheless, a systematic mathematical approach to the analysis of this class of models is still lacking (most of the approaches concern discrete mod¨ els; see, e.g., Lubeck, Rajewsky, & Wolf, 2000; Bagnoli, Cecconi, Flammini, & Vespignani 2003; Markosova & Markos, 1992; Allouche, Courbage, & Skordev, 2001). In particular, the aspect of synchronization of output activity, which is assumed to play an important role in, for example, the emergence of epileptic seizures in the human brain (Kandel et al., 2000; Lehnertz et al., 2000; Volk, 2000), until now has not been discussed sufficiently (see Neural Computation 19, 2492–2514 (2007)
C 2007 Massachusetts Institute of Technology
Synchronizability of Neural Networks
2493
Lago-Fernandez, Corbacho, & Huerta, 2001; Hopfield & Brody, 2001; Dauce ¨ et al., 2001; Golomb & Hansel, 2000; Borgers & Kopell, 2003; Burkitt & Clark, 2001; Giacometti & Diaz-Guilera, 1998; Antonelli, Dias, Golubitsky, & Wang, 2005, for existing approaches). The dependence of the synchronizability of a system (respectively, the corresponding model) on its network structure is of particular interest in this context. In classical neural networks, the time evolution of the system is defined by x(t + 1) = f (Wx(t) − ω),
(1.1)
where x(t) is the (multidimensional) state of the system at time t ∈ N, W is the synaptic connection matrix (weight matrix), and ω is the bias or threshold vector. In most cases, the mapping f is chosen to have a sigmoidal form in every coordinate. With this choice, dissipation is introduced into the system, and one can expect that without any new stimulus, the system will evolve to a stationary state. In this work, we are especially interested in the dependence of the dynamics on the network topology (represented in equation 1.1 by the weight matrix W). Motivated by this, we propose another model, described algorithmically in section 2, where (in the representation corresponding to equation 1.1) the transfer function f is defined by f (x) :=
0, x,
x<0 x≥0
and the dissipation occurs only at the network boundary as an outflow of energy (which can be interpreted as a transmission of signals to other brain regions). The interest in the algorithmic model discussed in this work is twofold. First, we are interested in a (mathematical) formalization of this object as a dynamical system in discrete time. Then the asymptotic behavior of the system can be studied rigorously. Considering the system as a model of neuronal activity, one is interested in properties that may be significant for the emergence of certain phenomena of brain dynamics. In particular, the role of network topology in the occurrence of spontaneous synchronization may give new insight into the process of the genesis of epileptic seizures. This letter is organized as follows. In section 2, the model is introduced in its algorithmic form (similar models can be found in, e.g., Chen, Eu, Guo, & Yang, 1995; Zhang, 1989). In section 3 we provide a formalization of the model with explicit dependence on the network topology represented by the weight matrix implementing the ideas of Blanchard, Cessac, and ¨ Kruger 2000, 2001). Section 4 and the following sections include the results concerning the asymptotic behavior of the introduced model. Beginning
2494
C. Cebulla
in section 6, the mean activity of the system is introduced and discussed, together with the notion of mean activity synchronization. The detailed discussion of the synchronizability criteria is given in sections 6.1 to 6.3. 2 The Model The neural network model introduced in the following is a variant of the Zhang model (Zhang, 1989; Perez, Corral, & Diaz-Guilera, 1996). It belongs to the class of nonabelian models (Dhar, 1990) and includes the ideas of self-organized criticality (SOC; Jensen, 1998; Hergarten, 2002; Dickman, ˜ Munoz, Vespignani, & Zapperi 2000; Bak, Tang, & Wiesenfeld, 1988; Goh, Lee, Kahng, & Kim, 2003). According to some recent work on this matter (see, e.g., Beggs & Plenz, 2003) the principles of SOC may play a role in signal transmission within the brain. Definition 1. Let denote a set of vertices vi , 1 ≤ i ≤ ||. Let T ∈ Ma t||×|| a matrix with the following properties: 1. (T )ii = 0 ∀i 2. 0 ≤ (T )i j ≤ 1 ∀i = j 3. (T )i j ≤ 1 ∀ j. i
Then (, T ) defines a graph or network (without self-linking), which in the following will be denoted for simplicity by . The matrix T will be called the network matrix. Let a network (,T ) be given. Then the SAS (successively activated sites) model is a system evolving in discrete time t ∈ N0 ≡ {0} ∪ N defined by the following algorithm: Definition 2. To every vertex v ∈ , there is attached a real number Xv (the state of v) with Xv ∈ [0,1]. The states of all vertices in form a state vector X ∈ [0,1]|| such that the i-th coordinate of X is the state of i ∈ , (X)i ≡ Xi . At time t = 0, one sets a starting configuration X(0) such that X(0) ∞ = 1. Then for t ∈ N, t ≥ 1, one performs the following algorithm: 1. For every vertex v such that (X(t))v ≥ 1, its energy is redistributed among its neighbors (where the neighborhood is defined by the nonzero entries of the matrix T , that is, (X(t))i → (X(t))i + (T )iv (X(t))v , i, v = 1..||, where the energy of those vertices is set to be zero: (X(t))v = 0. 2. If there are no vertices v with (X)v ≥ 1, one sets X(t + 1) = F (X(t)) where F : [0,1]|| × N → [0,1]|| is the activation mapping, which may be deterministic or random. The vertices a ∈ for which (F (X))a = 0 will be called activation vertices, and the set of all activation vertices will be denoted by A.
Synchronizability of Neural Networks
2495
Note that definition 1 uses what is called timescale separation, which means that the time difference between two successive activations is much longer than the duration of an avalanche. It means that the process of an avalanche, given by the redistribution of energies (steps 1 in section 2.2 until an inactive configuration is reached) takes place within one time interval [t + 1, t]. In other words, one assumes that the duration of an avalanche is always smaller than the time difference between two subsequent activations. 3 Formalization By setting M := [0, 1]|| one has that the model described in definition 2 can be considered as a transformation T : ∂M → ∂M (where ∂M denotes the boundary of M) defining the evolution of the vector X(t) in time. Formally, one has X(t + 1) ≡ T X(t) ≡ X(t) + (T − 1)Y1 (X(t)) + α F (X, t), where Y1 is a piecewise linear mapping given by
Y1 (X) ≡
|| ∞ j j TE X i − 1 TE X i e i ,
(3.1)
j=0 i=1
where TE X := X + (T − 1)
||
((X)i − 1)(X)i e i ,
i=1
and is the Heaviside function. The mapping α F corresponds to the acti|| vation F (X), α F (X, t) ≡ i=1 (F (X, t))i e i . The intuitive interpretation of the mapping Y1 (X) is that its ith component (Y1 (X))i corresponds to the amount of energy the ith node transfers to its neigbors between time step t and t + 1. n→∞
Proposition 1. For all lattices such that Tn X −→0 and Tn X = 0 (with Tn as the nth n-th iteration of the transformation T ) as long as X = 0, (X)i ≥ 0 ∀i, the sum on the right-hand side of equation 3.1 is finite. Proof. From the assumptions on T , there is a j ∈ N, j < ∞ such that ((TEk X)i − 1) = 0 for all k ≥ j. From this, the sum on the right-hand side of equation 3.1 is finite.
2496
C. Cebulla
The networks satisfying the assumptions of proposition 1 are said to have an open boundary condition. Corollary 1. Let T be a network matrix satisfying the conditions of proposition 1. Then det(T − 1) = 0. Proof. The assertion is equivalent to the existence of a vector X ∈ R|| , X = 0, with T X = X. But from the assumption on T in proposition 1, such a vector does not exist. 4 Asymptotic Behavior We shall discuss the (time) asymptotic behavior of the model using the formalism of section 3. Let ∂Ma , a ∈ denote the sets defined by (X)a = 1 for all X ∈ ∂Ma . The transformation T is piecewise continuous. Its continuity domains ∂Mak (half open cubes) form a disjunct partitioning of the state space ∂M = a ,ka ∂Maka , ka = 1, . . . , kamax . In the following, we omit the index a of ka for simplicity. Without loss of generality, we consider activation mappings F for which one has that only one vertex at a time t is activated, that is, for all t ∈ N ∃a t ∈ : (F (X, t))a = 0 for all a = a t . In this case, one can define an activation sequence ω ≡ (a 1 , a 2 , . . .) with a t ∈ , t ∈ N and T (X, t) = Ta t (X). The set of all activation sequences ω will be denoted by . With this, we use the notation T (X) ≡ T (X, ω), ω ∈ , and T m (X) ≡ T a m · · · T a 1 (X). Proposition 2. Let T satisfy the following condition. There is an m ∈ N, m < ∞ k k ˘k and a compact subsetk∂ Ma ⊂ ∂Ma for every partition element ∂Ma such that m ˘ T (∂M) = a ,k ∂ Ma . Then there is an asymptotic invariant set K with the following properties: 1. 2. 3. 4.
K is compact. limn→∞ T n (A) ⊂ K for all compact sets A ⊂ ∂M. T (K ) = K . if K ω (X) denotes the part of K reachable by trajectory starting in X ∈ ∂M and with given arbitrary activation sequence ω, then K ≡ ∪ω ∪ X K (X) and K ω (X) is countable.
Proof. See appendix A for the proof. Proposition 2 holds for arbitrary activations of the system (e.g., also in random activations). Our particular interest is to consider periodic activation mappings F , where the period is finite. For this case, one finds that the number of elements |K | constituting the asymptotic set is also finite. This implies that every orbit (T n X)n∈N ends up on a periodic orbit in the phase space ∂M.
Synchronizability of Neural Networks
2497
Corollary 2. Let T satisfy the assumptions of proposition 2. Let the activation mapping F be periodic in time with a finite period q ∈ N. Then one has that |K | < ∞. Proof. See appendix B for the proof. The assumptions of proposition 2 require that for every trajectory (Xt )t∈N in ∂M constituting a Cauchy subsequence in ∂Mak , there exists a limit limt→∞ Xt = X∞ ∈ ∂Mak . This condition is rather abstract and may involve a high computational effort to be verified analytically for large systems (in Blanchard et al., 2000, it is formulated as a conjecture for the Zhang model). Still, for some simple systems (e.g., two vertex systems or fully connected networks), it can be shown that it is satisfied. On the other hand, for any numerical simulation of a particular system executed with finite computational precision, every Cauchy sequence converges, making the assumption of proposition 2 trivially satisfied. For further analysis, from now on it will be assumed that has an open boundary and T satisfies the assumptions of proposition 2. 5 Periodic Orbits Consider an asymptotic periodic orbit in ∂M of period p. Lemma 1. Let X0 ≡ X p , X1 , . . . , X p−1 be periodic points with T Xi = Xi+1 , 0 ≤ i ≤ p − 1. Then: p
T p X = X + (T − 1)Y p (X) + α F (X) = X + (T − 1)
p−1 (Y1 (T n X) + α F (X, n))
(5.1)
n=0
and Y p (X j ) = Y p (Xk ) for all 0 ≤ j, k ≤ p − 1. Proof. From the definitions of Y(X) and α F , one has the identities Yn (X) =
n−1
Y1 (T i X),
p
αF =
i=0
p−1
α F (X, n),
n=0
and from equation 5.1. It follows, then, that
Y p (X j ) =
p−1 i=0
Y1 (T i X j ) =
p−2 i=0
Y1 (T i X j ) + Y1 (T
p−1
Xj )
2498
C. Cebulla
=
p−2
l=i+1
Y1 (T X j ) + Y1 (X j−1 ) = Y1 (T Xk ) + i
0
i=0
=
p−1
p−1
Y1 (T l Xk )
l=1
Y1 (T l Xk ) = Y p (Xk ) = Y p (X j+1 )
l=0
for all 0 ≤ j ≤ p − 1. With this, one has the following equation for a periodic trajectory: p
(1 − T )Y p (X) = α F (X).
(5.2) p
p
If the right-hand side is independent of X, α F (X) = α F for all X, so the left-hand side, and equation 5.2 are valid for all periodic orbits (of period p) for a given . As we assumed, the activation mapping F is periodic with some finite period q , F ≡ F (tmodq ). Starting at the recurrent point X (in the sense that X(t + p) = X(t)) and considering q periods of the trajectory (instead of one), we obtain qp
q
(1 − T )Yq p (X) = α F (X) = pα F (X),
(5.3)
q
where α F then depends on only the activation mapping F and is the amount of activation during one period of the activation mapping F . Considering finally the mean activity Y¯ of the system during one periodic orbit (of the period p > 0), one has Y¯ = q1p Yq p . From this,
(1 − T )Y¯ =
q
1 q α . q F q
(5.4)
If again α F (X) = α F for all X, then equation 5.4 depends on only the activation mapping F and the network matrix T . It is thus independent of the starting point of a trajectory and the period of the asymptotic orbit. p In general, the value of α F in equation 5.2 is not independent of X. In the next section, networks shall be considered where the activation vertices do not receive any energy from any other vertices, and thus their own energy q is always (X)a ∈ {0, 1} for all a ∈ A and α F is given a fixed value for all periodic orbits.
Synchronizability of Neural Networks
2499
Figure 1: Sketch of a particular example of a network with the corresponding network matrix T of the form 6.1. The set of activation vertices is denoted by A. There are no (directed) links pointing at nodes i ∈ A.
6 Mean Activity Synchronization In the following, networks having the form T =
0 0 BC
(6.1)
of the network matrix shall be considered (see Figure 1 for an example). In this representation, the right-hand side of equation 5.4 is independent of X. Furthermore, B ∈ M|A|×|A| consists of the outgoing links of the activation vertices v ∈ A (where A denotes the set of all activation vertices v ∈ ), and C ∈ M(||−|A|)×(||−|A|) is again a network matrix. From the properties of ||−|A| T as in definition 1, one has that i=1 (B)i j ≤ 1 for all 1 ≤ j ≤ |A| and ||−|A| (C)i j ≤ 1 for all 1 ≤ j ≤ ||. i=1 Lemma 2. For a network matrix T of the form 6.1, one has −1
(1 − T )
=
1
0
(1 − C)−1 B
(1 − C)−1
(6.2)
Proof. Because C is a network matrix satisfying the assumptions of proposition 2, we know that (1 − C)−1 exists, and one has
1
0
−B
1−C
1
0
(1 − C)−1 B
(1 − C)−1
=
1 0 0 1
.
2500
C. Cebulla
¯ i , 1 ≤ i ≤ ||, corresponds to the mean The mean activity of a vertex, (Y) energy transferred from the node i ∈ to its neighbors during one time step of the asymptotic period. Setting Y¯ = (x1 , . . . , x|A| , c, . . . , c), c > 0, in equation 5.4, we obtain that xi , 1 ≤ i ≤ |A|, corresponds to the mean input rate, and the mean activity of all nodes j ∈ \A is equal. We will call such a state a synchronized one. Indeed, this kind of synchronization is considered in rate models (see, e.g., Hopfield & Brody, 2001, where this notion of synchronization was also introduced). Setting c = 1, we expect a relaxation of each node j for all times t ∈ N (the neurons fire with the maximal rate). In this sense, we speak in this special case about strong synchronization. Notation 1. In the following, the synchronization problem will denote the following equation, (1 − T )
X e
=
X 0
,
(6.3)
where X ∈ R|A| , 1 ≥ X ≥ 0 (in the meaning that 1 ≥ (X)i ≥ 0 for all i), is the unknown to be found. e denotes the diagonal vector (1, . . . , 1). Equation 6.3 is for network matrices of the form 6.1, equivalent to B X = (1 − C)e.
(6.4)
The meaning of equation 6.3 is the following. Given a network matrix T , find an activation mapping F (in the sense of the definition 2) such that the total mean activity during one period of the asymptotic orbit is equal for all vertices v ∈ \ A. Additionally, for matrices of the form 6.1, one may fix only the submatrix B and search for an activation mapping F and a set of C’s for which equation 6.4 is satisfied. In the following, we shall restrict ourselves to the case where B ≥ 0 (in the sense that (B)i j ≥ 0 for all i, j). A network satisfying equation 6.4 for a B ≥ 0 and a 1 ≥ X ≥ 0 will be called synchronizable. The next goal is to find the conditions on the synchronizability of networks under the given restriction of equation 6.1. 6.1 Networks Without Inhibitory Links. Consider networks matrices of the form 6.1 with 1 ≥ C ≥ 0. We start with the following lemma. Lemma 3. Let v ∈ R||−|A| , 1 ≥ v ≥ 0, v 1 ≤ |A|. Then there|A|exists a B : R|A| → R||−|A| with 1 ≥ B ≥ 0, ||−|A| (B)i j ≤ 1 and an x ∈ R , 1 ≥ x ≥ 0 i=1 such that Bx = v. + (R|A| ) := {x ∈ R|A| |1 ≥ x ≥ 0}. Take any B : R|A| → Proof. Define B∞ ||−|A| ||−|A| with 1 ≥ B ≥ 0 and i=1 (B)i j ≤ 1. We shall investigate the image R
Synchronizability of Neural Networks
2501
+ + B(B∞ (R|A| )). For all x ∈ B∞ (R|A| ) one has that x = a i e i with the coeffi|A| cients 0 ≤ a i ≤ 1. From this, we get Bx = i=1 a i Be i . Consider first vectors Be i . Be i is the ith column of the matrix B. Due the ||−|A| to the fact that i=1 (B)i j ≤ 1 for all j, one has that Be i ∈ B1+ (R||−|A| ) := {x ∈ R||−|A| | x 1 ≤ 1, 1 ≥ x ≥ 0}. As a consequence, the union of images + |A| B B(B∞ (R )) for all B allowed by the assumptions is then equal to + B1,|A| ≡ {x ∈ R||−|A| | x 1 ≤ |A|, 0 ≤ x ≤ 1}, because it is the set of all linear combinations (with coefficients 0 ≤ a i ≤ 1) of all possible column vectors of B. One always has (with 0 ≤ x ≤ 1) Bx 1 = a i Be i = a i Be i 1 ≤ a i ≤ |A|.
i
i
1
≤1
i
The following result gives the conditions for the synchronizability of positive network matrices. Proposition 3. Consider networks with the corresponding network matrices of the form 6.1, where 1 ≥ C ≥ 0 is again a network matrix. Then one has that is synchronizable; that is, there exists 1 ≥ B ≥ 0 with ||−|A| (B)i j ≤ 1 and a i=1 1 ≥ x ≥ 0 such that (1 − C) = Bx iff the following inequalities hold for C: ||−|A|
(C)i j ≤ 1 for all i.
1.
j=1
||−|A| ||−|A|
(C)i j ≥ || − 2|A|.
2.
i=1
j=1
Proof. ⇐ Assume inequalities 1 and 2 hold. Then one has ||−|A|
(C)i j ≤ 1
⇔
1−
(C)i j ≥ 0 for all i.
j
j=1
From this, one has that 1 ≥ (1 − C)e ≥ 0. Let v := (1 − C)e. Then 1 ≥ v ≥ 0. On the other hand, from assumption 3.1, one has ||−|A|
(C)i j ≥ || − 2|A| ⇔
i, j=1
i
1−
(C)i j ≤ |A| ⇔ (1 − C)e 1 ≤ |A|. j
Thus, v 1 ≤ |A|. The assertion follows now from lemma 3. ⇒ Assume the existence of a B and x with therequired properties and (1 − C)e = Bx. From 1 ≥ Bx ≥ 0 it follows that j (C)i j ≤ 1 for all i. Let + v := (1 − C)e. Then, it follows from lemma 3 that v = Bx ∈ B1,|A| (R||−|A| ).
2502
C. Cebulla
Thus, one has that v 1 ≤ |A|. But then (C)i j ≤ |A|, (1 − C)e 1 = || − |A| − i, j
and from this, (C)i j ≥ || − 2|A|. i
j
Remark 1. Condition 2 in proposition 3 becomes nontrivial for |A| < || . 2 Thus it is convenient to choose the activation set A ⊂ sufficiently small. Corollary 3. Consider a situation as in proposition 3, but let B be fixed. Thus, one searches for matrices C for which the corresponding network is synchronizable under the given B. In this case, is synchronizable if and only if the vector (1 − C)e lies within the parallelogram defined by the linear combinations (with coefficients 0 ≤ a i ≤ 1) of the column vectors of B: (1 − C)e =
|A|
a i bi ,
where
B ≡ (b 1 , . . . , b |A| ), b i ∈ R||−|A| .
(6.5)
i=1
Proof. Let be synchronizable for the given B. Then there is a 1 ≥ x ≥ 0 such that (1 − C)e = Bx. With a i = xi , equation 6.5 holds. On the other hand, if equation 6.5 holds, then by setting xi = a i , one obtains the required vector to satisfy (1 − C)e = Bx. 6.2 Asymptotic Symmetric Networks. In this section, the asymptotics || → ∞ of symmetric networks are discussed . 6.2.1 Networks. Before the synchronization problem is discussed in the most general sense, we consider it for more concrete cases. Two kinds of networks shall be introduced in the following. The first is the classical random network (RN; Bollob´as, 1985). Let a set of || vertices be given. Then the RN is constructed by connecting each pair of vertices i, j ∈ with a fixed probability 0 < p ≤ 1. The connectivity distribution of an RN is in the asymptotic case || → ∞ equal to the Poisson distribution (Bollob´as, 1985). The second network is the generalized Barabasi-Albert-Network (Barabasi & Albert, 1999; Newman, Strogatz, & Watts, 2001). It is constructed by the following algorithm: 1. One starts with n vertices connected with each other at time t = 0. 2. At every time step t > 0, t ∈ {1, . . . , || − n}, a new vertex is added to the lattice.
Synchronizability of Neural Networks
2503
3. This new vertex is attached to n (different) vertices, which are already there. 4. The probability p(k, i, t) for an existing vertex to be linked to the new one is proportional to the number of links outgoing from this vertex (preferential linking following the motto, “The rich get richer”). For the connectivity distribution P(k) one has in this case, P(k)k 3 → const. for k → ∞ (Dogorovtsev & Mendes, 2003). Lemma 4. Consider a network with the representation 6.1. Let the activation set A be such that |A| < 12 . Let C be symmetric, (C)i j ∈ {0, c} for all i, j, c > 0. || 1. Consider the classical random network (RN) with the linking probability ν , for some constant ν > 0. p = || 2. And consider the g-BA network of order n ∈ N. 1 , for some constant α > 0. Then one has that for || → ∞, there is Let c = α || no triple {ν, α, n} such that the considered networks are synchronizable.
, one has that condition 2 in proposition 3 is equivalent Proof. With b := |A| || to 1 (C)i j ≥ (1 − 2b) > 0. (6.6) || i j On the other hand, one has that ν lim (C)i j = α ||→∞ 2 ij resp.
lim
||→∞
(C)i j = 2nα
for RN for g-BA.
ij
Thus, lim
||→∞
resp.
1 (C)i j = 0 for all ν, α || i j
1 (C)i j = 0 ||→∞ || ij lim
for all n, α
for RN
for g-BA.
But this means that condition 6.6 cannot be satisfied for any finite ν, n, α. Lemma 5. Consider a situation as in lemma 4 but with a fixed c > 0. Then one has that the probability ps for an RN or g-AB network to be synchronizable decreases exponentially with || if the vertex connectivities are assumed to be independent.
2504
C. Cebulla
Proof. With c > 0 fixed, one has 1 (C)i j = lim ||→∞ || i, j
ν
c for RN, 2 2nc for g-BA.
(6.7)
From this, one has that condition 6.6 is satisfied for νc ≥ 2(1 − 2b) resp. nc ≥ 1 − 2b . 2 Thus, suitable values for the parameters ν, n can always be found such that condition 2 in proposition 3 is satisfied. On the other hand, the probability ps that the considered (random) realization of the network satisfies condition 1 in proposition 3 is equal to the probability that all vertices v ∈ independently have the connectivity vk less than or equal to m, m := maxk∈N {kc ≤ 1}. If P denotes the corresponding probability for one vertex, then one has
P=
m−1 1 −ν e νj j! j=0
for RN,
∞
1 −3 k 1 − N(n) j=m
for g-BA.
m1
With this, P < 1 because from c > 0, it follows that m < ∞. For the probability ps , one has then, using the independence of the distributions of the vertex connectivities, lim ps = lim (P)||−|A| = 0,
||→∞
||→∞
and from this the assertion follows. Corollary 4. Let || 1 be fixed and ps [RM] and ps [B A] denote for the given || the probability for a realization of a classical random network and a g-BA network respectively to be synchronizable. Let the constants n, ν, and 1 c > 0 be 1 defined as in lemma 4 and m as in lemma 5. Let ν be chosen such that ν < (m!m−3 ) m , ν and finally let N(n) = e . Then one has that ps [RM] > ps [B A]. Proof. One has, following the notation of lemma 5, that P[RN] − P[B A] = e
−ν
∞ j=m
j
−3
1 −ν j! j
>0
Synchronizability of Neural Networks
2505 1
from the assumption ν < (m!m−3 ) m , which gives ν < ( j! j −3 ) j for all j ≥ m. Recalling that ps = P || , one obtains the assertion immediately. 1
As a remark, one has to state that the condition ν < (m!m−3 ) m on ν is quite natural since dealing with large networks (|| 1), it is necessary to choose c correspondingly small, which implies a large value for m, and from this the above condition is naturally satisfied. 1
6.3 Inhibitory Networks. Real neuronal networks contain units (in this model, they correspond to vertices) with inhibitory connections to other units (see, e.g., Kandel et al., 2000). This fact can be easily included in our SAS model by allowing negative entries for the network matrices. In the following, the assumptions on the matrix C are: 1. C is a network matrix. 2. −1 ≤ C ≤ 1. 3. For every j, one has either (C)i j ≥ 0 for all i or (C)i j ≤ 0 for all i (according to Dale’s law; see, e.g., Kandel et al., 2000). In particular, one creates inhibitory networks by constructing a network without inhibitory vertices and then replacing (C)vi → −(C)vi for a vertex v ∈ \ A with a certain probability p− , 0 ≤ p− ≤ 1. For such networks, where inhibitory nodes are allowed, one obtains the following statement corresponding to proposition 3: Proposition 4. Consider networks with the corresponding network matrices of the form 6.1 where 1 ≥ C ≥ −1 is again a network matrix and satisfies assumption 3 above. Then one has that is synchronizable, that is, there exists 1 ≥ B ≥ 0 with ||−|A| (B)i j ≤ 1 and an 1 ≥ x ≥ 0 such that (1 − C) = Bx iff the following i=1 inequalities hold for C: ||−|A| (C)i j ≤ 1 for all i, 1. 0 ≤ j=1
2.
||−|A| ||−|A|
(C)i j ≥ || − 2|A|.
i=1
j=1
Proof. The proof runs in the same way as the proof of proposition 3. Definition 3. A network having representation 6.1 with a given corresponding network submatrix C is called irreducible if there exists an > 0 such that the matrix C˜ defined by (C)i j for (C)i j ≥ ˜ (C)i j := 0 else is irreducible in the common sense.
2506
C. Cebulla
Finally, one obtains the following proposition, which is valid for both positive and inhibitory random generated networks (assuming that the number of inhibitory vertices is on the average smaller than the number of excitatory ones). Proposition 5. Let have representation 6.1, C being irreducible with 1 ≥ C ≥ −1 in the sense of definition 3, where the probability p− < 12 for elements of the i-th column (according to Dale’s law) to be negative is equal for all i, j. Let b := |A| be constant and λ := || − |A|. Let finally ( j (C)i j )i be a family of || independent and identically distributed (i.i.d.) random variables for i ∈ \A. And for the case of inhibitory networks, for all i, let the entries (C)i j be i.i.d. random variables with expectation value E((C)i j ) = i (resp. −i if negative). Then is not synchronizable in the limit || → ∞. Proof. Let P j (||) denote the probability that one has λj (C)i j > 1, for a vertex j ∈ \ A. Let lim||→∞ P(||) > 0. Then it follows that lim||→∞ (1 − P) < 1, and, with ps denoting as before the probability for to be synchronizable, lim||→∞ ps = lim||→∞ (1 − P)|| = 0. On the other hand, let lim||→∞ P = 0. Then it is sufficient to consider the case where j (C)i j < 1 in the asymptotic limit || → ∞ for all i. Define J + := {1 ≤ j ≤ λ|(C)i j > 0} and J − := {1 ≤ j ≤ λ|(C)i j < 0}. Then one has that λj (C)i j = λj∈J + (C)i j + λj∈J − (C)i j . Two cases have to be considered: 1. Let the are for
both sums (over J − and J + ) converge. Then one has from Leibniz criterion that the sequences ((C)i j ) j∈J + and (−(C)i j ) j∈J − isomorph to sequences (a n )n∈N resp. (b n )n∈N with a n , b n > 0 all n and a n , b n −→ 0. Thus, following definition 3, C is not n→∞
irreducible. 2. Let both sums over J − and J + diverge for an i ∈ \ A. Then one has λ
1 (C)i j = E((C)i j )(1 − p− ) + E(−(C)i j ) p− = (1 − 2 p− )i > 0 ||→∞ || j lim
for this i. From this, one has synchronizable.
λ
j (C)i j
−→ ∞, and thus C is again not
||→∞
7 Weak Synchronizability In the context of application of the model to brain dynamics, finite networks are of particular interest. Further, the notion of synchronization as it was defined by equation 6.3 is very close to the strong form of synchronization where all units fire simultaneously. Thus, equation 6.3 is much too strong a criterion for synchronization in living systems. One is interested in neurons
Synchronizability of Neural Networks
2507
with equal mean firing rates r , allowing for the case r < 1. This weaker form of synchronization shall be introduced by the following definition: Definition 4. A network , || < ∞, with representation 6.1 is called weakly synchronizable if there exist an 0 < α ≤ 1 such that (1 − C)αe = Bx for a B such that 0 ≤ B ≤ 1,
(7.1)
i (B)i j
≤ 1 for all j, 0 ≤ x ≤ 1.
In the following, we shall assume C ≥ 0. Remark 2. Equation 7.1 is equivalent to the condition (1 − C)e = Bx
(7.2)
|A|
for an x ∈ R+ . Remark 3. Condition (2) in proposition 4 becomes, with b ≡ i, j
(C)i j ≥
|A| ||
< 12 ,
(1 − b)α − b ||. α
(7.3)
b then equation 7.3 is trivially satisfied. FurtherThis means that if α ≤ 1−b b more, if equation 7.1 is satisfied for an x and an α such that α > 1−b , then b one can always find a suitable number 0 < < 1 such that α ≤ 1−b , and equation 7.2 is satisfied for x = x.
Next, we consider the situation where the activation part of the network, given by the matrix B in equation 6.1 is fixed. In this case, the network will be called weakly synchronizable by B. Now we turn our attention to the set of weakly synchronizable networks , which are irreducible, because the latter networks are of particular interest in applications. One expects the set of irreducible networks weakly synchronizable by a given B to be convex. Nevertheless, one is interested in a more precise description of this set. Let λ ≡ || − |A|. Since C is a λ × λ matrix with zero diagonal and nonnegative, nondiagonal components, one can define the (in general not injective) mapping, A:
Set≥0 := C network matrix|C ≥ 0, (C)i j ≤ 1 → Rλ+ j
C
→ (1 − C)e
2508
C. Cebulla
from the set of the network submatrices C, where we demand
|A|
j (C)i j
≤1
to Rλ+ . On the other hand, one has the mapping B : R+ → Rλ+ . Thus, one x
→
Bx
obtains that the set of irreducible networks weakly synchronizable by B is equal to the preimage of the set (I m(A|C ) ∩ I mB) under the mapping A. Lemma 6. The image of the mapping A| Set≥0 is given by I m A| Set≥0 = {x ∈ Rλ |0 ≤ x ≤ 1} \ {0}. Moreover, I m(A|C ) = {x ∈ Rλ |0 ≤ x < 1} \ {0}. Proof. From the assumptions on C, one has that ((1 − C)e)i ≤ 1 for all i. From this, (1 − C) 1 ≤ λ. Thus, let x = (x1 , . . . , xλ )T ∈ Rλ with 0 ≤ x ≤ 1. i In order to construct a matrix C with (1 − C)e = x, set (C)i j = 1−x for all λ−1 j = i. With this, one has i = xi for all i. 1. ((1 − C)e)i = 1 − (C)i j = 1 − (λ − 1) 1−x λ−1 2.
j
(C)i j = 1 − xi ≤ 1.
j
3.
i
(C)i j =
λ i=1 i= j
λ
1−xi λ−1
=1−
xi
i=1,i= j
λ−1
≤ 1.
Thus, I m A| Set≥0 = {x ∈ Rλ |0 ≤ x < 1} \ {0}. Moreover, if x < 1, then the constructed C is irreducible. From this, one has that I m(A|C ) = {x ∈ Rλ |0 ≤ x < 1} \ {0}. Remark 4. Consider the set of the inhibitory subnetwork matrices C with (C) ≤ 1 as before and ij j i, j (C)i j ≥ 0. Then one has for the image of the mapping, A : Setinh := inhibitory network matrix C| (C)i j j
≤ 1,
(C)i j ≥ 0 → Rλ+ i, j
C
→ (1 − C)e
that Set≥0 ⊂ Setinh and thus I m|ASet≥0 ⊂ I m A| Setinh . Finally one obtains the following result:
Synchronizability of Neural Networks
2509
Proposition 6. For every given B, there exists an irreducible network = ( 0B 0C ) that is weakly synchronizable by B. The set of those networks is the preimage of I m(A|C ∪ I mB). In particular this is a (Lebesgue-) null set in the set C. Proof. From lemma 6, one has that (I m(A|C ) ∩ I mB) = ∅. This implies the existence of the required network immediately. Because dim(I mB) = |A| < λ, by definition, one has that (I m(A|C ) ∩ I mB) is a (Lebesgue-) null set in Rλ and thus also a null set in the set of all network submatrices λ(λ−1) C ∈ R+ . The consequence of proposition 6 is the following. We know that there exists some network that is synchronizable by a given (but arbitrary) input subnetwork B. But if one would choose one 0 at random out of the set of all networks of a given size, then the probability for 0 to be synchronizable by B is equal to zero. For this reason, it is convenient to assume the following. If a particular system (e.g., a living system approximated by the considered model) is found to be in a synchronized state, then its structure is not synchronizable by chance; it is a product of some dynamic (but not necessarily deterministic) process (e.g., a learning process). 8 Conclusion In this work, we have presented a neural network model where the dissipation occurs not due to a sigmoidal form of the transfer function but rather through the boundary of the underlying recurrent network (in contrast with other models). This approach allows a systematic analysis of the influence of the network topology on the properties of the network dynamics. We have discussed the asymptotic behavior of the introduced dynamics, where the development in (discrete) time produces a sequence of avalanches in the form of an energy transmission through the network. Also in this context, we have investigated the conditions on the system implying the existence of an asymptotic invariant set in phase space. Further, we have introduced the notion of synchronizability of networks in the sense of equal mean activity in time of the system units (neurons). The analysis of the synchronizability resulted in general conditions on the network topology. We have seen that under certain assumptions (see corollary 6.9) scale-free networks (scale free in the sense of networks with a thick tail of their connectivity distribution) are (in a certain probabilistic sense) less synchronizable than the corresponding classical random networks. It is convenient to mention that similar observations were made in Pecora and Barahona (2005) and Nishikawa, Motter, Lai, and Hoppensteadt (2003), where synchronizability of networks of chaotic oscillators was analyzed. Drawing an outline of our future work we point out the following issues. In this letter, we did not consider the plasticity of networks, an important
2510
C. Cebulla
issue in the context of an analysis of certain phenomena in brain dynamics. Thus, the next step is to make Hebbian plasticity of the network structure the focus of the analysis. Besides the emergence of synchronization in brain dynamics, there exists a wide field of applications for the presented analysis, as, for example, in the analysis of highway traffic (Lee, Lee, & Kim, 1998). The particular interest of such applications is forecasting synchronization phenomena. With particular assumptions on the input rates, our results can be extended to the analysis of waiting times (in the sense of a transient of the dynamics itself given a fixed input as well as of some estimates of probabilities of the occurrence of synchronizing input rates) on the way into a synchronized state. Appendix A: Proof of Proposition 2 Let T fulfill the assumptions of proposition 2. Then one has the following lemma. Lemma A.1. With the assumptions of proposition 2, one has for all ∂Mak that ˘ ak ˘ ak ⊂ ∂ M T ∂M for suitable a and k . ˘ ak ) of ∂ M ˘ ak Proof. Because of the continuity of T on ∂Mak , the image T (∂ M is compact and connected. From this, if the assertion does not hold, there ˘ ak . Assume must exist an X with T X lying outside all the subsets ∂ M compact k k ˘ ˘ that there is an X ∈ ∂ Ma such that T X ∈ / a ,k ∂ Ma . From the assumptions, ˘ ak , which is a / a ,k ∂ M there must be an X0 with T m X0 = X and T m+1 X0 ∈ contradiction to those assumptions. ˘ ak ) (Hutchinson, 1981) compact and countConsider the space a ,k H(∂ M k ˘ a . Then one can consider the mapping T as acting on able sets in all ∂ M k ˘ H(∂ M ) in the following way: a a ,k ˘ ak , T (A) := ∪ ∪ Tb A ∩ ∂ M b∈ a ,k
for all
A∈
˘ ak . H ∂M
a ,k
Then the following theorem is applicable to the model given by T . Theorem A.1. Let (V, E, {ri }) be a strongly connected, strictly contracting graph G ≡ (V, E). Let ( f e )e∈E realize the graph in the sense that to every vertex v ∈ V, there is attached a compact metric space Sv , and to every edge a contracting mapping
Synchronizability of Neural Networks
2511
f e : Sv → Su , where e ∈ E is the edge from v to u ∈ V (notation: e ∈ E uv ). Then the function F defined as follows is a contraction on the space v∈V H(Sv ): F ((Av )v∈V ) :=
f e (Av )
v∈V,e∈E uv
.
u∈V
Here (Av )v∈V is an element of (Av )v∈V ∈ v∈V H(Sv ). Thus, there is a unique list of nonempty compact sets (K v ⊆ Sv ) such that
Ku =
f e (K v )
∀u ∈ V.
v∈V,e∈E uv
One has ( lim F n ((Av )v∈V )) ∩ Su ⊂ K u for all v and all (Av )v∈V ∈ n→∞
v∈V
H(Sv ).
Proof. See Edgar (1992), for example. For every restricted mapping T |∂ M˘ ak , its image lies within one other continuity domain Mak of T : ˘ ak → img T |∂ M˘ k ⊂ ∂ M ˘ ak T |∂ M˘ ak : ∂ M a (this follows from lemma A.1). From this, it follows that T |∂ M˘ ak , defining a directed link between them, connects exactly two domains of continuity. ˘ ak can be considered as vertices of a graph (the tranFinally, the sets ∂ M sition graph). A link between two vertices i and j is given by the existence of a nonempty image of ˘ ak → ∂ M ˘ la , T |∂ M˘ ak : ∂ M
a , a ∈ , k, l ∈ N,
as defined above. This graph is not necessarily irreducible, so only the restriction to one of the (finitely many) irreducible subgraphs of the transition graph is considered. One also has to state that the transformation T is in general not a contraction. But the open boundary condition on the network guarantees that there exists a finite number k ∈ N such that T k is a contraction, and with this, theorem A.1 is still applicable (Edgar & Golds, 1999; Edgar, 1992). Appendix B: Proof of Corollary 2 Let the assumptions of proposition 2 be satisfied. With this, we can apply the following lemma, identifying the mapping f with the transformation k ,...,k T q and the partition elements Mi with the continuity domains ∂Ma11 ,...,aqq of
2512
C. Cebulla
T q for any deterministic activation sequence ω = (a 1 , . . . , a q , a 1 . . . a q , . . .) and the corresponding activation mapping F (X, tmodq ). Lemma B.1. Consider a compact metric space M with a representation as a N disjunct union of half open sets, M = i=1 Mi . Let f : M → M be such that f |Mi is a contraction for all i = 1, . . . , N (or there is a k < ∞ with the k-th iterate f k of f is a contraction). Define the sequence (xn )n∈N by x0 ∈ M and xn+1 := f (xn ). Because of the finiteness of the number of partition elements Mi , there exists an i = 1, . . . , N such that there is no m ∈ N with xk ∈ / Mi for all k > m (let this condition for Mi be denoted by (COND1)). Let f satisfy the following condition (denoted by (COND2)) with respect to all x ∈ M: For every partition Mi such that {(xn )n∈N } ∩ Mi = ∅, there is a compact ˆ i ⊂ Mi with m ∈ N such that xk ∈ ˆ i for all k > m. subset M /M Then every trajectory is asymptotically periodic. Proof. Consider a sequence (xn ) and a partition element Mi satisfying (COND1), and let f satisfy (COND2). Then there is a cluster point of the ˆ i . Let h denote this cluster point. From this, there sequence (xn ) lying in M is, due to the fact that f is a piecewise contraction on the finite partition N
i=1 Mi , an open set Vh with h ∈ V h and a t ∈ N such that f t (V h ) ⊂ V h . From this, it follows that there is a unique fixed point of f t in V h which can be only the cluster point h itself. Thus, there exists a point h ∈ M and a number n ∈ N such that f n (h) = h. From this, it follows that the trajectory (xn ) ends up asymptotically in the periodic trajectory including the point h. The assertion follows immediately because the consideration above can be made for any trajectory (xn ) in M.
Acknowledgments I am very grateful to Sergio Albeverio for his instructive supervision and ¨ FounKlaus Lehnertz for fruitful discussion. I also thank the Heinrich-Boll dation and the German Research Foundation for financially supporting this work. Last but not least, I thank the reviewers for their constructive suggestions concerning this letter. References Albeverio, S., & Tirozzi, B. (1997). An introduction to the mathematical theory of neural networks. In P. L. Garrido & J. Marro (Eds.), Fourth Granada Lectures in Computational Physics (pp. 197–222). Berlin: Springer. Allouche, J.-P., Courbage, M., Skordev, G. (2001). Notes on cellular automata. Cubo, Matematica Educacional, 3, 213–244. Anderson, J. A. (1995). An introduction to neural networks. Cambridge, MA: MIT Press.
Synchronizability of Neural Networks
2513
Antonelli, F., Dias, A. P. S., Golubitsky, M., & Wang, Y. (2005). Patterns of synchrony in lattice dynamical systems. Nonlinearity, 18, 2193–2209. Bagnoli, F., Cecconi, F., Flammini, A., & Vespignani, A. (2003). Short period attractors and non-ergodic behavior in the deterministic fixed energy sandpile model. Europhysics Letters, 63(4), 512–518. Bak, P., Tang, C., & Wiesenfeld, K. (1988). Self-organized criticality. Phys. Rev. A, 38(1), 364–374. Barabasi, A.-L., & Albert, R. (1999). Emergence of scaling in random networks. Science, 286, 509–512. Beggs, J. M., & Plenz, D. (2003). Neuronal avalanches in neocortical circuits. J. Neurosc., 23(35), 1167–1177. ¨ Blanchard, P., Cessac, B., & Kruger, T. (2000). What can we learn about self-organized criticality from dynamical systems theory? J. Stat. Phys., 98, 357–404. ¨ Blanchard, P., Cessac, B., & Kruger, T. (2001). Lyapunov exponents and transport in the Zhang model of self-organized criticality. Phys. Rev. E, 64(1), 016133. ¨ Borgers, C., & Kopell, N. (2003). Synchronization in networks of excitatory and inhibitory neurons with sparse, random connectivity. Neural Computation, 15, 509–538. Bollob´as, B. (1985). Random graphs. San Diego, CA: Academic Press. Burkitt, A. N., & Clark, G. M. (2001). Synchronization of the neural response to noisy periodic synaptic input. Neural Computation, 13, 2639–2672. Chen, D., Eu, S., Guo, A., & Yang, Z. R. (1995). Self-organized criticality in a cellular automaton model of pulse-coupled integrate-and-fire neurons. J. Phys. A: Math. Gen., 28, 5177–5182. Dauce, E., Moynot, O., Pinaud, O., & Samuelides, M. (2001). Mean-field theory and synchronization in random recurrent neural networks. Neural Processing Letters, 14, 115–126. Dhar, D. (1990). Self-organized state of sandpile automaton models. Phys. Rev. Lett., 64(14), 1613–1616. ˜ Dickman, R., Munoz, M. A., Vespignani, A., & Zapperi, S. (2000). Paths to selforganized criticality. Brazilian J. Physics, 30(1), 27–41. Dorogovtsev, S. N., & Mendes, J. F. F. (2003). Evolution of networks. New York: Oxford University Press. Edgar, G. A. (1992). Measure, topology, and fractal geometry. Berlin: Springer. Edgar, G. A., & Golds, J. (1999). A fractal dimension estimate for a graph-directed iterated function system of non-similarities. Indiana Univ. Math. Journal, 48(2), 429–447. Feng, J., & Brown, D. (1998). Fixed-point attractor analysis for a class of neurodynamics. Neural Computation 10, 189–213. Gerstner, W., & Kistler, W. (2002). Spiking neuron models. Cambridge: Cambridge University Press. Giacometti, A., & Diaz-Guilera, A. (1998). Dynamical properties of the Zhang model of self-organized criticality. Phys. Rev. E, 58(1), 247–253. Goh, K.-I., Lee, D.-S., Kahng, B., & Kim, D. (2003). Sandpile on scale-free networks. Phys. Rev. Lett., 91(14), 1487–1491. Golomb, D., & Hansel, D. (2000). The number of synaptic inputs and the synchrony of large, sparse neuronal networks. Neural Computation, 12, 1095–1139.
2514
C. Cebulla
Hergarten, S. (2002). Self-organized criticality in earth systems. Berlin: Springer. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Redwood City, CA: Addison-Wesley. Hopfield, J. J., & Brody, C. D. (2001). What is a moment? Transient synchrony as a collective mechanism for spatiotemporal integration. PNAS, 98(3), 1282–1287. Hutchinson, J. E. (1981). Fractals and self similarity. Indiana Univ. Math. Journal, 30, 713–747. Jensen, H. J. (1998). Self-organized criticality. Cambridge: Cambridge University Press. Kandel, E. R., Schwartz, J. H., & Jessel, T. M. (Eds.). (2000). Principles of neural science (4th ed.). New York: McGraw-Hill. Lago-Fernandez, L. F., Corbacho, F. J., & Huerta, R. (2001). Connection topology dependence of synchronization of neural assemblies on class 1 and 2 excitability. Neural Networks, 14, 687–696. Lee, H. Y., Lee, H.-W., & Kim, D. (1998). Origin of synchronized traffic flow on highways and its dynamic phase transition. Phys. Rev. Lett., 81(3), 1130–1133. Lehnertz, K., Andrzejak, R. G., Arnhold, J., Widman, G., Burr, W., David, P., & Elger, C. E. (2000). Possible clinical and research applications of nonlinear EEG analysis in humans. In K. Lehnertz, J. Arnhold, P. Grassberger, & C. E. Elger (Eds.), Chaos in brain? (pp. 134–155). Singapore: World Scientific. ¨ Lubeck, S., Rajewsky, N., & Wolf, D. E. (2000). A deterministic sandpile automaton revisited. Eur. Phys. J. B, 13, 715–721. Markosova, M., & Markos, P. (1992). Analytical calculation of the attractor periods of deterministic sandpiles. Phys. Rev. A, 46(6), 3531–3534. Newman, M. E. J., Strogatz, S. H., & Watts, D. J. (2001). Random graphs with arbitrary degree distribution and their applications. Phys. Rev. E, 64(2), 026118. Nishikawa, T., Motter, A. E., Lai, Y.-C., & Hoppensteadt, F. C. (2003). Heterogeneity in oscillator networks: Are smaller worlds easier to synchronize? Phys. Rev. Letters, 91(1), 014101. Pecora, L. M., & Barahona, M. (2005). Synchronization of oscillators in complex networks. Chaos and Compl. Letters, 1(1), 61–91. Perez, C. J., Corral, A., & Diaz-Guilera, A. (1996). On self-organized criticality and synchronization in lattice models of coupled dynamical systems. J. Mod. Phys. B, 10(8,9), 1–41. Trappenberg, T. P. (2002). Fundamentals of computational neuroscience. New York: Oxford University Press. Volk, D. (2000). Spontaneous synchronization in a discrete neural network model. In K. Lehnertz, J. Arnhold, P. Grassberger, & C. E. Elger (Eds.), Chaos in brain? (pp. 234–237). Singapore: World Scientific. Zhang, Y. C. (1989). Scaling theory of self-organized criticality. Phys. Rev. Lett. 63(5), 470–473.
Received February 23, 2006; accepted October 14, 2006.
LETTER
Communicated by James A. Bednar
Self-Organizing Maps with Asymmetric Neighborhood Function Takaaki Aoki [email protected] Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan, and CREST, JST, Kyoto 606-8501, Japan
Toshio Aoyagi [email protected] Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan
The self-organizing map (SOM) is an unsupervised learning method as well as a type of nonlinear principal component analysis that forms a topologically ordered mapping from the high-dimensional data space to a low-dimensional representation space. It has recently found wide applications in such areas as visualization, classification, and mining of various data. However, when the data sets to be processed are very large, a copious amount of time is often required to train the map, which seems to restrict the range of putative applications. One of the major culprits for this slow ordering time is that a kind of topological defect (e.g., a kink in one dimension or a twist in two dimensions) gets created in the map during training. Once such a defect appears in the map during training, the ordered map cannot be obtained until the defect is eliminated, for which the number of iterations required is typically several times larger than in the absence of the defect. In order to overcome this weakness, we propose that an asymmetric neighborhood function be used for the SOM algorithm. Compared with the commonly used symmetric neighborhood function, we found that an asymmetric neighborhood function accelerates the ordering process of the SOM algorithm, though this asymmetry tends to distort the generated ordered map. We demonstrate that the distortion of the map can be suppressed by improving the asymmetric neighborhood function SOM algorithm. The number of learning steps required for perfect ordering in the case of the one-dimensional SOM is numerically shown to be reduced from O(N3 ) to O(N2 ) with an asymmetric neighborhood function, even when the improved algorithm is used to get the final map without distortion.
Neural Computation 19, 2515–2535 (2007)
2516
T. Aoki and T. Aoyagi
1 Introduction The self-organizing map (SOM) algorithm was proposed by Kohonen (1982, 2001) as a simplified neural network model having some essential properties to reproduce topographic representations observed in the brain (Hubel & Wiesel, 1962, 1974; von der Malsburg, 1973; Takeuchi & Amari, 1979). The SOM algorithm can be used to construct an ordered mapping from high-dimensional input data space onto low-dimensional array of units according to the topological relationships between input data. This implies that the SOM algorithm is capable of extracting the essential information from hugely complicated data. From the viewpoint of applied information processing, the SOM algorithm can be regarded as a generalized, nonlinear type of principal component analysis and has proven valuable in the fields of visualization, compression, and mining of various complicated data. The learning performance becomes an important issue in many of the applications of the SOM algorithm. It is therefore desirable for real applications that fast convergence can be achieved to the map in which the correct topological order is produced. In addition, it is desirable for the resultant map to have as little distortion as possible to faithfully represent the structure of data. In this letter, we examine the effect of the form of the neighborhood function on the performance of the SOM learning algorithm. An inappropriate choice of neighborhood function may have a detrimental effect on learning performance. Erwin, Obermayer, and Schulten (1992) have reported that if the form of the symmetric neighborhood function is not convex, many undesirable metastable states will be present in the system. At each metastable state, the feature map of the SOM is ill structured, unlike from the correct map. Since the learning process becomes trapped in these metastable states, the system can escape from these undesirable states only through a sheer number of iterations because of their local stability. They have closely investigated certain relationships between the form of the symmetric neighborhood function and the existence of metastable states and have concluded that no metastable states exist when a convex neighborhood function is used in one-dimensional SOM. Hence, it is evident that a suitable selection of neighborhood function will pay dividends in the performance of the algorithm. However, even if such a suitable symmetrical neighborhood function is selected, there remains another important factor that spoils the learning performance. This is the emergence of a topological defect characterized by a globally conflicting point between multiple, locally ordered regions. A topological defect in the feature map appears occasionally during the learning process, especially when using a neighborhood function that is narrow compared with the total size of the SOM array. Figure 1A shows an example of the topological defect in a two-dimensional array of SOM with a uniform, rectangular input data space. The feature map is twisted at the center but topologically ordered everywhere else. Figure 1B shows an example of the topological defect in a one-dimensional array of SOM with scalar input data, which we
Self-Organizing Maps with Asymmetric Neighborhood Function
2517
Figure 1: (A) Example of a topological defect in a two-dimensional array of SOM with a uniform rectangular input space. The triangle point indicates the conflicting point in the feature map. (B) Another example of topological defect in a one-dimensional array with scalar input data. The triangle points also indicate the conflicting points. (C) Method of generating an asymmetric neighborhood function by scaling the distance ric asymmetrically. The degree of asymmetry is parameterized by β. The distance of the node on the positive direction with asymmetric unit vector k is scaled by 1/β. The distance on the negative direction is scaled by β. Therefore, the asymmetric gaussian function is described by 2 h(˜ric ), where r˜ic is the scaled distance of node i from the winner h β (ric ) = β+1/β c. (D) An example of an asymmetric gaussian function in two-dimensional SOM. (E) The initial reference vectors used in numerical simulations. There is a single kink point at the center of the array. In addition, the reference vectors are disturbed with small noise. The dashed line depicts one of the possible perfect maps, which should be formed from the input data given by the uniform distribution in [0,1].
2518
T. Aoki and T. Aoyagi
call the kink state in this letter. In this kink state, the feature map is piecewise ordered on the scale of the width of neighborhood function, but there are one or more topological defects over the entire map. Once such a kink appears in the feature map, the learning process for getting the kink out of the map requires a large number of steps, which is typically several times larger than that without a kink (Geszti et al., 1990). This slow convergence time of learning due to a kink state is a serious problem that needs to be resolved, especially when large, complicated data are the inputs. To avoid the trap in the kink state, several conventional and empirical methods have been used, such as finding a suitable choice of the initial reference vectors or rearranging the sequence of input vectors. In this letter, we demonstrate that the asymmetry of the neighborhood function can greatly improve the learning performance, particularly when a kink state appears during the learning process. The reason that the asymmetric neighborhood function is effective in the presence of the kink state is as follows. In the process of getting a kink out of the feature map, the topological defect of the kink must move outside the boundary of the SOM array and vanish. Therefore, the convergence time of the learning process in the presence of the kink state is determined by the speed of the motion of topological defects. In general, the motive process of topological defects depends on the form of the neighborhood function. In the case of a symmetric gaussian function, which is commonly used in the conventional SOM algorithm, the movement of topological defects can be regarded as a random walk stochastic process, in which the motion of the defect behaves like a diffusion process. It is hypothesized that if an asymmetric neighborhood function is used, the motion of the defect will behave like a drift—a faster, more coherent process. Motivated by this idea, we investigate the effect of an asymmetric neighborhood function on the performance of the SOM algorithm, in particular, on the possibility that the asymmetry realizes fasterorder learning than with a conventional symmetric neighborhood function in the presence of the topological defects. 2 Methods 2.1 One-Dimensional SOM. The SOM constructs a mapping from the input data space to the array of nodes that we call the feature map. To each node i, a parametric reference vector mi is assigned. Through SOM learning, these reference vectors are rearranged according to the following iterative procedure. An input vector x(t) is presented at each time step t, and the best matching unit whose reference vector is closest to the given input vector x(t) is chosen. The distance between the input vector and the reference vector is here prescribed by the Euclidean distance x(t) − mi in the input data space. The best matching unit c, called the winner, is given by c = arg min x(t) − mi . i
(2.1)
Self-Organizing Maps with Asymmetric Neighborhood Function
2519
In other words, the data x(t) in the input data space are mapped on to the node c associated with the reference vector mi closest to x(t). In SOM learning, the update rule for reference vectors is given by mi (t + 1) = mi (t) + α · h(ric )[x(t) − mi (t)] ric ≡ ric ≡ ri − rc ,
(2.2)
where α, the learning rate, is some small constant. The function h(r ) is called the neighborhood function, in which ric is the distance from the position rc of the winner node c to the position ri of a node i on the array of units. A widely used neighborhood function is the gaussian function defined by ric2 h(ric ) = exp − 2 . 2σ
(2.3)
We expect an ordered mapping after iterating the above procedure a sufficient number of times. In this letter, we investigate mainly the one-dimensional SOM, because its simplicity clarifies the essential dynamical process of SOM learning as a first step. Note that the reference vector in the case of the one-dimensional SOM is a scalar value (one-dimensional vector), In the following sections, we denote the reference vector by mi and the input vector by x. 2.2 Asymmetric Neighborhood Function. The conventional neighborhood function widely used in practical applications is usually symmetric about the origin, that is, the position of the winner node; the gaussian function is an example of such a function. We now consider an asymmetric neighborhood function. We would like a method to transform any given symmetric neighborhood function to an asymmetric one and also to characterize the degree of asymmetry with a one-parameter variable. In addition, in order to single out the effect of asymmetry on the learning ∞process, we require that the overall area of the neighborhood function, −∞ h(r )dr , be preserved under the transformation from a symmetric to an asymmetric function. Keeping in the mind the above two points, the transformation of a given symmetric neighborhood function h(r ) proceeds as follows (see Figure 1C). Let us define an asymmetry parameter β(β ≥ 1), representing the degree of asymmetry and the unit vector k indicating the direction of asymmetry. If a unit i is located on the positive direction with k, then the component parallel to k of the distance from the winner to the unit is scaled by 1/β. If a unit i is located on the negative direction with k, the parallel component of the distance is scaled by β. Hence, the asymmetric function
2520
T. Aoki and T. Aoyagi
h β (r ), transformed into its symmetric counterpart h(r ), is described by h β (ric ) = Aβ · h(˜ric )
Aβ = 2
1 +β β
−1
2 r + r⊥ 2 , if ric · k ≥ 0 β r˜ic = , 2 βr + r⊥ 2 , if ric · k < 0
(2.4)
where r˜ic is the scaled distance from the winner. r is the projected component of ric , and r⊥ are the remaining components perpendicular to k. For example, in 1D SOM, the positive direction of asymmetry is the increasing direction of the index of the SOM array. In this case, the asymmetric gaussian function is described by
2 Aβ exp − r2ic 2 , if i ≥ c 2β σ
. h β (ric ) = A exp − β 2 ric2 , if i < c β 2σ 2
(2.5)
In the special case of the asymmetry parameter β = 1, h β (r ) is equal to the original symmetric function h(r ). Figure 1D displays a example of asymmetric gaussian neighborhood functions in the two-dimensional array of SOM. 2.3 Topological Order and Distortion of the Feature Map. We define two measures that characterize the property of the feature map. One is the topological order η for quantifying the order of reference vectors in the SOM array. In a topologically ordered state of 1D SOM, the units of the SOM array should be arranged according to the magnitude of its reference vector mi satisfying the condition mi−1 ≤ mi ≤ mi+1 ,
or
mi−1 ≥ mi ≥ mi+1 .
(2.6)
In a kink state, although a large population of the reference vectors satisfies the above order condition 2.6, there are topological defects that violate the order condition 2.6. On the feature map, we consider some domains within which the reference vectors satisfy the order condition 2.6. The topological order η can be defined as the ratio of the maximum domain size to the total number of units N, given by η≡
maxl Nl , N
where Nl is the size of domain l.
(2.7)
Self-Organizing Maps with Asymmetric Neighborhood Function
2521
The other is the distortion χ, which measures the distortion of the feature map. The distribution of reference vectors obtained by SOM learning reproduces qualitative features of the probability density of the input vectors. There are several theoretical studies on the qualitative features of the distribution of the reference vectors depending on the probability density of input vectors (Ritter & Schulten, 1986; Ritter, 1991; Villmann & Claussen, 2006). However, as demonstrated in the following sections, the asymmetry of the neighborhood function tends to distort the distribution of reference vectors, which is quite different from the correct probability density of input vectors. For example, when the probability density of input vectors is uniform in the range [0, 1], a nonuniform distribution of reference vectors is formed through SOM learning with an asymmetric neighborhood function. Hence, for measuring the nonuniformity in the distribution of reference vectors, let us define the distortion χ. χ is a coefficient of variation of the size distribution of unit Voronoi tessellation cells and is given by Var(i ) χ= , E(i )
(2.8)
where i is the size of Voronoi cell of unit i. To eliminate the boundary effect of the SOM algorithm, the Voronoi cells on the edges of the array are excluded from the calculation of the mean and variance of i . When the reference vectors are distributed uniformly, the distortion χ converges to 0. 2.4 Numerical Simulations. In the numerical simulations, we use the following parameter values: the total number of units N = 1000, the learning rate α = 0.05 (constant), and the neighborhood radius σ = 50. The asymmetry parameter β = 1.5 and asymmetric direction k are set to the positive direction in the array. The aim is to examine the learning performance in the presence of a kink state; for this purpose, we use the initial condition that a single kink appears at the center of the array (see Figure 1E). The initial reference vectors are then given by mi = 4
i(N − i) + ξi , N2
(2.9)
where ξi is a small noise that is uniform in the range −0.02 to 0.02. The probability density of input vectors {x} is uniform distribution in [0, 1]. Because the density of input vectors is uniform, it follows that the desirable feature map is a linear arrangement of SOM nodes; an increasing map is shown as the dashed line in Figure 1E, and a decreasing one is not shown. In many numerical simulations, we confirmed the following result holds in a wide range of model parameters.
2522
T. Aoki and T. Aoyagi
3 Results In this section, we investigate the ordering process of SOM learning in the presence of a kink state in both symmetric and asymmetric cases of the neighborhood function. Here we use gaussian function for the original symmetric neighborhood function, as shown in equation 2.3. Figure 2A shows a typical time development of the reference vectors mi . In both cases, almost all topological defects due to the small noise ξi vanish within the first 1000 steps. However, in the case of the symmetric neighborhood function, a single kink point remains around the center of the array even after 10,000 steps. In contrast, in the case of the asymmetric function, this kink moves out of the feature map to the right so that the reference vectors are ordered within 3000 steps. This fast ordering process with an asymmetric neighborhood function can also be confirmed in Figure 2B, which demonstrates the time dependence of the topological order η. In the case of the asymmetric neighborhood function, η rapidly converges to 1 (completely ordered state) within 3000 steps. For the symmetric function, η also increases up to 0.5 within 3000 steps, which results from the elimination of almost all the topological defects made by the initial small noise. However, the process of eliminating the last single kink takes a large amount of time (about 18,000 steps). To quantify the performance of the ordering process, let us define the ordering time as the time at which η reaches 1 (completely ordered state). Figure 2D shows the ordering time as a function of the total number of units N for both asymmetric and symmetric cases of the neighborhood function. It is found that the ordering time scales roughly as N3 and N2 for symmetric and asymmetric neighborhood functions, respectively. Therefore, the asymmetric neighborhood function accelerates the learning process in the presence of a kink state, with performance being improved by about one order. We should remark that the range σ of the neighborhood function in Figure 2D is fixed when N is changed. With a fixed σ , the increase of the units N can improve the
Figure 2: The asymmetric neighborhood function enhances the ordering process of SOM, though this asymmetry causes the distortion of the feature map. (A) A typical time development of the reference vectors mi in cases of symmetric and asymmetric neighborhood functions. The input vectors are randomly selected from a uniform distribution in [0, 1]. (B) Time dependence of the topological order η for symmetric and asymmetric neighborhood functions, estimated from 10 trials. The standard deviations are denoted by the error bars, which cannot be seen because they are smaller than the size of the graphed symbol. (C) Time dependence of the distortion χ . (D) Ordering time as a function of the total number of units N. Each data point is estimated from 12 trials. The fitting function is described by Const. · Nγ .
Self-Organizing Maps with Asymmetric Neighborhood Function
2523
2524
T. Aoki and T. Aoyagi
resolution of the representation in the input space. However, more learning steps are required for achieving the ordered state. We will also discuss the case that σ is scaled proportional to N in discussion. However, one problem arises in the feature map obtained through the learning process with the asymmetric neighborhood function. After 10,000 steps, the distribution of the reference vectors in the feature map develops an unusual bias (see Figure 2A). In general, the distribution of reference vectors in the learned feature map should represent the probability density of input vectors, which is uniform in [0, 1]. In the case of the symmetric neighborhood function, the magnification law of the one-dimensional SOM has been studied theoretically (Ritter & Schulten, 1986). By this magnification law, the uniform input vectors create a uniformly linear map. However, this result cannot be applied to the case of the asymmetric neighborhood function. Furthermore, the nonuniform, distorted feature map is formed by the drift force of the asymmetric neighborhood function. To measure this distortion of the feature map, we adopt the distortion χ (see equation 2.8), which is the coefficient of variation of the size distribution of Voronoi cells. Figure 2C shows the time dependence of the distortion χ during learning. In the case of the symmetric neighborhood function, χ eventually converges to almost 0. This result indicates that the feature map obtained with the symmetric neighborhood function has an almost uniform size distribution of Voronoi cells, which is the result of the uniform probability density of input vectors. In contrast, in the case of the asymmetric function, χ converges to a finite value (=0). To solve this distorted feature map problem, we introduce an improved algorithm for the asymmetric neighborhood function. The improved algorithm includes two novel steps during learning. First, we introduce an inversion operation on the direction of the asymmetric neighborhood function. After every time interval T, the direction of the asymmetry is turned in the opposite direction, as illustrated in Figure 3. Since the distortion in the feature map is along the direction of the asymmetry of the neighborhood function, this periodic inversion is expected to average out the distortion in the feature map, leading to the formation of a feature map without distortion. We set T = 3000 in the following numerical simulations. It is noted that the interval T should be set to a larger value than the typical ordering time for the asymmetric neighborhood function. The optimal value of the flip period T will be considered in the discussion. Next, we introduce an operation that decreases the degree of asymmetry of the neighborhood function. The degree of asymmetry can be controlled by one parameter, β. When β = 1, the neighborhood function equals the original symmetric function. With this operation, β is decreased to 1 with each time step, as illustrated in Figure 3. In our numerical simulations, we adopt a linear decreasing function, β = 1 + (β0 − 1) 1 −
t tTotal
,
(3.1)
Self-Organizing Maps with Asymmetric Neighborhood Function
2525
Figure 3: Illustration of the improved algorithm for asymmetric neighborhood function.
where tTotal is the total time step and β0 is the initial constant value of the asymmetry parameter β. Using this improved algorithm for the asymmetric neighborhood function, the results obtained are summarized in Figure 4. It is observed that the improved algorithm preserves the faster-order learning (see Figure 4A). Furthermore, as shown in Figure 4C, the ordering time scales as N2 for the improved algorithm, which is the same as the original asymmetric neighborhood function. Therefore, the improved algorithm has the ability to accelerate the learning process. This improved algorithm also suppresses the distortion of the feature map. As shown in Figure 4B, the distortion χ converges to almost 0, which implies that the feature map without distortion represents the probability density of input vectors. This result can also be confirmed by another measure of distortion on the feature map. Der, Herrmann, and Villmann (1997) proposed an effective visualization method for the topological defects and distortions on the feature map by using the spatial spectral density of the deviation from uniform ordered feature map. When the deviation ui is defined by ui ≡ mi − 12 (mi−1 + mi+1 ), the spectral density S(k) is given by S(k) = u˜ k 2 ,
u˜ k =
1 2πikn un . exp N n N
(3.2)
When the topological defects or distortions exist on the feature map, the characteristic peaks in the spectral density appear at the wave number
2526
T. Aoki and T. Aoyagi
corresponding to the typical spatial scale of the structures in the feature map. When the feature map is formed successfully, the spectral density becomes nearly flat and has no peaks. Figure 4D shows the time development of the spectral density for the cases of the symmetric, the asymmetric, and the improved asymmetric neighborhood functions. We can see that for the improved asymmetric neighborhood function, the spectral density rapidly converges to a flat distribution without peaks, whereas some significant peaks still remain for the original asymmetric neighborhood function. Although the density finally converges to a flat distribution in the case of the symmetric neighborhood function, the speed of the convergence is much slower than that for the improved asymmetric neighborhood function. In summary, by using the improved algorithm of asymmetric neighborhood function, we confer the full benefit of both the fast-order learning and the undistorted feature map. 4 Applicability to More General Situations In this section, we demonstrate that the asymmetric neighborhood function retains its useful properties in more general situations, such as the onedimensional ring SOM (periodic boundary condition on the SOM lattice) and the two-dimensional SOM. These results suggest that the proposed method can be widely applied to various types of real data. 4.1 One-Dimensional Ring SOM. Let us consider the case of the ring of SOM units, which is useful when one-dimensional input data are periodic. For example, we assume a uniform distribution on the unit circle as input data, which is then given by x(t) = {cos θ (t), sin θ (t)},
θ (t) is uniform in [0, 2π) .
(4.1)
Figure 4: The improved algorithm utilizing an asymmetric neighborhood function has the ability not only to expedite the learning process but also to suppress the distortion of the feature map. (A) Time dependence of the topological order η (interval T = 3000). After 3000 steps, the SOM array is perfectly ordered in the case of the improved algorithm for asymmetric neighborhood function, as in the asymmetric neighborhood function. Therefore, these data points are overlapping at η = 1. Note that the error bars cannot be seen owing to their smallness, as in Figure 2B. (B) Time dependence of the distortion χ , in the same situation as A. (C) Ordering time as a function of the total number of units N. (D) Another measure of the distortion on the feature map. The spectral density S(k) is the Fourier components of local deformation of the feature map. A salient peak in the spectral density indicates the existence of the distortion or topological defect in the feature map.
Self-Organizing Maps with Asymmetric Neighborhood Function
2527
2528
T. Aoki and T. Aoyagi
For such input data, it is desirable that the position of the SOM units makes one rotation as θ varies over one period (see Figure 5A). To start in an artificial kink-like state, we set the initial reference vectors so that the position of the units makes two rotations over one period of θ [0,2π], as shown in Figure 5B. The model parameters used in these simulations are the same as in the previous section. Figure 5C shows a typical time development of the reference vectors. During learning, the system should first destroy the initial nonoptimal ordered state and rearrange the ordered state by some intermediate disordered state. We can see from Figure 5C that the asymmetric neighborhood function can achieve faster ordering than the symmetric one. Hence, one can say that the asymmetry of the neighborhood function is also effective for quick reformation from the kink-like undesirable state to the optimal state through the intermediate disordered state.
4.2 Two-Dimensional SOM. Next, we show some results from the twodimensional SOM, which is also important in practical applications. We demonstrate that a similarly fast ordering process can be realized with an improved asymmetric neighborhood function in two-dimensional SOM (see Figure 5D). The conventional symmetric neighborhood function has trouble in correcting the twisted state in the two-dimensional SOM. However, using the improved asymmetric neighborhood function, the feature map converges to the completely ordered map in much less time, whereas for the case of the symmetric neighborhood function, this twisted state cannot be corrected even in 200,000 steps. This suggests that the asymmetric neighborhood function is also effective in improving the convergence process in higher-dimensional SOMs. Finally, although the advantage of the asymmetric neighborhood function is also seen for the above two cases, further study is required to examine this improvement of convergence time with the asymmetric neighborhood function.
Figure 5: (A) Ordered state of one-dimensional ring SOM lattice for the given input data, x(t) = {cos θ (t), sin θ (t)}, where θ (t) is randomly generated according to a uniform distribution in [0, 2π ). (B) Initial reference vectors intended to create a kink-like state during the learning process. (C) Comparison of typical ordering behaviors in the 1D ring SOM between the symmetric and the asymmetric neighborhood functions. (D) Typical time development of reference vectors in a two-dimensional array of SOM for the cases of symmetric, asymmetric, and improved asymmetric neighborhood functions. The input vectors are randomly selected from a uniform distribution [0, 1] × [0, 1]. Simulation parameters: α = 0.05, β = 1.5, σ = 5, N = 30 × 30, and T = 2000.
Self-Organizing Maps with Asymmetric Neighborhood Function
2529
2530
T. Aoki and T. Aoyagi
5 Discussion In this letter, we discussed the learning process of the self-organized map, especially in the presence of a kink (1D topological defect). Once a kink in the feature map appears, the learning process tends to take a very long time to eliminate this kink in the map and achieve the complete ordered state. Interestingly, even in the presence of the kink, we found that the asymmetry of the neighborhood function enables the system to accelerate the learning process. Compared with the conventional symmetric function with a constant neighborhood range, the convergence time of the learning process can be roughly reduced from O(N3 ) to O(N2 ) (N is the total number of units). The difference in the scaling exponent of the ordering time can be explained as follows. In order to produce the correct topological order, all topological defects need to move out of the feature map and vanish, so the process of movement of the topological detect determines the performance of the ordering process. Let us define X(tˆ) as the net distance from the initial position of the topological defect, as a function of the actual steps tˆ in which the defect moves, where tˆ is the number of times the defect has actually stepped in t learning steps. Figure 6A shows the mean distance X(tˆ) as a function of the actual steps tˆ the defect has moved, estimated from 1000 samples (α = 0.05, β = 1.5, σ = 50, and N = 1000). For the case of a symmetric neighborhood function, the mean distance obeys X(tˆ) ∝ tˆ1/2 .
(5.1)
This result indicates that the movement of the topological defect is well described by a random walk because of the equal probability of defect
Figure 6: (A) Average distance X(tˆ) of the topological defect from the initial point as a function of the actual steps tˆ in which the defect moved. tˆ is the number of times the defect hat actually stepped in t steps. Each data point is estimated from 1000 simulations. (B) Effect of the width of neighborhood function. Log-log plot of ordering time versus scaled number of units N/σ , with σ = 8,16,32,64. The fitting functions are Const. · (N/σ )k , k = 2, 3. Each data point is estimated from 12 simulations. (C) Distribution of ordering times when the initial reference vectors are generated by a uniform random value in [0, 1], estimated from 10,000 simulations. The arrow indicates that the mean ordering time in the case of the single kink initial condition. (D) The optimal value of the flip period T in the improved asymmetric neighborhood function. Three typical cases are presented: short (T = 500), around optimal (T = 2500), and long time (T = 8000).
Self-Organizing Maps with Asymmetric Neighborhood Function
2531
2532
T. Aoki and T. Aoyagi
moving to the right or to the left. In contrast, for the case of an asymmetric neighborhood function, the mean distance obeys X(tˆ) ∝ tˆ,
(5.2)
which indicates that the topological defect behaves like a drift motion along the positive direction of the asymmetric function. This drift motion arises from the asymmetry of the neighborhood function. Thus, equations 5.1 and 5.2 suggest that to eliminate the defect from the N-unit array, the required actual step tˆ in which the defect moves is proportional to O(N2 ) and O(N) for symmetric and asymmetric neighborhood functions, respectively. For the topological defect to move, some unit near this defect needs to be a winner so that the reference vector associated with the defect is updated. In general, the probability that the unit near the defect becomes a winner depends on both the form of the distribution of reference vectors of all units and the probability density of the input vectors. As a first approximation, however, it is reasonable to estimate that this probability of becoming a winner is proportional to N1 when the input vectors are randomly selected from a uniform distribution. Therefore, the probability that the defect motion occurs is of the order O(1/N). In other words, the total steps required to move the defect out of the feature map are proportional to the system size O(N). Considering the above points, the convergence time can be estimated by O(N3 ) and O(N2 ) with a constant neighborhood range. Several conventional ways have been used to avoid the slow convergence of the learning process in the presence of a kink state, such as choosing a suitable set of initial reference vectors or rearranging the sequence of input vectors. Further, it is well known that using a neighborhood function with a large width is effective in creating an ordered map from a very random initial condition. This is partially because a kink state is likely to appear with a narrow neighborhood function. However, the large width tends to average out the fine, detailed, ordered structures. Therefore, in many cases, the width of the neighborhood function is initially set to be large, such as half the width of the array of units, and is gradually decreased to a small final value. In Figure 6B, we investigate the dependency of ordering time on the width of the neighborhood function under the existence of a single kink. Figure 6B shows the ordering time as a function of scaled number of units N, that is, (N/σ )k , where k = 2 for asymmetric neighborhood function and k = 3 for symmetric one. Thus, the large σ reduces the effective number of units N, independent of the shape of the neighborhood function. In other words, this means that if the unit size N is fixed, the ordering time is proportional to ( σ1 )k . , is kept a constant What is especially important is that even if the ratio, N σ value, the ordering time with an asymmetric neighborhood function is only about one-tenth that with a symmetric one. This result implies that the combined use of the asymmetric neighborhood function and the method
Self-Organizing Maps with Asymmetric Neighborhood Function
2533
Table 1: Summary of the Scaling Exponent of Ordering Time to the Total Number of Units N for Various Neighborhood Functions.
Gaussian Tent Parabola
Symmetric
Asymmetric
2.992 ± 0.001 2.980 ± 0.001 2.977 ± 0.0004
1.912 ± 0.004 1.912 ± 0.003 1.919 ± 0.003
Notes: Tent function is described by h(r ) = [1 − r/σ ]+ and parabola function is described by h(r ) = [1 − (r/σ )2 ]+ , where [x]+ = x if x ≥ 0 and [x]+ = 0 elsewhere. The simulation parameters are the same as in Figure 2D.
of adjusting the width of the neighborhood function is more effective for achieving a fast ordering process than using either technique by itself. Although the gaussian neighborhood function is commonly used in SOM, some studies show that the shape of the neighborhood function is important for fast learning. We have examined the performance of the ordering process with various types of neighborhood functions, and the results are summarized in Table 1. Table 1 shows that the result found with the gaussian function quantitatively holds in cases of other types of neighborhood functions, such as a parabola and tent functions. However, in the case of a nonconvex neighborhood function, the dynamical properties of the ordering process drastically change owing to the presence of many metastable states. Erwin et al. (1992) have proved that in the case of nonconvex neighborhood functions, there are metastable states in which the system gets trapped during the learning process. In fact, in the case of the concave exponential neighborhood function defined in Erwin’s et al.’s letter, the system is trapped in a metastable state and cannot reach a completely ordered state. Consequently, to realize the fast ordering process, the neighborhood function should be convex over most of its support. In order to investigate the dynamical properties of the topological defect, we have considered a simple situation that a single topological defect exists around the center of the feature map as an initial condition. However, when the initial reference vectors are set randomly, the total number of topological defects appearing in the feature map is not generally equal to one. Therefore, we need to consider the statistical distribution of the ordering time, because the total number of the topological defects and the convergence process depend generally on the initial conditions. Figure 6C shows the distribution of the ordering time in 1D SOM, when the initial reference vectors are randomly selected from the uniform distribution [0, 1]. The arrows indicate the average ordering time in the case of the single-kink initial condition. For the asymmetric neighborhood function, the distribution of the ordering time has a single sharp peak near the value of the average ordering time
2534
T. Aoki and T. Aoyagi
in the case of a single kink initial condition. In contrast, for the symmetric neighborhood function, the distribution of the ordering time has double peaks and large variations. By examining the time development of the topological order η, it is found that the first higher, sharper peak results from the fast ordering process in the absence of the kink state. The second lower, broader peak arises from the slow process in which some kinks appear during learning. In fact, the average position of this broad peak (≥7500 steps) is around 1.59 × 104 steps, which is close to the ordering time in the case of the single kink initial condition (1.70 × 104 steps). So although the fast ordering process is observed in some successful cases (lucky initial conditions), the averaged behavior of the ordering process can be qualitatively well described by the results obtained in the case of a single kink initial state. Finally, we briefly consider the optimal value of the flip period T in the improved asymmetric neighborhood function. If T is too short, the system cannot eliminate the topological defects, because the total number of the learning steps within the same flip period is too short to move the defects out of the edge of the array of units. Therefore, the value T is required to be longer than some critical value for ordering the feature map. On the other hand, if T is too long, few flips of the direction of asymmetric function occur, which tends to interfere with the formation of the complete feature map without distortion. Taking account of the above two extreme cases, it is expected that an optimal value of the flip period T exists. In fact, this is confirmed in Figure 6D, which shows the time development of the distortion χ for several values of the flip period T. As a result of many simulations, we find that the optimal value of flip period T is about 2500, which is almost equivalent to the typical ordering time of the feature map. Acknowledgments We thank Koji Kurata and Shigeru Shinomoto for helpful discussions and comments on the manuscript. We are also graceful to Kei Takahashi for preparing the preliminary simulation data. This work was supported by CREST (Core Research for Evolutional Science and Technology) of Japan Science and Technology Corporation (JST) and by Grant-in-Aid for Scientific Research from the Ministry of Education, Science, Sports, and Culture of Japan: Grants 18047014, 18019019, and 18300079. References Der, R., Herrmann, M., & Villmann, T. (1997). Time behavior of topological ordering in self-organizing feature mapping. Biol. Cybern., 77(6), 419–427. Erwin, E., Obermayer, K., & Schulten, K. (1992). Self-organizing maps: Stationary states, metastability and convergence rate. Biol. Cybern., 67(1), 35–45.
Self-Organizing Maps with Asymmetric Neighborhood Function
2535
´ F., Szak´acs, T., Serneels, R., & Vattay, G. (1990). DynamGeszti, T., Csabai, I., Cazoko, ics of the Kohonen map. In Statistical Mechanics of Neural Networks: Proceedings of the XIth Sitges Conference (pp. 341–349). New York: Springer-Verlag. Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in cat’s visual cortex. J. Physiol., London, 160(1), 106–154. Hubel, D. H., & Wiesel, T. N. (1974). Sequence regularity and geometry of orientation columns in monkey striate cortex. J. Comp. Neurol., 158(3), 267–294. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biol. Cybern., 43(1), 59–69. Kohonen, T. (2001). Self-organizing maps (3rd ed.). New York: Springer-Verlag. Ritter, H. (1991). Asymptotic level density for a class of vector quantization processes. IEEE Trans. Neural Netw., 2(1), 173–175. Ritter, H., & Schulten, K. (1986). On the stationary state of Kohonen self-organizing sensory mapping. Biol. Cybern., 54(2), 99–106. Takeuchi, A., & Amari, S. (1979). Formation of topographic maps and columnar microstructures in nerve fields. Biol. Cybern., 35(2), 63–72. Villmann, T., & Claussen, J. C. (2006). Magnification control in self-organizing maps and neural gas. Neural Comp., 18, 446–469. von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in striate cortex. Kybernetik, 14(2), 85–100.
Received January 16, 2006; accepted October 16, 2006.
LETTER
Communicated by Gal Chechik
Parametric Embedding for Class Visualization Tomoharu Iwata [email protected]
Kazumi Saito [email protected]
Naonori Ueda [email protected] NTT Communication Science Laboratories, Kyoto 619-0237, Japan
Sean Stromsten [email protected] BAE Systems Advanced Information Technologies, Burlington, MA 01803, U.S.A.
Thomas L. Griffiths tom [email protected] Department of Psychology, University of California, Berkeley, Berkeley, CA 94720, U.S.A.
Joshua B. Tenenbaum [email protected] Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02142, U.S.A.
We propose a new method, parametric embedding (PE), that embeds objects with the class structure into a low-dimensional visualization space. PE takes as input a set of class conditional probabilities for given data points and tries to preserve the structure in an embedding space by minimizing a sum of Kullback-Leibler divergences, under the assumption that samples are generated by a gaussian mixture with equal covariances in the embedding space. PE has many potential uses depending on the source of the input data, providing insight into the classifier’s behavior in supervised, semisupervised, and unsupervised settings. The PE algorithm has a computational advantage over conventional embedding methods based on pairwise object relations since its complexity scales with the product of the number of objects and the number of classes. We demonstrate PE by visualizing supervised categorization of Web pages, semisupervised categorization of digits, and the relations of words and latent topics found by an unsupervised algorithm, latent Dirichlet allocation.
Neural Computation 19, 2536–2556 (2007)
C 2007 Massachusetts Institute of Technology
Parametric Embedding for Class Visualization
2537
1 Introduction Recently there has been great interest in algorithms for constructing lowdimensional feature-space embeddings of high-dimensional data sets. These algorithms seek to capture some aspect of the data set’s intrinsic structure in a low-dimensional representation that is easier to visualize or more efficient to process by other learning algorithms. Typical embedding algorithms take as input a matrix of data coordinates in a high-dimensional ambient space (e.g., PCA; Jolliffe, 1980) or a matrix of metric relations between pairs of data points (MDS; Torgerson, 1958), Isomap (Tenenbaum, de Silva, & Langford, 2000), or stochastic neighbor embedding (SNE) (Hinton & Roweis, 2002). The algorithms generally attempt to map nearby input points onto nearby points in the output embedding. Here we consider a different sort of embedding problem with two sets of points X = {x 1 , . . . , x N } and C = {c 1 , . . . , c K }, which we call objects (X) and classes (C). The input consists of conditional probabilities p(c k |x n ) associating each object x n with each class c k . Many kinds of data take this form: for a classification problem, C may be the set of classes and p(c k |x n ) the posterior distribution over these classes for each object x n ; in a marketing context, C might be a set of products and p(c k |x n ) the probabilistic preferences of a consumer; or in language modeling, C might be a set of semantic topics, and p(c k |x n ) the distribution over topics for a particular document, as produced by a method like latent Dirichlet allocation (LDA; Blei, Ng, & Jordan, 2003). Typically the number of classes is much smaller than the number of objects, K N. We seek a low-dimensional embedding of both objects and classes such that the distance between object x n and class c k is monotonically related to the probability p(c k |x n ). This embedding simultaneously represents not only the relations between objects and classes, but also the relations within the set of objects and within the set of classes–each defined in terms of relations to points in the other set. That is, objects that tend to be associated with the same classes should be embedded nearby, as should classes that tend to have the same objects associated with them. Our primary goals are visualization and structure discovery, so we typically work with two- or three-dimensional embeddings. Object-class embeddings have many potential uses, depending on the source of the input data. If p(c k |x n ) represents the posterior probabilities from a supervised Bayesian classifier, an object-class embedding provides insight into the behavior of the classifier: how well separated the classes are, where the errors cluster, whether there are clusters of objects that slip through a crack between two classes, which objects are not well captured by any class, and which classes are intrinsically most confusable with each other. Answers to these questions could be useful for improved classifier design. The probabilities p(c k |x n ) may also be the product of unsupervised or semisupervised learning, where the classes c k represent components in
2538
T. Iwata et al.
a generative mixture model. Then an object class embedding shows how well the intrinsic structure of the objects (and, in a semisupervised setting, any given labels) accords with the clustering assumptions of the mixture model. Our specific formulation of the embedding problem assumes that each class c k can be represented by a spherical gaussian distribution in the embedding space, so that the embedding as a whole represents a simple gaussian mixture model for each object x n . We seek an embedding that matches the conditional probabilities for each object under this gaussian mixture model to the input probabilities p(c k |x n ). Minimizing the Kullback-Leibler (KL) divergence between these two probability distributions leads to an efficient algorithm, which we call parametric embedding (PE). The rest of this letter is organized as follows. In the next section, we formulate PE, and in section 3, we describe the optimization procedures. In section 4, we briefly review related work. In section 5, we compare PE with conventional methods by visualizing classified Web pages. In section 6, we visualize hand-written digits with two classifiers and show that PE can visualize the characteristics of assumed models as well as given data. In section 7, we show that PE can visualize latent topic structure discovered by an unsupervised method. Finally, we present concluding remarks and discussion of future work in section 8. 2 Parametric Embedding Given as input conditional probabilities p(c k |x n ), PE seeks an embedding N of objects with coordinates R = {r n }n=1 and classes with coordinates = K {φ k }k=1 , such that p(c k |x n ) is approximated as closely as possible by the conditional probabilities under the assumption of a unit variance spherical gaussian mixture model in the embedding space: p(c k ) exp(− 12 r n − φ k 2 ) p(c k |r n ) = K , 1 2 l=1 p(c l ) exp(− 2 r n − φ l )
(2.1)
where · is the Euclidean norm in the embedding space. The dimension of the embedding space is D, and r n ∈ R D , φ k ∈ R D . When the conditional probabilities p(c k |x n ) arise as posterior probabilities from a mixture model, we will also typically be given priors p(c k ) as input; otherwise the p(c k ) terms above may be assumed equal. Assuming this model in the embedding space, if the Euclidean distance between object r n and class φ k is small, the conditional probability p(c k |r n ) becomes high. Therefore, we can understand the input conditional probabilities from the visualization result. It is natural to measure the degree of correspondence between input probabilities and embedding space probabilities using a sum of KL N divergences for each object: n=1 KL( p(c k |x n ) p(c k |r n )). Minimizing this
Parametric Embedding for Class Visualization
2539
sum with regard to p(c k |r n ) is equivalent to minimizing the objective function E(R, ) = −
K N
p(c k |x n ) log p(c k |r n ).
(2.2)
n=1 k=1
Gradients of E with regard to r n and φ k are, respectively, (see appendix A), ∂E = ( p(c k |x n ) − p(c k |r n ))(r n − φ k ) ∂ rn
(2.3)
∂E = ( p(c k |x n ) − p(c k |r n ))(φ k − r n ). ∂φ k
(2.4)
K
k=1 N
n=1
These learning rules have an intuitive interpretation (analogous to those in SNE) as a sum of forces pulling or pushing r n (φ k ) depending on the difference of conditional probabilities. Importantly, the Hessian of E with regard to r n is a semipositive definite matrix (see appendix B): K K T K ∂2 E T = p(c k |r n )φ k φ k − p(c k |r n )φ k p(c k |r n )φ k , ∂ r n ∂ r nT k=1
k=1
k=1
(2.5) since the right-hand side of equation 2.5 is exactly a covariance matrix, where T represents transpose. When we add regularization terms to the above objective function as follows, J (R, ) = E(R, ) + ηr
N n=1
r n 2 + ηφ
K
φ k 2 , ηr , ηφ > 0,
(2.6)
k=1
the Hessian of the objective function J with regard to r n becomes positive definite. Thus, we can find the globally optimal solution for R given . The visualization result depends on only initial coordinates of classes , not initial coordinates of objects R, since we can find the globally optimal solution for R given . On the other hand, the result of conventional nonlinear embedding methods (e.g., SNE) depends on the initial coordinates of objects R. Therefore, we can get more stable results than conventional nonlinear methods in the case that the number of classes is much smaller
2540
T. Iwata et al.
than the number of objects. The dependence on initial conditions was small in our experiments (see section 5.4). 3 Algorithm We minimize the objective function J by alternately optimizing R while fixing , and optimizing while fixing R, until J converges. The optimization procedure can be summarized as follows: 1. Initialize R and randomly. N 2. Calculate { ∂∂rJn }n=1 .
3. If ∂∂rJn 2 < r for all n = 1, . . . , N, go to step 6. N 4. Calculate modification vectors {r n }n=1 .
5. Update R by r n = r n + r n , and go to step 2. 6. Calculate
∂J . ∂φ
7. If ∂∂φJ 2 < φ , output R, and terminate. 8. Calculate the modification vector φ. 9. Update by φ = φ + φ, and go to step 2. φ = (φ 1T , . . . , φ TK )T and r and φ are convergence precisions. In step 1, if we have information of classes, as the initial values for , we may use the result of other embedding methods such as MDS. From step 2 to step 5, R is moved to minimize J , while is fixed. In step 4, according to Newton methods, r n is calculated using the Hessian with regard to r n as follows:
∂2 J r n = − ∂ r n ∂ r nT
−1
∂J . ∂ rn
(3.1)
Since ∂∂r n rJ T is positive definite as described above, the inverse always exists. n From step 6 to step 9, is moved to minimize J while R is fixed. In step 9, according to quasi-Newton methods, φ is calculated as follows: 2
φ = −λG−1
∂J , ∂φ
(3.2)
∂ 2 J −1 that is calculated by limited where G−1 is the approximation of ∂φ∂φ T memory Broyden-Fletcher-Goldfarb-Shanno (BFGS; Saito & Nakano, 1997). and the step length λ is calculated so as to minimize the objective function.
Parametric Embedding for Class Visualization
2541
Steps 2 and 6 can be calculated using O(NK D) multiplications, and step 4 can be calculated using O(ND3 ) multiplications. The complexity of step 9 is O(K Ds), where s is the memory size in limited memory BFGS (Saito & Nakano, 1997). Thus, the complexity of a single iteration of PE is O(NK ), when we assume the dimension of the embedding space D and the memory size s are constant. We found experimentally that the number of iterations does not grow with N (see Figure 3). 4 Related Work MDS and PCA are representative linear embedding methods. MDS embeds objects so as to preserve objects’ pair-wise distances, and PCA embeds objects so as to maximize variance. These methods can find globally optimal embeddings and are computationally efficient, but they cannot represent nonlinear structure. Recently, therefore, a number of nonlinear embedding methods have been proposed, such as Isomap, local linear embedding (Roweis & Saul, 2000), SNE, and connectivity-preserving embedding (Yamada, Saito, & Ueda, 2003). However, these nonlinear embedding methods are more computationally expensive than linear methods and PE. Furthermore, these embedding methods do not use any class information. Fisher linear discriminant analysis (FLDA; Fisher, 1950) and kernel discriminant analysis (KDA) (Baudat & Anouar, 2000; Mika, Ratsch, Weston, ¨ Scholkopf, & Muller, 1999) are embedding methods that use class information. FLDA embeds objects so as to maximize between-class variance and minimize within-class variance. KDA extends FLDA to nonlinear embedding by using the kernel method. FLDA and KDA are dimensionalityN reduction methods for data given as a set of class-object pairs {x n , c(n)}n=1 (c(n) is the class label of a object x n ). PE, by contrast, uses conditional class probabilities rather than hard classifications. PE can be seen as a generalization of stochastic neighbor embedding (SNE). SNE corresponds to a special case of PE where the objects and classes are identical sets. In SNE, the class conditional probabilities p(c k |x n ) are replaced by the probability p(x m |x n ) of object x n under a gaussian distribution centered on x m . When the inputs (conditional probabilities) to PE come from an unsupervised mixture model, PE performs unsupervised dimensionality reduction just like SNE. However, it has several advantages over SNE and other methods for embedding a single set of data points based on their pairwise relations. When class labels are available, it can be applied in supervised or semisupervised modes. Because its computational complexity scales with NK , the product of the number of objects and the number of classes, it can be applied efficiently to data sets with very many objects (as long as the number of classes remains small). In this sense, PE is closely related to landmark MDS (LMDS; de Silva & Tenenbaum, 2003), if we equate classes with landmarks, objects with data points, and − log p(c k |x n ) with the squared distances input to LMDS. However,
2542
T. Iwata et al.
LMDS lacks a probabilistic semantics and is suitable only for unsupervised settings. The formulation of co-occurrence data embedding (Globerson, Chechik, Pereira, & Tishby, 2005) is similar PE, but it embeds objects of different types based on their co-occurrence statistics. PE embeds objects and classes based on parametric models that describe their relationships. Mei and Shelton (2006) also proposed a method to embed objects of different types, but they focused on visualizing collaborative filtering data with ratings. 5 Visualization of Labeled Data In this section, we show how PE helps visualize labeled data, and compare PE with conventional methods (MDS, Isomap, SNE, FLDA, KDA) in terms of visualization, conditional probability approximation, and computational complexity. 5.1 Experimental Setting. We visualized 5000 Japanese Web pages categorized into 10 topics by the Open Directory Project (http://dmoz.org), where objects are Web pages and classes are topic categories. We omitted pages with fewer than 50 words and those in multiple categories. Each page is represented as a word frequency vector, and the class prior and conditional probabilities are obtained from a naive Bayes model (McCallum, & Nigam, 1998). trained on these data (see appendix C). The dimension of the word frequency vector is 34,248. We used ηr = 0.1, ηφ = 50 for parameters of PE. 5.2 Compared Methods. We used seven methods for comparison that are closely related to PE:
r
MDS1: The input is squared Euclidean distances between word frequency vectors divided by L 2 norm: diMDS1 j
r
2 xi xj . − = x i x i
(5.1)
MDS2: The input is squared Euclidean distances between conditional probabilities: = diMDS2 j
K ( p(c k |x i ) − p(c k |x j ))2 .
(5.2)
k=1
r
Isomap1: The input is squared Euclidean distances between word frequency vectors divided by L 2 norm as in equation 5.1. We used the 10-nearest neighbor approach to construct the graph.
Parametric Embedding for Class Visualization arts
sports
business
computers
health
home
recreation
2543 regional
science
online-shopping
arts computers science recreation home online-shopping
sports
regional
business
health
(a) PE
(c) MDS2
(b) MDS1
(d) Isomap1
Figure 1: Visualizations of categorized Web pages. Each of the 5000 Web pages is show by a particle, with shape indicating the page’s class.
r
Isomap2: The input is KL divergences between conditional probabilities: I soma p2
di j
r r
= KL( p(c k |x i ) p(c k |x j )).
(5.3)
We used the 10-nearest neighbor approach to construct the graph. The input distance of Isomap need not be symmetric, since the shortest path distances become symmetric even if the input is not symmetric distance. SNE: The input is KL divergences between conditional probabilities as in equation 5.3. FLDA: The input is word frequency vectors that are reduced to dimension 2000 by PCA and their classes. If the dimension of an object is higher than N − K, the between-class covariance matrix becomes singular, and FLDA is not applicable (small sample size problem;
2544
T. Iwata et al.
(e) Isomap2
(g) FLDA
(f) SNE
(h) KDA
Figure 1: Continued.
r
Fukunaga, 1990). We avoided this problem using PCA as in Belhumeur, Hespanha, and Kriegman (1997). KDA: The input is word frequency vectors and their classes. We used gaussian kernels with variance 1 and regularization as in Mika et al. (1999). with regularization parameter 10−3 .
5.3 Visualization Results. Figure 1a is the result of PE. Each point represents a Web page, and the shape represents the class. Pages from the same class cluster together, and closely related pairs of classes, such as sports and health or computers and online shopping, are located nearby. There are few objects near the sports cluster, so sports pages are easy to distinguish from others. The regional cluster is central and diffuse, and there are many objects from other classes mixed in with it; regional is apparently a vague topic. These can be confirmed by F-measure for each class (see Table 1), which is the harmonic mean of precision and recall. The precision of class c k is the ratio of the number of objects correctly estimated at class c k compared to the total number of objects estimated at class c k , and
Parametric Embedding for Class Visualization
2545
Table 1: F-Measure for Each Class. Arts
Sports
Business
Computers
Health
0.973
0.978
0.929
0.924
0.967
Home
Recreation
Regional
Science
Online Shopping
0.957
0.958
0.909
0.964
0.941
the recall of class c k is the ratio of the number of objects correctly estimated at class c k compared to the total number of objects classified into class c k . The estimated class is the class that has the highest conditional probability. The high F-measure of sports reflects the easiness of the classification, and the low F-measure of regional reflects the difficulty of the classification. Furthermore, we can visualize not only the relations among classes but also how pages relate to their classes. For example, pages that are located at the center of a cluster are typical pages for the class, and pages that are located between clusters have multiple topics. Some pages are located in the cluster of different classes; these may be misclassified pages. MDS1 and Isomap1 do not use class information; therefore, they yield no clear class structure (see Figure 1b and 1d). Figure 1c is the result of MDS2. Pages from the same class are embedded closely, but some classes are overlapping, so we do not see the class structure as clearly as we do with PE. In the result of Isomap2 and SNE (see Figures 1e and 1f), we can clearly see the class structure as in PE. Figure 1g is the result of FLDA. Since FLDA uses class information, pages are more class-clustered than in MDS1. However many clusters are overlapping, and it is difficult to understand the relationships among classes. Linear embedding methods cannot in general separate all the classes. Figure 1h is the result of KDA. All clusters are separated perfectly, and we can understand the relationships among classes. However, little within-class structure is visible. 5.4 Comparison on Conditional Probability Approximation. We evaluate the degree of conditional probability approximation quantitatively. Since conditional probabilities are not given as inputs in MDS1, Isomap1, FLDA, and KDA, we compare PE, MDS2, Isomap2, and SNE. high high high Let Xk (h) = {x k,1 , . . . , x k,h } be the set of h objects with the highest conditional probabilities p(c k |x n ) in the class c k , and let X close (h) = k close {x close , . . . , x } be the set of h objects closest to the class center φ k in k,h k,1 the embedding space. If conditional probabilities are approximated perhigh fectly, X k (h) and Xclose (h) should be identical in each class c k , since high k posterior objects should be embedded close to the class. As a measure of
2546
T. Iwata et al. 1 0.95
PE
precision
0.9
MDS2
0.85 0.8
SNE
0.75
ISOMAP2
0.7 0.65 0
100 200 300 400 the number of neighbors
500
Figure 2: Experimental comparisons of the degree of conditional probability approximation. Each error bar of PE represents the standard deviation of 100 results with different initial conditions.
conditional probability approximation, we use the precision between high X k (h) and Xclose (h) as follows: k pr ec(h) =
K
1 1
high
(h)
,
X k (h) ∩ X close k K h
(5.4)
k=1
where | · | is the number of elements. MDS2, Isomap2, and SNE do not output . Here, we take the object that has the highest conditional probability as a representative point of the class and define the coordinates of the class c k to be the coordinates of this object (i.e., φ k = r argn max p(ck |x n ) ). Figure 2 shows precisions of PE, MDS2, Isomap2, and SNE as h goes from 10 to 500. Each error bar of PE represents the standard deviation of 100 results with different initial conditions. By this measure, the conditional probability approximation of PE is clearly better than those of the other methods. In Isomap2 and SNE, precisions are low in small h. This is because Isomap2 and SNE preserve not object-class relationships but objects’ pair-wise neighborhood relationships. In MDS2, precision goes down as h increases. This is because classes are overlapping, as in Figure 1c. 5.5 Comparison on Computational Time. One of the main advantages of PE is its efficiency. As described in section 3, the computational complexity of a single iteration of PE is O(NK ). That is, the computational time increases linearly with the number of objects. We evaluated the number of iterations of PE experimentally. Figure 3 shows that the number of iterations does not depend on the number of objects, where each error bar represents the standard deviation of 1000 results with different initial conditions.
Parametric Embedding for Class Visualization
2547
the number of iterations
800 700 600 500 400 300 200 100 0 0
100
200
300
400
500
the number of samples
Figure 3: The number of iterations of PE with 10 classes. Each error bar represents the standard deviation of 1000 results with different initial conditions.
MDS computes eigenvectors of the matrix B = − 12 H A2 H, where H = I − N1 11T is the centering matrix and A is the distance matrix. If the input of MDS is squared Euclidean distances (Ai j = x i − x j 2 ), the complexity of MDS increases linearly with the number of objects by Lanczos methods (Golub & Van Loan, 1996), since the matrix vector product of B = H X T X H can be calculated using O(NV) multiplications, where V is the number of nonzero elements in each row of X = (x 1 , . . . , x n ). Isomap has O(N3 ) complexity (de Silva & Tenenbaum, 2003). SNE has O(N2 ) complexity since it uses objects’ pair-wise relationships. FLDA and KDA lead to generalized eigenvalue problems, whose complexity is the order of the cube of matrix size. We measured computational time experimentally, varying the number of objects from 500 to 5000, on a Xeon 3.2 GHz CPU, 2 GB memory PC. Figure 4 shows the result. As a preprocessing step of FLDA, input vectors are reduced by PCA to N − K dimensions if N ≤ 2000. The x-axis and the y-axis show the logarithm of number of objects and the logarithm of computational time (sec), respectively. The dotted line is the regression line in the log-log plot. Note that preprocessing time (conditional probability estimation in PE, MDS2, Isomap2, and SNE and dimensionality reduction by PCA in FLDA) is omitted. Table 2 shows the slopes of regression lines. The slopes represent how computational time increases with the number of objects. The results are consistent with the theoretical computational complexities as described above, even though they are not same since iterative methods are used in all methods. In the case of 5000 objects, the computational time of PE is 3.13 sec. Even taking into account the preprocessing time of PE (2.33 sec), PE is more efficient than Isomap1, Isomap2, SNE, FLDA, and KDA (with
2548
T. Iwata et al. 10000 KDA SNE ISOMAP1 FLDA ISOMAP2 PE MDS1 MDS2
1000 log(sec)
100 10 1 0.1 0.01 500
1000
2000 4000 log(the number of samples)
Figure 4: Experimental comparisons of the computational complexity. Table 2: Slopes of Regression Lines in Figure 4. PE
MDS1
MDS2
Isomap1
Isomap2
SNE
FLDA
KDA
0.749
0.770
0.834
2.722
2.824
2.232
2.898
2.998
computational times of 1593 sec, 704 sec, 1869 sec, 211 sec, and 6752 sec, respectively). 5.6 Summary of Comparison. In our experiments, we showed that PE approximates conditional probabilities well and is quite efficient compared to conventional methods. MDS is also efficient but does not extract the class structure. SNE and Isomap2 achieve results similar to those of PE but take more time. FLDA and KDE are different from PE in input information and also take more time. 6 Visualization of Classifiers The utility of PE for analyzing classifier performance may best be illustrated in a semisupervised setting, with a large unlabeled set of objects and a smaller set of labeled objects. We fit a probabilistic classifier based on the labeled objects, and we would like to visualize the behavior of the classifier applied to the unlabeled objects, in a way that suggests how accurate the classifier is likely to be and what kinds of errors it is likely to make. We constructed a simple probabilistic classifier for 2558 handwritten digits (classes 0–4) from the MNIST database. The classifier was based on a mixture model for the density of each class, defined by selecting either 10 or 100 digits uniformly at random from each class and centering a fixedcovariance gaussian (in pixel space) on each of these examples—essentially
Parametric Embedding for Class Visualization
2549
Figure 5: Parametric embeddings for handwritten digit classification. Each dot represents the coordinates r n of one image. Boxed numbers represent the class means φ k . ×’s show labeled examples used to train the classifier. Images of several unlabeled digits are shown for each class.
a soft nearest-neighbor method (see appendix D). The posterior distribution over this classifier for all 2558 digits was submitted as input to PE. The resulting embeddings allow us to predict the classifiers’ patterns of confusions, calculated based on the true labels for all 2558 objects. Figure 5 shows embeddings for both 10 labels per class and 100 labels per class. In both cases, we see five clouds of points corresponding to the five classes. The clouds are elongated and oriented roughly toward a common center, forming a star shape (also seen to some extent in our other applications). Objects that concentrate their probability on only one class will lie as far from the center of the plot as possible–ideally, even farther than the mean of their class, because this maximizes their posterior probability on that class. Moving toward the center of the plot, objects become increasingly confused with other classes. Relative to using only 10 labels per class, using 100 labels yields clusters that are more distinct, reflecting better between-class discrimination. Also, the labeled examples are more evenly spread throughout each cluster, reflecting more faithful within-class models and less overfitting. In both cases, the 1 class is much closer than any other to the center of the plot, reflecting the fact that instances of other classes tend to be mistaken for 1’s. Instances of other classes near the 1 center also tend to look rather “one-like”–thinner and more elongated. The dense cluster of points just outside the mean for 1 reflects the fact that 1’s are rarely mistaken for other digits. In Figure 5a, the 0 and 3 distributions are particularly overlapping, reflecting that those two digits are most readily confused with each other (apart from 1). The
2550
T. Iwata et al.
webbing between the diffuse 2 arm and the tighter 3 arm reflects the large number of 2’s taken for 3’s. In Figure 5b, that webbing persists, consistent with the observation that (again, apart from many mistaken responses of 1) the confusion of 2’s for 3’s is the only large-scale error these larger data permit. 7 Visualization of Latent Structure of Unlabeled Data In the applications above, PE was applied to visualize the structure of classes based at least to some degree on labeled examples. The algorithm can also be used in a completely unsupervised setting, to visualize the structure of a probabilistic generative model based on latent classes. Here we illustrate this application of PE by visualizing a semantic space of word meanings: objects correspond to words, and classes correspond to topics in a latent Dirichlet allocation (LDA) model (Blei et al., 2003) fit to a large (more than 37,000 documents and more than 12,000,000 word tokens) corpus of educational materials for first grade to college (TASA). The problem of mapping a large vocabulary is particularly challenging and, with over 26,000 objects (word types), prohibitively expensive for pairwise methods. Again, PE solves for the configuration shown in about a minute. In LDA (not to be confused with FLDA), each topic defines a probability distribution over word types that can occur in a document. This model can be inverted to give the probability that topic c k was responsible for generating word xn ; these probabilities p(c k |x n ) provide the input needed to construct a space of word and topic meanings in PE (see appendix E). More specifically, we fit a 50-topic LDA model to the TASA corpus. Then, for each word type, we computed its posterior distribution restricted to a subset of five topics and input these conditional probabilities to PE (with N = 26, 243, K = 5). Figure 6 shows the resulting embedding. As with the embeddings in Figures 1 and 2, the topics are arranged roughly in a star shape, with a tight cluster of points at each corner of the star corresponding to words that place almost all of their probability mass on that topic. Semantically, the words in these extreme clusters often (though not always) have a fairly specialized meaning particular to the nearest topic. Moving toward the center of the plot, words take on increasingly general meanings. This embedding shows other structures not visible in previous figures: in particular, dense curves of points connecting every pair of clusters. This pattern reflects the characteristic probabilistic structure of topic models of semantics: in addition to the clusters of words that associate with just one topic, there are many words that associate with just two topics, or just three, and so on. The dense curves in Figure 6 show that for any pair of topics in this corpus, there exists a substantial subset of words that associate with just those topics. For words with probability sharply concentrated on two topics,
Parametric Embedding for Class Visualization
2551
Figure 6: Parametric embedding for word meanings and topics based on posterior distributions from an LDA model. Each dot represents the coordinates r n of one word. Large phrases indicate the positions of topic means φ k (with topics labeled intuitively). Examples of words that belong to one or more topics are also shown.
points along these curves minimize the sum of the KL and regularization terms. This kind of distribution is intrinsically high-dimensional and cannot be captured with complete fidelity in any two-dimensional embedding. As shown by the examples labeled in Figure 6, points along the curves connecting two apparently unrelated topics often have multiple meanings or senses that join them to each topic: deposit has both a geological and a financial sense, phase has both an everyday and a chemical sense, and so on. 8 Conclusions and Future Work We have proposed a probabilistic embedding method, PE, that embeds objects and classes simultaneously. PE takes as input a probability distribution for objects over classes, or more generally of one set of points over another set, and attempts to fit that distribution with a simple classconditional parametric mixture in the embedding space. Computationally,
2552
T. Iwata et al.
PE is inexpensive relative to methods based on similarities or distances between all pairs of objects and converges quickly on many thousands of data points. The visualization results of PE shed light on features of both the data set and the classification model used to generate the input conditional probabilities, as shown in applications to classified Web pages, partially classified digits, and the latent topics discovered by an unsupervised method, LDA. PE may also prove useful for similarity-preserving dimension reduction, where the high-dimensional model is not of primary interest, or more generally, in analysis of large conditional probability tables that arise in a range of applied domains. As an example of an application we have not yet explored, purchases, Web surfing histories, and other preference data naturally form distributions over items or categories of items. Conversely, items define distributions over people or categories thereof. Instances of such dyadic data abound–restaurants and patrons, readers and books, authors and publications, species and foods–with patterns that might be visualized. PE provides a tractable, principled, and effective visualization method for large volumes of such data for which pairwise methods are not appropriate.
Appendix A: Gradients This appendix describes gradients of objective function E with regard to r n and φ k . We rewrite the objective function, equation 2.2, as follows: E(R, ) =−
N K
p(c k |x n ) log p(c k |r n )
n=1 k=1
=−
K N n=1 k=1
− log
K l=1
=−
1 p(c k |x n ) log p(c k ) − r n − φ k 2 2
1 p(cl ) exp − r n − φ l 2
K N n=1
− log
k=1
1 p(c k |x n )r n − φ k 2 2 K
p(c k |x n ) log p(c k ) −
k=1 K
k=1
1 p(c k ) exp − r n − φ k 2 2
.
(A.1)
Parametric Embedding for Class Visualization
2553
Differentiating equation A.1 with regard to r n , we obtain: K
∂E = p(c k |x n )(r n − φ k ) − ∂ rn K
1 2 k=1 (r n − φ k ) p(c k ) exp(− 2 r n − φ k ) K 1 2 k=1 p(c k ) exp(− 2 r n − φ k )
k=1
K ( p(c k |x n ) − p(c k |r n ))(r n − φ k )
=
k=1 K ( p(c k |r n ) − p(c k |x n ))φ k .
=
(A.2)
k=1
Differentiating equation A.1 with regard to φ k , we obtain: N N (r n − φ k ) p(c k ) exp( 12 r n − φ k 2 ) ∂E =− p(c k |x n )(r n − φ k ) + K 1 2 ∂φ k k=1 p(c k ) exp(− r n − φ k ) n=1
=
N
2
n=1
( p(c k |x n ) − p(c k |r n ))(φ k − r n ).
(A.3)
n=1
Appendix B: Hessian This appendix describes the Hessian of the objective function E with regard to r n . Differentiating equation A.2 with regard to r nT , we obtain: K (r n − φ k )T p(c k ) exp(− 12 r n − φ k 2 ) ∂2 E =− φk K T 1 2 ∂ r n∂ r n l=1 p(c l ) exp(− r n − φ l ) 2
k=1
p(c k ) exp(− 12 r n −φ k 2 ) l=1 (r n −φ l )T p(cl ) exp(− 12 r n −φ l 2 ) + 2 K 1 2 l=1 p(c l ) exp(− 2 r n −φ l ) K
=−
K
p(c k |r n )φ k r nT +
k=1
+
K
=
k=1
p(c k |r n )φ k φ kT
k=1
p(c k |r n )φ k r nT
−
k=1 K
K
p(c k |r n )φ k
k=1
p(c k |r n )φ k φ kT
K
−
K k=1
p(c k |r n )φ k
K
T p(c k |r n )φ k
k=1
K k=1
T p(c k |r n )φ k
.
(B.1)
2554
T. Iwata et al.
Appendix C: Web Page Classifier We explain here the naive Bayes model used for the estimation of conditional probabilities in section 5. we assume that a Web page (considered as a bag of words) x n in class c k is generated from a multinomial distribution as follows: p(x n |c k ) ∝
V
θk j xnj ,
(C.1)
j=1
where V is the number of word types, xnj is the number of tokens of word type w j in a page x n , and θk j is the probability that a word token is of type w j in a page of a class c k (θk j > 0, Vj=1 θk j = 1). We approximate θk j by its maximum a posteriori (MAP) estimate. The estimated θk j is θˆk j =
n∈C k
xnj + λk
Nk + λk V
,
(C.2)
where, Nk is the number of pages in a class c k , C k is a set of pages in a class c k , and λk is a hyperparameter. We estimated λk by a leave-one-out cross-validation method. Appendix D: Handwritten Digits Classifier The handwritten digits classifier discussed in section 6 represents each class as a mixture of gaussians. The mean of each component is a random sample from the class, and the covariance for each is the covariance for the entire data set, p(x n |c k ) ∝
m∈C k
1 exp − (x n − x m )T −1 (x n − x m ) , 2
(D.1)
where x n is a 256-dimensional vector of pixel gray levels for a handwritten digit, and C k is the set of samples defining the model for class c k . Appendix E: Latent Topic Model The conditional probabilities of topics in section 7 are from a latent Dirichlet allocation (LDA) model of a text corpus. In LDA, each document is assumed to be generated by a mixture of latent topics. The topic proportion vector for each document is drawn from a Dirichlet distribution. Each word is generated by first drawing a topic from this distribution and then drawing a word from a topic-specific multinomial distribution. Let x be a document, wm be the mth word in the document, M be the number of words in the
Parametric Embedding for Class Visualization
2555
document, zk be a latent topic, and K be the number of latent topics. The generative model of a document is as follows,
p(x) =
K M θ
ψ
p(wm |zk , ψ) p(zk |θ ) p(ψ|β) p(θ |α)dψdθ ,
(E.1)
m=1 k=1
where p(wm |zk , ψ), p(zk |θ ) are multinomial distributions and p(ψ|β), p(θ |α) are Dirichlet distributions. We estimated ψkm (the probability of word wm given latent topic zk ) by Gibbs sampling (Gilks, Richardson, & Spiegelhalter, 1996; Griffiths & Steyvers, 2004), and obtained conditional probabilities (probabilities of latent topics given word) by Bayesian inversion. References Baudat, G., & Anouar, F. (2000). Generalized discriminant analysis using a kernel approach. Neural Computation, 12, 2385–2404. Belhumeur, P., Hespanha, P., & Kriegman, D. (1997). Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Analysis and Machine Intelligence, 19, 711–720. Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. de Silva, V., & Tenenbaum, J. (2003). Global versus local methods in nonlinear dimensionality reduction. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 705–712). Cambridge, MA: MIT Press. Fisher, R. (1950). The use of multiple measurements in taxonomic problem. Annuals of Eugenics, 7, 179–188. Fukunaga, K. (1990). Introduction to statistical pattern recognition. New York: Academic Press. Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (1996). Markov chain Monte Carlo in practice. New York: Chapman & Hall. Globerson, A., Chechik, G., Pereira, F., & Tishby, N. (2005). Euclidean embedding of co-occurrence data. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 497–504). Cambridge, MA: MIT Press. Golub, G., & Van Loan, C. (1996). Matrix computation (3rd ed.). Baltimore, MD: John Hopkins University Press. Griffiths, T., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101, 5228–5235. Hinton, G., & Roweis, S. (2002). Stochastic neighbor embedding. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 833–840). Cambridge, MA: MIT Press. Jolliffe, I. (1980). Principal component analysis. New York: Springer-Verlag. McCallum, A., & Nigam, K. (1998). A comparison of event models for naive Bayes text classification. In Proceedings of the AAAI workshop on Learning for Text Categorization (pp. 41–48). New York: AAAI Press.
2556
T. Iwata et al.
Mei, G., & Shelton, C. (2006). Visualization of collaborative data. In Proceedings of the International Conference on Uncertainty in Artificial Intelligence (pp. 341–348). Cambridge, MA: AUAI Press. ¨ ¨ Mika, S., R¨atsch, G., Weston, J., Scholkopf, B., & Muller, K.-R. (1999). Fisher discriminant analysis with kernels. In Proceedings of the IEEE Neural Networks for Signal Processing IX (pp. 41–48). Piscataway, NJ: IEEE Press. Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by local linear embedding. Science, 290, 2323–2326. Saito, K., & Nakano, R. (1997). Partial BFGS update and efficient step-length calculation for three-layer neural networks. Neural Computation, 9, 123–141. Tenenbaum, J., de Silva, V., & Langford, J. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290, 2319–2323. Torgerson, W. (1958). Theory and methods of scaling. New York: Wiley. Yamada, T., Saito, K., & Ueda, N. (2003). Cross-entropy directed embedding of network data. In Proceedings of the International Conference on Machine Learning (pp. 832–839). New York: AAAI Press.
Received January 11, 2006; accepted September 6, 2006.
LETTER
Communicated by Luis B. Almeida
MISEP Method for Postnonlinear Blind Source Separation Chun-Hou Zheng [email protected] Intelligent Computing Lab, Institute of Intelligent Machines, Chinese Academy of Sciences, Hefei, Anhui, 230031, China, and School of Information and Technology, Qufu Normal University, Rizhao, Shandong, 276826, China
De-Shuang Huang [email protected] Intelligent Computing Lab, Institute of Intelligent Machines, Chinese Academy of Sciences, Hefei, Anhui, 230031, China
Kang Li [email protected]
George Irwin [email protected] School of Electronics, Electrical Engineering and Computer Science, Queen’s University of Belfast, Belfast BT7 1NN, U.K.
Zhan-Li Sun sun [email protected] Intelligent Computing Lab, Institute of Intelligent Machines, Chinese Academy of Sciences, Hefei, Anhui, 230031, China
In this letter, a standard postnonlinear blind source separation algorithm is proposed, based on the MISEP method, which is widely used in linear and nonlinear independent component analysis. To best suit a wide class of postnonlinear mixtures, we adapt the MISEP method to incorporate a priori information of the mixtures. In particular, a group of three-layered perceptrons and a linear network are used as the unmixing system to separate sources in the postnonlinear mixtures, and another group of three-layered perceptron is used as the auxiliary network. The learning algorithm for the unmixing system is then obtained by maximizing the output entropy of the auxiliary network. The proposed method is applied to postnonlinear blind source separation of both simulation signals and real speech signals, and the experimental results demonstrate its effectiveness and efficiency in comparison with existing methods.
Neural Computation 19, 2557–2578 (2007)
C 2007 Massachusetts Institute of Technology
2558
C.-H. Zheng et al.
1 Introduction Blind source separation (BSS) has become an important research subject in the past decade. Most BSS algorithms are based on independent component analysis (ICA) for linear mixture models (Comon, 1994; Hyv¨arinen, Karhunen, & Oja, 2001; Hyv¨arinen, 1999). The fundamental idea of ICA is to discover independent components from the mixture of statistically independent source signals. In comparison with linear mixtures, the nonlinear mixture of signals is more prevalent in many real-world applications; however, linear ICA is not applicable to such mixtures in general (Hyv¨arinen & Pajunen, 1999). Research on nonlinear blind source separation has led to the proposal of a number of unmixing algorithms recently (Burel, 1992; Deco & Brauer, 1995; Pajunen, Hyv¨arinen, & Karhunen, 1996; Pajunen, 1998; Lee, Koehler, & Orglmeister, 1997; Taleb, Jutten, & Olympieff, 1998; Bell & Sejnowski, 1995; Martinez & Bray, 2003; Valpola, 2000), such as the self-organizing maps (Pajunen et al., 1996; Herrmann & Yang, 1996; Lin, Grier, & Cowan, 1996), generative topographic mapping (GTM) (Bishop, Svensen, & Williams, 1998), neural networks (Yang, Amari, & Cichocki, 1997, 1998), Bayesian ensemble learning (Valpola, 2000), and kernel trick (Martinez & Bray, 2003). In particular, nonlinear ICA has been developed by extending linear ICA to nonlinear mixtures (Deco & Brauer, 1995; Almeida, 2004). Despite these attempts, solid theoretical research on nonlinear BSS is still lacking, and it has been pointed out that nonlinear BSS is possibly a collection of problems that should be addressed on a case-by-case basis (Taleb, 2002). An important class of nonlinear BSS is the postnonlinear (PNL) mixture, which was introduced by Taleb and Jutten (1997a, 1999), and a number of algorithms have been proposed. For example, Achard, Pham, and Jutten (2001) introduced an alternative algorithm to that proposed by Taleb and Jutten; Babaie-Zadeh, Jutten, and Nayebi (2002) proposed a geometric method that requires a sufficient number of samples, and the extension to the mixtures of more than two sources seems complicated and difficult. ¨ Ziehe, Kawanabe, Harmeling, and Muller (2003) presented two BSS methods using linearizing transformations and temporal decorrelation; Sol´eCasals, Babaie-Zadeh, Jutten, and Pham (2003) derived a simple method to increase separation speed using an inverting Wiener system, but it requires weak a priori knowledge about the source distribution; Ilin and Honkela (2004) presented a variational Bayesian approach for noninvertible postnonlinearities; Ilin, Achard, and Jutten (2004) compared two approaches for separating PNL mixtures, and one of their major conclusions is that the PNL ICA (Taleb & Jutten, 1999; Ilin et al., 2004) outperforms the Bayesian method in classical PNL mixtures with equal number of sources and observations when all PNL distortions are invertible. In this letter, a network paradigm is proposed for the PNL mixture problems, where two groups of three-layered perceptrons and a linear
MISEP Method for Postnonlinear Blind Source Separation
2559
network are utilized. The gradient descent method is applied to maximize the entropy of the output signals in order to optimize the network parameters. Based on Almeida (2004), a PNL MISEP algorithm is derived for a special case of postnonlinear mixtures by incorporating a priori information hidden in the mixtures. In comparison with other methods (Taleb & Jutten, 1999; Almeida, 2004; Ilin et al., 2004; Sol´e-Casals et al., 2003), the PNL MISEP algorithm proposed in this letter can either improve the separation precision or speed up the learning process. This letter is organized as follows. In section 2, we briefly discuss the relation between ICA and BSS, especially for nonlinear mixtures. In addition, we develop the mixing-separating model for PNL mixtures. Section 3 presents several independence criteria between signals, and their relations are discussed. Section 4 details the PNL MISEP algorithm, followed by discussion, and section 5 presents the simulation results. Finally, concluding remarks and future research work are given in section 6. 2 The BSS Problem and Nonlinear ICA Considering an n-sensor array that produces signals x(t) = (x1 (t), x2 (t), · · · , xn (t))T , which are the mixtures of n unknown statistically independent sources s(t) = (s1 (t), s2 (t), · · · , sn (t))T , the general nonlinear mixture model for BSS can be described as follows (Hyv¨arinen et al., 2001): x(t) = f (s(t)),
(2.1)
where f is some unknown one-to-one mapping from Rn to Rn (note that x and s do not have to be the same dimension; we make this assumption here for simplicity). The mixing system is memoryless in general; therefore, the time dependence of variables can be omitted. Consequently, the model can be further simplified as x = f (s).
(2.2)
The objective now is to find the original signals s from the observed data x. This is referred to as the BSS problem in the literature. Here, “blind” means that we know little (if anything) about the sources. It is noted that independent component analysis (ICA) (Comon, 1994) is closely related to BSS. The basic idea of ICA is to estimate a mapping g such that the components of y = g(x) are statistically independent.
(2.3)
2560
C.-H. Zheng et al.
The question is to determine whether y generated by ICA is the unknown source s. If they are the same, what are the necessary constraints on the mixtures? To help with understanding the postnonlinear BSS problem, we review the general linear and nonlinear mixtures in the following sections. 2.1 Linear Mixtures. If the mixing function f in equation 2.1 is linear and can be represented by a square matrix A, then equation 2.2 becomes x = As.
(2.4)
The main objective of ICA is to estimate a square nonsingular matrix B (i.e., g(·) in equation 2.3) such that y = Bx = BAs
(2.5)
has statistically independent components, and B is called the separating matrix. Denoting C = BA using an independent criterion, we can estimate the sources, in which there exist at most a gaussian one, for any diagonal matrix and permutation matrix P (Comon, 1994). That is, the separation matrix can be achieved by C = P, which ensures the equivalence between linear ICA and linear BSS. 2.2 Nonlinear Mixtures. As described above, the general nonlinear ICA problem is to estimate a mapping g : (R)n → (R)n that yields components y = g(x), which are statistically independent, using only the observations x. Generally, a fundamental problem with nonlinear ICA is that it has no unique solution. Because one can easily design a nonlinear mapping that mixes the sources and provides statistically independent variables, yi = h i (s) = h i (sσ (i) ), where σ (i) is a permutation over {1, 2, · · · , n} (Jutten, Zadeh, and Hosseini, 2004). Moreover, even the separation can be achieved if each estimated output yi depends on only a unique source, yi = h i (sσ (i) ); then strong distortions can still be introduced due to the mapping h i (·). This is because if u and v are two independent random variables, any component-wise transformations ς(u) and (v) are also statistically independent (assume that the nonlinear functions ς(·) and (·) are both invertible) (Hyv¨arinen & Pajunen, 1999; Jutten et al., 2004; Taleb & Jutten, 1999). In nonlinear BSS, one needs to find the original signals s that generate the observed data x with weaker indeterminacies, such as the scale and permutation indeterminacies in linear mixtures. Obviously, nonlinear BBS is a unique and more meaningful problem than nonlinear ICA. In nonlinear ICA, even some arbitrary independent components were found from the observed data using equation 2.3. They could be quite different from the
MISEP Method for Postnonlinear Blind Source Separation
s1
u1
x1
f1
g1
v1
A
sn
2561
y1 B
un
xn
fn
Mixing system
gn
vn
yn
Separating system
Figure 1: The mixing-separating system for postnonlinear mixtures.
true source signals (Jutten et al., 2004). In other words, ICA and BSS are not equivalent for nonlinear mixtures. The above analysis shows that nonlinear BSS is a difficult and ill-posed problem. Like other ill-posed problems in practice, regularization is necessary for achieving ICA solutions that are consistent with the corresponding BSS. For this purpose, two main approaches can be used. First, one can solve the nonlinear BSS problem based on the independence assumption, which is feasible only if mixtures and separation structure are constrained, such as postnonlinear mixtures (Achard & Jutten, 2005). Second, the a priori information about the sources (e.g., bounded or temporally correlated sources) can be assumed to simplify the algorithms or to reduce the indeterminacies in the solutions (Jutten et al., 2004). In the following, we will study the postnonlinear BSS problem and propose an effective algorithm. 2.3 Postnonlinear Mixtures. The postnonlinear mixtures model was introduced by Taleb and Jutten (1997a). It can be regarded as a hybrid model consisting of a linear part in series with a nonlinear part. 2.3.1 PNL Model. In the postnonlinear mixtures model, observations x(t) = (x1 (t), x2 (t), · · · , xn (t))T have the following specific form (as shown in Figure 1): n xi (t) = f i a i j s j (t) ,
i = 1, · · · , n
(2.6)
j=1
Equation 2.6 can be reformulated in vector-matrix form: x = f (As).
(2.7)
It is obvious that the postnonlinear mixtures (PNL) model consists of a linear mixture followed by a componentwise nonlinearity f i . Here,
2562
C.-H. Zheng et al.
the nonlinear functions f i are assumed to be differentiable and invertible (Taleb & Jutten, 1999). The PNL model, equation 2.7, which partitions the complex nonlinear mixtures into a linear mixing with a nonlinear transformation, represents an important class of real-life problems, although it does not model any nonlinear cross-channel mixing. For example, many mixing systems in communication, speech, and biomedical areas (Larson, 1998; Korenberg & Hunter, 1986; Prakriya & Hatzinakos, 1995; Maeda, Song, & Ishii, 2005; Lee & Batzoglou, 2003; Parashiv-Ionescu, Jutten, & Bouvier, 2002) can be modeled using equation 2.7, which is also illustrated in Figure 1, such as the nonlinearities introduced by the sensor’s nonlinear saturation distortions and the nonlinear characteristics in individual receiver channels (Parashiv-Ionescu et al., 2002). Therefore, this research may have many potential applications (Maeda, Song, & Ishii, 2005; Lee & Batzoglou, 2003). 2.3.2 Separability of PNL. According to the above discussions, it is obvious that separability is the most important issue in dealing with nonlinear mixtures. In contrary to general nonlinear mixtures, PNL mixtures have a favorable separability property. If the corresponding separating model for postnonlinear mixtures, as shown in Figure 1, is reformulated as
yi =
n
b i j g j (x j ),
(2.8)
j=1
then it has been shown that (Taleb & Jutten, 1999; Achard & Jutten, 2005), under weak conditions on the mixing matrix A and the source distribution, the output independence can be obtained if and only if h i = gi ◦ f i , ∀i = 1, · · · , n are linear. This implies that the sources, which can be estimated using an independence criterion on the outputs, are equal to the unknown sources with the same indeterminacies observed in linear memoryless mixtures: y = Ps.
(2.9)
The conditions can be set as follows. First, there exist at least two nonzero elements in each row or column of A. Second, the joint density function of the sources is supposed to be differentiable, and its derivative is continuous on its support (no hypothesis is made on the support of the density functions). (For more details, refer to Taleb & Jutten, 1999; Achard & Jutten, 2005.)
MISEP Method for Postnonlinear Blind Source Separation
2563
3 Mutual Information Minimization 3.1 Independence Criterion. Since statistical independence of the sources is the main assumption in this study, the important issue is to tune the separation structure such that the components of its output y are statistically independent. In practice, there are several methods for measuring mutual dependence, and Shannon’s mutual information is perhaps the most popular one (Hyv¨arinen et al., 2001): I (y) =
H(yi ) − H(y),
(3.1)
i
where H(yi ) = − p(yi ) log p(yi )dyi denotes Shannon’s differential entropy of random variable yi (Hyv¨arinen et al., 2001), and H(y) is the differential entropy of a multidimensional random variable y. The mutual information I (y) measures the amount of information shared by the components of y, which is always nonnegative and vanishes iff the components of y are mutually independent, that is, if. p(y) =
p(yi ).
(3.2)
i
In addition, the mutual information has another important property that will be used in the following. If a transformation zi = ρi (yi ) is applied, resulting in new random variables zi , and we assume that these transformations are all continuous and monotonic (and thus invertible), then I (z) = I (y) according to Almeida (2004). 3.2 Mutual Information Minimization Method. The mutual information index has been used as the dependence measure within the nonlinear ICA context (Deco & Brauer, 1995; Yang et al., 1998; Taleb & Jutten, 1997b). However, using the mutual information as a cost function is not easy since H(yi ) depends directly on the marginal densities p(yi ), which are unknown and varying when minimizing I (y). Two methods are often used to tackle the problem. One is to approximate it using a series expansion (Yang et al., 1998), but the approximation precision is not ensured. The other method is to estimate p(yi ) directly from the estimated value of yi during the learning using some unsupervised learning algorithms (Taleb & Jutten, 1999; Taleb et al., 1998). The disadvantage of the second method used in the literature (Taleb & Jutten, 1999; Taleb et al., 1998) is that a nested form is used, which is computationally very demanding. In this letter, we use another method to calculate the mutual information criterion. The structure of the PNL BSS system proposed in this letter is shown in Figure 2, where B and gi form the unmixing structure for PNL, yi are the
2564
C.-H. Zheng et al.
Cost
x1
function ψ1
g1
z1
B
xn
ψ
gn
zn n
Figure 2: The structure of the PNL BSS system proposed in this letter. The gi and B blocks perform the PNL ICA operation, and the ψi blocks are auxiliary, used only for the optimization.
extracted independent components, and ψi are some nonlinear mappings, which are used only for network optimization. The corresponding cost function is introduced in next section. Assume that each function ψi is the cumulative probability function (CPF) of the corresponding component yi , that is, zi = ψi (yi ) =
yi −∞
p y (t)dt,
0 ≤ zi ≤ 1.
(3.3)
Then p(zi ) =
p(yi ) p(yi ) = = 1. |∂zi /∂ yi | p(yi )
(3.4)
That is, zi are uniformly distributed within [0, 1]. Consequently, it satisfies that H(zi ) = 0 (Almeida, 2004). Therefore, we have I (y) = I (z) =
H(zi ) − H(z) = −H(z).
(3.5)
i
According to equation 3.5, maximizing the output entropy H(z) is equivalent to minimizing the mutual information of the extracted components yi . It has been proved in the literature (Almeida, 2004) that if suitable constraints are placed on the functions ψi , then maximizing the entropy H(z) will lead those functions to become estimates of the CPFs of the corresponding yi components. The above analyses show that if we maximize H(z) with appropriate constraints placed on ψi (see section 4.2), I (y) can be minimized. Consequently, yi should be the duplicate of si with just the sign and scale unknown. Now the fundamental problem that needs to be solved is to optimize the
MISEP Method for Postnonlinear Blind Source Separation
2565
networks (the gi , B, and ψi blocks) by maximizing H(z). In next section, we discuss the proposed algorithm in detail. 4 Unsupervised Learning of Separating System 4.1 Learning Algorithm. Given the separation structure, the joint probabilistic density function of vector z can be calculated as p(z) =
|det(B)|
n
p(x)
i=1 |gi (θ i , xi )|
n i=1
|ψi (ϕ i , yi )|
,
(4.1)
where gi (θ i , xi ) is the derivative of gi (θ i , xi ) with respect to xi , and ψi (ϕ i , yi ) is the derivative of ψi (ϕ i , yi ) with respect to yi . Vectors θ and ϕ are structure parameters in g and ψ, respectively. From equation 4.1, it is obvious that H(z) = H(x) + log |det(B)| +
n
E log |gi (θ i , xi )|
i=1
+
n
E log |ψi (ϕ i , yi )| .
(4.2)
i=1
Minimizing I (y), which is equivalent to maximizing H(z) here, requires the computation of its gradient with respect to the separation structure parameters B, θ , and ϕ. Since H(x) does not depend on B, θ , and ϕ, its derivatives with respect to these parameters become null, hence:
n ∂ H(z) ∂ log |det(B)| ∂ i=1 E log |ψi (ϕ i , yi )| = + ∂B ∂B ∂B
n
∂ ∂ log |gk (θ k , xk )| ∂ H(z) i=1 E log |ψi (ϕ i , yi )| =E + ∂θ k ∂θ k ∂θ k
∂ log |ψk (ϕ k , yk )| ∂ H(z) =E . ∂ϕ k ∂ϕ k
(4.3) (4.4) (4.5)
Obviously, these computations depend on the structure of g and ψ. In this letter, we use multilayer perceptron (MLP) (Huang, 1998) to model the nonlinear parametric functions gk (θ k , xk ). Thus, it can be written as
gk (θ k , xk ) = gk (ξ k , ωk , ηk , xk ) =
Nk j=1
ξ kj σ ωkj xk − ηkj ,
(4.6)
2566
C.-H. Zheng et al.
Figure 3: Structure of the PNL BSS network. The linear subnet is composed of the light gray units, and the nonlinear subnets are composed of the deep gray units.
where ξ and ω are the weight matrices of the input layer and the output layer, respectively; η is the hidden unit’s bias term; and σ (·) is the activation function of the hidden layer. Nk denote the number of hidden units in the kth MLP, and superscript k represents the vector elements. In addition, we use a multilayer perceptron to model the nonlinear parametric functions ψk (ϕ k , yk ) as ψk (ϕ k , yk ) = ψk (α k , β k , µk , yk ) =
Mk
α kj τ β kj yk − µkj .
(4.7)
j=1
Figure 3 shows the structure of the PNL BSS network. The gradient algorithms for maximizing H(z) are given as B(t + 1) = B(t) − κ B
∂ H(z(t)) ∂B(t)
(4.8)
θ k (t + 1) = θ k (t) − κθ
∂ H(z(t)) ∂θ k (t)
(4.9)
ϕ k (t + 1) = ϕ k (t) − κϕ
∂ H(z(t)) ∂ϕ k (t)
(4.10)
where κ is the step size of iteration, θ k = (ξ kT , ωkT , ηkT )T and ϕ k = (α kT , β kT , µkT )T . The details of the algorithms are given in the appendix. 4.2 Performance Index. It is well known that the performance of an algorithm can be measured by some indices, such as efficacy and computational complexity. For the proposed model, we intend to achieve the independence of the outputs y in the separation system. Therefore, the
MISEP Method for Postnonlinear Blind Source Separation
2567
algorithm performance can be measured byI (y): n I (y) = −H(z) = − H(x) + log |det(B)| + E(log |gi (θ i , xi )|) i=1
+
n
E log |ψi (ϕ i , yi )|
i=1
= − (H(x) − I P (x; B, θ , ϕ)).
(4.11)
Since H(x) is unchanged during the learning process, we can use I P (x; B, θ , ϕ) as the performance index, which may take negative values. Therefore, the performance index can be formulated as I P (x; B, θ , ϕ) = − log |det(B)| −
n
E log |gi (θ i , xi )|
i=1
−
n
E log |ψi (ϕ i , yi )| .
(4.12)
i=1
For an iterative algorithm, a stopping criterion is needed. For this purpose, we define the amount of change in I P (x, B(t), θ (t), ϕ(t)) with respect to B, θ , and ϕ from instant t to instant t + 1 as e r = |I P (x, B(t + 1), θ (t + 1), ϕ(t + 1)) − I P (x, B(t), θ (t), ϕ(t))|.
(4.13)
If e r is smaller than a predetermined small, positive constant ε, the algorithm will terminate. According to the performance index and stopping criterion, the details of the learning algorithms can be summarized as follows: Step 1: Initialization of B, θ, and ϕ. Step 2: Adjusting B, θ , and ϕ of the network using equations 4.8 to 4.10. Step 3: Computing the performance index I P (x; B, θ , ϕ) according to equation 4.12 using current model parameters. Step 4: Verify the termination criterion (e r < ε). If it is true, then the algorithm stops; otherwise go to Step 2. Remark. To improve the efficiency of the algorithm, the momentum and adaptive step sizes with error control have been used (Almeida, 1997). In addition, in the experiment shown in section 5, Ni = 5 (the number of the hidden-layer neurons for gi blocks) and Mi = 3 (the number of the hiddenlayer neurons for ψi blocks). Finally, to implement the constraints on ψi such that it becomes an increasing function within a finite interval, the inverse
2568
C.-H. Zheng et al.
tangent sigmoids were chosen as the activation function for the hidden units in the ψi block, the vector of weights from the hidden units to the output units was normalized at the end of each epoch, and all weights in each ψi block were initialized with positive values. Furthermore, the interval of zi was chosen within [−1, 1] instead of [0, 1] without changing the fact that the maximum of H(z) corresponds to the minimum of I (y). Furthermore, it allows the use of bipolar sigmoids in hidden units, and consequently, a faster training speed can be achieved (Almeida, 2004). 5 Experimental Results In this section, three experiments were carried out to verify the effectiveness of the proposed method (PNL MISEP). We used artificial signals in the first two experiments, and in the third experiment, the recorded speech signals were used to complete the investigation. 5.1 Experiment 1. In the first experiment, the source signals consist of a sinunsoid signal and a curve signal (Hyv¨arinen et al., 2001): s(t) = [sin(t/5), ((rem(t,23) - 11) / 9)5 ]T . Here rem(u, v) stands for the remainder of u divided by v. The first signal is subgaussian, and the second one is supergaussian. The two source signals, shown in Figure 4a, were first linearly mixed with the (randomly chosen) mixture matrix: A=
−0.4389 0.2310 . 0.1267 0.6564
(5.1)
Then the two nonlinear distortion functions,
1 3 1 f 1 (u) = u+ u 2 6
(5.2)
3 f 2 (u) = u + tanh(u), 5
(5.3)
were applied to each mixture to produce a PNL mixture. Thus, [x1 (t), x2 (t)]T = [ f 1 (u1 ), f 2 (u2 )]T .
(5.4)
Figure 4b shows the separated signals using our method. For comparison, we also employed the PNL ICA algorithm developed at INPG (Taleb & Jutten, 1999; Ilin et al., 2004), the general MISEP (G-MISEP) method (Almeida, 2004), and the gauss-TD algorithm (Ziehe et al., 2003) on the same data. Figure 5 shows the separated signals implemented by the other three methods. The correlations between the two recovered signals and the two original sources using the three methods are listed in Table 1.
MISEP Method for Postnonlinear Blind Source Separation
2569
Figure 4: The two sets of signals in experiment 1. (a) Source signals. (b) Separated signals using PNL MISEP.
It is obvious that the separated signals using the method proposed in this letter is closer to the original signals. The CPU time on a 2 GHz Pentium IV of the three methods are also listed in Table 1. We can see that the gauss-TD method is the fastest; however, its separation result is poor. The PNL ICA algorithm has the same network structure as our method, yet this algorithm has a nested structure, so the algorithm is very time-consuming. For the G-MISEP method, a network for computing the Jacobian (Almeida, 2004) is added to the unmixing network, so it took longer to process than ours. The performance indices I P (x, B(t), θ (t), ϕ(t)) of our algorithm and G-MISEP are shown in Figure 6. Each iteration step corresponds to one update of the parameters, followed by the presentation of all 500 samples. It can be seen that our proposed algorithm outperforms the G-MISEP in terms of accuracy (The other two methods did not use this performance index directly; therefore, it is not shown in Figure 6.)
2570
C.-H. Zheng et al.
Figure 5: (a) Separated signals using PNL ICA. (b) Separated signals using G-MISEP. (c) Separated signals using the gauss-TD method.
MISEP Method for Postnonlinear Blind Source Separation
2571
Table 1: Correlations Between the Two Recovered Signals and the Original Sources and Time used in Experiment 1. Algorithm PNL ICA G-MISEP PNL MISEP Gauss-TD
c1 a
c2
Time (Sec.)
0.9902 0.9823 0.9960 0.9830
0.9573 0.9922 0.9915 0.9517
110.016 30.562 6.046 0.110
a Correlation
between the ith recovered signal and the ith original source.
Figure 6: Learning curve of the PNL MISEP and G-MISEP algorithm.
5.2 Experiment 2. To further test the effectiveness of the proposed method, we conducted another experiment in which there are three source signals (Hyv¨arinen et al., 2001), s(t) = [sin(t/5), ((rem(t, 23) − 11)/9)5 , and sign(r1 (t) − 1/2) log(r2 (t))]T , where ri (t) are random numbers uniformly distributed between zero and one. They are first linearly mixed using the matrix 0.1312 0.5913 0.1243 A = 0.5164 −0.2015 0.1234 0.2341 0.0981 0.6321
(5.5)
and then distorted by the nonlinear function: f 1 (u) = f 2 (u) = f 3 (u) = tanh(u).
(5.6)
2572
C.-H. Zheng et al.
Table 2: Correlations Between the Three Recovered Signals and the Original Sources and Time Used in Experiment 2. Algorithm PNL ICA G-MISEP PNL MISEP Gauss-TD
c1
c2
c3
Time (Sec.)
0.9832 0.9704 0.9813 0.9780
0.9596 0.9728 0.9716 0.9560
0.9648 0.9002 0.9427 0.9620
219.950 38.672 9.485 0.156
Table 3: Correlations Between the Two Recovered Signals and the Original sources and Time Used in Experiment 3. Algorithm PNL ICA G-MISEP PNL MISEP Gauss-TD
c1
c2
Time (Sec.)
0.9812 0.9878 0.9954 0.9378
0.9810 0.9833 0.9940 0.9410
413.187 65.406 9.938 0.281
Figure 7a shows the three source signals. The separated signals by applying our unmixing method are shown in Figure 7b. In this example, we also used PNL ICA, G-MISEP, and gauss-TD algorithms to separate the mixed signals; the correlations between the three recovered signals and the original sources are shown in Table 2. From the experimental results, we can reach the same conclusion as in the first experiment. We also found that with the increase of the number of mixed source signals, the quality of separated signals worsens, which is shown in all three separating methods. The main reason may be that with the increase in the number of source signals, the separation problem becomes more difficult. 5.3 Experiment 3. Finally, we tested using real-life speech signals. In this experiment, two speech signals (with 15,000 samples, whose sampling rate is 8 kHz, obtained online from http://www.ece.mcmaster.ca/∼reilly/ kamran/id18.htm) are postnonlinearly mixed by A=
0.8325 0.5310 −0.4267 0.9564
f 1 (u) = f 2 (u) = tanh(3u).
(5.7) (5.8)
We used the first 1000 samples of mixture signals to train the unmixing network and then applied the trained network to separate the source signals. The experimental results are shown in Figure 8 and Table 3, which confirm the conclusions reached in the first two experiments.
MISEP Method for Postnonlinear Blind Source Separation
2573
Figure 7: The two sets of signals in experiment 2. (a) Source signals. (b) Separated signals using PNL MISEP.
2574
C.-H. Zheng et al.
Figure 8: The two sets of speech signals in experiment 3. (a) Source signals. (b) Separated signals using PNL MISEP.
Almeida (2004) conjectured that if the mixture is relatively smooth and does not involve nonlinearities that are too strong, the nonlinear ICA will lead to nonlinear BSS and the MISEP method will be useful. For the postnonlinear BSS problem, the general MISEP method produced relatively better results in the above three experiments, but the structure of the method cannot theoretically guarantee the equivalency of nonlinear ICA and BSS. To the contrary, the PNL MISEP method that we propose in this letter is more competitive, since the unmixing system is especially designed to separate the postnonlinear mixed signals, and it makes full use of the information of the mixing structure, therefore theoretically ensuring the separability. Generally, compared with postnonlinear MISEP, general MISEP can solve more general nonlinear BSS problems, but PNL MISEP is more efficient in solving the PNL BSS problem.
MISEP Method for Postnonlinear Blind Source Separation
2575
6 Conclusion In this letter, we have discussed ICA and BSS problems for nonlinear mixture models. Considering that general nonlinear BSS and ICA are difficult and ill-posed problems, a wide class of mixtures known as postnonlinear mixtures has been studied in this letter. We have proposed an adaptive algorithm based on the general nonlinear BSS method, MISEP, for postnonlinear BSS, which uses an entropy criterion that differs from the one introduced in the literature (Bell & Sejnowski, 1995; Lee, Girolami, & Sejnowski, 1999). This adaptive method optimizes a network with a specialized architecture using the output entropy as the objective function, which is equivalent to the mutual information criterion, but the computation of the marginal entropy of the output is not required. The experimental results showed that the proposed method is competitive to existing algorithms. It is also observed from the experiments that as the number of mixed source signals increased, the quality of separated signals became worse. In future research work, we shall investigate and solve this problem by optimizing the network structure and using a more efficient optimization algorithm. We shall also apply this method to solving other problems, such as real-life nonlinear image separation. Appendix: Details of the Proposed Algorithm We denote the derivative of the functions as uk = gk (θ k , xk ) =
Nk
ξ kj ωkj σ ωkj xk − ηkj
(A.1)
j=1
vk = ψk (ϕ k , yk ) =
Mk
α kj β kj τ β kj yk − µkj
(A.2)
2
α ij β ij τ β ij yi − µij .
(A.3)
j=1
ϑi = ψi (ϕ i , yi ) =
Mk j=1
In stochastic gradient mode, we can easily calculate the gradients of the contrast function H(z) with respect to each parameter as
k k k ∂ H(z) β j τ β j yk − µ j = vk ∂α kj
k
k k k α kj τ β kj yk − µkj ∂ H(z) α j β j yk τ β j yk − µ j = + vk vk ∂β kj
(A.4)
(A.5)
2576
C.-H. Zheng et al.
k k k k ∂ H(z) −α j β j τ β j yk − µ j = . vk ∂µkj
(A.6)
Applying the chain rule of partial derivatives, we can get
k n k k ϑi b ik σ ωkj xk − ηkj ∂ H(z) ω j σ ω j xk − η j = + uk vi ∂ξ kj i=1
ωk x σ (ωkj xk − ηkj ) + σ ωkj xk − ηkj ∂ H(z) k j k = ξj uk ∂ωkj
n ϑi b ik ξ kj xk σ ωkj xk − ηkj + vi i=1
n ξ kj ωkj σ ωkj xk − ηkj ϑi b ik ξ kj σ ωkj xk − ηkj ∂ H(z) =− − uk vi ∂ηkj
(A.7)
(A.8)
(A.9)
i=1
ϑi g j ∂ H(z) = (B −T )i j + ∂b i j vi
(A.10)
According to the above formulas, it is obvious that this algorithm was derived using a method similar to error backpropagation (BP) for training a multilayer network yet with an unsupervised manner. Therefore, the unsupervised algorithm is also called the entropy BP algorithm. References Achard, S., & Jutten, C. (2005). Identifiability of postnonlinear mixtures. IEEE Signal Processing Letters, 12(5), 423–426. Achard, S., Pham, D. T., & Jutten, C. (2001). Blind source separation in post nonlinear mixtures. In Proceedings of the International Workshop on Independent Component Analysis and Blind Signal Separation, ICA2001 (pp. 295–300). Piscataway, NJ: IEEE. Almeida, L. B. (1997). Multilayer perceptrons. In E. Fiesler & R. Beale (Eds.), Handbook of neural computation. New York: Oxford University Press. Almeida, L. B. (2004). Linear and nonlinear ICA based on mutual information—the MISEP method. Signal Processing, 84, 231–245. Babaie-Zadeh, M., Jutten, C., & Nayebi, K. (2002). A geometric approach for separating post non-linear mixtures. In Proc. Int. Workshop EUSIPCO 2002 (pp. 11–14). Piscataway, NJ: IEEE. Bell, A., & Sejnowski, T. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bishop, C. M., Svensen, M., & Williams, C. K. I. (1998). GTM: The generative topographic mapping. Neural Computation, 10, 215–234.
MISEP Method for Postnonlinear Blind Source Separation
2577
Burel, G. (1992). Blind separation of sources: A nonlinear neural algorithm. Neural Networks, 5, 937–947. Comon, P. (1994). Independent component analysis: A new concept? Signal Processing, 36(3), 287–314. Deco, G., & Brauer, W. (1995). Nonlinear higher-order statistical decorrelation by volume-conserving neural architectures. Neural Networks, 8, 525– 535. Herrmann, M., & Yang, H. H. (1996). Perspectives and limitations of self-organizing maps in blind separation of source signals. Progress in Neural Information Processing: Proc. ICONIP’96 (pp. 1211–1216). Berlin: Springer. Huang, D. S. (1998). The local minima free condition of feedforward neural networks for outer-supervised learning. IEEE Transactions on Systems, Man and Cybernetics, 28B(3), 477–480. Hyv¨arinen, A. (1999). Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Neworks, 10(3), 626–634. Hyv¨arinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Hyv¨arinen, A., & Pajunen, P. (1999). Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12(3), 429–439. Ilin, A., Achard, S., & Jutten, C. (2004). Bayesian versus constrained structure approaches for source separation in postnonlinear mixtures. In Proc. IJCNN 2004 (pp. 2181–2186). Piscataway, NJ: IEEE. Ilin, A., & Honkela, A. (2004). Postnonlinear independent component analysis by variational Bayesian learning. In Proc. ICA 2004 (pp. 766–773). Berlin: Springer. Jutten, C., Zadeh, M. B., & Hosseini, S. (2004). Three easy ways for separating nonlinear mixtures? Signal Processing, 84, 217–229. Korenberg, M., & Hunter, I. (1986). The identification of nonlinear biological systems: LNL cascade models. Biology Cybern, 55, 125–134. Larson, L. E. (1998). Radio frequency integrated circuit technology low-power wireless communications. IEEE Personal Communications, 5(3), 11–19. Lee, S., & Batzoglou, S. (2003). Application of independent component analysis to microarrays. Genome Biology, 4(11), R76. Lee, T.-W., Girolami, M., & Sejnowski, T. (1999). Independent component analysis using an extended infomax algorithm for mixed sub-gaussian and super-gaussian sources. Neural Computation, 11, 417–441. Lee, T.-W., Koehler, B., & Orglmeister, B. U. (1997). Blind source separation of nonlinear mixing model. In Proc. 1997 IEEE Workshop, Neural Networks for Signal Processing VII (pp. 405–415). Piscataway, NJ: IEEE. Lin, J. K., Grier, D. G., & Cowan, J. D. (1996). Source separation and density estimation by faithful equivariant SOM. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 536–542). Cambridge, MA: MIT Press. Maeda, S., Song, W. J., & Ishii, S. (2005). Nonlinear and noisy extension of independent component analysis: Theory and its application to a pitch sensation model. Neural Computation, 17(1), 115–144. Martinez, D., & Bray, A. (2003). Nonlinear blind source separation using kernels. IEEE Transactons on Neural Networks, 14(1), 228–235.
2578
C.-H. Zheng et al.
Pajunen, P. (1998). Blind source separation using algorithmic information theory. Neurocomputing, 22, 35–48. Pajunen, P., Hyv¨arinen, A., & Karhunen, J. (1996). Nonlinear blind source separation by self-organizing maps. In Progress in Neural Information Processing: Proc. ICONIP’96 (Vol. 2, pp. 1207–1210). Berlin: Springer. Parashiv-Ionescu, A., Jutten, C., & Bouvier, G. (2002). Source separation based processing for integrated hall sensor arrays. IEEE Sensors Journal, 2(6), 663–673. Prakriya, S., & Hatzinakos, D. (1995). Blind identification of LTI-ZMNL-LTI nonlinear channel models. IEEE Transactions on Signal Processing, 43(12), 3007–3013. Sol´e-Casals, J., Babaie-Zadeh, M., Jutten, C., & Pham, D. (2003). Improving algorithm speed in PNL mixture separation and Wiener system inversion. In Proc. ICA 2003 (pp. 639–644). Piscataway, NJ: IEEE. Taleb, A. (2002). A generic framework for blind source separation in structured nonlinear models. IEEE Transactions on Signal Processing, 50(8), 1819–1830. Taleb, A., & Jutten, C. (1997a). Nonlinear source separation: The postnonlinear mixtures. In Proc. ESANN’97 (pp. 279–284). Brussels: D-Facto. Taleb, A., & Jutten, C. (1997b). Entropy optimization, application to blind separation of sources. In Proc. ICANN’97 (pp. 529–534). Berlin: Springer. Taleb, A., & Jutten, C. (1999). Source separation in post-nonlinear mixtures. IEEE Transactions on Signal Processing, 47, 2807–2820. Taleb, A., Jutten, C., & Olympieff, S. (1998). Source separation in post nonlinear mixtures: An entropy-based algorithm. In Proc. ESANN’98 (pp. 2089–2092). Piscataway, NJ: IEEE. Valpola, H. (2000). Nonlinear independent component analysis using ensemble learning: Theory. In Proceedings of the Second International Workshop on Independent Component Analysis and Blind Signal Separation (pp. 251–256). Piscataway, NJ: IEEE. Yang, H. H., Amari, S., & Cichocki, A. (1997). Information backpropagation for blind separation of sources from nonlinear mixture. In Proc. IEEE ICNN (pp. 2141–2146). Piscataway, NJ: IEEE. Yang, H. H., Amari, S., & Cichocki, A. (1998). Information-theoretic approach to blind separation of sources in nonlinear mixture. Signal Processing, 64(3), 291–300. ¨ Ziehe, A., Kawanabe, M., Harmeling, S., & Muller, K.-R. (2003). Blind separation of postnonlinear mixtures using linearizing transformations and temporal decorrelation. Journal of Machine Learning Research, 4, 1319–1338.
Received March 14, 2006; accepted September 22, 2006.
2579
Erratum In the March 2007 issue of Neural Computation (Volume 19, Number 3), the letter “Reinforcement Learning State Estimator” by Jun Morimoto and Kenji Doya was printed with incomplete author affiliations. The correct affiliations should read: Jun Morimoto [email protected] JST, ICORP, Computational Brain Project, 4-1-8 Honcho, Kawaguchi, Saitama, 332-0012, Japan ATR Computational Neuroscience Laboratories 2-2-2 Hikaridai Soraku-gun Seika-cho, Kyoto, 619-0288, Japan Kenji Doya [email protected] Initial Research Project, Okinawa Institute of Science and Technology, Uruma, Okinawa 904-2234, Japan ATR Computational Neuroscience Laboratories 2-2-2 Hikaridai Soraku-gun Seika-cho, Kyoto, 619-0288, Japan
ARTICLE
Communicated by Christal Gordon
Synaptic Dynamics in Analog VLSI Chiara Bartolozzi [email protected]
Giacomo Indiveri [email protected] ¨ ¨ Institute for Neuroinformatics, UNI-ETH Zurich, Zurich, Switzerland
Synapses are crucial elements for computation and information transfer in both real and artificial neural systems. Recent experimental findings and theoretical models of pulse-based neural networks suggest that synaptic dynamics can play a crucial role for learning neural codes and encoding spatiotemporal spike patterns. Within the context of hardware implementations of pulse-based neural networks, several analog VLSI circuits modeling synaptic functionality have been proposed. We present an overview of previously proposed circuits and describe a novel analog VLSI synaptic circuit suitable for integration in large VLSI spike-based neural systems. The circuit proposed is based on a computational model that fits the real postsynaptic currents with exponentials. We present experimental data showing how the circuit exhibits realistic dynamics and show how it can be connected to additional modules for implementing a wide range of synaptic properties. 1 Introduction Synapses are highly specialized structures that, by means of complex chemical reactions, allow neurons to transmit signals to other neurons. When an action potential generated by a neuron reaches a presynaptic terminal, a cascade of events leads to the release of neurotransmitters that give rise to a flow of ionic currents into or out of the postsynaptic neuron’s membrane. These excitatory or inhibitory postsynaptic currents (EPSC or IPSC, respectively) have temporal dynamics with a characteristic time course that can last up to several hundreds of milliseconds (Koch, 1999). In computational models of neural systems, the temporal dynamics of synaptic currents have often been neglected. In models that represent the information with mean firing rates, synaptic transmission is typically modeled as an instantaneous multiplier operator (Hertz, Krogh, & Palmer, 1991). Similarly, in pulse-based neural models, where the precise timing of spikes and the dynamics of the neuron’s transfer function play an important role, synaptic currents are often reduced to simple instantaneous charge impulses. Also in VLSI implementations of neural systems, silicon Neural Computation 19, 2581–2603 (2007)
C 2007 Massachusetts Institute of Technology
2582
C. Bartolozzi and G. Indiveri
synapses have often been reduced to simple multiplier circuits (Borgstrom, Ismail, & Bibyk, 1990; Satyanarayana, Tsividis, & Graf, 1992) or constant current sources activated only for the duration of the presynaptic input pulse (Mead, 1989; Fusi, Annunziato, Badoni, Salamon, & Amit, 2000; Chicca, Badoni, et al., 2003). Within the context of pulse-based neural networks, modeling the detailed dynamics of postsynaptic currents can be a crucial step for learning neural codes and encoding spatiotemporal patterns of spikes. Leaky integrate-and-fire (I&F) neurons can distinguish between different temporal input spike patterns only if the synapses stimulated by the input spike patterns exhibit dynamics with time constants comparable to the time con¨ stant of the neuron’s membrane potential (Gutig & Sompolinsky, 2006). Modeling the temporal dynamics of each synapse in a network of I&F neurons can be onerous in terms of CPU usage for software simulations and in terms of silicon real estate for dedicated VLSI implementations. A compromise between highly detailed models of synaptic dynamics and no dynamics at all is to use computationally efficient models that account for the basic properties of synaptic transmission. A very efficient model that reproduces the macroscopic properties of synaptic transmission and accounts for the linear summation property of postsynaptic currents is the one based on pure exponentials proposed by Destexhe, Mainen, and Sejnowski (1998). Here we propose a novel VLSI synaptic circuit, the diffpair integrator (DPI), that implements the model proposed in Destexhe et al. (1998) as a log-domain linear temporal filter and supports a wide range of synaptic properties, ranging from short-term depression to conductancebased EPSC generation. The design of the DPI synapse is inspired by a series of similar circuits proposed in the literature that collectively share many of the advantages of our solution but individually lack one or more of the features of our design. In the next section, we present an overview of previously proposed synaptic circuits and describe the DPI synapse pointing out the advantages that the DPI offers over each of them. In section 3 we present experimental data from a VLSI chip showing the properties of the circuit in response to a single pulse and to sequences of spikes. In section 4 we show how the DPI is compatible with additional circuits used to implement various types of synaptic dynamics, and in section 5, we discuss possible uses of the DPI circuit in massively parallel networks of I&F neurons implemented on single or multichip neuromorphic systems. 2 Synaptic VLSI Circuits Synaptic circuits translate presynaptic voltage pulses into postsynaptic currents injected in the membrane of their target neuron, with a gain typically referred to as the synaptic weight. The function of translating “fast” presynaptic pulses into long-lasting postsynaptic currents, with elaborate
Synaptic Dynamics in Analog VLSI
2583
temporal dynamics, can be efficiently mapped onto silicon using subthreshold (or weak-inversion) analog VLSI (aVLSI) circuits (Liu et al., 2002). In typical VLSI neural network architectures, the currents generated by multiple synapses are integrated by a single postsynaptic neuron circuit. The neuron circuit carries out a weighted sum of the input signals, produces postsynaptic potentials, and eventually generates output spikes that are typically transmitted to synaptic circuits in further processing stages. A common neuron model used in VLSI spike-based neural networks is the point neuron. With this model, the spatial position of the synaptic circuits connected to the neuron is not relevant, and the currents produced by the synapses are summed linearly into the single neuron’s membrane capacitance node. Alternatively, synaptic circuits (including the one presented in this article) can be integrated in multicompartmental models of neurons, and the neuron’s dendrite, comprising the spatial arrangement of VLSI synapses connected to the neuron, implements the spatial summation of synaptic currents (Northmore & Elias, 1998; Arthur & Boahen, 2004). Regardless of the neuron model used, one of the main requirements for synaptic circuits in large VLSI neural networks is compactness: the less silicon area is used, the more synapses can be integrated on the chip. On the other hand, implementing synaptic integrator circuits with linear response properties and time constants of the order of tens of milliseconds can require substantial silicon area. Therefore, designing VLSI synaptic circuits that are compact and linear and model relevant functional properties of biological synapses is a challenging task still being actively pursued. Several subthreshold synaptic circuit designs have been proposed (Mead, 1989; Lazzaro, 1994; Boahen, 1998; Fusi et al., 2000; Chicca, Indiveri, & Douglas, 2003; Shi & Horiuchi, 2004a; Gordon, Farquhar, & Hasler, 2004; Hynna & Boahen, 2006) covering a range of trade-offs between functionality and complexity of temporal dynamics versus circuit and layout size. Some of the circuits proposed require floating-gate devices (Gordon et al., 2004) or restrict the signals used to a very limited dynamic range (Hynna & Boahen, 2006) to reproduce in great detail the physics of biological synaptic channels. Here we focus on the synaptic circuits that implement kinetic models of synaptic transmission functionally equivalent to the one implemented by the DPI, which can be directly integrated into large arrays of address-event-based neural networks (Lazzaro, 1994; Boahen, 1998). 2.1 Pulsed Current-Source Synapse. The pulsed current-source synapse, originally proposed by Mead (1989) in the late 1980s, was one of the first synaptic circuits implemented using transistors operated in the subthreshold domain. The circuit schematics are shown in Figure 1 (left); it consists of a voltage-controlled current source activated by an active-low input spike. In VLSI pulsed neural networks, input spikes are typically brief digital voltage pulses that last at most a few microseconds. The output of this circuit is a pulsed current Isyn that lasts as long as the input spike.
2584
C. Bartolozzi and G. Indiveri
Mpre Vτ
Vw
Mτ Iτ
Mw
Mpre
Isyn
Csyn Vsyn
Msyn Isyn
Vw Figure 1: (Left) Pulsed current-source synaptic circuit. (Right) Reset-anddischarge synapse.
Assuming that the output p-FET Mw is saturated (i.e., that its Vds is greater than 4UT ), the current Isyn can be expressed as Isyn = I0 e
− Uκ (Vw −Vdd ) T
,
(2.1)
where Vdd is the power supply voltage, I0 the leakage current, κ the subthreshold slope factor, and UT the thermal voltage (Liu et al., 2002). This circuit is extremely compact but does not integrate input spikes into continuous output currents. Whenever a presynaptic spike reaches Mpr e , the postsynaptic membrane potential undergoes a step increase proportional to Isyn . As integration happens only at the level of the postsynaptic I&F neuron, input spike trains with the same mean rates but different spike timing distributions cannot be distinguished. However, given its simplicity and compactness, this circuit has been used in a wide variety of VLSI implementations of pulse-based neural networks that use mean firing rates as the neural code (Murray, 1998; Fusi et al., 2000; Chicca, Badoni, et al., 2003). 2.2 Reset-and-Discharge Synapse. In the early 1990s, Lazzaro (1994) proposed a synaptic circuit where the duration of the output EPSC Isyn (t) could be extended with respect to the input voltage pulse by means of a tunable exponential decay (see also Shi & Horiuchi, 2004b, for a recent application example). This circuit, shown in Figure 1 (right), comprises three p-FET transistors and one capacitor; the p-FET Mpr e is used as a digital switch that is turned on by the synapse’s input spikes; the p-FET Mτ is
Synaptic Dynamics in Analog VLSI
2585
operated in subthreshold and is used as a constant current source to linearly charge the capacitor Csyn ; the output p-FET Msyn is used to generate an EPSC that is exponentially dependent on the Vsyn node (assuming subthreshold operation and saturation): Isyn (t) = I0 e
− Uκ (Vsyn (t)−Vdd ) T
.
(2.2)
At the onset of each presynaptic pulse, the node Vsyn is (re)set to the bias Vw . When the input pulse ends, the p-FET Mpr e is switched off, and the node Vsyn is linearly driven back to Vdd , at a rate set by Iτ /Csyn . For subthreshold values of (Vdd − Vw ), the EPSC generated by an input spike is therefore Isyn = Iw0 e − τ , t
−
κ
(2.3)
(V −V )
where Iw0 = I0 e UT w dd and τ = UTκCIτsyn . In general, given a generic spike sequence on n spikes, ρ(t) =
n
δ(t − ti ),
(2.4)
i
the response of the reset-and-discharge synapse can be formally expressed as Isyn (t) = Iw0 e − τ · t
0
t
ξ
δ(ξ − tn )e τ dξ = Iw0 e −
(t−tn ) τ .
(2.5)
Although this synaptic circuit produces an EPSC that lasts longer than the duration of its input pulses and decays exponentially with time, its response depends on only the last (nth) input spike. This nonlinear property of the circuit fails to reproduce the linear summation property of postsynaptic currents often desired in synaptic models and makes the theoretical analysis of networks of neurons interconnected with this synapse intractable. 2.3 Linear Charge-and-Discharge Synapse. In Figure 2 (left), we show a modification of the reset-and-discharge synapse that has often been used by the neuromorphic engineering community and was recently presented in Arthur and Boahen (2004). Here the presynaptic pulse, applied to the input n-FET Mpr e , is active high. Assuming that all transistors are saturated and operate in subthreshold, the circuit behavior is the following. During an input pulse, the node Vsyn (t) decreases linearly, at a rate set by the net current Iw − Iτ , and the synapse EPSC Isyn (t) increases exponentially (charge phase). In between spikes, the Vsyn (t) node is recharged toward Vdd at a rate set by Iτ ,
2586
C. Bartolozzi and G. Indiveri
Vτ Mτ
Vτ
Vw
Mτ
Csyn
Iτ
Iτ
Vsyn
Msyn Vw
Mw
Isyn
Csyn Vsyn
Mw
Iw
Iw
Mpre
Mpre
Msyn Isyn
Figure 2: (Left) Linear charge-and-discharge synapse. (Right) Current mirror integrator synapse.
and Isyn (t) decreases exponentially with time (discharge phase). The circuit equations that describe this behavior are
Isyn (t) =
(t−t − ) − + τci Isyn e
(charge phase)
(discharge phase),
+ Isyn e
(t−t + ) − τi d
(2.6)
where ti− is the time at which the ith input spike arrives, ti+ the time at − + which it ends, Isyn the initial condition at ti− , Isyn the initial condition at ti+ , U C
U C
syn τc = κ(ITw −I is the charge phase time constant, and τd = Tκ Iτsyn the discharge τ) phase time constant. Assuming that each spike lasts a fixed brief period t, and considering − two successive spikes arriving at times ti− and ti+1 , we can then write
− Isyn (ti+1 )
=
t Isyn (ti− )e
+
1 τc
1 τd
e
−
− (ti+1 −ti− ) τd .
(2.7)
From this recursive equation, we derive the response of the linear chargeand-discharge synapse to a generic spike sequence ρ(t) of n spikes Isyn (t) = I0 e
nt
1 τc
+
1 τd
e
− τtd
,
(2.8)
assuming as the initial condition Vsyn (0) = Vdd . The EPSC dynamics depend on the total number of spikes n received at time t and on the circuit’s time constants τc and τd . If we denote the input
Synaptic Dynamics in Analog VLSI
2587
spike train frequency at time t with f = (n/t), we can express equation 2.8 as Isyn (t) = I0 e
c +τd ) − τc − f t(τ t τc τd
.
(2.9)
The major drawback of this circuit, aside from its not being a linear integrator, is that if the argument of the exponential in equation 2.9 is 1 Iτ positive (i.e., if f > t ), the output current increases exponentially with Iw time, and the circuit’s response saturates: Vsyn (t) decreases all the way to Gnd, and Isyn (t) increases to its maximum value. This can be a problem because in these conditions, the circuit’s steady-state response does not encode the input frequency. 2.4 Current-Mirror-Integrator Synapse. In his doctoral dissertation, Boahen (1997) proposed a synaptic circuit that differs from the linear chargeand-discharge one by a single node connection (see Figure 2) but that has a dramatically different behavior. The two transistors Mτ − Msyn of Figure 2 (right) implement a p-type current mirror, and together with the capacitor Csyn , they form a current mirror integrator (CMI). The CMI synapse implements a nonlinear pulse integrator circuit that produces a mean output current Isyn that increases with input firing rates and has a saturating nonlinearity with maximum amplitude that depends on the circuit’s synaptic weight bias Vw and on its time constant bias Vτ .1 The CMI response properties have been derived analytically in Hynna and Boahen (2001) for steady-state conditions. An explicit solution of the CMI response to a generic spike train, which does not require the steadystate assumption, was also derived in Chicca (2006). According to the analysis presented in Chicca, the CMI response to a spike arriving at ti− and ending at ti+ is
Isyn (t) =
α Iw (t−t − ) − τc i α Iw 1 + − 1 e − Isyn
(charge phase)
(discharge phase),
Iw + Isyn
Iw (t−t+ ) + τd i
(2.10)
− + , and Isyn are the same as defined in equation 2.6, α = where ti− , ti+ , Isyn
e
(Vτ −Vdd ) UT
, τc =
Csyn UT κ Iw
, and τd = ατc .
1 The CMI does not implement a linear integrator filter; therefore the term time constant
is improperly used. We use it in this context to denote a parameter that controls the temporal extension of the CMI’s impulse response.
2588
C. Bartolozzi and G. Indiveri
Vτ
Mτ Iτ
Vw
Csyn Vsyn
Msyn
Mw Isyn Iw Mpre
Figure 3: Log-domain integrator synapse.
During the charge phase, the EPSC increases over time as a sigmoidal function, while during the discharge phase, it decreases with a 1/t profile. The discharge of the EPSC is therefore extremely fast compared to the typical exponential decay profiles of other synaptic circuits. The parameter α (set by the Vτ bias voltage) can be used to slow the EPSC response profile. However, this parameter affects both the length of the EPSC discharge profile and the maximum amplitude of the EPSC charge phase: longer response times (larger values of τd ) produce higher EPSC values. Despite these problems and although the CMI cannot be used to linearly sum postsynaptic currents, this circuit was very popular and has been extensively used by the neuromorphic engineering community (Boahen, 1998; Horiuchi & Hynna 2001; Indiveri, 2000; Liu et al., 2001). 2.5 Log-Domain Integrator Synapse. More recently Merolla and Boahen (2004) proposed another variant of the linear charge-and-discharge synapse that implements a true linear integrator circuit. This circuit (shown in Figure 3) exploits the logarithmic relationship between subthreshold MOSFET gate-to-source voltages and their channel currents and is therefore called a log-domain filter. The output current Isyn of this circuit has the same exponential dependence on its gate voltage Vsyn as all other synapses presented (see equation 2.2). Therefore, we can express its derivative with respect to time as d κ d Isyn = −Isyn Vsyn . dt UT dt
(2.11)
Synaptic Dynamics in Analog VLSI
2589
During an input spike (charge phase), the dynamics of the Vsyn are govd erned by the equation: Csyn dt Vsyn = −(Iw − Iτ ). Combining this first-order differential equation with equation 2.11, we obtain τ
d Iw Isyn + Isyn = Isyn , dt Iτ C
(2.12)
U
T . The beauty of this circuit lies in the fact that the term Iw where τ = syn κ Iτ is inversely proportional to Isyn itself:
Iw = I0 e
−
κ(Vw −Vsyn ) UT
= I0 e
−
κ(Vw −Vdd ) UT
e
κ(Vsyn −Vdd ) UT
= Iw0
I0 Isyn
,
(2.13)
where I0 is the leakage current and Iw0 is the current flowing through Mw in the initial condition, when Vsyn = Vdd . When this expression of Iw is substituted in equation 2.12, the right term of the differential equation loses the Isyn dependence and becomes the constant factor I0IIτw0 . Therefore, the log-domain integrator transfer function takes the form of a canonical first-order low-pass filter equation, and its response to a spike arriving at ti− and ending at ti+ is
(t−ti− ) (t−ti− ) I0 Iw0 − − τ e 1 − e− τ + Isyn Iτ Isyn (t) = + + − (t−tτi ) Isyn e
(charge phase) (2.14) (discharge phase).
This is the only synaptic circuit of the ones described up to now that has linear filtering properties. The same silicon synapse can be shared to sum the contributions of spikes potentially arriving from different sources in a linear way. This could save significant amounts of silicon real estate in neural architectures where the synapses do not implement learning or local adaptation mechanisms and could therefore solve many of the problems that have hindered the development of large-scale VLSI multineuron chips up to now. However, this particular circuit has two drawbacks. One problem is that the VLSI layout of the schematic shown in Figure 3 requires more area than the layout of other synaptic circuits, because the Mw p-FET has to live in an “isolated well" structure (Liu et al., 2002). The second, and more serious, problem is that the spike lengths used in pulse-based neural network systems, which typically last less than a few microseconds, are too short to inject enough charge in the membrane capacitor of the postsynaptic neuron to see any effect. The maximum amount of charge possible is Q = I0 Iw0 t, and Iw0 cannot be increased beyond subthreshold current limits (of Iτ the order of nano-amperes); otherwise, the log domain properties of the filter break down (note that also Iτ is fixed, to set the desired time constant
2590
C. Bartolozzi and G. Indiveri
Mτ
Vτ
Iτ Iin Vthr
Mthr
Vw
Min
Csyn Vsyn
Msyn Isyn
Mw Iw Mpre
Figure 4: Diff-pair integrator synapse.
τ ). A possible solution is to increase the fast (off-chip) input pulse lengths with on-chip pulse extenders (e.g., with CMI circuits). But this solution requires additional circuitry at each input synapse and makes the layout of the overall circuit even larger (Merolla & Boahen 2004). 2.6 Diff-Pair Integrator Synapse. The DPI circuit that we designed solves the problems of the log domain integrator synapse while maintaining its linear filtering properties, thus preserving the possibility of multiplexing in time spikes arriving from different sources. The schematic diagram of the DPI synapse is shown in Figure 4. This circuit comprises four n-FETs, two p-FETs, and a capacitor. The n-FETs form a differential pair whose branch current Iin represents the input to the synapse during the charge phase. Assuming subthreshold operation and saturation regime, the diffpair branch current Iin can be expressed as
Iin = Iw
e κ Vsyn e UT
κ Vsyn UT
+
κ Vthr e UT
,
(2.15)
Synaptic Dynamics in Analog VLSI
2591
and multiplying the numerator and denominator of equation 2.15 by e we can express Iin as Iin =
1+
Iw
Isyn Igain
,
−
κ Vdd UT
,
(2.16)
−
κ(Vthr −Vdd )
UT represents a virtual p-type subthreshold where the term Igain = I0 e current that is not tied to any p-FET in the circuit. As for the log-domain integrator, we can combine the Csyn capacitor d equation Csyn dt Vsyn = −(Iin − Iτ ) with equation 2.11 and write
d Iin τ Isyn = −Isyn 1 − , dt Iτ where (as usual) τ = tion 2.17, we obtain τ
CUT κ Iτ
(2.17)
. Replacing Iin from equation 2.16 into equa-
Isyn d Iw . Isyn + Isyn = dt Iτ 1 + Isyn Igain
(2.18)
This is a first-order nonlinear differential equation; however, the steadystate condition can be solved in closed form, and its solution is Isyn =
Igain (Iw − Iτ ). Iτ
(2.19)
If Iw Iτ , the output current Isyn will eventually rise to values such that Isyn Igain , when the circuit is stimulated with an input step signal. If Isyn 1 the Isyn dependence in the second term of equation 2.18 cancels out, Igain and the nonlinear differential equation simplifies to the canonical first-order low-pass filter equation: τ
Iw Igain d . Isyn + Isyn = dt Iτ
(2.20)
In this case, the response of the DPI synapse to a spike arriving at ti− and ending at ti+ is
(t−ti− ) (t−ti− ) Igain Iw − − τ e 1 − e− τ + Isyn Iτ Isyn (t) = + + − (t−tτi ) Isyn e
(charge phase)
.
(discharge phase) (2.21)
2592
C. Bartolozzi and G. Indiveri
The solution of the DPI synapse is almost identical to the one of the logdomain integrator synapse, described in equation 2.14. The only difference is that the term I0 of equation 2.14 is replaced by Igain . This scaling factor can be used to amplify the charge phase response amplitude, therefore solving the problem of generating sufficiently large charge packets sourced into the neuron’s integrating capacitor for input spikes of very brief duration, while keeping all currents in the subthreshold regime and without requiring additional pulse-extender circuits. In addition, the layout of DPI does not require isolated well structures and can be implemented in a very compact way. As for the log-domain integrator synapse described in section 2.5, the DPI synapse implements a low-pass filter with linear transfer function (under the realistic assumption that Iw Iτ ). Although it is less compact than the synaptic circuits described in sections 2.1, 2.2, 2.3, and 2.4, it is the only one that can reproduce the exponential dynamics observed in excitatory and inhibitory postsynaptic currents of biological synapses (Destexhe et al., 1998), without requiring additional input pulse-extender circuits. Moreover, the DPI synapse we propose has independent control of time constant, synaptic weight, and synaptic scaling parameters. The extra degree of freedom obtained with the Vthr parameter can be used to globally scale the efficacies of the DPI circuits that share the same Vthr bias. This feature could in turn be employed to implement global homeostatic plasticity mechanisms complementary to local spike-based plasticity ones acting on the synaptic weight Vw node (see also section 4). In the next section, we present experimental results from a VLSI chip comprising an array of DPI synapses connected to low-power leaky I&F neurons (Indiveri, Chicca, & Douglas, 2006) that validate the analytical derivations presented here. 3 Experimental Results We fabricated a prototype chip in standard AMS 0.35 µm CMOS technology comprising the DPI circuit and additional test structures to augment the synapse’s functionality. Here we present experimental results measured from the basic DPI circuit of Figure 4, while the characteristics and measurements from the additional test circuits are described in section 4. In Figure 5, we show a picture of the synaptic circuit layout. The full layout occupies an area of 1360 µm2 . These types of synaptic circuits can therefore be used to implement networks of spiking neurons with a very large number of synapses on a small chip area. For example, in a recent chip, we implemented a network comprising 8192 synapses and 32 neurons (256 synapses per neuron) using an area of only 12 mm2 (Mitra, Fusi, & Indiveri, 2006). The silicon area occupied by the synaptic circuit can vary significantly, as it depends on the choice of layout design solutions. More conservative solutions use large transistors, have lower mismatch, and require more area. More aggressive solutions require less area, but multiple instances of the
Synaptic Dynamics in Analog VLSI
2593
Figure 5: Layout of the fabricated DPI synapse and additional circuits that augment the synapse’s functionality. The schematic diagram and properties of the STD, NMDA, and G blocks are described in section 4.
same layout cell produce currents with larger deviations. The layout of Figure 5 implements a very conservative solution. To validate the theoretical analysis of section 2.6, we measured the DPI step response and fitted the experimental data with equation 2.21. In Figure 6 (left), we plot the circuit’s step response for different synaptic weight Vw bias values. The rise and decay parts of the data were fitted with the charge phase and discharge phase parts of equation 2.21 using slightly different parameters for the estimated time constant. The small differences in the time constants are most likely due to leakage currents and parasitic capacitance effects, not considered in the analytical derivations. These results, however, show that the DPI time constant does not depend on Vw and can be independently tuned with Vτ . Silicon synapses are typically stimulated with trains of pulses (spikes) of very brief duration, separated by longer interspike intervals (ISIs). It can be easily shown from equation 2.20 that when the DPI is stimulated with a spike train of average frequency f in and pulse duration t, its steady-state response is < Isyn >=
Igain Iw Iτ
t f in .
(3.1)
2594
C. Bartolozzi and G. Indiveri 900
Vw=0.96V
Vw=420mV
800
Vw=1V
Vw=440mV
700
Vw=1.04V
Vw=460mV
600
300
200
Isyn (nA)
EPSC (nA)
250
150 100
500 400 300 200
50
100 0 0
0.05
0.1 Time (s)
0.15
0
50
100 150 Input Frequency (Hz)
200
Figure 6: DPI circuit response properties. (Left) Step response for three different values Vw . The response is fitted with equation 2.21, and the fitting functions (dotted, and dashed lines) are superimposed to the measured data. The time constants estimated by the fit are τ = 3 ms for the charge phase and τ = 4 ms for the discharge phase. (Right) Response to spike trains of increasing frequencies. The output mean current is linear with the synaptic input frequency, and its gain can be changed with the synaptic weight bias Vw .
We also verified this derivation by measuring the mean EPSC of the circuit in response to spike trains of increasing frequencies. In Figure 6 (right), we show the i − f curve for typical biological spiking frequencies, ranging from 10 to 200 Hz. The mean output current is linear over a wide range of input frequencies (extending well beyond the ones shown in the plot). 4 Synaptic Dynamics The results of the previous sections showed how the DPI response models the EPSC generated by biological excitatory synapses of AMPA type (Destexhe et al., 1998). Inhibitory (GABAa ) type synapses can be easily emulated by using the complementary version of the DPI circuit of Figure 4 (with a p-type diff-pair, and n-type output transistor). Additional circuits can be attached to the DPI synapse to extend the model with additional features typical of biological synapses and implement various types of plasticity. For example, by adding two extra transistors, we can implement voltage-gated channels that model NMDA synapse behavior. Similarly, by using two more transistors, we can extend the synaptic model to be conductance based (Kandel, Schwartz, & Jessell, 2000). Furthermore, the DPI circuit is compatible with previously proposed circuits for implementing synaptic plasticity, on both short timescales with models of short-term depression (STD) (Rasche & Hahnloser, 2001; Boegerhausen, Suter, & Liu, 2003) and on longer timescales with spike-based learning mechanisms, such as spiketiming-dependent plasticity (STDP) (Indiveri et al., 2006). Finally the DPI’s extra degree of freedom for modifying the overall gain of the synapse
Synaptic Dynamics in Analog VLSI
2595
Figure 7: Schematic diagram of the DPI connected to additional test circuits that augment the synapse’s functionality. The names of the functional blocks correspond to the ones used in the layout of Figure 5: The STD block comprises the circuit modeling short-term depression of the synaptic weight, the NMDA block comprises the transistors modeling NMDA voltage-gated channels, and the G block includes transistors that render the synapse conductance based.
either with Vthr or with Vw allows the implementation of synaptic homeostatic mechanisms (Bartolozzi & Indiveri, 2006), such as global activity dependent synaptic scaling (Turrigiano, Leslie, Desai, Rutherford, & Nelson, 1998). In Figure 7, we show the schematics of the extension circuits mentioned above implemented on the test chip (with the exception of the STDP and homeostatic circuits). In the next paragraphs, we describe the behavior of these additional circuits, characterized by measuring the membrane potential Vmem of a low power leaky I&F neuron (Indiveri et al., 2006) that receives in input the synaptic EPSC. 4.1 NMDA Synapse. With the DPI we reproduce phenomenologically the current flow through ionic ligand-gated membrane channels that open and let the ions flow across the postsynaptic membrane as soon as they sense the neurotransmitters released by the presynaptic boutons (e.g., AMPA channels). Another important class of ligand-gated synaptic
2596
C. Bartolozzi and G. Indiveri 80
V
nmda
1.2
70 60
EPSP (mV)
0.8
V
mem
(V)
1
0.6
V
nmda
= 500mV
Vnmda = 600mV
50 40 30 20
0.4
10
0.2 0
= 300mV
Vnmda = 400mV
2
4
6 Time (s)
8
10
0
200
300
400 500 Vmem (mV)
600
700
800
Figure 8: NMDA-type synapse response properties. (Left) Membrane potential of an I&F neuron connected to the synapse and stimulated by a constant injection current. The NMDA threshold voltage is set to Vnmda = 400 mV. The small bumps in Vmem represent the excitatory postsynaptic potentials (EPSPs) produced by the synapse, when Vmem > Vnmda , in response to the presynaptic input spikes. (Right) EPSP amplitude versus the membrane potential, for increasing values of the NMDA threshold Vnmda and for a fixed value of Vw .
channels, the NMDA receptors, is, in addition, voltage gated; these channels open to let the ions flow only if the membrane voltage is depolarized above a given threshold while in the presence of its neurotransmitter (glutamate). We can implement this behavior by exploiting the thresholding property of the differential pair circuit, as shown in Figure 7; if the node Vmem is lower than the externally set bias Vnmda , the output current Isyn flows through the transistor Mnmda in the left branch of the diff-pair and has no effect on the postsynaptic depolarization. On the other hand, if Vmem is higher than Vnmda , the current flows also into the membrane potential node, depolarizing the I&F neuron, and thus implementing the voltage-gating typical of NMDA synapses. In Figure 8, we show the results measured from the test circuit on the prototype chip: we stimulate the synapse with presynaptic spikes, while also injecting constant current into the neuron’s membrane. The synapse’s EPSC amplitude depends on the difference between the membrane potential and the NMDA threshold Vnmda . As expected, when Vmem is smaller than Vnmda , the synaptic current is null, and the membrane potential increases solely due to the constant injection current. As Vmem increases above Vnmda , the contribution of the synaptic current injected with each presynaptic spike becomes visible. The time constant of the DPI circuit used in this way can be easily extended to hundreds of milliseconds (values typical of NMDA-type synaptic dynamics) by increasing the Vτ bias voltage of Figure 7. This allows us to faithfully reproduce both the voltage-gated and temporal dynamic properties of real NMDA synapses. It is important to be able to implement
Synaptic Dynamics in Analog VLSI
2597
these properties in our VLSI devices because there is evidence that they play an important role in detecting coincidence between the presynaptic activity and postsynaptic depolarization for inducing long-term-potentiation (LTP) (Morris, Davis, & Butcher, 1990). Furthermore, the NMDA’s synapse stabilizing role, hypothesized by computational studies within the context of working memory (Wang, 1999), could be useful for stabilizing persistent activity of recurrent VLSI networks of spiking neurons. 4.2 Conductance-Based Synapse. So far we have reproduced the total current flowing through the synaptic channels independent of the postsynaptic membrane potential. However, in real synapses, the current is proportional to the difference between the postsynaptic membrane voltage and the synaptic ion reversal potential E ion : Isyn = gsyn (Vmem − E ion ).
(4.1)
Exploiting once more the properties of the differential pair circuit, we can model this dependence with just two more transistors (see G block of Figure 7), and obtain a behavior that, to first-order approximation, is equivalent to that described by equation 4.1. Formally, the conductancebased synapse output is: Isyn = Isyn
1 1+e
κ UT
(Vmem −Vgthr )
,
(4.2)
so if we consider the first-order term of the Taylor expansion, when Vmem ∼ = Vgthr , we obtain Isyn =
Isyn 2
+ gsyn (Vmem − Vgthr ),
(4.3)
where the conductance term gsyn = Isyn 4Uκ T . In Figure 9 we plot the EPSPs measured from the I&F neuron connected to the conductance-based synapse for different values of Vgthr . These experimental results show that our synapse can reproduce the behavior of conductance-based synapses. This behavior is especially relevant in inhibitory synapses, where the dependence expressed in equation 4.1 results in shunting inhibition. Computational and biological studies have attributed different roles to shunting inhibition, such as logical AND-NOT) Koch, Poggio, & Torre, 1983), and normalization (Carandini, Heeger, & Movshon, 1997) functions. Evidence for these and other hypotheses continues to be the subject of further investigation (Anderson, Carandini, & Ferster, 2000; Chance, Abbott, & Reyes, 2002). The implementation of shunting inhibition in large arrays of VLSI synapses and spiking neurons provides
2598
C. Bartolozzi and G. Indiveri 70
1.4
60
1.2 Vgthr
50 EPSP (mV)
Vmem (V)
1 0.8 0.6
40 30
Vgthr=800mV V
=700mV
gthr
20
Vgthr=600mV
0.4
V
10
=500mV
gthr
Vgthr=400mV
0.2 0.5
1 Time (s)
1.5
0
200
300
400
500 V
mem
600 (mV)
700
800
Figure 9: Conductance-based synapse. (Left) Membrane potential of the I&F neuron stimulated by the synapse for different values of the synaptic reversal potential Vgthr . (Right) EPSP amplitude as a function of Vmem for different values of Vgthr .
an additional means for exploring the computational role of this computational primitive. 4.3 Synaptic Plasticity. In the previous sections, we showed that our circuit can model biologically realistic synaptic current dynamics. The synaptic main feature exploited in neural networks, though, is plasticity: the ability of changing the synaptic efficacy to learn and adapt to the environment. In neural networks with large arrays of synapses and neurons, usually (Indiveri et al., 2006; Mitra et al. 2006; Arthur & Boahen, 2004; Shi & Horiuchi, 2004b) all the synapses belonging to one population share the same bias that sets their initial weight.2 In addition each synapse can be connected to a local circuit for the short- and/or long-term modification of its weight. Our silicon synapse supports all of the short-term and longterm plasticity mechanisms for inducing long-term potentiation (LTP) and long-term depression (LTD) in the synaptic weight that have been proposed in the literature. Specifically, the possibility of biasing Mw with subthreshold voltages on the order of hundreds of mV makes the DPI compatible with many of the spike-timing-dependent plasticity circuits previously proposed (Indiveri et al., 2006; Mitra et al., 2006; Arthur & Boahen, 2006; Bofill, Murray, & Thompson, 2002). Similarly the DPI synapse is naturally extended with the short-term depression circuit proposed by Rasche and Hahnloser (2001), where the synaptic weight decreases with increasing number of input spikes and recovers during periods of presynaptic inactivity. From the computational point of 2 The initial weight V can be set by an external voltage reference or by on-chip bias w generators.
Synaptic Dynamics in Analog VLSI
2599
130
V =190mV lk
V =160mV lk
V
mem
(mV)
120
110
100
90
80 0
50
100 Time (ms)
150
Figure 10: Short-term depression: Membrane potential of the leaky I&F neuron, when the short-term depressing synapse is stimulated with a regular spike train at 50 Hz. The different traces of the membrane potential correspond to different values of the leakage current of the neuron. Note how (from the second spike on) the EPSP amplitude decreases with each input spike.
view, STD is a nonlinear mechanism that plays an important role for implementing selectivity to transient stimuli and contrast adaptation (Chance, Nelson, & Abbott, 1998). In Figure 10, we show the EPSPs of the I&F neuron connected to the synapse, having activated the STD block of Figure 7. These results confirm the compatibility between the DPI and the STD circuits and show qualitatively the effect of short-term depression. Quantitative considerations and comparisons to short-term depression computational models have already been presented elsewhere (Rasche & Hahnloser, 2001; Boegerhausen et al. 2003). Another valuable property of biological synapses is the homeostatic mechanism known as activity-dependent synaptic scaling (Turrigiano et al., 1998). It acts by scaling the synaptic weights in order to keep the neurons, firing rate within a functional range in the face of chronic changes of their activity level while preserving the relative differences between individual synapses. As demonstrated in section 2.6 and Figure 11, we can scale the total synaptic efficacy of the DPI by independently varying either Iw or Igain (see also Figure 6 (left)). We can exploit these two independent degrees of freedom for learning the synaptic weight Vw with “fast” spike-based learning rules, while adapting the bias Vthr to implement homeostatic synaptic scaling, on much slower timescales. A control algorithm that exploits the
2600
C. Bartolozzi and G. Indiveri
Figure 11: Independent scaling of EPSC amplitude by adjusting either Vthr or Vw . The plots show the time course of mean and standard deviation (over 10 repetitions of the same experiment) of the current Isyn , in response to a singleinput voltage pulse. In both plots, the lower EPSC traces share the same set of Vthr and Vw , in (Left) The higher EPSC is obtained by increasing Vw and (right) by decreasing Vthr , with respect to the initial bias set. Superimposed to the experimental data, we plot theoretical fits of the decay from equation 2.21. The time constant of all plots is the same and equal to 5 ms.
properties of the DPI to implement the activity-dependent synaptic scaling homeostatic mechanism has been recently proposed by Bartolozzi and Indiveri (2006). 5 Conclusion We have proposed a new analog VLSI synapse circuit (the DPI of section 2.6) useful for implementing postsynaptic currents in neuromorphic VLSI networks of spiking neurons with biologically realistic temporal dynamics. We showed in analytical derivations and experimental data that the circuit proposed matches detailed computational models of synapses. We compared our VLSI synapse to previously proposed circuits that implement an equivalent functionality and derived analytically their transfer functions. Our analysis showed that the DPI circuit incorporates most of the strengths of previously proposed circuits, while providing additional favorable properties. Specifically, the DPI implements a linear integrator circuit with two independent tunable gain parameters and one independently tunable timeconstant parameter. The circuit’s mean output current encodes linearly the input frequency of the presynaptic spike train. As the DPI performs linear temporal summation of its input spikes, it can be used for processing multiple spike trains generated by different sources multiplexed together, modeling the contribution of many different synapses that share the same weight.
Synaptic Dynamics in Analog VLSI
2601
Next to being linear and compact, this circuit is compatible with existing implementation of both short-term and long-term plasticity. The favorable features of linearity, compactness, and compatibility with existing synaptic circuit elements make it an ideal building block for constructing adaptive dynamic synapses and implementing dense and massively parallel networks of spiking neurons capable of processing spatiotemporal signals in real time. The VLSI implementation of such networks constitutes a powerful tool for exploring the computational role of each element described in this work, from the voltage-gated NMDA channels, to shunting inhibition, and homeostasis, using real-world stimuli while observing the network’s behavior in real time. Acknowledgments This work was supported in part by the EU Grants ALAVLSI (IST2001-38099) and DAISY (FP6-2005-015803) and in part by the ETH TH under Project 0-20174-04. The chip was fabricated via the EUROPRACTICE service. We thank Pratap Kumar for fruitful discussions about biological synapses.
References Anderson, J., Carandini, M., & Ferster, D. (2000). Orientation tuning of input conductance, excitation, and inhibition in cat primary visual cortex. Journal of Physiology, 84, 909–926. Arthur, J., & Boahen, K. (2004, July). Recurrently connected silicon neurons with active dendrites for one-shot learning. In IEEE International Joint Conference on Neural Networks (Vol. 3, pp. 1699–1704). Piscataway, NJ: IEEE. Arthur, J., & Boahen, K. (2006). Learning in silicon: Timing is everything. In Y. Weiss, ¨ B. Scholkopf, & J. Platt (Eds.), Advances in neural information processing systems, 18. Cambridge, MA: MIT Press. Bartolozzi, C., & Indiveri, G. (2006). Silicon synaptic homeostasis. In Brain Inspired Cognitive Systems 2006. [CD] Boahen, K. A. (1997). Retinomorphic vision systems: Reverse engineering the vertebrate retina. Unpublished doctoral dissertation, California Institute of Technology. Boahen, K. (1998). Communicating neuronal ensembles between neuromorphic chips. In T. S. Lande (Ed.), Neuromorphic systems engineering (pp. 229–259). Norwell, MA: Kluwer Academic. Boegerhausen, M., Suter, P., & Liu, S.-C. (2003). Modeling short-term synaptic depression in silicon. Neural Computation, 15(2), 331–348. Bofill, A., Murray, A., & Thompson, D. (2002). Circuits for VLSI implementation of temporally asymmetric Hebbian learning. In T. G. Dietlerich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press.
2602
C. Bartolozzi and G. Indiveri
Borgstrom, T., Ismail, M., & Bibyk, S. (1990). Programmable current-mode neural network for implementation in analogue MOS VLSI. IEE Proceedings G, 137(2), 175–184. Carandini, M., Heeger, D. J., & Movshon, J. A. (1997). Linearity and normalization in simple cells of the macaque primary visual cortex. Journal of Neuroscience, 17(21), 8621–8644. Chance, F., Abbott, L., & Reyes, A. (2002). Gain modulation from background synaptic input. Neuron, 35, 773–782. Chance, F. S., Nelson, S. B., & Abbott, L. F. (1998). Synaptic depression and the temporal response characteristics of V1 cells. Journal of Neuroscience, 18(12), 4785– 99. Chicca, E. (2006). A neuromorphic VLSI system for modeling spike-based cooperative competitive neural networks. Unpublished doctoral dissertation, ETH Zurich, Zurich, Switzerland. Chicca, E., Badoni, D., Dante, V., D’Andreagiovanni, M., Salina, G., Fusi, S., et al. (2003). A VLSI recurrent network of integrate-and-fire neurons connected by plastic synapses with long term memory. IEEE Transactions on Neural Networks, 14(5), 1297–1307. Chicca, E., Indiveri, G., & Douglas, R. (2003). An adaptive silicon synapse. In Proc. IEEE International Symposium on Circuits and Systems (pp. I-81–I-84). Piscataway, NJ: IEEE. Destexhe, A., Mainen, Z., & Sejnowski, T. (1998). Kinetic models of synaptic transmission. In C. Koch & I. Segev (Eds.), Methods in neuronal modelling, from ions to networks (pp. 1–25). Cambridge, MA: MIT Press. Fusi, S., Annunziato, M., Badoni, D., Salamon, A., & Amit, D. J. (2000). Spike-driven synaptic plasticity: Theory, simulation, VLSI implementation. Neural Computation, 12, 2227–2258. Gordon, C., Farquhar, E., & Hasler, P. (2004, May). A family of floating-gate adapting synapses based upon transistor channel models. In 2004 IEEE International Symposium on Circuits and Systems (Vol. 1, pp. 317–1320). Piscataway, NJ: IEEE. ¨ Gutig, R., & Sompolinsky, H. (2006). The tempotron: A neuron that learns spike timing–based decisions. Nature Neuroscience, 9, 420–428. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Horiuchi, T., & Hynna, K. (2001). A VLSI-based model of azimuthal echolocation in the big brown bat. Autonomous Robots, 11(3), 241–247. Hynna, K., & Boahen, K. (2001). Space-rate coding in an adaptive silicon neuron. Neural Networks, 14, 645–656. Hynna, K. M., & Boahen, K. (2006, May). Neuronal ion-channel dynamics in silicon. In 2006 IEEE International Symposium on Circuits and Systems (pp. 3614–3617). Piscataway, NJ: IEEE. Indiveri, G. (2000). Modeling selective attention using a neuromorphic analog VLSI device. Neural Computation, 12(12), 2857–2880. Indiveri, G., Chicca, E., & Douglas, R. (2006). A VLSI array of low-power spiking neurons and bistable synapses with spike-timing dependent plasticity. IEEE Transactions on Neural Networks, 17(1), 211–221. Kandel, E. R., Schwartz, J., & Jessell, T. M. (2000). Principles of neural science. New York: McGraw-Hill.
Synaptic Dynamics in Analog VLSI
2603
Koch, C. (1999). Synaptic input. In M. Stryker (Ed.), Biophysics of computation: Information processing in single neurons (pp. 85–116). New York: Oxford University Press. Koch, C., Poggio, T., & Torre, V. (1983). Nonlinear interactions in a dendritic tree: Localization, timing, and role in information processing. PNAS, 80, 2799– 2802. Lazzaro, J. P. (1994). Low-power silicon axons, neurons, and synapses. In M. E. Zaghloul, J. L. Meador, & R. W. Newcomb (Eds.), Silicon implementation of pulse coded neural networks (pp. 153–164). Norwell, MA: Kluwer. Liu, S.-C., Kramer, J., Indiveri, G., Delbruck, T., Burg, T., & Douglas, R. (2001). Orientation-selective aVLSI spiking neurons. Neural Networks, 14(6/7), 629– 643. ¨ Liu, S.-C., Kramer, J., Indiveri, G., Delbruck, T., & Douglas, R. (2002). Analog VLSI: Circuits and principles. Cambridge, MA: MIT Press. Mead, C. (1989). Analog VLSI and neural systems. Reading, MA: Addison-Wesley. Merolla, P., & Boahen, K. (2004). A recurrent model of orientation maps with simple ¨ L. K. Saul, & B. Scholkopf ¨ and complex cells. In S. Thrun, (Eds.), Advances in neural information processing systems, 16 (pp. 995–1002). MIT Press. Cambridge, MA: Mitra, S., Fusi, S., & Indiveri, G. (2006, May). A VLSI spike-driven dynamic synapse which learns. In Proceedings of IEEE International Symposium on Circuits and Systems (pp. 2777–2780). Piscataway, NJ: IEEE. Morris, R., Davis, S., & Butcher, S. (1990). Hippocampal synaptic plasticity and NMDA receptors: A role in information storage? Philosophical Transactions: Biological Sciences, 329(1253), 187–204. Murray, A. F. (1998). Pulse-based computation in VLSI neural networks. In W. Maass & C. M. Bishop (Eds.), Pulsed neural networks (pp. 87–109). Cambridge, MA: MIT Press. Northmore, D. P. M., & Elias, J. G. (1998). Building silicon nervous systems with dendritic tree neuromorphs. In W. Maass & C. M. Bishop (Eds.), Pulsed neural networks (pp. 135–156). Cambridge, MA: MIT Press. Rasche, C., & Hahnloser, R. (2001). Silicon synaptic depression. Biological Cybernetics, 84(1), 57–62. Satyanarayana, S., Tsividis, Y., & Graf, H. (1992). A reconfigurable VLSI neural network. IEEE J. Solid-State Circuits, 27(1), 67–81. Shi, R., & Horiuchi, T. (2004a). A summating, exponentially-decaying CMOS synapse ¨ L. Saul, & B. Scholkopf ¨ for spiking neural systems. In S. Thrun, (Eds.), Advances in neural information processing systems, 16. Cambridge, MA: MIT Press. Shi, R., & Horiuchi, T. (2004b). A VLSI model of the bat lateral superior olive for azimuthal echolocation. In Proceedings of the 2004 International Symposium on Circuits and Systems (ISCAS04) (Vol. 4, pp. 900–903). Piscataway, NJ: IEEE. Turrigiano, G., Leslie, K., Desai, N., Rutherford, L., & Nelson, S. (1998). Activitydependent scaling of quantal amplitude in neocortical neurons. Nature, 391, 892– 896. Wang, X. (1999). Synaptic basis of cortical persistent activity: The Importance of NMDA receptors to working memory. Journal of Neuroscience, 19, 9587–9603.
Received May 11, 2006; accepted September 27, 2006.
NOTE
Communicated by Dominique Martinez
Exact Simulation of Integrate-and-Fire Models with Exponential Currents Romain Brette [email protected] Odyssee Lab (ENPC Certis/ENS Paris/INRIA Sophia), D´epartement d’Informatique, Ecole Normale Sup´erieure, 75230 Paris Cedex 05, France
Neural networks can be simulated exactly using event-driven strategies, in which the algorithm advances directly from one spike to the next spike. It applies to neuron models for which we have (1) an explicit expression for the evolution of the state variables between spikes and (2) an explicit test on the state variables that predicts whether and when a spike will be emitted. In a previous work, we proposed a method that allows exact simulation of an integrate-and-fire model with exponential conductances, with the constraint of a single synaptic time constant. In this note, we propose a method, based on polynomial root finding, that applies to integrate-and-fire models with exponential currents, with possibly many different synaptic time constants. Models can include biexponential synaptic currents and spike-triggered adaptation currents. 1 Introduction General event-driven strategies have been devised for the case when spikes can be emitted only at times of incoming spikes (Mattia & Del Giudice, 2000). The relevant models are mostly simple pulse-coupled integrateand-fire models, but it is also possible to include the effect of synaptic conductances (Rudolph & Destexhe, 2006). Event-driven strategies can be extended to the more realistic case when outgoing spikes are delayed (Rochel & Martinez, 2003; Lee & Farhat, 2001), by using provisory events that can be cancelled by incoming spikes. However, fitting neuron models to the requirements of exact event-driven simulation is generally not trivial, because one needs a guaranteed method that predicts whether and when a spike will be emitted given the values of the state variables. In a previous work, we devised a method that allows exact event-driven simulation of integrate-and-fire models with exponential synaptic conductances (Brette, 2006), but it was limited to the case when all synaptic time constants are identical. Here we turn to integrate-and-fire models with exponential synaptic currents, in which the membrane potential V and the synaptic inputs Ii evolve according Neural Computation 19, 2604–2609 (2007)
C 2007 Massachusetts Institute of Technology
Exact Simulation of I&F Models with Exponential Currents
2605
to the following system of differential equations:
τ V = V0 − V +
k
Ii
i=1
τ1 I1 = −I1 τ2 I2 = −I2 ... τk Ik = −Ik , where V0 is the rest potential, τ is the membrane time constant, and τi are the synaptic time constants (note that for clarity, we have not included the membrane resistance; it is contained in the inputs Ii , which have the dimension of voltages). Incoming spikes trigger instantaneous additive changes in one or several input variables I j , and an outgoing spike is emitted when V = Vt , upon when the voltage is reset (V → Vr ). This formalism includes (1) exponential synaptic currents; (2) biexponential synaptic currents (exp(−t/τ1 ) − exp(−t/τ2 ), which simply use two variables Ii ); and (3) exponential adaptation currents (each outgoing spike triggers a negative exponential current). Since the differential system is linear, it can be solved analytically in intervals with no spike:
V(t) = V0 + (V(0) − V0 )e −t/τ +
k i=1
Ii (0)
τi (e −t/τ − e −t/τi ). τ − τi
(1.1)
It is, however, not trivial to tell whether and when V(t) will cross the threshold Vt for a given initial state (V(0), I1 (0), . . . , Ik (0)). Figure 1A shows the three difficulties than can arise (here with a response to a biexponential current): (1) evaluating the trajectory at the end of a short interval (as in discrete-time simulations) can miss the threshold crossing; (2) there are in general several threshold crossings; and (3) nonlinear root finding methods (e.g., Newton-Raphson) may not converge or may converge to the wrong crossing time, because the trajectory of V(t) is not concave. Our contribution in this note is to describe a method that tells with 100% certainty whether and when a threshold crossing will occur given an initial state. The method relies on polynomial root finding algorithms and applies to the case when the time constants are commensurable (i.e., they are related to each other by rational factors), which is the case in practice.
2606
R. Brette
C
exc. neuron external inputs
A Vt
inh. neuron
Event
D
100%
0
Quick spike test? (Descartes' rule of signs) yes 10% Full spike test? (Sturm sequence) no 7% No spike
no 90% No spike
Power (dB/Hz)
B
−5 −10 −15
yes 3% Compute spike timing
−20 0
100
200 300 400 500 Frequency (Hz)
Figure 1: Exact simulation of current-based integrate-and-fire models. (A) Response of an integrate-and-fire model to a biexponential synaptic current (here, membrane time constant is 20 ms, and synaptic time constants are 3 ms and 7 ms). The horizontal line is the threshold potential (Vt ). (B) Statistics of the outcome of the spike test in the random network model. (C) Architecture of the network model (see the text). (D) Power spectral density of the firing rate of the network simulated for 10 minutes with the present event-driven algorithm (solid line) and clock-driven integration with time step 0.1 ms (broken line).
2 Spike Timing Computation 2.1 Polynomial Formulation. Let τlcm be the least common multiple of τ , τ1 , . . . , τk . Define c, c 1 , . . . , c k the integers such that τlcm = cτ and τlcm = c i τi for all i, and let x = e −t/τlcm . Then V(t) (see equation 1) is the value of a polynomial at x, and finding t > 0 such that V(t) = Vt means finding a root in [0, 1] of the polynomial P defined by P(X) = V0 − Vt + V(0) − V0 +
k i=1
τi Ii (0) τ − τi
Xc −
k i=1
Ii (0)
τi Xci . τ − τi (2.1)
Exact Simulation of I&F Models with Exponential Currents
2607
2.2 The Spike Test. There are several algorithms to compute the number of roots of a real polynomial in a given interval [a , b] (Collins & Loos, 1983; Pan, 1992; Mignotte & Stefanescu, 1999). The most popular one is based on Sturm sequences, which we briefly describe here. Define F0 = P, F1 = P and for i > 1,−Fi is the remainder of the Euclidian division of Fi−2 by Fi−1 . The sequence terminates in m ≤ n steps (where n is the degree of P). Then the number of roots of P in any interval [a , b] is S(a ) − S(b), where S(x) is the number of sign changes in the sequence [F0 (x), F1 (x), . . . , Fm (x)]. Note that in the present case, evaluation of the polynomials is simple since Fi (0) is the constant coefficient of Fi and Fi (1) is the sum of all coefficients. Thus, Sturm’s algorithm provides us with a O(1) (constant time) method to decide whether a spike will occur. 2.3 Computation of Spike Timing. Computing the spike timing means finding the largest root of P in [0, 1]. Sturm’s algorithm also answers this problem. One can use, for example, a simple bisection method: divide the interval in two; select the right-hand interval if it contains a root (calculate S(a ) − S(b) as above), otherwise select the left-hand interval; and iterate until the required precision is reached. The number of iterations is proportional to the number of digits. Convergence can be accelerated using a hybrid bisection/Newton-Raphson method. 2.4 A Quicker Spike Test. Although the spike test based on Sturm sequences requires a constant number of operations, it is useful to have a faster test because an event-driven simulator will call the spike test every time an incoming spike is received. It turns out that in balanced networks, the answer of the spike test is negative most of the time because the average input current is subthreshold. In the model considered in Brette (2006), we showed that most of the time, simple inequalities could discard the possibility of a spike. Here we describe a simple test based on another standard algorithm for polynomials, Descartes’ rule of signs (Collins & Loos, 1983). Consider a polynomial with real coefficients: P(X) = a 0 + a 1 X + a 2 X2 + . . . a n Xn . According to Descartes’ rule of signs, the number of positive roots of P does not exceed the number of sign changes in the sequence a 0 , a 1 , . . . , a n (null coefficients are discarded). In the present case, we know that the number of roots in [0, 1] is even because P(0) < 0 and P(1) < 0 (provided V0 < Vt ). Therefore, if Descartes’ rule of signs predicts at most no or one root, then we know that there is none in [0, 1]. If the maximum number of roots predicted by the rule is two or more, then we must use the full Sturm’s algorithm in order to obtain a certain answer.
2608
R. Brette
In practice, this method is useful when there are exponential excitatory and inhibitory currents for which the excitatory time constant τe is smaller than the inhibitory time constant τi (which is fortunately the typical case). Then we can see from equation 2.1 that Descartes’s rule of signs predicts that no spike will occur if V(0) − V0 + Ie (0)
τe τi + Ii (0) ≤ 0. τ − τe τ − τi
The rule also applies when there is a slow adaptation current (τa > τ ). 3 Example and Discussion 3.1 A Network of Excitatory and Inhibitory Neurons. In order to illustrate our method, we simulated a randomly connected network (connection probability 20%) of 250 excitatory and 150 inhibitory integrate-fire neurons, with models described in Vogels and Abbott (2005); the network was driven by excitatory and inhibitory external events, modeled as Poisson spike trains (rates were 500 Hz). Synaptic currents were exponential, with excitatory time constant τe = 5 ms and inhibitory time constant τi = 10 ms. The membrane time constant is τ = 20 ms, so that τlcm = 20 ms = τ = 4τe = 2τi . The code is available online at http://www.di.ens.fr/∼brette/papers/ Brette2006 NC.htm. We implemented our algorithm in C++ and compared it with a clock-driven algorithm with exact subthreshold integration and time step dt = 0.1 ms (i.e., the differential equations are integrated exactly, but spike times are bound to a discrete time grid). The event-driven implementation is quite complex because it involves a queue management system and polynomial operations, so we also provide an implementation with Scilab, a free vector-based scientific software developed by INRIA and ENPC (http://www.scilab.org), which has built-in polynomial functions (note that the simulation is much slower with Scilab because programs are interpreted). Figure 1B shows that 97% of the time, the outcome of the spike test is negative and that the simple test based on Descartes’s rule of signs is negative 90% of the time, so that the Sturm sequence is calculated in only 10% of the cases. We simulated the network model for 10 minutes of biological time; the simulation time was about 20 minutes for the event-driven implementation and 5 minutes for the clock-driven implementation. The average firing rate of the network was about 8 Hz in both cases. However, Figure 1D shows that the time-varying population rate in the clock-driven simulation lacked high frequencies (>100 Hz), compared to the exact eventdriven simulation (consistently, the high-frequency power increased with decreasing dt).
Exact Simulation of I&F Models with Exponential Currents
2609
3.2 Perspectives. We have described a method to implement currentbased integrate-and-fire models with exponential (or biexponential) currents in an exact event-driven fashion. Our method also applies to models with spike-triggered adaptation (which is formally equivalent to an inhibitory self-connection). An important and difficult extension would be to design methods to simulate two-variable models such as those of Izhikevich (2003) and Brette and Gerstner (2005), which are more realistic because the threshold is “soft” (with a quadratic or exponential term in the first differential equation). Acknowledgments This work was partially supported by the EC IP project FP6-015879, FACETS, and the EADS Corporate Research Foundation. References Brette, R. (2006). Exact simulation of integrate-and-fire models with synaptic conductances. Neural Comp., 18(8), 2004–2027. Brette, R., & Gerstner, W. (2005). Adaptive exponential integrate-and-fire model as an effective description of neuronal activity. J. Neurophysiol., 94, 3637–3642. Collins, G., & Loos, R. (1983). Real zeroes of polynomials. In B. Buchberger, G. E. Collins, R. Loos, & Albrecht (Eds.), Computer algebra: Symbolic and algebraic computation (2nd ed., pp. 83–94). Berlin: Springer-Verlag. Izhikevich, E. M. (2003). Simple model of spiking neurons. IEEE Transactions on Neural Networks, 14(6), 1569–1572. Lee, G., & Farhat, N. H. (2001). The double queue method: A numerical method for integrate-and-fire neuron networks. Neural Netw., 14(6–7), 921–932. Mattia, M., & Del Giudice, P. (2000). Efficient event-driven simulation of large networks of spiking neurons and dynamical synapses. Neural Comput., 12(10), 2305– 2329. Mignotte, M., & Stefanescu, D. (1999). Polynomials: An algorithmic approach. Berlin: Springer-Verlag. Pan, V. (1992). Complexity of computations with matrices and polynomials. SIAM Rev., 34(2), 225–262. Rochel, O., & Martinez, D. (2003). An event-driven framework for the simulation of networks of spiking neurons. In Proc. 11th European Symposium on Artificial Neural Networks (pp. 295–300). Bruges, Belgium: D-facto. Rudolph, M., & Destexhe, A. (2006). Analytical integrate-and-fire neuron models with conductance-based dynamics for event-driven simulation strategies. Neural Comp., 18, 2146–2210. Vogels, T. P., & Abbott, L. F. (2005). Signal propagation and logic gating in networks of integrate-and-fire neurons. J. Neurosci., 25(46), 10786–10795.
Received April 18, 2006; accepted October 29, 2006.
LETTER
Communicated by Thomas Wachtler
The Relation Between Color Discrimination and Color Constancy: When Is Optimal Adaptation Task Dependent? Alicia B. Abrams [email protected]
James M. Hillis [email protected]
David H. Brainard [email protected] University of Pennsylvania, Department of Psychology, Philadelphia, PA 19104, U.S.A.
Color vision supports two distinct visual functions: discrimination and constancy. Discrimination requires that the visual response to distinct objects within a scene be different. Constancy requires that the visual response to any object be the same across scenes. Across changes in scene, adaptation can improve discrimination by optimizing the use of the available response range. Similarly, adaptation can improve constancy by stabilizing the visual response to any fixed object across changes in illumination. Can common mechanisms of adaptation achieve these two goals simultaneously? We develop a theoretical framework for answering this question and present several example calculations. In the examples studied, the answer is largely yes when the change of scene consists of a change in illumination and considerably less so when the change of scene consists of a change in the statistical ensemble of surface reflectances in the environment. 1 Introduction Color vision supports two distinct visual functions: discrimination and constancy (Jacobs, 1981; Mollon, 1982). Color discrimination, the ability to determine that two spectra differ, is useful for segmenting an image into regions corresponding to distinct objects. Effective discrimination requires that the visual response to distinct objects within a scene be different. Across changes in scene, adaptation can improve discrimination by optimizing the use of the available response range for objects in the scene (Walraven, Enroth-Cugell, Hood, MacLeod, & Schnapf, 1990).
James M. Hillis is now at the Department of Psychology, University of Glasgow, Glasgow, Scotland, G12 8QB, U.K. Neural Computation 19, 2610–2637 (2007)
C 2007 Massachusetts Institute of Technology
The Relation Between Color Discrimination and Color Constancy
2611
Color constancy is the ability to identify objects on the basis of their color appearance (Brainard, 2004). Because the light reflected from an object to the eye depends on both the object’s surface reflectance and the illumination, constancy requires that some process stabilize the visual representation of surfaces across changes in illumination. Early visual adaptation can mediate constancy if it compensates for the physical changes in reflected light caused by illumination changes (Wandell, 1995). Although there are large theoretical and empirical literatures concerned with both how adaptation affects color appearance and constancy on the one hand (Wyszecki, 1986; Zaidi, 1999; Foster, 2003; Shevell, 2003; Brainard, 2004), and discrimination on the other (Wyszecki & Stiles, 1982; Walraven et al., 1990; Hood & Finkelstein, 1986; Lennie & D’Zmura, 1988; Kaiser & Boynton, 1996; Eskew, McLellan, & Giulianini, 1999), it is rare that the two functions are considered simultaneously. Still, it is clear that they are intimately linked since they rely on the same initial representation of spectral information. In addition, constancy is useful only if color vision also supports some amount of discrimination performance; in the absence of any requirement for discrimination, constancy can be achieved trivially by a visual system that assigns the same color descriptor to every object in every scene.1 The recognition that constancy (or its close cousin, appearance) and discrimination are profitably considered jointly has been exploited in a few recent papers (Robilotto & Zaidi, 2004; Hillis & Brainard, 2005). Here we ask whether applying the same adaptive transformations to visual responses can simultaneously optimize performance for both constancy and discrimination. If the visual system adapts to each of two environments so as to produce optimal color discrimination within each, what degree of constancy is achieved? How does this compare with what is possible if adaptation is instead tailored to optimize constancy, and what cost would such an alternative adaptation strategy impose on discrimination performance? To address these questions, we adopt the basic theoretical framework introduced by Grzywacz and colleagues (Grzywacz & Balboa, 2002; Grzywacz & de Juan, 2003; also Brenner, Bialek, & de Ruyter van Steveninck, 2000; von der Twer & MacLeod, 2001; Foster, Nascimento, & Amano, 2004; Stocker & Simoncelli, 2005) by analyzing task performance using explicit models of the visual environment and early visual processing. Parameters in the model visual system specify the system’s state of adaptation, and we study how these parameters should be set to maximize performance, where the evaluation is made across scenes drawn from a statistical model of the visual environment (see Grzywacz & Balboa, 2002; Grzywacz
1
This is sometimes referred to as the Ford algorithm, after a quip attributed to Henry Ford: “People can have the Model T in any color—so long as it’s black” (http://en.wikiquote.org/wiki/Henry Ford).
2612
A. Abrams, J. Hillis, and D. Brainard
& de Juan, 2003). Within this framework, we investigate the trade-offs between optimizing performance for discrimination and for constancy. We begin in section 2 with consideration of a simple one-dimensional example that illustrates the basic ideas and then generalize in section 3. The work presented here complements our recent experimental efforts directed toward understanding the degree to which measured adaptation of the visual pathways mediates judgments of both color discrimination and color appearance (Hillis & Brainard, 2005). 2 Univariate Example We begin with the specification of a model visual system, a visual environment, and performance measures. The basic structure of our problem is well illustrated for the case of lightness/brightness constancy and discrimination, and we begin with a treatment of this case. 2.1 Visual Environment. The model visual environment consists of achromatic matte surfaces lit by a diffuse illuminant. Each surface j is characterized by its reflectance r j , which specifies the fraction of incident illumination that is reflected. Each illuminant is specified by its intensity e i . The intensity of light c i, j reflected from surface j under illuminant i is thus given by c i, j = e i r j .
(2.1)
At any given moment, we assume that the illuminant e i is known and that the particular surfaces in the scene have been drawn from an ensemble of surfaces. The ensemble statistics characterize regularities of the visual environment. In particular, we suppose that ¯ µr , σr2 , rj ∼ N
(2.2)
¯ r , σr2 ) represents a truncated where ∼ indicates “distributed as” and N(µ normal distribution with mean parameter µr and variance parameter σr2 . The overbar in the notation indicates the truncation, which means that the probability of obtaining a reflectance in the range 0 ≤ r j ≤ 1 is proportional to the standard normal density function, while the probability of obtaining a reflectance outside this range is zero. The truncated distribution is normalized so that the total probability across all possible values of r j is unity. We are interested in (1) how well a simple model visual system can discriminate and identify randomly chosen surfaces viewed within a single scene and (2) how well the same visual system can discriminate and
The Relation Between Color Discrimination and Color Constancy
2613
identify randomly chosen surfaces viewed across different scenes where the illumination, surface ensemble, or state of adaptation has changed. 2.2 Model Visual System. The model visual system has a single class of photoreceptor. At each location, the information transmitted by this photoreceptor is limited in two ways. First, the receptor has a limited response range. We capture this by supposing that the deterministic component of the response to surface j under illuminant i is given by ui, j =
(gc i, j )n , (gc i, j )n + 1
(2.3)
where ui, j represents the visual response, c i, j represents the intensity of incident light obtained through equation 2.1, g is a gain parameter, and n is a steepness parameter that controls the slope of the visual response function. For this model visual system, the adaptation parameters g and n characterize the system’s state of adaptation. Across scenes, the visual system may set g and n to optimize its performance. The second limit on the transmitted information is that the responses are noisy. We can capture this by supposing that the deterministic component of the visual responses is perturbed by zero-mean additive visual noise, normally distributed with variance σn2 . 2.3 Discrimination Task and Performance Measure. To characterize discrimination performance, we need to specify a discrimination task. We consider a same-different task. On each trial, the observer sees either two views of the same surface (same trials) or one view each of different surfaces (different trials), all viewed under the same light e i . The observer’s task is to respond “same” on the same trials and “different” on the different trials. On same trials, a single surface is drawn at random from the surface ensemble and viewed twice. On different trials, two surfaces are drawn independently from the surface ensemble. Independently drawn noise is added to the response for each view of each surface. This task is referred to in the signal detection literature as a roving same-different design (Macmillan & Creelman, 2005). The observer’s performance is characterized by a hit rate (fraction of “same” responses on same trials) and a false alarm rate (fraction of “same” responses on different trials). It is well known that the hit and false alarm rates obtained by an observer in a same-different task depend on both the quality of the information supplied by the visual responses (i.e., signal-to-noise ratio) and how the observer chooses to trade off hits and false alarms (Green & Swets, 1966; Macmillan & Creelman, 2005). By tolerating more false alarms, an observer can increase his or her hit rate. Indeed, by varying the response criterion used in the hit–false alarm trade-off, an observer can obtain performance
2614
A. Abrams, J. Hillis, and D. Brainard
Figure 1: ROC diagram. The ROC (receiver operating characteristic) diagram plots hit rate versus the false alarm rate. An observer can maximize hit rate by responding “same” on every trial. This will lead to a high false alarm rate, and performance will plot at (1,1) in the diagram. An observer can minimize false alarms by responding “different” on every trial and achieve performance at (0,0). Varying criteria between these two extremes produces a trade-off between hits and false alarms. The exact locus traced out by this trade-off depends on the information used at the decision stage. Better information leads to performance curves that tend more toward the upper left of the plot (the solid curve indicates better information than the dashed curve.) The area under the ROC curve, referred to as A , is a task-specific measure of information that does not depend on criterion. The hatched area is A for the dashed ROC curve. The ROC curves shown were computed for two surfaces with reflectances r1 = 0.15 and r2 = 0.29 presented in a roving same-different design. The illuminant had intensity e = 100, and the deterministic component of the visual responses was computed from equation 2.3 with g = 0.02 and n = 2. The solid line corresponds to σn = 0.05 and A = 0.85, and the dashed line corresponds to σn = 0.065 and A = 0.76. Hit and false alarm rates were computed using the decision rule described in section 2.4.
denoted by a locus of points in what is referred to as a receiver operating characteristic (ROC) diagram (see Figure 1 and its caption). A standard criterion-free measure of the quality of information available in the visual responses is A , the area under the ROC curve (Green & Swets, 1966; Macmillan & Creelman, 2005). In this letter, we use A as our measure of performance for both discrimination (as is standard) and constancy (see below). 2.4 Effect of Adaptation on Discrimination. To understand the effect of adaptation, we ask how the average A depends on the adaptation
The Relation Between Color Discrimination and Color Constancy
2615
parameters, given the surface ensemble, illuminant, and noise. For a roving design, a near-optimal strategy is to compute the magnitude of the difference between the visual responses for two surfaces and compare this difference to a criterion C (Macmillan & Creelman, 2005). The intuition is that when the visual responses to the two surfaces are similar, the observer should say “same.” Let ui, j be the visual response to one surface under the given illuminant e i , ui,k to the other surface. The observer responds “same” if the squared response difference ui,2 jk = ui, j − ui,k 2 is less than C and “different” if ui,2 jk ≥ C. For any pair of surfaces r j and rk , we can compute the values of the deterministic component of the corresponding visual responses (ui, j and ui,k ), once we know the illuminant e i and the adaptation parameters g and n. Because of noise, the observed response difference ui,2 jk varies from trial to trial. If the variance of the noise is σn2 , the distribution of the √ 2 quantity (ui, jk ) = (ui, jk / 2σn )2 is noncentral chi-squared with 1 degree √ of freedom and noncentrality parameter (ui, jk / 2σn )2 .2 Because scaling the√visual response for both same and different trials by a common factor 1/ 2σn does not affect the information contained in these responses, a decision rule based on comparing (ui, jk )2 criterion C = C/2σ 2 leads to the same performance as one that compares ui,2 jk to C. Thus, the known noncentral chi-square distributions on same and different trials may be used, along with standard signal detection methods, to compute hit and false alarm rates for a set of criteria. The resultant ROC curve may then be numerically integrated to find the value of Ai, jk . To evaluate overall discrimination performance, we compute Ai, jk for many pairs of surfaces drawn according to the surface ensemble and compute an aggregate measure. Figure 2 illustrates how the gain parameter g affects discrimination performance for a single illuminant and surface ensemble, when the steepness parameter is held fixed at n = 2. The top shows histograms of Ai, jk for two choices of gain. As the gain parameter is changed from g = 0.010 (solid bars) to g = 0.021 (hatched bars), the values of Ai, jk increase, with fewer values near 0.5 and more values near 1.0. To study how performance varies parametrically with changes in gain, we need a measure that summarizes the change in the distribution of the Ai, jk . In this letter, we use the mean of the Ai, jk for this purpose. For simplicity of notation, we denote this value by the symbol (script “D” for discrimination). The bottom panel of Figure 2 shows how varies with gain for four noise levels. There is an optimal choice of gain for each noise level (cf. Brenner et al., 2000), and the optimal gain does not vary appreciably with noise level. To provide some intuition about the state of the model
2 On same trials, the difference (u 2 i, jk ) is 0, and the distribution reduces to ordinary chi-squared.
2616
A. Abrams, J. Hillis, and D. Brainard
Figure 2: Effect of gain and noise on discrimination performance. (Top) Histograms of the discrimination measure Ai, jk for two values of the gain parameter (solid bars, g = 0.010; hatched bars, g = 0.021). In the calculations, we set σn = 0.05, n = 2, µr = 0.5, σr = 0.3, and e = 100. Calculations were performed for 500 draws from the surface ensemble, and Ai, jk was evaluated for all possible 124,750 surface pairs formed from these draws. (Bottom) The mean of the Ai, jk (denoted by ) as a function of the gain parameter for four noise levels.
The Relation Between Color Discrimination and Color Constancy
2617
Figure 3: State of model visual system for optimal choice of gain. The top right panel shows the response function for the optimal choice of gain (g = 0.021) when the noise is σn = 0.05. Below the response function is a histogram of the light intensities c i, j reaching the eye, while to the left is a histogram of the resultant visual responses. Calculations were performed for 500 draws from the surface ensemble. Choices of gain less than or greater than the optimum would shift the response function right or left. For these nonoptimal choices, visual responses would tend to cluster nearer to the floor or ceiling of the response range, resulting in poorer discrimination performance.
visual system when the gain is optimized, the top right panel of Figure 3 shows the response function obtained for the optimal choice of gain when σn = 0.05. The histogram below the x-axis of this panel shows the distribution of reflected light intensities, while that to the left of the y-axis shows the distribution of visual responses. The histogram of responses appears more uniform than the histogram of light intensities. This general effect is expected from standard results in information theory, where maximizing the information transmitted by a channel occurs when the distribution of channel responses is uniform (Cover & Thomas, 1991). The response
2618
A. Abrams, J. Hillis, and D. Brainard
Figure 4: Effect of illuminant change on discrimination and constancy performance. The filled circles and solid line show how discrimination performance decreases when the illuminant intensity is changed and the adaptation parameters are held constant. Here the x-axis indicates the single scene illuminant intensity used in the calculations for the corresponding point. The open circles and dashed line show how constancy performance decreases when the test illuminant intensity is changed and the adaptation parameters are held fixed across the change. Here the x-axis indicates the test illuminant intensity, with the reference illuminant intensity held fixed at 100. All calculations performed with adaptation parameters held fixed (g = 0.021, n = 2) and for σn = 0.05. The surface distribution had parameters µr = 0.5 and σr = 0.3.
histogram is not perfectly uniform because varying the gain alone cannot produce this result and because our performance measure is rather than bits transmitted. Figure 6 below shows response histograms when both gain and steepness parameters are allowed to vary. 2.5 Effect of Illuminant Change on Discrimination. We can also investigate the effect of illumination changes on performance and how adaptation can compensate for such changes. First, consider the case where the adaptation parameters are held fixed. We can compute the performance measure for any visual environment. The filled circles and solid line in Figure 4 plot as a function of the illuminant intensity when the adaptation parameters, noise, and surface ensemble are held fixed. Not surprisingly, performance falls off with the change of illuminant: increasing the illuminant intensity pushes the intensity of the reflected light toward the saturating region of the visual response function and compresses the response range used. The effect of increasing the illuminant intensity is multiplicative, so this effect can be compensated for by decreasing the gain (which also acts multiplicatively) so as to keep the distribution of responses constant. Perfect compensation is possible in this example because of the match between
The Relation Between Color Discrimination and Color Constancy
2619
the physical effect of an illuminant change (multiplication of all reflected light intensities by the same factor) and the effect of a gain change (also multiplication of the same intensities by a common factor). In general, such perfect compensation is not possible. 2.6 Constancy. Suppose that instead of discriminating between surfaces seen under a common illuminant, we ask the question of constancy: How well can the visual responses be used to judge whether two surfaces are the same, when on each trial one surface is viewed in a reference environment and the other is viewed in a test environment? On same trials of the constancy experiment, the observer sees a surface in the reference environment and the same surface in the test environment. On different trials, the observer sees one surface in the reference environment and a different surface in the test environment. As in the discrimination experiment, the observer must respond “same” or “different.” The test and reference environments can differ through a change in illuminant, a change in surface ensemble, a change in adaptation parameters, or all of these. We assume that the observer continues to employ the same basic distance decision rule applied to the visual responses, with the decision vari 2 = uref,j − utest,k 2 . able evaluated across the change of environment: u jk On same trials, the expression is evaluated for a single surface across the change (rk = r j ), and on different trials, the expression is evaluated for two draws from the surface ensemble. In the notation, the arrow indicates that the response difference is evaluated across the change from reference to test environment. Basing the decision rule on the response difference models the fact that in our framework, the observer has no explicit knowledge of the illuminant, surface ensemble, or state of adaptation—all effects of adaptation on performance are modeled explicitly with a change in adaptation parameters.3 The quantity A remains an appropriate measure of performance across a change in visual environments, as it continues to characterize how well hits and false alarms trade off as a function of a decision criterion. For any pair of surfaces, we denote the value of A obtained across the change as jk . We obtain an aggregate performance measure by computing the mean A jk . To emphasize the fact that the performance measure for conof the A stancy is computed across a change in visual environments, we denote this measure by the symbol (script “C” for constancy) rather than overloading the meaning of the symbol . Evaluating requires specification of the
3 This choice may be contrasted with work where the measure of performance is bits of information transmitted (Foster et al., 2004). Measurements of information transmitted are silent about what subsequent processing is required to extract the information. Here we are explicitly interested in the performance supported directly by the visual response representation.
2620
A. Abrams, J. Hillis, and D. Brainard
illuminant and adaptation parameters for both test and reference environ jk ments, as well as the surface ensemble and noise level over which the A are evaluated. Note that may be regarded as a special case of when the illumination, surface, and adaptation parameters are all held fixed, so that the test and reference environments are identical. As an example, the open circles and dashed line in Figure 4 show how constancy performance falls off with a change in test illuminant when there is no compensatory change in adaptation parameters. The reference illuminant had intensity 100, and the x-axis provides the intensity of the test illuminant. Constancy performance was evaluated across draws from the surface ensemble common to the reference and test environments. As with the effect of the illuminant change on within-illuminant discrimination performance, the deleterious effect of the illuminant change on constancy may be eliminated if an appropriate gain change occurs between the test and reference scenes. This is because changing the gain with the illumination can restore the responses under the test light back to their values under the reference light.
2.7 Trade-offs Between Discrimination and Constancy. When the adaptation parameters include a gain change, adaptation can compensate perfectly for changes in illumination intensity so that discrimination performance (obtained in a discrimination experiment) remains unchanged and constancy performance (obtained in a constancy experiment with a change of illuminant intensity and a gain change that compensates for it) is at ceiling given discrimination. More generally, there will be cases where the adaptation parameters available within a given model visual system are not able to compensate completely for environmental changes. This raises the possibility that the adaptation parameters that optimize discrimination may differ from those that optimize constancy. Consider the case of an illuminant change where the gain parameter is held fixed and the steepness parameter n is allowed to vary between test and reference environments. Each connected set of solid circles in the left panel of Figure 5 plots against for various choices of the steepness parameter. The five different sets shown were computed for different levels of noise. The point at the lower right of each set indicates performance when the steepness parameter was chosen to maximize for the test illuminant, while the point at the upper left plots performance when the steepness parameter was chosen to maximize , evaluated across the change from reference to test illuminant. The points between these two extremes represent performance obtained when the steepness parameter was chosen to maximize either the expression − w( − 0 )2 or the expression − w( − 0 )2 . Maximizing these expressions pushes the value of the leading measure as high as possible while holding the value of the other measure close to a target value. Which expression was used, as well as values of w, 0 , and 0 ,
The Relation Between Color Discrimination and Color Constancy
2621
Figure 5: Trade-off between discrimination and constancy. (Left) Each set of connected solid circles shows the trade-off between and for various optimizations of the steepness parameter n (see the text). Each set is for a different noise level (σn = 0.01, 0.025, 0.05, 0.075, 0.10), with the set closest to the upper right of the plot corresponding to the lowest noise level. The reference illuminant had intensity e ref = 100, and the test illuminant had intensity e test = 160. The surface ensemble was specified by µr = 0.5 and σr = 0.3 and was common to both the reference and test environments. Both and were evaluated with respect to draws from this surface ensemble. The gain parameter was held fixed at g = 0.02045. The steepness parameter for the reference environment was n = 4.5. Parameters g = 0.02045 and n = 4.5 optimize discrimination performance for the reference environment when σn = 0.05. The open circles connected by the dashed line show the performance points that could be obtained for each noise level if there were no trade-off between discrimination and constancy. (Right) Equivalent trade-off noise plotted against visual noise level σn . See the discussion in the text.
were chosen by hand so that the trade-off curve represented by each connected set was well sampled. Values of w and 0 or 0 were held fixed during the optimization for individual points on the trade-off curves. All optimizations were performed in Matlab using routines from its Optimization Toolbox (Version 3). Figure 5 shows that there is a trade-off between the two performance measures—optimizing for constancy results in decreased discrimination performance and vice versa. Indeed, the open circles connected by the dashed line show the performance points that could be obtained for each noise level if there were no trade-off between discrimination and constancy. These were obtained as the points ( max , max ) where the maxima were obtained over the trade-off curves for each noise level.
2622
A. Abrams, J. Hillis, and D. Brainard
Although the trade-off curves do not include the open points, the distance between each trade-off curve and its corresponding no-trade-off point is not large. One way to quantify this distance is to ask, for each trade-off curve, how much the visual noise would have to be reduced so that performance at the no-trade-off point was feasible. We call this noise reduction the equivalent trade-off noise. To find its value, we treat each solid point as a triplet ( , , σn ) and use bilinear interpolation to find σn as a function of and . We then identify the value of σn corresponding to each ( max , max ). For the trade-off curve corresponding to visual noise level σn = 0.10 (the lowest left curve in the left panel of Figure 5), for example, the noise level corresponding to ( max , max ) is 0.092, leading to an equivalent trade-off noise value of 0.008. The right panel of Figure 5 plots the equivalent tradeoff noise versus noise level σn .4 The mean value was 0.005 ± 0.003 S.D. If there were no trade-off, performance at the point ( max , max ) would be possible for each noise level, but to achieve each ( max , max ) requires, on average, a reduction of the visual noise by 0.005. 2.8 Intermediate Discussion. The example above illustrates our basic approach to understanding how adaptation affects both discrimination and constancy. The example illustrates a number of key points. First, as is well known, adaptation is required to maintain optimal discrimination performance across changes in the state of the visual environment (Walraven et al., 1990). Second, adaptation is also necessary to optimize performance for constancy, when we require that surface identity be judged directly in terms of the visual responses. The link between adaptation and constancy has also been explored previously (Burnham, Evans, & Newhall, 1957; D’Zmura & Lennie, 1986; Wandell, 1995). What is new about our approach is that we have set our evaluations of both discrimination and constancy in a common framework by using an A measure for both. This allows us to ask questions about how any given adaptation strategy affects both tasks and whether common mechanisms of adaptation can simultaneously optimize performance for both. The theory we develop is closely related to measurements of lightness constancy made by Robilotto and Zaidi (2004), who used a forced-choice method to measure both discrimination within and identification across changes in illuminant. Our theory allows us to quantify the trade-off between constancy and discrimination, either by examination of the shape of trade-off curves or through the equivalent trade-off noise concept.
4 Equivalent trade-off noise was computed only for visual noise levels where the corresponding point ( max , max ) was well within the region of the trade-off diagram where interpolation was possible. For points ( max , max ) outside this region, the equivalent trade-off noise is not well constrained by the trade-off curves that we computed.
The Relation Between Color Discrimination and Color Constancy
2623
2.9 Contrast Adaptation. Changing the illuminant is not the only way to change the properties of the environment. Within the context of the univariate case introduced above, we can also vary both the mean and variance of the surface ensemble. Such variation might occur as a person travels from, say, the city to the suburbs during an afternoon commute. There is good evidence that the visual system adapts to changes in the variance of the reflected light. This is generally called contrast adaptation (Krauskopf, Williams, & Heeley, 1982; Webster & Mollon, 1991; Chubb, Sperling, & Solomon, 1989; Zaidi & Shapiro, 1993; Jenness & Shevell, 1995; Schirillo & Shevell, 1996; Brown & MacLeod, 1997; Bindman & Chubb, 2004; Chander & Chichilnisky, 2001; Solomon, Peirce, Dhruv, & Lennie, 2004). Here we expand our analysis by considering changes to the mean and variance of the surface ensemble and explore the effect of adaptation to such changes on both discrimination and constancy, using the approach developed above. Given a particular illuminant, we used numerical search to find the values of g and n that optimized for two visual environments that differed in terms of their surface ensembles (surface ensemble 1 and surface ensemble 2). The illuminant was held constant across the change in visual environment. This calculation tells us how the adaptation parameters should be chosen under a discrimination criterion. Figure 6 shows the results. We see in the middle panel in the top row that the visual response function under surface ensemble 2 has shifted to the right and become steeper. The effect of this adaptation is to distribute the visual responses fairly evenly across the available response range for each ensemble (cf. Fairhall, Lewen, Bialek, & de Ruyter van Steveninck, 2001). The gain and steepness parameters that optimize discrimination for surface ensemble 1 and surface ensemble 2 are different. Suppose now that we evaluate constancy by computing across the change in visual environment and adaptation parameters required to optimize discrimination for the two surface ensembles.5 The lower right points on each of the trade-off curves in Figure 7 plot this value of against the corresponding value of for five different noise levels. The resultant value is very low (see Figure 7). This low value occurs because the change in adaptation parameters remaps the relation between surface reflectance and visual response (see the dashed lines in the response function panel of Figure 6).
5 Since there are now two separate surface ensembles under consideration, evaluation of requires a decision about what surface ensemble performance should be evaluated over. For this evaluation purpose, we used a surface ensemble that was a 50–50 mixture of the reference environment ensemble (surface ensemble 1) and the test environment ensemble (surface ensemble 2.) That is, each surface drawn during the evaluation of was chosen at random from surface ensemble 1 with probability 50% and from surface ensemble 2 with probability 50%. Evaluation of was with respect to surface ensemble 2.
2624
A. Abrams, J. Hillis, and D. Brainard
Figure 6: Adaptation to surface ensemble change for discrimination. Numerical search was used to optimize for two visual environments characterized by a common illuminant but different surface ensembles (surface ensemble 1 and surface ensemble 2). The illuminant intensity was 100. In surface ensemble 1, µr = 0.5 and σr = 0.3. In surface ensemble 2, µr = 0.7 and σr = 0.1. The histogram under the graph shows the distributions of reflected light intensities for the two ensembles. The graph shows the resultant visual response function for each case. The solid line corresponds to surface ensemble 1 and the dotted line to surface ensemble 2. The histogram to the left of the graph shows the response distribution for surface ensemble 1 under the surface ensemble 1 response function, while the histogram to the right shows the response distribution for surface ensemble 2 under the surface ensemble 2 response function. All calculations done for σn = 0.05 and e = 100. In evaluating for surface ensemble 1, performance was averaged over draws from surface ensemble 1; in evaluating for surface ensemble 2, performance was averaged over draws from surface ensemble 2. The dashed lines show how the visual response to the light intensity reflected from a fixed surface varies with the change in adaptation parameters. (Since the illuminant is held constant, a fixed surface corresponds to a fixed light intensity.)
Rather than choosing adaptation parameters for the test environment to optimize discrimination performance , one can instead choose them to optimize constancy performance . This choice leads to quite different adaptation parameters and to different values of and . Here the best adaptation parameters for surface ensemble 2 are very similar (but not
The Relation Between Color Discrimination and Color Constancy
2625
Figure 7: Trade-off between discrimination and constancy for change in surface ensemble. (Left) The plot shows the trade-off between versus in the same format as the left panel of Figure 5. When the adaptation parameters are chosen to optimize discrimination ( ) for the test environment (surface ensemble 2), constancy performance ( ) is poor (lower right end of each set of connected dots). When the adaptation parameters are chosen to optimize constancy, discrimination performance is poor (upper left end of each set.) The connected sets of dots show how performance on the two tasks trades off for five noise levels σn = 0.01, 0.025, 0.05, 0.075, 0.10. Surface ensemble parameters and illuminant intensty are given in the caption for Figure 6. In evaluating , the adaptation parameters used for computing responses in the reference environment (surface ensemble 1) were held fixed at g = 0.02045 and n = 4.5. These parameters optimize discrimination performance for the reference environment when σn = 0.05. (Right) Equivalent trade-off noise plotted against noise level σn .
identical) to their values for surface ensemble 16 and constancy performance is better (upper left points on the trade-off curves shown in Figure 7). Discrimination performance suffers, however. More generally, the visual system can choose adaptation parameters that trade off against . This trade-off is shown in Figure 7 in the same format
6 One might initially intuit that that the best adaptation parameters for constancy would be identical to the reference parameters in this case, since the illuminant does not change. The reason that a small change in parameters helps is that the cost of variation in visual response to a fixed surface caused by the shift in parameters is offset by an improved use of the available response range. The fact that constancy can sometimes be improved by changing responses to fixed surfaces is an insight that we obtain by assessing constancy with a signal detection theoretic measure.
2626
A. Abrams, J. Hillis, and D. Brainard
as Figure 5.7 If there were no trade-off, each set of connected dots would pass through the corresponding open circle. We quantify the deviation as before, using the equivalent trade-off noise. The right panel of Figure 7 plots equivalent trade-off noise against visual noise. The average value is 0.029 ± 0.012 SD, larger than the average value of 0.005 obtained for the illuminant change example. The comparison shows that adapting to optimize discrimination in the face of changes in the distribution of surfaces in the environment is not always compatible with adapting to maximize constancy across the same change in surface ensemble. The intuition underlying this result is relatively straightforward: changing the surface ensemble affects the optimal use of response range and hence leads to a change in adaptation parameters for optimizing discrimination, but it does not affect the mapping between surface reflectance and receptor response. Thus, any change in adaptation parameters perturbs the visual response to any fixed surface. 3 Chromatic Adaptation The univariate example presented above illustrates the key idea of our approach. In this section, we generalize the calculations to a more realistic trichromatic model visual system and a more general parametric model of adaptation. In addition, in comparing adaptation to changes in illumination and changes in surface ensemble, we develop a principled method of equating the magnitude of the changes. The overall approach, however, is the same as for the univariate example. We begin with a standard description (Wandell, 1987; Brainard, 1995) of the color stimulus and its initial encoding by the visual system. Each illuminant is specified by its spectral power distribution, which we represent by a column vector e. The entries of e provide the power of the illuminant in a set of Nλ discretely sampled wavelength bands. Each surface is specified by its spectral reflectance function, which we represent by a column vector s. The entries of s provide the fraction of incident light power reflected in each wavelength band. The light reflected to the eye has a spectrum described by the column vector c = diag(e) s,
(3.1)
7 The trade-off curves obtained for Figure 7 (and other similar plots below) are not convex. If the visual system adopts a strategy of switching, on a trial-by-trial basis, adaptation parameters for the test environment between those corresponding to any two of the obtained points, then it can achieve performance anywhere on the line connecting those two points. More generally, a visual system that adopts the switching strategy can achieve a trade-off curve that is the convex hull of points shown. This is analogous to the standard result that ROC curves are convex if the system is allowed to switch decision criterion on a trial-by-trial basis (see Green & Swets, 1966).
The Relation Between Color Discrimination and Color Constancy
2627
where diag() is a function that returns a square diagonal matrix with the elements of its argument along the diagonal. The initial encoding of the reflected light is the quantal absorption rate of the L-, M-, and S-cones. We represent the spectral sensitivity of the three cone types by a 3 × Nλ matrix T. Each row of T provides the sensitivity of the corresponding cone type (L, M, or S) to the incident light. The quantal absorption rates may then by computed as q = T c = Tdiag (e) s,
(3.2)
where q is a three-dimensional column vector whose three entries represent the L-, M-, and S-cone quantal absorption rates. We used the Stockman and Sharpe (2000) estimates of the human cone spectral sensitivities (2 degree), and we tabulate these in the supplemental material (http://color.psych.upenn.edu/supplements/adaptdiscrimappear/). As with the univariate example, we model visual processing as a transformation between cone quantal absorptions (q) and visual responses. Here we model the deterministic component of this transformation as u = f(M D q − q0 ),
(3.3)
where u is a three-dimensional column vector representing trivariate visual responses, D is a 3 × 3 diagonal matrix whose entries specify multiplicative gain control applied to the cone quantal absorbtion rates, M is a fixed 3 × 3 matrix that describes a postreceptoral recombination of cone signals, q0 is a three-dimensional column vector that describes subtractive adaptation, and the vector-valued function f() applies the function f i () to the ith entry of its vector argument. Because incorporation of subtractive adaptation allows the argument to the nonlinearity to be negative, we used a modified form of the nonlinearity used in the univariate example: (x + 1)ni ni (x + 1) + 1 f i (x) = 0.5 (1 − x)ni 1 − (1 − x)ni + 1
x>0 x=0 .
(3.4)
x<0
This nonlinearity maps input x in the range [− ∞, ∞] from the real line into the range [0, 1]. We allow the exponent ni to vary across entries. The matrix M was chosen to model, in broad outline, the postreceptoral processing
2628
A. Abrams, J. Hillis, and D. Brainard
of color information (Wandell, 1995; Kaiser & Boynton, 1996; Eskew et al., 1999; Brainard, 2001):
0.33
M = 0.5
0.33
0.33
−0.5
0
−0.25 −0.25
.
(3.5)
0.5
This choice of M improves discrimination performance by approximately decorrelating the three entries of the visual response vector prior to application of the nonlinearity and the injection of noise (Buchsbaum & Gottschalk, 1983; Wandell, 1995). As with the univariate example, we assume that each entry of the deterministic component of the visual response vector is perturbed by independent zero-mean additive visual noise that is normally distributed with variance σn2 . We characterized the reference environment surface ensemble using the approach developed by Brainard and Freeman (Brainard & Freeman, 1997; Zhang & Brainard, 2004). We assumed that the spectral reflectance of each surface could be written as a linear combination of Ns basis functions via s = Bs ws .
(3.6)
Here Bs is an Nλ × Ns matrix whose columns provide the basis functions, and ws is an Ns -dimensional vector whose entries provide the weights that describe any particular surface as a linear combination of the columns of Bs . We then assume that surfaces are drawn from an ensemble where ws is drawn from a truncated multivariate normal distribution ¯ and covariance matrix Kw . The truncation is chowith mean vector w sen so that the reflectance in each wavelength band lies within the range [0, 1]. We obtained Bs by computing the first eight principal components Figure 8: Chromatic example, illuminant change results. Trade-off between discrimination and constancy for illuminant change. Each pair of horizontally aligned panels is in the same format as Figure 5. (Left panels) versus trade-off curves with respect to illuminant changes. The reference environment illuminant was D65; the test environment illuminants were (from top to bottom) the blue, yellow, and red illuminants. The reference and test environment surface ensembles were the baseline surface ensemble in each case. The individual sets of connected points show performance for noise levels σn = 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50. In evaluating , the adaptation parameters for the reference environment were those that optimized discrimination performance in the reference environment. These parameters were optimized separately for each noise level. (Right panels) Equivalent tradeoff noise plotted against noise level σn .
The Relation Between Color Discrimination and Color Constancy
2629
2630
A. Abrams, J. Hillis, and D. Brainard
of the reflectance spectra measured by Vrhel, Gershon, and Iwan (1994). ¯ and Kw by taking the mean and covariance of the set We obtained w of ws required to best approximate each of the measured spectra with respect to Bs . Computations were run using an ensemble consisting of 400 draws from this distribution.8 The 400 reflectances in the reference environment surface ensemble are tabulated in the online supplement (http://color.psych.upenn.edu/supplements/adaptdiscrimappear/), and we refer to this ensemble below as the baseline surface ensemble. Given the visual system model and surface ensemble defined above, we can proceed as with the univariate case and ask how the adaptation parameters affect and , the discrimination and constancy performance measures, respectively. The only modification required is that the decision rule now operates on the difference variable ui,2 jk = ui, j − ui,k 2 , and this variable is distributed as a noncentral chi-squared distribution with 3 degrees of freedom rather than 1. The adaptation parameters are the three diagonal entries of D, the three entries of q0 , and the three exponents ni . Figure 8 shows how and trade off when the illuminant is changed from CIE illuminant D65 to three separate changed illuminants. Each of the changed illuminants was constructed as a linear combination of the CIE daylight basis functions. We refer to the three changed illuminants as the blue, yellow, and red illuminants, respectively. Their CIE u’v’ chromaticities are provided in Table 1, and their spectra are tabulated in the supplemental material (http://color.psych.upenn.edu/supplements/ adaptdiscrimappear/). The relative illuminant spectra are essentially the same as the neutral (here D65), Blue 60 (here blue), Yellow 60 (here yellow), and Red 60 (here red) illuminants used by Delahunt and Brainard (2004) in a psychophysical study of color constancy. The changes between D65 and the blue and yellow illuminants are typical of variation in natural daylight Figure 9: Chromatic example, surface ensemble change results. Trade-off between discrimination and constancy for surface ensemble changes. Same format as Figure 8. The reference and test environment illuminants were D65, the reference environment surface ensemble was the baseline esemble, and the test environment surface ensembles were (from top to bottom) the blue, yellow, and red ensembles. The individual sets of connected points in the left panels show performance for noise levels σn = 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50. In evaluating , the adaptation parameters for the reference environment were those that optimized discrimination performance in the reference environment. These parameters were optimized separately for each noise level.
8 In drawing the 400 surfaces for the ensemble used in the calculations, we also imposed a requirement that the drawn surfaces be compatible with our procedure for constructing the changed surface ensembles. This procedure is described in more detail where it is introduced below.
The Relation Between Color Discrimination and Color Constancy
2631
2632
A. Abrams, J. Hillis, and D. Brainard
Table 1: Illuminant Chromaticities. Illuminant
CIE u’
CIE v’
D65 Blue Yellow Red
0.198 0.185 0.226 0.242
0.468 0.419 0.508 0.450
Notes: CIE u’v’ chromaticity coordinates of the four illuminants used in the trichomatic calculations. Chromaticity coordinates were computed over the wavelength range 390 nm to 730 nm, which is the range for which we had surface reflectance data from the Vrhel et al. (1994) data set.
(see Delahunt & Brainard, 2004). The change between D65 and the red illuminant has a similar colorimetric magnitude but is atypical of variation in daylight. For the calculations here, the units of overall illuminant intensity are arbitrary; the four illuminant spectra were scaled to have the same CIE 1931 photopic luminance. Figure 8 shows that discrimination and constancy are highly compatible here. We also investigated discrimination constancy trade-offs for the color case when the surface ensemble is changed. An issue that arises is how to produce a surface ensemble change whose magnitude is commensurate with that of the illuminant changes. We did not treat this magnitude issue in the univariate example above. Here we created three changed surface ensembles (the blue, yellow, and red ensembles) so that the cone responses to each changed ensemble under the reference illuminant were exactly the same as those of the baseline surface ensemble under the corresponding changed illuminant. For example, the 400 triplets of LMS cone responses from the blue ensemble under illuminant D65 were exactly the same as the 400 triplets of LMS cone responses from the baseline ensemble under the blue illuminant. Construction of changed surface ensembles such that they have this property is straightforward using the type of linearmodel-based colorimetric calculations developed by Brainard (1995). We constrained the reflectance functions in the changed ensemble to be a linear combination of the first three columns of the matrix Bs that was used to construct the baseline ensemble. We also required that all of the surfaces in all four ensembles have reflectance functions with values between 0 and 1. This required rejecting some draws from the truncated normal distribution used to define the baseline ensemble at the time the ensembles were constructed. The supplemental material (http://color.psych.upenn.edu/ supplements/adaptdiscrimappear/) tabulates the changed surface ensembles, as well as the baseline surface ensemble. Figure 9 shows the results from the surface ensemble change calculations, in the same format as Figure 8. Direct examination of both the trade-off curves and the summary provided by the equivalent trade-off
The Relation Between Color Discrimination and Color Constancy
2633
Figure 10: Summary of equivalent trade-off noise for the chromatic example. The solid black bars show the mean equivalent trade-off noise (+/− one standard deviation) for the three illuminant changes reported in Figure 8. The solid gray bars show the corresponding values for the three surface ensemble changes reported in Figure 9. In each case, the mean and standard deviation were taken over visual noise levels (that is, over the values shown in each of the right-hand panels in Figures 8 and 9.)
noise measure indicate that constancy and discrimination are considerably less compatible in the surface ensemble change case than in the illuminant change case. This difference is summarized in Figure 10, where the mean equivalent trade-off noise is shown for each of the changes reported in Figures 8 and 9. 4 Summary and Discussion The theory and calculations presented here lead to several broad conclusions. First, we note that constancy cannot be evaluated meaningfully without considering discrimination. By using a signal detection theoretic measure (A ) to quantify constancy, we explicitly incorporate discrimination into our treatment of constancy. When the environmental change is a change in illuminant, then the dual goals of discrimination and constancy are reasonably compatible for the cases we studied: a common change in adaptation parameters comes close to optimizing performance for both our discrimination and constancy
2634
A. Abrams, J. Hillis, and D. Brainard
performance measures. To be more precise about the meaning of “reasonably compatible” and “comes close,” we turn to the quantification in terms of the mean equivalent trade-off noise, which was less than 2% for each of the changes we studied in the chromatic example (see Figure 10). For applications where an increase in 2% in visual noise relative to the baseline visual noise (10–50% across the trade-off curves we computed) is deemed to be large, one could revise the verbal descriptions accordingly. When the environmental change is a change in the surface ensemble, discrimination and constancy were less compatible. As measured by the mean equivalent trade-off noise, the incompatibility between constancy and discrimination is approximately two to four times larger for surface ensemble changes than for the corresponding illuminant changes, under conditions where the physical effect of the corresponding illuminant and surface ensemble changes on the LMS cone responses between reference and test environments was equated. Thus, the analysis suggests that stimulus conditions where the surface ensemble changes may provide psychophysical data that are most diagnostic of whether the early visual system optimizes for discrimination, for constancy, or whether it has evolved separate sites that mediate performance on the two tasks. We have started to develop an experimental framework for approaching this question (see Hillis & Brainard, 2005; see also Robilotto & Zaidi, 2004). As noted in the introduction, our approach is similar to that of Grzywacz and colleagues (Grzywacz & Balboa, 2002; Grzywacz & de Juan, 2003; see also Foster et al., 2004). Previous authors have considered the adaptation of the visual response function required to optimize discrimination performance (Laughlin, 1989; Buchsbaum & Gottschalk, 1983; Brenner et al., 2000; von der Twer & MacLeod, 2001; Fairhall et al., 2001), as well as the nature of adaptive transformations that can mediate constancy (von Kries, 1970; Buchsbaum, 1980; West & Brill, 1982; Brainard & Wandell, 1986; D’Zmura & Lennie, 1986; Maloney & Wandell, 1986; Foster & Nascimento, 1994; Foster et al., 2004; Finlayson, Drew, & Funt, 1994; Finlayson & Funt, 1996). Here the two tasks are analyzed in a unified manner using the theory of signal detection. The work provides both a framework for a fuller theoretical exploration of adaptation across different models of adaptation and environmental changes and an ideal observer benchmark against which future experimental results may be evaluated. Acknowledgments Dan Lichtman helped with early versions of the computations. We thank M. Kahana and the Penn Psychology Department for access to computing power. This work was supported by NIH RO1 EY10016.
The Relation Between Color Discrimination and Color Constancy
2635
References Bindman, D., & Chubb, C. (2004). Mechanisms of contrast induction in heterogeneous displays. Vision Research, 44, 1601–1613. Brainard, D. H. (1995). Colorimetry. In M. Bass (Ed.), Handbook of optics, Vol. 1: Fundamentals, techniques, and design (pp. 1–54). New York: McGraw-Hill. Brainard, D. H. (2001). Color vision theory. In N. J. Smelser & P. B. Baltes (Eds.), International encyclopedia of the social and behavioral sciences (Vol. 4, pp. 2256–2263). Oxford: Elsevier. Brainard, D. H. (2004). Color constancy. In L. Chalupa & J. Werner (Eds.), The visual neurosciences (Vol. 1, pp. 948–961). Cambridge, MA: MIT Press. Brainard, D. H., & Freeman, W. T. (1997). Bayesian color constancy. Journal of the Optical Society of America A, 14, 1393–1411. Brainard, D. H., & Wandell, B. A. (1986). Analysis of the retinex theory of color vision. Journal of the Optical Society of America A, 3, 1651–1661. Brenner, N., Bialek, W., & de Ruyter van Steveninck, R. (2000). Adaptive rescaling maximizes information transmission. Neuron, 26, 695–702. Brown, R. O., & MacLeod, D. I. A. (1997). Color appearance depends on the variance of surround colors. Current Biology, 7, 844–849. Buchsbaum, G. (1980). A spatial processor model for object colour perception. Journal of the Franklin Institute, 310, 1–26. Buchsbaum, G., & Gottschalk, A. (1983). Trichromacy, opponent colours coding and optimum colour information transmission in the retina. Proceedings of the Royal Society of London B, 220, 89–113. Burnham, R. W., Evans, R. M., & Newhall, S. M. (1957). Prediction of color appearance with different adaptation illuminations. Journal of the Optical Society of America, 47, 35–42. Chander, D., & Chichilnisky, E. J. (2001). Adaptation to temporal contrast in primate and salamander retina. Journal of Neuroscience, 21, 9904–9916. Chubb, C., Sperling, G., & Solomon, J. A. (1989). Texture interactions determine perceived contrast. PNAS, 86, 9631–9635. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Delahunt, P. B., & Brainard, D. H. (2004). Does human color constancy incorporate the statistical regularity of natural daylight? Journal of Vision, 4, 57–81. D’Zmura, M., & Lennie, P. (1986). Mechanisms of color constancy. Journal of the Optical Society of America A, 3, 1662–1672. Eskew, R. T., McLellan, J. S., & Giulianini, F. (1999). Chromatic detection and discrimination. In K. Gegenfurtner & L. T. Sharpe (Eds.), Color vision: From molecular genetics to perception (pp. 345–368). Cambridge: Cambridge University Press. Fairhall, A. L., Lewen, G. D., Bialek, W., & de Ruyter van Steveninck, R. (2001). Efficiency and ambiguity in an adaptive neural code. Nature, 412, 787– 792. Finlayson, G. D., Drew, M. S., & Funt, B. V. (1994). Color constancy—Generalized diagonal transforms suffice. Journal of the Optical Society of America A, 11, 3011– 3019.
2636
A. Abrams, J. Hillis, and D. Brainard
Finlayson, G. D., & Funt, B. V. (1996). Coefficient channels—derivation and relationship to other theoretical studies. Color Research and Application, 21, 87–96. Foster, D. H. (2003). Does colour constancy exist? Trends in Cognitive Science, 7, 493– 443. Foster, D. H., & Nascimento, S. M. C. (1994). Relational colour constancy from invariant cone-excitation ratios. Proc. R. Soc. Lond. B, 257, 115–121. Foster, D. H., Nascimento, S. M. C., & Amano, K. (2004). Information limits on neural identification of colored surfaces in natural scenes. Visual Neuroscience, 21, 1–6. Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York: Wiley. Grzywacz, N. M., & Balboa, R. M. (2002). A Bayesian framework for sensory adaptation. Neural Computation, 14, 543–559. Grzywacz, N. M., & de Juan, J. (2003). Sensory adaptation as Kalman filtering: Theory and illustration with contrast adaptation. Network: Computation in Neural Systems, 14, 465–482. Hillis, J. M., & Brainard, D. H. (2005). Do common mechanisms of adaptation mediate color discrimination and appearance? Uniform backgrounds. Journal of the Optical Society of America A, 22, 2090–2106. Hood, D. C., & Finkelstein, M. A. (1986). Senstivity to light. In K. R. Boff, L. Kaufman, & J. P. Thomas (Eds.), Handbook of perception and human performance: Sensory processes and perception (Vol. 1). New York: Wiley. Jacobs, G. H. (1981). Comparative color vision. New York: Academic Press. Jenness, J. W., & Shevell, S. K. (1995). Color appearance with sparse chromatic context. Vision Research, 35, 797–805. Kaiser, P. K., & Boynton, R. M. (1996). Human color vision (2nd ed.). Washington, DC: Optical Society of America. Krauskopf, J., Williams, D. R., & Heeley, D. W. (1982). Cardinal directions of color space. Vision Research, 22, 1123–1131. Laughlin, S. (1989). The reliability of single neurons and circuit design: A case study. In R. Durbin, C. Miall, & G. Mitchison (Eds.), The computing neuron (pp. 322–335). Reading, MA: Addison-Wesley. Lennie, P., & D’Zmura, M. (1988). Mechanisms of color vision. CRC Critical Reviews in Neurobiology, 3, 333–400. Macmillan, N. A., & Creelman, C. D. (2005). Detection theory: A user’s guide (2nd ed.). Mahwah, NJ: Erlbaum. Maloney, L. T., & Wandell, B. A. (1986). Color constancy: A method for recovering surface spectral reflectances. Journal of the Optical Society of America A, 3, 29– 33. Mollon, J. D. (1982). Color vision. Annual Review of Psychology, 33, 41–85. Robilotto, R., & Zaidi, Q. (2004). Limits of lightness identification for real objects under natural viewing conditions. Journal of Vision, 4, 779–797. Schirillo, J. A., & Shevell, S. K. (1996). Brightness contrast from inhomogeneous surrounds. Vision Research, 36, 1783–1796. Shevell, S. K. (2003). Color appearance. In S. K. Shevell (Ed.), The science of color (2nd ed.). Washington, DC: Optical Society of America. Solomon, S. G., Peirce, J. W., Dhruv, N. T., & Lennie, P. (2004). Profound contrast adaptation early in the visual pathway. Neuron, 42, 155–162.
The Relation Between Color Discrimination and Color Constancy
2637
Stocker, A. A., & Simoncelli, E. P. (2005). Sensory adaptation within a Bayesian ¨ framework for perception. In Y. Weiss, B. Scholkopf, & J. Platt (Eds.), Advances in neural information processing systems, 18 (pp. 1291–1298). Cambridge, MA: MIT Press. Stockman, A., & Sharpe, L. T. (2000). The spectral sensitivities of the middle- and long-wavelength-sensitive cones derived from measurements in observers of known genotype. Vision Research, 40, 1711–1737. von der Twer, T., & MacLeod, D. I. A. (2001). Optimal nonlinear codes for the perception of natural colors. Network: Computation in Neural Systems, 12, 395–407. von Kries, J. (1970). Chromatic adaptation. In D. L. MacAdam (Ed.), Sources of color vision (pp. 109–119). Cambridge, MA. (Originally published 1902). Vrhel, M. J., Gershon, R., & Iwan, L. S. (1994). Measurement and analysis of object reflectance spectra. Color Research and application, 19, 4–9. Walraven, J., Enroth-Cugell, C., Hood, D. C., MacLeod, D. I. A., & Schnapf, J. L. (1990). The control of visual sensitivity. Receptoral and postreceptoral processes. In L. Spillman & J. S. Werner (Eds.), Visual perception: The neurophysiological foundations (pp. 53–101). San Diego: Academic Press. Wandell, B. A. (1987). The synthesis and analysis of color images. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-9, 2–13. Wandell, B. A. (1995). Foundations of vision. Sunderland, MA: Sinauer. Webster, M. A., & Mollon, J. D. (1991). Changes in colour appearance following post-receptoral adaptation. Nature, 349, 235–238. West, G., & Brill, M. H. (1982). Necessary and sufficient conditions for von Kries chromatic adaptation to give color constancy. J. Theor. Biolog., 15, 249. Wyszecki, G. (1986). Color appearance. In K. R. Boff, L. Kaufman, & J. P. Thomas (Eds.), Handbook of perception and human performance: Sensory processes and perception (Vol. 1, pp. 9.1–9.56). New York: Wiley. Wyszecki, G., & Stiles, W. S. (1982). Color science—Concepts and methods, quantitative data and formulae (2nd ed.). New York: Wiley. Zaidi, Q. (1999). Color and brightness inductions: From Mach bands to threedimensional configurations. In K. R. Gegenfurtner & L. T. Sharpe (Eds.), Color vision: From genes to perception (Vol. 1, pp. 317–344). Cambridge: Cambridge University Press. Zaidi, Q., & Shapiro, A. G. (1993). Adaptive orthogonalization of opponent-color signals. Biological Cybernetics, 69, 415–428. Zhang, X., & Brainard, D. H. (2004). Bayesian color-correction method for non-colorimetric digital image sensors. Paper presented at the 12th IS&T/SID Color Imaging Conference, Scottsdale, AZ.
Received August 29, 2005; accepted February 1, 2007.
LETTER
Communicated by Carlos Brody
Deviation from Weber’s Law in the Non-Pacinian I Tactile Channel: A Psychophysical and Simulation Study of Intensity Discrimination ¨ ¸ lu¨ Burak Guc [email protected] Biomedical Engineering Institute, Bo˘gazic¸i University, Istanbul 34342, Turkey
This study involves psychophysical experiments and computer simulations to investigate intensity discrimination in the non-Pacinian I (NP I) tactile channel. The simulations were based on an established popu¨ ¸ lu¨ & lation model for rapidly adapting mechanoreceptive fibers (Guc Bolanowski, 2004a). Several intensity codes were tested as decision criteria: number of active neurons, total spike count, maximal spike count, distribution of spike counts among the afferent population, and synchronization of spike times. Simulations that used the number of active fibers as the intensity code gave the most accurate results. However, the Weber fractions obtained from simulations are smaller than psychophysical Weber fractions, which suggests that only a subset of the afferent population is recruited for intensity discrimination during psychophysical experiments. Simulations could also capture the deviation from Weber’s law, that is, the decrease of the Weber fraction as a function of the stimulus level, which was present in the psychophysical data. Since the psychophysical task selectively activated the NP I channel, the deviation effect is probably not due to the contribution of another tactile channel but rather is explicitly produced by the NP I channel. Moreover, because simulations with all tested intensity codes resulted in the same effect, the activity of the afferent population is sufficient to explain the deviation, without the need for a higher-order network. Depending on the intensity code used, the mechanical spread of the stimulus, rate-intensity functions of the tactile fibers, and the decreasing spike-phase jitter contribute to the deviation from Weber’s law. 1 Introduction Intensity discrimination is the detection of small changes in the stimulus intensity and is of paramount importance for encoding the physical stimulus by a sensory system. The ecological significance of this cannot be underestimated, since survival often depends on judging the magnitude variation of a sensory input and taking action according to the increase or decrease
Neural Computation 19, 2638–2664 (2007)
C 2007 Massachusetts Institute of Technology
Deviation from Weber’s Law in the Non-Pacinian I Tactile Channel
2639
of the input level. The sense of touch is acute at our fingertips, where many tactile fibers are concentrated. However, most of the studies on intensity discrimination in the human tactile system have focused on the thenar eminence of the hand (e.g., Gescheider, Bolanowski, Verrillo, Arpajian, & Ryan, 1990; Gescheider, Bolanowski, Zwislocki, Hall, & Mascia, 1994; Gescheider, Thorpe, Goodarz, & Bolanowski, 1997), because it can be mechanically stimulated more easily by contactor probes of various sizes. One goal of the study reported here is to compare its experimental results, which were obtained by stimulating the fingertip, with previous work that stimulated the thenar eminence. Psychophysical detection thresholds have been measured ¨ ¸ lu¨ & Bolanowski, 2005a) and were found to be similar at the fingertip (Guc to the results of previous studies that applied 40 Hz vibratory stimuli to the thenar eminence. Therefore, it would be expected that the results of intensity-discrimination experiments on the fingertip should turn out to be consistent with the literature as well. The primary importance of this letter is that it attempts to predict a psychophysical phenomenon, the deviation from Weber’s law, by using a ¨ ¸ lu¨ & Bolanowski, 2002, 2004a, 2005b). physiological population model (Guc This model mimics the properties of rapidly adapting (or fast adapting type I) nerve fibers, and has recently been improved by incorporating time¨ ¸ lu¨ & Bolanowski, 2004b). Rapidly adapting fibers dependent activity (Guc are associated with Meissner’s corpuscles in the skin and constitute one of four classes of mechanoreceptive fibers (Greenspan & Bolanowski, 1996). It is assumed that the input from a population of rapidly adapting fibers is processed in the non-Pacinian I (NP I) psychophysical channel (Bolanowski, Gescheider, Verrillo, & Checkosky, 1988; Gescheider, Bolanowski, & Verrillo, 2004). In order to compare the performance of the population model with the psychophysical response of the NP I channel, the NP I channel was ¨ ¸ lu¨ & Bolanowski, 2005a, experimentally isolated by forward masking (Guc 2005b). This approach avoids interactions between channels. Although a narrower range of intensities can be tested with this method (cf. Gescheider et al., 1990), the psychophysical properties of the NP I channel can be obtained selectively. Weber’s law states that a just-noticeable change in intensity (or amplitude) is approximately proportional to stimulus intensity (or amplitude). This is an almost universal law for all sensory systems, but deviations are possible in certain conditions. For example, a near miss to Weber’s law was observed when sinusoidal stimuli were used in hearing (Moore & Raab, 1974) and in touch (Knudsen, 1928; Gescheider et al., 1990). That is, some improved discrimination was found at high intensities, although Weber’s law asserts a worsening of discrimination. The psychophysical basis for this deviation is unclear, but it may be due to recruitment of other sensory-input channels as the intensity is increased (Gescheider et al., 1997). This does not exclude purely cortical properties (e.g., see Dykes, 1983), but
¨ ¸ lu¨ B. Guc
2640
in the simplest sense, more recruited channels may mean more input to facilitate discrimination. Therefore, it may be hypothesized that by isolating the NP I tactile channel, the deviation from Weber’s law should disappear. On the other hand, if deviation is still found during the activation of a single psychophysical channel, the deviation may be attributed to either physiological properties of afferents or a certain intensity code represented in the activity of higher-order neurons. For this letter, several hypothetical codes were tested. These codes specify either the number of activated tactile neurons or the spike activity of the neurons. Note that either approach may be a modified view of the same representation, since many active neurons would increase the activity of a higher-order unit by synaptic convergence. In addition, the distribution of activity among the population of neurons and synchronization of spikes was considered in the modeling analyses presented.
2 Methods 2.1 Population Model. The population model is similar to the one used ¨ ¸ lu¨ and Bolanowski (2002, 2004a, 2005b); it simulated the rapidly by Guc adapting afferent output that mediates the NP I channel. 2.1.1 Anatomical Organization. Since the contactor location does not have ¨ ¸ lu¨ & an effect on the detection thresholds measured across the finger (Guc Bolanowski, 2005b), a uniformly random distribution was used to model the receptive field locations of the afferent fibers. The innervation density of the rapidly adapting fibers was set to 0.75 fibers per mm2 (Johansson & Vallbo, 1979). Therefore, the rectangular terminal-phalanx skin model (3 cm × 2 cm) contained 450 rapidly adapting fibers. The x- and y-coordinates of the receptive field centers were randomly determined by using the following independent probability density functions:
f X (x) =
1/30; 0 ≤ x ≤ 30 mm 0; else.
f Y (y) =
1/20;
0 ≤ x ≤ 20 mm 0; else
(2.1)
2.1.2 Time-Independent Physiological Properties. The rate intensity functions of the rapidly adapting fibers were determined by, among others, ¨ ¸ lu¨ and Bolanowski (2003, 2005b). Since the statistical firing properties Guc of human tactile fibers were not adequately described at this time, the model was based on monkey data (Johnson, 1974). The rate intensity functions can
Deviation from Weber’s Law in the Non-Pacinian I Tactile Channel
2641
be approximated by 0; 0 < A < a 0 fs (A − a 0 ); a 0 < A < a 1 a1 − a0 r = f s ;a 1 < A < a 2 A − a2 fs 1 + ; a2 < A < a3 a3 − a2 2 fs ; a 3 < A
(2.2)
for a 0 < a 1 < a 2 < a 3 , where r is the average firing rate, A is the effective stimulus amplitude at the fiber’s receptive field center, f s is the stimulus frequency, and (a 0 , a 1 , a 2 , a 3 ) are the parameters, which are distributed randomly among fibers. These parameters are not independent, but Johnson (1974) showed independence between the random variables a 0 /a 1 and a 1 and between a 2 /a 3 and a 3 . Therefore, the following probability density functions were used to construct the rate intensity function of each rapidly adapting fiber:
2
a0 p(a 0 /a 1 ) = 3.093 exp −30.046 − 0.32 a1 2
0.456 a1 p(a 1 ) = exp −3.458 log a1 35.94 2
a2 p(a 2 /a 3 ) = 3.093 exp −30.046 − 0.62 a3 2
a3 0.625 p(a 3 ) = exp −6.516 log . a3 278.26
(2.3)
Note that in equation 2.3, the probability densities of parameter ratios are normal, but the probability densities of the given parameters are log normal. 2.1.3 Spike Generation. The average firing rate of each nerve fiber specified the activity regime that affects the transition probabilities in a Markov ¨ ¸ lu¨ & Bolanowski, 2004b). The rapidly chain used for spike generation (Guc adapting fiber’s output can occupy one of three states during each cycle of the stimulus waveform: no spike at a given stimulus cycle (ε0 ), one spike at the positive phase of the stimulus cycle (ε1 ), and two spikes (one in the positive phase and the other in the negative phase of the stimulus cycle (ε2 )). It was assumed that the fiber changed its state randomly at each cycle
¨ ¸ lu¨ B. Guc
2642
c = 1, 2, 3, . . . Starting from an initial state, the simulated process proceeded in a chain such as ξ (1) → ξ (2) → ξ (3) → · · ·, where ξ (c) is the state of the ¨ ¸ lu¨ & Bolanowski, rapidly adapting fiber at cycle c. It can be shown (Guc 2004b) that Prob{ξ (c + 1) = ε j } =
p ji Prob{ξ (c) = εi },
i, j = 0, 1, 2,
(2.4)
i
where p ji is the probability that the fiber will go into state ε j at the next cycle given that it is in state εi at some stimulus cycle c. This probability is independent of the fiber’s trajectory before the cycle c. p ji ’s were obtained ¨ ¸ lu¨ and Bolanowski as a function of the average firing rate (r ) from Guc (2004b). The timing of the first spike phase relative to the stimulus cycle was modeled using the Laplace distribution. The probability density function of this distribution is: √
1 2 |x − µ| . f (x) = √ exp − (2.5) σ σ 2 It was reported that the phase advances to smaller values, and the distribution becomes narrower as the stimulus amplitude is increased in monkey fibers (Talbot, Darian-Smith, Kornhuber, & Mountcastle, 1968). Therefore, the mean (µ) and the standard deviation (σ ) varied as a function of the effective stimulus amplitude at the receptive field center in the simulations: µ = ma e −mb A + mc σ = sa e −sb A + sc .
(2.6)
¨ ¸ lu¨ and The parameters (ma , mb , mc , sa , sb , sc ) were obtained from Guc Bolanowski (2004b). When there were two spikes per stimulus cycle, the earlier spike was modeled by equation 2.5, but the phase difference between the first and second spike was modeled by a normal distribution with the ¨ ¸ lu¨ and Bolanowski (2004b). In order to simulate the parameters given in Guc propagation of action potentials from the periphery to the central nervous system, a randomly gaussian variation was added to the simulated spike times based on the estimate of conduction speed reported by Darian-Smith and Kenins (1980; average distance: 1 m; mean conduction speed: 50 m/s, standard deviation of conduction speed: 10 m/s). 2.1.4 Mechanical Stimulus. The stimulus used in the simulations was a 40 Hz sinusoidal wave, which had duration of 0.5 s and varying intensity. The skin model had the anatomical properties given in section 2.1.1 and was stimulated by a circular contactor probe (radius: 2 mm, area: 0.126 cm2 ). The stimulation site was the center of the skin model. The effects of the stimulus
Deviation from Weber’s Law in the Non-Pacinian I Tactile Channel
2643
spread radially, but attenuated as a function of distance from the contactor according to (Johnson, 1974) 2 2 rs2 As ; (x − xs ) + (y − ys ) ≤ 1.9 A(x, y) = rs A ; else, s (x − xs )2 + (y − ys )2
(2.7)
where (xs , ys ) are the coordinates of the stimulus contactor, rs the radius of the contactor, As the stimulus amplitude, (x, y) the coordinates of the receptive field center, and Athe effective stimulus amplitude at the receptive field center of the rapidly adapting fiber. It is important to note that the responses of the nerve fibers are correlated in the model because fibers close to each other are more likely to have similar firing rates due to the mechanical attenuation (see equation 2.7). On the other hand, the intensity rate functions that actually determine the average firing rate of each fiber were generated from independent probability distributions, and the spike generation for each fiber was performed independently as described above. This is because the tactile afferents do not interact with each other in the periphery. The model presented above is biologically accurate and mimics the population response of rapidly adapting fibers, but it is not strictly independent because of mechanical coupling. The entire model has numerous levels where stochastic variability is incorporated. Additionally, the model is noisy due to the spike generation mechanism employed. Physiological and anatomical parameters are altered in successive simulation experiments to mimic experimental variability (e.g., testing a different subject). 2.2 Simulations and Decision Criteria. The intensity discrimination experiment was simulated in Matlab Version 6.5 by using the two-interval forced-choice paradigm, which has two temporal observation intervals. Each observation interval included a stimulus, but the amplitude of the stimulus in one interval was set greater than the amplitude of the stimulus (the comparison stimulus) in the other interval. The interval with the stronger stimulus (the test stimulus) was determined randomly. The intensity of the comparison stimulus was kept constant during a simulated experiment. The simulated observer’s task was to select the interval that presumably contained the test stimulus. Therefore, the decision criterion was to select the interval that had the higher neural variable (n1 or n2 ) according to the assumed intensity code (see below). The decision criterion was tested by the simulated observer for each simulated trial. If n1 > n2 , the stimulus in the first interval was judged to be of higher amplitude by the simulated observer. If n1 < n2 , the second interval was selected by the simulated
¨ ¸ lu¨ B. Guc
2644
observer. Note that the simulated observer could make an error by selecting the interval with the higher neural output, although that interval contained the weaker stimulus. This is because of the random factors incorporated in the population model. The frequency of correct decisions in 100 trials was found, and this value was considered to be the probability of discriminating among stimuli with different amplitudes. Half of the trials that had equal neural variables (n1 = n2 ) were assumed to yield correct decisions. Probabilities of discriminations were found between a stimulus with a given amplitude (the comparison stimulus) and stimuli with slightly higher amplitudes. The amplitude change detected with 0.75 probability was set to be the discrimination threshold of the simulated observer, that is, the just-noticeable difference. When presenting the results, the just-noticeable difference in amplitude (δ A) was converted to the just-noticeable difference in intensity (difference limen, DL) by using DL = 20 log10
As + δ A , As
(2.8)
where As is the amplitude of the comparison stimulus and As + δ A is the amplitude of the stronger stimulus. DL was given in dB units. The Weber fraction was calculated as the ratio of δ A to As , (δ A/As ); therefore, it shows the discrimination capacity relative to the stimulus amplitude. It is customary to plot the Weber fraction as a function of the sensation level (SL), which is defined as SL = 20 log10
As , Ath
(2.9)
where Ath is the detection threshold in amplitude units. Thus, SL was given in dB units referenced to the detection threshold—the weakest stimulus detected by the observer. Note that Weber’s law implies that the Weber fraction is approximately constant as a function of the intensity of the comparison stimulus. Here, deviation refers to a decrease in the Weber fraction as a function of comparison stimulus intensity. The simulations were repeated for five hypothetical observers who were generated by using the population model described. The criterion for detecting the stimulus was at least 10 active fibers, which has been shown to give accurate simulation results when compared to psychophysical data ¨ ¸ lu¨ & Bolanowski, 2005b). That is, at least 10 active rapidly adapting (Guc fibers should be present in the population for detecting the 40 Hz sinusoidal stimulus. Detection thresholds of the simulated observers were found and SLs could be calculated by using this criterion.
Deviation from Weber’s Law in the Non-Pacinian I Tactile Channel
2645
2.3 Intensity Codes. The following hypothetical intensity codes were tested in the simulations. In the descriptions below, {o k } refers to the output of the population, that is, a set of spike times for each kth fiber. i specifies the interval and is either 1 or 2. n is the neural variable that contains the intensity information. 2.3.1 Number of Active Fibers. In this code, the intensity information is represented by the number of active fibers in the population. Therefore, the temporal interval that produces the highest number of active fibers is selected by the simulated observer. This code implies that there is a vast convergence of the afferent output within the central nervous system. By synaptic convergence, the information encoded in the number of fibers may be converted to an increase of activity in a higher-order neuron. The code can be mathematically expressed as ni =
I[1,∞) (|{o k }|).
(2.10)
k
In equation 2.10, I[...] (·) is the indicator function, which has a value of 1 if its argument belongs to the set indicated by the subscript; otherwise, the function is 0. Here, the cardinality (the number of elements shown by bar notation) of the spike time set is evaluated. Note that cardinality is 1 or greater if there is at least one spike in the set. 2.3.2 Total Activity. This code summates all the spike counts in the population. It implies that the intensity information is represented by the entire spike activity in the population. Therefore, the temporal interval that produces the highest number of total spikes is selected by the simulated observer. This code suggests that the activity of a higher-order decision network is modulated by all afferent spikes. The mathematical expression for the code is ni =
|{o k }|.
(2.11)
k
In equation 2.11, the cardinalities of the ouput sets (i.e., the number of spikes) are summed. 2.3.3 Maximal Spike Count. This code assumes that the intensity information is represented by the activity of the afferent fiber that has the maximal spike count. Therefore, the temporal interval that produces the highest spike count in an individual fiber is selected by the simulated observer. This code implies that there is a competition between afferent fibers in a higher-order decision network (e.g., winner-takes-all type). The mathematical expression
¨ ¸ lu¨ B. Guc
2646
for the code is ni = max |{o k }| . k
(2.12)
In equation 2.12, the maximum cardinality found in the population output is set as the neural variable. 2.3.4 Distribution of Activity. This code considers the distribution of spike counts (q ) among the fiber population. The simulated observer decides on a temporal interval according to the signal detection theory (Green & Swets, 1966; Gescheider, 1997). In this approach, the maximum likelihood rule yields the optimum decision (Srinath, Rajasekaran, & Viswanathan, 1996). If I1 and I2 are the stimulus intensities in two intervals, the maximum likelihood rule can be given as I2 p (q |I2 ) > 1. p (q |I1 ) < I1
(2.13)
In equation 2.13, p(q |Ii ) is the conditional probability density of spike counts given the stimulus with intensity Ii . The simulated observer selects the interval according to the rule in equation 2.13. In other words, if the ratio is greater than one, the second interval is selected; if it is less than one, the first interval is selected. The density functions in equation 2.13 were found by repeating the simulation for 100 trials. 2.3.5 Synchronized Activity. Although temporal information in the spike train may help discrimination, the central decision unit probably does not know about stimulus parameters a priori. Therefore, synchronization with respect to the stimulus waveform may be a less probable code for intensity. With no a priori information, the interspike interval is the simplest temporal parameter that may signal the intensity of the sinusoidal stimulus. If the spikes are regular, the standard deviation of the interspike intervals is low, and the afferent fibers are synchronized. The neural variable tested in the simulations is based on the entire population output: si = std
{o k }.
(2.14)
k
In equation 2.14, the union (∪ symbol) set of all the spike times in the population is first obtained within the stimulus duration. The neural variable (s) is equal to the sample standard variation (std) of the union set. Note that the decision criterion based on s is reversed as compared to n in the
Deviation from Weber’s Law in the Non-Pacinian I Tactile Channel
2647
simulations. That is, s1 > s2 implies that the second temporal interval contains the stronger stimulus, because the synchronization is higher. 2.4 Psychophysical Experiments. The psychophysical experiments ¨ ¸ lu¨ and Bolanowski (2005a) except were similar to the experiments by Guc the final procedure, which measured intensity discrimination. Again, unmasked detection thresholds were measured first. The apparatus was identical. All psychophysical experiments used the two-interval forced-choice task. The test stimulus frequency was chosen as 40 Hz because the Pacinian (P) channel is not very sensitive at this frequency (see Bolanowski et al., 1988). Thus, the NP I channel can more easily be isolated. It is also important to note that the population model has been validated specifically by 40 Hz data in the literature. 2.4.1 Subjects. Two female and three male human subjects in good health were recruited. The experiments adhered to National Institutes of Health ethical guidelines for testing human subjects and were approved by the Ethics Committee for Human Subjects of Bogazic ˘ ¸ i University. The subjects came from a narrow age group (22 to 27 years). Note that the differential sensitivity to changes in stimulus intensity is not affected much by age (Gescheider, Edwards, Lackner, Bolanowski, & Verrillo, 1996). The terminal phalanx of the left middle finger was stimulated on the center of the volar surface. All subjects declared that they were right-handed. The average surface temperature of the second finger in the left hand was 29◦ C to 34◦ C during the experiments. The temperature was not controlled, because the NP I channel is not affected by temperature (Greenspan & Bolanowski, 1996). 2.4.2 Stimuli. The stimuli were sine wave bursts of mechanical displacements superimposed on a static indentation of 0.5 mm (see Figure 1). They were applied by a 0.126 cm2 circular contactor probe (radius, 2 mm). The burst stimuli started and ended as cosine-squared ramps with 50 ms rise and fall times. The stimulus duration was determined with reference to the half-power points of the stimulus burst. In the intensity discrimination measurements, 40 Hz test stimuli were preceded by 250 Hz stimuli to mask the ¨ ¸ lu¨ & Bolanowski, 2005a). This forward-masking paradigm P channel (Guc was to ensure that the 40 Hz stimulus is detected by the NP I channel but not by the P channel. For one set of experiments, 250 Hz sine waves were used as test stimuli to determine the masking functions of each subject. The test stimulus levels were changed during the experiment according to an up-down rule, which tracked 0.75 correct decision probability (Zwislocki & Relkin, 2001). In this rule, the stimulus level is increased one step for each incorrect answer; and it is decreased one step for three, not necessarily consecutive, correct answers. The step size was 1 dB for detection experiments and 0.4 dB for intensity discrimination experiments. An experiment
2648
¨ ¸ lu¨ B. Guc
a
b
c
Deviation from Weber’s Law in the Non-Pacinian I Tactile Channel
2649
ended when the stimulus level was within a small range (±1 dB for detection, ±0.4 dB for intensity discrimination) for the previous 20 trials. A 0.75 probability level is referred to as the threshold. 2.4.3 Procedures. To obtain a complete set of experiments for each subject, four serial procedures were required. Each measurement was repeated five times to enable statistical analyses of the data. The procedures were based on the four-channel theory of tactile psychophysics (Bolanowski et al., 1988). In this theory, tactile channels are assumed to be independent, and by masking (e.g., forward masking) a channel at a special frequency, the threshold of another channel can be measured (Verrillo & Gescheider, 1977; Hamer, ¨ ¸ lu¨ & Bolanowski, Verrillo, & Zwislocki, 1983; Bolanowski et al., 1988; Guc 2005a; Hollins, Lorenz, & Harper, 2006). Figure 2 illustrates the main principles of the theory. In Figure 2a, the threshold curve (solid line) of a channel is shown. X1 to X3 are three stimuli with different frequencies. Note that X1 is below the threshold, so it is not detected. X2 and X3 are detected. However, X3 is above the threshold and induces masking of the channel. The masked channel has elevated thresholds (dotted line) shifted parallel to the original threshold curve in Figure 2a. Figure 2b depicts the detection
Figure 1: Stimulus timing diagrams for psychophysical experiments. All stimuli are sinusoidal bursts of displacements superimposed on a static indentation. Stimuli occur in two temporal intervals signaled to the subject by red and green lights. The subject gives a response when the yellow light is on. Therefore, all psychophysical tasks conform to the two-interval forced-choice paradigm. (a) In a simple detection task, the test stimulus with a given frequency and intensity occurs either in the first (shown as the gray burst) or in the second interval (shown as the dotted burst) randomly. In the experiments presented here, thresholds were measured for 40 Hz and 250 Hz test stimuli. (b) In a detection task with forward masking, a high-level 250 Hz masking stimulus is applied at the beginning of each interval. The test stimulus with a given frequency and intensity occurs in either the first (shown as the gray burst) or the second interval (shown as the dotted burst) randomly. For finding the masking function of the Pacinian channel, a 250 Hz test stimulus was used. For finding the threshold of the NP I channel, a 40 Hz test stimulus was used. Note that the threshold of the NP I channel cannot be measured with certainty if the threshold of the Pacinian (P) channel is not elevated by forward masking. (c) In an intensity discrimination task regarding the NP I channel, the P channel is first masked by high-level 250 Hz stimuli at the beginning of the intervals. Then the 40 Hz stimuli are applied in both intervals. However, the 40 Hz stimulus in one of the randomly determined intervals has a higher intensity (test stimulus) than the 40 Hz stimulus in the other interval (comparison stimulus with constant intensity). In this diagram, the comparison stimulus is in the first and the test stimulus is in the second interval.
2650
¨ ¸ lu¨ B. Guc
Figure 2: Diagrams explaining the principles of the four-channel theory (Bolanowski et al., 1988) of tactile psychophysics. (a) The threshold curve (solid line) of a hypothetical tactile channel is plotted as a function of frequency. X1 to X3 are stimuli with different frequencies. Because X1 is below the threshold curve, it cannot be detected by the channel. X2 and X3 are detected by the channel. X3 is above the threshold curve and causes some masking of the channel. The threholds of the masked channel (dotted line) are elevated. It is important to note that masking at any given frequency (e.g., frequency of X3) shifts the thresholds up, parallel to the original curve (solid line). (b) The detection mechanism that is valid in the psychophysical experiments presented in this letter is shown. A weak 40 Hz stimulus is detected by the P channel because the threshold of this channel is lower than the threshold of the NP I channel at 40 Hz. However, if the P channel is masked, the threshold of the NP I channel can be measured at 40 Hz.
¨ ¸ lu¨ & Bolanowski, mechanism based on previous experimental results (Guc 2005a). A weak 40 Hz stimulus is typically detected by the P channel, because this channel has a lower threshold than the NP I channel at 40 Hz. However, if the P channel is sufficiently masked, the threshold of the NP I channel can be measured at 40 Hz. Note that the inputs of the NP I channel (rapidly adapting fibers) and the P channel (Pacinian fibers) are processed separately in this theory. Masking a channel does not entirely delete it; a sufficiently intense stimulus at a particular frequency (e.g., a stimulus with an intensity that reaches the masked P curve in Figure 2b) can reactivate it. Unmasked detection thresholds. Without applying a masking stimulus, unmasked detection thresholds were measured for each subject. In this procedure, only one of the two intervals contained the test stimulus (see Figure 1a). The interval that contained the test stimulus was determined randomly, and the subject’s task was to find this interval. The intensity of the test stimulus was varied during each experiment, and the unmasked threshold was measured according to the up-down rule. The threshold was found separately for a 40 Hz test stimulus and a 250 Hz test stimulus.
Deviation from Weber’s Law in the Non-Pacinian I Tactile Channel
2651
Masking functions. A 250 Hz high-level masking stimulus preceded a 250 Hz test stimulus, which varied in intensity during the experiment. By using this procedure, a shift in sensitivity (masked threshold minus unmasked threshold for 250 Hz stimulus) was obtained in the P channel as a function of the level of the masking stimulus. Again, only one of the two intervals contained the test stimulus (see Figure 1b). The interval that contained the test stimulus was determined randomly, and the subject’s task was to find this interval without paying attention to the masking stimulus. A masking level that produced sufficient masking effect (10–20 dB shift) was selected for further experiments. Detection thresholds of the NP I channel. By using the forward-masking procedure, the sensitivity of the P channel was decreased, and the detection threshold of the NP I channel could be measured at 40 Hz. In this procedure, the 250 Hz masking stimulus with the predetermined level was applied just before the 40 Hz test stimulus (see Figure 1b). Only one of the two intervals contained the test stimulus. The interval that contained the test stimulus was determined randomly, and the subject’s task was to find this interval without paying attention to the masking stimulus. The intensity of the test stimulus was varied during each experiment, and the threshold was measured according to the up-down rule. Intensity discrimination in the NP I channel. The previous procedures were necessary to ensure detection of the 40 Hz test stimulus by the NP I channel. Once this was verified for each subject, intensity discrimination experiments could be performed. In order to mask the P channel, a 250 Hz masking stimulus with the predetermined level preceded the 40 Hz stimuli. Both of the intervals contained 40 Hz stimuli (see Figure 1c), but one of the 40 Hz stimuli (the test stimulus) had greater amplitude than that of the other stimulus (the comparison stimulus). The interval that contained the test stimulus was determined randomly, and the subject tried to find that interval. The level of the comparison stimulus was kept constant, but the intensity of the test stimulus was varied during each experiment. The discrimination threshold was measured according to the up-down rule. 3 Results 3.1 Simulations 3.1.1 Detection Thresholds. Five observers were simulated by generating five populations of afferents that were modeled with slightly different anatomical and physiological properties. Model skin regions were stimulated by 40 Hz stimuli. The psychometric function of each simulated observer is given in Figure 3. The detection thresholds (stimulus intensity at
¨ ¸ lu¨ B. Guc
2652
Probability of detection
1
0.8
0.6
0.4 O1 O2 O3 O4 O5
0.2
0 5
10 15 20 Stimulus intensity [dB re 1 µm peak]
25
Figure 3: Psychometric functions of simulated observers (O1–5). The 0.75 probability of detection is defined as the detection threshold. The probabilities were computed from 100 simulations.
0.75 detection probability) are 20.2, 15.8, 15.1, 20.0, and 17.0 dB (referenced to 1 µm amplitude) for simulated observers O1 to O5, respectively. There is no significant difference (two-sample t-test; p = 0.072) between these simulated detection thresholds and the thresholds of the NP I channel measured ¨ ¸ lu¨ and Bolanowski, 2005a). in humans (Guc 3.1.2 Intensity Discrimination by Number of Fibers. The probabilities of discrimination (probabilities of correct decisions) were found for each simulated observer by using comparison stimuli with various intensities. The DLs of O5 are 0.4, 1.2, 1.2, 0.8, 0.5, 0.3, 0.7, 0.3, and 0.2 dB, respectively for the comparison stimulus intensities of 17, 18, 19, 20, 22, 24, 26, 31, and 36 dB. For this particular intensity code (the number of active fibers), the discrimination is very good. The DL values are significantly lower than psychophysical results (approximately 1.5 dB, t-test; p < 0.001) reported by Gescheider et al. (1990). Note, however, that Gescheider et al. (1990) used a 25 Hz test stimulus and did not isolate the NP I channel by forward masking (see below). The Weber fraction is plotted as a function of SL in Figure 4a. The data points in Figure 4a are averages of five simulated observers. There is a large variation among observers near the threshold, but the variation is decreased as the SL is increased. On average, the Weber fraction is 0.1 near the threshold and decreases to 0.02 at 20 dB SL. Therefore, there is a deviation from Weber’s law. The data points in the 10 to 20 dB SL range
Deviation from Weber’s Law in the Non-Pacinian I Tactile Channel
0.08
a
0.15
Weber fraction
Weber fraction
0.2
0.1 0.05
b
0.06 0.04 0.02 0
0 0
5 10 15 Sensation level [dB]
0
20
5
10
15
20
Sensation level [dB]
5
30
c
4
Weber fraction
Weber fraction
2653
3 2 1
d
20
10
0
0 0
5
10
15
Sensation level [dB]
20
0
5
10
15
20
Sensation level [dB]
Figure 4: Weber fractions (in terms of displacement amplitude: δ As /A) from simulations with various intensity codes. Intensity discrimination was based on the number of active fibers (a), the total spike count (b), the maximal spike count (c), and the standard deviation of interspike intervals (d). The data points are averages of five simulated observers, and the error bars are the standard deviations.
are significantly lower than the data points in the 0 to 10 dB SL range (two-sample t-test; p < 0.001). 3.1.3 Intensity Discrimination by Total Activity. The probabilities of discrimination were found for each simulated observer by using comparison stimuli with various intensities. The intensity of the comparison stimulus was set at 17, 18, 19, 20, 22, 24, 26, 31, and 36 dB. The DLs of O5 are 0.2, 0.2, 0.2, 0.3, 0.3, 0.2, 0.2, 0.2, and 0.2 dB, respectively, for the given list of comparison intensities. Therefore, the discrimination is very good. The DL values are significantly lower than the psychophysical results (approximately 1.5 dB, t-test; p < 0.001) by Gescheider et al. (1990). The Weber fraction is plotted as a function of SL in Figure 4b. The data points in Figure 4b are averages of five simulated observers. On average, the Weber fraction is 0.03 for all tested SLs. There is a slight decrease of the Weber fraction up to 0.02 in the higher half-range of the SLs; this change is statistically significant (two-sample t-test; p = 0.007). 3.1.4 Intensity Discrimination by Maximal Spike Count. For this intensity code, the DLs of O5 are 0.6, 0.4, 0.4, 7.2, 5.5, 7.9, 8.8, 9.5, and 4.7 dB,
2654
¨ ¸ lu¨ B. Guc
respectively, for the comparison intensities 17, 18, 19, 20, 22, 24, 26, 31, and 36 dB. These DLs are significantly higher than psychophysical results (t-test; p = 0.011) by Gescheider et al. (1990). The Weber fraction is plotted as a function of SL in Figure 4c. Again, there is a large variance near the threshold. Nevertheless, there is also a considerable decrease of the Weber fraction from 1.66 in the lower halfrange of the SLs to 0.91 in the higher half-range of the SLs; this change is statistically significant (two-sample t-test; p = 0.022). 3.1.5 Intensity Discrimination by Distribution of Activity. The distribution of spike counts in the afferent population of O5 is given in Figure 5a for various stimulus inputs. If the stimulus intensity is below the detection threshold (e.g., 15 dB), almost all fibers are inactive. If the stimulus intensity is increased, some fibers are activated, and eventually they are entrained to the stimulus frequency. That is, they fire one spike per cycle of stimulus, and thus, the distribution has a peak at 20 spikes. (Note that stimulus frequency is 40 Hz, and duration is 0.5 s.) As a result of this phenomenon, the distribution of activity is typically bimodal: one mode at zero activity and the second mode at 20 spikes. If the stimulus intensity is very high, there may be two spikes generated per cycle. Then the distribution of activity is trimodal, although the peaks at 0 and 40 spikes are lower than the peak at 20 spikes. For this intensity code, the DLs of O5 are 35.6, 34.6, 33.6, 32.6, 30.6, 28.6, 26.6, 21.6, and 18.4 dB, respectively, for the comparison intensities 17, 18, 19, 20, 22, 24, 26, 31, and 36 dB. These DLs are much higher (t-test; p < 0.001) than the psychophysical results reported by Gescheider et al. (1990). Weber fractions are also very high (mean: 33.93; see Figure 5b), but they decrease as a function of the stimulus level similar to the results obtained from the other intensity codes. An intensity code based on the distribution of spike counts in the population yielded a very unfavorable discrimination capacity. 3.1.6 Intensity Discrimination by Synchronized Activity. As the stimulus intensity is increased, more afferent fibers within the population are entrained, and they generate one spike per cycle of the stimulus. Thus, the standard deviation of the interspike intervals would be expected to decrease. For the simulated observer O5, the DLs are 2.7, 1.5, 1.1, 1.0, 8.4, 16.7, 14.2, 9.9, and 0 dB, respectively, for the comparison intensities 17, 18, 19, 20, 22, 24, 26, 31, and 36 dB. These DLs are significantly higher than the psychophysical results (t-test; p = 0.029) of Gescheider et al. (1990). Because of the high variance in the discrimination thresholds, the variances of the Weber fractions are also high, especially near the detection threshold (see Figure 3d). Weber fractions are very high (mean: 5.72), but there is a considerable decrease from 7.87 in the lower half-range of the SLs to 1.44 in the higher half-range of the SLs, and this change is statistically significant (two-sample t-test; p = 0.017).
Deviation from Weber’s Law in the Non-Pacinian I Tactile Channel
2655
Figure 5: Simulations based on the distribution of activity as the intensity code. (a) The probabilities of spike counts among the population are given for O5 obtained from 100 simulations. Different graphs are presented for indicated stimulus intensities (dB re 1 µm peak). (b) Weber fractions are plotted as a function of the sensation level. The data points are averages of five simulated observers, and the error bars are the standard deviations.
2656
¨ ¸ lu¨ B. Guc
3.2 Psychophysical Experiments 3.2.1 Detection Threshold of the NP I Channel. The results from five subjects show that the NP I channel could be isolated (see Figure 6a). This conclusion can be clarified as follows. The white bars in Figure 6a are the detection thresholds measured for the 40 Hz stimulus when there was no 250 Hz masking. The gray bars are the detection thresholds measured for the 40 Hz stimulus when there was forward masking by the 250 Hz stimulus. Since the gray bars are higher than the white bars (two-sample t-test; all p’s < 0.015), detection was probably mediated by different channels. That is, the white bars must represent the threshold of the P channel at 40 Hz, and the gray bars must represent the threshold of the NP I channel while the P channel was masked by the 250 Hz stimulus (see section 2.4.3). These results are entirely consistent with previous data obtained by the same procedure ¨ ¸ lu¨ & Bolanowski, 2005a). This explanation can be validated further by (Guc adding threshold shifts obtained from masking functions (not shown) to the white bars to obtain the black bars. The black bars represent the elevated thresholds of the P channel in each subject at 40 Hz and should be higher than the gray bars. The data in Figure 6a show that this explanation is valid (two-sample t-test; all p’s < 0.001). The NP I thresholds of S1 to S5 are 17.1, 17.7, 20.4, 18.4, and 20.8 dB, respectively, at 40 Hz. These data were compared to data previously ob¨ ¸ lu¨ & Bolanowski, 2005a). No statained from five different subjects (Guc tistical difference could be found (two-sample t-test; p = 0.226). The current psychophysical data also were compared to the thresholds obtained from simulations. No statistical difference could be found between experimental and simulation results (two-sample t-test; p = 0.401). Therefore, the rapidly adapting population model used in the simulations mimics the psychophysical detection process of the NP I channel very well. 3.2.2 Intensity Discrimination by the NP I Channel. Since SLs were adjusted for each subject’s NP I channel characteristics, the measurements of
Figure 6: Results of psychophysical experiments from five subjects (S1–5). The data points are averages of five measurements, and the error bars are the standard deviations. (a) Thresholds measured for a 40 Hz stimulus without (white bars) and with forward masking (gray bars). The white bars are considered to be the thresholds of the Pacinian channel, and the gray bars are thought to be the thresholds of the NP I channel at 40 Hz. The black bars refer to the elevated thresholds of the Pacinian channel when forward masking is used. (b) Weber fractions are plotted separately for each human subject. (c) Weber fractions from psychophysical experiments (filled circles) are plotted with simulated Weber fractions (open circles) based on the number of active fibers as the intensity code.
Deviation from Weber’s Law in the Non-Pacinian I Tactile Channel
2657
45 no masking
a
masking
Threshold [dB re 1 µm]
35
no masking + threshold shift
25
15
5
-5 S1
S2
S3
S4
S5
1
Weber fraction
S1 S2 S3 S4 S5
b
0.8 0.6 0.4 0.2 0 0
5
10
15
20
15
20
Sensation level [dB] 1
c
Weber fraction
0.8 0.6 0.4 0.2 0 0
5
10 Sensation level [dB]
2658
¨ ¸ lu¨ B. Guc
intensity discrimination were performed at SLs spaced irregularly apart. The DLs obtained from all subjects are a little higher on average (mean: 2.3 dB) than the average DL (1.5 dB) reported by Gescheider et al. (1990). This difference is statistically significant (two-sample t-test; p = 0.002). The Weber fraction is plotted as a function of SL in Figure 6b. The average Weber fraction is 0.32 for all SLs tested. The Weber fraction decreases from 0.38 in the lower half-range of the SL to 0.18 in the higher half-range. This change is highly significant (two-sample t-test; p < 0.001). The Weber fraction found in psychophysical experiments (0.32) is closest to the Weber fraction found in the simulations (0.07), which were based on the number of active fibers as the intensity code. Psychophysical Weber fractions and simulated Weber fractions, based on the number of active fibers, are plotted on the same graph in Figure 6c. The psychophysical results are significantly greater than the simulation results (two-sample t-test; p < 0.001). Similarly, the psychophysical Weber fractions are significantly higher than the simulated Weber fractions (average: 0.03) when the intensity code was the total activity (two-sample t-test; p < 0.001). On the other hand, the psychophysical Weber fractions are considerably lower than the results obtained from the other intensity codes tested (twosample t-test; p’s < 0.005). The mean Weber fraction based on the maximal spike count is 1.41. It is 33.93 when the intensity code is based on the distribution of activity and 5.72 when the intensity code is based on synchronization.
4 Discussion The population model used in the simulations produced simulated detection thresholds (average: 17.9 dB) very close to experimental thresholds (average: 19.0 dB). Similar to previous studies that used the same decision criterion of at least 10 active rapidly adapting fibers, the population model mimics the detection process very well. In addition, the results of the population model exhibited a deviation from Weber’s law, just like the current psychophysical results on intensity discrimination. However, the average Weber fraction of the psychophysical data (0.32) obtained from a 20 dB SL range was higher than the Weber fraction of the nearest simulation data (0.07), which were based on the number of active fibers as the intensity code. An intensity code based on total activity gave better discrimination (Weber fraction: 0.03), whereas the other codes tested (i.e., maximal spike count, distribution of activity, synchronization) gave much worse results. Intensity discrimination based on the distribution of spike counts was especially unfavorable (Weber fraction: 33.93). The experimental and simulation results are discussed below in relation to the hypotheses put forth in section 1.
Deviation from Weber’s Law in the Non-Pacinian I Tactile Channel
2659
4.1 Previous Literature. The Weber fractions from psychophysical experiments (average: 0.32) are slightly higher that the those reported by Gescheider et al. (1990). The average Weber fraction in their study is 0.19. Therefore, the subjects in our study could not detect small changes in amplitude as well as the subjects in Gescheider et al. (1990). There may be several explanations for this inconsistency. In addition to some differences in temporal stimulation parameters, the current study used a forwardmasking paradigm to isolate the NP I channel. Therefore, the results presented here are related only to the NP I channel. On the other hand, Gescheider et al. (1990) tested the overall tactile discrimination capacity. More than one psychophysical channel could have contributed to discrimination and improved the Weber fraction in their work. This idea is also supported by frequency insensitivity in their results. That is, the intensity discrimination capacity was probably not altered by frequency because the channels were not selectively activated. 4.2 Deviation from Weber’s Law. It was implied in section 1 that the deviation from Weber’s law may be because of the recruitment of additional tactile channels, and it was hypothesized that the NP I channel should not display this phenomenon. However, the current psychophysical results indicate that the deviation is present even if a single channel (i.e., NP I channel) is activated. Then a second question arises: Is the deviation from Weber’s law a result of some specific higher-order processing or in the inherent activity of the afferent population? No matter which intensity code was tested, Weber fraction decreased as a function of the stimulus level. The activity of the afferent population seems sufficient to generate the deviation. In order to investigate the mathematical basis for that, it may be assumed that there exists a response measure (i.e., intensity code) R defined over the population activity, and it is equal to a function of the stimulus level (S): R = f (S). Additionally, it is assumed that S > 0 and dd RS = f (S) > 0, that is, the response measure is monotone increasing. For a given criterion change ∗ dR* > 0 in the response measure, the Weber fraction (W) is W = dSS = S·dfR (S) . Then the change in the Weber fraction over the change in the stimulus level is dW −d R∗ ( f (S) + S · f (S)) = . dS S2 · f 2 (S)
(4.15)
Since the denominator in equation 4.1 is positive, W is decreasing if and only if f (S) + S · f (S) > 0. Therefore, all concave up or constant slope (i.e., f (S) ≥ 0) response measures yield a decrease in the Weber fraction as seen in the experimental and simulation data. However, that is not a necessary condition; for example, power functions (R = β · Sα ; β > 0 ) commonly used in neurophysiology and psychophysics also yield decreasing Weber fractions: f (S) + S · f (S) = α 2 β · Sα−1 > 0. Note that for
¨ ¸ lu¨ B. Guc
150
Total activity [spikes]
Number of active fibers
2660
125 100 75
a
50 25
2000 1500 1000
0
b
500 0
0
20
40
60
80
100
0
50 40 30 20
c
10 0 0
20
40
60
80
Stimulus amplitude [µm peak]
20
40
60
80
100
Stimulus amplitude [µm peak] Std. dev. of interspike intervals [ms]
Maximum activity [spikes]
Stimulus amplitude [µm peak]
100
80 60 40
d
20 0 0
20
40
60
80
100
Stimulus amplitude [µm peak]
Figure 7: Response measures (i.e., intensity codes) are plotted as a function of the stimulus amplitude. (a) Number of active fibers. (b) Total spike count. (c) Maximal spike count. (d) Standard deviation of interspike intervals. The data points are averages of results from five simulated observers, and the error bars are the standard deviations.
decreasing response measures (i.e., dd RS = f (S) < 0) like the synchonized activity code presented above, dR* < 0 and a decreasing Weber fraction requires f (S) + S · f (S) < 0. (The nontrivial solution of f (S) + S · f (S) = 0 is R = f (S) = c 1 ln S + c 2 , which produces a constant Weber fraction, that is, exact Weber’s law.) The values of the intensity codes obtained from the simulations presented in this letter are plotted as a function of the stimulus amplitude in Figure 7. Note that all graphs in Figure 7 are generally consistent with the mathematical explanation given above. Although the graphs in section 3 were plotted in dB units, the same mathematical conclusions are valid there as well, because logarithm function is monotonic. There is no single response measure for the distribution of activity, but a similar explanation can be extended by considering the number of nerve fibers for each spike count used to construct the distribution. The mathematical analyses, however, do not explain the physical causes of the deviation effect. The major physical cause for the deviation from Weber’s law is the mechanical spread of the stimulus. Given a criterion amplitude A* < As , the skin area that is activated as the effective stimulus amplitude is attenuated to
Deviation from Weber’s Law in the Non-Pacinian I Tactile Channel
2661
As 2/1.9 the criterion amplitude is π · rs2 A∗ according to equation 2.7. Note that this area increases almost linearly as the stimulus amplitude is increased and hence causes an increase in the number of activated nerve fibers. This mechanical factor affects all the intensity codes tested in this letter, and because of equation 4.1, it generates a decreasing Weber’s fraction. The second cause for the deviation is physiological: the monotonic rate intensity functions (see equation 2.2). This factor also affects all the tested codes. The threshold parameters (a 0 ) of the rate intensity functions determine which fibers are activated. As the stimulus intensity is increased, the total spike count and the maximum spike count increase according to the rate intensity functions. Additionally, the distribution of activity and the synchronicity of the spikes change. There is another physiological factor that is relevant only if the intensity code is based on the timing of the spikes. That is, according to equation 2.6, the phase jitter of the spikes decreases as the stimulus level is increased, and that causes a decrease in the Weber fraction. In summary, the peripheral population response contains sufficient information to generate the deviation from Weber’s law in the NP I tactile system. Similarly in hearing, the excitation pattern of neurons is believed to be the cause of the near-miss to Weber’s law (Moore & Raab, 1974). However, severe departure from Weber’s law can sometimes be observed in hearing at high frequencies (e.g., 6500 Hz: Carlyon & Moore, 1986). The tactile sense, however, is very insensitive to high frequencies (Bolanowski et al, 1988). 4.3 Intensity Codes. The population model can be improved by selecting a better intensity code. It is important to note that the codes tested in this study are not optimal. More information regarding the stimulus can be obtained if the small fluctuations in the spike rates and spike timing of individual afferent fibers are considered. Nevertheless, the intensity code based on the number of active fibers has yielded better discrimination than data obtained from psychophysical measurements. This means that the nervous system is operating at a suboptimal level. What then is a more probable code? Something more stringent than the number of active fibers can produce better results. For example, higher-order decision networks may choose to analyze only a subset of the afferent population. Furthermore, this subset should not be concentrated close to the stimulus-input location; otherwise, intensity discrimination would be better at lower stimulus levels, which is opposite the deviation effect. Perhaps a uniform distribution of afferents, but fewer in number than the entire population, may be the basis for an accurate intensity code. This and other points discussed in the next section will be considered in future modeling studies. 4.4 Future Work. Gescheider, Zwislocki, and Rasmussen (1996) found that intensity discrimination was not affected by stimulus duration when
2662
¨ ¸ lu¨ B. Guc
the gated-pedestal method was used (i.e., the psychophysical method used in the current study). This was attributed to the absence of a major temporal summation effect. Temporal summation is the improvement of the threhold as the stimulus duration is increased. Since the population model and the decision rules that were used in this letter did not explicitly include temporal summation, duration effects cannot be easily predicted by the population model. However, stimuli with longer durations would generate more spikes and shift the detection criteria. Therefore, it would be worthwhile to test whether some duration effects may be produced by the population model. It is interesting to note that, in hearing, stimulus duration has only a small effect on discrimination (Green, Nachmias, Kearney, & Jeffress, 1979). The population model of intensity coding may also be tested using the psychophysical method of absolute magnitude estimation. Sensation magnitude is typically represented as a power law function of the stimulus amplitude with growth of sensation magnitude faster at lower stimulus intensities (Verrillo & Chamberlain, 1972; Gescheider et al., 1994). It is probable that the nervous system uses the same intensity code for intensity discrimination and magnitude estimation. This hypothesis may be tested by using the population model described in this letter. If the same intensity code gives accurate results in both discrimination and magnitude estimation tasks, the hypothesis would be supported. Acknowledgments This work was supported by Bogazic ˘ ¸ i University Research Fund grant ¨ ITAK ˙ 04HX101 and TUB grant 104S228. I thank Ronald T. Verrillo and Diane ¨ Ozbal for helpful comments and editing the manuscript. I am grateful to ˙ Ibrahim Mutlu for the mechanical construction of the psychophysical setup. ¨ I also thank Ozge Kalkancı for her help in the psychophysical detection experiments. References Bolanowski, S. J., Gescheider, G. A., Verrillo, R. T., & Checkosky, C. M. (1988). Four channels mediate the mechanical aspects of touch. J. Acoust. Soc. Am., 84, 1680– 1694. Carlyon, R. P., & Moore, B. C. J. (1986). Continuous versus gated pedestals and the “severe departure” from Weber’s law. J. Acoust. Soc. Am., 79, 453–460. Darian-Smith, I., & Kenins, P. (1980). Innervation density of mechanoreceptive fibres supplying glabrous skin of the monkey’s index finger. J. Physiol.-London, 309, 147–155. Dykes, R. W. (1983). Parallel processing of somatosensory information: A theory. Brain Res., 287, 47–115. Gescheider, G. A. (1997). Psychophysics: The fundamentals (3rd ed.). Mahwah, NJ: Erlbaum.
Deviation from Weber’s Law in the Non-Pacinian I Tactile Channel
2663
Gescheider, G. A., Bolanowski, S. J., & Verrillo, R. T. (2004). Some characteristics of tactile channels. Behav. Brain. Res., 148, 35–40. Gescheider, G. A., Bolanowski, Jr., S. J., Verrillo, R. T., Arpajian, D. J., & Ryan, T. F. (1990). Vibrotactile intensity discrimination measured by three methods. J. Acoust. Soc. Am., 87, 330–338. Gescheider, G. A., Bolanowski, S. J., Zwislocki, J. J., Hall, K. L., & Mascia, C. (1994). The effects of masking on the growth of vibrotactile sensation magnitude and on the amplitude difference limen: A test of the equal sensation magnitude–equal difference limen hypothesis. J. Acoust. Soc. Am., 96, 1479–1488. Gescheider, G. A., Edwards, R. R., Lackner, E. A., Bolanowski, S. J., & Verrillo, R. T. (1996). The effects of aging on information-processing channels in the sense of touch: III. Differential sensitivity to changes in stimulus intensity. Somatosens. Mot. Res., 13, 73–80. Gescheider, G. A., Thorpe, J. M., Goodarz, J., & Bolanowski, S. J. (1997). The effects of skin temperature on the detection and discrimination of tactile stimulation. Somatosens. Mot. Res., 14, 181–188. Gescheider, G. A., Zwislocki, J. J., & Rasmussen, A. (1996). Effects of stimulus duration on the amplitude difference limen for vibrotaction. J. Acoust. Soc. Am., 100, 2312–2319. Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York: Wiley. Green, D. M., Nachmias, J., Kearney, J. K., & Jeffress, L. A. (1979). Intensity discrimination with gated and continuous sinusoids. J. Acoust. Soc. Am., 66, 1051–1056. Greenspan, J. D., & Bolanowski, S. J. (1996). The psychophysics of tactile perception and its peripheral physiological basis. In L. Kruger (Ed.), Handbook of perception and cognition 7: Pain and touch (pp. 25–103). San Diego, CA: Academic Press. ¨ ¸ lu, ¨ B., & Bolanowski, S. J. (2002). Modeling population responses of rapidlyGuc adapting mechanoreceptive fibers. J. Comput. Neurosci., 12, 201–218. ¨ ¸ lu, ¨ B., & Bolanowski, S. J. (2003). Distribution of the intensity-characteristic Guc parameters of cat rapidly-adapting mechanoreceptive fibers. Somatosens. Mot. Res., 20, 149–155. ¨ ¸ lu, ¨ B., & Bolanowski, S. J. (2004a). Probability of stimulus detection in a model Guc population of rapidly adapting fibers. Neural Comput., 16, 39–58. ¨ ¸ lu, ¨ B., & Bolanowski, S. J. (2004b). Tristate Markov model for the firing statistics Guc of rapidly-adapting mechanoreceptive fibers. J. Comput. Neurosci., 17, 107–126. ¨ ¸ lu, ¨ B., & Bolanowski, S. J. (2005a). Vibrotactile thresholds of the non-Pacinian I Guc channel: I. Methodological issues. Somatosens. Mot. Res., 22, 49–56. ¨ ¸ lu, ¨ B., & Bolanowski, S. J. (2005b). Vibrotactile thresholds of the non-Pacinian I Guc channel: II. Predicting the effects of contactor location on the phalanx. Somatosens. Mot. Res., 22, 57–68. Hamer, R. D., Verrillo, R. T., & Zwislocki, J. J. (1983). Vibrotacile masking of Pacinian and non-Pacinian channels. J. Acoust. Soc. Am., 73, 1293–1303. Hollins, M., Lorenz, F., & Harper, D. (2006). Somatosensory coding of roughness: The effect of texture adaptation in direct and indirect touch. J. Neurosci., 26, 5582–5588. ˚ B. (1979). Tactile sensibility in the human hand: Relative Johansson, R. S., & Vallbo, A. and absolute densities of four types of mechanoreceptive units in glabrous skin. J. Physiol.–London, 286, 283–300.
2664
¨ ¸ lu¨ B. Guc
Johnson, K. O. (1974). Reconstruction of population response to a vibratory stimulus in quickly adapting mechanoreceptive afferent fiber population innervating glabrous skin of the monkey. J. Neurophysiol., 37, 48–72. Knudsen, V. O. (1928). “Hearing” with the sense of touch. J. Gen. Psychol., 1, 320–352. Moore, B. C. J., & Raab, D. H. (1974). Pure-tone intensity discrimination: Some experiments relating to the “near-miss” to Weber’s law. J. Acoust. Soc. Am., 55, 1049–1054. Srinath, M. D., Rajasekaran, P. K., & Viswanathan, R. (1996). Introduction to statistical signal processing with applications. Englewood Cliffs, NJ: Prentice Hall. Talbot, W. H., Darian-Smith, I., Kornhuber, H. H., & Mountcastle, V. B. (1968). The sense of flutter-vibration: Comparison of the human capacity with response patterns of mechanoreceptive afferents from the monkey hand. J. Neurophysiol., 31, 301–334. Verrillo, R. T., & Chamberlain, S. C. (1972). The effect of neural density and contactor surround on vibrotactile sensation magnitude. Percept. Psychophys., 11, 117–120. Verrillo, R. T., & Gescheider, G. A. (1977). Effect of prior stimulation on vibrotactile thresholds. Sens. Processes, 1, 292–300. Zwislocki, J. J., & Relkin, E. M. (2001). On a psychophysical transformed-rule up and down method converging on a 75% level of correct responses. PNAS, 98, 4811–4814.
Received March 31, 2006; accepted October 29, 2006.
LETTER
Communicated by Bruno Olshausen
Learning the Lie Groups of Visual Invariance Xu Miao [email protected]
Rajesh P. N. Rao [email protected] Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195, U.S.A.
A fundamental problem in biological and machine vision is visual invariance: How are objects perceived to be the same despite transformations such as translations, rotations, and scaling? In this letter, we describe a new, unsupervised approach to learning invariances based on Lie group theory. Unlike traditional approaches that sacrifice information about transformations to achieve invariance, the Lie group approach explicitly models the effects of transformations in images. As a result, estimates of transformations are available for other purposes, such as pose estimation and visuomotor control. Previous approaches based on first-order Taylor series expansions of images can be regarded as special cases of the Lie group approach, which utilizes a matrix-exponential-based generative model of images and can handle arbitrarily large transformations. We present an unsupervised expectation-maximization algorithm for learning Lie transformation operators directly from image data containing examples of transformations. Our experimental results show that the Lie operators learned by the algorithm from an artificial data set containing six types of affine transformations closely match the analytically predicted affine operators. We then demonstrate that the algorithm can also recover novel transformation operators from natural image sequences. We conclude by showing that the learned operators can be used to both generate and estimate transformations in images, thereby providing a basis for achieving visual invariance. 1 Introduction How does the visual system recognize objects despite transformations such as translations, rotations, and scaling? J. J. Gibson, an early pioneer in vision research, hypothesized that “constant perception depends on the ability of the individual to detect the invariants” (Gibson, 1966). One of the first computational approaches to perceptual invariance was proposed by Pitts and McCulloch in their article, “How We Know Universals” (Pitts & McCulloch, 1947). Pitts and McCulloch suggested that invariance could be achieved by Neural Computation 19, 2665–2693 (2007)
C 2007 Massachusetts Institute of Technology
2666
X. Miao and R. Rao
representing an image using characteristic values, such as averages over all transformations of functionals applied to transformed images. The method, however, suffered from the drawback that it required very large numbers of neurons for computing transformed images and their functional values. A number of other approaches have since been explored (Fukushima, 1980; Hinton, 1987; LeCun et al., 1989; Olshausen, Anderson, & Essen, 1995; Frey & Jojic, 1999; Tenenbaum & Freeman, 2000; Vasilescu & Terzopoulos, 2002; Grimes & Rao, 2005), some relying on pooling of activities in a featuredetector hierarchy (e.g., Fukushima, 1980), others relying on temporal se¨ ak, 1991; quences of input patterns undergoing transformations (e.g., Foldi´ Wiskott & Sejnowski, 2002), and yet others utilizing modifications to the distance metric for comparing input images to stored templates (e.g., Simard, LeCun, & Denker, 1993; Vasconcelos & Lippman, 2005). In a majority of these approaches, information regarding the underlying transformations is lost in the process of achieving invariance. In this letter, we investigate a new approach to invariance based on explicitly modeling transformations in images. Transformations are modeled using operators that are learned directly from input images. Once learned, the operators can be used to achieve visual invariance by factoring out any transformation-induced changes in the image, thereby preserving the original image. Information regarding transformations is retained and can be used for tasks such as pose estimation, visuomotor planning, and control. Our approach is based on the notion of continuous transformations and Lie group theory. It generalizes previous approaches based on first-order Taylor series expansions of images (Black & Jepson, 1996; Rao & Ballard, 1998), which can account for only small transformations due to their assumption of a linear generative model for the transformed images. The Lie approach, on the other hand, utilizes a matrix-exponential-based generative model that can handle arbitrarily large transformations once the correct transformation operators have been learned. Although Lie groups have previously been used in visual perception (Dodwell, 1983), computer vision (Van Gool, Moons, Pauwels, & Oosterlinck, 1995), and image processing (Nordberg, 1994), the question of whether it is possible to learn these groups directly from input data has remained open. The ability to learn transformation operators from data is important because it opens the door to learning new operators that cannot be easily characterized analytically (e.g., nonrigid transformations such as those induced by changes in facial expression). The problem of learning transformations is made difficult by the fact that not only are the transformations operators unknown, but so is the amount of transformation between any given pair of images. We present an expectation-maximization-based (EM) (Dempster, Laird, & Rubin, 1977) unsupervised learning algorithm wherein the transformation values are treated as hidden (latent) variables and the transformation operators are treated as parameters to be estimated in the M step. This
Learning the Lie Groups of Visual Invariance
2667
method is a natural extension of a previous gradient-descent-based algorithm proposed in Rao and Ruderman (1999). We investigate the performance of the proposed learning method on the problem of learning Lie operators for affine image transformations from an artificially constructed data set consisting of image pairs containing up to six types of transformations (translation, rotation, scaling, and two types of shear). Our experimental results demonstrate that the EM-based method has better convergence properties than the gradient-descent-based method and can accurately learn the Lie group operators for all six types of affine transformations in a completely unsupervised fashion. A major contribution of this letter is the demonstration that for cases where the transformation operators are known, the proposed EM-based algorithm does indeed produce exactly these operators, thereby opening the door to using the algorithm in other domains where the operators are unknown. As an example, we show that the algorithm can learn a novel Lie transformation operator from a natural image sequence. We conclude by demonstrating that the learned operators can be used to estimate multiple simultaneous transformations in images for invariant recognition. 2 Continuous Transformations and Lie Groups Suppose we have a point (in general, a vector) I0 , which is an element in a space F . Let T I0 denote a transformation of the point I0 to another point, say I1 . The transformation operator T is completely specified by its actions on all points in the space F . Suppose T belongs to a family of operators T . Consider the case where T is a group, that is, there exists a mapping f : T × T → T from pairs of transformations to another transformation such that (1) f is associative, (2) there exists a unique identity transformation, and (3) for every T ∈ T , there exists a unique inverse transformation of T. We are interested in transformation groups because most common types of image transformation obey properties 1 to 3. For example, it is easy to see that translation is associative (T1 · (T2 · T3 )I0 = (T1 · T2 ) · T3 I0 ), with a unique identity (zero translation) and a unique inverse (the inverse of T is −T). Continuous transformations are those that can be made infinitesimally small. Due to their favorable properties as described below, we will be especially concerned with continuous transformation groups or Lie groups. Continuity is associated with both the transformation operators T and the group T . Each T ∈ T is assumed to implement a continuous mapping from F → F . We focus on the case where T is parameterized by a single real number z (multiple transformations in an image can be handled by combining several single-parameter transformations as discussed below). Then the group T is continuous if the function T(z) : → T is continuous; that is, any T ∈ T is the image of some z ∈ , and any continuous variation of z results in a continuous variation of T. Let T(0) be equivalent to the
2668
X. Miao and R. Rao
identity transformation. Then as z → 0, the transformation T(z) gets arbitrarily close to identity. Its effect on I0 can be written as (to first order in z) T(z)I0 ≈ (1 + zG)I0 for some matrix G, which is known as the generator (or operator) for the transformation group. A macroscopic transformation I1 = I (z) = T(z)I0 can be produced by chaining together a number of these infinitesimal transformations. For example, by dividing the parameter z into N equal parts and performing each transformation in turn, we obtain I (z) = (1 + (z/N)G) N I0 .
(2.1)
In the limit N → ∞, this expression reduces to the matrix exponential equation, I (z) = e zG I0 ,
(2.2)
where I0 is the initial or reference input. Thus, each of the elements of our one-parameter Lie group can be written as T(z) = e zG . The generator G of d the Lie group is related to the derivative of T(z) with respect to z: dz T = GT.1 This leads to an alternate way of deriving equation 2.2. Consider the Taylor series expansion of a transformed input I (z) in terms of a previous input I (0): I (z) = I (0) +
d I (0) d 2 I (0) z2 z+ + ..., dz dz2 2
(2.3)
where z denotes the relative transformation between I (z) and I (0). Defining d I = G I for some operator matrix G, we can rewrite equation 2.3 as I (z) = dz e zG I0 , which is the same as equation 2.2 with I0 = I (0). Thus, some previous approaches based on first-order Taylor series expansions (Shi & Tomasi, 1994; Black & Jepson, 1996; Rao & Ballard, 1998) can be viewed as special cases of the Lie group–based generative model. 3 Learning Lie Transformation Groups In this section, we address the following problem: Can one learn the generators (or operators) G of particular Lie transformation groups directly from input data containing examples of transformations? Note that learning the operator for a transformation effectively allows us to remain invariant to that transformation because the effects of transformations can be modeled independent of the content of the image (see below for details). We assume
1
The generator G of a Lie group defines a field of tangent vectors and is in fact an element of the Lie algebra associated with the Lie group. We refer interested readers to Helgason (2001) for more details on the relationship between Lie algebras and Lie groups.
Learning the Lie Groups of Visual Invariance
2669
that during natural temporal sequences of images containing transformations, there are small image changes corresponding to deterministic sets of pixel changes that are independent of what the actual pixel values are. The rearrangements themselves are universal as in, for example, image translations. The question we address is: Can we learn the Lie group operator G given simply a series of before and after images? Let the n × 1 vector I(0) be the original reference image, and let I(z) be the transformed image. Each element zi of the T × 1 transformation vector z represents the amount of transformation of type i present in the new image. Although the methods we describe are not limited to a particular type of transformation, for concreteness, this letter focuses on the wellknown group GA(2) of general affine transformations in 2D obtained under the condition of weak perspective viewing. Our focus on this group is motivated by the fact that it includes the most common types of image transformations: translations in X ( ∂∂x ) and Y ( ∂∂y ), rotations (−y ∂∂x + x ∂∂y ), scaling (x ∂∂x + y ∂∂y ), and two types of hyperbolic deformations (parallel hyperbolic deformation along X/Y x ∂∂x − y ∂∂y and hyperbolic deformation along the diagonals y ∂∂x + x ∂∂y ) that can model image distortion under weak perspective viewing. Based on the derivation in the previous section, we propose the following statistical generative model for images: I(z) = e
i
zi G i
I(0) + n,
(3.1)
where G i is the Lie operator matrix for transformation type i and n is a zero-mean gaussian white noise process with variance σ 2 . Note that this generalizes equation 2.2 to the case of multiple transformations. The original image I(0) can itself be modeled using, for example, the following commonly used linear generative model: I(0) = Ur + n0 ,
(3.2)
where U is a matrix of learned object features (or basis vectors) and r is the feature vector (basis coefficients). Traditional approaches to appearancebased vision (such as the eigenfaces method) have focused on the generative model in equation 3.2. Equation 3.2 also forms the basis for image coding techniques such as sparse coding (Olshausen & Field, 1996) and independent component analysis (ICA; Bell & Sejnowski, 1995). None of these techniques addresses the problem of image transformations. The matrix-exponential generative model in equation 3.1 extends these previous techniques by explicitly including transformations in the generative model, independent of the actual basis matrix U used to model I(0). For the rest of the letter, we assume an arbitrary model for I(0) and focus on the problem of learning and estimating transformations on I(0) based on equation 3.1.
2670
X. Miao and R. Rao
In the case where the transformations are infinitesimal, the higher-order terms become negligible, and we can rewrite equation 3.1 as
I =
zi G i I(0) + n,
(3.3)
i
where I = I(z) − I(0) is the difference image. Note that although this model is linear, the operators G i learned using infinitesimal transformations are the same as those used in the exponential model. Thus, once learned, these matrices can be used to handle larger transformations as well, as explored in section 4. Given the generative model in equation 3.3, our goal is to learn various Lie operators G i directly from image pairs containing examples of transformations. The problem is complicated by the fact that the transformation values zi are also unknown for each pair of images. We first briefly review a previous gradient-descent-based approach (Rao & Ruderman, 1999) for tackling this problem and then present our expectation-maximization-based approach in section 3.2.
3.1 Approximate Method based on Gradient Descent. Suppose we are given M image pairs as data. We wish to find the n × n matrices G i and the transformation vectors z that generated the data set. To do so, we take a Bayesian maximum a posteriori approach using gaussian priors on z and G i . Based on equation 3.3, the negative log posterior probability of a single image pair {I(z), I(0)} can be written as E(G i , z) = − log p[G i , z|I(z), I(0)] 2 1 zi G i I(0) = 2 I − 2σ i
1 2 1 T + 2 zi + gi gi , 2σz i 2σg2 i
(3.4)
where || · || denotes Euclidean norm, σz2 is the variance of the zero-mean gaussian priors associated with the components zi of the vector z, gi is the n2 × 1 vector form of G i , and σg2 is the variance associated with the gaussian prior on G i . Minimizing E is equivalent to maximizing the posterior probability P[G i , z|I(z), I(0)]. The posterior probability for the entire data set can be expressed as a sum of E over all M image pairs.
Learning the Lie Groups of Visual Invariance
2671
If the transformation values zi are known for each image pair, the n × n operator matrices G i can be estimated by performing gradient descent on E with respect to each G i :
G˙ i = −α
∂E α = 2 ∂G i σ
I −
zi G i I(0) (zi I(0))T −
i
α Gi , σg2
(3.5)
where α is a positive constant (the learning rate). If the zi are not known but estimates of G i are available, one can estimate zi by performing gradient descent on E with respect to each zi : ∂E β β T = 2 (G i I(0)) I − zi G i I(0) − 2 zi . z˙i = −β ∂zi σ σ z i
(3.6)
Note that if the prior distributions on zi are uniform rather than gaussian, one can estimate the transformation vector z directly using the following matrix pseudoinverse method: z = [AT A]−1 AT I,
(3.7)
where A= G 1 I(0)
G 2 I(0) · · ·
G T I(0) .
This estimate for z minimizes the squared error term (first term) in equation 3.4 for fixed G i , as can be verified by setting the partial derivative of E with respect to z equal to zero and solving for z. In the completely unsupervised case where both zi and G i are unknown, one can repeat the following two steps: (1) estimate zi for all image pairs using the values for G i from the previous iteration, and (2) adapt G i using these estimated values for zi . Although there is no a priori reason for such a strategy to converge (because parameters are being optimized separately), such an approach has been successfully used in other applications such as sparse coding of natural images (Olshausen & Field, 1996) and predictive coding (Rao & Ballard, 1999). We explore the convergence properties of this approach for unsupervised transformation learning in section 4.1.1. The estimation rules for zi given above are based on a first-order model (see equation 3.3) and are therefore useful only for estimating small (infinitesimal) transformations. A more general rule for estimating larger transformations is obtaining by performing gradient descent on the
2672
X. Miao and R. Rao
optimization function given by the matrix-exponential generative model in equation 3.1:
T
γ z˙i = γ e i zi G i G i I(0) I(z) − e i zi G i I(0) − 2 zi . σz
(3.8)
Note that since the optimization function for the above equation is no longer quadratic in zi , there may exist many local minima, and the gradient descent procedure is not guaranteed to converge to a global optimum. We alleviate this problem by performing random-restart gradient descent, starting from several different initial values for zi and picking the estimated zi that produces the best fit to equation 3.1. Note that each of the equations for estimating zi minimizes an optimization function that factors a new image I(z) into a reference image I(0) and a transformation component given by z and G i . When z and G i are correctly estimated, any new transformations in the reference image I(0) are absorbed by the transformation component, keeping the original image stable. Thus, an object recognition system trained on reference images I(0) (e.g., eigen-based systems; see equation 3.2) will continue to recognize training objects despite the presence of transformations, thereby achieving perceptual invariance (see Rao & Ballard, 1998). 3.2 EM Algorithm for Learning Lie Operators. In the completely unsupervised case where both the transformations zi and the operators G i are unknown, an approach based on alternating between gradient descent over zi and G i was suggested above. However, such an approach is not guaranteed to converge. A more rigorous approach that is guaranteed to converge (albeit to a local minimum) is to use the expectation maximization (EM) algorithm (Dempster et al., 1977). Rather than attempting to estimate a single best value for the transformations zi , the EM algorithm treats the zi as hidden variables and marginalizes over these variables in the optimization function, allowing G i to be estimated in an unsupervised manner. 3.2.1 Optimization Function. The EM algorithm maximizes the likelihood of input data X given a set of model parameters by maximizing the following optimization function, ) = Q(,
)dY, log p(X, Y|) p(Y|X, Y
is an estimate of the parameters and Y denotes the hidden variwhere ables. Recall the generative model used in equation 3.3 (to simplify notation,
Learning the Lie Groups of Visual Invariance
2673
we use I for I(z) and I0 for I(0)): I = I0 +
zi G i I0 + n,
(3.9)
i
where we assume n is a zero-mean gaussian noise process with a covariance matrix . For this generative model, we obtain:
X = I, I0 |I, I0 ∈ N×M Y = Z ∈ T×M = {G, |G ∈ N×N×T , ∈ N×N }, where I and I0 are matrices whose columns contain training images I and I0 , respectively; N = n × n is the size of the 2D image (vectorized in raster format as I or I0 ); M is the number of training image pairs; Z is the matrix whose columns store z for each training image pair; T is the number of types of transformations in the training set (size of z); and G is a 3D matrix containing Lie operator matrices G i , i = 1, . . . , T. For a given pair of training images {Ii , I0i }, we obtain from equation 3.9, p(Ii |I0i , zi , G, ) =
1 N 2
(2π) ||
1 2
e
− 12 (Ii −I0i − j zji G j I0i )T −1 (Ii −I0i −k zki G k I0i )
.
(3.10) Using Bayes’ rule and assuming that each training pair is drawn independently, we can compute the conditional distribution of the hidden variable Z: ) = p(Z|I, I0 , G,
M
) p(zi ), αi p(Ii |I0i , zi , G,
i=1
and are where αi is a normalization factor for the ith image pair, and G estimated values of G and , respectively. The optimization function thus becomes ) = log( p(I |I0 , Z, G, ) p(I0 ) p(Z)) Q(, , Z|I,I0 ,G,
(3.11)
where ·(·) denotes averaging with respect to the distribution in the subscript. To simplify notation, we define · ≡ · Z|I,I ,G , . 0
2674
X. Miao and R. Rao
The optimization function then becomes: M 1 Q(, ) = − log || 2 i=1
1 (Ii − I0i − j zji G j I0i )T −1 (Ii − I0i − k zki G k I0i ) 2 M
−
i=1
+
M
log p(zi ) + const.
i=1
1 1 = − M log || − ((Ii − I0i )T −1 (Ii − I0i )) 2 2 i − +
1 T T −1 I0i G j G k I0i zji zki 2 i j k i
log p(zi ) + const. (Ii − I0i )T −1 G j I0i zji +
j
i
3.2.2 The Expectation Step. In the expectation step or E-step, we need to calculate zi and zi ziT . We show in the appendix that
zi = µi , zi ziT = zi + µi µiT ,
(3.12) (3.13)
where µi = zi−1 AiT −1 (Ii − I0i ), zi−1 = AiT −1 Ai , Ai = G 1 I0i G 2 I0i
···
(3.14) (3.15)
G T I0i .
3.2.3 Maximization Step. The expectations calculated in the E-step allow ∂Q ∂Q us to maximize Q with respect to G and . Setting ∂G and ∂ −1 equal to j zero, we obtain (see the appendix): −1 T T G1 G2 · · · = kron(zi , Ci ) kron zi zi , Di , (3.16) i
=
i
1 (Ii − I0i )(Ii − I0i )T + G j I0i I0Ti G kT zji zki M i i, j,k −2
i, j
(Ii − I0i )I0Ti G Tj zji ,
(3.17)
Learning the Lie Groups of Visual Invariance
2675
where kron denotes the Kronecker product (or tensor product) (Eves, 1980) (see the appendix for definition), Ci = (Ii − I0i )I0Ti and Di = I0i I0Ti . Reshaping the N × NT matrix ( G 1 G 2 · · · ) into an N × N × T matrix, we obtain the updated G. 3.2.4 PCA Step. In general, the operators G i learned via the EM algorithm are not constrained to be orthogonal to each other. As a result, EM may recover matrices that are, for example, linear combinations of the true operators. To recover orthogonal operators (e.g., affine operators), we apply principal component analysis (PCA)2 to the space of N2 -dimensional vectors si = j zji G j = Gzi where G j is the N2 -dimensional vectorized form of the N × N matrix G j , G is the N2 × T matrix whose columns are given by G j , and zi is the T-dimensional transformation vector. Typically, the number of transformation types T is not known. PCA produces an N2 × T
matrix G whose columns are T orthogonal vectors Gi or, equivalently, T
orthogonal operator matrices G i of size N × N. These operator matrices obtained from PCA are then used in the E-step for the next iteration of the EM algorithm. 3.2.5 Summary of EM algorithm for Learning Lie Transformation Operators. Repeat until convergence:
r r
E-step: Compute expectations zi , zi ziT according to equations 3.12 and 3.13. M-step: Compute G and using Equations 3.16 and 3.17. Compute orthogonal operators G from G using PCA. Use G in the next E-step.
4 Results The learning algorithms described above were tested in three ways. First, we used a synthetic image data set containing up to six different types of affine transformations. To generate synthetic data containing transformations, we used the periodic sinc interpolation function to continuously transform a reference image I by infinitesimal (subpixel) amounts with no assumptions about the camera sampling mechanism (Marks, 1991). The set of affine operators was derived analytically through a convolution of the numerical differential operators and the interpolation function. Figure 1 shows these analytically derived operators in the form of operator matrices G i , along with examples of transformations obtained by using them 2
An interesting alternative worth pursuing is to use independent component analysis (ICA) (Bell & Sejnowski, 1995) instead of PCA, especially when the operators are not known to be orthogonal.
2676
X. Miao and R. Rao
Figure 1: Analytically derived Lie operators and example transformations. (A) The six affine transformation operator matrices for image size 10 × 10. From left to right and top to bottom, they are shown in the order: horizontal translation, vertical translation, rotation, scaling, parallel hyperbolic deformation, and diagonal hyperbolic deformation. (B) Examples of using the exponential generative model with the horizontal translation analytical operator to translate a grayscale image of an automobile. (C) Examples illustrating how multiple transformation types can be simultaneously applied to an image using equation 3.1. For the Mona Lisa example, the parameters used were: 3.9 (X-translation), −5.2 (Y-translation), 45 degrees (rotation), 0.1 (scaling), 0.05 (hyperbolic along X/Y), and 0.25 (hyperbolic along diagonal).
Learning the Lie Groups of Visual Invariance
2677
in equation 3.1. The analytically derived operators allowed us to test the performance of the gradient-descent- and EM-based learning algorithms as described in section 4.1. Second, we applied the EM-based method to the task of learning unknown transformation operators from natural images. Results from this experiment are reported in section 4.2. A final set of experiments, discussed in section 4.3, focused on using the learned operators for estimating transformations in images.
4.1 Learning Affine Transformations. Previous work (Rao & Ruderman, 1999) demonstrated the utility of using gradient descent for semi-supervised learning of a single Lie group operator: the transformation value z was assumed to be known for each training image pair. However, in most practical problems, both the transformation values and the operators themselves are unknown. We focus here on this unsupervised learning problem and test the performance of the unsupervised gradient descent approach as well as the EM-based approach described above. We used an artificially generated data set containing one or more of the six types of 2D affine transformations: horizontal translation (H), vertical translation (V), rotation (R), scaling (S), parallel hyperbolic deformation (P), and diagonal hyperbolic deformation (D). We interpret the corresponding transformation parameter vector z as [H, V, R, S, P, D]T . The synthetic image data set contained 20,000 image pairs. The first image in each pair was generated with uniformly random pixel intensities; the second was obtained by transforming the first image using transformation values zi picked from zero-mean gaussian distributions with subpixel variance. To simulate the noisy image generation process in equation 3.1, zero-mean gaussian distributed noise with variance of 0.01 was added to each artificially transformed image. An example image pair in this data set is shown in Figure 2. All intensity values were in the range 0 to 1. 4.1.1 Gradient Descent-Based Method. To test the gradient-descent approach in the unsupervised scenario where both the transformation values z and the operators G i are unknown, we used the algorithm suggested in section 3.1: alternate between estimating z using equation 3.7 for each image pair and adapting G i using equation 3.5 based on the estimated value of z. The learning rate α was initialized to 0.4 and was decreased by dividing it with 1.0001 after each iteration (=100 training pairs). The operator matrices G i were updated after every 100 pairs of images based on the average value of the gradient. Figure 3 shows the results of attempting to learn the six affine operator matrices from the training data set using unsupervised gradient descent. As can be seen, the approach fails to learn any of the operator matrices. In fact, the algorithm fails to converge to a stable solution, as shown in Figure 3 (right panel), which plots the optimization value (negative of the
2678
X. Miao and R. Rao Image pair in artificial dataset
After
Before
Figure 2: Example of a training image pair from the synthetic data set. The first image is a randomly generated image of size 6 × 6 pixels. The second image was generated by transforming the first using the following value for z with components picked from zero-mean gaussian distributions: [0.0947, 0.0348, −0.0293, 0.0875, −0.0693, −0.0374]T .
log posterior
0
0
50
100 Iterations
150
200
Figure 3: Results from the unsupervised gradient descent algorithm. (Left) Six putative operator matrices learned simultaneously by the unsupervised gradient descent algorithm. It is clear that none of the operators resembles any of the six analytical operators for affine transformations. (Right) The value of the log posterior function (negative of first term only in equation 3.4) over successive iterations. Each iteration involved 100 image pairs. The algorithm failed to converge to a stable set of parameters even after 20,000 training image pairs.
first term in equation 3.4) over 200 iterations (20,000 image pairs). This lack of convergence motivates the use of the EM algorithm. 4.1.2 EM-Based Learning. We first tested the EM algorithm without the PCA step. As shown in Figure 4, the algorithm was unable to recover the six operators, although it did converge to a solution.
Learning the Lie Groups of Visual Invariance
2679
Figure 4: Results from EM without PCA step. (Left) The converged values of six operators learned by the EM algorithm without the intermediate PCA step. The algorithm failed to discover six different operators even though the training data set contained six different types of affine transformations. (Right) The algorithm converged quickly.
We then included the intermediate PCA step (see section 3.2.4) within the EM algorithm to encourage the learning of mutually orthogonal operators. Figure 5 shows that the modified EM algorithm again converges rapidly, but this time to the correct solution (see Figure 5; cf. Figure 1A). To quantify the accuracy of the learning algorithm, we computed the percentage square root error between the learned and analytical operators for both the gradient-descent- and EM-based unsupervised learning algorithms. Figure 6 shows that while the gradient descent algorithm produces operators with significant errors, the percentage error for the EM-based learned operators is consistently below 1.5%. In a separate set of experiments, we tested the performance of the EMbased algorithm in learning each operator separately rather than simultaneously. The algorithm successfully recovered each operator and in addition converged much faster to the correct solution because no PCA step was necessary in this case. We also tested the accuracy of the matrices learned by the EM-based algorithm using equation 2.2 to generate images with different transformation values starting from a given reference image. Figure 7 illustrates an experiment where the performance of the learned rotational operator was compared to the analytical operator. The percentage error between the transformed images using the learned versus analytical operator are shown in the bottom panel. The results from all such experiments are summarized in Table 1. The percentage errors are small (<8%) for transformations such as multipixel translations, rotations from 1 to 20 degrees, and shrinkage by
2680
X. Miao and R. Rao
Figure 5: Results from EM algorithm with PCA step. (Left) The six operators learned using the EM algorithm with PCA step. They closely match the analytical operators depicted in Figure 1A. (Right) The value of Q function (normalized to the final attained value) over successive iterations of the E- and M-steps.
Figure 6: Comparison of operators learned using gradient descent and EM algorithm with PCA step to analytical operators. The first set of bars shows the percentage square root error between operators learned via unsupervised gradient descent and analytical operators for affine transformations. The second set shows the same error for operators learned using EM with PCA. The bars are plotted in the order of transformations H, V, R, S, P, and D (see the text for details).
Learning the Lie Groups of Visual Invariance
2681
Figure 7: Rotating 2D images using analytical and learned Lie operators. (A) A reference image (labeled 0) of size 40 × 40 pixels was rotated in the range −20◦ to +20◦ using both the analytically derived Lie operator matrix (bottom panel) and the learned matrix (top panel). (B) shows the percentage squared error between images generated using the analytical matrix and the learned matrix, plotted as a function of rotation value. Table 1: Errors Between the Transformed Images Using the Learned versus Analytical Operators. Transformation Type Translation (horizontal and vertical) rotation Scaling (shrinkage only)a Hyperbolic (parallel)a
Transformation Value Range
Percentage Error Range
<2 pixels <20 degrees 1-0.67 times <1 units
<0.05% <8% <3% <20%
a Other cases give poor results due to numerical errors in the matrix exponential routine.
30%. For extremely large transformations, the error does become more noticeable primarily due to numerical errors in the matrix exponential routine. 4.2 Learning Lie Operators from Natural Image Sequences. The results in the previous section suggest that the EM-based learning algorithm
2682
X. Miao and R. Rao
Image 1
Image 15
Figure 8: Two images from the Tiny Town natural image sequence. The image sequence consists of 15 frames. The first and last frame are shown. A rightward motion of the camera has caused the scene to shift leftward by a few pixels, as is evident from examining the left and right edges of the two images.
can successfully recover a set of known transformation operators from an artificially constructed data set. To test whether the algorithm can be used to learn unknown operators from natural image sequences, we used the well-known Tiny Town image sequence, consisting of 15 frames of aerial images of a toy town taken from an overhead moving camera. The camera was slowly moving rightward relative to the scene. Figure 8 shows two frames from this sequence. Each image is 480 × 512 pixels. Consecutive frames in the image sequence exhibit a close-to-infinitesimal leftward translation, making it suitable for testing the EM-based learning algorithm (see equations 3.3 and 3.9). We generated two 10,000 image pair data sets from this image sequence—one for learning a 1D operator and the other for learning a 2D operator. To create each data set, we first randomly selected a pair of consecutive images in the sequence and then chose either a 6 × 1 (1D case) or a 6 × 6 patch (2D case) from a random location in each image. To give the learning algorithm examples of both rightward and leftward shifts, we reversed the order of the image patches with 0.5 probability. Figure 9 shows the 1D and 2D Lie operators learned by the EM algorithm for the natural image sequence. Comparing with translational operators, it is clear that the learned operators resemble the horizontal translation operators except for the assumption of periodicity in the input signal. Indeed, if one derives the operator for horizontal translations using the nonperiodic interpolation function, the learned operator can be seen to be similar (but not identical) to the nonperiodic analytical Lie operator for translation (see Figure 9).
Learning the Lie Groups of Visual Invariance
2683
Analytical Operators
Periodic Learned Operators
Figure 9: Operators learned from natural images. The top row shows the analytical operator for translation in 1D images based on a nonperiodic (left) and periodic (right) interpolation function. The 1D and 2D operators learned from the natural image sequence are shown below.
4.3 Estimating Transformations Using the Learned Operators. Once a set of transformation operators has been learned, they can be used to estimate and factor out the transformations in images, thereby facilitating visual invariance. Given a pair of input images containing one or more transformations, the transformation values zi can be estimated using the equations derived in section 3. We examined the efficacy of the learned operators under two conditions: (1) when the image pair contains a single type of transformation and (2) when the image pair contains multiple types of transformations simultaneously. For each case, we investigated two regimes: (a) the subpixel regime, wherein the first-order approximation can be expected to hold true, and (b) the large transformation regime, wherein the full exponential model is required. 4.3.1 Estimating Single Transformations. Subpixel transformations. We compared the performance of the gradient descent method (see equation 3.6) and the direct method (see equation 3.7) for estimating subpixel transformations in pairs of input images. Figure 10
2684
X. Miao and R. Rao 7 6 5 4
E(z)
3 2 1 0
0.1
5
10
5
10
z
A 140 120 100 80
E(z)
60 40 20 0 10
5
0.1
z
B 400
300
E(z)
200
100
0 10
5
0.1
5
10
z
C Figure 10: Examples of subpixel transformations and the corresponding optimization surfaces. (A) An original image was translated vertically by 0.1 pixels (pair of images on the left). The plot shows the error function defined by the first term in equation 3.4 for this pair images. (B) The pair images on the left illustrate a rotation of 0.1 radians. The plot on the right shows the corresponding error surface. (C) The pair images on the left depict a scaling of 0.1 (i.e., 1.1 times the original image). The corresponding error surface is shown on the right. Note that for all three subpixel transformations, the error surface contains a unique global minimum.
Learning the Lie Groups of Visual Invariance
2685
Table 2: Estimating Subpixel Transformations. Transformation Type Translation Rotation Scaling
Analytical Estimate
Learned Estimate
Actual Value
Gradient Descent
Direct
Gradient Descent
Direct
0.1 0.1 0.1
0.098 0.099 0.1001
0.1 0.1 0.1001
0.098 0.098 0.089
0.098 0.099 0.09
Note: The table compares actual transformation values with estimates obtained using the analytically derived and the learned Lie operator matrices for the image transformations in Figure 10.
shows three pairs of images and the optimization surfaces determined by equation 3.4 as a function of the transformation parameter z. For simplicity, uniform priors were used for all the parameters, allowing the optimization surface to be viewed as an “error surface” reflecting only the first term in equation 3.4. All three surfaces have a single global minimum (z = 0.1), which was found by both the gradient descent method and the direct (matrix pseudoinverse) method. Table 2 compares actual transformation values with those estimated using the direct method and the gradient descent method for the transformations depicted in Figure 10. Both the direct method and the gradient descent method recover values close to the actual value used to generate the pair of input images. Large transformations. An advantage of the Lie approach is that the learned operator matrix can be used to estimate not just subpixel transformations but also larger translations using gradient descent based on the matrix exponential generative model (see equation 3.8). As expected, the optimization function in this case often contains multiple local minima. Figure 11 shows examples of optimization surfaces for three pairs of images containing relatively large transformations (downward translation of 2 pixels in Figure 11A, clockwise rotation of 1 radian in Figure 11B, and scaling by 2 in Figure 11C). In all these cases, the optimization function contains a global minimum and one or more (typically shallower) local minima representing alternate (often cyclic) solutions. The two images on the bottom left of each panel in Figures 11A to 11C show the transformed images corresponding to these minima in the optimization surfaces. Except for Figure 11A (where the second minimum is a cyclic solution), the local minima are shallower and generate transformed images with greater distortion. To find the global minimum, we performed gradient descent with several equally spaced starting values centered near zero. We picked the transformation estimate that yielded the best (smallest) value for the optimization function. Table 3 compares actual transformation values with the estimates provided by the gradient descent method for a data set of 2D image pairs containing one of three types of transformations (translation, rotation, or
2686
X. Miao and R. Rao original interpolated I (2) I (0)
0.5 0.4 0.3
E(z) 0.2
Lie generated
0.1 0 0
z=2
z = 12
A original interpolated I (1) I (0)
2
12
15
z
0.5 0.4 0.3
E(z) 0.2
Lie generated
0.1 0
z=1
B original interpolated I (2) I (0)
0
1
3.14
z
1 0.8 0.6
E(z) Lie generated
0.4 0.2 0 1
z=2
z = 9.52
2
9.5238 100
z
C Figure 11: Examples of large transformations and the corresponding optimization surfaces. (A) Example of large translation. The top two images on the left illustrate a downward translation of two pixels obtained by using a periodic sinc interpolation function. The plot on the right shows the error function derived from applying the exponential generative model to these two images. The images representing the two local minima of this function are shown on the left at the bottom. (B, C) Examples of large rotation and large scaling, respectively. The images on the left and the plot on the right follow the scheme in A.
Learning the Lie Groups of Visual Invariance
2687
Table 3: Estimating Large Transformations in 2D Images. Transformation Type Translation Rotation Scaling
Actual Value
Analytical Estimate (by gradient descent)
Learned Estimate (by gradient descent)
2 3.14 1 1.41 2 1.83
2.001 3.140 0.998 1.406 2.001 1.832
2.002 3.142 0.991 1.392 2.041 1.850
Note: The table compares actual transformation values with estimates obtained using the analytically derived and the learned Lie operator matrices for an arbitrary set of image transformations.
scaling). Once again, the gradient descent estimates using either the analytical or the learned matrix can be seen to be close to the actual transformation values used to generate the image pair. 4.3.2 Estimating Simultaneous Transformations. Natural images typically contain a mixture of transformations. We tested whether the Lie-based approach could be used to recover multiple simultaneous transformations from image pairs containing all six affine transformations but of different magnitudes. The transformed images were generated by using the periodic interpolation function to sequentially apply each of the six types of transformations. We studied two transformation regimes: (1) simultaneous subpixel transformations and (2) simultaneous large (multipixel) transformations. Subpixel transformations. We used the matrix pseudoinverse method (see equation 3.7) to recover the six transformation values from pairs of grayscale images that contained simultaneous known subpixel transformations. The performance was found to be comparable to that obtained in the single transformation case. Large transformations. We tested the performance of the multiple-start gradient descent method based on equation 3.8 for estimating larger (multipixel) simultaneous transformations in image pairs. Figure 12 shows a result from an example image pair. The recovered transformations are approximately correct, though the errors are larger than in the case where the image pairs contain only a single type of transformation (see Table 3). The larger error is partly due to the fact that the estimation method (see equation 3.8) does not take into account the noncommutativity of some sequentially applied transformations (such as translation followed by rotation). An interesting direction of future research is to investigate extensions of the present approach based on, for example, Lie algebras (Helgason, 2001) for more accurately modeling the effects of large, multiple transformations.
2688
X. Miao and R. Rao
6
22 5
Estimated 5 Actual Error
4
3
2
1
0
0
1
2
3
4
5
6
H V R S P D Figure 12: Estimation of large simultaneous transformations. The second image in the top row was obtained by transforming the first using the transformation vector z = [5.1, 4.2, 0.5, 0.1, 0.1, 0.1]T . The bar plot below shows the estimated transformation value, actual value, and error for each of the six types of affine transformations present in the image pair.
5 Discussion and Conclusion Our results suggest that it is possible to learn operators (or generators) for Lie transformation groups directly from image data in a completely unsupervised manner. We demonstrated that operators for all six affine image transformations—translations, rotations, scaling, and two types of shear—can be learned directly from image pairs containing examples of these transformations. The learned operators were shown to be almost identical to their analytically derived counterparts. Furthermore, the operators were learned from a training data set containing randomly generated images, indicating that operators can be learned independent of image
Learning the Lie Groups of Visual Invariance
2689
content. These operators can be used in conjunction with a matrix exponential-based generative model to transform (or “warp”) images by relatively large amounts. Conversely, we showed that the learned operators can be used to estimate one or more transformations directly from input images. We also provided evidence for the applicability of the approach to natural image sequences by showing that the learning algorithm can recover a novel operator from a well-known image sequence (the Tiny Town sequence) traditionally used to test optic flow algorithms. An interesting research direction therefore is to apply the EM-based algorithm to the task of learning more complex types of transformations directly from video images. For example, nonrigid transformations such as those induced by changes in facial expression or lip movements during speech are hard to model using analytical or physics-based techniques. The unsupervised EM-based algorithm offers an alternate approach based on learning operators for nonrigid transformations directly from examples. Once learned, the operators could potentially be used for estimating the type and amount of nonrigid image transformation in an input video stream. The Lie generative model (see equation 3.1), together with a generative model for image content (such as in equation 3.2), could be used to address certain challenging problems in computer vision such as simultaneous motion estimation and recognition of objects in natural scenes and image stabilization. The optimization function E in these cases would need to be modified to account for occlusion, multiple motions, and background clutter using, for instance, a layered model (e.g., Adelson & Wang, 1994) or a robust optimization function (as in Black & Jepson, 1996). One could also use sampling-based methods such as particle filtering (Isard & Blake, 1996) to represent multimodal distributions of transformations zi rather than gaussian distributions. Efforts are currently underway to investigate these extensions. An important issue concerning the estimation of large transformations based on the exponential generative model is how the globally optimal value can be found efficiently. Several possibilities exist. First, we may impose a prior on the transformation estimate z that favors small values over bigger values; this helps in avoiding solutions that are larger cyclic counterparts of the desired solution, which is closest to zero. This is in fact already implemented in the model with the zero-mean gaussian prior on z and the restriction of search to a region around zero. A second possibility is to use coarse-to-fine techniques, where transformation estimates obtained from images at a coarse scale are used as starting points for estimating transformations at finer scales (see, e.g., Black & Jepson, 1996; Vasconcelos & Lippman, 2005). The coarse-to-fine approach may also help in alleviating the problem of noise in the learned matrices. We hope to investigate such strategies in future work.
2690
X. Miao and R. Rao
Appendix: Formal Derivation of EM Algorithm for Learning Lie Operators A.1 Derivation of Expectations in the E-Step. In the expectation step (E-step), we need to calculate zi and zi ziT . First, define Ai = [ G 1 I0i
···
G 2 I0i
G T I0i ].
Then: p(zi |Ii , I0i , G, ) =
p(Ii |I0i , zi , G, ) p(zi ) p(Ii |I0i , zi , G, ) p(zi )dzi
= = zi−1
(2π )
1 || 2
1 T
2
−1
1 − 2 zi
e − 2 (Ii −I0i −Ai zi ) (Ii −I0i −Ai zi ) T
1 e (2π ) 2 | | 2 p(Ii |I0i , zi , G, ) p(zi )dzi 1
1 N 2
(A.1)
(2π) |zi |
T
e − 2 ((zi −µi ) 1
1 2
T
1 T
−1
zi
zi−1 (zi −µi ))
= AiT −1 Ai + −1
(A.2)
µi = zi AiT −1 (Ii − I0i ).
(A.3)
Here, we assume that the prior distribution of zi is a T -dimensional gaussian distribution, where T denotes the number of transformations (to distinguish it from matrix transpose T). When | | → ∞, the gaussian distribution becomes a uniform distribution. As zi is gaussian, we have:
zi = µi zi ziT = zi + µi µiT .
(A.4) (A.5)
A.2 Derivation of the Maximization Step. The expectations calculated in the E-step allow us to maximize Q with respect to G and . Taking the partial derivatives, ∂Q −1 T −1 T = G k I0i I0i zji zki + (Ii − I0i )I0i zji − ∂G j i k ∂Q 1 1 = M T − (Ii − I0i )(Ii − I0i )T −1 ∂ 2 2 i
Learning the Lie Groups of Visual Invariance
− +
2691
1 G j I0i I0Ti G kT zji zki 2 i j k (Ii − I0i )I0Ti G Tj zji i
j
and setting these equal to zero, we obtain: =
1 (Ii − I0i )(Ii − I0i )T + G j I0i I0Ti G kT zji zki M i i j k −2
(Ii − I0i )I0Ti G Tj zji . i
(A.6)
j
For G i , we have G1
Di z1i z1i + G 2
i
G1
Di z1i z2i + . . . =
i
Di z2i z1i + G 2
i
Ci z1i
i
Di z2i z2i + . . . =
i
...
Ci z2i
i
where Di = I0i I0Ti and Ci = (Ii − I0i )I0Ti . Solving this equation using the Kronecker product notation3 kr on, we have
G1
G2
··· =
i
−1 T T kron zi , Ci kron zi zi , Di .
(A.7)
i
Finally, reshaping the N × NT matrix ( G 1 G 2 · · · ) into an N × N × T matrix, we obtain the updated G. Acknowledgments We thank the reviewers for their comments and suggestions. We are also indebted to Dan Ruderman for providing the initial impetus for this work
3 For an M × N matrix G and a P × Q matrix H, the Kronecker (or tensor) product K = kron(G, H) is defined as K a b = G i j Hkl where a = P(i − 1) + k and b = Q( j − 1) + l (Eves, 1980).
2692
X. Miao and R. Rao
and for his early collaboration. This research is supported by an NGA NEGI grant, NSF Career grant no. 133592, and fellowships to R.P.N.R. from the Packard and Sloan foundations. References Adelson, E., & Wang, J. (1994). Representing moving images with layers. IEEE Trans. on Image Processing, 3, 625–638. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129–1159. Black, M. J., & Jepson, A. D. (1996). Eigentracking: Robust matching and tracking of articulated objects using a view-based representation. In Proc. of the Fourth European Conference on Computer Vision (ECCV) (pp. 329–342). Berlin: SpringerVerlag. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society Series B, 39, 1–38. Dodwell, P. C. (1983). The Lie transformation group model of visual perception. Perception and Psychophysics, 34(1), 1–16. Eves, H. (1980). Elementary matrix theory. New York: Dover. ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural ComFoldi´ putation, 3(2), 194–200. Frey, B. J., & Jojic, N. (1999). Estimating mixture models of images and inferring spatial transformations using the EM algorithm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 416–422). Piscataway, NJ: IEEE. Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36, 193–202. Gibson, J. (1966). The senses considered as perceptual systems. Boston: Houghton Mifflin. Grimes, D., & Rao, R. P. N. (2005). Bilinear sparse coding for invariant vision. Neural Computation, 17(1), 47–73. Helgason, S. (2001). Differential geometry, Lie groups, and symmetric spaces. Providence, RI: American Mathematical Society. Hinton, G. E. (1987). Learning translation invariant recognition in a massively parallel network. In G. Goos & J. Hartmanis (Eds.), PARLE: Parallel Architectures and Languages Europe (pp. 1–13). Berlin: Springer-Verlag. Isard, M., & Blake, A. (1996). Contour tracking by stochastic propogation of conditional density. In Proc. of the Fourth European Conference on Computer Vision (ECCV) (pp. 343–356). New York: Springer. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541–551. Marks II, R. J. (1991). Introduction to Shannon sampling and interpolation theory. New York: Springer-Verlag. Nordberg, K. (1994). Signal representation and processing using operator groups (Tech. ¨ Rep. Linkoping Studies in Science and Technology, Dissertations No. 366),
Learning the Lie Groups of Visual Invariance
2693
¨ ¨ Linkoping, Sweden: Department of Electrical Engineering, Linkoping University. Olshausen, B. A., Anderson, C. H., & Essen, D. C. V. (1995). A multiscale dynamic routing circuit for forming size- and position-invariant object representations. Journal of Computational Neuroscience, 2, 45–62. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Pitts, W., & McCulloch, W. (1947). How we know universals: The perception of auditory and visual forms. Bulletin of Mathematical Biophysics, 9, 127–147. Rao, R. P. N., & Ballard, D. H. (1998). Development of localized oriented receptive fields by learning a translation-invariant code for natural images. Network: Computation in Neural Systems, 9(2), 219–234. Rao, R. P. N., & Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive field effects. Nature Neuroscience, 2(1), 79–87. Rao, R. P. N., & Ruderman, D. L. (1999). Learning Lie groups for invariant visual perception. In M. S. Kearns, S. Sollo, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 810–816). Cambridge, MA: MIT Press. Shi, J., & Tomasi, C. (1994). Good features to track. In Proceedings of IEEE CVPR’94 (pp. 593–600). Piscataway, NJ: IEEE. Simard, P., LeCun, Y., & Denker, J. (1993). Efficient pattern recognition using a new transformation distance. In C. L. Giles, S. J. Hanson, & J. D. Cowen (Eds.), Advances in neural information processing systems, 5 (pp. 50–58). San Mateo, CA: Morgan Kaufmann. Tenenbaum, J. B., & Freeman, W. T. (2000). Separating style and content with bilinear models. Neural Computation, 12(6), 1247–1283. Van Gool, L., Moons, T., Pauwels, E., & Oosterlinck, A. (1995). Vision and Lie’s approach to invariance. Image and Vision Computing, 13(4), 259–277. Vasconcelos, N., & Lippman, A. (2005). A multiresolution manifold distance for invariant image similarity. IEEE Transactions on Multimedia, 7, 127–142. Vasilescu, M. A. O., & Terzopoulos, D. (2002). Multilinear Analysis of Image Ensembles: TensorFaces. In Proc. European Conf. on Computer Vision (ECCV) (pp. 447–460). Berlin: Springer-Verlag. Wiskott, L., & Sejnowski, T. (2002). Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4), 715–770.
Received April 7, 2006; accepted December 16, 2006.
LETTER
¨ Communicated by Gunther Palm
Learning with “Relevance”: Using a Third Factor to Stabilize Hebbian Learning Bernd Porr [email protected] Department of Electronics and Electrical Engineering, University of Glasgow, Glasgow, GT12 8LT, Scotland
¨ otter ¨ Florentin Worg [email protected] Bernstein Centre for Computational Neuroscience, University of G¨ottingen, 37073 G¨ottingen, Germany
It is a well-known fact that Hebbian learning is inherently unstable because of its self-amplifying terms: the more a synapse grows, the stronger the postsynaptic activity, and therefore the faster the synaptic growth. This unwanted weight growth is driven by the autocorrelation term of Hebbian learning where the same synapse drives its own growth. On the other hand, the cross-correlation term performs actual learning where different inputs are correlated with each other. Consequently, we would like to minimize the autocorrelation and maximize the cross-correlation. Here we show that we can achieve this with a third factor that switches on learning when the autocorrelation is minimal or zero and the crosscorrelation is maximal. The biological counterpart of such a third factor is a neuromodulator that switches on learning at a certain moment in time. We show in a behavioral experiment that our three-factor learning clearly outperforms classical Hebbian learning. 1 Introduction Hebbian learning (Hebb, 1949) inherently suffers from a stability problem, which can be simply stated: if a synapse grows, the output will grow, leading to further growth of the synapse, and so on. Hence, in an autocorrelative manner, such a synapse influences its own growth. As long as there are only direct input-output correlations to be learned (e.g., facilitation of neuronal activity), this may not be a problem. However, there exist many cases where it is of vital importance to learn the (cross-)correlation between inputs. The most prominent example is classical conditioning (Pavlov, 1927, Balkenius & Mor´en, 1998), where the correlation between unconditioned and conditioned stimulus is learned. Also in a more technical context, when using Hebbian learning to extract the principal components of an input space, it Neural Computation 19, 2694–2719 (2007)
C 2007 Massachusetts Institute of Technology
Using a Third Factor to Stabilize Hebbian Learning
2695
Figure 1: Learning algorithm and signal structure. (A) Differential Hebbian learning (red) uses the derivative of the output to control weight change. Its three-factor extension (black, solid line) uses in addition a “relevance” signal r to control the timing of the learning. The dashed black lines indicate that in practical applications, signals need to be fanned out into a filter bank (see below). (B) Signals in response to two δ-function inputs. The gray shapes (AC, CC) at the bottom denote a linear approximation of the absolute contribution of autoand cross-correlation terms. The main part of the unwanted AC contribution comes directly after x1 . General annotations: x0,1 = input signals; r = relevance signal; stands for the summating neuron on which inputs converge with weights ρ0,1 . Symbols h 0,1,r represent bandpass filters and u0,1,r the signals that enter the neuron; ⊗ denotes a correlation, and the amplifier symbol stands for a changeable synaptic weight.
is required to evaluate the cross-correlations, while autocorrelations scale only the result (Oja, 1982; Linsker, 1988). In these and a variety of other situations, the self-amplification of a Hebb synapse may lead to serious difficulty in the control of learning. Here, we concentrate on differential Hebbian learning (Kosco, 1986; ¨ otter, ¨ Klopf, 1986; Porr & Worg 2003a), which is a variant of Hebbian learning and implements sequence learning, where two (or more) signals are correlated in time. In real life, this can happen, for example, when heat radiation precedes a pain signal or when the vision of food precedes the pleasure of eating it. Such situations occur often during the lifetime of a creature, and in these cases, it is advantageous to learn reacting to the earlier stimulus, not having to wait for the later signal. Temporal sequence learning enables the animal to react to the earlier stimulus by learning an ¨ otter ¨ anticipatory action (Worg & Porr, 2005). The autocorrelation problem can be better understood if we look at a simple neuron (see Figure 1, red) with just two inputs u0 and u1 . The Black parts of Figure 1 can be neglected for the time being. This neuron calculates a linear weighted sum: v = ρ0 u0 + ρ1 u1 .
(1.1)
¨ otter ¨ B. Porr and F. Worg
2696
The plasticity of the synapse ρ1 for differential Hebbian learning (Kosco, ¨ otter, ¨ 1986; Porr & Worg 2003a) is defined as dρ1 = µu1 v , dt
(1.2)
where µ is the learning rate. The derivative of the postsynaptic activity implements on a phenomenological level spike-timing-dependent plasticity ¨ (Markram, Lubke, Frotscher, & Sakmann, 1997; Xie & Seung, 2000; Guo¨ otter, ¨ ¨ otter, ¨ Quing & Poo, 1998; Porr & Worg 2004; Saudargiene, Porr, & Worg 2004) so that the order of the pre- and postsynaptic spikes determines if long-term potentiation (LTP) or long-term depression (LTD) occurs. Now, we can substitute v in equation 1.2 with the weighted sum of equation 1.1 and get dρ1 = µρ0 u1 u0 + µρ1 u1 u1 . dt cc
(1.3)
ac
Clearly, weight development is composed of a cross-correlation term cc and an autocorrelation term a c, which is the term that causes an unwanted weight drift: a change in the weight ρ1 will cause a positive correlation in the autocorrelation term, which in turn causes further weight change, and so on. The strategy in this letter to minimize the effect of the autocorrelation is to use the fact that in temporal sequence learning, input signals happen at different moments in time. We will show that in general, cross- and autocorrelation terms have little or no temporal overlap and that this will allow us to remove the unwanted autocorrelation term by using a third factor that switches learning on only at the moment when the autocorrelation is minimal and the cross-correlation is maximal. In terms of biology, the application of a third factor as such is not novel. Especially in conjunction with the dopaminergic system, three-factor learning has been discussed, suggesting that dopaminergic responses could be related to the process of reward-based reinforcement learning (Miller, Sanghera, & German, 1981; Schultz, 1998; Schultz & Suri, 2001). Simply this can be formalized as d ρ = µ · pre(t) · post(t) · DA(t), dt
(1.4)
were pre and post represent the pre- and postsynaptic activity at the synapse and DA is the dopamine signal. Indeed there is experimental support in the striatum and other subcortical structures that dopamine could gate the plasticity of glutamatergic
Using a Third Factor to Stabilize Hebbian Learning
2697
¨ otter ¨ synapses (for reviews, see Reynolds & Wickens, 2002; Worg & Porr, 2005). Corticostriatal synapses at medium spiny neurons will show pronounced LTP if pulsed dopamine is present (Wickens, Begg, & Arbuthnott, 1996). If absent, LTD arises, which is also the case for a continuous infusion of dopamine because of D1-receptor desensitization (Memo, Lovenberg, & Hanbauer, 1982). While many interpretations show that the dopaminergic signal is regarded as an error signal (Sutton, 1988; Mirenowicz & Schultz, 1994; Schultz, Dayan, & Montague, 1997), we suggest in this letter that it might also be used to time the learning in order to stabilize synaptic weights by minimizing autocorrelation terms. The letter is organized in the following way. In the next section, we introduce the formal framework of our three-factor learning. Then we provide a convergence proof for the open-loop condition and demonstrate how our new learning scheme behaves in a set of standard tests. Finally, we introduce behavioral feedback (closed-loop condition) and demonstrate its stability with a simple food retrieval task. 2 ISO3 Learning We call our learning rule ISO3 learning because it is related to our differential ¨ otter, ¨ Hebbian ISO learning rule (Porr & Worg 2003a), where we have added a third factor. We define the inputs to the system as x0 (late) and x1 (early). In all realistic situations, the interval T between x1 and x0 is not exactly known. To account for this, we introduce a filter bank h j at the input x1 , defining: u0 = h 0 ∗ x0 u j = h j ∗ x1 ,
(2.1) j >0
(2.2)
with filters h j , which are given as h j (t) =
e −a j t − e −b j t , ηj
(2.3)
where a j and b j are constants defining the rise and decay times and η j is a normalization constant that can be used to weight the contributions of the individual filters in a filter bank.1 1 Note that these filters differ from the ones originally used in ISO learning. This is necessary for the convergence proof below because we need real poles for the proof instead of complex conjugate ones. However, there is no substantial difference because we have always been using highly damped resonators (e.g., Q = 0.51) in ISO learning, which can also be modeled by the difference of two exponentials. The reader who is familiar
¨ otter ¨ B. Porr and F. Worg
2698
The output is a weighted sum of the filtered signals, v = ρ0 uk +
N
ρk uk .
(2.4)
k=1
Now we can define ISO3 learning by (see Figure 1A, red and black parts), dρk = µuk v γ , dt
(2.5)
where µ is the learning rate, as before. Note that the original ISO learning rule was defined as dρk /dt = µuk v but is now augmented by a third factor, γ . For further analysis, it is useful to rewrite equation. 2.5, as in the section 1, in the following way:
N d u ρ u + u ρ u ρk = µ k 0 0 k k k γ, dt k=1 cc k
(2.6)
a ck
= µ(cc k + a c k )γ
(2.7)
where cc k and a c k represent cross- and autocorrelation contributions respectively. Note that the cross-correlation term cc k = uk ρ0 u0 is essentially 1 ¨ otter, ¨ identical to the ICO rule (Porr & Worg 2006) given by dρ = µu1 u0 . In dt some sections, we will refer to the ICO rule to compare its behavior to ISO and ISO3 learning. Furthermore, we define the signal γ by
γ =
γ˜ if γ˜ > 0 , 0 otherwise
(2.8)
where γ˜ (t) =
d [r (t) ∗ h γ (t)], dt
(2.9)
where we call r the relevance signal. The function h γ (t) is also implemented with ISO learning will notice that a resonator around Q = 0.5 has a damping of e −2π f so that we define the constants a j and b j around a j = 2π f to remain compatible with our definitions from ISO learning.
Using a Third Factor to Stabilize Hebbian Learning
2699
by equation. 2.3, where the derivative turns its low-pass characteristic into a high-pass characteristic. This also guarantees that the computations in the main pathway (via x j → v) and the relevance pathway (r → γ ) undergo the same computations, namely, first low-pass filtering and then calculating the derivative. The autocorrelation term in equation 2.7 is the one that needs to be minimized by timing the learning correctly using r . To get an idea of how this could be achieved, we analyze the signal structure of this circuit (see Figure 1B) when using δ-function inputs. Signals u0,1 are obtained by filtering the input pulses x0,1 with bandpass filters h, which create an overlap between temporally shifted inputs, necessary for the correlation in Hebbian learning. Most important, however, this diagram shows the different components u1 and u0 of which v is composed during learning. Because we are employing sequence learning, the auto- and the crosscorrelation terms happen at different moments in time, which immediately suggests that one should time learning by triggering r together with x0 because the autocorrelation is then zero. Hence, we define tr = tx0 .
(2.10)
The relevance signal starts at the moment x0 is triggered and then slowly decays. The derivative in equation 2.9 is used to eliminate the time lag obtained by the convolution of r ∗ h r , and we use only positive contributions to ensure that Hebbian learning does not spuriously turn into anti-Hebbian learning. 2.1 Formal Open-Loop Convergence Condition. In order to prove that ISO3 learning with an appropriately timed r -signal can eliminate the contributions of autocorrelation terms, we will use δ-functions for all input signals. More complex input signals can be decomposed into a train of delta functions as long as the system is linear, which we assume in the following derivations, x0 = a 0 δ(t − T)
(2.11)
x1 = a 1 δ(t)
(2.12)
γ = δ(t − T),
(2.13)
where x0 and x1 are scaled by the amplitude factors a 0 and a 1 , respectively, which can have any nonzero value. Having scaling factors for x0 and x1 and not for the relevance signal r stresses the fundamental difference between these signals: while x0 is an error signal that can have different polarities and fluctuating amplitudes, the relevance signal always has the same amplitude and is always triggered at the moment x0 is excited.
¨ otter ¨ B. Porr and F. Worg
2700
The idea here is to show that the distribution of weights associated with a filter bank will reach its first maximum exactly when x0 (or r ) occurs, leading to a zero derivative. A situation like this has been constructed in Figure 1B with just a single filter, where the derivative curve reaches its maximum precisely when a delta pulse at x0 occurs. We now show that learning with δ-pulse inputs with a filter bank will generate a maximum at the moment the relevance signal is triggered and that this renders the autocorrelation term to zero. First, we have to calculate the overall weight change for ISO3 learning. The overall weight change for ρk is given as
∞
ρk = µ
uk v γ dt
(2.14)
0
by integrating the ISO3 learning rule, equation 2.5, over the whole time span. This integral can also be split up into a cross- and autocorrelation term so that we get ρk = µ
∞ 0
uk ρ0 u0 γ dt + µ cc
∞
uk 0
k
N j=1
ρ j uj γ dt.
(2.15)
a ck
The integral can be solved by recalling that we have defined our signal γ as a delta function, which switches learning on at time T,
gv (T) ρ j u j (T) ρk = µ ρ0 u0 (0) uk (T) + µ uk (T), cc k
(2.16)
j
a ck
= µρ0 u0 (0) uk (T) + µgv (T) uk (T)
(2.17)
which means that we have weight change only at time T. The second step now is to show that at this time T, the autocorrelation term a c k remains at zero. As introduced above, this is the case if the signal gv (t) has a maximum at T so that its derivative becomes zero. The signal gv (t) is generated by the weighted filter bank responses ρ1 h 1 , . . . , ρ N h N . Consequently, we have to show that the weights ρ j are learned in a way that they generate a maximum at time T. At the very beginning of learning, only the cross-correlation cc k contributes to weight change because output v is still zero. Thus, for the first weight change, we can concentrate on the cross-correlation term. We hope that this cross-correlation term generates
Using a Third Factor to Stabilize Hebbian Learning
2701
a weight distribution that creates a maximum at T so that further weight growth is driven only by the cross-correlation. Thus, we must prove that the cross-correlation term cc k creates a maximum of gv at T. We note that the weight change ρk is proportional to ρ0 , u0 (0), and uk (T), where ρ0 is a constant. The second term, u0 (0), is the same for all weights so that it cannot generate a distribution of different weights. The only term that can change individual weights is the filter response uk (T). This means that the weight distribution must be of the form ρk ∝ uk (T), which results in N
gv (t) ∝
uk (T)uk (t).
(2.18)
k=1
We have to show that a weighted sum of filters that uses weights as their own values at time T creates a maximum at time T. This will be possible ultimately only with an infinite number of filters so that all possible T are covered, gv (t) ∝
∞
uσ (T)uσ (t) dσ,
0
(2.19)
where σ scales the timing of the filters, which are defined as e −ta σ − e −tbσ ησ
uσ (t) =
(2.20)
with a given rise (a ) and decay time (b). We now solve the integral, equation 2.19, with the normalization ησ =
σ (b − a ),
(2.21)
which guarantees that the maximum of the filter bank indeed appears at t = T. Substituting equation 2.20 into equation 2.19 gives gv (t) ∝
∞
>0
(e −ta σ − e −tbσ )(e −Ta σ − e −Tbσ ) dσ, σ (b − a )
(2.22)
where is infinitely small but nonzero to avoid a singularity in the integral. To find the maximum of gv (t), we have to find the values of t where the derivative, gv (t) ∝
d dt
∞
>0
(e −ta σ − e −tbσ )(e −Ta σ − e −Tbσ ) dσ, σ (b − a )
(2.23)
¨ otter ¨ B. Porr and F. Worg
2702
Figure 2: (A) Plot of equation2.24 for different values of T = 1, 5, 10. a = 0.0001, b = 0.0005, c = 1, and ησ = σ (b − a ). (B) Plot of equation 2.22 with the same parameters. (C) Filters have been normalized with η = 1 instead.
becomes zero. This is an exponential integral that can be solved by exchanging differentiation and integration. With that trick, the σ in the denominator vanishes, which makes the successive integration possible. (In the appendix, we derive the solution step-by-step). Here, we show directly the result: gv (t) ∝
−a e −(a t+bT) −a e −a (t+T) −be −(a T+bt) 1 −be −b(t+T) − + − . a −b a t + bT a (t + T) a T + bt b(t + T) (2.24)
For small numbers of → 0, the exponentials in the numerator converge toward one, which yields gv (t) ∝
T(t − T)(a − b) . (a t + bT)(a T + bt)(t + T)
(2.25)
This term becomes zero for t = T, which is the desired result: the derivative of the filter bank is zero at t = T so that the autocorrelation is zero at the moment the relevance signal r is triggered. Figure 2A shows a plot of equation 2.24 for different values of T. The choice of a and b is not critical as long as they are not identical. Here, we have set the constants a and b to small values so that the integration takes into account slow rise and decay times. It is clear that the extremum is at the desired position t = T. The integral, equation 2.22, has no closed-form solution but can be integrated numerically (the results are shown in Figure 2B). We have chosen T = 1, 5 and T = 10 as the time between x1 and x0 , γ .
Using a Third Factor to Stabilize Hebbian Learning
2703
But also with different, “wrong” normalizations, we get interesting properties as shown for the normalization η = 1, where we get a maximum at about half of T (see Figure 2C). This might be useful in applications where the autocorrelation term only has to be minimized but where a fast reaction is required. With a stronger normalization (e.g., ησ = σ (b − a )), the maximum appears at t > T. Thus, with different normalizations, we can fine-tune the responsiveness of the system. 3 Analyzing the ISO3 Rule in an Open-Loop Condition In this section, we present two open-loop tests for our learning rule. In these tests (see Figures 3 and 4), pulse pairs have been repeatedly presented at inputs x, which converge with initial weights ρ0 = 1 and ρ1 = 0 at the learning unit. 3.1 Comparing ISO3 Learning with ISO Learning. Figure 3 shows re¨ otter, ¨ sults for the standard test (see, e.g., Porr & Worg 2003a) for ISO and ISO3 learning. Here, the signal x0 was also used to trigger the relevance signal r . Learning rates have been adjusted to produce equally strong learning for ISO and ISO3. Note that this requires larger values for µ for ISO3 than for ISO, because weight integration (see equation 2.14) is limited to the surface under the small γ signal in ISO3 while it covers a broader surface in ISO. At time step 5000, the input x0 was switched off. The corresponding signals of the filter banks, the output during learning (x1 and x0 active) and after learning (x0 switched off), are shown in Figure 3D, respectively. According to the ¨ otter ¨ theory, as described in detail in Porr and Worg (2003a), this should lead to weight stabilization at ρ1 . Figure 3A, however, demonstrates that weights will continue to grow for ISO after switching x0 off. This is due to the autocorrelation influence only. The same thing happens when using a filter bank in ISO learning (see Figure 3B), where some weights are also shrinking. Using a relevance signal prevents this unwanted effect entirely, and the same is true for the filter bank (see Figure 3C). All weights become stable after x0 has been switched off. This is due to the fact that v reaches its first maximum at tx0 , and thereby learning uses the cross-correlation term only. 3.2 Changing the Duration of the Relevance Signal. In this section, we test how a longer-lasting relevance signal influences stability, and we demonstrate that one can use different types of filters in the filter bank. To obtain the results shown in Figure 4, we have used α-functions, h(t) = te −αt ,
(3.1)
instead of the filters defined by equation 2.3. It is apparent from Figure 4 that the weights stabilize as soon as the input x0 has been switched off.
2704
¨ otter ¨ B. Porr and F. Worg
Figure 3: Comparison of the simulation results using ISO and ISO3 learning rules with a single filter (A) or a filter bank (B, C). Experiments were performed presenting pulse pairs as inputs. The time difference between x1 and x0 was T = 10 (x1 always precedes x0 ). At time step 5000, x0 was switched off. and b = 2π were used to filter inputs x0 , x1 and (A) Filters given by a = 0.9 2π 10 10 also the relevance signal r . Learning rate was µ = 0.005 for the ISO learning rule and µ = 0.07 for the ISO3 rule. (B, C) Results when using a filter bank with 2π 2π , b = 10 , j = 1, . . . , 10. Filters with 10 filters for signal x1 given by a = 0.9 10 j j 2π 2π a = 0.9 20 , b = 20 were used to filter signals x0 and r . Learning rate µ = 0.001 was used for ISO learning and µ = 0.002 for ISO3. (D) The signals of the filter bank u j and the output signal v(t) when x1 and x0 are active and when only x1 is active.
Figure 4 also shows that stability is insensitive to the length of the r -signal as long as r sets in at the same time as x0 . Varying the duration for more than one order of magnitude does not affect stability. These findings suggest that there is a class of different filter functions for which ISO3 converges. The common feature of the filters used so far is their low-pass component, which generates a distinctive maximum at a certain moment in time. Together, these filters are able to create a weighed sum
Using a Third Factor to Stabilize Hebbian Learning
2705
Figure 4: Simulation results using the ISO3 learning rule with other filters and also with varying duration a , b. Pulse pairing protocol as in Figure 3. All filters in the signal pathways x took the form of an α-function. For the relevance pathway, we used our conventional filters (see equation 2.3). For x1 , we used a filter bank, setting α1, j = 0.5/j, j = 1, . . . , 10, and for x0 we set α0 = 0.25. Panels A to D show results for varying the shape of the filter in the relevance pathway , b = 2π . (B) a = 0.9 2π ,b = for µ = 0.001 and tr = tx0 throughout. (A) a = 0.9 2π 10 10 20 2π 2π 2π 2π 2π . (C) a = 0.9 , b = . (D) a = 0.9 , b = . The insets show the filtered 20 100 100 200 200 relevance signals γ together with the output v(t), which have 200 time steps on the x-axis and a range of 0, . . . , 1.2 for v(t) and 0, . . . , 12 for gamma.
that has its maximum at the moment the relevance signal is triggered. This means that we can choose different types of low-pass filters to minimize the contribution of the autocorrelation term. This is a useful property, because the choice of the filter functions will determine how the output v is shaped. Different applications may require different types of outputs, and it is now, in principle, possible to obtain them by the correct choice of filters in the ¨ otter, ¨ filter bank (which is also true for ICO learning; Porr & Worg 2006).
3.3 Summary of the Result from Open-Loop Analysis. For ISO3, we find three possible ways to stop weight growth where the third condition is the most important one. The weights stabilize:
¨ otter ¨ B. Porr and F. Worg
2706
1. Trivially when x1 = 0. This is obvious because then its own input is lacking. 2. When T = 0 or T → ∞. These conditions reflect the fact that the ISO3 is a differential Hebbian learning rule, related to spike-timingdependent plasticity (STDP; Saudargiene et al., 2004, where LTP turns into LTD at T = 0, or where no learning takes place at large temporal intervals. 3. When x0 = 0. This is the nontrivial case, which has been made possible with the help of the third factor γ . As will be shown below, this condition allows stable behavioral learning: as soon as the learned behavior is able to eliminate the x0 signal, the weights ρ j , j > 0 stop changing. This property was known and used in the original ISO learning (Porr ¨ otter, ¨ & Worg 2003a), but weight stability could be proven for only small learning rates µ → 0, which led to the divergence of ISO learning for high learning rates. The introduction of the relevance signal r in ISO3 finally leads to the desired stability for x0 = 0 also for higher learning rates.
4 Applying ISO3 in a Behavioral Closed Loop In the following section, we compare ISO3 with ISO in a closed-loop scenario. First, we formalize the closed loop and provide the outline of a convergence proof. Formal convergence proofs have been given for ISO in ¨ otter, ¨ the limit of µ → 0 (Porr, von Ferber, & Worg 2003) and for ICO (Porr ¨ otter, ¨ & Worg 2006), and here we use the same arguments while taking into account the third factor. 4.1 Formalizing the Closed-Loop Situation. Figure 5 shows how we set up our closed-loop system. This diagram is similar to the ones shown in ¨ otter ¨ Porr et al. (2003) and Porr and Worg (2006). Uppercase letters denote that we are treating the system in the Laplace domain. Transfer functions P are the environmental transfer functions, which are usually unknown but well behaved, most often leading to a delay or to some kind of low-pass filtering. These aspects are to a great extent discussed in Porr et al. (2003). The system is built with two loops for pathways x0 and x1 and with one additional path for r . As in ISO or ICO learning, the inner loop via x0 represents the primary reflex. The goal of the reflex is to compensate disturbance D at summation node α by a prewired response, which is achieved by setting ρ0 so that we have classical negative feedback. This means that we demand that the closed-loop feedback system, V = De −sT
ρ0 H0 P0 , 1 − ρ0 H0 P0
(4.1)
Using a Third Factor to Stabilize Hebbian Learning
2707
Figure 5: ISO3 embedded in a closed-loop framework. The gray box represents the learning agent; everything outside represents the environment. Most symbols are as in Figure 1. Uppercase letters denote that we are treating such systems in the Laplace domain. P represents environmental transfer functions, τ is a delay, and D is a disturbance.
is stable (Phillips, 2000). In this way, basic behavioral patterns are established, and the system is operational and prepared to learn. The outer loop is established by x1 which is a predictive input with the potential to generate an anticipatory action. To model the predictive nature of the input x1 against the input x0 , we use a delay τ , which delays the disturbance D so that it reaches first x1 and then x0 . The goal of all these systems is to adapt the behavior, expressed by the output v, such that the primary reflex is no longer triggered by x0 . As soon as this is achieved, we get x0 = 0, and ρ1 will stop to change. In this way, behavioral stability arises exactly at the same time as synaptic stability. Finally, the r signal needs to be discussed. As mentioned before, there is a major difference between the r signal and the x0 signal: while the x0 signal will be eliminated and becomes zero, the r signal is not influenced by the output v of the ISO3 learner. Thus, the r signal will still be triggered even after successful learning when the x0 signal has become zero.
¨ otter ¨ B. Porr and F. Worg
2708
4.2 Closed-Loop Convergence of ISO3 Learning. In this section, we argue that convergence in the closed loop is not substantially different from the open-loop case. Convergence is ensured as long as the filter bank generates a maximum at the moment the r signal is triggered and x0 is happening. Consequently we have to find the closed-loop description of the output signal v, which is in the Laplace domain, V=
N ρ0 De −sT H0 P0 ˜ + ρk U k , 1 − ρ0 H0 P0 k=1 C(s)
(4.2)
A(s)
with U˜ k =
Uk . 1 − ρ0 H0 P0
(4.3)
The functions A and C can then be transformed back into the time domain and applied in equation 2.5, d ρk = (uk c(t) + uk a (t) )γ dt cc k
(4.4)
a ck
= (cc k + a c k )γ .
(4.5)
As before, it is clear that c(t) forms the cross-correlation term and a (t) the autocorrelation term. The term a (t) will still reach its maximum at the moment the relevance signal r is triggered because it remains constituted by the sum of low-pass filtered signals (see equation 4.3). New is the term 1/(1 − ρ0 H0 P0 ) compared to the original signal Uk , which introduces in the worst case a phase shift but otherwise no substantial change as long as the term does not generate more poles. This, however, is not the case because we have demanded that the pure feedback loop, equation 4.1, is stable. Consequently we can still expect the maximum at the moment the r signal is switched on because we still have a weighted sum of low-pass filtered signals. 5 Simulated Robot Experiment The behavioral experiment of this section has two purposes: it will give the signals x0 , x1 , and r a behavioral meaning, and it will demonstrate the superiority of ISO3 compared to ISO learning. Figures 6A and 6B present the task where a simulated robot has to learn to retrieve “food disks” (Porr & ¨ otter, ¨ Worg 2003b) which are also emitting simulated sound signals. Two sets
Using a Third Factor to Stabilize Hebbian Learning
2709
Figure 6: The robot simulation. (A) The robot has two pairs of sensors: light sensors that detect the food disk only in their direct proximity and sound detectors that are able to “hear” the food source from the distance. (B) The two light detectors (LD) establish the reflex reaction (x0 ). The sound detectors (SD) establish the predictive loop (x1 ). The weights ρ1 , . . . , ρ N are variable and are changed by ISO, ICO, or ISO-3 learning. The signal r is generated by a third light sensor and is triggered as soon as the robot enters the food disk. The robot also has a simple retraction mechanism that operates when it collides with a wall (“retraction”), which is not used for learning. The output v is the steering angle of the robot. 2π 2π , b = 2π for the reflex, a = 0.9 10k , b = 10k , k = 1, . . . , 5. Filters are set to a = 0.9 2π 10 10 Reflex gain was ρ0 = 0.005. (C–E) Plots of the number of contacts for three different learning rules needed for successful learning against the learning rate. In addition, the number of failures against the learning rate are plotted.
of sensor signals are used. One sensor type (x0 ) reacts to (simulated) touch and the other sensor type (x1 ) to the sound. The reflex x0 is established by two light detectors (LD), which draw the robot into the center of the white disks (see Figure 6A1). Learning must use the sound detectors (SD; see Figure 6A2), which feed into x1 to generate an anticipatory reaction toward the “food disk” (Verschure, Voegtlin, & Douglas, 2003). The reflex reaction is established by the difference of two light-dependent resistors, which cause a steering reaction toward the white disk (see Figure 6B). Hence, x0 is equal to zero if both LDs are not stimulated or when they are stimulated at the
2710
¨ otter ¨ B. Porr and F. Worg
same time, which happens during a straight encounter with a disk. The latter situation occurs after successful learning. The reflex has a constant weight ρ0 , which always guarantees a stable reaction. The predictive signal x1 is generated by using two signals coming from the SDs. The signal is simply assumed to give the Euclidean distance from the sound source. The difference in the signals from the left and the right SD is a measure of the azimuth of the sound source to the robot. Successful learning leads to a turning reaction, which balances both sound signals and results ideally in a straight trajectory toward the target disk, ending in a head-on contact. After a disk is encountered, the disk is removed and placed randomly elsewhere. Details of this experiment also show individual movement traces, as shown ¨ otter ¨ in Porr and Worg (2006); however, here we want to focus on the statistical comparison between ISO and ISO3 and try to show that ISO3 essentially performs as well as ICO, whereas ISO itself is unstable for high learning rates. For this, we quantify successful and unsuccessful learning for increasing learning rates µ. To make the failures comparable between ISO and ISO3 learning, we have chosen the learning rates in a way that for both learning rules, the contacts for successful learning are the same. Learning was considered successful when we received a sequence of five contacts with the disk at a subthreshold value of |x0 | < 1.1. We recorded the actual number of contacts until this criterion was reached. The plots in Figures 6D and 6E show that fewer contacts are required for successful learning with increasing learning rates. The simulations demonstrate clearly that ISO3 learning is much more stable than the Hebbian ISO learning. It behaves very similar to ICO, for which there is no autocorrelation contribution. ISO3 learning can therefore operate at learning rates that so far have been achieved only with ICO learning, not with ISO learning. Figures 7A to 7D show how the strongest changing weight (here, ρ9 ) behaves for ISO3 compared to ISO during a “food disk” experiment where we have adjusted the learning rates in such a way that weight change is similar for both ISO and ISO3 learning. For ISO learning, there is one learning experience, which leads to a correct, small weight drop close to time step 3000, but the second contact has already led to divergence. ISO3 is essentially stable, but all weights will oscillate slightly around their optimal value. As discussed above, this is due to the fact that as with non-δ-function inputs, the autocorrelation term cannot be fully eliminated in all cases. The remaining small fluctuation, however, will not lead to a deterioration of learning or behavior. As long as the crosscorrelation term is stronger than the autocorrelation term, weights should stabilize in the closed-loop scenario because the feedback will always correct weight drifts. To show how learning evolves without any autocorrelation term, we have added inset C, which shows the weight develop¨ otter, ¨ ment of ICO learning (Porr & Worg 2006) for the same food retrieval experiment.
Using a Third Factor to Stabilize Hebbian Learning
2711
Figure 7: (A–D) Behavior of the strongest growing weight ρ9 in a “food disk” collection experiment using different learning rules. Other parameters as defined in Figure 6. The left side shows the results from ISO3 (µ = 10−3 ) and the right side from ISO learning (µ = 10−4 ). (A, B) Weight change. (C, D) Value of the weight ρ9 . The first learning experience (“contact”) happens around time step 3000, which is magnified in panels B and D. For ISO learning, the next contact has already led to divergence. ISO3, on the other hand, remains stable, and ρ9 fluctuates slightly at the end of learning. The inset in C shows that for ICO learning, weights will fully stabilize. (E–L) Examples of signal shapes during four learning events (“contacts”). (E) Cross-correlation contribution. (F) Autocorrelation contribution. (G) The γ signal. (H) Weight change of weight ρ9 . (I–L) Magnifications of events 1 and 2 from E and F. The dashed lines give the moment when the r signal is elicited. Note that each signal has been scaled separately.
Figures 7E and 7L show in detail what the signals look like in these experiments after some learning using the same setup but with a much higher learning rate of µ = 0.001. Traces 7E and 7F show the cross-(cc) and
2712
¨ otter ¨ B. Porr and F. Worg
autocorrelation (a c) contributions, respectively. Due to the chosen filters, cc is much shorter but also much stronger than a c (note the different scaling). Also, it is evident that a c contributions can exist before as well as after cc. Furthermore, Figure 7F shows that a c can occur without cc (events 3 and 4). Events 1, 2, and 4 are associated with a relevance signal r . Figure 7G shows the corresponding r , and Figure 7H shows the resulting weight change. Figures 7I to 7L are magnifications of Figures 7E and 7F. It is interesting to discuss the individual events in more detail: 1. In the first event, there is a large temporal difference between the two light sensor inputs, because the robot had been approaching the food disk at an angle. This results in an early, spread-out autocorrelation term with moderate amplitude. The cross-correlation cc reaches its minimum at the moment of impact with the food disk, which is at the same moment that r is triggered. As a consequence, a large negative cc contribution is summed with a much smaller positive a c contribution, leading to an overall strong negative weight change. Effectively, due to the high learning rate, the system has now slightly “overlearned” the task, which becomes clear in the second event. 2. In the second event, the robot approaches the food disk at an angle but at a smaller one than in the first event. However, the robot oversteers and touches the disk from the other side. This results in the effect that all signals are inverted, and the weight is corrected upward to a small degree. Due to the short interval when r occurs, it is again obvious that the unwanted autocorrelation contribution does not enter into the weight change. 3. In the third event, the robot was directed by the predictive inputs but did not touch any food disk. Consequently, no learning should occur, and the overall correlation should not deviate from zero. The autocorrelation term a c is positive and would cause an unwanted change in the weights. However, learning does not occur at event 3 because, on failing to touch, the relevance signal was not triggered at all. 4. The last event shows the response when the robot approaches the food disk approximately head-on, which corresponds to x0 ≈ 0. Thus, the cross-correlation remains almost zero (the small existing contribution does not appear at this magnification). This shows that the robot has learned to approach the food disk from a distance, and a straight trajectory toward the food disk is achieved. No weight change should happen because the learning goal x0 = 0 has been reached. Learning is indeed prevented due to the fact that r occurs when both the crossand the autocorrelation contributions are zero. We have provided mathematical evidence above to illustrate that the ISO3 rule converges in a closed-loop behaving system. The simulated food
Using a Third Factor to Stabilize Hebbian Learning
2713
disk collection experiment shown here supports this. In particular, these experiments show that the third factor control mechanism also works with real non-δ-function inputs, for which rigorous mathematical convergence proofs are no longer possible. In an earlier study, we showed that the autocorrelation-free ICO rule can be employed in a variety of difficult simulated and real control tasks (Porr ¨ otter, ¨ & Worg 2006). These experiments shall not be repeated here, but the similarity of the behavior of ISO3 in the food disk collection supports the view that ISO3 will not demonstrate anything really new in these tasks. In addition, our simulated robot experiment demonstrates how ISO3 input signals can be embedded in a behavioral context: the sensor signals x0 and x1 directly generate motor reactions and will change substantially during learning. The r -signal, however, is always triggered when the robot enters the food disk and stabilizes learning by its right timing but not by its amplitude, which always remains the same.
6 Discussion Correlation-based temporal sequence learning dates back to the early approaches of Sutton and Barto (1981), Kosco (1986), and Klopf (1988). The design of these rules did not allow embedding them into a behavioral context, and they were only treated in open loop and mostly in conjunction with ¨ otter ¨ classical conditioning (for reviews, see Sutton & Barto, 1990; Worg & Porr, 2005). The success of TD learning (Sutton, 1988) and its “acting” extension Q-learning (Watkins & Dayan, 1992), in which both use a reward signal to control the learning, soon led to the ousting of the older correlation-based approaches, and they were resurrected only after they had been successfully embedded in behaving agents (Verschure & Voegtlin, 1998; Porr & ¨ otter, ¨ Worg 2003a). The newly introduced ISO learning rule (replotted in Figure 8A) is successful at low but not at high learning rates because of its ¨ otter ¨ autocorrelation term. Porr and Worg (2006) present a very simple and highly efficient solution to this problem, which is shown in Figure 8C. The autocorrelation term is completely eliminated if the derivative of the output v is replaced with the derivative of the reflex input u0 . This rule, called input correlation learning (ICO), is highly stable and converges extremely fast, allowing even single-shot learning, as shown in several difficult real ¨ otter, ¨ control scenarios (Porr & Worg 2006). However, it has two clear disadvantages. First, one input, u0 , has now become “special.” Hence, learning is judged against this input and no longer against any strong driving input, which can be a problem if a subsumption architecture (Brooks, 1989) is needed where learning is driven by different inputs and not just by one input. In such subsumption architectures, the driving input changes over time when one feedback loop is replaced by another, which in turn leads to
2714
¨ otter ¨ B. Porr and F. Worg
Figure 8: Four different learning rules. (A) ISO learning. (B) TD learning. (C) ICO learning. (D) ISO3 learning.
another driving input after learning. This argument can be reversed if we recall that ISO3 is implementing differential Hebbian learning, which computes predictions: the third factor defines the moment in time when learning takes place. This provides an opportunity to self-organize the development of weights that grow if their corresponding inputs can predict postsynaptic activity and that shrink if their corresponding signals are too late at the moment when the relevance signal is triggered. In such a self-organized network, strong inputs develop by themselves and need not be defined. This also offers new opportunities for self-organized structures, for example, memory models (Durstewitz, Seamans, & Sejnowski, 2000). The second central disadvantage of ICO learning is its low biological plausibility. ICO learning represents a form of pure heterosynaptic plasticity, which is found only in some rare cases (Clark & Kandel, 1984; Humeau, Shaban, Bissi`ere, ¨ & Luthi, 2003; Beninger & Gerdjikov, 2004; Kelley, 2004). This prompted us to search for alternative solutions to the autocorrelation problem, and we have introduced the ISO3 rule for this purpose (see Figure 8D). In this study, we have built on some earlier convergence proofs of ISO ¨ otter, ¨ ¨ otter, ¨ and ICO learning (Porr & Worg 2003a; Porr & Worg 2006), and we have focused on the problem of how to eliminate the autocorrelation term. For δ-function inputs, we have now proved that eliciting the r -signal together with the later input will remove the autocorrelation term completely. In practical situations, this term cannot be fully reduced to zero,
Using a Third Factor to Stabilize Hebbian Learning
2715
but nonetheless we found that the ISO3 rule has much better convergence properties compared to ISO. Furthermore, as in ICO also for ISO3, it is no longer necessary to use orthogonal filters, as was the case for the plain ISO ¨ otter, ¨ rule (Porr & Worg 2003a) because weight stabilization is achieved by the generation of a maximum at x0 , not by orthogonal filter functions. We have shown that the maximum can also be generated by alpha functions instead of differences of exponentials. This result suggests that there is a class of functions that generate an approximately zero derivative at the moment x0 is triggered. Looking at equation 2.18, we see that we need functions that have one maximum that can be shifted in time. If we superimpose such functions, we intuitively get a maximum at the moment when the r signal is triggered. The difficult part is the right normalization of the functions, which in our case is defined by equation 2.21. Usually a difference of exponentials is divided by η = σ (b − a ), which normalizes the amplitude (see equation 2.3) to one. However, here we have to normalize the learning (see equation 2.19), which gives us the filter response two times: the filter itself and its value at the moment when r is triggered. This is a general recipe for the design of new filter functions: we need a normalization that normalizes learning (see equation 2.19) instead of the functions themselves, and we need filter functions with one maximum that can be shifted in time. Such functions could be damped exponentials, alpha functions, or higher-order functions of the form t n e −αt . The relation of correlation-based learning to reward-based reinforce¨ otter ¨ ment learning has been discussed at great length in Worg and Porr (2005). Here we would like to point out one interesting novel aspect of ISO3: this rule uses only the timing of the r -signal to control learning. This is different from TD learning, where a prediction error is generated that directly influences the weight values (Sutton, 1988). In other words, while the third factor in ISO3 learning determines when learning should happen, in RL the third factor determines what is learned. In machine learning, the error signal is used to control value propagation in a rigorous quantitative way, distinguishing between differently rewarding situations. The quantitative value, which an individual associates with different “rewards,” is certainly also evaluated by animals and humans, but it is hard to believe that the rather broad and unspecific dopaminergic signals (Fellous & Suri, 2002), which represent the majority of responses in these cell classes, would be directly used in the specific way demanded by TD-like algorithms. Such signals seem more compatible with the assumption of three-factor ISO3 learning, where they are used only to control the timing. It will be interesting to see if this new interpretation can be substantiated by physiological experiments in the future, for example, by trying to influence the plasticity at a neuron with an ill-timed, micro-iontophoretically applied dopamine burst.
¨ otter ¨ B. Porr and F. Worg
2716
Appendix: Solving the Exponential Integral We have to solve the integral equation 2.22, which can be rewritten in the form gv (t) ∝ −
∞
>0
e −σ (a t−bT) dσ − σ (a − b)
∞
−σ b(t+T)
>0
∞
>0
e −σ a (t+T) dσ + σ (a − b)
∞
>0
e −σ (a T−bt) dσ σ (a − b)
e dσ, σ (a − b)
(A.1)
from which we have to calculate its derivative gv (t) . Equation A.1 contains four terms that differ only by the arguments in the exponentials that we call z(t). They can be solved in the following way: d dt
∞
e −σ z(t) dσ = σ
∞
d e −σ z(t) dσ dt σ dz(t) ∞ −σ z(t) =− e dσ dt dz(t) e −z(t) = 0− dt z(t) =−
dz(t) e −z(t) . dt z(t)
(A.2) (A.3) (A.4) (A.5)
With this result, we can solve the four terms of equation A.1. For example, the first term of equation A.1 has the solution 1 d a − b dt
∞
>0
e −σ (a t+bT) 1 d(a t + bT) e −(a t+bT) dσ = − σ a −b dt a t + bT =−
a e −(a t+bT) . a − b a t + bT
(A.6) (A.7)
The other terms of equation A.1 can be solved in the same way, and we arrive at equation 2.24. Acknowledgments We thank Tomas Kulvicius and Maria Thompson for running the openloop and closed-loop simulations, respectively. We thank Christoph Kolodziejski, David Murray Smith, Nicholas Bailey, John Williamson, John O’Reilly, and Vi Romanes for their constructive feedback. We acknowledge
Using a Third Factor to Stabilize Hebbian Learning
2717
the support of the European Commission, IP-Project “PACO-PLUS” (IST¨ FP6-IP-027657), and the BMBF, BCCN-Gottingen, Proj W3.
References Balkenius, C., & Mor´en, J. (1998). Computational models of classical conditioning: A comparative study (Tech. Rep.). Lund: Lund University. Beninger, R., & Gerdjikov, T. (2004). The role of signaling molecules in reward-related incentive learning. Neurotoxicity Research, 6(1), 91–104. Brooks, R. A. (1989). How to build complete creatures rather than isolated cognitive simulators. In K. VanLehn (Ed.), Architectures for intelligence (pp. 225–239). Mahwah, NJ: Erlbaum. Clark, G. A., & Kandel, E. R. (1984). Branch-specific heterosynaptic facilitation in aplysia siphon sensory cells. Proc. Natl. Acad. Sci. (USA), 81(8), 2577–2581. Durstewitz, D., Seamans, J. K., & Sejnowski, T. J. (2000). Neurocomputational models of working memory. Nature Neurosci. (Suppl.), 3, 1184–1191. Fellous, J. M., & Suri, R. E. (2002). The roles of dopamine. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (2nd ed.). Cambridge, MA: MIT Press. Guo-Quing, B., & Poo, M.-M. (1998). Synaptic modifications in cultured hippocampus neurons. J. Neurosci., 18(24), 10464–10472. Hebb, D. O. (1949). The organization of behavior: A neuropsychological study. New York: Wiley-Interscience. ¨ Humeau, Y., Shaban, H., Bissi`ere, S., & Luthi, A. (2003). Presynaptic induction of heterosynaptic associative plasticity in the mammalian brain. Nature, 426(6968), 841–845. Kelley, A. E. (2004). Ventral striatal control of appetitive motivation: Role in ingestive behaviour and reward-related learning. Neurosci. and Biobehav. Reviews, 27, 765– 776. Klopf, A. H. (1986). A drive-reinforcement model of single neuron function. In J. S. Denker (Ed.), Neural networks for computing: Snowbird, Utah. New York: American Institute of Physics. Klopf, A. H. (1988). A neuronal model of classical conditioning. Psychobiol., 16(2), 85–123. Kosco, B. (1986). Differential Hebbian learning. In J. S. Denker (Ed.), Neural networks for computing: Snowbird, Utah (pp. 277–282). New York: American Institute of Physics. Linsker, R. (1988). Self-organisation in a perceptual network. Computer, 21(3), 105– 117. ¨ Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Memo, M., Lovenberg, W., & Hanbauer, I. (1982). Agonist-induced subsensitivity of adenylate cyclase coupled with a dopamine receptor in slices from rat corpus striatum. Proc. Natl. Acad. Sci. USA, 79, 4456–4460. Miller, J. D., Sanghera, M. K., & German, D. C. (1981). Mesencephalic dopaminergic unit activity in the behaviorally conditioned rat. Life Sci., 29, 1255–1263.
2718
¨ otter ¨ B. Porr and F. Worg
Mirenowicz, J., & Schultz, W. (1994). Importance of unpredictability for reward responses in primate dopamine neurons. J. Neurophysiol., 72(2), 1024–1027. Oja, E. (1982). A simplified neuron model as a principal component analyzer. J. Math. Biol., 15(3), 267–273. Pavlov, I. (1927). Conditional reflexes. New York: Oxford University Press. Phillips, C. L. (2000). Feedback control systems. Upper Saddle River, NJ: Prentice Hall. ¨ otter, ¨ Porr, B., von Ferber, C., & Worg F. (2003). ISO-learning approximates a solution to the inverse-controller problem in an unsupervised behavioural paradigm. Neural Comp., 15, 865–884. ¨ otter, ¨ Porr, B., & Worg F. (2003a). Isotropic sequence order learning. Neural Comp., 15, 831–864. ¨ otter, ¨ Porr, B., & Worg F. (2003b). Isotropic sequence order learning in a closed loop behavioural system. Roy. Soc. Phil. Trans. Math., Phys. & Eng. Sciences, 361(1811), 2225–2244. ¨ otter, ¨ Porr, B., & Worg F. (2004). Analytical solution of spike-timing dependent plas¨ L. Saul, & B. Scholkopf ¨ ticity based on synaptic biophysics. In S. Thrun, (Eds.), Advances in neural information processing systems. 16. Cambridge, MA: MIT Press. ¨ otter, ¨ Porr, B., & Worg F. (2006). Strongly improved stability and faster convergence of temporal sequence learning by utilizing input correlations only. Neural Comp., 18(6), 1380–1412. Reynolds, J. N., & Wickens, J. R. (2002). Dopamine dependent plasticity of corticostriatal synapses. Neural Networks, 15, 507–521. ¨ otter, ¨ Saudargiene, A., Porr, B., & Worg F. (2004). How the shape of pre- and postsynaptic signals can influence STDP: A biophysical model. Neural Comp., 16, 595–626. Schultz, W. (1998). Predictive reward signal of dopamine neurons. J. Neurophysiol., 80(1), 1–27. Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599. Schultz, W., & Suri, R. E. (2001). Temporal difference model reproduces anticipatory neural activity. Neural Comp., 13(4), 841–862. Sutton, R. (1988). Learning to predict by method of temporal differences. Machine Learning, 3(1), 9–44. Sutton, R., & Barto, A. (1981). Towards a modern theory of adaptive networks: Expectation and prediction. Psychological Review, 88, 135–170. Sutton, R. S., & Barto, A. (1990). Time-derivative models of Pavlovian reinforcement. In M. Gabriel & J. Moore (Eds.), Learning and computational neuroscience (pp. 497– 537). Cambridge, MA: MIT Press. Verschure, P., & Voegtlin, T. (1998). A bottom-up approach towards the acquisition, retention, and expression of sequential representations: Distributed adaptive control III. Neural Networks, 11, 1531–1549. Verschure, P. F. M. J., Voegtlin, T., & Douglas, R. J. (2003). Environmentally mediated synergy between perception and behaviour in mobile robots. Nature, 425, 620–624. Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292. Wickens, J. R., Begg, A. J., & Arbuthnott, G. W. (1996). Dopamine reverses the depression of rat corticostriatal synapses which normally follows high-frequency stimulation of cortex in vitro. Neurosci., 70, 1–5.
Using a Third Factor to Stabilize Hebbian Learning
2719
¨ otter, ¨ Worg F., & Porr, B. (2005). Temporal sequence learning, prediction and control— a review of different models and their relation to biological mechanisms. Neural Comp., 17, 245–319. Xie, X., & Seung, S. (2000). Spike-based learning rules and stabilization of persistent ¨ neural activity. In S. A. Solla, T. K. Leen, & K.-R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 199–208). Cambridge, MA: MIT Press.
Received May 25, 2006; accepted October 9, 2006.
LETTER
Communicated by Markus Diesmann
Synchrony-Induced Switching Behavior of Spike Pattern Attractors Created by Spike-Timing-Dependent Plasticity Takaaki Aoki [email protected] Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan, and CREST, JST, Kyoto 606-8501, Japan
Toshio Aoyagi [email protected] Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan
Although context-dependent spike synchronization among populations of neurons has been experimentally observed, its functional role remains controversial. In this modeling study, we demonstrate that in a network of spiking neurons organized according to spike-timing-dependent plasticity, an increase in the degree of synchrony of a uniform input can cause transitions between memorized activity patterns in the order presented during learning. Furthermore, context-dependent transitions from a single pattern to multiple patterns can be induced under appropriate learning conditions. These findings suggest one possible functional role of neuronal synchrony in controlling the flow of information by altering the dynamics of the network. 1 Introduction Although synchronous activity associated with behavior and cognition has ¨ been observed in many neuronal systems (Gray, Konig, Engel, & Singer, ¨ Diesmann, & Aertsen, 1997; Fries, Reynolds, Rorie, & 1989; Riehle, Grun, Desimone, 2001), its functional role remains unclear (Engel, Fries, & Singer, 2001; Salinas & Sejnowski, 2001; Shadlen & Movshon, 1999; Ermentrout & Kleinfeld, 2001). It is reasonable to assume that such generated synchronous spikes drive certain cortical networks as input signals and thereby affect their functions (Brody & Hopfield, 2003; Diesmann, Gewaltig, & Aertsen, 1999). Another related phenomenon is spike-timing-dependent plasticity (STDP) (Bi & Poo, 1998; Zhang, Tao, Holt, Harris, & Poo, 1998; Debanne, ¨ Gahwiler, & Thompson, 1998; Markram, Lubke, Frotscher, & Sakmann, 1997; Song, Miller, & Abbott, 2000; Rubin, Lee, & Sompolinsky, 2001), which allows cortical networks to learn the causality of experienced events through the coding of the temporal structures of neuronal activity. Considering that both phenomena affect the functioning of cortical neurons, it is Neural Computation 19, 2720–2738 (2007)
C 2007 Massachusetts Institute of Technology
Synchrony-Induced Switching Behavior
2721
natural to ask what effect synchronous inputs have on a neural network organized under STDP learning. For this purpose, let us consider the specific situation in which a model network of spiking neurons, whose synaptic connections are modified through STDP, receives an activity pattern as an external stimulus as a result of the experiencing of some external events. In general, these activity patterns possess some systematic temporal structure reflecting the causality of these external events (Hebb, 1949; Dan & Poo, 2004; Izhikevich, 2006). In this study, we demonstrate that a network organized under STDP not only is capable of memorizing the activity patterns of the external stimulus, but also exhibits a systematic transition behavior among the memorized patterns in response to uniform external synchronized spikes. First, in section 2, we describe the structure of the network model and the learning process of STDP. Next, we investigate the behavior of the network organized under STDP, for the case of asynchronous (see section 3.1) and transient synchronized spike inputs (see section 3.2). From these results, to understand the essential mechanism of switching behavior of the network, in section 3.3 we develop a stability analysis of the memorized pattern in a simplified version of the network model. In section 3.4, we also demonstrate that a context-dependent switching behavior can be achieved under some appropriate learning conditions. Finally, in section 4, we discuss the mechanism of the synchrony-induced switching behavior of the network from the point of view of dynamical systems. In relation to recent experimental findings, we discuss the putative functional rules of neuronal synchrony. 2 Model Figure 1A presents a schematic illustration of the model network we consider, in which leaky integrate-and-fire neurons are recurrently connected by excitatory synapses whose coupling strengths are modified according to the STDP rule (Song et al., 2000; van Rossum, Bi, & Turrigiano, 2000; Cateau & Fukai, 2003). Thus, if a presynaptic spike and a postsynaptic spike occur at times tpre and tpost , the peak synaptic conductance g is modified by the addition of the value of the STDP window function F (tpre − tpost ), as shown in Figure 1B. The synaptic conductance g is restricted to lie within E the range 0 to gmax . To enforce this restriction, if the STDP rule causes g to take a value outside this range, it is reset to the appropriate limiting value. In addition, a globally uniform inhibition without modification of learning is included in an all-to-all manner (see appendix A for a full description of the simulation). We employ two types of controllable external inputs. One is a stimulus input, in which an initial stimulus pattern and a training stimulus pattern are presented during a trial and a learning session, respectively. For learning, we use a simple training stimulus pattern, as depicted in Figure 1C. This pattern is divided into three parts consisting of firing patterns, referred
2722
T. Aoki and T. Aoyagi
Figure 1: (A) Schematic diagram depicting the structure of the network model with spike-timing-dependent plasticity (STDP). Each leaky integrate-and-fire neuron receives two external inputs. One is a stimulus input, which is a neuronspecific current used to set a given firing pattern in the network for the learning and initial stimulus. The other is an activation input, which projects to all neurons uniformly. For the activation input, there are two modes: asynchronous and synchronous. Within the network, each neuron is connected reciprocally by an excitatory synaptic coupling, whose strength is modified according to the STDP rule. In addition, all-to-all uniformly inhibitory connections are included. (B) The STDP window function. (C) The activity pattern presented by the stimulus input during learning. Each dot represents a spike presented in the corresponding stimulus input. This whole pattern consists of three basic firing patterns in the fixed order A,B,C,A,B,C · · ·.
to as A, B, and C, in which each is composed of a particular set of active neurons. These neurons that are active for a given pattern fire periodically, and in each pattern, there are certain fixed phase relationships among the active neurons. During learning, these three patterns are presented as the stimulus input (see Figure 1A) in the fixed order (A,B,C,A,B,C · · ·). This can be regarded as representing a certain external sequence of events that the network is to learn—in other words, the causality of certain external events. The other type of controllable input is an activation input, which projects to all neurons uniformly. This input is introduced to examine the effect
Synchrony-Induced Switching Behavior
2723
of synchrony on the neuronal dynamics. This uniform background input serves to activate the entire network and allow each neuron to be in a firing state under suitable conditions. There are two modes of neuronal activity for the activation input: asynchronous and synchronous modes. In the asynchronous mode, spike trains are randomly generated by a Poisson process. During learning, the activation input is always in the asynchronous mode. In the synchronous mode, some of the neurons fire synchronously, while the other neurons remain in the asynchronous firing state. The fraction of neurons firing synchronously represents the degree of synchrony. To remove the influence of firing-rate modulation, in both modes, the average firing-rate is set to the same constant value, 25 Hz. 3 Results 3.1 Associative Memory Behavior. The main question of interest in this study is the following: After the STDP learning process described above has been completed, what activity pattern does the resultant network exhibit? To answer this question, we first examine the case in which the activation input is set in the asynchronous mode, with the level of the total current such that the neurons are in an active state in response to an appropriate stimulus input and maintain an active state through recurrent excitatory synapses organized under the STDP. In the following simulations, we use the condition that the synaptic weights are fixed after learning. This condition is reasonable under a situation in which the dynamics of the synaptic modification is much slower than that of the spiking activity. For our model under these conditions, Figure 2A illustrates some typical activity patterns displayed by the network (top), the stimulus input (middle), and the activation input (bottom) as raster plots. In each case, a stimulus input that is close to one of the learned patterns is applied for only a brief initial time interval. Apparently, in the situation considered, the activity pattern rapidly converges to the closest learned pattern. Without an input stimulus, this recalled activity pattern is sustained by the excitatory synaptic connections organized through the learning process and suppresses the other memorized patterns with all-to-all inhibitory connections. Therefore, if the wrong neurons fire accidentally, the network activity is corrected and maintained in a robust manner. Such behavior is of the same type as that typically seen in systems exhibiting associative memory dynamics, like the Hopfield model (Hopfield, 1982), except that these retrieval dynamics contain not only information regarding the firing rate but also that regarding the temporal structure of each learned pattern in the present case. 3.2 Synchrony-Induced Switching Behavior. The situation described above is that in which the system recalls one of the three learned patterns and remains in the corresponding state indefinitely. However, in the
2724
T. Aoki and T. Aoyagi
Synchrony-Induced Switching Behavior
2725
learning process, the network learned these three patterns not independently but in the particular order depicted in Figure 1C. This large-scale structure can be interpreted as reflecting the causality of some sequence of external events. Thus, because these transitions themselves are included in the training pattern, they too are encoded in the synaptic matrix (see Figure 2B). However, because during the learning process, the individual patterns A, B, and C are presented many times in succession, while the transitions between these patterns appear much less often, the network has much less experience with these transitions. Hence, the network does not learn these transitions as well as the individual patterns, and therefore the synaptic coupling corresponding to the transitions among patterns is relatively weak. In fact, we can see clearly that three diagonal blocks of major synaptic connections, which are formed by the three basic stimulus patterns (A,B,C), enable the network to retrieve each pattern in an associative manner. In addition, there are three off-diagonal blocks of weak synaptic connections arising from the less frequent transitions among the stimulus patterns as shown in Figure 1C. Because the synaptic connections for transitions are relatively weak, under ordinary conditions, each individual pattern is sufficiently stable that no transition among patterns occurs. Is there any situation in which these weak synapses arising through STDP from the pattern transitions presented in the learning process yield a substantial effect on the dynamics of the network? Interestingly, Figure 2C demonstrates that a brief period of synchrony in the uniform activation input can enhance this weak effect embedded in the synaptic matrix and thereby cause a transition from the one pattern to another. Let us describe the process depicted in that figure. First, the network exhibits pattern A, which is stable when the activation input is in the asynchronous mode. Thus, in this
Figure 2: Typical effect of a uniformly synchronized spike input on the network of spiking neurons organized under the STDP learning rule. (A) Typical activity patterns (top raster plots) in the case that the activation input is in the asynchronous mode (bottom raster plots). In response to the brief presentation of the initial firing pattern of the stimulus input (middle raster plots), the network rapidly converges to the learned pattern that is most similar to this stimulus input. (B) Gray-scale plot of the normalized strengths of excitatory synapses between neurons after the STDP learning. (C) Synchrony-induced switching behavior of the network realized through the STDP learning rule. The situation here is the same as that depicted in Figure 2A, except that in this case, the mode in the activation input temporally changes from asynchrony to synchrony for the time interval indicated by the double-arrowed lines. In response to this brief synchronous activation input, the activity pattern becomes unstable and, consequently, a transition to the next pattern occurs. (D) The same kind of synchrony-induced switching behavior also occurs in the case of more general patterns.
2726
T. Aoki and T. Aoyagi
case, the network exhibits ordinary associative memory. However, when the activation input is switched for a brief time to the synchronous mode, a transition from the pattern A to the pattern B occurs. Hence, the retrieval of a learned sequence in the presented order can be triggered by globally uniform synchronous inputs. The off-diagonal weak synaptic connections, as shown in Figure 2B, have no effect on the retrieval dynamics under ordinary conditions, as demonstrated in Figure 2A. However, the effect of these weak synapses can be enhanced by synchronous spike inputs, as shown in Figure 2C. Note that in the situation studied here, the sets of neurons that are active in patterns A, B, and C are mutually exclusive. This is a somewhat severe restriction. However, we have found that the qualitative nature of these results is the same in the more general case that we allow some of the individual neurons to be active in more than one of the patterns, as long as the number of such neurons is not too large. In fact, we can clearly see from Figure 2D that the same synchrony-induced switching behavior occurs in this case. The top graph shows a 40 × 40 dot image, in which the gray level of a dot represents the average firing rate over 200 ms (black underbars). The condition that a relatively small number of neurons are active in multiple patterns is biologically reasonable, because the mean firing rate of memory patterns is usually low in biological systems (a situation referred to as sparse coding) (Willshaw, Buneman, & Longuet-Higgins, 1969). In addition, it is important to note that many types of observed synchronous activity accompanying cognition and behavior (Gray et al., 1989; Riehle et al., 1997; Fries et al., 2001) are transient, and there have been a number of theoretical studies aimed at understanding the related neuronal mechanism (Aoyagi, Takekawa, & Fukai, 2003; Tiesinga & Sejnowski, 2004). 3.3 Stability Analysis of the Retrieval State. Theoretical analysis is helpful to understand the essential mechanism of synchrony-induced switching, but a full analytical treatment of a synaptic matrix organized through STDP is very difficult. For this reason, we instead consider a simplified version of the synaptic matrix, keeping the essential properties of the original one, as given by P P−1 E g µ µ µ+1 µ giEj = max ζi ζ j + ζi ζ j , NP
(x) =
µ=1
µ=1
1 (x ≥ 0) , 0 (x < 0)
(3.1) µ
where P is the total number of the stored patterns, and ζi represents the state in the ith neuron in the µth pattern and takes only two values: 1 (firing) and µ 0 (quiescent). NP (=< i ζi >µ ) is the averaged number of active neurons
Synchrony-Induced Switching Behavior
2727
Figure 3: (A) The same synchrony-induced switching behavior can be observed in a simplified version of synapses matrix described in equation 3.1. ( = 0.3) (B) Using this simplified model, we theoretically obtained the two-parameter phase diagram, showing the dependence of the stability of the learned pattern on the synaptic delay and the relative coupling strength of the two terms. The present state, one of the learned patterns, is stable if the point (, ) is located below the solid curve (dashed curve) for the case of the asynchronous (synchronous) mode of the activation input. In the shaded region, the learned pattern becomes unstable in response to the switching of the mode of the activation input from asynchrony to synchrony. This leads finally to a transition to the next pattern. It is thus seen that the transition-inducing effect of the weak synaptic coupling is enhanced by the synchrony. The parameter values used in Figure 3A are indicated by the +.
2728
T. Aoki and T. Aoyagi
over all stored patterns. This simplified matrix is constructed by summing two terms with the relative strength ; the first term ensures the stable retrieval of each learned basic pattern (Hopfield, 1982; Willshaw et al., 1969; Gerstner & van Hemmen, 1992), and the second term is a small term ( < 1) whose effect is to cause a transition among learned patterns. This simplified matrix makes the previous model theoretically tractable, while providing a reasonable approximation of the original model. In Figure 3A, the same type of synchrony-induced switching behavior can be observed in this simplified model. Using this model, we examined the effect of changing the mode of the activation input on the stability of the current retrieval patterns (see appendix B). The result is displayed in Figure 3B. There, the dependence of the stability on the synaptic delay and the relative coupling strength of the two terms is plotted. In the shaded region, the current retrieval pattern is maintained for asynchronous mode in activation input, whereas the same pattern is unstable for synchronous mode. In other words, a change in the stability of the retrieval pattern occurs in response to a synchrony-asynchrony switching in the activation input. This stability analysis does not answer what spike pattern the network exhibits after the synchrony-asynchrony switching. However, considering the effect of the small off-diagonal term in the synaptic weight (see Equation 3.1), which tends to cause the transition among the patterns,
Figure 4: (A) Context-dependent switching behavior can be realized by using the appropriate training pattern. The top and bottom raster plots represent the activity patterns in the network and the activation input, respectively. The lengths of the red double-arrowed lines indicate one period of the firing pattern A, while the red vertical lines indicate the onset times of the switching from asynchrony to synchrony in the activation input. It is thus seen how the network can be caused to make a transition from one given pattern to either of the other two patterns by appropriately controlling the onset time of the synchronous mode. (B) Timing-selective connections organized under STDP reflecting the nature of the training data. For pattern A, the neuron that fires at an early phase has a weak off-diagonal connection for a transition to pattern B, whereas the neuron that fires at a later phase has a connection for pattern C. The normalized strengths of excitatory synapses between neurons are plotted with gray scales. (C) Interpretation of the synchrony-induced switching behavior from the point of view of dynamical systems. When the network is activated by a uniformly asynchronous spike input, the system possesses some attractors formed by the STDP learning rule (left). However, a brief, uniformly synchronous input activates the paths between attractors, leading to a transition to the next pattern in the learned order (middle). Furthermore, the system can exhibit contextdependent switching behavior, in which the subsequently appearing pattern can be determined by appropriately choosing the position of the orbit at the onset time of the synchronous pattern, as demonstrated in Figure 4A (right).
Synchrony-Induced Switching Behavior
2729
2730
T. Aoki and T. Aoyagi
it is likely that the state in the network moves near the next learned pattern. Consequently, when the activation input becomes asynchronous again, the network activity converges toward the next learned pattern in an associative manner. For successful transition among the learned patterns, it is required that the length of the period of synchronized spike input is appropriate. In general, the desirable value of this period (100–200 ms in this case) depends on the model parameters. It is found that too short a period cannot trigger the transition to the next pattern, so that the network stays on the same pattern, whereas a period that is too long tends to make the transition uncontrollable. Hence, an appropriate transient synchrony of the activation input is able to activate the effect of the weak coupling and consequently induce a transition from one state to the next in the learned order. In relation to this mechanism, it has been reported that synchrony causes a similar destabilization phenomenon in the case of a winner-take-all competition (Lumer, 2000), and in the case of conventional neural networks, a change in the uniform external field can cause a transition between memorized patterns under suitable conditions (Amit, 1988). 3.4 Context-Dependent Switching Behavior. The synchrony-induced switching mechanism realized in a model with STDP has a greater ability to cause the network to encode the context-dependent association set by the training data. Figure 4A demonstrates that the network is able to make a transition from the pattern A to either pattern B or C, depending on the onset time of the synchronous mode. When the mode of the activation input changes from asynchrony to synchrony early in the period of pattern A, a transition from pattern A to pattern B occurs, whereas when this mode change takes place later in the period, a transition from pattern A to pattern C occurs. This context-dependent behavior is due to the nature of the training data, in which the transition to pattern B (C) occurs during the first (second) half of a period of firing activity of pattern A. Therefore, the difference in the temporal structure of pattern A at the time of the transition organizes a selective synaptic connection among the neurons: the neurons that fire early in pattern A tend to make excitatory connections to the neurons that fire in pattern B, while the neurons at a later phase form connections to the neurons of pattern C (see Figure 4B). Owing to this selective connection organized under STDP, which patterns (B or C) the network makes a transition to depends on the onset time of the external synchronized spikes, equivalently, the corresponding temporal structure of pattern A. In summary, for the context-dependent switching described to appear, it is necessary that the activity pattern have a temporal structure that varies in time periodically. The temporal firing pattern exhibited by the network depends on the onset time of the synchronous mode. This dependence endows the network with the ability to make a transition from a single pattern to multiple patterns in a context-dependent manner through STDP learning.
Synchrony-Induced Switching Behavior
2731
4 Conclusion We have demonstrated that a network organized under STDP is capable of exhibiting synchrony-induced switching behavior, in which a transition from one firing state to another occurs in response to brief, uniform synchronous spike inputs. In the case of uniformly random spike inputs, the firing state converges to the learned pattern that is closest to the external cue stimulus. Then a synchrony-asynchrony transition in the activation input can induce a transition from one pattern to the next in the order that the learned patterns were presented during training. From the perspective of dynamical systems, in the case of the asynchronous mode, the network possesses certain attractors acquired through STDP learning, as shown in Figure 4C (left). The weak synaptic connections formed through the influence of the transitions among training patterns have no effect on the retrieval dynamics in the asynchronous mode. However, when the network is subject to a uniformly synchronous input, the dynamical properties of the network change drastically. This brief synchronous input activates the paths corresponding to the transitions among individual patterns that were weakly encoded in the synaptic matrix during the learning process, leading to a transition between patterns in the order in which they were presented. This result suggests that synchronous spikes may act as a signal in biological systems, serving to link learned sequences of actions in response to some external stimuli (Riehle et al., 1997; Lee, 2003). Furthermore, the network has the ability to carry out more complicated tasks, as demonstrated in Figure 4A. As shown there, the attractor corresponding to the pattern appearing after the transition can be switched in a context-dependent manner by controlling the position of the orbit on the attractor at the time that the uniform synchronous input is applied (see Figure 4C). Computationally, the mechanism studied here seems to provide a sophisticated method by which a neuronal system can carry out a task in a context-dependent manner. It has been pointed out that although the amount of information encoded by correlated activity may be small, coherent neuronal activity is observed during particular cognitive and behavioral tasks (Salinas & Sejnowski, 2001; Averbeck & Lee, 2004). Our results suggest that one possible functional role of neuronal synchrony is to control the flow of information by changing the nature of the dynamical system constituted by a set of neuronal circuits. We believe that some experimental results can be more clearly reinterpreted using our results.
Appendix A: Detailed Model A.1. Dynamics. In the network of spiking neurons we employ (see Figure 1A), the dynamics of each neuron are described by a leaky
2732
T. Aoki and T. Aoyagi
integrate-and-fire model,
τm
d Vi (t) = −(Vi (t) − Vrest ) + Rm IiE (t) + I I (t) dt
+I Act (t) + IiStim (t) ,
(A.1)
with τm = 20 ms, Vrest = −65 mV, and Rm = 100 M. Whenever Vi (t) reaches the threshold voltage, Vth = −50 mV, a spike is generated, and Vi (t) is instantaneously reset to the resting potential, Vrest . The synaptic currents due to excitatory and inhibitory connections are given by IiE (t) = N N E n n I I n gi j α(t − t j − ) and I (t) = −g n α(t − t j − ), respectively. j j Here, each spike-mediated postsynaptic potential is expressed in terms of the α function, which is given by α(t) = τt2 exp( −t ) for t ≥ 0 and α(t) = 0 τ elsewhere. Here is the synaptic delay, and τ is a time constant. We set = 1 ms and τ = 4 ms/2 ms for excitatory/inhibitory. The values t nj (n = 1,2, · · ·) are the spike times of the jth presynaptic neuron. The inhibitory synaptic weight is g I = 21 nS uniformly, while the excitatory one, gi j , ranges E from 0 to 30 nS (= gmax ), according to the STDP learning (see Figure 1B). We studied networks consisting of N = 1500 (see Figure 2A) and N = 1600 neurons (see Figure 2C). n The stimulus current is given by IiStim (t) = g Stim n α(t − ti(Stim) ), where n the quantities ti(Stim) (n = 1,2, · · ·) are the times at which the ith neuron receives a spike from the stimulus input. The uniform current from the Act n n activation input is given by I Act (t) = Ng Act k n α(t − tk(Act) ), where tk(Act) (n = 1,2, · · ·) are the spike times of the kth spike train in the activation input. The parameter values g Stim = 1000 nS and g Act /NAct = 5.1 nS are used. The total number of spike trains in the activation input is 1400. In asynchronous mode, each spike train from the activation input is generated by a Poisson process with frequency 25 Hz. In the synchronous mode, some of the neurons fire synchronously, while the other neurons remain in the asynchronous firing state. The fraction of neurons firing synchronously represents the degree of synchrony (the ratio of synchronized neurons ρ = 0.6). In addition, these “synchronized” neurons are not completely synchronized, but exhibit some fluctuation in time, which is generated by a gaussian distribution with σ = 3 ms. In the case of the simplified model (see Figure 3A), we use the same values for the model parameters, except that we use ratio E ρ = 0.3 for the degree of synchrony, gmax = 4.5 µS for the synaptic weight of excitatory, and τ = 4 ms for the synaptic time constant of both excitatory and inhibitory neurons. A.2. STDP Learning. For each pair of pre- and postsynaptic spikes, the corresponding synaptic conductance is modified by g →
Synchrony-Induced Switching Behavior
2733
g + F (tpre − tpost ). The STDP window function F (tpre − tpost ) is then given by F (tpre − tpost ) =
A+ exp( −t ) τ
−A− exp( −t ) τ
(t ≤ 0) (t > 0)
,
(A.2)
where τ is the time constant of the exponential decay, which determines the effective time range of the STDP window function. The synaptic weights are updated by the STDP update rule every time a postsynaptic neuron receives a spike. When the axonal synaptic delay is taken into account, the effective time difference to be used in the update rule is modified to be t = tpre + − tpost . To avoid unlimited synaptic growth, the ratio of the negative and positive area of the STDP window function, A− /A+ , must be a sightly larger than 1 (we set A− /A+ = 1.05) (Song et al., 2000). In E addition, the synaptic conductance is restricted to the range 0 to gmax . We also examine a multiplicative type of STDP rule (Gutig, Aharonov, Rotter, & Sompolinsky, 2003), which is given by F (tpre − tpost ) =
A+ (1 − w)µ exp( −t ) (t ≤ 0) τ
−A− w µ exp( −t ) τ
(t > 0)
,
(A.3)
where w = gEg is the normalized synaptic weight and µ is a parameter conmax trolling the weight dependency of synaptic growth. When µ = 0, this STDP rule equals the additive type of STDP rule in equation A.2. In the case of µ = 0.05, 0.2, and 1.0, the obtained synaptic weight matrices are very similar to the synaptic matrix in Figure 2B. We confirmed that the synchronyinduced switching behavior can also be observed in these synaptic matrices (data not shown). In STDP learning, a stimulus input is used to present some training stimulus patterns. The training stimulus pattern consists of three patterns (A, B, C in Figure 1C), in which the neurons belonging to a pattern fire with a period of 200 ms, keeping certain phase relationships among the firing times of these neurons. In the preliminary learning process, each stimulus pattern is presented for 1200 ms in the fixed order shown in Figure 1C. This order can be regarded as reflecting the causality of experienced external events. It is expected that after this procedure, the network is able to maintain each learned pattern without persistent stimulus input, but further learning is required to facilitate transitions between the learned patterns. For this purpose, in the main learning process, each stimulus pattern is presented for 200 ms in the fixed order A,B,C,A,B,C,A · · ·. Between the presentations of each stimulus pattern, there is a quiescent interval period. The duration of each quiescent interval is selected randomly from [100 ms, 300 ms]. The use of various interval times is necessary to ensure that the transition is not
2734
T. Aoki and T. Aoyagi
associated specifically with a particular interval time, because it is desirable for the network to have the property that the transition can occur at any time within some suitable interval. Typically, this main learning process is repeated several times. In the case depicted in Figure 4A, the learning process is the same, except that the period of each quiescent interval between two stimulus patterns depends on the next pattern (80 ms for A → B, 180 ms for A → C), and the stimulus presentation order is A,B,A,C,A,B,A,C · · ·. This ensures that the transition property depends on the specific timing of the period in activation pattern A. During learning, the activation input is always set in the asynchronous mode. A.3. Dot Image Plotting of the Firing Pattern of Neurons. In Figures 2D and 3A, describing the numerical simulations, the active firing pattern of the network is a dot image in which neurons are arranged on grid points, and the gray level of the dots indicates the average firing rate over 200 ms (see Figure 2D) and 100 ms (see Figure 3A), normalized with respect to the maximum firing rate. Appendix B: Stability Analysis in the Retrieval State We derived the stability condition for the retrieval state in the simplified model with nonoverlapping learned patterns. In particular, we examine the difference of the stability in the two situations that the mode of the activation input is asynchronous and synchronous. Using the rescaled membrane potential v(t) = V(t) − Vrest , the dynamics of each neuron can be rewritten by τm
dvi (t) = −vi (t) + Rm IiE (t) + I I (t) + I Act (t) . dt
(B.1)
In the limit of the number of the spike train NAct → ∞, the synaptic current produced by the activation input can be simplified,
I Act (t) =
g Act (asynchronous) T Act Act (1−ρ)g + ρg Act n α(t T Act
(n : integer)
+ nT Act ) ,
(B.2)
(synchronous)
where T Act is the period of the synchronized spike in the activation input. In this analysis, the degree of synchrony is controlled by changing the ratio of the synchronized neurons ρ (0 ≤ ρ ≤ 1). In other words, ρ NAct spike trains are completely synchronized with no jitter, and (1 − ρ)NAct spike trains are generated randomly with Poisson process. The synaptic currents of the
Synchrony-Induced Switching Behavior
2735
recurrent connections IiE (t), I I (t) are given by
IiE (t) =
N j
gI = giEj =
I I (t) = −g I
n
N j
gcI N E gmax NP
(x) =
giEj α(t − − t nj ),
α(t − − t nj ),
n
(B.3)
P P−1 µ µ µ+1 µ ζi ζ j + ζi ζ j , µ=1
1 (x ≥ 0) , 0 (x < 0)
µ=1
(B.4)
µ where NP (=< i ζi >µ ) is the averaged number of active neurons over all stored patterns. To evaluate these currents, we have to determine the spike time t nj for the active neurons. From typical results of our numerical simulation for the asynchronous mode, these active neurons tend to fire in a certain period T with independently random phases. In contrast, in the synchronous mode, the neurons tend to fire with a fixed delay θ to the incoming synchronized spikes. Assuming the above conditions, we can simplify the currents IiE (t) and I I (t) in the limit of NP , N → ∞. On substituting these equations into equation B.1, we obtain E (gmax − gcI ) Act (asynchronous) −v (t) + Rm + I (t) dvi (t) i E T I τm = , − gc ) n α(t − − T Act (n + θ )) −v (t) + R (g dt i Act m max +I (t) (synchronous) (B.5) where T and θ are unknown parameters to be solved later. In order to determine these unknown parameters, by integrating equation B.5 over a spike period of the active neurons, we can derive a self-consistent equation,
E g Act /TAct + gmax − gcI /T T = τm log (asynchronous) rest E g Act /TAct + (gmax − gcI ) /T − Vth −V Rm , (1 − ρ)τm (1 − e TAct /τm )g Act Vth − Vrest τm = Rm TAct E + (gmax − gcI )K TAct (−) + ρg Act K TAct (θ TAct ) (synchronous) (B.6)
2736
T. Aoki and T. Aoyagi
K T (s) ≡ e −T/τm
T
e s/τm
0
α(u + s + nT)du.
(B.7)
n
If the self-consistent values of T and θ are within physically meaningful ranges, we can determine a solution of equation B.6 for active neurons. In addition to the above analysis for the active neurons, one more necessary condition for inactive neurons is required: the inactive neuron must not fire when synaptic current is received from the active neurons. In fact, we numerically found that irregular firing in the inactive neuron could perturb the current retrieval state and cause instability. Therefore, we need to check that the membrane potential of the inactive neurons stays below the threshold voltage of the neuronal spike. Among the inactive neurons, the neurons that should fire in the next learned pattern have a higher membrane potential, owing to the excitatory connections from the current active neurons, that is, the second term in equation B.4. From this consideration, it is sufficient that we check only the membrane potential of these neurons, which is described by E (gmax − gcI ) Act −vi (t) + Rm + I (t) (asynchronous) T dvi (t) τm = −v (t) + R (g E − g I ) α(t − − T Act (n + θ )) . i m max c dt n
+I Act (t) (synchronous) (B.8) To evaluate the maximum of the membrane potential, it is useful to derive a return map of vi (t), which is calculated by integrating over one spike period of the active neurons. The convergent value of the return map is given by v ∞ (t ) E gmax − gcI g Act (asynchronous) + R m T TAct TAct /τm Act E (1 − ρ)(1 − e )g − gcI ) (gmax ρg Act = + K TAct ((t − θ)TAct − ) + K TAct (θ TAct ) , T τ τm Act m R m 1−e −TAct /τm (synchronous)
(B.9) where t takes from 0 to 1. Finally, the condition for the membrane potential of the inactive neuron is given by Vth > Vrest + max v ∞ (t ). t ∈[0,1)
(B.10)
The result of Figure 3B is obtained by calculating equations B.6 and B.10.
Synchrony-Induced Switching Behavior
2737
Acknowledgments This work was supported by CREST (Core Research for Evolutional Science and Technology) of Japan Science and Technology Corporation (JST) and by Grant-in-Aid for Scientific Research from the Ministry of Education, Science, Sports, and Culture of Japan: Grant number 18047014; 18019019 and 18300079. References Amit, D. J. (1988). Neural networks counting chimes. Proc. Natl. Acad. Sci. USA, 85, 2141–2145. Aoyagi, T., Takekawa, T., & Fukai, T. (2003). Gamma rhythmic bursts: Coherence control in networks of cortical pyramidal neurons. Neural Comput., 15, 1035–1061. Averbeck, B. B., & Lee, D. (2004). Coding and transmission of information by neural ensembles. Trends in Neurosci., 27, 225–230. Bi, G. Q., & Poo, M. M. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472. Brody, C. D., & Hopfield, J. J. (2003). Simple networks for spike-timing-based computation, with application to olfactory processing. Neuron, 37, 843–852. Cateau, H., & Fukai, T. (2003). A stochastic method to predict the consequence of arbitrary forms of spike-timing-dependent plasticity. Neural Comput., 15, 597–620. Dan, Y., & Poo, M. M. (2004). Spike timing-dependent plasticity of neural circuits. Neuron, 44, 23–30. Debanne, D., Gahwiler, B. H., & Thompson, S. M. (1998). Long-term synaptic plasticity between pairs of individual CA3 pyramidal cells in rat hippocampal slice cultures. J. Physiol., 507, 237–247. Diesmann, M., Gewaltig, M. O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Engel, A. K., Fries, P., & Singer, W. (2001). Dynamic predictions: Oscillations and synchrony in top-down processing. Nat. Rev. Neurosci., 2, 704–716. Ermentrout, G. B., & Kleinfeld, D. (2001). Traveling electrical waves in cortex: Insights from phase dynamics and speculation on a computational role. Neuron, 29, 33–44. Fries, P., Reynolds, J. H., Rorie, A. E., & Desimone, R. (2001). Modulation of oscillatory neuronal synchronization by selective visual attention. Science, 291, 1560–1563. Gerstner, W., & van Hemmen, J. L. (1992). Associative memory in a network of “spiking” neurons. Network: Comput. Neural Syst., 3, 139–164. ¨ Gray, C. M., Konig, P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338, 334–337. Gutig, R., Aharonov, R., Rotter, S., & Sompolinsky, H. (2003). Learning input correlations through nonlinear temporally asymmetric Hebbian plasticity. J. Neurosci., 23, 3697–3714. Hebb, D. O. (1949). The organization of behavior: A neuropsychological theory. New York: Wiley.
2738
T. Aoki and T. Aoyagi
Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA, 79, 2554–2558. Izhikevich, E. M. (2006). Polychronization: Computation with spikes. Neural Comput., 18, 245–282. Lee, D. (2003). Coherent oscillations in neuronal activity of the supplementary motor area during a visuomotor task. J. Neurosci., 23, 6798–6809. Lumer, E. D. (2000). Effects of spike timing on winner-take-all competition in model cortical circuits. Neural Comput., 12, 181–194. ¨ Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efficacy by coincidence of postsynaptic APS and EPSPs. Science, 275, 213–215. ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and Riehle, A., Grun, rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Rubin, J., Lee, D. D., & Sompolinsky, H. (2001). Equilibrium properties of temporally asymmetric Hebbian plasticity. Phys. Rev. Lett., 86, 364–367. Salinas, E., & Sejnowski, T. J. (2001). Correlated neuronal activity and the flow of neural information. Nat. Rev. Neurosci., 2, 539–550. Shadlen, M. N., & Movshon, J. A. (1999). Synchrony unbound: A critical evaluation of the temporal binding hypothesis. Neuron, 24, 67–77. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nat. Neurosci., 3, 919–926. Tiesinga, P. H., & Sejnowski, T. J. (2004). Rapid temporal modulation of synchrony by competition in cortical interneuron networks. Neural Comput., 16, 251–275. van Rossum, M. C., Bi, G. Q., & Turrigiano, G. G. (2000). Stable Hebbian learning from spike timing-dependent plasticity. J. Neurosci., 20, 8812–8821. Willshaw, D. J., Buneman, O. P., & Longuet-Higgins, H. C. (1969). Non-holographic associative memory. Nature, 222, 960–962. Zhang, L. I., Tao, H. W., Holt, C. E., Harris, W. A., & Poo, M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44.
Received April 16, 2006; accepted September 2, 2006.
LETTER
Communicated by Klaus Pawelzik
Competition Between Synaptic Depression and Facilitation in Attractor Neural Networks J. J. Torres [email protected] Institute Carlos I for Theoretical and Computational Physics, and Department of Electromagnetism and Matter Physics, University of Granada, Granada E-18071, Spain
J. M. Cortes [email protected] Institute Carlos I for Theoretical and Computational Physics, and Department of Electromagnetism and Matter Physics, University of Granada, Granada E-18071, Spain, and Department of Physics, Radboud University of Nijmegen, 6525 EZ Nijmegen, Netherlands
J. Marro [email protected] Institute Carlos I for Theoretical and Computational Physics, and Department of Electromagnetism and Matter Physics, University of Granada, Granada E-18071, Spain
H. J. Kappen [email protected] Department of Medical Physics and Biophysics, Radboud University of Nijmegen, 6525 EZ Nijmegen, Netherlands
We study the effect of competition between short-term synaptic depression and facilitation on the dynamic properties of attractor neural networks, using Monte Carlo simulation and a mean-field analysis. Depending on the balance of depression, facilitation, and the underlying noise, the network displays different behaviors, including associative memory and switching of activity between different attractors. We conclude that synaptic facilitation enhances the attractor instability in a way that (1) intensifies the system adaptability to external stimuli, which is in agreement with experiments, and (2) favors the retrieval of information with less error during short time intervals.
The current address for J. M. Cortes is Institute for Adaptive and Neural Computation, School of Informatics, University of Edinburgh, Edinburgh, EHI 2QL, U.K. Neural Computation 19, 2739–2755 (2007)
C 2007 Massachusetts Institute of Technology
2740
J. Torres, J. Cortes, J. Marro, and H. Kappen
1 Introduction and Model Recurrent neural networks are a prominent model for information processing and memory in the brain (Hopfield, 1982; Amit, 1989). Traditionally, these models assume synapses that may change on the timescale of learning but that can be assumed constant during memory retrieval. However, synapses are reported to exhibit rapid time variations, and it is likely that this finding has important implications for our understanding of the way information is processed in the brain (Abbott & Regehr, 2004). For instance, Hopfield-like networks in which synapses undergo rather generic fluctuations have been shown to significantly improve the associative process (Marro, Garrido, & Torres, 1998). In addition, motivated by specific neurobiological observations and their theoretical interpretation (Tsodyks, Pawelzik, & Markram, 1998), activity-dependent synaptic changes that induce depression of the response have been considered (Pantic, Torres, Kappen, & Gielen, 2002; Bibitchkov, Herrmann, & Geisel, 2002). It was shown that synaptic depression induces, in addition to memories as stable attractors, special sensitivity of the network to changing stimuli as well as rapid switching of the activity among the stored patterns (Pantic et al., 2002; Cortes, Garrido, Marro, & Torres, 2004; Marro, Torres, & Cortes, 2007; Torres et al., 2005; Cortes, Torres, Marro, Garrido, & Kappen, 2006). This behavior has been observed experimentally to occur during the processing of sensory information (Laurent et al., 2001; Mazor & Laurent, 2005; Marro, Torres, Cortes, & Wemmenhove, 2006). In this letter, we present and study networks that are inspired in the observation of certain more complex synaptic changes. That is, we assume that repeated presynaptic activation induces at short times not only depression but also facilitation of the postsynaptic potential (Thomson & Deuchars, 1994; Zucker & Regehr, 2002; Burnashev & Rozov, 2005; Wang et al., 2006). The question, which has not been quite addressed yet, is how competition between depression and facilitation will affect network performance. We here conclude that as for the case of only depression (Pantic et al., 2002; Cortes et al., 2006), the system may exhibit up to three different phases or regimes: one with standard associative memory, a disordered phase in which the network lacks this property, and an oscillatory phase in which activity switches between different memories. Depending on the balance between facilitation and depression, novel intriguing behavior results in the oscillatory regime. In particular, as the degree of facilitation increases, the sensitivity to external stimuli is enhanced and the frequency of the oscillations increases. It then follows that facilitation allows recovering of information with less error, at least during a short interval of time, and can therefore play an important role in short-term memory processes. We are concerned here with a network of binary neurons. Previous studies have shown that the behavior of such a simple network dynamics agrees qualitatively with the behavior that is observed in more realistic networks,
Connections Between Synaptic Depression and Facilitation
2741
such as integrate-and-fire neuron models of pyramidal cells (Pantic et al., 2002). Let us consider N binary neurons, si = {1, 0}, i = 1, . . . , N, endowed with probabilistic dynamics, Prob{si (t + 1) = 1} =
1 {1 + tanh[2βh i (t)]}, 2
(1.1)
which is controlled by a temperature parameter, T ≡ 1/β (see, e.g., Peretto, 1992; Marro & Dickman, 2005), for details. The function h i (t) denotes a timedependent local field, that is, the total presynaptic current arriving at the postsynaptic neuron i. This will be determined in the model following the phenomenological description of nonlinear synapses reported in Markram, Wang, and Tsodyks (1998) and Tsodyks et al. (1998), which was shown to capture well the experimentally observed properties of neocortical connections. Accordingly, we assume that h i (t) =
N
ωi j D j (t)F j (t)s j (t) − θi ,
(1.2)
j=1
where θi is a constant threshold associated with the firing of neuron i, and D j (t) and F j (t) are functions—to be determined—that describe the effect on the neuron activity of short-term synaptic depression and facilitation, respectively. We further assume that the weight ωi j of the connection between the (presynaptic) neuron j and the (postsynaptic) neuron i are static and store a set of patterns of the network activity, namely, the familiar covariance rule: ωi j =
P ν 1 ξi − f ξ νj − f . N f (1 − f ) ν=1
(1.3)
Here, ξ ν = {ξiν }, with ν = 1 . . . , P, are different binary patterns of average activity ξiν ≡ f . The standard Hopfield model is recovered for F j = D j = 1, ∀ j = 1, . . . , N. We next implement a dynamics for F j and D j after the prescription in Markram et al. (1998) and Tsodyks et al. (1998). A description of varying synapses requires at least three local variables, say, x j (t), y j (t), and z j (t), to be associated to the fractions of neurotransmitters in recovered, active, and inactive states, respectively. A simpler picture consists in dealing with only the x j (t) variable. This simplification, which seems to describe accurately both interpyramidal and pyramidal interneuron synapses, corresponds to the fact that the time in which the postsynaptic current decay is much shorter than the recovery time for synaptic depression, say, τrec (Markram
2742
J. Torres, J. Cortes, J. Marro, and H. Kappen
& Tsodyks, 1996) (time intervals are in milliseconds hereafter). Within this approach, one may write that x j (t + 1) = x j (t) +
1 − x j (t) − D j (t)F j (t)s j (t), τrec
(1.4)
where D j (t) = x j (t)
(1.5)
F j (t) = U + (1 − U) u j (t).
(1.6)
and
The interpretation of this ansatz is as follows. Concerning any presynaptic neuron j, the product D j F j stands for the total fraction of neurotransmitters in the recovered state that are activated by either incoming spikes, U j x j , or facilitation mechanisms, (1 − U j )x j u j ; for simplicity, we are assuming that U j = U ∈ [0, 1] ∀ j. The additional variable u j (t) is assumed to satisfy, as in the quantal model of transmitter release in Markram et al. (1998), that u j (t + 1) = u j (t) −
u j (t) + U[1 − u j (t)]s j (t), τfac
(1.7)
which describes an increase with each presynaptic spike and a decay to the resting value with relaxation time τfac . Consequently, facilitation washes out (u j → 0, F j → U) as τfac → 0, and each presynaptic spike uses a fraction U of the available resources x j (t). The effect of facilitation increases with decreasing U, because this will leave more neurotransmitters available to be activated by facilitation. Therefore, facilitation is controlled not only by τfac but also by U. The Hopfield case with static synapses is recovered after using x j = 1 in equation 1.5 and u j = 0 in equation 1.6 or, equivalently, τrec = τfac = 0 in equations 1.4 and 1.7. In fact, the latter imply fields h i (t) = j ωi j Us j (t) − θi , so that one may simply rescale both β and the threshold. The above interesting phenomenological description of dynamic changes has already been implemented in attractor neural networks (Bibitchkov et al., 2002; Pantic et al., 2002) for pure depressing synapses between excitatory pyramidal neurons (Tsodyks & Markram, 1997). We are interested here in the consequences of a competition between depression and facilitation. Therefore, we shall use T, U, τrec , and τfac in the following as relevant control parameters.
Connections Between Synaptic Depression and Facilitation
2743
2 Mean-Field Solution Let us consider the mean activities associated, respectively, with active and inert neurons in a given pattern ν:
mν+ (t) ≡
1 Nf
s j (t),
mν− (t) ≡
1 s j (t). N(1 − f )
(2.1)
j∈Act(ν)
j∈Act(ν)
It follows for the overlap of the network activity with pattern ν that
mν (t) ≡
1 ξiν − f si (t) = mν+ (t) − mν− (t), N f (1 − f ) i
(2.2)
∀ν. One may also define the averages of xi and ui over the sites that are active and inert, respectively, in a given pattern ν:
x+ν (t) ≡ uν+ (t) ≡
1 Nf 1 Nf
x j (t),
x−ν (t) ≡
j∈Act(ν)
j∈Act(ν)
1 x j (t) N(1 − f )
u j (t),
j∈Act(ν)
uν− (t) ≡
1 u j (t), N(1 − f )
(2.3)
j∈Act(ν)
∀ν, which describe depression (the xs) and facilitation (the us), each concerning a subset of neurons, for examples, N/2 neurons for f = 1/2. One may solve the model, equations 1.1 to 1.7, in the thermodynamic limit N → ∞ under the standard mean-field assumption that si ≈ si . This type of approach has recently been reported to satisfactorily explain the behavior of recurrent networks of spiking neurons with short-term depressing synapses (Romani, Amit, & Mongillo, 2006). Here, within this approximation, we shall substitute xi (ui ) by the mean-field values x±ν (uν± ) (notice that one expects, and it will be confirmed below by comparisons with direct simulation results, that the mean-field approximation is accurate away from any possible critical point). The local fields then ensue as
h i (t) =
P ν=1
ξiν − f Mν (t),
(2.4)
Mν (t) ≡ U + (1 − U) uν+ (t) x+ν (t) mν+ (t) − U − (1 − U) uν− (t) x−ν (t) mν− (t).
2744
J. Torres, J. Cortes, J. Marro, and H. Kappen
Assuming further that patterns are random with mean activity f = 1/2, one obtains the set of dynamic equations: x±ν (t + 1) = x±ν (t) +
1 − x±ν (t) − U + (1 − U) uν± (t) x±ν (t) mν± (t), τrec
uν± (t) + U 1 − uν± (t) mν± (t), τfac µ 1 mν± (t + 1) = i Mµ (t) , 1 ± tanh β Mν (t) ± N uν± (t + 1) = uν± (t) −
i
mν (t + 1) = µ
µ=ν
µ 1 ν i tanh β i Mµ (t) , N i µ
(2.5)
µ
where i ≡ 2ξi − 1. This is a 6P-dimensional coupled map whose analytical treatment is difficult for large P, but it may be integrated numerically, at least for not too large P. One may also find the fixed-point equations for the coupled dynamics of neurons and synapses; these are −1 x±ν = 1 + U + (1 − U) uν± τrec mν± , −1 uν± = U τfac mν± 1 + U τfac mν± , µ 2 tanh β Mν ± i Mµ , 2mν± = 1 ± N i µ=ν µ 1 ν µ tanh β i M . m = N i i µ ν
(2.6)
The numerical solution of these transcendental equations describes the resulting order as a function of the relevant parameters. Determining the stability of these solutions for α = P/N = 0 is a more difficult task, because it requires linearizing equation 2.5, and the dimensionality diverges in the thermodynamical limit (see, however, Torres, Pantic, & Kappen, 2002). In the next section, we therefore deal with the case α → 0. 3 Main Results Consider a finite number of stored patterns P, that is, α = P/N → 0 in the thermodynamic limit. In practice, it is sufficient to deal with P = 1 to illustrate the main results (therefore, we suppress the index ν hereafter).
Connections Between Synaptic Depression and Facilitation
2745
Let us define the vectors of order parameters y ≡ (m+ , m− , x+ , x− , u+ , u− ), its stationary value yst that is given by the solution equation 2.6, and F whose components are the functions on the right-hand side of equation 2.5. The stability of equation 2.5 around the steady state of equation 2.6 follows from the first derivative matrix D ≡ (∂ F /∂ y) yst . This is ¯ β A+ −β¯ A + −A+ D= 0 D+ 0
−β¯ A− β¯ A−
β¯ B+ −β¯ B+
−β¯ B− β¯ B−
¯ + βC ¯ + −βC
0 −A−
τ − B+
0
−C+
0
τ − B−
0
0
0
0
τ − E+
D−
0
0
0
¯ − −βC ¯ − βC 0 , −C− 0
(3.1)
τ − E−
where β¯ ≡ 2βm+ m− , A± ≡ [U + (1 − U) u± ] x± , B± ≡ [U + (1 − U) u± ] m± , −1 C± ≡ (1 − U) x± m± , D± ≡ U(1 − u± ), τ ≡ 1 − τrec , and E ± ≡ Um± . After noticing that m+ + m− = 1, one may numerically diagonalize D and obtain the eigenvalues λn for a given set of control parameters T, U, τrec , τfac . For |λn | < 1 (|λn | > 1), system is stable (unstable) close to the fixed point yn . The maximum of |λn | determines the local stability: for |λn |max < 1, system 2.5 is locally stable, while for |λn |max > 1, there is at least one direction of instability, and the system consequently becomes locally unstable. Therefore, varying the control parameters, one crosses the line |λ|max = 1 that signals the bifurcation points. The resulting situation is summarized in Figure 1 for specific values of U, T, and τfac . Equations 2.6 have three solutions, two of which are memory states corresponding to the pattern and antipattern and the other a so-called paramagnetic state that has no overlap with the memory pattern. The stabil∗∗ ity of the two solutions depends on τrec . The region τrec > τrec corresponds to the nonretrieval phase, where the paramagnetic solution is stable and the memory solutions are unstable. In this phase, the average network behavior has no significant overlap with the stored memory pattern. The region ∗ τrec < τrec corresponds to the memory phase, where the paramagnetic solution is unstable and the memory solutions are stable. The network retrieves ∗ ∗∗ one of the stored memory patterns. For τrec < τrec < τrec (denoted O in the figure), none of the solutions is stable. The activity of the network in this regime keeps moving from one to the other fixed points neighborhood (the pattern and antipattern in this simple example). This rapid switching behavior is typical for dynamic synapses and does not occur for static synapses. A similar oscillatory behavior was reported in Pantic et al. (2002) and Cortes et al. (2004) for the case of only synaptic depression. A main novelty is that the inclusion of facilitation importantly modifies the phase diagram, as discussed below (see Figure 2). On the other hand, the phases ∗ ∗∗ for τrec < τrec (F) and τrec > τrec (P) correspond, respectively, to a locally
2746
J. Torres, J. Cortes, J. Marro, and H. Kappen
Figure 1: The three relevant regions, denoted F, O, and P, respectively, that are depicted by the absolute value of the maximum eigenvalue |λn |max of the stability matrix D in equation 3.1 when plotted as a function of the recovering time τrec for different values of the facilitation time τfac . Here, τfac = 10, 15, 20, and 25 for different curves from bottom to top, respectively, in the F and P ∗ ∗∗ < τrec < τrec (O), regions. The stationary solutions lack any local stability for τrec ∗ and and the network activity then undergoes oscillations. The arrows signal τrec ∗∗ for τfac = 10. This graph is for U = 0.1 and T = 0.1. τrec
stable regime with associative memory (m = 0) and a disordered regime without memory (m ≡ m1 = 0). ∗ ∗∗ The values τrec and τrec , which, as a function of τfac , U and T, determine the limits of the oscillatory phase, correspond to the onset of condition |λn |max > 1. This condition defines lines in the parameter space (τrec , τfac ) that are illustrated in Figure 2. This reveals that for a relatively large de∗ gree of facilitation, τrec (separation between the F and O regions) in general decreases with increasing facilitation, which implies a larger oscillatory region and consequently a reduction of the memory phase. This is due to the additional depressing effect produced by strong facilitation, which tends to ∗∗ destabilize the ferromagnetic solutions. On the other hand, τrec (separation between O and P regions) in general increases with facilitation, thus broad∗∗ ∗ ening further the width of the oscillatory phase δ ≡ τrec − τrec . Moreover, for relatively small facilitation, the behavior is more complex due to the balance between depression and facilitation. The width of the oscillatory phase thus increases when depression is dominant and decreases when facilitation becomes more important. At this regime, there is a competition between the tendency to stabilize ferromagnetic solutions due to facilitation and the opposite one due to depression, which results in a nonmonotonic behavior of δ. The behavior of δ under different conditions is illustrated in the insets
Connections Between Synaptic Depression and Facilitation
2747
Figure 2: This illustrates how the different regimes of the network activity depend on the balance between depression and facilitation. (Top graphs) Phase diagram (τrec , τfac ) for α = 0 and U = 0.1 at temperature T = 0.1 (left) and 0.05 ∗ ∗∗ (τrec ), signaling the first-order (second(right). The dashed (solid) line is for τrec order) phase transitions between the O and F(P) phases. The insets show the ∗∗ ∗ − τrec , as a function of τfac . (Botresulting width of the oscillatory region, δ ≡ τrec tom graphs) Phase diagram (τrec , U) for α = 0 and T = 0.1, and γ ≡ τfac /τrec = 1 (left) and 0.25 (right).
of Figure 2. These insets also show that the effects just described are more evident for low levels of noise (e.g., δ is larger for smaller temperature). Also interesting is that facilitation induces changes in the phase diagram as one varies the facilitation parameter U, which measures the fraction of neurotransmitters not activated by the facilitating mechanism. In order to discuss this, we define the ratio between the timescales, γ ≡ τfac /τrec , and monitor the phase diagram (τrec , U) for varying γ . The result is also in Figure 2; see the bottom graphs for γ = 1 (left) and 0.25 (right), which correspond, respectively, to a situation in which depression and facilitation occur in the same timescale and a situation in which facilitation is four times faster. The two cases exhibit a similar behavior for large U, but they
2748
J. Torres, J. Cortes, J. Marro, and H. Kappen
Figure 3: For U = T = 0.1 as in Figure 1, these graphs illustrate results from Monte Carlo simulations (symbols) and mean-field solutions (curves) for the case of associative memory under competition of depression and facilitation. This shows m ≡ m1 as a function of τrec (horizontal axis) and τfac (different ∗ curves as indicated) corresponding to regimes in which the limiting value τrec decreases (left graph) or increases (right graph) with increasing τfac , the two situations discussed in the text.
are qualitatively different for small U. In the case of faster facilitation, ∗ there is a range of U values for which τrec increases in such a way that one passes from the oscillatory to the memory phase by slightly increasing U. This means that facilitation tries to drive the network activity to one of the attractors (τfac < τrec ), and for weak depression (U small), the activity will remain there. Decreasing U further then has the effect of increasing effectively the system temperature, which destabilizes the attractor. This requires only small U because the dynamics 1.7 rapidly decreases the second term in F j to zero. Figure 3 shows the variation with both τrec and τfac of the stationary, locally stable solution with associative memory, m = 0, computed this time in both the mean-field approximation and using Monte Carlo simulation. The Monte Carlo simulation consists of iterating equations 1.1, 1.4, and 1.7 using parallel dynamics. This shows a perfect agreement between our mean-field approach above and the simulations as long as one is far from the transition, a fact that is confirmed below (in Figure 5). This is because, ∗ near τrec , the simulations describe hops between positive and negative m that do not compare well with the mean-field absolute value |m|. The most interesting behavior is perhaps the one revealed by the phase diagram (T, τfac ) in Figure 4. Here we depict a case with U = 0.1 in order to clearly visualize the effect of facilitation (facilitation has practically no effect for any U > 0.5, as shown above) and τrec = 3 ms in order to compare with the situation of only depression in Pantic et al. (2002). A main result here is that for appropriate values of the working temperature T, one may force the system to undergo different types of transitions by simply varying τfac . First note that the line τfac = 0 corresponds roughly to the case of static
Connections Between Synaptic Depression and Facilitation
2749
0.3
P
T
0.25
O
0.2
0.15
F 0.1 20
40
60
80
100
120
τfac
Figure 4: Phase diagram (T, τfac ) for U = 0.1 and τrec = 3 ms. This illustrates the potential high adaptability of the network to different tasks (e.g., around T = 0.22) by simply varying its degree of facilitation.
synapses, since τrec is very small. In this limit, the transition between the retrieval (F) and nonretrieval (P) phases is at T = U = 0.1. At low enough T, there is a transition between the nonretrieval (P) and retrieval phases (F) as facilitation is increased. This reveals a positive effect of facilitation on memory at low temperature and suggests improvement of the network storage capacity, which is usually measured at T = 0, a prediction that we have confirmed in preliminary simulations. At intermediate temperatures, for example, T ≈ 0.22 for U = 0.1, the system shows no memory in the absence of facilitation, but increasing τfac , one may describe consecutive transitions to a retrieval phase (F), a disordered phase (P), and then an oscillatory phase (O). The last is associated with a new instability induced by a strong depression effect due to the further increase of facilitation. At higher T, facilitation may drive the system directly from complete disorder to an oscillatory regime. In addition to its influence on the onset and width of the oscillatory region, the competition between depression and facilitation, described by the relation between τrec and τfac , determines the frequency of the oscillations of m. In order to study this effect, we computed the average time between the consecutive minimum and maximum of these oscillations: a half period as a function of τfac and for different values of τrec . The result is illustrated in the left graph of Figure 5. This shows that for relatively large facilitation, the frequency of the oscillations increases with the facilitation time. This means that the access of the network activity to the attractors is faster with increasing facilitation, though the system then remains near each attractor for a shorter time due to the extra depression that follows strong facilitation. However, for relatively small facilitation (τfac < 10 ms), there is competition between a tendency to go to the attractors, which is favored by facilitation,
2750
J. Torres, J. Cortes, J. Marro, and H. Kappen
20
1 τrec=8 ms
τrec=8 ms
0.8
| m |max
p/2
16
τrec=10 ms
12
8
τrec=10 ms
0.6 0.4 τrec=12 ms
0.2 τrec=12 ms
4
0 0
10
20
τfac
30
40
0
10
20
30
40
τfac
Figure 5: (Left graph) Half-period of oscillations as a function of τfac , for U = T = 0.1 and P = 1, as obtained from the mean-field solution for values of τrec = 12, 10, 8 ms (curves) and from simulations (symbols) for τrec = 10 ms. (Right graph) For the same conditions as in the left graph, this shows the maximum of the absolute value of m during oscillations. The simulation results in both graphs corresponding to an average over 103 peaks of the stationary series for m. The fact that statistical errors are small confirms the mean-field periodic behavior.
and a tendency to leave the attractor, due to depression. For a relatively week depression (e.g., τrec = 8 ms), the activity tends to remain longer near the attractor, while the system takes more time to reach the attractor if facilitation is not too strong. The competition results in a large half-period for oscillations. This tendency to increase the half-period finishes when facilitation becomes very large, and then a depressing effect due to strong facilitation occurs. As a consequence, the half-period begins to decrease. This complex behavior of the half-period due to the competition between depression and facilitation washes out for strong depression, as illustrated in Figure 5 (left) for τrec = 12 ms. We also computed the maximum of m during oscillations: |m|max . This, which is depicted in the right graph of Figure 5, also increases with τfac for strong facilitation. For relatively small facilitation, again complex behavior due to the competition shows up. In this case, the effects are more evident for strong depression (see the curve τrec = 12 ms in Figure 5, right), and there is a small region in which, though the access to stored information is faster (the half-period decreases), the error when one retrieves information increases (|m| decreases), which is due to the fact that depression dominates. The overall conclusion is that for relatively small facilitation, there is a nontrivial competition between depression and facilitation that, for particular values of the depression, results in nonmonotonic behavior of the halfperiod and maximum m during oscillations. On the other hand, not only the access to the stored information is faster under relatively large facilitation, but increasing it will then help to retrieve information with less error.
Connections Between Synaptic Depression and Facilitation
0.6
P1
F
P2
1.5
2.0
2751
O
|m|
0.4 0.2 0
m
0.8
0
-0.8
m
0.8 0 −0.8 1.0
2.5
3.0
3.5
Time (s) Figure 6: Time series for the overlap function, m, at T = 0.22 (horizontal dotted line in Figure 4) as one increases the value of τfac in order to visit the different regimes (separated here by vertical lines). The simulations started with τfac = 1 at t = 1, and τfac was then increased by 10 units every 200 ms. The bottom graph corresponds to a case in which the system is under the action of an external stimulus (as described in the text). The middle graph depicts an individual run when the system is without any stimulus, and the top graph corresponds to the average of |m| over 100 independent runs of the unperturbed system.
In order to provide more information on some aspects of the system behavior, we present in Figures 6 and 7 a detailed study of specific time series. The middle graph in Figure 6 corresponds to a simulation of the system evolution for increasing values of τfac as one describes the horizontal line
2752
J. Torres, J. Cortes, J. Marro, and H. Kappen 1
1
P1
1e-04
1e-06
1e-08
1e-08
1e-10
1e-10 0
5
10 15 f (Hz)
20
25
1 0.5 0 -0.5 -1
0
m(t)
m(t)
1e-04
1e-06
10
10.2 10.4 Time (s)
10.6
10
P2
10.2 10.4 Time (s)
20
25
10.6
O
0.01
1e-04
S(f)
S(f)
10 15 f (Hz)
1
0.01
1e-04
1e-06
1e-06
1e-08
1e-08
1e-10
1e-10 0 10 20 30 40 50 60 70 80 f (Hz)
1 0.5 0 -0.5 -1
0 10 20 30 40 50 60 70 80 f (Hz) m(t)
m(t)
5
1 0.5 0 -0.5 -1
1
10
F
0.01 S(f)
S(f)
0.01
10.2
10.4
Time (s)
10.6
1 0.5 0 -0.5 -1 10
10.2
10.4
10.6
Time (s)
Figure 7: Spectral analysis of the cases in Figure 6. On top of each of the small panels, which show typical time series for τfac = 2, 20, 50, and 100, respectively, from top to bottom and from left to right, the square panels show the corresponding power spectra. Details of the simulations as in Figure 6.
for T = 0.22 in Figure 4. The system thus visits consecutively the different regions (separated by vertical lines) over time. That is, the simulation starts with the system in the stable paramagnetic phase, denoted P1 in the figure, and then successively moves by varying τfac into the stable ferromagnetic
Connections Between Synaptic Depression and Facilitation
2753
phase F, into another paramagnetic phase, P2, and, finally, into the oscillatory phase, O. We interpret that the observed behavior in P2 as due to competition between the facilitation mechanism, which tries to bring the system to the fixed-point attractors, and the depression mechanism, which tends to destabilize the attractors. The result is a sort of intermittent behavior in which oscillations and convergence to a fixed-point alternate in a way that resembles (but is not) chaos. The top graph in Figure 6, which corresponds to an average over independent runs, illustrates the typical behavior of the system in these simulations; the middle run depicts an individual run. Further interesting behavior is shown in the bottom graph of Figure 6. This corresponds to an individual run in the presence of a very small and irregular external stimulus, represented by the black line around m = 0. This consists of an irregular series of positive and negative pulses of intensity ±0.03 ξ 1 and a duration of 20 ms. In addition to a great sensibility to weak inputs from the environment, this reveals that increasing facilitation tends to significantly enhance the system response. Figure 7 shows the power spectra of typical time series such as the ones in Figure 6: describing the horizontal line for T = 0.22 in Figure 4 to visit the different regimes. We here plot time series m(t) obtained, respectively, for τfac = 2, 20, 50, and 100 and, on top of each of them, the corresponding spectra. This reveals a flat, white-noise spectrum for the P1 phase and for the stable fixed-point solution in the F regime. However, the case for the intermittent P2 phase depicts a small peak around 65 Hz. The peak is much sharper, and it occurs at 70 Hz in the oscillatory case. 4 Conclusion We have shown that the dynamic properties of synapses have profound consequences on the behavior, and the possible functional role, of recurrent neural networks. Depending on the relative strength of depression, facilitation, and noise in the network, one observes either attraction toward one of the stored memory patterns or a nonretrieval situation in which the neurons fire largely at random in a fashion uncorrelated to the patterns, or else switching where none of the patterns is stable and the network rapidly moves between (the neighborhoods of) all of them. These three behaviors were also observed in our previous work, where we studied the role of depression only. However, the transition between the different possible phases and their nature importantly depends on the existence of competition between depression and facilitation and on the balance between these and the underlying noise. Our analysis of the possible phases indicates a clear role for facilitation. The activity goes straight to the attractor when facilitation dominates, so that it is reached rapidly and with very small error. The system then depresses, however, which destabilizes the attractor and expedites the search
2754
J. Torres, J. Cortes, J. Marro, and H. Kappen
for a different one. Even more interesting is the resulting situation when facilitation is relatively small. That is, competition between depression and facilitation may then induce a critical condition in which the activity may eventually become “frozen”: there is a value of frustration for each degree of depression at which the time to reach the attractor importantly increases. This behavior is less marked as depression increases; depression serves to avoid such frustration induced by intermediate values of facilitation. This is quantitatively illustrated in Figure 5. On the other hand, under the action of only depression, we know that by avoiding too much thermal noise, the system may describe a transition from the retrieval phase to the oscillating one, and then to the nonretrieval phase as depression is increased. As shown with detail in Figures 4 and 6, the presence of facilitation may essentially modify this picture. For instance, the particular sequence of phases the system will describe depends crucially on the balance of depression, facilitation, and noise. In summary, facilitation favors a more rapid and precise access to the stored information. It also induces a more complex situation in which competing with depression seems to allow for more diverse tasks. Interestingly, Wang et al. (2006) have recently reported the existence of depressing and facilitating synapses and their possible competition to form the basis of working memories. Acknowledgments This work was supported by the MEyC–FEDER project FIS2005-00791 and the Junta de Andaluc´ıa project FQM–165. J.M.C. also acknowledges financial support from EPSRC-funded COLAMN project Ref. EP/CO 10841/1. We thank Jorge F. Mej´ıas for useful discussion.
References Abbott, L. F., & Regehr, W. G. (2004). Synaptic computation. Nature, 431, 796–803. Amit, D. J. (1989). Modeling brain function: The world of attractor neural networks. Cambridge: Cambridge University Press. Bibitchkov, D., Herrmann, J. M., & Geisel, T. (2002). Pattern storage and processing in attractor networks with short-time synaptic dynamics. Network: Comput. Neural Syst., 13, 115–129. Burnashev, N., & Rozov, A. (2005). Presynaptic Ca2+ dynamics, Ca2+ buffers and synaptic efficacy. Cell Calcium, 37, 489–495. Cortes, J. M., Garrido, P. L., Marro, J., & Torres, J. J. (2004). Switching between memories in neural automata with synaptic noise. Neurocomputing, 58–60, 67–71. Cortes, J. M., Torres, J. J., Marro, J., Garrido, P. L., & Kappen, H. J. (2006). Effects of fast presynaptic noise in attractor neural networks. Neural Comput., 18(3), 614– 633.
Connections Between Synaptic Depression and Facilitation
2755
Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA, 79, 2554–2558. Laurent, G., Stopfer, M., Friedrich, R. W., Rabinovich, M. I., Volkovskii, A., & Abarbanel, H. D. I. (2001). Odor encoding as an active, dynamical process: Experiments, computation and theory. Annu. Rev. Neurosci., 24, 263–297. Markram, H. H., & Tsodyks, M. V. (1996). Redistribution of synaptic efficacy between pyramidal neurons. Nature, 382, 807–810. Markram, H. H., Wang, Y., & Tsodyks, M. V. (1998). Differential signaling via the same axon of neocortical pyramidal neurons. Proc. Natl. Acad. Sci. USA, 95, 5323–5328. Marro, J., & Dickman, R. (2005). Nonequilibrium phase transitions in lattice models. Cambridge: Cambridge University Press. Marro, J., Garrido, P. L., & Torres, J. J. (1998). Effect of correlated fluctuations of synapses in the performance of neural networks. Phys. Rev. Lett., 81, 2827–2830. Marro, J., Torres, J. J., & Cortes, J. M. (2007). Chaotic hopping between attractors in neural networks. Neural Networks, 20, 230–235. Marro, J., Torres, J. J., Cortes, J. M., & Wemmenhove, B. (2006). Sensitivity, itinerancy and chaos in partly-synchronized weighted networks. Manuscript submitted for publication. Mazor, O., & Laurent, G. (2005). Transient dynamics versus fixed points in odor representations by locust antennal lobe projection neurons. Neuron, 48, 661–673. Pantic, L., Torres, J. J., Kappen, H. J., & Gielen, S. C. A. M. (2002). Associative memory with dynamic synapses. Neural Comput., 14, 2903–2923. Peretto, P. (1992). An introduction to the modeling of neural networks. Cambridge: Cambridge University Press. Romani, S., Amit, D. J., & Mongillo, G. (2006). Mean-field analysis of selective persistent activity in presence of short-term synaptic depression. J. Comput. Neurosci., 20, 201–217. Thomson, A. M., & Deuchars, J. (1994). Temporal and spatial properties of local circuits in neocortex. Trends Neurosci., 17, 119–126. Torres, J. J., Marro, J., Garrido, P. L., Cortes, J. M., Ramos, F., & Munoz, M. A. (2005). Effects of static and dynamic disorder on the performance of neural automata. Biophysical Chemistry, 115, 285–288. Torres, J. J., Pantic, L., & Kappen, H. J. (2002). Storage capacity of attractor neural networks with depressing synapses. Phys. Rev. E., 66, 061910. Tsodyks, M. V., & Markram, H. H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci. USA, 94, 719–723. Tsodyks, M., Pawelzik, K., & Markram, H. (1998). Neural networks with dynamic synapses. Neural Comput., 10, 821–835. Wang, Y., Markram, H., Goodman, P. H., Berger, T. K., Ma, J., & Goldman-Rakic, P. S. (2006). Heterogeneity in the pyramidal network of the medial prefrontal cortex. Nature Neurosci., 9, 534–542. Zucker, R. S., & Regehr, W. G. (2002). Short–time synaptic plasticity. Annu. Rev. Physiol., 64, 355–405.
Received April 16, 2006; accepted September 17, 2006.
LETTER
Communicated by Yin Zhang
Projected Gradient Methods for Nonnegative Matrix Factorization Chih-Jen Lin [email protected] Department of Computer Science, National Taiwan University, Taipei 106, Taiwan
Nonnegative matrix factorization (NMF) can be formulated as a minimization problem with bound constraints. Although bound-constrained optimization has been studied extensively in both theory and practice, so far no study has formally applied its techniques to NMF. In this letter, we propose two projected gradient methods for NMF, both of which exhibit strong optimization properties. We discuss efficient implementations and demonstrate that one of the proposed methods converges faster than the popular multiplicative update approach. A simple Matlab code is also provided. 1 Introduction Nonnegative matrix factorization (NMF) (Paatero & Tapper 1994; Lee & Seung, 1999) is useful for finding representations of nonnegative data. Given an n × m data matrix V with Vi j ≥ 0 and a prespecified positive integer r < min(n, m), NMF finds two nonnegative matrices W ∈ Rn×r and H ∈ Rr ×m such that V ≈ WH. If each column of V represents an object, NMF approximates it by a linear combination of r “basis” columns in W. NMF has been applied to many areas, such as finding basis vectors of images (Lee & Seung, 1999), document clustering (Xu, Liu, & Gong, 2003), and molecular pattern discovery (Brunet, Tamayo, Golub, & Mesirov, 2004). Donoho and Stodden (2004) have addressed the theoretical issues associated with the NMF approach. The conventional approach to find W and H is by minimizing the difference between V and WH: 1 (Vi j − (WH)i j )2 2 n
min W,H
subject to
f (W, H) ≡
m
i=1 j=1
Wia ≥ 0, Hb j ≥ 0, ∀i, a , b, j.
Neural Computation 19, 2756–2779 (2007)
(1.1)
C 2007 Massachusetts Institute of Technology
Projected Gradient Methods for Nonnegative Matrix Factorization
2757
The inequalities such that variables are upper- and lower-bounded are referred to as bound constraints. Hence, equation 1.1 is a standard boundconstrained optimization problem. We also note that m n (Vi j − (WH)i j )2 = V − WH2F , i=1 j=1
where · F is the Frobenius norm. The most popular approach to solve equation 1.1 is the multiplicative update algorithm proposed by Lee and Seung (2001). It is simple to implement and often yields good results. At each iteration of this method, the elements of W and H are multiplied by certain factors. As the zero elements are not updated, all the components of W and H are strictly positive for all iterations. This type of strategy is contrary to the traditional boundconstrained optimization methods, which usually allow iterations to have bounded elements (i.e., zero elements in this case). Thus far, no study has formally applied bound-constrained optimization techniques to NMF. This letter investigates such methods indetail. Some earlier NMF studies require n all W’s column sums to be ones: i=1 Wia = 1, ∀a = 1, . . . , r . The function value does not change because f (WD, D−1 H) = f (W, H) for any r × r positive diagonal matrix D. With the inclusion of such additional constraints, equation 1.1 no longer remains a bounded problem. As adding these constraints may complicate the optimization procedures, we do not consider this modification in this study. Among the existing bound-constrained optimization techniques, the projected gradient method is simple and effective. Although several researchers have used this method for NMF (Hoyer, 2002; Chu, Diele, Plemmons, & Ragni, 2005; Shepherd, 2004), there is neither a systematic study nor an easy implementation comparable to that of the multiplicative update method. This letter presents a comprehensive study on using projected gradient methods for NMF. Several useful modifications lead to efficient implementations. While the multiplicative update method still lacks convergence results, our proposed methods exhibit strong optimization properties. We experimentally show that one of the proposed methods converges faster than the multiplicative update method. This new method is thus an attractive approach to solve NMF. We also provide a complete Matlab implementation. Another popular NMF optimization formula is to minimize the (generalized) Kullback-Leibler divergence between V and WH (Lee & Seung, 1999): m n Vi j min Vi j log − Vi j + (WH)i j W,H (WH)i j i=1 j=1
subject to
Wia ≥ 0, Hb j ≥ 0, ∀i, a , b, j.
2758
C. Lin
Strictly speaking, this formula is not a bound-constrained problem, which requires the objective function to be well defined at any point of the bounded region. The log function is not well defined if Vi j = 0 or (WH)i j = 0. Hence, we do not consider this formulation in this study. This letter is organized as follows. Section 2 discusses existing approaches for solving NMF problem 1.1 and presents several new properties. Section 3 introduces the projected gradient methods for bound-constrained optimization. Section 4 investigates specific but essential modifications for applying the proposed projected gradients methods to NMF. The stopping conditions in an NMF code are discussed in section 5. Experiments on synthetic and real data sets are presented in section 6. The discussion and conclusions are presented in section 7. Appendix B contains the Matlab code of one of the proposed approaches. All source codes used in this letter are available online at http://www.csie.ntu.edu.tw/∼cjlin/nmf. 2 Existing Methods and New Properties There are many existing methods for NMF. Some discussions are in Paatero (1999), but bound constraints are not rigorously handled. A more recent and complete survey is by Chu et al. (2005). This section briefly discusses some existing methods and presents several previously unmentioned observations. To begin, we need certain properties of the NMF problem, equation 1.1. The gradient of the function f (W, H) consists of two parts: ∇W f (W, H) = (WH − V)H T
and ∇ H f (W, H) = W T (WH − V),
(2.1)
which are, respectively, partial derivatives to elements in W and H. From the Karush-Kuhn-Tucker (KKT) optimality condition (e.g., Bertsekas, 1999), (W, H) is a stationary point of equation 1.1 if and only if Wia ≥ 0, Hb j ≥ 0 ∇W f (W, H)ia ≥ 0, ∇ H f (W, H)b j ≥ 0 Wia · ∇W f (W, H)ia = 0, and Hb j · ∇ H f (W, H)b j = 0, ∀i, a , b, j.
(2.2)
Optimization methods for NMF produce a sequence {Wk , H k }∞ k=1 of iterations. Problem 1.1 is nonconvex and may have several local minima. A common misunderstanding is that limit points of the sequence are local minima. In fact, most nonconvex optimization methods guarantee only the stationarity of the limit points. Such a property is still useful, as any local minimum must be a stationary point.
Projected Gradient Methods for Nonnegative Matrix Factorization
2759
2.1 Multiplicative Update Methods. The most commonly used approach to minimize equation 1.1 is a simple multiplicative update method proposed by Lee and Seung (2001): Algorithm 1: Multiplicative Update 1. Initialize Wia1 > 0, Hb1j > 0, ∀i, a , b, j. 2. For k = 1, 2, . . . Wiak+1 = Wiak
(V(H k )T )ia , ∀i, a (Wk H k (H k )T )ia
(2.3)
k Hbk+1 j = Hb j
((Wk+1 )T V)b j , ∀b, j. ((Wk+1 )T Wk+1 H k )b j
(2.4)
This algorithm is a fixed-point type method: if (Wk H k (H k )T )ia = 0 and = Wiak > 0, then
Wiak+1
(V(H k )T )ia = (Wk H k (H k )T )ia
implies
∇W f (Wk , H k )ia = 0,
which is part of the KKT condition, equation 2.2. Lee and Seung (2001) have shown that the function value is nonincreasing after every update: f (Wk+1 , H k ) ≤ f (Wk , H k )
and
f (Wk+1 , H k+1 ) ≤ f (Wk+1 , H k ).
(2.5)
They claim that the limit of the sequence {Wk , H k }∞ k=1 is a stationary point (i.e., a point satisfying the KKT condition, equation 2.2). However, Gonzales and Zhang (2005) indicate that this claim is wrong, as having equation 2.5 may not imply the convergence. Therefore, this multiplicative update method still lacks optimization properties. To have algorithm 1 well defined, one must ensure that denominators in equations 2.3 and 2.4 are strictly positive. Moreover, if Wiak = 0 at the kth iteration, then Wia = 0 at all subsequent iterations. Thus, one should keep Wiak > 0 and Hbkj > 0, ∀k. The following theorem discusses when this property holds: Theorem 1. If V has neither zero column nor row, and Wia1 > 0 and Hb1j > 0, ∀i, a , b, j, then Wiak > 0 and Hbkj > 0, ∀i, a , b, j, ∀k ≥ 1.
(2.6)
The proof is straightforward, and is in appendix A. If V has zero columns or rows, a division by zero may occur. Even if theorem 1 holds, denominators close to zero may still cause numerical problems. Some studies, such as Piper, Pauca, Plemmons, and Giffin (2004)
2760
C. Lin
have proposed adding a small, positive number in the denominators of equations 2.3 and 2.4. We observe numerical difficulties in a few situations and provide more discussion in section 6.3. Regarding the computational complexity, V(H k )T and (Wk+1 )T V in equation 2.3 and 2.4 are both O(nmr ) operations. One can calculate the denominator in equation 2.3 by either (WH)H T
or
W(H H T ).
(2.7)
The former takes O(nmr ) operations, but the latter costs O((max(m, n)r 2 ). As r < min(m, n), the latter is better. Similarly for equation 2.4, (W T W)H should be used. This discussion indicates the importance of having fewer O(nmr ) operations (i.e., WH, W T V, or V H T ) in any NMF code. In summary, the overall cost of algorithm 1 is #iterations × O(nmr ). All time complexity analysis in this letter assumes that V, W, and H are implemented as dense matrices. 2.2 Alternating Nonnegative Least Squares. From the nonincreasing property 2.6, algorithm 1 is a special case of a general framework, which alternatively fixes one matrix and improves the other: find Wk+1 such that f (Wk+1 , H k ) ≤ f (Wk , H k ), and find H k+1 such that f (Wk+1 , H k+1 ) ≤ f (Wk+1 , H k ). The extreme situation is to obtain the best point (Paatero, 1999; Chu et al., 2005): Algorithm 2: Alternating Nonnegative Least Squares 1. Initialize Wia1 ≥ 0, Hb1j ≥ 0, ∀i, a , b, j. 2. For k = 1, 2, . . . Wk+1 = arg min f (W, H k )
(2.8)
H k+1 = arg min f (Wk+1 , H).
(2.9)
W≥0
H≥0
This approach is the block coordinate descent method in bound-constrained optimization (Bertsekas, 1999), where sequentially one block of variables is minimized under corresponding constraints and the remaining blocks are fixed. For NMF, we have the simplest case of only two block variables W and H.
Projected Gradient Methods for Nonnegative Matrix Factorization
2761
We refer to equations 2.8 or 2.9 as a subproblem in algorithm 2. When one block of variables is fixed, a subproblem is indeed the collection of several nonnegative least-squares problems. From equation 2.9, H k+1 ’s jth column = min v − Wk+1 h2 , h≥0
(2.10)
where v is the jth column of V and h is a vector variable. Chu et al. (2005) suggest projected Newton’s methods (Lawson & Hanson, 1974) to solve each problem (see equation 2.10). Clearly, solving subproblems 2.8 and 2.9 per iteration could be more expensive than the simple update in algorithm 1. Then algorithm 2 may be slower even though we expect that it better decreases the function value at each iteration. Efficient methods to solve subproblems are thus essential. Section 4.1 proposes using project gradient methods and discusses why they are suitable for solving subproblems in algorithm 2. Regarding the convergence of algorithm 2, one may think that it is a trivial result. For example, Paatero (1999) states that for the alternating nonnegative least-squares approach, no matter how many blocks of variables we have, the convergence is guaranteed. However, this issue deserves some attention. Past convergence analysis for block coordinate descent methods requires subproblems to have unique solutions (Powell, 1973; Bertsekas, 1999), but this property does not hold here: subproblems 2.8 and 2.9 are convex, but they are not strictly convex. Hence, these subproblems may have multiple optimal solutions. For example, when H k is the zero matrix, any W is optimal for equation 2.8. Fortunately, for the case of two blocks, Grippo and Sciandrone (2000) have shown that this uniqueness condition is not needed. Directly from corollary 2 of Grippo and Sciandrone (2000), we have the following convergence result: Theorem 2. Any limit point of the sequence {Wk , H k } generated by algorithm 2 is a stationary point of equation 1.1. The remaining issue is whether the sequence {Wk , H k } has at least one limit point (i.e., there is at least one convergent subsequence). In optimization analysis, this property often comes from the boundedness of the feasible region, but our region under constraints Wia ≥ 0 and Hb j ≥ 0 is unbounded. One can easily add a large upper bound to all variables in equation 1.1. As the modification still leads to a bound-constrained problem, algorithm 2 can be applied and theorem 2 holds. In contrast, it is unclear how to easily modify the multiplicative update rules if there are upper bounds in equation 1.1. In summary, contrary to algorithm 1, which still lacks convergence results, algorithm 2 has nice optimization properties.
2762
C. Lin
2.3 Gradient Approaches. In Chu et al. (2005, sec. 3.3), several gradienttype approaches have been mentioned. In this section, we briefly discuss methods that select the step size along the negative gradient direction. By defining 2 Wia = E ia
and
Hb j = Fb2j ,
Chu et al. (2005) reformulate equation 1.1 as an unconstrained optimization problem of variables E ia and Fb j . Then standard gradient descent methods can be applied. The same authors also mention that Shepherd (2004) uses Wk+1 = max(0, Wk − αk ∇W f (Wk , H k )), H k+1 = max(0, H k − αk ∇ H f (Wk , H k )), where αk is the step size. This approach is already a projected gradient method. However, in the above references, details are not discussed. 3 Projected Gradient Methods for Bound-Constrained Optimization We consider the following standard form of bound-constrained optimization problems: minn x∈R
subject to
f (x) li ≤ xi ≤ ui ,
i = 1, . . . , n,
(3.1)
where f (x) : Rn → R is a continuously differentiable function, and l and u are lower and upper bounds, respectively. Assume k is the index of iterations. Projected gradient methods update the current solution xk to xk+1 by the following rule: xk+1 = P[xk − α k ∇ f (xk )], where xi P[xi ] = ui li
if li < xi < ui , if xi ≥ ui , if xi ≤ li ,
maps a point back to the bounded feasible region. Variants of projected gradient methods differ on selecting the step size α k . We consider a simple and effective one called “Armijo rule along the projection arc” in Bertsekas (1999), which originates from Bertsekas (1976). The procedure is illustrated in algorithm 3:
Projected Gradient Methods for Nonnegative Matrix Factorization
2763
Algorithm 3: A Projected Gradient Method for Bound-Constrained Optimization 1. Given 0 < β < 1, 0 < σ < 1. Initialize any feasible x1 . 2. For k = 1, 2, . . . xk+1 = P[xk − αk ∇ f (xk )], where αk = β tk , and tk is the first nonnegative integer t for which f (xk+1 ) − f (xk ) ≤ σ ∇ f (xk )T (xk+1 − xk ).
(3.2)
Condition 3.2, used in most proofs of projected gradient methods, ensures the sufficient decrease of the function value per iteration. By trying the step sizes 1, β, β 2 , . . . , Bertsekas (1976) has proved that αk > 0 satisfying equation 3.2 always exists and every limit point of {xk }∞ k=1 is a stationary point of equation 3.1. A common choice of σ is 0.01, and we consider β = 1/10 in this letter. In the experiment section, 6.2, we have some discussions about the choice of β. Searching αk is the most time-consuming operation in algorithm 3, so one should check as few step sizes as possible. Since αk−1 and αk may be similar, a trick in Lin and Mor´e (1999) uses αk−1 as the initial guess and then either increases or decreases it in order to find the largest β tk satisfying equation 3.2. Moreover, with nonnegative tk , algorithm 4 may be too conservative by restricting αk ≤ 1. Sometimes a larger step more effectively projects variables to bounds at one iteration. Algorithm 4 implements a better initial guess of α at each iteration and allows α to be larger than one: Algorithm 4: An Improved Projected Gradient Method 1. Given 0 < β < 1, 0 < σ < 1. Initialize any feasible x1 . Set α0 = 1. 2 For k = 1, 2, . . . (a) Assign αk ← αk−1 (b) If αk satisfies equation 3.2, repeatedly increase it by αk ← αk /β until either αk does not satisfy equation 3.2 or x(αk /β) = x(αk ). Else repeatedly decrease αk by αk ← αk · β until αk satisfies equation 3.2. (c) Set xk+1 = P[xk − αk ∇ f (xk )].
2764
C. Lin
The convergence result has been proved in, for example, Calamai and Mor´e (1987). One may think that finding α with the largest function reduction leads to faster convergence: αk ≡ arg min f (P[xk − α∇ f (xk )]). α≥0
(3.3)
The convergence of selecting such an αk is proved in McCormick and Tapia (1972). However, equation 3.3 is a piecewise function of α, which is difficult to be minimized. A major obstacle for minimizing bounded problems is to identify free (i.e., li < xi < ui ) and active (i.e., xi = li or ui ) components at the convergent stationary point. Projected gradient methods are considered effective for doing so since they are able to add several active variables at a single iteration. However, once these sets have been (almost) identified, the problem in a sense reduces to an unconstrained one, and the slow convergence of gradient-type methods may occur. We will explain in section 4.1 that for NMF problems, this issue may not be serious. 4 Projected Gradient Methods for NMF We apply projected gradient methods to NMF in two situations. The first case solves nonnegative least-squares problems discussed in section 2.2. The second case directly minimizes equation 1.1. Both approaches have convergence properties following from theorem 2 and Calamai and Mor´e (1987), respectively. Several modifications specific to NMF will be presented. 4.1 Alternating Nonnegative Least Squares Using Projected Gradient Methods. Section 2.2 indicates that algorithm 2 relies on efficiently solving subproblems 2.8 and 2.9, each of which is a bound-constrained problem. We propose using project gradient methods to solve them. Subproblem 2.9 consists of m independent nonnegative least-square problems (see equation 2.10), so one could solve them separately, a situation suitable for parallel environments. However, in a serial setting, treating them together is better for the following reasons:
r r
These nonnegative least-square problems are closely related as they share the same constant matrices V and Wk+1 in equation 2.10. Working on the whole H but not its individual columns implies that all operations are matrix based. Since finely tuned numerical linear algebra codes have better speed-up on matrix than on vector operations, we can thus save computational time.
Projected Gradient Methods for Nonnegative Matrix Factorization
2765
For a simpler description of our method, we focus on equation 2.9 and rewrite it as min H
subject to
f¯ (H) ≡
1 V − WH2F 2
Hb j ≥ 0,
∀b, j.
(4.1)
Both V and W are constant matrices in equation 4.1. If we concatenate H’s columns to a vector vec (H), then 1 f¯ (H) = V − WH2F 2 T W W 1 = vec(H)T 2
..
vec(H) + H’s linear terms.
. WT W
The Hessian matrix (i.e., second derivative) of f¯ (H) is block diagonal, and each block W T W is an r × r positive semidefinite matrix. As W ∈ Rn×r and r n, W T W and the whole Hessian matrix tend to be well conditioned, a good property for optimization algorithms. Thus, gradient-based methods may converge fast enough. A further investigation of this conjecture is in the experiment section, 6.2. The high cost of solving the two subproblems, equations 2.8 and 2.9, at each iteration is a concern. It is thus essential to analyze the time complexity and find efficient implementations. Each subproblem requires an iterative procedure, whose iterations are referred to as subiterations. When using algorithm 4 to solve equation 4.1, we must maintain the gradient ∇ f¯ (H) = W T (WH − V) at each subiteration. Following the discussion near equation 2.7, one should calculate it by (W T W)H − W T V. Constant matrices W T W and W T V can be computed respectively in O(nr 2 ) and O(nmr ) operations before running subiterations. The main computational task per subiteration is to find a step size α ¯ is the such that the sufficient decrease condition 3.2 is satisfied. Assume H current solution. To check if ˜ ≡ P[ H ¯ − α∇ f¯ ( H)] ¯ H ˜ takes O(nmr ) operations. If there are satisfies equation 3.2, calculating f¯ ( H) ˜ t trials of H’s, the computational cost O(tnmr ) is prohibitive. We propose
2766
C. Lin
the following strategy to reduce the cost: for a quadratic function f (x) and any vector d, 1 f (x + d) = f (x) + ∇ f (x)T d + dT ∇ 2 f (x)d. 2
(4.2)
Hence, for two consecutive iterations x¯ and x˜ , equation 3.2 can be written as 1 (1 − σ )∇ f (¯x)T (˜x − x¯ ) + (˜x − x¯ )T ∇ 2 f (¯x)(˜x − x¯ ) ≤ 0. 2 Now f¯ (H) defined in equation 4.1 is quadratic, so equation 3.2 becomes 1 ¯ H ˜ − H ¯ + H ˜ − H, ¯ (W T W)( H ˜ − H) ¯ ≤ 0, (1 − σ )∇ f¯ ( H), 2
(4.3)
where ·, · is the sum of the component-wise product of two matrices. The ˜ − H), ¯ major operation in equation 4.3 is the matrix product (W T W) · ( H 2 which takes O(mr ). Thus, the cost O(tnmr ) of checking equation 3.2 is significantly reduced to O(tmr 2 ). With the cost O(nmr ) for calculating W T V in the beginning, the complexity of using algorithm 4 to solve subproblem 4.1 is O(nmr ) + #sub-iterations × O(tmr 2 ), where t is the average number of checking equation 3.2 at each subiteration. The pseudocode for optimizing equation 4.1 is in section B.2. We can use the same procedure to obtain Wk+1 by rewriting equation 2.9 as a form similar to equation 4.1: f¯ (W) ≡
1 T V − H T W T 2F , 2
where V T and H T are constant matrices. The overall cost to solve equation 1.1 is #iterations × (O(nmr ) + #sub-iterations × O(tmr 2 + tnr 2 )).
(4.4)
At each iteration, there are two O(nmr ) operations: V(H k )T and (Wk+1 )T V, the same as those in the multiplicative update method. If t and #subiterations are small, this method is efficient. To reduce the number of subiterations, a simple but useful technique is to warm-start the solution procedure of each subproblem. At final iterations,
Projected Gradient Methods for Nonnegative Matrix Factorization
2767
(Wk , H k )’s are all similar, so Wk is an effective initial point for solving equation 2.8. 4.2 Directly Applying Projected Gradients to NMF. We may directly apply algorithm 4 to minimize equation 1.1. Similar to solving the nonnegative least-squares problems in section 4.1, the most expensive operation is checking the sufficient decrease condition, equation 3.2. From the current ¯ H), ¯ we simultaneously update both matrices to (W, ˜ H): ˜ solution (W, ˜ H) ˜ ≡ P[(W, ¯ H) ¯ − α(∇W f (W, ¯ H), ¯ ∇ H f (W, ¯ H))]. ¯ (W, As f (W, H) is not a quadratic function, equation 4.2 does not hold. Hence, the trick, equation 4.3, cannot be applied to save the computational time. ˜ H) ˜ = 1 V − W ˜ H ˜ 2F takes O(nmr ) operations. The Then, calculating f (W, 2 total computational cost is #iterations × O(tnmr ), where t is the average number of condition 3.2 checked per iteration. Given any random initial (W1 , H 1 ), if V − W1 H 1 2F > V2F , very often after the first iteration, (W2 , H 2 ) = (0, 0) causes the algorithm to stop. The solution (0, 0) is a useless stationary point of equation 1.1. A simple remedy ¯ 1 ) such that f (W1 , H ¯ 1 ) < f (0, 0). By is to find a new initial point (W1 , H 1 1 ¯ solving H = arg min f (W , H) using the procedure described in section H≥0
4.1, we have ¯ 1 2F ≤ V − W1 · 02F = V2F . V − W1 H ¯ 1 ) < f (0, 0). The strict inequality generally holds, so f (W1 , H 5 Stopping Conditions In all algorithms mentioned so far, we did not specify when the procedure should stop. Several implementations of the multiplicative update method (e.g., Hoyer, 2004) have an infinite loop, which must be interrupted by users after a time or iteration limit. Some researchers (e.g., Brunet, 2004) check the difference between recent iterations. If the difference is small enough, the procedure stops. However, such a stopping condition does not reveal whether a solution is close to a stationary point. In addition to a time or iteration limit, standard conditions to check the stationarity should also be included in NMF software. Moreover, in alternating least squares, each subproblem involves an optimization procedure, which needs a stopping condition as well.
2768
C. Lin
In bound-constrained optimization, a common condition to check if a point xk is close to a stationary point is the following (Lin & Mor´e, 1999): ∇ P f (xk ) ≤ ∇ f (x1 ),
(5.1)
where ∇ P f (xk ) is the projected gradient defined as ∇ f (x)i ∇ f (x)i ≡ min(0, ∇ f (x)i ) max(0, ∇ f (x)i ) P
if li < xi < ui , if xi = li , if xi = ui .
(5.2)
This condition follows from an equivalent form of the KKT condition for bounded problems: li ≤ xi ≤ ui , ∀i, and ∇ P f (x) = 0. For NMF, equation 5.1 becomes ∇ P f (Wk , H k ) F ≤ ∇ f (W1 , H 1 ) F .
(5.3)
For alternating least squares, each subproblem 2.8 or 2.9 requires a stopping condition as well. Ideally, the condition for them should be related to the global one for equation 1.1, but a suitable condition is not obvious. For example, we cannot use the same stopping tolerance in equation 5.3 for subproblems. A user may specify = 0 and terminate the code after a certain time or iteration limit. Then the same = 0 in solving the first subproblem will cause algorithm 2 to keep running at the first iteration. We thus use the following stopping conditions for subproblems. The returned matrices Wk+1 and H k+1 from the iterative procedures of solving the subproblems 2.8 and 2.9 should, respectively, satisfy P ∇ f (Wk+1 , H k ) ≤ ¯ W , W F
and ∇ HP f (Wk+1 , H k+1 ) F ≤ ¯ H ,
where we set ¯ W = ¯ H ≡ max(10−3 , ) ∇ f (W1 , H 1 ) F in the beginning and is the tolerance in equation 6.3. If the projected gradient method for solving equation 2.8 stops without any iterations, we decrease the stopping tolerance by ¯ W ← ¯ W /10. For subproblem 2.9, ¯ H is reduced in a similar way.
(5.4)
Projected Gradient Methods for Nonnegative Matrix Factorization
2769
6 Experiments We compare four methods discussed in this letter and refer to them in the following way:
r r r r
mult: The multiplicative update method described in section 2.1 alspgrad: Alternating nonnegative least squares using the projected gradient method for each subproblem (see section 4.1) pgrad: A direct use of the projected gradient method on equation 1.1 (see section 4.2) lsqnonneg: Using Matlab command lsqnonneg to solve m problems (see equation 2.10) in the alternating least-squares framework
All implementations are in Matlab (http://www.mathworks.com). We conduct experiments on an Intel Xeon 2.8 GHz computer. Results of using synthetic and real data are presented in the following subsections. All source codes for experiments are available online at http://www.csie.ntu. edu.tw/∼cjlin/nmf. 6.1 Synthetic Data. We consider three problem sizes: (m, r, n) = (25, 5, 25), (50, 10, 250), and (100, 20, 500). The matrix V is randomly generated by the normal distribution (mean 0 and standard deviation 1) Vi j = |N(0, 1)|. The initial (W1 , H 1 ) is constructed in the same way, and all four methods share the same initial point. These methods may converge to different points due to the nonconvexity of the NMF problem, equation 1.1. To have a fair comparison, for the same V we try 30 different initial (W1 , H 1 ) and report the average results. As synthetic data may not resemble practical problems, we leave detailed analysis of the proposed algorithms in section 6.2, which considers real data. We set in equation 6.3 to be 10−3 , 10−4 , 10−5 , and 10−6 in order to investigate the convergence speed to a stationary point. We also impose a time limit of 1000 seconds and a maximal number of 8000 iterations on each method. As lsqnonneg takes long computing time, we run it only on the smallest data. Due to the slow convergence of mult and pgrad, for = 10−6 we run only alspgrad. Results of average time, number of iterations, and objective values are in Table 1. For small problems, Table 1 shows that all four methods give similar objective values as → 0. The method lsqnonneg is rather slow, a result supporting our argument in section 4.1 that a matrix-based approach is better than a vector-based one. For larger problems, when = 10−5 , mult and pgrad often exceed the maximal number of iterations. Clearly, mult quickly
mult alspgrad pgrad
mult alspgrad pgrad
b. m = 50, r = 10, n = 250
c. m = 100, r = 20, n = 500 0.41 0.02 0.60
0.16 0.03 0.38
0.10 0.03 0.05 6.32
8.28 0.21 2.88
14.73 0.13 3.17 175.55 1.09 35.20
21.53 0.99 10.29
2.60 0.45 0.68 57.57
10−5
Time
1.22 0.10 0.24 27.76
10−4
10.02
5.51
0.97
10−6
170 2 2
349 4 47
698 6 53 23
2687 8 200
6508 14 1331
4651 26 351 96
10−4
8000 31 3061
8000 76 4686
7639 100 1082 198
10−5
234
352
203
10−6
Number of Iterations 10−3
6535.2 9198.7 8141.1
1562.1 1866.6 1789.4
390.4 412.9 401.6 391.1
10−3
6355.7 6958.6 6838.7
1545.7 1597.1 1558.4
389.3 392.8 389.9 389.1
10−4
6342.3 6436.7 6375.0
1545.6 1547.8 1545.5
389.3 389.2 389.1 389.0
10−5
Objective Values
6332.9
1543.5
389.1
10−6
Notes: We present the average time (in seconds), number of iterations, and objective values of using 30 initial points. Approaches with the smallest time or objective values are in bold type. Note that when = 10−5 , mult and pgrad often exceed the iteration limit of 8000.
mult alspgrad pgrad lsqnonneg
a. m = 25, r = 5, n = 125
10−3
Table 1: Results of Running Synthetic Data Sets (from Small to Large) Under Various Stopping Tolerances.
2770 C. Lin
Projected Gradient Methods for Nonnegative Matrix Factorization
2771
Table 2: Image Data: Average Objective Values and Projected-Gradient Norms of Using 30 Initial Points Under Specified Time Limits. Problem Size (n r m) Time limit (in seconds)
CBCL ORL 361 49 2429 10,304 25 25 50 25
Objective value mult alspgrad ∇ f P (W, H) F mult alspgrad
963.81 923.77 488.72 230.67
914.09 16.14 870.18 14.34 327.28 19.37 142.13 4.82
400 50
Natural 288 72 10,000 25 50
14.31 370,797.31 353,709.28 13.82 377,167.27 352,355.64 9.30 54,534.09 21,985.99 4.33 19,357.03 4974.84
Note: Smaller values are in bold type.
decreases the objective value in the beginning but slows down in the end. In contrast, alspgrad has the fastest final convergence. For the two larger problems, it gives the smallest objective value under = 10−6 but takes less time than that by mult under = 10−5 . Due to the poor performance of pgrad and lsqnonneg, subsequently we focus on comparing mult and alspgrad. 6.2 Image Data. We consider three image problems used in Hoyer (2004):
r r r
The CBCL (MIT Center for Biological and Computional Learning) face image database (http://cbcl.mit.edu/cbcl/software-datasets/ FaceData2.html). ORL (Olivetti Research Laboratory) face image database (http:// www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html). Natural image data set (Hoyer, 2002).
All settings are the same as those in Hoyer (2004). We compare objective values and projected-gradient norms of mult and alspgrad after running 25 and 50 seconds. Table 2 presents average results of using 30 random initial points. For all three problems, alspgrad gives smaller objective values. While mult may quickly lower the objective value in the beginning, alspgrad catches up very soon and has faster final convergence. Results here are consistent with the findings in section 6.1. Regarding the projected-gradient norms, those by alspgrad are much smaller. Hence, solutions by alspgrad are closer to stationary points. To further illustrate the slow final convergence of mult, Figure 1 checks the relation between the running time and the objective value. The CBCL set with the first of the 30 initial (W1 , H 1 ) is used. The figure clearly demonstrates that mult very slowly decreases the objective value at final iterations. The number of subiterations for solving equations 2.8 and 2.9 in alspgrad is an important issue. First, it is related to the time complexity analysis. Second, section 4.1 conjectures that the number should be small as W T W
2772
C. Lin 1300
Objective value
1200 1100 1000 900 800 0 10
1
2
3
10 10 Time in seconds (logged scale)
10
Figure 1: Time (seconds in log scale) versus objective values for mult (dashed line) and alspgrad (solid line). Table 3: Number of Subiterations and Condition Numbers in Solving Equations 2.8 and 2.9 of alspgrad. Problem Time limit (in seconds) W: # subiterations cond(H H T ) H: # subiterations cond(W T W)
CBCL
ORL
Natural
25
50
25
50
25
50
34.51 224.88 11.93 150.89
47.81 231.33 18.15 124.27
9.93 76.44 6.84 478.35
11.27 71.75 7.70 129.00
21.94 93.88 3.13 38.49
27.54 103.64 4.39 17.19
Notes: For subiterations, we calculate (total subiterations)/(total iterations) under each initial point and report the average of 30 values. For condition numbers, we find the median at all iterations, and report the average. Note that H H T (W T W) corresponds to the Hessian of minimizing W (H).
and H H T are generally well conditioned. Table 3 presents the number of subiterations and the condition numbers of W T W and H H T . Compared to gradient-based methods in other scenarios, the number of subiterations is relatively small. Another projected gradient method pgrad discussed in Table 1 easily takes hundreds or thousands of iterations. For condition numbers, both the CBCL and natural sets have r < n < m, so H H T tends to be better conditioned than W T W. ORL has the opposite as r < m < n. All condition numbers are small, and this result confirms our earlier conjecture. For ORL, cond(W T W) > cond(H H T ), but the number of subiterations on solving W is more. One possible reason is the different stopping tolerances for solving equations 2.8 and 2.9. In the implementation of alspgrad, there is a parameter β, which is the rate of reducing the step size to satisfy the sufficient decrease condition, equation 3.2. It must be between 0 and 1, and for the above experiments,
Projected Gradient Methods for Nonnegative Matrix Factorization
2773
Table 4: Text Data: Average Objective Values and Projected-Gradient Norms of Using 30 Initial Points Under Specified Time Limits. Size (n r m) Time limit (in seconds) Objective value ∇ f P (W, H) F
mult alspgrad mult alspgrad
5412 25
3 1588 50
710.160 710.128 4.646 0.016
710.135 710.128 1.963 0.000
5737 6 25 595.245 594.631 13.633 2.250
1401 50 594.869 594.520 11.268 0.328
Notes: Smaller values are in bold type. Due to the unbalanced class distribution, interestingly the random selection of six classes results in fewer documents (i.e., m) than that of selecting three classes.
we use β = 0.1. One may wonder about the effect of using other β. Clearly, a smaller β more aggressively reduces the step size, but it may also cause a step size that in the end is too small. We consider the CBCL set with the first of the 30 initial (W1 , H 1 ) (i.e., the setting to generate Figure 1, and check the effect of using β = 0.5 and 0.1. In both cases, alspgrad works well, but the one using 0.1 slightly more quickly reduces the function value. Therefore, β = 0.5 causes too many checks for the sufficient decrease condition, equation 3.2. The cost per iteration is thus higher. 6.3 Text Data. NMF is useful for document clustering, so we next consider a text set RCV1 (Lewis, Yang, Rose, & Li, 2004). This set is an archive of manually categorized newswire stories from Reuters Ltd. The collection has been fully preprocessed, including removing stop words, stemming, and transforming into vector space models. Each vector, cosine normalized, contains features of logged TF-IDF (term frequency, inverse document frequency). Training and testing splits have been defined. We remove documents in the training set that are associated with more than one class and obtain a set of 15,933 instances in 101 classes. We further remove classes with fewer than five documents. Using r = 3 and 6, we then randomly select r classes of documents to construct the n × m matrix V, where n is the number of the vocabulary set and m is the number of documents. Some words never appear in the selected documents and cause zero rows in V. We remove them before experiments. The parameter r is the number of clusters that we intend to assign documents to. Results of running mult and alspgrad by 25 and 50 seconds are in Table 4. Again, alspgrad gives smaller objective values. In addition, projected-gradient norms of alspgrad are smaller. In section 2.1, we mentioned that mult is well defined if theorem 1 holds. Now V is a sparse matrix with many zero elements since words appearing in a document are only a small subset of the whole vocabulary set. Thus, some columns of V are close to zero vectors, and for a few situations, numerical
2774
C. Lin
difficulties occur. In contrast, we do not face such problems for projected gradient methods. 7 Discussion and Conclusions We discuss some future issues and draw conclusions. 7.1 Future Issues. As resulting W and H usually have many zero components, NMF is said to produce a sparse representation of the data. To achieve better sparseness, some studies, such as Hoyer (2002) and Piper et al. (2004) add penalty terms to the NMF objective function: 1 (Vi j − (WH)i j )2 + αW2F + βH2F , 2 n
m
(7.1)
i=1 j=1
where α and β are positive numbers. Besides the Frobenius norm, which is quadratic, we can also use a linear penalty function, α
i,a
Wia + β
Hb j .
(7.2)
b, j
Our proposed methods can be used for such formulations. As penalty parameters α and β only indirectly control the sparseness, Hoyer (2004) proposes a scheme to directly specify the desired sparsity. It is interesting to investigate how to incorporate projected gradient methods in such frameworks. 7.2 Conclusions. This letter proposes two projected gradient methods for NMF. The one solving least-squares subproblems in algorithm 2 leads to faster convergence than the popular multiplicative update method. Its success is due to our following findings:
r
r
Subproblems in algorithm 2 for NMF generally have well-conditioned Hessian matrices (i.e., second derivative) due to the property r min(n, m). Hence, projected gradients converge quickly, although they use only the first-order information. The cost of selecting step sizes in the projected gradient method for subproblem 4.1 is significantly reduced by some reformulations that again use the property r min(n, m).
Therefore, taking special NMF properties is crucial when applying an optimization method to NMF.
Projected Gradient Methods for Nonnegative Matrix Factorization
2775
Roughly speaking, optimization methods are between the following two extreme situations: Low cost per iteration; High cost per iteration; ←→ slow convergence. fast convergence. For example, Newton’s methods are expensive per iteration but have very fast final convergence. Approaches with low cost per iteration usually decrease the objective value more quickly in the beginning, a nice property enjoyed by the multiplicative update method for NMF. Based on our analysis, we feel that the multiplicative update is very close to the first extreme. The proposed method of alternating least squares using projected gradients tends to be more in between. With faster convergence and strong optimization properties, it is an attractive approach for NMF. Appendix A: Proof of Theorem 1 When k = 1, equation 2.6 holds by the assumption of this theorem. Using induction, if equation 2.6 is correct at k, then at (k + 1), clearly denominators of equations 2.3 and 2.4 are strictly positive. Moreover, as V has neither zero column nor row, both numerators are strictly positive as well. Thus, equation 2.6 holds at (k + 1), and the proof is complete. Appendix B: Matlab Code B.1 Main Code for alspgrad (Alternating Nonnegative Least Squares Using Projected Gradients) function [W,H] = nmf(V,Winit,Hinit,tol,timelimit,maxiter) % NMF by alternative non-negative least squares using projected gradients % Author: Chih-Jen Lin, National Taiwan University % % % %
W,H: output solution Winit,Hinit: initial solution tol: tolerance for a relative stopping condition timelimit, maxiter: limit of time and iterations
W = Winit; H = Hinit; initt = cputime; gradW = W*(H*H’) - V*H’; gradH = (W’*W)*H - W’*V;
2776
C. Lin
initgrad = norm([gradW; gradH’],’fro’); fprintf(’Init gradient norm %f\n’, initgrad); tolW = max(0.001,tol)*initgrad; tolH = tolW; for iter=1:maxiter, % stopping condition projnorm = norm([gradW(gradW<0 | W>0); gradH(gradH<0 | H>0)]); if projnorm < tol*initgrad | cputime-initt > timelimit, break; end [W,gradW,iterW] = nlssubprob(V’,H’,W’,tolW,1000); W = W’; gradW = gradW’; if iterW==1, tolW = 0.1 * tolW; end [H,gradH,iterH] = nlssubprob(V,W,H,tolH,1000); if iterH==1, tolH = 0.1 * tolH; end if rem(iter,10)==0, fprintf(’.’); end end fprintf(’\nIter = %d Final proj-grad norm %f\n’, iter, projnorm);
B.2 Solving Subproblem 4.1 by the Projected Gradient Algorithm 4. function [H,grad,iter] = nlssubprob(V,W,Hinit,tol,maxiter) % % % % % %
H, grad: output solution and gradient iter: #iterations used V, W: constant matrices Hinit: initial solution tol: stopping tolerance maxiter: limit of iterations
H = Hinit; WtV = W’*V; WtW = W’*W; alpha = 1; beta = 0.1;
Projected Gradient Methods for Nonnegative Matrix Factorization
2777
for iter=1:maxiter, grad = WtW*H - WtV; projgrad = norm(grad(grad < 0 | H >0)); if projgrad < tol, break end % search step size for inner_iter=1:20, Hn = max(H - alpha*grad, 0); d = Hn-H; gradd=sum(sum(grad.*d)); dQd = sum(sum((WtW*d).*d)); suff_decr = 0.99*gradd + 0.5*dQd < 0; if inner_iter==1, decr_alpha = ~suff_decr; Hp = H; end if decr_alpha, if suff_decr, H = Hn; break; else alpha = alpha * beta; end else if ~suff_decr | Hp == Hn, H = Hp; break; else alpha = alpha/beta; Hp = Hn; end end end end if iter==maxiter, fprintf(’Max iter in nlssubprob\n’); end
Acknowledgments I thank Marco Sciandrone for pointing out the convergence of two-block coordinate descent methods in Grippo & Sciandrone (2000) and helpful comments.
2778
C. Lin
References Bertsekas, D. P. (1976). On the Goldstein-Levitin-Polyak gradient projection method. IEEE Transations on Automatic Control, 21, 174–184. Bertsekas, D. P. (1999). Nonlinear programming (2nd ed.). Belmont, MA: Athena Scientific. Brunet, J.-P. (2004). An NMF Program. Available online at http://www.broad.mit. edu/mpr/publications/projects/NMF/nmf.m Brunet, J.-P., Tamayo, P., Golub, T. R., & Mesirov, J. P. (2004). Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the National Academy of Science, 101(12), 4164–4169. Calamai, P. H., & Mor´e, J. J. (1987). Projected gradient methods for linearly constrained problems. Mathematical Programming, 39, 93–116. Chu, M., Diele, F., Plemmons, R., & Ragni, S. (2005). Optimality, computation and interpretation of nonnegative matrix factorizations. Preprint. Available online at http://www4.ncsu.edu/∼mtchu/Research/Papers/nnmf.ps Donoho, D., & Stodden, V. (2004). When does non-negative matrix factorization ¨ L. Saul, & B. Scholkopf ¨ give a correct decomposition into parts? In S. Thrun, (Eds.), Advances in neural information processing systems, 16. Cambridge, MA: MIT Press. Gonzales, E. F., & Zhang, Y. (2005). Accelerating the Lee-Seung algorithm for non-negative matrix factorization (Tech. Rep.). Houston, TX: Department of Computational and Applied Mathematics, Rice University. Grippo, L., & Sciandrone, M. (2000). On the convergence of the block nonlinear Gauss-Seidel method under convex constraints. Operations Research Letters, 26, 127–136. Hoyer, P. O. (2002). Non-negative sparse coding. In Proceedings of IEEE Workshop on Neural Networks for Signal Processing (pp. 557–565). Piscataway, NJ: IEEE. Hoyer, P. O. (2004). Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research, 5, 1457–1469. Lawson, C. L., & Hanson, R. J. (1974). Solving least squares problems. Upper Saddle River, NJ: Prentice Hall. Lee, D. D., & Seung, S. (1999). Learning the parts of objects by nonnegative matrix factorization. Nature, 401, 788–791. Lee, D. D., & Seung, S. (2001). Algorithms for non-negative matrix factorization. In T. K. Leen, T. G. Dietterich, and V. Tresp (Eds.), Advances in neural information processing systems, 13 ( pp. 556–562). Cambridge, MA: MIT Press. Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361– 397. Lin, C.-J., & Mor´e, J. J. (1999). Newton’s method for large-scale bound constrained problems. SIAM Journal on Optimization, 9, 1100–1127. McCormick, G. P., & Tapia, R. A. (1972). The gradient projection method under mild differentiability conditions. SIAM Journal on Control, 10, 93–98. Paatero, P. (1999). The multilinear engine—A table-driven, least squares program for solving multilinear problems, including the n-way parallel factor analysis model. J. of Computational and Graphical Statistics, 8(4), 854–888.
Projected Gradient Methods for Nonnegative Matrix Factorization
2779
Paatero, P., & Tapper, U. (1994). Positive matrix factorization: A non-negative factor model with optimal utilization of error. Environmetrics, 5, 111–126. Piper, J., Pauca, P., Plemmons, R., & Giffin, M. (2004). Object characterization from spectral data using nonnegative factorization and information theory. In Proceedings of AMOS Technical Conference. Mavi, HI. Powell, M. J. D. (1973). On search directions for minimization. Mathematical Programming, 4, 193–201. Shepherd, S. (2004). Non-negative matrix factorization. Available online at http://www.simonshepherd.supanet.com/nnmf.htm Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. In Proceedings of the 26th Annual International ACM SIGIR Conference (pp. 267–273). New York: ACM Press.
Received May 1, 2006; accepted September 27, 2006.
LETTER
Communicated by Peter Dayan
Integration of Stochastic Models by Minimizing α-Divergence Shun-ichi Amari [email protected] RIKEN Brain Science Institute, Wako-shi, Hirosawa 2-1, Saitama 351-0198, Japan
When there are a number of stochastic models in the form of probability distributions, one needs to integrate them. Mixtures of distributions are frequently used, but exponential mixtures also provide a good means of integration. This letter proposes a one-parameter family of integration, called α-integration, which includes all of these well-known integrations. These are generalizations of various averages of numbers such as arithmetic, geometric, and harmonic averages. There are psychophysical experiments that suggest that α-integrations are used in the brain. The α-divergence between two distributions is defined, which is a natural generalization of Kullback-Leibler divergence and Hellinger distance, and it is proved that α-integration is optimal in the sense of minimizing α-divergence. The theory is applied to generalize the mixture of experts and the product of experts to the α-mixture of experts. The α-predictive distribution is also stated in the Bayesian framework.
1 Introduction When there are a number of stochastic models or probability distributions, we need to integrate them for further reasoning. For example, sensory information in the brain is believed to be encoded stochastically such that a firing pattern of neurons represents a probability distribution of the sensory input. When a signal is processed by different channels or even in different modalities, they provide different probability distributions, which are to be integrated. Many artificial neural models also integrate stochastic models. The mixture of experts model (Jacobs, Jordan, Nowlan, & Hinton, 1991; Jordan & Jacobs, 1994) integrates stochastic models given by different experts, and the product of experts model (Hinton, 2002) is another kind that uses a different method of integration. The mosaic model (Wolpert & Kawato, 1998) is a more realistic neuronal one to integrate evidences in the cerebellum. See also the Helmholtz machine (Dayan, Hinton, Neal, & Zemel, 1995) and Ying-Yang machine (Xu, 2004). As a typical example, let us consider two probability distributions p1 (s) and p2 (s), concerning random variable s that is to be inferred. Such a probability distribution may be the posterior probability in the Bayesian Neural Computation 19, 2780–2796 (2007)
C 2007 Massachusetts Institute of Technology
Integration of Stochastic Models by Minimizing α-Divergence
2781
framework given a prior distribution and observed data. The simplest integration is to use their mixture (arithmetic mean): p˜ (s) =
1 { p1 (s) + p2 (s)}. 2
(1.1)
Another idea for integration is to use their geometric mean (called the exponential mixture), p˜ (s) = c
p1 (s) p2 (s),
(1.2)
where c is the normalization constant. There are many other ways to achieve integration. To begin, let us give a generalized version of the average of k positive numbers (Hardy, Littlewood, & Polya, 1952) and derive the α-mean of numbers. We can then propose a one-parameter family of integration of k stochastic evidences p1 (s), p2 (s), . . . , pk (s) with weights w1 , . . . , wk , parameterized by a scalar α. This is given in the α-family of probability distributions, which includes both the mixture and exponential mixtures as special cases. They are a natural family of probability distributions in information geometry (Amari & Nagaoka, 2000), because they form dually flat manifolds or manifolds with a constant curvature. The α-integration of stochastic evidences is carried out in the α-family, where α is the free parameter. To characterize the α-integration, we introduce the α-divergence between two probability distributions, which includes Kullback-Leibler divergence and Hellinger distance as special cases. We can then see that α-integration is optimal in the sense that it minimizes the average α-divergence from the candidate probability distribution to all of the distributions to be integrated. The theory can be applied to integrate not only probability distributions but also positive measures that may be regarded as unnormalized probability distributions. This also characterizes the optimality of the α-mean of numbers. Let us apply the theory to generalize the mixture of experts and the product of experts for processing information. We also need to touch on the α-version of the Bayesian predictive distribution studied by Komaki (1996), Corcuera and Giummol`e (1999), and Marriott (2002). 2 α-Mean of Positive Numbers Given two positive numbers a and b, their arithmetic mean is (a √ + b)/2. However, there are other types of means: The geometric mean is a b, and the harmonic mean is 2/(a −1 + b −1 ). We can generalize the concept of the mean in the following way (Hardy et al., 1952; see Petz & Temesi, 2005, for a more general approach).
2782
S. Amari
Let f be a differentiable monotone function. We can rescale positive number u to f (u) using the (nonlinear) transformation f . This is called the f -representation of u. Given a , b > 0, we take the arithmetic mean of their f -representations f (a ) and f (b), 1 { f (a ) + f (b)}, 2
(2.1)
and then apply the inverse transformation f −1 . We then have the f -mean of a and b as m f (a , b) = f
−1
1 ( f (a ) + f (b)) . 2
(2.2)
This can easily be generalized to the weighted f -mean of a 1 , . . . , a k , m f (a 1 , . . . , a k ) = f −1
wi f (a i ) ,
(2.3)
where w1 , . . . , wk are the weights, wi > 0, wi = 1. We further require the f -mean to be linear scale free, that is, for c > 0, the f -mean of ca and cb is c times their f -mean, m f (ca , cb) = cm f (a , b).
(2.4)
Theorem 1 (Hardy et al., 1952). The f -mean is linear scale free, when and only when f is the following function for some α, f α (u) =
1−α
u 2 , α = 1. log u, α = 1.
(2.5)
There is a simple proof in the appendix. By using function f α , a one-parameter family of mean is proposed, called the α-mean. The function f α also originates from information geometry (Amari & Nagaoka, 2000), which elucidates an invariant structure, that is, α-affine connections together with the Fisher Riemannian metric, in the manifold of probability distributions. The α-mean of two nonnegative numbers a and b is defined by
1 f α (a ) + f α (b) 2 2 1−α 1−α 1−α , α = 1, = cα a 2 + b 2
mα (a , b) = f α−1
(2.6)
Integration of Stochastic Models by Minimizing α-Divergence
2783
where c α = 2−2/(1−α)
(2.7)
is a constant that yields mα (a , a ) = a .
(2.8)
One can easily see that this family includes the various known means: α=1: α = −1 : α=0: α=3 α=∞: α = −∞ :
m1 (a , b) =
√ ab
1 (a + b) 2 √ √ 2 1 √ 1 1 (a + b) + a b a+ b = m0 (a , b) = 4 2 2
m−1 (a , b) =
m3 (a , b) =
2 1 a
+
1 b
m∞ (a , b) = min {a , b} m−∞ (a , b) = max {a , b} .
The last two cases show that fuzzy logic can be naturally induced from the α-mean. The α-mean is inversely monotone with respect to α, mα (a , b) ≥ mα (a , b),
α ≤ α .
(2.9)
This is a generalization of the well-known inequalities a +b √ 2 . ≥ a b ≥ −1 2 a + b −1
(2.10)
Therefore, as α increases, the α-mean relies more on the smaller element of {a , b} while as α decreases, the larger one is taken into account more seriously. We can say that a small α represents a pessimistic attitude and a larger α a more optimistic attitude. It is easy to define the weighted α-mean of a 1 , . . . , a k , with weights w1 , . . . , wk (w1 + · · · + wk = 1) by mα (a 1 , . . . , a k ; w) = f α−1
wi f α (a i ) ,
(2.11)
2784
S. Amari
where w = (w1 , . . . , wk ). In what sense the α-mean is optimal will be discussed later. 3 α-Families of Probability Distributions Given a number of probability distributions pi (s), i = 1, . . . , k, we can define their α-mixture by using the α-mean. The α = −1 mixture is the conventional mixture, and α = 1 corresponds to the exponential mixture. The α-representation of probability density function p(s) is given by f α [ p(s)] =
2 1−α
p(s)(1−α)/2 , α = 1 α=1
log p(s),
(3.1)
(Amari & Nagaoka, 2000). The α-mixture is defined by p˜ α (s) = c f α−1
1 f α [ pi (s)] , k
(3.2)
where normalization constant c is necessary to make it a probability distribution. This is given by c=
f α−1
1
1
k
. f α [ pi (s)] ds
(3.3)
The α = −∞ mixture, p˜ −∞ (s) = c max{ pi (s)},
(3.4)
i
is the optimistic integration of component distributions, in the sense that for each s, it takes the largest values of the component probabilities. However, the α = ∞ mixture is pessimistic, taking the minimum of the component probabilities, p˜ ∞ (s) = c min{ pi (s)}.
(3.5)
The exponential mixture is much more pessimistic than the conventional mixture, in the sense that the resulting probability density becomes 0 at s if some of the components become 0 at s. Let us next consider weighted mixtures. The weighted α-mixture with weights w1 , . . . , wk is given by p˜ α (s; w1 , . . . , wk ) = c f α−1
wi f α [ pi (s)] .
(3.6)
Integration of Stochastic Models by Minimizing α-Divergence
2785
This is called the α-integration of p1 (s), . . . , pk (s) with weights w1 , . . . , wk . All the above forms of distributions give a family of probability distributions parameterized by w = (w1 , . . . , wk ), which connect k component distributions p1 (s), . . . , pk (s) continuously. When α = −1, this is the conventional mixture family,
p˜ −1 (s; w) = where wi > 0 and p˜ 1 (s) = exp
wi pi (s),
(3.7)
wi = 1. When α = 1, this is the exponential family,
wi log pi (s) − ψ(w) ,
(3.8)
where the normalization constant is put at c = exp {−ψ} .
(3.9)
This depends on w, and ψ(w) is the cumulant-generating function and is called the free energy in statistical physics. The α-family of probability distributions gives an α-flat Riemannian structure and has been extensively studied in information geometry (Amari & Nagaoka, 2000). Zhu and Rohwer (1995) used the α-version of the generalized Pythagorean theorem of information geometry to derive the α-optimal estimator in the Bayesian framework. 4 α-Divergence Given k distributions p1 (s), . . . , pk (s), we have defined their α-mixture or α-integration. We characterize them from the point of view of optimality by using the divergence measure. To this end, let us define an (asymmetric) divergence, D [ p(s) : q (s)] from p(s) to q (s). It is required that D[ p(s) : q (s)] ≥ 0,
(4.1)
D[ p(s) : q (s)] = 0,
(4.2)
when and only when p(s) = q (s). There are many divergence functions (see, e.g., Zhang, 2004, and Minami & Eguchi, 2004). Csiszar (1975) introduced the f -divergence,
D f [ p(s) : q (s)] =
f
q (s) p(s)
p(s)ds,
(4.3)
2786
S. Amari
for a convex function f . This is invariant under the change of scale of s ∈ R1 . The α-divergence is derived by putting 1+α 4 2 u − u , α = ±1 2 1−α f (u) = u log u, α=1 − log u, α = −1,
(4.4)
so that we have p(s) log qp(s) ds α = −1 (s) q (s) q (s) log p(s) ds α=1 Dα [ p(s) : q (s)] = 1+α 4 1 − p(s) 1−α 2 q (s) 2 ds , α = ±1. 1−α 2
(4.5)
The α-divergence is continuous with respect to α, and the case with α = ±1 is derived from limit α → ±1 from the other Dα . The coefficient 4/(1 − α 2 ) in equation 4.5 guarantees that any Dα gives the same Fisher information metric when p(s) and q (s) are infinitesimally close. Obviously, D−1 [ p(s) : q (s)] = K L [ p(s) : q (s)]
(4.6)
D1 [ p(s) : q (s)] = K L [q (s) : p(s)] ,
(4.7)
where K L is the Kullback-Leibler divergence and D0 [ p(s) : q (s)] = 2
p(s) −
2 q (s) ds
(4.8)
is the square of the well-known Hellinger distance. The divergences satisfy Dα [ p(s) : q (s)] ≥ 0,
(4.9)
with equality when and only when p(s) = q (s). They are not symmetric, Dα [ p(s) : q (s)] = Dα [q (s) : p(s)] ,
(4.10)
except for the case α = 0. It should be noted that Dα and D−α are dual in the sense of Dα [ p(s) : q (s)] = D−α [q (s) : p(s)] .
(4.11)
Integration of Stochastic Models by Minimizing α-Divergence
2787
We can generalize the α-divergence to that between positive measures m(s) > 0 and l(s) > 0, where
m(s)ds > 0,
l(s)ds > 0.
(4.12)
This is given by Dα [m(s) : l(s)] m(s) α = −1 m(s)ds − l(s)ds + m(s) log l(s) ds l(s) l(s) = l(s)ds − m(s)ds + l(s) log m(s) ds + l(s) log m(s) ds α = 1 1+α 2 m(s)ds + 2 l(s)ds − 4 m(s) 1−α 2 l(s) 2 ds, α = ±1. 1+α 1−α 1−α 2 In particular, the α-divergence for two positive numbers y and z is y z − y + y log z z Dα [y : z] = y − z + z log y 2 2 y + 1−α z− 1+α
α = −1 α=1 4 1−α 2
y
1−α 2
z
1+α 2
(4.13)
α = ±1.
How does the α-divergenceDα [ p(s) : q (s)] depend on α? When α = 0, it is the L 2 -norm in the space of p(s). As α increases, the Dα emphasizes the part where p(s) is small. Therefore, in order to let Dα be small, q (s) is forced to be 0 at s where p(s) = 0. Dually to the above, the Dα be emphasizes the part where q (s) is small. In order to keep Dα finite, p(s) is forced to be 0 at s where q (s) = 0, when α ≤ −1. We now make some remarks on the α-divergence. Any f divergence D f [ p(s) : q (s)] gives a unique Riemannian metric, which is the Fisher information matrix. It also gives a pair of α- and −α-affine connections, where α depends on f . (Eguchi, 1983). In particular, α-divergence gives the ±αaffine connections. On the other hand, given the α-structure of information geometry, the α-divergence Dα is the unique canonical divergence related to the α-flat structure. The α-divergence is a unique family of divergence in this sense. (See Zhang, 2004, for characterization of the α-divergence, where it is viewed as the limiting cases of a two-parameter family of divergence functions introduced by the difference between the weighted arithmetic mean and the weighted α-mean.) The α-divergence was first introduced by Chernoff (1952) in order to evaluate the error probability of pattern classification. It is closely related to Renyi entropy (Renyi, 1961) and Tsallis entropy (Tsallis, 1988). There are many applications of the α-divergence (Amari & Nagaoka, 2000; Matsuyama, 2003; Toyoizumi & Aihara, 2006).
2788
S. Amari
Recently Minka (2005) applied the α-divergence measure to the problem of belief propagation. (See also Ikeda, Tanaka, & Amari, 2004, for information geometry and belief propagation.) Amari, Ikeda, and Shimokawa (2001) studied the mean-field approximation from the viewpoint of the α-divergence. 5 Optimality of α-Integration Given k distributions p1 (s), . . . , pk (s), their integration denoted by q (s) is required to be as close to each of p1 (s), . . . , pk (s) as possible. Let w1 , . . . , wk be weights of pi (s), and we can define the weighted average of the divergence from pi (s)’s to q (s), RD [q (s)] =
wi D[ pi (s) : q (s)].
(5.1)
We search for q (s) that minimizes RD [q ]. The minimizer of RD [q (s)] is called the D-optimal integration of p1 (s), . . . , pk (s) with weights w1 , . . . , wk . The following is our main theorem: Theorem 2. Optimality of α-integration. The α-integration of probability distributions p1 (s), . . . , pk (s) with weights w1 , . . . , wk is optimal under the αdivergence criterion of minimizing Rα [q (s)] =
wi Dα [ pi (s) : q (s)].
(5.2)
Proof. Let us first prove the case where α = ±1. By taking the variation in Rα [q (s)] under the normalizing constraint, q (s)ds = 1,
(5.3)
we derive δ Rα [q (s)] − λ =
δq (s)ds
2 1−α 1+α pi (s) 2 q (s)− 2 δq (s)ds − λ δq (s)ds wi 1−α
= 0,
(5.4) (5.5) (5.6)
where λ is the Lagrange multiplier. This gives q (s)−
1−α 2
wi pi (s)
1−α 2
= const,
(5.7)
Integration of Stochastic Models by Minimizing α-Divergence
2789
and, hence, the optimal q (s) is q (s) = c f α−1
wi f α { pi (s)} .
(5.8)
When α = ±1, we obtain δ R1 [q (s)] =
δ R−1 [q (s)] = −
wi
log wi
q (s) δq (s)ds, pi (s)
pi (s) + const δq (s)ds, q (s)
(5.9) (5.10)
respectively. Hence, the optimal q is the α-integration for any α. Unnormalized probabilities, that is, positive measures, might be used in the brain or in others (see information geometry of boosting; Murata, Takenouchi, Kanamori, & Eguchi, 2004). Here, the optimal integration m(s) ˜ of m1 (s), . . . , mk (s) under the α-criterion is m ˜ α (s) = f α−1
wi f α {mi (s)} ,
(5.11)
where the normalization constant is not necessary. In particular, the α-mean mα (a 1 , . . . , a k ) of k numbers a 1 , . . . , a k (with weights w1 , . . . , wk ) minimizes the average, α-divergence (see equation 4.13), Rα (a ) =
wi Dα (a i : a ) .
(5.12)
6 Application to α-Integration of Experts Let us consider a system composed of k experts M1 , . . . , Mk , which process common input signal x and emit respective answers. The output of Mi is a probability distribution, pi (s|x), or a positive measure, mi (s|x). The entire system integrates these answers and provides an integrated answer. Let us assume that wi (x) is given as the weight or reliability of Mi when x is input. The α-risk of q (s) is then given by Rα [q (s)|x] =
wi (x)Dα [ pi (s|x) : q (s)] .
(6.1)
The α-risk is defined similarly for positive measure m(s|x) or scalar a (x).
2790
S. Amari
Theorem 3. The α-expert machine integrates the output by q (s|x) = f α−1
wi (x) f α { pi (s|x)} .
(6.2)
This is optimal under α-divergence Rα [q (s)] =
wi (x)Dα [ pi (s|x) : q (s)] .
(6.3)
Similar assertions hold for the positive measure and scalar output cases. The α = 1 machine is the mixture of experts (Jacobs et al., 1991), and the α = −1 machine is the product of experts (Hinton, 2002). It is important to determine the weight or reliability function wi (x). When teacher output q ∗ (s) is available, one may use the soft-max function wi (x) = c exp{−β Dα [ pi (s) : q ∗ (s)]},
(6.4)
where c is the normalization constant and β is the “inverse temperature,” indicating the sharpness of the weight. However, the teacher distribution cannot usually be obtained explicitly. Therefore, we need to design a learning scheme. Since wi (x) are functions, they have infinite degrees of freedom. In such a case, we use parametric forms wi (x, θ i ) such that wi are determined by parameters θ i . A gating neural network is used in Jacobs et al. (1991). Let s ∗ be the teacher output when x is input, which is subject to q ∗ (s|x). Let s be the output from the integrated distribution q (s|x), which depends on wi (x) or θ i , and let si be the stochastic output from pi (s|x). A simple instantaneous loss function is D (s, s ∗ ), where s is the α-integration of si or an example from the α-integration of pi (s|x) with weights wi (x, θ i ). We then have the learning rule θ i = −η
∂ D (s, s ∗ ) . ∂θ i
(6.5)
The above rule can be simplified by using D (s, s ∗ ) = s=
1 (s − s ∗ )2 , 2
wi si .
(6.6) (6.7)
Given s1 , . . . , sk , we may use the winner-take-all (or a-few-winnersshare-all) rule, and the weights of the winners increase while the losers decrease their weights. The behaviors of the component experts Mi are also subject to learning, where the behavior pi (si |x; ξ i ) is parameterized by ξ i .
Integration of Stochastic Models by Minimizing α-Divergence
2791
The winners modify their behaviors, while the losers do not. This is a simple sketch of learning the weight distribution, and further research is necessary. 7 α-Bayesian Predictive Distribution Let us consider the Bayesian framework, where we use a model of prob
ability distributions S = p(x|θ ) parameterized by θ. Here, p(x|θ ) is the probability density function of random variable x parameterized by θ . Let D = {x 1 , . . . , x n } be n independent observations and π(θ ) a prior distribution of θ . We then have the joint distribution of θ and D, p(D, θ ) = π(θ )
n
p (x i |θ ) .
(7.1)
i=1
The posterior distribution of θ is given by p(θ |D) =
p(D, θ ) , p(D)
(7.2)
p(D, θ )dθ.
(7.3)
where p(D) =
The Bayesian predictive distribution is given by p(x|D) =
p(x|θ) p(θ |D)dθ .
(7.4)
This is a mixture of an infinite number of distributions p(x|θ ) in S with weight function p(θ |D). The component distribution pi (x), i = 1, . . . , k in the present mixture is replaced by p(x|θ), where {θ } is an infinite set of indices. It is known that the Bayes predictive distribution is optimal, in the sense of minimizing the expectation of the risk functional, R [q (x|D)] =
π(θ ) p(D|θ )K L [ p(x|θ) : q (x|D)] dθd D,
(7.5)
where d D = d x1, . . . , d xn.
(7.6)
2792
S. Amari
In other words, the above risk functional is the average K L divergence from true distribution p(x|θ) to test distribution q (x|D) to be minimized, which depends on data that have emerged from p(x|θ ). The Bayesian predictive distribution is optimal when K L-divergence is used to evaluate the estimator. However, it depends on the loss function. We may use α-divergence as the loss function in the Bayesian framework. We then have the following α-risk functional: Rα [q (x|D)] =
π(θ ) p(D|θ )Dα [ p(x|θ : q (x|D))] dθ d D.
(7.7)
The minimizer of α-risk is called α-predictive distribution. Theorem 4. The α-predictive distribution is given by p˜ α (x|D) = c f α−1
f α p(x|θ ) p(θ |D) dθ,
(7.8)
which is optimal under α-risk. (The theorem was derived by Corcuera & Giummol`e, 1999, as a generalization of Komaki, 1996; also see Marriott, 2002.) 8 Conclusion Many different methods of averaging are known (Hardy et al., 1952). The α-average of numbers is the only family having linear scale invariance properties. This is based on the α-representation of numbers. Their properties and optimalities were elucidated. Given a number of probability distributions, their α-mixtures were derived by using the α-representations. They are the α-family of probability distributions, which are invariant from the point of view of information geometry, having geometric-invariant properties. The optimality of α-integration was explained from the point of view of α-divergence, which included Kullback-Leibler divergence, its dual, and Hellinger distance as special cases. This concept is known as Chernoff divergence and has been clarified in information geometry (Amari & Nagaoka, 2000). It is closely related to Renyi entropy, which has recently been used in physics under the name of Tsallis entropy (Tsallis, 1988). The generalization of the mixture and product of experts is a direct application of the results here. It is also interesting to note that we can expand the scope of Bayesian inference by using the α-predictive distribution, which is given by the α-mixture of candidate distributions. What is the biological or psychophysical relevance of α-representation and α-integration? We first argue that the α-representation of sensation is widely known in psychophysical exprements. The Weber-Fechner law states that our perception of signals is on a logarithmic scale. This suggests
Integration of Stochastic Models by Minimizing α-Divergence
2793
that α = 1 representation is used in the brain. The Stevens law, on the other hand, suggests that α-representation is also used in the brain. In the nineteenth century, J. Plateau investigated (see Falmagne, 1985) how the brain performs on average on two light intensities, which led us to the α-mean. This story, together with later developments, is reviewed in Heller (2006). The functional equations, A.1 and A.2, in the appendix were derived for explaining the midway (average) of two gray sensations. Let us next discuss the possibility of α-integration in the brain. It is believed that visual motion is represented in the MT or MST area in the form of population coding. When we see a pattern in which random dots are moving coherently in a common direction, we feel that the entire field is moving in that direction. It is plausible that such motion is represented by a unimodal firing pattern in a neural field in which each position represents the direction of motion. This is population coding, and such an excitation pattern might represent an unnormalized probability distribution of possible directions. When there are two coherent groups of random dots in a pattern—one group moving in one direction and the other group in another—we feel as if there were two transparent fields moving in different directions. Such motion is represented in population coding by an excitation pattern with two peaks. For example, if component motion in each direction is represented by a gaussian distribution, their mixture (α = −1 mixture) gives a gaussian mixture that has two peaks, provided the positions of the component directions are sufficiently separated. When such motion suddenly disappears, one perceives an illusory motion after-effect such that the visual field is moving in the opposite direction. This is called the waterfall illusion. However, when we see transparent motion moving in two directions, we perceive two different directions for motion, but surprisingly its motion aftereffect is different. We perceive a unidirectional illusory motion after-effect in the direction of the vector addition of component velocity vectors. When each direction is represented by a gaussian distribution, their mixture (α = −1) is bimodal, while the exponential mixture (α = 1 mixture) is always unimodal gaussian. Therefore, this suggests that the brain also uses the α = 1 mixture (exponential mixture). The brain makes use of various integrations depending on its purpose. The Stevens law supports the use of α-representation in the brain. Experimental studies with theoretical considerations (Hida, Ohno, Hashimoto, Amari, & Saito, 2006) have also been done on the illusory motion aftereffect of transparent motion fields. Appendix: Derivation of the Scale-Free- f -Mean Let f be a monotonically increasing differentiable function satisfying f (m) =
1 f (x) + f (y) 2
(A.1)
2794
S. Amari
f (cm) =
1 f (cx) + f (cy) 2
(A.2)
for any c > 0, where m(x, y) = f −1
1 ( f (x) + f (y)) . 2
(A.3)
By differentiating equations A.1 and A.2 with respect to x, we derive 1 f (x), 2 1 c f (cm)m = c f (cx), 2 f (m)m =
(A.4) (A.5)
where m =
∂ m(x, y). ∂x
(A.6)
From these we derive f (cx) f (cm) = . f (x) f (m)
(A.7)
Since m may take an arbitrary value depending on y, f (cx) = g(c) f (x)
(A.8)
does not depend on x, depending only on c. By substituting x = 1 and a = f (1), we have g(c) = f (c)/a . Let us put a = 1 tentatively and further put h(x) = log f (x).
(A.9)
We then have h(cx) = h(c) + h(x).
(A.10)
The solution to equation A.10 is h(x) = b log x,
(A.11)
Integration of Stochastic Models by Minimizing α-Divergence
2795
where b is an arbitrary constant. This gives f (x) = x b .
(A.12)
By integration, we finally derive f (x) =
1−α
x 2 , log x,
α= 1 α = 1,
(A.13)
neglecting constant of proportionality, where we put α = −2b − 1, and α = 1(b = −1) is the special case giving log x. Remark. Due to the integration constant and f (1), for constants a , b, f α (c) = a f α (c) + b
(A.14)
yields the same solution, but the α-mean is the same regardless of a , b.
Acknowledgments I thank Jun Zhang of the University of Michigan at Ann Arbor for his discussions and pointing out the classical work of Hardy, Littlewood, and Polya and Plateau. References Amari, S., Ikeda, S., & Shimokawa, H. (2001). Information geometry of α-projection in mean field approximation. In M. Opper & D. Saad (Eds.), Advanced mean field methods: Theory and practice (pp. 241–257). Cambridge, MA: MIT Press. Amari, S., & Nagaoka, H. (2000). Methods of information geometry. Providence, RI: American Mathematical Society, and New York: Oxford University Press. Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on a sum of observations. Annals of Mathematical Statistics, 23, 493–507. Corcuera, J. M., & Giummol`e, F. (1999). A generalized Bayes rule for prediction. Scandinavian Journal of Statistics, 26, 265–279. Csiszar, I. (1975). I-divergence geometry of probability distributions and minimization problem. Annals of Probability, 3, 146–158. Dayan, P., Hinton, G., Neal, R., & Zemel, R. (1995). The Helmholtz machine. Neural Computation, 7, 889–904. Eguchi, S. (1983). Second order efficiency of minimum contrast estimators in a curved exponential family. Annals of Statistics, 11, 793–803. Falmagne, J.-C. (1985). Elements of psychophysical theory. New York: Clarendon Press. Hardy, G. H., Littlewood, J. E., & Polya, G. (1952). Inequalities (2nd ed.). Cambridge: Cambridge University Press.
2796
S. Amari
Heller, J. (2006). Illumination-invariance of Plateau’s midgray. Journal of Mathematical Psychology, 50, 263–270. Hida, E., Ohno, H., Hashimoto, N., Amari, S., & Saito, H. (2006). Neural representation for perception of wide-field visual flow in MST: Bidirectional transparent motion and its illusory aftereffect. Manuscript submitted for publication. Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14, 1771–1800. Ikeda, S., Tanaka, T., & Amari, S. (2004). Stochastic reasoning, free energy, and information geometry. Neural Computation, 16, 1779–1810. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3, 79–81. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214. Komaki, F. (1996). On asymptotic properties of predictive distributions. Biometrika, 83, 299–313. Marriott, P. (2002). On the local geometry of mixture models. Biometrika, 89, 79–93. Matsuyama, Y. (2003). The alpha-EM algorithm: Surrogate likelihood maximization using alpha-logarithmic information measures. IEEE Trans. on Information Theory, 49, 692–706. Minami, M., & Eguchi, S. (2004). Robust blind source separation by beta-divergence. Neural Computation, 14, 1859–1886. Minka, T. (2005). Divergence measures and message passing (MSR-TR-2005-173). Redmond, WA: Microsoft. Murata, N., Takenouchi, T., Kanamori, T., & Eguchi, S. (2004). Information geometry of U-boost and Bregman divergence. Neural Computation, 16, 1437–1481. Petz, D., & Temesi, R. (2005). Means of positive numbers and matrices. SIAM Journal on Matrix Analysis and Applications, 27, 712–720. Renyi, H. (1961). On measures of entropy and information. In Proc. 4th Berkeley Symp. on Mathematical Statistics and Probability. Berkeley: University of California Press. Toyoizumi, T., & Aihara, K. (2006). Generalization of the mean-field method for power-law distributions. International Journal of Bifurcation and Chaos, 16, 129– 135. Tsallis, C. (1988). Possible generalization of Boltzmann-Gibbs statistics. Journal of Statistical Physics, 52, 479–487. Wolpert, D., & Kawato, M. (1998). Multiple paired forward and inverse models for motor control. Neural Networks, 11, 1317–1329. Xu, L. (2004). Advances on BYY harmony learning: Information theoretic perspective, generalized projection geometry, and independent factor autodetermination. IEEE Trans on Neural Networks, 15(4), 885–902. Zhang, J. (2004). Divergence function, duality and convex analysis. Neural Computation, 16, 159–195. Zhu, H., & Rohwer, R. (1995). Bayesian invariant measurements of generalization. Unpublished manuscript. Available online at http://www.santafe.edu/research/ publications/wpabstract/199806045.
Received May 25, 2006; accepted October 4, 2006.
LETTER
Communicated by Martin Stemmler
Distortion of Neural Signals by Spike Coding David H. Goldberg [email protected] Department of Physiology and Biophysics, Weill Medical College of Cornell University, New York, NY 10021, U.S.A.
Andreas G. Andreou [email protected] Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD 21218, U.S.A.
Analog neural signals must be converted into spike trains for transmission over electrically leaky axons. This spike encoding and subsequent decoding leads to distortion. We quantify this distortion by deriving approximate expressions for the mean square error between the inputs and outputs of a spiking link. We use integrate-and-fire and Poisson encoders to convert naturalistic stimuli into spike trains and spike count and interspike interval decoders to generate reconstructions of the stimulus. The distortion expressions enable us to compare these spike coding schemes over a large parameter space. We verify that the integrate-and-fire encoder is more effective than the Poisson encoder. The disparity between the two encoders diminishes as the stimulus coefficient of variation (CV) increases, at which point, the variability attributed to the stimulus overwhelms the variability attributed to Poisson statistics. When the stimulus CV is small, the interspike interval decoder is superior, as the distortion resulting from spike count decoding is dominated by a term that is attributed to the discrete nature of the spike count. In this regime, additive noise has a greater impact on the interspike interval decoder than the spike count decoder. When the stimulus CV is large, the average signal excursion is much larger than the quantization step size, and spike count decoding is superior. 1 Introduction The inputs and outputs of neuronal computations are continuous signals expressed as currents, voltages, and chemical concentrations. These signals are propagated throughout the brain by a massively connected communication network. Because axons—the wires of the nervous system—are electrically leaky, these signals must be encoded as spikes, which are amenable to active restoration and transmission over long distances. Neural Computation 19, 2797–2839 (2007)
C 2007 Massachusetts Institute of Technology
2798
D. Goldberg and A. Andreou
In this letter, we explore how spike encoding and decoding distorts neural signals, and we consider hypothetical strategies that neurons may use to minimize this distortion. We examine integrate-and-fire and Poisson encoding schemes. We also examine two spike decoding schemes: spike count decoding, in which the decoded signal is proportional to the number of spikes that occur in a sliding window, and interspike interval decoding, in which the decoded signal is proportional to the instantaneous spike rate as computed from the inverse of the time between successive spikes. We compare the various encoding and decoding combinations by quantifying their ability to reconstruct a naturalistic stimulus. A comparison of spike codes can tell us which codes are most effective and, in turn, are most likely to be employed by neural systems. The notion that neural systems are organized to realize efficient representations of information dates back to the seminal work of Attneave (1954) and Barlow (1961). The organism that can process information the fastest with a fixed amount of resources can react more quickly to a constantly changing world and is better fit to survive and procreate (Laughlin, Anderson, O’Carroll, & de Ruyter van Steveninck, 1999). Similarly, the organism that can process a fixed amount of information with fewer resources is also fitter. For these reasons, it is hypothesided that evolution has optimized neural systems for efficiency. The optimization criteria may be determined by the function of the neural system or the constraints that guided its development (Laughlin, 2001; Chklovskii, Schikorski, & Stevens, 2002). One way to quantify the transfer of information over a spiking link is to compute its channel capacity—the maximum information (per unit time) that can be transmitted over the link. In information theory, a channel is characterized by an input alphabet, an output alphabet, and a probabilistic mapping between the symbols in the two alphabets. The channel capacity for a spiking link that uses alphabets of interspike interval lengths has been derived by MacKay and McCulloch (1952). Stein (1967) derived the capacity when the alphabets consist of the number of spikes in a window. Stein also compared these capacities and demonstrated that the interspike interval alphabet far outperforms the spike count alphabet1 (see his Figure 7; see also Softky, 1996). An analysis based on channel capacity assumes that the encoder exploits the full capacity of the channel and that the decoder is capable of recovering all of the information present in the spike train. However, a spiking link that is consistent with the efficient coding hypothesis will not achieve the channel capacity if a great deal of resources are required to achieve capacity. For this reason, we quantify the transfer of information over a spiking link by computing the distortion introduced by biologically plausible encoders and decoders.
1 Stein refers to these schemes as codes, but they are not codes in the formal sense. We consider a code to be an encoding function and its corresponding decoding function.
Distortion of Neural Signals by Spike Coding
2799
1.1 Previous Work. Spiking links can be a challenge to analyze. Even the simplest spike encoder, the integrate-and-fire encoder, has rectifying characteristics (negative currents are poorly encoded) and a nonlinear spike generation mechanism. In pioneering work, Knight (1972) used a smallsignal approximation to analyze the encoding of analog information into a spike train by an integrate-and-fire mechanism. An analytical transfer function for the integrate-and-fire encoder was found in the limiting case where many encoders capture different phases of an analog input. Stein, French, and Holden (1972) found an analytical expression for the interspike interval (ISI) distribution, and in turn the spike train power spectral density, for the case where the analog input has a white power spectral density. An analytical expression for the output of an integrate-and-fire encoder driven by a stimulus current with correlations has yet to be found. Expressions for the moments of the ISI distribution in this case have been found only recently (e.g., Salinas & Sejnowski, 2002; Lindner, 2004). Spike decoding has been investigated by several groups. Bialek, Rieke, de Ruyter van Steveninck, & Warland (1991) used a reconstruction paradigm to analyze the ability of spike trains to convey information about a continuous stimulus. They applied this approach to a motion-sensitive neuron in the blowfly visual system and were able to find a linear filter that, when applied to the spike train, gave good reconstructions of the stimulus. August and Levy (1996) used simulations to evaluate the ability of linear and nonlinear decoders to reconstruct a stimulus from a spike train. Gabbiani (1996) modeled the encoding of a continuous stimulus as a nonhomogeneous Poisson spike train and examined how the form of the optimal linear decoding filter was influenced by changes in the stimulus statistics and the firing rate. Manwani, Steinmetz, and Koch (2001) extended this work to integrate-and-fire and stochastic ion channel encoders and explored the effects of spike timing variability on the amount of information transmitted. However, neither of these studies addressed the impulsive nature of the spike train or nonlinear decoding methods. It has recently been demonstrated that signals represented by the output of an integrate-and-fire encoder can be perfectly recovered provided that interspike interval is bounded by the inverse of the Nyquist rate of ´ the stimulus (Lazar, 2004; Lazar & Toth, 2004). In that work’s original formulation, the reconstruction required computing the pseudo-inverse of an infinite-dimensional matrix, but subsequent formulations make use of computational primitives found in neural systems (Lazar, 2006).
1.2 Overview. In section 2, we describe the general framework we use to quantify the performance of spike codes. Our metric is the mean square error distortion between a naturalistic stimulus and a reconstruction based on spike-encoded versions of the stimulus.
2800
D. Goldberg and A. Andreou
We derive analytical expressions for the distortion between the stimulus and reconstructions obtained from a spike count decoder in section 3 and an interspike interval decoder in section 4. The expressions reveal how parameters such as the stimulus variability, stimulus bandwidth, mean ISI length, latency of the spiking link, noise magnitude, and noise bandwidth influence spike communication. Simulations are used to ensure that the approximations made in the derivations are valid. In general, models are preferred over simulations because the models explain how the distortion is influenced by the parameters, whereas simulations give only the total distortion. Furthermore, obtaining simulation results over a large parameter space can be computationally intensive. In section 5, we use the expressions derived in sections 3 and 4 to compare the spike encoders and decoders over a large range of parameters. In section 6, we discuss the implications of the results for the study of spike coding in neural systems and conclude the letter. 2 General Framework We employ a simple model of a spiking link (see Figure 1A). We start with a stimulus S, an analog signal that corresponds to the activity of the neuron at the soma.2 This signal could arise from a sensory stimulus or the summed activity from neurons in a preceding processing stage. The stimulus is combined with additive noise N, which corresponds to input that is unrelated to the stimulus. The encoder, which corresponds to the axon hillock, transforms the input S + N into a spike train X. The decoder, which corresponds to the dendrite of the efferent neuron, transforms the spike train into a train ˆ This framework captures many elements of of postsynaptic potentials S. neural communication, yet it is simple enough to provide analytical results. We quantify the distortion by determining the mean squared error (MSE) between the stimulus S and a reconstruction of the stimulus Sˆ based on the spike train X, ˆ = E[(S − S) ˆ 2 ] = E[S2 ] − 2E[S S] ˆ + E[ Sˆ 2 ]. ε2 = MSE(S, S)
(2.1)
We acknowledge that the goal of sensory processing is not communication— the preservation of information—but computation—the selective reduction of information. However, for the purpose of this analysis, we adopt a simplified view that all neuronal computation occurs when information from many neurons is combined and that the spike generation mechanism is
2 Uppercase italicized letters without a succeeding argument denote a random variable. Lowercase italicized letters followed by an argument t denote a time domain representation of a signal. Uppercase italicized letters followed by an argument ω denote a frequency domain representation of a signal.
Distortion of Neural Signals by Spike Coding
2801
used only for the purpose of communication. The reconstruction paradigm then provides an upper limit on the amount of information that is available for neuronal computation and facilitates a straightforward comparison of spike coding schemes. 2.1 Stimulus and Noise. The stimulus is an electric current S that is characterized by naturalistic statistics—a gaussian distribution (mean µ S and variance σ S2 ) and a Lorentzian power spectral density with bandwidth ω S . It is helpful to characterize the variability of the stimulus by its coefficient of variation (CV): CV S = σ S /µ S .
(2.2)
The power spectral density and corresponding autocorrelation are given by SSS (ω) =
2ω S σ S2 + 2πµ2S δ(ω) ←→ RSS (τ ) = σ S2 e −ωS |τ | + µ2S . ω2S + ω2
(2.3)
We have chosen a Lorentzian stimulus power spectral density because it is the simplest form that has a naturalistic correlation structure, such as is found in natural visual stimuli (Dong & Atick, 1995). The noise N is also an electric current that has a gaussian distribution (zero mean and variance σ N2 ). Because the noise is superimposed on a background of µ S , we define its CV by CV N = σ N /µ S .
(2.4)
The power spectral density and corresponding autocorrelation are given by SNN (ω) =
σ N2 π/ω N , |ω| ≤ ω N 0,
|ω| > ω N
←→ RNN (τ ) =
σ N2 sin ω N τ . ωNτ
(2.5)
The white noise power spectral density was chosen for its simplicity. The stimulus and noise power spectral densities are plotted for nominal parameters in Figure 1B. For the sake of brevity, we confine our analysis to Lorentzian stimuli and band-limited white noise, but our approach can be applied to signals with other spectral characteristics (e.g., band-limited white stimuli and Lorentzian noise.)
2802
D. Goldberg and A. Andreou
A
S
Spike encoder
Σ
Stimulus
N
Noise
X Spike train
Spike decoder
S Reconstruction
Power spectral density (µA2/Hz)
B 0
10
Stimulus Noise
10
10
10 10
0
10 Frequency ω (rad/sec)
1
10
Figure 1: (A) Schematic depiction of a spiking link. (B) Power spectral densities for stimulus and noise for µ S = 1 µA, σ S = 0.25 µA, ω S = 1 rad/sec, σ N = 0.25 µA, ω N = 4 rad/sec. The stimulus power spectral density has an impulse at ω = 0 of height 2π µ2S that corresponds to the component at zero frequency (not shown).
2.2 Spike Encoding. A spike train X can be expressed in terms of its spike times {tkX } or its ISI lengths {TkX }, where the kth ISI length is given by X TkX = tkX − tk−1 .
(2.6)
The spike train can be expressed as a sum of delta functions shifted by the spike times {tkX }, x(t) =
δ t − tkX .
(2.7)
k
The variability of the spike train can be characterized by its ISI length CV, CVT = σT /µT ,
(2.8)
where µT and σT are the mean and standard deviations of the ISI lengths, respectively.
Distortion of Neural Signals by Spike Coding
2803
2.2.1 Integrate-and-Fire Encoding. The first spike encoder we consider is a nonleaky integrate-and-fire encoder. In an electric circuit description of the encoding process, the current S + N charges a capacitance C. When the potential V across the capacitor reaches the threshold θ , a spike is emitted, and V is reset to zero. The potential cannot go below zero. If S + N is strictly X positive during the interval [tk−1 , tkX ], then V is given by v(t) =
1 C
X tk−1 +t X tk−1
[s(t ) + n(t )]dt .
(2.9)
By setting v(t) = θ and t = TkX , we can obtain the ISI lengths {TkX } as a recursive function of S + N, θ=
1 C
X tk−1 +TkX
X tk−1
[s(t ) + n(t )]dt .
(2.10)
The ISI length that results from encoding an input with constant, positive magnitude µ S is given in terms of the integrate-and-fire parameters by µ ˆ T = Cθ/µ S .
(2.11)
When S + N is positive, µ ˆ T is equal to the mean ISI length µT . We will often use µ ˆ T as an approximation for µT , as S + N is rarely negative when the stimulus and noise CV are low. A general analytical expression for the output of an integrate-and-fire encoder has not been found. In this letter, we obtain approximate expressions for the distortion that do not require knowledge of the spike times by making two assumptions. First, we assume that the input to the integrate-and-fire encoder S + N is rarely negative. This assumption is true 3 if CV S + CV N < ∼1. Second, we assume that the resulting spike train is regular, that is, CVT < ∼1. Knight (1972) used a similar approach to find an expression for the transfer function from a stimulus to the instantaneous spike rate. When the input (i.e., stimulus plus noise) is white and band
3 Throughout the letter, we use the expression <1 (read as “approximately less than 1”) ∼ to describe the requirements that certain quantities must satisfy in order for our models to be valid. It is intended to convey the observation that as the value of these quantities increases, the disparity between the approximations and simulations increases. While the increase of the disparity is a smooth, graded process, the value of 1 was chosen because it is at that order of magnitude that the disparity empirically becomes large enough to influence the results of a comparison of different codes.
2804
D. Goldberg and A. Andreou
limited (as is the case when µs > 0, σ S = 0, and σ N > 0), the spike train CV is given by σT = CVT = µT
2Cθ σ N2 µ3S ω N Cθ/µ S
σN = µS
2 µT ω N
(2.12)
(Gerstein & Mandelbrot, 1964). Recent work on dichotomous noise (Lindner, 2004) suggests, and simulations confirm, that when the input has a Lorentzian power spectral density with bandwidth ω S , the spike train CV has a related form, σT σS = CVT = µT µS
2 f (. . .), µT ω S
(2.13)
where f (. . .) is some function of the model parameters. If the expressions in equations 2.12 and 2.13 are small, then the assumption of a regular spike train with small perturbations is valid. 2.2.2 Poisson Encoding. We also consider an encoder that generates spike trains characterized by a nonhomogeneous Poisson process. Poisson statistics adequately describe the variable discharge of some neurons and are straightforward to analyze. To encode the input as a nonhomogeneous Poisson spike train, we generate a homogeneous Poisson process U with rate 1/µT and then warp the spike times by the inverse of cumulative integral of the input signal S + N:
[s(t ) + n(t )]+ dt µS 0 tkX = −1 tkU .
(t) =
t
(2.14) (2.15)
The [·]+ operator indicates half-wave rectification, as negative inputs are not considered. This approach can also be used with a periodic spike train to generate the output of an integrate-and-fire encoder (Johnson, 1996). The derivations of the distortion expressions associated with Poisson encoding build on the approach we employ for integrate-and-fire encoding. For this reason, the constraints on the input and input-induced perturbations in the spike train described above apply here as well. 2.3 Spike Decoding. The first spike decoder we analyze is based on the spike count in a sliding window, and the second is based on the inverse of the ISI lengths. We do not claim that these decoders are in any way optimal; rather, they were chosen for their simplicity and analytical tractability. In
Distortion of Neural Signals by Spike Coding
2805
both cases, the decoders are formulated such that they are unbiased estimators of the stimulus. When we refer to the distortion of a particular encoder or decoder, we mean the distortion introduced by spike communication in a link that employs that encoder or decoder. The decoding processes are defined as noncausal transformations, meaning that the reconstruction at time t requires information from the past on up to some future time t + τ L . To make the decoder physically realizable, we assume a latency τ L between the stimulus and the reconstruction. Therefore, when we express the reconstructed signal as sˆ (t), we make an implicit assumption that this reconstruction is not available until time t + τ L . 3 Distortion of Spike Count Decoder In this section, we derive analytical expressions for the distortion introduced by encoding the stimulus as a spike train and subsequently decoding the spike train with a linear filter Ᏻ. While the expressions are valid for any linear filter, we focus on the distortion of a decoder that convolves the spike train with a rectangular impulse response, which corresponds to a spike counting operation. The impulse response and transfer function of this decoding filter are given by
rectτ M +τL [t + (τ M − τ L )/2] ←→ τM + τL
sin[ω(τ M + τ L )/2] jω(τ M −τL )/2 G SC (ω) = µ S µT , e ω(τ M + τ L )/2 gSC (t) = µ S µT
(3.1)
where rectT (t) is a rectangular function of unit height and width T, centered at t = 0. The zero frequency gain of the filter G(0) is set to µ S µT in order to make the reconstruction unbiased when the decoder operates on a spikes of unit height generated by a constant input of µ S . The filter is characterized by two parameters, the memory τ M and the latency τ L , which describe how far into the past and in the future, respectively, the filter looks to generate the reconstruction (see Figure 2). Because the decoders are constrained to be causal, τ L gives the delay between the stimulus and the reconstruction. The reconstruction is obtained by passing the spike train X through a linear filter Ᏻ, Sˆ = Ᏻ{X}.
(3.2)
This filtered point process is not straightforward to analyze. Therefore, we approximate the reconstruction as a superposition of three simpler random processes. The components capture fluctuations in the reconstruction
2806
D. Goldberg and A. Andreou
gSC(t) µ S µT τM+τL
- τM
0
τL
t
Figure 2: Impulse response of the spike count decoder. The filter is characterized by the memory τ M and the latency τ L .
attributable to the stimulus, the noise, and the impulsive nature of the spike train, respectively.
S − µS N + +P , Sˆ ≈ Ᏻ µ S µT µ S µT
(3.3)
An impulse train P with period µT is used in the case of an integrate-andfire encoder, p(t) =
∞
δ(t − kµT ).
(3.4)
k=−∞
In the case of a Poisson encoder, we replace P with a homogeneous Poisson process U with mean ISI length µT . (For the remainder of the derivation, we will consider the integrate-and-fire encoder case, while noting that the Poisson encoder case proceeds analogously.) The intuition behind equation 3.3 is that when S + N is above (below) µ S , P is superimposed on a positive (negative) signal (S − µ S + N)/µ S µT to compensate for the fact that its rate 1/µT is lower (higher) than that of the actual spike train (see Figure 3). In addition to the requirements described in section 2.2.1 (CV S +CV N < ∼1, CVT < ∼1), g(t) must be smooth. The approximation in equation 3.3 is discussed in more detail in section A.1 in the appendix. For convenience, we define the following variables: S˜ = S − µ S
˜ S µT } S˜ = Ᏻ{ S/µ
N = Ᏻ{N/µ S µT }
P = Ᏻ{P}.
(3.5)
Then equation 3.3 becomes Sˆ ≈ S˜ + N + P .
(3.6)
Distortion of Neural Signals by Spike Coding
2807
Figure 3: Illustration of integrate-and-fire encoding followed by decoding with a linear filter that has a rectangular impulse response, along with an approximation for the reconstruction that can be used to find an analytical expression for the distortion. (A) Stimulus signal S. (B) Spike train X. (C) Reconstruction ˆ (D) Signal S˜ , which results from subtracting the mean from the stimulus and S. filtering with the decoder. (E) Signal P , which results from filtering a periodic pulse train with the decoding filter. (F) Approximation of the reconstruction Sˆ given by summing S˜ and P .
The distortion can be found as follows: ˆ ≈ MSE( S˜ + µ S , S˜ + N + P ) MSE(S, S)
(3.7)
≈ E[( S˜ + µ S )2 − 2( S˜ + µ S )( S˜ + N + P ) + ( S˜ + N + P )2 ]
2808
D. Goldberg and A. Andreou
˜ S˜ ) + MSE(µ S , P ) ≈ MSE( S, + 2µ S µ S˜ − 2µ S µ S˜ − 2(σ SP ˜ + σ SN ˜ ) + 2(σ S˜ P + σ S˜ N + σ N P ) − 2µ S µ N + σ N2 + µ2N .
(3.8)
Because S˜ and N are zero mean by definition, all terms with µ S˜ and µ N drop out. Additionally, the means of the filtered signals S˜ and N drop out. Because the stimulus-related signals, the noise, and the impulse train are mutually independent, all of the corresponding covariance terms drop out. The mean of P is given by µ P = 1/µT (see section A.2), and the mean of P is µ P = G(0)/µT = µ S . This leaves us with ˆ ≈ MSE( S, ˜ S˜ ) + σ N2 + MSE(µ P , P ). MSE(S, S)
(3.9)
We recognize the final term as the variance of P and note that if we add µ S back into S˜ and S˜ , the MSE in the first term will not change. Equation 3.9 becomes ˆ ≈ MSE(S, S ) + σ N2 + σ P2 , MSE(S, S)
(3.10)
where S = S˜ + µ S = Ᏻ{S/G(0)}. The three terms in equation 3.10 correspond to components of the distortion. The first component, the stimulus2 related distortion εstim,SC = MSE(S, S ), quantifies the failure of the broad decoding window to capture high-frequency information in the stimulus. 2 The second component, the noise-induced distortion εnoise,SC = σ N2 , quantifies the impact of the additive noise, which is independent of the stimulus. 2 The final component is the impulsive distortion εimp,SC = σ P2 , which describes the high-frequency distortion introduced by the spike train itself. We next derive analytical expressions for these three terms. 3.1 Stimulus-Related Distortion. The stimulus-related distortion is given by 2 εstim,SC = MSE(S, S ) = E[S2 ] − 2E[SS ] + E[S2 ].
(3.11)
Let us define a normalized version of the decoding filter: ¯ g¯ (t) = g(t)/G(0) ←→ G(ω) = G(ω)/G(0).
(3.12)
We have E[S2 ] = σ S2 + µ2S , and the remaining terms in equation 3.11 are given by 1 E[SS ] = 2π
∞ −∞
¯ SSS (ω)G(ω)dω
(3.13)
Distortion of Neural Signals by Spike Coding
E[S2 ] =
1 2π
∞
−∞
2809
2 ¯ |G(ω)| SSS (ω)dω.
(3.14)
Equations 3.13 and 3.14 are evaluated for G SC (ω) in section A.3. The results are then plugged into equation 3.11 to obtain the stimulus-related distortion, 2 εstim,SC
2[ω S (τ M + τ L )(1 − e −ωS τ M − e −ωS τL ) − e −ωS (τ M + τL ) + 1] . (3.15) = σ S2 1 − [ω S (τ M + τ L )]2
3.2 Noise-Induced Distortion. To find the distortion resulting from the additive white noise, we simply find the variance of the white noise after it ¯ is filtered by G(ω), 2 εnoise,SC = σ N2 =
1 2π
=
1 2π
∞
−∞
∞
−∞
SN N (ω)dω 2 ¯ |G(ω)| SNN (ω)dω.
(3.16)
We plug G SC (ω) into equation 3.16 to obtain the distortion: 2 εnoise,SC
1 = 2π
ωN
−ω N
sin[ω(τ M + τ L )/2] ω(τ M + τ L )/2
2
σ N2 π dω. ωN
b
b) + a 12 b [cos(2a b) − 1] (obtained Using the relation −b [(sin ax)/ax]2 dx = 2Si(2a a from Maple; Maplesoft, Waterloo, Canada), we find
2 εnoise,SC =
2σ N2 cos[ω N (τ M + τ L )] − 1 Si[ω N (τ M + τ L )] + , ω N (τ M + τ L ) ω N (τ M + τ L ) (3.17)
x where Si(x) = 0 sin t/t dt. If we assume that ω N 1/(τ M + τ L ), we can use ∞ the simpler relation −∞ [(sin ax)/ax]2 dx = π/a to obtain 2 εnoise,SC ≈
σ N2 π . ω N (τ M + τ L )
(3.18)
3.3 Impulsive Distortion. The impulsive component of the distortion is given by the variance of a filtered impulse train. For the integrate-and-fire encoder, the input to the filter is a periodic impulse train P. For a Poisson encoder, the input is a homogeneous Poisson impulse train U.
2810
D. Goldberg and A. Andreou
Distortion of Neural Signals by Spike Coding
2811
3.3.1 Integrate-and-Fire Encoder. The autocorrelation and power spectral density of a periodic impulse train P are derived in section A.2: RP P (τ ) =
∞ 1 δ(τ − kµT ) ←→ µT k=−∞
SP P (ω) =
∞ 2π δ(ω − 2πk/µT ). µ2T k=−∞
(3.19)
The variance of P , and in turn the distortion due to impulsive effects, is given by integrating the power spectral density,
2 εimp,SC
=
σ P2
∞ 1 = SP P (ω)dω − µ2P 2π −∞
∞ 1 = |G(ω)|2 SP P (ω)dω − µ2P 2π −∞
∞ 2πk 2 2 G = 2 . µT µT
(3.20)
k=1
We plug G SC (ω) into equation 3.20 to obtain the impulsive distortion (see section A.4 for details), 2 εimp,SC =
1 6
µ S µT τM + τL
2 .
(3.21)
Figure 4: Distortion of the spike count decoder. Nominal parameters: µ S = 1 µA, σ S = 0.25 µA, ω S = 1 rad/sec, C = 1 µF, θ = 1 V, τ M = τ L = 1 sec , σ N = 0.25 µA, ω N = 4 rad/sec. Simulation data were computed over 20 trials of 10 seconds each with sampling rate of 1 kHz. Error bars indicate one standard deviation above and below the mean. (A) Distortion components as a function of decoding window length τ M + τ L . (B) Distortion components as a function of mean spike rate µT . (C) Distortion components as a function of stimulus variance σ S2 . (D) Comparison of analytical expressions for distortion (solid line) with simulations (dots with error bars) as a function of window length τ M + τ L . (E) Comparison of analytical expressions for distortion with simulations as a function of mean spike rate µT . (F) Comparison of analytical expressions for distortion with simulations as a function of stimulus variance σ S2 . (G) Optimal ∗ as a function of latency τ L for the integrate-and-fire encoder. The memory τ M dashed line corresponds to τ M = τ L .
2812
D. Goldberg and A. Andreou
This distortion can be thought of as the distortion that arises from the quantization imposed by the discrete nature of spike counts. 3.3.2 Poisson Encoder. The autocorrelation and power spectral density of a homogeneous Poisson impulse train U are given by RUU (τ ) =
1 1 1 2π δ(τ ) + 2 ←→ SUU (ω) = + 2 δ(ω). µT µ µT µT T
(3.22)
Following the steps from section 3.3.1, we obtain σU2 = µ2S
µT τM + τL
.
(3.23)
For convenience, we will express the distortion of Poisson encoding as the amount of impulsive distortion that exceeds the impulsive distortion for integrate-and-fire encoding, 2 εpoiss,SC = µ2S
µT τM + τL
1−
µT . 6(τ M + τ L )
(3.24)
3.4 Overall Behavior. The behavior of the distortion components is mostly easily understood as a function of the counting window length τ M + τ L (plotted in Figure 4A for nominal parameters, including τ M = τ L ), the mean ISI length µT (see Figure 4B), and the stimulus variance σ S2 (see Figure 4C). 2 The stimulus-related distortion εstim,SC increases as τ M + τ L increases, as a longer counting window smooths the stimulus to a greater extent. The 2 noise-induced distortion εnoise,SC decreases as τ M + τ L increases because a longer counting window averages out the noise to a greater extent. Both 2 2 εstim,SC and εnoise,SC are flat as a function of µT because these distortion terms account only for the smoothing effects of the decoding window and 2 not characteristics of the spike train. Because εstim,SC is the only term that depends on the dynamics of the stimulus, it is the only term that varies with σ S2 . The impulsive nature of individual spikes has less of an effect on the 2 reconstruction when there are more spikes in the window, so εimp,SC decreases as a function of τ M + τ L and increases as a function of µT . In the case of spike counting, impulsive distortion can be thought of as quantization noise attributable to the discrete nature of spike counts, so including more spikes in the window means a smaller quantization step and lower distortion. 2 The Poisson distortion εpoiss,SC —the extra impulsive distortion introduced when a Poisson encoder is used in place of an integrate-and-fire
Distortion of Neural Signals by Spike Coding
2813
encoder—is positive as long as the counting window is at least one-sixth of the length of the mean ISI. A counting window shorter than the mean ISI length is not realistic, so the integrate-and-fire encoder can be considered superior to the Poisson encoder for all practical parameter sets. In fact, the 2 Poisson distortion εpoiss,SC dominates the distortion for the nominal parameters used in Figure 4. Because the Poisson distortion is a form of impulsive distortion, its behavior as a function of τ M + τ L and µT is similar to that of 2 εimp,SC , except that the slopes are steeper. The total distortion expressions for integrate-and-fire encoder (without and with noise) and the Poisson encoder (without noise) closely match simulation results for these nominal parameter values (shown as a function of τ M + τ L in Figure 4D and as a function of µT in Figure 4E). This is noteworthy, as the simulations make no assumptions about the additivity of the distortion components. As anticipated, the model breaks down as σ S2 increases, as the conditions described in section 2.2.1 no longer hold (see Figure 4F). Fixing the value of τ L enables us to explore the trade-off between latency and reconstruction accuracy. In Figures 4A to 4F, we assumed τ M = τ L , but we can relax this assumption and explore the behavior of the value of τ M that ∗ minimizes the distortion (denoted τ M ) as a function of τ L (see Figure 4G).4 Recall that long windows reduce impulsive and noise-related distortion but increase stimulus-related distortion. For small τ L , noise-related and ∗ impulsive distortion dominates, so τ M > τ L in order increase the overall ∗ window length. For large τ L , stimulus-related noise dominates, and τ M < τL in order to decrease the overall window length. Only for intermediate values of τ L is τ M = τ L optimal. Asymmetric windows are penalized, as they shift the reconstruction with respect to the stimulus (following the automatic shift of τ L required for causality), giving rise to stimulus-related distortion. The presence of additive noise shifts the curve upward, as longer windows are required to minimize the noise’s impact. 4 Distortion of Interspike Interval Decoder In this section, we discuss a spike decoder that computes a reconstruction from the inverse of the ISI lengths. In the simplest version of the decoder, one ISI is considered at a time. During the spike train’s kth ISI (of length tk − tk−1 ), the reconstruction takes on a constant value of µ S µT /(tk − tk−1 ). The process can be generalized to use more than one ISI at a time. We introduce the integer variables M and L, which give the number of halfintervals extending into the past and the future, respectively, used in the reconstruction. 4 This is not feasible for the Poisson encoder for the nominal parameters because τ ∗ M is prohibitively large.
2814
D. Goldberg and A. Andreou
A M, L odd
B M, L even
(tk+1-tk−1)/2
tk-tk−1
M+L 2(tk+(L−1)/2-tk−(M+1)/2)
tk−(M+1)/2
tk−1
τM
tk
tk+1 tk+2
τL
tk+(L−1)/2
M+L 2(tk+L/2-tk-M/2)
tk−M/2
tk−1
tk
τM
tk+1 tk+2
tk+L/2
τL
Figure 5: Generalization of interspike interval decoder to multiple interspike intervals. M is the number of past half-intervals used in the reconstruction, and L is the number of future half-intervals used in the reconstruction. A portion of the reconstruction is represented by the gray box, with the center marked by an ×. (A) M, L odd. The reconstruction is centered in the center of an interspike interval. (B) M, L even. The reconstruction is centered on a spike.
r r
If M and L are odd, the reconstruction during an ISI takes on a constant value inversely proportional to the sum of the lengths of the (M − 1)/2 past ISIs, the current ISI, and the (L − 1)/2 future ISIs. If M and L are even, the reconstruction takes on a constant value between adjacent ISI midpoints. The value is inversely proportional to the sum of the M/2 past ISIs and the L/2 future ISIs.
The two cases are illustrated in Figure 5. The reconstruction can be expressed mathematically as M+L µ S µT k 2(tk+(L−1)/2 recttk −tk−1 t − −t ) k−(M+1)/2 M, L odd sˆ (t) = M+L µ S µT k 2(tk+L/2 −tk−M/2 ) rect(tk+1 −tk−1 )/2 t − M, L even.
tk +tk−1 , 2 tk+1 +2tk +tk−1 4
,
(4.1)
Figure 6: Illustration of integrate-and-fire spike encoding and interspike interval decoding for M = 1, L = 1. (A) Stimulus signal S. (B–D) Stages of encoding and decoding. (B) Spike train X. (C) Placement of impulses at the center of interspike intervals. Height of impulses given by the inverse of the interspike interval length. (D) Reconstruction obtained by interpolation between the impulses in panel C. (E–G) Approximations required for analysis. (E) Convolution of stimulus with a rectangular impulse response to obtain A. (F) Sampling of A to give B. (G) Reconstruction obtained by interpolating between the samples of B.
Distortion of Neural Signals by Spike Coding
2815
2816
D. Goldberg and A. Andreou
In order to ensure that the reconstruction is causal, we define the average latency as τ L = tk+(L−1)/1 − tk +t2k−1 for M, L odd and τ L = tk+L/2 − tk for M, L even. The memory τ M can be defined in an analogous manner. The integrateand-fire encoding and ISI decoding process is illustrated in Figures 6A to 6D. The process by which we estimate the distortion introduced by integrateand-fire encoding and ISI decoding is illustrated in Figures 6E to 6G. First, we make the approximation that τ L and τ M are given by the length of L/2 and M/2 uniformly spaced intervals, respectively, as the true spike times are unknown: τ L ≈ LµT /2
τ M ≈ MµT /2.
(4.2)
To model the inability of the decoding window to capture high-frequency content in the input, we convolve the input with a rectangular impulse response given by h LP (t) =
1 τM + τL
τL − τM , rectτ M +τL t − 2
(4.3)
which gives the signal A (see Figure 6E), a (t) = h LP (t) ∗ [s(t) + n(t)].
(4.4)
Next, we model the sampling behavior of the integrate-and-five encoder by sampling A with a period of µT to give the signal B (see Figure 6F): b(t) = a (t)
∞
δ(t − kµT ).
(4.5)
k=−∞
Finally, the reconstruction Sˆ is given by zero-order interpolation between the samples, which can be modeled by the convolution of the impulse process B with a rectangular window given by h interp (t) = rectµT (t) (see Figure 6G): sˆ (t) ≈ h interp (t) ∗ b(t) =
a (kµT )rectµT (t − kµT ).
(4.6)
k
This approximation requires that S + N is rarely negative (CV S + CV N < ∼ 1) and the that spike train must is regular (CVT < 1). The approximation is ∼ explored in greater depth in section B.1 in appendix B.
Distortion of Neural Signals by Spike Coding
2817
To aid the derivation, we decompose A, B, and the approximation for Sˆ into components corresponding to the stimulus and the noise, a S (t) = h LP (t) ∗ s(t)
a N (t) = h LP (t) ∗ n(t),
(4.7)
and use equations 4.5 and 4.6 to compute BS , B N , Sˆ S , and Sˆ N accordingly. 4.1 Stimulus-Related Distortion. The stimulus-related distortion is given by ε 2 = E[S2 ] − 2E[S Sˆ S ] + E Sˆ S2 .
(4.8)
Expressions for E[S Sˆ S ] and E[ Sˆ S2 ] are derived in section B.2 and can be plugged into equation 4.8 to obtain the distortion: 2 εstim,ISI = σ S2 1 −
1 [ω S (τ M + τ L )]2
× 2ω S (τ M + τ L ) + ω S |τ L − τ M | + 2e −ωS |τL −τ M |/2
ω S µT 4 (τ M + τ L )(e −ωS τ M + e −ωS τL ) sinh ω S µT 2
ω S (τ L − τ M ) −2e −ωS (τL +τ M ) cosh . 2
−
(4.9)
The distortion given by the low-pass filtering of the decoder MSE(S, AS ) 2 is equal to MSE(S, S ) = εstim,SC from section 3. By subtracting MSE(S, AS ) 2 from εstim,ISI , we obtain the distortion due only to the sampling and interpolation phase, 2 2 εsamp,ISI = εstim,ISI − MSE(S, AS ).
(4.10)
Interestingly, this quantity can be negative at long latencies, as the smoothing afforded by the interpolation can decrease E[ Sˆ S2 ] more than it increases 2E[S Sˆ S ]. 4.2 Noise-Induced Distortion. To find the distortion attributable to additive white noise N, we use the fact that the noise is independent of the stimulus, which means that the distortion can be found from the variance of 2 Sˆ N . Because N is zero mean, σ Sˆ2 = E[ Sˆ N ], which can be found in a manner N
2818
D. Goldberg and A. Andreou
analogous to E[ Sˆ S2 ] in section B.2. We have 2 E[ Sˆ N ]=
=
1 2π 1 2π
∞ −∞ ωN −ω N
|HLP (ω)|2 SNN (ω)dω
sin[ω(τ M + τ L )/2] ω(τ M + τ L )/2
2
σ N2 π + 2πµ S δ(ω) dω. ωN
This resembles the derivation in section 3.2, which we follow to give 2 = εnoise,ISI
≈
2σ N2 cos[ω N (τ M + τ L )] − 1 Si[ω N (τ M + τ L )] + ω N (τ M + τ L ) ω N (τ M + τ L ) σ N2 π , ω N (τ M + τ L )
(4.11)
which is exactly the same as the noise-induced distortion of the spike count decoder. 4.3 Poisson Distortion. Another term can account for the additional distortion introduced by Poisson encoding. The most straightforward way 2 to obtain this term is to consider a constant input, which causes εstim,ISI to go to zero. In the case of an integrate-and-fire encoder, the spike train would be perfectly periodic, and the reconstruction would be exact. But in the case of the Poisson encoder, the intrinsic variability of the spike train gives rise to Poisson distortion. At a particular point in time, the reconstruction value is determined from the length of a superinterval, which is composed of K = (M + L)/2 ISIs, where M and L are constrained such that K is an integer. The length of the superinterval is a random variable T. The reconstruction value during a superinterval is then given by K µ S µT /T, and the contribution to the distortion is given by T K µT
K µ S µT − µS T
2 p(T),
(4.12)
where p(T) is the probability of a superinterval of length T. The factor of T/K µT weights the error during the superinterval by the superinterval length. By definition of the Poisson process, the distribution of ISI lengths is exponential with parameter µT . Because the superinterval is given by the sum of ISI lengths, it is described by the gamma distribution, p(T) =
T K −1 e −T/µT . (K )µTK
(4.13)
Distortion of Neural Signals by Spike Coding
2819
To find the distortion, we integrate equation 4.12 over all possible superintervals:
2 K −1 −T/µT e K µ S µT T − µS dT = T (K )µTK 0
µ2S µT 2 = = µS . K −1 τ M + τ L − µT
2 εpoiss,ISI
∞
T K µT
(4.14)
4.4 Overall Behavior. Analogous to the spike count decoder, the behavior of the distortion of the interspike interval decoder is understood by considering the length of the spike train segment used in the reconstruction τ M + τ L (see Figure 7A), the mean ISI length µT (see Figure 7B), and the stimulus variance σ S2 (see Figure 7C). 2 The stimulus-related distortion εstim,ISI increases as τ M + τ L increases, as including more or longer intervals in the reconstruction leads to greater 2 smoothing in the reconstruction. The noise-induced distortion εnoise,ISI decreases as τ M + τ L increases because the noise is averaged over more intervals. Provided that τ M and τ L are kept constant, the value of µT has no 2 bearing on the smoothing effects of the decoding window. εnoise,ISI is flat as 2 a function of, µT and there is a slight dependence of εstim,ISI on µT , which 2 arises from the sampling and reconstruction process. Only the εstim,ISI term 2 is affected by varying σ S . 2 The Poisson distortion εpoiss,ISI —the additional distortion introduced by using a Poisson encoder instead of an integrate-and-fire encoder— overshadows the other distortion components when the nominal param2 eters are used. Similar to the spike count decoder, εpoiss,ISI decreases as τ M + τ L increases and increases as µT increases, as the Poisson variability is reduced when it is averaged over several intervals. The total distortion for the integrate-and-fire encoder (without and with noise) and the Poisson encoder (without noise) closely matches simulation results for these nominal parameter values (shown as a function of τ M + τ L in Figure 7D and as a function of µT in Figure 7E). The model breaks down as σ S2 increases, as the conditions described in section 2.2.1 no longer hold (see Figure 7F). As in the case of the spike count decoder, the memory τ M can be adjusted to minimize the distortion at a given value of the latency τ L (see Figure 7G). When an integrate-and-fire encoder is used in the absence of noise, the ∗ stimulus-related distortion is the only distortion component, so τ M < τL as fewer intervals are favored until the detrimental effects of asymmetric ∗ decoding set in. When noise is introduced, τ M > τ L for low τ L , and in ∗ general, the τ M curve shifts upward, mirroring the behavior of the spike count decoder. In this case, the curve has a staircase appearance because only integer numbers of intervals can be used.
2820
D. Goldberg and A. Andreou
A
D
1
1
10 Distortion ε2 (µA2)
Stimulus−related Noise−induced Poisson
2
Distortion ε (µA )
10
0
2
10
−1
10
−2
10
Integrate−and−fire, noiseless Integrate−and−fire, noise Poisson, noiseless
0
10
−1
10
−2
0
10
10
1
10 Window size τ +τ (sec) M
0
10
L
1
10 Window size τ +τ (sec) M
B
E
1
1
10 Distortion ε2 (µA2)
Stimulus−related Noise−induced Poisson
2
Distortion ε (µA )
10
0
2
10
−1
10
−2
10 −1 10
Integrate−and−fire, noiseless Integrate−and−fire, noise Poisson, noiseless
0
10
−1
10
−2
10 −1 10
0
10 Mean ISI length µ (sec) T
0
10 Mean ISI length µ (sec) T
C
F
1
1
10 Distortion ε2 (µA2)
Stimulus−related Noise−induced Poisson
2
Distortion ε (µA )
10
0
2
10
−1
10
−2
0
10
−1
10
Integrate−and−fire, noiseless Integrate−and−fire, noise Poisson, noiseless
−2
0
10 −1 10
1
10 10 2 2 Stimulus variance σ (µA ) S
0
S
G Noiseless With noise τ =τ L
M
M
0
10
0
10
1
10 10 2 2 Stimulus variance σ (µA )
1
10 Optimal memory τ* (sec)
10 −1 10
L
Latency τL (sec)
1
10
Distortion of Neural Signals by Spike Coding
2821
5 Comparison In the previous sections, we developed analytical models for the distortion introduced by spike communication. We have considered integrate-andfire and Poisson encoders and spike count (SC) and interspike interval (ISI) decoders. The expressions include several parameters: the stimulus is characterized by its mean µ S , variance σ S2 , and bandwidth ω S ; the noise is characterized by its variance σ N2 and bandwidth ω N ; the spike encoders are characterized by Cθ or µT ; and the decoders are characterized by τ M and τ L . In this section, we compare the distortion expressions side-by-side. 5.1 Reducing the Number of Parameters. The comparison of the distortion expressions is complicated by the abundance of free parameters. We can take several steps to make the expressions more manageable. First, we normalize the distortion by the stimulus variance, ε¯ 2 = ε 2 σ S2 .
(5.1)
Next, we replace the parameters with dimensionless quantities, µ ¯ T = µT ω S τ¯ M = τ M /µT σ¯ N = σ N /σ S
σ¯ S = σ S /µ S τ¯L = τ L /µT ω¯ N = ω N /ω S .
(5.2)
Finally, we effectively eliminate τ M by using the distortion at the op∗ timal memory value τ M . This leaves us with five parameters: a unitless spike period µ ¯ T , a unitless latency τ¯L , the stimulus coefficient of variation σ¯ S , a unitless noise level σ¯ N , and a unitless noise bandwidth ω¯ N .
Figure 7: Distortion of the interspike interval decoder. Same parameters as Figure 4. (A) Distortion components as a function of decoding window length τ M + τ L . (B) Distortion components as a function of mean spike rate µT . (C) Distortion components as a function of stimulus variance σs2 . (D) Comparison of analytical expressions for distortion (solid line) with simulations (dots with error bars) as a function of window length τ M + τ L . (E) Comparison of analytical expressions for distortion with simulations as a function of mean spike rate µT . (F) Comparison of analytical expressions for distortion with ∗ as simulations as a function of stimulus variance σ S2 . (G) Optimal memory τ M a function of latency τ L for the integrate-and-fire encoder. Dashed line corresponds to τ M = τ L . The curve has a staircase appearance because only integer numbers of intervals can be used.
2822
D. Goldberg and A. Andreou
The SC decoder expressions become
2 =1− ε¯ stim,SC
¯ T (1 − e −τ¯M µ¯ T − e −τ¯L µ¯ T ) − e −(τ¯M + τ¯L )µ¯ T + 1] 2[(τ¯ M + τ¯L )µ [(τ¯ M + τ¯L )µ ¯ T ]2 (5.3)
1 6σ¯ S2 (τ¯ M + τ¯L )2 1 1 2 = 2 . ε¯ poiss,SC 1− 6(τ¯ M + τ¯L ) σ¯ S (τ¯ M + τ¯L ) 2 ε¯ imp,SC =
(5.4) (5.5)
The terms specific to the ISI decoder become 2 ε¯ stim,ISI =1 −
1 [(τ¯ M + τ¯L )µ ¯ T ]2
× 2(τ¯ M + τ¯L )µ ¯ T + |τ¯L − τ¯ M |µ ¯ T + 2e −|τ¯L −τ¯M |µ¯ T /2 −4(τ¯ M + τ¯L )(e −τ¯M µ¯ T + e −τ¯L µ¯ T ) sinh −2e −(τ¯L +τ¯M )µ¯ T cosh 2 ε¯ poiss,ISI =
¯T (τ¯L − τ¯ M )µ 2
1 . σ¯ S2 (τ¯ M + τ¯L − 1)
µ ¯T 2
(5.6) (5.7)
The noise-induced distortion expression is identical for the SC and ISI decoders: 2 = ε¯ noise
≈
2σ¯ N2 ¯ T (τ¯ M + τ¯L )] Si[ω¯ N µ ω¯ N µ ¯ T (τ¯ M + τ¯L )
¯ T (τ¯ M + τ¯L )] − 1 cos[ω¯ N µ + ω¯ N µ ¯ T (τ¯ M + τ¯L ) σ¯ N2 π , ω¯ N µ ¯ T (τ¯ M + τ¯L )
(5.8) (5.9)
where the approximation requires ω¯ N µ ¯ T (τ¯ M + τ¯L ) 1. When comparing the decoders, we choose not to penalize a decoder for having high distortion at a high latency if it can achieve a lower distortion
Distortion of Neural Signals by Spike Coding
2823
at a lower latency. Therefore, at a given latency τ¯L , we plot the lowest distortion achievable at any latency at or below τ¯L , ε¯ ∗2 (τ¯L ) = min ε 2 (τ¯L ).
(5.10)
τ¯L :τ¯L ≤τ¯L
When comparing the distortion expressions, we must keep in mind that the expressions are approximations that are valid only in a limiting case. Expressing the conditions described in section 2.2.1 with the dimensionless parameters gives σ¯ S (1 + σ¯ N ) < ∼1
σ¯ S 2/µ ¯T < ∼1
σ¯ N σ¯ S 2/ω¯ N µ ¯T < ∼ 1.
(5.11)
5.2 Integrate-and-Fire Encoding. A comparison of the SC and ISI decoders’ ability to reconstruct an integrate-and-fire encoded spike train reveals that the parameter space is cleanly divided into two distinct regions—one where the ISI decoder has lower distortion and one where the SC decoder has lower distortion. 5.2.1 Noiseless Case. First, we consider the noiseless case (σ¯ N = 0; see Figures 8A to 8D). The partitioning of the parameter space can be visualized ∗2 ∗2 by plotting the surface over which ε¯ SC and ε¯ ISI are equal as a function of µ ¯ T, σ¯ S , and τ¯L (see Figure 8A). When the stimulus CV σ¯ S is low (“behind” the boundary surface in Figure 8A), the ISI decoder has lower distortion, and when the stimulus CV σ¯ S is high (“in front of” the boundary in Figure 8A), the SC decoder has lower distortion. The difference between the distortions, ∗2 ∗2 ε¯ SC − ε¯ ISI can be visualized by fixing µ ¯ T (see Figure 8B). The surface is asymmetric about zero and indicates that the absolute disparity between the decoders’ distortion is larger when the ISI decoder is superior than when the SC decoder is superior. The value of σ¯ S can be also be fixed to facilitate comparison of the distortion expressions with simulation results. For low-stimulus CV (σ¯ S = 0.25), the models for ε¯ 2 match the simulations, verifying that the ISI decoder outperforms the SC decoder (see Figure 8C). For high-stimulus CV (σ¯ S = 2), simulations verify that the reversal in the relative distortion predicted by the models does occur, but the models are quantitatively inaccurate (see Figure 8D). This inaccuracy arises because the conditions described in equation 5.11 do not hold. The partitioning of the parameter space can be explained by contrasting the behavior of the discontinuities present in the decoders’ reconstructions. (Examples of the discontinuities can be seen in Figures 3C and 6D.) For the SC decoder, the distortion that results from the discontinuities is given by 2 the impulsive distortion (εimp,SC , equation 3.21). This term scales with the square of the stimulus mean µ2S , as it is a function of the square of the quanµT tization step size τµMS+τ . For the ISI decoder, the distortion resulting from L discontinuities in the reconstruction is due to sampling and interpolation
2824
D. Goldberg and A. Andreou
2 (εsamp,ISI , equation 4.10). This term scales with the stimulus variance σ S2 because the magnitude of the discontinuities is proportional to the variability of the stimulus over the typical interspike interval. As the stimulus CV increases (i.e., σ S2 increases relative to µ2S ), the relative contribution of impulsive distortion to the total SC distortion decreases, while the relative contribution of the sampling and interpolation distortion to the total ISI distortion remains constant.
5.2.2 Adding Noise. Next, we examine the effects of noise on the distortion comparison (σ¯ N = 1 and ω¯ N = 4, Figures 8E to 8G). For low stimulus CV, noise has a greater effect on the performance of the ISI decoder than the SC decoder (compare Figures 8C and 8G). This behavior is somewhat counterintuitive considering that both decoders have the same expression for 2 the noise-induced distortion ε¯ noise . However, the distortion at a particular ∗ latency is computed at the optimal memory τ¯ M , which need not be identical for the two decoders (compare Figures 4G and 7G). In the noiseless case, the SC decoder must have a large value of τ¯ M in order to minimize the impulsive distortion. As we have observed in sections 3 and 4, the introduction of ∗ ∗ noise causes the value of τ¯ M to increase. However, because the value of τ¯ M for the SC decoder is already large, it need not increase as much as it must for the ISI decoder. The overall effect is that noise has a greater impact on the ISI decoder, and it narrows the performance gap between the decoders at low stimulus CV. Figure 8: Comparison of the analytical expressions for the distortion of integrate-and-fire encoding and spike count (SC) and interspike interval (ISI) decoding over a large portion of the dimensionless parameter space. (A–D) Noiseless case (σ¯ N = 0). (A) Surface over which the distortions of the SC decoder and the ISI decoder are equal for σ¯ N = 0. In the region behind the surface, the SC decoder has higher distortion. In the region in front of the surface, the ISI decoder has higher distortion. The color of the surface at a particular point corresponds to the value of the distortion at that point. Black corresponds to ε¯ ∗2 = 0, and white corresponds to ε¯ ∗2 = 1. (B) Difference between the SC decoder distortion and the ISI decoder distortion for µ ¯ T = 1. The color of the surface is taken from the corresponding z-axis value. The locations where the distortions are equal are marked with the dashed line. The solid black lines mark the regions that are plotted in panels C and D. (C) Distortion of the decoders for σ¯ S = 1, plotted with simulation results. SC decoder distortion in black, ISI decoder distortion in gray. The dashed lines represent ε¯ 2 , given by the analytical model. The dots correspond to simulation results. The solid lines represent ε¯ ∗2 , the minimum of the distortion values for all values of τ¯L up to the current value. (D) Distortion of the decoders for σ¯ S = 2. Same format as panel C. (E–G) σ¯ N = 1 and ω¯ N = 4. (E) Surface over which the distortion of the SC decoder and the ISI decoder are equal for σ¯ N = 0.25 and ω¯ N = 4. Same format as panel A. (F–H) Distortion of the decoders for µ ¯ T = 1. Same format as panels in B–D.
Distortion of Neural Signals by Spike Coding
2825
2826
D. Goldberg and A. Andreou
For high stimulus CV, the situation is more complicated. In the region corresponding to high σ¯ S , low µ ¯ T , and low τ¯L , the distortion expressions predict that the ISI decoder gives less distortion than the SC decoder, the opposite of the noiseless case. However, the simulation results reveal that this prediction fails qualitatively and quantitatively (see Figure 8H), as the conditions required for the models do not hold. 5.3 Poisson Encoding. Unfortunately, a similar comparison of the decoders’ ability to decode a Poisson-encoded spike train is not as informative. At lower values of stimulus CV, the distortion is dominated by the Poisson 2 2 distortion terms ε¯ poiss,SC (see equation 5.5) and ε¯ poiss,ISI (see equation 5.7). In this regime, comparisons like those depicted in Figure 8 are not meaningful because extraordinarily high values of τ¯ M are required to minimize the distortion. A comparison is possible at higher stimulus CV values, but only because 2 the ε¯ poiss terms are diminished. In this regime, the behavior exhibited by the integrate-and-fire comparison is recovered, and the results resemble those depicted in Figures 8D and 8H—the SC decoder outperforms the ISI decoder by a small margin (not shown). The Poisson distortion terms diminish because as the stimulus’s variance increases relative to its mean squared (i.e., the stimulus CV increases), the variability attributable to the stimulus overwhelms the variability attributable to Poisson spiking. 6 Discussion and Conclusion We have explored how spike coding can distort neural signals, and we have considered strategies that neurons may use to minimize this distortion. We have derived analytical expressions for the distortion that include five dimensionless parameters related to the mean ISI, the stimulus CV, the latency, the noise level, and the noise bandwidth. These models are valid in the portion of the parameter space where the variability in the stimulus and noise is small relative to the mean stimulus level, and the spike train is regular. The models enable us to predict which spike encoder and decoder will perform best for a given parameter set. The distortion expressions can be broken up into components that are affected by the parameters in differing ways, giving rise to trade-offs. For the ISI decoder, short decoding windows are required for stimulus fidelity, but long decoding windows are required for robustness to noise. The SC decoder is similar, with the exception that its distortion has a term attributable to the quantization effects of the counting window, which increases the optimal window size. In all cases, increasing the spike rate while holding all other parameters constant decreases the distortion. As expected, the deterministic integrate-and-fire encoder is superior to the stochastic Poisson encoder. It is not trivial to predict which of spike
Distortion of Neural Signals by Spike Coding
2827
count (SC) or interspike interval (ISI) decoders will best extract information from the spike train, even if the nature of the spike encoding process is completely known to the receiver. The identity of the best decoder is a function of the various parameters that describe the stimulus, the encoder, the noise, and the latency requirements. The stimulus CV appears to play a particularly influential role in the decoder comparison. At low stimulus CV, the ISI decoder is superior because the quantization effects of spike counting are prominent. At high stimulus CV, the SC decoder is superior, as the variability of the stimulus overwhelms the quantization effects. While the examples in this letter have been limited to Lorentzian stimuli and band-limited white noise, our approach can be applied to signals with other spectral characteristics. For example, we were able to reproduce the qualitative results depicted in Figures 4, 7, and 8 using band-limited white stimuli and Lorentzian noise. 6.1 Implications for the Neural Coding Debate. It is tempting to relate the results here to the ongoing debate over rate and temporal coding. However, as others have pointed out, there is no consensus on precise definitions of rate and temporal coding. In attempting to clarify the neural coding debate, Dayan and Abbott (2001) introduced a distinction between independent spike codes and correlation codes. In an independent spike code, the spike train is determined solely from a time-dependent firing rate. An example is the Poisson encoder we have described. A correlation code is one where correlations in spike times carry significant information. The integrate-and-fire encoder is consistent with a correlation code, as the information about the stimulus is encoded exclusively in the ISIs. The distinctions that Dayan and Abbott propose refer only to spike encoding. In the information-theoretic sense, a code consists of an encoder and a decoder, and discussions of neural codes must encompass both in order to be unambiguous. Poisson-encoded spike trains do not contain any information in their interspike intervals, so the Poisson encoder and the spike count decoder appear to be intuitively matched. This is consistent with the result that the SC decoder outperforms the ISI decoder on Poisson-encoded spike trains. On the other hand, because the integrate-and-fire encoder encodes information exclusively in ISIs, one might think the interspike interval decoder is best suited to decode these spike trains. But because the SC decoder slides continuously across regions of the spike train encompassing several spikes, it too is capable of extracting information contained in the correlations among spike times. As we have demonstrated, the SC decoder can outperform the ISI decoder under certain conditions (see Figure 8). The theoretical results presented here are applicable to systems that conform to the assumptions of our framework. The stimulus and noise must have low CV, the spike trains must be fairly regular, the noise must be small and additive, and the encoder must be adequately described by an integrate-and-fire or Poisson model. One system that is potentially
2828
D. Goldberg and A. Andreou
adequately described by these criteria is the mammalian retinal ganglion cell, particularly the model proposed by Passagalia and Troy (2004). In this case, the stimulus corresponds to the generator potential of the ganglion cell, and the results could provide insight into the mechanisms of spike decoding in the leteral geniculate nucleus. Unfortunately, no measurements of the statistics of the generator potential under naturalistic stimulation could be found in the literature. Current work is focused on determining such statistics from the statistics of natural scenes and models of retinal processing. 6.2 Efficiency of Spike Communication. We hypothesize that a neural system will use the most efficient code—the encoder-decoder pair that has the most favorable trade-off between performance and cost. Ideally, we would like to systematically search through all possible spike codes to determine which are the most efficient in order to test this hypothesis. Such a search is at the moment intractable, so we instead focus on specific encoders and decoders that have been discussed in the neural modeling literature. We make no claim that the encoders and decoders we have considered are optimal in any sense; they were chosen for their simplicity. They can be characterized by few parameters and therefore facilitate a parametric exploration of the spiking link’s performance. This is in contrast to decoders that can be found using optimization techniques, such as the optimal linear decoder. Biophysical models of encoders and decoders are required to estimate the cost of spike communication. Such models will enable us to further examine how energy consumption and other constraints influence spike coding (Levy & Baxter, 1996; Schreiber, Machens, Herz, & Laughlin, 2002; Goldberg, Sripati, & Andreou, 2003). The spike count decoder, for example, is a low-pass filter that is rather straightforward to implement with biological machinery. The implementation of the ISI decoder is less clear, although mechanisms based on short-term synaptic dynamics are attractive (Tsodyks, Pawelzik, & Markram, 1998; Zucker & Regehr, 2002). Recent work has demonstrated that lossless transmission with a spiking link is ´ 2004). However, the effipossible under the right conditions (Lazar & Toth, ciency of such transmission depends critically on the expense of the decoder implementation. Appendix A:
Distortion of the Spike Count Decoder
A.1 Approximation of a Filtered Spike Train. The spike train X has spikes at times {tkX } and can be expressed as a sum of delta functions, x(t) =
δ t − tkX . k
(A.1)
Distortion of Neural Signals by Spike Coding
2829
The decoding filter G has impulse response g(t) with support over [t − τ− , t + τ+ ], where τ− , τ+ > 0. The zero frequency gain of the filter is G(0) = τ+ τ− g(τ )dτ = µ S µT . The reconstruction is given by sˆ (t) = g(t) ∗ x(t) = g(t) ∗
N N δ t − tkX = g t − tkX , k=1
(A.2)
k=1
where i = 1, . . . , N encompasses all the spikes in the interval [t − τ+ , t + τ− ]. Because we do not know the spike times {tkX }, our goal is to find an expression for sˆ (t) that does not include them. When the input is positive, the output of an integrate-and-fire encoder can be expressed in terms of a periodic spike train and the warping function −1 (t) (Johnson, 1996). Similarly, section 2.2.2 discusses how the output of a Poisson encoder can be expressed in terms of a homogeneous Poisson spike train and the same warping function. For the remainder of this derivation, we focus on the integrate-and-fire case and note that the Poisson case proceeds similarly. We express the reconstruction as the sum of the periodic spike train p(t) (spike times {tkP }) and a term γ , g t − tkX = g t − tkP + γ i
(A.3)
k
γ=
g t − tkX − g t − tkP .
(A.4)
k
If tkX and tkP are not far apart relative to the mean ISI length (which is the case if the conditions described in section 2.2.1 hold) and g(t) is smooth, then the slope of the secant line from (t − tkX , g(t − tkX )) to (t − tkP , g(t − tkP )) approaches the derivative of g(t) at t − tkX or t − tkP . Then the difference between g(t − tkX ) and g(t − tkP ) can be expressed as g t − tkX − g t − tkP ≈ g t − tkX t − tkX − t − tkP ≈ g t − tkX tkP − tkX .
(A.5)
Substituting into the expression for γ , γ ≈
g t − tkX tkP − tkX = g t − tkX tkX − tkX . k
(A.6)
k
The sum in equation A.6 can be approximated by an integral. This approximation is best when the curvature of g(t) is small and there are many spikes
2830
D. Goldberg and A. Andreou
in the decoding window, γ≈
1 µT
g (t − α)((α) − α)dα
1 d 1 g(t) ∗ ((t) − t) [g (t) ∗ ((t) − t)] = µT µT dt s(t) − µ S + n(t) , = g(t) ∗ µ S µT =
(A.7)
where we use derivative property of the convolution and note (t) = s(t) + n(t). Substituting the expression for γ into equation A.3, we obtain s(t) − µ S + n(t) g t − tkX ≈ g t − tkP + g(t) ∗ . µ S µT k k
(A.8)
A.2 Statistics of Infinite Impulse Trains. The impulse train P is expressed as (see equation 3.4) p(t) =
∞
δ(t − kµT ).
(A.9)
k=−∞
The mean of P is given by
∞
µ P = E[ p(t)] = E
δ(t − kµT )
k=−∞
1 k→∞ 2kµT
= lim
kµT
−kµT
p(t)dt
1 (2k) k→∞ 2kµT
= lim
= 1/µT .
(A.10)
The autocorrelation of P is given by RP P (τ ) = E[ p(t) p(t + τ )] ∞ ∞ =E δ(t − kµT ) δ(t − kµT + τ ) . k=−∞
k=−∞
The product inside the expectation is nonzero only when a spike in the spike train coincides with a spike in the shifted version of the spike train. In
Distortion of Neural Signals by Spike Coding
2831
this case, a coincidence occurs when the lag τ is equal to a multiple of µT , RP P (τ ) =
∞
E
k=−∞
δ(t − kµT ) ,
τ = lµT , l an integer
0,
=
otherwise
1/µT , τ = lµT , l an integer 0,
otherwise
,
making use of equation A.10. This can be written as
RP P (τ ) =
∞ 1 p(τ ) δ(τ − lµT ) = . µT l=−∞ µT
(A.11)
To find the power spectral density of P, we apply the Fourier transform, SP P (ω) =
∞
∞
−∞ k=−∞
2π µT
=
δ(τ − kµT )e − jωτ dτ =
δ ω−
k=−∞
e − jωkµT
k=−∞
∞
∞
2πk µT
.
(A.12)
A.3 Stimulus-Related Distortion. The normalized transfer function of the decoding filter is ¯ SC (ω) = G
sin[ω(τ M + τ L )/2] jω(τ M −τL )/2 . e ω(τ M + τ L )/2
(A.13)
The expressions for E[SS ] and E[S2 ] are in general given by equations 3.13 and 3.14, respectively. First, we find E[SS ], E[SS ] =
1 2π
=
1 2π
∞
−∞
SSS (ω)dω =
1 2π
∞
−∞
¯ SSS (ω)G(ω)dω
2ω S σ S2 2 + 2πµ δ(ω) S 2 2 −∞ ω S + ω
sin[ω(τ M + τ L )/2] jω(τ M −τL )/2 e × dω. ω(τ M + τ L )/2
∞
(A.14)
2832
D. Goldberg and A. Andreou
The first term is evaluated with the help of the identity π (2 2b 2
−e
−a b
−e
E[SS ] =
−b(a +c)
∞ 0
sin a x e jcx d x x(b 2 +x 2 )
=
), obtained from Maple,
σ S2 (2 − e −ωS τ M − e −ωS τL ) + µ2S . ω S (τ M + τ L )
(A.15)
Next, we find E[S2 ]: 1 E[S ] = 2π 2
1 = 2π
∞
−∞
∞
−∞
S
S S
1 (ω)dω = 2π
∞
−∞
2 ¯ SSS (ω)|G(ω)| dω
2ω S σ S2 sin[ω(τ M + τ L )/2] 2 2 + 2πµ S δ(ω) δω. ω(τ M + τ L )/2 ω2S + ω2
Here, the first term is evaluated with the help of the identity π [2a 4b 2
−
1 (1 b
−e
E[S2 ] =
−2a b
(A.16)
∞ 0
sin2 a x dx x 2 (b 2 +x 2 )
=
)] (Gradshteyn & Ryzhik, 1965, equation 3.826.1):
2σ S2 ω S (τ M + τ L ) − 1 + e −ωS (τ M +τL ) + µ2S . [ω S (τ M + τ L )]2
(A.17)
The expressions for E[SS ] and E[S2 ] are then plugged into equation 3.11 to give the distortion. A.4 Impulsive Distortion. Plugging the transfer function of the decoding filter G SC (ω) (see equation 3.1) into equation 3.20 gives 2 εcount,imp
∞ 2πk 2 2 G SC = 2 µT µT k=1
∞ sin(πk(τ M + τ L )/µT ) 2 2 µ µ S T πk(τ M + τ L )/µT µ2T k=1
µ S µT =2 τM + τL
2 τ M +τ L
2 ∞ sin πk µT
k=1
πk
.
(A.18)
The value of the infinite sum depends on (τ M + τ L )/µT —the average number of spikes that fall in the rectangular window. The sum is zero if (τ M + τ L )/µT is an integer and is at its maximum when (τ M + τ L )/µT is halfway between two integers. If we assume that variability in the spike train due to the stimulus causes x = mod(τ M + τ L , µT ) to vary uniformly
Distortion of Neural Signals by Spike Coding
2833
over the interval [0, 1], we obtain
2 εcount,imp
µ S µT =2 τM + τL =2
µ S µT τM + τL
µ S µT = τM + τL
2
∞ 1
0
k=1
2 ∞
2 "
k=1
1 πk
sin(xπk) πk
2
∞ 1 1 π2 k2
1
2 dx
sin2 (xπk)d x
0
# ,
k=1
1 2 where we make use of the 0 sin (xπk)d x = 1/2 for any integer ∞identity k > 0. Using the identity k=1 1/k 2 = π 2 /6, we obtain 2 = εcount,imp
1 6
µ S µT τM + τL
2 .
(A.19)
In the case of spike counting, this distortion component is a consequence of the quantization of the reconstruction due to the discrete nature of spike counts and points the way to an alternative derivation of equation A.19. Appendix B:
Distortion of the Interspike Interval Decoder
B.1 Approximation of Integrate-and-Fire Encoding Followed by Interspike Interval Decoding. The interspike interval decoder first generates an impulse train in which the impulses are placed at spikes, and the heights of the impulses are inversely proportional to the length of the M/2 past intervals and the L/2 future intervals.5 Then the decoder interpolates between the impulses to generate the reconstruction. The value of the reconstruction Sˆ at spike time and sample point tk can be expressed as sˆ (tk ) = µ S
(M + L)µT 2
1 tk+L/2 − tk−M/2
.
(B.1)
Combining equations 2.10 and 2.11 and generalizing to multiple intervals gives µS
(M + L)µT 2
=
tk+L/2
[s(t ) + n(t )]dt .
(B.2)
tk−M/2
5 In this appendix, we confine ourselves to the case where M and L are even. The odd case proceeds analogously.
2834
D. Goldberg and A. Andreou
Recall that this expression is strictly true only when S + N > 0, which is probable if CV S + CV N < ∼ 1. Substituting equation B.2 into equation B.1 gives sˆ (tk ) =
1 tk+L/2 − tk−M/2
tk+L/2
[s(t ) + n(t )]dt .
(B.3)
tk−M/2
Our approximation of Sˆ uses impulses that are spaced uniformly with intervals of µT and have heights derived from the mean of the stimulus over the window that goes MµT /2 into the past and LµT /2 into the future:
2 sˆ (kµT ) ≈ (M + L)µT
kµT +LµT /2
kµT −MµT /2
[s(t ) + n(t )]dt .
(B.4)
Setting tk = kµT , we find that equations B.1 and B.4 are equivalent if tk − tk−K = K µT for all k, K . Put simply, the spike train must be periodic. As the spike train deviates from being perfectly periodic, the approximation becomes less appropriate. B.2 Derivation of the Stimulus-Related Distortion. To model the lowpassing effects that arise from the computation of the mean stimulus over the decoding window, the signal S is passed through a function with the τ L −τ M 1 impulse response h LP (t) = τ M +τ rect t − to give the signal AS . τ +τ M L 2 L This is expressed by a S (t) = h LP (t) ∗ s(t).
(B.5)
The mean of AS is given by µ AS = HLP (0)µ S = µ S .
(B.6)
The cross-correlation between S and AS is given by RSAS (τ ) = RSS (τ ) ∗ h LP (−τ ) = σ S2 e −ωS |τ | + µ2S ∗
1 τM + τL
rectτ M +τL
(B.7)
τL − τM −τ + 2
Distortion of Neural Signals by Spike Coding
2835
2σ S2 τL − τM −ω S (τ M +τ L )/2 cosh ω S τ − 1−e 2 ω S (τ M +2τ L ) + µ S , −τ M < τ < τ L =
2σ S2 ω S (τ M + τ L ) −ω S |τ −(τ L −τ M )/2| sinh e + µ2S . 2 ω S (τ M + τ L ) τ < −τ M , τ > τ L (B.8) The autocorrelation of AS is given by (B.9) RAS AS (τ ) = h LP (τ ) ∗ RSAS (−τ )
1 τL − τM = ∗ RAS S (−τ ) rectτ M +τL −t + τM + τL 2
$ 2σ S2 [ω S (τ M + τ L )]2 ω S [(τ M + τ L ) − |τ − (τ M − τ L )/2|] µT −
τM − τL −ω S |τ −(τ M −τ L )/2| −ω S (τ M +τ L ) + e cosh ω e τ − S 2 3τ M + τ L τ M + 3τ L 2 , = +µ S , − <τ < 2 2
4σ S2 ω S (τ M + τ L ) −ω S |τ −(τ M −τ L )/2| sinh2 + µ2S [ω S (τ M + τ L )]2 e 2 3τ M + τ L τ M + 3τ L τ <− ,τ > 2 2 (B.10) To model the staircase nature of the reconstruction, AS is sampled uniformly to give BS and then convolved with an impulse response h interp (τ ) = rectµT (τ ). BS is given by
b S (t) = a S (t)
∞
δ(t − kµT ) = a S (t) p(t).
(B.11)
k=−∞
The mean of BS is µ BS = E[a S (t) p(t)] = E[a S (t)]E[ p(t)] = µ S /µT ,
(B.12)
where we exploit the fact that AS and the impulse train P are uncorrelated and recall that µ P = 1/µT . The cross-correlation between S and BS is
2836
D. Goldberg and A. Andreou
given by RSBS (τ ) = E[s(t)b S (t + τ )] = E[s(t)a S (t + τ ) p(t + τ )] = E[s(t)a S (t + τ )]E[ p(t + τ )] = RSAS (τ )/µT .
(B.13)
The autocorrelation of BS is given by RBS BS (τ ) = E[b S (t)b S (t + τ )] = E[a S (t) p(t)a S (t + τ ) p(t + τ )] = E[a S (t)a S (t + τ )]E[ p(t) p(t + τ )] = RAS AS (τ )RP P (τ ) = RAS AS (τ ) p(τ )/µT .
(B.14)
The reconstruction Sˆ S is given by sˆ (t) = h interp (t) ∗ b S (t).
(B.15)
The mean of Sˆ S is µ Sˆ S = Hinterp (0)µ BS = µT µ BS = µ S .
(B.16)
The cross-correlation of Sˆ S and S is given by RSSˆ S (τ ) = RSBS (τ ) ∗ h interp (−τ ) = RSBS (τ ) ∗ rectµT (−τ ) µT /2 ∞ RSBS (α)rectµT (α − τ )dα = RSBS (τ − α)dα. = −∞
−µT /2
Then E[S Sˆ S ] is given by E[S Sˆ S ] = RSSˆ S (0) = =
µT /2
−µT /2
RSBS (−α)dα =
1 µT
µT /2 −µT /2
RSAS (−α)dα
ω µ 2σ S2 S T −ω S τ M −ω S τ L 2 + µ . ω µ − (e + e ) sinh S T S 2 ω2S µT (τ M + τ L ) (B.17)
The autocorrelation of Sˆ S is given by RSˆ S Sˆ S (τ ) = h interp (τ ) ∗ RBS BS (τ ) ∗ h interp (τ ) = rectµT (τ ) ∗ RBS BS (τ ) ∗ rectµT (−τ ) = RBS BS (kµT )rect(τ − kµT ) ∗ rectµT (−τ ) k
Distortion of Neural Signals by Spike Coding
=
∞
−∞
k
=
2837
RBS BS (kµT )rect(τ − β − kµT ) ∗ rectµT (β)dβ
µT /2
−µT /2
k
RBS BS (kµT )rect(τ − β − kµT )dβ.
(B.18)
E[ Sˆ S2 ] is given by E[ Sˆ S2 ] = RSˆ S Sˆ S (0) =
µT /2
−µT /2
k
RBS BS (kµT )rect(β − kµT )dβ
= µT RBS BS (0) = RAS AS (0) 2σ S2 |τ M − τ L | = ω S (τ M + τ L ) − [ω S (τ M + τ L )]2 2
ω S (τ M − τ L ) −ω S |τ M −τ L |/2 −ω S (τ M +τ L ) −e +e cosh + µ2S . 2
(B.19)
The expressions for E[ Sˆ S S] and E[ Sˆ 2 ] can be plugged into equation 4.8 to obtain the distortion, which is given by equation 4.9. Acknowledgments This work was supported by National Science Foundation grant EIA0130812. During the manuscript’s preparation, D. H. G. was also supported by National Institutes of Health grant MH068012 to Daniel Gardner. We thank Jonathan Victor and Arun Sripati for helpful discussions. References Attneave, F. (1954). Some informational aspects of visual perception. Psychological Review, 61, 183–193. August, D. A., & Levy, W. B. (1996). A simple spike train decoder inspired by the sampling theorem. Neural Computation, 8, 67–84. Barlow, H. B. (1961). Possible principles underlying the transformations of sensory messages. In W. A. Rosenblith (Ed.), Sensory communication (pp. 217–234). Cambridge, MA: MIT Press. Bialek, W., Rieke, F., de Ruyter van Steveninck, R. R., & Warland, D. (1991). Reading a neural code. Science, 252, 1854–1857. Chklovskii, D. B., Schikorski, T., & Stevens, C. F. (2002). Wiring optimization in cortical circuits. Neuron, 34, 341–347. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience: Computational and mathematical modeling of neural systems. Cambridge, MA: MIT Press.
2838
D. Goldberg and A. Andreou
Dong, D. W., & Atick, J. J. (1995). Statistics of nature time-varying images. Network: Computation in Neural Systems, 6, 345–358. Gabbiani, F. (1996). Coding of time-varying signals in spike trains of linear and half-wave rectifying neurons. Network: Computation in Neural Systems, 7, 61–85. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophysical Journal, 4, 41–68. Goldberg, D. H., Sripati, A. P., & Andreou, A. G. (2003). Energy efficiency in a channel model for the spiking axon. Neurocomputing, 52–54, 39–44. Gradshteyn, I. S., & Ryzhik, I. M. (1965). Table of integrals, series, and products. New York: Academic Press. Johnson, D. H. (1996). Point process models of single-neuron discharges. Journal of Computational Neuroscience, 3, 275–299. Knight, B. W. (1972). Dynamics of encoding in a population of neurons. Journal of General Physiology, 59, 734–766. Laughlin, S. B. (2001). Energy as a constraint on the coding and processing of sensory information. Current Opinion in Neurobiology, 11, 475–480. Laughlin, S. B., Anderson, J. C., O’Carroll, D., & de Ruyter van Steveninck, R. (1999). Coding efficiency and the metabolic cost of sensory and neural information. In R. ¨ ak (Ed.), Information theory and the brain (pp. 41– Baddeley, P. Hancock, & P. Foldi´ 61). Cambridge: Cambridge University Press. Lazar, A. A. (2004). Time encoding with an integrate-and-fire neuron with a refractory period. Neurocomputing, 58–60, 53–58. Lazar, A. A. (2006). A simple model of spike processing. Neurocomputing, 69, 1081– 1085. ´ Lazar, A. A., & Toth, L. T. (2004). Perfect recovery and sensitivity analysis of time encoded bandlimited signals. IEEE Transactions on Circuits and Systems–I: Regular Papers, 51(10), 2060–2073. Levy, W. B., & Baxter, R. A. (1996). Energy efficient neural codes. Neural Computation, 8, 531–543. Lindner, B. (2004). Interspike interval statistics of neurons driven by colored noise. Physical Review E, 69, 022901. MacKay, D. M., & McCulloch, W. S. (1952). The limiting information capacity of a neuronal link. Bulletin of Mathematical Biophysics, 14, 127–135. Manwani, A., Steinmetz, P. N., & Koch, C. (2001). The impact of spike timing variability on the signal-encoding performance of neural spiking models. Neural Computation, 14, 347–367. Passaglia, C. L., & Troy, J. B. (2004). Information transmission rates of cat retinal ganglion cells. Journal of Neurophysiology, 91, 1217–1229. Salinas, E., & Sejnowski, T. J. (2002). Integrate-and-fire neurons driven by correlated stochastic input. Neural Computation, 14, 2111–2155. Schreiber, S., Machens, C. K., Herz, A. V. M., & Laughlin, S. B. (2002). Energy efficient coding with discrete stochastic events. Neural Computation, 14, 1323–1346. Softky, W. R. (1996). Fine analog coding minimizes information transmission. Neural Networks, 9, 15–24. Stein, R. B. (1967). The information capacity of nerve cells using a frequency code. Biophysical Journal, 7, 797–826.
Distortion of Neural Signals by Spike Coding
2839
Stein, R. B., French, A. S., & Holden, A. V. (1972). The frequency response, coherence, and information capacity of two neural models. Biophysical Journal, 12, 295–322. Tsodyks, M., Pawelzik, K., & Markram, H. (1998). Neural networks with dynamic synapses. Neural Computation, 10, 821–835. Zucker, R. S., & Regehr, W. G. (2002). Short-term synaptic plasticity. Annual Review of Physiology, 64, 355–405.
Received April 20, 2006; accepted October 9, 2006.
LETTER
Communicated by Donald Specht
Gap-Based Estimation: Choosing the Smoothing Parameters for Probabilistic and General Regression Neural Networks Mingyu Zhong [email protected]
Dave Coggeshall [email protected]
Ehsan Ghaneie [email protected]
Thomas Pope [email protected]
Mark Rivera [email protected]
Michael Georgiopoulos [email protected] School of Electrical Engineering and Computer Science, University of Central Florida, Orlando, FL 32816, U.S.A.
Georgios C. Anagnostopoulos [email protected] Department of Electrical and Computer Engineering, Florida Institute of Technology, Melbourne, FL 32901, U.S.A.
Mansooreh Mollaghasemi [email protected] Department of Industrial Engineering and Management Systems, University of Central Florida, Orlando, FL 32816, U.S.A.
Samuel Richie [email protected] School of Electrical Engineering and Computer Science, University of Central Florida, Orlando, FL 32816, U.S.A.
Probabilistic neural networks (PNN) and general regression neural networks (GRNN) represent knowledge by simple but interpretable models that approximate the optimal classifier or predictor in the sense of expected value of the accuracy. These models require the specification of an important smoothing parameter, which is usually chosen by crossvalidation or clustering. In this article, we demonstrate the problems with the cross-validation and clustering approaches to specify the smoothing parameter, discuss the relationship between this parameter and some of Neural Computation 19, 2840–2864 (2007)
C 2007 Massachusetts Institute of Technology
Gap-Based Estimation
2841
the data statistics, and attempt to develop a fast approach to determine the optimal value of this parameter. Finally, through experimentation, we show that our approach, referred to as a gap-based estimation approach, is superior in speed to the compared approaches, including support vector machine, and yields good and stable accuracy.
1 Introduction Classification and regression are two types of common problems in science, technology, and even our daily life. Assuming that the inputs and outputs have a fixed and known relationship expressed by the conditional probabilities, it can be proven that the optimal classifier is the Bayes classifier and the optimal predictor is expressed by the conditional expected value of the output given the inputs. In practice, the probabilities representing the inputoutput relationship are not known but are estimated from the given training instances by using legitimate approaches for their estimation. One such probability estimation approach is the Parzen window approach (Parzen, 1962), extended by Cacoullos (1966), which estimates the class-conditional probabilities as a sum of gaussian functions centered at the training points with appropriately chosen widths (standard deviations), designated from now on as smoothing parameters. Parzen’s approach works well under some reasonable assumptions regarding the choice of the smoothing parameters, and when the training data size becomes very large, the approximation becomes equal to the actual class-conditional probabilities. When training data sets are of finite size (as is usually the case in practice), the right choice of smoothing parameters is essential for good performance of the classifier. Probabilistic neural networks (PNN), invented by Specht (1990), approximates a Bayes classifier where the class-conditional probabilities are estimated by using Parzen’s approach with gaussian kernels. General regression neural network (GRNNs), also invented by Specht (1991), is the PNN’s regression counterpart. One of the major issues associated with the PNN and GRNN is how to choose the smoothing parameter(s) involved in the gaussian functions utilized to estimate the conditional probabilities. Unlike linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) or expectationmaximization algorithms where the number of gaussian functions is preset to a small number, PNN or GRNN applies a variable number of kernels (one per each training example), and consequently the maximum likelihood estimate of the smoothing parameter is zero, resulting in an estimate of the class-conditional probabilities as a series of impulses, thus affecting negatively the generalization. Specht (1992, 1994) suggested using crossvalidation to estimate the smoothing parameters. This approach, although simple, has two major drawbacks. The first is that choosing the right candidate values for validation is not as simple as one expects. Although the
2842
M. Zhong et al.
smoothing parameter represents the standard deviation in the gaussian kernels, we show that in a one-dimensional example (each input has only one attribute), the smoothing parameter should be chosen less than the standard deviation of the observed instances. In fact, the optimal smoothing parameter can be much smaller than the standard deviation of the observed instances. Consequently, the theoretical range of the optimal smoothing parameter is infinitely wide if one chooses the smoothing parameter based on the standard deviation only. The second drawback is that cross-validation is very time-consuming. The time complexity of cross-validation in PNN or GRNN is O(D·PT·PV), where D is the dimensionality of the input, PT is the number of instances in the training set, and PV is the number of instances in the validation set. If N candidate smoothing parameters are examined, the time complexity of cross-validation is O(N·D·PT·PV). To overcome the first difficulty, Galleske and Castellanos (2002) proposed the ellipsoidal model to estimate the smoothing parameters. However, their approach is based on the most complicated PNN structure: each instance corresponds to an individual covariance matrix as the smoothing parameter. To optimize the matrix for each instance, one has to find out the “closest” instance belonging to another class. Therefore, the time complexity issue is not solved but rather worsened. Another way of dealing with this issue of smoothing parameters is to cluster the training data and approximate the class-conditional probabilities by gaussian functions centered at the cluster points instead of the actual training points. The clustering of the data gives the additional capability of estimating the smoothing parameters of these gaussian functions as the within-cluster standard deviations. Clustering procedures that have been used in the literature for the PNN neural network are the learning vector quantization (LVQ) approach (Burrascano, 1991), K-means clustering (Traven, 1991), mixture of gaussians (Tseng, 1991), and the dynamic decay adjustment (DDA) approach (Berthold, 1998). These approaches modify PNN by reducing the number of kernels. For example, the DDA approach actually optimizes the smoothing parameter for radial basis function (RBF) networks. In this article, we stick with the PNN model (one kernel per instance) and attempt to find out whether these clustering approaches can simplify the optimization of the smoothing parameters of PNN significantly. We apply an unsupervised variant of gaussian ARTMAP (GAM) (Williamson, 1996, 1997), which is similar to RBF, to cluster the data. We call this clustering variant gaussian ART (GART). Although this approach compresses the training data and may determine a better value for the smoothing parameter than cross-validation, it tends to deteriorate the estimate of the probability density functions (PDFs), as shown in this article. Our approach attempts to determine a good estimate of the optimal smoothing parameters with a time complexity of O(D·PT). No cross-validation is required. This is a significant computational advantage, and it is also an advantage in practical situations, where the data set given is
Gap-Based Estimation
2843
small, and we do not have the luxury of defining a sizable cross-validation set. The organization of the letter is as follows. Section 2 introduces PNN and GRNN, including their clustered versions. Section 3 discusses the effect of the smoothing parameter on PNN and GRNN. Section 4 describes our approach of choosing the smoothing parameter. Section 5 compares our approach of choosing the smoothing parameter to other commonly used approaches. Section 6 provides a summary of our work and concluding remarks. For the rest of the article, we assume that the reader is familiar with GAM. 2 Preliminaries 2.1 Probabilistic Neural Network. The Bayes classifier is based on the following formula for finding the class to which a datum x belongs: P(c j |x) =
f (x|c j )P(c j ) . f (x)
(2.1)
The above probabilities for every class j in our pattern classification task are calculated, and the datum x is classified as belonging to the class j that maximizes this probability. In order to calculate the above probabilities, one needs to estimate the class-conditional probabilities f (x|c j ) and the a priori probabilities P(c j ) for every class j (the calculation of f (x) is not needed because it is a common factor in all of these probabilities and can be cancelled out). The a priori probabilities P(c j ) are calculated from the given training data. The class-conditional probabilities f (x|c j ) can also be calculated from the training data by using the approximation suggested by Parzen (1962). In particular, the following approximation is utilized to estimate these class-conditional probabilities (see Specht, 1990): P Tj j j x − Xr x − Xr 1 f (x|c j ) = exp − , (2π) D/2 σ D PT j 2σ 2
(2.2)
r =1
where D is the dimensionality (number of attributes) of the input patterns, j PT j represents the number of training patterns belonging to class j, Xr denotes the r th such training pattern, x is the input pattern to be classified, and σ is the smoothing parameter. Cacoullos (1966) points out that this estimation will approach asymptotically the real probability density function (PDF) if both of the following equations are true: lim σ = 0
P Tj →∞
lim PT j σ = ∞.
P Tj →∞
(2.3) (2.4)
2844
M. Zhong et al.
Equation 2.2 is the basis of the PNN classifier. The smoothing parameter σ should be selected properly, as further discussed in section 3. In general, the smoothing parameters may depend on the dimension and the class of the data. For simplicity, we assume attribute independence, so that the smoothing parameter does not become a matrix for each class. The above formula for the estimation of the class-conditional probabilities now becomes
f (x|c j ) =
(2π)
1 D D/2
i=1
P Tj
σi j PT j
exp −
r =1
j 2 D xi − Xir i=1
2σi2j
,
(2.5)
where σi j is the smoothing parameter across dimension i for the training points belonging to class j. 2.2 General Regression Neural Network. For regression problems, it is not difficult to prove that the solution minimizing the expected mean square error is the conditional expected value given below: ∞
E(y|x) = −∞ ∞
yf (x, y)dy
−∞
f (x, y)dy
.
(2.6)
In practice, the joint PDF f(x, y) is estimated by PT f (x, y) =
r =1
D (xi −Xir )2
2 r) exp − (y−y − 2 i=1 2σ0 2σi2 . D P T(2π)(D+1)/2 i=0 σi
(2.7)
Equations 2.6 and 2.7 can be used to show that E(y|x) can be estimated by
D (xi −Xir )2 y exp − r 2 r =1 i=1 2σi
, E(y|x) = PT D (xi −Xir )2 exp − 2 r =1 i=1 2σ PT
(2.8)
i
where σi is the smoothing parameter across dimension i. 2.3 Clustered PNN and GRNN. As shown above, both PNN and GRNN simply memorize all the training instances to perform classification or regression. To compress the training instances, one can cluster them and represent them by the clusters centers. Thus, each cluster center is essentially a training instance with a certain multiplicity. When applied to PNN or GRNN, we should expect the smoothing parameter to depend on the clusters. When clustering is applied to the PNN training data, it is not
Gap-Based Estimation
2845
difficult to see that the modified version of equation 2.5 is N j
r =1
F (x|c j ) =
j N D r j i=1 σir
exp −
(2σ ) D/2
D
Nj
r =1
i=1
j
xi − X¯ ir
2
2σir2
,
Njr
(2.9)
j
where Nr is the multiplicity of, or the number of, original training instances j j covered by the r th cluster in class j, while σir and X¯ ir are the smoothing parameter and the mean value of the ith dimension for the r th cluster in class j, respectively. Similarly, the GRNN equation, when clustering is applied to the GRNN training data, becomes PT
yr Nr r =1 D σir
E(y|x) = PT
i=1
Nr r =1 D σir i=1
D exp − i=1 D exp − i=1
(xi − X¯ ir )2 2σir2 (xi − X¯ ir )2 2σir2
.
(2.10)
3 Analysis of the Smoothing Parameter in the One-Dimensional Cases 3.1 Nonclustered Cases. For simplicity, let us first consider only one dimension and one class without clustering. In this case, equation 2.5 reduces to f (x) =
1 (2π)1/2 σ P T
PT r =1
(x − Xr )2 exp − . 2σ 2
(3.1)
It is not difficult to prove that ˆ = VAR(X) + σ 2 , VAR( X)
(3.2)
where the left-hand side represents the expected value of the variance of the point x using the estimated PDF given in equation 3.1, and VAR(X) stands for the expected value of the variance of the point x using the true PDF. In order to produce an accurate estimate of the PDF, a necessary condition is σ STD(X),
(3.3)
where STD(X) is the unbiased standard deviation of X. This conclusion agrees with equation 2.3, since the standard deviation usually approaches a positive constant number when the number of instances goes to infinity. Note that setting σ i to a large value can smooth out the ith attribute, making the corresponding marginal PDF constant (almost zero) in the whole
2846
M. Zhong et al.
space. This could be beneficial for removing irrelevant attributes for classification, but it alters the PDFs, while our goal in this article is to accurately estimate the PDFs, which can be used by algorithms other than PNN or GRNN. On the other hand, σ i cannot be too small; otherwise, the PDF becomes spiky (consisting of a number of impulse functions) at the training points and is almost zero at other places, which causes the overtraining problem. To understand the effect of the smoothing parameter better, consider an example where the training set is {−10, −9, . . . , 10} and is taken from a uniform distribution in the range [−10.5, 10.5]. In practice, even if the points are uniformly distributed, the observed instances are not likely to be equally spaced; here we simply use this ideal case to demonstrate our idea, which will be further discussed in section 4. Figure 1 shows the resulting estimated PDF in three typical cases. In Figure 1a, the smoothing parameter is properly set, and the estimated PDF is almost the same as the true one except near the two bounds; for better generalization, we do not require the PDF to drop too quickly to 0 outside the range [−10.5, 10.5] (see Figure 4b). In Figure 4b, the smoothing parameter is too small, and the estimated PDF consists of spikes. Although in this case the PDF outside the range [−10.5, 10.5] is accurate and we will have few false positives, the PDF is too small inside the range but off any training point. In a multidimensional space, once a test point is not close enough to any training point, the estimated PDF is close to zero for all classes, and the decision of the PNN is not reliable. In Figure 4c, the smoothing parameter is too large, which means all the training points have significant influence on the PDF at the test point even if they are far away from the test point. When the true PDFs of two classes are overlapped, the crossing points of the discriminant functions tend to be shifted from the true ones (consider the extreme case σ = ∞, which shifts the crossing points to ∞, and the decision will be based on prior probabilities only). This figure also verifies our previous statements. It shows that the standard deviation of the observed attribute values is usually too large to be used as the smoothing parameter. In appendix B, we prove that to estimate the norms accurately, we should set the smoothing parameter much smaller than the standard deviation.
3.2 Clustered Case. Suppose the previous training set with 21 instances is clustered, with each cluster containing the same number of instances. Figure 2 shows the effect of the cluster size and the smoothing parameter on the PDF estimation. In this example, neither the cluster standard deviation nor a reasonable value for the smoothing parameter yields an accurate PDF for both the three-clustered and the seven-clustered case. Since clustering represents multiple instances by their mean value, and as a result loses considerable information, it is difficult to determine a good smoothing parameter quickly.
Gap-Based Estimation
2847
0.05
0.04
0.03
0.02
0.01
0
0
5
10
15
20
15
2
15
20
(a) Reasonable Smoothing Parameter (σ =2) 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0
0
5
10
(b) Small Smoothing Parameter (σ =0.25) 0.05
0.04
0.03
0.02
0.01
0
0
5
10
(c) Large Smoothing Parameter (σ =Standard Deviation=6.2) Figure 1: Estimates of PDF with various σ values (no clustering). Dash-dot line: desired (true) PDF. Gray solid line: gaussian function centered at a training point. Black solid line: estimated PDF.
2848
M. Zhong et al.
Figure 2: Estimates of PDF with various σ values (with clustering). Dash-dot line: desired (true) PDF. Gray solid line: gaussian function centered at a training point (for better visualization, the gaussian functions in each cluster are summed up instead of being overlapped). Black solid line: estimated PDF.
Gap-Based Estimation
2849
4 Our Approach: Gap-Based Estimation 4.1 Result from a Simple Model. Figure 1 indicates that the width of gaussian kernels, which is directly controlled by the smoothing parameter, should be related to the gap between two nearest points. Our proposed approach, the gap-based estimation, originates from the following simple idea: when the training points are equally spaced in (−L, 0] where L 1, the estimated PDF should (1) be almost constant at points located at the left of the origin and (2) drop to almost zero at the points located at the right of the origin (we allow, though, a transition interval where the value drops from the constant value to the zero value). Numerical solution of the above two constraints indicates that 0.69d < σ < 2.18d, where d is the distance between two neighbors in the above model (see appendix C). In practice, the training points are seldom equally spaced even if uniformly distributed. In this case, we have to generalize the definition of d. If we define d as the average distance between a point and its nearest neighbor (i.e., i min j=i |xi − x j |/N), it is easy to prove that d ≤ d0 = (maxi xi − mini xi )/(N − 1), where d0 is our desired value if the points are uniformly distributed in 1D space (we cannot apply d0 directly to multidimensional spaces). This indicates that the observed d is usually smaller than our desired value. It is also difficult to provide an analytical expression to adjust d in a general case. To address this problem, first we double the bounds as 1.38d < σ < 4.36d and choose σ = 4d for better generalization. Second, we define d as the average value of the largest distances between a point and its 2D nearest neighbors. If this method is applied to our starting ideal model, the computed d is exactly the same. 4.2 Sampling in Multidimensional Space. For multidimensional cases, we should standardize each dimension first so that each dimension has unity standard deviation before computing the gap, which prevents bias toward dimensions with large data. In these cases, σi = min(4d, 0.5)STD(Xi ),
(4.1)
where d is the average gap (roughly the average value of the minimum distance) between two input points after standardization and STD(Xi ) is the standard deviation of the ith dimension before standardization. Note that 4d might be larger than 1 when the dimensions are not independent, which is a challenge to us since the assumption of equation 2.5 is violated. In this case, we replace 4d to 0.5 to retain the validity of inequality 3.3. Computing d directly has a high computational complexity of O(PT2 ). Therefore, we estimate d by sampling the points. We randomly choose M points, where M PT when PT is large, and for each one (X) of these points, we calculate the largest value of the minimum 2D distances to another small
2850
M. Zhong et al.
sample of N points, where N PT when PT is large. However, two issues are raised due to sampling. First, if the points are distributed in clusters, it is possible that all the N points are in a different cluster from the X, which causes d to be estimated as the distance between clusters, despite the fact that we desire to obtain the local distance among the points in the same cluster. Nevertheless, we assume each cluster has the same number of points, which is at least 4C, where C is the number of clusters. Hence, PT ≥ 4C 2
(4.2)
√ In this case, if we set N to be equal to 4 PT, we can prove that the probability that none of the N sampled points are in the same cluster as X does not exceed e −8 = 3.35×10−4 . Even if some of the distances may be overestimated, they are not likely to affect the final result after the sampling is repeated M times. Second, the observed average gap δ in the sampled set is not approximately the same as, but instead proportional to, the desired average gap d in the full data set. The constant of proportionality that connects δ and d depends on the parameters PT, N, and v, where v is the number of intrinsic dimensionality. In particular, we have shown that the actual relationship between δ and d can be expressed by the following equation:
δ≈
PT N
1/v d.
(4.3)
Equation 4.3 can be explained below: when N = PT, δ = d; when the number of samples per intrinsic dimension is halved but Nj is still so large that the samples are representative, N = 2−v PT and δ ≈ 2d. This also implies that under the same distribution with enough points, (PT)1/v d is a constant, which means equations 2.3 and 2.4 are satisfied if we apply equation 4.1 to set the smoothing parameter and v > 1. When equation 4.1 is applied and v is 1, equation 2.4 is not satisfied, but it is not a necessary condition for the estimated PDF to be asymptotically accurate. Our experiments show that our approach works very well for one-dimensional databases. As we mentioned above, v means the intrinsic dimensionality. For example, when the data points of the data set are residing on the unit circle in a 2D space, their attributes x and y can be expressed in terms of the angle of the (x, y) point with respect to the horizontal axis. As a result, in this case, although the data points are two-dimensional, they have only one intrinsic dimension. Equation 4.3 is a single equation with two unknowns (that is, d, which we want to compute, and v, which is unknown). Ideally, two calculations of δ with different values of sample sizes N would be sufficient to produce
Gap-Based Estimation
2851
the needed value d and the unknown intrinsic dimensionality v. Due to the randomness of the sampling procedure that leads to the computation of d, however, two calculations of δ are not enough. For the databases we have experimented with, it turned out that five calculations of δ are sufficient for the calculation of d. To calculate d from the five equations of the form depicted in equation 4.3, we utilized a least-square-error procedure. 4.3 Application to Classification Problems. So far we have discussed the one-class case. For classification problems, there are two ways to set the smoothing parameter: 1. For each class, compute σi j based on equation 4.1, by using only the patterns in the corresponding class. 2. Apply equation 4.1 to all patterns regardless of their classes, and use the computed σ i for all classes. Although the first method appears more reasonable, we argue that it is less beneficial because each class occupies only a subset of the training points, and thus for each class, the estimate of d becomes less accurate after sampling (whose confidence relies on the data size), especially when equation 4.2 is violated. Our preliminary experiments verify our argument, although the corresponding results are not set out in this article. 5 Experiments 5.1 Experimental Procedures. We compare the following approaches for choosing the smoothing parameters for both PNN and GRNN:
r
r
r
Standard deviation: Set the smoothing parameters as the standard deviation for each dimension without discriminating the class labels (which is much better than using the in-class standard deviations in our experiments, although the justification for this statement is not included in this article). Cross-validation: The training set is divided equally to a learning set and a validation set. We evaluate σi = kSTD(Xi ) for each attribute Xi , where k is chosen from {1/2, 1/4, 1/8, 1/16}, and STD(Xi ) is computed as explained in the previous approach. The best k value is selected according to the performance on the validation set. The time for finding the smoothing parameters is defined as the total time for all the runs of PNN/GRNN. Clustering: Run GART (see section 1) to cluster the training set. For PNN, we run GART for each class separately (which is remarkably faster than running GAM on the whole database). The initial standard deviation γ is set to STD(Xi ), and the baseline vigilance ρ is set to 0.5 (in fact, we tested higher values of ρ, but they required more time to
2852
r
M. Zhong et al.
converge, without significantly improving the attained accuracy). It is known that GAM/GART is not sensitive to γ as long as it is close to the final cluster standard deviation. The parameter ρ controls the size of the clusters: ρ = 0 means arbitrarily large clusters are allowed, resulting in only one cluster for GRNN since all regression instances are treated as in the same class; ρ = 1 means only zero-sized clusters can be created, reducing to the noncluster case except when repeated training instances are present. The time for finding the smoothing parameters is defined as the time required for GART to converge. Gap-based estimation: Our approach is described in section 4.
In addition, we compared PNN with C-SVC (support vector classifier) and GRNN with epsilon-SVR (support vector regressor). We applied the radial basis function (RBF) kernel for both SVC and SVR as most other researchers have done in earlier support vector machine (SVM) publications. The implementations of SVC and SVR were downloaded from Chang and Lin (2001), where the authors suggest grid search to optimize the parameters (for loose grid search, C = {2−5 , 2−4 , . . . , 215 } and gamma = {2−15 , 2−14 , . . . , 23 }). We followed these suggestions and found that using such a wide range of each parameter is reasonable, because among the optimal parameters in our experiments (one per database), min C = 2−5 , max C = 215 , min gamma = 2−15 , and max gamma = 2. All algorithms, including PNN, GRNN, GART, SVC, and SVR, are coded efficiently in the same computer language (C/C++) and with the same interface (Matlab MEX DLL). The recorded time does not include what is spent on file input and output or displaying to the screen. All experiments are carried out in the same and stable software environment. 5.2 Databases. The databases used in our experiments are listed in Table 1. We used only large databases because they guarantee statistical significance of the results presented here. Therefore, we did not use the small databases as in Galleske and Castellanos (2002) and Berthold (1998). To demonstrate that the standard deviation is not directly related to the optimal smoothing value, we created an artificial database, Grass and Trees, that contains only one attribute. The first class is uniformly distributed in [0, 1]. The second class has five clusters, centered at 0, 0.25, 0.5, 0.75, and 1, respectively. Each cluster has also uniform distribution with range 0.05. Both classes occupy 50% of the instances. It can be shown that the Bayes classifier attains 90% accuracy on this database. Figure 3 shows that it is difficult to guess the optimal smoothing value using the standard deviation, while our approach produces very accurate estimates. Note that the optimal smoothing parameter can be arbitrarily small when we increase the number of clusters in class 2 while maintaining its overall standard deviation.
Gap-Based Estimation
2853
Table 1: Statistics of Databases. Number of Training Points
Number of Test Points
Number of Numerical Attributes
Number of Classes
% Minor Classesa
2000 500 210 2000 2088 4435 7494 3823
4000 4800 2100 3473 2089 2000 3498 1797
1 2 18 10 7 36 16 62
2 2 7 5 3 6 10 10
50.000 49.812 85.714 10.193 65.677 76.950 89.623 89.872
GRNN Databases
Number of Training Points
Number of Test Points
Number of Numerical Attributes
Output Rangeb
Output Variancec
Friedman Kinematics Pumadyn Bank Abalone Computer
400 4096 4096 4096 2088 4096
1000 4096 4096 4096 2089 4096
5 8 8 8 7 12
27.116 1.4004 24.091 0.74548 26 99
27.094 0.06853 31.41 0.023478 10.544 348.85
PNN Databases Grass and Trees Modified Iris Segmentation Page Blocks Abalone Satellite Pen Digits Optical Digits
a The
percentage of the minor classes is 1 minus the percentage of the major training class in the test set. This percentage represents the misclassification rate for the blind classifier, which always predicts the class as the major training class without considering the attributes). b The output range reflects the output in the test set. c The output variance is computed as mean((y − m)2 ), where y is the output in the test set and m is the mean value of the outputs in the training set. This variance represents the mean square error of the blind predictor (that is, the predictor that always predicts the output as the mean training output without considering the attributes).
The rest of the databases are commonly used benchmark databases. We selected the ones with sufficient size in order to make our results statistically significant. We downloaded all the other classification databases from Newman, Hettich, Blake, and Merz (1998), Friedman from KEEL (2002), and Kinematics, Bank, and Computer from the Delve Repository (Rasmussen et al., 1996). Following is a brief description of each database:
r r
The Iris database represents the classification task of iris plants. We introduced noise to generate enough points and removed the two attributes with the least correlation to the class label. The Segmentation database is created from 7 outdoor images of different scenes. Three-by-three subimages are manually segmented from the 7 images and are classified with the image features. The original database contains 19 attributes, but the third attribute turns out to be constant and thus is removed in our experiments. This database
2854
M. Zhong et al. 5
4
3
2
1
0 0.2
0
0.2
0.4
0.6
0.8
1
1.2
0.8
1
1.2
0.8
1
(a) True PDF 6 5 4 3 2 1 0 0.2
0
0.2
0.4
0.6
(b) Gap-based Estimate 0.9
0.8
0.7
0.6
0.5
0.4
0
0.2
0.4
0.6
1.2
(c) Estimated PDF with σ =STD(X) Figure 3: Estimates of the PDFs for Grass and Trees database. Black line: class 1; gray line: class 2.
Gap-Based Estimation
r r
r
r
r
r
r
r
2855
contains only 210 training points, which is a representative test to our sampling methodology. The Page Blocks database consists of the attributes of page layouts in a document with various block types. The major class “text” occupies approximately 90% of the instances. The Abalone database is used for predicting the age of abalones. We removed the categorical attribute in the original database because it is not used on either PNN or GRNN. For PNN, we also grouped the outputs into three classes: 8 and lower, 9–10, and 11 and greater, as other researchers have done in the past. The resulting classes, however, are still highly overlapped in the attribute space, and current algorithms can attain only approximately 60% accuracy. The Satellite database provides the multispectral values extracted from satellite images corresponding to six types of soil (previously seven types, one of which used to be “mixed soil” and was removed due to doubts about its validity). The Pen Digits database stores the information of 250 digits. The attributes are obtained from the coordinates of the points after spatial sampling on the captured trajectories. Thirty writers contribute to the training set and 14 to the test set. The Optical Digits database is also concerned with the recognition of handwritten digits but without temporal information. The images are divided into subimages, and the number of pixels in each subimage serves as an attribute. The data from 30 writers are used for training and those from the other 13 writers for testing. After two constant attributes are removed from the original database, there are still as many as 62 attributes. The Friedman database is artificial, and was first used by Friedman (1991). The output is defined as y = 10 sin(π x1 x2 ) + 20(x3 − 0.5)2 + 10x4 + 5x5 + ε, where x1 , x2 , x3 , x4 , and x5 are independent attributes uniformly distributed in [0,1], while ε is gaussian noise with zero mean and unity variance. The Kinematics database and the Pumadyn database are chosen from two families generated from the simulation of two different robot arms. They are concerned with the prediction of the end effector from a target and the angular acceleration, respectively. For both databases, we selected the nonlinear version with eight attributes and medium noise. The Bank database is generated from a simulator that simulates bank service. The task is to predict the fraction of customers who leave the bank because all queues are full.
2856
r
M. Zhong et al.
The Computer database consists of computer activity measures, such as the reading and writing rates. The task is to predict the portion of time that the CPUs run in user mode based on the various data transfer rates.
5.3 Experimental Results. All the results are shown in Table 2. For PNN, although the computation of the standard deviation is usually too fast to be timed, it does not appear to be a good value for the smoothing parameters due to its poor accuracy (excluding the blind classifier), especially when the data are distributed in disconnected clusters (see the Grass and Trees database). The cross-validation approach works well in databases where the data corresponding to different classes are small in number and belong to a single connected cluster. However, since the number of candidate values is very limited, cross-validation is sometimes risky, as shown in the experiments with the Grass and Trees database. The clustering approach tested here could work better than crossvalidation in defining reasonable smoothing parameters when class data belong to disconnected clusters, but its accuracy is suspect, possibly due to the weak relationship between the optimal smoothing parameter and the cluster standard deviation, as illustrated in Figure 2. Note that the accuracy is surprisingly low for Optical Digits. We examined the results and found that GART outputs exactly one cluster per training point, due to the high dimensionality, which means the distance among training points tends to be large. The cluster standard deviation is usually too small, because the same point is repeatedly chosen to construct a template. Of course, these results reflect only the GART approach; one could improve the clustering approach by using other clustering algorithms or more complicated estimation of the smoothing parameters for each cluster. Our approach always yields an accuracy that is equal or close to the best one. Note that the accuracy of the Grass and Trees database is almost the theoretical optimal one. Moreover, the time spent to produce the smoothing parameters with our approach is an order of magnitude faster than the crossvalidation approach, and it is scalable to large databases (see Figure 4). The experiments also demonstrate that our approach is robust, whether or not the training set is noisy (as Grass and Trees, Modified Iris, and the Friedman), small (as Segmentation), or high dimensional (which means possible dependency among attributes; see Optical Digits). In our experiments, the SVM accuracy is always better than the gapbased estimate accuracy (except for the Grass and Trees problem), but not significantly better in most instances. The time that it takes to choose the correct SVM parameters is in most cases three to four orders of magnitude larger than the time required by the gap-based estimate approach for choosing PNN parameters, which leads us to the obvious conclusion that there is merit in using the gap-based estimate PNN classifier or GRNN regressor.
Blind Regressor
27.094 0.069 31.410 0.023 10.544 348.85
Friedman Kinematics Pumadyn Bank Abalone Computer
50.00 49.81 85.71 10.19 65.68 76.95 89.62 89.87
Blind Classifier
GRNN Databases
Grass and Trees Modified Iris Segmentation Page Blocks Abalone Satellite Pen Digits Optical Digits
PNN Databases
SVR
11.53 5.50 9.52 3.54 33.32 8.40 1.94 2.78
SVC
1.551 0.0069 11.10 0.00135 4.59 10.88
Table 2: Experimental Results.
15.60 5.38 10.52 4.06 35.47 9.55 2.57 3.78
Cross Validation
9.224 0.032 18.516 0.006 6.220 36.455
Standard Deviation 4.310 0.014 14.788 0.003 5.173 15.653
Cross Validation
Mean Square Error
38.63 6.19 14.38 5.79 37.10 13.00 8.38 3.28
Standard Deviation
Misclassification Rate (%)
4.999 0.019 15.068 0.002 5.024 333.28
Clustering
34.98 5.15 20.24 36.17 40.21 13.25 5.80 78.35
Clustering
4.292 0.014 14.808 0.003 5.513 15.764
Gap
10.60 5.63 10.81 4.78 35.38 9.70 3.23 3.62
Gap
2.00E3 1.21E5 1.24E5 2.11E3 5.49E4 1.23E5
SVR
2.34E3 7.52E1 1.66E1 5.24E2 6.69E3 1.02E4 1.95E4 8.86E3
SVC
0.094 0.063 0.031 32.203 2.922 48.328 22.891 26.5
0.047 5.343 5.344 5.297 1.250 7.328
Cross Validation
1.172 76.187 83.797 147.330 18.782 176.670
Clustering
Time (seconds)
0.578 0.047 0.031 1.156 1.015 16.063 25.219 16.954
Clustering
Time (seconds) CrossValidation
0.016 0.547 0.531 0.531 0.156 0.703
Gap
0.047 0.015 0 0.157 0.125 2.141 1.938 1.360
Gap
Gap-Based Estimation 2857
2858
M. Zhong et al.
Elapsed Time (seconds)
2.5
2
1.5
1
0.5
0 0
50k 100k 150k 200k Size of Training Database (# instances x # attributes)
250k
Figure 4: Elapsed time of gap-based estimation versus database size.
In the experiments with GRNN, we have exactly the same observations: using the standard deviation is fastest but has the worst accuracy; cross-validation has reasonable accuracy while its time is nonscalable; the tested clustering approach using cluster standard deviations as smoothing parameters is unstable in accuracy and expensive in time; our approach is the fastest one (excluding the unreliable STD approach) with good accuracy; SVM has the best accuracy, but it requires the longest time to optimize its parameters. 6 Conclusion In this letter, we presented the gap-based approach of estimating the smoothing parameters for both PNN and GRNN. Our approach was first analyzed for an ideal problem and then applied to more general problems. We utilized sampling techniques to reduce the computational complexity of finding the smoothing parameters to a linear function of the training data size. Our experiments showed that our proposed approach, the gap-based estimate of the smoothing parameter, is faster than other commonly used approaches for finding the smoothing parameters, such as cross-validation. It was also demonstrated, through experimentation, that the gap-based estimate of the smoothing parameters produced a PNN/GRNN network with good and stable accuracy, comparable (albeit inferior) to the accuracy achieved by a SVM approach, but achieving this accuracy in orders of magnitude faster than the SVM approach. The computational complexity required by the gap-based estimate approach to produce the smoothing
Gap-Based Estimation
2859
parameter estimates in PNN/GRNN was low and scalable to larger problems.
Appendix A: Pseudo-Code of Gap-Based Estimation The parameters are K max : Maximum repeat times (natural number); typical K max = 4 F M , F N : Sampling factor (small positive number); typical F M = 8, FN = 4 Main Procedure Compute STD(Xi ) √ for i = 1, 2, . . . , D M = min(P T, F M√ P T ) N = min(P T, F N P T ) K = min(K max , log2 PNT ) If K < 1 (which means PT is small) then d = AverageGap(M, PT) Else δk = AverageGap(M,2k N) for k = 0, 1, 2, . . . , K Solve the equations v1 log( 2Pk TN ) + log d = log δk for k = 0, 1, 2, . . . , K , treating ν1 and log d as unknowns. End If σi = min(4d, 0.5)STD(Xi ) for i = 1, 2, . . . , D Subroutine AverageGap (M, N) For m = 1, 2, . . . , M Randomly choose a point X. For n = 1, 2, . . . , N Randomly choose a point Xn different from X.
2 = dmn
D Xin − Xi 2 i=1
STD(Xi )
End For 2 2 2 Find the smallest 2D elements from dm1 , dm2 , . . . , dmN Find the largest one (denoted by dm2 ) in the above elements End For Return mea n[dm2 ] m
2860
M. Zhong et al.
Note that the smallest 2D elements for each m can be cached, so that when we double N next time, only N more patterns have to be chosen, which halves the computational complexity. The least-square-error solution to the linear equations can be explicitly given as:
1/v = (AT A)−1 AT B log d log(P T/N0 ) .. A= . log(P T/NK )
1 .. , . 1
log d¯ 0 B = ... . log d¯ k
In practice, computing (AT A)−1 AT B directly is not efficient in either time or space. A recursive algorithm can be applied, which is shown below: P0 =
log(P T/N0 ) log(P T/N1 )
1 , 1
−1 P = P0T P0 ,
Q = P0T
log d¯ 0 log d¯ 1
For k = 2, . . . , K a = [log(P T/Nk ) 1] P = P − (PaT )(aPaT + 1)−1 (aP) Q = Q + aT log d¯ k End For
1/v log d
= PQ
Note that the update of P is very simple because (PaT ) is only 2-by-1, (aPaT + 1) is a scalar, and (aP) is 1-by-2. Appendix B: Relationship Between σ and STD Consider the one-dimensional case. The PDF is estimated as 1
p(x|c j ) = √ 2πσ j P Tj
P Tj r =1
exp −
j 2
x − Xr 2σ j2
.
Let f Xˆ (x) = p(x|c j ) stand for the estimated PDF and f X (x) represent the real unknown PDF of the attribute X j of class j. When P Ti is sufficiently
Gap-Based Estimation
2861
large,
(x − X j )2 E exp − f Xˆ (x) = √ 2σ j2 2πσ j ∞ 1 (x − y)2 =√ exp − f X (y)dy, 2σ j2 2πσ j −∞ 1
where E[X] denotes the expected value of the random variable X. It can be easily proven that ˆ = E[X j ], E[ X]
ˆ = VAR[X j ] + σ j2 . VAR[ X]
Therefore, the estimated PDF is not the real PDF since the variance is different, unless σ j is zero, which always overfits the network to the training set. In fact, if we seek another kernel function F (x, X) such that f X (x) = E[F (x, X)] for any x, the only solution is F (x, X) = δ(x − X) = lim √ σ →0
(x − X)2 exp − . 2σ 2 2πσ 1
This observation leads to constraint 2.3. It also suggests that σ j2 VAR[X j ], or σ j STD[X j ]. The above result is also applicable to the multidimensional case. Appendix C: Relationship Between σ and d Now we assume that the attributes have been standardized within each class, that is, STD[X j ] is always one. An important requirement for the σ value of a single class can be derived from a simple model. Assume that the patterns in this class are uniformly distributed within a D-dimensional hypercube whose range in every dimension is (−L , L), and the training data set contains P T = (2m + 1) D (the subscript j is omitted since only this class is discussed) exemplar points regularly located as an array in the hyperbox, as shown in Figure 5. The figure reflects the case where m = 4, D = 2 for simplicity. Actually m and n can be much larger. One of the points is (0, 0, 0, . . . , 0) and one of its neighbors is (d, 0, 0, . . . , 0), where d = 2L/(2m + 1). This model may seem too ideal, but it will be shown that the results can be applied to more general cases. Consider the center P in a “cell” box that has the following 2 D corners (0, 0, 0, . . . , 0), (d, 0, 0, . . . , 0), (0, d, 0, . . . , 0), . . . (d, d, . . . , d). ObviouslyP = (d/2, d/2, . . . , d/2). Since the points are uniformly distributed, it is expected
2862
M. Zhong et al.
P
-L
-d
0
d
L
Figure 5: A simple model.
that the estimated PDF at P is the same as that at the origin: m m D m 1 d 2 1 ... exp − 2 il d − σ D i =−m i =−m i =−m 2σ 2 1
2
n
l=1
m m D m 1 1 2 ... exp − 2 (il d) . = D σ i =−m i =−m i =−m 2σ 1
2
n
l=1
After some calculations, the above equality leads to the following equation: ∞ i=−∞
2
k (i−0.5) =
d2 2 k i where k = exp − 2 . 2σ i=−∞ ∞
Numerical solution of this equation yields σ 2 ≥ 0.476d 2 . Now consider a similar model in which the training patterns are distributed in a hypercube with half volume as the previous one, but with the same density. That is, all the patterns with positive x1 (coordinate in the first dimension) are removed. Then consider the point Q = (q × d, 0, 0, . . . , 0) where q is positive. The estimated PDF at Q should be small enough because Q is outside the region where the training patterns are distributed. With similar derivations, the following condition should be met: 0.476d 2 ≤ σ 2 ≤ 4.58d 2 .
Gap-Based Estimation
2863
σ 2=d 2/8 (k=0.0183) σ 2=d 2/4 (k=0.1353) 2
2
σ =d /2 (k=0.3679) 2
σ =d 2 (k=0.6065) 2
2
σ =2d (k=0.7788) σ 2=4d 2 (k=0.8825) σ 2=8d 2 (k=0.9394)
10
0
2
4
6
8
10
Figure 6: Estimated PDF with various σ 2 .
Thus, the smoothing parameter is allowed to vary in a certain way. To show this more clearly, we plot the PDF with various σ 2 in Figure 6, assuming the training instances are {−50, −49, . . . , 0} (d 2 = 1). Acknowledgments This work was supported in part by the NSF grants: CRCD: 0203446, CCLI: 0341601, DUE: 05254209, IIS-REU: 0647120, and IIS-REU: 0647018. References Berthold, M. R. (1998). Constructive training of probabilistic neural networks. Neurocomputing, 19, 167–183. Burrascano P. (1991). Learning vector quantization for the probabilistic neural network. IEEE Transactions on Neural Networks, 2, 458–461. Cacoullos, T. (1966). Estimation of a multi-variate density. Annals of the Institute of Mathematical Statistics, 18(2), 179–189. Chang, T., & Lin, T. (2001). LIBSVM: A library for support vector machines. Available online at http://www.csie.ntu.edu.tw/∼cjlin/libsvm. Friedman, J. (1991). Multivariate adaptive regression splines (with discussion). Annals of Statistics, 19, 1–141. Galleske I., & Castellanos, J. (2002). Optimization of the kernel functions in a probabilistic neural network analyzing the local pattern distribution. Neural Computation, 14, 1183–1194.
2864
M. Zhong et al.
KEEL (Knowledge Extraction Based on Evolutionary Learning) data sets. (2002). Available online at http://www.keel.es/. Newman, D. J., Hettich, S., Blake, C. L., & Merz, C. J. (1998). UCI Repository of machine learning databases. Irvine: Department of Information and Computer Science, University of California, Irvine. Available online at http://www.ics. uci.edu/∼mlearn/MLRepository.html. Parzen, E. (1962). On estimation of probability density function and mode. Annals of Mathematical Statistics, 33, 1065–1073. Rasmussen, C. E., Neal, R. M., Hinton, G. E., Camp, D. van, Revow, M., Ghahramani, Z., Kustra, R., & Tibshirani, R. (1996). The DELVE manual. Available online at http://www.cs.toronto.edu/∼delve/. Specht, D. F. (1990). Probabilistic neural networks and the polynomial Adaline as complementary techniques for classification. IEEE Transactions on Neural Networks, 1(1), 111–121. Specht, D. F. (1991). A general regression neural network. IEEE Transactions on Neural Networks, 2, 568–576. Specht, D. F. (1992). Enhancements to probabilistic neural networks. In Proceedings of the IEEE International Joint Conference on Neural Networks (Vol. 1, pp. 761–768). Piscataway, NJ: IEEE. Specht, D. F. (1994). Experience with adaptive probabilistic neural networks and adaptive general regression neural networks. In Proceedings of the IEEE World Congress on Computational Intelligence (Vol. 2, pp. 1203–1208). Piscataway, NJ: IEEE. Traven, H. G. C. (1991). A neural network approach to statistical pattern classification by “semi-parametric” estimation of probability density functions. IEEE Transactions on Neural Networks, 2, 366–377. Tseng, M.-L. (1991). Integrating neural networks with influence diagrams for multiple sensor diagnostic systems. Unpublished doctoral dissertation, University of California at Berkley. Williamson, J. R. (1996). Gaussian ARTMAP: A neural network for fast incremental learning of noisy multi-dimensional maps. Neural Networks, 9(5), 881–897. Williamson, J. R. (1997). A constructive, incremental-learning network for mixture modeling and classification. Neural Computation, 9, 1517–1543.
Received May 29, 2006; accepted December 6, 2006.
NOTE
Communicated by Idan Segev
J4 at Sweet 16: A New Wrinkle? Bardia F. Behabadi [email protected] Department of Biomedical Engineering, University of Southern California, Los Angeles, CA 90089, U.S.A.
Bartlett W. Mel [email protected] Department of Biomedical Engineering and Neuroscience Graduate Program, University of Southern California, Los Angeles, CA 90089, U.S.A.
Compartmental models provide a major source of insight into the information processing functions of single neurons. Over the past 15 years, one of the most widely used neuronal morphologies has been the cell called “j4,” a layer 5 pyramidal cell from cat visual cortex originally described in Douglas, Martin, and Whitteridge (1991). The cell has since appeared in at least 28 published compartmental modeling studies, including several in this journal. In recently examining why we could not reproduce certain in vitro data involving the attenuation of signals originating in distal basal dendrites, we discovered that pronounced fluctuations in the diameter measurements of j4 lead to a bottlenecking effect that increases distal input resistances and significantly reduces voltage transfer between distal sites and the cell body. Upon smoothing these diameter fluctuations, bringing j4 more in line with other reconstructions of layer 5 pyramidal neurons, we found that the attenuation of steady-state voltage signals traveling to the cell body (Vdistal /Vsoma ) was reduced by 60% at some locations in some branches (corresponding to a 2.5-fold increase in the voltage response at the soma for the same distal depolarization) and by 30% on average (corresponding to a 45% increase in somatic response). Changes of this magnitude could lead to different outcomes in some types of compartmental modeling studies. A smoothed version of the j4 morphology is available online at http://lnc.usc.edu/j4-smooth/. 1 Introduction At the heart of many compartmental models is a 3D morphological reconstruction of a real neuron, consisting of length and diameter values for hundreds to thousands of dendritic segments. Accurate diameter measurements are particularly important since passive signal propagation in neural cables is strongly influenced by the axial resistance per unit length, Neural Computation 19, 2865–2870 (2007)
C 2007 Massachusetts Institute of Technology
2866
B. Behabadi and B. Mel
A
B
C
Figure 1: Sample basal dendritic subtrees from three morphological reconstructions of layer 5 pyramidal neurons. (A) Cat V1 neuron, j4. (B) Cat somatosensory neuron from Contreras et al. (1997). (C) Rat somatosensory neuron from Stuart and Spruston (1998). Scale bars in gray are each 20 µm long and 1 µm in diameter. Note that the cell in B is a large cell with branches roughly twice as long and thick as in the other two cells. Subtrees A, B, and C start from sections a7 1, dend4[0], and dend1[808], respectively, as labeled in the NEURON morphologies.
which varies as 1/d 2 (Rall & Rinzel, 1973). Accurate encoding of diameter fluctuations is also crucial, given that transitions between small and large-diameter sections significantly alter voltage attenuation profiles (Jack, Noble, & Tsien, 1975) (see zone between vertical-dashed lines in Figures 2A and 2C). Modern tools for cell reconstruction typically yield smoothly tapering dendritic branches (Contreras, Destexhe & Steriade, 1997; Stuart & Spruston, 1998; see Figures 1B and 1C). We noted, however, that j4, one of the most widely used pyramidal cell morphologies for simulation studies, has relatively large fluctuations in branch diameter (see Figure 1A). We hypothesized that the diameter jitter, whose origin is unknown, might significantly alter the spatial profiles of voltage signals traveling from distal sites to the cell body compared to a smooth branch. 2 Results We measured the voltage attenuation factor from distal sites to the cell body, plotted as the fraction of the distal voltage signal (measured at the stimulus site) divided by the passively propagated steady-state voltage response measured at the soma (see Figures 2A and 2B, gray curves). Dark curves show decreased attenuation for the same branch after smoothing. The smoothing method used in these plots was a least-squares linear fit to the original diameter data, shown as gray trapezoids behind the branch
J4 at Sweet Sixteen
2867
compartment diagrams. The sharp increase in the attenuation factor beginning at 50 µm on branch A before smoothing is due to the narrowing of the branch to 0.3 microns at that point, compared to a diameter of 1.2 µm just proximally and 1 µm just distally. In contrast, voltage attenuation in branch B shows relatively little effect of smoothing. The differential signal propagation toward the cell body before and after smoothing can be viewed in two ways, exemplified for branch A in Figure 2C. If the local voltage response is equated at 150 µm (e.g., set to 20 mV before and after smoothing by adjusting the current input), then the somatic voltage response for the smoothed branch is 57% larger than the original (Figure 2C, dashed versus gray curves). If instead the local current
Figure 2: Smoothing uneven diameters reduces steady-state attenuation of voltage signals traveling to the cell body. The morphological reconstruction of j4 represents each unbranched length of dendrite as a “section” (ranging from 1– 2 µm to 100s of µm in length). Short sections typically contained three diameter samples, while longer sections had samples 10 µm apart. (A, B) For compartmental simulations, sections were divided into electrical compartments, or “segments” of length no greater than one-tenth of the length constant at 100 Hz (Hines & Carnevale, 2001). Segment dimensions are indicated by white rectangular cross sections (diameters are scaled up by a factor of 50 relative to their length). Branch A had pronounced bottlenecks at 50 and 170 µm, while branch B showed relatively little bottlenecking. A least-squares linear approximation to the diameter data was used to smooth each branch, shown as gray backdrops for the branch-compartment diagrams in A and B. Graphs below each branch show signal attenuation as the attenuation factor (AF), that is, the peak local depolarization measured at the site of a distal steady-state stimulus as a fraction of the peak somatic depolarization. Steady-state attenuation factor can be as large as 50 for a very distal site. Light gray curves show AF using original diameter data, and dark curves are for smoothed branches. (C) Steady-state voltage responses in branch A. Light gray curve and solid dark curve show the response to same current injection (375 pA) in the original and smoothed branches, respectively, indicating a lower input resistance distally in the smoothed branch. The dashed curve is the response of the smoothed branch to a current injection (602.5 pA) that was increased to match the local potential in the original unsmoothed branch. (D) Ratios of AF for smoothed versus unsmoothed branches. Heavy solid and dashed curves correspond to branches in A and B, respectively. Light gray curves are for the other 43 basal dendrites. The light dashed curve is average for all branches; the gray backdrop shows the envelope of average curves using four different smoothing methods: (1) leastsquares linear fit (as discussed in the text), (2) moving average of diameters using a 10-sample window, (3) moving best-fit line using a 10-sample window, and (4) piece-wise constant fitting. Note that branches A and B are sections a3 11 and a1 111, respectively, as labeled in the NEURON morphology.
2868
B. Behabadi and B. Mel
A
B
50
30 25 20
20 15
10
10 Smoothed Original
0
250
5 200
150
100
50
0
0
50
Distance from soma (µm)
150
200
250
0
Distance from soma (µm) Branch A
C
D
Branch B
20
1.1 100 µm
10 Original
1
0 Smoothed -10
Same local voltage Same local current
Attenuation factor ratio
Steady-state voltage (mV)
100
Attenuation factor
Attenuation factor
40
-20 -30 -40 -50 -60 -70 -80
0.9 0.8 0.7 0.6 0.5 Branch A Branch B Other branches Basal avg
0.4 Resting Membrane Potential, -75mV
250
200
150
100
Distance from soma (µm)
50
0
0.3
0
50
100
150
200
250
300
Distance from soma (µm)
injection is equated in the two cases, the distal voltage response in the smoothed branch drops to nearly half its original (unsmoothed) value, reflecting a significant drop in the input resistance, while the somatic response remains essentially unchanged. This means that j4’s diameter bottlenecks produce relatively little current attenuation for slow signals traveling to the soma but can have a profound effect on the dendritic voltage profile. The ratio of the attenuation curves for smoothed versus unsmoothed branches is shown in Figure 2D. The lower dark and upper dashed curves correspond to branches A and B, respectively; the ratio curves for the other 43 basal branches are represented by light gray lines. An attenuation factor (AF) ratio of 0.5 at 100 µm means that whereas the voltage attenuation from 100 µm toward soma was, say, 5 in unsmoothed branch, after smoothing, the attenuation from that point to the soma was reduced to 2.5. The middle dashed curve is the average of the AF ratio curves for all 45 basal dendrites using least-squares linear smoothing. On average, the attenuation factor was reduced to about 70% of its original value beginning around 50 µm from the soma.
J4 at Sweet Sixteen
2869
3 Discussion Smoothing of the diameter fluctuations in the j4 morphology had two main effects: (1) lowered distal input resistances and (2) smoother, flatter voltage profiles for signals traveling to the cell body. In contrast, current injected distally traveled to the cell body about equally well in smoothed versus unsmoothed branches. Results shown here are for steady-state signals in basal dendrites modeled as passive cables. Similar results were obtained in j4 for apical dendrites and for basal dendrites configured with a more realistic complement of voltage-dependent channels. In experiments with transient signals (i.e., AMPA-type synaptic inputs), the effects of smoothing were similar, if slightly more pronounced. It is interesting that a branch with balanced (zero-mean) diameter jitter can differ significantly in its passive electrical properties from a smooth branch with the same average diameter. The main reason for this is that decreases in diameter—the bottlenecks—have a proportionally larger effect on signal propagation than same-sized increases in diameter. The origin of the diameter jitter in j4 is unknown and could be a true reflection of that neuron’s morphology. Given this possibility, the smoothed version of j4 used in this Note should not be viewed as “new and improved” morphological data. On the other hand, the smoother profiles of several recent layer 5 pyramidal cell reconstructions (see Figure 1) suggest that j4’s diameter jitter could be partly artifactual, arising from measurement uncertainties introduced during the reconstruction process. Diameter variation could also occur if dendrites have elliptical cross sections (as suggested by Ed White based on expectation-maximization reconstructions) and twist as they wend their way through the neuropil. Clearly the extent of natural branch shape and diameter fluctuation in pyramidal neuron dendrites remains an unresolved issue. Future modelers using the j4 morphology may therefore wish to test their results using both original and diameter-smoothed versions of the cell (available at http://lnc.usc.edu/j4smooth). We note that in studies where only low spatial resolution is needed within the dendritic tree, the assignment of a small number of electrical segments per section is also effectively a form of diameter smoothing (Hines & Carnevale, 1997). References Contreras, D., Destexhe, A., & Steriade, M. (1997). Intracellular and computational characterization of the intracortical inhibitory control of synchronized thalamic inputs in vivo. J. Neurophysiol., 78(1), 335–350. Douglas, R. J., Martin, K. A., & Whitteridge, D. (1991). An intracellular analysis of the visual responses of neurones in cat visual cortex. J. Physiol., 440, 659–696. Hines, M. L., & Carnevale, N. T. (1997). The NEURON simulation environment. Neural Comput., 9(6), 1179–1209.
2870
B. Behabadi and B. Mel
Hines, M. L., & Carnevale, N. T. (2001). NEURON: A tool for neuroscientists. Neuroscientist, 7(2), 123–135. Jack, J. J. B., Noble, D., & Tsien, R. W. (1975). Electric current flow in excitable cells. New York: Oxford University Press. Rall, W., & Rinzel, J. (1973). Branch input resistance and steady attenuation for input to one branch of a dendritic neuron model. Biophys. J., 13(7), 648–687. Stuart, G., & Spruston, N. (1998). Determinants of voltage attenuation in neocortical pyramidal neuron dendrites. J. Neurosci., 18(10), 3501–3510.
Received January 20, 2006; accepted January 19, 2007.
NOTE
Communicated by Sonia Petrone
On the Consistency of Bayesian Function Approximation Using Step Functions Heng Lian Heng [email protected] Division of Applied Mathematics, Brown University, Providence, RI 02912, U.S.A.
We consider the problem of estimating a step function with an unknown number of jumps under noisy observations on a grid. Under mild assumptions, the Bayesian approach is shown to produce a consistent estimate, even when the underlying true function is not piecewise constant. A simple prior is constructed to illustrate our assumptions.
1 Introduction Step functions are frequently studied in the statistical literature as either an approximation to other functions or as a true model in itself. For example, Green (1995) used a step function as underlying intensity in a Poisson process. Numerous articles on change point analysis also bear similarity with step functions, although the emphasis is usually on the location of the change point. Recently there has been growing interest in the asymptotics of nonparametric Bayesian estimation. In Barron, Schervish, and Wasserman (1999), the consistency of density estimation is proved by constructing appropriate sieves and bounding the metric entropy. Similar construction is used in Ghosal, Ghosh, & van der Vaart (2000), and Shen and Wasserman (2001). Another approach appears in Walker (2004), where simple sufficient conditions in terms of sum of prior probabilities are given. A study of some regression problems appears in Walker (2003), which uses random covariates assumed to be independent and identically distributed (i.i.d.). We will prove a similar result for the special simple case that the underlying function are piecewise constant with a finite number of jumps, corrupted by i.i.d. gaussian noise with known variance. A main difference from the previous problems studied is that we assume that the covariates are on a regular grid in the interval [0, 1], so the observations are independent but not identically distributed. We show that if the prior is such that it sufficiently penalizes frequent jumps, the Bayesian procedure is consistent in the sense that the posterior probability mass will concentrate on an L 2 neighborhood of the true function. We will construct a simple prior that shows that the assumptions we made for consistency are easily satisfied in practice. Neural Computation 19, 2871–2880 (2007)
C 2007 Massachusetts Institute of Technology
2872
H. Lian
In an influential paper, Barron et al. (1999) showed how to prove consistency in nonparametric Bayesian density estimation. The main idea is to express the posterior as a ratio, then show that the numerator is exponentially small, while the denominator cannot decrease exponentially fast. We follow this reasoning here. 2 Main Results We closely follow the proofs in Barron et al. (1999). For that reason, we show only the part of proofs that is different from Barron et al. (1999). The complete proof can be found on the author’s (Web site: http://www.dam.brown.edu/people/lian/consistency.htm). Also, we make use of the notations in Barron et al. as much as possible so readers can easily find corresponding results in that article. iid
The data we observe are yi = f 0 (xi ) + i , xi = ni , i ∼ N(0, σ 2 ), i = 0, 1, . . . , n, where σ 2 is assumed to be known. The true underlying step function f 0 is assumed to come from the set
F = { f : f is a step function defined on [0, 1] with finite number of jumps, and || f ||∞ := sup | f (x)| < K }. The posterior distribution is given by n A
e−
i=0
(A) = n n
i=0
e
n i=0
(y − f (x ))2 − i 2i 2σ
e−
i=0
= n A
e
(yi − f (xi ))2 2σ 2
(df) (df)
(yi − f (xi ))2 −(yi − f 0 (xi ))2 2σ 2
(y − f (xi ))2 −(yi − f 0 (xi ))2 − i 2σ 2
(df) (df)
=:
mn (yn , A) . mn (yn )
We construct a sequence of subsets of F,
Fn = { f ∈ F : f has at most kn jumps}, and consider a small neighborhood of f 0 defined by A =
1
f ∈F:
( f − f 0 )2 dx ≤ 2 .
0
Denoting the prior on F by (df), the assumptions that we use in the following proofs are:
Bayesian Function Approximation
2873
Assumption 1. For every > 0, (A ) > 0. Assumption 2. There exists a sequence {Fn }∞ n=1 defined as above, with limn→∞ knn = 0, and positive numbers c 1 , c 2 such that (Fcn ) ≤ c 1 e −nc2 for all but finitely many n. Assumption 2 means that the prior should put a small mass on the set of functions that has too many jumps compared to the number of observations. As in Barron et al. (1999), we can show that the denominator mn (yn ) is not exponentially small. Lemma 1. (adapted from lemma 4 in Barron et al., 1999). Under assumption 1, for every > 0 , P0∞ (y∞ : mn (yn ) ≤ e −n , i.o.) = 0 . In order to prove our main theorem, the following explicit bounds are needed: Lemma 2. Let f, g ∈ Fn be step functions with at most kn jumps. Then 1 4K kn i | f (xi ) − g(xi )| | f (x) − g(x)| dx − ≤ n n 0 1 2 2 ( f (x) − g(x))2 dx − i ( f (xi ) − g(xi )) ≤ 8K kn . n n 0
), which happens for Proof. If f − g has one ormore jumps inside [ ni , i+1 n i+1/n i )| at most 2kn intervals, then i/n | f (x) − g(x)| dx − | f (xi )−g(x , by the ≤ 2K n n uniform boundedness of functions in Fn . If f − g has no jumps inside i+1/n i )| [ ni , i+1 ), then i/n | f (x) − g(x)| dx − | f (xi )−g(x = 0. Sum up over i to get n n the first inequality; the second inequality is proved in the same way. With the above lemmas, we can derive our main theorem: Theorem 1. Under assumptions 1 and 2, for every > 0 lim n Ac = 0 a .s. P0∞ .
n→∞
Proof. Write n Ac = n Ac ∩ Fn + n Ac ∩ Fcn . The second term on the right-hand side goes to 0 by our assumption on the prior (the exact arguments of lemma 5 in Barron et al., 1999, can be followed
2874
H. Lian
even though yi are not i.i.d.). So we need to prove only that the first term also goes to 0 na.s. P0∞ . Similar to Barron et al., let Cn = Ac ∩ Fn . Now write n (y ,C n ) n (Cn ) = mm . Let r = r (n, δ) be the covering number of Fn in L 2 norm, n n (y ) and let { f 1 , f 2 , . . . , fr } be the δ-net for Fn . Define E˜ j = { f ∈ Fn : f − f j L 2 ≤ δ}. j−1 Let E 1 = E˜ 1 , and for j > 1, let E j = E˜ j \ ∪s=1 E˜ s . Hence, {E 1 , . . . , Er } are disjoint and cover Cn . We now write
mn (yn , Cn ) =
r
mn (yn , Cn ∩ E j )
j=1
=
n r
j=1 i=0
·
(yi − f j (xi ))2 − (yi − f 0 (xi ))2 exp − 2σ 2 n
Cn ∩E j i=0
(yi − f (xi ))2 − (yi − f j (xi ))2 (df), exp − 2σ 2
where we pick f j from the δ-net if Cn ∩ E j = ∅. The magnitude of the different terms in the last expression is estimated in lemmas 3 to 5. Combining these estimates with lemma 1 and choosing the appropriate value for δ as a function of , we get lim n (Ac ) = 0 a .s.P0∞ .
n→∞
Lemma 3. n i=0
(yi − f j (xi ))2 − (yi − f 0 (xi ))2 ≤ e −nβ exp − 2σ 2
eventually, for β =
(−δ)2 . 16σ 2
Proof. P ∞ y∞ :
i=0
=P
∞
n
∞
y
:
n i=0
(yi − f j (xi ))2 − (yi − f 0 (xi ))2 ≥ e −nβ exp − 2σ 2
(yi − f j (xi ))2 − (yi − f 0 (xi ))2 −n β2 exp − ≥e 4σ 2
Bayesian Function Approximation
2875
(yi − f j (xi ))2 − (yi − f 0 (xi ))2 ≤e E exp − 4σ 2 i ( f j (xi ) − f 0 (xi ))2 β = exp n − i 2 8nσ 2 β 8K 2 kn 1 2 − f − − f ≤ exp n j 0 2 8σ 2 n 2 1 8K kn β 2 − ( − δ) − ≤ exp n 2 8σ 2 n ( − δ)2 ≤ exp − n , 32σ 2
n β2
where the first inequality follows from Markov inequality; the second inequality follows from lemma 2; the third inequality follows from the fact that if f ∈ Cn ∩ E j = ∅, then f j − f 0 ≥ f 0 − f − f − f j ≥ − δ; and 2 the last inequality follows from kn /n and goes to 0, so 8Kn kn will be less than (−δ)2 for n large enough. 2 Lemma 4. If δ < 1 , we have
n
Cn ∩E j i=0
(yi − f (xi ))2 − (yi − f j (xi ))2 (df) ≤ e nγ exp − 2σ 2
eventually, where γ =
2(4K +1) δ. σ2
Proof. We write E
n i=0
(yi − f (xi ))2 − (yi − f j (xi ))2 exp − 2σ 2
( f (xi ) − f j (xi ))2 − ( f (xi ) − f j (xi ))(2 f 0 (xi ) − f (xi ) − f j (xi )) exp 2σ 2 i 2 4K i | f (xi ) − f j (xi )| n i ( f (xi ) − f j (xi )) ≤ exp + 2σ 2 n n n (2δ 2 + 4K · 2δ) ≤ exp 2σ 2
=
≤ en
(4K +1)δ σ2
,
2876
H. Lian
where the second inequality || f − f j || L 1 ≤ || f − f j || L 2 ≤ δ. Then, P
∞
≤ e −nγ
follows
from
lemma
2
and
(yi − f (xi ))2 − (yi − f j (xi ))2 nγ exp − (df) ≥ e 2σ 2 Cn ∩E j i=0 n (yi − f (xi ))2 − (yi − f j (xi ))2 E exp − (df) 2σ 2 Cn ∩E j n
i=0
≤e
−n(γ − 4K 2+1 δ)
=e
−n γ2
σ
.
So we get the result after applying the Borel-Cantelli lemma.
Lemma 5. For any c > 0 , we have r (n, δ) ≤ e c n if kn ≤ αn for some α small δ2 c enough (α = (log 2K e) 8K 2 suffices). δ
Proof. The proof is simple calculation. Choose a grid on the domain [0, 1] and a grid on the range [−K , K ]: x =
δ2 · i, i ∈ N ∩ [0, 1] 8kn K 2
y = {δ · i, i ∈ N} ∩ [−K , K ]. Let F˜n = { f ∈ Fn , f jumps only at points in x , and takes value in y }. Then8kitn K 2can be shown that F˜n is a δ-covering of Fn , and its size is at most 2K δ2 . δ As a corollary of theorem 1, we have ||E( f |y1 , . . . , yn ) − f 0 || L 2 → 0, a .s.P0∞ . The proof is similar to corollary 1 in Barron et al. (1999). Remark 1. One referee points out that there is a general approach given in Ghosal and van der Vaart (in press) that gives rates of convergence for non-i.i.d. problems. The conditions involve constructing certain tests with exponentially small error. It is possible to adapt that general approach in our situation to get a consistent result. 3 Misspecification of the Prior Sometimes the step functions are used as an approximation to a smoother function. Then the natural question is whether the posterior is still consistent in this case. We have the following result:
Bayesian Function Approximation
2877
Theorm 2. Suppose the true function f 0 is contained in the set
G = { f : f is a continuous function on [0, 1], || f ||∞ < K , || f || Li p < L} f (y) for some constant K and L, where || f || Li p := supx = y∈[0,1] | f (x)− | is the Lipschitz x−y coefficient of f . We still have
lim n (Ac ) = 0 a .s. P0∞
n→∞
under assumptions 1 and 2. Proof. The proof is almost identical to that of theorem 1, except that the explicit bounds proved in lemma 2 (used in the proof of lemma 3) cannot be used. Instead, we must prove a similar bound when f ∈ G and g ∈ Fn , which we will do next. If f − g has one or more i i+1 jumps for at most kn intervals, then i+1/n inside [ n , n ), which happens i )| | i/n | f (x) − g(x)| dx − | f (xi )−g(x | ≤ 2K , by the uniform boundedness n n of functions in Fn or G. If f − g has no jumps inside [ ni , i+1 ), then n i+1/n L i )| | i/n | f (x) − g(x)| dx − | f (xi )−g(x | ≤ . Summing up over i, we get n 2n2
1
| f (x) − g(x)| dx −
0
i
| f (xi ) − g(xi )| L 2K kn ≤ + . n 2n n
Similar bounds can be shown for |
1 0
( f (x) − g(x))2 dx −
i(
f (xi )−g(xi ))2 |. n
We now present a construction of the prior that satisfies assumptions 1 and 2. The construction is similar to the one used in Green (1995). A step function on [0, 1] is fully determined by a vector {k, s1 , s2 , . . . , sk , h 0 , h 1 , . . . , h k }, where k is the number of jumps, si is the location of the ith jump, and h i is the value of f on [si , si+1 ) (with the understanding that s0 = 0, sk+1 = 1). In the prior, we suppose that k is drawn k from a Poisson distribution Poiss(λ). Given k, {si }i=1 is distributed as order k statistics of k uniform random variables on [0, 1]. Finally, {h i }i=0 are drawn independently from the uniform distribution on [−K , K ]. Assumption 2 is satisfied by the specification of the distribution on k. Proposition 1. Let p(k) be the marginal distribution of on the number of jumps. If p is a Poisson distribution Poiss(λ), then assumption 2 is satisfied. Proof. We want to show that for appropriately chosen kn , ∞
e −λ λk < c 1 e −c2 n . k! k=k n
2878
H. Lian
This can be satisfied by, say, kn = n/log n. The detailed calculation is omitted here. By simple calculation, if p(k) ∝ p k , that is, p(k) is a geometric distribution, there is no sequence kn that can satisfy lim knn = 0 and ∞ −c 2 n simultaneously. However, by the inspection of the k=kn p(k) < c 1 e proofs, we can relax assumption 2 somewhat. Assumption 3. For any α > 0, there exists a sequence kn , with knn ≤ α, and positive numbers c 1 , c 2 such that (Fcn ) ≤ c 1 e −nc2 for all but finitely many n. The condition lim knn = 0 in assumption 2 is used only in the proofs of lemmas 3 to 5. By inspection, the proof requires only that kn /n is sufficiently small compared to . So theorem 1 is still true when assumption 2 is replaced by assumption 3. Now for a geometrically distributed prior on k, p(k) ∝ p k , it satisfies assumption 3 with kn = αn. Proposition 2. For the prior described above, assumption 1 is also satisfied, whether f 0 ∈ F or f 0 ∈ G. Proof. First, assume f 0 ∈ F is a step function with k jumps, so f 0 can be equivalently specified as {k, s1 , s2 , . . . , sk , h 0 , h 1 , . . . , h k }. Define a L 2 neighborhood M( f 0 , 1 , 2 ) of f 0 in F as follows. Every step function f in M( f 0 , 1 , 2 ) has the vector description {k, t1 , t2 , . . . , tk , g0 , g1 , . . . , gk }, 1 such that |ti − si | < 1 , |gi − h i | < 2 . For every such f , we have 0 ( f (x) − f 0 (x))2 dx < 4K 2 k1 + 22 . By choosing 1 , 2 small enough, we have M( f 0 , 1 , 2 ) ⊂ A . Since p(k) > 0 and the distributions on si and h i are continuous, (A ) ≥ (M( f 0 , 1 , 2 )) > 0. If f 0 ∈ G, there is some k large enough and some step function g with k jumps such that || f 0 − g||2 < /2. Arguing as above for the step function g, we can construct some L 2 neighborhood of g, M(g, 1 , 2 ), such that M(g, 1 , 2 ) ⊂ { f ∈ F : ||g − f || L 2 ≤ /2}. Then we have M(g, 1 , 2 ) ⊂ A by the triangular inequality and (A ) ≥ (M( f 0 , 1 , 2 )) > 0. 4 Changing Priors With simple modifications of the above assumptions, we can deal with the situation that the prior is dependent on n. From now on, the prior is denoted by n (df) (note that n with a superscript n is used to denote the posterior). In place of assumption 1, we should impose the following assumption: Assumption 4. For any , t > 0, there exists an M such that n (A ) ≥ e −tn when n > M.
Bayesian Function Approximation
2879
Under assumption 4, we can prove the following (adapted from lemma 1 in Shen & Wasserman, 2001). Lemma 6. Under assumption 4, for any t > 0 , P0n (yn : mn (yn ) ≤ e −tn ) → 0 . We can then show the weak consistency (i.e., L 2 consistency in probability) of the posterior distribution under changing priors, whose proof is identical to the proof of theorem 1 using lemma 6 in place of lemma 1. Theorem 3. Under assumptions 2 and 4, for any > 0 . n (Ac ) → 0 , in P0n probability. As a special case, we can show the (weak) consistency of the posterior distribution when the number of jumps is a deterministic parameter rather than random. For that purpose, we use the following prior. The prior n is supported on the set of step functions that have exactly kn jumps. With reuse of notation, we let Fn denote this set of functions (different from the definition of Fn in section 2). The distributions on the jump locations si and function values h i are the same as specified in section 3. Using this prior, we have: Proposition 3. If limn→∞ knn = 0 and kn → ∞, then the posterior is weakly consistent, whether the true function is f 0 ∈ F or f 0 ∈ G. Proof. Assumption 2 is trivially satisfied since n (Fcn ) = 0. We need to verify only assumption 4. First, assume f 0 ∈ F—that is, f 0 is a step function with k jumps. When kn > k, f 0 can also be thought of as a step function with kn jumps. From the proof of proposition 3, we have M( f 0 , 1 , 2 ) ⊂ A when 1 < 2 /8K 2 kn and 2 < /2. By our construction of the prior, we have n (M( f 0 , 1 , 2 )) = kn !(21 )kn ( K2 )kn , if 1 = 2 /Ckn and 2 = /C for some constant C large enough. By the assumption that kn /n → 0, we can show 2 2 kn kn kn !(21 )kn ( K2 )kn = kn !( Ck ) ( C K ) > e −tn when n is large enough. n If f 0 ∈ G, we can use the same argument as in the proof of the second half of proposition 3. 5 Conclusion In this note, we find sufficient conditions under which the Bayes estimate for a piecewise constant function is consistent in the L 2 topology. Even when the true function is not piecewise constant, the Bayesian procedure can produce a consistent estimate. Under either a Poisson or a geometric
2880
H. Lian
distribution on the number of change points, the posterior is consistent as soon as we put continuous prior distributions on the jump location and function value on each segment. Consistency of the posterior immediately implies consistency of the Bayes estimate; however, it does not guarantee that we can have a consistent estimate of the number of jumps. It seems that some additional conditions on the prior penalizing functions with a large number of jumps is probably required to prevent overfitting, which might be an interesting direction for further investigation. References Barron, A., Schervish, M. J., & Wasserman, L. (1999). The consistency of posterior distributions in nonparametric problems. Ann. Statist., 27, 536–561. Ghosal, S., Ghosh, J. K., & van der Vaart, A. W. (2000). Convergence rates of posterior distributions. Ann. Statist., 28, 500–531. Ghosal, S., & van der Vaart, A. W. (in press). Convergence rates of posterior distributions for non i.i.d. observations. Ann. Statist. Green, P. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82, 711–732. Shen, X., & Wasserman, L. (2001). Rates of convergence of posterior distributions. Ann. Statist., 29, 79–100. Walker, S. G. (2003). Bayesian consistency for a class of regression problems. South African Statistical Journal, 37, 149–167. Walker, S. G. (2004). New approaches to Bayesian consistency. Ann. Statist., 32, 2028– 2043.
Received March 7, 2006; accepted December 15, 2006.
LETTER
Communicated by Wulfram Gerstner
Learning Real-World Stimuli in a Neural Network with Spike-Driven Synaptic Dynamics Joseph M. Brader [email protected]
Walter Senn [email protected] Institute of Physiology, University of Bern, Bern, Switzerland
Stefano Fusi [email protected] Institute of Physiology, University of Bern, Bern, Switzerland, and Institute of Neuroinformatics, ETH|UNI Zurich, 8059, Zurich, Switzerland
We present a model of spike-driven synaptic plasticity inspired by experimental observations and motivated by the desire to build an electronic hardware device that can learn to classify complex stimuli in a semisupervised fashion. During training, patterns of activity are sequentially imposed on the input neurons, and an additional instructor signal drives the output neurons toward the desired activity. The network is made of integrate-and-fire neurons with constant leak and a floor. The synapses are bistable, and they are modified by the arrival of presynaptic spikes. The sign of the change is determined by both the depolarization and the state of a variable that integrates the postsynaptic action potentials. Following the training phase, the instructor signal is removed, and the output neurons are driven purely by the activity of the input neurons weighted by the plastic synapses. In the absence of stimulation, the synapses preserve their internal state indefinitely. Memories are also very robust to the disruptive action of spontaneous activity. A network of 2000 input neurons is shown to be able to classify correctly a large number (thousands) of highly overlapping patterns (300 classes of preprocessed Latex characters, 30 patterns per class, and a subset of the NIST characters data set) and to generalize with performances that are better than or comparable to those of artificial neural networks. Finally we show that the synaptic dynamics is compatible with many of the experimental observations on the induction of long-term modifications (spike-timing-dependent plasticity and its dependence on both the postsynaptic depolarization and the frequency of pre- and postsynaptic neurons).
Neural Computation 19, 2881–2912 (2007)
C 2007 Massachusetts Institute of Technology
2882
J. Brader, W. Senn, and S. Fusi
1 Introduction Many recent studies of spike-driven synaptic plasticity have focused on using biophysical models to reproduce experimental data on the induction of long-term changes in single synapses (Senn, Markram, & Tsodyks, 2001; Abarbanel, Huerta, & Rabinovich, 2002; Shouval, Bear, & Cooper, ¨ otter, ¨ 2002; Karmarkar & Buonomano, 2002; Saudargiene, Porr, & Worg 2004; Shouval & Kalantzis, 2005). The regulatory properties of synaptic plasticity based on spike timing (STDP) have been studied both in recurrent neural networks and at the level of single synapses (see Abbott & Nelson, 2000; Kempter, Gerstner, & van Hemmen, 2001; Rubin, Lee, & Sompolinsky, 2001; Burkitt, Meffin, & Grayden, 2004), in which the authors study the equilibrium distribution of the synaptic weights. These are only two of the many aspects that characterize the problem of memory encoding, consolidation, maintenance, and retrieval. In general these different aspects have been studied separately, and the computational implications have been largely neglected despite the fact that protocols to induce long-term synaptic changes based on spike timing were initially considered in the computational context of temporal coding (Gerstner, Kempter, van Hemmen, & Wagner, 1996). More recently, spike-driven synaptic dynamics has been linked to several in vivo phenomena. For example, spike-driven plasticity has been shown to improve the performance of a visual perceptual task (Adini, Sagi, & Tsodyks, 2002), to be a good candidate mechanism for the emergence of direction-selective simple cells (Buchs & Senn, 2002; Senn & Buchs, 2003), and to shape the orientation tuning of cells in the visual cortex (Yao, Shen, & Dan, 2004). In Yao et al. (2004) and Adini et al. (2002), spike-driven models are proposed that make predictions in agreement with experimental results. Only in Buchs and Senn (2002) and Senn and Buchs (2003) did the authors consider the important problem of memory maintenance. Other work has focused on the computational aspects of synaptic plasticity but neglects the problem of memory storage. Rao and Sejnowski (2001), for example, encode simple temporal sequences. In Legenstein, Naeger, and Maass (2005), spike-timing-dependent plasticity is used to learn an arbitrary synaptic configuration by imposing to the input and the output neurons the appropriate temporal pattern of spikes. In (Gutig & Sompolinsky, 2006) the principles of the perceptron learning rule are applied to the classification of temporal patterns of spikes. Notable exceptions are Hopfield and Brody (2004), in which the authors consider a self-repairing dynamic synapse, and Fusi, Annunziato, Badoni, Salamon, & Amit (2000), Giudice and Mattia (2001), Amit and Mongillo (2003), Giudice, Fusi, & Mattia (2003), and Mongillo, Curti, Romani, & Amit (2005) in which discrete plastic synapses are used to learn random uncorrelated patterns of mean firing rates as attractors of the neural dynamics. However, a limitation of all these studies is that the patterns stored by the network remain relatively simple.
Learning Real-World Stimuli with Plastic Synapses
2883
Here we propose a model of spike-driven synaptic plasticity that can learn to classify complex patterns in a semisupervised fashion. The memory is robust against the passage of time, the spontaneous activity, and the presentation of other patterns. We address the fundamental problems of memory: its formation and its maintenance. 1.1 Memory Retention. New experiences continuously generate new memories that would eventually saturate the storage capacity. The interference between memories can provoke the blackout catastrophe that would prevent the network from recalling any of the previously stored memories (Hopfield, 1982; Amit, 1989). At the same time, no new experience could be stored. Instead, the main limitation on the storage capacity does not come from interference if the synapses are realistic (i.e., they do not have an arbitrarily large number of states) and hence allow only a limited amount of information to be stored in each synapse. When subject to these constraints old, memories are forgotten (palimpsest property—Parisi, 1986; Nadal, Toulouse, Changeux, & Dehaene, 1986). In particular, the memory trace decays in a natural way as the oldest memories are replaced by more recent experiences. This is particularly relevant for any realistic model of synaptic plasticity that allows only a limited amount of information to be stored in each synapse. The variables characterizing a realistic synaptic dynamics are bounded and do not allow for long-term modifications that are arbitrarily small. When subject to these constraints, the memory trace decays exponentially fast (Amit & Fusi, 1992, 1994; Fusi, 2002), at a rate that depends on the fraction of synapses modified by every experience: fast learning inevitably leads to fast forgetting of past memories and results in uneven distribution of memory resources among the stimuli (the most recent experiences are better remembered than old ones). This result is very general and does not depend on the number of stable states of each synaptic variable or on the specific synaptic dynamics (Fusi, 2002; Fusi & Abbott, 2007). Slowing the learning process allows the maximal storage capacity to be recovered for the special case of uncorrelated random patterns (Amit & Fusi, 1994). The price to be paid is that all the memories should be experienced several times to produce a detectable mnemonic trace (Brunel, Carusi, & Fusi, 1998). The brain seems to be willing to pay this price in some cortical areas like inferotemporal cortex (see Giudice et al., 2003, for a review). The next question is how to implement a mechanism that slows learning in an unbiased way. We assume that we are dealing with realistic synapses, so it is not possible to reduce the size of the synaptic modifications induced by each stimulus to arbitrarily small values. Fortunately each neuron sees a large number of synapses, and memory retrieval depends on only the total synaptic input. If only a small fraction of synaptic modifications is consolidated, then the change of the total synaptic input can be much smaller than NJ , where N is the total number of synapses to be changed and J
2884
J. Brader, W. Senn, and S. Fusi
is the minimal synaptic change. Randomly selecting the synaptic changes that are consolidated provides a simple, local, unbiased mechanism to slow learning (Tsodyks, 1990; Amit & Fusi, 1992, 1994). Such a mechanism requires an independent stochastic process for each synapse, and depending on the outcome of this process, the synaptic change is either consolidated or cancelled. The irregularity of the neural activity provides a natural source of randomness that can be exploited by the synapse (Fusi et al., 2000; Chicca & Fusi, 2001; Fusi, 2003). In this letter, we employ this approach and study a synapse that is bistable on long timescales. The bistability protects memories against the modifications induced by ongoing spontaneous activity and provides a simple way to implement the required stochastic selection mechanism. Not only is there accumulating evidence that biological single synaptic contacts undergo all-or-none modifications (Petersen, Malenka, Nicoll, & Hopfield, 1998; Wang, O’Connor, & Wittenberg, 2004), but additional synaptic states do not significantly improve the memory performance (Amit & Fusi, 1994; Fusi, 2002; Fusi & Abbott, 2007). 1.2 Memory Encoding. The second important issue is related to the way each synapse is modified to allow the network to recall a specific memory at a later time. Here we deal with supervised learning: each stimulus to be memorized imposes a characteristic pattern of activity on the input neurons, and an “instructor” generates an extra synaptic current that steers the activity of the output neurons in a desired direction. Note that the activity of the output neurons is not entirely determined by the instructor because the input neurons also contribute to determining the output (semisupervised learning). The aim of learning is to modify the synaptic connections between the input and the output neurons so that the output neurons respond as desired in both the presence and absence of the instructor. This problem can be solved with the perceptron learning rule (Rosenblatt, 1958) or with algorithms such as backpropagation (see Hertz, Krogh, & Palmer, 1991). Here we focus on a more complex and biologically inspired spikedriven synaptic dynamics that implements a learning algorithm similar to that of the perceptron. In the past, similar implementations of the synaptic dynamics have been successfully applied to learn nonoverlapping binary patterns (Giudice & Mattia, 2001; Amit & Mongillo, 2003; Giudice et al., 2003) and random uncorrelated binary patterns with a constant number of active neurons (Mongillo et al., 2005). In all these works, the authors were aiming to make the activity patterns imposed during training into stable attractors of the recurrent network dynamics. Here we consider a feedforward network, but the synaptic dynamics we develop could just as well be used in a recurrent network (see section 6.5 for more detail). We show that in order to store more complex patterns with no restriction on the correlations or number of active neurons, the long-term synaptic dynamics should slow down when the response of the output neurons is in agreement with the one generated by the total current of the input
Learning Real-World Stimuli with Plastic Synapses
2885
neurons. This is an indication that the currently presented pattern has already been learned and that it is not necessary to change the synapses further (stop-learning condition, as in the case of the perceptron learning rule: Rosenblatt, 1958; Block, 1962; Minsky & Papert, 1969). When a single output neuron is considered, arbitrary linearly separable patterns can be learned without errors (Senn & Fusi, 2005a, 2005b; Fusi & Senn, 2006) also in the extreme case of binary synapses. If more than one output neuron is read, then nonlinearly separable patterns can also be classified, which is not possible with a simple perceptron. We consider a network with multiple output neurons, realistic spike-driven dynamics implementing the stop-learning condition, and binary synapses on long timescales. 2 Abstract Learning Rule We first describe the abstract learning rule: the schematic prescription according to which the synapses should be modified at each stimulus presentation. We then introduce the detailed spike-driven synaptic dynamics that implements this prescription in an approximate fashion. The stochastic selection and stop-learning mechanisms we require are incorporated into a simple learning rule. We consider a single output neuron receiving a total current h, which is the weighted sum of the activities si of the N input neurons: 1 (J j − g I )s j , N N
h=
(2.1)
j=1
where the J j are the binary plastic excitatory synaptic weights (where J j = 0, 1), and g I is a constant representing the contribution of an inhibitory population. The latter can be regarded as a group of cells uniformly connected to the input neurons and projecting their inhibitory afferents to the output neuron. Following each stimulus, the synapses are updated according to the following rule: if the instructor determines that the postsynaptic output neuron should be active and the total input h is smaller than some threshold value θ , then the efficacy of each synapse, J j , is set equal to unity with a probability q + s j , proportional to the presynaptic activity s j . In general, this activity is a continuous variable, s j ∈]0, 1[. In this letter, we employ the binary activities s j = 0, 1. On the contrary, if the output neuron should be inactive and h is larger than θ , then the synapse is depressed with probability q − s j . The threshold θ determines whether the output neuron is active in the absence of the instructor. The synapses are thus modified only when the output produced by the weighted sum of equation 2.1 is unsatisfactory, that is, it is not in agreement with the output desired by the instructor. With this prescription, learning would stop as soon as the output is satisfactory. However, in practice, it is useful to introduce a margin δ
2886
J. Brader, W. Senn, and S. Fusi
and stop potentiation only when h > θ + δ. Analogously, depression would stop only when h < θ − δ. The margin δ guarantees a better generalization (see section 5.6). The learning rule can be summarized as follows: J i → 1 with probability q + si if h i < θ + δ and ξ = 1 J i → 0 with probability q − si if h i > θ − δ and ξ = 0,
(2.2)
where ξ is a binary variable indicating the desired output as specified by the instructor and the right arrow indicates how the synapse is updated. This learning prescription allows us to learn linearly separable patterns in a finite number of iterations (Senn & Fusi, 2005a, 2005b) provided that (1) g I is between the minimal and the maximal excitatory weights (g I ∈ ]0, 1[), (2) N is large enough, and (3) θ and δ and q ± are small enough. 3 The Synaptic Model The design of the synaptic dynamics has been largely dictated by the need to implement in hardware the abstract learning rule described in the previous section. The components we have selected are directly or indirectly related to some known properties of biological synapses. They are combined to produce a spike-driven synaptic dynamics that implements the desired learning rule and, at the same time, is compatible with the experimental protocols used to induce long-term modifications. 3.1 Memory Consolidation. The first aspect we consider is related to memory preservation against the effects of both spontaneous activity and the presentation of other stimuli. We assume that each synaptic update is triggered by the arrival of a presynaptic spike. However, in order to preserve existing memories, not all of these events will eventually lead to a long-term modification of the synapse. If many of these events change the synapse in the same direction and their effect accumulates, then the consolidation process might be activated. In such a case, the synapse is modified permanently, or at least until the next stimulation. Otherwise the synaptic efficacy preceding the stimulation would be restored. In the first case, a transition to a new stable state occurred. The activation of the consolidation process depends on the specific train of presynaptic spikes and on the coincidence with other events (e.g., elevated postsynaptic depolarization). Many presynaptic trains can share the same rate (which in our case encodes the stimulus), but they can produce different outcomes in terms of consolidation. In particular, if the presynaptic spikes arrive at random times, then consolidation is activated with some probability (Fusi et al., 2000; Fusi, 2003). This allows implementing the stochastic selection that chooses only a small fraction of the synapses to be changed on each stimulus presentation. Notice that the synaptic dynamics can be completely deterministic (as in
Learning Real-World Stimuli with Plastic Synapses
2887
our model) and that the stochasticity of the selection is generated by the irregularity of the pre- and postsynaptic activities. 3.2 Memory Encoding. The main goal of our synaptic model is to encode patterns of mean firing rates. In order to guide our choice of model, we incorporate elements that have a counterpart in neurophysiological experiments on pairs of connected neurons. Specific experimental aspects we choose to consider are: 1a: Spike-timing-dependent plasticity (STDP). If a presynaptic spike precedes a postsynaptic action potential within a given temporal window, the synapse is potentiated, and the modification is stable on long timescales in the absence of other stimulations (memory is consolidated). If the phase relation is reversed, the synapse is depressed. This ¨ behavior has been observed in vitro (Markram, Lubke, Frotscher, & ¨ om, ¨ Turrigiano, & Nelson, 2001), Sakmann, 1997; Feldman, 2000; Sjostr ¨ om ¨ et al., 2001), with realistic spike trains (Froemke & Dan, 2002; Sjostr and in vivo (Zhang, Tao, Holt, Harris, & Poo, 1998; Zhou, Tao, & Poo, 2003) for mean pre- and postsynaptic frequencies between 5 and 20 Hz. 1b: Dependence on postsynaptic depolarization. If the STDP protocol is applied to obtain LTP but the postsynaptic neuron is hyperpolarized, ¨ om ¨ et al., the synapse remains unaltered, or it slightly depresses (Sjostr 2001). More generally, the postsynaptic neuron needs to be sufficiently depolarized for LTP to occur. 1c: LTP dominance at high frequencies. When both pre- and postsynaptic neurons fire at elevated frequencies, LTP always dominates LTD regardless of the phase relation between the pre- and the postsynaptic ¨ om ¨ et al., 2001). spikes (Sjostr The corresponding dynamic elements we include in our model are: 2a: STDP. Two dynamical variables are needed to measure the time passed since the last pre- and postsynaptic spikes. They would be updated on the arrival or generation of a spike and then decay on the typical timescale of the temporal window of STDP (order of 10 ms). Other dynamical variables acting on longer timescales would be needed to restrict the STDP behavior in the frequency range of 5 to 20 Hz. 2b: Dependence on postsynaptic depolarization. A direct reading of the depolarization is sufficient. Notice that the postsynaptic depolarization can be used to encode the instantaneous firing rate of the postsynaptic neuron (Fusi et al., 2000; Fusi, 2001): the average subthreshold depolarization of the neuron is a monotonic function of the mean firing rate of the postsynaptic neurons, in both simple
2888
J. Brader, W. Senn, and S. Fusi
models of integrate-and-fire neurons (Fusi et al., 2000) and experi¨ om ¨ et al., 2001). ments (Sjostr 2c: LTP dominance at high frequencies. We assume that a relatively slow variable acting on a timescale of 100 ms (internal calcium concentration is a good candidate—Abarbanel et al., 2002; Shouval et al., 2002) will measure the mean postsynaptic frequency. For high values of this variable, LTP should dominate, and aspects related to spike timing should be disregarded. Ingredients 2a to 2c are sufficient to implement the abstract learning rule but without the desired stop-learning condition; that is, the condition that if the frequency of the postsynaptic neuron is too low or too high, no long-term modification should be induced. This additional regulatory mechanism could be introduced through the incorporation of a new variable or by harnessing one of the existing variables. A natural candidate to implement this mechanism is calcium concentration, ingredient 2c, as the average depolarization is not a sufficiently sensitive function of postsynaptic frequency to be exploitable (Fusi, 2003). 3.3 Model Reduction. We now introduce the minimal model that reproduces all the necessary features and implements the abstract rule. STDP can be achieved using a combination of depolarization dependence and an effective neuronal model, as in Fusi et al. (2000) and Fusi (2003). When a presynaptic spike shortly precedes a postsynaptic action potential, it is likely that the depolarization of an integrate-and-fire neuron is high, resulting in LTP. If the presynaptic spike comes shortly after the postsynaptic action potential, the postsynaptic integrate-and-fire neuron is likely to be recovering from the reset following spike emission, and it is likely to be hyperpolarized, resulting in LTD. This behavior depends on the neuronal model. In this work, we employ simple linear integrate-and-fire neurons with a constant leak and a floor (Fusi & Mattia, 1999), dV = −λ + I (t) dt
(3.1)
where λ is a positive constant and I (t) is the total synaptic current. When a threshold Vθ is crossed, a spike is emitted, and the depolarization is reset to V = H. If at any time V becomes negative, then it is immediately reset to V = 0. This model can reproduce quantitatively the response function ¨ of pyramidal cells measured in experiments (Rauch, La Camera, Luscher, Senn, & Fusi, 2003). The adoption of this neuronal model, in addition to the considerations on the temporal relations between pre- and postsynaptic spikes, allows us to reproduce STDP (point 1a of the above list) with only one dynamic variable (the depolarization of the postsynaptic neuron V(t)), provided that we accept modifications in the absence of postsynaptic action
Learning Real-World Stimuli with Plastic Synapses
2889
potentials. Given that we never have silent neurons in realistic conditions, the last restriction should not affect the network behavior much. These considerations allow us to eliminate the two dynamic variables of point 2a and to model the synapse with only one dynamic variable X(t), which is modified on the basis of the postsynaptic depolarization V(t) (ingredient 2b) and the postsynaptic calcium variable C(t). We emphasize that we consider an effective model in which we attempt to subsume the description into as few parameters as possible. It is clear that different mechanisms are involved in the real biological synapses and neurons. We now specify the details of the synaptic dynamics. The synapses are bistable with efficacies J + (potentiated) and J − (depressed). Note that the efficacies J + and J − can now be any two real numbers and are no longer restricted to the binary values (0,1) as in the case of the abstract learning rule. The internal state of the synapse is represented by X(t), and the efficacy of the synapse is determined according to whether X(t) lies above or below a threshold θ X . The calcium variable C(t) is an auxiliary variable with a long time constant and is a function of postsynaptic spiking activity, dC(t) 1 δ(t − ti ), = − C(t) + J C dt τC i
(3.2)
where the sum is over postsynaptic spikes arriving at times ti . J C is the contribution of a single postsynaptic spike, and τC is the time constant (La ¨ Camera, Rauch, Luscher, Senn, & Fusi, 2004). The variable X(t) is restricted to the interval 0 ≤ X ≤ Xmax (in this work, we take Xmax = 1) and is a function of C(t) and of both pre- and postsynaptic activity. A presynaptic spike arriving at tpre reads the instantaneous values V(tpre ) and C(tpre ). The conditions for a change in X depend on these instantaneous values in the following way, X→ X+a
if
V(tpre ) > θV
l h and θup < C(tpre ) < θup
X→ X−b
if
V(tpre ) ≤ θV
l h and θdown < C(tpre ) < θdown ,
(3.3)
l , where a and b are jump sizes, θV is a voltage threshold (θV < Vθ ), and the θup l h h θup , θdown and θdown are thresholds on the calcium variable (see Figure 2a). In the absence of a presynaptic spike or if the conditions 3.3 are not satisfied, then X(t) drifts toward one of two stable values,
dX = α if X > θ X dt dX = −β if X ≤ θ X , dt
(3.4)
2890
J. Brader, W. Senn, and S. Fusi
Figure 1: Stochastic synaptic transitions. (Left) A realization for which the accumulation of jumps causes X to cross the threshold θ X (an LTP transition). (Right) A second realization in which the jumps are not consolidated and thus give no synaptic transition. The shaded bars correspond to thresholds on C(t); see equation 3.3. In both cases illustrated here, the presynaptic neuron fires at a mean rate of 50 Hz, while the postsynaptic neuron fires at a rate of 40 Hz.
where α and β are positive constants and θ X is a threshold on the internal variable. If at any point during the time course X < 0 or X > 1, then X is held at the respective boundary value. The efficacy of the synapse is determined by the value of the internal variable at tpre . If X(tpre ) > θ X , the synapse has efficacy J + , and if X(tpre ) ≤ θ X , the synapse has efficacy J − . In Figure 1 we show software simulation results for two realizations of the neural and synaptic dynamics during a 300 ms stimulation period. In both cases, the input is a Poisson train with a mean rate of 50 Hz and a mean output rate of 40 Hz. Due to the stochastic nature of pre- and postsynaptic spiking activity, one realization displays a sequence of jumps that consolidate to produce an LTP transition; X(t) crosses the threshold θ X , whereas the other does not cross θ X and thus gives no transition. All parameter values are in Table 1. 4 Single Synapse Behavior 4.1 Probability of Potentiating and Depressing Events. The stochastic process X(t) determining the internal state of the synapse is formed by an accumulation of jumps that occur on arrival of presynaptic spikes. Before analyzing the full transition probabilities of the synapse, we present results for the probabilities that, given a presynaptic spike, X experiences either an upward or downward jump. In Figure 2a, we show the steady-state probability density function of the calcium variable C(t) for different postsynaptic frequencies. At low values of the postsynaptic firing rate νpost , the dynamics are dominated by the decay of the calcium concentration, resulting in a pile-up of the distribution about the origin. For larger values, νpost > 40 Hz, the distribution is well approximated by a gaussian. The shaded bars in
Learning Real-World Stimuli with Plastic Synapses
2891
Table 1: Parameter Values Used in the Spike-Driven Network Simulations. Neural Teacher Calcium Parameters Population Parameters λ
10 Vθ /s
θV 0.8 Vθ
Ne x 20
τC
l νe x 50 Hz θup
Inhibitory Population
60 ms Ninh 1000 3 JC
l 3 JC θdown
νinh 50 f Hz
Synaptic Parameters a
0.1 Xmax
b 0.1 Xmax
J inh −0.035 Vθ θx 0.5 Xmax
Input Layer Ninput
2000
νstimula ted
50 Hz
νunstimula ted 2 Hz
h θdown 4 JC
α 3.5 Xmax /s J +
Jex
h θup
β 3.5 Xmax /s J −
0
13 J C
Notes: The parameters Vθ , Xmax , and J C set the scale for the depolarization, synaptic internal variable and calcium variable, respectively. All three are set equal to unity. The firing frequency of the inhibitory pool is proportional to the coding level f of the presented stimulus. The teacher population projects to the output neurons, which should be active in response to the stimulus.
Figure 1 indicate the thresholds for upward and downward jumps in X(t) and correspond to the bars shown alongside C(t). The probability of an upward or downward jump is thus given by the product of the probabilities that both C(tpre ) and V(tpre ) fall within the defined ranges. Figure 2b shows the jump probabilities Pup and Pdown as a function of νpost . 4.2 Probability of Long-Term Modifications. In order to calculate the probability of long-term modification, we repeatedly simulate the time evolution of a single synapse for given pre- and postsynaptic rate. The stimulation period used is Tstim = 300 ms, and we ran Ntrials = 106 trials for each (νpre , νpost ) pair to ensure good statistics. In order to calculate the probability of an LTP event, we initially set X(0) = 0 and simulated a realization of the time course X(t). If X(Tstim ) > θ X at the end of the stimulation period, we registered an LTP transition. An analogous procedure was followed to calculate the LTD transition probability with the exception that the initial condition X(0) = 1 was used and a transition was registered if X(Tstim ) < θ X . The postsynaptic depolarization was generated by Poisson trains from additional excitatory and inhibitory populations. It is known that a given mean firing rate does not uniquely define the mean µ and variance σ 2 of the subthreshold depolarization, and although not strongly sensitive, the transition probabilities do depend on the particular path in (µ, σ 2 ) space. We choose the linear path σ 2 = 0.015µ + 2, which yields the same statistics as the full network simulations considered later and thus ensures that the transition probabilities shown in Figure 2 provide an accurate guide. The linearity of the relations between σ 2 and µ comes from the fact that for a Poisson train of input spikes emitted at a rate ν, both σ 2 and µ are linear functions of ν, and the coefficients are known functions of the connectivity and the average synaptic weights (see Fusi, 2003).
2892
J. Brader, W. Senn, and S. Fusi
Figure 2: (a) Probability density function of the internal calcium variable for different values of νpost . The bars indicate regions for which upward and downward jumps in the synaptic internal variable, X, are allowed, labeled LTP and LTD, respectively. See equation 3.3. (b) Probability for an upward or downward jump in X as a function of postsynaptic frequency. (c, d) LTP and LTD transition probabilities, respectively, as a function of νpost for different values of νpre . The insets show the height of the peak in the transition probabilities as a function of the presynaptic frequency.
In Figure 2 we show the transition probabilities as a function of νpost for different νpre values. There exists a strong correspondence between the jump probabilities and the synaptic transition probabilities of Figure 2. The consolidation mechanism yields a nonlinear relationship between the two sets of curves. The model synapse displays Hebbian behavior: LTP dominates at high νpost , and LTD dominates at low νpost when the presynaptic neuron is stimulated. When the presynaptic neuron is not stimulated, the transition probabilities become so small that the synapse remains unchanged. The decay of the transition probabilities at high and low values of νpost implements the stop-learning condition and is a consequence of the thresholds on C(t). The inset to Figure 2 shows the peak height of the LTP and LTD probabilities as a function of νpre . For νpre > 25 Hz the maximal transition probability shows a linear dependence upon νpre . This important feature
Learning Real-World Stimuli with Plastic Synapses
Inh
2893
Teach C1
C2
Input
Figure 3: A schematic of the network architecture for the special case of a data set consisting of two classes. The output units are grouped into two pools, selective to stimuli C1 and C2, respectively, and are connected to the input layer by plastic synapses. The output units receive additional inputs from teacher and inhibitory populations.
implies that use of this synapse is not restricted to networks with binary inputs, as considered in this work, but would also prove useful in networks employing continuous valued input frequencies. A useful memory device should be capable of maintaining the stored memories in the presence of spontaneous activity. In order to assess the stability of the memory to such activity, we have performed long simulation runs, averaging over 107 realizations, for νpre = 2 Hz and a range of νpost values in order to estimate the maximal LTP and LTD transition probability. We observed no synaptic transitions during any of the trials. This provides upper bounds of PLTP (νpre = 2 Hz) < 10−7 and PLTD (νpre = 2 Hz) < 10−7 . With a stimulation time of 300 ms, this result implies 0.3 × 107 seconds between synaptic transitions and provides a lower bound to the memory lifetime of approximately 1 month, under the assumption of 2 Hz spontaneous activity. 5 Network Performance 5.1 The Architecture of the Network. The network architecture we consider consists of a single feedforward layer composed of Ninp input neurons fully connected by plastic synapses to Nout outputs. Neurons in the output layer have no lateral connection and are subdivided into pools of size class Nout , each selective to a particular class of stimuli. In addition to the signal from the input layer, the output neurons receive additional signals from inhibitory and teacher populations. The inhibitory population provides a signal proportional to the coding level of the stimulus and serves to balance the excitation coming from the input layer (as required in the abstract rule; see section 2). A stimulus-dependent inhibitory signal is important, as it can compensate for large variations in the coding level of the stimuli. The teacher population is active during training and imposes the selectivity of the output pools with an additional excitatory signal. A schematic view of this network architecture is shown in Figure 3.
2894
J. Brader, W. Senn, and S. Fusi
Multiple arrays of random classifiers have been the subject of considerable interest in recent studies of machine learning and can achieve results for complex classification tasks far beyond those obtainable using a single class classifier (Nout = 1) (see section 6 for more on this point). Following learning, the response of the output neurons to a given stimulus can be analyzed by selecting a single threshold frequency to determine which neurons are considered active (express a vote). The classification result is determined using a majority rule decision between the selective pools of output neurons. (For a biologically realistic model implementing a majority rule decision, see Wang, 2002.) We distinguish among three possibilities upon presentation of a pattern: (1) correctly classified—the output neurons of the correct selective pool express more votes than the other pools; (2) misclassified—an incorrect pool of output neurons wins the vote; and (3) nonclassified—no output neuron expresses a vote and the network remains silent. Nonclassifications, are preferable to misclassifications, as a null response to a difficult stimulus retains the possibility that such cases can be sent to other networks for further processing. In most cases, the majority of the errors can be made nonclassifications with an appropriate choice of threshold. The fact that a single threshold can be used for neurons across all selective pools is a direct consequence of the stop-learning condition, which keeps the output response within bounds. 5.2 Illustration of Semisupervised Learning. To illustrate our approach to supervised learning, we apply our spike-driven network to a data set of 400 uncorrelated random patterns with low coding level ( f = 0.05) divided equally into two classes. The initial synaptic matrix is set randomly, and an external teacher signal is applied that drives the (single) output neuron to a mean rate of 50 Hz on presentation of a stimulus of class 1 and to 6 Hz on presentation of a stimulus of class 0; the mean rates of 50 Hz and 6 Hz result from the combined input of teacher plus stimulus. The LTP and LTD transition probabilities shown in Figure 2 intersect at νpost ∼ 20 Hz. Stimuli that elicit an initial output response more than 20 Hz on average exhibit a strengthening of this response during learning, whereas stimuli that elicit an initial response less than 20 Hz show a weakening response during learning. The decay of the transition probabilities at both high and low νpost eventually arrests the drift in the output response and prevents overlearning the stimuli. Patterns are presented in random order, and by definition, the presentation number increases by 1 for every 400 patterns presented to the network. In Figure 4 we show output frequency histograms across the data set as a function of presentation number. The output responses of the two classes become more strongly separated during learning and eventually begin to pile up around 90 Hz and 0 Hz due to the decay of the transition probabilities. Figure 5 shows the output frequency distributions across the data set in the absence of a teacher signal both before and after learning. Before learning, both classes have statistically identical distributions.
Learning Real-World Stimuli with Plastic Synapses
2895
Without Teacher
(a)
With Teacher
(b)
Teacher on
0.5 0.4 0.3 0.2 0.1 0
Learning
0.5 0.4 0.3 0.2 0.1 0
(d)
0
40
80
Teacher off
After Learning
Before Learning
Figure 4: Frequency response histograms on presentation of the training set at different stages during learning for a simple test case consisting of a single output neuron. Throughout learning, the teacher signal is applied to enforce the correct response. The panels from back to front correspond to presentation number 1, 60, 120, 180, and 240, respectively. The solid and dashed curves correspond to the LTD and LTP transition probabilities, respectively, from Figure 2.
νpost
(c)
0
40
80
νpost
Figure 5: Frequency response histograms before and after learning. (a) Response to class 1 (filled circles) and class 0 (open squares) before learning without external teacher signal. (b) Response before learning with teacher signal. (c) Response after learning with teacher signal. (d) Situation after learning without teacher signal.
2896
J. Brader, W. Senn, and S. Fusi
Following learning, members of class 0 have a reduced response, whereas members of class 1 have responses distributed around 45 Hz. Classification can then be made using a threshold on the output frequency. When considering more realistic stimuli with a higher level of correlation, it becomes more difficult to achieve a good separation of the classes, and there may exist considerable overlap between the final distributions. In such cases it becomes essential to use a network with multiple output units to correctly sample the statistics of the stimuli. 5.3 The Data Sets and the Classification Problem. Real-world stimuli typically have a complex statistical structure very different from the idealized case of random patterns often considered. Such stimuli are characterized by large variability within a given class and a high level of correlation between members of different classes. We consider two separate data sets. The first data set is a binary representation of 293 Latex characters preprocessed, as in Amit and Mascaro (2001) and presents in a simple way these generic features of complex stimuli. Each character class consists of 30 members generated by random distortion of the original character. Figure 6 shows the full set of characters. This data set has been previously studied in Amit and Geman (1997) and Amit and Mascaro (2001) and serves as a benchmark test for the performance of our network. In Amit and Mascaro (2001) a neural network was applied to the classification problem, and in Amit and Geman (1997), decision trees were employed. It should be noted that careful preprocessing of these characters is essential to obtain good classification results, and we study the same preprocessed data set used in Amit and Mascaro (2001). The feature space of each character is encoded as a 2000-element binary vector. The coding level f (the fraction of active neurons) is sparse but highly variable, spanning a range 0.01 < f < 0.04. On presentation of a character to the network, we assign one input unit per element which is activated to fire at 50 Hz if the element is unity but remains at a spontaneous rate of 2 Hz if the element is zero. Due to random character deformations, there is a large variability within a given class, and despite the sparse nature of the stimuli, there exist large overlaps between different classes. The second data set we consider is the MNIST data set, a subset of the larger NIST handwritten characters data set. The data set consists of 10 classes (digits 0 → 9) on a grid of 28 × 28 pixels. (The MNIST data set is available from http://yann.lecun.com, which also lists a large number of classification results.) The MNIST characters provide a good benchmark for our network performance and have been used to test numerous classification algorithms. To input the data to the network, we construct a 784element vector from the pixel map and assign a single input neuron to each element. As each pixel has a grayscale value, we normalize each element such that the largest element has value unity. From the full MNIST data set,
Learning Real-World Stimuli with Plastic Synapses
2897
Figure 6: The full Latex data set containing 293 classes. (a) Percentage of nonclassified patterns in the training set as a function of the number of classes for different numbers of output units per class. Results are shown for 1(∗), 2, 5, 10, and 40(+) outputs per class using the abstract rule (points) and for 1, 2, and 5 outputs per class using the spike-driven network (squares, triangles, and circles, respectively). In all cases, the percentage of misclassified patterns is less than 0.1%. (b) Percentage of nonclassified patterns as a function of the number of output units per class for different numbers of classes (abstract rule). Note the logarithmic scale. (c) Percentage of nonclassified patterns as a function of number of classes for generalization on a test set (abstract rule). (d) Same as for c but showing percentage of misclassified patterns.
2898
J. Brader, W. Senn, and S. Fusi
we randomly select 20,000 examples for training and 10,000 examples for testing. The Latex data set is more convenient for investigating the trends in behavior due to the large number of classes available. The MNIST data set is complementary to this; although there are only 10 classes, the large number of examples available for both training and testing makes possible a more meaningful assessment of the asymptotic (large number of output neurons) performance of the network relative to existing results for this data set. For both data sets the high level of correlation between members of different classes makes this a demanding classification task. We now consider the application of our network to these data sets. We first present classification results on a training set obtained from simulations employing the full spike-driven network. As such simulations are computationally demanding, we supplement the results obtained from the spike-driven network with simulations performed using the abstract learning rule in order to explore more fully the trends in performance. We then perform an analysis of the stability of the spike-driven network results with respect to parameter fluctuations, an issue of practical importance when considering hardware VLSI implementation. Finally, we consider the generalization ability of the network. 5.4 Spike-Driven Network Performance. The parameters entering the neural and synaptic dynamics as well as details of the inhibitory and teacher populations can be found in Table 1. In our spike-driven simulations, the inhibitory pool sends Poisson spike trains to all output neurons at a mean rate proportional to the coding level of the presented stimulus (note that each output neuron receives an independent realization at the same mean rate). The teacher population is purely excitatory and sends additional Poisson trains to the output neurons, which should be selective to the present stimuli. It now remains to set a value for the vote expression threshold. As a simple method to set this parameter, we performed preliminary class simulations for the special case Nout = 1 and adjusted the output neuron threshold to obtain the best possible performance on the full data set. The optimal choice minimizes the percentage of nonclassified patterns without allowing significant misclassification errors. This value was then used in class all subsequent simulations with Nout > 1. In Figure 6a we show the percentage of nonclassified patterns as a function of number of classes for netclass works with different values of Nout . In order to suppress fluctuations, each data point is the average over several random subsets taken from the full data set. Given the simplicity of the network architecture, the performance on class the training set is remarkable. For Nout = 1 the percentage of nonclassified class patterns increases rapidly with the number of classes; however, as Nout is increased, the performance rapidly improves, and the network eventually enters a regime in which the percentage of nonclassified patterns remains
Learning Real-World Stimuli with Plastic Synapses
2899
almost constant with increasing class number. In order to test the extent of this scaling, we performed a single simulation with 20 output units per class and the full 293-class data set. In this case we find 5.5% nonclassified (0.1% misclassified), confirming that the almost constant scaling continues across the entire data set. We can speculate that if the number of classes were further increased, the system would eventually enter a new regime in which the synaptic matrix becomes overloaded and errors increase more rapidly. class The total error of 5.6% that we incur using the full data set with Nout = 20 should be contrasted with that of (Amit and Mascaro, 2001), who reported an error of 39.8% on the 293 class problem using a single network with 6000 class output units, which is roughly equivalent to our network with Nout = 20. By sending all incorrectly classified patterns to subsequent networks for reanalysis (“boosting”; see section 8.6), Amit and Mascaro obtained 5.4% error on the training set using 15 boosting cycles. This compares favorably with our result. We emphasize that we use only a single network, and essentially all of our errors are nonclassifications. Figure 6b shows the percentage of nonclassified patterns as a function of the number of output units per class for fixed class number. In order to evaluate the performance for the MNIST data set, we retain all the parameter values used for the Latex experiments. Although it is likely that the results on the MNIST data set could be optimized using specially tuned parameters, the fact that the same parameter set works adequately for both MNIST and Latex cases is a tribute to the robustness of our network (see section 5.7 for more on this issue). We find that near-optimal performance class on the training set is achieved for Nout = 15, for which we obtain 2.9% nonclassifications and 0.3% misclassifications. 5.5 Abstract Rule Performance. Due to the computational demands of simulating the full spike-driven network, we have performed complementary simulations using the abstract rule in equation 2.2 to provide a fuller picture of the behavior of our network. The parameters are chosen such that class the percentage of nonclassified patterns for the full data set with Nout =1 matches those obtained from the spike-driven network. In Figure 6a we show the percentage of nonclassified patterns as a function of number of class classes for different values of Nout . Although we have chosen the abstract rule parameters by matching only a single data point to the results from the spike-driven network, the level of agreement between the two approaches is excellent. 5.6 Generalization. We tested the generalization ability of the network by selecting randomly 20 patterns from each class for training and reserving the remaining 10 for testing. In Figures 6c and 6d, we show the percentage of mis- and nonclassified patterns in the test set as a function of the class number of classes for different values of Nout . The monotonic increase in the percentage of nonclassified patterns in the test set is reminiscent
2900
J. Brader, W. Senn, and S. Fusi
of the training set behavior but lies at a slightly higher value for a given class number of classes. For large Nout , the percentage of nonclassifications exhibits the same slow increase with number of classes as seen in the training set. Although the percentage of misclassifications increases more rapidly than the nonclassifications, it also displays a regime of slow increase for class Nout > 20. When applied to the MNIST data set, the spiking network yields very reasonable generalization properties. We limit ourselves to a maximum value class of Nout = 15 due to the heavy computational demand of simulating the class spike-driven network. Using Nout = 15, we obtain 2.2% nonclassifications and 1.3% misclassifiactions. This compares favorably with existing results on this data set and clearly demonstrates the effectiveness of our spikedriven network. For comparison, k-nearest neighbor classifiers typically yield a total error in the range 2% to 3%, depending on the specific implementation, with a best result of 0.63% error obtained using shape context matching (Belongie, Malik, & Puzicha, 2002). (For a large number of results relating to the MNIST data set, see http://www.lecun.com.) Convolutional nets yield a total error around 1%, depending on the implementation, with a best performance of 0.4% error using cross-entropy techniques (Simard, Steinkraus, & Platt, 2003). We have also investigated the effect of varying the parameter δ on the generalization performance using the abstract learning rule. Optimal performance on the training set is obtained for small values of δ, as this enables the network to make a more delicate distinction between highly correlated patterns. The price to be paid is a reduced generalization performance, which results from overlearning the training set. Conversely, a larger value of δ reduces performance on the training set but improves the generalization ability of the network. In the context of a recurrent attractor network, δ effectively controls the size of the basin of attraction. In general, an intermediate value of δ allows compromise between accurate learning of the test set and reasonable generalization. 5.7 Stability with Respect to Parameter Variations. When considering hardware implementations, it is important to ensure that any proposed model is robust with respect to variations in the parameter values. A material device provides numerous physical constraints, and so an essential prerequisite for hardware implementation is the absence of fine-tuning requirements. To test the stability of the spiking network, we investigate the change in classification performance with respect to perturbation of the class parameter values for the special case of Nout = 20 with 50 classes. In order to identify the most sensitive parameters, first consider the effect of independent variation. At each synapse, the parameter value is reselected from a gaussian distribution centered about the tuned value and with a standard deviation equal to 15% of that value. All other parameters are held at their tuned values. This approach approximates the natural
Learning Real-World Stimuli with Plastic Synapses
2901
Figure 7: Stability of the network with respect to variations in the key parameters; see equations 2.2, 3.1, and 3.2. The black bars indicate the change in the percentage of nonclassified patterns, and the dark gray bar indicates the change in the percentage of misclassified patterns on random perturbation of the parameter values. With no perturbation, the network yields 4% nonclassified (light gray bars) and less than 0.1% misclassified. Simultaneous perturbation of all parameters is marked {. . . } in the right-most column. The inset shows the change in the percentage of nonclassified (circles) and misclassified (squares) patterns as a function of the parameter noise level when all parameters are simultaneously perturbed.
variation that occurs in hardware components and can be expected to vary from synapse to synapse (Chicca, Indiveri, & Douglas, 2003). In Figure 7 we report the effect of perturbing the key parameters on the percentage of nonand misclassified patterns. The most sensitive parameters are those related to jumps in the synaptic variable X(t), as these dominate the consolidation mechanism causing synaptic transitions. To mimic a true hardware situation more closely, we also consider simultaneous perturbation of all parameters. Although the performance degrades significantly, the overall performance drop is less than might be expected from the results of independent parameter variation. It appears that there are compensation effects within the network. The inset to Figure 7 shows how the network performance changes as a function of the parameter noise level (standard deviation of the gaussian). For noise levels less than 10%, the performance is only weakly degraded.
2902
J. Brader, W. Senn, and S. Fusi
6 Discussion Spike-driven synaptic dynamics can implement semisupervised learning. We showed that a simple network of integrate-and-fire neurons connected by bistable synapses can learn to classify complex patterns. To our knowledge, this is the first work in which a complex classification task is solved by a network of neurons connected by biologically plausible synapses, whose dynamics is compatible with several experimental observations on longterm synaptic modifications. The network is able to acquire information from its experiences during the training phase and to preserve it against the passage of time and the presentation of a large number of other stimuli. The examples shown here are more than toy problems, which would illustrate the functioning of the network. Our network can classify correctly thousands of stimuli, with performances that are better than those of more complex, multilayer traditional neural networks. The key to this success lies in the possibility of training different output units on different subsets or subsections of the stimuli (boosting technique— Freund & Schapire, 1999). In particular, the use of random connectivity between input and output layers would allow each output neuron to sample a different subsection of every stimulus. Previous studies (Amit & Mascaro, 2001) employed deterministic synapses and a quenched random connectivity. Here we use full connectivity but with stochastic synapses to generate the different realizations. This yields a dynamic random connectivity that changes continuously in response to incoming stimuli. In order to gain some intuition into the network behavior, it is useful to consider an abstract space in which each pattern is represented by a point. The synaptic weights feeding into each output neuron define hyperplanes that divide the space of patterns. During learning, the hyperplanes follow stochastic trajectories through the space in order to separate (classify) the patterns. If at some point along the trajectory, the plane separates the space such that the response at the corresponding output neuron is satisfactory, then the hyperplane does not move in response to that stimulus. With a large number of output neurons, the hyperplanes can create an intricate partitioning of the space. 6.1 Biological Relevance 6.1.1 Compatibility with the Observed Phenomenology. Although the synaptic dynamics was mostly motivated by the need to learn realistic stimuli, it is interesting to note that the resulting model remains consistent with experimental findings. In order to make contact with experiments on pairs of connected neurons, we consider the application of experimentally realistic stimulation protocols to the model synapse. A typical protocol for the induction of LTP or LTD is to pair pre- and postsynaptic spikes with a given phase relation. In the simulations presented in Figure 8, we impose a
Learning Real-World Stimuli with Plastic Synapses
2903
Figure 8: Synaptic transition probabilities for (a) paired post-pre, (b) paired prepost, and (c) uncorrelated stimulation as a function of νpost . The full curve is PLTP and the dashed curve PLTD . For the post-pre and pre-post protocols, the phase shift between pre- and postspikes is fixed at +6 ms and −6 ms, respectively. In all cases, νpre = νpost . (d) The tendency T (see equation 6.1) as a function of the phase between pre- and postsynaptic spikes for a mean frequency νpre = νpost = 12 Hz.
post-pre or pre-post pairing of spikes with a phase shift of +6 ms or −6 ms, respectively, and simulate the synaptic transition probabilities as a function of νpost = νpre over a stimulation period of 300 ms. The postsynaptic neuron is made to fire at the desired rate by application of a suitable teacher signal. When the prespike precedes the postspike, then LTP dominates at all frequencies; however, when the prespike falls after the postspike, there exists a frequency range (5 < νpost < 20) over which LTD dominates. This LTD region is primarily due to the voltage decrease following the emission of a postsynaptic spike. As the frequency increases, it becomes increasingly likely that the depolarization will recover from the postspike reset to attain a value larger than θV , thus favoring LTP. For completeness, we also present the LTP and LTD transition probabilities with uncorrelated pre- and postsynaptic spikes. We maintain the relation νpre = νpost . We also test the spike-timing dependence of the transition probabilities by fixing νpost = 12 Hz and varying the phase shift between pre- and postsynaptic
2904
J. Brader, W. Senn, and S. Fusi
spikes. In Figure 8, we plot the tendency (Fusi, 2003) as a function of phase shift. The tendency T is defined as T=
PLTP PLTD+PLTP
−
1 max(PLTP , PLTD ). 2
(6.1)
A positive value of T implies dominance of LTP, whereas a negative value implies dominance of LTD. Although the STDP time window is rather short (∼10 ms) the trend is consistent with the spike timing induction window observed in experiments. 6.1.2 Predictions. We already know that when pre- and postsynaptic frequencies are high enough, LTP is observed regardless the detailed temporal ¨ om ¨ et al., 2001). As the frequencies of pre- and postsynaptic statistics (Sjostr neurons increase, we predict that the amount of LTP should progressively decrease until no long-term modification becomes possible. In the highfrequency regime, LTD can become more likely than LTP, provided that the total probability of a change decreases monotonically at a fast rate. Large LTD to compensate LTP would produce an average modification that is also small, but it would actually lead to fast forgetting as the synapses would be modified anyway. A real stop-learning condition is needed to achieve high classification performance. Although experimentalists did not study systematically what happens in the high-frequency regime, there is preliminary evidence for a nonmonotonic LTP curve (Wang & Wagner, 1999). Other regulatory mechanisms that would stop learning as soon as the neuron responds correctly might be also possible. However, the mechanism we propose is probably the best candidate in cases in which a local regulatory system is required and the instructor is not “smart” enough. 6.2 Parameter Tuning. We showed that the network is robust to heterogeneities and to parameter variations. However, the implementation of the teacher requires the tuning of one global parameter controlling the strength of the teacher signal (essentially the ratio νe x /νstimula ted ; see Table 1). This signal should be strong enough to steer the output activity in the desired direction and in the presence of a noisy or contradicting input. Indeed, before learning, the synaptic input is likely to be uncorrelated with the novel patterns to be learned, and the total synaptic input alone would produce a rather disordered pattern of activity of the output neurons. The teacher signal should be strong enough to dominate over this noise. At the same time, it should not be so strong that it brings the synapse into the region of very slow synaptic changes (the region above ∼100 Hz in Figure 4). The balance between the teacher signal and the external input is the only parameter we need to tune to make the network operate in the proper regime. In the brain, we might hypothesize the existence of other mechanisms (e.g.,
Learning Real-World Stimuli with Plastic Synapses
2905
homeostasis; Turrigiano & Nelson, 2000), which would automatically create the proper balance between the synaptic inputs of the teacher (typically a top-down signal) and the synaptic inputs of the sensory stimuli. 6.3 Learning Speed. Successful learning usually requires hundreds of presentations of each stimulus. Learning must be slow to guarantee an equal distribution of the synaptic resources among all the stimuli to be stored. Faster learning would dramatically shorten the memory lifetime, making it impossible to learn a large number of patterns. This limitation is a direct consequence of the boundedness of the synapses (Amit & Fusi, 1994; Fusi, 2002; Senn & Fusi, 2005a, 2005b) and can be overcome only by introducing other internal synaptic states that would correspond to different synaptic dynamics. In particular, it has been shown that a cascade of processes, each operating with increasingly small transition probabilities, can allow for a long memory lifetime without sacrificing the amount of information acquired at each presentation (Fusi, Drew, & Abbott, 2005). In these models, when the synapse is potentiated and the conditions for LTP are met, the synapse becomes progressively more resistant to depression. These models can be easily implemented by introducing multiple levels of bistable variables, each characterized by different dynamics and each one controlling the learning rate of the previous level. For example, the jumps a k and b k of level k might depend on the state of level k + 1. These models would not lead to better performance, but they would certainly guarantee a much faster convergence to a successful classification. They will be investigated in future work. Notice also that the learning rates can be easily controlled by the statistics of the pre- and postsynaptic spikes. So far we considered a teacher signal that increases or decreases the firing rate of the output neurons. However, the higher-order statistics (e.g., the synchronization between preand postsynaptic neurons) can also change, and our synaptic dynamics would be rather sensitive to these alterations. Attention might modulate these statistics, and the learning rate would be immediately modified without changing any inherent parameter of the network. This idea has already been investigated in a simplified model in Chicca and Fusi (2001). 6.4 Applications. The synaptic dynamics we propose has been implemented in neuromorphic analog VLSI (very large scale integration) (Mitra, Fusi, & Indiveri, 2006; Badoni, Giulioni, Dante, & Del Giudice, 2006; Indiveri & Fusi, 2007). A similar model has been introduced in Fusi et al. (2000) and the other components required to implement massively parallel networks of integrate-and-fire neurons and synapses with the new synaptic dynamics have been previously realized in VLSI in (Indiveri, 2000, 2001, 2002; Chicca & Fusi, 2001). There are two main advantages of our approach. First, all the information is transmitted by events that are highly localized in time (the spikes): this leads to low power consumption and optimal communication
2906
J. Brader, W. Senn, and S. Fusi
bandwidth usage (Douglas, Deiss, & Whatley, 1998; Boahen, 1998). Notice that each synapse in order to be updated requires the knowledge of the time of occurrence of the presynaptic spikes, its internal state, and other dynamic variables of the postsynaptic neuron (C and V). Hence each synapse needs to be physically connected to the postsynaptic neuron. The spikes from the presynaptic cell can come from other off-chip sources. Second, the memory is preserved by the bistability. In particular, it is retained indefinitely if no other presynaptic spikes arrive. All the hardware implementations require negligible power consumption to stay in one of the two stable states (Fusi et al., 2000; Indiveri, 2000, 2001, 2002; Mitra et al., 2006; Badoni, Giulioni, Dante, & Del Giudice, 2006) and the bistable circuitry does not require nonstandard technology or high voltage, as for the floating gates (Diorio, Hasler, Minch, & Mead, 1996). Given the possibility of implementing the synapse in VLSI and the good performances obtained on the Latex data set, we believe that this synapse is a perfect candidate for low-power, compact devices with on-chip autonomous learning. Applications to the classification of real-world auditory stimuli (e.g., spoken digits) are presented in (Coath, Brader, Fusi, & Denham, 2005). They show that classification performances on classification of the letters of the alphabet are comparable to those of humans. In these applications, the input vectors are fully analog (i.e., sets of mean firing rates ranging in a given interval), showing the capability of our network to encode these kinds of patterns as well. 6.5 Multiple Layer and Recurrent Networks. The studied architecture is a single layer network. Our output neurons are essentially perceptrons, which resemble the cerebellar Purkinje cells, as already proposed by Marr (1969) and Albus (1971) (see Brunel, Hakim, Isope, Nadal, & Barbour, 2004, for a recent work). Not only they are the constituents of a single-layer network and have a large number of inputs, but they also receive a teacher signal similar to what we have in our model. However, it is natural to ask whether our synaptic dynamics can be applied also to a more complex multilayer network. If the instructor acts on all the layers, then it is likely that the same synaptic dynamics can be adopted in the multilayer case. Otherwise, if the instructor provides a bias only to the final layer, then it is unclear whether learning would converge and whether the additional synapses of the intermediate layers can be exploited to improve the performances. The absence of a theory that guarantees the convergence does not necessary imply that the learning rule would fail. Although simple counterexamples probably can be constructed for networks of a few units, it is difficult to predict what happens in more general cases. More interesting is the case of recurrent networks in which the same neuron can be regarded as both an input and an output unit. When learning converges, then each pattern imposed by the instructor (which might be the sensory stimulus) becomes a fixed point of the network dynamics. Given the
Learning Real-World Stimuli with Plastic Synapses
2907
capability of our network to generalize, it is very likely that the steady states are also stable attractors. Networks in which not all the output neurons of each class are activated would probably lead to attractors that are sparser than the sensory representations. This behavior might be an explanation of the small number of cells involved in attractors in inferotemporal cortex (see Giudice et al., 2003). 6.6 Stochastic Learning and Boosting. Boosting is an effective strategy to greatly improve classification performance by considering the “opinions” of many weak classifiers. Most of the algorithms implementing boosting start from deriving simple rules of thumb for classifying the full set of stimuli. A second classifier will concentrate more on those stimuli that were most often misclassified by the previous rules of thumb. This procedure is repeated many times, and in the end, a single classification rule is obtained by combining the opinions of all the classifiers. Each opinion is weighted by a quantity that depends on the classification performance on the training set. In our case, we have two of the necessary ingredients to implement boosting. First, each output unit can be regarded as a weak classifier that concentrates on a subset of stimuli. When the stimuli are presented, only a small fraction of randomly selected synapses changes. Our stimuli activate only tens of neurons, and the transition probabilities are small (order of 10−2 ). The consequence is that some stimuli are actually ignored because no synaptic change is consolidated. Second, each classifier concentrates on the hardest stimuli. Indeed, only the stimuli that are misclassified induce significant changes in the synaptic structure. For the others, the transition probabilities are much smaller, and again, it is as if the stimulus is ignored. In our model, each classifier does not know how the others are performing and which stimuli are misclassified by the other output units. So the classification performances cannot be as good as in the case of boosting. However, for difficult tasks, each output neuron changes continuously its rules of thumb as the weights are stochastically updated. In general, each output neuron feels the stop-learning condition for a different weight configuration, and the aggregation of these different views often yields correct classification of the stimuli. In fact our performances are very similar to those of Amit and Mascaro (2001) when they use a boosting technique. Notice that two factors play a fundamental role: stochastic learning and a local stop-learning criterion. Indeed, each output unit should stop learning when its own output matches the one desired by the instructor, not when the stimulus is correctly classified by the majority rule. Otherwise the output units cannot concentrate on different subsets of misclassified patterns. Acknowledgments We thank Massimo Mascaro for many inspiring discussions and for providing the preprocessed data set of Amit and Mascaro (2001). We are grateful
2908
J. Brader, W. Senn, and S. Fusi
to Giancarlo La Camera for careful reading of the manuscript. Giacomo Indiveri greatly contributed with many discussions in constraining the model in such a way that it could be implemented in neuromorphic hardware. Harel Shouval drew our attention to Wang and Wagner (1999). This work was supported by the EU grant ALAVLSI and partially by SNF grant PP0A-106556. References Abarbanel, H. D. I., Huerta, R., & Rabinovich, M. I. (2002). Dynamical model of long-term synaptic plasticity. PNAS, 99, 10132–10137. Abbott, L. F., & Nelson, S. B. (2000). Synaptic plasticity: Taming the beast. Nature Neuroscience, 3, 1178–1183. Adini, Y., Sagi, D., & Tsodyks, M. (2002). Context enabled learning in human visual system. Nature, 415, 790–794. Albus, J. (1971). A theory of cerebellar function. Math. Biosci., 10, 26–51. Amit, D. (1989). Modeling brain function. Cambridge: Cambridge University Press. Amit, D. J., & Fusi, S. (1992). Constraints on learning in dynamic synapses. Network, 3, 443. Amit, D., & Fusi, S. (1994). Learning in neural networks with material synapses. Neural Comput., 6(5), 957–982. Amit, Y., & Geman, D. (1997). Shape quantization and recognition with randomized trees. Neural Computation, 9, 1545–1588. Amit, Y., & Mascaro, M. (2001). Attractor networks for shape recognition. Neural Computation, 13, 1415–1442. Amit, D., & Mongillo, G. (2003). Spike-driven synaptic dynamics generating working memory states. Neural Computation, 15, 565–596. Badoni, D., Giulioni, Dante, V., & Del Giudice, P. (2006). A VLSI recurrent network of spiking neurons with reconfigurable and plastic synapses. IEEE International Symposium on Circuits and Systems ISCAS06 (pp. 1227–1230). Piscataway, NJ: IEEE. Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and object recognition using shape context. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(24), 509. Block, H. (1962). The perceptron: A model for brain functioning. I. Reviews of Modern Physics, 34, 123–135. Boahen, K. (1998). Communicating neuronal ensembles between neuromorphic chips. In S. Smith & A. Hamilton (Eds.), Neuromorphic systems engineering. Norwell, MA: Kluwer. Brunel, N., Carusi, F., & Fusi, S. (1998). Slow stochastic Hebbian learning of classes of stimuli in a recurrent neural network. Network, 9(1), 123–152. Brunel, N., Hakim, V., Isope, P., Nadal, J.-P., & Barbour, B. (2004). Optimal information storage and the distribution of synaptic weights: Perceptron versus Purkinje cells. Neuron, 43, 745–757. Buchs, N., & Senn, W. (2002). Spike-based synaptic plasticity and the emergence of direction selective simple cells: Simulation results. Journal of Computational Neuroscience, 13, 167–186.
Learning Real-World Stimuli with Plastic Synapses
2909
Burkitt, A. N., Meffin, H., & Grayden, D. B. (2004). Spike timing-dependent plasticity: The relationship to rate-based learning for models with weight dynamics determined by a stable fixed-point. Neural Computation, 16, 885–940. Chicca, E., & Fusi, S. (2001). Stochastic synaptic plasticity in deterministic aVLSI networks of spiking neurons. In F. Rattay (Ed.), Proceedings of the World Congress on Neuroinformatics (pp. 468–477). Verlag, Vienna: ARGESIM/ASIM. Chicca, E., Indiveri, G., & Douglas, R. (2003). An adaptive silicon synapse. IEEE International Symposium on Circuits and Systems. Piscataway, NJ: IEEE Press. Coath, M., Brader, J., Fusi, S., & Denham, S. (2005). Multiple views of the response of an ensemble of spectro-temporal features support concurrent classification of utterance, prosody, sex and speaker identity. NETWORK Computation in Neural Systems, 16, 285–300. Diorio, C., Hasler, P., Minch, B., & Mead, C. (1996). A single-transistor silicon synapse. IEEE Transactions on Electron Devices, 43, 1972–1980. Douglas, R., Deiss, S., & Whatley, A. (1998). A pulse-coded communications infrastructure for neuromorphic systems. In W. Maass & C. Bishop (Eds.), Pulsed neural networks (pp. 157–178). Cambridge, MA: MIT Press. Feldman, D. (2000). Abstract timing-based LTP and LTD at vertical inputs to layer II/III pyramidal cells in rat barrel cortex. Neuron, 27, 45–56. Freund, Y., & Schapire, R. (1999). A short introduction to boosting. Journal of the Japanese Society for Artificial intelligence, 14, 771–780. Froemke, R., & Dan, Y. (2002). Spike-timing-dependent synaptic modification induced by natural spike trains. Nature, 416, 433–438. Fusi, S. (2001). long-term memory: Encoding and storing strategies of the brain. In J. M. Bower (Ed.), Neurocomputing: Computational neuroscience: Trends in research 2001 (Vol. 38–40, pp. 1223–1228). Amsterdam: Elsevier Science. Fusi, S. (2002). Hebbian spike-driven synaptic plasticity for learning patterns of mean firing rates. Biol. Cybern., 87, 459. Fusi, S. (2003). Spike-driven synaptic plasticity for learning correlated patterns of mean firing rates. Rev. Neurosci., 14, 73–84. Fusi, S., & Abbott, L. (2007). Limits on the memory-storage capacity of bounded synapses. Nature Neuroscience, 10, 485–493. Fusi, S., Annunziato, M., Badoni, D., Salamon, A., & Amit, D. (2000). Spike-driven synaptic plasticity: Theory, simulation, VLSI implementation. Neural Computation, 12, 2227–2258. Fusi, S., Drew, P., & Abbott, L. (2005). Cascade models of synaptically stored memories. Neuron, 45, 599–611. Fusi, S., & Mattia, M. (1999). Collective behavior of networks with linear (VLSI) integrate-and-fire neurons. Neural Comput., 11(3), 633–653. Fusi, S., & Senn, W. (2006). Eluding oblivion with smart stochastic selection of synaptic updates. Chaos, 16, 026112. Gerstner, W., Kempter, R., van Hemmen, J., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Giudice, P. D., Fusi, S., & Mattia, M. (2003). Modeling the formation of working memory with networks of integrate-and-fire neurons connected by plastic synapses. Journal of Physiology, Paris, 97, 659–681.
2910
J. Brader, W. Senn, and S. Fusi
Giudice, P. D., & Mattia, M. (2001). Long and short-term synaptic plasticity and the formation of working memory: A case study. Neurocomputing, 38–40, 1175– 1180. Gutig, R., & Sompolinsky, H. (2006). The tempotron: A neuron that learns spike timing-based decisions. Nat. Neurosci., 9(3), 420–428. Hertz, J., Krogh, A., & Palmer, R. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Hopfield, J. J. (1982). Neural networks and physical systems with emergent selective computational abilities. Proc. Natl. Acad. Sci. USA, 79, 2554. Hopfield, J., & Brody, C. D. (2004). Learning rules and network repair in spiketiming-based computation networks. PNAS, 101, 337–342. Indiveri, G. (2000). Modeling selective attention using a neuromorphic AVLSI device. Neural Comput., 12, 2857–2880. Indiveri, G. (2001). A neuromorphic VLSI device for implementing 2D selective attention systems. IEEE Transactions on Neural Networks, 12, 1455–1463. Indiveri, G. (2002). Neuromorphic bistable VLSI synapses with spike-timing¨ & K. Obermayer (Eds.), Advances dependent plasticity. In S. Becker, S. Thrun, in neural information processing systems, 15. Cambridge, MA: MIT Press. Indiveri, G., & Fusi, S. (2007). Spike-based learning in VLSI networks of spiking neurons. In Proceedings of the IEEE International Symposium on Circuits and Systems (pp. 3371–3374). Piscataway, NJ: IEEE Press. Karmarkar, U., & Buonomano, D. (2002). A model of spike-timing dependent plasticity: one or two coincidence detectors? J. Neurophysiology, 88, 507– 513. Kempter, R., Gerstner, W., & van Hemmen, J. (2001). Intrinsic stabilization of output firing rates by spike-based Hebbian learning. Neural Computation, 13, 2709– 2741. ¨ La Camera, G., Rauch, A., Luscher, H.-R., Senn, W., & Fusi, S. (2004). Minimal models of adapted neuronal response to in vivo–like input currents. Neural Comput., 16, 2101–2124. Legenstein, R., Naeger, C., & Maass, W. (2005). What can a neuron learn with spiketiming-dependent plasticity? Neural Comput, 17(11), 2337–2382. ¨ Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efficacy by coindence of postsynaptic APs and EPSPs. Science, 275, 213– 215. Marr, D. (1969). A theory of cerebellar cortex. J. Physiol., 202, 435–470. Minsky, M. L., & Papert, S. A. (1969). Perceptrons. Cambridge, MA: MIT Press. Mitra, S., Fusi, S., & Indiveri, G. (2006). A VLSI spike-driven dynamic synapse which learns only when necessary. IEEE International Symposium on Circuits and Systems ISCAS06 (pp. 2777–2780). Piscataway, NJ: IEEE Press. Mongillo, G., Curti, E., Romani, S., & Amit, D. (2005). Learning in realistic networks of spiking neurons and spike-driven plastic synapses. European Journal of Neuroscience, 21, 3143–3160. Nadal, J. P., Toulouse, G., Changeux, J. P., & Dehaene, S. (1986). Networks of formal neurons and memory palimpsests. Europhys. Lett., 1, 535. Parisi, G. (1986). A memory which forgets. J. Phys. A: Math. Gen., 19, L617.
Learning Real-World Stimuli with Plastic Synapses
2911
Petersen, C. C. H., Malenka, R. C., Nicoll, R. A., & Hopfield, J. J. (1998). All-ornone potentiation at CA3-CA1 synapses. Proc. Natl. Acad. Sci. USA, 95, 4732– 4737. Rao, R. P. N., & Sejnowski, T. J. (2001). Spike-timing-dependent Hebbian plasticity as temporal difference learning. Neural Computation, 13, 2221–2237. ¨ Rauch, A., La Camera, G., Luscher, H.-R., Senn, W., & Fusi, S. (2003). Neocortical pyramidal cells respond as integrate-and-fire neurons to in vivo–like input currents. J. Neurophysiology, 90, 1598–1612. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386–408. Rubin, J., Lee, D., & Sompolinsky, H. (2001). The equilibrium properties of temporally asymmetric Hebbian plasticity. Physical Review Letters, 86, 364–367. ¨ otter, ¨ Saudargiene, A., Porr, B., & Worg F. (2004). How the shape of pre- and postsynaptic signals can influence STDP: A biophysical model. Neural Computation, 16, 595–625. Senn, W., & Buchs, N. (2003). Spike-based synaptic plasticity and the emergence of direction selective simple cells: Mathematical analysis. Journal of Computational Neuroscience, 14, 119–138. Senn, W., & Fusi, S. (2005a). Convergence of stochastic learning in perceptrons with binary synapses. Phys. Rev. E., 71, 061907. Senn, W., & Fusi, S. (2005b). Learning only when necessary: Better memories of correlated patterns in networks with bounded synapses. Neural Computation, 17, 2106–2138. Senn, W., Markram, H., & Tsodyks, M. (2001). An algorithm for modifying neurotransmitter release probability based on pre- and postsynaptic spike timing. Neural Comput., 13(1), 35–67. Shouval, H., & Kalantzis, G. (2005). Stochastic properties of synaptic transmission affect the shape of spike time dependent plasticity curves. J. Neurophysiology, 93, 643–655. Shouval, H. Z., Bear, M. F., & Cooper, L. N. (2002). A unified model of NMDA receptor-dependent bidirectional synaptic plasticity. PNAS, 99, 10831– 10836. Simard, P., Steinkraus, D., & Platt, J. (2003). Best practice for convolutional neural networks applied to visual document analysis. In Proceedings of the Seventh International Conference on Document Analysis and Recogntion (p. 958). Washington, DC: IEEE Computer Society. ¨ om, ¨ Sjostr P. J., Turrigiano, G. G., & Nelson, S. B. (2001). Rate, timing and cooperativity jointly determine cortical synaptic plasticity. Neuron, 32, 1149–1164. Tsodyks, M. (1990). Associative memory in neural networks with binary synapses. Mod. Phys. Lett., B4, 713–716. Turrigiano, G., & Nelson, S. (2000). Hebb and homeostasis in neuronal plasticity. Curr. Opin. Neurobiol., 10, 358–364. Wang, H., & Wagner, J. (1999). Priming-induced shift in synaptic plasticity in the rat hippocampus. J Neurophysiol., 82, 2024–2028. Wang, S., O’Connor, D., & Wittenberg, G. (2004). Steplike unitary events underlying bidirectional hippocampal synaptic plasticity. Society for Neuroscience, p. 57.6.
2912
J. Brader, W. Senn, and S. Fusi
Wang, X.-J. (2002). Probabilistic decision making by slow reverbaration in cortical circuits. Neuron, 36, 955–968. Yao, H., Shen, Y., & Dan, Y. (2004). Intracortical mechanism of stimulus-timingdependent plasticity in visual cortical orientation tuning. PNAS, 101, 5081–5086. Zhang, L. I., Tao, H. W., Holt, C. E., Harris, W. A., & Poo, M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44. Zhou, Q., Tao, H., & Poo, M. (2003). Reversal and stabilization of synaptic modifications in a developing visual system. Science, 300, 1953.
Received March 23, 2005; accepted March 19, 2007.
LETTER
Communicated by Jonathan Victor
Tight Data-Robust Bounds to Mutual Information Combining Shuffling and Model Selection Techniques M. A. Montemurro [email protected]
R. Senatore [email protected]
S. Panzeri [email protected] University of Manchester, Faculty of Life Sciences, Manchester M60 1QD, U.K.
The estimation of the information carried by spike times is crucial for a quantitative understanding of brain function, but it is difficult because of an upward bias due to limited experimental sampling. We present new progress, based on two basic insights, on reducing the bias problem. First, we show that by means of a careful application of data-shuffling techniques, it is possible to cancel almost entirely the bias of the noise entropy, the most biased part of information. This procedure provides a new information estimator that is much less biased than the standard direct one and has similar variance. Second, we use a nonparametric test to determine whether all the information encoded by the spike train can be decoded assuming a low-dimensional response model. If this is the case, the complexity of response space can be fully captured by a small number of easily sampled parameters. Combining these two different procedures, we obtain a new class of precise estimators of information quantities, which can provide data-robust upper and lower bounds to the mutual information. These bounds are tight even when the number of trials per stimulus available is one order of magnitude smaller than the number of possible responses. The effectiveness and the usefulness of the methods are tested through applications to simulated data and recordings from somatosensory cortex. This application shows that even in the presence of strong correlations, our methods constrain precisely the amount of information encoded by real spike trains recorded in vivo. 1 Introduction A recent fundamental insight from system neuroscience is that information about the external sensory world is often encoded by the precise timing of action potential (spikes). The timing of individual spikes has been shown to encode precisely and reliably the occurrence of certain stimulus features (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1996; Neural Computation 19, 2913–2957 (2007)
C 2007 Massachusetts Institute of Technology
2914
M. Montemurro, R. Senatore, and S. Panzeri
de Ruyter van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997; Buracas, Zador, DeWeese, & Albright, 1998; Panzeri, Petersen, Schultz, Lebedev, & Diamond, 2001; DeWeese, Wehr, & Zador, 2003; Arabzadeh, Zorzin, & Diamond, 2005). However, one question still unanswered is how individual spikes, from either the same or a different neuron, combine together to give rise to perception. One fundamental observation is that spike times are correlated: for example, nearby cortical cells tend to fire in synchrony more than expected by chance. The presence of correlations has suggested that they are a fundamental ingredient of the neural code. The computational advantage of such a representation may be that correlations add an information channel that can be used to either represent more sensory and behavioral features (Abeles, Bergman, Margalit, & Vaadia, 1993; Dan, Alonso, Usrey, & Reid, 1998) or bind together groups of features (Gray, ¨ Konig, Engel, & Singer, 1989; von der Malsburg, 1999). However, whether correlations are a crucial part of the neural code is still highly controversial (Shadlen & Movshon, 1999). A principled and rigorous way to address how the messages carried by individual spike times are integrated together is to use information theory to quantify and compare different ways and timescales at which spike times may convey information (Rieke et al., 1996; Borst & Theunissen, 1999; Dimitrov & Miller, 2001; Panzeri et al., 2001). The use of information theory allows an estimate of how reliably stimuli are encoded in single trials and which features of the neuronal response, such as independent spikes or the correlations, contribute to stimulus discriminability. A problem with this approach is that quantifying reliably the information conveyed by spike timing often requires the collection of unpractically large samples of data. This is mainly due to correlations: if correlations did not exist, then the statistics of spike times would be completely characterized by the time-dependent firing rate of each neuron. However, one also needs to measure the correlations among all possible groups of spikes. A complete characterization of these correlations requires a number of parameters that are difficult to sample with realistic amounts of neuronal data. Thus, spike timing information measures suffer from a significant sampling bias problem (Panzeri & Treves, 1996). Generalizing previous work of Reich and colleagues (Reich, Mechler, Purpura, & Victor, 2000), we have recently proposed an approach to alleviate the sampling bias problem by developing data-robust lower bounds to the spike timing information that neglect long-lag stimulus modulations of correlations (Pola, Petersen, Thiele, Young, & Panzeri, 2005). These bounds can establish if there is information conveyed in spike times above and beyond that conveyed by spike counts. However, these methods cannot be used to test the importance of correlations in coding because they explicitly neglect a potentially important part of the correlation structure. In this letter, we overcome this limitation by presenting several methodological advances that lead to a radical improvement of the sampling
Robust Bounds to Mutual Information
2915
properties of information measures and provide very tight and data-robust upper and lower bounds to spike timing information. These advances also permit constraining precisely the role of correlations in decoding. Overall, this progress provides the basis for a better determination of the role of correlations in information transmission and at the same time significantly expands the domain of applicability of information-theoretic techniques to the analysis of neural signals. The letter is organized as follows. We first review basic concepts of information theory applications to spike trains; we then discuss how to quantify the importance of correlations in decoding; we next address the sampling properties of these information quantities and provide bounds that are biased either upward or downward; and we discuss how to use model selection techniques to give virtually unbiased and tight estimations of information. Finally, we apply the new techniques to real neuronal spike trains recorded from rat somatosensory cortex. 2 The Information Carried by Neuronal Population Responses We consider a time period of duration T, associated with a dynamic or static sensory stimulus s (chosen with probability P(s) from a stimulus set S with S elements), during which the activity of one neuron is observed.1 We assume that the spike arrival times are binned with a timing precision t and transformed into a sequence of spike counts in each time bin. L denotes the number of time bins (i.e., T = Lt). The neuronal response is denoted by a one-dimensional array r = {r (1), r (2), . . . , r (L)}, where r (t) is the number of spikes emitted by the neuron in the tth time bin. The maximum number of spikes that can be observed in a single time bin in any trial is denoted by M. (If t is very short, M is 1 and r (t) is binary.) We indicate the response space by R (R contains (M + 1) L elements). Following Shannon (1948), we write the mutual information I (R; S) (often abbreviated as I in the following) transmitted by the population response about the whole set of stimuli as I (R; S) = H(R) − H(R|S),
(2.1)
where H(R) and H(R|S) are the response entropy (stimulus-unconditional) and the noise entropy (stimulus-conditional), respectively. They are defined
1 In this letter, we consider one neuron only in order to keep notations simple. However, the generalization to neuronal populations is relatively straightforward and does not present conceptual difficulties. The main step in generalizing to populations is a slight change needed in the definitions of the Markov models described below; see Pola et al. (2005) for an example of how to carry out this generalization.
2916
M. Montemurro, R. Senatore, and S. Panzeri
(Cover & Thomas, 1991) as H(R) = −
P(r) log2 P(r),
r∈R
H(R|S) = −
s∈S
P(s)
P(r|s) log2 P(r|s).
(2.2) (2.3)
r∈R
The response entropy quantifies how neuronal responses vary with the stimulus and thus sets the capacity of the spike train to convey information. The noise entropy quantifies the irreproducibility of the neuronal responses at fixed stimulus. Thus, mutual information quantifies how much of the information capacity provided by stimulus-evoked differences in neural activity is robust to the presence of trial-by-trial response variability (de Ruyter van Steveninck et al., 1997). In equations 2.2 and 2.3, the summation over r is over all possible neuronal responses. The summation over s is over all possible stimuli. P(r|s) is the probability of observing a particular response r conditional to stimulus s. Experimentally, P(r|s) is determined by repeating each stimulus on many trials while recording the neuronal responses. The probability P(s) is usually chosen by the experimenter. P(r) = P(r|s)s is its average across all stimuli (the angular brackets indicate an average over stimuli, F (s)s ≡ s∈S P(s)F (s)). We assume that there are enough stimuli in the presented set so that P(r) (which is computed across all trials to all stimuli) is better sampled than P(r|s). (In practice, this amounts to the requirement that more than a handful of stimuli is presented.) Estimating the information carried by spike times of real neuronal populations is difficult because each stimulus-response probability has to be measured from a limited number of data. The statistical errors in estimating the response probabilities lead to a downward systematic error (bias) in both noise and response entropy (Miller, 1955). H(R) depends on only P(r), which is sampled across all trials to all stimuli. Under our assumptions, its bias is much smaller than that of H(R|S), which depends on P(r|s). This results in an overall upward bias when estimating mutual information (Panzeri & Treves, 1996). This makes it difficult to estimate the information directly from equation 2.1, especially for long time windows or precise spike time discretizations (large L). 3 Simplified Models of Correlation Having defined the information that neuronal responses transmit about sensory stimuli, we consider how correlations in the responses affect information transmission. The first step is to define precisely what we mean by correlations. In this letter, when we say that the spike trains are correlated, we mean that, for
Robust Bounds to Mutual Information
2917
some stimulus s, the “true” stimulus-response probability P(r|s) is different from the probability Pind (r|s) obtained if spikes were independent at fixed stimulus. By definition, the independent probability model Pind (r|s) is the product of the stimulus-conditional marginal probabilities P(r (t)|s) of responses in each time bin t: Pind (r|s) =
L
P(r (t)|s),
(3.1)
t=1
Thus, when we refer to correlations, we mean correlations at fixed stimulus. These correlations are usually called noise correlations (Gawne & Richmond, 1993; Nirenberg & Latham, 2003; Pola, Thiele, Hoffmann, & Panzeri, 2003). For brevity, in the rest of this letter, when we use the term correlation, we mean “noise correlation.” After correlations have been defined, the next step is to characterize how they affect information transmission. Correlations can affect neural information transmission in different ways in terms of both encoding and downstream decoding of neuronal messages. Here, following previous work (Latham & Nirenberg, 2005; Nirenberg & Latham, 2003), we specifically focus on whether correlations must be taken into account to decode the neuronal response. We consider a downstream neural system that bases its decoding decisions on the assumption that the spikes are generated by a simplified response model Psimp (r|s), which neglects certain aspects of the spike train correlation structure (e.g., it considers only correlations between spikes close together in time).2 We ask how much information is lost because the decoding operation is performed assuming that responses r are generated with Psimp (r|s) rather than with P(r|s). The choice of the mathematical form of Psimp (r|s) will depend on the question that the experimenter wants to address about correlations. If, for example, one is interested in whether correlations of any form are important for decoding, then one considers how much information is lost when the independent model Pind (r|s) is used for decoding. If instead one is interested in the more specific question of whether correlations within a specified time range are important for decoding, then one considers how much information would be lost when using simplified response models Psimp (r|s) that neglect correlations at timescales outside the specified range. When considering neuronal population recordings, a similar strategy could be used to study the spatial scale at which correlations influence information transmission. In this case, Psimp (r|s) will take into account only correlations among a specific subset of neurons.
2 For example, the downstream system may decode the stimulus using, via Bayes’ rule, a posterior probability based on the simplified model: Psimp (s|r) = P(s)Psimp (r|s)/Psimp (r).
2918
M. Montemurro, R. Senatore, and S. Panzeri
3.1 Assumptions on the Simplified Models of Correlation. Although in the remainder of the letter, we will focus on a particular class of simplified response models, in this section it is useful to keep the simplified probability model Psimp (r|s) as general as possible and spell out the minimal requirements to Psimp (r|s) that are necessary to develop our information-theoretic formalism. We will require only that Psimp (r|s) satisfies two assumptions. These assumptions are listed below, and their importance will be made clear in the rest of the letter. Before we list the assumptions, we will describe our formalism by having in mind simplified models Psimp (r|s) that depend on many fewer parameters than P(r|s), and are thus much easier to sample than P(r|s). For example, the simplest possible case of Psimp (r|s) is a parameter-free probability distribution that is uniformly flat across all times and stimuli. Another example of a model that is simple to sample is Psimp (r|s) = Pind (r|s). In fact, while estimating P(r|s) requires an evaluation of (M + 1) L − 1 parameters for each stimulus s, estimating Pind (r|s) needs only ML parameters for each stimulus. Although we have in mind very simple models for Psimp (r|s), it is important to note that the formalism developed in this letter would be well defined even if Psimp (r|s) is approximately as complex to sample as P(r|s). In fact, the family of Markov models that we will study in detail below interpolates parametrically from low to high model complexity. We require that the simplified model Psimp (r|s) to be used satisfies the following two assumptions: Assumption 1. We require that the method used for transforming P(r|s) to Psimp (r|s) operates separately and independently on the responses conditioned to each stimulus. Thus, we require that the transformation from P(r|s) to Psimp (r|s) is independent of P(s), or of P(r|s ) and P(s ) for any s = s. This property is useful because the resampling (or “shuffling”) techniques that, as detailed in section 4, can reduce the bias of information estimates can be applied only to Psimp (r|s) that, for each stimulus s, are constructed only from responses collected in response to that stimulus. Assumption 2. We require that for each stimulus s with nonzero probability, the simplified response model Psimp satisfies the following condition: P(r|s) log2 Psimp (r|s) = Psimp (r|s) log2 Psimp (r|s). (3.2) r∈R
r∈R
Assumption 2 is important to our analysis for three reasons. The first is that (taking the point of view that 0 log(0) is zero and c log(0) is ill defined for any c = 0) assumption 2 enforces the condition that if for some r and s, Psimp (r|s) is zero, then P(r|s) must also be zero. This fact is crucial in the present context because, as we will see in the next section, it ensures that the information-theoretic quantities to be introduced below are well defined. The second reason is that, as also shown in the next section, assumption 2 ensures that we can rewrite the information-theoretic quantities in a way
Robust Bounds to Mutual Information
2919
that is easier to sample. The third reason is that assumption 2 is satisfied by all maximum entropy models constrained to preserve selected features of the full probability model P(r|s). This will be demonstrated in section 3.3. Since maximum entropy smoothing is a principled way to fill the unconstrained details of a simplified model, it is useful to ensure that our formalism is applicable to all such models. 3.2 Measures of the Information Lost in Decoding with the Simplified Model. Now we turn to determining how much information is lost when decoding the neural response with a mismatched decoding model Psimp (r|s) instead of with the true model P(r|s). This problem has been well studied in the information-theoretic literature (Merhav, Kaplan, Lapidoth, & Shamai Shitz, 1994; Latham & Nirenberg, 2005). Although this information loss cannot be expressed through a general and simple analytical expression, Latham and Nirenberg (2005) recently derived a simple closed-form expression that is an upper bound to it, as follows: Isimp ≡ D(P(s|r)||Psimp (s|r)) ≡
P(r)
P(s|r) log2
s
r
P(s|r) , Psimp (s|r)
(3.3)
where D is conditional Kullback-Leibler (KL) distance (see Cover & Thomas, 1991, p. 22, eq. 2.65). Assumption 2 ensures that if for some r and s, Psimp (r|s) is zero, then P(r|s) must also be zero, and this in turn ensures that Isimp is a nonnegative and nondivergent information-theoretic measure. An important problem in the practical estimation of Isimp in the case in which Psimp (s|r) is described by many fewer parameters than P(s|r)) is that it is heavily biased, approximately as much as the mutual information I (Pola et al., 2005). In the next sections, we will show how to reduce the bias problem of Isimp and thus allow its estimation in practice. A second quantity of interest is ILB-simp (see Pola et al., 2005), the difference between the mutual information I and Isimp : ILB-simp = I − Isimp =− P(r) log2 Psimp (r) + P(s) P(r|s) log2 Psimp (r|s) r∈R
= χsimp (R) +
s∈S
P(s)
s∈S
r∈R
P(r|s) log2 Psimp (r|s),
(3.4)
r∈R
where χsimp (R) ≡ −
r∈R
P(r) log2 Psimp (r).
(3.5)
2920
M. Montemurro, R. Senatore, and S. Panzeri
Since Isimp is nonnegative and is an upper bound to the information lost when decoding the neuronal responses with the mismatched response model Psimp , ILB-simp has a well-defined meaning: it quantifies information that can be decoded by using Psimp . As shown by Pola et al. (2005), this quantity is of practical importance because it is much less biased than the mutual information I ; therefore, it provides a useful data-robust lower bound to the information decodable with Psimp . A further simplification to both I and I L B can be obtained by making use of assumption 2, which permits rewriting ILB-simp and Isimp as ILB-simp = χsimp (R) − Hsimp (R|S)
(3.6)
Isimp = Hsimp (R|S) − H(R|S) + H(R) − χsimp (R),
(3.7)
where Hsimp (R|S) is the noise entropy of the simple response model: Hsimp (R|S) = −
P(s)
s∈S
Psimp (r|s) log2 Psimp (r|s).
(3.8)
r∈R
The advantage of this rewriting is that now the stimulus-conditional functionals of the simplified model are expressed as the noise entropy of Psimp (r|s). This property is important to improving the sampling properties of the information quantities, because, as we will see in the next section, Hsimp (R|S) can be corrected for limited sampling by very effective techniques (Nemenman, Bialek, & de Ruyter van Steveninck, 2004), and its presence in the expression for Isimp will allow us to cancel out the bias of the latter with appropriate procedures. 3.3 Maximum Entropy Models. We now consider in more detail the problem of how to construct simplified correlation models that satisfy the assumptions needed by our information-theoretic framework. In constructing simplified models of correlations, it is natural to ask our model to preserve only some properties of the true probability P(r|s). A way to formalize this is to require our simplified model to satisfy, apart from the usual requirements of nonnegativity and normalization to one, a certain number m of constraints that are also satisfied by P(r|s), as follows:
Psimp (r|s) > 0 Psimp (r|s) = 1
r
r
Psimp (r|s)gi (r) =
r
P(r|s)gi (r) i = 1, · · · , m,
(3.9)
Robust Bounds to Mutual Information
2921
where gi (r) are arbitrary functions on R.3 Once the constraints in equation 3.9 have been chosen, it is then desirable to simplify the response model by removing all types of correlation in the data apart from those enforced by the features preserved from the original distribution. A principled way to choose a Psimp (r|s) that satisfies the constraints in equation 3.9 and adds no further relationship between the data is to choose Psimp (r|s) as the distribution with the maximum entropy allowed by our constraints. This maximum entropy distribution is unique and has the following expression (Cover & Thomas, 1991): Psimp (r|s) = exp λ0 − 1 +
m
λi gi (r) ,
(3.10)
i=1
where the parameters λ0 , λi , · · · , λm are fixed (independently for each conditional distribution to each stimulus) so as to satisfy the constraints in equation 3.9. The maximum entropy distribution is in some way the most reasonable choice of simplified model of correlations given the constraints: to choose a distribution with lower entropy would correspond to assume some additional structure that we do not know; to choose one with a higher entropy would necessarily violate the constraints that we wish to enforce. It is important to note that any maximum-entropy simplified model of the form in equation 3.10 that satisfies constraints of equation 3.9 is a suitable simplified model for our analysis; in fact, any such maximum-entropy model satisfies by construction the two assumptions of our formalism. In particular, by using equation 3.10, equationq 3.2 of assumption 2 becomes: r
Psimp (r|s) λ0 − 1 +
m i=1
λi gi (r) =
r
P(r|s) λ0 − 1 +
m
λi gi (r) ,
i=1
(3.11) which is obviously satisfied if Psimp (r|s) meets the constraints set by equation 3.9. Another demonstration of the relationship between assumption 2 and the maximum-entropy principled is reported in appendix B. An important class of maximum entropy distributions is made of the distributions that preserve some marginal probabilities of P(r|s), such as the independent model Pind (r|s) of equation 3.1, the Markov models considered in the next subsection, and the hierarchical probability models of Amari (2001). Models preserving marginals are obtained from equation 3.10 by constraining Psimp and P to have equal sum on a number of subsets
3 The functions g (r) could in principle be different for each stimulus conditional disi tribution.
2922
M. Montemurro, R. Senatore, and S. Panzeri
Ai , Bi , · · · of the responses space R (each subset corresponding to the responses that have to be summed to compute the marginal probability to be preserved). This corresponds to choosing for each subset a function gi (r) in equations 3.9 and 3.10 with value one on the subset under consideration and zero elsewhere. For example, the independent model of equation 3.1 is obtained from equation 3.10 by partitioning the response space into the disjoint union of the “marginal” subsets A1 , A2 (the subset of all responses with respectively 1 or zero spikes in time bin 1), B1 , B2 (the subset of all responses with respectively 1 or zero spikes in time bin 2), and so on for all time bins, and then by enforcing Psimp and P to have equal sum on each of the subsets. More complex maximum entropy models can be obtained by extending the procedure to constrain Psimp also on intersections of subsets. For example, Markov models of order 1 can be obtained by constraining the sum of Psimp not only on the intersections of the above marginal subsets corresponding to each time bin, but also on the pairwise intersections corresponding to adjacent time bins. Amari’s hierarchical models of purely pairwise interactions (Amari, 2001) can be obtained by the stricter requirement of constraining the sum of Psimp on all the pairwise intersections of marginal subsets, not only to that corresponding to adjacent time bins. All of these transformations can be applied recursively. The recursion leads to constructing Markov chains of arbitrary length, as well as hierarchical models containing higher-order interactions (Amari, 2001). It is interesting to note that in this way, one can prove that all suffix trees models satisfy our assumptions and therefore are valid choices of Psimp . In fact, suffix tree models can be constructed with this recursive partitioning by constraining Psimp on a smaller number of intersections than the Markov model of corresponding length. This observation helps in understanding the relationship between the work presented here and the recent work of London, Schreibman, Hausser, Larkum, & Segev (2002) and Kennel, Shlens, Abarbanel, & Chichilnisky (2005), which use suffix tree models to estimate entropy rates. In summary, in this section we have established that all maximumentropy models satisfy our assumptions and can thus be used to estimate information with our formalism. These model include Markov chains, hierarchical distributions, and suffix tree models. We have also provided a general and explicit way to construct such simplified models from the data through equations 3.9 and 3.10. This construction can be successfully applied whatever the statistics of neuronal firing described by the response probability P(r|s). In fact, for any given choice of the constraints in equation 3.9, the maximum entropy model fulfilling these constraints will automatically satisfy (by construction) our assumptions, whatever the form of P(r|s). An important implication of this result is that, in practice, our assumptions will not restrict in any way the applicability of the method to data sets with specific statistical properties.
Robust Bounds to Mutual Information
2923
3.4 Markov Models. Despite the generality of the above constructions of Psimp , we will illustrate and develop the main idea behind our formalism by focusing on a specific class of simplified correlation models: the Markov models with finite memory. We choose to illustrate our ideas using this particular class of maximum entropy simplified model because (1) by tuning the order of the Markov process, we can vary parametrically the complexity of the model and thus illustrate clearly how the sampling behavior of the information-theoretic functional depends on the number of parameters describing the simplified model, and (2) these models are easy to construct and apply to data. A neurophysiological motivation of the use of Markov models stems from the fact that in many neural systems, correlations are significant only between spikes that are separated by a short time lag, in the range 1 to 15 ms (Gray et al., 1989; Brosch, Bauer, & Eckhorn, 1997; Dan et al., 1998; Nirenberg, Carcieri, Jacobs, & Latham, 2001; Golledge et al., 2003). In such cases, to preserve the whole information, it is sufficient to take into account only correlations extending over a short lag. Thus, one can approximate the real probability of current response r (t) given the past firing with a finite-memory Markov model that looks back only q time steps, as follows:
Pq (r|s) = P(r (1)|s)
L
P(r (t)|r (t − q ), . . . , r (t − 1); s), if q = 1, . . . , L − 1,
t=2
P0 (r|s) = Pind (r|s),
if q = 0.
(3.12)
The probability conditional on the response in the previous time steps at fixed stimulus in the above equation can be computed from the experimental probabilities via P(r (t)|r (t − q ), . . . , r (t − 1); s) =
P(r (t − q ), . . . , r (t − 1), r (t)|s) , P(r (t − q ), . . . , r (t − 1)|s)
(3.13)
where P(r (t − q ), . . . , r (t − 1), r (t)|s) and P(r (t − q ), . . . , r (t − 1)|s) are marginal distributions of the full model P(r|s), computed by integrating away the dependence on all the response variables that do not enter in their argument. This simple procedure to construct the Markov model is equivalent to the maximum-entropy one described above. Markov models interpolate parametrically between the independent model (q = 0) and the full probability model (obtained for q = L − 1, because PL−1 (r|s) = P(r|s)). Pq (r|s) preserves all correlations extending up to q time bins in the past, and it neglects all correlations of range longer than q . Thus, it is a perfect description of neuronal firing if correlations extend to a lag shorter than or equal to q time bins.
2924
M. Montemurro, R. Senatore, and S. Panzeri
The information-theoretic probability functionals corresponding to the choice Psimp (r|s) = Pq (r|s) will be indicated by a subscript q in place of the subscript simp. For completeness, their expression is reported below: I L B−q = χq (R) − Hq (R|S)
(3.14)
Iq = Hq (R|S) − H(R|S) + H(R) − χq (R),
(3.15)
where Hq (R|S) is the noise entropy of the simple response model, Hq (R|S) = −
s∈S
P(s)
Pq (r|s) log2 Pq (r|s),
(3.16)
r∈R
and χq (R) is χq (R) = −
P(r) log2 Pq (r).
(3.17)
r∈R
4 Bias Cancellations Obtained by Shuffling the Responses As discussed above, mutual information has been broken down into two terms, ILB-simp and Isimp , with radically different sampling properties, the former easy to sample and the latter very difficult to sample. Here we examine in detail the sampling behavior of Isimp and show how to reduce its bias dramatically without increasing its variance. Since Isimp is the most biased part of I , this will also improve the sampling properties of the mutual information. Since I , ILB-simp and Isimp consist of four quantities— H(R|S), Hsimp (R|S), H(R), and χsimp (R)—the relative sampling properties of I , ILB-simp , and Isimp can be established by considering the sampling properties of the above four quantities. 4.1 The Independent Decoder: Ignoring All Correlations. For clarity, we start by considering in detail the sampling properties of the quantities corresponding to the q = 0, independent probability model Psimp (r|s) = Pind (r|s) = P0 (r|s). The corresponding information-theoretic functionals are I L B−0 and I0 , taken from equations 3.16 and 3.17 with q = 0. 4.1.1 Uncorrected Estimators. There are two relevant aspects of the sampling properties of a probability functional: its bias and its variance. We will consider the bias first and the variance later. The bias of a functional F (P) of the probabilities P is defined as the difference between F (PN ) N , the ensemble-averaged value of the functional computed from the probability distributions PN empirically obtained from N trials, and the true value of the functional F (P) computed with the true probability distribution P. Thus, the bias is a systematic error that cannot be eliminated just by averaging.
Robust Bounds to Mutual Information
2925
We illustrate the sampling behavior of the entropy functionals by computing them on a set of realistically simulated neuronal data, generated as follows. We simulated the spiking response of one neuron in somatosensory cortex to 49 different stimuli consisting of sinusoidal whisker vibrations with different amplitude and frequency (Arabzadeh, Petersen, & Diamond, 2003; Arabzadeh, Panzeri, & Diamond, 2004). We simulated neuronal responses over a 0 to 50 ms poststimulus time window. We then digitized, with time precision t equal to 5 ms, these responses into L = 10 binary words. The simulations were performed using a Markov process with order q = 3, with all the marginal probabilities of order 3 or less and the transition probabilities taken from real responses of a cortical somatosensory neuron.4 The spike train simulated in this way retains a faithful description of all the real neuronal marginal distributions and their correlations up to q = 3 time bins. In Figure 1A we report the sampling behavior of the four quantities χ0 (R), H(R), H0 (R|S), and H(R|S) as a function of the number of trials per stimulus. For each simulation with a fixed number of trials, we computed the plug-in value of the functional by plugging into their equations the empirical estimates of the probability (without application of any bias correction procedure) and then averaging all simulations with the same number of trials. We first consider the noise entropy H(R|S). Figure 1A shows that this is by far the most downward biased of the four functionals considered. This is because it requires the simultaneous measure of all P(r|s) to all stimuli. To understand better its sampling behavior, it is useful to find analytical approximations to the bias. These can be easily derived in the asymptotic sampling regime. The latter is defined as the case in which the number of trials Ns to each stimulus s is so large that each response bin with nonzero probability is observed many times, Ns P(r|s) 1 (Panzeri & Treves, 1996). In this asymptotic sampling regime, the bias of H(R|S) (and, similarly, of all other functionals) can be expanded in inverse powers of 1/N (N being the total number of trials across all stimuli), as follows: Bias[H(R|S)] ≈
C1 C2 + 2 + ···, N N
(4.1)
4 For each stimulus condition s, N binary spike trains were generated with a q -order s Markov model as follows (see equation 3.13). The response in the first bin was assigned to be a 1 or a 0 (spike or no spike) according to P(r (1)|s), the latter being computed from the real data. Responses in successive time bins were generated one after the other, one by one, using the corresponding transition probabilities. For instance, the response at bin k was generated according to the real-data probability P(r (k)|r (k − j), . . . , r (k − 1); s), where j ≤ min(q , k − 1).
2926
A
M. Montemurro, R. Senatore, and S. Panzeri
B
4.5
I Ish
2
I
Entropy [bits]
3.5 (R)
3
0
H(R) H (R|S) 0
2.5 2 4
H 6
8 10 Log (trials)
Information [bits]
1.5
4
I0 I
1
0.5
0
(R|S)
H(R|S) 12
4
2
6
8 10 Log (trials)
12
2
Figure 1: Comparison of the sampling properties of plug-in estimations of different probability functionals. The plug-in estimators of the probability functionals are plotted as a function of the number of trials per stimulus available. Results were averaged over a number of repetitions of the simulation (decreasing from 200 to 10 as the number of trials per stimulus available increased). We simulated a neuron responding to 49 different stimuli. We considered a time precision of 5 ms and a poststimulus time window of 50 ms; thus, L was equal to 10. The spike train was simulated with a Markov process with the same mean firing rates and up-to-third-order marginal probabilities as a real spike train recorded in Arabzadeh et al. (2003) in response to 49 different sinusoidal whisker vibrations. (A) Average values of plug-in estimators of χ0 (R), H(R), H0 (R), H(R|S), and H0−sh (R|S) obtained without any bias correction. (B) Average values of plug-in estimators of I (R, S), I L B−0 , I0 , and I0−sh obtained without any bias correction.
The leading term in 1/N has a simple analytical expression, Bias[H(R|S)] ≈ −
1 ˜ ( R(s) − 1), 2N ln 2 s
(4.2)
˜ where R(s) denotes the number of “relevant" responses of the stimulus conditional response probability distribution P(r|s), that is, the number of different responses r with nonzero probability of being observed when ˜ stimulus s is presented (Panzeri & Treves, 1996). R(s) is of order (M + 1) L for each stimulus. Thus, it follows that for the bias in equation (4.2) to be small, N should be much bigger than S × (M + 1) L . Let us now consider H0 (R|S). It is apparent from Figure 1A that it is almost unbiased. The reason is that it can be expressed as the sum of single-bin entropies (see Pola et al., 2005). As a consequence, its bias is ≈SML/2N log(2) and is thus much smaller than that of H(R|S). Figure 1A shows that the noise entropy H(R) is considerably simpler to sample than H(R|S). The reason is that since H(R) depends only on P(r),
Robust Bounds to Mutual Information
2927
its bias is approximately S times smaller than the bias of H(R|S). This is an advantage when many different stimuli are presented. Figure 1A shows that the bias of χ0 (R) is much smaller than the bias of H(R). The reason is as follows. Bias arises from the logarithmic form of entropy functionals. The log in χ0 (R) depends on P0 (r). Since P0 (r) is better sampled than P(r), χ0 (R) has less bias than H(R), whose log depends on P(r). Now we study how the properties of the four functionals combine together to give rise to the sampling properties of I (R; S), I L B−0 and I0 . The bias of the mutual information I (R; S) is the difference between the biases of H(R) and H(R|S). As the most biased term is H(R|S), the mutual information is upward biased (Panzeri & Treves, 1996). The bias of I L B−0 (see equation 3.14 with q = 0) is the difference between the biases of χ0 (R) and H0 (R; S). As both these quantities are virtually unbiased, so is I L B−0 . Let us now consider the bias of I0 (see equation 3.15 with q = 0). When many stimuli are presented, the contribution of the χ0 (R) and H(R) is only very mildly biased and determined by the bias of H(R). Thus, most of the bias comes from the stimulus-conditional term H0 (R|S) − H(R|S). Since H(R|S) is very strongly biased downward and H0 (R|S) is essentially unbiased, the two biases do not cancel out, and as a result, I0 is biased upward and behaves like −H(R|S). An important practical problem is how to reduce the bias of H0 (R|S) − H(R|S) and thus improve the sampling properties of I0 . A solution to this problem is to compute H0 (R|S) not only directly from the single bin marginal probability as in equation 3.16 but by randomly shuffling the responses and then computing their entropy. In the q = 0 case considered here, we can generate a new set of shuffled responses to stimulus s by randomly permuting, for each time bin, the order of trials collected in response to the stimulus s considered, and then joining together the shuffled responses in different time bins into a response vector r0−sh . This shuffling operation leaves each single-time-bin marginal probability unchanged (because responses in each bin are just randomly permuted), while destroying any within-trial correlation between different time bins. We define H0−sh (R|S) as the noise entropy of the shuffled distribution: H0−sh (R|S) = −
s∈S
P(s)
P0−sh (r|s) log2 P0−sh (r|s),
(4.3)
rsh ∈R
where P0−sh (r|s) is the distribution of response values obtained from r0−sh . The asymptotic large N value of H0−sh (R|S) is the same of that of H0 (R|S), but its scaling with the number of trials is much different. This is shown in Figure 1A. Unlike H0 (R|S), H0−sh (R|S) scales with N approximately as H(R|S), but with a slightly more negative slope (i.e., slightly more
2928
M. Montemurro, R. Senatore, and S. Panzeri
downward bias) as the number of trials decreases. The fact that the biases of H0−sh (R|S) and H(R|S) are of similar size can be intuitively understood from the fact that P0−sh (r|s) is sampled with the same number of trials as P(r|s) from responses with the same dimensionality. To understand why H0−sh (R|S) is more biased downward than H(R|S), we computed the bias of H0−sh (R|S) in the “asymptotic sampling” regime. We found that the asymptotic bias of H0−sh (R|S) has the same expression as that of ˜ ˜ 0−sh (s), the number of H(R|S) in equation 4.2, after replacing R(s) with R bins relevant to P0−sh (r|s). Since P0 (r|s) = 0 implies P(r|s) = 0 and since the shuffled responses are generated according to P0 (r|s), then it must be ˜ 0−sh (s) ≥ R(s). ˜ that R Thus, at the leading order in the asymptotic regime, H0−sh (R|S) is never less downward biased than H(R|S).5 Therefore, if we are in the asymptotic sampling regime (or at least the number of trials is enough to make the asymptotic equations a decent estimate of the size of ˜ 0−sh (s) and R(s)) ˜ the bias) and if R are roughly similar, H0−sh (R|S) − H(R|S) will either (1) have a leading-order 1/N bias term that is negative and arises from a partial cancellation of roughly similar leading bias terms of two en˜ 0−sh (s) = R(s)) ˜ tropies, or (2) (in case R it will have a very small bias, only at the order 1/N2 or higher. The sampling behavior of H0−sh (R|S) suggests a simple strategy to reduce the bias of I0 : use H0−sh (R|S) instead of H0 (R|S). We call this estimator I0−sh : I0−sh = H0−sh (R|S) − H(R|S) + H(R) − χ0 (R).
(4.4)
Simulations in Figure 1B show that I0−sh is much less biased than I0 , but it converges to the same asymptotic large-N value. The biases of H0−sh (R|S) and H(R|S) almost cancel each other, leaving an overall small negative bias.6 The good sampling properties of I0−sh suggest a new, alternative way to estimate mutual information, as follows: Ish = I L B−0 + I0−sh .
(4.5)
5 There are different types of resampling methods that have been used to generate surrogate data sets for the validation of information-theoretic calculations (Johnson, Gruner, ˜ 0−sh (s) ≥ R(s) ˜ Baggerly, & Seshagiri, 2001). It is important to note that R holds for the resampling procedure “without replacement” used here to generate the surrogate data, but it does not necessarily hold for other resampling techniques “with replacement.” 6 The idea of using shuffling to cancel out the biases in the stimulus-dependent part of I was first proposed by Nirenberg et al. (2001). However, the fact that H0−sh (R|S) is more downward biased than H(R|S), and thus I0 computed in this way is mildly downward biased, has not been reported before.
Robust Bounds to Mutual Information
2929
Since I L B−0 is virtually unbiased and I0−sh is mildly biased downward, Ish is also mildly biased downward. This is confirmed by the simulation results in Figure 1B: using Ish compares very favorably with computing mutual information directly as I in equation 2.1. In summary, we now have two classes of estimators of I0 and therefore of I : the direct estimator I0 and I in equations 3.3 and 2.1, which are strongly biased upward, and the shuffled estimators I0−sh and Ish , which are mildly biased downward. The difference in the sign of the bias of the two different estimators is very important in practice. In fact, computing I0 and I with both upward- and downward-biased algorithms provides a mean to bound from both above and below. This can greatly improve the confidence in estimates obtained from experimental data. In the rest of this letter, we devote our attention to how to make both upper and lower bounds tighter than they are in Figure 1. 4.1.2 Effect of Different Bias Correction Methods. The above shows the behavior of the probability functionals when estimated by plugging in the probabilities estimated directly from the data. However, these estimates can be improved by using available bias correction techniques. In this section, we apply different bias correction techniques to the probability functionals in order to understand how best to evaluate each functional from a limited number of data. We first evaluated the performance of a quadratic extrapolation procedure (Strong, Koberle, de Ruyter van Steveninck, & Bialek, 1998), performed computing the quantities from fractions of the data available and then fitting the resulting data-scaling behavior to a quadratic function of 1/N as in equation 4.1. This procedure should work well in the asymptotic regime where the number of trials N is large.7 Simulations suggest that in practice, a good bias correction from quadratic extrapolations or other asymptotic requires the number of trials per stimulus to be at least as big as the number of possible responses R (Panzeri & Treves, 1996). Results are shown in Figures 2A and 2B. Figure 2A shows that χ0 (R) after the quadratic extrapolation procedures become completely unbiased, even with as few as 32 trials per stimulus. After applying the quadratic extrapolation, H0 (R|S) also becomes almost unbiased even at 32 trials per stimulus. This can be understood by remembering that H0 (R|S) can be expressed as a sum of single bin entropies, and the sampling of each single bin entropy can be considered to be in asymptotic regime even for a small number of trials. Thus, the quadratic extrapolation performs very well on H0 (R|S). As 7 We also considered the performance of other bias correction methods developed to work in the asymptotic regime, such as computing analytically the coefficients of the 1/N expansion of the bias as a function of probabilities (Panzeri & Treves, 1996; Pola et al., 2005). As we found no overall increase in performance with respect to the quadratic extrapolation, we decided to work with the latter, which is easier to implement.
2930
M. Montemurro, R. Senatore, and S. Panzeri Quadratic extrapolation
A
B
4.5
I I
1.6
sh
I
Entropy [bits]
3.5
χ (R)
3
0
Information [bits]
1.4 4
6
8 10 Log (trials)
1.2
∆I
1
h
0.8 0.6 0.4
H(R) H0(R|S) 2.5 4
∆ I0
0.2
H (R|S) 12 H(R|S)
0 4
6
2
8 10 Log (trials)
12
8 10 Log2(trials)
12
2
NSB correction
C
D
4.5
1.6 1.4
Information [bits]
Entropy [bits]
4
3.5
3
1.2 1 0.8 0.6 0.4 0.2
2.5 4
6
8 10 Log (trials) 2
12
0 4
6
Figure 2: Comparison of the sampling properties of estimators computed with the quadratic extrapolation and the NSB bias correction procedure. Results are plotted as a function of the number of trials per stimulus and were averaged over a number of repetitions of the simulation. The simulated data were obtained exactly as in Figure 1, again considering a poststimulus window of 50 ms discretized into L = 10 bins of size t = 5 ms. (A, B) Quadratic extrapolation. (A) Averaged estimated values of χ0 (R), H(R), H0 (R), H(R|S), and H0−sh (R|S). (B) Averaged estimated values of I (R, S), I L B−0 , I0 , and I0−sh . (C, D) NSB estimation. (C) Averaged estimated values of χ (R), H(R), H0 (R), H(R|S), and H0−sh (R|S). (D) Averaged estimated values of I (R, S), I L B−0 , I0 , and I0−sh .
a result, the extrapolated I L B−0 estimator (see Figure 2B) becomes almost unbiased, even with as few as 32 trials per stimulus. Figure 2A shows that the response entropy H(R) is still mildly biased after correction, requiring at least 128 trials per stimulus for good estimation. In contrast, the noise entropies H0−sh (R|S) and H(R|S) remain highly biased even after the extrapolation procedure, requiring at least 210 = 1024 trials per stimulus for unbiased estimation. The reason is that P(r|s) and P0−sh (r|s) are high dimensional, and thus the number of trials needed to get them into the asymptotic sampling regime is much higher. As a consequence, I0 is still substantially biased upward, and applying a quadratic extrapolation procedure is still not enough: we still need at least 210 trials per stimulus for unbiased estimation (see Figure 2B). However, the good
Robust Bounds to Mutual Information
2931
news from Figure 2B is that the behavior of I0−sh is better than the one of I0 . Because of the partial bias cancellation in H0−sh (R|S) − H(R|S), the quadratic extrapolation is able to remove most of the downward bias of I0−sh from at least 128 trials per stimulus. Similarly, estimating mutual information as Ish (rather than directly as I in the mutual information definition in equation 2.1) leads to a much less biased estimation (see Figure 2B). In summary, we simulated data with L = 10 time bins and 210 possible responses, and we use the quadratic extrapolation bias correction. Estimating I0 and I directly requires approximately 210 trials per stimulus, of the order of the size of the response space. However, estimating the same quantities through the I0−sh and Ish shuffling procedure required only 27 trials per stimulus, eight times smaller than the size of the responses space. To check how these results scaled with L, we simulated data with the same procedure as in Figures 1 and 2, but using different sizes of poststimulus windows (ranging from L = 8 to L = 14). The results were always comparable to the L = 10 case: estimating I0 and I directly requires approximately a number of trials per stimulus of the order of the size of the response space. Estimating I0−sh and Ish shuffling procedure required only a number of trials per stimulus eight times smaller than the size of the response space (results not shown). We now consider the effect of using a more sophisticated Bayesian entropy bias correction proposed by Nemenman et al. (2004). This procedure (called NSB) was designed to operate beyond the asymptotic regime8 and is briefly reviewed in appendix A. As shown in (Nemenman et al. (2004), it works well even when data are scarce and the response space is highdimensional, so that each response is observed in no more than handful of trials. Figures 2C and 2D report the values of the functional corrected with the NSB procedure and show that the NSB method in general performs well. The response entropy H(R) (see Figure 2C) is even better behaved than with the quadratic correction and is now essentially unbiased, requiring only 32 trials per stimulus for good estimation. The noise entropies H0−sh (R|S) and H(R|S) are also better evaluated with the NSB procedure than with the quadratic extrapolation procedure. However, both H0−sh (R|S) and H(R|S) still remain substantially biased. As a consequence, Figure 2D shows that I0 and I are still substantially biased upward, requiring at least 29 trials per stimulus for unbiased computation. However, the unbiased computation of I0−sh and of Ish requires only 26 trials per stimulus, because the residual biases of H0−sh (R|S) and −H(R|S) cancel out almost exactly. Estimating mutual information as Ish (see equation 4.5) works much 8 We also considered the performance of the procedure of Paninski (2003), which also aims at working beyond the asymptotic sampling regime. We found that on the simulated data and information-theoretic quantities, the Paninski procedure helped to reduce the bias but did not perform as well as the NSB and the quadratic extrapolation. Thus, we omitted the presentation of its results.
2932
M. Montemurro, R. Senatore, and S. Panzeri
better than estimating I directly. However, it should be noted that as a result of applying the NSB procedure, the residual bias for Ish is not always negative. Thus, using the Nemenman et al. (2004) procedure to estimate Ish gives an almost unbiased estimator for small numbers of trials, but not necessarily a lower bound to the true information (as it instead happens when correcting Ish with the quadratic extrapolation). The only entropy term that performed worse when corrected with the NSB procedure is the independent entropy H0 (R; S). As discussed in appendix A, the reason is that the Nemenman et al. (2004) procedure is tailored to work for high-dimensional response spaces, and thus its assumptions do not work well for H0 (R; S), which is essentially a sum of singlebin entropies. This problem is particularly serious for low firing rates (see appendix A). Finally, we note that the procedure of Nemenman et al. (2004) has been developed so far only for entropy quantities. Therefore, it cannot be applied to χ0 (R). Thus, χ0 (R) will always be corrected with the extrapolation procedure, which, as demonstrated before, works extremely well for this quantity. It is now useful to come back to discuss why assumption 2 on the probability model, the one that enabled us to express ILB-simp and Isimp as in equations 3.6 and 3.7, is very important to improve the sampling properties. There are two reasons behind it. The first is that the term outside the logarithm in the stimulus-conditional part of ILB-simp now is Psimp (r|s) rather than P(r|s), and this is expected to reduce statistical fluctuations. The second, and more important, reason is that now the stimulus-dependent part of ILB-simp can be expressed as a difference of entropies. This has a big impact on the ways the sampling bias of this quantity can be corrected. In fact, the sampling bias correction techniques of Nemenman et al. (2004) and the shuffling bias relationships used above both depend crucially on being able to express the stimulus-conditional part of ILB-simp and of Isimp entirely in terms of entropies. 4.1.3 Variance of Shuffling-Based Estimators. We now consider whether the reduction in bias of the shuffled estimators I0−sh and Ish comes at the expense of an increase in their variance. Equation 4.5 indicates that the variance of Ish is largely determined by that of I0−sh , its worst-sampled component. The most biased terms in the computation of I0−sh are the stimulus-conditional entropies H(R|S) and H0−sh (R|S), with their bias almost canceling out (see equation 4.4). The amount of variance of I0−sh will depend on the degree of correlation (across independent realizations with the same number of trials) between the values of H(R|S) and H0−sh (R|S). If these two quantities were positively correlated, then their statistical fluctuations would tend to cancel out, and the resulting variance of I0−sh would be under control. If instead H(R|S) and H0−sh (R|S) were either uncorrelated or negatively correlated across different realizations, then their
Robust Bounds to Mutual Information
2933
statistical fluctuations would either not cancel out or even increase, thus making the variance of I0−sh larger. The scatter plot in Figure 3A shows that H0−sh (R|S) and H(R|S) are strongly positively correlated when taken from the same realization of simulations. (The data were simulated exactly as in Figure 2 and were obtained using 128 trials per stimulus in each simulation). The fact that H0−sh (R|S) and H(R|S) are correlated stems from the fact that fluctuations in the values of single bin marginal probabilities have a major impact on the fluctuations of entropy values (Schultz & Panzeri, 2001). These fluctuations are reflected with the same sign in both H0−sh (R|S) and H(R|S). Thus, the correlation between the entropies ensures that the variances of Ish and I (R; S) remain comparable. This is clearly demonstrated in Figure 3B (quadratic extrapolation correction procedure used). To better understand the importance of the correlation of H(R|S) and H0−sh (R|S) across different realizations, we paired values of H(R|S) and H0−sh (R|S) taken from randomly chosen realizations of the simulation. The scatter plot of these randomly paired entropies is reported in Figure 3C. Computing information Ish from the randomly paired entropies H(R|S) and H0−sh (R|S) leads to a considerable increase in invariance of the information Ish (results reported in Figure 3D). Thus, we conclude that the reduction of bias in estimating information with Ish rather than directly with I does not come at the expense of increased variance, because the computation of Ish involves a cancellation of fluctuations between similarly biased terms. Ish is a very efficient estimator of information in terms of both bias and variance. 4.1.4 Performance of Shuffling-Based Estimators in the Presence of Strong Correlation. An important question is how the bias properties of H(R|S), H0−sh (R|S), and Ish are affected by the strength of correlations between spikes. As it was discussed above, when the responses are shuffled, the ˜ 0−sh (s), will be equal to or larger than the new value of relevant responses, R ˜ original R(s). Thus, because of equation 4.2, if correlations are not strong enough to induce radical differences of shape and support between P(r|s) and Pind (r|s), I0−sh is expected to have a downward bias that is much smaller in absolute magnitude than the upward bias of I0 . The previously shown simulations (based on a correlation structure taken from a real cortical neuron) confirmed this expectation and suggest that I0−sh and Ish are only very mildly biased in realistic neurophysiological conditions. However, this argument also suggests that the magnitude of the downward bias may be affected by the overall strength of correlation. High correlation tends to favor certain patterns in the response and thus decrease the number of possible responses. The stronger the correlations, the larger the number of new response words that may appear after shuffling and the larger the resulting downward bias. If correlations are strong enough ˜ 0−sh (s) much larger to make the number of shuffled relevant responses R
2934 A
M. Montemurro, R. Senatore, and S. Panzeri B
3.35
0.1
3.2
3.15
3.1 2.85
C
2.9
2.95 3 H(R|S)
3.05
D
6
8 10 log (Trials)
12
0.1
0.08 Information [bits]
(R|S) H
0.04
2
3.15
3.1 2.85
ST. DEV.
0.06
0 4
3.1
3.3
3.2
sh
0.02
3.35
3.25
I I
0.08 Information [bits]
(R|S)
3.25
H
3.3
ST. DEV. 0.06
0.04
0.02
2.9
2.95 3 H(R|S)
3.05
3.1
0 4
6
8 10 log (Trials)
12
2
Figure 3: Variance of shuffling estimators. The plug-in estimates of the probability functionals were computed (without any bias corrections) from exactly the same simulated neural data as in Figure 1, again with L = 10. (A) Scatter plot of H0−sh (R|S) versus H(R|S) computed from 100 random realizations of the simulated spike trains, each realization of the simulation consisting of 128 simulated trials per stimulus condition. Each point in the scatter plot represents a data pair taken from the same realization of the numerical simulation. (B) The variance (across realizations of the simulation) of the estimators of I and Ish plotted as a function of the number of trials per stimulus condition. In this case, Ish was estimated using values of H0−sh (R|S) from the same realization of the simulation. (C) Scatter plot of H0−sh (R|S) versus H(R|S) computed from 100 random realizations of the simulated spike trains, each realization being made of 128 trials per stimulus condition. Now each point in the scatter plot represents a data pair taken from different randomly paired realizations of the numerical simulation. (D). The variance (across realizations of the simulation) of the estimators of I and Ish plotted as a function of the number of trials per stimulus condition. Now Ish was estimated using values of H0−sh (R|S) from different randomly paired realizations of the numerical simulation.
˜ than the original R(s), then I0−sh may become strongly downward biased. It is thus important to assess numerically the dependence of the sampling properties of I0−sh on the correlation strength. To address this
Robust Bounds to Mutual Information
2935
issue, we next simulated spike trains as a first-order (q = 1) Markov process with a firing rate ρ to each stimulus and a given coefficient c quantifying Pearson correlation of spikes across adjacent time bins (both parameters were taken as constant over time). Reference values for the parameters ρ0 and c 0 were, similar to previous simulations, measured (as an average over a poststimulus time window of 20–40 ms poststimulus onset) from the experimental cortical responses recorded in response to 49 different whisker vibration stimuli by Arabzadeh et al. (2004). In the simulations, we then took the same rate ρ0 as the real data and produced a simulated correlation strength c that was f times stronger than the real one: c = f c 0 . Figure 4 reports how the bias of H(R|S) and of H0−sh (R|S) depends on the correlation strength (no bias correction procedure was applied). Figure 4A shows the results obtained for f = 0 (i.e., absence of correlations). In this case, the number of possible responses is, on average, unchanged by shuffling. Thus, H(R|S) and H0−sh (R|S) have exactly the same bias, and Ish is unbiased. For f = 1 (i.e., real neuronal correlation value, shown in Figure 4B), the bias of H0−sh (R|S) is still very close to that of H(R|S). For correlation four times stronger than that of the real neuron (see Figure 4C), there is some difference between the biases of H(R|S) and of H0−sh (R|S). However, with 64 trials per stimulus or more, the bias of the two entropies cancels out and Ish is unbiased. H0−sh (R|S) becomes considerably more biased downward than H(R|S) only for correlation eight times stronger than that of the real neuron (see Figure 4D). Thus, we conclude that although their sampling properties will worsen as the amount of correlation in the data grows, the shuffled estimators I0−sh and Ish can be very useful for the analysis of a wide range of neurophysiological experiments. 4.2 The Markov Model Decoder. The above section dealt with the use of the simplest decoding model, P0 (r|s). How do the sampling properties of I L B−q and Iq change when using more detailed decoding models such as Pq (r|s) with q > 0? To address this issue, we consider the properties of χq (R) and Hq (R|S) with q > 0. As q increases, χq (R) is expected to become more biased because it depends on Pq (r|s) logarithmically. The bias of Hq (R|S) is expected to increase even more significantly with q . In fact, it can be decomposed into a sum of stimulus-conditional entropies of the marginal probability distributions of up to q + 1 consecutive time bins (Pola et al., 2003). For this reason, although the bias of Hq (R|S) is smaller than that of H(R|S), it grows for larger q values. The bias of I L B−q is given by the difference between the biases of χq (R) and Hq (R|S). Thus, as q increases, I L B−q will be more biased. These expectations are confirmed by the simulations shown in Figure 5. These simulations were performed as in Figure 1, by creating a somatosensory-like, simulated spike train over L = 10 time bins responding to 49 different stimuli with realistic firing rates and correlations described
2936
B 0
Entropy bias [bits]
0
Entropy bias [bits]
A
M. Montemurro, R. Senatore, and S. Panzeri
H(R|S) H
(R|S)
f=0
4
5
6
7
8
9
10
11
12
13
f=1
4
5
6
7
Log (trials)
D
5
6
7
8
9
Log2(trials)
10
11
12
13
10
11
12
13
0
Entropy bias [bits]
f=4
4
9 2
0
Entropy bias [bits]
C
8
Log (trials)
2
11
12
13
f=8
4
5
6
7
8
9
10
Log (trials) 2
Figure 4: Effect of correlations on the performance of the shuffling estimators. We simulated spike trains as a first-order (q = 1) Markov process with constant rate ρ and a given Pearson correlation coefficient c. The value of ρ was taken from experimental data (Arabzadeh et al., 2004) as an average over a poststimulus time window of 40 ms after stimulus onset. The correlation coefficient c was then adjusted to produce different correlation strengths commensurate with correlation value found in the real data, c 0 . The relationship between the real and the simulated correlation strength was given by a factor f as c = f c 0 , where f is a multiplicative factor. (A–D) Plots of the bias of H(R|S) and H0−sh (R|S) as a function of the number of trials per stimulus, for different values of the correlation strength f . No bias correction was used to compute these entropies.
by a q = 3 Markov model. A quadratic extrapolation procedure was used to correct for the bias. I L B−q , Iq , and Iq −sh were computed for q = 1, 3, and 6 and plotted in Figures 5A, 5C, and 5E, respectively. Figure 5 confirms the intuition that as the value of q increases, I L B−q becomes more biased. The asymptotic value of I L B−1 can be computed well even with as few as 32 trials (see Figure 5A). However, since the actual length of the Markov model used to generate the data was equal to q = 3, the asymptotic value of I L B−1 is only 0.69 bits, which is less than the true value of the full mutual information carried by the spike train (0.82 bits). This correction reflects the fact that some information would be lost by a decoder neglecting correlations
Robust Bounds to Mutual Information
1.5
q=1
1
B
I
0.06
q sh
0 4
6
8 10 Log2(trials)
0.02 6
8 10 Log2(trials)
12
D 0.08 q=3
1 0.5 0 6
8 10 Log (trials)
F
1.5 q=6 1 0.5 0 6
8 10 Log (trials) 2
12
q=3
0.06 0.04 0.02 0 4
12
2
E
4
q=1
0.04
0 4
12
1.5
4
Standard Deviation 0.08
std [bits]
Information [bits]
q
Iq
0.5
C
Information [bits]
I
std [bits]
Average
std [bits]
Information [bits]
A
2937
6
8 10 Log (trials)
12
2
0.08 q=6
0.06 0.04 0.02 0 4
6
8 10 Log (trials)
12
2
Figure 5: Bias and standard deviation of estimators making use of simplified Markov models of different orders q . We computed I L B−q , Iq , and Iq −sh for q 1, 3, and 6 as a function of the number of trials per stimulus. Results are plotted as a function of the number of trials per stimulus. (A, C, E) Report of the average of these quantities over a number of repetitions of the simulation. (B, D, F) Report of the standard deviation across simulations. The simulated data were obtained exactly as in Figure 1, again considering a poststimulus window of 50 ms discretized into L = 10 bins of size t = 5 ms. It is worth noticing that the underlying Markov process to generate the simulated data was of order 3.
of range longer than one bin. On the contrary, I L B−3 and I L B−6 can reach the correct asymptotic value of information (see Figures 5C and E). However, they are considerably more biased than I L B−1 . Figures 5A, 5C, and 5E show that the bias of Iq has a completely different dependence on q : the higher q , the smaller the bias of Iq . This is because as q increases, the difference H(R|S) − Hq (R|S) becomes smaller, and thus the overall bias of Iq decreases. Similar to the q = 0 case considered in the previous section, the bias in Iq can be further reduced by using a shuffling procedure. In fact, it is easy to construct a “q -shuffling” procedure that generate simulated responses rq −sh preserving all marginals and correlations up to q + 1 consecutive time bins, but destroying all the
2938
M. Montemurro, R. Senatore, and S. Panzeri
higher length correlations.9 As in equation 4.3, we can construct from the probabilities of q-shuffled responses rq −sh a “shuffled noise entropy of order q” Hq −sh (R|S), which converges to the same asymptotic value of Hq (R|S) for large numbers of trials but presents a higher bias for small trial numbers. The introduction of Hq −sh (R|S) allows us to define, in analogy to equation 4.4, a shuffled estimator Iq −sh , as follows: Iq −sh = Hq −sh (R|S) − H(R|S) + H(R) − χq (R).
(4.6)
Iq −sh must be less biased than Iq , because there are substantial bias cancellations between Hq −sh (R|S) and H(R|S). This expectation is investigated numerically in Figure 5. Figures 5A, 5C, and 5E show the sampling behavior of Iq −sh for q = 1, 3, and 6, respectively. The bias of Iq −sh has a negative sign, for the same reasons given above for the case of I0−sh . However, for any q value, the asymptotic value of Iq −sh for large trial numbers is the same as Iq , but they differ significantly for low trial numbers. Iq −sh is biased downward and is much less biased than Iq . Figures 5B, 5D, and 5F consider the behavior of the variances of the information-theoretic quantities as the value of q increases. The most notable finding is that like their bias, the variances of both Iq and Iq −sh also decrease with q . Moreover, for the same reasons presented above for q = 0, the variance of Iq −sh is never higher than that of Iq . Thus, the decrease of bias of Iq −sh does not come at the expense of an increase of variance. This stresses the competitiveness of the shuffling procedures for the estimation of Iq . 5 Tighter Upper and Lower Information Bounds Using Model Selection We introduced information-theoretic quantities Iq that quantify, for any value of memory length q , the information-theoretic cost of using a simplified decoding model described by a Markov model of order q rather than by the full-response probability distribution. Intuition suggests that the shorter the memory needed to decode the neuronal spike trains, the
9 The “q -shuffled” responses r q −sh can be constructed as follows. The response in the first time bin is chosen as a random permutation of the first time bins across all trials. Then the response in each bin r (k) for k = 2, . . . , L is taken randomly without replacement from the subset of trials that satisfy r (k − j) = r (k − j) for j = 1, . . . , m, where m = min(k − 1, q ). In other words, q -shuffled bins are concatenated with others taken at random without replacement from the subset of trials that have the same state of up to q past bins. Naturally, as q approaches the total length L of the time window, it could happen that if data are scarce, the q -shuffling procedure becomes trivial by regenerating exactly the original data set for each random shuffling. Thus, care should be taken to verify that this is not happening with the data under analysis
Robust Bounds to Mutual Information
2939
fewer data are needed to compute its information content. However, we have not explored yet the specific advantages offered by knowing that to decode all information, we need to consider only response chains extending over a number w of time bins, which is shorter than the L time bins making up the time window used to analyze spike trains. Suppose that we know that the shorter history depth needed to decode all information is w time bins. This amounts to requiring that Iq = 0 for any q ≥ w and will happen if P(s|r)/Pq (s|r) is not stimulus modulated for each response and each q ≥ w. This condition will be met if the stimulusconditional probability of the neuronal responses P(r|s) is generated according to a Markov process with length not higher than w for every stimulus. What are the further advantages offered to us in computing information quantities when we know the actual value of w? If the shortest Markov order needed to decode all the information in the spike train is equal to w (i.e., Iq = 0 for any q ≥ w), then any I L B−q (with q ≥ w) will be equal to the total mutual information I . Therefore, this knowledge offers a huge advantage when computing I . In fact, in this case, it will be most convenient to compute the true value of I by using I L B−w , which will be equal to I in the asymptotic sampling regime, but it will be much less biased than I if w is much shorter than the window length L.10 The knowledge that the shortest Markov order needed to decode all the information encoded in the spike train is equal to w offers an advantage in the estimation of the quantities Iq , which may still be larger than zero if q < w. Using the fact that Iw = 0, it is easy to show that all Iq for q < w are in this case equal to a simpler and much less biased quantity, called δ Iq , and defined as follows:11 δ Iq ≡ Iq − Iw = Hq (R|S) − Hw (R|S) + χw (R) − χq (R)
if q < w.
(4.7)
Since Iw = 0, it is clear than δ Iq = Iq . However, when estimating the quantities from finite samples, δ Iq is much less biased than Iq . If w < L − 1, then term Hq (R|S) − Hw (R|S) (the dominant term for the bias in equation 4.7) will have a resulting upward bias that will be smaller than the bias in the corresponding term in the definition of Iq , namely, Hq (R|S) − H(R|S). The smaller w is with respect to L, the better the sampling advantage is in using δ Iq . Along the same lines explained above, we can used the q-shuffling procedure to obtain a very weakly downward-biased estimator of δ Iq . By
10 In this case, one could in principle use any other I L B−q (with q > w) to compute information I . However it is more efficient to use I L B−w because the bias of I L B−q grows with q . See Figure 5. 11 The quantity δ I should have a further index w, which we omit for compactness of q notation.
2940
M. Montemurro, R. Senatore, and S. Panzeri
computing the noise-entropy difference Hq (R|S) − Hw (R|S) by using the “q-shuffled” distribution, we can define the following quantity: δ Iq −sh = Hq −sh (R|S) − Hw−sh (R|S) + χw (R) − χq (R).
(4.8)
In this case, the difference of entropies Hq −sh (R|S) − Hw−sh (R|S) will have only a very small downward bias. Hq −sh (R|S) − Hw−sh (R|S) will be less biased than its counterpart Hq −sh (R|S) − H(R|S) in Iq . Thus, δ Iq −sh will be a tighter downward-biased estimator than Iq −sh . In summary, knowing the true shortest length of the Markov model w that can decode all information encoded by the spike train is useful because it leads to computing Iq through two tight estimators, δ Iq and δ Iq −sh , which bound precisely the information from above and below, respectively, and that are tighter and less biased than the corresponding estimators Iq and Iq −sh obtained in the absence of knowledge about w. In the next two sections we illustrate how the knowledge about w can reduce the sampling bias in a dramatic way. We consider two cases: when we are given some independent knowledge of the value of w and when we do not have this knowledge and have to guess w from the data. 5.1 Using a Prior Knowledge of the Shortest Markov Order Needed to Decode All Information. There may be situations in which there is precise knowledge about the memory length w needed to decode all the information encoded in the spike train. For example, when analyzing the information properties of some simulated data, we often have theoretical insights into the value of w. Alternatively, for some well-studied neural systems, there may be data available indicating that there is no correlation between spike times exceeding a certain time separation of w time bins. In this case, we are entitled to use straightforwardly the quantities δ Iq and δ Iq −sh as bounds to the true value of Iq . In Figure 6A we analyze again the simulated cortical responses to 49 stimuli (generated as in Figures 1 and 2 over L = 10 time bins) and show the behavior of various information estimators. (Here, a quadratic extrapolation bias correction was used.) As usual, I and I0 are strongly biased upward and require at least 210 trials per stimulus to give estimates close to the asymptotic values. In contrast, I0−sh , which is also computed without any assumption on the order w of the Markov model underlying the real spike train, gives an accurate estimate at around 256 trials per stimulus. What is the effect of making use of the knowledge that the true order w of the underlying Markov model is equal to 3? This amounts to using δ Iq and δ Iq −sh in equations 4.7 and 4.8 with w = 3. Figures 6A and 6B show that δ I0 , δ I0−sh provide much tighter and data-robust upper- and lower-bound estimations than Iq , Iq −sh . As expected by the above considerations, δ I0 is a much tighter upward-biased estimator than the direct use of I0 . This
Robust Bounds to Mutual Information Average
A
2941
B 0.1
Standard Deviation
0.08 std [bits]
Information [bits]
1.5
1
0.06 0.04
0.5 0.02 0 4
6
8 10 Log2(trials)
0 4
12
C
6
8 10 Log2(trials)
D 0.1 I
1.5 0.08 std [bits]
Information [bits]
12
1
I0 I
0.06
I 0.04
0
I0
0.5
sh
0.02 0 4
6
8 10 Log (trials) 2
12
4
6
8 10 Log (trials)
12
2
Figure 6: Performance of estimators δ I0 and δ I0−sh when the Markov order of the underlying model is known in advance. In addition to δ I0 and δ I0−sh , for comparison we also show the average values of I (R, S), I0 , Ish as a function of the number of trials per stimulus. (A, C) Report of the average of these quantities over a number of repetitions of the simulation. (B, D) Report of the standard deviation across simulations. The simulated data were obtained exactly as in Figure 1, again considering a poststimulus window of 50 ms discretized into L = 10 bins of size t = 5 ms. The underlying Markov process used to generate the simulated data was of order 3. (A) Data corrected with the quadratic extrapolation method. (B) Standard deviations of the data shown in B. (C) Estimation based on the method of Nemenman et al. (2004). (D) Standard deviations of the data shown in C.
is because the knowledge of w = 3 ensures that we compute the functionals in equation 4.7 on a response space with 24 different responses, a number much smaller than the 210 possible responses characterizing the probability distribution over 10 bins. The downward-biased shuffled estimator δ I0−sh is even tighter than I0−sh , whcih provides reliable estimates of information with as few as 64 trials per stimulus. The standard deviation of the estimators δ I0 and δ I0−sh is considered in Figure 6B. It is apparent that the reduction in bias of δ I0 and δ I0−sh , with respect to I0 and I0−sh , is not obtained at the expense of an increase in variance.
2942
M. Montemurro, R. Senatore, and S. Panzeri
In Figure 6C we show the performance of the same estimators obtained with the NSB method bias-correction method rather than with the quadratic extrapolation procedure. The comparison with Figure 6A shows that the direct estimation of I and I0 is less biased than that obtained with the quadratic extrapolation. However, this increase in performance is greatly enhanced when using the δ I0 and δ I0−sh . The NSB method is in general effective, and I0−sh and δ I0−sh gave a good estimate of the asymptotic value of I0 , even down to 128 trials per stimulus. A potential problem with using the NSB method in this context is that although it performs well, in general it does not preserve the sign of the residual bias. Therefore, for very low trial values, it could happen that the downward bias quantities I0−sh and δ I0−sh occasionally become higher than the asymptotic values of I0 . Figure 6D compares the standard deviations of the estimates obtained with the NSB method. A comparison of the estimates and the performance of the quadratic extrapolation suggests that the reduction of bias obtained using the NSB method comes at the expense of a relatively small but appreciable increase in variance and that this increase is higher for the shuffled estimators. 5.2 Selecting the Order of the Decoding Model with a Nonparametric Test. In general, when analyzing real spike trains recorded during a neurophysiological experiment, we do not have precise prior knowledge of the temporal extent of correlations between spikes. How can we take advantage of the sampling properties offered by δ I0 and δ I0−sh to the case in which we do not know the value of w a priori? In this section, we address this problem and show that nonparametric statistical techniques (Efron & Tibshirani, 1993) can be used to compute empirically an effective shortest Markov order w needed to decode all information encoded by the spike train. The crucial property in the derivation of the expression for δ Iq was that Iq = 0 for all q ≥ w. Therefore, we can suggest a simple way to determine effectively a statistical procedure to estimate the value of the memory length w to be inserted in the definition of δ Iq . This value can be defined as the w representing the the highest value of Markov order above that all Iq are zero for any q > w. In practice, this is not easy to determine from data as individual values of Iq that should be zero in the asymptotic large-N limit may be estimated as greater than zero because of statistical fluctuations or because of a residual bias not entirely corrected for (see, e.g., Figure 5). However, the likelihood of an individual outcome of Iq to be significantly greater than zero could be established easily with a statistical test evaluating the null hypothesis that Iq is zero. Once a particular statistical test has been chosen, the value of w can be determined as follows. We start from the largest possible value of q for which Iq can be > 0 (i.e., q = L − 2), and we use the statistical test to determine whether Iq is significantly more than
Robust Bounds to Mutual Information
2943
0. If this is the case, we cannot discard any correlation of any length, and we set w = L − 1. If instead the null hypothesis that I L−2 is zero cannot be discarded, we repeat the process by decreasing q at each step and test again whether Iq is significantly different from zero. The parameter q is then decreased until we find that Iq is significantly more than 0. At this point, the procedure is stopped, and we take w as the last value of q for which the null hypothesis could not be discarded. A simple way to test the null hypothesis that Iq = 0 for some q is to use the following nonparametric “bootstrap” test (Efron & Tibshirani, 1993). We first generated a set of bootstrapped data sets obtained by the q -shuffled procedure described above by destroying all correlations longer than q − 1 bins. We then use these q -shuffled responses to compute Iq . For each random realization of these shuffled responses, the corresponding Iq must be zero. However, because of statistical fluctuations and errors in fully eliminating the bias, the distribution of Iq will peak above zero. We can use this empirically generated distribution to set boundaries on accepting the null hypothesis that the value of Iq computed with the original data set is zero. In our simulations, we found that the distribution of values of Iq computed on the q -shuffled distribution was approximated by a gaussian (data not shown). Thus, our statistical criterion for rejection was simply that the actual value of Iq computed with the original data set was higher than two standard deviations above the mean of the bootstrapped distribution (the cutoff at two standard deviation was chosen because it gave the best estimates of information quantities when applied to the simulated data shown below). In Figures 7A and 7B, we show the resulting histograms after applying the bootstrap test to simulated data. The simulations were again the same as in Figures 1 and 2, simulating a somatosensory neuron with underlying memory length w = 3 and a response word of length L = 10. In Figure 7A, we plotted the distribution of estimated w values obtained with the bootstrap test when 128 trials per stimulus were available. This distribution is broad and peaked at the correct order of the simulations. As the number of trials per stimulus increases to 256 (see Figure 7B), the test becomes more effective as more of the estimated values of w are correct. The trend continues as the number of trials increases, and the test almost always reports the correct w for number of trials larger than 512 (results not shown). In Figures 8A and 8B, we plotted δ I0 and δ Ish−0 , computed from equations 4.7 and 4.8, with w determined for each simulation through the test described above (values were corrected with the quadratic extrapolation method). Figure 8A shows that even in the case in which w is not known a priori but must be determined through a statistical test, the lower bound given by δ I0−sh has already reached the asymptotic value of I for as small as 27 trials per stimulus. The upper bound, δ I0 , converges more slowly toward the asymptote; it represents a significant improvement with respect
2944 A
M. Montemurro, R. Senatore, and S. Panzeri
B
40
80
N =128 s
counts
counts
30
20
10
0 0
Ns=256
60 40 20
1
2
3
4
5
w
6
7
8
9
10
0 0
1
2
3
4
5
6
7
8
9
10
w
Figure 7: Bootstrap test to estimate model complexity. We simulated spike trains as in Figure 1 using a Markov process of order 3. The poststimulus time window used in the analysis was of 50 ms and was discretized into L = 10 bins with resolution t = 5 ms. For each realization (out of a total of 100) of the simulated spike trains, we estimated w using the bootstrap test described in the text. (A, B) Histograms of the distribution of w values obtained in this way for 128 and 256 trials per stimulus, respectively.
to the direct estimation of Iq . δ I0−sh and δ I0 bound the true value of I with an error of less than 10% for 256 trials per stimulus, and L = 10. Thus, δ I0−sh and δ I0 behave much better than I (R, S), I , and I0−sh , quantities that do not depend on the determination of the order of the process. In Figure 8C we report the information values estimated using the NSB method. It is apparent from Figure 8C that both δ I0−sh and δ I0 perform much better than for the quadratic extrapolation case shown in Figure 8A. The estimates bound the exact value of I0 less than 5% for as low as 128 trials per stimulus.12 The above results were obtained using simulated data generated by a Markov process of order w = 3. To check that this procedure also worked for data generated by higher-order Markov processes, we repeated the analysis on synthetic data simulating a somatosensory neuron with different values of Markov order ranging from w = 3 to w = 8, and the usual response word of length L = 10. We found results consistent with that of Figures 7 and 8 (data not shown). In brief, the bootstrap test continued to give estimates of w peaked around the correct highest value of w with Iw > 0 (which in this simulation was constructed to be equal to the w value used to generate the data). The distributions of values obtained with the bootstrap test were 12 The NSB method performed less well than the quadratic extrapolation only when computing low-dimensional entropies such as H0 (R|S) (see appendix A). Thus, a useful practical consideration is that when pairing the bootstrap test with the NSB correction method, it is safer to use a quadratic extrapolation to compute entropies such as H0 (R|S) that appear in equations 4.7 and 4.8 when the statistical test suggests a value of w = 0 as the most likely.
Robust Bounds to Mutual Information
2945 Standard Deviation
Average B Information [bits]
Information [bits]
A 1
0.5
0 6
7
8 9 Log (trials)
10
11
0.15 0.1 0.05
6
7
2
10 I
D Information [bits]
1
0.5
0 6
7
8 9 Log (trials) 2
11
2
C Information [bits]
8 9 Log (trials)
10
11
I 0.1
I
0
I0
0.05
0 6
0
I
7
8 9 Log (trials)
10
sh
11
2
Figure 8: Performance of estimators δ I0 and δ I0−sh when the Markov order of the underlying model is selected with the bootstrap test. In addition to δ I0 and δ I0−sh , for comparison we also show the values of I (R, S), I0 , I0−sh as a function of the number of trials per stimulus. The data correspond to simulated spike trains as in Figure 1 using a Markov process of order w = 3. However, the knowledge of the true Markov order w used to generate the data was not used; w was instead estimated on each simulation using the bootstrap test described in the text. (A, C) Report of the average of these quantities over a number of repetitions of the simulation. (B, D) Report of the standard deviation across simulations. (A) Data corrected with the quadratic extrapolation method. (B) Standard deviations of the data shown in A. (C) Estimation based on the method of Nemenman et al. (2004). (D) Standard deviations of the data shown in C.
slightly narrower for simulated processes with higher w values, probably because Iq has a smaller variance for higher q values (see Figure 5). As the number of trials increased, the information quantities δ I0 and δ I0−sh always converged (from above and below, respectively) to their correct asymptotic value and were less biased than I0 and I0−sh . Obviously, the lower the Markov order used to generate the process, the bigger was the bias advantage of δ I0 and δ I0−sh over I0 and I0−sh . Using a simplified model of the Markov family and selecting the most economic decoding model with a statistical test leads to drastic reductions of the bias of the information quantities. However, it is likely that future extensions of this work to include other families of simplified models may lead to even better performance. For example, it is likely that in the
2946
M. Montemurro, R. Senatore, and S. Panzeri
presence of serial correlation of adjacent but long interspike intervals, selecting among a class of suffix tree model would perform much better than selecting only among full Markov models, because the depth of history of a suffix tree could be made to depend on when and if the previous spike occurred (Kennel et al., 2005). 6 Analysis of Real Data In this section we illustrate the application of the methods developed above to two different data sets of real neuronal recordings from the whisker representation in somatosensory (“barrel”) cortex of rats anesthetized with urethane. The first data set consisted of 21 neural clusters, each recorded from a different electrode in a silicon array made up of a total of 100 electrodes (see Petersen & Diamond, 2000, for further details). Spike times from each electrode were determined by a voltage threshold set to a value 2.5 times the root mean square voltage. Since it was not possible to sort well-isolated units from each channel, spikes from the same recording channel were all considered together as a single neural cluster. It has been estimated that each cluster captured the spikes of approximately two to five neurons (see Petersen & Diamond, 2000). Neural activity was recorded in response to individual stimulation of one of nine different whiskers (whisker D2 and its eight nearest neighbors); individual whiskers were stimulated near their base by a piezoelectric wafer, controlled by a voltage generator. The stimulus was an up-down step function of 80 µm amplitude and 100 msec duration, delivered once per second. The trials per stimulus available were 200 to 500 for this first data set. The time course of the estimates of the information transmitted about stimulus location by spike times from a neuronal cluster in response to instantaneous whisker deflection is reported in Figure 9A. For the information analysis, we considered the response r to be a spike timing code (binarized with temporal resolution t = 5 ms) defined in a poststimulus time window that began at 5 ms poststimulus and whose end was gradually increased in 5 ms steps up to 55 ms poststimulus. The 21 clusters in the data set were analyzed separately and then averaged. The estimation of all the entropy quantities was done using the NSB method (similar conclusions were reached using the quadratic extrapolation; results not shown). The full spike timing mutual information was computed through the estimator Ish and was found to increase smoothly along the entire time range analyzed. Consistently with the fact that under these conditions neurons typically stop firing after 40 to 50 ms, Ish also saturated after 40 to 50 ms. We also computed the temporal evolution of I0 and I0−sh . I0−sh was small over all time ranges and accounted for no more than 8% of the total information Ish . I0 was also small, but unlike I0−sh , it increased markedly for longer time windows. This indicates that I0 , unlike I0−sh , suffers
Robust Bounds to Mutual Information
2947
∆ I0
0.6
δ I0 , δ I
0.5
B 0.6
h
∆I
Information [bits]
Information [bits]
Ish
A
0.4 0.3 0.2 0.1
0.4 0.3 0.2 0.1 0
0 0
0.5
10
20
30
40
50
60
0
10
20
30
40
50
60
Figure 9: Analysis of rat somatosensory cortex data. We show the total mutual information Ish , I0 , and I0−sh . The gray area shows the region delimited by the upper and lower bounds δ I0 and δ I0−sh . The NSB method was used to correct for finite sampling. (A) Information about stimulus location conveyed by individual neural clusters recorded in rat somatosensory cortex in response to instantaneous whisker deflections (Petersen & Diamond, 2000). Results reported as average (± SEM) over 21 different clusters. (B) Information about amplitude and frequency of sinusoidal whisker vibration conveyed by individual neural clusters recorded in rat somatosensory cortex (Arabzadeh et al., 2004). Results reported as average (± SEM) over 24 different clusters.
sampling problems when the number of bins L is more then 8 to 10. The fact that I0−sh is a good estimator of the asymptotic value of I0 is confirmed by the computation of the tight upper and lower bounds δ I and δ Ish . The latter quantities were computed by selecting the w value required for their computation with the bootstrap test. The mean value of w across channels was 3.7 ± 2.8 (mean ± SEM). In Figure 9A the gray region represents the area enclosed by the upper and lower bounds δ I0 and δ I0−sh . I0−sh remained within the two bounds along the whole timescale considered. In contrast, I0 overshot the data-robust upper bound δ I0 for longer time windows, indicating that it was suffering from sampling problems in that time range. To summarize, this analysis shows that the use of Ish and I0−sh gives precise estimates of the asymptotic values of I and I0 for up to 10 time bins. This an excellent performance considered that the number of trials per stimulus was 200 to 500. It is much beyond what could be achieved, for example, with a standard direct method to estimate I0 . The second data set consisted of 24 recordings of neural clusters, again from rat barrel cortex, obtained with the same type of electrodes and anesthesia as above. In this case, the stimulation protocol was different. The set of stimuli consisted of 49 different types of sinusoidal whisker vibrations, each defined by a unique combination of amplitude and frequency of vibration and delivered for 500 ms (see Arabzadeh et al., 2004, for full
2948
M. Montemurro, R. Senatore, and S. Panzeri
details). The trials per stimulus available were 200. As before, we considered the response r to be a spike timing code (resolution t = 5 ms) defined in a poststimulus time window starting at 5 ms poststimulus and gradually increasing in 5 ms steps up to 55 ms poststimulus. The time courses of the estimates of the information transmitted by spike times from a single neural cluster (averaged over the set of available clusters) about the parameters of whisker vibration are reported in Figure 9B. In contrast to the previous case, the spike timing mutual information (computed as Ish ) continued to grow over time, consistent with the fact that neurons show stimulus-evoked activity for several hundred ms in response to these sinusoidal vibrations (Arabzadeh et al., 2003). As in the previous example, both Ish and I0−sh increased smoothly over time (with no sudden upward jump reflecting a potential bias problem). The accuracy of I0−sh in estimating the asymptotic value of I0 was substantiated by the fact that it was tightly bound by δ I0 and δ I0−sh (gray area) up to 55 time bins. (δ I0 and δ I0−sh were computed by selecting the w value with the bootstrap test. The estimated mean value of w across channels was 6.2 ± 1.6.) In contrast, I0 overshot the gray area for long windows, indicating it suffers from an upward bias problem in this time range. A potentially interesting neurophysiological observation is that unlike in the case of instantaneous whisker deflection, I0−sh is now considerable (29% of the total information Ish at 50 ms poststimulus), and it grows supralinearly over time. This suggests that correlation may be useful in decoding complex whisker vibrations from cortical neuronal activity and is consistent with the approximated analytical prediction of Panzeri and Schultz (2001), which suggests that when neurons keep firing over a sustained period of time, I0 is expected to grow at least quadratically as a function of time. 7 Discussion Using information theory to probe the neural code over fine time resolutions, large populations or large time windows remain technically difficult because the analysis of long sequences of spikes requires the collection of large amounts of data to sample the probability of occurrence of each possible spike sequence. Under this condition, information measures suffer from serious bias problems, which impose severe limitations to the range of timescales and population sizes available for analysis (Panzeri & Treves, 1996; Strong et al., 1998). In this letter, we have developed a new procedure to estimate the information carried by spike trains that drastically alleviates its sampling problems. The starting point of the procedure is the observation that if we break down the mutual information I into a Kullback-Leibler divergence Isimp (which bounds the information lost by decoding when ignoring stimulusdependent correlations) and the rest (called ILB-simp ), then almost all the bias usually comes from Isimp . The second key observation is that the bias of
Robust Bounds to Mutual Information
2949
Isimp (and, as a consequence, that of the total mutual information) can be drastically reduced (and be made negative) at no increase of variance by an appropriate shuffling procedure. The third key observation is that the bias of Isimp (and thus that of the mutual information) can be further reduced by the use of a nonparametric test to find the minimal complexity within a class of models of correlation that still permits computing Isimp correctly. From a practical point of view, the overall reduction of the bias is useful because it extends the domain of applicability of information theory. To appreciate how this domain is extended, we note that our numerical examples, based on a response space made of 210 different responses, showed that our techniques eliminated the bias even with number of samples as low as 32 trials per stimulus, a number of the order of the square root of the number of different responses. This is a significant advance over the previous requirement that the number of trials is comparable to the number of different responses. This effectively allows one to double the timing precision, the time window length, or the population size that can be analyzed. This fact is timely because large-scale recording techniques are rapidly becoming available, and because there are theoretical arguments (reviewed in Averbeck, Latham, & Pouget, 2006) that suggest that the impact of correlations to the neural code may be particularly important when considering larger populations. The bias-reduction techniques open up the possibility of analyzing populations twice as big than those previously considered in informationtheoretic studies of neural codes. Although this is a significant step forward, it is important to recognize that due to the “dimensionality curse,” even this advance will not ultimately be enough for the direct analysis of very large populations or very long temporal sequences at very fine timescales. Alternative approaches that may work better in these more extreme situations include algorithms not relying on binning (Victor & Purpura, 1997; Victor, 2002). The reduction of the bias problem may also help in extending the information analysis to the domain of graded brain signals, such as fMRI or local field potentials (Logothetis, 2003). These graded signals are more difficult to analyze because, unlike spikes, they cannot simply be converted into a binary word sampled with a certain temporal precision. The techniques presented here may lend themselves to the analysis of LFPs/fMRI, by first discretizing the graded signals into a number of different levels, characterizing the correlation structure among them, and finally fitting it to a low-dimensional stochastic model. A useful property of the shuffled information estimator presented here is that besides reducing the overall magnitude of the bias when compared to a direct procedure, it makes the bias negative rather than positive. This downward bias property is useful in practical studies of neural codes because a finding of significant extra information in spike timing obtained with this new method will ensure that this additional spike timing information is genuine and not an artefict due to sampling problems.
2950
M. Montemurro, R. Senatore, and S. Panzeri
An important technical step in the reduction of the mutual information bias was the selection, within a predefined class, of the minimal complexity of the simplified model of correlation that still captures the correct value of Isimp . This was achieved in practice by introducing a parametric family of Markov models to approximate the correlation structure of the real data and by using a nonparametric bootstrap statistical test to select the order q of the Markov model that best described the neural response. While we showed that the simple nonparametric test and the simple class of models described here are effective at reducing the data constraints in information calculations, it is important to note that neither of these steps must necessarily be performed exactly in this form. Particularly interesting families of maximum-entropy simplified correlation models are those considered by Amari (2001) and Schneidman, Berry, Segev, and Bialek (2006). The model selection can be also performed in different ways, for example, through log-likelihood ratio model selection or other types of inference (see, e.g., Cover & Thomas, 1991; Kennel et al., 2005). An important step of future research is to understand which class of models describes neural data more economically and which statistical model selection technique is more powerful under different circumstances. In recent years, there has been a debate on how best to measure the role of correlations in neural coding (see Nirenberg & Latham, 2003; Pola et al., 2003; Schneidman, Bialek, & Berry, 2003 for different points of view). The measure used in this letter is I , which was proposed by Nirenberg and Latham (2003). It has an interpretation as an upper bound to the information lost by a decoder that neglects correlation. In this letter, I was used to break-down the information into mildly biased and strongly biased components and to obtain a more data-robust estimator of the total mutual information through this breakdown. The considerable sampling advantages in the computation of the mutual information obtained in this way would also be available to situations when the computation of I is not of interest, either because the only purpose is to quantify mutual information or because other measures of the importance of correlation are used. In fact, the other measures proposed (Pola et al., 2003; Schneidman et al., 2003) quantify the importance of correlation as differences of certain mutual information quantities; therefore, the bias of each such information quantity could be substantially reduced with the techniques presented here. The bias reduction procedure presented here is thus useful to better compute other measures of the importance of correlations on coding. In summary, the combination of the simplified models and the shuffling methods allowed us to extend the range of applicability of information theory to neuronal responses. Usually computing information accurately with a straight application of the best bias correction methods available requires a sample size comparable to the number of possible different neural responses. By applying the same bias correction techniques to the shuffled estimators after a model selection procedure, it is now possible to estimate
Robust Bounds to Mutual Information
2951
accurately information quantities with amounts of data one order of magnitude smaller. Appendix A: The NSB Method In this appendix we sketch some theoretical considerations and present some additional numerical simulations on the performance of the NSB bias correction method (Nemenman et al., 2004) when entropies are computed from response spaces of different sizes and reflecting different firing rates. While these considerations follow straight from Nemenman et al., they are helpful to understand why this method is not suited for correcting I L B−q at low q values, especially at low firing rates. For simplicity, we focus on correcting the response entropy H(R). However, similar considerations hold for the stimulus-conditional entropies. The NSB method (Nemenman et al., 2004) is rooted in the Bayesian inference approach to estimate the entropy H(R) (a function of a generally unknown underlying probability P(r)). The Bayesian approach assumes that there exists some a priori probability density function P pr (P) in the continuum space of all the possible probability distributions on the response space R with R elements. The Bayesian estimation of the entropy after the observation of {n(r)} (the experimental number of times {n(r)} in which each response is observed) can be computed as an average of the corresponding entropy over all the possible hypothetical probability distributions weighted by their conditional probability given the data. Unless we have some other criteria to select a preferred value in the entropy range [0, log2 R] , we would like to have a flat a priori distribution of entropy. However, the choice of the prior has a strong impact on the value of H Ba yes unless very many data are observed. Nemenman et al. (2004) have addressed this problem by using a mixture of Dirichlet priors, the latter being defined in terms of a parameter β ranging between 0 and ∞. Nemenman et al. have shown that after fixing β (i.e., after choosing the prior within the family), the Bayesian estimate of the entropy is sharply defined and monotonically dependent on the parameter β (naturally, until the number of samples becomes large, in which case the likelihood dominates the estimate). It can be shown that at fixed β, the variance of entropy estimation before any observation (i.e., when n(r) = 0) scales as 1/R as R grows, and it is thus small compared to the range of possible entropy values [0, log2 R]. Therefore, the goal of constructing a prior on the space of probability distributions that generates a nearly uniform distribution of entropies can be approximately achieved by an average over the Bayesian estimation over all the one-parameter Dirichlet family of priors labeled by β. While this procedure is designed to work well for large R, it may work less well for small values of R. In fact, if R is small, then the variance of a priori entropy estimation is not small anymore with respect to the range of possible entropies. Thus, the result of integration over β will not be flat
2952
M. Montemurro, R. Senatore, and S. Panzeri
but will typically be smaller near the edges 0 and log2 R than in the central region of possible entropies. Thus, the method is likely to give problems in the estimation of low entropy values for low R values. In particular, the NSB method is likely to give problems when estimating the entropy of processes generated with very low firing rates. In this case, since the probability of observing a spike in each bin is low, the entropy value would be much nearer zero than to log2 (R); since in this case, the NSB prior distribution of entropy values is instead higher in the central part of the interval [0, log2 R], the resulting NSB estimation of a low-firing rate, the entropy for low R may strongly overestimate the entropy unless many experimental observations are available. In the latter case, asymptotic bias correction procedures might work better anyway. We tested these considerations by applying the NSB method to a homogeneous Poisson process. The spike times generated by the Poisson processes were binned using a bin size of t = 5 ms. In Figure 10 we report the performance of the NSB by comparing it to the quadratic extrapolation procedure and the uncorrected measure of the entropy. We tested two different values of R and two different firing rates. Figures 10A and 10B compare the estimations of the entropy using R = 2 for a firing rate of 2 Hz in Figure 10A and 40 Hz in Figure 10B. In both cases the estimation with the NSB method performs worse than the extrapolation procedure of Strong et al. (1998). Figures 10C and 10D show the estimators applied to Poisson processes with the same firing rates as with Figures 10A and 10B, respectively, and using R = 26 . It is apparent now that the NSB performs substantially better than the quadratic extrapolation procedure. We consistently found that for higher R values, the NSB method was always more competitive than the quadratic extrapolation (data not shown, but see Figure 2 for the L = 10 case). When the NSB estimator is clearly the best estimator (low N and large R; see Figures 10C and 10D), it is dominated by the prior. Thus, a potential concern is that this superb performance might be specific to the Poisson process used in the simulation, perhaps because this process is a good match to a distribution within the family of Dirichlet priors. It is conceivable that for some class of strongly correlated response distributions, there may not be a good match in the Dirichlet family, and thus the superiority of the NSB method in the large-R regime may suffer. To test for this potential problem, we repeated the analysis in Figure 10 using the same correlated processes used to generate the data in Figures 1 and 4 (and described in the main text). We found results consistent with those plotted in Figures 10C and 10D. Although these results cannot rule out completely the above concern, they suggest that the NSB method will perform well in the large R regime on a wide range of processes with realistic neuronal statistics. Since the Markov noise entropies Hq (R|S) of order q can be written as sums of entropies defined over q adjacent time bins, it follows that the NSB method is not suited for correcting I L B−q at low q values, especially at low
Robust Bounds to Mutual Information
A
2953
B
0.2
0.2
R=2
Bias Ratio
R=2 0
0
NSB Quadratic extrapolation Uncorrected
4
6
8
4
10
8
D
0.2
0.2
6
R=2
R=26 Bias Ratio
10
2
2
C
6
Log (trials)
Log (trials)
0
0
4
6
8
Log2(trials)
10
4
6
8
10
Log (trials) 2
Figure 10: Bias in the NSB method. Average and standard error of the bias ratio (defined as the bias divided by the true asymptotic value of the entropy) for the NSB, quadratic extrapolation, and uncorrected estimators of the entropy. The estimators were applied to homogeneous Poisson process. The generated spike times were binned using a bin size of t = 5 ms. The average and standard error were computed over 1000 realizations of the simulations. (A, B) Panels correspond to R = 2, and the rates of the homogeneous Poisson process were 2 Hz in A and 40 Hz in B. In both A and B the quadratic extrapolation outperforms the NSB method. (C, D) Panels correspond to R = 26 , and the rates of the Poisson processes were respectively the same as in A and B. For larger R, the NSB method performs better than the quadratic extrapolation.
firing rates. However, it appears to be an excellent method for correcting the quantities involving entropies defined over long response times, such as I L B−q at high q values. Appendix B: A Link Between Our Assumptions and the Maximum-Entropy Principle A direct contact between assumption 2 and the maximum-entropy principle can be made as follows. Consider a one-dimensional family of simplified
2954
M. Montemurro, R. Senatore, and S. Panzeri
models Psimp (r|s) that are obtained from the true P(r|s) as a one-dimensional trajectory in probability space parameterized by λ: Psimp (r|s) = P(r|s) + λu(r|s),
(B.1)
with the “modulator” u satisfying the normalization r u(r|s) = 0. By computing the zeroes of the derivative of the entropy of Psimp (r|s) with respect to λ, it is easy to show that the entropy of such Psimp (r|s) is maximized (as a function of λ) when λ is chosen so that
u(r|s) log2 (P(r|s) + λu(r|s)) = 0.
(B.2)
r
Using equation B.1, the maximum entropy condition in equation B.2 can be rewritten as
(P(r|s) − Psimp (r|s)) log2 (Psimp (r|s)) = 0
(B.3)
r
which is exactly the condition 3.2 requested by our assumption 2. Thus, for any simplified model belonging to the family defined in equation B.1, the only one that satisfies our assumption 2 is the model with the highest entropy within the family. This parametric entropy maximization can be related to the construction of the classes of maximum-entropy models developed in section 3.3, for example by considering the extremization of entropy with respect to a large number of modulator functions. Thus, assumption 2 is related to a maximum entropy principle. Acknowledgments We are grateful to R. Petersen, M. E. Diamond, and E. Arabzadeh for many useful discussions and for kindly making available to us the example data used in Figure 9. Gianni Pola contributed to the early stages of this work. We are indebted to the anonymous referees for useful insights, particularly on the relation between our work and the maximum entropy principle. This research was supported by the International Human Frontier Science Program (M.A.M.), Pfizer Global Development (R.S.), Wellcome Trust 066372, and the Royal Society. References Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993). Spatio-temporal firing patterns in the frontal cortex of behaving monkeys. J. Neurophysiol., 70, 1629– 1638.
Robust Bounds to Mutual Information
2955
Amari, S. (2001). Information geometry on hierarchy of probability distributions. IEEE Trans. Inform. Theory, 47, 1701–1711. Arabzadeh, E., Panzeri, S., & Diamond, M. E. (2004). Whisker vibration information carried by rat barrel cortex neurons. J. Neurosci., 24(26), 6011–6020. Arabzadeh, E., Petersen, R. S., & Diamond, M. E. (2003). Encoding of whisker vibration by rat barrel cortex neurons: Implications for texture discrimination. J. Neurosci., 23(27), 9146–9154. Arabzadeh, E., Zorzin, E., & Diamond, M. E. (2005). Neuronal encoding of texture in the whisker sensory pathway. PLOS Biology, 3(1), 0155–0165. Averbeck, B. B., Latham, P. E., & Pouget, A. (2006). Neural correlations, population coding and computation. Nature Reviews Neuroscience, 7, 358–366. Borst, A., & Theunissen, F. E. (1999). Information theory and neural coding. Nature Neuroscience, 2, 947–957. Brosch, M., Bauer, R., & Eckhorn, R. (1997). Stimulus dependent modulations of correlated high frequency oscillations in cat visual cortex. Cerebral Cortex, 7, 70– 76. Buracas, G. T., Zador, A. M., DeWeese, M. R., & Albright, T. D. (1998). Efficient discrimination of temporal patterns by motion-sensitive neurons in primate visual cortex. Neuron, 20, 959–969. Cover, T. M., & Thomas, J. A. (1991) . Elements of information theory. New York: Wiley. Dan, Y., Alonso, J.-M., Usrey, W. M., & Reid, R. C. (1998). Coding of visual information by precisely correlated spikes in the lateral geniculate nucleus. Nature Neuroscience, 1, 501–507. de Ruyter van Steveninck, R., Lewen, G., Strong, S., Koberle, R., & Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 21, 1805–1808. DeWeese, M. R., Wehr, M., & Zador, A. M. (2003). Binary spiking in auditory cortex. J. Neurosci., 23, 7940–7949. Dimitrov, A. G., & Miller, J. P. (2001). Neural coding and decoding: Communication channels and quantization. Network: Comput. Neural Syst., 12, 441–472. Efron, B., & Tibshirani, R. (1993). An introduction to the bootstrap. New York: Chapman and Hall. Gawne, T. J., & Richmond, B. J. (1993). How independent are the messages carried by adjacent inferior temporal cortical neurons? J. Neurosci., 13, 2758–2771. Golledge, H. D. R., Panzeri, S., Zheng, F., Pola, G., Scannell, J. W., Giannikopoulos, D. V., Mason, R. J., Tovee, M. J., & Young, M. P. (2003). Correlations, feature binding and population coding in primary visual cortex. Neuroreport, 14, 1045–1050. ¨ Gray, C. M., Konig, P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338, 334–337. Johnson, D. H., Gruner, C. M., Baggerly, K., & Seshagiri, C. (2001). Informationtheoretic analysis of neural coding. J. Comput. Neurosci., 10, 47–69. Kennel, M. B., Shlens, J., Abarbanel, H. D. I., & Chichilnisky, E. J. (2005). Estimating entropy rates with Bayesian confidence intervals. Neural Comp., 17(7), 1531–1476. Latham, P. E., & Nirenberg, S. (2005). Synergy, redundancy, and independence in population codes, revisited. J. Neurosci., 25(21), 5195–5206. Logothetis, N. K. (2003). The underpinnings of the bold functional magnetic resonance imaging signal. J. Neurosci., 23, 3963–3971.
2956
M. Montemurro, R. Senatore, and S. Panzeri
London, M., Schreibman, A., Hausser, M., Larkum, M. E., & Segev, I. (2002). The information efficacy of a synapse. Nature Neurosci., 5, 332–340. Merhav, N., Kaplan, G., Lapidoth, A., & Shamai Shitz, S. (1994). On information rates for mismatched decoders. IEEE Trans. Inform. Theory, 40, 1953–1967. Miller, G. A. (1955). Note on the bias of information estimates. In H. Quastler (Ed.), Information theory in psychology: Problems and methods (pp. 95–100). New York: Free Press. Nemenman, I., Bialek, W., & de Ruyter van Steveninck, R. (2004). Entropy and information in neural spike trains: Progress on the sampling problem. Physical Review E, 69(5), 056111. Nirenberg, S., Carcieri, S. M., Jacobs, A., & Latham, P. E. (2001). Retinal ganglion cells act largely as independent encoders. Nature, 411, 698–701. Nirenberg, S., & Latham, P. E. (2003). Decoding neuronal spike trains: How important are correlations. Proc. Natl. Acad. Sci., USA, 100, 7348–7353. Paninski, L. (2003). Estimation of entropy and mutual information. Neural Computation, 15, 1191–1253. Panzeri, S., Petersen, R. S., Schultz, S. R., Lebedev, M., & Diamond, M. E. (2001). The role of spike timing in the coding of stimulus location in rat somatosensory cortex. Neuron, 29, 769–777. Panzeri, S., & Schultz, S. (2001). A unified approach to the study of temporal, correlational and rate coding. Neural Computation, 13, 1311–1349. Panzeri, S., & Treves, A. (1996). Analytical estimates of limited sampling biases in different information measures. Network, 7, 87–107. Petersen, R. S., & Diamond, M. (2000). Spatio-temporal distribution of whiskerevoked activity in rat somatosensory cortex and the coding of stimulus location. J. Neurosci., 20, 6135–6143. Pola, G., Petersen, R., Thiele, A., Young, M. P., & Panzeri, S. (2005). Data-robust tight lower bounds to the information carried by spike times of a neural population. Neural Comp., 17, 1962–2005. Pola, G., Thiele, A., Hoffmann, K.-P., & Panzeri, S. (2003). An exact method to quantify the information transmitted by different mechanisms of correlational coding. Network, 14, 35–60. Reich, D. S., Mechler, F., Purpura, K. P., & Victor, J. D. (2000). Interspike intervals, receptive fields, and information encoding in primary visual cortex. J. Neurosci., 20, 1964–1974. Rieke, F., Warland, D., de Ruyter van Steveninck, R. R., & Bialek, W. (1996). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Schneidman, E., Berry, M. J., Segev, R., & Bialek, W. (2006). Weak pairwise correlations imply strongly correlated network states in a neural population. Nature, 440, 1007–1012. Schneidman, E., Bialek, W., & Berry, M. J., II. (2003). Synergy, redundancy, and independence in population codes. J. Neurosci., 23(37), 11539–11553. Schultz, S., & Panzeri, S. (2001). Temporal correlations and neural spike train entropy. Phys. Rev. Lett., 86, 5823–5826. Shadlen, M. N., & Movshon, J. A. (1999). Synchrony unbound: A critical evaluation of the temporal binding hypothesis. Neuron, 24, 67–77.
Robust Bounds to Mutual Information
2957
Shannon, C. E. (1948). A mathematical theory of communication. AT&T Bell Labs. Tech. J., 27, 379–423. Strong, S., Koberle, R., de Ruyter van Steveninck, R., & Bialek, W. (1998). Entropy and information in neural spike trains. Physical Review Letters, 80, 197–200. Victor, J. D. (2002). Binless strategies for estimation of information from neuronal data. Physical Review E, 66, 51903–51918. Victor, J. D., & Purpura, K. P. (1997). Metric-space analysis of spike trains: Theory, algorithms, and application. Network, 8, 127–164. von der Malsburg, C. (1999). The what and why of binding: The modeler’s perspective. Neuron, 24, 95–104.
Received May 30, 2006; accepted October 20, 2006.
LETTER
Communicated by Wulfram Gerstner
Spike-Frequency Adapting Neural Ensembles: Beyond Mean Adaptation and Renewal Theories Eilif Muller [email protected]
Lars Buesing [email protected]
Johannes Schemmel [email protected]
Karlheinz Meier [email protected] Kirchhoff Institute for Physics, University of Heidelberg, 69120 Heidelberg, Germany
We propose a Markov process model for spike-frequency adapting neural ensembles that synthesizes existing mean-adaptation approaches, population density methods, and inhomogeneous renewal theory, resulting in a unified and tractable framework that goes beyond renewal and meanadaptation theories by accounting for correlations between subsequent interspike intervals. A method for efficiently generating inhomogeneous realizations of the proposed Markov process is given, numerical methods for solving the population equation are presented, and an expression for the first-order interspike interval correlation is derived. Further, we show that the full five-dimensional master equation for a conductance-based integrate-and-fire neuron with spike-frequency adaptation and a relative refractory mechanism driven by Poisson spike trains can be reduced to a two-dimensional generalization of the proposed Markov process by an adiabatic elimination of fast variables. For static and dynamic stimulation, negative serial interspike interval correlations and transient population responses, respectively, of Monte Carlo simulations of the full five-dimensional system can be accurately described by the proposed two-dimensional Markov process. 1 Introduction Spike-frequency adaptation (SFA) refers to the intrinsic property of certain neurons to fire with gradually increasing interspike intervals (ISIs) in response to a steady injection of suprathreshold current. SFA is ubiquitous: It has been observed in many neural systems of diverse species (Fuhrmann, Markram, & Tsodyks, 2002). In the mammalian visual system, for example, the majority of retinal ganglion cells (RGCs) (O’Brien, Isayama, Richardson, & Berson, 2002), geniculate relay neurons (Smith, Cox, Sherman, & Rinzel, Neural Computation 19, 2958–3010 (2007)
C 2007 Massachusetts Institute of Technology
Spike-Frequency Adapting Neural Ensembles
2959
2001), and neocortical and hippocampal regular spiking pyramidal neurons (McCormick, Connors, Lighthall, & Prince, 1985) exhibit SFA. The in-vitro conditions used to experimentally verify the presence of SFA are far from the operational mode of a typical neuron in a network. Given cortical neuron firing rates and interconnectivity, each neuron there is under intense bombardment by both excitatory and inhibitory synapses. These mutually opposing showers of excitation and inhibition induce highly irregular fluctuations of the membrane potential reminiscent of a random walk. The resulting dominance of the mean synaptic conductances over the leak results in a markedly shortened effective membrane time constant, a dynamical regime known as the high-conductance state (Destexhe, Rudolph, & Par´e, 2003; Shelley, McLaughlin, Shapley, & Wielaard, 2002). In this regime, action potentials are emitted when the membrane potential chances across the firing threshold and the resulting ISIs appear stochastic and are, for adapting neurons, roughly gamma distributed (Softky & Koch, 1993; Destexhe, Rudolph, Fellous, & Sejnowski, 2001; Dayan & Abbott, 2001). Conductance-based phenomenological models for SFA and related relative refractory mechanisms are standard and given in Dayan and Abbott (2001) and Koch (1999) and recently generalized in Brette and Gerstner (2005). Benda and Herz (2003) show that a large class of biophysical mechanisms that induce SFA can be reduced to these conductance-based phenomenological models. Similar but current-based adaptation mechanisms have been studied in van Vreeswijk and Hansel (2001) and the related threshold fatigue model for adaptation, also known as dynamic threshold, in Chacron, Pakdaman, and Longtin (2003) and Lindner and Longtin (2003). See Ermentrout, Pascal, and Gutkin (2001) for a bifurcation analysis of Iahp , the afterhyperpolarization current, a calcium-dependent potassium current, and Im , the muscarinic slow voltage-dependent potassium current, two biophysical mechanisms behind SFA. Mean-adaptation approximations for the firing rate of populations of spike-frequency adapting neurons augmenting the standard Wilson and Cowan equations (Wilson & Cowan, 1972) were devised in Latham, Richmond, Nelson, and Nirenberg (2000) and Fuhrmann et al. (2002) and used to study the synchronizing effects of SFA. Universal mean-adaptation methods for modeling the firing rate of adapting neurons subject to suprathreshold noise-free current input are given in Benda and Herz (2003). ¨ In La Camera, Rauch, Luscher, Senn, and Fusi (2004), mean-adaptation methods are investigated to describe the static and dynamic firing rates of a large class of integrate-and-fire neuron models with current-based and dynamic threshold adaptation mechanisms driven by noisy input currents. The phenomenological firing rate relaxation dynamics of previous Wilson and Cowan studies is replaced in La Camera et al. (2004) with a firing rate that depends instantaneously on filtered synaptic currents, as suggested in Fourcaud and Brunel (2002) and Renart, Brunel, and Wang (2004). While
2960
E. Muller, L. Buesing, J. Schemmel, and K. Meier
for the Wilson and Cowan approaches, the relaxation time constant is a free parameter, the approach due to La Camera et al. (2004) has no free parameters, and excellent agreement is reported in the static and dynamic case for several neuron models. Originally introduced in Knight (1972) and recently the subject of intense study, population density formalisms provide powerful tools to understand neural ensemble and network behavior in a quantitative way (Brunel, 2000; Omurtag, Knight, & Sirovich, 2000; Nykamp & Tranchina, 2000, 2001; Fourcaud & Brunel, 2002; Meffin, Burkitt, & Grayden, 2004; Renart et al., 2004). Such studies are mostly restricted to exactly solvable white noise input cases, with notable exceptions (Nykamp & Tranchina, 2001; Fourcaud & Brunel, 2002). In Fourcaud and Brunel (2002), the key observation is made that colored input noise due to synaptic filtering results in a nonzero probability density near threshold and allows neurons to respond instantaneously to injected currents. This provides the theoretical basis for studies such as La Camera et al. (2004) and will also play an important role in the work here. Conductance-based neurons with finite synaptic time constants are treated in Rudolph and Destexhe (2003a, 2005), Richardson (2004), Richardson and Gerstner (2005), though only in the subthreshold regime, limiting their applicability for understanding firing rate, and networks dynamics. The problem with threshold has yet to be solved exactly, however, it is treated in Moreno-Bote and Parga (2004, 2005). For neurons without SFA driven by noisy input, an alternate and fruitful approach is to apply renewal theory as presented in detail in Gerstner and Kistler (2002). With the defining characteristic of renewal theory being that successive ISIs are statistically independent, these models neglect by definition the observation in Chacron et al. (2003) and Lindner and Longtin (2003) that SFA induces negative serial ISI correlations. While the great majority of excitatory neurons exhibit SFA, there has yet to be a population density treatment accounting for it, given the difficulty in treating the added dimension analytically and numerically. We present here a study whereby the ensemble behavior of adapting neurons in the high-conductance state can be understood in a quantitative way. We start by considering in section 2 how to go beyond the renewal theory formalism of Gerstner and Kistler (2002) by introducing a dependence between ISIs, resulting in a Markov model described by a master equation. A connection to renewal theory is found by a suitable variable transformation, and expressions for the ISI distribution and conditional ISI distribution are derived. We then consider in section 3 the full five-dimensional master equation of the canonical conductance-based integrate-and-fire neuron model driven by Poisson spike trains augmented by SFA and a relative refractory mechanism of the form given in Dayan and Abbott (2001). By applying an adiabatic elimination of fast relaxing variables (Haken, 1983; Gardiner, 1984), we argue that this five-dimensional master equation can be approximated by a two-dimensional master equation of the same form as the
Spike-Frequency Adapting Neural Ensembles
2961
“beyond renewal theory” Markov model proposed in section 2. In section 4, we determine the generalized hazard function required for the Markov model by fitting to Monte Carlo simulations of the full system, given that the firing rate of the neuron model we employ has yet to be solved exactly. By reasoning as in Fourcaud and Brunel (2002), Renart et al. (2004), and La Camera et al. (2004), we show how the generalized hazard function applies in the dynamic case by accounting for synaptic filtering. In section 5, we provide numerical methods for solving the master equations and generating realizations of the proposed Markov processes. In section 6, predictions for ISI correlations and conditional ISI distributions in the static case, and firing rates in the dynamic case due to the proposed Markov model are compared to Monte Carlo simulations of the full system. Finally in section 7, the master equation is employed to analyze the domain of validity of mean-adaptation approaches. 2 Beyond Renewal Theory Gerstner and Kistler (2002) demonstrate that for spike response models (a generalization of integrate-and-fire neuron models), the statistical ensemble of a single neuron with noise can be described using methods of inhomogeneous renewal theory, as reviewed in appendix C. The basic assumption of inhomogeneous renewal theory is that the state of the modeled system can be described by a single state variable, τ , the time since last renewal, or age of the system, and time t. The limiting probability density for the neuron to spike, or more generally, for the system to renew after surviving a time interval τ , ρ(τ, t) = lim
t→0+
prob{> 0 renewals in [t, t + t) | τ } , t
(2.1)
also known as the hazard function (Cox, 1962), is a function of time, t, and age, τ .1 Thus, subsequent interspike intervals (ISIs) are by definition independent and uncorrelated. As Gerstner and Kistler (2002, pp. 245), stated, “A generalization of the [renewal] population equation to neuron models with [spike-frequency] adaptation is not straightforward since the [renewal] formalism assumes that only the last spike suffices. . . . A full treatment of adaptation would involve a density description in the high-dimensional space of the microscopic neuronal variables [as in] (Knight, 2000).” In section 3 we provide a full treatment of the density description mentioned above. However, before we proceed, it is instructive to consider what 1 For our discussion of renewal processes, we follow the notation of Cox (1962) but use
τ to denote age, t to denote time, and ρ instead of h to denote the hazard function, as in appendix C.
2962
E. Muller, L. Buesing, J. Schemmel, and K. Meier
a model might look like that allows for a dependence between subsequent ISIs. Consider the standard phenomenological model for spike-frequency adaptation (SFA) proposed in Dayan and Abbott (2001) where a given neuron model is augmented with a conductance gs (t) that makes the jump gs (t + dt) = gs (t) + q s when the neuron spikes at time t and is otherwise governed by
1 dgs (t) = − gs (t). dt τs
(2.2)
Now consider a neuron that has gs as a state variable and a probability density to fire of the form prob{> 0 spikes in [t, t + t) | gs } , t→0+ t
h g (gs , t) = lim
(2.3)
where gs evolves in time by equation 2.2. This process is analogous to a renewal process, but now with a single state variable, gs , which is not reset at each occurrence of a spike but slowly forgets with a timescale of τs due to equation 2.2. For a model of this form, it is possible for correlations to arise between subsequent ISIs. We refer to both the renewal hazard function, ρ(τ, t), and the h g (gs , t) defined here as hazard functions, as they both represent a probability density of the system to spike. It is straightforward to show that the ensemble of such neurons is governed by a master equation of the form ∂ ∂ P(gs , t) = ∂t ∂gs
gs P(gs , t) τs
+ h g (gs − q s , t)P(gs − q s , t) − h g (gs , t)P(gs , t),
(2.4)
where P(gs , t) is the distribution of state variables gs with P(gs < 0, t) ≡ 0. The distribution P(gs , t) is analogous to the distribution of ages, f − (τ, t), of renewal theory, and equation 2.4 is analogous to the renewal theory equation C.7, both given in appendix C. The model defined by equation 2.4 is referred to as the 1D Markov (1DM) model throughout the text. (See Table 1 for an overview of the models considered in the text.)
Spike-Frequency Adapting Neural Ensembles
2963
Understanding the connection of the 1DM model to its renewal theory cousin is facilitated by transforming gs to a pseudo–age variable ts with d t = 1 by2 dt s ts = η(gs ) := −τs log (gs /q s ) .
(2.5)
The hazard function h g (gs , t) becomes h(ts , t) = h g η−1 (ts ), t , a hazard function as in equation 2.1 of the pseudovariable ts but defined also for ts < 0. The distribution of states P(gs , t) becomes P(ts , t), where they are related by d −1 η (ts ). P(ts , t) = P gs = η−1 (ts ), t dts
(2.6)
The reset condition is not ts → 0 as for a renewal process, but ts → η(gs + q s ), where the right-hand side can be expressed in terms of ts using the relation gs = η−1 (ts ). Defining the reset mapping, ψ(ts ), such that the reset condition becomes ts → ψ(ts ), it follows that ψ(ts ) = η(η−1 (ts ) + q s ) −ts = −τs log exp +1 , τs
(2.7)
with its inverse given by ψ
−1
−ts (ts ) = −τs log exp τs
−1 ,
(2.8)
whereby ψ(ψ −1 (t)) = t and ψ −1 (ψ(t)) = t as required by the definition of the inverse. The variable ts is then a general state variable that no longer represents the time since the last spike, as in renewal theory. Since ψ : R → R− , it follows that all trajectories are reinserted at negative pseudo-ages, and it can be seen from the form of ψ that “younger” spiking trajectories are reinserted at more negative pseudo-ages. This dependence of the reinserted state on the state just prior to spiking yields a Markov process (Risken, 1996), which cannot be described by renewal theory.
2 We follow the convention throughout the text of using positional arguments for functions and labeled arguments for derivatives. Probability distributions are excepted from this rule, as they are not functions but densities. The notation “:=” denotes definition of a function and its positional arguments.
2964
E. Muller, L. Buesing, J. Schemmel, and K. Meier
The master equation in terms of ts takes the form ∂ ∂ P(ts , t) P(ts , t) = − ∂t ∂ts −h(ts , t)P(ts , t), ts ≥ 0 (2.9) + −1 −1 h(ψ (ts ), t)P(ψ (ts ), t) − h(ts , t)P(ts , t) ts < 0, revealing the advantage of the variable transformation gs → ts : The deterministic drift term in equation 2.4 for the exponential decay of gs is transformed to a constant drift term in ts analogous to age in renewal theory. As a result, much can be calculated by analogy to renewal theory, and we are freed from the difficulty of treating the nonconstant drift toward zero in equation 2.4 numerically. We will see in later sections that h(ts , t) is in practice approximately of the form h(ts , t) = a (t) exp −b(t)q s exp (−ts /τs )
(2.10)
when modeling spike-frequency adapting neurons in the high-conductance state, where a (t) and b(t) are determined by the stimulus. For the static case where h(ts , t) ≡ h(ts ), P(ts ) can be found from equation 2.9 by setting ∂/∂t P(ts , t) = 0. The resulting equation for ts ≥ 0, ∂ P(ts ) = −h(ts )P(ts ), ∂ts
(2.11)
is exactly as for a renewal process. The solution is the homogeneous survival function, P(ts ) = kW(ts , 0),
(2.12)
where k
−1
=
∞
−∞
W(ts , 0)dts
(2.13)
is a constant of normalization, and the survival function, W(t, ts0 )
= exp −
t 0
h(ts0
+ s)ds ,
(2.14)
Spike-Frequency Adapting Neural Ensembles
2965
and analogously the inhomogeneous survival function, W(t, ts0 , t)
= exp −
t 0
h(ts0
+ s, t + s)ds ,
(2.15)
represent the probability that a system with initial state ts0 ∈ R will survive for a time t, and a time t after t, respectively and are analogous to the survival function of renewal theory as discussed in appendix C, except for the explicit dependence on the initial state ts0 . For ts < 0, we solve P(ts ) numerically by discretizing and integrating back from ts = 0. The distribution of pseudo-ages just prior to spiking at t, P ∗ (ts , t), is related to P(ts , t) by P ∗ (ts , t) =
h(ts , t)P(ts , t) , α(t)
(2.16)
h(ts , t)P(ts , t)dts
(2.17)
where α(t) =
∞
−∞
is a normalizing constant and also the firing rate of the ensemble. The distribution of pseudo-ages just after spiking at t, P † (ts , t), is related to P ∗ (ts , t) by transforming variables by the reset mapping (see equation 2.7) for a probability distribution: P † (ts , t) = P ∗ (ψ −1 (ts ), t)
d −1 ψ (ts ). dts
(2.18)
2.1 Computing Renewal Quantities. The various quantities of renewal theory such as the ISI distribution, hazard function, and survival function are of interest and are straightforward to calculate for the proposed Markov process. First, the renewal survival function, F(τ, t), the probability that a system that spiked at t will survive the time interval τ , is given by F(τ, t) =
∞
−∞
W(τ, ts , t)P † (ts , t)dts .
(2.19)
The ISI distribution, f (τ, t), the probability that a neuron that spiked at t will survive for an interval τ and subsequently spike at t + τ , is f (τ, t) =
∞
−∞
h(ts + τ, t + τ )W(τ, ts , t)P † (ts , t)dts .
(2.20)
2966
E. Muller, L. Buesing, J. Schemmel, and K. Meier
Equivalently in terms of P ∗ (ts , t), f (τ, t) =
∞ −∞
h(ψ(ts ) + τ, t + τ )W(τ, ψ(ts ), t)P ∗ (ts , t)dts .
(2.21)
The hazard function of the system in a renewal sense, ρ(τ, t), where τ is a true age, is by definition the firing rate of the subpopulation that previously spiked at time t − τ . Thus, ρ(τ, t) =
∞
−∞
h(ts , t)P(ts , t| spike at t − τ )dts ,
(2.22)
where the state distribution of the system given a spike at t − τ , P(ts , t| spike at t − τ ), can be determined by reasoning that it is the distribution of states just after spiking with arguments ts − τ and t − τ , P † (ts − τ, t − τ ), which subsequently survive the interval τ , P(ts , t| spike at t − τ ) = k1 W(τ, ts − τ, t − τ )P † (ts − τ, t − τ ),
(2.23)
where k1 is the normalization factor, k1−1 =
∞
−∞
W(τ, ts − τ, t − τ )P † (ts − τ, t − τ )dts ,
(2.24)
and by inspection of equation 2.19, k1−1 = F(τ, t − τ ),
(2.25)
such that ρ(τ, t) =
1 F(τ, t − τ )
∞
−∞
h(ts , t)W(τ, ts − τ, t − τ )P † (ts − τ, t − τ )dts . (2.26)
Clearly, the numerator is just f (τ, t − τ ), resulting in ρ(τ, t) =
f (τ, t − τ ) . F(τ, t − τ )
(2.27)
This verifies that the standard renewal theory relation that f (τ ) = ρ(τ )F(τ ), generalized for the inhomogeneous case, still holds even though the underlying stochastic process is not a renewal process. It is interesting to note that in the inhomogeneous case, there is an alternate definition for the ISI distribution that is equally sensible: define ˆf (τ, t) as the probability that
Spike-Frequency Adapting Neural Ensembles
2967
a neuron that spiked at t − τ will survive the interval τ and subsequently spike at t. This is the ISI distribution that treats the spike at t as the final spike of the ISI rather than the initial spike as in equation 2.21. If one prefers this alternate definition of the ISI distribution, as in Gerstner and Kistler (2002), then one has ˆf (τ, t) =
∞
−∞
h(ts + τ, t)W(τ, ts , t − τ )P † (ts , t − τ )dts ,
(2.28)
implying that ˆf (τ, t) = f (τ, t − τ ), and equation 2.27 becomes
ρ(τ, t) =
ˆf (τ, t) . F(τ, t − τ )
(2.29)
2.2 Correlations. In this section, an expression for the joint serial ISI distribution, f (τi+1 , τi , t), will be derived for the proposed Markov process and shown to exhibit ISI correlations. Recall the definition of the absence of correlations between two random variables: τi and τi+1 are uncorrelated (independent) if and only if f (τi+1 , τi ) = f (τi+1 ) f (τi ),
(2.30)
where f (τi+1 , τi ) is the joint probability distribution of two back-to-back ISIs in the homogeneous case. For the inhomogeneous case, a separation of this joint distribution f (τi+1 , τi , t) by Bayes’ theorem, f (τi+1 , τi , t) = f (τi+1 , t|τi ) f (τi , t − τi ),
(2.31)
reveals a subtlety: The time argument of f (τi , t), the marginal distribution of τi , must be retarded by τi . This is due to the fact that for τi to precede τi+1 at t, it must occur at t − τi . Given that f (τ, t) is known, it is left to determine an expression for f (τi+1 , t|τi ). This can be achieved using equation 2.21 by replacing P ∗ (ts , t) with the conditional distribution of states just prior to spiking given a spike at t − τi , which is denoted by P ∗ (ts , t|τi ). The distribution P ∗ (ts , t|τi ), the conditional distribution of states just prior to spiking, given a spike at t − τi , takes the form P ∗ (ts , t|τi ) = k2 h(ts , t)P(ts , t| spike at t − τi ),
(2.32)
2968
E. Muller, L. Buesing, J. Schemmel, and K. Meier
where k2 is a normalization factor, and an expression for P(ts , t| spike at t − τi ) was given in equation 2.23. By inspection of equation 2.22, it can be seen that k2−1 = ρ(τi , t). This results in P ∗ (ts , t|τi ) =
h(ts , t)W(τi , ts − τi , t − τi )P † (ts − τi , t − τi ) , f (τi , t − τi )
(2.33)
where the denominator, ρ(τi , t)F(τi , t − τi ), was replaced by f (τi , t − τi ) using equation 2.27. Plugging this expression for P ∗ (ts , t|τi ) into equation 2.21 yields f (τi+1 , τi , t) = f (τi+1 , t|τi ) f (τi , t − τi ) =
∞ −∞
h(ψ(ts ) + τi+1 , t + τi+1 )W(τi+1 , ψ(ts ), t)
× h(ts , t)W(τi , ts − τi , t − τi )P † (ts − τi , t − τi )dts ,
(2.34)
an inhomogeneous expression for the joint ISI distribution of two successive ISIs. It is instructive to verify that for the case of a renewal process, equation 2.34 predicts no correlations. For a renewal process, ψ(ts ) = 0 and P † (ts , t) = δ(ts ), such that equation 2.34 becomes f (τi+1 , τi , t) = h(τi+1 , t + τi+1 )W(τi+1 , 0, t) · h(τi , t)W(τi , 0, t − τi ).
(2.35)
In addition, the ISI distribution given by equation 2.20 reduces to f (τ, t) = h(τ, t + τ )W(τ, 0, t).
(2.36)
Thus, it can be seen by inspection that equation 2.35 is of the form f (τi+1 , τi , t) = f (τi+1 , t) f (τi , t − τi ),
(2.37)
implying as expected that successive ISIs are uncorrelated for a renewal process. 3 Connection to a Detailed Neuron Model In this section we show that the full five-dimensional master equation for the canonical conductance-based integrate-and-fire neuron model driven by Poisson spike trains, augmented by mechanisms for SFA and a relative refractory period, can be reduced to a two-dimensional generalization of the 1DM model by an adiabatic elimination of fast variables.
Spike-Frequency Adapting Neural Ensembles
2969
3.1 Neuron Model, Adaptation, Input. Following Rudolph and Destexhe (2003a, 2005), Richardson (2004), Richardson and Gerstner (2005), we consider the equations for the membrane potential, v(t), and excitatory and inhibitory synaptic conductances, ge (t) and gi (t), of the conductancebased integrate-and-fire neuron driven by Poisson spike trains: dv(t) = gl (El − v(t)) + ge (t)(E e − v(t)) + gi (t)(E i − v(t)) dt dge (t) 1 = − ge (t) + q e Se (t) dt τe
cm
dgi (t) 1 = − gi (t) + q i Si (t), dt τi
(3.1) (3.2) (3.3)
where c m represents the membrane capacitance, gl the leak conductance, E x the various reversal potentials, q x the quantal conductance increases, and τx the synaptic time constants. The exact parameters used are given in appendix A. The excitatory and inhibitory input spike trains, Sx (t) with x ∈ {e, i}, respectively, are given by Sx (t) =
δ(t − sx,k ),
(3.4)
k
where sx,k are the spike times of a realization of an inhomogeneous Poisson process (Papoulis & Pillai, 1991). Thus, Sx (t) satisfies the constraints
Sx (t) = νx (t)
(3.5)
Sx (t)Sx (t ) = νx (t)νx (t ) + νx (t )δ(t − t ).
(3.6)
Here νx (t) represents the time-varying rate of the inhomogeneous Poisson process, and represents the expectation value over the ensemble of realizations. In what follows, all Poisson processes are assumed inhomogeneous unless otherwise stated. To put the neuron in a state of high conductance, it is bombarded by Ne = 1000 and Ni = 250 excitatory and inhibitory Poisson processes, all with rate functions λe (t) and λi (t), respectively, so that νx (t) = Nx λx (t).
(3.7)
A simple thresholding mechanism approximates the action potential dynamics of real neurons: If v(t) exceeds the threshold, vth , v(t) is reset to vreset . Analogous to the input spike train, we can thus define the output
2970
E. Muller, L. Buesing, J. Schemmel, and K. Meier
spike train, A(t) =
δ(t − sk ),
(3.8)
k
where sk are the times of membrane potential threshold crossings enumerated by k. SFA and a relative refractory period can both be modeled with the addition of a current to equation 3.1 of the form proposed in Dayan and Abbott (2001), g y (t)(E y − v(t)),
(3.9)
where E y is a reversal potential. The conductance g y (t) is governed by dg y (t) 1 = − g y (t) + q y A(t), dt τy
(3.10)
where τ y and q y are the time constant and quantal conductance increase of the mechanism. We label SFA and the relative refractory mechanism by the subscripts y = s and y = r , respectively. Defining βv (v, ge , gi , gs , gr ) := gl (El − v) +
gµ (E µ − v)
(3.11)
µ=e,i,s,r
and for µ = e, i, s, r , βµ (gµ ) := −
1 gµ , τµ
(3.12)
the five-dimensional system of coupled differential equations describing the conductance-based spike-frequency adapting relative refractory integrateand-fire neuron driven by Poisson spike trains is: dv(t) = βv (v(t), . . . , gr (t)) − (Vth − Vreset )A(t) dt dgx (t) = βx (gx (t), t) + q x Sx (t) dt dg y (t) = β y (g y (t), t) + q y A(t), dt
cm
(3.13) (3.14) (3.15)
where x ∈ {e, i} and y ∈ {s, r }. We refer to equations 3.13 to 3.15 as the full five-dimensional (5DF) model throughout the text (see the model overview in Table 1). The parameters used are given in Table 3.
Spike-Frequency Adapting Neural Ensembles
2971
3.2 Ensemble Behavior. It is natural to look for an ensemble description of equations 3.13 to 3.15, given that the input is described in terms of an ensemble. Equations 3.13 to 3.15 are a set of concurrent first-order differential equations, that is, the right-hand sides at time t are functions of the instantaneous values of the state variables, (v(t), ge (t), gi (t), gs (t), gr (t)), implying no delays or memory effects are to be modeled. The system is therefore a Markov process, and given an initial distribution P(v, ge , gi , gs , gr , t0 ) for some t0 , the evolution of P(v, ge , gi , gs , gr , t) can be described by a suitable master equation (Risken, 1996). For the system in question here, the master equation takes the form ∂ P(v, ge , gi , gs , gr , t) = −divJ (v, ge , gi , gs , gr , t) ∂t + δ(v − vreset )J v (vth , ge , gi , gs − q s , gr − qr , t), (3.16) where the probability current density, J , is a vector with components J v (v, ge , gi , gs , gr , t) = βv (v, ge , gi , gs , gr , t)P(v, ge , gi , gs , gr , t) J µ := βµ (gµ , t)P(v, ge , gi , gs , gr , t)
(3.17) (3.18)
with µ ∈ {s, r }. (For J e and J i , see appendix B.) The δ term in equation 3.16 implements the reinsertion of probability flux that crosses the threshold. Furthermore, we define P(v, ge , gi , gs , gr , t) = 0 if one or more of the conductances ge , . . . , gr is negative. There exists a wealth of literature treating master equations of conductance and current-based integrate-and-fire neuron models in the absence of adaptation and relative refractory mechanisms (Knight, 1972; Gerstner, 1995; Brunel, 2000; Omurtag et al., 2000; Nykamp & Tranchina, 2000; Knight, Omurtag, & Sirovich, 2000; Gerstner, 2000; Fourcaud & Brunel, 2002; Rudolph & Destexhe, 2003a; Richardson, 2004; Richardson & Gerstner, 2005). The usual approach is to make the so-called diffusion approximation yielding generally a Fokker-Planck equation for the membrane potential, and perhaps one or two other dimensions treating synaptic conductances. We present here a novel approach applicable for neurons in the highconductance state whereby the variables v, ge , gi are eliminated by a technique known as an adiabatic elimination of fast variables (Haken, 1983; Gardiner, 1984), and the system is reduced to a master equation for the two-dimensional marginal probability distribution, P(gs , gr , t), of the slow variables, gs and gr . As we will see, the membrane potential, v, and the synaptic conductances, ge and gi , are thus encapsulated in the hazard function, h g (gs , gr , t). We treat here the static input case, λe , λi . The case for dynamic external input λe (t), λi (t) is treated in section 4.
2972
E. Muller, L. Buesing, J. Schemmel, and K. Meier
We follow here the intuitive treatment of adiabatic elimination given in Haken (1983). We begin by integrating P(v, . . . , gr ) over the fast variables v, ge , gi , yielding the marginal distribution for the slow variables gs , gr ,
∞
P(gs , gr , t) =
0
∞
0
vth −∞
P(v, ge , gi , gs , gr , t)dvdge dgi .
(3.19)
Integrating equation 3.16 over v, ge , gi yields ∂ ∂ P(gs , gr , t) = − (βµ (gµ )P(gs , gr , t)) ∂t ∂gµ µ=s,r ∞ ∞ βv (vth , ge , gi , gs , gr )P(vth , ge , gi , gs , gr , t)dge dgi − 0
+
0
∞
0
∞
0
βv (vth , ge , gi , gs − q s , gr − qr )
× P(vth , ge , gi , gs − q s , gr − qr , t)dge dgi .
(3.20)
For details of the calculation, see appendix B. Now we separate the marginal distribution for the slow variables from the full distribution by Bayes’ theorem, resulting in P(v, ge , gi , gs , gr , t) = P(v, ge , gi , t|gs , gr , t)P(gs , gr , t),
(3.21)
and make the adiabatic approximation as in Haken (1983) that P(v, ge , gi , t|gs , gr , t) ≈ P (gs ,gr ) (v, ge , gi , t),
(3.22)
where P (gs ,gr ) (v, ge , gi , t) is the solution to the three-dimensional master equation for the canonical conductance-based integrate-and-fire neuron with a constant bias current, I (gs , gr ) = gs (E s − v) + gr (Er − v), with neither SFA nor the relative refractory mechanism. This implies we assume that v, ge , gi are immediately at equilibrium given the slow variables, or in other words, the system responds adiabatically to the dynamics of the slow variables gs , gr . The adiabatic assumption ensures the two-dimensional process (gs (t), gr (t)) is a Markov process. Now defining the hazard function, h g (gs , gr , t) := 0
∞
0
∞
βv (vth , ge , gi , gs , gr )P (gs ,gr ) (vth , ge , gi , t)dge dgi , (3.23)
Spike-Frequency Adapting Neural Ensembles
2973
the master equation, 3.20, becomes ∂ ∂ P(gs , gr , t) = − (βµ (gµ )P(gs , gr , t)) ∂t ∂gµ µ=s,r − h g (gs , gr , t)P(gs , gr , t) + h g (gs − q s , gr − qr , t)P(gs − q s , gr − qr , t).
(3.24)
We refer to the model defined by equation 3.24 as the 2D Markov (2DM) model throughout the text (see the model overview in Table 1). Since no analytical solution is yet known for P (gs ,gr ) (vth , ge , gi , t) in equation 3.23, h g (gs , gr ) was extracted from Monte Carlo simulations of equations 3.13 to 3.15, as will be discussed in section 4.1. Then given a solution to the master equation, P(gs , gr , t), the firing rate of the ensemble, denoted by α(t), is determined by the expectation value of the hazard function h g (gs , gr , t) over P(gs , gr , t): ∞ ∞ h g (gs , gr , t)P(gs , gr , t)dgs dgr . (3.25) α(t) = 0
0
For the adiabatic approximation, the assumption that gs is slow compared to v, ge , gi is easily justified as the timescale of gs is on the order of 100 ms, while the timescale of v is on the order of 2 ms in the highconductance state. The timescale of the mean and standard deviation of ge and gi are on the order of τe = 1.5 ms and τi = 10 ms, respectively, while the fluctuations of ge and gi are the source of stochasticity of the system and are on a still shorter timescale. The timescale of gr is significantly faster than gs , though its treatment as a slow variable is also justifiable, but in a somewhat indirect manner. As has been argued in Fourcaud and Brunel (2002) and Renart et al. (2004), for neurons with synaptic time constants comparable to or larger than the effective membrane time constant and driven by sufficient input noise, as is the case here, the firing rate follows the input current almost instantaneously. It is this property that allows the dynamic firing rate to be treated as a function of the time-dependent means and variances of the synaptic conductances in La Camera et al. (2004), a method we follow in section 4. This suggests that such modulations do not push the system far from equilibrium and that the system returns to equilibrium on a timescale faster than that of the synaptic means (τe , τi ). Since over the domain of the gr trajectory for which the integrals on the right-hand side of equation 3.20 are nonzero, gr has a timescale comparable to the mean of the synapses, the argument applies equally to gr . However, since gr is spike triggered, we leave gr in the master equation, while the synaptic variables, ge and gi , determine h g (gs , gr , t) and can be treated outside the master equation formalism.
2974
E. Muller, L. Buesing, J. Schemmel, and K. Meier
Methods to undertake a rigorous analysis of the error in the adiabatic approximation are beyond the scope of this letter. What follows are a variety of numerical comparisons to demonstrate the accuracy and domain of applicability of the proposed approximation. 4 Methods In this section we provide methods for determining appropriate homogeneous and inhomogeneous hazard functions for the 1DM, 2DM, and renewal models. Since no analytical expression for equation 3.23, or the renewal hazard function of the 5DF model is yet known, we approach the problem by fitting the homogeneous hazard functions determined by 5DF Monte Carlo simulations in the static case. The inhomogeneous functions are then constructed from the homogeneous ones by discretizing time and taking one homogeneous hazard function for the duration of a single time bin. 4.1 Determining the Static Hazard Function for Markov Models. Given a finite subset of the possible realizations of the Poisson input spike trains, the 5DF model equations, 3.13 to 3.15, can be integrated for each input realization. Any statistical quantity of interest can then be approximated by averaging or histogramming over this finite set of trajectories. This approach is known as the Monte Carlo method. By increasing the number of trials in this finite set of realizations, the statistical quantities determined by the Monte Carlo method converge to the true quantities. Therefore, Monte Carlo simulations are used for determining the unknown hazard functions as well as later benchmarking the reduced master equations. By Monte Carlo simulations of the 5DF model under static stimulation, the quantities P ∗ (gs + gr ), P(gs + gr ), and α(t) can be obtained. Then analogous to equation 2.16, we can determine h g (gs , gr ) by h g (gs , gr ) = h g (gs + gr ) =
α P ∗ (gs + gr ) , P(gs + gr )
(4.1)
where we treat the sum of the conductances, gs + gr , rather than each independently because we have chosen their reversal potentials to be equal (see appendix A). It was found that h g (gs , gr ) can be fit well by a function of the form h g (gs , gr ) = a exp(−b · (gs + gr )),
(4.2)
where a and b are fit parameters. Some typical fits for various excitatory Poisson input rates are shown in Figure 1. For the 1DM model, the same fit parameters were used, but with gr = 0. Transforming to (ts , tr ) by the
Spike-Frequency Adapting Neural Ensembles
2975
Figure 1: h g (gs , gr ) = h g (gs + gr ) as a function of gtot = gs + gr , as determined from 5DF Monte Carlo (data points, 1000 trials per λe , 10 s per trial, dt = 0.01 ms) by equation 4.1, was found to be approximately exponential for a range of excitatory stimulation rates, λe , with the inhibitory stimulation rate fixed at λi = 11.4 Hz. For the definition of the 5DF model, see Table 1. The exponential fits (lines) are good for low rates (< : λe = 5.26 Hz, : λe = 5.56 Hz, : λe = 5.88 Hz, ×: λe = 6.01 Hz, : λe = 6.25 Hz, : λe = 6.67 Hz) in A, but poorer for gs near zero for high rates (< : λe = 6.90 Hz, : λe = 7.14 Hz, : λe = 7.69 Hz, ×: λe = 8.33 Hz, : λe = 9.09 Hz, : λe = 10.0 Hz) in B.
inverse of equation 2.5, we have −ts −tr + qr exp . h(ts , tr ) = a exp −b · q s exp τs τr
(4.3)
4.2 Constructing Inhomogeneous Hazard Functions. Now given the hazard functions determined under static input statistics, the inhomogeneous hazard function given time-varying Poisson input rates λe (t), λi (t) can be constructed by accounting for synaptic filtering. The homogeneous hazard functions given static stimulation rates λe , λi determined by the recipes in section 4.1 are the hazard functions given synaptic conductance distributions parameterized by ge,i , neglecting higher-order moments. It can be shown that 1 d gx (t) = − ( gx (t) − q x τx Nx λx (t)), dt τx
(4.4)
with x ∈ {e, i}, a low-pass filter equation of the quantity q x τx Nx λx (t) with a cutoff frequency of 2π/τx (Gardiner, 1985; La Camera et al., 2004). As argued in Fourcaud and Brunel (2002) and Renart et al. (2004), the firing rate of neurons with nonzero synaptic time constants driven by
2976
E. Muller, L. Buesing, J. Schemmel, and K. Meier
sufficient noise follows their input currents instantaneously. Then the hazard function h g (gs , gr , t) here is also determined instantaneously by the mean synaptic conductances. Therefore, inhomogeneous parameters a (t) and b(t) in equation 4.3 can be determined by interpolating the parameters determined from static ge and gi with the instantaneous dynamic ge (t) and gi (t) determined by integrating equation 4.4 for some given arbitrary time-varying input spike trains parameterized by λe (t), λi (t). Thus, we have the hazard function h g (gs , gr , t) = a (t) exp(−b(t) · (gs + gr )).
(4.5)
A similar approach was taken in La Camera et al. (2004), except that we do not correctly account for the dynamics of the standard deviation of the synaptic conductance by the fitting approach used here. This could be remedied given an analytically solvable neuron model as was used in La Camera et al. In this study, we investigate only time-varying excitation. Treating inhibition in addition would require additional fits and two-dimensional interpolation of the resulting parameters but would yield no new results for this study. 4.3 Comparing to Renewal Theory Models. In inhomogeneous renewal theory, only the time since the last spike (age) enters into the hazard function (Gerstner & Kistler, 2002). While such theories cannot account for ISI correlations due to SFA, they can account for much of the gradual increase in excitability that follows a spike due to SFA by an appropriately chosen hazard function. Perhaps surprisingly, such models are sufficient to exhibit “adapting” transients to step stimuli. Like the 2DM model, we seek to calibrate such renewal models to the 5DF system and assess their suitability for modeling the ensemble firing rate under dynamic stimuli. Sufficient for such a comparison is a recipe for specifying the hazard function as a function of the static stimulus. The dynamic case can then be constructed by using the effective synaptic filtered stimulus to determine the inhomogeneous hazard function at each instant in time, as for the Markov models in the previous section. For the static case, one can determine the hazard function as a function of the stimulus by interpolating the ISI distribution due to 5DF Monte Carlo and applying the standard renewal theory result that
ρ(τ ) =
f (τ ) , F(τ )
(4.6)
Spike-Frequency Adapting Neural Ensembles
2977
where the renewal theory survival function is given by F(τ ) =
∞
f (s)ds.
(4.7)
s=τ
The renewal model will thus reproduce the ISI distribution of 5DF Monte Carlo under static stimulation. This process is numerically unstable for large τ , and for the dynamic case too costly. Another approach is to determine the renewal hazard function by equation 2.26, with one caveat: since the resulting renewal hazard function must be uniquely determined by the stimulus, P(ts , tr , t) in equation 2.26 must be replaced by P∞ (ts , tr , t), the instantaneous equilibrium distribution, or asymptotic state distribution for large time resulting from a h(ts , tr , t) fixed at the instantaneous value at time t. The renewal hazard function determined by this recipe, combined with the renewal master equation C.7, defines what we subsequently refer to as the effective renewal (ER) model (see the model overview in Table 1). Typical hazard functions are shown in Figure 2. Indeed, the renewal hazard functions determined by equation 2.26 are consistent with those of 5DF Monte Carlo determined by equation 4.6. Simulation of the ER model implies that the master equation for P(ts , tr , t) must be allowed to converge to P∞ (ts , tr , t) for each time step where the stimulation changes. This is costly and makes the ER models much less efficient to simulate than the 1DM and 2DM models, but allows a direct comparison of renewal models with the 1DM and 2DM models and 5DF Monte Carlo. When the renewal hazard function is known a priori, as would be the case for a gamma renewal process, or when the hazard functions can be fit by simple functions, the renewal theory ensemble equations given in appendix C are comparatively more efficient to simulate than the 1DM and 2DM models. 5 Numerics In this section we describe the numerical techniques applied to solve the 1DM and 2DM master equations, generate realizations of the 1DM and 2DM processes, and solve the 5DF neuron model equations. 5.1 Numerical Solution of Master Equations. We solved the 1DM and 2DM master equations numerically by discretizing P(ts , t) and P(ts , tr , t), respectively, applying the exponential Euler method for the death term, and reinserting the lost probability by walking the negative time domain and fetching the probability sources of each bin determined by equation 2.8. We present the one-dimensional case here, which can be generalized to two dimensions.
2978
E. Muller, L. Buesing, J. Schemmel, and K. Meier
Figure 2: The renewal hazard function, ρ(τ ), for synaptic input rates (λe = 6.5 Hz, λi = 11.4 Hz) and (λe = 8.3 Hz, λi = 11.4 Hz) resulting in an ensemble firing rate of α = 6.33 Hz (top row), and α = 18.67 Hz (bottom row), respectively. The renewal hazard function for 5DF Monte Carlo (circles) was computed by equation 4.6 with a spike train of 104 s. The renewal hazard function due to the 2DM model (solid line) was determined by equation 2.26. The renewal hazard function for a gamma renewal process (dashed line) equal to the 2DM renewal hazard function at large τ and with the same average firing rate was computed as discussed in section C.2. The small τ region is shown blown up in the right column. For the definition of the 2DM and 5DF models, see Table 1.
We discretize P(ts , t) on equally spaced grids tsi and t j with grid spacings ts := tsi+1 − tsi and t := t j+1 − t j , respectively, with ts = t, such that P(ts , t) → P i, j . The discretized form of the master equation 2.9 then takes the form
P i+1, j+1 = P i, j exp(−t · h(tsi , t j )) + Pri, j ,
(5.1)
where the first term is the exponential Euler computation of loss of probability due to the death term. On the left-hand side, the first superscript of P, i + 1, leads i by one to implement the constant drift of ts , dts /dt = 1. The
Conductance-based spike-frequency adapting relative refractory integrate-and-fire neuron driven by Poisson spike trains Two-dimensional Markov process
5DF
ER
Effective renewal theory model
One-dimensional Markov process
1DM
2DM
Description
Model
Adiabatic elimination (of 5DF), spike-frequency adaptation, relative refractory period, master equation, ensemble Inhomogeneous, master equation
Beyond renewal theory, spike-frequency adaptation, master equation, ensemble Full five-dimensional system, Monte Carlo, reference, benchmark
Keywords
3.13–3.15
3.24
3.1, A
3.2
C.7
2.4, 2.9
2
C
Equations
Defined in (section)
4.3, 4.2
4.1, 4.2
NA
4.1, 4.2
Calibration to 5DF Recipe in (section)
Table 1: Overview of the Models and Quick Reference for the Equations and Sections for the Models.
C.1 (master equation), C.2 (realizations)
5.1 (master equation), 5.3 (realizations)
5.2
5.1 (master equation), 5.3 (realizations)
Numerics in (section)
Spike-Frequency Adapting Neural Ensembles 2979
2980
E. Muller, L. Buesing, J. Schemmel, and K. Meier i, j
reinserted probability, Pr , is computed for tsi + ts < 0 by i rif (tsi+1 )−1
Pri, j :=
m, j
Pd
m=i rif (tsi )
+
ψ −1 (tsi+1 ) − i rif (tsi+1 ) i+1, j Pd ts
−
ψ −1 (tsi ) − i rif (tsi ) i, j Pd , ts
(5.2)
i, j
where Pd is the loss of probability computed by i, j
Pd := t · P i, j · h(tsi , t j ),
(5.3)
and i rif refers to the “reinserted-from” index, which satisfies i (tsi )
ts rif
i (tsi )
≤ ψ −1 (tsi ) < ts rif
+ ts .
(5.4) i, j
The first term in equation 5.2 is just a sum of all Pd except the fractional last bin, which sends probability to the interval t ∈ [tsi , tsi + ts ). The second two terms subtract the fractional first and add the fractional last bins of i, j Pd , which are reinserted to the interval, and thus implement a sort of anti-aliasing of the reinsertion mapping. 5.2 Neuron Simulations. Monte Carlo simulations of the full-system (5DF Monte Carlo) were performed by solving equations 3.13 to 3.15 using the NEST simulation environment (Diesmann & Gewaltig, 2002) with a time step of 0.01 ms. 5.3 Generating Realizations of the Proposed Markov Processes. Generating realizations of the proposed 1DM or 2DM processes is straightforward: The thinning method for a general hazard function described in Devroye (1986) can be applied. The quantity h max = maxts ,t (h(ts , t)) for the 1DM case or h max = maxts ,tr ,t (h(ts , tr , t)) for the 2DM case must be known. The variables t and ts (1DM), or t, ts , and tr (2DM) are required and can have initial values of zero. Sequential intervals are generated using a homogeneous Poisson process with hazard rate ρ = h max . Given one such interval, ti , it is added to t and ts (1DM), or t, ts , and tr (2DM). Next, a spike is generated at time t with probability h(ts , t)/ h max (1DM) or h(ts , tr , t)/ h max (2DM), and if a spike is generated, ts → ψs (ts ), and tr → ψr (tr ), where ψs and ψr refer to the reinsertion mappings as in equation 2.7 with the respective parameters for the two mechanisms.
Spike-Frequency Adapting Neural Ensembles
2981
6 Results In this section we compare ISI distributions, static ISI correlations, and firing rate dynamics of the 1DM, 2DM, and ER models to 5DF Monte Carlo. 6.1 Interspike Interval Distributions. The predictions due to the ER and 2DM model are in excellent agreement with the static ISI distribution of 5DF Monte Carlo. The prediction due to the 1DM model neglects refractory effects and is therefore poor for low ISIs, as can be seen in Figure 3. 6.2 Interspike Interval Correlations. In this section we investigate correlations between subsequent ISIs, a feature of the proposed 1DM and 2DM models that is absent by definition in renewal theory models of spike statistics. The correlation coefficient, r , for a finite number of observations is defined by ( (xi − x¯ )(yi − y¯ ))2 r = , (xi − x¯ )2 (yi − y¯ )2 2
(6.1)
and is the standard measure by which to quantify correlations between two random variables x, y, where xi , yi denote the individual observations and x¯ , y¯ denote the means. The correlation coefficients of subsequent ISIs under static stimulation were calculated for 100 runs of 100 s, and the mean and deviation in the mean are given in Table 2. Indeed, the renewal process shows no ISI correlations. For low and high firing rates, the difference between the correlation coefficients for 5DF Monte Carlo and realizations of the 2DM model is consistent with zero. Both exhibit negative ISI correlations, implying short ISIs are generally followed by long ISIs and vice versa, as has been observed in previous experimental and theoretical studies (Longtin & Racicot, 1997; Chacron, Longtin, & Maler, 2000; Chacron et al., 2005; Nawrot et al., 2007). The conditional ISI distribution, f (τi+1 |τi ) can be computed for the 1DM and 2DM models by equation 2.34. Predictions due to the 2DM model are in agreement with 5DF Monte Carlo for low and high rates, and both long and short τi , as shown in Figure 3. Applying equation 2.34, we can compute the quantity τi+1 |τi τi+1 =
∞
τi+1 f (τi+1 |τi )dτi+1 .
(6.2)
0
As discussed in Whittaker and Robinson (1967), this is a linear function of τi for normal distributions, the slope of which is the correlation coefficient. As the ISI distributions here are not normal, there are deviations from linearity,
2982
E. Muller, L. Buesing, J. Schemmel, and K. Meier
Figure 3: Comparison of the conditional ISI distributions due to 5DF Monte Carlo with predictions due to effective renewal theory (ER, dotted line), the 1DM model (dashed line, determined by equation 2.34), and the 2DM model (solid line, determined by the 2D generalization of equation 2.34). For the definition of the 1DM, 2DM, 5DF, and ER models, see Table 1. The left column shows three representative conditional ISI distributions for an ensemble firing rate of α = 18.67 Hz (λe = 8.3 Hz, λi = 11.4 Hz), and the right column shows the same for α = 6.33 Hz (λe = 6.5 Hz, λi = 11.4 Hz). The upper two plots show the conditional ISI distribution for τi much shorter than the mean. The middle two plots show the conditional ISI distribution for τi equal to the mean. The lower two plots show the conditional ISI distribution for τi much longer than the mean. The preceding ISI, τi , is given on each plot, and a small interval around τi is used to compute the distributions from 5DF Monte Carlo to yield sufficient statistics. The theoretical predictions of the conditional ISI distributions using the 2DM model are in good agreement with 5DF Monte Carlo for all situations considered. The ISI distribution due to 5DF Monte Carlo is consistent with the renewal ISI distribution only when the preceding ISI is equal to the mean ISI (middle row). Spike trains of duration 104 s were √ used. Error bars represent the relative error in the histogram bin counts, 1/ count.
Spike-Frequency Adapting Neural Ensembles
2983
Table 2: Serial ISI Correlation Coefficients for Monte Carlo Simulations of the Full Five-Dimensional System (5DF), Realizations of the One- and TwoDimensional Markov Models (1DM, 2DM), and Realizations of the Effective Renewal Theory Model (ER). Model α = 6.33 Hz 5DF 2DM 1DM ER α = 18.67 Hz 5DF 2DM 1DM ER
Correlation coefficient −0.148 ± 0.004 −0.147 ± 0.003 −0.160 ± 0.003 0.0042 ± 0.0043 −0.235 ± 0.002 −0.236 ± 0.002 −0.283 ± 0.002 0.001 ± 0.002
Figure 4: Mean of the conditional ISI distribution as a function of the preceding ISI, τi , for high-ensemble firing rates (A, (λe = 8.3 Hz, λi = 11.4 Hz), α = 18.67 Hz) and low-ensemble firing rates (B, (λe = 6.5 Hz, λi = 11.4 Hz), α = 6.33 Hz). The data points (triangles) shown are for 5DF Monte Carlo. Theoretical predictions due to the 1DM (dashed line), 2DM (solid line), and ER (dotted line) models are shown. For the definition of the 1DM, 2DM, 5DF, and ER models, see Table 1. A linear function with slope equal to the serial ISI correlation coefficient would be the functional form if the ISI distributions were normal. Thus, these linear functions are plotted for comparison.
as shown in Figure 4. Predictions due to equation 6.2 for the 2DM model are consistent with 5DF Monte Carlo for both low and high rates, as seen in Figure 4. Thus, the 2DM model is indistinguishable from the full system when considering static correlations.
2984
E. Muller, L. Buesing, J. Schemmel, and K. Meier
Figure 5: (A) The ensemble firing rate, α(t), in response to a moderate step stimulus, determined by 5DF Monte Carlo (5 · 105 trials, solid line), and numerical solution of the 1DM (dotted line), 2DM (dashed line), and ER (dashed dotted line) master equations. For the definition of the 5DF, 1DM, 2DM, and ER models, see Table 1. (B) The region outlined by the dashed rectangle is enlarged, showing consistency between the two-dimensional Markov (2DM) model and the full system (5DF Monte Carlo) apart from a 0.5 ms lead of the 2DM solution. This discrepancy is likely due to the neglected membrane potential dynamics.
6.3 Firing Rate Dynamics. In this section, we compare the ensemble firing rates of the 1DM, 2DM, and ER models to 5DF Monte Carlo for small to large step stimuli, and for random fluctuating stimuli generated by an Ornstein-Uhlenbeck process. We subject the neural ensemble to a dynamic stimulus by specifying a time-varying excitatory Poisson input rate, λe (t). Given the time-dependent hazard function determined by the Poisson input rates as described in section 4.2, the ensemble firing rate, α(t), of the 1DM and 2DM models can be calculated by solving equations 2.9 and 3.24, respectively, for the timedependent state distribution, and subsequently applying equation 2.17 or the 2D generalization of it. For the ER model, the hazard function was calculated by the methods discussed in section 4.3, and the ensemble firing rate was determined by solving equation C.7. For weak step stimuli that do not bring the system too far from equilibrium, all models (ER, 1DM, 2DM) faithfully reproduce the step stimulus response of 5DF Monte Carlo (not shown). For moderate step stimuli, only the 2DM model faithfully reproduces the step stimulus response of 5DF Monte Carlo, shown in Figure 5. For large step stimuli, the 2DM model starts to deviate from 5DF Monte Carlo, as seen in Figure 6. The effective renewal theory (ER) model does a reasonable job of predicting the ensemble firing rate of the system to low-amplitude step stimuli.
Spike-Frequency Adapting Neural Ensembles
2985
Figure 6: (A) The ensemble firing rate, α(t), in response to a large step stimulus, determined by 5DF Monte Carlo (5 · 105 trials, solid line), and numerical solution of the 1DM (dotted line), 2DM (dashed line), and ER (dashed dotted line) master equations. For the definition of the 5DF, 1DM, 2DM, and ER models, see Table 1. (B) The region outlined by the dashed rectangle is enlarged, showing disagreement between the two-dimensional Markov (2DM) model and the full system (5DF Monte Carlo).
This is perhaps surprising, since we do not expect renewal models to faithfully reproduce the dynamical responses of spike-frequency adapting neurons, as renewal models do not account for the dependencies of the firing probability density (hazard function) on spikes prior to the most recent. However, this shows that if the ensemble is not pushed far from equilibrium, knowledge of just the last spike is sufficient to predict the firing rate. We further compared 5DF Monte Carlo and predictions of the 2DM model for a stimulus, νe (t), generated by an Ornstein-Uhlenbeck (OU) process. Let ζ (t) be an OU process with mean of 10 Hz, standard deviation of 0.6 Hz, and time constant of 0.2 s. Then the excitatory synaptic inputs were supplied with νe (t) = Ne ζ (t), with Ne = 1000 being the number of excitatory synaptic inputs. The ensemble firing rates for the 2DM model, its adiabatic solution, and 5DF Monte Carlo are shown in Figure 7. The adiabatic solution of the 2DM model is defined as the system that at each instant in time has a distribution of states equal to the instantaneous equilibrium distribution, P∞ (ts , tr , t), the asymptotic state distribution for large time resulting from a h(ts , tr , t) fixed at the instantaneous value at time t. The firing rate of the adiabatic 2DM model is then calculated by α(t) =
∞
−∞
h(ts , tr , t)P∞ (ts , tr , t)dts dtr .
(6.3)
2986
E. Muller, L. Buesing, J. Schemmel, and K. Meier
Figure 7: (A) The ensemble firing rate, α(t), in response to an OrnsteinUhlenbeck process stimulus (as described in the text), determined by 5DF Monte Carlo (solid line), numerical solution of the 2DM master equation (dashed line), and the adiabatic solution (adiabatic 2DM, dotted line) computed by equation 6.3. For the definition of the 5DF, and 2DM models, see Table 1. (B) The region outlined by the dashed rectangle is enlarged, showing consistency between the two-dimensional Markov (2DM) model and the full system (5DF Monte Carlo).
By comparison of the ensemble firing rates of the 2DM model with its adiabatic solution in Figure 7, we can see that the system is being driven from equilibrium by the stimulus. Furthermore, the behavior of the 2DM model far from equilibrium captures the ensemble firing rate dynamics of 5DF Monte Carlo faithfully. This is a robust result under variation of neuron parameters and stimuli, so long as the ensemble is not pushed too far from equilibrium, as for the large step stimulus in Figure 6. 7 Beyond Mean-Adaptation Approximations In this section we show that statistical moment theory approximations such as the mean-adaptation theories due to La Camera et al. (2004) can be derived from the 1DM master equation. The approach generalizes, and we derive the next order moment theory approximation, a mean+varianceadaptation theory and use it to clarify the domain of validity of meanadaptation theories. Recall the 1DM master equation for a spike-frequency adapting neuron, ∂ ∂ P(gs , t) = ∂t ∂gs
gs P(gs , t) τs
+h g (gs − q s , t)P(gs − q s , t) −h g (gs , t)P(gs , t),
(7.1)
Spike-Frequency Adapting Neural Ensembles
2987
where P(gs , t) is the probability density of the adaptation state variable, gs , and P(gs < 0, t) = 0. The ensemble firing rate, α(t), is given by α(t) =
∞
−∞
h g (gs , t)P(gs , t)dgs .
(7.2)
The mean adaptation variable is gs (t) =
∞
−∞
gs P(gs , t)dgs .
(7.3)
Multiplying equation 7.1 by gs and integrating over gs yields the time evolution of the mean, gs (t), d gs (t) 1 = − gs (t) + q s α(t). dt τs
(7.4)
By Taylor expanding h g (gs ) in equation 7.2 around gs (t), and keeping up to linear terms, a mean-adaptation theory as in La Camera et al. (2004) results. Keeping up to quadratic terms, we have
1 α(t) ≈ α gs (t), gs2 (t) = h g gs (t) − h
g gs (t) · gs2 (t), 2
(7.5)
where the h g ( gs (t)) term vanishes by a cancellation of means. A mean+variance-adaptation theory results, but we require the time evolution of the variance. Multiplying equation 7.1 by (gs − gs (t))2 and integrating over gs yields the time evolution of the variance, gs2 (t), d gs2 (t) 2 = − gs2 (t) + q s2 α(t) dt τs ∞ gs − gs (t) h g (gs , t)P(gs , t)dgs . + 2q s
(7.6)
0
Approximating h g (gs ) ≈ h g ( gs (t)) + h g ( gs (t))(gs − gs (t)), equation 7.6 becomes d gs2 (t) 2 ≈ − gs2 (t) + q s2 α gs (t), gs2 (t) dt τs + 2q s h g ( gs (t)) · gs2 (t),
(7.7)
2988
E. Muller, L. Buesing, J. Schemmel, and K. Meier
which has a steady state
gs2
1 q s2 α gs , gs2 = . 2 τ1 − q s h g ( gs )
(7.8)
Thus, the mean+variance-adaptation theory consistency relation for the adapted equilibrium firing rate, α ∗ , reads 1 α = h g (q s τs α ) + h
g (q s τs α ∗ ) 4 ∗
q s2 α ∗
∗
1 τ
.
− q s h g (q s τs α ∗ )
(7.9)
Higher-order moment theories can be derived by keeping higher terms in the expansions in equations 7.5 and 7.7, and computing the necessary equations for the time evolution of higher statistical moments from the master equation 7.1. 7.1 Validity of Mean-Adaptation Theories. In this section we give a heuristic criterion for the validity of mean-adaptation theories in the static case, and the improved accuracy of the mean+variance-adaptation theory is demonstrated by a numerical example. It is illustrative to first investigate the exactly solvable leaky integrate-and-fire neuron driven by white noise for the parameters considered in La Camera et al. (2004), and subsequently contrast the findings to the 5DF model defined by equations 3.13 to 3.15. It can be seen by inspection of equation 7.9 that if h
g (gs ) ≈ 0 over the regime where P(gs ) is appreciably nonzero, then the mean-adaptation consistency relation, α ∗ = h g q s τs α ∗ ,
(7.10)
as in La Camera et al. (2004), results. First, we use the 1DM master equation to verify the mean-adaptation theory for the leaky integrate-and-fire neuron driven by white noise considered in La Camera et al. (2004). The hazard function, h g (gs , t), is referred to there as the response function in the presence of noise and has the exact solution, h g (gs , t) = τ
Cθ −(m−gs )τ √ σ τ C Vr −(m−gs )τ √ σ τ
√
−1 2
πe x (1 + erf (x)) dx
,
(7.11)
due to Siegert (1951), Ricciardi (1977), and Amit and Tsodyks (1991), where Vr is the reset potential, θ is the threshold, τ is the membrane potential, √ x 2 C is the membrane capacitance, and erf(x) = (2/ π) 0 e −t dt is the error
Spike-Frequency Adapting Neural Ensembles
2989
Figure 8: (A, top) The hazard function, h g (gs ), and (A, bottom) the equilibrium distribution of adaptation states, P(gs ), in the low-firing rate √ regime (α ∗ = 4.83 Hz, mean current input m = 0.25 nA and noise σ = s · 2 ms with s = 0.6 nA) of the leaky integrate-and-fire neuron (LIF) used in La Camera et al. (2004). P(gs ) was determined by numerical solution of the 1DM master equation using the hazard function given in equation 7.11. Neuron parameters: C = 0.5 nF, τm = 20 ms, θ = 20 mV, Vr = 10 mV. Adaptation parameters: τs = 110 ms, q s · τs = 4 pA · s. For comparison, the neuron and adaptation parameters are as for Figure 1a in La Camera et al. (2004), except τr = 0 ms and τs = 110 ms. For the definition of the 1DM model, see Table 1. The hazard function is nearly linear over the distribution of states; thus, terms depending on the variance of P(gs ) in equation 7.9 can be neglected, and mean-adaptation theories will yield good approximations to the adapted ensemble firing rate. (B) The adapted ensemble firing rate, α ∗ , for a range of mean current inputs, m, determined by numerical solution of the 1DM master equation (circles), and the mean-adaptation theory consistency relation (solid line).
function. As in La Camera et al. (2004), m and σ are the mean and standard deviation of the input current. Upon firing, the adaptation current, gs , makes a jump of q s and relaxes with a time constant τs . As can be seen in Figure 8A, h g (gs ) is quite near linear over the regime where P(gs ) is appreciably nonzero, and predictions of the adapted firing rate due to a mean-adaptation theory are in excellent agreement with the 1DM master equation as shown in Figure 8B. The mean+variance-adaptation theory helps us to understand this: agreement is good because h
g (gs ) ≈ 0 over the regime where P(gs ) is appreciably nonzero for all firing rates considered. For the 5DF models defined by equations 3.13 to 3.15, we have h g (gs ) ≈ a · exp(−bgs ). As can be seen in Figure 9A, h g (gs ) has an appreciable second derivative over P(gs ), and thus we expect mean-adaptation equilibrium ensemble firing rate predictions to deviate from the ensemble firing rate of the 1DM master equation. Indeed, such deviations are observed and are
2990
E. Muller, L. Buesing, J. Schemmel, and K. Meier
Figure 9: (A, top) The hazard function, h g (gs ), and (A, bottom) the equilibrium distribution of adaptation states, P(gs ), determined by numerical solution of the 1DM master equation. The hazard function, h g (gs ), was determined by fitting to 5DF Monte Carlo as in Figure 1 with λe = 9.75 Hz, λi = 11.4 Hz. For the definition of the 5DF and 1DM model, see Table 1. The hazard function has nonzero curvature (h
g (gs ) > 0) over the distribution of states; thus, terms depending on the variance of P(gs ) in equation 7.9 cannot be neglected, and mean-adaptation theory predictions for the adapted ensemble firing rate are expected to be in error. (B) The adapted ensemble firing rate, α ∗ , for a range of Poisson input rates, λe , determined by solution of the 1DM master equation (circles), the mean-adaptation theory consistency relation (dashed line), and the mean+variance-adaptation consistency relation (solid line). As expected, mean-adaptation theory predictions for the adapted firing rate are corrected by the mean+variance-adaptation theory consistency relation in equation 7.9.
corrected by the mean+variance-adaptation consistency relation, as seen in Figure 9B. Thus, a heuristic condition for the validity of mean-adaptation theories is that we must have h
g (gs ) ≈ 0 over the regime where P(gs ) is appreciably nonzero. Less heuristically, the contributions due to the second term (and all neglected higher-order terms) on the right-hand side of equation 7.9 must vanish compared to the first. When this condition is violated, higher-order moment theories such as the mean+variance-adaptation theory given here, or the 1DM master equation, should be applied to determine the ensemble firing rate. For the neuron models considered here, the accuracy of the mean+variance-adaptation theory was also verified in the dynamic case for an OU stimulus as in Figure 7, as shown in Figure 10. 8 Discussion In this letter, we propose a one-dimensional Markov process (the 1DM model) for modeling neural ensemble activity and spike train statistics
Spike-Frequency Adapting Neural Ensembles
2991
Figure 10: (A) The ensemble firing rate, α(t), in response to an OrnsteinUhlenbeck process stimulus (as for Figure 7), determined by the 1DM model (solid line), the adiabatic solution (thick solid line) computed by the mean+variance-adaptation consistency relation equation 7.9, the dynamic mean+variance-adaptation theory equations 7.4 to 7.6 (dotted line), and the dynamic mean-adaptation theory equations (dashed line). (B) The region outlined by the dashed rectangle is enlarged, showing consistency between the 1DM model and the mean+variance-adaptation theory, while predictions due to the mean-adaptation theory are poor. For the definition of the 1DM model, see Table 1.
that goes beyond renewal theory by accounting for interspike interval (ISI) correlations due to spike-frequency adaptation (SFA) mechanisms without the need to model the high-dimensional space of the microscopic neuronal state variables. We demonstrated that the full five-dimensional master equation of a conductance-based integrate-and-fire neuron with SFA and a refractory mechanism driven by Poisson spike trains (the 5DF model) can be reduced to a two-dimensional master equation plus filtering differential equations accounting for synaptic dynamics (the 2DM model), under an adiabatic elimination of the fast variables v, ge , gi , assuming the neuron has nonzero synaptic time constants and is in the high-conductance state. The resulting 2DM master equation is a two-dimensional generalization of the Markov process proposed at the outset as an extension of renewal theory to account for ISI correlations. We presented methods for generating inhomogeneous realizations of the proposed 1DM and 2DM models and a method for solving their master equations numerically. The 2DM model was shown to accurately predict firing rate profiles of the full system under dynamic stimulation and conditional ISI distributions and serial ISI correlations under static stimulation.
2992
E. Muller, L. Buesing, J. Schemmel, and K. Meier
It was shown that mean-adaptation theories for spike-frequency adapting neurons with noisy inputs such as in La Camera et al. (2004) and higherorder statistical moment theories can be derived from the 1DM master equation as long as one neglects the refractory period. A heuristic condition for the validity of mean-adaptation theories was derived and found to be violated for the neuron model (5DF) and parameters considered here. Furthermore, a mean+variance-adaptation theory was derived that corrected the ensemble firing rate predictions of mean-adaptation theories in this case. 8.1 Comparison with Other Studies of Adaptation. Studies of the firing rates of networks and ensembles of spike-frequency adapting neurons due to Latham et al. (2000) and Fuhrmann et al. (2002) augment a Wilson and Cowan equation (Wilson & Cowan, 1972) for the firing rate with a mean adaptation variable. As is typical of the Wilson and Cowan approach, the ensemble firing rate, α, enters a differential equation of the form τe
dα = −α + h g ( gs (t), · · ·), dt
(8.1)
where h g ( gs (t), · · ·) is the static firing rate given the input and the mean adaptation, and τe is the timescale for relaxation to a firing rate equilibrium. As is suggested in Fuhrmann et al. (2002), τe is determined mainly by the membrane time constant of the neuron, but is also affected by the mean amplitude of the input, and is treated there as a free parameter. It has been argued in Gerstner (2000), Brunel, Chance, Fourcaud, and Abbott (2001), Fourcaud and Brunel (2002), Renart et al. (2004), and La Camera et al. (2004) that for current and conductance-based synapses with nonzero time constants and biological input statistics, the ensemble firing rate responds instantaneously to input currents, and synaptic filtering dominates. In this case, the Wilson and Cowan equation for α can be replaced by an instantaneous f-I function, and the synaptic currents or conductances modeled by relaxation equations for their means and variances. This is the approach taken in La Camera et al. (2004). Thus, one sidesteps the concerns mentioned in Fuhrmann et al. (2002) that the Wilson and Cowan equation “cannot be rigorously derived from the detailed integrate-and-fire model” and has been “shown not to accurately describe the firing rate dynamics [by] (Gerstner, 2000).” The models due to Latham et al. (2000), Fuhrmann et al. (2002), and La Camera et al. (2004) all approximate the evolution of the ensemble of adaptation variables by its mean value and are therefore mean-adaptation theories. La Camera et al. (2004) state that such mean-adaptation theories are a good approximation under the assumption that “adaptation is slow compared to the timescale of the neural dynamics. In such a case, the
Spike-Frequency Adapting Neural Ensembles
2993
feedback [adaptation] current . . . is a slowly fluctuating variable and does not affect the value of s [the standard deviation of the input current].” La Camera et al. (2004) explore adaptation time constants on the order of 100 ms under the assumption that the adaptation dynamics are “typically slower than the average ISI.” They report that “for irregular spike trains the agreement is remarkable also at very low frequencies, where the condition that the average ISI be smaller than τ N [the time constant of adaptation] is violated. This may be explained by the fact that although I SI > τ N , the ISI distribution is skewed towards smaller values, and [the mean adaptation current proportional to the firing rate] . . . is still a good approximation.” In section 7 we used the 1DM master equation to derive a mean+variance-adaptation theory, the next correction to the meanadaptation theories in La Camera et al. (2004), yielding another explanation for the success reported there. We found that the error in the firing rate in La Camera et al. remained small because the hazard function used there is a nearly linear function of the adaptation variable in the interesting regime where P(gs ) is appreciably nonzero. Thus, perturbing contributions to the average firing rate from deviations of the adaptation variable above and below the mean over the course of one ISI roughly cancel on average. For the neuron model (5DF) and parameters considered here, the hazard function has an appreciable nonlinearity resulting in erroneous predictions of the firing rate when using a mean-adaptation theory. The mean+varianceadaptation theory derived here corrected the predictions. It is appropriate to reiterate that both the 1DM master equation and the resulting mean+variance-adaptation theory approximation considered here neglect refractory dynamics. It was demonstrated by the adiabatic reduction of the 5DF model to the 2DM model that the inclusion of a relative refractory period requires a two-dimensional master equation. Indeed, as shown in Figure 6, oscillations emerge for large and fast stimulation changes that are qualitatively captured by the 2DM model but not by the 1DM model. It remains to be seen if a two-dimensional mean- or mean+varianceadaptation theory can be constructed that accounts for this behavior, and under what conditions it can be reduced to a one-dimensional model by simply rescaling the firing rate by f = 1/(1/ f + τeff ), as in La Camera et al. (2004) for the absolute refractory period case, where τeff is some effective absolute refractory period of the relative refractory mechanism. In Benda and Herz (2003), a thorough mathematical analysis of several well-known mechanisms for SFA based on biophysical kinetics is performed for the case of a suprathreshold current. A universal phenomenological mean-adaptation model for such biophysical mechanisms for SFA is introduced with much the same form as later used in La Camera et al. (2004) for the case of noisy drive. Methods are given to completely parameterize the model using quantities that can be easily measured by standard recording techniques. Implications for signal processing are considered there and in subsequent publications (Benda, Longtin, & Maler, 2005).
2994
E. Muller, L. Buesing, J. Schemmel, and K. Meier
In Chacron et al. (2003), a novel approach is taken compared to Latham et al. (2000), Fuhrmann et al. (2002), Benda and Herz (2003), and La Camera et al. (2004). There an expression is derived for the serial correlation coefficient of ISIs in the static case by employing a Markov chain. In their analysis, they define a quantity analogous to the static distribution P † (ts ) here. In their framework, they prove that adaptation of the threshold fatigue form used there results in ISI correlations as have been observed experimentally (Longtin & Racicot, 1997; Chacron et al., 2000; Nawrot et al., 2007). Their expression, however, contains integrals that require “the computation of the FPT [first-passage time] PDF of the Ornstein-Uhlenbeck process through an exponential boundary. Given that no general analytical expression is available for this quantity, derivation of the correlation from the integrals can be computationally more demanding than estimating the same quantities from simulations” (Chacron et al., 2003). Subsequently, only simulations are performed, and the derived expression is never compared to the simulated result. Thus, they miss an important benchmark to ensure the calculations are correct. It is possible that the numerical techniques used here could be applied to compute a prediction for the correlation coefficient by the expression they derive and subsequently compared to the simulated result. Mean-adaptation theories cannot be used to model the correlation between subsequent ISIs, as they do not preserve the ensemble statistics. Our approach is that one simply not replace the trajectory of the adaptation variable, gs , by its mean. This resolves the problem in the development in La Camera et al. (2004) that the mean input current and instantaneous gs have an equal role in determining the instantaneous firing rate, and gs cannot be consistently replaced by its mean. What results is the 1DM master equation presented here. We subsequently calculate an expression for the inhomogeneous conditional ISI distribution and compare it to 5DF Monte Carlo in the static case. Furthermore, we calculate the variation of the mean of the conditional ISI distribution as a function of the preceding ISI, a generalization of the serial correlation coefficient of ISIs, and compare it to 5DF Monte Carlo. By our Markov process master equation, we avoid the difficulty encountered in Chacron et al. (2003) of treating the first passage times of an Ornstein-Uhlenbeck process through an exponential boundary, while capturing the full inhomogeneous ensemble dynamics in a framework that is tractable. 8.2 On the Adiabatic Reduction of the Master Equation. Under the assumption that the neuron is in the high-conductance state due to biologically realistic noisy inputs, we show that the 5DF master equation for the conductance-based spike frequency adapting relative refractory integrateand-fire neuron model used here can be reduced to the 2DM master equation by an adiabatic elimination of fast variables. The variables that remain are those of SFA and the relative refractory mechanism, and the form is
Spike-Frequency Adapting Neural Ensembles
2995
analogous to the 1DM master equation proposed to extend renewal theory to a class of Markov processes that account for SFA. The adiabatic reduction does not solve explicitly the firing rate of the given neuron model (without adaptation or the refractory mechanism) or rely on such a solution. We leave the firing rate dynamics of the given neuron model (without adaptation or the refractory mechanism) encapsulated in the hazard function (see equation 3.23). The approach applies to other models of adaptation such as the adapting threshold models in Chacron et al. (2003) and the current-based adaptation models in La Camera et al. (2004). Concerning the generality of the adiabatic elimination for the adaptation variable, we expect it could be applied to a larger class of formally spiking neuron models with fast, intrinsic dynamics compared to adaptation. For those interested in modeling a class of neurons where a solution to equation 3.23 already exists, the framework can be easily and immediately applied. The fitting methods presented allow the connection to be made between models for which an explicit solution for the hazard function is unknown and the 1DM and 2DM models presented here. What results is a reduced state space to explore for functional implications. The generality of treating the relative refractory mechanism as a slow variable in the adiabatic elimination is less clear. There are some issues that need to be clarified before one could specify the class of neurons to which it applies. Specifically, the relationship between the requirement that the neuron be in the high-conductance state (small effective τm ) and the requirement that the synapses have nonvanishing time constants (τe > 0) resulting in a nonvanishing probability at threshold (P(vth , · · ·) > 0) remains to be thoroughly investigated. The delta-conductance-based approach in Meffin et al. (2004), for example, does not satisfy the second requirement. The nonvanishing probability at threshold seems to be a criterion for the neuron to respond quickly to the synaptic signal (Fourcaud & Brunel, 2002; Renart et al., 2004). An important step in the reduction is the treatment of the synaptic conductances. As their statistics are assumed to instantaneously determine the equilibrium statistics of the membrane potential, they were removed from the master equation. Then differential equations were found for their first statistical moments (means) in terms of the rate of the Poisson process input, as in La Camera et al. (2004). One weakness of the fitting approach used here is that we cannot account for the dynamics of the second central moment (variance), as was done in La Camera et al., and modeling both dynamic excitation and inhibition simultaneously requires a laborious fitting of a two-dimensional space of synaptic inputs. Further work will apply methods such as those due to Moreno-Bote and Parga (2004) to obtain a solution to equation 3.23 without fitting, thus allowing ensemble studies of adapting network models and analysis as in Latham et al. (2000) with the rigor of, for example, (Brunel, 2000), and the possibility for quantitative agreement with traditional large-scale network simulations.
2996
E. Muller, L. Buesing, J. Schemmel, and K. Meier
8.3 Beyond Renewal Theory. We reviewed standard results of inhomogeneous renewal theory in appendix C and uncovered a conceptual error often made in the literature when using the intensity-rescaling transformation to make the transition from homogeneous (static) to inhomogeneous (dynamic) renewal theory. We clarified and remedied the problem by presenting a correct renewal process generation scheme as discussed in Devroye (1986). By means of a variable transformation, we provided a link between the 1DM model and inhomogeneous renewal theory methods, allowing direct comparison and contrast. The 1DM master equation was found to have an analogous structure to the renewal master equation, but with a state space spanning the whole real line. Furthermore, the 1DM state is not reborn to a universal initial value upon spiking, as in renewal theory (zero age), but reinserted to a state that is a function of the state just prior to spiking. This fact introduces a memory into the system and results in negative ISI correlations as reported in Chacron et al. (2003). Due to the detailed adiabatic reduction and fitting, we proposed the nested exponential form of the hazard function as given by equation 2.10 and the state-dependent reinsertion function as given by equation 2.7 for the conductance-based SFA mechanism considered here. The hazard function (perhaps time-dependent) and the reinsertion function together are a complete specification of the proposed Markov model given an initial distribution of states. We provided a numerical recipe to efficiently generate inhomogeneous realizations of the proposed Markov process. With an additional dimension for a relative refractory mechanism, the Markov process faithfully reproduces the transient dynamics and ISI correlations of 5DF Monte Carlo, as expected by the adiabatic reduction. The same comparison between a one-dimensional Markov process and a neuron model without the relative refractory mechanism was not done, as we found that without refractory mechanisms, the neuron model used exhibited a high probability to spike just after spiking due to correlations in the synaptic conductance on the timescale of the refractory mechanism. We feel this is a bug rather than a feature of neuron models without a refractory mechanism. Thus we chose not to build a Markov process to account for it. Furthermore, the proposed relative refractory mechanism requires only slightly more effort than treating an absolute refractory period, as done in Nykamp and Tranchina (2001). When the hazard function calibrated for the 2DM model is used directly for the 1DM model, reasonable agreement to refractory neuron models was still observed for the moderate firing rates considered. 8.4 Suprathreshold Stimuli. For large and rapid changes in stimulus that bring the ensemble predominantly into the suprathreshold regime, the predictions due to numerical solutions of the 2DM model deviated somewhat from 5DF Monte Carlo simulations, as seen in Figure 6. The reasons
Spike-Frequency Adapting Neural Ensembles
2997
for this are twofold. First, the stimulus brings us into a regime where the exponential fitting procedure for h(ts , tr ) begins to fail and was poorly populated with data points. This fact likely accounts for the larger amplitude of the oscillations of the 2DM model compared to 5DF Monte Carlo. It is likely that a choice of function that improves the fit in this regime or a proper analytical solution for h(ts , tr ) would improve the agreement here. Second, following the large stimulus change, a large portion of the population is in the suprathreshold regime where neurons make large migrations from the reset potential directly to the threshold following a spike. The 2DM model completely neglects the dynamics of the membrane potential and thus this migration period, resulting in the phase lead over the full system. A closer inspection of Figure 6 reveals a transition from predominantly supra- to predominantly subthreshold firing. Shortly after stimulus onset, a large portion of the population fires almost immediately and is reinserted with the adaptation conductance increased by q s , that is, a mass exodus in phase space. For the 2D case, the neurons also start a refractory period upon reinsertion; in the 1D case, they do not. The stimulus is sufficiently strong that in the 2D case, it is still suprathreshold following the refractory period. In the 1D case, there is no refractory period, and the neurons can fire immediately following a spike cycle, and no lull is seen in the firing rate following the initial mass exodus. For the 2D case, and even the renewal case, the system is refractory following the mass exodus, and a lull in the firing rate results, to peak again as the neurons are released from the refractory state. With the accumulation of adaptation, the fraction of the ensemble participating in subsequent exodus events is ever diminished as more and more neurons enter the subthreshold regime where neurons survive for highly variable durations following the refractory period. Thus, for large stimuli that keep the neuron suprathreshold over several spikes, the population is initially synchronized, firing at a rate determined by the refractory mechanism. As adaptation accumulates, spiking becomes more irregular, and the neurons desynchronize. The desynchronization of neurons driven by suprathreshold current has been observed experimentally in Mainen and Sejnowski (1995). It is shown there that this transition to the subthreshold regime due to adaptation is not strictly necessary for the neurons to desynchronize due to the constant presence of noise. However, adaptation is also a mechanism by which an otherwise irregularly firing neural ensemble is synchronized at a stimulus onset. Following such a synchronization, the transition from the predominantly suprathreshold regime to the predominantly subthreshold regime induced by the accumulation of adaptation is akin to a transition from a noisy oscillator firing mode to a point process firing mode. While the ensemble would gradually desynchronize in the noisy oscillator firing mode, the transition to the point process firing mode over only a few spikes ensures this occurs much more rapidly. Thus, adaptation works to both synchronize and then desynchronize at changes in stimulus. This is an implementation of novelty
2998
E. Muller, L. Buesing, J. Schemmel, and K. Meier
detection and is related to the high-pass filtering properties of SFA reported in Benda et al. (2005). Evidence for a synchronization-desynchronization coding scheme for natural communication signals was recently reported for the spike-frequency adapting P-unit electroreceptors of weakly electric fish in Benda, Longtin, and Maler (2006). 9 Conclusion In this letter, we have focused on establishing a framework for rigorously treating the dynamic effects of spike-frequency adaptation and refractory mechanisms on neural ensemble spiking. The resulting master equation formalism unifies renewal theory models and previous studies on adaptation (e.g., Latham et al., 2000; Fuhrmann et al., 2002; Chacron et al., 2003; Benda & Herz, 2003; La Camera et al., 2004) into an ensemble, or population density framework such as those due to Knight (1972, 2000), Brunel (2000), Omurtag et al. (2000), Nykamp and Tranchina (2000), Fourcaud and Brunel (2002), Richardson (2004), Rudolph and Destexhe (2005), Meffin et al. (2004), and Moreno-Bote and Parga (2004). The resulting methods are new and powerful tools for accurately modeling spike-frequency adaptation, an aspect of neuron dynamics ubiquitous in excitatory neurons that has been largely ignored in neural ensemble studies thus far due to the added difficulties of treating the extra state variable. By distilling the detailed neuron model down to two essential dimensions, spike-frequency adaptation and a relative refractory period, using an adiabatic elimination, their central role in perturbing neural firing is emphasized. Future work will employ the framework to examine the functional implications of spike-frequency adaptation. Given the variety of intriguing and prominent consequences such as interspike interval correlations and transient synchronization following stimulus changes, one is compelled to question if spike-frequency adaptation can be neglected when considering, for example, the dynamic nature of the neural code (Shadlen & Newsome, 1998; Rieke, Warland, de Ruyter van Steveninck,& Bialek, 1997), the propagation of synchrony (Abeles, 1991; Diesmann, Gewaltig, & Aertsen 1999), or the function of spike-timing-based learning rules (Gerstner, Kempter, van Hemmen, & Wagner, 1996; Song, Miller, & Abbott, 2000). Appendix A: Neuron and Adaptation Model Parameters The parameters of the 5DF neuron model given in equations 3.13 to 3.15 were determined by fitting to a single-compartment Hodgkin-Huxley (HH) model of a pyramidal neuron under various conditions using NEURON (Hines & Carnevale, 1997) as described in Muller (2003). The HH model and parameters are taken from Destexhe, Contreras, and Steriade (1998). The phenomenological mechanism for spike-frequency adaptation (SFA) used here, the counterpart to the M-current and AHP-current mechanisms
Spike-Frequency Adapting Neural Ensembles
2999
in the HH model, was inspired by (Dayan & Abbott, 2001), and similar models have been proposed in (Koch, 1999) and Fuhrmann et al. (2002), and more recently generalized in Brette and Gerstner (2005). Additionally, a relative refractory period (RELREF) mechanism identical to the SFA mechanism was added, but with a much shorter time constant and a much larger conductance increase. Both the SFA and RELREF mechanisms consist of an action potential (AP) activated and exponentially decaying conductance coupled to an inhibiting reversal potential so that the standard membrane equation takes the form: cm
dv(t) = gl (El − v(t)) + ge (t)(E e − v(t)) + gi (t)(E i − v(t)) dt + gs (t)(E s − v(t)) + gr (t)(Er − v(t)).
If v exceeds the threshold vth :
r r r r
v is reset to vreset . gs → gs + q s . gr → gr + qr . The time of threshold crossing is added to the list of spike times.
All conductances, gx (t), where x ∈ {s, r, e, i} are governed by an equation of the form dgx (t) 1 = − gx (t). dt τx The arrival of a spike at a synapse triggers gx → gx + q x for x ∈ {e, i}. Poisson processes were used to supply spike trains to the 1000 excitatory and 250 inhibitory synapses, where Poisson rates in the range 3 to 20 Hz were used as described in the text for each specific simulation. The synaptic model and parameters were directly transferred from the HH models, while the remaining parameters, as determined by fits to the HH model, are given in Table 3. Appendix B: Further Details on the Adiabatic Reduction In this appendix, we give the mathematical steps that lead from equation 3.16 to 3.20 in a more detailed way. For notational simplicity, we introduce the five-dimensional state variable x = (v, ge , gi , gs , gr ). The indices 1, 2, 3, 4, 5 shall correspond to v, e, i, s, r , as used in the definition of the neuron model in equations 3.13 to 3.15 (e.g., τ2 := τe ). The partial derivatives with respect to xµ are denoted by ∂µ and with respect to time by ∂t . Furthermore, we define P(x1 , x2 , x3 , x4 , x5 ) = 0 if one or more of the conductances x2 , . . . , x5 is negative.
3000
E. Muller, L. Buesing, J. Schemmel, and K. Meier
Table 3: Neuron and Synapse Model Parameters Used for Simulations of the Full System (5DF) Given by Equations 3.13 to 3.15. Parameter
Description
vth vreset cm gl El qr τr Er qs τs Es E e,i
Threshold voltage Reset voltage Membrane capacitance Membrane leak conductance Membrane reversal potential RELREF quantal conductance increase RELREF conductance decay time RELREF reversal potential SFA quantal conductance increase SFA conductance decay time SFA reversal potential Reversal potential of excitatory and inhibitory synapses, respectively Excitatory and inhibitory synaptic quantal conductance increase Excitatory and inhibitory synaptic decay time
q e,i τe,i
Value −57 mV −70 mV 289.5 pF 28.95 nS −70 mV 3214 nS 1.97 ms −70 mV 14.48 nS 110 ms −70 mV 0 mV, −75 mV 2 nS 1.5 ms, 10.0 ms
The master equation governing the evolution of the probability density P(x, t) may be formulated as a conservation equation: ∂t P(x, t) = −divJ (x, t) + δ(x1 − vreset )J 1 (vth , x2 , x3 , x4 − q 4 , x5 − q 5 , t). (B.1) The second term on the right-hand side of equation B.1 accounts for neurons that cross the threshold surface x1 = vth at time t with the state variables (vth , x2 , x3 , x4 − q 4 , x5 − q 5 ) and are reinserted to (vr , x2 , x3 , x4 , x5 ). The probability current J (x, t) is determined by the underlying stochastic differential equations 3.13 to 3.15. The components J µ (x, t) for µ = 1, . . . , 5 consist of the current due to the drift terms, βµ (x), and for µ = 2, 3 of additional currents due to the excitatory and inhibitory input Poisson spike trains, respectively. The drift term for the membrane potential reads 5 1 β1 (x) := xµ (E µ − x1 ) + gl (El − x1 ) . cm
(B.2)
µ=2
For the conductances xµ with µ = 2, . . . , 5, the drift terms are: βµ (x) = βµ (xµ ) := −
1 xµ . τµ
(B.3)
Spike-Frequency Adapting Neural Ensembles
3001
The components of the probability current for µ = 1, 4, 5 obey the equation J µ (x, t) = βµ (x)P(x, t).
(B.4)
For the excitatory synaptic conductance x2 , the component J 2 (x, t) is J 2 (x, t) = β2 (x)P(x, t) x2 ∞ + W2 (y2 , y1 , t)P(x1 , y1 , x3 , . . . , x5 )dy2 dy1 0
0
x2
−
∞
W2 (y1 , y2 , t)P(x1 , y2 , x3 , . . . , x5 )dy2 dy1 .
(B.5)
0
0
The component J 3 (x, t) has a similar form with obvious modifications. Since the synaptic input is modeled as a Poisson process, the transition rates Wµ (y1 , y2 , t) for µ = 2, 3 may be written as Wµ (y1 , y2 , t) = νµ (t)δ(y1 − (y2 + q µ )),
(B.6)
given the presynaptic firing rates νµ (t). The diffusion approximation can be obtained by a Kramers-Moyal expansion of the components J 2 and J 3 (Gardiner, 1985). B.1 Integration. To obtain an equation for the marginal probability distribution, P(x4 , x5 , t), one integrates equation B.1 over x1 , x2 , x3 . The integral of the terms ∂µ J µ (x, t) on the right-hand side in B.1 for µ = 2, 3 vanishes due to the boundary condition that the probability current vanishes in the limit xµ → 0 and xµ → ∞ for µ = 2, 3: ∞ ∂µ J µ (x, t)dxµ = lim J µ (x, t) − J µ (x, t) = 0. (B.7) xµ →∞
0
xµ =0
The component J 1 (x, t) yields a nonvanishing contribution: ∞ ∞ vth ∂1 J 1 (x, t)dx1 dx2 dx3 0
0
−∞
∞
= 0
∞
J 1 (vth , x2 , . . . , x5 , t)dx2 dx3 .
(B.8)
0
The reinsertion term involves an integration over a delta distribution: ∞ ∞ vth δ(x1 − vreset )J 1 (vth , x2 , x3 , x4 − q 4 , x5 − q 5 , t)dx1 dx2 dx3 0
0
= 0
−∞
∞
0
∞
J 1 (vth , x2 , x3 , x4 − q 4 , x5 − q 5 , t)dx2 dx3 .
(B.9)
3002
E. Muller, L. Buesing, J. Schemmel, and K. Meier
Integration of the left-hand side in equation B.1 results in: 0
∞
0
∞
vth −∞
(∂t P(x, t)) dx1 dx2 dx3 = ∂t P(x4 , x5 , t).
(B.10)
Plugging these results into equation 3.16 yields: ∂t P(x4 , x5 , t) = −
µ=4,5
∞
+
0 ∞
− 0
∂µ βµ (xµ )P(x4 , x5 , t)
∞
J 1 (vth , x2 , x3 , x4 − q 4 , x5 − q 5 , t)dx2 dx3
0 ∞
J 1 (vth , x2 , . . . , x5 )dx2 dx3 .
(B.11)
0
Returning to the initial notation and using the definition for J 1 (x, t) = β1 (x, t)P(x, t) yields equation 3.20. Appendix C: Inhomogeneous Renewal Processes The proposed Markov models can be easily understood by analogy to inhomogeneous renewal theory. We review some standard results thereof, which define the renewal models used in the text. Poisson and gamma renewal processes, both special cases of a renewal process, are used extensively to model the spike train statistics of cortical neurons (van Rossum, Bi, & Turrigiano, 2000; Song & Abbott, 2001; Rudolph & Destexhe, 2003b; Shelley et al., 2002), and are treated in detail in (Gerstner & Kistler, 2002) in sections 5.2, 5.3, 6.2.2, and 6.3.2. While the treatment in section 6.2.2 is developed for spike response model neurons with escape noise and in section 6.3.2 for populations of neurons satisfying a few basic assumptions, it is not explicitly stated there that the analysis is that of an arbitrary inhomogeneous renewal process, though it is mentioned in Gerstner (2001). We first reiterate this fact by producing the main results of section 6.2.2 and 6.3.2 of Gerstner and Kistler (2002) using an inhomogeneous generalization of the notation of Cox (1962), a classic reference work on homogeneous renewal theory. Second, we present a recipe for efficiently generating spike trains of a general inhomogeneous renewal process. In what follows, by working exclusively with the inhomogeneous hazard function, we avoid the pitfall of studies that erroneously assume an intensity-rescaling transformation of a stationary gamma renewal process
Spike-Frequency Adapting Neural Ensembles
3003
with parameter γ , or a yields an inhomogeneous gamma renewal process with parameter γ , or a (Barbieri, Quirk, Frank, Wilson, & Brown, 2001; Gaz`eres, Borg-Graham, & Fr´egnac, 1998). A renewal process is defined here to be inhomogeneous when its hazard function takes the form ρ(τ, t), where τ denotes the age, and t denotes the explicit dependence of the hazard function on time. The ensemble firing rate3 at t, denoted by α(t), is the expectation value, α(t) = ρ(t) =
∞
ρ(s, t) f − (s, t)ds,
(C.1)
0
where f − (τ, t) denotes the probability density function (PDF) of times since last renewal, also called the backward recurrence time in Cox (1962). The PDF, f − (τ, t), can be determined by reasoning that the probability that the system has an age in the interval (τ, τ + τ ) is equal to the probability that there is a renewal in the time interval (t − τ, t − τ + τ ) and that the system subsequently survives until t. This yields f − (τ, t) = α(t − τ )F(τ, t − τ ),
(C.2)
where F(t, t) is the inhomogeneous survival function, representing the probability that the system will survive for a time t after t. Generalizing equation 1.2.10 in Cox (1962) for the inhomogeneous case, we have F(t, t) = exp −
t
ρ(s, t + s)ds .
(C.3)
0
Plugging equation C.2 into C.1 results in the equivalent of equations 6.44 and 6.45 of Gerstner and Kistler (2002). A differential formulation of equations C.1 to C.3 is possible. First, note that age increases with increasing t and thus dτ = 1. dt This suggests a transform of the age variable τ → τ = t − τ , as in equation 6.46 of Gerstner and Kistler (2002). This new age variable, τ , is stationary with the evolution of t. We define the stationary backward recurrence time PDF as f s− (τ , t) := f − (t − τ , t).
3 The ensemble firing rate is referred to as the population activity, A(t), in Gerstner and Kistler, 2002.
3004
E. Muller, L. Buesing, J. Schemmel, and K. Meier
Thus, d −
∂ −
f s (τ , t) = f (τ , t) dτ ∂t s ∂ = (α(τ )F(t − τ , τ )), ∂t and differentiation of equation C.3 yields τ , t) whereby we have
d F(t dt
− τ , τ ) = F(t − τ , τ )ρ(t −
d −
f (τ , t) = −α(τ )F(t − τ , τ )ρ(t − τ , t) dt s d −
f (τ , t) = − f s− (τ , t)ρ(t − τ , t). dt s
(C.4)
d −
f s (τ , t) for τ ∈ (−∞, t). Additionally, we need This relation determines dt to ensure that the normalization of f s− (τ, t) is preserved, namely, that
∞
−∞
∂ − f (τ, t)dτ = 0. ∂t s
(C.5)
Splitting the integral into three regions of interest, we have lim
t→0+
t−t
−∞
+
t+t
t−t
+
∞ t+t
∂ − f (s, t)ds ∂t s
∂ − f (s, t)ds ∂t s ∂ − f (s, t)ds = 0. ∂t s
Since f s− (τ > t, t) = 0, the third integral is zero. We then have lim
t+t
t→0+ t−t
∂ − f (s, t)ds = − ∂t s
t −∞
∂ − f (s, t)ds. ∂t s
Since the contribution from equation C.4 in the integral on the left-hand d − side is vanishing, we need to add an additional term to dt f s (τ, t), which is zero for all τ = t but which has a finite integral when integrating around t. This can be achieved by addition of a δ-function term: d − d − f (τ, t) → f (τ, t) − δ(τ − t) dt s dt s
t
−∞
∂ − f (s, t)ds. ∂t s
Spike-Frequency Adapting Neural Ensembles
3005
Clearly the factor behind the δ-function is α(t). Thus, we have the final form, d −
− f s− (τ , t)ρ(t − τ , t), τ < t f s (τ , t) = 0, τ ≥ t dt + α(t)δ(τ − t),
(C.6)
defined for τ ∈ (−∞, ∞). Equation C.6 expressed in terms of f − (τ, t) results in the master equation, ∂ − ∂ − f (τ, t) = − f (τ, t) − f − (τ, t)ρ(τ, t) + α(t)δ(τ ), ∂t ∂τ
(C.7)
defined for τ ∈ [0, ∞). This is exactly equation 6.43 in Gerstner and Kistler (2002). It is analogous to equation 2.9, but with reinsertion to τ = 0 after a spike. C.1 Numerical Solution of the Renewal Master Equation. As the renewal master equation C.7 is just a special case of the 1DM master equation, it can be solved with the same numerical techniques as described in section 5.1. The content of the δ term in equation C.7 is that all probability lost due to the death term (the second term on the rhs) is accumulated and reinserted to the τ = 0 bin. Thus, we are spared the complication of treating state-dependent reinsertion of probability, as was necessary for the 1DM and 2DM master equations. C.2 Generating Random Numbers of a General Hazard Function. Random numbers with a general hazard function can be generated by the thinning method as discussed in Devroye (1986) and summarized here. The maximum of the hazard function, ρmax = maxτ,t (ρ(τ, t)), must be known. Sequential event intervals are generated using a homogeneous Poisson process with a rate of ρmax . The resulting spike train is then sequentially thinned, given the event time t and time since last event τ , by the rule: 1. Generate a uniform random number, T, on [0, 1). 2. if ρ(τ, t)/ρmax > T, keep the event in the spike train; otherwise remove it. The remaining event times are consistent with the prescribed hazard function. For the case of random number generation for a GRP, evaluation of ρ(τ, t) using equation 4.6 is numerically unstable for large τ and costly. An implementation of the algorithm (Shea, 1988) for calculating the cumulative hazard function of a gamma renewal process is available in pgamma.c of the Mathlib of the R statistics environment (Ihaka & Gentleman, 1996)
3006
E. Muller, L. Buesing, J. Schemmel, and K. Meier
under the GNU Public License. Alternatively, the logarithm of the function gsl_sf_gamma_inc_Q provided by the GNU Scientific Library can be used. The hazard function can then be calculated by a simple discrete difference calculation of the derivative. Time dependence can be introduced by giving a time dependence to the parameters of the gamma distribution. Acknowledgments This work is supported in part by the European Union under the grant IST2005-15879 (FACETS). Thanks to M. Rudolph, A. Destexhe, M. Diesmann, N. Brunel, M. Pospischil and N. Nurpeissov for helpful discussion. Thanks to the reviewers for their many detailed and insightful comments which have had a significant and positive impact on this letter. References Abeles, M. (1991). Corticonics. Cambridge: Cambridge University Press. Amit, D. J., & Tsodyks, M. V. (1991). Quantitative study of attractor neural network retrieving at low spike rates: I. Substrate-spikes, rates and neuronal gain. Network, 2, 259–273. Barbieri, R., Quirk, M., Frank, L., Wilson, M., & Brown, E. (2001). Construction and analysis of non-Poissonian stimulus-response models of neural spiking activity. Journal of Neuroscience Methods, 105, 25–37. Benda, J., & Herz, A. (2003). A universal model for spike-frequency adaptation. Neural Computation, 15, 2523–2564. Benda, J., Longtin, A., & Maler, L. (2005). Spike-frequency adaptation separates transient communication signals from background oscillations. Journal of Neuroscience, 25, 2312–2321. Benda, J., Longtin, A., & Maler, L. (2006). A synchronization-desynchronization code for natural communication signals. Neuron, 52, 347–358. Brette, R., & Gerstner, W. (2005). Adaptive exponential integrate-and-fire model as an effective description of neuronal activity. Journal of Neurophysiology, 94, 3637– 3642. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. Journal of Computational Neuroscience, 8, 183–208. Brunel, N., Chance, F. S., Fourcaud, N., & Abbott, L. F. (2001). Effects of synaptic noise and filtering on the frequency response of spiking neurons. Physical Review Letters, 86, 2186–2189. Chacron, M., Longtin, A., & Maler, L. (2000). Suprathreshold stochastic firing dynamics with memory in p-type electroreceptors. Physical Review Letter, 85, 1576–1579. Chacron, M., Pakdaman, K., & Longtin, A. (2003). Interspike interval correlations, memory, adaptation, and refractoriness in a leaky integrate-and-fire model with threshold fatigue. Neural Computation, 15, 253–278. Cox, D. R. (1962). Renewal theory. London: Methuen.
Spike-Frequency Adapting Neural Ensembles
3007
Dayan, P., & Abbott, L. (2001). Theoretical neuroscience: Computational and mathematical modeling of neural systems. Cambridge, MA: MIT Press. Destexhe, A., Contreras, D., & Steriade, M. (1998). Mechanisms underlying the synchronizing action of corticothalamic feedback through inhibition of thalamic relay cells. Journal of Neurophysiology, 79, 999–1016. Destexhe, A., Rudolph, M., Fellous, J., & Sejnowski, T. J. (2001). Fluctuating synaptic conductances recreate in vivo–like activity in neocortical neurons. Neuroscience, 107, 13–24. Destexhe, A., Rudolph, M., & Par´e, D. (2003). The high-conductance state of neocortical neurons in vivo. Nature Reviews Neuroscience, 4, 739–751. Devroye, L. (1986). Non-uniform random variate generation. New York: Springer-Verlag. Diesmann, M., & Gewaltig, M.-O. (2002). NEST: An environment for neural systems simulations. In T. Plesser & V. Macho (Eds.), Forschung und wisschenschaftliches ¨ Rechnen, Beitr¨age zum Heinz-Billing-Preis 2001 (Vol. 58, pp. 43–70). Gottingen: Ges. ¨ Wiss. Datenverarbeitung. fur Diesmann, M., Gewaltig, M.-O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Ermentrout, B., Pascal, M., & Gutkin, B. (2001). The effects of spike frequency adaptation and negative feedback on the synchronization of neural oscillators. Neural Computation, 13, 1285–1310. Fourcaud, N., & Brunel, N. (2002). Dynamics of the firing probability of noisy integrate-and-fire neurons. Neural Computation, 14, 2057–2110. Fuhrmann, G., Markram, H., & Tsodyks, M. (2002). Spike frequency adaptation and neocortical rhythms. Journal of Neurophysiology, 88, 761–770. Gardiner, C. W. (1984). Adiabatic elimination in stochastic systems. I. formulation of methods and application to few-variable systems. Physical Review A, 29, 2814– 2823. Gardiner, C. W. (1985). Handbook of stochastic methods. Berlin: Springer-Verlag. Gaz`eres, N., Borg-Graham, L. J., & Fr´egnac, Y. (1998). A phenomenological model of visually evoked spike trains in cat geniculate nonlagged x-cells. Visual Neuroscience, 15, 1157–1174. Gerstner, W. (1995). Time structure of the activity in neural network models. Physical Review E, 51, 738–758. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Computation, 12, 43–89. Gerstner, W. (2001). Coding properties of spiking neurons: Reverse and crosscorrelations. Neural Networks, 14, 599–610. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Gerstner, W., & Kistler, W. (2002). Spiking neuron models: Single neurons, populations, plasticity. Cambridge: Cambridge University Press. Haken, H. (1983). Synergetics (3rd ed.). Berlin: Springer-Verlag. Hines, M. L., & Carnevale, N. T. (1997). The NEURON simulation environment. Neural Computation, 9, 1179–1209. Ihaka, R., & Gentleman, R. (1996). R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3), 299–314.
3008
E. Muller, L. Buesing, J. Schemmel, and K. Meier
Knight, B. W. (1972). Dynamics of encoding in a population of neurons. Journal of General Physiology, 59, 734–766. Knight, B. W. (2000). Dynamics of encoding in neuron populations: Some general mathematical features. Neural Computation, 12, 473–518. Knight, B. W., Omurtag, A., & Sirovich, L. (2000). The approach of a neuron population firing rate to a new equilibrium: An exact theoretical result. Neural Computation, 12, 1045–1055. Koch, C. (1999). Biophysics of computation: Information processing in single neurons. New York: Oxford University Press. ¨ La Camera, G., Rauch, A., Luscher, H.-R., Senn, W., & Fusi, S. (2004). Minimal models of adapted neuronal response to in vivo–like input currents. Neural Computation, 16, 2101–2124. Latham, P. E., Richmond, B. J., Nelson, P. G., & Nirenberg, S. (2000). Intrinsic dynamics in neuronal networks. I. theory. Journal of Neurophysiology, 83, 808–827. Lindner, B., & Longtin, A. (2003). Nonrenewal spike trains generated by stochastic neuron models. In SPIE Proceedings 5114 (Fluctuations and Noise) (pp. 209–218). Bellingham, WA: SPIE. Longtin, A., & Racicot, D. (1997). Spike train patterning and forecastability. Biosystems, 40, 111–118. Mainen, Z., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1505. McCormick, D., Connors, B., Lighthall, J., & Prince, D. (1985). Comparative electrophysiology of pyramidal and sparsely spiny stellate neurons of the neocortex. Journal of Neurophysiology, 54, 782–806. Meffin, H., Burkitt, A. N., & Grayden, D. B. (2004). An analytical model for the “large, fluctuating synaptic conductance state” typical of neocortical neurons in-vivo. Journal of Computational Neuroscience, 16, 159–175. Moreno-Bote, R., & Parga, N. (2004). Role of synaptic filtering on the firing response of simple model neurons. Physical Review Letters, 92, 92.028102. Moreno-Bote, R., & Parga, N. (2005). Membrane potential and response properties of populations of cortical neurons in the high conductance state. Physical Review Letters, 94, 088103. Muller, E. (2003). Simulation of high-conductance states in cortical neural networks. M.Sc. thesis, University of Heidelberg. ¨ S., & Rotter, S. Nawrot, M. P., Boucsein, C., Rodriguez-Molina, V., Aertsen, A., Grun, (2007). Serial interval statistics of spontaneous activity in cortical neurons in vivo and in vitro. Neurocomputing, 70, 1717–1722. Nykamp, D. Q., & Tranchina, D. (2000). A population density approach that facilitates large-scale modeling of neural networks: Analysis and an application to orientation tuning. Journal of Computational Neuroscience, 8, 51–63. Nykamp, D. Q., & Tranchina, D. (2001). A population density approach that facilitates large-scale modeling of neural networks: Extension to slow inhibitory synapses. Neural Computation, 13, 511–546. O’Brien, B. J., Isayama, T., Richardson, R., & Berson, D. M. (2002). Intrinsic physiological properties of cat retinal ganglion cells. Journal of Physiology, 538, 787– 802.
Spike-Frequency Adapting Neural Ensembles
3009
Omurtag, A., Knight, B. W., & Sirovich, L. (2000). On the simulation of large populations of neurons. Journal of Computational Neuroscience, 8, 51–63. Papoulis, A., & Pillai, S. U. (1991). Probability, random variables and stochastic processes (4th ed.). New York: McGraw-Hill. Renart, A., Brunel, N., & Wang, X.-J. (2004). Mean-field theory of irregularly spiking neuronal populations and working memory in recurrent cortical networks. In J. Feng (Ed.), Computational neuroscience: A comprehensive approach (pp. 431–490). New York: Chapman&Hall/CRC. Ricciardi, L. (1977). Diffusion processes and related topics in biology. Berlin: SpringerVerlag. Richardson, M. J. E. (2004). Effects of synaptic conductance on the voltage distribution and firing rate of spiking neurons. Physical Review E, 69(5), 051918. Richardson, M. J. E., & Gerstner, W. (2005). Synaptic shot noise and conductance fluctuations affect the membrane voltage with equal significance. Neural Computation, 17, 923–947. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Risken, H. (1996). The Fokker-Planck equation: Methods of solution and applications (2nd ed.). Berlin: Springer-Verlag. Rudolph, M., & Destexhe, A. (2003a). Characterization of subthreshold voltage fluctuations in neuronal membranes. Neural Computation, 15, 2577–2618. Rudolph, M., & Destexhe, A. (2003b). The discharge variability of neocortical neurons during high-conductance states. Neuroscience, 119, 885–873. Rudolph, M., & Destexhe, A. (2005). An extended analytic expression for the membrane potential distribution of conductance-based synaptic noise. Neural Computation, 17, 2301–2315. Shadlen, M., & Newsome, W. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. Journal of Neuroscience, 18, 3870–3896. Shea, B. L. (1988). Algorithm AS 239, incomplete gamma function. Applied Statistics, 37, 466–473. Shelley, M., McLaughlin, D., Shapley, R., & Wielaard, D. (2002). States of high conductance in a large-scale model of the visual cortex. Journal of Computational Neuroscience, 13, 93–109. Siegert, A. J. F. (1951). On the first passage time probability function. Physical Review, 81, 617–623. Smith, G. D., Cox, C. L., Sherman, S. M., & Rinzel, J. (2001). A firing-rate model of spike-frequency adaptation in sinusoidally-driven thalamocortical relay neurons. Thalamus and Related Systems, 1, 135–156. Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. Journal of Neuroscience, 13, 334–350. Song, S., & Abbott, L. F. (2001). Cortical remapping though spike timing-dependent synaptic plasticity. Neuron, 32, 1–20. Song, S., Miller, K., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nature Neuroscience, 3, 919–926.
3010
E. Muller, L. Buesing, J. Schemmel, and K. Meier
van Rossum, M. C. W., Bi, G. Q., & Turrigiano, G. G. (2000). Stable Hebbian learning from spike timing-dependent synaptic plasticity. Journal of Neuroscience, 20(23), 8812–8821. van Vreeswijk, C. A., & Hansel, D. (2001). Patterns of synchrony in neural networks with spike adaptation. Neural Computation, 13, 959–992. Whittaker, E. T., & Robinson, G. (1967). The calculus of observations: A treatise on numerical mathematics (4th ed.). New York: Dover. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophysical Journal, 12, 1–24.
Received November 7, 2005; accepted February 15, 2007.
LETTER
Communicated by Bert Kappen
Phase Transition and Hysteresis in an Ensemble of Stochastic Spiking Neurons Andreas Kaltenbrunner [email protected]
´ Vicenc¸ Gomez [email protected]
´ Vicente Lopez [email protected] Universitat Pompeu Fabra, Departament de Tecnologia, 08003 Barcelona, Spain, and Barcelona Media Centre d’Innovaci´o, 08003 Barcelona, Spain
An ensemble of stochastic nonleaky integrate-and-fire neurons with global, delayed, and excitatory coupling and a small refractory period is analyzed. Simulations with adiabatic changes of the coupling strength indicate the presence of a phase transition accompanied by a hysteresis around a critical coupling strength. Below the critical coupling production of spikes in the ensemble is governed by the stochastic dynamics, whereas for coupling greater than the critical value, the stochastic dynamics loses its influence and the units organize into several clusters with self-sustained activity. All units within one cluster spike in unison, and the clusters themselves are phase-locked. Theoretical analysis leads to upper and lower bounds for the average interspike interval of the ensemble valid for all possible coupling strengths. The bounds allow calculating the limit behavior for large ensembles and characterize the phase transition analytically. These results may be extensible to pulsecoupled oscillators.
1 Introduction The collective dynamics of networks composed of neuron-like elements that interchange messages have been studied in many areas of science. Several models have been developed to simulate and analyze phenomena produced by pacemaker cells in heart (Peskin, 1975), neurons in the brain (Bienenstock, 1995), swarms of fireflies (Buck, 1988; Copeland & Moiseff, 1995), or hand clapping of opera theater attendants (N´eda, Ravasz, Brechet, Vicsek, & Barab´asi, 2000). Synchronization of the ensemble units is a common characteristic of these phenomena. The observed synchronization effects are different according to the model characteristics. Most studies consider only instantaneous coupling (i.e., no Neural Computation 19, 3011–3050 (2007)
C 2007 Massachusetts Institute of Technology
3012
´ ´ A. Kaltenbrunner, V. Gomez, and V. Lopez
delay in the message exchange) between the units, which simplifies the analysis of the resulting dynamics. Under this restriction, Mirollo and Strogatz (1990) demonstrated that certain types of identical leaky oscillators with global coupling synchronize for almost all initial conditions. Their result has been extended by Senn and Urbanczik (2000) allowing nonidentical oscillators whose intrinsic frequencies, thresholds, and couplings are heterogeneous within a certain range. They showed that nonleaky linear integrate-and-fire neurons synchronize for any initial condition for almost all parameter values of the system and speculate that, using perturbative arguments, their results might still be valid in the presence of a small leakiness. The influence of an absolute refractory period on the Mirollo-Strogatz model has been analyzed by Chen (1994) and Kirk and Stone (1997). These authors showed that the system approaches synchrony for almost all initial conditions if the refractory period is below a critical value. If a delay for the message exchange is added more complex forms of synchronization are observed. Ernst, Pawelzik, and Geisel (1998) showed empirically that for both excitatory and inhibitory coupling, the neurons tend to cluster their activities. All neurons within a cluster are synchronized and fire in unison, whereas the clusters are phase-locked with constant phase differences. The number of clusters of the system was inversely proportional to the length of the delay for inhibitory coupling. The stability of these clusters was analyzed in Timme, Wolf, and Geisel (2002), and similar phenomena were observed by van Vreeswijk (1996) for coupling with α functions and have been proposed as a possible mechanism for neural information coding and processing in the form of synfire chains (Abeles, 1991; Bienenstock, 1995; Diesmann, Gewaltig, & Aertsen, 1999; Ikegaya et al., 2004). The importance of delay for neural modeling has recently been addressed by Izhikevich, Gally, and Edelman (2004) and Izhikevich (2006), claiming that it allows an unprecedented information capacity, which translates into an increase of stable firing patterns in more realistic neural populations due to heterogeneous delays. In this study, we investigate the influence of variations in the coupling strength on a network of nonleaky integrate-and-fire neurons with delayed, pulsed coupling and a small refractory period. We show that a system of these characteristics exhibits a phase transition with a delay-induced hysteresis. Phase transition phenomena are well known in populations of weakly coupled oscillators, where the onset of synchronization represents a second-order phase transition analogous to the formation of a Bose-Einstein condensate (Winfree, 1967; Kuramoto, 1984). Interacting chaotic oscillators also exhibit a special kind of phase transition that closely resembles that seen in spin glasses (Kaneko, 1990). Recent work analyzes the existence of ◦ ¨ ¨ phase transitions for chains (Ostborn, 2002) and lattices (Ostborn, Aberg, & Ohl´en, 2003) of pulse-coupled oscillators with a particular biologically inspired phase response curve.
Phase Transition and Hysteresis in Stochastic Spiking Neurons
3013
Contrary to these studies, the neurons of the network we analyze are driven by stochastic input: pulse-coupled oscillators with stochastic frequencies. Such types of model neurons have been introduced in Gerstein and Mandelbrot (1964), and one possible interpretation of the stochasticity is random input from background neurons that are not explicitly modeled (Stein, 1967). The units perform a random walk toward a threshold, and according to the characteristics of the stochastic input, several studies have analyzed the resulting distributions of the interspike intervals (ISIs) of single units for the case of nonleaky integrate-and-fire (IF) neurons, sometimes also referred to as leakless or perfect integrator neurons (Tuckwell, 1988; Fusi & Mattia, 1999; Salinas & Sejnowski, 2002; Middleton, Chacron, Lindner, & Longtin, 2003; Lindner, 2004). (For a review on those results and the more biological plausible models of leaky IF neurons, see Burkitt, 2006, and the references therein.) Here we are interested in a network of such units that can be interpreted as a simplified model of a pool of globally coupled neurons with similar properties, which receive stochastic input from different regions of the brain (Gerstner, 2000). Such networks have been studied for the cases of inhibitory (Brunel & Hakim, 1999; Brunel & Hansel, 2006) and excitatory coupled (Gerstner, 2000) leaky IF neurons. The later work mainly concentrated on noise in the thresholds or the refractoriness and only makes some comments on the effect of stochastic inputs. A combination of excitatory and inhibitory coupling, where noise led to enhanced stability of long-distance synchronization, has been analyzed in a realistic heterogeneous neural model (Hodgkin-Huxley) by McMillen and Kopell (2003). A common problem of these more biologically plausible models is that analytical studies are hard to perform, especially if delay and a refractory period are added. To bypass this problem, we base our analysis on a discrete´ time model introduced in Rodr´ıguez, Su´arez, and Lopez (2001) of globally coupled, nonleaky integrate-and-fire neurons with a constant transmission delay and a refractory period, where the stochastic inputs to the units are provided by a Bernoulli process. This model allows efficient simulations and a detailed analytical study and is an extension of the discrete model presented in van Vreeswijk and Abbott (1993). The use of a stochastic model allows observing that the nature of the dynamics changes abruptly from a regime where the units show noisy, irregular spiking behavior at low coupling to a regime with deterministic, self-sustained, and repetitive spiking behavior if the coupling is increased to values greater than the critical coupling strength. There the units organize into several clusters, have the same interspike interval (ISI), and fire in phase with the units of their own cluster and phase-locked with constant time differences to the neurons of other clusters. The clusters are robust to modifications of the rate of the stochastic input. If the coupling strength is increased even more, the number of clusters and the length of
3014
´ ´ A. Kaltenbrunner, V. Gomez, and V. Lopez
the ISI decrease since some clusters merge, but the system continues with the phase-locked firing. On the contrary, if the coupling is decreased, the units of the population remain firing phase-locked without an increase of the ISI or the number of clusters until the critical coupling strength is reached, where the clusters start to dissolve. Thus, a hysteresis effect can be observed. The phase transition and the hysteresis can be described in detail. Upper and lower bounds for the ISI as well as an approximation for the mean ISI are obtained, and the system’s behavior for large ensemble sizes (i.e., at its thermodynamic limit) is characterized. We conjecture that the observed phenomena can potentially occur in rather different models with a refractory period and delayed coupling. The results may be useful to explain certain aspects of animal behavior, such as synchronized and nonsynchronized flashing of North American fireflies (Copeland & Moiseff, 1995), create a simple working memory (Wang, 2001), and may be applied in information processing. The rest of this letter is organized as follows. We first give an overview of the model in section 2 and explain the methodology we used for the experiments in section 3. In section 4 we present experimental and analytical results for homogeneous networks, which are extended to heterogeneous networks in section 5. Section 6 shows the results that can be obtained using either parallel or sequential updating. Finally, the findings are discussed in section 7, and some derivations of formulas are presented in the appendix.
2 Description of the Model The discrete neural model studied in this work is based on the one introduced by Rodr´ıguez et al. (2001), which is an extension of the work of van Vreeswijk and Abbott (1993). It is composed of a globally coupled network of N nonleaky stochastic integrate-and-fire units. Unlike the model of Rodr´ıguez et al. (2001), where only a finite number of states is allowed, each unit i is at time t in a continuous state a i (t) ∈ [1, ∞). Transitions between states can take place only at discrete time steps and are limited by a threshold L. L is a positive real number and limits the range of excitations in the sense that every state a i (t) ≥ L is metastable since it absorbs any further state transitions or incoming messages and relaxes to 1 after the end of a refractory period tref . We have two types of state transitions:
1. Stochastic state transitions. At every discrete timestep t, a single unit can increase its state variable by 1 with probability p if the state of
Phase Transition and Hysteresis in Stochastic Spiking Neurons
3015
the neuron is below threshold L. a i (t + 1) =
a i (t) + 1
with probability p
a i (t)
with probability (1 − p)
a i (t + tref ) = 1 with probability 1
if a i (t) < L , if a i (t) ≥ L . (2.1)
2. State transitions due to coupling between units. A unit j that reaches the threshold L at time t emits a spike and increases the continuous state variable of a unit i by an amount i j at time t + δ, where δ stands for the synaptic delay. Rodr´ıguez et al. (2001) allowed only positive integer values for i j . We will not use this restriction and will allow any nonnegative real number. The total amount of change of a single unit at time t + δ is obtained by summing over all neurons that reached the threshold at time t. This gives us the following transition function due to messages of other units (we have to consider three cases depending on the threshold situation and the relation of tref and δ): N a i (t) + i j L (a j (t)) j=1 N a i (t + δ) = 1+ i j L (a j (t)) j=1 1
if a i (t) < L , if a i (t) ≥ L and δ ≥ tref , if a i (t) ≥ L and δ < tref . (2.2)
Notice that L (x) = (x − L) where (x) is the Heaviside step function whose value is 0 for negative inputs and 1 elsewhere. We use this function to sum only over the neurons that reached threshold L in the previous time step.
We restrict our analysis to the case of δ ≥ tref . Consequences of other choices are discussed at the end of the letter. To keep the simulations simple, we set both tref and δ equal to 1, which allows us to combine these two types of evolution, 2.1 and 2.2, into a single equation, and we get the following
´ ´ A. Kaltenbrunner, V. Gomez, and V. Lopez
3016
dynamics: N a (t) + i j L (a j (t)) + 1 with probability p i j=1 a i (t + 1) = if a i (t) < L , N with probability (1 − p) a i (t) + i j L (a j (t)) a i (t + 1) = 1 +
N
j=1
i j L (a j (t)) with probability 1
if a i (t) ≥ L .
j=1
(2.3) Rodr´ıguez et al. (2001) set all i j = for i = j and all ii = 0. With these prerequisites, the parameter η=
L −1 (N − 1)
(2.4)
was introduced to characterize the strength of the interactions among the units. The parameter η gives the ratio between the total change in activation needed for a neuron to fire and the one provided by the the coupling with the rest of the population. For η 1 the following expressions approximate well the mean and standard deviation of the ISI of a neuron in the ensemble: L − (N − 1) − 1 η − 1 (L − (N − 1) − 1)(1 − p) τm f = tref + , σm f = . p η p (2.5) Details and derivations of these equations, which are based on a meanfield approach, replacing the state transitions due to coupling between units by their average, can be found in Rodr´ıguez et al. (2001)1 . Equations 2.5 fail to describe the behavior of the system in regions of high coupling. At η = 1, they predict one giant cluster containing all the units with an ISI of 1. This would be true for a system only without delay and refractory period. In our system, however, the more important correlations due to the delayed message exchange become (i.e., the smaller becomes η), the bigger is the difference between τm f and the ISI of the units. This was first reported ´ by Rodr´ıguez, Su´arez, and Lopez (2002), who found that for η = 1, after an initial transient, the system reaches one of a large number of periodic
1 Note that Rodr´ıguez et al. (2001) used t ref = 1 in their analysis, which can be easily extended to the general case of tref variable.
Phase Transition and Hysteresis in Stochastic Spiking Neurons
3017
firing patterns, composed of several clusters. The same is true for η < 1, as can be observed in Figure 1, where raster plots of spikes of a system consisting of 100 neurons are shown. The irregular behavior of the system at η = 1.2 (see Figure 1a) changes into a regular repetitive spiking pattern at η = 1 (see Figure 1b) if the coupling is increased. If increased further, some clusters merge, but the system continues with the phase-locked clustered firing (shown in Figure 1c for η = 0.9). 3 Experimental Setup To analyze the system described in section 2, we use the following experimental procedures. The phase transition and the hysteresis are best understood observing how the system reacts to adiabatic changes of the coupling strength.2 A simple analogy is useful to understand the procedure. Consider a cloud of particles that is slowly concentrated or diluted by increasing or decreasing the volume. We begin with a very dispersed cloud with little interaction between the particles and start to concentrate it in a stepwise manner. At every concentration step, the interaction among the particles increases. At some point, the process is inversed and the cloud is diluted again, until reaching the original state. We therefore distinguish two different processes in our experiments, to which we refer as concentration process and dilution process. The particles in our case are the spiking units, and the interaction can be measured via the relation of the threshold L and the coupling strength multiplied by the number of units N. As explained above (see equation 2.4) this relation is reflected in the parameter η, which in our analogy represents the volume of the system. In our experiments we choose a fixed set of N neurons with fixed threshold L. The only parameter allowed to change is the global interaction strength . We start with units at random initial states and at regions of high η (usually η = 2) where the system can be described with high accuracy by equations 2.5 and is ergodic in the sense that all accessible microstates are visited over a long period of time. The units in these regions can be viewed as nearly independent, with a threshold lowered by the mean activity induced by messages received from other units. The only difference to real independence is a period focusing effect described by Rodr´ıguez et al. (2001). This results in a slightly lower (by a factor (η − 1)/η standard deviation than the one of an independent unit with lowered threshold. Once an experiment is started, we let the system evolve enough time steps to avoid dependence on unnatural initial conditions (i.e., conditions that are not typical of the system) and let the ISI stabilize. Now we can start
2 We use the term adiabatic as it is used in quantum mechanics, meaning a “sufficiently slow” change of the system.
´ ´ A. Kaltenbrunner, V. Gomez, and V. Lopez
3018
(a) 100
η=1.2
η=1
(b) 100
(c) 100
η=0.9
9 90
90 80
80
70
70
7
70
60
60
6
60
50
50
5
50
40
40
4
40
30
30
20
20
80
# Neuron
90 8
30 3 20 2
10
10 10 20 30 40 50 time
10
1 10 20 30 40 50 time
10 20 30 40 50 time
Figure 1: Raster plot of spikes (firing patterns) of 100 neurons (noise rate p = 0.9 and threshold L = 100) for different values of η. Simulation started with a random initial state for every neuron. Coupling strength is slowly increased every 100 time steps, and time was set to 0 after a transient of 50 time steps. For clarity in the visualization, the neurons are relabeled according to their spike time at η = 1. (a) We observe irregular firing at η = 1.2. (b) At η = 1 the neurons organize into nine phase-locked clusters (labeled by the boxed numbers). (c) At η = 0.9 the number of phase-locked clusters is reduced to five since clusters number 1, 2; 4, 5; 6, 7; and 8, 9 merged into new bigger clusters. Note that the length of the ISI coincides with the number of clusters for η = 1 and η = 0.9.
the concentration process by decreasing η in a stepwise manner. We achieve this by adapting . Notice that although we change η by a constant η, the changes of are not constant due to the inverse relation of η and . After every decrease of η, we let the system evolve enough time steps until the ISI stabilizes again. This procedure is repeated until a value of η in the range between 1 and 0.5 is reached. Then we reverse the procedure and start the dilution process. We increase η in a stepwise manner until we again reach the starting value of η. To analyze the results, we calculate two types of statistics of the ISIs of the units for every value of η: 1. The statistics of the ISI of the units just before the parameters of the system are changed (i.e., is increased or decreased). We call the first two moments of these statistics τ and σ (τ denotes the mean ISI of the ensemble and σ their standard deviation).
Phase Transition and Hysteresis in Stochastic Spiking Neurons
3019
2. The statistics of τ and σ of different experiments. We call τ the mean and σexp the standard deviation of the ensemble’s mean ISI τ . In the case of σ , we calculate only its mean value, which we denote σ . The value of σ gives us an idea of the likelihood of ending up firing with phase-locked clusters for the given parameter values. The closer σ is to 0, the bigger is this likelihood. If all the units in one experiment have the same ISI, their standard deviation σ equals 0. σm f of equation 2.5 estimates σ . On the other hand, σexp measures the influence of the initial conditions and variations in the stochastic state transitions on the mean ISI τ of the ensemble. 4 Results In this section, we first present the outcome of several experiments that reveal the existence of a phase transition phenomenon of the system described in section 2 around a critical value of the coupling parameter η. This phenomenon is accompanied by a hysteresis effect, which will also be described. We then give analytical bounds for the mean ISI τ . This description allows us to calculate the behavior of the observed phenomena for N → ∞. We will refer to this limit behavior as thermodynamic limit in the rest of this work. 4.1 Experimental Results 4.1.1 Dependence of the ISI on the Coupling Strength . As explained in the model section, one of the quantities we are interested in this work is the ISI. Especially, we want to know how it is modified by small changes of the coupling parameter η. We therefore performed several experiments for different ensemble sizes as described in section 3 and observed the dependence of the mean value of the ISI τ on the coupling parameter η for different values of the rate p of the stochastic evolution. First, we analyze only the concentration process, where we slowly increase the amount of coupling between the neurons. Since the coupling parameter η is inversely proportional to the coupling strength, an increase of the coupling strength indicates a decrease of η. Figure 2a and 2b show the results for 1000 such concentration experiments for noise rates of p = 0.9 (solid line) and p = 0.6 (dashed line). We used an ensemble size of N = 1000 neurons and can observe how the mean ISI τ decreases as we increase the coupling. Initially, at high values for η, there is a clear dependence on the noise rate p, which seems to disappear as we reach η = 1. A closer examination of the mean ISIs of both experiments in this region reveals that their difference for η close to 1 and below decays exponentially to 0 with decreasing η (shown in the insets of Figures 2a and 2b). We will see later in the theoretical analysis that the greater N
´ ´ A. Kaltenbrunner, V. Gomez, and V. Lopez
3020 N=1000
(a) 800 20
400
〈 τ 〉0.6−〈 τ 〉0.9 〈 τ 〉0.6−〈 τ 〉0.9
2
10
10 0 0.8
0.9
0
10
〈τ〉
〈τ〉
600
N=1000; log. Scale
(b)
1
−1
10 1
−2
10 200
0.8
p=0.9 p=0.6
0.5
1
(c) 140
η p=0.9
1.5
1
2
0.5 (d)
1
1.5 η p=0.9; log. Scale
2
η=1.15 η=1.0 η=0.9
2
10
〈τ〉
100
0.9
p=0.9 p=0.6
η=1.15 η=1.0 η=0.9
120
〈τ〉
10
80 60 40
1
10
20 200
400
600
N
800
1000
2
3
10
10 N
Figure 2: Mean ISI. Values of the ISI τ over all neurons and 1000 experiments for increasing (i.e., decreasing η). Number of neurons N = 1000 equals threshold L in all cases. (a) Dependence of τ on η for two different noise levels p. Inset shows the difference of both curves in the interesting region around η = 1. (b) Same as a but in logarithmic scale. (c) Dependence of τ on N for different values of η and p = 0.9. (d) Same as c but in logarithmic scale. We can see a linear dependence of τ on N for η = 1.15, a square root dependence for η = 1, and nearly no dependence for η = 0.9.
is, the faster is this decay, and at the thermodynamic limit, the mean ISI is independent of p for all values of η < 1. When we analyze the deviation of the ISIs, we also notice a change in the behavior of the system if we approach η = 1. The mean deviation σ of several experiments drops to 0 when the critical value of η is reached, indicating that the units organize into clusters and fire phase-locked, all with the same ISI. Figure 3a shows this effect for the concentration experiments with the two different noise rates analyzed before. In the inset, we notice that for a noise level of p = 0.9 (solid line), the onset of phase locking already happens at η ≈ 1.04. This can also be observed in the deviation σexp of the experiments ISIs. Figure 3b shows the corresponding value of σexp . Here the onset of phase locking is marked by an increase of σexp . If the units are not phase-locked, σexp is small since it is the deviation of averages. Although there may be big differences between the ISIs of the
Phase Transition and Hysteresis in Stochastic Spiking Neurons (a) 12
(b) 0.8
N=1000 1
3021
N=1000
10 0.6
6 0
exp
0.5
1
1.1
σ
〈σ〉
8
1.2
p=0.9 p=0.6
0.4
4 0.2
2
p=0.9 p=0.6
0 0.5
1
(c) 0.8
η
1.5
0 0.5
2
1
(d)
p=0.9
η
1.5
2
p=0.9
0.6 η=1.15 η=1.05 η=1.0
0.4
σexp
〈σ〉
0.6
0.2
0.2 0
η=1.15 η=1.0 η=0.9
0.4
200
400
600 N
800
1000
0
200
400
600 N
800
1000
Figure 3: Standard deviations. Values of the deviations of σ and σexp of the experiments with increasing (e.g., decreasing η) presented in Figure 2. Number of neurons N equals threshold L in all cases. (a) Dependence of σ on η for two different values of p. Inset shows a zoom on the interesting region around η = 1. (b) Same as a but for σexp (c) Dependence of σ on N for different values of η and p = 0.9. σ = 0 for all η < 1 due to synchronization. (d) Dependence of σexp on N for different values of η and p = 0.9.
units as is reflected by the value of σ , once the mean ISI τ of the ensemble is calculated, these fluctuations are just averaged out. But as the units start to organize into phase-locked clusters, the ensemble dynamics starts to govern the system. The deviation of the ensemble, σ , equals 0, but the deviation of the experiments, σexp , increases. The mean ISI τ of the ensemble is now an integer value, which depends on the evolution of all the units since the beginning of the experiment. This gives rise to a broader shape of the ISI distribution. Small fluctuations in the evolution of some units can lead to a different ISI of the whole system. Therefore, the averaging effect observed before is lost, and different experiments, although started with the same initial conditions, can lead to systems with different τ . The maximum of σexp is reached when the rate p of the stochastic evolution loses most of its influence on the size of the ISI (compare with Figure 2).
3022
´ ´ A. Kaltenbrunner, V. Gomez, and V. Lopez
In this sense the interval between the local minimum at η > 1 and maximum at η < 1 of σexp marks the transition between an ensemble governed by the stochastic evolution to an ensemble with self-sustained activity where the stochastic evolution does not have much influence on the statistics of the system. We can see that once the local maximum is reached, the curves for p = 0.9 and p = 0.6 are practically identical if the system is further concentrated (i.e., the coupling strength is increased). σexp reaches a value of 0 at η = 0.5 when the system consists of only one giant cluster, which spikes at every time step. The strange bump at η ≈ 0.7 can be explained by a probability of nearly 80% of having an ISI of 2 for this value of η. The value of σexp thus experiences an important decrease, but starts to increase again when the concentration continues and the probability of having an ISI of 1 increases.
4.1.2 Dependence of the ISI on the Ensemble Size N. Once the properties of the deviations have been described, our analysis focuses again on the mean ISI τ . The strange shape of the curves in the logarithmic scale of Figure 2b suggests that apart from the elimination of the dependency on the noise rate p, something else is going on around the value of η = 1. To investigate this point further, we observed the dependence of the ISI τ on the ensemble size N. Figures 2c and 2d reveal a quite different kind of dependence for different values of the coupling parameter η. For η = 1.15 (line with circles), we observe a linear dependence of τ on N, whereas for η = 0.9 (line with crosses), the value of the ISI stabilizes once a certain number of neurons is in the ensemble (N ≥ 300) and does not show any dependence on the ensemble size. At η = 1 (line with diamonds), another type of relationship is observed. The slope in the double logarithmic scale of Figure 2d has √ nearly exactly a value of 0.5, indicating a relationship of type N ∼ τ . The corresponding values of the deviations σ and σexp for p = 0.9 can be seen in Figures 3c and 3d. As expected for η ≤ 1, the mean deviation σ = 0 since the units fire phase-locked. We omit for clarity the line for η = 0.9 and show the values only for η = 1 (line with diamonds). √ For η = 1.15 (line with circles) we get a linear dependence of σ on N as predicted by equation 2.5. The additional curve for η = 1.05 (line with squares) allows observing that the smaller the ensemble, the earlier the onset of phase locking in the concentration process occurs. Here the units are already phase-locked for ensemble sizes of N > 500. The dependence of σexp on N can be observed in Figure 3d. For η = 1.15 (line with circles), where the stochastic state transitions govern the ensemble, σexp slightly decreases since an increase of N implies an increase of the number of samples taken for every ISI τ and, due to the central limit theorem, a decrease of the deviation. For η = 0.9 (solid line with crosses) and η = 1.0 (solid line with circles), the σexp increases as N increases. This
Phase Transition and Hysteresis in Stochastic Spiking Neurons
3023
behavior changes in the case of η = 0.9, where σexp stabilizes for higher values of N and becomes independent of the ensemble size (data not shown). 4.2 Characterization of the Phase Transition. The results of the dependence of the mean ISI τ on the ensemble size N for different values of η presented in Figures 2c and 2d motivate us to investigate if there exists a relation of type α Nc = τ
(4.1)
and analyze the dependence of c and α on the coupling parameter η. Equation 4.1 can be transformed into a linear equation with slope c and y-intercept ln α, ln(α) + c ln(N) = ln(τ ).
(4.2)
This allows us to calculate c and α by least-squares fits of the simulation data. The results of these fits are shown in Figures 4 and 5. In both, we see two curves, each representing a set of 10 different values of N. For every value of N, the mean ISI τ of 1000 concentration processes was calculated. For the solid line, N takes values from 100 to 1000 in steps of 100 and for the dashed line, values from 103 to 104 . We can see that both c and α experience a sharp change of their value around η = 1. The higher the value of N, the sharper this change is. We can therefore speak of a phase transition around a critical value of η = 1. We expect that for N → ∞, the value of c should jump from 1 for η > 1 via 0.5 at η = 1 to 0 for η < 1. The gray areas represent bounds obtained from theoretical analysis and are discussed in section 4.3. 4.2.1 Hysteresis Effect. After having analyzed the concentration process experimentally and characterized a phase transition phenomenon, we are interested in what happens if we invert the process. Instead of increasing the coupling strength, we decrease it in a stepwise manner. As described in section 3, we call this type of experiment dilution process. If we combine the concentration and dilution process to obtain a cyclic process, we notice a hysteresis effect comparing the mean ISIs τ of both processes for values of η close to 1 and below. Figure 6 shows this effect for three different starting points of the dilution process. The dotted line represents the mean ISI τ of the concentration process of 1000 experiments. When the concentration process stops and the dilution process is started, τ and therefore also τ remain constant until a dilution of η > 1 is reached. The solid line with circles represents a dilution process starting at η = 0.5 and the dashed line with + markers one starting
´ ´ A. Kaltenbrunner, V. Gomez, and V. Lopez
3024
p=0.9 1 0.9 0.8
N=102−103: Experiment N=103−104: Experiment N=102−103: Theoretical Limits 3 4 N=10 −10 : Theoretical Limits 5 N=10 −106: Theoretical Limits
0.7
c
0.6 0.5 0.4 0.3 0.2 0.1 0 0.5
0.6
0.7
0.8
0.9
1 η=1/ε
1.1
1.2
1.3
1.4
1.5
Figure 4: Comparison between experimental and theoretical results for the parameter c of the assumption that τ = α Nc . Ten sets of 1000 experiments with p = 0.9 and different values for N were carried out. Solid line: N ∈ {100, 200, . . . , 1000}. Dashed line: N ∈ {1000, 2000, . . . , 10,000}. The values of c have been obtained by a least-squares fit of the experiments with the linear equation ln(τ ) = ln(α) + c ln(N). The shaded areas show the region of possible values of c obtained by least-squares fits of the theoretical bounds for τ of equations 4.10 and 4.11 and the assumption. For N ∈ {105 , . . . , 106 } (darkest area), the value of c is already very close to its thermodynamic limit. The difference between the values of c of the two bounds is very low.
at η = 0.9. In both cases, the ISI remains unchanged until η = 1, where it jumps to a value slightly higher than the one predicted by the formulas 2.5 for this case. If we start the dilution process at η = 0.99, we observe that τ remains constant even for η = 1.01. Only if we further, dilute τ increases and starts to coincide with τ of the other two dilution processes. At approximately η = 1.08, the ISI of the dilution process coincides with the one of the concentration process. To understand this phenomenon in detail, we carry out a theoretical analysis, which we present in the next section. 4.3 Theoretical Description. Once the phenomena occurring in the model are identified in experiments, we make some theoretical observations to gain further insight. We base these observations on a deterministic
Phase Transition and Hysteresis in Stochastic Spiking Neurons
3025
Figure 5: Comparison between experimental and theoretical results for the value α under the assumption that τ = α Nc . Ten sets of 1000 experiments with p = 0.9 and different values for N were carried out. Solid line: N ∈ {100, 200, . . . , 1000}. Dashed line: N ∈ {1000, 2000, . . . , 10,000}. The values of α have been obtained by a least-squares fit of the experiments with the equation ln(τ ) = ln(α) + c ln(N). The shaded areas show the regions of possible values of α for N → ∞ according to equations 4.20.
approximation of the model where the stochastic evolution 2.1 of a neuron i is simplified into the following deterministic iterative rule: a i (t + 1) = a i (t) + p a i (t + tref ) = 1
if a i (t) < L ,
if a i (t) ≥ L .
(4.3)
The random walk performed by the units is replaced by their average behavior: a deterministic motion with constant homogeneous velocity p. An equivalent continuous time system but with heterogeneous velocities (frequencies) and without delay and refractory period has been studied by Senn and Urbanczik (2000). In the following we restrict our analysis to the case where the delay δ of the message exchange is greater than or equal to the refractory period tref .
´ ´ A. Kaltenbrunner, V. Gomez, and V. Lopez
3026
N=1000; p=0.9; log. Scale η ↓ till 0.5 η ↑ from 0.5 η ↑ from 0.9 η ↑ from 0.99 2
10
90 80
〈τ〉
70 60 50
1
10
40 30 20 10 0 0.85
0
10
0.5
1
0.9
η
0.95
1.5
1
1.05
2
Figure 6: Hysteresis effect in the comparison of the dependence of τ on η between the concentration and dilution process of the experiments. The y-axis shows τ of 1000 experiments in logarithmic scale. Four curves are shown. η ↓ to 0.5 shows the results of the concentration process to η = 0.5. η ↑ from 0.5 is the corresponding part of 1000 dilution processes starting from η = 0.5. η ↓ from 0.9 shows the result for the dilution process starting after a concentration to η = 0.9. And η ↓ from 0.99 is the same for a concentration until η = 0.99. The inset shows a zoom on the interesting region around η = 1 in linear scale. The number of neurons N = 1000 equals threshold L and p = 0.9 in all cases.
After an initial transient, the deterministic system shows a periodic pattern of spikes. The period of a pattern is the ISI of the ensemble (Figures 1b and 1c illustrate such patterns). If we take an arbitrary neuron and start the pattern at a spike of this unit, all units will spike exactly once until the next spike of the same unit. The sequence of these spikes will then start again with exactly the same time differences between the spikes, and this pattern will repeat itself forever if the parameters of the system are not changed. One can derive the following condition that the system fulfills if it shows a periodic firing pattern. Periodic pattern condition. A system that consists of κ clusters, where every cluster i consists of ki elements, shows a periodic firing pattern with
Phase Transition and Hysteresis in Stochastic Spiking Neurons
3027
ISI τ if for every i ∈ {1, . . . , κ} ki > kmin (τ ) =
(N − 1)(1 − η) + (N − 1)(1 − η)
p(τ −1−tref )
if τ ≥ 1 + tref if τ < 1 + tref
(4.4)
is fulfilled. For derivation of this rule, see appendix A. Condition 4.4 simply tells us that every cluster (i.e., units that reach the threshold at the same time step) has to be greater than a certain minimum cluster size, which depends on the system’s parameter and its ISI. Before we continue our analysis, we make some comments on the validity of this rule for the stochastic system. The firing patterns of the deterministic system may also occur in the stochastic system, as can be seen in Figures 1b and 1c. According to the robustness of these patterns against variations in the stochastic evolution, we can distinguish three types of patterns:
r
r
r
Robust firing patterns are totally insensitive to variations of stochastic state transitions in the sense that no matter how they evolve, even if they are totally suppressed, the system cannot change its periodic pattern. Semirobust firing patterns remain unchanged if one or more units evolve slower than with their mean velocity p, but if they evolve much faster, the spiking pattern may change. Since such changes are rare, as will be explained subsequently, and the patterns are robust against at least half of the possible stochastic events, we call them semirobust. Variable firing patterns may change due to very slow or very fast evolution of one or more units.
At η ≤ 1, we can find only robust and semirobust patterns. Condition 4.4 gives us the rule for a semirobust pattern in the stochastic system. To get the condition of a robust pattern in this case, we would have to replace p with 1. A change of a semirobust pattern of phase-locked clusters implies that one or more units change from one cluster to the one firing directly before it. This increases the robustness of the resulting new firing pattern, since the smallest cluster has the highest probability of receiving a neuron and leads to a certain balancing of the sizes of the clusters. Only if we have two small clusters spiking one directly after the other, a merge of these two clusters might occur, which would imply a decrease of the ISI. Every decrease of the ISI enhances the robustness of the resulting firing pattern, since it implies a decrease of the minimum cluster size kmin (τ ). Such events, however, are rare, especially for large populations, and their influence is far below the
´ ´ A. Kaltenbrunner, V. Gomez, and V. Lopez
3028
standard deviation of the experiments. Since we are mainly interested in the ISIs of our system, we can neglect them in our analysis, although we have to state that for an infinite simulation time, all semirobust patterns would transform into robust ones. For η > 1 we find only variable firing patterns (see Figure 1a). In the stochastic system, the higher the value of η, the lower is the probability of observing in between of three consecutive spikes of a certain single unit the same two spiking patterns of the rest of the neurons. It is not even granted that the two ISIs of this unit have the same length. But now, unlike the case of η ≤ 1, the fluctuations of the noise cannot create irreversible effects on the ISI of the system. The mean ISI of the system coincides with that of the deterministic system, which makes the following analysis valid in this region as well. Every time the coupling strength is increased, after a short transient, the deterministic system again fulfills condition 4.4. This allows us to make some observations on the ISI of the system. Since the cluster sizes have to be integer numbers, we define
kˆ min (τ ) = kmin (τ ) + 1
(4.5)
and have the condition ki ≥ kˆ min (τ ). This definition guarantees that kˆ min (τ ) ≥ 0 and is necessary for technical reasons in the derivation of the lower bound, which we will see later. A system with ISI τ consists of κ = τ/δ clusters since a spiking cluster at time t provokes the spiking of the next one at time t + δ. With this, we calculate another quantity we need to describe our system, the mean cluster size given a certain ISI τ :
¯ )= k(τ
Nδ . τ
(4.6)
We then introduce a new function g(τ ), which gives us the ratio between ¯ ) and kˆ min (τ ): k(τ
g(τ ) =
¯ ) k(τ kˆ min (τ )
.
(4.7)
Intuitively this quantity can be seen as the frequency a system consisting only of clusters with the minimum cluster size kˆ min (τ ) has to fire with to ¯ ). Note that g(τ ) > 0 achieve the same ISI as the system with cluster size k(τ ˆ since k min (τ ) ≥ 0.
Phase Transition and Hysteresis in Stochastic Spiking Neurons
3029
Using equation 4.5 of the minimum integer cluster size kˆ min (τ ), we arrive after some manipulation at the following inequality (see equation B.15 in appendix B for details): Nδ ≤ g (kmin (τ ) + 1) . τ
(4.8)
g denotes the expectation of g(τ ) with respect to the probability distribution of τ . We will call g min from now on the minimum value, which g can take to fulfill inequality 4.8. Using g min , inequality 4.8 can be written as 1 = g min ζ , τ
(4.9)
when we replace (kmin (τ ) + 1)/(Nδ) with ζ . If we plot τ and ζ in double logarithmic scale, we notice a nearly linear dependence between 1/τ and ζ . This is the reason that by replacing g min with a constant, we can get a good approximation of τ , as we will see. Two examples for this nearly linear relation can be seen in Figure 7a for noise rates of p = 0.9 and p = 0.6. The inset shows a comparison between ζ and η for p = 0.9. If we take a closer look at g min , however, we notice that the hypothesis of a linear relationship does not hold. In Figure 7b we can observe the exact value of g min for the corresponding two sets of experiments. From empirical evidence, we can suppose that for the values of p and η of our simulations, 2 is an upper bound of g min , which allows us to replace g with 2 in inequality 4.8. 4.3.1 Bounds for τ and τ . With the results of the last section, we can derive upper and lower bounds for τ and the mean ISI τ . Both bounds of τ can be used to approximate τ . First we derive a lower bound for τ using definition 4.7 and condition 4.8 (see section B.3 for details):3 τ min =
3 Note
1 + tref (N − 1)(η − 1) − + 2p 2
(N − 1)(η − 1) − Nδ 1 + tref 2 + + . + 2p 2 pg
(4.10)
that if g is replaced by g min in equation 4.10, we get an expression for τ .
´ ´ A. Kaltenbrunner, V. Gomez, and V. Lopez
3030 N=1000; log. Scale
(a)
N=1000
(b)
2
2
η
1.5 2
〈 g 〉min
1
10 〈τ〉
0.5 −3 −2 −1 10 10 10 ζ
1
10
1.5
p=0.9 p=0.6 −3
10
1
−2
10
(c)
−1
0.5
10
ζ
∆τ
〈τ〉
η
1
1.5
2
τ −〈 τ 〉 max 〈 τ 〉−〈 τ 〉 min 〈 τ 〉−τmf τ −〈 τ 〉
8
10
max
6
min
4
10
p=0.9 Th. Limits
0
0.5
1
(d) 10
2
10
p=0.9 p=0.6
1
η
1.5
2 2
0 0.5
1
η
1.5
2
Figure 7: (a) Relation of τ and ζ = (kmin (τ ) + 1)/(Nδ) for two different values of p. (b) Dependence of parameter g min on η for two different values of p. We can observe g min ≤ 2. (c) The empirical outcome τ of the concentration processes of 1000 experiments compared to its theoretical limits τmax and τ min (gray area). (d) Difference between the theoretical limits τmax and τ min (gray area), τ and τ min (dashed line), τmax and τ (black solid line), τ and τm f (gray solid line with circles). τmax has been calculated using equation 4.11 and τ min substituting 2 for g in equation 4.10.
In the following we will use (if not stated differently) the empirical upper bound 2 of g min as explained in the previous section to substitute g in equation 4.10 and to calculate the numerical values of τ min . To get an upper bound for all possible ISIs, we use equations 4.4 and 4.6: τmax =
(N − 1)(η − 1) 1 + tref + 2p 2
(N − 1)(η − 1) 1 + tref 2 Nδ + + + 2p 2 p
(4.11)
and have that (see appendix B for details): τ min ≤ τ ≤ τmax .
(4.12)
Phase Transition and Hysteresis in Stochastic Spiking Neurons
3031
Note that τmax is an upper bound for all ISIs τ , whereas τ min is a lower bound only for τ , not for τ . Because of this fact, τ min is much closer to τ than τmax , as we can observe in the Figures 7c and 7d, which show the quality of these bounds. The solid line in Figure 7c represents τ for 1000 concentration processes in logarithmic scale. It lies within a gray area that indicates the interval limited by the two bounds τ min and τmax . In section B.2, we derive a lower bound τmin ≤ τ . τmin = tref for η < 1 and coincides with the approximation τm f (see equation 2.5) for η ≥ 1. We therefore have that τm f ≤ τ ≤ τmax .
(4.13)
The exact difference between τ and its bounds is shown in Figure 7d. The difference τ − τ min is indicated by the dashed line and τmax − τ by the solid line. The gray area shows the width of the interval bounded by τ min and τmax . The quantity τ min is, for a wide range of values of η, not only a bound but an excellent approximation for τ . It beats by far the approximation τm f , whose difference from τ is shown by the gray solid line with circles in Figure 7d. In spite of this fact, the importance of τ min and τmax lies not only in their quality as approximations but in the fact that they are bounds and both experience a similar phase transition effect as τ . This allows us to derive the thermodynamic limit of τ . 4.3.2 Thermodynamic Limit. The thermodynamic limits (i.e., the behavior for infinite N and L and finite η) of the bounds from equations 4.10 and 4.11 can be derived after some calculations (see appendix C for details). We get thus for the thermodynamic limit of τ . lim τ = δ
N→∞
δ δ ≤ lim τ ≤ N→∞ g (1 − η) (1 − η) δ τ δ ≤ lim √ ≤ N→∞ pg p N lim
N→∞
τ (η − 1) = N p
if η < 0.5, if 0.5 ≤ η < 1, if η = 1, if η > 1.
(4.14)
The upper bound at 0.5 ≤ η < 1 coincides with the one obtained for a model without stochastic input similar to van Vreeswijk and Abbott (1993). To characterize the quality of these bounds, we examine the quantity τ = τmax − τ min , especially at its thermodynamic limit. The gray area
´ ´ A. Kaltenbrunner, V. Gomez, and V. Lopez
3032
in Figure 7d shows τ for the case of N = 1000 and p = 0.9. From equations 4.10 and 4.11, we can derive after some algebra that 0
δ 1 1 − 1−η g
lim τ = Nδ 1 N→∞ lim 1 − √ N→∞ p g
δ 1 1− + p η−1 g
for η < 0.5, for 0.5 ≤ η < 1, for η = 1,
(4.15)
for η > 1.
Since from τmax ≥ τ min follows τ ≥ 0, we have that g and g min have lower bounds:
lim g ≥ lim g min ≥
N→∞
N→∞
1
pδ pδ + (η − 1)
for η ≤ 1, for η > 1.
(4.16)
If we compare τ with the difference of τm f and τmax , we get lim (τmax − τm f ) = 1 +
N→∞
δ η−1
for η > 1.
(4.17)
For high values of η, the limit of this difference tends to 1. Therefore the formulas 2.5 of Rodr´ıguez et al. (2001) really represent a good approximation of the dynamics for η 1. Approximation with τ min , however, is always closer to τ if p ≥ . In the case of p < , we can derive using τ ≥ τm f an upper bound for g min : lim g min ≤
N→∞
pδ (η − 1)( − p)
if η > 1 and p < .
(4.18)
As long as we use a value lower than the right-hand side of inequality 4.18 to replace g in equation 4.10, the approximation of τ with τ min beats τm f also for η > 1 and p < and is therefore an improvement to earlier studies for all values of η. 4.3.3 Application of the Thermodynamic Limit. If we suppose a dependence of τ on N as in equation 4.1 ( α Nc ∼ τ ), we can derive the thermodynamic limits for α and c from the above results (see appendix C for details). The
Phase Transition and Hysteresis in Stochastic Spiking Neurons
value of c is rather straightforward: for η < 1, 0 lim c = 0.5 for η = 1, N→∞ 1 for η > 1.
3033
(4.19)
This observation coincides with the experimental results presented in Figure 4 and indicates a linear dependence between the ISI and the number of neurons N for η > 1, whereas for η < 1, the ISI does not depend on the number of units in the ensemble at all.√In between at η = 1, the mean value of the ISI distribution τ depends on N. The thermodynamic limit of α follows straight from equation 4.14. We get: lim α = δ
N→∞
δ δ ≤ lim α ≤ N→∞ g (1 − η) (1 − η) δ δ ≤ lim α ≤ N→∞ pg p lim α =
N→∞
(η − 1) p
if η < 0.5, if 0.5 ≤ η < 1, if η = 1, if η > 1.
(4.20)
Since we can replace g with g min and 1 ≤ g min ≤ 2 for η ≤ 1, we get an excellent approximation for α at the limit of large N. We can observe this in Figure 5. The shaded area shows the possible regions of α. Already for low N, the experimental data fit well into the theoretical results for the thermodynamic limit, which gives an idea what ISI to expect for η < 1. 4.3.4 Hysteresis. Condition 4.4 for a periodic spiking pattern also gives us an explanation of the hysteresis phenomena. In the dilution process, we always start from a robust or semirobust pattern at η ≤ 1, and condition 4.4 is not altered by an increase in η. The inequality gets sharper, which means that the probability of changing a semirobust pattern is even lower the more we increase η. In other words, the semirobust patterns tend to be robust during the dilution process. This changes once we reach η = 1, since now the stochastic state transitions are needed again to maintain the ISIs. But even for η > 1, the ISI may remain constant as long as the following condition is fulfilled (see section B.2 for details and derivation): τ ≥ tref +
(N − 1)(η − 1) . p
(4.21)
This inequality reflects that the approximations 2.5 of Rodr´ıguez et al. (2001) are a lower bound for the ISI in the dilution process, which have to be
´ ´ A. Kaltenbrunner, V. Gomez, and V. Lopez
3034
fulfilled by the deterministic system. When a value of η violating this condition is reached, the system leaves the ISI it has fired with since the beginning of the dilution process and changes to a new ISI, which does fulfill the condition. We can observe this in the dashed-dotted line with markers of Figure 6, which shows a dilution process starting at η = 0.99. In this case, τ remains constant even for η = 1.01 since inequality 4.21 is still fulfilled at this point. Only if we dilute further, τ increases to values fulfilling condition 4.21 and starts to coincide with τ of the other two dilution processes. At approximately η = 1.08, the mean ISI of dilution processes coincides with the one of the concentration processes, since then the equations 2.5 start to approximate τ well again. 5 Heterogeneous Networks In this section we show that the results described in section 4 for homogeneous networks can be extended to networks consisting of neurons with heterogeneous coupling strengths and thresholds. 5.1 Generalization of the Model. To be able to simulate heterogeneous networks, we have to apply the following extensions to the model presented in section 2:
r r r r
Instead of setting all coupling strengths i j equal to a homogeneous coupling strength , we take the i j from a gaussian distribution with mean . We allow heterogeneous thresholds L i for every unit. Each L i is taken from a gaussian distribution with mean L . The relation (denominated s) between standard deviations and means of both distributions is fixed. To characterize the new extended system, we calculate the parameter η in the same way as before but use the mean values and L of the distributions instead of L and of a homogeneous network. We call this parameter ηext : ηext =
L − 1 . (N − 1)
(5.1)
With this model of a heterogeneous network, we perform the same type of experiments as described in section 3. In the concentration process, we start in regions with high ηext and increase the connection strengths of the system adiabatically. As in the experiments with homogeneous networks, we increase it by resting a constant value ηext from ηext . We have to calculate the corresponding values of i j after the change by multiplying them by ηext /(ηext − ηext ). Thus, we achieve that after every change, equation 5.1 is still valid.
Phase Transition and Hysteresis in Stochastic Spiking Neurons
3035
Figure 8: A comparison of the results of 1000 experiments with homogeneous and heterogeneous networks. (a) Shows α. (b) c under the assumption that τ = α Nc for N ∈ {100, 200, . . . , 1000}. Coupling strengths and thresholds were taken from gaussian distributions with deviations of 10% (dash-dotted line) and 30% (gray solid line with circles) of their means. The black solid lines (homogeneous network) coincide with the results shown in Figures 4 and 5, as well as the the shaded areas, that represent in subfigure a the regions of possible values of α for N → ∞ according to equations 4.20 and in subfigure b the region of possible values of parameter c obtained using equations 4.10 and 4.11 for finite N as in the experiments. (c) Comparison only of the results for N = 1000. The mean ISIs τ of the heterogeneous experiments lie within the gray area bounded by τ min (see equation 4.10) and τmax (see equation 4.11). Compare with Figure 7c. We notice that the results obtained for the theoretical bounds of the ISI are also valid for heterogeneous networks. (d) Hysteresis in a heterogeneous network with s = 30%. Compare with Figure 6. Hysteresis also occurs in heterogeneous networks, but vanishes there at η slightly lower than 1.
5.2 Results for Heterogeneous Networks. We compare experiments with the generalized model with the results of section 4 for homogeneous networks. Figure 8 shows this comparison for the values of α (see Figure 8a) and c (see Figure 8b) of equation 4.1.
3036
´ ´ A. Kaltenbrunner, V. Gomez, and V. Lopez
The only noticeable difference between experiments with homogeneous networks (solid black line) and the corresponding heterogeneous equivalents with deviations of 10% (dash-dotted line) and 30% of the mean value (gray solid line with circles) is that for η ≤ 1, the curves show some irregular bumps that differ from the expected values (gray areas) of the theoretical analysis. A closer examination of the ISIs at η < 1 reveals that the bumps are provoked by jumps between integer values of the ISI, as can be observed in Figure 8c, which shows τ for N = 1000. Contrary to intuition, the smooth change of the mean ISI in the homogeneous case (black solid line) is more steplike in heterogeneous networks, meaning that for certain intervals of the coupling strength (i.e., the plateaus in Figure 8c), nearly all the experiments (with different networks) end up with the same ISI. The observed behavior, which is reproducible and robust despite the stochastic inputs and the heterogeneity of the network, is not fully understood, but might be of biological relevance and will be the subject of future research. The locations of the steps depend on the ensemble size N, which causes the bumps in Figures 8a and 8b. In the thermodynamic limit, however, these bumps should disappear. In spite of this effect, the upper and lower bounds for τ , given by equations 4.10 and 4.11 and represented by the gray area in Figure 8c, are valid even for high deviations of the underlying probability distributions. Also the hysteresis effect is present in the heterogeneous networks (see Figure 8d), although it vanishes slightly before the critical coupling strength is reached at η = 1 (compare with Figure 6). This is caused by some neurons that receive less input relative to their threshold than the others. They need stochastic input to reach the threshold already at values of η slightly lower than 1, which can cause the end of self-sustained firing and an increase in the ISI of the ensemble. 6 Sequential versus Parallel Dynamics The results presented so far have been obtained using parallel updating of the units, meaning that at every time step, all units are updated. Sometimes sequential dynamics, where only a reduced number of neurons is updated at every time step, are used to simulate neural dynamics instead (Herz & Marcus, 1993). It has been shown that sequential and parallel updating ¨ can lead to different behavior of the Hopfield model (Fontanari & Koberle, 1988) and multistate Ising-type ferromagnets (Bolle & Blanco, 2004), which motivates us to investigate the dependence of our results on the type of updating. We will use the simplest case of sequential updating, where only one neuron is updated at each time step, and modify the model presented in section 2 in the following way:
r
One time step of the original model is split up into N time steps, and at each new time step, only one neuron is updated. A specific neuron i is updated at every time step t that fulfills i ≡ tmodN.
Phase Transition and Hysteresis in Stochastic Spiking Neurons
r
3037 seq
Given a synaptic delay δ seq and a refractory period tref , the neuron updated at time t receives all the spikes sent within the interval [t − N − δ seq , t − δ seq ) by the other neurons. In the case that neuron i fired seq the last update, the interval narrows to [t − N − δ seq + tref , t − δ seq ).
This modification implies that the synaptic delay is no longer homogeneous. An update of a postsynaptic unit occurs now the next time it is updated instead of the precise time step when the spike would reach the unit. Therefore, the effective delay is uniformly distributed between δ seq and δ seq + N, leading to an average delay of δ seq = δ seq + N/2.
(6.1)
To compare simulations with sequential and parallel dynamics, we have to set δ seq accordingly to fulfill Nδ = δ seq where δ is the delay of the parallel dynamics. This leads to
1 . = N δ− 2
δ
seq
(6.2)
The equations for upper (see equation 4.11) and lower bounds (see equation 4.10) obtained for the parallel dynamics are valid only if tref ≤ δ, which means that the only noticeable effect of tref on the ensemble dynamics is that the threshold L acts as an absorbing barrier. To achieve the same effect in seq seq the sequential model, we have to set tref = 1. A greater tref would lead to absorption of more spikes than in the corresponding parallel experiments and in consequence a greater ISI of the ensemble. If we consider these two constraints, we obtain similar results for both type of dynamics, as shown in Figure 9, where we compare simulations with increasing coupling strength for N = 1000 neurons and a noise rate p = 0.9. Since the delay δ of the parallel simulations equals 1, δ seq was set to 500. This implies a mean delay δ seq of N, which is coherent with the rescaling of t to t/N. We observe in Figure 9a that after rescaling (i.e., dividing the ISI by the ensemble size), the mean ISI of the sequential dynamics (dashed line with circles) nearly coincides with the result of the parallel simulations (continuous line), and the bounds obtained in section 4.3 are valid also for this type of dynamics. Hysteresis can also be observed (data not shown). Units with sequential updating show due to the inhomogeneous delays a slightly greater deviation of their ISIs than their equivalents with parallel updating (see Figure 9b). This effect causes the phase-locked clusters, although observable also at η = 1, to appear as a general phenomenon (i.e., σ = 0) for couplings slightly greater than the critical coupling strength, as we can observe in the inset of Figure 9b. Despite these small differences due to the enhanced dispersion of the ISIs, we can conclude that the results derived in the previous section are also valid if sequential instead of parallel updating is used.
´ ´ A. Kaltenbrunner, V. Gomez, and V. Lopez
3038
(a)
2
10
(b)
p=0.9, N=1000, log. Scale
2.5
Th. Limits Par. Dyn. Seq. Dyn.
Par. Dyn. Seq. Dyn.
2
〈τ〉
〈σ〉
0.4
1.5 0.2
1
1
10
0.5
0.5
0.9
1
1.1
0
0
10
0
1 η=1/ ε
1.5
0.5
1 η=1/ ε
1.5
Figure 9: Comparison of 1000 experiments with parallel dynamics (continuous lines) and their equivalent with sequential updating (dashed lines with circles). Number of neurons N = 1000 equals threshold L and p = 0.9. The results for the sequential dynamics are rescaled by the factor 1000. (a) The mean ISIs τ of both types of dynamics nearly coincide and lie within the gray area bounded by τ min (see equation 4.10) and τmax (see equation 4.11). (b) The sequential dynamics show a slightly higher mean of deviation σ of the units ISIs with the same experiments as the parallel dynamics for η ≥ 1 but also end up in phaselocked clusters at η < 1. The inset shows a zoom on the interesting region around η = 1.
7 Discussion We analyzed, by varying its coupling strength , the behavior of an ensemble of stochastic, nonleaky integrate-and-fire neurons with delayed, excitatory global coupling and a small refractory period. Around a critical value of the coupling strength, the behavior of the system undergoes a phase transition (Figure 4), which has three main consequences on the dynamics of the ensemble:
r
Transition from irregular to clustered spiking behavior. The units of the ensemble are homogeneous but show, due to a stochastic component in their evolution towards threshold, irregular spiking behavior if they are weakly coupled (see Figure 1a). The coupling strength is inversely proportional to the parameter η (see equation 2.4), which describes the system and allows fixating the phase transition at η = 1. For coupling greater than or equal to the coupling at η = 1, the population splits into several clusters. All neurons within a cluster spike in unison, although they still might show different trajectories toward the threshold. The clusters are phase-locked (see Figure 1b), and all have exactly the same ISI, which is proportional to the number of
Phase Transition and Hysteresis in Stochastic Spiking Neurons
r
r
3039
clusters. This number decreases if the coupling is increased further (see Figure 1c), but usually remains greater than 1 until a trivial case of only one cluster is reached at the latest at η < 0.5. The activity at η ≤ 1 is self-sustained in the sense that the stochastic inputs are not needed to maintain the clustered spiking activity. Hysteresis. The phase transition is accompanied by a hysteresis effect (see Figure 6). We applied a cyclic process to our system, which consisted of two subprocesses: a concentration process where we increased and a dilution process where is decreased. Starting at an initial configuration with low coupling, the process is reversible as long as we do not reach (at values of η slightly greater than 1) the onset of phase locking. If we increase the coupling strength further, the ISI τ of the concentration process decreases following rules explained below, whereas if we inverse the process (start the dilution process), the ISI remains frozen as long as η ≤ 1. Then it jumps to a value fulfilling condition 4.21 and coincides again with the ISI of the concentration process at η slightly greater than 1. Change in the dependence of the ISI on ensemble size and noise rate. For weak coupling, the mean ISI τ of the ensemble in the concentration process depends linearly on the ensemble size N. At η = 1, there is a square root dependence on N, whereas for coupling greater than at η = 1 the mean ISI τ does not depend on N or on the rate of the stochastic component, which governs the dynamics for low coupling. Between η = 1 and η = 0.5, the ISI depends only on the coupling strength itself. For a coupling greater than at η = 0.5 the influence of on the ISI is also lost. The length of the time delay determines the length of the ISI in this regime.
To obtain analytical results, we used a deterministic approach to the model dynamics, which allowed us, using a simple condition 4.4, to derive upper and lower bounds (τmax and τ min ) for the mean ISI τ of the ensemble in the concentration process. The lower bound τ min is also an excellent approximation for τ , as can be observed in Figure 7d. Using the bounds, we can calculate the behavior of the system at its thermodynamic limit (i.e., for N → ∞) and characterize the phase transition analytically. These theoretical results are also valid if sequential instead of parallel updating (Herz & Marcus, 1993) is used to simulate the dynamics (see Figure 9). In this case, the synaptic delay is no longer homogeneous, leading to a slightly higher deviation of the unit’s ISIs, and in consequence, a slightly later onset of clustered spiking behavior in the concentration process. The above explained effects could also be observed if, instead of an increase or decrease of the coupling strength, positive or negative external input were added to the system. To calculate the critical point in this case,
3040
´ ´ A. Kaltenbrunner, V. Gomez, and V. Lopez
we would have to add the external input to the denominator of equation 2.4 to get the appropriate value of η. Using the hysteresis effect, one can generate a simple memory by stimulating the system with a strong input, which leads the system to a value below η = 1. If the input is then substituted by a smaller one, which is just big enough to maintain the system below the critical couping strength and could represent the will to remember the first input, the ISI and the firing pattern of the ensemble will still be the same as if the strong input were still present. Once the system receives a short erase signal (e.g., in the form of a negative input or the absence of the small input), which allows it to reach a state corresponding to η > 1, the firing pattern produced by the strong input will disappear (i.e., the memory will be deleted). Such a mechanism might be a novel way to represent working memory functions (Wang, 2001), which are often modeled in the form of bistable dynamical attractor networks (Durstewitz, Seamans, & Sejnowski, 2000). In our case it seems that we have multistability for coupling greater than the critical coupling strength, but further analysis is needed to verify this claim. The main difference of our model to those used in earlier studies is the use of discrete time dynamics and that we combine delayed coupling with an implicit refractory period. Setting delay δ and refractory period tref identical and equal to 1 in the experiments does not represent a critical restriction on the presented results, as can be seen in the theoretical analysis, which is valid in the general case as long as δ ≥ tref and both are positive. We can observe as well from equations 4.15 and 4.17 that in a system with no delay, δ = 0, the upper and lower bounds nearly coincide, leaving no space for clustering with more than one cluster and hysteresis. For a delay lower than the refractory period, some of the interpopulation messages would get absorbed, leading to different upper and lower bounds for the ISIs. Such a pair of bounds has been calculated for a system without stochastic input in van Vreeswijk and Abbott (1993) for the case of δ slightly lower than tref . We expect those results to be valid for our system at η < 1 in the thermodynamic limit. In the case of sequential dynamics, the condition δ ≥ tref translates into setting the refractory period to a minimum value. It is straightforward to transfer the discrete time dynamics onto the continuous domain, replacing the stochastic state transitions with a continuous increase of the state variable. This leads to continuous oscillators similar to the ones analyzed by Senn and Urbanczik (2000), but with the add-ons of delayed coupling and refractory period, which are both crucial to observe the reported phase transition and clustering phenomena. One can even maintain the stochastic dynamics using a continuous extension of the ISI distribution of a single uncoupled unit, a negative binomial distribution in our case (Rodr´ıguez et al., 2001). Two different possibilities for such an ex´ ´ tension have been presented by Gomez, Kaltenbrunner, and Lopez (2006), leading both to gamma distributions. The length of delay and refractory period remain untouched by these extensions. An analysis of these models
Phase Transition and Hysteresis in Stochastic Spiking Neurons
3041
is the object of current research using a novel event-driven modeling tech´ nique (Gomez et al., 2006), allowing simulation of such extended models (also with noninteger values for delay and refractory period) without precision errors and without determining the exact trajectories of the states of the units in the ensemble. The derived formulas for upper and lower bounds should, apart from some minor modifications, be valid in these extended systems as well. For models with a leaky term (Mirollo & Strogatz, 1990), clustering phenomena have been reported by Ernst et al. (1998) if a delay is added, but these clusters turned out to be unstable for excitatory coupling if noise was added to the coupling strength. We conjecture that this behavior would change if a positive refractory period were included in the system. During the refractory period, the threshold acts as an absorbing barrier, allowing the system a certain tolerance against noise that is the higher the greater the amount of absorption is. The fact that Ernst et al. report stability for inhibition is clear evidence for this conjecture, since the reset state is a reflecting barrier in their model, allowing the absorption of noisy negative coupling. It would also be interesting to observe the effect of delay and refractory period for a model with a biologically inspired phase response curve (PRC). For this type of PRC, the existence of a phase transition in the case of one ¨ ¨ (Ostborn, 2002) and two-dimensional (Ostborn et al., 2003) oscillator lattices has been reported. Our study shows that synaptic delay significantly changes neural dynamics. It is crucial for the observed hysteresis effect and the appearance of several phase-locked clusters. Without it, we would get a totally synchronized ensemble at the critical coupling strength. The lack of a leaky term makes our model biologically plausible only at the limit of high coupling, where integration of synaptic inputs occurs over a timescale much shorter than the decay constant (Burkitt & Clark, 1999), which is exactly where we find the phase transition and hysteresis. We therefore conjecture that the phenomena described could be found as well in more complex, realistic neural models with delay. Even in a system with inhomogeneous delays, we can find similar results, as shown for the case of sequential dynamics, which demonstrates the robustness of the findings. In a network consisting of heterogeneous neurons with different coupling strengths and thresholds drawn from gaussian distributions, the reported phenomena are also present (see Figure 8). This may be of great importance if synaptic dynamics are added to the model. We conjecture that in this context, plasticity may act as a homeostatic mechanism to maintain the system in the regime around the critical coupling strength if it experiences perturbations. In the neighborhood of the critical state, the system explores all possible clusters one after the other. Clusters that are phase-locked after crossing this point are then transient states, and the system is ready to set in any of the phase-locked, periodic firing patterns as a reaction to an increase in the number of received messages. This could have potential
´ ´ A. Kaltenbrunner, V. Gomez, and V. Lopez
3042
applications in a wide range of engineering applications like image segmentation (Campbell, Wang, & Jayaprakash, 1999; Rhouma & Frigui, 2001) or large-scale sensor networks (Hong & Scaglione, 2005), where clustering might be useful to optimize the information throughput. The application of the results to information processing in natural systems is the subject of current research. As a final aspect, we would like to highlight the application of our model (interpreted as a network of pulse-coupled oscillators with stochastic frequencies) to describe animal behavior, for example, populations of flashing fireflies (Buck, 1988). For the North American firefly, several types of synchronization have been reported (Copeland & Moiseff, 1995), and it seems that a certain number of flashing flies within a certain area is needed to observe synchronization. This would be in consonance with our model, where an increase in the number of units (fireflies) would decrease the coupling parameter η. For a certain number of units, we would reach the critical point where synchronization would appear. Apart from unison synchrony, wave synchrony has been observed, which might be explained by the clusters we report. Appendix A: Proof of Periodic Pattern Condition 4.4 In the following analysis we use the deterministic rule 4.3 instead of the stochastic state transitions 2.1 and restrict ourself to the case of δ ≥ tref . A unit with ISI τ can make τ − 1 − tref transitions due to this rule before reaching the threshold. The term tref corresponds to the refractory period where no increase of the state of an unit is allowed and the −1 to the last time step when the threshold is reached. We therefore have a contribution of p(τ − 1 − tref )(τ − 1 − tref ) because of rule 4.3 to the total evolution of a neuron before its threshold is reached.4 With this result, we can calculate the mean minimum cluster size of a system of κ clusters. We call the clusters K i (i ∈ {1, . . . , κ}) and set the number of elements of every cluster |K i | = ki . The clusters are ordered according to their spiking time. At every time step, one cluster reaches threshold starting with cluster K 1 . After cluster K κ spikes, the cycle starts again with cluster K 1 . When cluster K i reaches threshold, the elements of cluster K i+1 (or of K 1 in case of i = κ) have received the inputs of all clusters except cluster K i and an increase of p(τ − 1 − tref ) due to rule 4.3. This leads to the following condition: κ 1+ −1+ k j + p(τ − 1 − tref )(τ − 1 − tref ) < L ,
(A.1)
j=1 j=i
4 Note the use of the Heaviside step function to have a valid formula also for the case of τ < tref + 1.
Phase Transition and Hysteresis in Stochastic Spiking Neurons
which due to
κ j=1
3043
k j = N is equivalent to
1 + p(τ − 1 − tref )(τ − 1 − tref ) − L for all i ∈ {1, . . . , κ}.
ki > kmin = N − 1 +
(A.2)
Written in terms of η, this is equal to ki > kmin (τ ) = (N − 1)(1 − η) +
p(τ − 1 − tref )(τ − 1 − tref ) .
(A.3)
We have proved the periodic pattern condition 4.4. Appendix B: Bounds for τ In this appendix we derive bounds for the ISI τ of the ensemble and the mean ISI τ of several experiments. Again we use the deterministic rule 4.3 instead of the stochastic dynamics 2.1. We obtain two absolute bounds such that τmin ≤ τ ≤ τmax for all possible ISIs of the system and a lower bound for the mean value τ over all possible initial conditions, which we call τ min . We thus get τ min ≤ τ ≤ τmax .
(B.1)
B.1 Maximum ISI τmax . First we derive the maximum possible ISI τmax ¯ ) ≥ kmin , which can of the system. It is obvious that the mean cluster size k(τ be written using equations 4.6 and A.3 as ¯ ) = Nδ ≥ (N − 1)(1 − η) + p(τ − 1 − tref )(τ − 1 − tref ) . k(τ τ
(B.2)
This results in an inequality of degree 2 for τ , which has the solution τ ≤ τmax =
(N − 1)(η − 1) 1 + tref + 2p 2
(N − 1)(η − 1) 1 + tref 2 Nδ + + + , 2p 2 p
(B.3)
compatible with condition τ ≥ 0. We omitted the (τ − 1 − tref ) term to calculate equation B.3. This can be done if τ ≥ 1 + tref . Inequality B.2 permits
´ ´ A. Kaltenbrunner, V. Gomez, and V. Lopez
3044
us to calculate the minimum value of η for which a certain ISI τ is possible. We get η ≥ ηmin (τ ) =
p(τ − 1 − tref )(τ − 1 − tref ) Nδ +1− . (N − 1) (N − 1)τ
(B.4)
It is sufficient to calculate ηmin (2δ) since we assume that δ ≥ tref ≥ 1 and the system will be fully synchronized (τ = δ), that is, it consists of only one cluster, for values of η below this limit, which we call ηsync . We get η ≥ ηsync = 0.5 −
1 p(2δ − 1 − tref )(2δ − 1 − tref ) + . 2(N − 1) (N − 1)
(B.5)
For the limit of large N, this transforms into lim ηsync = 0.5.
(B.6)
N→∞
Condition B.3 is therefore valid if η ≥ 0.5. B.2 Minimum ISI τmin . If we observe the state of a neuron just after it has passed the threshold, instead of looking at the state of a neuron before the threshold is reached as in the proof of the periodic pattern condition in appendix A, we can derive a condition for the minimum possible ISI. We use that a neuron with an ISI of length τ increases its state due to the deterministic rule 4.3 by p(τ − tref ) during every ISI. Moreover, since the ensemble is homogeneous, we expect that during an ISI of a single neuron, all other neurons fire once, which translates into an increase of the activation state by (N − 1) due to the ensemble dynamics. Combining the two terms we get L ≤ 1 + (N − 1) + p(τ − tref ),
(B.7)
which we can transform into a condition for τ : τ ≥ τˆmin = tref +
L − 1 − (N − 1) (N − 1)(η − 1) = tref + . p p
(B.8)
Comparing this result with equation 2.5, we observe that τˆmin = τm f and have thus found that the formulas 2.5 are a lower bound for the ISI τ . This bound can be improved for for η < 1, since there τˆmin is negative, although τ can never be smaller than tref . We use therefore the Heaviside step function and get
(N − 1)(η − 1) τmin = tref + , (B.9) p which is a lower bound for τ valid for all values of η.
Phase Transition and Hysteresis in Stochastic Spiking Neurons
3045
B.3 Minimum Mean ISI τ min . To derive a lower bound for τ , we start with
Nδ = g(τ ) kmin (τ ) + 1 τ
p(τ − 1 − tref )(τ − 1 − tref ) ≤ g(τ ) (N − 1)(1 − η) + + 1 ,(B.10) which can be obtained from equations 4.6, 4.7, and A.3. We calculate the mean of the left- and right-hand side according to the probabilities P(τ ), which denominate the probability of the system ending up in a system with ISI τ : τmax τmax Nδ P(i) ≤ g(i) i i=1 i=1
p(i − 1 − tref )(i − 1 − tref ) × (N − 1)(1 − η) + + 1 P(i).
(B.11)
We set 1/τ = f where f is the spiking frequency, use P(τ ) = P( f ) = P (g(τ )), eliminate the terms of the summation which equal 0, and get
(1 + tref ) p N f δ ≤ g (N − 1)(1 − η) + 1 − 1+t τmax p ref g(i)P(i) + g(i)i P(i) . +
i=1
(B.12)
i=2+tref
Using the following identity XY = X Y + Cov(X, Y) for two random variables X and Y, we achieve
(1 + tref ) p N f δ ≤ g (N − 1)(1 − η) + 1 −
1+t ref p g τ + Cov(g(τ ), τ ) − g(i)(i − 1)P(i) . +
(B.13)
i=2
From equation B.10 it is easy to see that Cov(g(τ ), τ ) ≤ 0 since an increase of τ translates into a decrease of g(τ ) and vice versa. Because of this fact and g(τ ) ≥ 0, we can eliminate the two left-most terms of inequality B.13 by weakening the inequality:
p(τ − 1 − tref ) N f δ ≤ g (N − 1)(1 − η) + 1 + .
(B.14)
´ ´ A. Kaltenbrunner, V. Gomez, and V. Lopez
3046
The mean frequency f is equal to the inverse of the harmonic mean h(τ ). Since for a set of positive numbers, its harmonic mean is never greater than its arithmetic mean (Bullen, 2003), we have τ1 ≤ h(τ1 ) = f . Applying this on inequality B.14 leads to
Nδ p(τ − 1 − tref ) ≤ g (N − 1)(1 − η) + 1 + . τ
(B.15)
We can transform this into a quadratic inequality of τ since τ > 0. It has the only positive solution: τ ≥ τ min =
(N − 1)(η − 1) − 1 + tref + 2p 2
1 + tref 2 (N − 1)(η − 1) − Nδ + . + + 2p 2 pg
(B.16)
We have found a lower bound for the mean value of our ISI distribution. Appendix C: Thermodynamic Limits of τ min and τmax In this appendix we calculate the thermodynamic limit (i.e., the behavior for N → ∞) of τ for the different regions of η. We first calculate the thermodynamic limits of the bounds τ min and τmax of τ calculated in appendix B and then situate the thermodynamic limit of τ between the limits of the bounds. We start with the limit for τ min using equation B.16:
r
η>1 lim
N→∞
τ min (η − 1) (η − 1) 1 + tref = lim − + N→∞ N 2p 2 pN 2N
(η − 1) (η − 1) 1 + tref 2 δ + + − + 2p 2 pN 2N pg N (η − 1) (η − 1) , (C.1) = + 2p 2p
which for η > 1 is lim
N→∞
τ min (η − 1) = . N p
This coincides with the limit of τm f /N (see equation 2.5).
(C.2)
Phase Transition and Hysteresis in Stochastic Spiking Neurons
r
η = 1. Setting η = 1 in equation B.16 leads to lim
N→∞
r
3047
τ min 1 + tref = lim √ − √ √ N→∞ N 2 N 2p N
1 δ 2 δ + + 1 + tref − = . 4N p pg pg
(C.3)
η < 1 . Equation C.1 for η < 1 leads to lim N→∞ τ min /N = 0, but we can improve this result. We transform equation B.16 slightly and calculate lim τ min = lim
N→∞
N→∞
+ Applying −a + b =
(N − 1)(1 − η) + 1 + tref − 2 2p 1 + tref (N − 1)(1 − η) + − 2p 2
−a 2 +b 2 , a +b
N→∞
=δ
+
Nδ . pg
we get Nδ pg
lim τ min = lim
N→∞
2
(N−1)(1−η)+ 2p
−
1+tref 2
+
(N−1)(1−η)+ 2p
g (1 − η) g (1 − η) −1 + 2 2
−
1+tref 2
2
Nδ + pg
(C.4)
and therefore have for η < 1 lim τ min =
N→∞
δ . g (1 − η)
(C.5)
The limits for τmax can be calculated analogously from equation B.3. Combining the limits for both bounds with the result of equation B.6 that the system has an ISI of δ for η < 0.5, we obtain lim N→∞ τ = δ δ δ ≤ lim N→∞ τ ≤ (1−η) g (1 − η) δ ≤ lim N→∞ √τN ≤ δp pg lim N→∞
τ N
=
(η−1) p
if η < 0.5, if 0.5 ≤ η < 1, if η = 1, if η > 1.
(C.6)
3048
´ ´ A. Kaltenbrunner, V. Gomez, and V. Lopez
Acknowledgments ´ Anders Ledberg, We thank Alberto Su´arez, Adan Garriga, Ernest Montbrio, and Alex Roxin for discussions and suggestions on earlier versions of the manuscript. We also thank two anonymous referees for their comments, which helped to improve the manuscript significantly. This work has been ´ partially supported by C`atedra Telefonica de Produccio´ Multim`edia de la Universitat Pompeu Fabra and CICyT, grant TIN2004-04363-C03-01.
References Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Bienenstock, E. (1995). A model of neocortex. Network-Computation in Neural Systems, 6, 179–224. Bolle, D., & Blanco, J. B. (2004). Two-cycles in spin-systems: Sequential versus synchronous updating in multistate ising-type ferromagnets. European Physical Journal B, 42, 397–406. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrate-andfire neurons with low firing rates. Neural Computation, 11, 1621–1671. Brunel, N., & Hansel, D. (2006). How noise affects the synchronization properties of recurrent networks of inhibitory neurons. Neural Computation, 18, 1066–1110. Buck, J. (1988). Synchronous rhythmic flashing of fireflies. II. Quarterly Review of Biology, 63, 265–289. Bullen, P. S. (2003). Handbook of means and their inequalities. Dordrecht, The Netherlands: Kluwer. Burkitt, A. N. (2006). A review of the integrate-and-fire neuron model: I. Homogeneous synaptic input. Biological Cybernetics, 95, 1–19. Burkitt, A. N., & Clark, G. M. (1999). Analysis of integrate-and-fire neurons: Synchronization of synaptic input and spike output. Neural Computation, 11, 871–901. Campbell, S. R., Wang, D. L. L., & Jayaprakash, C. (1999). Synchrony and desynchrony in integrate-and-fire oscillators. Neural Computation, 11, 1595–1619. Chen, C. C. (1994). Threshold effects on synchronization of pulse-coupled oscillators. Physical Review E, 49, 2668–2672. Copeland, J., & Moiseff, A. (1995). The occurrence of synchrony in the NorthAmerican firefly photinus-carolinus (coleoptera, lampyridae). Journal of Insect Behavior, 8, 381–394. Diesmann, M., Gewaltig, M. O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Durstewitz, D., Seamans, J. K., & Sejnowski, T. J. (2000). Neurocomputational models of working memory. Nature Neuroscience Supplement, 3, 1184–1191. Ernst, U., Pawelzik, K., & Geisel, T. (1998). Delay-induced multistable synchronization of biological oscillators. Physical Review E, 57, 2150–2162. ¨ Fontanari, J. F., & Koberle, R. (1988). Information-processing in synchronous neural networks. Journal de Physique, 49, 13–23.
Phase Transition and Hysteresis in Stochastic Spiking Neurons
3049
Fusi, S., & Mattia, M. (1999). Collective behavior of networks with linear (VLSI) integrate-and-fire neurons. Neural Computation, 11, 633–652. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophys. J., 4, 41–68. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Computation, 12, 43–89. ´ ´ Gomez, V., Kaltenbrunner, A., & Lopez, V. (2006). Event modeling of message interchange in stochastic neural ensembles. In IJCNN’06 Proceedings (pp. 81–88). Piscataway, NJ: IEEE Press. Herz, A. V. M., & Marcus, C. M. (1993). Distributed dynamics in neural networks. Physical Review E, 47, 2155–2161. Hong, Y. W., & Scaglione, A. (2005). A scalable synchronization protocol for large scale sensor networks and its applications. IEEE Journal on Selected Areas in Communications, 23, 1085–1099. Ikegaya, Y., Aaron, G., Cossart, R., Aronov, D., Lampl, I., Ferster, D., & Yuste, R. (2004). Synfire chains and cortical songs: Temporal modules of cortical activity. Science, 304, 559–564. Izhikevich, E. M. (2006). Polychronization: Computation with spikes. Neural Computation, 18, 245–282. Izhikevich, E. M., Gally, J. A., & Edelman, G. M. (2004). Spike-timing dynamics of neuronal groups. Cerebral Cortex, 14, 933–944. Kaneko, K. (1990). Clustering, coding, switching, hierarchical ordering, and control in a network of chaotic elements. Physica D, 41, 137–172. Kirk, V., & Stone, E. (1997). Effect of a refractory period on the entrainment of pulsecoupled integrate-and-fire oscillators. Physics Letters A, 232, 70–76. Kuramoto, Y. (1984). Chemical oscillations, waves and turbulence. Berlin: Springer. Lindner, B. (2004). Interspike interval statistics of neurons driven by colored noise. Physical Review E, 69, 022901. McMillen, D., & Kopell, N. (2003). Noise-stabilized long-distance synchronization in populations of model neurons. Journal of Computational Neuroscience, 15, 143–157. Middleton, J. W., Chacron, M. J., Lindner, B., & Longtin, A. (2003). Firing statistics of a neuron model driven by long-range correlated noise. Physical Review E, 68, 021920. Mirollo, R. E., & Strogatz, S. H. (1990). Synchronization of pulse-coupled biological oscillators. SIAM Journal on Applied Mathematics, 50, 1645–1662. N´eda, Z., Ravasz, E., Brechet, Y., Vicsek, T., & Barab´asi, A. L. (2000). The sound of many hands clapping—tumultuous applause can transform itself into waves of synchronized clapping. Nature, 403, 849–850. ¨ Ostborn, P. (2002). Phase transition to frequency entrainment in a long chain of pulse-coupled oscillators. Physical Review E, 66, 016105. ◦ ¨ Ostborn, P., Aberg, S., & Ohl´en, G. (2003). Phase transitions towards frequency entrainment in large oscillator lattices. Physical Review E, 68, 015004. Peskin, C. (1975). Mathematical aspects of heart physiology. New York: Courant Institute of Mathematical Sciences, New York University. Rhouma, M. B. H., & Frigui, H. (2001). Self-organization of pulse-coupled oscillators with application to clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23, 180–195.
3050
´ ´ A. Kaltenbrunner, V. Gomez, and V. Lopez
´ Rodr´ıguez, F., Su´arez, A., & Lopez, V. (2001). Period focusing induced by network feedback in populations of noisy integrate-and-fire neurons. Neural Computation, 13, 2495–2516. ´ Rodr´ıguez, F., Su´arez, A., & Lopez, V. (2002). A discrete model of neural ensembles. Philosophical Transactions of the Royal Society: Mathematical, Physical and Engineering Sciences, 360, 559–573. Salinas, E., & Sejnowski, T. J. (2002). Integrate-and-fire neurons driven by correlated stochastic input. Neural Computation, 14, 2111–2155. Senn, W., & Urbanczik, R. (2000). Similar nonleaky integrate-and-fire neurons with instantaneous couplings always synchronize. SIAM Journal on Applied Mathematics, 61, 1143–1155. Stein, R. B. (1967). Some models of neuronal variability. Biophysical Journal, 7, 37–68. Timme, M., Wolf, F., & Geisel, T. (2002). Prevalence of unstable attractors in networks of pulse-coupled oscillators. Physical Review Letters, 89(15), 154105. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology Vol. 2. Nonlinear and stochastic theories. Cambridge: Cambridge University Press. van Vreeswijk, C. (1996). Partial synchronization in populations of pulse-coupled oscillators. Physical Review E, 54, 5522–5537. van Vreeswijk, C., & Abbott, L. F. (1993). Self-sustained firing in populations of integrate-and-fire neurons. SIAM Journal on Applied Mathematics, 53, 253–264. Wang, X. J. (2001). Synaptic reverberation underlying mnemonic persistent activity. Trends in Neurosciences, 24, 455–463. Winfree, A. (1967). Biological rhythms and the behavior of populations of coupled oscillators. J. Theoret. Biol., 16, 15–42.
Received January 13, 2006; accepted December 14, 2006.
LETTER
Communicated by Andrew Gilpin
Model-Based Reinforcement Learning for Partially Observable Games with Sampling-Based State Estimation Hajime Fujita [email protected]
Shin Ishii [email protected] Nara Institute of Science and Technology, Graduate School of Information Science, Ikoma, Nara 630-0192, Japan
Games constitute a challenging domain of reinforcement learning (RL) for acquiring strategies because many of them include multiple players and many unobservable variables in a large state space. The difficulty of solving such realistic multiagent problems with partial observability arises mainly from the fact that the computational cost for the estimation and prediction in the whole state space, including unobservable variables, is too heavy. To overcome this intractability and enable an agent to learn in an unknown environment, an effective approximation method is required with explicit learning of the environmental model. We present a model-based RL scheme for large-scale multiagent problems with partial observability and apply it to a card game, hearts. This game is a welldefined example of an imperfect information game and can be approximately formulated as a partially observable Markov decision process (POMDP) for a single learning agent. To reduce the computational cost, we use a sampling technique in which the heavy integration required for the estimation and prediction can be approximated by a plausible number of samples. Computer simulation results show that our method is effective in solving such a difficult, partially observable multiagent problem. 1 Introduction Reinforcement learning (RL) (Sutton & Barto, 1998) is a machine learning framework that can solve various Markov decision processes (MDPs), based on trial-and-error interactions with a known or unknown environment. It has been studied for making an agent learn and act in its own environment and has successfully solved decision-making problems in various realistic situations (Kaelbling, Littman, & Moore, 1996)—for example, a channel allocation task for a cellular telephone system (Singh & Bertsekas, 1996), a biped robot walking task (Mori, Nakamura, & Ishii, 2004), and a standup task for a real robot (Morimoto & Doya, 2001) have been studied as Neural Computation 19, 3051–3087 (2007)
C 2007 Massachusetts Institute of Technology
3052
H. Fujita and S. Ishii
difficult applications. Many RL researchers in the machine learning community are now devoting much attention to more realistic situations: multiagent problems (Stone & Veloso, 2000; Shoham, Powers, & Grenager, 2004) and partially observable problems in large-scale environments (Littman & Majercik, 1997). Multiple learning agents often exist in a common environment; such a situation is assumed as a multiagent system. A decision-making or optimal control problem in multiagent environments has a high degree of difficulty due to interactions among the agents. These make the Markov property of the state space fail because the changing behaviors of the other agents provide a dynamic setting. RL, however, has often been applied to such multiagent problems with various devices and has obtained successful results (Claus & Boutilier, 1998; Dayan, Schraudolph, & Sejnowski, 2001; Wang & Sandholm, 2003; Suematsu & Hayashi, 2002). Crites and Barto (1996), for example, achieved an optimal scheduler for an elevator dispatching task. Littman (1994) and Hu and Wellman (2003) proposed learning methods by extending conventional Q-learning (Watkins & Dayan, 1992) so as to be applicable to two-player stochastic games. Chang, Ho, and Kaelbling (2003) used a Kalman filter to estimate an appropriate reward for each agent in a cooperative situation and demonstrated faster learning than a naive approach in a large multiagent game. The environments often have partial observability: the agents cannot directly access internal states of the environment and can get only observations that contain partial information about the state. Decision-making problems in such a situation can be formulated as partially observable Markov decision processes (POMDPs) (Kaelbling, Littman, & Cassandra, 1998). When this framework is introduced to realistic problems, however, serious computational difficulties arise because exact solutions require computing a policy over the entire belief space, which forms a simplex whose dimensionality is equal to the number of states in the underlying MDP. Not only the estimation process for a large number of unobservable states but also computing the optimal policy depending on the estimation need too much computation. Many studies of POMDPs have therefore devised methods to reduce the computation (Brafman, 1997; Boutilier & Poole, ¨ 2000; Meuleau, Peshkin, Kim, & Kaelbling, 2000; Littman, 1996; Thrun, Cassandra, & Kaelbling, 1995; Peshkin, Meuleau, & Kaelbling, 1999; Loch & Singh, 1998; McCallum, 1993; Chrisman, 1992; Yoshimoto, Ishii, & Sato, 2003; Nikovski & Nourbakhsh, 2000). Many card games have both general properties described above. Most cannot be played alone (that is, they are in a multiagent setting), and cards in another player’s hand or undealt cards are unobservable to each player (meaning a partially observable situation). Card games, then, have been studied as well-defined test-beds for strategy acquisition problems in the real world. Existing algorithms, however, have not attained the level of human players, unlike in completely observable games like Tesauro’s
Reinforcement Learning for Card Games
3053
TD-gammon (Tesauro, 1994). To deal with partially observable games (Chang et al., 2003; Dahl, 2002; Nair, Marsella, Tambe, Pynadath, & Yokoo, 2003; Hansen, Bernstein, & Zilberstein, 2004; Emery-Montemerlo, Gordon, ¨ 2004), three difficulties arise: the first is to estimate the Schneider, & Thrun, distribution of unobservable states based on the history of observations, the second is to predict the opponent agents’ actions based on any acquired models, and the third is to cope with the computational intractability stemming from the huge state space. A card game with partial observability therefore is a challenging target to study. This letter deals in particular with the card game hearts (Perkins, 1998; Pfahringer, Kaindl, Kramer, & Furnkranz, 1999; Sturtevant & White, 2006), a four-player competitive and partially observable game, and presents an automatic strategy acquisition scheme for this game. In our previous study (Ishii et al., 2005), the environment was regarded as stationary for the learning agent, under the assumption that there is a single learning agent, and our POMDP-RL method was then applied to this multiagent problem. The same assumption is used in this study: for most of this letter, we assume that there is only one learning agent in the environment. When there are multiple learning agents, the environmental dynamics can be influenced by the learning of the other agents. The environment therefore does not primarily satisfy a stationary Markov assumption even for underlying states and so cannot be properly formulated as a POMDP. In an ordinary approach, this multiagent problem should be dealt with by other appropriate frameworks such as game theory (Fudenberg & Tirole, 1991), which aims at providing solutions to problems of selecting optimal actions in nonstationary multiagent environments; the optimal solution for all agents is known as a Nash equilibrium. Although it is not difficult to obtain the equilibrium solution for zero-sum two-player games such as poker, it is intractable to do so in general-sum multiplayer games with partial observability such as the game hearts, because the highly complicated cooperative or competitive relationship among the multiple agents prevents us from calculating an analytic solution. Most studies about acquiring an optimal policy in multiagent environments therefore have restricted their targets to simple problems with two players (Stone & Veloso, 2000; Bowling & Veloso, 2000); a comprehensive examination of multiagent learning techniques applicable to general situations has not yet been undertaken. We then assume that the POMDP assumption is approximately valid if the environmental model can be learned faster than it changes. To achieve this fast learning, the learning agent is designed to have multiple function approximators, each of which tries to represent a policy of the opponent agent, and makes them learn independently to predict unknown behaviors of the opponent agent. This model-based approach (Sutton, 1990; Moore & Atkeson, 1993; Doya, Samejima, Katagiri, & Kawato, 2002) provides us with robust learning and an ability that can be applied to various multiagent problems.
3054
H. Fujita and S. Ishii
In our previous study (Ishii et al., 2005), to avoid computational intractability in the estimation of unobservable states, we used a mean-fieldlike analog approximation. Although this approximation was effective and the trained RL agent showed good performance, further improvement was possible; the approximation was not very accurate for belief approximation over discrete unobservable states. To obtain a better estimation of unobservable states, therefore, we use in this study a sampling technique; the heavy integration due to the large state space is approximated by a plausible number of samples, each representing a discrete state. In our previous study, the value function was approximated over the observation space according to temporal difference (TD) learning (Sutton, 1988). Although this learning method allowed the agent to acquire a good policy, it was difficult to learn accurate values for observation states due to the wellknown perceptual aliasing property of partially observable environments (Singh, Jaakkola, & Jordan, 1994). In this study, therefore, we make the agent learn the state value function with a completely observable approximation (Littman et al., 1995); the agent maintains the state value function so that the self-consistency equation holds on the underlying MDP. The problems above can then be solved, and the performance of the learning agent is improved, which is shown by computer simulations using rule-based agents. These results suggest that our method is applicable to large-scale multiagent problems with partial observability. 2 Partially Observable Markov Decision Process A POMDP (Kaelbling et al., 1998) is a framework to model an agent learning and acting in a partially observable environment and consists of:
r r r r r r
A finite set of real states S = {s1 , s2 , . . . , s|S| }, which cannot be determined with complete certainty A finite set of observation states O = {o 1 , o 2 , . . . , o |O| }, which can be perceived by the agent A finite set of actions A = {a 1 , a 2 , . . . , a |A| }, which can be executed by the agent A reward function R : S × A → R, which maps a pair of a state and an action into a numerical reward Transition probability P(st+1 |st , a t ), which defines the environmental dynamics mapping a pair of a state and an action into a next state Observation probability P(o t |st ), which provides the observation process mapping a state into an observation
The objective of each agent is to acquire the policy that maximizes an expected future reward in the partially observable world. The state st is not
Reinforcement Learning for Card Games
3055
observable for each agent; only the observation o t , which contains partial information about the state, is available. If the policy is determined only from an immediate observation, without estimating an internal state explicitly or implicitly, it does not usually converge to a global optimum (Singh et al., 1994; Kaelbling et al., 1996), because the observation does not satisfy the Markov property. One way to overcome this problem is to use a history of the agent’s experience, Ht = {(o t , −), (o t−1 , a t−1 ), . . . , (o 1 , a 1 )}. Because it is difficult to maintain such a naive history with a limited memory capacity, a belief state b(st ) ≡ P(st |Ht ) is often used. Since the belief state summarizes the history as a probability distribution over S, it is a sufficient statistic with the Markov property; it is updated for every new observation according to the incremental Bayes formula:
b(st+1 ) ≡ P(st+1 |Ht+1 ) =
P(o t+1 |st+1 )
st ∈S
P(st+1 |st , a t )P(st |Ht )
P(o t+1 |Ht , a t )
.
(2.1)
The optimal policy that maps a belief state into an action becomes a solution of a continuous-space belief-state MDP. Although this formulation has the capability of solving a POMDP, an exact solution is hard to achieve because of the requirement for computing a policy over the entire belief space, whose cost increases exponentially with the increase in the state number of the underlying MDP. Algorithms for computing an optimal policy therefore were considered impractical for large-scale domains, and recent research has focused on approximate algorithms that scale up the application area effectively. The problem structure of POMDPs can be classified into (I) the agent knows the environmental model and (II) the model is unknown. Existing solutions for case I tend to challenge realistic problems with effective approximations using given models and include completely observable MDP approximation (Littman et al., 1995), grid-based approximation (Brafman, 1997), factored belief states (Boutilier & Poole, 1996), ¨ 2003), hierarchical point-based value iteration (Pineau, Gordon, & Thrun, approaches (Theocharous & Mahadevan, 2002), and Monte Carlo approxi¨ 2000). Solutions for case II can be classified into (IIa) modelmation (Thrun, free methods, which learn to act without learning an environmental model or (IIb) model-based methods, which learn an environmental model and compute a policy using the learned model (Shani, 2004). Many model-free approaches have been proposed, such as finite-state controllers (Meuleau et al., 2000), memory bits (Peshkin et al., 1999), short-term memory (Loch & Singh, 1998), and utile suffix memory (McCallum, 1993). On the other hand, model-based approaches learn the model explicitly in addition to estimating internal states, which may make the problem difficult. Modern techniques, however, can provide a tractable solution for reasonably sized state space, for example, using a hidden Markov model (Chrisman, 1992),
3056
H. Fujita and S. Ishii
a Kalman filter (Yoshimoto et al., 2003), a recurrent network (Whitehead & Lin, 1995), and state merging methods (Nikovski & Nourbakhsh, 2000). It is still difficult to apply these existing model-based methods to large-scale problems like card games. In the RL method presented in this letter, the learning agent acquires the environmental model by multiple action predictors; the state transition of usual multiagent games depends on opponent agents’ actions. The agent learns the policy of each opponent agent explicitly by the corresponding predictor from past experiences and determines its action based on the acquired opponent agents’ policies. Our method therefore is a model-based approach classified as IIb. Here, each action predictor is trained individually. When one opponent agent changes its policy, this is enough to make the corresponding predictor adapt to the change. Since it enables the agent to adapt to environmental changes quickly, our method, formulated as a single-agent system, can be applied to multiagent systems. Because the value function over the belief space has an intractable complexity in largescale problems, we use completely observable approximation; the agent tries to approximate the value function in the underlying state space instead of the observation or belief state space. POMDP problems are then solved based on expectation of the value function with respect to the belief state. This is therefore similar to QMDP (Littman et al., 1995) included in case I, except that our method does not maintain the Q-function explicitly but calculates the utility function by one-step-ahead prediction based on the acquired model. To calculate the utility function, estimation and prediction must be performed; the estimation is to calculate an expectation over possible unobservable current states whose probabilities are proportional to the belief p(st |Ht ), and the prediction is to calculate an expectation over possible next states and observations whose probabilities are proportional to the transition probability p(st+1 |st , a t ) and observation probability p(o t |st ), respectively. Integration in these expectations is almost impossible because most challenging problems have a large number of states; for example, hearts has 52!/(13!)4 1028 states if every combination of 52 cards is considered. In our method, this heavy integration is approximated by a sampling technique. This methodology is then similar to Monte Carlo ¨ 2000) included in case I. POMDP (Thrun, The major targets of this study are partially observable and multiagent problems; there are multiple agents in a common environment with partial observability, and they are in a cooperative or competitive situation. In this letter, we use the following notations and assumptions. t indicates an action turn of the learning agent. The variables (state, observation, and action) for agent i (i = 0, . . . , M) are denoted by sti , o ti , and a ti ,1 where M
1 Some games have partial observability in opponent players’ actions a i (i = 1, . . . , M). t For example, in a game where players discard cards face down, each player cannot observe
Reinforcement Learning for Card Games
3057
is the number of opponent agents and i = 0 signifies the learning agent; st , o t , and a t are the same as st0 , o t0 , and a t0 , respectively, and st+1 is the same as stM+1 . A strategy of agent i (i = 1, . . . , M) is denoted by φ i , which corresponds to the action selection policy or the policy parameters of the agent. We make two assumptions: first, agent i selects its action a ti based on an immediate observation o ti according to its action selection probability P(a ti |o ti , φ i ); and second, the other agents’ strategies φ i are fixed for the time being, that is, there is only one learning agent in the environment, and hence the environment is stationary. In the computer experiments described in section 5, however, we relax the second assumption and show several acceptable results in dynamic environments with multiple learning agents. 3 Model In our RL method, the learning agent selects an action according to the greedy policy, π(Ht ) = argmax U(Ht , a t ),
(3.1)
at
where U(Ht , a t ) is the utility function at time step t. This function is defined as an expectation of a one-step-ahead future value with respect to the belief state and transition probability, U(Ht , a t ) =
st ∈S
P(st |Ht )
P(st+1 |st , a t )[R(st , a t , st+1 ) + γ V(st+1 )],
st+1 ∈S
(3.2) where R(st , a t , st+1 ) denotes an immediate reward at time step t, and V(st+1 ) denotes the state value function of the next state st+1 . γ denotes a discount factor.2 In large-scale problems, it is difficult to learn the value function over the belief space because optimization of the value function, whose input is a probability distribution over the high-dimensional state space, is too complex and requires heavy computation. We then use the completely
the actions. Such unobservable actions, however, can be estimated if the cards discarded by the actions are included in unobservable state st ; the agent can select an action based on the estimated actions. Since the agent can calculate likelihood as long as the actions contain some partial observation, our method, which estimates unobservable states based on likelihood information, is applicable. Although in this letter we do not assume any partial observability in opponent agents’ actions, our method can thus be extended to deal with such situations. 2 Our study mainly aims at dealing with finite-horizon problems but could be applied to infinite-horizon problems, because our method is basically an extension of the classical value iteration algorithm.
3058
H. Fujita and S. Ishii
observable approximation (Littman et al., 1995); the agent maintains the state value function so that the self-consistency equation holds on the underlying MDP and calculates the state-action value by a one-step-ahead prediction (the second summation in equation 3.2). After that, it calculates the history-action utility as an expectation of the state-action utility with respect to the belief state (the first summation in equation 3.2), under the knowledge that the optimal value function for the belief space can be approximated well by a piecewise linear and convex function (Smallwood & Sondik, 1973). The calculation of the utility function, however, includes three difficulties: (1) the summations in equation 3.2 over possible current states st and next states st+1 have computational intractability because there are so many state candidates in a realistic problem; (2) the computation for constructing the belief state P(st |Ht ) over possible current states is intractable due to the large state space and high dimensionality; and (3) the prediction of possible next states is difficult because the environmental model P(st+1 |st , a t ) is unknown for the learning agent and may change in a multiagent setting. Some effective approximations therefore are required for avoiding the above difficulties. To avoid the computational intractability problem (1), we use samplingbased approximation; the learning agent obtains independent and identically distributed (i.i.d.) random samples, sˆt and sˆt+1 , whose probabilities are proportional to the belief state P(st |Ht ) and the environmental model P(st+1 |st , a t ), respectively. Note here that each sampled sˆt+1 depends on a certain sampled sˆt . The utility function in equation 3.2 can then be approximated as U(Ht , a t ) ≈
N j=1
( j)
P(ˆst |Ht )
K
(k) ( j) ( j) (k) (k) P(ˆst+1 |ˆst , a t ) R(ˆst , a t , sˆt+1 ) + γ V(ˆst+1 ) .
k=1
(3.3) Samples of the current state sˆt can be obtained by sequential Monte Carlo methods such as particle filtering (Gilks, Richardson, & Spiegelhalter, 1996); samples of the previous state sˆt−1 , whose probability is proportional to the previous belief state P(st−1 |Ht−1 ), are diffused into the next time step according to equation 2.1. This process is repeated N times, and the agent obtains ( j) estimated current states {ˆst | j = 1, . . . , N}. Samples of the next state sˆt+1 are obtained by a simple sampling technique; K samples are drawn from the environmental model P(st+1 |st , a t ) given a sampled current state sˆt and an action a t . This sampling is repeated K times, and the agent obtains predicted (k) next states {ˆst+1 |k = 1, . . . , K } for each of N possible current states. The two summations in equation 3.2 are thus simultaneously approximated by using
Reinforcement Learning for Card Games
3059
K N samples in equation 3.3. In large-scale and multiagent problems, however, this naive sampling approach is insufficient due to difficulties 2 and 3. We then need further devices, as described below. To avoid difficulty 2, we do not deal with the whole history Ht but use a one-step history h t = {(o t , −), (o t−1 , a t−1 )}, which leads us to make a simplification (A): a belief state represents a simple one-step prior knowledge about states but does not carry the complete likelihood information. The history Ht contains two kinds of information. The first is about impossible states at the tth turn; for example, in the game hearts, if an agent played ♥9 after a leading card ♣3 in a past trick, the agent no longer has any club cards at the tth turn and any state in which this agent holds club cards is impossible (see appendix A). The second is about likelihood, considering the characteristics of the opponent agents; for example, in the same situation as above, it is unlikely for the agent to have any heart card higher than ♥9. Although the belief state P(st |Ht ), which is a sufficient statistic for the history Ht , should involve these two kinds of information, we partly ignore the latter kind by replacing the whole history Ht with a one-step history h t : the belief state P(st |Ht ) is approximated by the partial belief state P(st |h t ). No impossible state, on the other hand, is considered in the light of the former type of information, but each possible state has a one-step likelihood between the (t − 1)th and tth time steps. Although the maintenance of likelihood over all possible states requires heavy computation and a large amount of memory in many realistic problems, even with the sampling approximation, this simplification enables us to estimate internal states easily at each time step. To solve problem 3, the learning agent uses action predictors. Since there are M opponents’ actions within the transition from a state st to the next state st+1 in multiagent problems, the transition probability P(st+1 |st , a t ) in equation 3.2 is represented by using the real action selection probability P(a ti |o ti , φ i ) of agent i: P(st+1 |st , a t ) =
{st1 ,...,stM }∈S M
{a t1 ,...,a tM }∈A M
M j=0
j+1
P(st
j
j
|st , a t )
M i=1
P(a ti |o ti , φ i )P(o ti |sti ).
o ti ∈O
(3.4) In many multiagent games with partial observability, the environmental dynamics are deterministic and have two properties: (i) the state sti+1 can be uniquely determined given a state sti and an action a ti , that is, P(sti+1 |sti , a ti ) is 1 for a certain state sti+1 , otherwise 0; and (ii) the observation o ti can also be determined without any ambiguity given a state sti , that is, P(o ti |sti ) is 1 for
3060
H. Fujita and S. Ishii
a certain observation o ti , otherwise 0. Under these properties, equation 3.4 is simplified to
M
{a t1 ,...,a tM }∈A M (st ,a t ,st+1 )
i=1
P(st+1 |st , a t ) =
P(a ti |o ti , φ i ),
(3.5)
where A M (st , a t , st+1 ) denotes the set of possible sequences of opponents’ actions {a t1 , . . . , a tM } in which the state st reaches the next state st+1 after the action a t . According to property i, given a current state st and an action a t , determining the next state st+1 is the same as determining opponent agents’ actions {a t1 , . . . , a tM }. For obtaining samples of the next state sˆt+1 , therefore, the agent predicts all opponent agents’ actions aˆ t1 , . . . , aˆ tM . A sample path for equation 3.5 is now given by P(ˆst+1 |ˆst , a t ) =
M
P(aˆ ti |ˆo ti , φ i ),
(3.6)
i=1
where sˆt+1 is the sampled next state, which is consistent with the current state sˆt , the action a t , and the opponent agents’ actions {aˆ t1 , . . . , aˆ tM } ∈ A M (ˆs , a t , sˆt+1 ), and the observation oˆ ti of agent i is uniquely determined (that is, constant) by the current state sˆt , the action a t , and the previous opponents’ actions aˆ t1 , . . . , aˆ ti−1 , according to property ii. Since the action selection probability P(a ti |o ti , φ i ) of opponent agent i is unknown for the learning agent, as described in problem 2, the agent uses action predictors; each action predictor learns the action selection model of the corresponding opponent agent. The real environmental model in equation 3.6 is then approximated by M action predictors: ˆ = P(ˆst+1 |ˆst , a t ) ≈ P(ˆst+1 |ˆst , a t , )
M
P(aˆ ti |ˆo ti , φˆ i ),
(3.7)
i=1
ˆ = {φˆ 1 , . . . , φˆ M }. The action selection probability of the ith opwhere ponent agent, P(a ti |o ti , φ i ), is approximated by the ith action predictor (i = 1, . . . , M); φˆ i in equation 3.7 is not the real strategy φ i by agent i but a strategy approximated by the ith action predictor. Since each action predictor is realized as a function approximator, φˆ i denotes, in effect, its parameters (see section 4). The agent predicts that the ith opponent agent selects an action a ti according to the soft-max policy: P(a ti |o ti , φˆ i ) =
exp(F i (o ti , a ti ; φˆ i )/T i ) , j j exp(F i (o ti , a t ; φˆ i )/T i )
a t ∈A
(3.8)
Reinforcement Learning for Card Games
3061
where F i (o ti , a ti ) denotes an assumed utility of taking action a ti for an observation o ti , and T i is a constant that denotes the assumed randomness of agent i’s policy. Using the above two devices to deal with problems 2 and 3, equation 3.3 is now further approximated as U(Ht , a t ) ≈ N j=1
( j)
P(ˆst |h t )
M K
i,(k)
P(aˆ t
i,( j)
|ˆo t
( j) (k) (k) , φˆ i ) R sˆt , a t , sˆt+1 + γ V(ˆst+1 ) ,
(3.9)
k=1 i=1
( j)
by replacing the belief state P(ˆst |Ht ) and the environmental model (k) ( j) ( j) P(ˆst+1 |ˆst , a t ) with the partial belief state P(ˆst |h t ) and the action predictors (k) ( j) i,(k) i,( j) M ˆ = i=1 P(ˆst+1 |ˆst , a t , ) P(aˆ t |ˆo t , φˆ i ), respectively, where aˆ i,(k) is a con( j)
(k)
stituent of the action sequence {aˆ 1,(k) , . . . , aˆ M,(k) } ∈ A M (ˆst , a t , sˆt+1 ). Samples of the current state sˆt are obtained by the following three steps: the first step ∗ is to sample the previous state sˆt−1 according to uniform distribution so as not to violate the whole history Ht (no impossible state being sampled); the i i second step is to calculate the action selection probability P(a t−1 |o t−1 , φˆ i ) i for the action a t−1 actually taken by agent i according to equation 3.8 and 1 M to iterate this step for all opponent agents’ actions a t−1 , . . . , a t−1 ; and the ∗ ∗ ˆ last step is to calculate the one-step likelihood P(ˆst |ˆst−1 , a t−1 , ) according to equation 3.7 and obtain a sample of the current state sˆt∗ given the previ∗ ous state sˆt−1 , the previous action a t−1 , and the opponents’ actual actions 1 M a t−1 , . . . , a t−1 . These three steps are repeated N times, and the agent obtains ( j)
estimated current states {ˆst | j = 1, . . . , N}. The samples of the current state sˆt include only a one-step likelihood information (simplification A); in the ∗ first step above, samples of the previous state sˆt−1 are obtained by uniform sampling, ignoring the previous history. For approximating a large-scale posterior by using a limited number of samples, this uniform prior sampling is a plausible approach because the complete posterior belief is likely to be very sparse, and many possible states tend to have similar probability. In addition, in problems whose state space is discrete with a deterministic observation process, samples sˆt inconsistent with an actual observation o t necessarily disappear; for example, if an agent played ♥9, all state samples in which another agent has ♥9 are no longer useful. In large-scale and highdimensional problems, such as the game hearts especially, very few samples remain after each observation; this causes difficulty in performing the incremental maintenance of complete belief states by using a limited number of samples according to equation 2.1. The learning agent then discards the ∗ previous information P(st−1 |Ht−1 ) and obtains new samples sˆt−1 with uniform probability before calculating a one-step likelihood. Simplification A
3062
H. Fujita and S. Ishii
therefore provides an effective approximation to make intractable problems easier. Samples of the next state sˆt+1 are obtained by a simple sampling technique. K samples are drawn by the following three steps: first, calculate the action selection probability for the opponent agent i according to equation 3.8; second, select a possible action aˆ ti according to the action selection probability, and the first and second steps are iterated alternately for all opponent agents; and third, compute the next state sˆt+1 given the estimated current state sˆt , the action a t , and the opponent agents’ actions aˆ t1 , . . . , aˆ tM . These three steps are repeated K times, and the agent obtains predicted (k) next states {ˆst+1 |k = 1, . . . , K }, with the learned model, for each of the N possible current states. The three approximations (the sampling technique, partial belief state, and action predictor) enable us to solve large-scale and partially observable problems. Our method, for example, provides an effective solution to multiagent problems whose underlying state space is discrete, including various multiagent games.
4 Function Approximators with Feature Extraction In large-scale problems, the state space often has high dimensionality. In card games, for example, the state st is a 52-dimensional vector, each of whose dimensions represents the status of the corresponding card. This high dimensionality causes the performance of the RL agent to deteriorate due to the large and redundant representation. To achieve effective learning in a realistic problem, therefore, it is beneficial to use feature extraction techniques by considering the properties of the target problem for the input and output of the value function and action predictors. Here, we explain the feature extraction techniques used to apply our approach to a specific domain: the card game hearts. To reduce the dimensionality, the 52-dimensional state st is converted to a 36-dimensional input pt according to the following representation:
r r r r r r
pt [8 × i + 1]: the number of club cards in the agent i’s (i = 0, 1, 2, 3) hand pt [8 × i + 2]: same as pt [8 × i + 1], but the suit is diamonds pt [8 × i + 3]: same as pt [8 × i + 1], but the suit is spades, except ♠Q, ♠K, and ♠A pt [8 × i + 4]: a binary value for ♠Q pt [8 × i + 5]: a binary value for ♠K pt [8 × i + 6]: a binary value for ♠A
Reinforcement Learning for Card Games
r r
r
3063
pt [8 × i + 7]: the number of heart cards in the agent i’s hand that are weaker than the strongest card in the current trick (only if the leading card is a heart, otherwise zero) pt [8 × i + 8]: the number of heart, cards in the agent i’s hand that are stronger than the strongest card in the current trick (only if the leading card is a heart; otherwise the number of all heart cards in the agent i’s hand) pt [33] to pt [36]: a bit sequence
The binary values of ♠Q, ♠K, and ♠A take either 1 or 0, which represent whether agent i has the card or not, respectively. Since these three cards are the most important in the game hearts, we allocate one dimension to each card. Because heart cards are also important, we allocate twice as many dimensions to heart cards as to other suits. The bit sequence represents the playing order in the current trick. For example, when the learning agent is the second player in the current trick (the tth playing turn of the learning agent), { pt [33], pt [34], pt [35], pt [36]} = {0, 1, 0, 0}. When the next state st+1 is obtained by the sampling process, equation 3.9 can be calculated by replacing V(st+1 ) with V( pt+1 ) according to the above feature extraction. The observation o ti for agent i is also a 52-dimensional vector; each dimension of the observation vector represents an observable status of the corresponding card such that each status represents whether the card has already played. To reduce the dimensionality, the 52-dimensional observation o ti is converted to a 26-dimensional input q ti according to the following representation:
r r
r r r r r r r
q ti [1]: the number of club cards in the agent i’s hand that are weaker than the strongest card already played in the current trick (only if the leading card is a club; otherwise, zero) q ti [2]: the number of club cards in the agent i’s hand that are stronger than the strongest card already played in the current trick (only if the leading card is a club; otherwise, the number of all club cards in the agent i’s hand) q ti [3]: same as q ti [1], but the suit is diamonds q ti [4]: same as q ti [2], but the suit is diamonds q ti [5]: same as q ti [1], but the suit is spades, except ♠Q, ♠K, and ♠A q ti [6]: same as q ti [2], but the suit is spades excepting ♠Q, ♠K, and ♠A q ti [7]: a binary value for ♠Q q ti [8]: a binary value for ♠K q ti [9]: a binary value for ♠A
3064
r r
H. Fujita and S. Ishii
q ti [10] to q ti [22]: a binary value for each heart card q ti [23] to q ti [26]: a bit sequence
The binary values of ♠Q, ♠K, ♠A, and heart cards are defined similar to pt . Since the statuses of these cards are important for predicting the next action in the game hearts, we allocate one dimension to each card. The bit sequence is the same as pt . Given an extracted input q ti , the utility function F i of agent i returns a 26-dimensional output vector rti , each of whose dimensions represents:
r r r r r r r r r r r r r r
rti [1]: the merit value where agent i plays an arbitrary club card that is weaker than the strongest card in the current trick rti [2]: the merit value where agent i plays the weakest club card (among the remaining club cards) that is stronger than the strongest card in the current trick rti [3]: the merit value where agent i plays a club card (neither the weakest nor the strongest among the remaining club cards) that is stronger than the strongest card in the current trick rti [4]: the merit value where agent i plays the strongest card (in remaining club cards) that is stronger than the strongest card in the current trick rti [5]: same as rti [1], but the suit is diamonds rti [6]: same as rti [2], but the suit is diamonds rti [7]: same as rti [3], but the suit is diamonds rti [8]: same as rti [4], but the suit is diamonds rti [9]: the merit value where agent i plays an arbitrary spade card (except ♠Q, ♠K, and ♠A) that is weaker than the strongest card in the current trick rti [10]: the merit value where agent i plays an arbitrary spade card (except ♠Q, ♠K, and ♠A) that is stronger than the strongest card in the current trick rti [11]: the merit value where agent i plays ♠Q rti [12]: the merit value where agent i plays ♠K rti [13]: the merit value where agent i plays ♠A rti [14] to rti [26]: the merit value where agent i plays each heart card
The merit value represents a tendency where agent i plays the corresponding card; the larger the merit value, the more likely the agent is to play the card (see equations 3.8 and 4.1).
Reinforcement Learning for Card Games
3065
According to the above feature extraction technique, the action selection probability of agent i is calculated by the following four steps: first, the observation o ti for agent i can be obtained when the state st is given by the sampling process; second, the 52-dimensional observation vector o ti is converted to the extracted, and hence compressed, 26-dimensional input q ti according to the above representation; third, the utility function F i returns the compressed 26-dimensional output rti given the input q ti ; and fourth, the ith action predictor calculates the action selection probability, P(rti |q ti , φˆ i ) =
exp(F i (q ti , rti ; φˆ i )/T i ) , j ˆi i i i j j exp(F (q t , r t ; φ )/T )
(4.1)
a t ∈A
where A j denotes the set of possible actions for agent j. Note that the utility F i (o ti , a ti ) in equation 3.8 is replaced with F i (q ti , rti ) by the feature extraction. The action selection probability P(rti |q ti , φˆ i ) in equation 4.1 consists of a 26-dimensional vector, each of whose dimensions represents an occurrence probability rti , which is different from the primitive action a ti . The action selection probability for a compressed action rti is therefore converted to that of a primitive action a ti . The feature extraction conducted by considering the properties of the target problem allows us to reduce the dimensionality and improve the learning efficiency. Large-scale realistic problems, however, still have huge state spaces and high dimensionality; for example, there are about 1012 possible inputs q ti for the action predictor. This causes difficulty in learning the action selection model directly from a limited number of learning samples, by using a simple table lookup approach such as a multinomial model, due to the large number of effective parameters in the lookup table. To overcome this difficulty, we use function approximators for the value function and action predictors. A function approximator can approximate an inputoutput relation as a nonlinear regression model with a reasonable number of parameters, and it enables the agent to use its generalization ability for unknown situations. The value function is approximated by a normalized gaussian network (NGnet) (Sato & Ishii, 2000) with the feature extraction technique for its input. The NGnet V( pt ) is trained so as to learn the relationship between the compressed input pt and the cumulative reward via the following three steps: first, the real state s¯t and the discounted cumulative reward (return) 13 i ¯ t = i=t R γ R(si , a i , si+1 ) for each time t are available after one game has finished, where the immediate reward is defined as R(st , a t , st+1 ) = n when the agent gets n penalty points (n may be 0) between the tth and (t + 1)th play; second, the 52-dimensional state vector s¯t is converted to the compressed 36-dimensional input p¯ t according to the above feature extraction; and third, these two values, the input vector p¯ t and the corresponding ¯ t , are given to the NGnet for supervised learning, that is, scalar output R
3066
H. Fujita and S. Ishii
the NGnet is updated so as to approximate the return according to the Monte Carlo RL method (Sutton & Barto, 1998). The discount factor γ is 1.0 in our application. The property that the sequence of the real state can be available for learning is specific for many card games. Nevertheless, several POMDP algorithms would be applicable to learn the state value function in various partially observable environments (Hauskrecht, 2000), for example, by calculating the expectation of the value function with respect to the belief state. The NGnet showed the smallest TD error and achieved the fastest learning in our problem compared with other function approximators;3 this is partly attributable to the fact that it is a piece-wise linear model with multiple gaussian connections, and its learning is based on the online expectation-maximization (EM) algorithm whose coordinate-ascent optimization is often faster than simple gradient methods. The utility function F i in equations 3.8 and 4.1 is represented by a multilayered perceptron (MLP) with the feature extraction for its input and output. The MLPs are trained according to the four steps similar to those for the NGnet: first, the actual observation o¯ ti for agent i is available after one game has finished; second, the 52-dimensional observation vector o¯ ti is converted to the compressed 26-dimensional input q¯ ti ; third, the 26-dimensional output r¯ti is calculated from the actual action a¯ ti ; and fourth, these two vectors, the input q¯ ti and output r¯ti , are given to the MLP, that is, it is trained by supervised learning based on the error backpropagation method. The output vector r¯ti consists of elements of 1 or 0; the target value of the dimension corresponding to the actually taken action a¯ ti is 1, otherwise 0. Although the parameters of the MLP are tuned so that the output represents nonnegative values (from 0 to 1) in each dimension, the utility function does not directly represent the probability. In other words, the summation over possible actions of agent i is not always 1 because the set of possible actions A j may depend on the state st . We then use the soft-max normalization in equation 4.1, after removing impossible actions. An MLP showed the best prediction accuracy in our problem compared with other function approximators; this is partly because it is a global nonlinear model and has an adequate representation ability in this problem whose input and output have high dimensionality. To achieve effective learning, using feature extraction suited for target problems and using function approximators are crucial. Sturtevant and White (2006), for example, examined features of the game hearts to avoid ♠Q and heart cards and constructed a feature representation suited for playing the game. We used a similar idea; it enables the agent to understand the important information by taking advantage of the game’s properties
3
We compared three function approximators in their approximation ability for the problem: normalized gaussian network (NGnet), multilayered perceptron (MLP), and radial basis functions (RBFs).
Reinforcement Learning for Card Games
3067
and to improve the learning speed and approximation ability of function approximators. 5 Computer Simulations We applied our RL method to the card game hearts,4 a well-defined example of large-scale and multiagent problems with partial observability. To evaluate our method, we carried out computer simulations where an agent trained by our RL method played against rule-based agents that have 66 general rules for playing cards from their hands.5 The performance of an agent can be evaluated by the acquired penalty ratio, which is the ratio of the penalty points acquired by the agent to the total penalty points of the four agents. If the four agents have equal strength, their penalty ratio averages 0.25. The rule-based agent used in this study is stronger than the previous one (Ishii et al., 2005), due to the improvement in the rules; comparisons between the previous rule-based agent and the current rulebased agent are summarized in Table 1. Although the previous rule-based agent was an “experienced”-level player, the current rule-based agent has almost the same strength as a human hearts player; when this rule-based agent challenged a human player, the acquired penalty ratio was 0.256 (see Table 1). The learning agent based on our previous RL method then remained weaker than this rule-based agent even after 100,000 training games (data not shown). Since the outcome of this game tends to depend on the initial card distribution (e.g., an expert player with a bad initial hand may be defeated by an unskilled player), we prepared a fixed data set for the evaluation; the data set is a collection of initial card distributions for 100 games, each of which was generated randomly in advance. In the evaluation games, the initial cards were distributed according to this data set. Since performance is influenced by seat position (that is, an agent may have an advantage or disadvantage based on its seat position if the agents have different strengths), we rotated the agents’ positions for each initial hand to eliminate this bias; each of the 100 evaluation games was repeated four times with the four types of seating position. The performance of each agent therefore is evaluated by the 400 fixed and unbiased games. Note that learning of the agent 4 In section 3, we assumed that there are M opponent agents intervening between the tth and (t + 1)th action turns of the learning agent. In the game hearts, however, if the leading players of the tth turn and of the (t + 1)th turn are different, the number of intervening agents is not M = 3. Nonetheless, the above explanation can easily be extended to such a case. The details of the game’s rules used in our experiments are described in appendix 7. 5 The observation process of the rule-based agents is deterministic, and our action predictors use this fact (see equation 2.6). Although the action selection process of the rule-based agents is also deterministic, our action predictors assume that it is probabilistic and the stochastic process is approximated as the soft-max policy in equation 4.1.
3068
H. Fujita and S. Ishii
Table 1: Comparison of the Previous Rule-Based Agent with the Current RuleBased Agent.
Random Another rule-based Human player
Previous
Current
(a) 0.198 ± 0.002 (c) 0.300 ± 0.000 (d) 0.275 ± 0.006
(b) 0.181 ± 0.006 (c) 0.200 ± 0.000 (e) 0.256 ± 0.005
Notes: We carried out experiments in five types of settings: (a) one random agent and three previous rule-based agents, (b) one random agent and three new rulebased agents, (c) two previous rule-based agents and two new rule-based agents, (d) one human player and three previous rule-based agents, and (e) one human player and three new rule-based agents. The random agent, which played cards from its hand at random, is a reference agent for the absolute evaluation of the agents’ strength. The human player is the designer of both of the rule-based agents. Three runs for each of the five experiments were carried out with the same data set as in the figures: 400 evaluation games. The values in each cell represent the mean and standard deviation of the acquired penalty ratio over the three runs. In setting c, there is no variance because the previous and new rule-based agents play in a deterministic manner. Note that the acquired penalty ratios of the current rule-based agent are smaller than those of the previous one; the current rule-based agent is thus stronger than the previous one.
was suspended during the evaluation games. Each learning run comprised several sets of 500 games, in which initial cards were distributed to the four agents at random and seat positions of the agents were determined randomly. In each learning run, accordingly, 400 evaluation games and 500 learning games were alternated. 5.1 Single Agent Learning in Stationary Environment. Figure 1 shows the result when the agent trained by our method challenged the three
Figure 1: Computer simulation result in an environment consisting of one learning agent trained by our RL method and three rule-based agents. (Top) P-values of the t-test where the null hypothesis is “the RL agent has the same strength as the rule-based agents” and the alternative hypothesis is “the RL agent is stronger than the rule-based agents.” The test was done independently at each point on the abscissa. The horizontal line denotes the significance level of 1%. After 5000 training games, the RL agent was significantly stronger ( p < 0.01) than the rule-based agents. (Bottom) The abscissa denotes the number of training games, and the ordinate denotes the penalty ratio acquired by each agent. We executed 17 learning runs, each consisting of 5500 training games. Each point and error bar represent the average and standard deviation, respectively, for the 400 evaluation games over the 17 runs. The discount factor γ in equation 3.2 was 1.0, the constant T i in equation 4.1 was 1.0, and the numbers of samples in equation 3.9 were N = 80 and K = 20. These parameters were heuristically determined.
Reinforcement Learning for Card Games
3069
0
P value
10
10
10
10
1
2
3
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
2500
3000
3500
4000
4500
5000
5500
0.31
averaged acquired penalty ratio
0.3
RL agent
0.29
0.28
0.27
0.26
0.25
0.24
0.23
0.22
0
500
1000
1500
2000
number of games
3070
H. Fujita and S. Ishii
rule-based agents. Each point and error bar represent the average and standard deviation of the penalty ratio, respectively, for the 400 evaluation games over 17 learning runs. The penalty ratio of the RL agent decreased as learning progressed, and after 5000 training games, the agent became significantly stronger than the rule-based agents. Since the agent showed better performance than the rule-based agents after only several thousand training games, the new RL method based on a sampling method showed a dramatic improvement over the previous one in both learning speed and strength; our previous RL agent required about 20 times as many training games until learning converged. Although the three rule-based agents have the same rules, there is a distinct difference in their performances. This comes from the fact that the relative seat position was not changed (even with the rotation) during the evaluation games; the rule-based agent that showed the worst performance was always opposite the RL agent, and the agent that showed the best performance was always at the left side. Although the evaluation is unbiased, the agents’ strength is biased (the RL agent is weaker than the rule-based agents before learning but stronger after 5000 training games), so that the trajectories of the learning curves diverged from each other. Figure 2 shows frequency distributions of penalty points for the RL agent and the rule-based agent in the same experiment as Figure 1. After 5500 training games, the frequency of penalty points obtained by the rule-based agent is mainly distributed over higher penalty points than that of the RL agent. The frequencies of the RL agent for incurring 4 and 5 penalty points, on the contrary, markedly increased after the training games. This result suggests that the agent learned the policy so as to avoid many penalty points by instead receiving relatively few penalty points. This figure then supports the previous observation; the RL agent could acquire a good policy by interacting with the environment. We obtained a similar result to that shown in Figures 1 and 2 by using another evaluation data set (data not shown). When the RL agent challenged the three rule-based agents used in our previous work (Ishii et al., 2005), it showed an improved performance from the beginning of learning and finally became better than the rulebased agents; this is why we developed the new rule-based agent, which is stronger than the previous one. The improvement by our new RL method is attributed to the following two facts. First, the ability to approximate the utility function in equation 3.2 was improved by replacing the analog approximation method with a discrete sampling-based one. In our previous work, to calculate the utility function, we applied the mean-field-like analog approximation to the problem, whose state space is discrete, by changing the order of summations; we calculated the summation over the current state with the approximation before calculating the summation over the next state. In this study, on the contrary, the summations are calculated in a straightforward way with the sampling-based approximation; each sampled state represents a discrete state, and therefore our new approximation
Reinforcement Learning for Card Games
3071
Before learning After 5,500 training games
912
Before learning After 5,500 training games
995 944
728
RL agent ≈≈ 300
250
250
Frequency
Frequency
≈≈ 300
200
200
150
150
100
100
50
50
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Penalty points
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Penalty points
Figure 2: Frequency distributions of penalty points obtained by the RL agent and the rule-based agent, before learning and after 5500 training games. (Left) The abscissa denotes the penalty point number, and the ordinate denotes the frequencies of penalty points acquired by the RL agent. The vertical dashed line represents the boundary between incurring “23 or fewer” and “23 or more” penalty points. The total frequencies at the right side of the boundary are 117 and 80 before learning and after 5500 training games, respectively. (Right) The representation is the same as in the left panel except that the ordinate denotes the frequencies of penalty points acquired by the rule-based agent. The total frequencies at the right side of the boundary are 99 and 124 before learning and after 5500 training games, respectively.
is more suitable than the previous one. This enables us to calculate the expectation with a higher accuracy. Second, the expected future reward could be evaluated more accurately by the state value function. In our previous work, we made the value function learn over the observation space. Although this is an effective method for large-scale POMDP problems, it is difficult to obtain an accurate value due to the perceptual aliasing property in partially observable environments; the same observation may come from different states whose state values should be different, but the value function defined on the observation space cannot detect the difference between the values. In this study, in contrast, the learning agent predicts possible next states and evaluates a value from the value function defined on the state space. This enables the agent to perform accurate value prediction.
3072
H. Fujita and S. Ishii
The ideas used in our previous work were adequate for an environment with moderate complexity, constituted by the previous rule-based agents, but the limitation of that method precluded a more remarkable result. In this study, we have improved the old model so that the method works well within only several thousand training games, even in the harder environment constituted by the stronger rule-based agents. 5.2 Multiagent Learning in Dynamic Environments. In the experiment described above, our RL method was applied to the problem under the POMDP assumption that there is only one learning agent in the stationary environment. In the following, we apply our method directly to multiagent environments where there are multiple learning agents and then the environment becomes dynamic. Figure 3 (left) shows the result when one agent trained by our RL method, one agent trained by the REINFORCE algorithm (Williams, 1992), a policygradient-based RL method, and two rule-based agents played against each other. Note that our RL agent sat at the left side of the REINFORCE agent. The REINFORCE agent learned an action selection probability P(a t |o t ) with a feature extraction technique applied to its 52-dimensional input and output; an observation o t was converted to a 25-dimensional input whose representation of each dimension was the same as that in our previous work, and its output was also converted to a 26-dimensional output whose representation was the same as that of the action predictor in our current study (see section 4). Since these feature extraction processes are similar to those in our RL method, which take advantage of the game’s properties, we could evaluate the agents’ performance in a comparable condition. As a result, the penalty ratio of the REINFORCE agent did not decrease, and its performance remained much worse than that of other agents, whereas the agent trained by our RL method showed a better performance than the rule-based agents after 3500 training games in this multiagent setting. Figure 3 (right) shows the result when one agent trained by our RL method, one agent based on the actor-critic algorithm (Barto, Sutton, & Anderson, 1983), a well-known generalized policy iteration RL method, and two rule-based agents played against each other. Again, our RL agent sat at the left side of the actor-critic agent. The critic module of the actorcritic agent learned the value function defined on the observation space V(o t ) with the same feature extraction technique applied to its input as that of the previous experiment (see Figure 3, left). The actor module also used the same one applied to its input and output as that of the action predictor in our current study; the actor determines its action a t based on a utility function for which the input is a 26-dimensional observation and the output is a 26-dimensional utility value. The penalty ratio of the actor-critic agent did not decrease as in the previous experiment, whereas the agent trained by our RL method became significantly stronger than the rule-based agents after 3500 training games.
Reinforcement Learning for Card Games
3073
0
0
10
10
10
10
1
P value
P value
10
2
3
0
500
1000
1500
2000
2500
3000
3500
10
10
10
4000
1
2
3
0
500
1000
1500
2000
2500
3000
3500
4000
1500
2000
2500
3000
3500
4000
0.5 0.5
REINFORCE agent
0.45
averaged acquired penalty ratio
averaged acquired penalty ratio
0.45
0.4
0.35
0.3
0.25
RL agent
0.4
0.35
0.3
0.25
RL agent
0.2
0.2
0.15 0.15 0
500
1000
1500
2000
2500
number of games
3000
3500
4000
0
500
1000
number of games
Figure 3: Computer simulation result in an environment consisting of one learning agent trained by our RL method, one learning agent trained by another algorithm, and two rule-based agents. (Left panels) Computer simulation result in an environment consisting of one learning agent trained by our RL method, one learning agent trained by the REINFORCE algorithm, and two rule-based agents. (Right panels) Computer simulation result in an environment consisting of one learning agent trained by our RL method, one learning agent trained by the actor-critic algorithm, and two rule-based agents. (Upper panels) P-values of the t-test where the null and alternative hypotheses are the same as in the previous experiment (see Figure 1). After 3500 training games, the RL agent became significantly stronger than the rule-based agents, but the agent trained by another algorithm did not, in both experiments (P-values are not shown because they remained around 1). (Lower panels) The abscissa and the ordinate denote the same as in Figure 1, but the scale of the ordinate is larger here. We executed 15 learning runs, each consisting of 4000 training games in both experiments. The parameter values and other experimental setups were also the same.
These results are attributed to two points. The first is the disadvantage of learning over the observation space. As described above, it is difficult to obtain good performance without solving ambiguity in partially observable problems, even with effective feature extraction. The second is the limitation of model-free approaches: those such as the REINFORCE and actor-critic methods are easy to apply to various problems, including POMDPs, but it is in fact hard to achieve a good result in complex multiagent environments
3074
H. Fujita and S. Ishii
with highly restricted observations. Our experimental results show that our model-based RL method with sampling-based state estimation could overcome such difficulties and then achieve a better performance than conventional RL methods. It would be impractical to apply other existing RL methods to our problem. For example, using the least squares temporal difference algorithm (Bradtke & Barto, 1997) for hearts would be infeasible, because it is very difficult to obtain a least-square solution in a large state space due to the inevitable sparseness of learning samples. Using belief-state POMDP methods is also difficult, even with an appropriate approximation (Hauskrecht, 2000), because learning over the belief space is computationally heavy in large-scale problems, as discussed in section 2. On the contrary, our method can solve such problems. Figure 4 shows the result when one agent trained by our RL method, one agent trained by our previous RL method, and two rule-based agents played against each other. As before, the new RL agent sat at the left side of the previous RL agent. The penalty ratio of the new RL agent decreased as learning progressed, and after 3500 training games, the agent showed a better performance than the rule-based agents. On the other hand, the averaged penalty ratio of the previous RL agent did not decrease, and it remained much weaker than the other agents. This result shows that our new agent can acquire a better policy than the previous one through a direct match. Figure 5 shows the result when two RL agents trained by our method and two rule-based agents played against each other. The RL agent indicated by the dashed line sat at the left side of the other RL agent with the solid line. The penalty ratios of both RL agents decreased as learning progressed, and after 5000 training games, both agents came to acquire a smaller penalty ratio than the rule-based agents. The setting of this experiment is more challenging than the previous experiments (see Figures 3 and 4), because the learning speed of a learning agent trained by the new RL method is much faster than that of an agent trained by other methods. In other words, the environmental dynamics changes more rapidly. Even with this difficult
Figure 4: Computer simulation result in an environment consisting of one learning agent trained by our RL method, one learning agent trained by our previous method, and two rule-based agents. (Upper panel) P-values of the t-test where the null and alternative hypotheses are the same as in the previous experiment (see Figure 1). After 3500 training games, the RL agent became significantly stronger than the rule-based agents, but the previous RL agent did not (Pvalues are not shown because they remained around 1). (Lower panel) The abscissa and the ordinate denote the same as in Figure 1, but the scale of the ordinate is larger here. We executed 16 learning runs, each consisting of 4000 training games. The parameter values and other experimental setups were also the same.
Reinforcement Learning for Card Games
3075
0
P value
10
10
10
10
1
2
3
0
500
1000
1500
2000
2500
3000
3500
4000
2500
3000
3500
4000
0.45
averaged acquired penalty ratio
Previous RL agent 0.4
0.35
0.3
0.25
RL agent 0.2
0.15 0
500
1000
1500
2000
number of games
3076
H. Fujita and S. Ishii 0
P value
10
10
10
10
1
2
3
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
3000
3500
4000
4500
5000
5500
0.29
0.28
averaged acquired penalty ratio
RL agents 0.27
0.26
0.25
0.24
0.23
0.22
0.21
0
500
1000
1500
2000
2500
number of games Figure 5: Computer simulation result in an environment consisting of two learning agents trained by our RL method and two rule-based agents. (Upper panel) P-values of the t-test where the null and alternative hypotheses are the same as in the previous experiment (see Figure 1). After 5000 training games, the two RL agents were significantly stronger than the rule-based agents. (Lower panel) The abscissa and the ordinate denote the same as in Figure 1. We executed 18 learning runs, each consisting of 5500 training games. The parameter values and other experimental setups were also same.
Reinforcement Learning for Card Games
3077
multiagent setting, the RL agents could adapt to the change and showed good performance. This ability is attributed to the fast learning that occurs when three action predictors are used. 5.3 Validation Matches Against Human Player and Commercial Software. To validate the general strength of the learning agent, we carried out evaluation games with the human player, who is the same person evaluated in Table 1. Figure 6 shows the result when one RL agent trained by our method, two rule-based agents, and the human player played together, with the RL agent sitting next to the human player. We used another evaluation data set for 25 games with seat rotation (that is, 100 games), and the learning run was executed twice. Each point of Figure 6 therefore represents a mean over 200 games for the evaluation games.6 One hundred evaluation games in both learning runs were done before learning and after 1000, 2000, 3000, 4000, and 5000 training games, and these training games were carried out with three rule-based agents; one rule-based agent was then replaced by the human player in each evaluation phase. This may cause the action predictor to deviate from the human player’s action selection model due to the difference between the strategy estimated by the action predictor in the training phase and the strategy of the human player in the evaluation phase. The action predictor nonetheless could have come to predict a standard card playing by its generalization ability from playing against the rule-based agents with general strategies, and such a standard card prediction worked well in playing with the human player. Figure 6 shows that the RL agent successfully acquired a good strategy comparable to or slightly better than that of the human player. To examine the general strength of the learning agent in a fair manner, we carried out a validation match using commercial hearts software (Freeverse Software, 2004); this software was once used to demonstrate the hearts agent developed by Sturtevant (2003), the strongest hearts program in the field of artificial intelligence. When three agents of this commercial software played against the random agent and the same human player as in Table 1 and Figure 6, the averaged penalty ratios of each agent over 200 games were 0.187 ± 0.009 and 0.259 ± 0.010, respectively; the strength of the commercial agent is the almost same as our rule-based agent used in this study (see the results of settings b and e in Table 1).7 The RL agent played 100 games 6 To prevent the human player from remembering the fixed card distributions, the order of the evaluation games was shuffled in each evaluation on the abscissa. 7 Since the interface of our simulation program is different from that of the commercial software, the evaluation games were carried out by manually inputting cards played by the commercial agents into our program; the strengths were evaluated under the initiative of the commercial software. Since, for this reason, the initial card distribution was not fixed but distributed randomly, the exact comparison of the penalty ratios with those in Table 1 may be difficult. These results, however, show that there is only a small difference in the strengths between the current rule-based agent and the commercial agent.
3078
H. Fujita and S. Ishii
0.32
0.3
averaged acquired penalty ratio
RL agent 0.28
0.26
0.24
0.22
Human player
0.2 0
1000
2000
3000
4000
5000
number of games Figure 6: Computer simulation result when one learning agent trained by our RL method, one human player, and two rule-based agents played together. One hundred evaluation games were done before learning and after 1000, 2000, 3000, 4000, and 5000 training games (with three rule-based agents). We repeated the training and evaluation runs twice. The abscissa and the ordinate denote the same as in Figure 1. The parameter values were also the same. Each point denotes the average of 200 (2 × 100) evaluation games.
Reinforcement Learning for Card Games
3079
against the commercial software agents after learning through 5000 training games with our rule-based agents, and we repeated this experiment twice; that is, the performance of the RL agent was evaluated by 200 evaluation games. The averaged penalty ratio of the RL agent was 0.239 ± 0.012 in the evaluation games, stronger than the commercial agent. Although this result is comparable to that of the previous study (Sturtevant, 2003), in which the agent did not learn but used a hand-tuned evaluation function with a pruning technique as an expert system, the aims and contributions of our study are different from this previous work. They are summarized as the following two points: the first is to develop a reinforcement learning algorithm applicable to large-scale multiagent environments with partial observability; the second is to apply our method to the card game hearts, which exemplifies such an environment, and to demonstrate that the agent trained by our RL algorithm attains performance comparable to that of the human player. 6 Discussion Our RL formulation provides a general solution for partially observable games that can be solved by sequential decision making based on the estimation of unobservable states and the prediction of the unknown environmental dynamics; for example, it could be applied directly to other tricktaking card games such as hoist, contract bridge, gin rummy, and napoleon. To apply our method to other games, however, there are two noteworthy points: first, the feature extraction described in section 4 should be modified according to properties of the target problem so that effective learning can be achieved; and second, additional action predictors should sometimes be prepared because some games include cooperative relations among the players as well as competitive relations. Although it is difficult to apply our method directly to other games like strategic computer games, due to their larger state space and more complicated relations than the card games, our basic formulation and ideas are available; the game could be dealt with by our approach if the scale of the problem is appropriately reduced. Our method thus allows the agent to act and learn in various large-scale and multiagent problems with partial observability. Hearts is an example application with the essential property and is thus suited to the evaluation of our method. When it is applied to other applications, many small details may have to be changed for the specific problem, but our essential ideas are consistent. Note, however, that we have used several specific properties for games; for example, since the state transition P(sti+1 |sti , a ti ) and observation process P(o ti |sti ) are deterministic, the prediction of the opponent agents’ actions can be simplified as equation 3.6. For applying our method to more general POMDP problems, such deterministic properties should be relaxed. Even in probabilistic situations, our sampling-based method with the simplified belief calculation would work well; we plan to apply our method to probabilistic problems in our future work.
3080
H. Fujita and S. Ishii
¨ (2000) proposed the Monte Carlo POMDP; it estimated unobservThrun able states using the Monte Carlo method, which is similar to our method because all summations were approximated by a plausible number of samples in a partially observable problem. The problem setting, however, was classified as case I in section 2: the environmental model was given to the agent, and the state space was reasonably sized so that the simple Monte Carlo integration worked well. If an adequate number of samples could be obtained from the given distribution, then the summations were approximated with high accuracy, and it was possible to deal with the propagation of belief states. On the contrary, the problem of the game hearts is classified as case II; the environmental model is unknown, and it is necessary to approximate the integration process with some devices. This letter shows that a sampling technique with model identification works well for a largescale problem whose state space is discrete. We have not used complete belief maintenance throughout the game process but have calculated a onestep belief because a good incremental approximation of the belief is not easy with a restricted computer resource for problems whose observation process is deterministic. The dynamics of the game hearts can be represented by products of opponent agents’ action probabilities, as in equation 3.7. In our method, the policies of opponent agents are estimated by corresponding action predictors, and the utility function in equation 3.2, which is necessary for action selection by the agent, is calculated based on these predictors. Since such a model-based approach is effective in unknown and partially observable environments, many practical methods have been proposed (Chrisman, 1992; Whitehead & Lin, 1995; Nikovski & Nourbakhsh, 2000; Yoshimoto et al., 2003). This may, however, make the problem more difficult and complicated than model-free approaches, for two reasons: first, the computational cost for learning of the model with the estimation process is expensive in general problems; and second, if learning of the model fails, it may work against the estimation and policy acquisition processes, and vice versa, because they rely on each other. On the other hand, in our method, learning of the model is independent of the estimation or learning of the value function; each predictor is trained by available information, which is given at the end of each game, according to the supervised learning method. Learning of the model then always goes well, and policy learning accelerates with the improvement of prediction. Predictors are prepared for each opponent agent, and they are learned independently. When one opponent agent changes its policy, it is enough to make the corresponding predictor adapt to the change. Since this enables the agent to adapt quickly, our RL method becomes applicable to complex multiagent settings where the Markov property fails.8
8 To evaluate the adaptability of the action predictor in a dynamic environment, we carried out an additional experiment; this is shown in appendix B.
Reinforcement Learning for Card Games
3081
Our RL method is based on the Monte Carlo RL method (Sutton & Barto, 1998) in learning of the value function. When we carried out the same experiment as shown in Figure 1 without learning of the value function, the policy of the agent was not improved (data not shown). The value function therefore should be learned properly for a large state space. In general, however, it is difficult to learn the value function effectively for a large-scale and high-dimensional problem like hearts, even with function approximators, because the number of parameters increases exponentially; this is known as the curse of dimensionality. If a large problem can be reduced by being divided into multiple subproblems with a hierarchical structure (Barto & Mahadevan, 2003), the agent will be able to learn more effectively. We plan to explore this reductive technique in our future work. 7 Conclusion In this study, we developed a new model-based RL scheme for large-scale multiagent games with partial observability. It is necessary for decision making in such environments to estimate unobservable states and predict opponents’ actions. The computational cost for such processes, however, increases exponentially with the enlargement in the scale of the environment, so it is infeasible to estimate all possible current states and predict all possible next states in realistic problems. We therefore developed three devices to avoid the exhaustive state enumeration for estimation and prediction: the sampling technique, the likelihood estimation based on one-step history, and the action predictors. These ideas have enabled us to approximate the heavy integrations effectively from a plausible number of samples and have allowed the agent to select a preferable action according to the action selection model of opponent agents. For learning of the value function and environmental dynamics, we used function approximation methods with feature extraction. We finally applied our method to a real card game, hearts, a typical example of such difficult environments. Computer simulations showed that our model-based RL method is effective for acquiring a good strategy for this realistic partially observable problem. Appendix A: Hearts The game of hearts is played by four players and uses an ordinary 52-card deck. There are four suits—spades(♠), hearts(♥), diamonds(♦), and clubs(♣)—and there is an order of strength within each suit (decreasing from A, K, Q, . . ., 2). There is no strength order among the suits. Cards are distributed to the four players so that each has 13 cards at the beginning of a game. After that, according to the rules below, each player plays a card in clockwise order. When each of the four players has played a card, it is called a trick; each player plays a card once in a trick. The first card played in a trick is called the leading card, and the player who plays this card is called
3082
H. Fujita and S. Ishii
the leading player. A single game ends when 13 tricks have been carried out.
r r r r r r r
In the first trick, ♣2 is the leading card, so that the player holding this card is the leading player. Except for the first trick, the winner of the current trick becomes the leading player of the next trick. Each player must play a card of the same suit as the leading card. If a player does not have a card of the same suit as the leading card, he or she can play any card. When a heart is played in this way for the first time in a game, the play is called breaking hearts. Until breaking hearts occurs, the leading player may not play a heart. If the leading player has only hearts, it is an exceptional case and the player may lead with a heart. After a trick, the player who has played the strongest card of the same suit as the leading card is the winner of that trick. Each heart equals a one-point penalty, and the ♠Q equals a 13-point penalty, so the total number of penalty points is 26. The winner of a trick receives all of the penalty points of the cards played in the trick.
According to the rules above, a single game is played, and at the end of a game, the penalty points of each player are determined as the sum of the received points. The lower the points, the better. This game therefore represents a competitive situation because each player has to avoid penalty points by pushing them to the opponents. When a player takes all the heart cards and the ♠Q, this is known as shooting the moon. The player receives no penalty point, and the other players incur 26 points. For simplicity, this rule is removed from our setting. Appendix B: Prediction Accuracy To examine the adaptability of the action predictors in a dynamic environment, we calculated the Kullback-Leibler (KL) divergence between the real empirical action probability, P(a ti |o ti , φ i ), and the action probability approximated by the agent, P(a ti |o ti , φˆ i ), by changing the environment every 100 training games. We carried out the experiment using four types of rule-based agents; their acquired penalty ratios were (A) 0.206, (B) 0.193, (C) 0.186, and (D) 0.180 when they played against the random agent. This experiment used four types of evaluation data sets, which are collections of actions taken by the corresponding rule-based agent, each generated according to the agent’s own rules within 1000 games in advance. We switched the rule-based agents in alphabetical order—A, B, C, and D—after every 1000 training games. The opponent agents gradually became stronger, and
Reinforcement Learning for Card Games
3083
4
3.92
x 10
A
B
C
D
3.9
KL divergence
3.88
3.86
3.84
3.82
3.8
3.78
0
500
1000
1500
2000
2500
3000
3500
4000
number of games Figure 7: The Kullback-Leibler (KL) divergence between the real empirical action probability, P(a ti |o ti , φ i ), and the action probability approximated by the agent, P(a ti |o ti , φˆ i ). The abscissa denotes the number of training games and the ordinate denotes the KL divergence. We executed 10 learning runs, each consisting of 4000 training games. The constant T i in equation 4.1 was set at 1.0.
the performance of the predictors was evaluated based on the data set from the corresponding agent. In this experiment, accordingly, 1000 evaluation games and 100 training games were alternated, and three rule-based agents were replaced every 1000 training games. Each point of Figure 7 represents an average for 1000 such evaluation games over 10 learning runs. Although the performance deteriorated slightly just after every switching of the
3084
H. Fujita and S. Ishii
rule-based agents, the action predictor could adapt to the changes quickly, and its performance improved steadily as learning progressed even in this dynamic environment. This result implies that the adaptability of our action predictor is so fast that the POMDP assumption is approximately valid even in dynamic environments constituted by multiple learning agents. References Barto, A. G., & Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13, 341–379. Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man and Cybernetics, 13, 834–846. Boutilier, C., & Poole, D. (1996). Computing optimal policies for partially observable decision processes using compact representations. In Proceedings of the 13th National Conference on Artificial Intelligence (AAAI ’96) (pp. 9–16). Menlo Park, CA: AAAI Press/MIT Press. Bowling, M., & Veloso, M. (2000). An analysis of stochastic game theory for multiagent reinforcement learning (Tech. Rep. No. CS-00-165). Pittsburgh, PA: Carnegie Mellon. Bradtke, S., & Barto, A. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22, 33–57. Brafman, R. I. (1997). A heuristic variable grid solution method for POMDPs. In Proceedings of the 14th National Conference on Artificial Intelligence (AAAI ’97) (pp. 727– 733). Menlo Park, CA: AAAI Press/MIT Press. Chang, Y.-H., Ho, T., & Kaelbling, L. P. (2003). All learning is local: Multi-agent ¨ L. K. Saul, & B. Scholkopf ¨ learning in global reward games. In S. Thrun, (Eds.), Advances in neural information processing systems, 16 (pp. 807–814). Cambridge, MA: MIT Press. Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In Proceedings of the 10th National Conference on Artificial Intelligence (AAAI ’92) (pp. 183–188). San Mateo, CA: Morgan Kaufmann. Claus, C., & Boutilier, C. (1998). The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the 15th National Conference on Artificial Intelligence (AAAI ’98) (pp. 746–752). Menlo Park, CA: AAAI Press/MIT Press. Crites, R. H., & Barto, A. G. (1996). Elevator group control using multiple reinforcement learning agents. Machine Learning, 33, 235–262. Dahl, F. A. (2002). The lagging anchor algorithm: Reinforcement learning in twoplayer zero-sum games with imperfect information. Machine Learning, 49, 5–37. Dayan, P., Schraudolph, N. N., & Sejnowski, T. J. (2001). Learning to evaluate go positions via temporal difference methods (Tech. Rep.). Berlin: Springer-Verlag. Doya, K., Samejima, K., Katagiri, K., & Kawato, M. (2002). Multiple model-based reinforcement learning. Neural Computation, 14, 1347–1369. ¨ S. (2004). ApproxiEmery-Montemerlo, R., Gordon, G., Schneider, J., & Thrun, mate solutions for partially observable stochastic games with common payoffs. In Proceedings of the 3rd International Joint Conference on Autonomous Agents and
Reinforcement Learning for Card Games
3085
Multi-Agent Systems (AAMAS ’04) (pp. 136–143). Menlo Park, CA: AAAI Press/ MIT Press. Freeverse Software. (2004). 3D Hearts deluxe. New York: Author. Fudenberg, D., & Tirole, J. (1991). Game theory. Cambridge, MA: MIT Press. Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (1996). Markov chain Monte Carlo in practice. London: Chapman and Hall. Hansen, E. A., Bernstein, D. S., & Zilberstein, S. (2004). Dynamic programming for partially observable stochastic games. In Proceedings of the 19th National Conference on Artificial Intelligence (AAAI ’04) (pp. 709–715). Menlo Park, CA: AAAI Press/MIT Press. Hauskrecht, M. (2000). Value-function approximations for partially observable Markov decision processes. Journal of Artificial Intelligence Research, 13, 33–94. Hu, J., & Wellman, M. P. (2003). Nash Q-learning for general-sum stochastic games. Journal of Machine Learning Research, 4, 1039–1069. Ishii, S., Fujita, H., Mitsutake, M., Yamazaki, T., Matsuda, J., & Matsuno, Y. (2005). A reinforcement learning scheme for a partially-observable multi-agent game. Machine Learning, 59, 31–54. Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101, 99–134. Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237–285. Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the 11th International Conference on Machine Learning (pp. 157–163). San Mateo, CA: Morgan Kaufmann. Littman, M. L., Cassandra, A. R., & Kaelbling, L. P. (1995). Learning policies for partially observable environments: Scaling up. In Proceedings of the 12th International Conference on Machine Learning (ICML ’95) (pp. 363–370). San Mateo, CA: Morgan Kaufmann. Littman, M. L., & Majercik, S. M. (1997). Large-scale planning under uncertainty: A survey. Paper presented at the NASA Workshop on Planning and Scheduling in Space, Oxnard, CA. Loch, J., & Singh, S. P. (1998). Using eligibility traces to find the best memoryless policy in partially observable Markov decision processes. In Proceedings of the 15th International Conference on Machine Learning (ICML ’98) (pp. 323–331). San Mateo, CA: Morgan Kaufmann. McCallum, A. (1993). Overcoming incomplete perception with Util distinction memory. In Proceedings of the 10th International Conference on Machine Learning (ICML ’93) (pp. 190–196). San Mateo, CA: Morgan Kaufmann. Meuleau, N., Peshkin, L., Kim, K.-E., & Kaelbling, L. P. (2000). Learning finitestate controllers for partially observable environments. In Proceedings of the 15th Annual Conference on Uncertainty in Artificial Intelligence (UAI ’00) (pp. 427–436). San Mateo, CA: Morgan Kaufmann. Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with les data and less real time. Machine Learning, 13, 103–130. Mori, T., Nakamura, Y., & Ishii, S. (2004). Reinforcement learning for CPG-driven biped robot. In Proceedings of the 19th National Conference on Artificial Intelligence (AAAI ’04) (pp. 623–630). Menlo Park, CA: AAAI Press/MIT Press.
3086
H. Fujita and S. Ishii
Morimoto, J., & Doya, K. (2001). Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. Robotics and Autonomous Systems, 36, 37–51. Nair, R., Marsella, S., Tambe, M., Pynadath, D., & Yokoo, M. (2003). Taming decentralized POMDPs: Towards efficient policy computation for multiagent settings. In Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI ’03) (pp. 705–711). Nikovski, D., & Nourbakhsh, I. (2000). Learning probabilistic models for decisiontheoretic navigation of mobile robots. In Proceedings of the 17th International Conference on Machine Learning (ICML ’00) (pp. 671–678). New York: ACM Press. Perkins, T. (1998). Two search techniques for imperfect information games and application to hearts (Tech. Rep.). Amherst, MA: Computer Science at University of Massachusetts. Peshkin, L., Meuleau, N., & Kaelbling, L. P. (1999). Learning policies with external memory. In Proceedings of the 16th International Conference on Machine Learning (ICML ’99) (pp. 307–314). New York: ACM. Pfahringer, B., Kaindl, H., Kramer, S., & Furnkranz, J. (1999). Learning to make good use of operational advice. Paper presented at the International Conference on Machine Learning, Workshop on Machine Learning in Game Playing, Bled, Slovenia. ¨ S. (2003). Point-based value iteration: An anytime Pineau, J., Gordon, G., & Thrun, algorithm for POMDPs. In Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI ’03) (pp. 1025–1032). New York: ACM Press. Sato, M., & Ishii, S. (2000). On-line EM algorithm for the normalized gaussian network. Neural Computation, 12, 407–432. Shani, G. (2004). A survey of model-based and model-free methods for resolving perceptual aliasing (Tech. Rep.). Be’er Sheva, Israel: Department of Computer Science, BenGurion University. Shoham, Y., Powers, R., & Grenager, T. (2004). Multi-agent reinforcement learning: A critical survey. Paper presented at the Proceedings of AAAI Fall Symposium on Artificial Multi-Agent Learning, Washington, DC. Singh, S. P., & Bertsekas, D. (1996). Reinforcement learning for dynamic channel allocation in cellular telephone systems. In Advances in neural information processing systems, 9 (pp. 974–980). Cambridge, MA: MIT Press. Singh, S. P., Jaakkola, T., & Jordan, M. I. (1994). Learning without state-estimation in partially observable Markovian decision processes. In Proceedings of the 11th International Conference on Machine Learning (ICML ’94) (pp. 284–292). San Mateo, CA: Morgan Kaufmann. Smallwood, R. D., & Sondik, E. J. (1973). The optimal control of partially observable processes over a finite horizon. Operations Research, 21, 1071–1088. Stone, P., & Veloso, M. M. (2000). Multiagent systems: A survey from a machine learning perspective. Autonomous Robots, 8, 345–383. Sturtevant, N. R. (2003). Multi-player games: Algorithms and approaches. Unpublished doctoral dissertation, University of California, Los Angeles. Sturtevant, N. R., & White, A. M. (2006). Feature construction for reinforcement learning in hearts. In Proceedings of the 5th International Conference on Learning and Games. Available online at http://www.cs.ualberta.ca/∼nathanst.
Reinforcement Learning for Card Games
3087
Suematsu, N., & Hayashi, A. (2002). A multiagent reinforcement learning algorithm using extended optimal response. In Proceedings of the 1st International Joint Conference on Autonomous Agents and Multi-Agent Systems (AAMAS ’02) (pp. 370–377). New York: ACM Press. Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3, 9–44. Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the 7th International Conference on Machine Learning (ICML ’90) (pp. 216–224). San Mateo, CA: Morgan Kaufmann. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Tesauro, G. (1994). TD-gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6, 215–219. Theocharous, G., & Mahadevan, S. (2002). Approximate planning with hierarchical partially observable Markov decision process models for robot navigation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA ’02) (pp. 1374–1352). Piscataway, NJ: IEEE. ¨ S. (2000). Monte Carlo POMDPs. In S. A. Solla, T. K. Leen, & K. R. Muller ¨ Thrun, (Eds.), Advances in neural information processing systems, 12 (pp. 1064–1070). Cambridge, MA: MIT Press. Wang, X., & Sandholm, T. (2003). Reinforcement learning to play an optimal Nash ¨ & K. Obermayer (Eds.), equilibrium in team Markov games. In S. Beaker, S. Thrun, Advances in neural information processing systems, 15 (pp. 554–565). Cambridge, MA: MIT Press. Watkins, C., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292. Whitehead, S. D., & Lin, L.-J. (1995). Reinforcement learning of non-Markov decision processes. Artificial Intelligence, 73, 271–306. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256. Yoshimoto, J., Ishii, S., & Sato, M. (2003). System identification based on on-line variational Bayes method and its application to reinforcement learning. In Proceedings of the International Conference on Artificial Neural Networks and Neural Information Processing (Vol. 2714, pp. 123–131). New York: Springer-Verlag.
Received May 19, 2006; accepted December 15, 2006.
LETTER
Communicated by Edward Snelson
Clustering Based on Gaussian Processes Hyun-Chul Kim [email protected] Department of Computer Science, Yonsei University, 134 Shinchondong, Sudaimunku Seoul, 120-749, Korea
Jaewook Lee [email protected] Department of Industrial and Management Engineering, Pohang University of Science and Technology, Pohang, Kyungbuk 797-784, Korea
In this letter, we develop a gaussian process model for clustering. The variances of predictive values in gaussian processes learned from a training data are shown to comprise an estimate of the support of a probability density function. The constructed variance function is then applied to construct a set of contours that enclose the data points, which correspond to cluster boundaries. To perform clustering tasks of the data points, an associated dynamical system is built, and its topological invariant property is investigated. The experimental results show that the proposed method works successfully for clustering problems with arbitrary shapes. 1 Introduction Gaussian process models are Bayesian kernel machines that have been successfully applied to regression and classification (Rasmussen & Williams, 2006). Gaussian processes for regression (GPR) model the density of the target values as a multivariate gaussian density. The covariance between the targets at two different points is usually controlled by a small set of hyperparameters that captures interpretable properties of the function, such as the length scale of autocorrelation, the overall scale of the function, and the amount of noise. GPR has some advantages over other nonparametric regression algorithms for its ability to provide a principled hyperparameter selection method and the distribution of predicted target value (Williams & Rasmussen, 1995; Neal, 1998; Gibbs & MacKay, 1997) and is shown to be an excellent method for nonlinear regression (Rasmussen, 1996). In gaussian processes for classification, a latent function is employed whose sign determines the class label instead of modeling them using a multivariate gaussian density and integrates over both hyperparameters and latent values of the function at the data points. It has been reported in a number of works (Williams & Barber, 1998; Neal, 1998; Gibbs & MacKay, Neural Computation 19, 3088–3107 (2007)
C 2007 Massachusetts Institute of Technology
Clustering Based on Gaussian Processes
3089
2000; Opper & Winther, 2000; Minka, 2001; Csato, 2002) that they are effectively applied to some difficult classification problems. So far, most of works related to gaussian processes have been on their applications to supervised learning tasks such as regression and classification, although there are some approaches to apply gaussian process (GP) models to an unsupervised learning task such as density estimation (Schmidt, 2000; Csato, 2002; Kim & Lee, 2006). Clustering, which GP models have not yet been rigorously developed and applied to, is also one of the important unsupervised learning tasks. Clustering is the task of grouping the input patterns naturally (Duda, Hart, & Stork, 2001; Jain, Murty, & Flynn, 1999). Clustering techniques are useful in several exploratory pattern analysis, grouping, decision-making, and machine learning situations, including data mining, document retrieval, image segmentation, and pattern classification. In this letter, we make a rigorous development of a gaussian process model for clustering. To this end, we start with the observation that the variances in GPR capture regions in input space where the probability density is in some sense large and establish that the variance function in GPR represents the support of a probability density with its generalization error bound. We then take a maximum of the variance function values of the data points as the level value, which is shown to make up the cluster boundaries. A dynamical system associated with the variance function is built and applied to facilitate the cluster labeling of the data points, as well as to partition the input space into several clustered regions. We verify the performance of the proposed method through simulation. The organization of this letter is as follows. In section 2, we review gaussian processes for regression (GPR). In section 3, we show that the constructed variance function of GPR learning from data estimates the support of an unknown probability density. In section 4, we present a clustering method utilizing the constructed variance function, and we show the simulation results in section 5. 2 Review of Gaussian Processes for Regression In this section, we briefly review the gaussian processes for regression (O’Hagan, 1978; Williams & Rasmussen, 1995; Gibbs & MacKay, 1997; Neal, 1998). Suppose that we have a data set D of data points xi ∈ d with continuous target values ti : D = {(xi , ti )|i = 1, 2, . . . , N}, X = {xi |i = 1, 2, . . . , N}, T = {ti |i = 1, 2, . . . , N}. Given this data set, we wish to find the predictive distribution of the target value t˜ for a new data point x˜ . Let the target function f (·). We put f(= f N ) as [ f 1 , f 2 , . . . , f N ] = [ f (x1 ), f (x2 ), . . . , f (x N )], and put f N+1 as [ f 1 , f 2 , . . . , f N , f N+1 ] = [ f (x1 ), f (x2 ), . . . , f (x N ), f (x N+1 )], where x N+1 = x˜ . Gaussian processes for regression assume that the target function has a gaussian process prior. This means that the density of any collection of target function values is modeled as a multivariate gaussian density. Writing the
3090
H.-C. Kim and J. Lee
dependence of f on X implicitly, the GP prior over functions can be written as 1 1 −1 exp − (f − µ) C (f − µ) , (2.1) p(f|X, ) = (2π) N/2 |C|1/2 2 where the mean µ is usually assumed to be zero, µ = 0, and each term of a covariance matrix Ci j is a parameterized function of xi and x j with hyperparameters , that is, Ci j = C(xi , x j ; ). Since f and f N+1 have a gaussian density, we get the following predictive density of f N+1 as a conditional gaussian density, p(f N+1 |, x N+1 , X) p(f N |, X) ZN 1 T T −1 f N+1 C−1 = exp − f − f C f , N N N N+1 N+1 ZN+1 2
p( f N+1 |D, x N+1 , ) =
(2.2) (2.3)
where ZN and ZN+1 are appropriate normalization constants of p(f N |, X) and p(f N+1 |, x N+1 , X), respectively. The relationship between C N and C N+1 is as follows, C N+1 =
CN kT
k , κ
(2.4)
where k = [C(˜x, x1 ; ), C(˜x, x2 ; ), . . . , C(˜x, xn ; )]T
and κ = C(˜x, x˜ ; ).
(2.5)
By rearranging the terms in equation 2.3, we get a gaussian density of f N+1 (= ˜f ) for a new data point x N+1 (= x˜ ) given by p( f N+1 |D, , x N+1 ) = N (kT C−1 t, κ − kT C−1 k).
(2.6)
The covariance function is the kernel, which defines how data points generalize to nearby data points. The commonly used covariance function, adopted in this letter, is C(xi , x j ; ) = v0 exp
d 1 m m 2 − lm xi − x j + v1 + δi j v2 . 2
(2.7)
m=1
The covariance function is parameterized by the hyperparameters = {lm |m = 1, . . . , d} ∪ {v0 , v1 , v2 }, which we can learn from the data. The covariance between the targets at two different points is a decreasing function
Clustering Based on Gaussian Processes
3091
1 data points µ σ µ µ+σ
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
0
2
4
6
8
Figure 1: Examples of gaussian process regression. Variance of the predictive value is related to the density.
of their distance in input space. This decreasing function is controlled by a small set of hyperparameters that capture interpretable properties of the function, such as the length scale of autocorrelation, the overall scale of the function, and the amount of noise. The posterior distributions of the hyperparameters given the data can be inferred in a Bayesian way via Markov chain Monte Carlo (MCMC) methods (Williams & Rasmussen, 1995; Neal, 1998) or they can be selected by maximizing the marginal likelihood (the evidence) (Gibbs & MacKay, 1997). A Bayesian treatment of multilayer perceptrons for regression has been shown to converge to a gaussian process as the number of hidden nodes approaches infinity if the priors on input-to-hidden weights and hidden unit biases are independent and identically distributed (Neal, 1996). Empirically, gaussian processes have been shown to be an excellent method for nonlinear regression (Rasmussen, 1996). 3 Support Estimate from Gaussian Processes One key observation of gaussian process regression (GPR), as illustrated in Figure 1, is that the variances of predictive values have smaller values in a denser area and larger values in a sparse area (Rasmussen & Williams, 2006;
3092
H.-C. Kim and J. Lee
Kim & Lee, 2006). This means that the variances in GPR capture regions in input space where the probability density is in some sense large, thereby motivating us to consider the variances as a good estimate of the support of a probability density function. A support of a function f : d → Y is mathematically defined as the closure of the set of arguments of a function f for which f is not zero. The support of a probability density function is often applied to clustering problems since each separate support domain can be considered a cluster. So clustering problems can be solved in some sense by estimating support of a probability distribution generating given unlabeled data. This key observation can be explained utilizing the following property of GPR: the variance function of a predictive distribution of GPR in equation 2.6, given by σ 2 (x) = κ − kT C−1 k,
(3.1)
is nondegenerate; that is, the variance becomes large (small) in a sparse (dense) region since the second term on the right-hand side becomes small in a sparse region of input space wherein points lie far away from the sample data points. Another good property is that the variance function in equation 3.1 does not depend on the target values, t. This feature enables us to apply the GPR to unlabeled sample data, X = {xi |i = 1, 2, . . . , N}, by taking all target values of zero with noise allowance, that is, ti = 0 for all i = 1, . . . , N, without changing the variance behavior. With this machinery, the sample functions from the GP posterior would fluctuate around zero. In a denser area, the degree of fluctuation of the sample functions is low with more probability, because the sample function values should be smooth and close to the target values, that is, zero, near the data points. So E[ f (x)2 ] should be closer to zero in a denser area and bigger in a less dense area, which is exactly the same as the variance function σ 2 (x), that is, E[ f (x)2 ] = V[ f (x)] + {E[ f (x)]}2 = σ 2 (x).
(3.2) (3.3)
Figure 2 shows the variance of the sampled functions given the 1D data points sampled from a bimodal distribution. Intuitively, we see that there are two separate support domains in a denser area where each support is in some sense related to a cluster or a subcluster. Variances clearly show that the data points are sampled from a bimodal distribution.
Clustering Based on Gaussian Processes
3093
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
5
10
15
Figure 2: Variance of f (·) for the 1D data set of 500 points sampled from N (−5, 1) and 50 points sampled from N (5, 1).
To be more rigorous, we next give a theoretical result showing that the variance function can indeed represent the support of a probability density with its generalization error bound. We begin with the definition ¨ for D(X, f, θ ), where X ≡ (x1 , . . . , xl ) (Scholkopf & Smola, 2002). Definition. Let f be a real-valued function on a space X, Fix θ ∈ . For x ∈ X let d(x, f, θ ) = max{0, θ − f (x)}. Similarly for a training sequence X ≡ (x1 , . . . , xl ), we define D(X, f, θ ) =
d(x, f, θ ).
(3.4)
x∈X
The variance function estimates the support region by σ 2 (x) ≤ θ , where θ = max σ 2 (x). x∈X
The following theorem provides a bound on the probability that a novel point lies outside the region slightly larger than the estimated support region by the variance function of gaussian processes, similar to the result
3094
H.-C. Kim and J. Lee
¨ obtained in Scholkopf, Platt, Shawe-Taylor, Smola, and Williamson (1999) for support vector machines. In the following, log and ln denotes logarithms to base 2 and natural logarithms, respectively. Theorem 1. Consider a fixed but unknown probability distribution P with no atomic components on the space F with support contained in a ball of radius 1 about the origin and σ 2 (x) = κ − kT C−1 k. Assume that θ = max σ 2 (x). Then x∈X
with probability 1 − δ over randomly drawn training sequences X of size N, for all γ > 0, σ 2 (x) = κ − kT C−1 k, and, N2 2 k + log , P(x : σ (x) > θ + 2γ ) ≤ N 2δ 2
(3.5)
where k=
(2N − 1)γˆ c 1 log(c 2 γˆ 2 N) + D γ ˆ log e + 1 + 2, γˆ 2 D
(3.6)
c 1 = 4c 2 , c 2 = ln(2)/c 2 , c = 103, γˆ = γ / w , and D = D(X, g, κ − θ ). Proof. First, we show that h(x)(= kT C−1 k) is a linear function in a kernel defined feature space. k can be represented as X (x), where X = [(x1 ), (x2 ), . . . , (xl )]T . Then we get h(x) = (x)T TX C−1 X (x)
(3.7)
= tr((x)T TX C−1 X (x))
(3.8)
TX C−1 X ).
(3.9)
= tr((x)(x)
T
If we denote with w = TX C−1 X and 2 (x) = Vec((x)(x)T ) (meaning 2 (x)[l ∗ (i − 1) + j] = (x)[i](x)[ j]), then h(x) is a linear function f w (2 (x))(= w · 2 (x)) in a kernel feature space. So σ 2 (x) = κ − h(x) is a linear function. ¨ By theorem 17 in Scholkopf et al. (1999), with probability 1 − δ, for θˆ = min h(x)(= κ − θ ) and all γ > 0, x∈X
P{2 (x) : f w (2 (x)) < θˆ − 2γ } = P{x : σ 2 (x) > κ − θˆ + 2γ } = P{x : σ 2 (x) > θ + 2γ } N2 2 k + log , ≤ N 2δ
(3.10) (3.11) (3.12)
Clustering Based on Gaussian Processes
3095
where k=
c 1 log(c 2 γˆ 2 N) (2N − 1)γˆ + 1 + 2, + D γ ˆ log e γˆ 2 D
(3.13)
c 1 = 4c 2 , c 2 = ln(2)/c 2 , c = 103, γˆ = γ / w , and D = D(X, f w , θˆ ). This result shows that the variance function of a gaussian process characterizes the support of a high-dimensional distribution of a given data set and the estimated support set by GP has tractable complexity even in high-dimensional cases. 4 Clustering Based on the Variance Function In the previous section, we showed that the variance function of GPR can represent an estimate of the support of a probability density function. As a by-product of the constructed variance function, σ 2 (·) in equation 3.1, cluster boundaries can be obtained by a set of contours that enclose the points in data space, which is given by {x : σ 2 (x) ≤ r ∗ }, where r ∗ = maxk σ 2 (xk ). (See Figure 4.) Specifically, the level set of σ 2 (·) is decomposed into several disjoint connected sets, L(r ∗ ) := {x : σ 2 (x) ≤ r ∗ } = C1 ∪ · · · ∪ C p ,
(4.1)
where the Ci , i = 1, . . . , p are disjoint connected sets corresponding to different clusters and p is the number of clusters determined by σ 2 (·). 4.1 Cluster Characterization from a Dynamical Systems Viewpoint. In this section we describe a way to assign to each data point a cluster index determined by the level set in equation 4.1. To do this, we build the following dynamical system associated with the variance function, σ 2 (·): dx = F (x) := −∇σ 2 (x). dt
(4.2)
The existence of a unique time evolution solution (or trajectory) x(t) : → n for each initial condition x(0) = x0 is guaranteed by the smoothness of the function F . (For more details on dynamical system theory, see Guckenheimer & Homes, 1986; Lee & Chiang, 2004). A state vector x¯ ∈ n satisfying the equation F (¯x) = 0 is called an equilibrium point of equation 4.2 and called an (asymptotically) stable equilibrium point if all the eigenvalues of its derivative are positive. Geometrically, for each x, the vector field F (x) in equation 4.2 is orthogonal to the hypersurface
3096
H.-C. Kim and J. Lee
{y : σ 2 (y) = r ∗ }, where σ 2 (x) = r ∗ and points inward the surface. This property makes each trajectory flow inward and remain in one of the clusters described by the level sets of the variance function σ 2 , which is rigorously proved below. The basin of attraction of a stable equilibrium point s is defined as a set of all the points converging to s when the process in equation 4.2 is applied, A(s) := x(0) ∈ n : lim x(t) = s . t→∞
One distinguished feature of the constructed system in equation 4.2 is that it preserves a topological property (i.e., connectedness) of the level set in equation 4.1, as is shown in the next lemma. Lemma 1. For any given level value r > 0, each connected component of the level set L(r ) = {x : σ 2 (x) ≤ r } is positively invariant, that is, if a point is on a connected component of L(r ), then its entire positive trajectory lies on the same component. Proof. Let x(0) = x0 be a point in a connected component, say C, of L(r ) and x(t) be the trajectory starting at x(0) = x0 . Since dx d 2 σ (x) = −∇σ 2 (x)T = − ∇σ 2 (x) 2 ≤ 0, dt dt σ 2 (x(t)) is a decreasing function of t ∈ , and so we have σ 2 (x(t)) ≤ σ 2 (x0 ) for all t ≥ 0, or equivalently, {x(t) : t ≥ 0} ∈ L(r ). Since {x(t) : t ≥ 0} is connected, we should have x(t) ∈ C for all t ≥ 0. This lemma implies that when the dynamic process, equation 4.2, is applied, the generated trajectory x(·) of a data point x(0) = x0 never exits the connected component of the level set in equation 4.1 where it starts. The next theorem, which serves as a theoretical basis of clustering based on the variances of predictive values, implies that the generated trajectory must terminate at one of the stable equilibrium points in the same connected component. Theorem 2. The trajectory of process 4.2 approaches one of the equilibrium points of the equation. In particular, almost all the trajectory approaches one of stable equilibrium points of equation 4.2. Proof. To prove this, we first show that every trajectory is bounded. For any given x0 ∈ n , we consider the set L(r ) = {x : σ 2 (x) ≤ r } where r = σ 2 (x0 ). x ∈ L(r ) implies that kTx C−1 kx ≥ α, where α = kTx0 C−1 kx0 > 0. Since C−1 is positive definite, we can define a norm · C on N by v C = vT C−1 v for a
Clustering Based on Gaussian Processes
3097
vector v ∈ N . Then by the equivalence of a norm, there exists a c > 0 such that
kx 22 := kTx kx ≥ c kx C2 = ckTx C−1 kx ≥ cα. In other words, we have
kx 22 =
N
C(x, xk )2 ≥ cα.
k=1
From the form of C(x, xk ), we should have that for some M1 > 0, x − xk ≤ M1 for all k. If we set M2 = maxk xk , then
x ≤ x − xk + xk ≤ M1 + M2 , which means that x is bounded. Therefore, the set L(r ) = {x : σ 2 (x) ≤ r } is bounded. Now let x(0) = x0 be a point in L(r ) where r = σ 2 (x0 ). Since L(r ) is bounded and positive invariant by lemma 1, we should have that {x(t) : t ≥ 0} is bounded. Next, we show that every bounded trajectory converges to one of the equilibrium points. Since {x(t) : t ≥ 0} is nonempty and compact and φ(t) = σ 2 (x(t)) is a nonincreasing function of t, φ is bounded from below because f is continuous. Hence, φ(t) has a limit a as t → ∞. Let ω(x0 ) be the ω-limit set of x0 . Then for any p ∈ ω(x0 ), there exists a sequence {tn } with tn → ∞ and x(tn ) → p as n → ∞. By the continuity of g, σ 2 ( p) = limn→∞ σ 2 (x(tn )) = a . Hence, σ 2 ( p) = a , for all p ∈ ω(x0 ). Since ω(x0 ) is an invariant set, for all x ∈ ω(x0 ), ∂ 2 σ (x) = − ∇σ 2 (x) = 0, ∂t or equivalently, F (ω(x0 )) = 0. Since every bounded trajectory converges to its ω-limit set and x(t) is bounded, x(t) approaches ω(x0 ) as t → ∞. Hence, it follows that every bounded trajectory of system 4.2 converges to one of the equilibrium points. Therefore, the trajectory of x(0) = x0 under the process 4.2 approaches one of its equilibrium points, say x¯ . If x¯ is not a stable equilibrium point, then the region of attraction of x¯ has a dimension less than or equal to n − 1. Therefore, we have n =
M
A(si )
i
where {si : i = 1, . . . , M} is the set of the stable equilibrium points of system 4.2.
3098
H.-C. Kim and J. Lee
Figure 3: The partitioning property of the proposed clustering algorithm. The dashed lines represent the basin boundary separating each clusters, and the arrows represent the direction of the system trajectories.
By this theorem, each in-sample data point converges to one of the stable equilibrium points in the same connected component of that point when equation 4.2 is applied. Hence, we can assign a cluster label to each insample data point by identifying the cluster label of its corresponding stable equilibrium point (SEP) si . Also this theorem enables us to partition the data space into a small number of separate clustered regions (i.e., basins of attraction), where each region is represented by the corresponding SEP as its candidate point. From a computational point of view, this result implies that we do not need to determine the exact basins to partition the whole data space. Since almost all the data points converge to one of the SEPs, we can identify a basin to which a data point belongs by locating a stable equilibrium point that the data point converges to. From a clustering point of view, this result implies that we can assign a cluster label to even outof-sample data points by identifying the cluster label of its corresponding SEP, thereby extending the clustering ability to out-of-sample points, as does the dynamic-based support vector clustering in Lee and Lee (2006). This viewpoint is illustrated in Figure 3 and is described in the following corollary. Corollary 1. Let the level set of σ 2 (·) be decomposed into several disjoint connected sets, L(r ∗ ) := {x ∈ n : σ 2 (x) ≤ r ∗ } = C1 ∪ · · · ∪ C p , where the Ci , i = 1, . . . , p are disjoint connected sets corresponding to different clusters and p is the number of clusters determined by σ 2 (·). Then almost all the (in-sample or out-of-the sample) data points in the input space approach one of the clusters, Ci , when process 4.2 is applied.
Clustering Based on Gaussian Processes
3099
4.2 Clustering Algorithm. To assign cluster labels to the stable equilibrium points, we can employ a complete graph strategy as in Ben-Hur, Horn, Siegelmann, and Vapnik (2001) that is based on the following observation: given a pair of stable equilibrium points located in different clusters, a straight path connecting them must exit from the contour boundary. That is, there should be a segment of points y in the path such that σ 2 (y) > r ∗ . In this letter, we adopted a reduced complete graph strategy applied to the set of the stable equilibrium points as in Lee and Lee (2005) to improve labeling speed. When this strategy is applied, we can build an adjacency matrix A whose elements are given by 1 ifσ 2 (y) > r ∗ for any point y on the segment between si and s j ; Ai j = 0 otherwise.
(4.3)
Clusters are then defined as the connected components of the graph induced by A. The conceptual procedure of the clustering algorithm can be summarized as follows: Step 1: For given unlabeled training data X = {xi |i = 1, 2, . . . , N}, construct a variance function, σ 2 (·), in equation 3.1, and compute the level value r ∗ = maxxi ∈X σ 2 (xi ). Step 2: Using each data point as an initial value, we apply the process in equation 4.2 to find its corresponding stable equilibrium point, denoted by si , i = 1, . . . , p. Let Xi be the set of training data points that converge to a stable equilibrium point si for each i = 1, . . . , N. Step 3: For each pair of stable equilibrium point si and s j , i, j = 1, . . . , p, define an adjacency matrix A with its elements: 1 if σ 2 (y) > r ∗ for any point y on the segment between si and s j ; Ai j = 0 otherwise.
(4.4)
and assign the same cluster index to the stable equilibrium point in the same connected components of the graph induced by A. Step 4: For each i = 1, ..., p, assign the same cluster label to all the training data points in Xi . Figure 4 illustrates the proposed algorithm. In this example, a sample data set with two concentric circles is given, as shown in Figure 4a. Step 1 of the algorithm applied to this data set provides a variance function, σ 2 : 2 → , as shown in Figure 4b. After applying step 2, we obtain multiple stable equilibrium points corresponding to each data point, as shown in
3100
H.-C. Kim and J. Lee
Figure 4: Clustering of the proposed algorithm applied to the circle-shaped data points.
Figure 4d. By applying step 3 to the obtained stable equilibrium points, we obtain an induced graph with two connected components with the cluster labels of the stable equilibrium points being identified according to its belonging. Finally by applying step 4, we assign the cluster label of each data point to the same cluster label of its corresponding stable equilibrium point. Remarks. (1) Our method is methodologically similar to support vector clustering (SVC) (Ben-Hur et al., 2001; Lee & Lee, 2005) in that separate connected components of a level set of a support function (i.e., a variance function in our method and a trained kernel radius function in SVC) play an important role in identifying their clusters. Our method differs, however, from the SVC in that when the parameters are changed, the variance function in our method can be immediately obtained and changes continuously with parameters, whereas the trained kernel radius function in SVC should be retrained by invoking a quadratic programming solver and may change irregularly with parameters, which makes it hard to choose the appropriate
Clustering Based on Gaussian Processes
3101
parameters of the SVC from its geometry. (2) Our method requires the matrix inversion computation, which has the complexity O(n3 ). We could speed up ¨ method (Williams & Seeger, 2001). the computation by using the Nystrom (3) There could be two nearest-neighbor stable equilibrium points that are in the same cluster defined by the contours but with a joining straight line segment that crosses over the threshold contour. The graph-based clustering (Lee & Lee, 2005) would then separate these two stable equilibrium points and all the data points associated with them. In our experiments, this does not happen because of a rather densely located stable equilibrium points, but it could potentially happen more often in some practical high-dimensional cases and would require more additional cluster characterizations as in Lee & Lee (2006), which deserves a future research topic. 5 Simulation Results We applied the proposed clustering method to the six two-dimensional data sets with highly complex shapes, which is shown in Figures 5a to 5f. The data set in Figure 5a is from Camastra and Verri (2005); the data sets in Figures 5b to 5d are from the spectral clustering site (http://www.ms.washington.edu/˜spectral/), and the data sets in Figures 5e and 5f are synthetically made. Before the proposed method was used for clustering, all data sets were scaled so that the elements of the data points were between 0 and 1. Numerical gradients were used to find the stable equilibrium points. To check the connectedness between two stable equilibrium points, we checked 10 to 20 points along the straight line between two stable equilibrium points. (When the variance function value at any one among the checked 10 points is higher than σ 2 = r ∗ , we set Ai j = 0.) Figures 5a’ to 5f’ show the clustering results of the propose method applied to the six data sets. For all six data sets, v0 = 1 and v1 = 10 were used. The solid lines represent the cluster boundaries and the cross mark represents stable equilibrium points. For the data set in Figure 5c, li = 200 was used. For the rest of the data sets, li = 400 was used. Separate regions enclosed by solid lines represent different clusters. The data sets in Figures 5a to 5f are not clustered properly by the conventional clustering methods such as K-means. The proposed clustering method clusters all six data sets perfectly. We also applied the proposed clustering method to the three highdimensional data sets. One is the 5-dimensional data set, another is the 10-dimensional data set, and the other is the 20-dimensional data set. The 5-dimensional data set has five clusters whose data points have gaussian distributions. The variance of all the five clusters is 1, and their means are (7, 0, 0, 0, 0), (0, 7, 0, 0, 0), . . . , (0, 0, 0, 0, 7). The 10-dimensional data set has five clusters whose data points have gaussian distributions. The variance of all five clusters is 1, and their means are (7, 0, 0, 0, 0, 0, 0, . . . , 0), (0, 7, 0, 0, 0, 0, 0, . . . , 0), . . . , (0, 0, 0, 0, 7, 0, 0, . . . , 0). The 20-dimensional
3102
H.-C. Kim and J. Lee
Figure 5: (a– f) Six original data sets. (a’–f’) Clustering results of the proposed method applied to the data set.
data set has five clusters whose data points have gaussian distributions. The variance of all five clusters is 1, and their means are (5, 0, 0, 0, 0, 0, 0, . . . , 0), (0, 5, 0, 0, 0, 0, 0, . . . , 0), . . . , (0, 0, 0, 0, 5, 0, 0, . . . , 0). For all three data sets, v0 = 1 and v1 = 10 were used. For the 5-dimensional data set, li = 60 was used. For the 10-dimensional data set, li = 40 was used. For the 20-dimensional data set, li = 60 was used. The proposed clustering method clusters all the three data sets perfectly. We also compared our method with the spectral clustering method. We applied this method to all the data sets used in the experiments. With the
Clustering Based on Gaussian Processes
3103
proper parameters, the spectral clustering method clusters all nine data sets perfectly. To measure the clustering performance quantitatively, we use the three measures such as the reference error rate RE, cluster error rate C E, and F Score F Scor e. Let nr,c be the number of data points that belong to reference cluster r and are assigned to cluster c by a clustering algorithm. Let n, nr , and nc be the number of all the data points, the number of the data points in the reference cluster r , and the number of the data points in the cluster c obtained by a clustering algorithm. Then the measures RE and C E are defined as r
RE = 1 −
max nr,c c
n
,
CE = 1 −
c
max nr,c r
n
.
(5.1)
The reference error rate RE is zero when all the data points belonging to every cluster obtained by a clustering algorithm belong to one reference cluster. The cluster error rate C E is zero when all the data points belonging to every reference cluster belong to one cluster obtained by a clustering algorithm. When both RE and C E are zero, it means that the clusters exactly match the reference clusters. Let L r,c , Pr,c be the recall and the precision defined as nnr,cc , nnr,cr , respectively. Then the FScore of the reference cluster r and cluster c is defined as Fr,c =
2Rr,c Pr,c . Rr,c + Pr,c
(5.2)
The FScore of the reference cluster r is the maximum FScore value over all the clusters as Fr = max Fr,c .
(5.3)
c
The overall FScore is defined as F Scor e =
nr r
n
Fr .
(5.4)
In general, the higher the FScore, the better the clustering result. The FScore is 1 when the clusters exactly match the reference clusters. Both the spectral clustering method and our method show the reference error rate (RE) [ 0 ] and the cluster error rate (RE) [ 0 ] and Fscore (F Scor e) [ 1 ] for all the data sets tested. In terms of all the measures, such as reference error rate, cluster error rate, and FScore, our method gives comparable results to the spectral clustering method.
3104
H.-C. Kim and J. Lee
Figure 6: Clustering of the Leptograpsus crabs data set with overlapped clusters.
So far, we have shown the experiment results for the data sets with separate clusters. We next perform experiments for the Leptograpsus crab data set with overlapped clusters from the PRNN site. (The Leptograpsus crabs data set is from http://www.stats.ox.ac.uk/pub/PRNN/.) It consists of sizes of five parts of the crabs. Each crab is either male or female and has either of two color forms. There are 50 crabs of each sex of each of two color forms. To visualize the experiment result on the plane, we reduced the 5dimensional data set by principal component analysis and used second and third principal components. In Figure 6 we plotted the data points of the crabs data set, the local minima, and saddle points. The lines going through saddle points are potential decision boundaries. We have four basins of attraction: A(s1 ), A(s2 ), A(s3 ), and A(s4 ). The values near the local minima in Figure 6 are the variance function values on the local minima. By controlling the cutting level, which is related to r ∗ in equation 4.1, we can control the number of clusters from 1 to 4. When the cutting level is greater than 0.27 and smaller than 0.34, the number of clusters becomes
Clustering Based on Gaussian Processes
3105
1 (A(s1 ) ∪ A(s2 ) ∪ A(s3 ) ∪ A(s4 )). When the cutting level is greater than 0.34 and smaller than 0.43, the number of clusters becomes 2 (A(s1 ) ∪ A(s2 ) and A(s3 ) ∪ A(s4 )). When the cutting level is greater than 0.43 and smaller than 0.47, the number of clusters becomes 3 (A(s1 ) ∪ A(s2 ), A(s3 ) and A(s4 ). When the cutting level is greater than 0.47, the number of clusters becomes 4 (A(s1 ), A(s2 ), A(s3 ) and A(s4 )). This is how we can cluster the data set with the overlapped clusters and can control the number of clusters.
6 Conclusion We have proposed a clustering algorithm with gaussian process models. The variance function of GPR learned from a training data is shown to represent an estimate of the support of a probability density function. The variance function is then applied to construct a set of contours that enclose the data points, which correspond to cluster boundaries. A dynamic process associated with the variance function is built and applied to cluster labeling of the data points. The simulation results have shown that the method can detect clusters with arbitrarily complex shapes and high-dimensional clusters. We also showed how we can cluster the data set with overlapped clusters. One of the main implications of this letter is that a gaussian processes regression model providing predictive distribution with nondegenerate variance functions can be applied to develop a clustering algorithm with the aid of dynamical systems machinery, which shows an interesting connection between a Bayesian regression task and a clustering task. In the simulation results, we had to have a proper hyperparameter to get a proper result. It is important to estimate a proper hyperparameter. Based on our experiment on the crabs data set, where we showed how to control the number of clusters, we could try a hyperparameter-robust approach with the number of clusters given. These are potential future research topics.
Acknowledgments This work was supported partially by the Korea Research Foundation under the Grant number KRF-2005-041-D00708 and partially by the KOSEF under the grant number R01-2005-000-10746-0, and partially by the BK21 Research Center for Intelligent Mobile Software at Yonsei University.
References Ben-Hur, A., Horn, D., Siegelmann, H. T., & Vapnik, V. (2001). Support vector clustering. Journal of Machine Learning Research, 2, 125–137.
3106
H.-C. Kim and J. Lee
Camastra, F., & Verri, A. (2005). A novel kernel method for clustering. In IEEE Transactions on PAMI (pp. 801–805). Piscataway, NJ: IEEE Press. Csato, L. (2002). Gaussian processes—Iterative sparse approximation. Unpublished doctoral dissertation, Aston University. Duda, R., Hart, P., & Stork, D. (2001). Pattern classification. New York: Wiley. Gibbs, M., & MacKay, D. J. C. (1997). Efficient implementation of gaussian processes. Available online at http://citeseer.nj.nec.com/6641.html. Gibbs, M., & MacKay, D. J. C. (2000). Variational gaussian process classifiers. IEEE Transactions on NN, 11(6), 1458. Guckenheimer, J., & Homes, P. (1986). Nonlinear oscillations, dynamical systems, and bifurcations of vector fields. New York: Springer. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 11(3), 564–323. Kim, H.-C., & Lee, J. (2006). Pseudo-density estimation for clustering with gaussian processes. Lecture Notes in Computer Science, 3971, 1238–1243. Lee, J., & Chiang, H.-D. (2004). A dynamical trajectory-based methodology for systematically computing multiple optimal solutions of general nonlinear programming problems. IEEE Transactions on Automatic Control, 49(6), 888–899. Lee, J., & Lee, D. (2005). An improved cluster labelling method for support vector clustering. IEEE Transactions on PAMI, 27(3), 461–464. Lee, J., & Lee, D. (2006). Dynamic characterization of cluster structures for robust and inductive support vector clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(11), 1869–1874. Minka, T. (2001). A family of algorithms for approximate Bayesian inference. Unpublished doctoral dissertation, Massachusetts Institute of Technology. Neal, R. M. (1996). Bayesian learning for neural networks: Lecture notes in statistics no. 113. New York: Springer-Verlag. Neal, R. (1998). Regression and classification using gaussian process priors. Bayesian Statistics 6, 465–501. O’Hagan, A., (1978). On curve fitting and optimal design for regression. Journal of the Royal Statistical Society, 40, 1–42. Opper, M., & Winther, O. (2000). Gaussian processes for classification: Mean field algorithms. Neural Computation, 12, 2655–2684. Rasmussen, C. E. (1996). Evaluation of gaussian processes and other methods for non-linear regression. Unpublished doctoral dissertation, University of Toronto. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. Cambridge, MA: MIT Press. Schmidt, D. (2000). Continuous probability distributions from finite data. Phys. Rev. E, 61(2), 1052–1055. ¨ Scholkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge, MA: MIT Press. Scholkopf, B., Platt, J., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (1999). Estimating the support of a high-dimensional distribution (Tech. Rep. 9987). Redmond, WA: Microsoft Research. Available online at http://research. microsoft.com/research/pubs/view.aspx?msr tr id=MSR-TR-99-87. Williams, C. K. I., & Barber, D. (1998). Bayesian classification with gaussian processes. IEEE Transactions on PAMI, 20, 1342–1351.
Clustering Based on Gaussian Processes
3107
Williams, C. K. I., & Rasmussen, C. E. (1995). Gaussian processes for regression. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural in formation processing systems, 8. Cambridge, MA: MIT Press. Williams, C. K. I., & Seeger, M. (2001). Using the nystrom method to speed up kernel machines. In T. G. Diettrich, T. K. Leen, & V. Tresp (Eds.), Advances in neural information processing systems, 13. Cambridge, MA: MIT Press.
Received January 17, 2006; accepted February 2, 2007.
LETTER
ˇ Communicated by Peter Tino
Elman Backpropagation as Reinforcement for Simple Recurrent Networks ¨ Andr´e Gruning [email protected] Cognitive Neuroscience Sector, SISSA, 34014 Trieste, Italy
Simple recurrent networks (SRNs) in symbolic time-series prediction (e.g., language processing models) are frequently trained with gradient descent–based learning algorithms, notably with variants of backpropagation (BP). A major drawback for the cognitive plausibility of BP is that it is a supervised scheme in which a teacher has to provide a fully specified target answer. Yet agents in natural environments often receive summary feedback about the degree of success or failure only, a view adopted in reinforcement learning schemes. In this work, we show that for SRNs in prediction tasks for which there is a probability interpretation of the network’s output vector, Elman BP can be reimplemented as a reinforcement learning scheme for which the expected weight updates agree with the ones from traditional Elman BP. Network simulations on formal languages corroborate this result and show that the learning behaviors of Elman backpropagation and its reinforcement variant are very similar also in online learning tasks. 1 Introduction Artificial neural networks arose as models of real neuronal networks. They are used as biologically inspired models of all kinds of brain processes, from the dynamics of networks of real neurons to the dynamics of cognitive processes, for which the model neurons correspond to assemblies of real neurons rather than to real neurons themselves (Ellis & Humphreys, 1999). Generally network models are used in language modeling since they are thought of as more plausible than symbol-based approaches toward language and also because network models are often able to learn from scratch parts of language that were thought to be innate (Elman, 1990; Christiansen & Chater, 1999b). Thus, there is a natural interest in learning algorithms that are cognitively plausible. If they are also biologically plausible, this is even better, since then the learning algorithm fills the gap between what we know about cognitive processes and what we know about the underlying neural hardware of the brain (Jackendoff, 2002). van der Velde and de Kamps (2006), for example, proposed a neurobiologically based model of language processing, but hard-wired without learning. Thus, while the Neural Computation 19, 3108–3131 (2007)
C 2007 Massachusetts Institute of Technology
Elman Backpropagation as Reinforcement
3109
types of processing an artificial neural network can do are fairly well understood, it is more uncertain how networks can be trained on a particular task in a biologically and cognitively plausible way. The challenge is to find learning algorithms for artificial neural networks that are both efficient and plausible. In cognitive science, the commonly used training algorithms are variants of gradient descent algorithms: for an input (or sequence of inputs), they produce an output that is compared to a target output. Some measure of error between actual output and expected output is then calculated. The gradient descent algorithm changes the weights in the network such that the error is minimized, typically following the negative gradient of the error as a function of all network parameters. This is a nontrivial task since at every synapse, we need information as to how much its weight contributes to the error (credit assignment problem). This is especially true when weight updates are to be computed from quantities that are locally available at the synapse the weight belongs to, that is, in a biologically plausible manner. Backpropagation and its variants bypass this problem by using each synapse bidirectionally: to forward-propagate activity and backpropagate each synapse’s contribution to the total error. Backpropagation (BP) and other standard gradient descent algorithms are thus implausible for at least two reasons: first, biologically, because synapses are used bidirectional; second, cognitively, because a full target with the correct output pattern has to be supplied to the network. In this work, we deal with the second implausibility: the need for a fully specified target answer. In reinforcement learning (RL), the teacher instead of the fully specified target answer supplies only feedback about success or failure of an answer. This is cognitively more plausible than supervised learning since a fully specified correct answer might not always be available to the learner or ¨ otter ¨ even the teacher (Sutton & Barto, 2002; Worg & Porr, 2005). Standard reinforcement algorithms are less powerful than the gradient descent–based ones. This seems to be due to the lack of good solutions for the credit assignment problem for hidden-layer neurons (Roelfsema & van Ooyen, 2005). Thus, there is a trade-off: BP is quite efficient, but not so plausible cognitively, and vice versa for RL. Therefore, the objective of this letter is to find an algorithm that is both as efficient as BP and as plausible as RL. We want to concentrate on simple recurrent networks (SRNs) that are trained on a prediction task since these have been used extensively as models in the cognitive sciences, in particular to model language processing (Christiansen & Chater, 1999a). The algorithm we propose shares properties with RL, such as a reward signal instead of a fully specified target. However, regarding its speed of learning and the types of solutions it finds, we will provide evidence that it essentially behaves like Elman BP. Our algorithm will not provide an improvement on the technical side. Rather, it is a reimplementation of Elman
3110
¨ A. Gruning
BP in terms of RL. What is gained is an improvement in cognitive plausibility. Since the suggested algorithm will turn out to behave virtually like Elman BP, this letter can also be seen as a post hoc justification of the use of Elman BP in so many articles in cognitive science. We use ideas from the attention-gated reinforcement learning (AGREL) learning scheme (Roelfsema & van Ooyen, 2005). The AGREL learning scheme effects an amalgamation of BP and RL for feedforward (FF) networks in classification tasks. However we need to recast these ideas in order to apply them to SRNs in prediction tasks. We call the resulting reimplementation of Elman BP as a reinforcement scheme recurrent BP-as-reinforcement learning (rBPRL). We start with an outline of Elman BP since our suggested modifications build on an understanding of BP. Then we introduce modifications in order to reimplement Elman BP as a reinforcement scheme and show that in the sense of expectations, Elman BP and its reinforcement version, rBPRL, effect ¨ the same weight changes (for a different formulation, see Gruning, 2005). After these theoretical sections, we present simulations that corroborate the claimed agreement in practice. 2 SRNs and Elman BP An SRN in its simplest form is a three-layer feedforward (FF) network; in addition, the hidden layer is self-recurrent (Elman, 1990). It is a special case of a general recurrent neural network (RNN) and could thus be trained with full backpropagation through time (BPTT) and its variants (Williams & Peng, 1990). However, instead of regarding the hidden layer as self-recurrent, one introduces a so-called context layer into which the activities of the hidden neurons are stored in each time step and which acts as an additional input to the hidden layer in the next time step. Regarding the forward-propagation of activity through the SRN, these two views are equivalent. In order to keep notation simple, we will deal only with networks that are strictly layered: there are connections only between subsequent layers and assume that there is only one hidden layer. Generalizations where neurons receive input from other downstream layers and with more hidden layers are, of course, possible. Furthermore, we refrain from explicitly introducing a bias term, since its effect can easily be achieved by a unit with constant activation 1. 2.1 Forward Pass. Let y(0) (t), y(1) (t), and y(2) (t) denote the activation vectors of the input, hidden, and output layer, respectively, at (discrete) time step t and a (r ) (t), r = 1, 2 the vectors of the corresponding net activations, and finally let f : a → 1+e1 −a be the standard sigmoid and wr,r denote the weight matrices between layer r and r . Then the forward pass of activity through a three-layer SRN viewed as a special case of a general RNN can
Elman Backpropagation as Reinforcement
3111
be expressed as follows: (1)
a i (t) =
(0)
wi1,0 j y j (t) +
j
(1)
wi1,1 j y j (t − 1),
(2.1)
j
(1) (1) yi (t) = f a i (t) , 2,1 (1) (2) wi j y j (t), a i (t) =
(2.2) (2) (2) yi (t) = f a i (t) .
(2.3)
j
When viewed as an FF network with extra input from the context layer, y(0) (t) we set y˜ (0) (t) := (1) and w ˜ 1,0 := (w 1,0 w 1,1 ). This replaces equation 2.1 y (t−1)
with (1)
a i (t) =
(0)
w ˜ i1,0 ˜ j (t), j y
(2.4)
j
making the SRN formally equivalent to an FF network with the addition of recycling the hidden-layer activations as part of the next input. For notational simplicity, we occasionally drop the explicit dependence of the variables on t when no confusion can arise. For each input y(0) , the layers are updated consecutively. Finally the output is read off from y(2) and the error E against a target vector u computed: E(t) =
2 1 (2) ui (t) − yi (t) . 2 i
(2.5)
2.2 Elman Backpropagation. The SRN is viewed as an FF network with an additional set of inputs from the context layer. Hence, standard BP in conjunction with copying the hidden layer into the context layer (with immutable weights between the hidden and context layer equal to 1 and linear activation function) can be used for training (Elman, 1990). This scheme is called Elman BP. In fact, any algorithm suitable for FF networks will automatically be available for SRNs. We restate the main formulas of Elman BP in the following since the reinforcement implementation will be based on them. For the update of the weights, we need to know how much each single weight wir,rj contributes to the overall error E. For given input y(0) and target u, E can be viewed as a function of all weights w. In order to keep track of each weight’s contribution ∂ Er,r to the error we first calculate the ∂wi j (r ) contribution i := ∂ E(r ) of each neuron’s net input to the error. The s ∂a i can be recursively computed layer-wise starting from the output layer, and
¨ A. Gruning
3112
from them, the weight updates δw with learning rate as (2) (2) (2) i (t) = f a i (t) yi − ui (t) , (1) (2) (1) j (t) = f a j (t) i (t)wi2,1 j ,
(2.6) (2.7)
i (2)
(1)
δwi2,1 j (t) := −i (t)y j (t), (2) (1)
since i y j =
(2)
∂ E ∂a i 2,1 (2) ∂a i ∂wi j
=
∂E ∂wi2,1 j
(1)
(0)
δw ˜ i1,0 ˜ j (t), j (t) := −i (t) y
(2.8)
and analogously for δ w. ˜ So weight changes
follow the antigradient of error. Error contributions are not propagated further back; thus, Elman BP is a variant of BPTT(n), with n = 1 (Williams & Zipser, 1989). The weight change δw might be applied to the weights in each time step (w(t + 1) = w(t) + δw(t), online learning) or collected and summed and applied to the weights only after a learning epoch has finished (batch learning). Since weight changes will take place at the end of an epoch or, ideally, on a timescale slower than the network dynamics, we will assume the weights to be fixed for the rest of the theory sections. 3 Prediction Tasks for SRNs The tasks we are interested in are prediction tasks—predicting the next symbol in a temporal sequence of symbols that follow a certain rule or stochastic dependency. These tasks often arise in the context of modeling language or action sequences in cognitive science, or generally in predicting a stochastic process used as an information source. We formalize these ˇ and Dorffner (2001). notions following Tino 3.1 Information Source. Let A denote a finite alphabet with n elements, which, for the sake of notational simplicity, we assume to be A = 1, 2, . . . , n. A stationary stochastic process (Cover & Thomas, 1991) produces a sequence s1 s2 , . . . , ∀i : si ∈ A one symbol a time, so that up to time t, it has produced the finite sequence σt := s1 s2 , . . . , st , and is given by a family of time-invariant probability measures Pn on n-blocks (length n sequences over A) with the consistency condition that s∈A Pn+1 (ωs) = Pn (ω) for all n-blocks ω. Conditional probabilities are as usual defined as P(s|ω) =
Pn+1 (ωs) Pn (ω)
(3.1)
and express the probability that an n-block ω will be followed by symbol s. An example of a stochastic information source is a finite automaton (FA) that selects one of the state transitions of its current state randomly and emits
Elman Backpropagation as Reinforcement
3113
the corresponding edge symbol (Hopcroft & Ullmann, 1979). However, the source can also be a more powerful automaton or (quantized) dynamical system. The SRN’s task is to predict the next symbol st+1 of sequence σt , more precisely to infer (an approximation of) the conditional probabilities P(s|σt ) during training. 3.2 Input-Driven Systems of Iterated Functions. From the forward pass of activity, it is obvious that given y(0) (t), equations 2.1 and 2.2 can be rewritten as y(1) (t − 1) → y(1) (t) = F (y(0) (t), y(1) (t − 1)) := fˆ (w 1,0 y0 (t) + w 1,1 y(1) (t − 1)), (3.2) where fˆ denotes component-wise application of the standard sigmoid f . Especially in the case where there is only a finite set of different input (0) vectors ys that encode the elements s of the finite alphabet A, it is more convenient to regard the dependence on the input symbol s as a parameter rather than a variable of F : Fs (·) := F ys(0) , · . (3.3) The forward pass of activity through the network is thus the application of a system of iterated functions that map the space of hidden-layer activations into itself driven by the source from which the symbols s are drawn (Blair ˇ nansk ˇ Cer ˇ ´ & Benuˇ ˇ skov´a, 2004). & Pollack, 1997; Moore, 1998; Tino, y, We finally introduce the notion Fσ to mean the resulting mapping when a sequence σ = s1 s2 , . . . , sn of input symbols is applied consecutively: Fσ = Fsn ◦ Fsn−1 ◦, . . . , ◦Fs1 . 3.3 Standard Learning Schemes. After these formal prerequisites, let us look first at how the established learning schemes derive their error signals in order to reproduce the conditional next symbol probabilities in the network’s output layer. (0) For each symbol s, there is an input neuron whose activation ys is set to one when this symbol is input, and to zero otherwise, and there (2) is a corresponding output neuron whose activation ys (possibly after normalization) is a measure for the predicted probability of that symbol as a continuation of the input sequence. (Remember the symbols s are just natural numbers.) Finally let y0 := y(1) (0) denote the (fixed) vector that the hidden-layer activities of the networks are initialized to before any symbols are entered. 3.3.1 Distribution Learning. A straightforward idea is to take as a target vector u(t) directly the probability distribution P(s|σt ), that is,
¨ A. Gruning
3114
ui (t) := P(i|σt ) when the network has seen the initial sequence σt and thus is in state y(1) (t) = Fσt (y0 ) with output activations y(2) (t) given by equation 2.3. In this case, equation 2.6 reads (2) (2) (2) i (t) = f a i (t) yi (t) − P(i|σt ) .
(3.4)
We refer to this scheme as distribution learning (DL). This type of learning might not be considered cognitively plausible, since it requires the teacher to know the family of probability distributions that define the information source. 3.3.2 Elman Learning. However, the conditional probabilities P(s|ω) might not be explicitly known to the teacher. So in this approach, after input sequence σt , the teacher waits until the source has produced st+1 , and before using it as the next input symbol, it is presented as the new target vector; that is, u(t) is a unit vector with the entry corresponding to st+1 set to one and all others zero. Accordingly at time t, the expected target, averaging over all possible targets after sequence σt , is E P(·|σt ) (ui (t)) = P(i|σt ), and the expected value of the error signal according to equation 2.6 reads: (2) (2) E P(·|σt ) (2) s (t) = f a i (t) yi (t) − P(i|σt ) ,
(3.5)
so that the right-hand side agrees with equation 3.4, and the same is true for the other s recursively computed from this due to the linearity of equation 2.7 in the s, and finally also for the δws due to the linearity of equation 2.8. This scheme is widely used in simulations in the cognitive and neurosciences. Its interesting feature is that the output activations will be driven toward the probability distribution P(·|σt ) despite not explicitly being the target. 4 Elman BP as a Reinforcement Learning Scheme Let us now recast Elman BP as an RL scheme in the following way. In (2) prediction learning, the activation yi (t) of output i corresponds to the estimated probability of symbol i. Assume that the network is only allowed to select a single symbol k as a response in each time step. It selects (2) this answer according to the distribution Q given by yi (t)/|y(2) (t)|, | · | r,r denoting the one-norm. With network weights w and the initial state y0 fixed, this distribution depends only on the input sequence σt the net(2) work has been driven with, so we write Q(i|σt ) := yi (t)/|y(2) (t)|. The neuron k thus selected according to Q by some competitive process (Usher & McClelland, 2001) is called the winning neuron. This answer k is compared to the target st+1 , the actual next symbol drawn from the source according to P(s|σt ), and the network receives a
Elman Backpropagation as Reinforcement
3115
reward r (t) = 1 only if k = st+1 , and r (t) = 0 otherwise. The reward objectively to be expected after sequence σt and with winning neuron k is E P(·|σt ) (r (t)) = P(k|σt ). The network compares the received reward r (t) to (2) the subjectively expected reward, namely, the activation yk (t) of the winning neuron (Tremblay & Schultz, 1999); the relative difference, (2)
δ(t) :=
yk (t) − r (t) , Q(k|σt )
(4.1)
serves as the direct reward signal for the network. From δ, we compute the error signals (2) for the output layer. Since attention is concentrated on the (2) winning output k, only k (t) is different from zero, and we set (2)
i (t) :=
0, i = k (2) f a i (t) δ(t) i = k,
(4.2)
overwriting the original definitions of the s as derivatives δ E(2) . (1) and δa i the weight updates δw are recursively computed from (2) formally as in equations 2.7 and 2.8. Since the reward r (t) depends on st+1 drawn according to P, we need to take expected values over P, with k still as the winning unit: 0 (2) y(2) (t) − P(i|σt ) E P(·|σt ) i (t) = f a i(2) (t) i Q(i|σt )
i = k i = k,
(4.3)
since E P(·|σt ) (r (t)) = P(k|σt ) and all other quantities do not depend on st+1 . However, k is also selected randomly according to Q, so we need to take the expected value over Q too. Since given the past input sequence σt , the distributions P(·|σt ) and Q(·|σt ) are independent, E (Q,P) (·) = E Q (E P (·)), and we get (2) (2) y(2) (t) − P(i|σt ) E Q(·|σt ) E P(·|σt ) i (t) = Q(i|σt ) f a i (t) i Q(i|σt ) (2) (2) = f a i (t) yi (t) − P(i|σt ) ,
(4.4)
and this agrees with equations 3.4 and 3.5. By linearity the expected values of all other s and of the δws (and δ ws) ˜ in this scheme coincide as well with their counterparts in standard Elman BP. Compared to Elman BP, an error for a certain output is calculated only when it is the winning unit; it is updated less frequently by a factor Q(i|σt ). (2) But i is larger by 1/Q(i|σt ) to compensate for this.
3116
¨ A. Gruning
Let us state briefly how the ingredients in this scheme fit into the standard reinforcement terminology (Sutton & Barto, 2002). First, there is a primary reward signal r from which a secondary one δ is derived as a comparison to the expected reward for a particular network state and selected action k. An action of our network is simply the selection of an output symbol k, and the value of this action k is the activation of the corresponding output neuron (2) yk . Finally actions k are (2)selected (fairly nongreedily) on the basis of the distribution Q given by |yy(2) | . A policy is given by the inner dynamics of the network that maps an input y(0) and network state y(1) onto a set of action values and ultimately on the selected symbol. Furthermore, the dynamics of the network is a model of the information source. We can conclude that we have found an RL scheme in which the expected weight updates are the same as in distribution or Elman learning. 5 From Expectations to Averages In the preceding sections, we showed that expected values of the s and weight updates δw in Elman BP and rBPRL agree with their counterparts in DL. For practical applications, we need to say how these expected values can be approximated by averages from sampling one or more sequences from the information source. Keeping the weights fixed as we did in the preceding paragraphs suggests a batch approach toward learning. However, we are ultimately interested in online learning (and Elman BP has often been used in such tasks) since they are also cognitively more plausible as no external reset of information source or network is needed and learning proceeds in an incremental way. 5.1 Batch Learning. For DL, there are no specific problems. An initial SRN state y0 is fixed. The SRN is trained epoch-wise so that at the beginning of each epoch, the network state is reset to y0 and time count to t = 1 and a new sequence drawn from the information source. The length of the epoch should be longer than the typical timescale of the source (if any and if known). For each target, the weight changes δw of equation 2.8 are collected and are applied at the end of the current epoch. For Elman BP, since the target vectors are selected randomly according to P(s|σt ), we need to sample over a sufficient number of targets for the same σt before we update the weights to ensure a good approximation to the expectation of equation 3.5. Thus, within an epoch there will be several “subepochs” at the beginning of which network and time are reset and a new sequence drawn. Weight changes are collected during several subepochs and applied only on completion of an epoch. Again, the length of a subepoch should be longer than the typical timescale of the source, and the number of subepochs within an epoch should be such that each σt
Elman Backpropagation as Reinforcement
3117
appears a sufficient number of times so that the sampling of their respective targets becomes reliable enough for averaging. For rBPRL, the procedure will be just like for Elman BP. Since the error signals have an even bigger variance than in Elman BP, due to averaging for both P and Q, we need to sample the s even longer than before for the targets to yield reliable average s; thus, epochs will contain more subepochs. The difficult point is to decide on the length of subepochs and their number when one knows nothing about the timescale of the source so that the network can generalize from the finite sequences it sees in training to longer sequences. 5.2 Online Learning. Insisting on cognitive plausibility, resetting the network to a fixed reset state y0 seems unnatural, as does the epoch structure. Learning ought to be incremental just as new data are coming in. Let us therefore discuss the problems we incur in the transition from batch to online learning. Online learning means that weight updates δw are applied after each target presentation, and this means that the network dynamics change constantly, so that in equation 3.5, a (2) (t) and y(2) (t), and in equations 4.3 and 4.4, Q(k|σt ) will be different for each presentation of the same sequence σt in a later subepoch and depend on the complete learning history in an intricate way, known as the moving targets problem. However, simulational results have shown for Elman BP and similar gradient descent algorithms that networks will approximate the averaged target for a sequence σt well when the learning rate is small (Williams & Peng, 1990), that is, the timescales of weight changes and the timescale of the source (or epoch) are decoupled, so that y(2) and Q will differ only slightly for presentation of the same sequence σt in different epochs. A rigorous convergence result for BPTT under certain conditions is derived in Kuan, Hornik, and White (1994). A proof for rBPRL or Elman BP would presumably have to follow the same line of argumentation. Weights are changed incrementally in every time step now, but we have retained the (sub-)epoch structure insofar as we still reset the networks to y0 and time count to t = 1 at the beginning of each epoch. Many information sources can be predicted well when one bases one’s predictions on only a (fixed) finite number of recent symbols in a sequence, though, of course, cases exist where this is not the case at all. When L n denotes the cut-off operator, that is, L n (σt ) = st−n+1 st−n+2 , . . . , sn for t > n and trivial operation for t ≤ n, this means that the probabilities P(s|σt ) can be well approximated by the probabilities P(s|L n (σt )). For Markov processes of order n, this approximation is exact. For fully fledged finite state processes, this approximation is exact except for a fraction of input histories whose probability measure decreases exponentially with n due to the guar¨ anteed existence of a synchronizing word (Gruning, 2004; Lind & Marcus,
3118
¨ A. Gruning
1995). Second, networks cannot be expected to reproduce the source exactly. Instead they have to build a model of the source using structure in the sequences (Cleeremans, Servan-Schreiber, & McClelland, 1989; Bod´en ¨ ˇ & & Wiles, 2000; Rodriguez, 2001; Gruning, 2006; Crutchfield, 1993; Tino Dorffner, 2001). While no definite exact results exist about which models for which type of information source SRNs can evolve under gradient descent training, it is known that SRNs, when initialized with small weights, have ˇ an architectural bias toward definite memory machines (Hammer & Tino, 2003), that is, a model class with decaying memory. They tend to base their next symbol predictions on only a finite past since the last few input symbols have the greatest impact on the state-space activations. In addition, for the standard learning schemes, theoretical and heuristic results point into the same direction for bigger weights in the sense that under gradient descent learning, it is very hard to teach long-term dependencies to the networks (Bengio, Simard, & Frasconi, 1994). Hence, also for rBPRL, we are led to assume that deviations caused when we do not reset the network at the beginning of each (sub)epoch to a definite value y0 are negligible after a few input symbols. If the information source is stationary and ergodic so that all finite subsequences will appear with their typical frequency in almost all long enough sequences, we can thus abolish the (sub-)epoch structure completely and train the network on a single long sequence from the source. This has the advantage too that we do not need to make assumptions about the internal timescale of the source to set the (sub-)epoch lengths. While there exists abundant simulational evidence that the network’s output will approach the conditional probabilities of next symbols for DL and Elman BP also under online learning (Elman, 1990; Bod´en & Wiles, ¨ 2000; Rodriguez, 2001; Gruning, 2006), this is of course no formal proof that rBPRL under these circumstances will converge to an error minimum too. 6 Simulations In this section, we present some simulations corroborating the theoretical considerations. Our goal is to demonstrate that the modifications introduced to Elman BP in order to arrive at rBPRL have minor effect on the learning behavior, hence establishing also in practice that rBPRL is equivalent to Elman BP in the sense that it shows a similar learning speed and that the internal dynamics found as a solution to a certain task are similar too. We have already shown that the expected error signals, and thus the weight updates, in rBPRL agree with the ones from Elman BP (and from DL). Thus, the speed of learning ought to be the same when all other parameters are kept constant and similar dynamics emerge. While the learning behavior agrees in theory, in practice there are at least two factors that potentially introduce deviations between rBPRL and
Elman Backpropagation as Reinforcement
3119
#0
S P
T
b #2
#1
T
V
P
X
X
#4
#3
V
a
g
d
S
A
U1
I1
i
u
i
u
S
U2
I2
#5
u E
to #0
(a) Reber language
U3
(b) badiiguuu language
Figure 1: FAs for the languages used in simulations. Double circles indicate the start states. The states carry arbitrary labels. Edge labels correspond to the symbol emitted when the edge is traversed.
Elman BP. First, we change weights online and not batch-like. This might have different influences for rBPRL and Elman BP on the strict validity of equations 3.5 and 4.4, respectively. For Elman BP, it has been found to work well. Second, due to averaging over both P and Q, the variance of the error signals that in rBPRL is greater than that in Elman BP; thus, rBPRL is even more likely than Elman BP to leave narrow local minima. This is an advantage when the local minimum is shallow; however, rBPRL is also more likely to lose good local minima, which impedes learning. We will demonstrate here that one can balance these effects with a low enough learning rate . We do so using two languages that have been used frequently as a benchmark in cognitively motivated symbolic time-series prediction. The first language is the so-called Reber language (Cleeremans et al., 1989; Reber, 1967) that can be generated by the FA in Figure 1a. Its main characteristic is that the same symbol can effect transitions between otherwise unrelated states. The second language is the badiiguuu language (Elman, 1990; Jacobsson, 2006), which needs to keep track of the number of symbols i and u in order to predict the probabilities of the next symbols (see Figure 1b).
3120
¨ A. Gruning
For both languages, a stream of sentences is generated as follows. We start in the accept state and move to a new state along one of the outgoing transitions whose symbol is emitted. When there is more than one transition, each branch is chosen with equal probability. Then we move on to the next state and so forth. We complete a sentence when we reach the accept state again.1 6.1 Learning Speed. Both the Reber language and badiiguuu language have six different symbols. Thus, for each, we create 50 SRNs with six input and output neurons and three neurons in the hidden layer. The weights in the 50 networks are initialized independently and identically distributed randomly from the interval [−0.1, 0.1). We then make three identical copies of each of the 50 networks that are online trained with one of the following three training schemes, respectively: DL, standard Elman BP, and rBPRL. The training consists of up to 500 epochs, where each epoch comprises a training phase of 100 sentences followed by a test phase of 50 sentences from the same online source; that is, all three copies of a network receive the same input in the same order starting from the same initial weights. Neither the network nor the information sources are reset at epoch boundaries; they are left running continuously, so epochs merely serve to organize the training. In the test phase, for each network the squared Euclidean distance between the actual output and the target distribution is calculated as a measure for error while weights are kept constant. Errors are then averaged over all patterns in a test epoch to yield the network’s mean squared error (MSE) for that epoch. 6.1.1 The Reber Language. Graphs for learning rates = 1.0, 0.1 and 0.01 are displayed in Figure 2, showing the MSE averaged over the 50 independently randomly initialized network for each test epoch as a function of the number of training epochs. For learning rate = 1.0, we see that there is a wider distance between rBPRL and Elman BP, as well as a smaller one between Elman BP and DL, so that the error for DL finally is lowest while the error of rBPRL stays above Elman BP. This is true even for the initial epochs where larger error signals could facilitate finding a large but perhaps shallow local minimum. We conclude that the higher variance of error signals does play a role when learning rates are high, and thus leads to a dissociation of the error curves.
1 Strictly speaking, these sources are not stationary and ergodic, since they start in a fixed state instead of drawing the start state each time from their stationary state distribution (Kitchens, 1998). However, each state distribution converges exponentially fast to the stationary distribution (in an irreducible, aperiodic FA) so that the differences are negligible after a few initial symbols.
Elman Backpropagation as Reinforcement
3121
0.24 Elman BP Reinforcement Distribution
0.22 0.2 0.18
av. MSE
0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0
20
40
60
80
100
Epochs
(a) = 1.0 0.22 Elman BP Reinforcement Distribution
0.2 0.18 0.16
av. MSE
0.14 0.12 0.1 0.08 0.06 0.04 0.02 0
20
40
60
80
100
Epochs
(b) = 0.1 0.29 Elman BP Reinforcement Distribution
0.28 0.27 0.26
av. MSE
0.25 0.24 0.23 0.22 0.21 0.2 0.19 0
20
40
60
80
100
Epochs
(c) = 0.1
Figure 2: Performance in terms of epoch-wise MSE of the Reber language averaged over the 50 networks for each language. Error bars indicate the standard error of the average MSE.
3122
¨ A. Gruning
For = 0.1 and = 0.01, all three learning curves agree well. Thus, moving target problems and the higher variance of the error signal do not cause performance to deteriorate. 6.1.2 Badiiguuu. Averaged performance curves for badiiguuu are shown in Figure 3. While there is not much difference between the performance of Elman BP and rBPRL regardless of learning rate, there remains for all tested learning rates a certain gap between these two on the one side and DL on the other. However, we are mainly interested in the equivalence between Elman BP and rBPRL and can thus again state that their performance levels as functions of the number of training epochs are similar. We note that the error rate is generally higher than for the Reber languages. An inspection of the trained networks reveals that many of them fail to count the three u correctly. 6.2 Inner Dynamics. A main objective with this work is to give more plausibility to neural network simulations in cognitive modeling that make use of Elman BP for learning. The idea is to justify these simulations post hoc by demonstrating that Elman BP and rBPRL show similar learning behavior and thus that using Elman BP introduces only negligible deviations from rBPRL. Therefore, our main interest is in a comparison of Elman BP and rBPRL, and we do not pursue DL anymore in the following. We have already presented some evidence that rBPRL and Elman BP learn equally fast (at least for low learning rates); however, “similar” means more than “equally fast.” Therefore, we want to demonstrate that starting from identically initialized networks, the two learning algorithms also lead to the same internal dynamics or, put differently, that the solution space is identical for both algorithms for a given prediction task. Thus, we also recorded the activations of the hidden layer during the test phases. For each test phase, we calculate the average squared Euclidean distance between the hidden-layer activations of the copies of the network trained with Elman BP and with rBPRL, respectively. We start for both learning algorithms from identically initialized weight matrices, and hence from identical inner dynamics. When the learning rate is sufficiently low, the dynamics of copies of the network trained with different learning algorithms should also change slowly (bar any passing critical points in the dynamics, which usually is a rare event compared to continuous change of dynamics with parameters). When the dynamics stay sufficiently close to each other, the differences of hidden-layer activations between the two copies will also be small. When the dynamics evolve differently, the distance between the hidden-layer states will increase until the dynamics are not related any longer; that is, the activation distances reach a level that corresponds to the expected squared distance of two random vectors in the state space. For a three-dimensional state space [0, 1]3 , the expected squared distance of two random vectors drawn with uniform
Elman Backpropagation as Reinforcement
3123
0.35 Elman BP Reinforcement Distribution 0.3
av. MSE
0.25
0.2
0.15
0.1
0.05
0 0
20
40
60
80
100
Epochs
(a) = 1 .0 0.3 Elman BP Reinforcement Distribution 0.25
av. MSE
0.2
0.15
0.1
0.05 0
20
40
60
80
100
Epochs
(b) = 0 .1 0.5 Elman BP Reinforcement Distribution 0.45
0.4
av. MSE
0.35
0.3
0.25
0.2
0.15
0.1 0
100
200
300
400
500
Epochs
(c) = 0 .01
Figure 3: Performance in terms of average MSE of the badiiguuu language over 50 networks. Error bars indicate the standard error as before.
¨ A. Gruning
3124 0.3
0.45 Elman BP, eps = 0.01 Reinforcement, eps = 0.01 Elman BP, eps = 0.1 Reinforcement, eps = 0.1 Elman BP, eps = 1.0 Reinforcement, eps = 1.0
0.25
eps = 0.01 eps = 0.1 eps = 1.0
0.4
av. squared distrance
0.35
av. MSE
0.2
0.15
0.3
0.25
0.2
0.15
0.1
0.1
0.05
0.05
0 0
100 200 300 400 Epochs (x1 for eps = 0.01, x10 for eps = 0.1, x100 for eps 1.0)
500
0
100 200 300 400 Epochs (x1 for eps = 0.01, x10 for eps = 0.1, x100 for eps 1.0)
(a)
(b)
7
7 Elman BP, eps = 0.01 Reinforcement, eps = 0.01 Elman BP, eps = 0.1 Reinforcement, eps = 0.1 Elman BP, eps = 1.0 Reinforcement, eps = 1.0
6
eps = 0.01 eps = 0.1 eps = 1.0 6
5 av. Euclidean distance
5 av. Euclidean distance
500
4
3
4
3
2
2
1
1
0
0 0
100
200
300
400
Epochs (x1 for eps = 0.01, x10 for eps = 0.1, x100 for eps 1.0)
(c)
500
0
100
200
300
400
500
Epochs (x1 for eps = 0.01, x10 for eps = 0.1, x100 for eps 1.0)
(d)
Figure 4: Reber language. In order to make the curves comparable for different learning rates and yield equal gross weight changes, the x-axes are tenfold stretched for = 0.1 and hundred-fold for = 1.0. Error bars indicate the standard error of the average mean squared error and average distances. (a) Comparison of performance curves. The networks for = 0.01 and = 0.1 perform equally after 10 times the number of epochs when trained with a 10 times smaller learning rate. (b) Differences of hidden-layer activations between Elman BP and rBPRL. At comparable performance and gross weight change, the differences are considerably smaller for smaller . (c) Euclidean distance of the weight matrix w1,1 from the initial weight matrix as a function of epoch number. (d) Euclidean distance of the weight matrices w 1,1 between Elman BP and rBPRL.
probability is 12 . For higher learning rates, one expects this to be the case earlier than for lower learning rates, since in the latter case, the effect of the moving-targets problem as well as the effects of different variances of the weight updates, are lower. When we compare the dynamics of networks across different learning rates we should also ensure that their performances are comparable. Figure 4a shows that the performances of networks trained on the Reber language with a learning rate = 0.01 are comparable to those of networks trained with = 0.1 for 10 times as long a training, and a bit less so to
Elman Backpropagation as Reinforcement
3125
= 1.0 (for a hundred times as long a training), so that the total gross weight changes (number of epochs × learning rate) are of the same order of magnitude. Figure 4b shows the epoch-wise squared hidden-layer distances between Elman BP and rBPRL for learning rates = 1.0, 0.1, 0.01, averaged over the 50 networks. A comparison across the different learning rates (aided by the epoch scaling) at comparable performance levels and comparable gross weight changes reveals that the dynamics for Elman BP and rBPRL become dissimilar much faster for = 1.0 than for = 0.1 or 0.01. In fact, the difference approaches its random value 12 already after five epochs for = 1.0, while it still stays considerably smaller for = 0.01 than for = 0.1. A more direct way to compare the dynamics are the Euclidean distances of the weight matrices. In the following, we concentrate on the hidden-tohidden matrix w 1,1 as a main ingredient of the state dynamics; the other weight matrices not presented here show a similar qualitative behavior. Figure 4c shows the Euclidean distance of w 1,1 to its value after initialization as a function of epochs. With scaling the epoch axis for = 0.1 and = 1.0, the average distances traveled from the initial matrix to the current one are comparable across learning algorithms and learning rates, with rBPRL for = 1.0 deviating quite a bit. These distances show as a side effect that the matrices and thus dynamics have moved considerably away from the initial dynamics.2 Finally Figure 4d compares epoch-wise the Euclidean distances of weight matrices w 1,1 between Elman BP and rBPRL across the different learning rates. The graphs show that the matrices become dissimilar faster for higher learning rates, comparable gross weight changes, and comparable distances from the initial matrix. For = 1.0 the distances between the algorithms are of the same order of magnitude as the distance to the initial matrix, around 6 after five epochs, while for learning rates = 0.1, the ratio of the distance to the initial matrix and the distance between the algorithms is roughly 1.5 after 50 epochs and 4 for = 0.01, so that the matrix distances show that for low learning rates, the dynamics for Elman BP and rBPRL stay more similar for comparable distances from the initial weight matrix. Figures 5a to 5d are the equivalent for the badiiguuu language and provide evidence for essentially the same behavior as for the Reber language. Also here, matrix distances from the initial matrix and between the algorithm variants show ratios about 1, 2, and 6 for = 1.0, 0.1, 0.01. However, matrix distances are generally smaller, indicating that the dynamics have not moved as far from the initial matrix. This might also explain the generally lower performance of badiiguuu as compared to the Reber language. 2 A similar picture emerges when comparing the maximal eigenvalue of w 1,1 across learning rates and algorithm variants. It rises from close to 0.15 to levels of about 6 after 500, 50, 5 epochs for = 0.01, 0.1, 1.0, respectively, indicating that the contractive regime of the initial matrix has probably been left.
¨ A. Gruning
3126
0.5
0.3 Elman BP, eps = 0.01 Reinforcement, eps = 0.01 Elman BP, eps = 0.1 Reinforcement, eps = 0.1 Elman BP, eps = 1.0 Reinforcement, eps = 1.0
0.45
eps = 0.01 eps = 0.1 eps = 1.0 0.25
av. squared distrance
0.4
av. MSE
0.35
0.3
0.25
0.2
0.15
0.1 0.2 0.05 0.15
0.1
0 0
100 200 300 400 Epochs (x1 for eps = 0.01, x10 for eps = 0.1, x100 for eps 1.0)
500
0
100 200 300 400 Epochs (x1 for eps = 0.01, x10 for eps = 0.1, x100 for eps 1.0)
(a)
500
(b)
1.4
1.6 Elman BP, eps = 0.01 Reinforcement, eps = 0.01 Elman BP, eps = 0.1 Reinforcement, eps = 0.1 Elman BP, eps = 1.0 Reinforcement, eps = 1.0
1.2
eps = 0.01 eps = 0.1 eps = 1.0 1.4
1.2 av. Euclidean distance
av. Euclidean distance
1
0.8
0.6
1
0.8
0.6
0.4 0.4 0.2
0.2
0
0 0
100
200
300
400
Epochs (x1 for eps = 0.01, x10 for eps = 0.1, x100 for eps 1.0)
(c)
500
0
100
200
300
400
500
Epochs (x1 for eps = 0.01, x10 for eps = 0.1, x100 for eps 1.0)
(d)
Figure 5: badiiguuu language. In order to make the curves comparable for different learning rates and yield equal gross weight changes, the x-axes are tenfold stretched for = 0.1 and hundred-fold for = 0.01. Error bars indicate the standard error. (a) Comparison of performance curves. The networks for = 0.01 and = 0.1 perform equally after 10 times the number of epochs when trained with a 10 times smaller learning rate. (b) Differences of hidden-layer activations between Elman BP and rBPRL. At comparable performance and gross weight change, the differences are considerably smaller for = 0.01 than for 0.1 or 1.0. (c) Euclidean distance of the weight matrix w 1,1 from the initial weight matrix as a function of epoch number. (d) Euclidean distance of the weight matrices w 1,1 between Elman BP and rBPRL.
From both Figures 4 and 5, we can conclude that for too high a learning rate , the distances between the hidden-layer activations increase faster than for lower learning rates at comparable performance levels. Thus, the dynamics of two networks stay similar although trained with two different learning algorithms. This is an indication that, again for low learning rates , one can keep the search space for solutions the same for rBPRL and Elman BP. Even more so, given the same network initialization and identical training data, the two algorithms will lead to identical (or nearly identical) dynamical solutions for the learning task.
Elman Backpropagation as Reinforcement
3127
7 Discussion The crucial facts why we can reimplement Elman BP as a reinforcement scheme are (1) a probability interpretation of the output vector, which allows us to regard the network as directing its attention to a single output that is stochastically selected and subsequently to relate the evaluative feedback to this single output with = 0 only for this output; and (2) the error contribution of each single neuron in the hidden layer to the total error is a linear superposition of the individual contributions of the neurons in the output layer in Elman BP (see equations 2.7 and 2.8). This enables us to concentrate on the contribution from a single output in each time step and still arrive at an expected weight update equal to the original Elman BP scheme. Obviously the ideas laid out in this letter are applicable to all kinds of SRNs, multilayered or not, and more general RNNs as long as their recurrence can be treated in the sense of introducing context units with immutable copy weights, and they ought also to be applicable to other networks and gradient-based learning algorithms that fulfill the two above conditions (e.g., the long short-term memory algorithm; Hochreiter & Schmidhuber, 1997). Our goal here is to make Elman BP cognitively and biologically more acceptable, not to offer a technical improvement of BP. Therefore a discussion of aspects of rBPRL’s plausibility is in order. First, we have replaced a fully specified target with a single evaluative feedback for SRNs in a symbolic prediction task, so that a more natural semisupervised RL scheme can be incorporated in Elman BP, yielding our rBPRL. Second, look at the neural computability of the quantities we use in rBPRL. First, there is the relative difference δ of actual and expected reward. ¨ otter ¨ It can be realized as a prediction error neuron (Schultz, 1998; Worg & Porr, 2005). Furthermore, attentive concentration on the winning neuron is physiologically plausible (Roelfsema & van Ooyen, 2005), and we are justified to inject a = 0 only for the winning neuron. Also the computation of Q (that also enters δ) from the output activations y(2) is not trivial. Techni(2) cally it is just an addition of the yi and a division of each output activity by this sum. Positive evidence for pools of neurons doing precisely such a divisive normalization is summarized in Carandini and Heeger (1994). Furthermore, there needs to be a competitive process among the output neurons to finally select one according to Q y as the winning neuron. Plausible brain mechanisms for this are discussed in Nowlan and Sejnowski (1995) and Usher and McClelland (2001). Finally, the s have to be calculated. The term that seems not straightforward to calculate is f (a ). But here the choice of the standard sigmoid f as the activation function leads to a derivative that can be calculated from the function value y = f (a ) alone as f (a ) = f (a )(1 − f (a )) = y(1 − y), so that this derivative is also natural to compute.
¨ A. Gruning
3128
A third set of problems to be addressed is related to locality and how past values can be stored. In order to make all quantities for weight update available locally at a synapse, the weights w are used bidirectionally. And that is not really biologically plausible. When model neurons stand for assemblies of real neurons, the assumption of reciprocal connections of roughly equal strength can, however, be justified (Fellemann & van Essen, 1991). Therefore, Roelfsema and van Ooyen (2005) suggest a second set of weights w ¯ that are used in backward propagation of error, while w is used in the forward pass of activity only. This second set of weights w ¯ is initialized independently from w but updated, mutatis mutandis, according to equations 2.6 to 2.8, however, with a small noise term η with mean 0 added in order to introduce deviations between forward and backward connections:
(r ) (r )
δw ¯ ir,rj = −(1 + η)i y j .
(7.1)
In simulations with rBPRL and separate backpropagation weights w, ¯ different initialization and introduction of gaussian noise (mean zero, standard deviation 0.2) as above did not change the average performance curves. Thus, rBPRL can cope with backpropagation weights that are not exactly equal to the forward-propagation weights. In our view, the bidirectional use of weights or the introduction of a second set of weights in this explicit manner remains the weakest point of both Roelfsema and van Ooyen’s AGREL and our rBPRL and needs further elaboration. An interesting route to pursue would be to incorporate elements of weight perturbation algorithms (Rowland, Maida, & Berkeley, 2006). For SRNs we also need to make plausible that the activations of the hidden layer can be retrieved after one time step when they are to act as inputs to the hidden layer again. This assumption is again plausible in view of the ample recurrent connections in the brain, which might recycle activation from a neighboring cell for some time so that its activation can be traced. Also in this case, we cannot expect that the activation can be retrieved with perfect accuracy. Hence, we conducted an additional set of simulations in which we added gaussian noise with mean zero and standard deviation 0.2 to the activations in the context layer. This modification left the performance curves for both the Reber and the badiiguuu languages unaltered, demonstrating that rBPRL can also cope with an imprecise context layer. While our reinforcement scheme easily extends even to fully recurrent networks trained with BP through time (again, it is mainly a question of the linearity of the s), its naturalness would of course hinge on a mechanism that allows tracing a neuron’s or assembly’s activation with some precision for a longer time.
Elman Backpropagation as Reinforcement
3129
8 Conclusion We have found an RL scheme that behaves essentially like the Elman BP scheme for SRNs in prediction tasks; however, it is cognitively more plausible by using a success-failure signal instead of a prescribed target. Essential in transforming Elman BP into a reinforcement scheme was (1) that the Elman BP error signal for the complete target is a linear superposition of the error for each single output neuron and (2) the probabilistic nature of the task: select one possible output randomly and direct the network’s attention toward it until it is rewarded. We have shown that expected weight updates are the same for DL, Elman BP, and rBPRL under the assumption of fixed weights. Furthermore, we have found evidence in the simulations that DL, rBPRL, and Elman BP also behave similar in online learning with incremental weight updates (for low learning rates). They agree in terms of the number of epochs to reach a certain level of performance. Also, there is evidence that the solution spaces for rBPRL and Elman BP are essentially the same, at least for low learning rates . Furthermore, we have briefly discussed the cognitive and biological plausibility of other ingredients in the recurrent BP-as-reinforcement scheme. It seems that there is good evidence for all of them at least in some parts of the brain, possibly with the exception of the second set of weights used for the backward propagation phase. Enhanced cognitive plausibility for Elman BP in the form of rBPRL thus gives SRN use in cognitive science a stronger standing. Acknowledgments I thank three anonymous reviewers for many profound and valuable comments and Thomas Wennekers, Arjen van Ooyen, and Alessandro Treves for stimulating discussions. This work was supported by Human Frontier grant RGP0047/2004-C. References Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166. Blair, A. D., & Pollack, J. B. (1997). Analysis of dynamical recognizers. Neural Computation, 5, 1127–1142. Bod´en, M., & Wiles, J. (2000). Context-free and context-sensitive dynamics in recurrent neural networks. Connection Science, 12(3/4), 197–210. Carandini, M., & Heeger, D. J. (1994). Summation and division by neurons in primate visual cortex. Science, 264(5163), 1333–1336. Christiansen, M. H., & Chater, N. (1999a). Connectionist natural language processing: The state of the art. Cognitive Science, 28(4), 417–437.
3130
¨ A. Gruning
Christiansen, M. H., & Chater, N. (1999b). Toward a connectionist model of recursion in human linguistic performance. Cognitive Science, 23(2), 157–205. Cleeremans, A., Servan-Schreiber, D., & McClelland, J. L. (1989). Finite state automata and simple recurrent networks. Neural Computation, 1, 372–381. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Crutchfield, J. P. (1993). The calculi of emergence: Computation, dynamics, and induction. Physika D, 75, 11–54. Ellis, R., & Humphreys, G. (1999). Connectionist psychology. Hove, East Sussex: Psychology Press. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179–211. Fellemann, D., & van Essen, D. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex, 1, 1–47. ¨ Gruning, A. (2004). Neural networks and the complexity of languages. Unpublished doctoral dissertation, University of Leipzig. ¨ Gruning, A. (2005). Back-propagation as reinforcement in prediction tasks. In W. Duch, J. Kacprzyk, E. Oja, & S. Zadrozny (Eds.), Proceedings of the International Conference on Artificial Neural Networks (pp. 547–552). Berlin: Springer. ¨ Gruning, A. (2006). Stack- and queue-like dynamics in recurrent neural networks. Connection Science, 18(1), 23–42. ˇ P. (2003). Recurrent neural networks with small weights impleHammer, B., & Tino, ment definite memory machines. Neural Computation, 15(8), 1897–1929. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9, 1735–1780. Hopcroft, J. E., & Ullmann, J. D. (1979). Introduction to automata theory, languages, and computation. Reading, MA: Addison-Wesley. Jackendoff, R. (2002). Foundations of language: Brain, meaning, grammar, evolution. New York: Oxford University Press. Jacobsson, H. (2006). The crystallizing substochastic sequential machine extractor— CrySSMEx. Neural Computation, 18(9), 2211–2255. Kitchens, B. P. (1998). Symbolic dynamics. Berlin: Springer. Kuan, C.-M., Hornik, K., & White, H. (1994). A convergence result for learning in recurrent neural networks. Neural Computation, 6, 420–440. Lind, D., & Marcus, B. (1995). An introduction to symbolic dynamics and coding. Cambridge: Cambridge University Press. Moore, C. (1998). Dynamical recognizers: Real-time language recognition by analog computers. Theoretical Computer Science, 201, 99–136. Nowlan, S. J., & Sejnowski, T. J. (1995). A selection model for motion processing in area MT of primates. Journal of Neuroscience, 15(2), 1195–1214. Reber, A. (1967). Implicit learning of artificial grammars. Journal of Verbal Learning and Verbal Behavior, 77, 855–863. Rodriguez, P. (2001). Simple recurrent networks learn context-free and contextsensitive languages by counting. Neural Computation, 13, 2093–2118. Roelfsema, P. R., & van Ooyen, A. (2005). Attention-gated reinforcement learning of internal representations for classification. Neural Computation, 17, 1–39. Rowland, B. A., Maida, A. S., & Berkeley, I. S. N. (2006). Synaptic noise as a means of implementing weight-perturbation learning. Connection Science, 18(1), 69–79.
Elman Backpropagation as Reinforcement
3131
Schultz, W. (1998). Predictive reward signal of dopaminic neurons. J. Neurophysiol., 80, 1–27. Sutton, R. S., & Barto, A. G. (2002). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. ˇ nansk ˇ (2004). Markovian architectural bias of ˇ P., Cer ˇ ´ M., & Benuˇ ˇ skov´a, L. Tino, y, recurrent neural networks. IEEE Transactions on Neural Networks, 15(1), 6–15. ˇ P., & Dorffner, G. (2001). Predicting the future of discrete sequences from fractal Tino, representations of the past. Machine Learning, 45(2), 187–217. Tremblay, L., & Schultz, W. (1999). Relative reward preferences in primate orbitofrontal cortex. Nature, 398, 704–708. Usher, M., & McClelland, J. L. (2001). On the time course of perceptual choice: The leaky competing accumulator model. Psychological Review, 108, 550–592. van der Velde, F., & de Kamps, M. (2006). Neural blackboard architectures of combinatorial structures in cognition. Behavioral and Brain Science, 29(1), 1–72. Williams, R. J., & Peng, J. (1990). An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Computation, 2, 490–501. Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running full recurrent neural networks. Neural Computation, 1, 270–280. ¨ otter, ¨ Worg F., & Porr, B. (2005). Temporal sequence learning, prediction, and control—a review of different models and their relation to biological mechanisms. Neural Computation, 17(2), 245–319.
Received February 17, 2006; accepted January 30, 2007.
ARTICLE
Communicated by Tatyana Sharpee
Single Neuron Computation: From Dynamical System to Feature Detector Sungho Hong [email protected] Department of Physiology and Biophysics, University of Washington, Seattle, WA 98195, U.S.A.
¨ Blaise Aguera y Arcas [email protected] Program in Applied and Computational Mathematics, Princeton University, Princeton, NJ 08544
Adrienne L. Fairhall [email protected] Department of Physiology and Biophysics, University of Washington, Seattle, WA 98195, U.S.A.
White noise methods are a powerful tool for characterizing the computation performed by neural systems. These methods allow one to identify the feature or features that a neural system extracts from a complex input and to determine how these features are combined to drive the system’s spiking response. These methods have also been applied to characterize the input-output relations of single neurons driven by synaptic inputs, simulated by direct current injection. To interpret the results of white noise analysis of single neurons, we would like to understand how the obtained feature space of a single neuron maps onto the biophysical properties of the membrane, in particular, the dynamics of ion channels. Here, through analysis of a simple dynamical model neuron, we draw explicit connections between the output of a white noise analysis and the underlying dynamical system. We find that under certain assumptions, the form of the relevant features is well defined by the parameters of the dynamical system. Further, we show that under some conditions, the feature space is spanned by the spike-triggered average and its successive order time derivatives. 1 Introduction A primary goal of sensory neurophysiology is to understand how stimuli from the natural environment are encoded in the spiking output of Neural Computation 19, 3133–3172 (2007)
C 2007 Massachusetts Institute of Technology
3134
¨ S. Hong, B. Aguera y Arcas, and A. Fairhall
neurons. A useful tool for performing this characterization is white noise analysis (Marmarelis & Marmarelis, 1974; Wiener, 1958; Rieke, Warland, Bialek, & de Ruyter van Steveninck, 1997), whereby a system is stimulated with a randomly varying broadband signal and the relevant features driving the system are determined by correlating output spikes with the signal. Extensions of this analysis to second order have been used to find multiple stimulus features to which the system is sensitive (de Ruyter van Steveninck & Bialek, 1988; Brenner, Bialek, & de Ruyter van Steveninck, 2000; Bialek & de Ruyter van Steveninck, 2005). Recently, these methods have been applied to characterize the computation performed by a ¨ single neuron on its current inputs (Aguera y Arcas, Fairhall, & Bialek, ¨ 2003; Aguera y Arcas & Fairhall, 2003; Slee, Higgs, Fairhall, & Spain, 2005). Due to detailed knowledge of the dynamics of ion channels, it is possible to build dynamical models of current flow in neurons that accurately reproduce the voltage response of single neurons to current (or conductance) inputs. White noise analysis provides us with the capacity to reduce this complex set of dynamical equations to a simple functional model with concrete components: linear feature selection followed by a nonlinear decision function generating spikes. This procedure is compelling as it provides a clear intuition about the form of the changes in the stimulus to which the system is sensitive and how the system combines these features in the decision to fire. Second order white noise methods have provided insight into coding properties at the systems level, in visual (Brenner et al., 2000; Fairhall et al., 2006; Touryan, Lau, & Dan, 2002; Rust, Schwartz, Movshon, & Simoncelli, 2005; Horwitz, Chichilnisky, & Albright, 2005) and somatosensory (Petersen, 2003; Maravall, Peterson, Fairhall, Arabzadeh, & Diamond, 2007) cortex. More recently, these methods have also been applied to characterize neural coding in single auditory (Slee et al. 2005), ¨ central (Powers, Dai, Bell, Percival, & Binder, 2005), and model (Aguera y ¨ Arcas, et al. 2003; Aguera y Arcas & Fairhall, 2003) neurons. However, mostly missing from these analyses is a clear link between the obtained stimulus features and the biophysical properties of the circuit or neuron. This issue is particularly well posed for single neurons under stimulation by direct somatic current injection, where the biophysical parameters of the soma must dominate the neuron’s computation. Thus, given the power of white noise methods, the questions that we wish to address here are: How does the dynamical system governing neural behavior map to the neuron’s functional characterization derived using white noise analysis? How are the features determined by the neuron’s biophysical properties?
1.1 Minimal Spiking Models. Neural dynamics are typically described by conductance-based models that describe the temporal evolution of the
Single Neuron Computation
3135
voltage V due to an input current I and the variable ionic conductances of ion channels: Cd V/dt = −
gi (V)(V − E i ) + I,
(1.1)
i
where C is the membrane capacitance. Each conductance type i has a reversal potential E i , where the conductances gi corresponding to different ion channels may be voltage dependent through the dynamics of the corresponding activation and inactivation gating variables ξi and φi : p
q
gi = g¯ i ξi i φi i
(1.2)
dξi /dt = f ξi (ξ, V)
(1.3)
dφi /dt = f φi (φ, V),
(1.4)
where g¯ i are constants and the functions f ξ and f φ , are affine in ξ and φ, respectively, but not in V. The exponents pi and q i are usually taken to be integers and describe the cooperativity of molecules required to gate the ion channel. This set of equations is highly nonlinear due to this cooperativity and the voltage dependence of the conductances. While ultimately we would like to understand systems of many dynamical variables, we will begin by analyzing a simplified spiking system: the FitzHugh-Nagumo (FN) neural model (FitzHugh, 1961; Nagumo, Arimoto, & Yoshikawa, 1962). The FitzHugh-Nagumo system is cubic in V, the lowest-order nonlinearity supporting spike-like behavior, and is defined by ψ d V/dt = V(1 − V)(a + V) − W + c + I, d W/dt = V − bW,
(1.5) (1.6)
where V is voltage, W is a generic inactivation variable with linear recovery dynamics, I is the input current, and a , b, c, and ψ are parameters, ψ 1. The nullclines of the system are the curves along which d V/dt = 0 (the V nullcline, a cubic) and dW/dt = 0 (the W nullcline, a straight line). Figure 1a shows the nullclines for zero input I on the phase plane. Trajectories have been traced forward and backward in time through a randomly chosen set of initial conditions. The unique intersection of the nullclines is the stable fixed point. Elsewhere in the plane, trajectories always spiral counterclockwise, but qualitatively, there are two kinds of orbits, ultimately both converging to the stable fixed point. Subthreshold orbits make only small excursions; however, for trajectories starting with either low inactivation (roughly W < −.45) or high voltage (for example, V > 0), the system makes a large excursion around the right branch of the V nullcline, corresponding to a spike, before settling back to the fixed point. Although
3136
¨ S. Hong, B. Aguera y Arcas, and A. Fairhall
Figure 1: (a) Phase plane of the FitzHugh-Nagumo neuron model with trajectories and threshold (dashed line). Parameters are a = 0.1, b = 0.5, c = −0.4, and ψ = 0.015. (b) A single extended trajectory on the FN phase plane driven by white noise input, showing multiple spikes. The threshold (dashed line) still makes an approximate boundary of the spiking and nonspiking portions of the trajectory (gray).
we will generally be considering a time-varying input I (t), we will make heavy use of the I = 0 phase plane; one can regard this as representing the instantaneous flow if the input current is switched off. For I = 0, one can define a separation of spiking initial conditions from subthreshold conditions, marked in light versus dark gray in Figure 1a. This separation of behavior can be thought of as a threshold. Typically the threshold for spiking is taken to be a threshold in voltage only, but this phase-plane picture demonstrates that one should really consider the threshold to depend on both V and the hidden inactivation variable(s) (Izhikevich, 2006). One can approximate this “dynamical” threshold by considering a minimal spiking trajectory. There is no absolute distinction between spiking and subspiking trajectories in type II neurons (Gerstner & Kistler, 2002) such as Hodgkin-Huxley and FitzHugh-Nagumo, so this choice is necessarily somewhat arbitrary. Starting with a point on the phase plane at the end of the spike plateau, one can eliminate time from the dynamical equations, solving parametrically for the curve passing through this point. Here, we obtained one such curve by numerically solving the equations from the local maximum of the cubic nullcline, given by ddV V(1 − V)(a + V) = 0. One of the issues we explore in this article is the relationship between the geometry of the threshold in the multidimensional space of the dynamical variables and the measurements one makes in white noise analysis. 1.2 White Noise Analysis. Our approach is based on the functional perturbation expansion method, which we briefly review here. The simplest such expansion is known as the Volterra series (Volterra, 1930; Marmarelis,
Single Neuron Computation
3137
& Marmarelis, 1974; Marmarelis, 2004), where the functional y(t) = y[I (t)] is expanded as y(t) = h 0 + +
1 2!
dt1 h 1 (t1 )I (t − t1 ) dt1 dt2 h 2 (t1 , t2 )I (t − t1 )I (t − t2 ) + · · · ,
(1.7)
where the kernels h i are called Volterra kernels. One problem with Volterra analysis is that the successive order kernels are not independent. For example, h 2 contains a constant component ∼ dt1 h 2 (t1 , t1 ). Wiener analysis (Wiener, 1958; Rieke et al., 1997) avoids this issue. In this framework, I (t) is assumed to be zero mean gaussian white noise satisfying
I (t)I (t ) = σ 2 δ(t − t ).
Under Wiener analysis, y[I (t)] is expanded as y(t) = g0 + +
1 2!
dt1 g1 (t1 )I (t − t1 )
dt1 dt2 g2 (t1 , t2 )I (t − t1 )I (t − t2 ) − σ 2
dt g2 (t , t ) + · · · . (1.8)
In the Wiener expansion, each kernel is independent of the others, since at each order, one subtracts out the lower-order components. Therefore, the kernels gi can be easily obtained by measuring the correlation functions between y(t) and I (t)—for example,
y(t) = g0 + +
1 2!
− σ2
dt1 g1 (t1 ) I (t − t1 )
dt1 dt2 g2 (t1 , t2 ) I (t − t1 )I (t − t2 ) dt g2 (t , t ) + · · ·
= g0 ,
y(t)I (t − t1 ) = g1 (t1 )/σ 2 , y(t)I (t − t1 )I (t − t2 ) = g2 (t1 , t2 )/σ 4 ,
¨ S. Hong, B. Aguera y Arcas, and A. Fairhall
3138
and so on. The correlation method can be used to determine the relationship between Wiener and Volterra kernels. From equation 1.8, g0 = h 0 + +
1 2!
σ2 = h0 + 2! g1 (t) = h 1 (t) +
dt1 h 1 (t) I (t − t1 )
σ2 2!
dt1 dt2 h 2 (t1 , t2 ) I (t − t1 )I (t − t2 ) + · · · dt h 2 (t , t ) + · · · ,
dt h 3 (t , t , t) + · · · ,
and so on. Wiener analysis has been successfully used to determine the relevant feature space for neural systems by reverse correlation of the spiking output with white noise input. For this application, the output is taken to be the sequence of spike times, ρ(t) = i δ(t − ti ). Let us define the stimulus s as a linear transformation of the current input I (t), si =
dt f i (t )I (t − t ),
(1.9)
where f i are some set of independent linear filters. The goal is to determine how a single spike is generated by I , reduced to a vector s comprising projections along the relevant dimensions f i . We need
to compute the probability of spiking as a function of the input, P spike|s ,
P spike|s = P(spike)G(s),
(1.10)
where G(s) is the input-output function relating ρ(t) to the time-varying stimulus s(t), and we have separated out the scale factor P(spike) proportional to the mean firing rate.
In principle, since the equations of motion are deterministic, P spike|s is a Boolean function. However, as we will be approximating s by a finite-dimensional truncation and will be solving for P itself in successive orders, the
result will no longer be Boolean. We can compute P spike|s using Bayes’ theorem (Rieke et al., 1997),
P spike|s P s|spike G(s) = = . P(spike) P(s)
(1.11)
Single Neuron Computation
3139
This shows that the Wiener kernels of G(s) are the moments of P s|spike , which can be easily measured. In this article, we focus on the first two moments of this distribution.1 The first moment is the spike-triggered average (STA), s¯ (t), s¯ (t) = I (ti − t) i =
ds sP s|spike =
ds sP(s)G(s),
(1.12)
where the average ·i over the spike occurrence times {ti } is replaced with an average over the spike-triggering ensemble. When the neuron has multiple characteristic features, the STA may contain little information about them since it points in only a single direction, which is a certain linear combination of relevant features. The second-order kernel, however, can give such information. This kernel is related to the spike-triggered covariance (STC) matrix C, which we will define as C(t, t ) = I (ti − t) − s¯ (t), I (ti − t ) − s¯ (t ) i − I (τ − t)I (τ − t ) τ = σ 2 g0 δ(t, t ) + σ 4 g2 (t, t ) − s¯ (t)¯s (t ) − σ 2 δ(t, t )
(1.13)
= σ 4 g2 (t, t ) − s¯ (t)¯s (t ). g2 (t, t ) is the second-order Wiener kernel When a functional F [s] of G(s). depends on gaussian noise s, si F [s] = j si s j ∂ j F [s] . Thus, we can expand g2 (Bialek & de Ruyter van Steveninck, 2005), g2 (t, t ) =
∂2G f i (t) f j (t ), ∂s ∂s i j i, j
(1.14)
where equation 1.9 introduces the linear filters that define the components of s. ˆ i j = ∂ 2 G − ∂ G ∂ G Therefore, unless the modified Hessian H ∂si ∂s j ∂si ∂s j has null directions,2 the STC is spanned by the relevant feature space { f i }. Diagonalizing the STC and extracting the eigenvectors with the largest absolute eigenvalue will determine the leading dimensions spanning the ¨ feature space (Brenner et al., 2000; Aguera y Arcas et al., 2003; Bialek & de Ruyter van Steveninck, 2005). Eigenmodes appearing with negative eigenvalue correspond to stimulus directions in which the variance of the
Note that the zeroth-order kernel is simply g0 = 1. Since s¯ (t)¯s (t ) is only rank one, there are only two possibilities leading to null directions. The first is when ∂i ∂ j G has null directions, in which case it is excluded (for our purposes) by definition. The second case is when ∂i ∂ j G has an eigenvector s¯ (t) with an eigenvalue ||¯s (t)||2 , which does not happen generically. 1 2
¨ S. Hong, B. Aguera y Arcas, and A. Fairhall
3140
spike-conditional distribution is reduced relative to the stimulus prior distribution; eigenmodes with positive eigenvalues are directions with increased variance. Such cases may arise if the spike-conditional distribution is bimodal in some direction or if it forms a ring around the origin; phase-invariant (complex) cells are an example (Simoncelli, Pillow, Paninski, & Schwartz, 2004; Rust et al., 2005; Fairhall et al., 2006; Petersen, 2003; Maravall et al., 2007). In this article, we work with a multidimensional space of dynamical variables, as in equations 1.1 to 1.6, for which we obtain the Volterra series. When we denote them with yi , the Wiener kernels for G(yi ) can be expanded as g1 [G] =
∂G i
g2 [G] = ···
∂ yi
∂G i
∂ yi
g1 [yi ] + · · · , g2 [yi ] +
∂2G g1 [yi ]g1 [y j ] + · · · ∂ yi ∂ y j i, j
where gn [yi ] is an nth Wiener kernel of yi and dots represent higher-order terms than quadratic. Here we note that gn [yi ] is generated by the infinite series of Volterra kernels. Moreover, we will see later that the Volterra kernels at each order can be constructed from the set of first-order kernels. 2 Linear Systems In section 1, we discussed the application of white noise methods to experimental data where the output of the system is measured as the firing times of spikes. With access to the entire dynamical system, one could attempt to model the relationship between the input I (t) and the output voltage V(t), including both subthreshold and spiking behavior. As we have discussed, neurons are highly nonlinear. Perturbative expansions such as Wiener and Volterra series generally are unable to capture the nonlinearities of spike generation with few terms. However, a very successful and highly influential approach to the modeling of neural systems has been to separate a linear filtering stage from an explicit nonlinearity, capturing the very sharp voltage increase of the spike (Victor & Shapley, 1980; Gerstner & Kistler, 2002; Kistler, Gerstner, & van Hemmen, 1997; Berry & Meister, 1999). This is known as a cascade model, and in various forms is the theoretical basis for many of the simple neuron models in use, including integrate-and-fire ¨ neuron (Gerstner & Kistler, 2002; Aguera y Arcas & Fairhall, 2003), the spike response model (Gerstner & Kistler, 2002), generalized integrate-andfire models (Paninski, Pillow, & Simoncelli, 2004) and cascade models of retinal ganglion cells (Victor & Shapley, 1979, 1980; Keat, Reinagel, Reid, & Meister, 2001; Pillow, Paninski, Uzzell, Simoncelli, & Chichilnisky, 2005).
Single Neuron Computation
3141
The success of these models suggests that a breakdown of the dynamical system into its linear component and a threshold is a useful approximation to examine. In our analysis of the dynamical system, we will take the approach of experimental work, where the output is reduced to a sequence of spike times. Our goal here is to derive the form of the multidimensional linear stage that arises directly from the equations of motion and examine the consequences of thresholds of various forms. We begin by walking through some simple linearized systems analysis and clarify the connection with the cascade model picture. Rewriting equations 1.1 to 1.4 in the following simple form, ∂t yk = f k (y1 , y2 , . . . , yN ) + I (t)δk0 ,
k = 0, 1, · · · , n − 1,
(2.1)
(0)
we assume that they have a fixed-point solution yk (t) = yk when I (t) = 0. Now the linear approximation of the system around the fixed point is given by3 Dkm ym(1) = I (t)δk0 ,
(2.2)
where Dkm = δkm ∂t − J km ,
J km =
∂ f k
. ∂ ym y(0)
(2.3)
The linear filters are the kernels of equation 2.3, (1) = D−1 . They satisfy (1)
Dkm ml (t) = δkl δ(t),
(2.4)
giving the linear filter form for the system’s evolution: (1)
yk (t) = (1)
t
−∞
dt k (t − t )I (t ), (1)
(2.5)
(1)
where k = k0 . Equation 2.4 can be solved by diagonalizing the Jacobian matrix J km . Letus consider the following diagonalization with eigenvalues λα : −1 . λα δαβ = Uαk J km Umβ
From here on, we use the Einstein summation convention, for example, Ckm dm ≡ Ckm dm .
3
m
(2.6)
¨ S. Hong, B. Aguera y Arcas, and A. Fairhall
3142
Here we employ the notation that indices in the diagonalized basis are Greek letters. Therefore, equation 2.4 can be written in the new basis as Dα ˜α(1) = Uα0 δ(t),
Dα = ∂t − λα ,
(2.7)
where the first-order kernels ˜ (1) in this basis are ˜α(1) (t) = e λα t H(t), where H(t) is the Heaviside step function and arises from causality. The eigenvalues λα can be real or complex. The sign of the real part of the eigenvalues is determined by the stability properties of the fixed point at y(0) . For a neural system, this fixed point should be attracting, so all eigenvalues have negative real part, and thus eigenvectors decay as t → ∞. This underscores that the linear filters can take only a limited number of forms: decaying exponentials or damped oscillations. We note the following facts. First, in general, an n-dimensional system will have n independent first-order kernels, except when the Jacobian has a degenerate spectrum. Further, the linear kernels can be generated from a “master kernel,” ϕ(t) =
e λα t H(t),
α
and its time derivatives up to nth order. To show this, from the expression of the derivative of a master kernel, ∂t k ϕ(t) =
α
λkα e λα t = Tkα e λα t ,
Tkα = λkα ,
(k, α = 0, . . . , n − 1),
up to additive singular terms coming from the derivatives of H(t). Tkα is a Wronskian matrix of exponentials at t = 0. The determinant of Tkα is the Vandermonde determinant det (Tkα ) = α>β (λα − λβ ), which cannot vanish by definition (Gradshteyn & Ryzhik, 2000). This means that the master kernel and its derivatives can be transformed to e λα t ’s in a nonsingular way. Therefore, we conclude that all first-order kernels can be expressed in terms of a master kernel and its time derivatives. Frequently the first-order kernel is nonzero at t = 0, so that the time derivative of the filter also includes a delta function at t = 0. We will see the appearance of this singular component in the white noise analysis.
Single Neuron Computation
3143
To recover the filters corresponding to the dynamical variables, one must invert the diagonalizing operation, equation 2.6, to obtain linear combinations of the eigenmodes: −1 (1) i (t) = Uiα ˜α (t) = (1)
c α e λα t H(t),
(2.8)
α
where the coefficients c α derive from the components of the Jacobian. Thus, the filters of the linear system are sums of exponentials—either purely real or with an imaginary component. To the extent that subthreshold dynamics are well approximated by the linearized system, this shows clearly why one might expect to find both integrate-and-fire-like neurons, with filters having purely real associated eigenvalues, and oscillate-and-fire neurons (Izhekevich, 2001; Hutcheon & Yarom, 2000) with a corresponding eigenvalue having a nonzero imaginary part. In terms of the linear system, the possible set of filters or feature space of the system is dual with the dynamical variables that define the system. The subset of features that are relevant to a spike occurring are those that contribute to crossing the internal state over “threshold.” The properties outlined above have an interesting implication for white noise analysis in the case that the system is well described by subthreshold linearity. As we will see in the next section, the STA is typically a linear combination over the relevant features and their derivatives. If the components of the STA belong to the true feature space, as they typically will, one can derive the basis for the feature space from the STA and its successive order time derivatives, as has been empirically observed (Rieke, 1991). 2.1 Higher-Order Series Terms. The higher-order approximations to the system can also be expressed in terms of first-order kernels. For example, the second-order approximation is given by the following equation: Dkm ym(2) =
1 Hkmn ym(1) yn(1) , 2!
Hkmn =
∂ 2 f k
, ∂ ym ∂ yn y(0)
(2.9)
whose solution is (2)
yk = (2)
k (t1 , t2 ) =
1 2!
(2)
ds1 ds2 k (t − s1 , t − s2 )I (s1 )I (s2 ), dt Hlmn kl (t ) m(1) (t1 − t ) n(1) (t2 − t ). (1)
(2.10)
In this way, all of the higher-order kernels can be obtained by outer products of first-order kernels. The detailed computation is summarized in appendix A.
3144
¨ S. Hong, B. Aguera y Arcas, and A. Fairhall
2.2 White Noise Analysis of Threshold-Crossing Linear Neurons. In this section, we assume that the higher-order nonlinearity of the system is well captured by a threshold. Since voltage is generally the only observed variable, “the threshold” is often taken to be a threshold on the voltage. The imposition of a threshold on the voltage alone has been applied even to neural models such as FN and the Hodgkin-Huxley model system, where there is no difficulty in defining a threshold in V and the “hidden” dynamical variables of W (FN) or m, n, and h (HH). Here we consider the implications of considering the threshold to apply in multiple dimensions. Before we discuss the validity of the threshold approximation for an FN neuron, we illustrate the results from applying white noise analysis to models with a variety of threshold structures acting on subthreshold linear dynamics given by the linearized FN system. Previous work has treated simple one-dimensional linear and nonlinear models, in which a spike is defined when the output of a single linear filter f on a random current input crosses a threshold. One can show (DeWeese, 1995; M. Meister, private communication, November 2004) that covariance analysis on this model finds two modes: the filter f and its time derivative, f . The derivative mode is a result of the criterion defining the spike only on the upward crossing of the threshold: t t d dτ f (t − τ )I (τ ) > 0 → dτ f (t − τ )I (τ ) > 0. (2.11) dt Thus, the variance of projections onto f is also reduced with respect to the prior. Covariance analysis has also been performed (Fairhall et al., 2006) on more realistic models in which an exponentially decaying afterhyperpolarization (AHP) is added to the voltage following a spike (Gerstner & Kistler, 2002; Keat et al., 2001). This makes the threshold-crossing model more realistic, but the manipulation does not alter the ( f, f ) eigenmode structure unless the timescale of the AHP is short enough that spiking occurs many times during the superthreshold fluctuation, which destroys the correlation with positive f . Here we assume that the system evolves according to linear subthreshold dynamics. We treat five cases. In the first three, the threshold is linear, but in different components of the 2D space; in the last two, the threshold is a nontrivial two-dimensional structure. The thresholds used are: 1. A traditional threshold in V only 2. A threshold in W only 3. A threshold on a linear combination of V and W 4. The union of two piecewise linear segments in V and W, that is, a logical “or” on 1 and 2 5. A curved threshold on a smooth, nonlinear function of V and W
Single Neuron Computation
3145
From the discussion in Section 2, the linear space in which the n-dimensional system operates is defined by the n filter inversions of the dynamical system, (see equation 2.5). The effect of the threshold is to select as relevant from those n dimensions the subspace defined by the filter directions that have a nonzero projection onto the threshold. Thus, we would predict that for the linear threshold, we will find two relevant modes: the primary direction f , which is normal to the direction of the threshold, and f , its time derivative. For a threshold in a single dimension, V or W only, f should recover the V or W filter, respectively, and the corresponding time derivative. For a threshold that is a linear combination of V and W, f should be a linear combination of the V and W filters. The distribution of trajectories is simply a linear transformation of the original gaussian stimulus distribution. Any linear threshold in this plane will produce negative eigenmodes, as the set of points selected by the threshold must have decreased variance with respect to this prior. When the threshold is not linear in V and W, we expect to find two primary filters, f 1 and f 2 , which will be some linear combination of the V and W filters, and, in principle, the time derivatives. However, the number of modes will be less than or equal to n + 1 because, as we have shown, the time derivatives span the same space as the linear variables themselves. (Note that the +1 accounts for the singular component of the time derivative, δ(t), arising from discontinuity of the filters at t = 0.) Furthermore, a threshold that is not linear may produce a spike-conditional distribution with a direction in which the variance is actually increased with respect to the prior; such a direction will appear with a positive eigenvalue. Figure 2b shows the thresholds that we applied to the FN first-order linear dynamics. The results of the covariance analysis are shown in Figures 2c and 2d. Figure 2c shows the covariance eigenvalue spectra for all five cases. The significant eigenvalues are empty circles beyond the error level denoted by the shaded box.4 In the cases with thresholds on V only, on W only, and linear in both V and W, only two modes are obtained. With 2D thresholds, cases 4 and 5, three eigenmodes appear. Figure 2d shows (1) the corresponding modes. We see that the V threshold picks out V , the (1) W threshold selects W , and the (V, W)-linear threshold leads to a linear 4 Significance of eigenvalues can be evaluated as follows. Let N be the number of spikes and d the dimension of each sampled stimulus. To estimate the finite size or dimension error, we choose N d-dimensional stimuli at random (Rust et al. 2005). In this case, the covariance matrix C in equation 1.14 is filled with d(d + 1)/2 random numbers drawn from a normal distribution N (0, σ 2 2/N) in the large N limit. When d 1, the eigenvalues of this matrix follow Wigner’s semicircle law, and their distribution is bounded √ √ by ±2 2σ 2 d/N, which is the error level (Mehta, 1991). When the stimulus has small correlation, the semicircle is not a good approximation of the eigenvalue distribution, but change in the upper and lower bounds rarely exceeds an order of magnitude, which can be checked by numerical simulation.
3146
¨ S. Hong, B. Aguera y Arcas, and A. Fairhall
Single Neuron Computation
3147
combination of these filters. In the latter two cases, the eigenmodes are a (1) (1) (1) linear combination of V , W , and ∂t V , itself a linear combination of the V and W kernels and δ(t). Figure 2d also shows the projection of spike-triggered stimuli onto two of the distinguished eigenvectors. In the first three cases, the linearity of the threshold is manifested in the 2D plane as a single elliptic cluster, while the nonlinear threshold cases show richer structure. Recall that the filters are generated by linear combinations of a master filter and its time derivatives. This implies that our multidimensional threshold structure has a natural interpretation as a dynamical threshold, depending not only on the voltage V but also on ∂t V, ∂t 2 V, and so on (Azouz & Gray, 2000). We will show how well this approximation fits for the Hodgkin-Huxley model in the final section. 2.3 Analysis of Threshold-Crossing Models in Multiple Dimensions. An advantage of the simple models introduced in the previous section is that they can be treated analytically. From equation 1.11, we have
P spike|s P(s) P spike|s P(s)
G(s) = = , P(spike) Ds P spike|s P(s) where Ds = ds1 ds2 · · ·. Instead of directly computing G(s), we characterize it by a moment generating function: W[j] = log
DsP(spike|s)P(s)e j·s .
(2.12)
We define an n-dimensional first-order system, given by dynamical variables yk and the corresponding linear filters k as in equation 2.2. We assume
Figure 2: (a) First-order kernels of the FN neuron model, drawn as filters, that is, in −t. (b) Various thresholds for the linear system. V (vertical), W (horizontal), linear (dashed), piecewise linear (dotted), and curved (solid curve) thresholds are used. (c) Spectra of covariance matrices for the thresholds. The shaded box represents the level of error from finite sample size and dimension. (d) STAs and covariance modes of the linear system with various thresholds. The nonsingular part of each mode is fitted using least squares to a linear combination (1) (1) of the (orthogonalized) first-order kernels as v f it = c V V + c W W . The kernels are normalized, excluding the δ-function component. The coefficients (c V , c W ) are displayed above each plot, and the gray line is a fitted function. For each case, we show the projection of spike-triggered current histories onto either the leading two negative modes, v1 and v2 , or v1 and v+ , depending on the threshold.
¨ S. Hong, B. Aguera y Arcas, and A. Fairhall
3148
that the stimulus is a gaussian white noise current I (t) with zero mean and variance σ 2 . Thus, as before,
∞
yk (t) =
dτ k (τ )I (t − τ ).
0
We denote a random segment of the current stimulus I (τ ), τ ≤ t as an infinite-dimensional vector s.5 Any sample of y is therefore a functional of the random variable s, y[s]; for simplicity, we will write y. Spiking is determined by crossing a threshold θ (y) = 0 in the phase space from below. Then,
P spike|s = δ(θ (y))H(y˙ · ∇y θ (y))y˙ · ∇y θ (y). The Heaviside function H(·) ensures that spiking occurs only on a threshold crossing from below, and the weight factor y˙ · ∇y θ (y) is a geometric factor accounting for the flux at the threshold. The filters k are not necessarily normalized or orthogonal to each other, and it is convenient to define the orthonormal basis { f µ }, which spans the same space as { k }. Then there will be a linear transformation Tµk , f µ = Tµk k , and so we define a new coordinate system: zµ ≡ Tµk yk . As s is gaussian, the orthonormally transformed variable z is also gaussian with variance σ 2 . Now we can separate the stimulus into two components: its projection into the subspace spanned by the { f µ } and the orthogonal component. We denote the corresponding directions of j as j = j + j⊥ , where jµ (t) =
5
0
∞
dτ j(t − τ ) f µ (τ ).
In practice, any application to data requires discretization in time. We use the con√ vention that a function f (t) is discretized as fˆt = f (t) t, where t is the time step. 2 This implies f dt ≈ i f (it)2 t = i fˆi2 = fˆ · f,ˆ and thus the vector fˆ obtained by discretizing f (t) has a vector norm corresponding to the L 2 norm.
Single Neuron Computation
3149
The moment generating function can then be separated as W[j] = W [j ] + W⊥ [j⊥ ], where W⊥ [j⊥ ] = log
Ds⊥ e −s⊥ /2σ 2
2
+j⊥ ·s⊥
=
σ2 j⊥ (t)2 + const, 2
(2.13)
where s⊥ are the components of s orthogonal to the plane spanned by z, and W [j ] = log = log
d n z det(T −1 )e −z /2σ δ(θ (z))H(z˙ · ∇z θ (z))z˙ · ∇z θ (z)e j ·z 2
2
d n z e −z /2σ δ(θ (z))w(z)e j ·z + const, 2
2
(2.14)
with w(z) = H((Rz) · ∇z θ (z))(Rz) · ∇z θ (z), where R = T J T −1 is the Jacobian matrix of the linear system, equation 2.2, in the { f µ } basis, and we use z˙ µ = Tµk y˙ k = Tµk J kl yl = (T J T −1 )µν zν . As previously discussed, the closure under time differentiation of the space spanned by the first-order kernels is assured by z˙ , as in equation (A.5) in appendix A. Note that our separation of W[j] depends on this particular property. For a fixed spike time t, restricting ourselves to the region where τ < t, W⊥ [j] becomes a static integral. Thus, equation
(2.14) shows clearly how the properties of this model emerge. P spike|s is given by a threshold with a weight function w(z) up to some linear transformations. Therefore, for example, the linear threshold case reduces to the one-dimensional case. Since ∇z θ (z) is a constant vector, every computation reduces to a onedimensional integral in this direction, and the corresponding filter is a linear combination of f µ s, as we have observed in Figure 2d. Furthermore, a variety of structures may be generated depending on how trajectories cross the threshold. From equations 2.14 and 2.13, we can derive the analytic forms of the first- and second-order moments: STA(τ ) =
1 cµ = N
c µ f µ (τ ),
µ
N=
d n z e −z /2σ w(z), 2
2
d n z zµ e −z /2σ w(z), 2
2
(2.15)
¨ S. Hong, B. Aguera y Arcas, and A. Fairhall
3150
and C(τ, τ ) =
µ,ν
c µν =
1 N
c µν − c µ c ν − σ 2 δµν f µ (τ ) f ν (τ ), d n z zµ zν e −z /2σ w(z). 2
2
(2.16)
Note that we have assumed a prior based on the distribution of randomly driven trajectories of the linear system. This takes no account of perturbations of this distribution due to flux from the system’s return from a spike. This assumption is equivalent to assuming that the last spike is in the distant past, so that memory of that perturbation has vanished. In this article, we consider only this “isolated spike” case. We return to this point in the discussion. 2.4 Variance Dependence. Through equations 2.15 and 2.16, this model captures an explicit dependence of the white noise outcome on the stimulus statistics, in particular, the variance σ 2 . While the subthreshold dynamics are linear, the dependence on σ is nonlinear due to the threshold shape and the weight function w(z). We discuss two examples below. We first consider a linear threshold. Through a suitable linear transformation, this case simply reduces to a filter-and-fire model with a single filter, say, f 0 , and a fixed threshold z0 = θ f . Now the distribution of threshold crossing points is constrained by z0 = θ f and z˙ 0 > 0. As we mentioned in the previous section, z˙ 0 lies in the originally defined feature space and is given by another single filter, which we denote by f 1 . In other words, when the normalized f˙ 0 is denoted by f 0 , we can choose f 1 = f 0 by a suitable orthogonal linear transformation. Thus, our system depends on two filters, which are depicted schematically in Figure 3a. The STA is the centroid of the distribution of threshold crossing points. Equation 2.15 reproduces a previously known result (M. Meister, private communication November 2004; Fairhall et al., 2006), STA(τ ) = θ f f 0 (τ ) +
σ π/2
f 1 (τ ).
(2.17)
With high-variance σ θ f , the STA is dominated by f 1 . In the feature space, with increasing variance, the threshold stays the same, but a larger portion of it is crossed by trajectories driven by the larger variance ensemble, as can be seen in Figure 3a. When the threshold is curved, the σ dependence is considerably more complicated. We will consider an extreme but analytically tractable version
Single Neuron Computation
3151
Figure 3: (a) Diagram of a single filter model. The disc of radius σ represents the prior gaussian distribution. The dotted line is the model threshold imposed at z0 = θ f . The thick line is a distribution of spike-triggered stimuli, and the white dot represents its average, the STA. (b) Diagram of a model whose threshold is the union of two lines, each imposed at a distance θ f,g from the center. Two gray dots denote the STAs for each segment.
of this case to illustrate the point: let the curved threshold be approximated by two linear ones imposed at θ f and θg in the f 0 and g0 directions, respectively, as in Figure 3b. In this case, one segment of the threshold imposes a dependence on the filters f 0 and its (normalized) derivative, f 1 , while the other selects g0 and g1 = g˙ 0 /g˙ 0 . The space of relevant features is still two-dimensional or three-dimensional, including the δ-function, since g0 and g1 are linear combinations of f 0 , f 1 , and possibly also δ(t). Now the STA of this system is STA(τ ) = cos2 ϕ · STA f (τ ) + sin2 ϕ · STAg (τ ), ϕ = tan−1 e (θ f −θg )/4σ , 2
2
2
(2.18)
where STA (τ ) = θ 0 (τ ) + √σ 1 (τ ) as in equation 2.17. Note that as in π/2
Figure 3b, the STA does not lie on the threshold. As for a variety of experimental examples such as complex cells (Touryan et al., 2002), neurons of rat barrel cortex (Petersen, 2003), and some retinal ganglion cells (Fairhall et al., 2006), the spike-triggered stimuli are poorly represented by their first-order statistics, the STA. Also, the coefficients of STA f,g depend exponentially on θ f,g and σ . Thus, the system shows a nonlinear dependence on the stimulus variance. Figure 4 shows an example. The W threshold model does not show any significant change, except sharpening of the STA due to the increased component of ˙ W , Figure 4b, and linear broadening of the spike-triggered stimulus distribution. The piecewise linear threshold case is more dramatic: while
3152
¨ S. Hong, B. Aguera y Arcas, and A. Fairhall
Figure 4: Two examples of variance dependence in the linear system model. (a) Spectra of the covariance matrix for several different variance stimuli. (b) Significant eigenmodes. (c) Projection of spike-triggering stimuli onto the filter subspace defined by v1,2 for the linear and v1,+ for the piecewise linear threshold, respectively, colored according to the stimulus variance as in the legend of b. The circled points are the centers of the distributions, corresponding to the respective STAs, colored according to the respective distribution.
Single Neuron Computation
3153
the smallest variance does not drive the system hard enough to produce a positive eigenmode (hence, the one-dimensional distribution of projections seen in Figures 4b and 4c, red), a new significant mode emerges as the variance increases. The STA changes beyond sharpening and almost looks like a different model at high variance compared with low variance. Each of the significant modes changes as more trajectories cross the threshold from the other side and the principal axis of the distribution rotates, as seen in Figure 4c. 3 Dynamical Threshold In the previous section, we considered neuron models composed of linear filters derived from the FN neuron, and some choices of imposed thresholds. However, when we consider the full FN neuron, the threshold arises from the structure of the FN equations, 1.5 and 1.6. Here we discuss the importance of the threshold identification for reverse correlation analysis. The problem of the identification of a threshold arises immediately on attempting a reverse correlation analysis; spike times are often defined by the threshold crossing of the voltage. Here also we can impose an arbitrary threshold in V to identify each spike. The STA obtained using this scheme is displayed in Figure 5a. It is clear that it cannot be well described by the first-order kernels. However, this does not mean breakdown of the analysis; rather, it underscores the point that the STA or any other single spike quantities should be computed using the “correct” threshold that we take to be the dynamical threshold discussed in section 1. Figure 5b compares the spike-triggered stimuli in the fixed V and dynamical threshold cases. While the peaks of the stimuli are spread out in time in the fixed V threshold case, for the dynamical threshold, the peaks lie in a narrow band around the spike time. As we can see from Figure 1, the spiking trajectories cross the dynamical threshold before crossing the V threshold, and this timing difference, t, depends on W. Additionally, since the system is driven by gaussian white noise, the variability increases with the timing difference, inducing a point-spread function on the STA. Hence, each spike-triggered stimulus is contaminated by this temporal jitter, and estimated filters are distorted. As mentioned in section 1.1, the FN neuron does not have a clear-cut dynamical threshold; the threshold was chosen to some degree arbitrarily, and this will induce some error in the estimation of filters. However, Figure 6a shows the improvement in the STA computed using the dynamical threshold. Figures 5c and 5d show the point-spread distribution—the distribution of values of t—and how it is correlated with W. This situation is similar to that discussed in Aldworth, Miller, Gedeon, Cummins, & Dimitrov, (2005), where it was noted that temporal or spatial jitter can blur the estimation of filters and receptor fields. There, a blind deconvolution algorithm was used to “dejitter” and realign
3154
¨ S. Hong, B. Aguera y Arcas, and A. Fairhall
Figure 5: Effect of a dynamical threshold in the FN model. (a) STA of the full FN system with a constant V threshold. As before, a linear fit by the kernels is also shown. (b) Comparison of spike-triggered stimuli with different threshold choices in the FN model. (c) Temporal point-spread function induced by the choice of threshold. t is the time delay from crossing the dynamical threshold to the fixed V threshold. (d) Time delay of each spike-triggered stimulus plotted against the value of W at its threshold crossing point.
the spike-triggered stimuli, which dramatically sharpened the estimate of the spike-triggering stimulus. In a real system, jitter may indeed be due at least partly to noise, but it is also possible that a component of such jitter is deterministic and due to variability in the point of dynamical threshold crossing, as in the FN case. Blind deconvolution may then be viewed as an empirical approach to recovering a dynamical threshold based on spike times originally recorded using a voltage threshold. It is interesting to note that estimated jitter in such a case is correlated with projection onto the STA derivative.6
6
This correlation arises because the STA derivative is the linear approximation to time translation of the STA. Hence, if a current history aligns with the STA better under a small time translation or jitter, it will have a projection onto the STA derivative proportional
Single Neuron Computation
3155
4 Fully Nonlinear Systems In this section, we discuss the covariance analysis of the full FN system and compare the result with the same analysis of the first- and second-order approximation. We identified approximately 2 × 106 spikes first using a voltage threshold and then backtraced each trajectory to the point where it crosses the dynamical threshold shown in Figure 1. The trajectory might cross the threshold multiple times before spiking due to the noisy input. In this case, we used the first crossing point after the trajectory diverges from that of the second-order approximation. This is based on the assumption that the second order is a sufficiently good approximation of the system in the subthreshold regime. For both the first- and second-order system, crossing of the dynamical threshold is used to identify spikes. Right after a spike, we imposed a postspike inhibitory period equal to about a “spike width,” as empirically determined from the full system. The results are shown in Figures 6 and 7. Figure 6a shows that the STA of the first-order approximation is quite similar to that of the full system, and the second-order STA is even closer. From the covariance analysis, we see ¨ that the FN neuron, like the HH model (Aguera y Arcas et al., 2003), has one dominant negative and one dominant positive eigenvalue. However, the results from the covariance analysis of the three systems differ considerably. Both the second-order and the full system show a relatively large number of significant eigenvalues. This is due to the contribution of the second order g2 [V] and g2 [W] (and higher-order) kernels in equation 2.1. However, these are relatively suppressed compared to the first-order modes. Further, the spectrum derived from the second-order system, Figure 6b, has no positive significant eigenvalue. Figure 6c provides more detailed information. The first-order case is just as we have seen previously, with three modes that are well described by linear combinations of the linear kernels. In the second-order case, v1 and v2 are comparable to those of the first-order. v3 , which is not approximated by the first-order kernels, must arise from the second-order kernels. In this regard, the full system is well matched with the second-order approximation. There also, v3 is comparable to that in the second order, and v1,2 are derived from the linear kernels. However, the full system also has a positive mode, v+ , from the linear kernels, which resembles the first order rather than the second.
to that jitter. In section 2, we demonstrated that the STA and its derivatives are, up to a linear transformation, the filters associated with the dynamical variables of the linearized system. Therefore, there is a relationship between blind deconvolution to optimize fit to an STA and estimation of a dynamical threshold based on the “hidden” (non-V) dynamical variables.
3156
¨ S. Hong, B. Aguera y Arcas, and A. Fairhall
Figure 6: (a) STAs of an FN model neuron, and the first (gray) and second (dashed) order approximation. (b) Spectra of covariance matrices of the FN model and approximations. (c) STAs and covariance modes for the first- and second-order approximations and the full FN system.
The geometry of the spike-triggering stimulus projections is shown in Figure 7a. The first order, not surprisingly, recapitulates the curved threshold case in Figure 2d. However, the second order is more like the (V, W)linear threshold case, while again the full system resembles the curve in Figure 2d. A possible explanation is that each system probes different parts of the dynamical threshold. Figure 7b marks the density of threshold crossing points for each model. In contrast to the first-order and the full system, which access a large section of the threshold with nontrivial curvature, the second order probes only a small and almost linear section. This may be the reason for the lack of a positive eigenvalue. As in the toy models
Single Neuron Computation
3157
Figure 7: (a) Spike-triggered stimuli projected into a feature space defined by v1,+ for the first-order case and full system and v1,2 for the second-order system. (b) The density of threshold crossing points in the first-order (solid), secondorder (dashed), and full (dotted) systems, plotted along the threshold curves in the V − W plane.
in section 2, the contributions of the relevant dimensions are determined not only by local information (filters) but also by the global structure of a multidimensional or dynamical threshold. 5 Abbott-Kepler Model In this section, we apply the same analysis to a two-dimensional model which is more nonlinear and more realistic than the FN model. Abbott and Kepler (1990) developed a two-dimensional reduction of the Hodgkin-Huxley model, based on the observation that there is a separation of timescales between the faster m and the slower n and h variables. m is then replaced with its asymptotic value at the membrane voltage V, while n and h are controlled by another voltage variable U. The equations of the Abbott-Kepler (AK) model are of the form C
dV = f (V, U) + I (t), dt dU = g(V, U), dt
(5.1) (5.2)
3158
¨ S. Hong, B. Aguera y Arcas, and A. Fairhall
Figure 8: (a) Phase plane of the Abbott-Kepler model with trajectories, nullclines, and a threshold, in the absence of input. (b) AK phase plane with the injected noisy input.
where f (V, U) and g(V, U) are nonlinear functions in V and U. Their derivation is briefly sketched in appendix C, and we refer to the original paper (Abbott & Kepler, 1990) for further detail. The nullcline for U is given by g(V, U) = 0, which is satisfied by V = U. The V nullcline, f (V, U) = 0, is more complicated and obtained numerically. The two nullclines intersect at a fixed point V = U = −65 mV. Figure 8a shows the phase plane of this model with zero input current. Like Figure 1, the threshold structure, which can be obtained numerically, is visible. Due to the strong nonlinearity, spiking trajectories are well defined on the phase plane, and the threshold has less ambiguity than for the FN model. Again, we try to identify the dynamics of the system in the subthreshold regime with the first-order approximation. Unlike an FN neuron, the Jacobian of this system has complex eigenvalues (1) λ± = −0.2118 ± i0.4035 ms−1 , and therefore the first-order kernels V,U oscillate, Figure 10a. This is consistent with the oscillatory linearized behavior associated with the full Hodgkin-Huxley model near equilibrium. Before we carry out covariance analysis on this model, we examine the effect of a dynamical threshold, as in section 3. Figure 9a shows the spiketriggering stimuli aligned with both a fixed threshold in V, chosen as V = −40 mV to unambiguously select spiking trajectories, and the dynamical threshold. The typical time shift is of order of 1 ms or less; the overall STA suffers from only a slight time or phase shift when the V threshold is used. However, in the small timescale of 1 ms or less, the V-threshold STA differs from that in the dynamical threshold case, which is characterized by a large delta-function component at the spike time. Both STAs are well (1) approximated by linear combinations of V,U , up to a small deformation around −15 ms due to the effects of multiple spikes.
Single Neuron Computation
3159
Figure 9: (a) Spike-triggering current histories with two different threshold schemes in the AK model. (b) STAs of the AK model with the fixed V = −40 mV threshold (dashed) and dynamical threshold (solid). Also, they are least squares fitted by linear combinations of V,U (gray).
Figure 10 shows the results from covariance analysis carried out with approximately 2 × 106 spikes for gaussian white noise stimuli with various variances. For comparison with previous results, we selected three eigenmodes corresponding to the leading two negative and the largest positive eigenvalues. We find that they are reasonably well approximated by linear (1) combinations of V,U in most cases, although they are sometimes affected by a δ-function at t = 0 and a large multispike effect at high variances. The multispike effect can be eliminated by considering only the isolated ¨ spikes when the spike rate is low (Aguera y Arcas & Fairhall, 2003). At higher variances, isolated spikes are rare, and there is a stronger influence ¨ of oscillating “silence modes” (Aguera y Arcas et al., 2003). As in the FN neuron case, we identify modes other than those in Figure 10c as “nonlinear modes.” We compare results at different variances with our discussion in section 2.4. Some features of variance dependence are shared with the toy model in section 2.4: the eigenvalue spectrum drifts, and the corresponding modes rotate among themselves. However, the Abbott-Kepler modes also exhibit more complicated behavior. Figure 11 shows projections of spike-triggered stimuli onto two distinguished eigenvectors v1,+ and corresponding threshold crossing points. At low variance, the system crosses mostly one side of the threshold while the projections trace out the curvature of the threshold segment. As the variance increases, some crossing points begin to appear on the left side of the threshold. However, this does not overcome the expansion of a crossing point distribution on the right side, and the modes corresponding to this direction dominate. At high variance, there are many crossing points on both sides, reflected in the bimodal distribution of the
3160
¨ S. Hong, B. Aguera y Arcas, and A. Fairhall
Figure 10: (a) Normalized first-order kernels ¯ V,U of the AK model and the component of U orthogonal to V , normalized (gray). (b) Eigenvalue spectrum of the covariance matrix of the AK model with each variance. σ0 = 13pA. (c) Eigenmodes of the covariance matrices. As previously, v1,2 and v+ correspond, respectively, to the two smallest negative eigenvalues and the largest positive one.
Single Neuron Computation
3161
Figure 11: Projection of spike-triggering stimuli onto v1 and v+ in the AK model. Below are threshold crossing points in the phase plane of each case, marked by crosses, V ≤ −67 mV, and circles, when V ≥ −67 mV, over the dynamical threshold (dashed). The other gray curves are the nullclines, with their intersection, the fixed point, marked by a circle.
projections, previously seen in the toy models. Thus, this is another example of how the results of reverse correlation analysis are affected by both the filter properties and the interaction of the stimulus ensemble with the threshold geometry. So far, we have discussed only two-dimensional dynamical models. We began with some artificial toy models with purely linear subthreshold dynamics and proceeded to the minimal FitzHugh-Nagumo spiking model. We applied the lessons learned there to the more nonlinear Abbott-Kepler model. We will conclude with a partial analysis of the higher-dimensional Hodgkin-Huxley model. 6 Hodgkin-Huxley Model Higher-dimensional systems require nontrivial extensions of the methodology we have used with two-dimensional systems. First, it is much harder to find a dynamical threshold, which could now be a multidimensional hypersurface rather than a curve. If we do not align the spike-triggered stimuli according to the dynamical threshold, the obtained filters may include a distribution of time delays. For example, the reverse correlation analysis may result in broadened filters, f˜µ (t) = dτ f µ (τ ) pσ (t − τ ), (6.1)
¨ S. Hong, B. Aguera y Arcas, and A. Fairhall
3162
where pσ (τ ) is a point-spread function, depending on the input variance σ 2 , as in Figure 5c.7 However, we can still gain some insights from f˜µ . If p(t) has a narrow support, say less than a millisecond, it has negligible effects for larger time windows. It is also possible that some consequences of our analysis will still apply to f˜µ with only a few reasonable assumptions. For example, if pσ is of limited temporal extent, then the derivatives of f˜µ can be simply f˜(n) µ ≈
dτ f µ(n) (τ ) pσ (t − τ ).
(6.2)
Therefore, as we discussed in section 2, the linear modes among { f˜µ } should still be (approximately) closed under time differentiation. To demonstrate this, we use the four-dimensional Hodgkin-Huxley model, and the following strategy: we select the linear modes out of the significant filters from covariance analysis. Now, since the STA is approximately their linear combination, the time derivatives of the STA should be written in terms of the linear combinations as well if they are closed under time differentiation. Figure 12 shows that this holds. In the low-variance case, Figure 12a, the three linear modes chosen provide good fits to the time derivatives of the STA. The high-variance case, Figure 12b, is affected by multispike effects and a filtering artifact, but it is also well fitted. Therefore, we can conclude that the time derivatives of the STA span the same space as three linear modes. One might expect a fourth linear mode due to the dimensionality of the model, but this is not as significant as the others; this agrees with previous covariance analysis that the HH model can be well ¨ described as a quasi-three-dimensional system (Aguera y Arcas et al., 2003). We note that it is not clear whether the linear modes in the two cases span the same feature space. This is difficult to ascertain since the point-spread function can in principle depend on the stimulus variance. In Figure 12c, we see that while v1 of the high-variance case can be fitted by the linear modes of the low variance, the other modes show a small deviation even around 5 ms. This might indicate an interesting variance dependence as in section 2.4, but we will not pursue this issue in this article. 7 Summary Here we have investigated the meaning or interpretation of the features derived from the spike-triggered covariance method applied to two simple 7 In fact, it can be much more complicated than this. Since the time delay depends on the location of threshold crossing in a phase space, additional temporal correlations are introduced. Therefore, each filter may have a different point-spread function, or filters may depend on the time delay structure in a complicated way. Here will discuss only the simplest case.
Single Neuron Computation
3163
Figure 12: Covariance bases of the linear feature space of the Hodgkin-Huxley model recovered at (a) low and (b) high variance. The three most significant eigenvalues are denoted with the same shades as the corresponding linear eigenmodes. (c) The STA and linear modes of the high-variance case and their fits using the linear modes of the low-variance case in the HH model.
3164
¨ S. Hong, B. Aguera y Arcas, and A. Fairhall
neuron models: the FitzHugh-Nagumo model, the minimal spiking neuron model, and the Abbott-Kepler model, a more faithful two-dimensional reduction of the Hodgkin-Huxley system. The power of white noise analysis is that it provides a data-driven method to reduce a high-dimensional dynamical system to a functional model that captures the essential computation of the system in terms familiar to systems neuroscience: a receptive field that filters the stimulus and a threshold function over the filtered stimulus. Our goal here was to analyze the output of a white noise analysis in terms of what it can reveal and how it depends on the underlying dynamical system. In this, our approach is distinct from the elegant work of Huys, Ahrens, & Paninski (2006), where responses to white noise stimuli are used to fit the parameters of a conductance-based model. Dealing with simple two-dimensional systems, our observations of spiking dynamics in the phase plane motivated the following reduced model: dynamics in the phase plane are approximated by the perturbative expansion, in particular the linear approximation, and the system’s nonlinearity is captured by a spiking threshold, determined for zero input, that extends through multiple dimensions. This model is a generalization of a basic filter-and-fire model to multiple dimensions and extends it in two significant ways. One is the identification of the filters with the dynamical variables of the system; the other is the generalization of the concept of the threshold. The simplified model with linear dynamics and a multidimensional curved threshold was treated both analytically and phenomenologically, using numerical simulation and reverse correlation (covariance) analysis of spike-triggered stimuli. This led to several insights, some of which are related to existing observations. First, for this case, the feature space derived from covariance analysis is spanned by the linear kernels. Since the set of kernels is closed under time differentiation, the same feature space can also be spanned by a generic linear combination of the kernels, such as the spike-triggered average and its n time derivatives. However, not every kernel contributes as a relevant feature. The threshold structure plays a role of selecting the relevant ones from the kernels: for example, a linear threshold selects a single filter and its time derivative. In general, a threshold spanning a d-dimensional subspace will select d + 1 features from among linear combinations of the kernels. We show further for this model that the distribution of spike-triggered stimuli in the covariance feature space corresponds to that of threshold crossing points, up to a suitable linear transformation. We also used this model to illustrate the effects of the interaction of the threshold nonlinearity with the stimulus ensemble. We show that a complex threshold geometry leads to a nontrivial variance dependence of the eigenmodes and eigenvalues of covariance analysis and, even more so, the spike-triggered average. In the linearized subthreshold model, the subspace of eigenmodes is not changed, but the spike-triggering ensemble
Single Neuron Computation
3165
may rotate through this subspace, leading to a variance-dependent spiketriggered average. This is one example of stimulus-variance dependence in a nonlinear, nonadapting system (Yu & Lee, 2003; Borst, Flanagin, & Sompolinsky, 2005). We show this effect in the analysis of the Abbott-Kepler model neuron. The identification of a curved or dynamical threshold raises an issue in reverse correlation analysis regarding the determination of spike time. For the FitzHugh-Nagumo and Abbott-Kepler models, we compared two cases: when spike times are identified using a threshold in voltage and when the spike times are found using the crossing of an identified curve that leads to spiking for zero current input. Surprisingly, the spike-triggered averages showed remarkable differences. For a voltage threshold, the time delay from the earlier threshold crossing point blurs the estimated filters. It is shown that using the dynamical threshold leads to a better fitting of the estimated filters by the system’s linear kernels, as well as improving the sharpness and substantially altering the shape of the STA. In all of our discussion here, we have concentrated on the subthreshold dynamics bringing the system to the point of threshold. We do not analyze, and our simplified models do not accommodate, the dynamics of the system immediately after a spike. In the phase-plane picture, spiking reinjects the system into the subthreshold regime in a nonrandom way, affecting the subsequent probability flux to threshold, analogous to the one-dimensional reset of the integrate-and-fire neuron treated in Paninski, Lau, & Reyes, (2003). In our approach, we have assumed that the system has remained below threshold for long enough that its location in the subthreshold space has been randomized by the driving current. This corresponds to an analysis of ¨ isolated spikes only, a simplification we have used before (Aguera y Arcas et al., 2003) and employed here for the covariance analysis. A number of works have treated this issue explicitly with a variety of methods: treating the interspike interval as the primary symbol (de Ruyter van Steveninck & Bialek, 1988; Rieke et al., 1997), solving for the interspike interaction using the interspike intervals (Pillow & Simoncelli, 2003), simultaneously solving for the linear kernel over stimulus history and spike history using the autoregressive moving average (Powers et al., 2005; Truccolo, Eden, Fellows, Donoghue, & Brown, 2005), and fitting parameters of an explicit model for the effective postspike current (Kistler et al., 1997; Keat et al., 2001; Paninski et al., 2004; Pillow et al. 2005). While the simplification we have used here allows us to find direct connections between the covariance ¨ modes without the confound of the interspike interaction (Aguera y Arcas ¨ et al., 2003; Aguera y Arcas, & Fairhall, 2003) it is clearly not a complete model for spiking responses. A first step toward a more complete spiking model may be to consider the perturbed subthreshold distributions induced by the influx of trajectories following spikes. This is in effect a mean field approximation, taking into account the overall spike rate for a given stimulus ensemble. Further steps could be taken by introducing a return map
¨ S. Hong, B. Aguera y Arcas, and A. Fairhall
3166
deterministically relating the point of threshold crossing to a point of return into the subthreshold domain. The integrate-and-fire model is the most trivial implementation of such a return map; an equivalent map in multiple dimensions would reintroduce all spike trajectories into the subthreshold domain at an identified point (this is V = 0 for integrate-and-fire). Such a many-to-one map implies that the neuron’s state is completely reset by a spike, which is incorrect for neurons with slow conductances that modulate spike afterpotentials. A less degenerate map seems more appropriate for such cases. Another important step is the addition of these slow conductances. Using our formalism, such conductances may be representable simply as additional dimensions of threshold curvature with corresponding longer-timescale stimulus filters. White noise analysis allows the derivation of intuitive functional models for neural computation: What does the neuron compute? In this article, we have drawn concrete correspondences between the components of these functional models and parameters of the underlying dynamical system.
Appendix A: Volterra Expansion of a Dynamical System Let us consider an n-dimensional dynamical system, perturbed by an external input I (t) as equation 2.1. For convenience, we introduce a dimensionless expansion parameter η, with which we will expand the equation. More precisely, equation 2.1 becomes ∂t yk = f k (y1 , y2 , . . . , yN ) + ηδk0 I (t).
(A.1)
We also have the perturbation expansion of yk as (0)
(1)
(2)
yk = yk + ηyk + η2 yk + · · · . By plugging equation A.2 into equation A.1, we obtain ∂ f k (y(0) ) (1) (1) (2) ∂t ηyk + η2 yk + · · · = ηI (t)δk0 + ηym ∂ ym +
1 ∂ 2 f k (y(0) ) 2 (1) (1) η ym yn 2! ∂ ym ∂ yn
+
∂ f k (y(0) ) 2 (2) η ym + · · ·. ∂ ym
(A.2)
Single Neuron Computation
3167
By comparing two sides order by order, we obtain a series of equations satisfied by the perturbative expansion at each order as Dkm ym(1) = I δk0 ,
Dkm = δkm ∂t −
∂ f k (x (0) )
, ∂ xm y(0)
which is equation 2.2, 1 Hkmn ym(1) yn(1) , 2!
Dkm ym(2) =
Hkmn =
∂ 2 f k (y(0) ) , ∂ ym ∂ yn
(A.3)
which is equation 2.9, and so on. These equations can be solved recursively by using a kernel (1) = D−1 , which therefore satisfies Dkm ml (t, t ) = δkl δ(t − t ). (1)
Now the solution of the first order can be written as
(1)
dt k (t − t )I (t ), (1)
yk (t) = (1)
(1)
where k = k0 . Also from equation A.3, (2)
yk =
1 2!
1 2!
(1)
× =
dt kl (t − t ) Hlmn
(2) k (s1 , s2 )
ds1 ds2 m(1) (t − s1 ) n(1) (t − s2 )I (s1 )I (s2 ) (2)
ds1 ds2 k (t − s1 , t − s2 )I (s1 )I (s2 ), =
dt Hlmn kl (t ) m(1) (s1 − t ) n(1) (s2 − t ), (1)
which is equation 2.10. Since this procedure can be carried out to higher orders, the higher-order kernels are outer products of the linear kernels. In addition, the first-order kernels are related to each other by simple time differentiation. To show this, we rewrite equation 2.4 as follows: −1 δn0 δ(t − t ). m(1) (t − t ) = Dmn
(A.4)
¨ S. Hong, B. Aguera y Arcas, and A. Fairhall
3168
Now the derivatives can be computed as −1 δn0 δ(t − t ) ∂t m(1) = ∂t Dmn −1 = (∂t δml − J ml + J ml )Dln δn0 δ(t − t ) −1 = Dml Dln δn0 δ(t − t ) + J ml l
(1)
= δm0 δ(t − t ) + J ml l . (1)
(A.5)
This shows that the nonsingular part of the first-order kernels can be obtained by the linear combination of the derivatives of other kernels. Appendix B: A Linear Model from the FitzHugh-Nagumo System Here we derive the linearization of the FitzHugh-Nagumo model, beginning with equations 1.5 and 1.6. The fixed point can be obtained by simultaneously solving the nullcline equations. We denote it by (V0 , W0 ) and expand the system around this point. First, the Jacobian is J =
F (V0 )/ψ − 1/ψ 1−b
,
where F (V) = V(1 − V)(a + V), and this defines a linear system: ∂t V =
1 F (V0 ) V − W + I (t) ψ ψ
∂t W = V − bW. J has eigenvalues λ± : λ± =
1 ψ
f − (V0 ) ±
f + (V0 )2 − ψ
= −b + f + (V0 ) ± f + (V0 )2 − ψ
(B.1)
= −b + κ± , where f ± (V) = (F (V) ± bψ)/2. As in equation 2.6, J is diagonalized by a matrix U, 1 U= λ+ − λ−
1
λ− + b
−1 −(λ+ + b)
.
Single Neuron Computation
3169
From equation 2.8, we obtain the first-order kernels, which solve the linear system V (t) = e −bt ∂t S(t)H(t), (1)
W (t) = e −bt S(t)H(t), (1)
(B.2) (1)
where S(t) = (e κ+ t − e κ− t )/(κ+ − κ− ). V,W (t) with our choice of parameters is drawn in Figure 2a. Appendix C: Derivation of the Abbott-Kepler Model In this section, we show the derivation of a two-dimensional neuron model used in section 5. For further details, we refer to the original paper (Abbott & Kepler, 1990). We begin with a Hodgkin-Huxley equation, which is defined by equation 1.1 and the following parameters: g L = g¯ L ,
τz (V)
αm =
g K = g¯ K n4 ,
dz = z¯ (V) − z, dt αz , z¯ = αz + βz
τz =
.1(V + 40) , 1 − exp[−.1(V + 40)]
.01(V + 55) , 1 − exp[−.1(V + 55)]
(C.1)
1 , αz + βz
z = m, n, h
αh = .07 exp[−.05(V + 65)], αn =
g Na = g¯ Na m3 h.
(C.2)
βm = 4 exp[−.0556(V + 65)], βh =
1 , 1 + exp[−.1(V + 35)]
(C.3)
βn = .125 exp[−.0125(V + 65)].
The key observation is that τm is much smaller than τn,h , while τh and τn are mutually comparable. Therefore, m can be approximated by its value at equilibrium, m(V), ¯ and h and n can be represented by the same equilibrium voltage, U. In other words, m ≈ m(V), ¯
¯ h ≈ h(U),
n ≈ n(U). ¯
¨ S. Hong, B. Aguera y Arcas, and A. Fairhall
3170
Using this, we obtain equation 5.1 as
F=
gi (V − E i )
i=L ,K ,Na 4 3¯ ¯ (V − E K ) + g¯ Na m(V) ¯ h(U)(V − E Na ) ≈ g¯ L (V − E L ) + g¯ K n(U)
= − f (V, U). An equation for U, equation 5.2, is obtained by requiring that time dependence of the active current F due to h and n in the Hodgkin-Huxley model ¯ is mimicked by h(U) and n(U). ¯ This implies d F dh d F dn + ≈ dh dt dn dt
∂ f d h¯ ∂ f d n¯ + ¯ ∂ h dU ∂ n¯ dU
dU . dt
(C.4)
Again, we approximate equation C.2 in the same way, dz 1 ≈ (¯z(V) − z¯ (U)) , dt τz (V)
z = h, n.
Plugging this in equation C.4, we can solve for dU/dt as a function of V and U, which we denoted by g(V, U) in equation 5.2 as g(V, U) = 3 ¯ 3 ¯ g¯ Na (V − E Na )m(V) ¯ (h(V) − h(U))/τ ¯ K (V − E K )n(U) ¯ (n(V) ¯ − n(U))/τ ¯ h (V) + 4 g n (V) . 3 3 ¯ h (U) + 4g¯ K (V − E K )n(U) g¯ Na (V − E Na )m(V) ¯ ¯ n¯ (U)
References Abbott, L. F., & Kepler, T. (1990). Model neurons: From Hodgkin-Huxley to Hopfield. In L. Garrido (Ed.), Statistical mechanics of neural networks (pp. 5–18). Berlin: Springer-Verlag. ¨ Aguera y Arcas, B., & Fairhall, A. (2003). What causes a neuron to spike? Neural Computation, 15, 1715–1749. ¨ Aguera y Arcas, B., Fairhall, A., & Bialek, W. (2003). Computation in a single neuron: Hodgkin and Huxley revisited. Neural Computation, 15, 1789–1807. Aldworth, Z., Miller, J., Gedeon, T., Cummins, G., & Dimitrov, A. (2005). Dejittered spike-conditioned stimulus waveforms yield improved estimates of neuronal feature selectivity and spike timing precision of sensory interneurons. J. Neurosci., 25, 5323–5332. Azouz, R., & Gray, C. (2000). Dynamic spike threshold reveals a mechanism for synaptic coincidence detection in cortical neurons in vivo. Proc. Natl. Acad. Sci. USA, 97, 8110–8115.
Single Neuron Computation
3171
Berry II, M. J., & Meister, M. (1999). The neural code of the retina. Neuron, 22, 435–450. Bialek, W., & de Ruyter van Steveninck, R. R. (2005). Features and dimensions: Motion estimation in fly vision. Preprint, arXiv:q-bio/0505003 [q-bio.NC]. Borst, A., Flanagin, V. L., & Sompolinsky, H. (2005). Adaptation without parameter change: Dynamic gain control in motion detection. Proc. Natl. Acad. Sci. USA, 102(17), 6172–6176. Brenner, N., Bialek, W., & de Ruyter van Steveninck, R. R. (2000). Adaptive rescaling maximizes information transmission. Neuron, 26, 695–702. de Ruyter van Steveninck, R. R., & Bialek, W. (1988). Real-time performance of a movement sensitive in the blowfly visual system: Information transfer in short spike sequences. Proc. Roy. Soc. Lond. B, 234, 379–414. DeWeese, M. (1995). Optimization principles for the neural code. Unpublished doctoral dissertation, Princeton University. Fairhall, A. L., Burlingame, C. A., Narasimhan, R., Harris, R. A., Puchalla, J. L., & Berry II, M. J. (2006). Selectivity for multiple stimulus features in retinal ganglion cells. J. Neurophysiol., 96, 2724–2738. FitzHugh, R. (1961). Impulse & physiological states in models of nerve membrane. Biophysics J., 1, 445–466. Gerstner, W., & Kistler, W. (2002). Spiking neuron models: Single neurons, populations, plasticity. Cambridge: Cambridge Press. Gradshteyn, I. S., & Ryzhik, I. M. (2000). Table of integrals, series, and products (6th ed.). San Diego: Academic Press. Horwitz, G., Chichilnisky, E., & Albright, T. (2005). Blue-yellow signals are enhanced by spatiotemporal luminance contrast in macaque V1. J. Neurophysiol., 93(4), 2263– 2278. Hutcheon, B., & Yarom, Y. (2000). Resonance, oscillation and the intrinsic frequency preferences of neurons. Trends Neurosci., 23(5), 216–222. Huys, Q. J., Ahrens, M. B., & Paninski, L. (2006). Efficient estimation of detailed single-neuron models. J Neurophysiol., 96, 872–890. Izhikevich, E. (2001). Resonate-and-fire neurons. Neural Networks, 14, 883–894. Izhikevich, E. M. (2006). Dynamical systems in neuroscience: The geometry of excitability and bursting. Cambridge, MA: MIT Press. Keat, J., Reinagel, P., Reid, R. C., & Meister, M. (2001). Predicting every spike: A model for the responses of visual neurons. Neuron, 30(3), 803–817. Kistler, W., Gerstner, W., & van Hemmen, J. L. (1997). Reduction of the HodgkinHuxley equations to a single-variable threshold model. Neural Computation, 9, 1015–1045. Maravall, M., Petersen, R. S., Fairhall, A. L., Arabzadeh, E., & Diamond, M. E. (2007). Shift in coding properties and maintenance of information transmission during adaptation in barrel cortex. PLoS Biol., 5, e19. Marmarelis, P. Z., & Marmarelis, V. Z. (1974). Measurement of the Wiener kernels of a non-linear system by cross-correlation. International Journal of Control, 2, 234–254. Marmarelis, V. (2004). Nonlinear dynamic modeling of physiological systems. Piscataway, NJ: IEEE Press. Mehta, M. L. (1991). Random matrices (3rd ed.). New York: Academic Press. Nagumo, J., Arimoto, S., & Yoshikawa, Z. (1962). An active pulse transmission line simulating nerve axon. Proc. IRE, 50, 2061–2071.
3172
¨ S. Hong, B. Aguera y Arcas, and A. Fairhall
Paninski, L., Lau, B., & Reyes, A. (2003). Noise-driven adaptation: in vitro and mathematical analysis. Neurocomputing, 52, 877–883. Paninski, L., Pillow, J., & Simoncelli, E. (2004). Maximum likelihood estimation of a stochastic integrate-and-fire neural encoding model. Neural Comp., 16, 2533–2561. Petersen, R. (2003). Coding of dynamical whisker stimulation in the rat somatosensory cortex: A spike-triggered covariance analysis. Society for Neuroscience, 57.1. Pillow, J. W., Paninski, L., Uzzell, V. J., Simoncelli, E. P., & Chichilnisky, E. J. (2005). Prediction and decoding of retinal ganglion cell responses with a probabilistic spiking model. J. Neurosci., 25(47), 11003–11013. Pillow, J. W., & Simoncelli, E. P. (2003). Biases in white noise analysis due to nonPoisson spike generation. Neurocomputing, 52–54, 109–115. Powers, R., Dai, Y., Bell, B., Percival, D., & Binder, M. (2005). Contributions of the input signal and prior activation history on the discharge behavior of rat motoneurones. J. Physiol., 562, 707–724. Rieke, F. (1991). Physical principles underlying sensory processing and computation. Unpublished doctoral dissertation, University of California, Berkeley. Rieke, F., Warland, D., Bialek, W., & de Ruyter van Steveninck, R. R. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Rust, N., Schwartz, O., Movshon, J., & Simoncelli, E. (2005). Spatiotemporal elements of macaque V1 receptive fields. Neuron, 46, 945–956. Simoncelli, E. P., Pillow, J., Paninski, L., & Schwartz, O. (2004). Characterization of neural responses with stochastic stimuli. In M. Gazzaniga (Ed.), The cognitive neurosciences. Cambridge, MA: MIT Press. Slee, S. J., Higgs, M. H., Fairhall, A. L., & Spain, W. J. (2005). Two-dimensional time coding in the auditory brainstem. J. Neurosci., 25(43), 9978–9988. Touryan, J., Lau, B., & Dan, Y. (2002). Isolation of relevant visual features from random stimuli for cortical complex cells. J. Neurosci., 22, 10811–10818. Truccolo, W., Eden, U. T., Fellows, M. R., Donoghue, J. P., & Brown, E. N. (2005). A point process framework for relating neural spiking activity to spiking history, neural ensemble, and extrinsic covariate effects. J. Neurophysiol., 93(2), 1074–1089. Victor, J. D., & Shapley, R. M. (1979). The nonlinear pathway of Y ganglion cells in the cat retina. J. Gen. Physiol., 74(6), 671–689. Victor, J., & Shapley, R. (1980). A method of nonlinear analysis in the frequency domain. Biophys. J., 29(3), 459–483. Volterra, V. (1930). Theory of functions and of integral and integrodifferential equations. New York: Dover. Wiener, N. (1958). Nonlinear problems in random theory. Cambridge, MA: MIT Press. Yu, Y., & Lee, T. S. (2003). Dynamical mechanisms underlying contrast gain control in single neurons. Phys. Rev. E, 68(1 Pt. 1), 011901–011901.
Received May 31, 2006; accepted November 10, 2006.
ARTICLE
Communicated by Byron Nelson
Context Learning in the Rodent Hippocampus Mark C. Fuhs [email protected]
David S. Touretzky [email protected] Computer Science Department and Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh, PA 15213, U.S.A.
We present a Bayesian statistical theory of context learning in the rodent hippocampus. While context is often defined in an experimental setting in relation to specific background cues or task demands, we advance a single, more general notion of context that suffices for a variety of learning phenomena. Specifically, a context is defined as a statistically stationary distribution of experiences, and context learning is defined as the problem of how to form contexts out of groups of experiences that cluster together in time. The challenge of context learning is solving the model selection problem: How many contexts make up the rodent’s world? Solving this problem requires balancing two opposing goals: minimize the variability of the distribution of experiences within a context and minimize the likelihood of transitioning between contexts. The theory provides an understanding of why hippocampal place cell remapping sometimes develops gradually over many days of experience and why even consistent landmark differences may need to be relearned after other environmental changes. The theory provides an explanation for progressive performance improvements in serial reversal learning, based on a clear dissociation between the incremental process of context learning and the relatively abrupt context selection process. The impact of partial reinforcement on reversal learning is also addressed. Finally, the theory explains why alternating sequence learning does not consistently result in unique context-dependent sequence representations in hippocampus. 1 Introduction Several theories have posited that the hippocampus is responsible for providing a representation of context and that this representation underpins an animal’s performance of a variety of tasks sensitive to hippocampal lesions (Hirsh, 1974; Nadel & Willner, 1980; Jarrard, 1993; Levy, 1996; Wallenstein, Eichenbaum, & Hasselmo, 1998; Redish, 1999, 2001; Hasselmo & Eichenbaum, 2005). These tasks include spatial navigation, sequence Neural Computation 19, 3173–3215 (2007)
C 2007 Massachusetts Institute of Technology
3174
M. Fuhs and D. Touretzky
learning, and hippocampal lesion-sensitive conditioning paradigms such as reversal learning. Models have typically focused on one or two of these domains (Levy, 1989, 1996; Schmajuk & DiCarlo, 1992; Gluck & Myers, 1993; Samsonovich & McNaughton, 1997; Wallenstein & Hasselmo, 1998; ´ Redish & Touretzky, 1998; Doboli, Minai, & Best, 2000; Hasselmo, Bodelon, & Wyble, 2002; Hasselmo & Eichenbaum, 2005), developing mechanisms of context learning that are well tailored to the domain of study but do not generalize well across domains. Moreover, empirical data from both lesion and physiology studies have called into question the viability of existing models. Attractor models of hippocampal place cell remapping (Samsonovich & McNaughton, 1997; Redish & Touretzky, 1998; Tsodyks, 1999; Doboli et al., 2000) cannot account for the gradual separation of maps between similar environments (Jeffery, 2000; Lever, Wills, Cacucci, Burgess, & O’Keefe, 2002). Hippocampal models addressing reversal learning (Schmajuk & DiCarlo, 1992; Gluck & Myers, 1993; Hasselmo et al., 2002) fail to capture the experimental observation that after a series of discrimination reversals, rats can learn to reverse behavior after a single trial (Buytendijk, 1930; Dufort, Guttman, & Kimble, 1954; Pubols, 1962). Models of sequence learning (Levy, 1989, 1996; Wallenstein & Hasselmo, 1998) are challenged by the failure to find sequence-dependent differences in hippocampal place cell activity in some ¨ studies (Lenck-Santini, Save, & Poucet, 2001; Holscher, Jacob, & Mallot, 2004; Bower, Euston, & McNaughton, 2005). While these previous modeling efforts have offered neural mechanisms to explain how the hippocampus remaps, the work presented here focuses on a related but distinct question: Why does the hippocampus remap? More specifically, if remapping serves to create a representation of context, then understanding remapping across problem domains requires a general and concrete definition of what a context is. We propose that context learning may be formalized as decomposing a nonstationary world of hippocampal input patterns (and the sensory input and behaviors they represent) into multiple domains, or contexts, within which the distribution of these input patterns is stationary. What critically defines a context is therefore not a particular class of stimuli (e.g., background cues) or behaviors (e.g., random versus directed foraging), but a set of time windows within which the statistical structure of sensory experiences and behaviors is stable. Such a definition of context advances its role in prediction: if recent experiences suggest that the present context is C, then other prior knowledge about context C should also be applicable for the foreseeable future (i.e., until the context changes). The predictive power of a context is determined by both the temporal duration of the context and the (systematic) variability within a context. Contexts that generally last only a short duration fail to be useful in prediction because of the high probability that a different
Context Learning in the Rodent Hippocampus
3175
context with different contingencies will soon become active. Contexts that encompass a highly variable set of sensory experiences and behaviors are also poor predictors insofar as they do not discriminate among the many possibilities within the context. As a concrete example, consider the serial reversal learning task in which animals must make a choice between two alternatives (e.g., levers, maze arms). During odd blocks of trials, one choice is rewarded; during even blocks, the other is rewarded. As training progresses, the two reward contingencies may be codified as two different contexts, each with a static reward structure in which only choice 1 is rewarded (context 1) or only choice 2 is rewarded (context 2). Thus, when a particular choice is rewarded (or not rewarded), it suggests which of the two contexts is active, thereby predicting that that choice will be rewarded on subsequent trials as well. If, however, both reward contingencies were grouped together in the same context, the identity of the active context would not be useful in predicting an optimal behavioral strategy. By contrast, in a random foraging paradigm, the random scattering of food within the environment ensures that the reward location is unpredictable. Dissecting a random foraging session into a large number of contexts, each representing a distinct location at which food was found, would result in a complex contextual representation with no predictive value. Inferring the most informative context model is therefore the central problem of interest. In this article, we formulate context learning as a model selection problem: Into how many contexts should the world be divided? We develop a statistical framework for context learning based on the hypothesis that the degree of remapping in hippocampus reflects two independent factors: the degree of similarity between contexts and the animal’s confidence that the two contexts are in fact distinct. The framework shows how contextual representations should evolve over time as the animal progresses through a training regimen, whether that regimen involves different environments, reward contingencies, or sequences. Within the framework, model selection is biased to prefer contexts that are active for longer periods of time, a key constraint that explains the development of hippocampal contextual representations in spatial and reversal learning paradigms and justifies their absence in overlapping sequence learning. The framework also distinguishes between context learning (inferring the context model), which may be a gradual process over many blocks of trials, and context selection (inferring the current context given a context model), which should be a more abrupt phenomenon. This distinction is critical to understanding serial reversal learning: substantial training is required for rats to achieve single-trial reversals, but once they are trained, single trial reversals can be realized as a context switch. Several testable predictions are made to motivate future experimental work.
3176
M. Fuhs and D. Touretzky
2 A Statistical Framework for Context Learning 2.1 Overview 2.1.1 Hidden Markov Models and Hippocampal Activity Patterns. We model context learning as a process in which the hippocampus constructs a generative model of its inputs. Within our statistical framework, which is based on hidden Markov models (HMMs), generative models are composed of states and contexts. An HMM with a given number of states is parameterized by a distribution of expected values for each state and a transition matrix, which expresses the probability of transition from one state to another. Each state represents a conjunctive encoding of the hippocampal input and may be thought of as a particular hippocampal activity pattern. Different states are therefore used to represent different positions within an environment or different stages in a task. A context is simply a group of states. States are grouped together into contexts such that state transitions are frequent but transitions between states in different contexts are rare. Transition probabilities between states within the same context are not constrained, and each is individually parameterized, but the transition probability between states in different contexts is fixed at a small value. If, for a particular model, context switches occur frequently, the goodness of fit of the model will therefore be judged to be low. Thus, the fixed intercontext transition probability encourages context switches to be rare, or, inversely, context durations to be long. What is the relationship between states, contexts, and hippocampal activity patterns? In an HMM, states are “identifiable” in the sense that there is an abstract label for each state that is independent of its particular parameters. This permits an HMM to contain multiple distinct states with the same parameters. This state identifiability is a theoretical construct that we view as unreasonable to apply to the hippocampus. Rather, we argue for weaker identifiability constraints:
r
r
States within a context are distinguished only by their expected observations. Thus, within a context, different hippocampal activity patterns can be observed only when different input patterns are presented. Contexts are identifiable—the hippocampus forms a latent representation of context. This implies that the same input pattern presented in two different contexts will result in different hippocampal activity patterns.
These constraints suggest that an observed hippocampal activity pattern (i.e., the currently active state) is determined by two factors: afferent activity and the currently active context. Intuitively, when the same input pattern is represented by multiple hippocampal codes, each code must be part of a different context. These constraints are closely related to the behavior of
Context Learning in the Rodent Hippocampus
3177
latent attractors (Doboli et al., 2000), where network activity is determined by both the input pattern driving the network and the network’s current attractor basin (which represents the current context). In order to handle input pattern noise, each state is parameterized by a distribution of input patterns. Thus, determining which state is active is a (minor) statistical inference problem, requiring the determination of the state under which the input pattern is most likely. This is qualitatively similar to pattern completion in associative memory models, where the network activity pattern resulting from a noisy input is made more similar to the best matching previously stored pattern. Several theories have implicated the hippocampus in conjunctive encoding (e.g., Rudy & Sutherland, 1995) and associative memory function (Marr, 1971; O’Reilly & McClelland, 1994; Rolls, 1996), and the formulation presented here is not incompatible with these notions. However, our work focuses not on the utility of the individual states but on their contextual organization. 2.1.2 Adopting New Models. In the hippocampus, new experimental conditions are observed to cause the creation of new place cell maps. In the model, a new experimental condition is represented by augmenting the current HMM with new states in a new model context. Given that the input patterns are noisy, under what conditions should a new context be added to the model? A larger, more complex model, one with more contexts, will inevitably provide a better fit of the input patterns. However, one should distinguish the extent to which the model is fitting the statistical structure of the inputs from the extent it is better fitting the noise. In a Bayesian setting, a model should be adopted when its posterior likelihood is higher than that of competing models. This posterior probability accounts for the complexity of the model by averaging the goodness of fit of the model over the prior uncertainty in the model’s parameterization. Larger models have more parameter values that must be specified, so the prior probability of any particular parameterization is lower. Larger models are therefore implicitly penalized for the size of their parameter space. In clear cases, such as when entering a completely new and different environment, the sensory input would differ strongly from what is expected by any state under the “current” model. A larger model, augmented with a group of states for the new environment, would therefore be immediately justifiable. In more subtle cases, such as when a new environment is similar to a familiar environment, the posterior likelihood of the larger model would rise more gradually as the animal gains more experience in each environment. This increasing experience offsets the penalty caused by the larger model’s added complexity. A distributed neural representation of context allows multiple models to be expressed simultaneously. Returning to the two-environment example, if a larger model were only weakly favored over a smaller one, then it is possible for some cells to represent the larger model, distinguishing between
3178
M. Fuhs and D. Touretzky
the two environments, while other cells represent the smaller model, showing the same firing patterns in both. We interpret gradual increases in the number of cells that remap between contexts (gradual remapping) as observed by Lever et al. (2002) and others as a reflection of the relative degree of acceptance of each model: the increasing degree of remapping between two environments reflects the increased statistical likelihood of a model that represents them as two separate contexts. 2.1.3 Independent and Dependent Contexts. Consider again the case of two similar environments, where the input patterns at corresponding positions in each environment are similar. These two environments could be represented by a one-context HMM, where each of the p states represents a corresponding position in the two environments. Each state would therefore be optimally parameterized when tuned to the distribution of input patterns observed at that position in either environment. The two environments could also be represented by a two-context HMM, where two different groups of states, each of size p, represent the two environments. Such a two-context model would have double the number of parameters, and the Schwarz criterion (Schwarz, 1978) suggests that this corresponds to a quadratic increase in complexity. Penalized for such high complexity, preliminary simulations showed that, compared to a one-context model, a two-context model would be considered astronomically improbable until after substantial experience in both environments. This would predict that rats without substantial experience should never show any remapping between similar environments, a finding incompatible with the observation of gradual remapping. Instead of asking whether there is sufficient evidence to justify the complexity of an entirely new context, one might ask whether there is sufficient evidence to justify the complexity required to express how the second environment differs from the first. If the two environments are similar, then it may be simpler to express this difference than to express the second environment independently. This is tantamount to assuming a priori that the two contexts are related but that their differences may nonetheless be valuable to represent in separate contexts. A two-level contextual hierarchy is therefore considered in which contexts may be either independent or dependent. At the top level are independent contexts, whose states’ parameters are not statistically associated with those in any other independent context. The creation of an independent context must be justified with respect to the complexity of its entire set of parameters, which is easily done when the animal enters an environment clearly unrelated to any other. At the second level are dependent contexts, each associated with an independent context. Specifically, each state in a dependent context is paired with a corresponding state in an independent context, and states in the dependent context may share parameters with states in that independent context. In addition, the expected hippocampal
Context Learning in the Rodent Hippocampus
3179
inputs of states in the dependent contexts are assumed to be similar to those of the states in their respective independent contexts. This reduces the added complexity of the dependent context to more accurately reflect the degree to which it differs from the independent context. There is a growing body of evidence to support the notion that similar contexts are not represented independently in vivo. Very similar contexts show partially overlapping representations (partial remapping), especially when only a change in task is involved (e.g., Markus et al., 1995; Shapiro, Tanila, & Eichenbaum, 1997; Skaggs & McNaughton, 1998; Wood, Dudchenko, Robitsek, & Eichenbaum, 2000; Jeffery, 2000). Even between contexts with more pronounced differences (arena wall shape or color changes), the type of remapping observed primarily involves a change in firing rate, but not of the location of the place field (S. Leutgeb et al., 2005). We therefore interpret the degree of remapping to be determined, in the asymptote, by the similarity of the contexts. With little experience, the degree of remapping is even less, reflecting the uncertainty that the two contexts are distinct. We do not propose a specific similarity metric— any such metric would vary among subjects—but such a metric should be monotonic: changes that make two contexts more different should not result in less remapping. Even if the contexts are not independent, why should one be represented as an independent context and the other as a dependent context? (In other words, why not represent them symmetrically as two codependent contexts without introducing an explicit hierarchy?) The asymmetry allows the models to be nested, which provides a more direct relationship between model contexts and place cell activity patterns. Consider an experiment involving two similar environmental contexts (E 1 and E 2 ). Before the start of the experiment, the animal has contextual representations only for its home cage, a transport box, or something similar; let this be context model M0 . Upon introduction to E 1 , which is nothing like any previously experienced context, a new model is immediately adopted, M1 = M0 ∪ {I1 }, that is, all contexts in Mo and an independent context I1 for E 1 . Subsequent exposure to E 2 leads to consideration of M2 = M1 ∪ {D2I1 } = M0 ∪ {I1 , D2I1 }, where D2I1 is a dependent context associated with I1 that represents E 2 . As more experience is acquired in E 1 and E 2 , the relative likelihood of M2 will gradually increase. Given a nested model structure, the only change from M1 to M2 is the creation of a new context D2I1 for E 2 . The model context I1 for E 1 exists (has been statistically justified) independent of any experiences in E 2 . Thus, the most parsimonious hippocampal representational change would involve the creation of a contextual representation for E 2 without changing the representations of any other contexts. Gradual remapping appears to show such an asymmetry. Lever et al. (2002) first exposed rats to a cylindrical arena (E 1 ) and then a square arena (E 2 ). Over the course of many subsequent exposures to both arenas, rats gradually remapped between them. The observed pattern of remapping
3180
M. Fuhs and D. Touretzky
suggests that the activity of place cells in the square arena (E 2 ) diverged from the cylindrical arena E 1 ; in contrast, fields in the cylindrical arena (E 1 ) appeared stable over several days. It should be noted that whereas the experiments and simulations in this study focus on learning to distinguish two contexts, the statistical framework outlined here can be extended in a straightforward manner to allow an arbitrary number of dependent contexts. In the following sections, a detailed description is provided of how simulated hippocampal inputs are constructed, how HMMs with independent and dependent contexts are defined, and how the posterior likelihood of different models is calculated. 2.2 Simulated Hippocampal Inputs. While the inputs to the hippocampus are very high dimensional, input patterns to the model, for computational simplicity, were formulated as noisy scalar values. Input patterns were generated based on a simulated “ground truth” of the animal’s actual state (e.g., location, task stage) in the world. These environmental states (e-states) compose the true generative model, not to be confused with states in the hippocampal HMM model. Each e-state was assigned a positive index i, and the hippocampal input value y generated for that e-state was 8(i − n/2) + η, where n is the number of e-states in the environment and η is a gaussian noise term; the y values are therefore roughly centered around 0, as expected under the prior (see below). The standard deviation of the noise term was 0.125, which ensures that different e-states within an environmental context are unambiguously distinct. To simulate an experiment involving multiple sequences or tasks, a single set of e-states was used; only their order of presentation changed. For example, multiple spatial sequences were simulated using the same set of position e-states, but the order of visited positions was different for each sequence. To simulate small environmental changes, each e-state’s input value was mildly perturbed (see specific simulations for details). Experiments were modeled in discrete time: each discrete time step had an associated e-state. To model a specific experimental paradigm of duration T, a sequence of T e-states was produced, representing the entire course of the experiment over many days. A sequence of hippocampal inputs, y1,...,T , was then generated based on the e-state sequence. 2.3 Hidden Markov Models with Independent and Dependent Contexts. Our simulations used gaussian HMMs as a model of the hippocampal representation of its input patterns. By using HMMs, we do not intend to suggest that the hippocampus is a finite state machine; rather, HMMs provide a convenient statistical framework in which to represent both a mixture of many different input patterns and their temporal structure (or a Markov approximation thereof).
Context Learning in the Rodent Hippocampus
3181
Formally, a gaussian HMM is composed of a set of NS states, each parameterized by a normal distribution of expected values, N(µs , σs2 ). A transition probability matrix, Aji , defines the probability of transitioning from state i to state j at each time step. Additionally, a starting state probability p0 must be specified; in these simulations, the starting state probability was assumed to be uniform over all states. In our framework, states are organized into multiple contexts, an organization realized by restrictions placed on transition probabilities. Transition probabilities between states in the same context could vary to fit the observed sequence of transitions in y. However, the sum of transition probabilities to all states in other contexts was fixed at γ = 0.05. This is the critical parameter that determines how strongly models are biased against context changes. If the total between-context transition probability is γ , what is the transition probability from some state s to some other state s in a different context? This is defined based on a hierarchical organization of states and contexts. Consider a model Mk composed of NS states, S1 , . . . , SNS . These states are organized into NC contexts, C1 , . . . , C NC . Contexts are organized into NG context groups, G 1 , . . . , G NG , where each context group contains exactly one independent context and any dependent contexts associated with it. For the purpose of defining transition probabilities, the model Mk is therefore organized as a mixture of context groups, each context group a mixture of contexts, each context a mixture of states. For each mixture, the component probabilities are set to be equal. Thus, the probability of transitioning to a particular context group, G g , is p1 = 1/NG . The probability of transitioning to a particular context in that group is p2 = 1/|G g |, the inverse of the number of contexts in group G g . The probability of transitioning to a particular state in that context is p3 = 1/|Cc |, the inverse of the number of states in context Cc . Thus, the probability of transitioning from state s to state s is γ p1 p2 p3 . When the values p1 and p2 are calculated, the context of s is first excluded, since these transitions are only between states in different contexts. The hierarchical organization of contexts helps us take into account the fact that there are other unrelated contexts outside the experimental apparatuses. A fixed value of NG is used to reflect that the transition probabilities among unrelated contexts would not change with the addition of one or two more. For the multiple environments simulations, NG = 3, since up to three independent contexts are considered; for the other simulations, NG = 1. (How exposure to just a few or to many completely different environments affects remapping is not known, but an increased willingness to remap after experiencing many previous environments would argue instead that NG is explicitly represented in the hippocampus.) We now describe how states in independent and dependent contexts are parameterized. In particular, we describe a model in which a state in a dependent context may share parameters with a state in an independent
3182
M. Fuhs and D. Touretzky
context, and the prior probabilities of parameters in dependent contexts may depend on corresponding parameters in their respective independent contexts. Qualitatively, the particular scheme for parameter sharing and the priors used here have a singular purpose: to reduce the size of the parameter space (i.e., the complexity) of the dependent context. Other formulations that achieve the same goal would have been equally acceptable, so readers not well versed in statistics may safely skip the remainder of this section. To define the parameters of a model, consider a state sˆ in a dependent context and its corresponding state s in the associated independent context. The independent state’s parameters are defined purely in terms of the state’s transition probability vector, as , and mean and variance terms, µs and σs2 : A∗s = as p(y|s) ∼ N
µs , σs2
(2.1) ,
(2.2)
where A∗s denotes a row of the HMM’s transition probability matrix and y is the observed input pattern. The dependent state sˆ is defined with respect to both the set of parameters for state s and a set of parameters unique to state sˆ : A∗ˆs = (1 − zsˆ ) a s + zsˆ asˆ p(y|ˆs ) ∼ (1 − ζsˆ )N µs , σs2 + ζsˆ N µsˆ , σsˆ2 .
(2.3) (2.4)
The mixing parameters zsˆ and ζsˆ govern the relative contribution of each component. This allows a dependent state to, for example, share the same transition probability vector with its corresponding independent state (zsˆ ≈ 0) but adopt a different distribution of y values (ζsˆ ≈ 1). When a mixing parameter is close to zero, the second mixture component plays no role in the likelihood of y under the model, thus reducing the model’s effective complexity. The prior over each mixing parameter was a highly sparse beta distribution that mildly favored smaller values: zsˆ ∼ |Beta|(δ1 , δ2 ) and ζsˆ ∼ |Beta|(δ1 , δ2 ). All fixed hyperparameters are listed in Table 1. For the first components’ parameters, vague priors were used: µs ∼ N(ξ, κ1−1 ), σs−2 ∼ (α1 , β1 ), and as ∼ Dirichlet (δa , δa , . . . , δa ), where δa specifies the prior over every within-context transition probability. The prior over each asˆ was the same as as . The priors for µsˆ and σsˆ were biased to be somewhat similar to µs and σs , further reducing the dependent context’s contribution to the complexity penalty: µsˆ ∼ N(µs + h, κ2−1 ) and σsˆ−2 ∼ (α2 , α2 σs2 ). The purpose of h was to encourage model parameterizations in which ζsˆ = 0 instead of ζsˆ = 1 and µsˆ = µs,1 . A more sophisticated approach might involve splice sampling from the nonconjugate prior µsˆ ∼ I (µs , r )N(µs , κ2−1 ), where the indicator function I is zero over some interval around the mean,
Context Learning in the Rodent Hippocampus
3183
Table 1: Hyperparameter Values That Define the Prior Distributions over Each Parameter. HMM Parameters
Prior
Value
Transition probabilities a Parameter sharing probabilities zsˆ and ζsˆ Mean of µs Inverse variance of µs Inverse variance of µsˆ Shape parameter of σs−2 Shape parameter of σsˆ−2 Rate parameter of σs−2 Offset of mean of µsˆ from µs
δa δ1 , δ2 ξ κ1 κ2 α1 α2 β1 h
0.8 0.1, 0.05 0 0.01 4 2 10 0.1 0.4
[µs − r, µs + r ]; however, the simpler procedure was sufficient for our simulations. 2.4 State Inference and Model Selection. Let Mk denote a model composed of NSk states grouped into NCk contexts. Let θk = (z, ζ , A, µ, σ 2 ) denote the parameters of the states in Mk . Given a sequence of noisy inputs, y1,...,T , the state inference problem is to infer under which sequence of states the inputs are most probable. For state inference, it is assumed that both Mk and θk are known and that the HMM is the generative model of y1,...,T , that is, each HMM state corresponds to an e-state in the world. Thus, inferring the HMM state serves as a proxy for inferring the true e-state. This is traditionally done using the Viterbi algorithm (Viterbi, 1967). The context learning problem can be understood as part of the more fundamental problem of inferring the generative model of the input patterns: given the sequence y1,...,T , infer the HMM most likely to have generated it. This can be decomposed into two parts: model parameterization and model selection. Model parameterization involves inferring the most likely set of parameters, θk , given an HMM Mk with a known structure. Model selection involves inferring which model structure is most likely out of a set of candidates, typically with different numbers of states. While model selection in a machine learning context typically involves a single batch analysis, the hippocampus provides some representation of context even in completely novel environments. Multiple models with different numbers of contexts were therefore compared based on initial subsequences, y1,...,τ , τ < T. As τ increased, the models were provided with progressively more input patterns to fit, resulting in changes in their likelihoods. In a Bayesian setting, models are compared based on relative likelihoods, given the observed data (y1,...,T ). Assuming equal prior probability across
3184
M. Fuhs and D. Touretzky
models, this can be reduced to calculating the odds of y1,...,T under each model: Bk =
P(y1,...,τ |Mk ) . P(y1,...,τ |M )
(2.5)
This is known as the Bayes factor (for review, see Kass & Raftery, 1995), and all model comparisons calculated will be presented as a log Bayes factor. p(y1,...,τ |Mk ) is the marginal likelihood of y1,...,T under Mk , marginalized over the entire parameter space: p(y1,...,τ |Mk ) =
θ ∈
p(y1...τ |θk , Mk ) p(θk |Mk )dθk ,
(2.6)
where p(y1,...,τ |θk , Mk ) is the conditional probability of the inputs given a particular parameterization of the model, and p(θk |Mk ) is the prior probability density over the parameter space. As mentioned above, equation 2.6 provides an implicit complexity penalty. Importantly, while p(y1,...,τ |θk , Mk ) depends on y1,...,τ , p(θk |Mk ) does not. Since the difference in conditional probability between models typically grows with the length of y1,...,τ , the Bayes factor asymptotically determines by how well each model fits y1,...,τ . However, for smaller input sequences (τ T), the prior uncertainty over the parameter space, p(θk |Mk ), can strongly influence the Bayes factor. The computational challenge in calculating Bayes’ factors is to accurately calculate p(y1,...,τ |Mk ), which requires integration over the parameter space (see equation 2.6). A closed-form solution does not exist, but any accurate numerical approximation method is equally acceptable, and the results presented here do not depend on the particular method employed. We use a Monte Carlo integration technique described in the appendix. 3 Simulation Results: Multiple Environments 3.1 Gradual Remapping. Recent data have demonstrated that the development of distinct spatial maps for two environments can be gradual (Tanila, Shapiro, & Eichenbaum, 1997; Jeffery, 2000; Lever et al., 2002). Tanila et al. (1997) found that repeatedly rotating two sets of cues in opposite directions engendered an increase in remapping between the cue configurations over time. Jeffery (2000) found that when the same arena was placed in two different room locations, the number of place fields that differentiated between arena room locations gradually increased over several days. Lever et al. (2002) similarly found that when hunting for food pellets alternately in cylindrical and square arenas, different place cells developed distinct place fields in each arena on different days; two weeks of training were required to achieve complete remapping. In each case, repeated conflicts between cue
Context Learning in the Rodent Hippocampus
3185
configurations were gradually resolved by developing separate contexts for each configuration. Our modeling of gradual remapping is described with reference to the Lever et al. (2002) study, though it could equally well be interpreted with respect to the others. The arena geometry change is interpreted as a small but consistent difference in the afferent input pattern for which a model with a separate context for each arena will, with enough experience, be better suited. To model gradual remapping, we constructed input sequences, y1,...,T , generated from alternating sessions in each arena, interposed by time spent on a holding pedestal (during which an experimenter would swap arenas). For computational tractability, two positions within the environment—A and B—were modeled (e.g., one in each half of the arena). The simulated rat spent 2 time steps on the pedestal and 10 time steps in each arena, alternating between the positions in order to remove trajectory differences that might suggest the environments were different. An entire sequence contained 16 environment visits, half in each arena, for a total sequence length of 192 samples. At each position, the input was perturbed depending on the shape of the arena. For example, at position A, the input had mean µ A − ε in the square arena but µ A + ε in the cylindrical arena. The input standard deviation for all states was σ = 0.125, and the value of ε used was 0.175, so the input distributions of the two contexts overlapped. The likelihoods of the one- and two-arena models shown in Figure 1 were compared using the model selection framework described above. The “no pretraining” line in Figure 2 shows that as more experience is gained in each environment, the Bayes factor gradually increases toward a decisive positive value. This suggests that the observed gradual remapping reflects an underlying statistical process: an evidence-based transition to a more complex contextual model. If one were to pretrain rats by exposing them first to one arena for multiple sessions, how should this pretraining experience affect the rate of remapping? The “N sessions” lines in Figure 2 reflect a training sequence that starts with N sessions of pretraining in one environment, followed by alternating sessions in the two environments. These simulations predict that pretraining should both delay the onset of gradual remapping and hasten its completion. With few experiences, the second arena looks like a “noisy” version of the first; however, the larger sequence of experiences due to pretraining eventually permits the environments to be distinguished more rapidly. 3.2 Failure to Generalize. In an extension of the Jeffery (2000) study, Hayman, Chakraborty, Anderson, and Jeffery (2003) trained rats in a box placed in two different locations in the room. After place cells gradually learned to remap between box positions, the color of the box and floor was
3186
M. Fuhs and D. Touretzky
Figure 1: Diagram of one- and two-arena models used to model gradual remapping. The small white boxes represent states in each model, and the large gray boxes indicate how the states are grouped into contexts. Arrows indicate transition probability parameters that are either sampled (within a context) or fixed (between contexts).
changed from white to black. This substantial sensory change resulted in an immediate, complete remapping. Interestingly, despite multiple days of training in a white box in the same two locations, place cells did not immediately discriminate the two locations of the black box. Their result can be understood as a consequence of the relative differences of the box position and box color manipulations. Specifically, while the position shift created a relatively subtle change, the color change was large. We model their experiment by extending the gradual remapping simulation to include two additional “black box” contexts whose hippocampal inputs are similar to each other but not to states in the original two contexts. Specifically, we constructed input sequences, y1,...,T , generated from alternating sessions in the white box in room locations 1 and 2, followed by alternating sessions in the black box in room locations 1 and 2. Box visits were, as before, interposed by time spent on a holding pedestal. The two white box positions were modeled identically to the square and cylindrical arenas in the previous simulation. The two black box room locations were modeled in an analogous fashion: the hippocampal inputs for corresponding rat positions in the two black boxes differed by ±, but their µ values differed substantially from hippocampal inputs in the white box. Figure 3 shows the three- and four-context models evaluated on the sequence of hippocampal inputs. Figure 4 shows the evolution of the likelihoods of the three- and four-context models over the course of training in the black box, subsequent to full training in the white box. For comparison,
Context Learning in the Rodent Hippocampus
3187
Figure 2: Gradual remapping simulation results, where the log Bayes factor indicates the log likelihood of the two-arena model relative to the one-arena model (see model description in Figure 1). With more experience in each environment, the relative likelihood of the two-arena model increases toward certainty. As a result of first pretraining in one arena, the two-arena model is initially less likely, but preference for this model then increases more rapidly. The dotted lines at ±5 denote the thresholds beyond which there is essentially no statistical uncertainty in which model is preferred.
training of one- and two-context models in the white box (structured as in Figure 1) is shown as well. Since the white and black boxes differ so strongly, the question of whether the black box locations should be represented as one context or two is statistically unrelated to the representation of the white box locations. The relative likelihood of the four-context model therefore increases at the same gradual rate, predicting that rats should show the same gradual remapping between room locations in both white and black boxes. Consistent with this prediction, Hayman et al. (2003) found in the one rat whose rate of remapping was fastest that the rat remapped between black box locations at the same rate as between white box locations. (The black box sessions were not continued long enough to assess remapping in the rats that were slower to remap in the white box.) 3.3 Morph Environments. J. Leutgeb et al. (2005) trained rats in square and cylindrical environments until the degree of remapping between arenas reached asymptote. Then they exposed the rats to a sequence of arenas whose wall shape systematically morphed between the square and cylinder. They found that the rats’ place cell activity in the morph environments reflected various averages of the two maps.
3188
M. Fuhs and D. Touretzky
Figure 3: Diagram of three- and four-context models used to model the Hayman et al. (2003) study. The two white box contexts differ subtly, as do the two black box contexts. However, the white box and black box contexts are highly differentiated.
What if the rats were first trained on the morph environments and then on just the square and cylindrical environments? In the gradual remapping simulations, separate contexts for the square and cylindrical arenas develop because they result in distinct clusters of hippocampal input patterns. However, substantial pretraining in the morph environments would generate one broad cluster of input patterns, predicting that morph training should inhibit remapping during subsequent training in just the square and cylindrical arenas. To model this hypothesized experiment, five arenas were used, and input patterns were generated that varied linearly from the square arena (arena 1) to the cylindrical arena (arena 5). For example, at position A, the inputs in the five arenas had means µ A − ε, µ A − 12 ε, µ A, µ A + 12 ε, and µ A + ε.
Context Learning in the Rodent Hippocampus
3189
Figure 4: Simulation results for the Hayman et al. (2003) study, where the log Bayes factor indicates the log likelihood of the four-arena model relative to the three-arena model or the two-arena model relative to the one-arena model (see model description in Figure 3). The preference for the four-arena model over the three-arena model (black box) increases at the same rate as the two-arena model over the one-context model (white box). A copy of the white box log Bayes factors is superimposed on top of the black box log Bayes factors to illustrate their similarity.
We constructed input sequences beginning with sessions in just the morph arenas, selecting arena 3 twice as often as arena 2 or 4 to reinforce a single cluster distribution of input patterns. These sessions were followed by alternating sessions in the square and cylindrical arenas. All other details were the same as in the gradual remapping simulation described previously. Simulation results are shown in Figure 5. Compared to the baseline condition (no morph training), the 16 morph training sessions roughly doubled the time required to complete remapping during subsequent cylinder and square training. With 16 pretraining sessions, remapping was further delayed. This prediction stands in direct contrast to hippocampal models based on independent component analysis (ICA), which would predict place code differentiation during the morph pretraining, as the arena shape ¨ acts as an independent source of variation (Lorincz & Buzs´aki, 2000). The disorientation study by Knierim, Kudrimoti, & McNaughton (1995) invites a similar interpretation. Rats were trained to forage for food in a cylinder with an orienting cue card. They found that when rats were disoriented by carrying them around in a closed box before being placed in the arena, their place fields failed to align with the cue card after several sessions, while nondisoriented rats maintained the alignment. They interpreted their results to suggest that the cue card was perceived to be in a different location each time the rats entered the arena after disorientation.
3190
M. Fuhs and D. Touretzky
Figure 5: Morph experiment simulation results, where the log Bayes factor indicates the log likelihood of the two-context model relative to the one-context model (see model description in Figure 1). Pretraining in the morph environments inhibits adoption of a two-context model during later training only in square and cylindrical arenas. During morph session training, the Bayes factor decreases, suggesting the morph environments are best represented together as a single context. During subsequent training on square and cylindrical environments, more training sessions are required to justify separate contexts for the two environments than would be required without morph environment pretraining. Simulation results involving morph training were calculated after each block of eight sessions, and simulation results not involving morph training were calculated after each block of four sessions. All plots begin after completion of the first block of training, that is, after four or eight sessions.
After several sessions, the cue card was perceived as “unstable” and therefore was ignored. Moreover, even after multiple subsequent sessions without disorientation, place fields never developed a consistent alignment with the cue card. Their experiment likely reflects the influence of the head direction system on the place code more than any computational process within the hippocampus itself. Nonetheless, their findings suggest a similar type of statistical inference elsewhere in the brain: if a cue varies in an apparently random manner, its impact on the overall representation of an environment should be minimized. 4 Simulation Results: Reversal Learning While many conditioning paradigms are not hippocampus dependent, reversal learning has consistently shown dependence on hippocampal
Context Learning in the Rodent Hippocampus
3191
function (Kimble & Kimble, 1965; Silveira & Kimble, 1968; Winocur & Olds, 1978; Berger & Orr, 1983; Neave, Lloyd, Sahgal, & Aggleton, 1994; McDonald, Ko, & Hong, 2002; Ferbinteanu & Shapiro, 2003). Consistent with this interpretation are place cell studies showing context-specific firing patterns during spatial reversal learning tasks (Ferbinteanu & Shapiro, 2003; Smith & Mizumori, 2006). One of the most interesting aspects of reversal learning is that repeated reward reversals lead to progressively faster behavioral reversal. Two studies have demonstrated that after repeated reward reversals, rats are capable of reversing behavior after a single error trial (Buytendijk, 1930; Dufort et al., 1954). Additionally, Pubols (1962) showed near-perfect reversal performance (fewer than 0.5 errors on average after initial error trial), and Brunswick (1939) showed that even when single-trial reversal performance is not yet achieved, most of the improvement is observed on the second trial. Thus, rats can be trained to select a different (previously learned) behavioral strategy after a single error trial. Several studies have also explored the impact of partial reinforcement on reversal learning, considering cases in which the “correct choice” is rewarded on only some percentage of trials, as well as cases in which the “incorrect choice” is also rewarded on some smaller percentage of trials (Brunswick, 1939; Wike, 1953; Grosslight, Hall, & Scott, 1954; Elam & Tyler, 1958; Wise, 1962; Pennes & Ison, 1967). The pattern of data across these studies suggests that the more similar the original and reversed contingencies, the more slowly the animal learns the reversal. While intuitive, this suggests that the impact of a trial on a reward association is weighted by how informative the trial is perceived to be. If the expectation of a particular outcome (reward, no reward) is more uncertain, then observing the outcome provides less information about whether the distribution of outcomes has changed. Previous approaches to modeling reversal learning have posited that the discrimination is relearned during each reversal. Hasselmo et al. (2002) theorized that the hippocampal facilitation of reversal learning was due to quick unlearning and relearning of the association between choice and reward within the hippocampal representation. Learning in their model is unsupervised (Hebbian), and they suggest that reversal learning deficits due to hippocampal or theta modulation impairment are caused by the inability to separate new learning from past associations. Unfortunately, since this model completely relearns the reward association after each reversal, no savings with repeated reversals is predicted. Also, since the current association is dissociated from the previous association, partial reinforcement would not affect the speed of reversal learning under this model. Another series of models have proposed that the hippocampus plays a role in stimulus representation (Gluck & Myers, 1993, 1996; Myers, Gluck, & Granger, 1995), performing both stimulus compression and predictive stimulus differentiation. A similar model has been proposed by Schmajuk &
3192
M. Fuhs and D. Touretzky
DiCarlo (1991, 1992). Both models propose that the hippocampus is critically involved in learning a conjunctive stimulus layer (hidden layer), though their network topologies differ somewhat. In addition, both models train the hidden layer using variants of the backpropagation learning rule. Schmajuk and DiCarlo propose a more direct, biologically plausible implementation, while Gluck and Myers (1993, 1996) suggest only that a functionally similar computational process occurs in vivo. Interestingly, “context” is represented in both models as an external sensory input that is integrated with other inputs in the hidden layer rather than as an internal categorization of the input that is inferred. Both models demonstrate some savings during repeated retraining (on serial reversals or serial extinctions and renewals). In the Gluck and Myers model, hidden layer discrimination increases gradually, making associations with the output layer simpler to learn. In the Schmajuk and DiCarlo model, the learning rate of the input-to-hidden-layer weights is higher, so the increased weighting of the hidden layer representation at the output layer over the repeated reversals leads to faster relearning. In either case, both models fundamentally rely on some degree of retraining during each reversal. However, the single trial reversals observed experimentally occurred based on an error trial alone (Buytendijk, 1930; Dufort et al., 1954). Even with an arbitrarily high learning rate, it is not clear how one could retrain a backpropagation network without at least one positive trial. In addition, the partial reinforcement conditions are not well explained by these models, or other delta-rule based models (e.g., Rescorla & Wagner, 1972; Pearce & Hall, 1980), since learning is not adjusted based on the entropy of the trial’s outcome. Similarly, none of these models accounts well for the partial reinforcement extinction (PRE) effect, in which extinction training is prolonged following partial reinforcement. With generalized delta rule learning, partial reinforcement results in a weaker association between behavioral choice and outcome. During a reversal (or extinction), this weaker association would be easier to expunge than a stronger association would be. However, the data show the opposite results. A few recent statistical models account well for the basic PRE effect (Gallistel & Gibbon, 2000; Courville, 2006), but they do not address the progressive improvement observed in reversal learning. An alternative interpretation of the reversal learning data, as proposed by Hirsh (1974), is that the hippocampus represents each reward condition as a separate context. If the reward contingencies of both contexts are represented simultaneously, then after initial training, no retraining should be required after each reversal. Rather, recent trial outcomes can be compared to prior knowledge of each context to infer which context is active. Similarly, “sequential theory” posits that the pattern of prior rewards forms a context that influences which conditioned associations are retrieved from memory (Capaldi, 1994; Capaldi & Neath, 1995).
Context Learning in the Rodent Hippocampus
3193
Figure 6: Diagram of one- and two-context models used to model reversal learning. Arrows within each context indicate transition probabilities that are typically high; however, transitions between any two states are possible.
In a full-reinforcement reversal learning paradigm, knowledge of both contexts allows a single error trial to be sufficient to infer a context switch, since the choice during the error trial should yield no reward only in “the other” context. One might think of this process as analogous to the problem of self-localization in spatial navigation, a function that has also been attributed to the hippocampus (Touretzky & Redish, 1996). The statistical formulation of context learning also correctly quantifies the increased uncertainty in the adoption of a new context under partial reinforcement conditions. Interestingly, if the reversal is performed in a different environment, the environmental cues substantially improve adaptation to the reversal, even in hippocampal animals (McDonald et al., 2002). This finding reinforces the notion that what the hippocampus contributes to reversal learning is a contextual cue to separate the two discriminations. To model reversal learning, we constructed a simple model of discrimination learning (see Figure 6). The discrimination begins at the start state, denoting the availability of two options (A and B). The start state leads to either of two states reflecting the rat’s choice. The choice states in turn lead to reward and no reward states, indicating the outcome of that choice. (Such reward states are justified by studies showing reward-related responses in the hippocampus; Tabuchi, Mulder, & Wiener, 2003; Smith & Mizumori, 2006.)
3194
M. Fuhs and D. Touretzky
Figure 7: Serial reversal learning simulation results, where the log Bayes factor indicates the log likelihood of the two-context model relative to the one-context model (see model description in Figure 6). With only three trials per reversal, the Bayes factor decreases, suggesting that rapid reversals are indistinguishable from partial reinforcement. Less frequent reversals (10 trials per reversal) lead to an increase, supporting the adoption of a separate context for each reward contingency.
4.1 Serial Reversal Learning. To simulate serial reversal learning training, alternating blocks of 10 trials were generated in which, for odd blocks, choice A was rewarded, and, for even blocks, choice B was rewarded. To simulate behavioral learning, the rat’s choice was selected randomly such that during the first block, the probability of selecting the correct choice exponentially approached 58.5% from 50%; during the second block, the correct choice probability exponentially approached 62% from 41.5% (100% − 58.5%); during the third block, the correct choice probability exponentially approached 65.5% from 38%. This continued for 10 blocks, increasing the final correct choice probability in each successive block by 3.5%. The exponential “decay” of choice probabilities within each trial approximates the trial-by-trial error data of Brunswick (1939). One- and two-context models, shown in Figure 6, were compared as the number of training blocks was increased; the results are shown in Figure 7. With increased training, the likelihood of the two-context model gradually increases, predicting a gradual adoption of context-specific hippocampal firing patterns. Smith and Mizumori (2006) showed that context specificity developed during reversal learning training, though they do not examine the time course of remapping across training sessions. Figure 8 illustrates context selection in the two-context model, demonstrating how a context switch can be inferred after a reversal.
Context Learning in the Rodent Hippocampus
3195
Figure 8: Context selection in the two-context model. Each circle represents a state in the HMM, and each column represents the full set of states of the two-context model at a particular point in time. The gray box shows e-states (hippocampal inputs) from three trials. Trials k − 2 and k − 1 are at the end of a block of trials in which choice A is rewarded. Trial k is the first trial of a new block in which choice B is rewarded. HMM states that are probable given the hippocampal input are darkened. With full reinforcement, only one choice in each context is associated with reward. Therefore, while the dashed-line path (remaining in Context #1) initially appears to be more likely, the lack of reward indicates that trial k is actually part of Context #2. Once the context switch has been inferred, the most likely path through the states will continue within Context #2 until choice B is no longer rewarded.
If the number of trials per block is substantially reduced, the reward structure becomes difficult to distinguish from partial reinforcement. When training sequences were generated comprising three trials per block, the two-context model required too many context switches to be justified, since context switches are constrained to be unlikely. This resulted in the progressively decreasing likelihood of the two-context model, as shown in Figure 7.
3196
M. Fuhs and D. Touretzky
Figure 9: Partial reinforcement reversal learning simulation results, where the Bayes factor indicates the log likelihood of the two-context model relative to the one-context model (see model description in Figure 6). When the similarity of the two reward contingencies is increased, more trials are required to justify representing them as separate contexts.
4.2 Partial Reinforcement and Reversal. In order not to conflate the effects of partial reinforcement during the original and reversal discriminations, the training paradigm typically applies partial reinforcement to choices during the original discrimination, leaving the reversal condition unambiguous (Wike, 1953; Grosslight et al., 1954; Elam & Tyler, 1958; Wise, 1962; Pennes & Ison, 1967). We consider three cases: full reinforcement (100:0), in which choice A is always rewarded and choice B is never rewarded; partial reinforcement of A (75:0), in which choice A is rewarded only 75% of the time; and partial reinforcement of A and B (75:25), in which choice B is also rewarded 25% of the time. In all cases, only choice B is rewarded during the reversal discrimination (0:100). Training consisted of a single original and single reversal block, the length of each being equal and set so that the Bayes factor would be roughly 5.0 by the end of training on both blocks. The rat’s choice was selected probabilistically, where the probability of a correct choice exponentially approached 90% from 50% in the original block and 90% from 10% in the reversal block. This simulated behavior was designed to mimic the progressive bias animals show in favor of the rewarded choice over time (e.g., Wike, 1953). Figure 9 shows that as the original discrimination is made more ambiguous, learning to differentiate it as a separate context from the reversal discrimination requires progressively more trials. We therefore predict that
Context Learning in the Rodent Hippocampus
3197
context-specific hippocampal representations would develop more slowly in these partial reinforcement paradigms. 5 Simulation Results: Sequence Learning The hippocampus has been implicated in a variety of tasks involving sequences (Kesner & Novak, 1982; Chiba, Kesner, & Reynolds, 1994; Gilbert, Kesner, & Lee, 2001; Agster, Fortin, & Eichenbaum, 2002; Fortin, Agster, & Eichenbaum, 2002; Kesner, Gilbert, & Barua, 2002). For example, Fortin et al. (2002) found that after being presented with a random sequence of odors, hippocampal rats could not choose the earlier of two odors from the sequence. Of particular interest has been the study of overlapping sequences, in which multiple sequences share common middle elements but distinct beginning and ending elements. Agster et al. (2002) found that while hippocampal rats could disambiguate two partially overlapping olfactory sequences, intertrial interference or a delay condition could impair their performance substantially. Several models have proposed that the hippocampus develops separate contextual representations of each sequence that serve to associate the ambiguous middle elements with the rest of the sequence (Levy, 1989, 1996; Wallenstein & Hasselmo, 1998). These models predict that if a rat were to repeatedly travel down a common maze arm that was part of two different paths (e.g., a continuous figure 8 pattern), place cells on the common arm would fire differently depending on which path the animal was traveling. Some studies have confirmed this finding (Frank, Brown, & Wilson, 2000; Wood et al., 2000), while others found no path-related differences ¨ (Lenck-Santini et al., 2001; Holscher et al., 2004). Intriguingly, Bower et al. (2005) were able to reproduce both cases by varying the shaping procedure used to train the rats. The studies that found no path-related differences nonetheless reported that rats were able to learn the task, a result consistent with Ainge and Wood (2003), who found that hippocampal lesions did not impair the continuous version of the task. However, when a small delay was added at the start of the common maze arm, Ainge and Wood found that hippocampal rats were impaired. Subsequently, two groups trained unlesioned rats on a figure 8 maze, each finding, paradoxically, that pathspecific modulation of place fields on the common maze arm occurred with no delay (when the task was not hippocampus dependent) but largely disappeared when a delay was added (Ainge, van der Meer, & Wood, 2005; Robitsek, Fortin, & Eichenbaum, 2005). Despite this disappearance, rats could still perform the task. As Bower et al. (2005) point out, the sequence dependence of place cell activity is likely attributable to differing input from some extrahippocampal brain area instead of a contextual representation developed within the hippocampus. An alternative hypothesis, adopted by Hasselmo and Eichenbaum (2005), is that sequence replay (Foster and Wilson, 2006) in the hippocampus
3198
M. Fuhs and D. Touretzky
guides behavior independent of whether path-specific remapping is seen on the common arm. Specifically, an extra-hippocampal brain area incrementally learns rules for completing each sequence, given its beginning; replay of the beginning of the current sequence is then sufficient to complete the task. (With a higher learning rate, Hasselmo et al., 2002, would likely provide an elegant model of such one-shot learning.) Such a division of labor explains why Agster et al. (2002) found evidence of rats learning overlapping sequences but failing, under some circumstances, to recall which sequence had most recently begun. The rapid alternation between sequences on a figure 8 maze is incompatible with our constraint that context switches should be rare. To demonstrate this, we constructed a simulation of the figure 8 task. Five positions on the track were modeled and were traversed in a six-step loop: start left, center, end right, start right, center, end left, repeat. A trial constituted a pass through one start arm, the center arm, and the opposing end arm. Oneand two-context models (see Figure 10) were compared, and the results are shown in Figure 11. Even after just 12 trials, the two-context model is astronomically unlikely. To demonstrate that the context switch penalty in the two-context model is the specific cause of the low Bayes factors, the one-context model was also compared with the generative model (see Figure 10). Figure 11 shows that decisive preference for the generative model is attained by 30 trials. Bower et al. (2005) have considered training regimes that promote sequence disambiguation of the common path, presumably due to afferent input from another brain region. However, were rats to be trained repeatedly on one sequence and then the other, our framework would predict that sequence-specific encodings of the common path would reliably develop due to intra-hippocampal mechanisms sensitive to temporal mismatch. 6 Discussion Advances in Bayesian computational statistics techniques over the past decade have opened the door to evaluating the marginal model probabilities of many new classes of models. As researchers use these techniques, insights into animal learning and human reasoning have begun to arise from their formulation as Bayesian inference problems, often over multiple generative models that provide competing explanations of a corpus of data (Tenenbaum & Griffiths, 2001; Courville, Daw, Gordon, & Touretzky, 2003; Courville, Daw, & Touretzky, 2004; Griffiths & Tenenbaum, 2005; Daw, Niv, & Dayan, 2005). In a similar vein, work presented here advances a theory of context learning in which hippocampal input patterns are grouped together in the same context when they reliably cluster together in time. Choosing between generative models that group the experiences into different numbers of clusters is therefore the fundamental challenge of context learning.
Context Learning in the Rodent Hippocampus
3199
Figure 10: The generative model and one- and two-context models used to model sequence learning. Note that the generative model includes two states (Center LR and Center RL) in the same context that represent the same distribution of input patterns (identical µC values). Our weak-identifiability constraint (see section 2.1) that such states can be represented only when in separate contexts excludes the generative model from the class of possible hippocampal context models. Arrows within each context indicate transition probabilities that are typically high; however, transitions between any two states are possible.
Sequence learning models (Levy, 1989, 1996; Wallenstein & Hasselmo, 1998) provide a rather different notion of context, one oriented toward binding together elements of a temporal sequence. Interestingly, this binding process simultaneously serves to differentiate elements based on their order within the sequence. By contrast, in our theory, context learning groups sequence elements together without disambiguating multiple occurrences of the same element. Thus, sequence learning models predict that alternating, overlapping sequences should be represented with different contextual bindings, whereas our theory groups them together into one context. The failure for sequence-dependent hippocampal representations to be consistently observed (e.g., Bower et al., 2005) or to serve a behavioral function (Ainge & Wood, 2003) argues against such representations playing a critical role in the disambiguation of alternating, overlapping sequences. Others have argued that such sequential encodings are instead formed in medial
3200
M. Fuhs and D. Touretzky
Figure 11: Sequence learning simulation results, where the log Bayes factor indicates log likelihood of the two-context model relative to the one-context model (see model description in Figure 10). The relative likelihood of the twocontext model decreases precipitously with experience due to the constraint that transitions between contexts are of low probability. If the generative model were admissible, it would be adopted over the one-context model, as its log likelihood relative to the one-context model increases to certainty with experience.
temporal lobe areas outside the hippocampus proper such as the entorhinal cortex (Howard, Fotedar, Datey, & Hasselmo, 2005). An important aspect of this theory is that it distinguishes between the inference processes of context learning and context selection: context learning may be gradual over many days, while context selection should be an abrupt process. Multibasin attractor models (Samsonovich & McNaughton, 1997; Redish & Touretzky, 1998; Tsodyks, 1999; Doboli et al., 2000) have demonstrated how multiple contexts, each a stable basin of attraction, could be simultaneously represented within the same network. In such models, context selection involves restabilization of the network in the most appropriate basin. What these models lack is an explanation of the gradual development of new contextual representations. By contrast, backpropagation models (Schmajuk & DiCarlo, 1992; Gluck & Myers, 1993) have attempted to address the gradual development of new (hidden layer) representations. However, these networks do not have multistable activity patterns; they have no “memory” of the current context across time beyond what is encoded in the weights. Thus, they fail to capture the one-trial context switching behavior observed in reversal learning. In addition, backpropagation and similar delta rule learning models do not properly adapt learning to the information provided by each trial and
Context Learning in the Rodent Hippocampus
3201
therefore cannot account for the slower reversal of partial reinforcement reward contingencies. Finally, while backpropagation is an efficient search technique for learning high-dimensional mappings, the complexity of the function represented by the neural network is not explicitly considered. (In a machine learning setting, overcomplexity can result in “overfitting,” which is typically detected by cross-validation of the model on a separate data set.) Similarly, ¨ independent component analysis (used in a hippocampal model by Lorincz & Buzs´aki, 2000) does not adjust the number of independent components based on the observed data. In contrast, the framework here considers the inherent trade-off between model fit and model complexity that underpins any model selection problem. Our framework can most clearly be dissociated from other models that do not consider this trade-off by determining whether remapping is observed subsequent to training in a set of morph boxes: our simulations predict that remapping should not occur. 6.1 Localizing Contextual Representations Within the Hippocampus. Since our theory concerns contextual representations in the hippocampus, we discuss how it relates to known facts about hippocampal anatomy and physiology. First, we argue that abrupt and gradual remapping are mediated by distinct physiological processes. Specifically, whereas abrupt remapping is caused by a change in the path integrator representation located in dorsal medial entorhinal cortex (dMEC), gradual remapping is caused by experience-dependent representational changes within the DG/CA3 network. Then we discuss the evidence for pattern separation and completion in the DG/CA3 network and how such mechanisms could underpin context learning. We contrast the role of the DG/CA3 network with the role of CA1 in gating the projection of the DG/CA3 representation to efferent cortical areas. 6.1.1 Abrupt Remapping and Attractor Dynamics. Marr (1971) first proposed that the architecture of the hippocampus is well suited to encode new memory traces using orthogonalized representations, which minimize interference between stored patterns. Recordings from hippocampal pyramidal cells confirmed that sparse, orthogonalized representations are formed to encode different places and other features within an environment (O’Keefe & Dostrovsky, 1971; Wood, Dudchenko, & Eichenbaum, 1999), as well as different environments as a whole (Muller & Kubie, 1987). Attractor models of hippocampal function (Samsonovich & McNaughton, 1997; Redish & Touretzky, 1998; Doboli et al., 2000) which address the remapping of place fields between environments, were born out of the observation that, when place cells remap, they appear to remap together. For example, Bostock, Muller, & Kubie (1991) found that the change of a cue card’s color sometimes caused a remapping, and, when observed, all simultaneously recorded cells remapped. When introduced repeatedly to the same
3202
M. Fuhs and D. Touretzky
environment, rats with deficient LTP sometimes recalled a completely different map (Barnes, Suster, Shen, & McNaughton, 1997; Kentros et al., 1998). One difficulty in interpreting such remapping data is in disambiguating the role of the path integrator, currently believed to be in dorsal medial entorhinal cortex (dMEC) (Fyhn, Molden, Witter, Moser, & Moser, 2004; Hafting, Fyhn, Molden, Moser, & Moser, 2005; Fuhs & Touretzky, 2006; McNaughton, Battaglia, Jensen, Moser, & Moser, 2006), from the place code. As noted by Touretzky and Redish (1996), failure to reset the path integrator should cause a substantial change in the afferent input to the hippocampus, resulting in abrupt hippocampal remapping independent of any attractor dynamics in the hippocampus. This reset failure likely underpins the results of S. Leutgeb et al. (2005), who showed that switching between two rooms causes a different and more radical form of remapping than switching between arena shapes in the same room. While dMEC grid cells change phase when the arena is placed in a novel room (Hafting et al., 2005), preliminary findings by J. Leutgeb et al. (2006) show that grid cell phases remain constant when the arena changes shape in the same room, while DG and CA3 undergo rate remapping (see also Quirk, Muller, Kubie, & Ranck, 1992). Thus, abrupt remapping, even when delayed from the first exposure to the environment (e.g., Bostock et al., 1991; Brown & Skaggs, 2002), is likely attributable to a reset failure of the path integrator, which causes remapping simultaneously in all subfields of the hippocampus. By this interpretation, delayed, abrupt remapping reflects a stochastic process where, for rats that attend to the environmental change, there is a fixed probability of PI reset failure on each visit to the perturbed environment, which should result in an exponentially distributed time to first remapping. 6.1.2 Gradual Remapping, Context Learning and the DG/CA3 Network. In contrast to abrupt remapping, the gradual differentiation of contextual representations should be attributable to circuitry within the hippocampus, specifically DG and CA3. Several authors have suggested that exposure to a novel context results in an orthogonalized representation being formed in DG, which is then propagated to CA3 (Marr, 1971; McNaughton & Morris, 1987; Treves & Rolls, 1992, 1994; O’Reilly & McClelland, 1994). Preliminary evidence suggests that the rate remapping between similar contexts observed by S. Leutgeb et al. (2005) originates in DG (J. Leutgeb et al., 2006). In a familiar context, perforant path input and recurrent collaterals in CA3 guide the recall of the previously learned contextual representation. O’Reilly and McClelland (1994) explored a simplified model of DG and CA3, showing that although similar patterns could be mapped to an even more similar CA3 representation (pattern completion), more strongly differing input patterns could be separated as well (pattern separation). The neural mechanisms of pattern separation and completion resemble the statistical process of clustering: map noisy input patterns into more
Context Learning in the Rodent Hippocampus
3203
similar representations to denote their association with the same cluster; map input patterns associated with different clusters into more distinct representations. One might therefore think of a clustering neural network as an extension of the O’Reilly and McClelland (1994) model in which the threshold between separation and completion is not static, but dynamically adjusted based on the distribution of input patterns. While O’Reilly and McClelland only elucidated the benefits of CA3 perforant path plasticity, NMDA-dependent synaptic plasticity is well known to exist within DG and the CA3 recurrent collaterals. In addition, there is intriguing new evidence that mossy fiber synapses show heterosynaptic plasticity, though changes in synaptic efficacy appear to depend on neighboring synapses in stratum radiatum instead of the depotentiation of the postsynaptic cell (Schmitz, Mellor, Breustedt, & Nicoll, 2003). We propose that one function of this learning is to adjust the separation-completion threshold, gating when new contextual representations would be propagated from DG to CA3. In a familiar context, once the path integrator and any other brain state is reset, the activity patterns projected from DG onto CA3 should match the patterns projected from the perforant path and recurrent collaterals, modulo some noise. However, if this familiar context is perturbed into a similar but distinct second context, then there should be some mismatch between the pathways: the representation in DG should more accurately reflect the current input patterns, whereas the perforant path and recurrent circuitry should reinforce a previously learned representation. Critically, our framework suggests that if the differences between recalled and input patterns are small, the impact of the DG representation should be minimal, and, if such differences do not repeat, the impact of DG should remain minimal. However, if the same differences are observed repeatedly, the impact of DG on the CA3 representation should increase, causing remapping in CA3 (Fuhs & Touretzky, 2000). In this way, more complex context models may be adopted. Small but repeated differences would be expected to cause incremental changes in synaptic connectivity in the DG perforant path and mossy fiber pathway to strengthen the impact of DG on CA3. Interestingly, perforant path plasticity in DG can last for months (Abraham, Logan, Greenwood, & Dragunow, 2002), suggesting that this pathway is capable of accruing changes in synaptic efficacy over many training days. However, granule cell neurogenesis causes cells to be gradually replaced, slowly fading away previous associations (Feng et al., 2001). These opposing forces provide a natural balance for input pattern density estimation. New input patterns can be registered, and their impact can be strengthened with repeated exposure; however, this strengthening is tempered by the replacement of granule cells to clear out old memories. It follows from this proposal that DG principal cells should fire at a lower rate in a novel environment and, with repeated exposure, gradually increase their rates as the environment becomes familiar.
3204
M. Fuhs and D. Touretzky
While presenting a neural network model of context learning is beyond the present scope, recent physiology, gene expression, and lesion studies are consistent with the proposal that a neural instantiation of context learning should be localized to the DG/CA3 network, whereas CA1 integrates the experience-dependent DG/CA3 representation with the entorhinal cortical representation. Both pattern separation and pattern completion have been observed in CA3 in response to changes in environmental cues (Vazdarjanova & Guzowski, 2004; Lee, Yoganarasimha, Rao, & Knierim, 2004; Leutgeb, Leutgeb, Treves, Moser, & Moser, 2004; Guzowski, Knierim, & Moser, 2004). Vazdarjanova and Guzowski (2004) found in an immediateearly gene expression study that similar environments were represented with a greater similarity in CA3 than CA1, while two very different environments were represented with less similarity in CA3 than CA1. Both Lee et al. (2004) and S. Leutgeb et al. (2004) found that the CA1 representation more directly reflected the current sensory environment, whereas the CA3 representation reflected either pattern completion of a previously learned contextual representation (Lee et al., 2004) or pattern separation to create a new contextual representation (S. Leutgeb et al., 2004). A behavioral role of CA3 pattern completion is suggested by Nakazawa et al. (2002), who found that performance in a cue-degraded version of the Morris water maze is impaired by CA3 NMDA receptor knockout. Lee and Kesner (2002, 2003) showed that delayed nonmatch to place (DNMP) was impaired in a novel (but not familiar) environment by CA3 NMDA inactivation or DG or CA3 lesion. These deficits may be interpreted as a failure to recall or learn a conjunctive representation of position and target (e.g., hidden platform, object) that could be retrieved via pattern completion using the target as an autoassociative memory cue. (Evidence for such a target representation has been found in prelimbic/infralimbic cortex; Hok, Save, Lenck-Santini, & Poucet, 2005.) Consistent with this hypothesis, DG appears necessary to create such conjunctive representations: DG lesions reduce performance to chance on a working-memory water maze task in which the platform is moved to a new location each day (Xavier, OliveiraFilho, & Santos, 1999). In relation to our theory, these data point to DG and CA3 to construct a model-based representation of the animal’s experiences, including various conjunctive associations instrumental in solving behavioral tasks. CA1 appears to relay a composition of the model-based CA3 representation and the context-free entorhinal cortical information to efferent cortical areas. Hasselmo and colleagues have presented a series of models and pharmacological data supporting the notion that increased cholinergic modulation decreases the contribution of CA3 to the CA1 representation, but increases plasticity of both the CA3 recurrent and Schaffer collaterals (Hasselmo & Schnell, 1994; Hasselmo, Schnell, & Barkai, 1995; Hasselmo, Wyble, & Wallenstein, 1996). More recently, Yu and Dayan (2005) have proposed a theory of acetylcholine and noradrenaline in which acetylcholine
Context Learning in the Rodent Hippocampus
3205
represents expected uncertainty, whether due to the context being new or to known unpredictability within a familiar context. Taken together, these models suggest that when the context is informative (low ACh), CA1 should be influenced by CA3; when the context is less informative (high ACh), that influence should be reduced. This gating of the CA3 representation has been confirmed experimentally in novel environments: CA1 shows a stable representation while the CA3 representation evolves over the course of 20 to 30 minutes (S. Leutgeb et al., 2004). Additionally, several monoaminergic neurotransmitters have been implicated in modulating the balance of CA3 and EC input to CA1 (Otmakhova & Lisman, 1998; Otmakhova, Lewey, Asrican, & Lisman, 2005). The complementary roles of the DG/CA3 and CA1 networks provide some insight with which to interpret discrepancies between the double rotation experiments of Shapiro et al. (1997) and Lee et al. (2004). When local and distal cues were rotated in opposite directions, both studies observed “heterogeneous” or “discordant” responses. However, Shapiro et al., recording mostly from CA1 at the beginning of the experiment, initially observed many more place fields to rotate with the distal cues than the local cues. In fact, the ratio of place fields rotating with each set of cues observed by Shapiro et al. much more closely resembles the ratio of place fields in CA3 tied to each set of cues observed by Lee et al. (2004), suggesting that in the Shapiro et al. study, CA3 strongly influenced the representation in CA1. Shapiro et al. repeatedly trained rats on two standard condition sessions and a single double rotation session (local and distal cues rotated 180 degrees apart) each day, as well as various other less frequent cue scrambling and deletion probe trials. They observed that over time, cells (predominantly in CA3 by the end of the experiment) more strongly remapped between standard and double rotation conditions. This change in the degree of remapping could reflect either the change in cell populations they recorded from or an experience-dependent effect; they do not address this issue statistically. Nonetheless, if we assume that the initial primacy of distal influence on CA1 place cells reflects pattern completion in CA3, then the increase in remapping reflects an experience-dependent transition in CA3 between pattern completion and pattern separation in order to adopt separate contexts for the standard and double rotation cue configurations. This is consistent with our simulations of gradual remapping, which show that two repeatedly experienced and distinct conditions should be differentiated by context. By contrast, Lee et al. trained rats equally on four rotation angles in addition to the standard condition, a training paradigm more akin to our morph experiment simulations. Though the relatively short duration of training by Lee et al. prevents any decisive conclusions, their observation that CA3 maintained pattern completion throughout the course of their experiment (I. Lee & J. J. Knierim, personal communication to M. Fuhs, June 2006) is consistent with our morph experiment simulation results that predict that experience-dependent remapping should not occur in such a case.
3206
M. Fuhs and D. Touretzky
6.2 Future Work. While multi-unit recordings from CA3 and CA1 have provided great insight into their differing function, data from DG and the hilus are only beginning to become available. As the activity patterns of cells in these areas are explored, so too does the exploration of physiologically based models of context learning become more tenable. The framework presented here assumes a perfect memory of past experiences, an assumption certain to be untrue. It would be interesting to explore to what extent memory limitations, perhaps imposed by the representation of the full history of experiences by a set of (in)sufficient statistics, would affect the predictions made by the present framework. Also, the framework here models context transitions as a Markov process with fixed transition probability, which results in the assumption that context transitions occur according to a Poisson distribution. However, Daw, Courville, and Touretzky (2006) have suggested that a more sophisticated semi-Markov process better explains physiology studies of the dopamine system. Context learning in vivo may similarly involve a more explicit inference about the amount of time an animal expects to remain in each context.
Appendix: Estimating Marginal Model Probabilities For these simulations, equation 2.6 was numerically approximated by Monte Carlo integration of the parameter samples using an importance distribution. The importance distribution was constructed in an automated fashion using 500 samples from the posterior parameter distribution, p(θk |y1,...,τ , Mk ), and the Gibbs kernels from the Markov chain Monte Carlo (MCMC) sampler that generated the parameter samples (for details, ¨ see Fruhwirth-Schnatter, 2004). Monte Carlo integration becomes unstable when the tails of the importance distribution are narrower than the posterior distribution along any dimension of the parameter space. To guard against this, the variances of the µ and σ −2 components were doubled, and the variance of the a s component was increased by multiplying each Dirichlet parameter by 0.75. The MCMC sampler was constructed based on previous sampling techniques for standard HMMs (Chib, 1996). Briefly, the Markov chain was constructed from Gibbs kernels that sample sequentially from both the parameters of the model and latent indicator vectors, which indicate, for each mixture distribution within the model and each time step t, which mixture component is implicated in the observed value yt . For the shared parameter HMMs, there are three indicator vectors. The vector S1,...,T indicates the active HMM state at each time step and was sampled using a standard HMM Gibbs move: p(St = i| . . .) ∝ Ai,St−1 p yt |µi , σi2 ASt+1 ,i .
(A.1)
Context Learning in the Rodent Hippocampus
3207
The vector u1,...,T−1 indicates which component of the transition probability mixture (see equation 2.3) for the current state contributed to each transition. The vector v1,...,T indicates which component of the hippocampal input value mixture (see equation 2.4) for the current state contributed to each observation. Gibbs moves were as for standard mixtures: p(ut = Dep| . . .) ∝ ASt+1 ,ˆs /(ASt+1 ,s + ASt+1 ,ˆs ) p(vt = Dep| . . .) ∝ p yt |µsˆ , σsˆ2 / p yt |µs , σs2 + p yt |µsˆ , σsˆ2 .
(A.2) (A.3)
The values of ut and vt are defined only for times when the HMM state indicator vector indicates a state in a dependent context. Gibbs moves for the HMM parameters were complicated by sharing of parameters between independent and dependent states. Gibbs moves for the mixing parameters were p(zsˆ | . . .) ∝ Beta δ1 + nus , δ2 + nusˆ p(ζsˆ | . . .) ∝ Beta δ1 + nvs , δ2 + nvsˆ ,
(A.4) (A.5)
where nus = #(ustˆ = Ind), nusˆ = #(ustˆ = Dep), nvs = #(vtsˆ = Ind), and nvsˆ = #(vtsˆ = Dep). The counting function #() returns the number of occurrences over a time-indexed vector in which the specified condition is satisfied. Gibbs moves for the transition probabilities were ind p( a s1 | . . .) ∝ Dirichlet δ + nind 1 , δ + n2 , . . . dep dep p( a sˆ1 | . . .) ∝ Dirichlet δ + n1 , δ + n2 , . . . .
(A.6) (A.7)
The transition probabilities for a state in an independent context, s1 , were updated based on a transition count, nind s2 , which counts the number of occurrences of a transition from s1 to s2 . If the independent context had a dependent context (with states sˆ1 and sˆ2 that are paired with states s1 and s2 ), then nind s2 also included the number of occurrences of a transition from sˆ1 to sˆ2 when ut = Ind. For a state in a dependent context, sˆ1 , the transition count dep nsˆ2 included transitions from sˆ1 to sˆ2 when ut = Dep. The observation parameters for a state in an independent context were sampled using the following Gibbs moves: p(µs | . . .) ∝ N
p
σs−2 . . .
σs−2 yt + κ1 ξ σµ2 , σµ2
(A.8)
t∈Ts
1 1 ∝ α1 + |Ts |, β1 + (yt − µs ) , 2 2 t∈T s
(A.9)
3208
M. Fuhs and D. Touretzky
where σµ−2 = σs−2 |Ts | + κ1 and Ts is the set of times for which St = s or for which vt = Ind and St = sˆ . The observation parameters for a state in a dependent context were sampled using the following Gibbs moves p(µsˆ | . . .) ∝ N
p
σsˆ−2 . . .
σsˆ−2
yt + κ2 (µs + h) σµ2 , σµ2
(A.10)
t∈Tsˆ
1 1 2 ∝ α2 + |Tsˆ |, α2 σs + (yt − µsˆ ) , 2 2 t∈T
(A.11)
sˆ
where σµ−2 = σsˆ−2 |Tsˆ | + κ2 and Tsˆ is the set of times for which vt = Dep and St = sˆ . Five thousand samples from the importance distribution were used to estimate equation 2.6. The availability of samples from the posterior distribution permitted evaluation of equation 2.6 with the computationally more expensive technique of bridge sampling. Unlike traditional importance sampling, an accurate estimate via bridge sampling does not require the importance distribution to be broader than the posterior distribution ¨ (Meng & Wong, 1996; Fruhwirth-Schnatter, 2004). In our many test cases, we found that the differences based on which integration technique was used were negligible, suggesting that the importance distribution was accurately covering the entire mass of the posterior distribution. Acknowledgments This work was supported by NIH grant MH59932 and the Pennsylvania Tobacco Settlement Fund. References Abraham, W. C., Logan, B., Greenwood, J. M., & Dragunow, M. (2002). Induction and experience-dependent consolidation of stable long-term potentiation lasting months in the hippocampus. Journal of Neuroscience, 22(21), 9626–9634. Agster, K. L., Fortin, N. J., & Eichenbaum, H. (2002). The hippocampus and disambiguation of overlapping sequences. Journal of Neuroscience, 22(13), 5760–5768. Ainge, J. A., van der Meer, M. A. A., & Wood, E. R. (2005). Disparity between sequence-dependent hippocampal activity and hippocampal lesion effects on a continuous t-maze task. Program no. 776.13. In Society for Neuroscience Abstracts, Washington, DC: Society for Neuroscience. Ainge, J. A., & Wood, E. (2003). Excitotoxic lesions of the hippocampus impair performance on a continuous alternation t-maze task with short delays but not with no delay. Program no. 91.1. In Society for Neuroscience Abstracts. Washington, DC: Society for Neuroscience.
Context Learning in the Rodent Hippocampus
3209
Barnes, C. A., Suster, M. S., Shen, J., & McNaughton, B. L. (1997). Multistability of cognitive maps in the hippocampus of old rats. Nature, 388(6639), 272–275. Berger, T. W., & Orr, W. B. (1983). Hippocampectomy selectively disrupts discrimination reversal conditioning of the rabbit nictitating membrane response. Behav. Brain Res., 8(1), 49–68. Bostock, E., Muller, R. U., & Kubie, J. L. (1991). Experience-dependent modifications of hippocampal place cell firing. Hippocampus, 1(2), 193–206. Bower, M. R., Euston, D. R., & McNaughton, B. L. (2005). Sequential-contextdependent hippocampal activity is not necessary to learn sequences with repeated elements. J. Neurosci., 25(6), 1313–1323. Brown, J. E., & Skaggs, W. E. (2002). Discordant coding of spatial locations in the rat hippocampus. Journal of Neurophysiology, 88, 1605–1613. Brunswick, E. (1939) Probability as a determiner of rat behavior. J. Exp. Psychol., 25, 175–197. ¨ Buytendijk, F. J. J. (1930). Uber das umlernen. Arch. N¨eerl. Physiol., 15, 283–310. Capaldi, E. J. (1994). The sequential view: From rapidly fading stimulus traces to the organization of memory and the abstract concept of number. Psychonomic Bulletin and Review, 1, 156–181. Capaldi, E. J., & Neath, I. (1995). Remembering and forgetting as context discrimination. Learning and Memory, 2, 107–132. Chib, S. (1996). Calculating posterior distributions and modal estimates in Markov mixture models. Journal of Econometrics, 75, 79–97. Chiba, A. A., Kesner, R. P., & Reynolds, A. M. (1994). Memory for spatial location as a function of temporal lag in rats: Role of hippocampus and medial prefrontal cortex. Behavioral and Neural Biology, 61, 123–131. Courville, A. C. (2006). A latent cause theory of classical conditioning. Unpublished doctoral dissertation, Carnegie Mellon University. Courville, A. C., Daw, N. D., Gordon, G. J., & Touretzky, D. S. (2003). Model un¨ certainty in classical conditioning. In S. Thrun, L. Saul, & B. Scholkopf (Eds.), Advances in neural information processing systems, 16 (pp. 977–984). Cambridge, MA: MIT Press. Courville, A. C., Daw, N. D., & Touretzky, D. S. (2004). Similarity and discrimination in classical conditioning: A latent variable account. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 313–320). Cambridge, MA: MIT Press. Daw, N. D., Courville, A. C., & Touretzky, D. S. (2006). Representation and timing in theories of the dopamine system. Neural Comput., 18(7), 1637–1677. Daw, N. D., Niv, Y., & Dayan, P. (2005). Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat. Neurosci., 8(12), 1704–1711. Doboli, S., Minai, A. A., & Best, P. J. (2000). Latent attractors: A model for contextdependent place representations in the hippocampus. Neural Computation, 12(5), 1003–1037. Dufort, R. H., Guttman, N., & Kimble, G. A. (1954). One-trial discrimination reversal in the white rat. J. Comp. Physiol. Psychol., 47(3), 248–249. Elam, C. B., & Tyler, D. W. (1958). Reversal-learning following partial reinforcement. Am. J. Psychol., 71(3), 583–586.
3210
M. Fuhs and D. Touretzky
Feng, R., Rampon, C., Tang, Y., Shrom, D., Jin, J., Kyin, M., et al. (2001). Deficient neurogenesis in forebrain-specific presenilin-1 knockout mice is associated with reduced clearance of hippocampal memory traces. Neuron, 32, 911–926. Ferbinteanu, J., & Shapiro, M. L. (2003). Prospective and retrospective memory coding in the hippocampus. Neuron, 40(6), 1227–1239. Fortin, N. J., Agster, K. L., & Eichenbaum, H. B. (2002). Critical role of the hippocampus in memory for sequences of events. Nature Neuroscience, 5(5), 458–462. Foster, D. J., & Wilson, M. A. (2006). Reverse replay of behavioural sequences in hippocampal place cells during the awake state. Nature, 440, 680–683. Frank, L. M., Brown, E. N., & Wilson, M. (2000). Trajectory encoding in the hippocampus and entorhinal cortex. Neuron, 27, 169–178. ¨ Fruhwirth-Schnatter, S. (2004). Estimating marginal likelihoods for mixture and Markov switching models using bridge sampling techniques. Econometrics Journal, 7, 143–167. Fuhs, M. C., & Touretzky, D. S. (2000). Synaptic learning models of map separation in hippocampus. Neurocomputing, 32, 379–384. Fuhs, M. C., & Touretzky, D. S. (2006). A spin glass model of path integration in rat medial entorhinal cortex. J. Neurosci., 26(16), 4266–4276. Fyhn, M., Molden, S., Witter, M. P., Moser, E. I., & Moser, M. B. (2004). Spatial representation in the entorhinal cortex. Science, 305, 1258–1264. Gallistel, C. R., & Gibbon, J. (2000). Time, rate, and conditioning. Psychol. Rev., 107(2), 289–344. Gilbert, P. E., Kesner, R. P., & Lee, I. (2001). Dissociating hippocampal subregions: A double dissociation between dentate gyrus and CA1. Hippocampus, 11, 626– 636. Gluck, M. A., & Myers, C. E. (1993). Hippocampal mediation of stimulus representation: A computational theory. Hippocampus, 3(4), 491–516. Gluck, M. A., & Myers, C. E. (1996). Integrating behavioral and physiological models of hippocampal function. Hippocampus, 6(6), 643–653. Griffiths, T. L., & Tenenbaum, J. B. (2005). Structure and strength in causal induction. Cognit. Psychol., 51(4), 334–384. Grosslight, J. H., Hall, J. F., & Scott, W. (1954). Reinforcement schedules in habit reversal: A confirmation. J. Exp. Psychol., 48(3), 173–174. Guzowski, J. F., Knierim, J. J., & Moser, E. I. (2004). Ensemble dynamics of hippocampal regions CA3 and CA1. Neuron, 44(4), 581–584. Hafting, T., Fyhn, M., Molden, S., Moser, M. B., & Moser, E. I. (2005). Microstructure of a spatial map in the entorhinal cortex. Nature, 436(7052), 801–806. ´ C., & Wyble, B. P. (2002). A proposed function for hipHasselmo, M. E., Bodelon, pocampal theta rhythm: Separate phases of encoding and retrieval enhance reversal of prior learning. Neural Comput., 14(4), 793–817. Hasselmo, M. E., & Eichenbaum, H. (2005). Hippocampal mechanisms for the context-dependent retrieval of episodes. Neural Netw., 18(9), 1172–1190. Hasselmo, M. E., & Schnell, E. (1994). Laminar selectivity of the cholinergic suppression of synaptic transmission in rat hippocampal region CA1: Computational modeling and brain slice physiology. Journal of Neuroscience, 14(6), 3898– 3914.
Context Learning in the Rodent Hippocampus
3211
Hasselmo, M. E., Schnell, E., & Barkai, E. (1995). Dynamics of learning and recall at excitatory recurrent synapses and cholinergic modulation in rat hippocampal region CA3. J. Neurosci., 15(7 Pt 2), 5249–5262. Hasselmo, M. E., Wyble, B. P., & Wallenstein, G. V. (1996). Retrieval of episodic memories: Role of cholinergic and GABAergic modulation in the hippocampus. Hippocampus, 6(6), 693–708. Hayman, R. M. A., Chakraborty, S., Anderson, M. I., & Jeffery, K. J. (2003). Contextspecific acquisition of location discrimination by hippocampal place cells. European Journal of Neuroscience, 18, 1–10. Hirsh, R. (1974). The hippocampus and contextual retrieval of information from memory: A theory. Behavioral Biology, 12, 421–444. Hok, V., Save, E., Lenck-Santini, P. P., & Poucet, B. (2005). Coding for spatial goals in the prelimbic/infralimbic area of the rat frontal cortex. Proc. Natl. Acad. Sci. U.S.A., 102(12), 4602–4607. ¨ Holscher, C., Jacob, W., & Mallot, H. A. (2004). Learned association of allocentric and egocentric information in the hippocampus. Exp. Brain Res., 158(2), 233– 240. Howard, M. W., Fotedar, M. S., Datey, A. V., & Hasselmo, M. E. (2005). The temporal context model in spatial navigation and relational learning: Toward a common explanation of medial temporal lobe function across domains. Psychol. Rev., 112(1), 75–116. Jarrard, L. E. (1993). On the role of the hippocampus in learning and memory in the rat. Behavioral Neural Biology, 60(1), 9–26. Jeffery, K. J. (2000). Plasticity of the hippocampal cellular representation of space. In C. Holscher (Ed.), Neuronal mechanisms of memory formation (pp. 100–124). Cambridge: Cambridge University Press. Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773–795. Kentros, C., Hargreaves, E., Hawkins, R. D., Kandel, E. R., Shapiro, M., & Muller, R. U. (1998). Abolition of long-term stability of new hippocampal place cell maps by NMDA receptor blockade. Science, 280, 2121–2126. Kesner, R. P., Gilbert, P. E., & Barua, L. A. (2002). The role of the hippocampus in memory for the temporal order of a sequence of odors. Behavioral Neuroscience, 116(2), 286–290. Kesner, R. P., & Novak, J. M. (1982). Serial position curve in rats: Role of the dorsal hippocampus. Science, 218(4568), 173–175. Kimble, D. P., & Kimble, R. J. (1965). Hippocampectomy and response perseveration in the rat. J. Comp. Physiol. Psychol., 60(3), 474–476. Knierim, J. J., Kudrimoti, H. S., & McNaughton, B. L. (1995). Place cells, head direction cells, and the learning of landmark stability. Journal of Neuroscience, 15, 1648–1659. Lee, I., & Kesner, R. P. (2002). Differential contribution of NMDA receptors in hippocampal subregions to spatial working memory. Nature Neuroscience, 5(2), 162– 168. Lee, I., & Kesner, R. P. (2003). Differential roles of dorsal hippocampal subregions in spatial working memory with short versus intermediate delay. Behavioral Neuroscience, 117(5), 1044–1053.
3212
M. Fuhs and D. Touretzky
Lee, I., Yoganarasimha, D., Rao, G., & Knierim, J. J. (2004). Comparison of population coherence of place cells in hippocampal subfields CA1 and CA3. Nature, 430, 456– 459. Lenck-Santini, P. P., Save, E., & Poucet, B. (2001). Place-cell firing does not depend on the direction of turn in a Y-maze alternation task. Eur. J. Neurosci., 13(5), 1055–1058. Leutgeb, J. K., Leutgeb, S., Moser, M.-B., & Moser, E. I. (2006). Expansion recoding is expressed in CA3 but not in the dentate gyrus. Program no. 68.3. In Society for Neuroscience Abstracts. Washington, DC: Society for Neuroscience. Leutgeb, J. K., Leutgeb, S., Treves, A., Meyer, R., Barnes, C. A., McNaughton, B. L., et al. (2005). Progressive transformation of hippocampal neuronal representations in “morphed” environments. Neuron, 48(2), 345–358. Leutgeb, S., Leutgeb, J. K., Barnes, C. A., Moser, E. I., McNaughton, B. L., & Moser, M.-B. (2005). Independent codes for spatial and episodic memory in hippocampal neuronal ensembles. Science, 309(5734), 619–623. Leutgeb, S., Leutgeb, J. K., Treves, A., Moser, M.-B., & Moser, E. I. (2004). Distinct ensemble codes in hippocampal areas CA3 and CA1. Science, 305(5688), 1295– 1298. Lever, C., Wills, T., Cacucci, F., Burgess, N., & O’Keefe, J. (2002). Long-term placticity in hippocampal place-cell representation of environmental geometry. Nature, 416, 90–94. Levy, W. B. (1989). A computational approach to hippocampal function. In R. D. Hawkins & G. H. Bower (Eds.), Computational models of learning in simple neural systems (pp. 243–305). San Diego, CA: Academic Press. Levy, W. B. (1996). A sequence predicting CA3 is a flexible associator that learns and uses context to solve hippocampal-like tasks. Hippocampus, 6(6), 579–591. ¨ Lorincz, A., & Buzs´aki, G. (2000). Two-phase computational model training longterm memories in the entorhinal-hippocampal region. Ann. N.Y. Acad Sci., 911, 83–111. Markus, E. J., Qin, Y., Leonard, B., Skaggs, W. E., McNaughton, B. L., & Barnes, C. A. (1995). Interactions between location and task affect the spatial and directional firing of hippocampal neurons. Journal of Neuroscience, 15, 7079–7094. Marr, D. (1971). Simple memory: A theory of archicortex. Philosophical Transactions of the Royal Society of London, 262(841), 23–81. McDonald, R. J., Ko, C. H., & Hong, N. S. (2002). Attenuation of context-specific inhibition on reversal learning of a stimulus-response task in rats with neurotoxic hippocampal damage. Behav. Brain Res., 136(1), 113–126. McNaughton, B. L., Battaglia, F. P., Jensen, O., Moser, E. I., & Moser, M.-B. (2006). Path integration and the neural basis of the “cognitive map.” Nat. Rev. Neurosci., 7(8), 663–678. McNaughton, B. L., & Morris, R. G. M. (1987). Hippocampal synaptic enhancement and information storage within a distributed memory system. Trends in Neurosciences, 10(10), 408–415. Meng, X.-L. & Wong, W. A. (1996). Simulating ratios of normalizing constants via a simple identity: A theoretical exploration. Statistica Sinica, 6, 831– 860.
Context Learning in the Rodent Hippocampus
3213
Muller, R. U., & Kubie, J. L. (1987). The effects of changes in the environment on the spatial firing of hippocampal complex-spike cells. Journal of Neuroscience, 7, 1951–1968. Myers, C. E., Gluck, M. A., & Granger, R. (1995). Dissociation of hippocampal and entorhinal function in associative learning: A computational approach. Psychobiology, 23, 116–138. Nadel, L., & Willner, J. (1980). Context and conditioning: A place for space. Physiological Psychology, 8(2), 218–228. Nakazawa, K., Quirk, M. C., Chitwood, R. A., Watanabe, M., Yeckel, M. F., Sun, L. D., et al. (2002). Requirement for hippocampal CA3 NMDA receptors in associative memory recall. Science, 297, 211–218. Neave, N., Lloyd, S., Sahgal, A., & Aggleton, J. P. (1994). Lack of effect of lesions in the anterior cingulate cortex and retrosplenial cortex on certain tests of spatial memory in the rat. Behavioural Brain Research, 65, 89–101. O’Keefe, J., & Dostrovsky, J. (1971). The hippocampus as a spatial map: Preliminary evidence from unit activity in the freely moving rat. Brain Research, 34, 171–175. O’Reilly, R. C., & McClelland, J. L. (1994). Hippocampal conjunctive encoding, storage, and recall: Avoiding a trade-off. Hippocampus, 4(6), 661–682. Otmakhova, N. A., Lewey, J., Asrican, B., & Lisman, J. E. (2005). Inhibition of perforant path input to the CA1 region by serotonin and noradrenaline. J. Neurophysiol, 94(2), 1413–1422. Otmakhova, N. A., & Lisman, J. E. (1998). Dopamine selectively inhibits the direct cortical pathway to the CA1 hippocampal region. J. Neurosci., 19(4), 1437–1445. Pearce, J. M., & Hall, G. (1980). A model for Pavlovian learning: Variations in the effectiveness of conditioned but not of unconditioned stimuli. Psychol Rev., 87(6), 532–552. Pennes, E. S., & Ison, J. R. (1967). Effects of partial reinforcement on discrimination learning and subsequent reversal or extinction. J. Exp. Psychol., 74(2), 219–224. Pubols, B. H. (1962). Serial reversal learning as a function of the number of trails per reversal. Journal of Comparative and Physiological Psychology, 55(1), 66–68. Quirk, G. J., Muller, R. U., Kubie, J. L., & Ranck Jr., J. B. (1992). The positional firing properties of medial entorhinal neurons: Description and comparison with hippocampal place cells. Journal of Neuroscience, 12(5), 1945–1963. Redish, A. D. (1999). Beyond the cognitive map: From place cells to episodic memory. Cambridge, MA: MIT Press. Redish, A. D. (2001). The hippocampal debate: Are we asking the right questions? Behavioral Brain Research, 127, 81–98. Redish, A. D., & Touretzky, D. S. (1998). The role of the hippocampus in solving the Morris water maze. Neural Computation, 10(1), 73–111. Rescorla, R. A., & Wagner, A. R. (1972). A theory of Pavlovian conditiong: Variations in the effectiveness of reinforcement and nonreinforcement. In A. H. Black & W. F. Prokesy (Eds.), Classical conditioning II: Current research and theory (pp. 64– 99). New York: Appleton-Century-Crofts. Robitsek, R. J., Fortin, N., & Eichenbaum, H. (2005). Hippocampal unit activity during continuous and delayed t-maze spatial alternation. Program no. 776.7. In Society for Neuroscience Abstracts, Washington, DC: Society for Neuroscience.
3214
M. Fuhs and D. Touretzky
Rolls, E. T. (1996). A theory of hippocampal function in memory. Hippocampus, 6, 601–620. Rudy, J. W., & Sutherland, R. J. (1995). Configural association theory and the hippocampal formation: An appraisal and reconfiguration. Hippocampus, 5, 375–389. Samsonovich, A., & McNaughton, B. L. (1997). Path integration and cognitive mapping in a continuous attractor neural network model. Journal of Neuroscience, 17(15), 5900–5920. Schmajuk, N. A., & DiCarlo, J. J. (1991). A neural network approach to hippocampal function in classical conditioning. Behav. Neurosci., 105(1), 82–110. Schmajuk, N. A., & DiCarlo, J. J. (1992). Stimulus configuration, classical conditioning, and hippocampal function. Psychol. Rev., 99(2), 268–305. Schmitz, D., Mellor, J., Breustedt, J., & Nicoll, R. A. (2003). Presynaptic kainate receptors impart an associative property to hippocampal mossy fiber long-term potentiation. Nature Neuroscience, 6(10), 1058–1063. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. Shapiro, M. L., Tanila, H., & Eichenbaum, H. (1997). Cues the hippocampal place cells encode: Dynamic and hierarchcal representation of local and distal stimuli. Hippocampus, 7(6), 624–642. Silveira, J. M., & Kimble, D. P. (1968). Brightness discrimination and reversal in hippocampally-lesioned rats. Physiology and Behavior, 3, 625–630. Skaggs, W. E., & McNaughton, B. L. (1998). Spatial firing properties of hippocampal CA1 populations in an environment containing two visually identical regions. Journal of Neuroscience, 18(20), 8455–8466. Smith, D. M., & Mizumori, S. J. Y. (2006). Learning-related development of contextspecific neuronal responses to places and events: The hippocampal role in context processing. J Neurosci, 26(12), 3154–3163. Tabuchi, E., Mulder, A. B., & Wiener, S. I. (2003). Reward value invariant place responses and reward site associated activity in hippocampal neurons of behaving rats. Hippocampus, 13(1), 117–132. Tanila, H., Shapiro, M. L., & Eichenbaum, H. (1997). Discordance of spatial representation in ensembles of hippocampal place cells. Hippocampus, 7(6), 613–623. Tenenbaum, J. B., & Griffiths, T. L. (2001). Generalization, similarity, and Bayesian inference. Behav. Brain Sci., 24(4), 629–640, 652–791. Touretzky, D. S., & Redish, A. D. (1996). A theory of rodent navigation based on interacting representations of space. Hippocampus, 6(3), 247–270. Treves, A., & Rolls, E. T. (1992). Computational constraints suggest the need for two distinct input systems to the hippocampal CA3 network. Hippocampus, 2, 189–199. Treves, A., & Rolls, E. T. (1994). Computational analysis of the role of hippocampus in memory. Hippocampus, 4(3), 374–391. Tsodyks, M. (1999). Attractor neural network models of spatial maps in hippocampus. Hippocampus, 9, 481–489. Vazdarjanova, A., & Guzowski, J. F. (2004). Differences in hippocampal neuronal population responses to modifications of an environmental context: Evidence for distinct, yet complementary, functions of CA3 and CA1 ensembles. J. Neurosci., 24(29), 6489–6496.
Context Learning in the Rodent Hippocampus
3215
Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13, 260– 269. Wallenstein, G. V., Eichenbaum, H., & Hasselmo, M. E. (1998). The hippocampus as an associator of discontiguous events. Trends in Neuroscience, 21(8), 317–323. Wallenstein, G. V., & Hasselmo, M. E. (1998). GABAergic modulation of hippocampal population activity: Sequence learning, place field development, and the phase precession effect. Journal of Neurophysiology, 78, 393–408. Wike, E. L. (1953). Extinction of a partially and continuously reinforced response with and without a rewarded alternative. J. Exp. Psychol., 46(4), 255–260. Winocur, G., & Olds, J. (1978). Effects of context manipulation on memory and reversal learning in rats with hippocampal lesions. J. Comp. Physiol. Psychol., 92(2), 312–321. Wise, L. M. (1962). Supplementary report: The Weinstock partial reinforcement effect and habit reversal. J. Exp. Psychol., 64, 647–648. Wood, E. R., Dudchenko, P. A., & Eichenbaum, H. (1999). The global record of memory in hippocampal neuronal activity. Nature, 397, 613–616. Wood, E. R., Dudchenko, P. A., Robitsek, R. J., & Eichenbaum, H. (2000) Hippocampal neurons encode information about different types of memory episodes occurring in the same location. Neuron, 27, 623–633. Xavier, G. F., Oliveira-Filho, F. J. B., & Santos, A. M. G. (1999). Dentate gyrus-selective colchicine lesion and disruption of performance in spatial tasks: Difficulties in “place strategy” because of a lack of flexibility in the use of environmental cues. Hippocampus, 9, 668–681. Yu, A. J., & Dayan, P. (2005). Uncertainty, neuromodulation, and attention. Neuron, 46(4), 681–692.
Received July 5, 2006; accepted January 30, 2007.
NOTE
Communicated by Eugene Izhikevich
Solution Methods for a New Class of Simple Model Neurons Mark D. Humphries [email protected]
Kevin Gurney [email protected] Adaptive Behaviour Research Group, Department of Psychology, University of Sheffield, Sheffield S10 2TP, U.K.
Izhikevich (2003) proposed a new canonical neuron model of spike generation. The model was surprisingly simple yet able to accurately replicate the firing patterns of different types of cortical cell. Here, we derive a solution method that allows efficient simulation of the model. 1 Introduction Two constraining criteria govern the choice of neuron models: accuracy and speed. If a model is highly accurate in its behavior, it probably demands too much computation time to be used in large networks; if a model has a low computation time, it is probably not accurate enough to capture all key dynamic properties of the modeled cell. As modelers, we seek efficient models that maximize the trade-off between accuracy and speed and which thus allow networks of sufficiently accurate neurons to be simulated comfortably on modest computing resources: this search has led to the proposal of numerous models. Izhikevich (2003) proposed a simple neuron model that was far more accurate in its replication of membrane potential behavior than the common leaky integrate-and-fire model, but that was not much more complex in its mathematical form. An equivalent, biophysical form of the model is given in Izhikevich (2006), which we use here. If v is the membrane potential and u is the dominant slow current of the neuron class, then C v˙ = k(v − vr )(v − vt ) − u + I ,
(1.1)
u˙ = a [b(v − vr ) − u] ,
(1.2)
with reset condition if v ≥ vpeak then v ← c, u ← u + d, where C is capacitance, vr and vt are the resting and threshold potentials, a is the time constant of the slow current, and c is the reset potential (i.e., Neural Computation 19, 3216–3225 (2007)
C 2007 Massachusetts Institute of Technology
Solution Methods for a New Class of Neurons
3217
the value of the membrane potential immediately after an action potential is fired). Parameters k and b are derived from the current-voltage relation of the neuron, and d is tuned to achieve the desired spiking behavior. All of these parameters are given constant values for a specific neuron class, and the range of spike train phenomena captured by this two-dimensional model is impressive (Izhikevich, 2004). The rationale behind Izhikevich’s original work was to provide a better accuracy-speed trade-off than was available in existing models (Izhikevich, 2004). Notwithstanding the success of this program, our aim was to improve this trade-off even further by finding better solution methods for the model than those used in its original deployments (Izhikevich, 2004; Izhikevich, Gally, & Edelman, 2004). 2 A Zero-Order Hold Solution In all example simulations and code that Izhikevich (2006) provides and in the large-scale model of cortex (Izhikevich et al., 2004), a forward Euler update is used as the numerical method for solving equations 1.1 and 1.2: given a time step of t, a differential y˙ is approximated at time t + t by y(t + t) = y(t) + t y˙ (t). Forward Euler is the simplest form of numerical integration but has considerable problems of stability and accuracy under certain conditions unless a small time step is used (Hansel, Mato, Meunier, & Neltner, 1998). Its main advantage is that each equation need be computed only once per time step. However, there are other once-per-timestep numerical integration methods available that achieve better levels of accuracy, as usually required in simulating neural models. One method, used with coupled ordinary differential equations (ODEs) of the form y˙ k = f yk , yi , y j , . . . , is to treat the other time-dependent variables (yi , y j , . . .) in each equation as constant over some small time interval t (equivalent to approximating these variables by the zeroth-order terms in their Taylor series expansions). If there is an exact solution for each of the resulting equations y˙ k = f (yk , D), where D is constant, then these may be used to determine yk (t + t) by holding suitable sets of variables constant at any one time. With two equations (as in equations 1.1 and 1.2), we simply solve exactly each equation by holding the other variable constant. This scheme is referred to as the zero-order hold approximation (ZOH) and has been used, for example, to solve Hodgkin-Huxley type models (Dayan & Abbot, 2001). Solving the typical form of the equations for neural models in this way produces solutions that are grouped as a numerical method called exponential Euler: the neural simulator GENESIS uses this a solution method (Bower & Beeman, 1998). We now show that the zero-order hold technique may be applied to the Izhikevich model equations. 2.1 Slow Current. We begin with the slow current u, as this has a straightforward solution. We take equation 1.2 and assume v to be constant
3218
M. Humphries and K. Gurney
between an initial t0 and current time t, and solve. Noting that the righthand-side of equation 1.2 has no explicit dependence on time, it is trivially linearly separable in u, t and can be solved exactly using an integrating factor e a dt = ea t : u(t) = u(t0 )e−a (t−t0 ) +
t
a b(v − vr )e−a (t−τ ) dτ.
t0
Assuming constant v, the solution is therefore u(t) = b(v − vr )[1 − e−a (t−t0 ) ] + u(t0 )e−a (t−t0 ) .
(2.1)
We use this as a discrete-time approximation of u after some small time step t with the replacements t ← t0 + t. Izhikevich (2006) gives some models of specific neuron species that derive u from a slightly different approximation to the dynamics of the slow current. However, the resulting form of u is similar to equation 1.2, and so rederiving the ZOH solution to replace equation 2.1 is straightforward. We give an example for the model of cortical fast-spiking interneurons, as we use this model later: the form of u for this model is −a u, if v < vb , u˙ = 3 a b(v − vb ) − u , if v ≥ vb , and there is no reset of u following a spike (i.e., d = 0). Given bounds t0 and t as before, the solutions are u(t) =
u(t0 )e−a (t−t0 ) , if v(t0 ) < vb 3 −a (t−t0 ) −a (t−t0 ) + u(t0 )e b(v − vb ) 1 − e , if v(t0 ) ≥ vb .
(2.2)
2.2 The Membrane Potential. We begin the same way as above: we take equation 1.1 and assume u to be constant over some small time interval between the start of integration t0 and current time t, and solve (we assume I to be a constant current injection, so the solution is exact in I as well). In general, nonlinear ODEs do not have a solution; fortunately, equation 1.1 is a special case and can be expanded out into v˙ =
kvr vt − u + I k(vr + vt ) k − v + v2 , C C C
(2.3)
which is a Riccati differential equation, of the form v˙ = P(t) + Q(t)v + R(t)v 2 . These can be solved by substituting z = 1/(v − v1 ), where v1 is a particular solution to v. ˙ Since P, Q, R are in fact independent of time (because of our assumptions of constancy), the equation for z˙ is (trivially) linearly
Solution Methods for a New Class of Neurons
3219
separable: z˙ = − [Q(t) + 2v1 R(t)] z − R(t). Having solved this in terms of z, we can solve for v with the relation v = v1 + 1/z. Finding a particular solution to equation 2.3 by inspection is difficult, so we propose here a simple method that generates a particular solution, and thus allows the solution of any Ricatti differential (we are not aware of this method previously appearing in the literature). We seek a particular solution by putting v˙ = 0 since, with I and u constant, we then obtain a simple quadratic in v kv 2 − k(vr + vt )v − u + I + kvr vt = 0, which has solutions v1 =
k(vr + vt ) ± A , A= 2k
[−k(vr + vt )]2 − 4k(−u + I + kvr vt ).
(2.4)
Since we need only one particular solution, we choose the root with +A. Now that we have an expression for the particular solution v1 , we can complete the substitution to express equation 2.3 in terms of z,
k(vr + vt ) + A k k(vr + vt ) k z˙ = − − +2 z− , C 2k C C which simplifies to the linearly separable form z˙ +
k A z=− . C C
This can be solved with an integrating factor e solution between t0 and t, z(t) = t0
t
−
(2.5)
A/Cdt
= et A/C , giving the
k −(t−τ )A/C dτ + C0 e−t A/C , e C
(2.6)
where C0 is a constant of integration (which we find below). Integrating equation 2.6 gives k z(t) = − (1 − e−(t−t0 )A/C ) + C0 e−t A/C , A and we go back to v with the relation given above to get
−1 k k(vr + vt ) + A −(t−t0 )A/C −t A/C + − (1 − e ) + C0 e . v(t) = 2k A
(2.7)
3220
M. Humphries and K. Gurney
We now find C0 in terms of the initial value v(t0 ). Thus, v(t0 ) =
k(vr + vt ) + A et0 A/C , + 2k C0
and solving for C0 gives
k(vr + vt ) + A −1 C0 = et0 A/C v(t0 ) − . 2k
(2.8)
We substitute equation 2.8 into 2.7 to get the full ZOH solution for v: k(vr + vt ) + A k v(t) = + − (1 − e−(t−t0 )A/C ) 2k A −1
k(vr + vt ) + A −1 −(t−t0 )A/C +e v(t0 ) − , (2.9) 2k where A is given in equation 2.4. Again this becomes a discrete-time solution with the replacement t ← t0 + t. Spike detection and reset in discrete time is handled by: if v(t0 + t) ≥ vpeak then v(t0 + t) ← c, u(t0 + t) ← u(t0 + t) + d. Following Izhikevich (2006), we also set v(t0 ) = vpeak . 2.3 Potential Limitations of the Solution. For some values of u and I , the particular solution v1 found by setting v˙ = 0 may have complex roots (with A imaginary). This implies that v˙ has no fixed point and will escape to infinity in a finite time. However, even if v1 is complex, it can be shown that the resulting value for v is always real as required. Let us assume that A is imaginary, that is, A = i B. Substituting this in equation 2.9 and using the Euler identity (e iθ = cos θ + i sin θ ) gives v(t) =
(vr + vt ) 2 +
Y cos X + (kY2 /B) sin X − (B/4k) sin X 1 (1 2
+ cos X) + (2kY/B) sin X − 2 (kY/B)2 (−1 + cos X)
(2.10)
where X = −(t − t0 )B/C and Y = v(t0 ) − (vr + vt )/2. Notice that v is always real and that our method for finding a particular solution thus produces sensible results. In addition, equation 2.10 gives a form for the calculation of v that avoids having to represent and compute complex numbers in simulation. As with any other numerical method, it is useful to find an upper bound on t. Our limit here is the value beyond which the assumption of constant u in equation 1.1 and constant v in equation 1.2 breaks down.
Solution Methods for a New Class of Neurons
3221
Ideally, for equation 1.1, u(t0 ) t u(t ˙ 0 ), so that the deviation from a constant value of u over t is very small (based on the Taylor series expansion, and assuming contribution from higher-order terms will be negligible). Thus, a reasonable upper bound is t < u(t0 )/u(t ˙ 0 ). Similarly, for equation 1.2, the upper bound is t < v(t0 )/v(t ˙ 0 ). These bounds should hold for most but not all t (if v does escape to infinity in a finite time, as when a spike is produced, then the ratios would also go to infinity and should be discounted). 3 Comparison of Numerical Solution Methods For a given time step, the forward Euler method is always the fastest fixedstep solution method (in terms of computation time), as it requires one multiplication and one addition operation per ODE above the floating-point operation requirements of the ODE system itself. This, combined with its ease of implementation, explains its widespread use in neural modeling (including the solution of Izhikevich’s model neurons). Conversely, one would expect that the forward Euler method is always the least accurate fixed-step solution for a given time step (Hansel et al., 1998). We thus compare our zero-order hold (ZOH) solutions and forward Euler as numerical methods for solving the simple model neurons to determine which method is the more efficient (achieving the best accuracy-to-speed trade-off). We use four classes of cortical neurons modeled as Izhikevich simple model neurons: the regular-spiking (RS) neurons, intrinsically bursting (IB) neurons, chattering (CH) neurons, and fast-spiking (FS) interneurons. Their parameter values, taken from Izhikevich (2006), are given in the legend of Figure 1. All model neurons were coded in both Matlab 7 (Mathworks) and C versions to test differences due to language implementations (all code available at www.abrg.group.shef.ac.uk). The C version used equation 2.10 if A, given by equation 2.4, was complex, and equation 2.9 otherwise. One Matlab version (“control”) also used this system, and hence acted as a control for other differences between the C and Matlab implementations; the other Matlab version (“native") made use of Matlab’s built-in handling of complex numbers and used equation 2.9 throughout. As a benchmark for accuracy, we solved each model neuron with a highorder variable-step solver, ode45 from the Matlab ODE suite: a RungeKutta 4th-5th order formula, the Dormand-Prince pair (Ashino, Nagase, & Vaillancourt, 2000). Each benchmark simulation for a model neuron class was run for 1000 ms, with a current step I onset at 100 ms. Using the same input, the fixed-step solutions (forward Euler and ZOH) were then assessed for each time step t taken from the interval [0.01, 0.5]ms in increments of 0.01 ms, giving 50 tested time steps in total. (The upper limit of t = 0.5 ms was determined by computing the rough bounds on step size for the ZOH method for each neuron class, as described above.) Each simulation using a fixed-step solution terminated when the same number of spikes as the
3222
M. Humphries and K. Gurney
benchmark simulation (using the variable-step solver) had been produced. Errors in solution were expressed as the mean error per spike for each t, where each error per spike was given as the time difference between the variable-step solution crossing vpeak and the fixed-step solution’s crossing of vpeak . The minimum possible error would thus be in the interval [0, t/2], due to the quantization imposed by the fixed-step methods. Computation times for the fixed-step solutions for each model neuron class were also computed. For the Matlab versions, a batch of 30 simulations was run for each t, and the time taken for each simulation was recorded. The mean computation time was then calculated by averaging over that batch. For the C versions, 30 batches of 1000 simulations were run for each t, and the time taken for each batch was recorded. The mean computation time was then calculated by averaging over all 30 batches. For each language-implementation and neuron-class combination, we took the mean computation times T zoh (t), T euler (t) and mean errors per spike E zoh (t), E euler (t) and calculated the following ratios: T r (t) = T euler (t)/T zoh (t) and E r (t) = E euler (t)/E zoh (t). We expected (and indeed found) that T r (t) > 1 always, because the forward Euler method is always faster for a given time step; we expected that E r (t) < 1 for most time steps and neuron types, because the ZOH method is expected to be more accurate than forward Euler. To express the accuracy-to-speed tradeoff, we calculate R(t) = T r (t) × E r (t). Thus, if R(t) > 1, then the magnitude gain in accuracy by using the ZOH method is greater than the magnitude loss in speed for that time step; in other words, the ZOH method is more efficient. We plot R(t) for each neuron class and language implementation combination in Figure 1 (the log scale is used because if R(t) = 2 means that the ZOH method was twice as efficient, then R(t) = 0.5 means it was
Figure 1: Accuracy-to-speed trade-off for zero-order hold (ZOH) and forward Euler numerical solutions to four classes of simple model neurons. A value of R(t) > 1 (white area) indicates that the ZOH method had a greater magnitude gain in accuracy than loss of speed for that time step, and thus achieved a better accuracy-to-speed trade-off than forward Euler. (A) Native Matlab version, making use of its built-in handling of complex numbers. (B) Control Matlab version. (C) The C version, using an equivalent scheme to the control Matlab version. The parameter values used for each model were as follows. Regular spiking (RS) neuron: C = 100, vr = −60, vt = −40, I = 70, k = 0.7, a = 0.03, b = −2, c = −50, d = 100, vpeak = 35; Intrinsically bursting (IB) neuron: C = 150, vr = −75, vt = −45, I = 500, k = 1.2, a = 0.01, b = 5, c = −56, d = 130, vpeak = 50. Chattering (CH) neuron: C = 50, vr = −60, vt = −40, I = 200, k = 1.5, a = 0.03, b = 1, c = −40, d = 150, vpeak = 20. Fast-spiking (FS) interneuron: C = 20, vr = −55, vt = −40, I = 100, k = 1, a = 0.2, b = 0.025, c = −45, d = 0, vpeak = 25, and vb = −55 (see text).
Solution Methods for a New Class of Neurons A
B
C
3223
3224
M. Humphries and K. Gurney
equivalently half as efficient). We find consistent patterns of dependence on t for each neuron class across all language implementations. Solving the RS class was always more efficient using forward Euler, an unsurprising result given the simple dynamics of this model. Solving the CH and IB classes was always more efficient using the ZOH method, except at the smallest t. In either Matlab version, solving the FS class was more efficient using the ZOH method over the approximate interval t ∈ [0.02, 0.3] ms; over the same interval, the C version showed that neither method was clearly more efficient: mean R(t) was ∼1.17 over that interval. Over all simulations we found (1) that the errors were identical to three significant places between the two Matlab versions, verifying that equation 2.10 gives correct results in simulation; (2) that computation using equation 2.10 was faster in Matlab than using built-in complex numbers, hence the higher R(t) for the “control” Matlab version; and (3) that there was a considerable difference in relative computation times across language implementations (which was not dependent on the form of ZOH solution). Across all neuron classes and time steps, the “control” Matlab version had T r (t) ∼ 0.7 but the C version had T r (t) ∼ 0.2. 4 Conclusion The new simple neuron model of Izhikevich (2003) may provide the best trade-off yet between computation time and accuracy for a model of spiking activity. We have derived a new numerical solution to this model, which may allow for significantly improved efficiency in simulation. Our numerical simulations showed how this efficiency was dependent on the choice of time step for most neuron classes. In general, we expect the choice of time step to be away from the extremes of the range that we tested, around t ∼ 0.1. Too small (t ∼ 0.01), and there is no gain in efficiency over more accurate methods, including other fixed-step methods such as midpoint (in simulation, we found that the fixed-step simulations in Matlab took at least as long as those using the variable-step solver for t ∈ [0.01, 0.05]). Too large (t ∼ 0.5), and enough error is introduced to the simulation to markedly affect the results (in simulation, we found that the mean error for the CH, IB, and FS neuron classes was at least 10 ms per spike). Thus, from our results, we believe that ZOH is a more efficient solution method for most neuron classes over the likely range of usable time steps. Future work will determine how these results from a single-neuron case extrapolate to a network. One might naively expect that errors in individual neuron solutions would propagate through a network, leading to considerable accumulation of error over time; on the other hand, the firing of a spike is dependent on multiple presynaptic firings, and thus the errors may be absorbed in the naturally noisy process of spike accumulation. The choice of network time step is more critical than for the single neuron: in addition to reducing error (whether it propagates or not), the time step ideally should
Solution Methods for a New Class of Neurons
3225
be smaller than any characteristic scale of the dynamics of interest, to avoid artifactual oscillations and synchronizations caused by quantizing spikes to the time-step. The solution also forms the basis of further analytic studies of this neuron class. For example, solutions 2.1 and 2.9 give exact values if equilibrium solutions to the coupled system 1.1–1.2 exist for given parameter values and injection currents. Acknowledgments We thank Nathan Lepora for discussions on the mathematics and Ric Wood for helpful comments on a draft of this manuscript. This work was supported by the EU Framework 6 ICEA project. References Ashino, R., Nagase, M., & Vaillancourt, R. (2000). Behind and beyond the MATLAB ODE suite. Computers and Mathematics with Applications, 40, 491–512. Bower, J. M., & Beeman, D. (1998). The book of GENESIS: Exploring realistic neural models with the general neural simulation system (2nd ed.). New York: SpringerVerlag. Dayan, P., & Abbot, L. F. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. Hansel, D., Mato, G., Meunier, C., & Neltner, L. (1998). On numerical simulations of integrate-and-fire neural networks. Neural Comput., 10(2), 467–483. Izhikevich, E. M. (2003). Simple model of spiking neurons. IEEE Trans. Neural Netw., 14, 1569–1572. Izhikevich, E. M. (2004). Which model to use for cortical spiking neurons? IEEE Trans. Neural Netw., 15(5), 1063–1070. Izhikevich, E. M. (2006). Dynamical systems in neuroscience. Cambridge, MA: MIT Press. Izhikevich, E. M., Gally, J. A., & Edelman, G. M. (2004). Spike-timing dynamics of neuronal groups. Cereb. Cortex, 14(8), 933–944.
Received June 21, 2006; accepted October 7, 2006.
NOTE
Communicated by Raul Muresan
Event-Driven Simulations of Nonlinear Integrate-and-Fire Neurons Arnaud Tonnelier [email protected]
Hama Belmabrouk [email protected]
Dominique Martinez [email protected] Cortex Project, LORIA, 54 506 Vandoeuvre-l`es-Nancy, France
Event-driven strategies have been used to simulate spiking neural networks exactly. Previous work is limited to linear integrate-and-fire neurons. In this note, we extend event-driven schemes to a class of nonlinear integrate-and-fire models. Results are presented for the quadratic integrate-and-fire model with instantaneous or exponential synaptic currents. Extensions to conductance-based currents and exponential integrate-and-fire neurons are discussed. 1 Introduction Our current theoretical understanding of the properties of neural systems is mainly based on numerical simulations from single cell models to neural networks. Recent experimental evidence has accumulated suggesting that precise spike-time coding is used in various neuronal systems (VanRullen, Guyonneau, & Thorpe, 2005). Individual spikes can be highly reliable and precisely timed with a submillisecond accuracy (Berry, Warland, & Meister, 1997; Mainen & Sejnowski, 1995). Synaptic plasticity depends critically on the relative timing of pre- and postsynaptic spikes; a synapse is strengthened if the presynaptic spike occurs shortly before the postsynaptic neuron fires, and the synapse is weakened if the sequence of spikes is reversed (Bi & Poo, 1998). It is therefore important to have accurate numerical schemes to calculate spike times. Integrate-and-fire neurons reproduce many features of the neuronal dynamics (Izhikevich, 2003; Tonnelier & Gerstner, 2003; Gerstner & Kistler, 2002) and are widely used in the numerical simulations of spiking neural networks. Two strategies have been used for the simulation of integrateand-fire neural networks: time-stepping methods that approximate the membrane voltage of neurons on a discretized time and event-driven schemes where the timings of spikes are calculated exactly. By definition, time-stepping approximations are imprecise, and it has been shown Neural Computation 19, 3226–3238 (2007)
C 2007 Massachusetts Institute of Technology
Event-Driven Simulations of Nonlinear Integrate-and-Fire Neurons
3227
that time steps have to be chosen correctly to reproduce the synchronization properties of networks of spiking neurons (Hansel, Mato, Meunier, & Neltner, 1998). Standard time-stepping algorithms (Euler, Runge-Kutta) have to be modified to give an accurate approximation of firing times (Shelley & Tao, 2001). A fundamental limitation on the accuracy of any time-stepping methods is imposed by the smoothness of the postsynaptic potentials (Shelley & Tao, 2001). High-order time-stepping algorithms can be constructed only if the onset of postsynaptic conductance changes has smooth derivatives. Exact simulations avoid these problems. The term exact method or exact simulation means that spike timings are analytically given or are derived from the analytical formulation of the membrane potential. Thus, it is possible to have an arbitrary precision (up to the machine precision) of spike timings. This method has become increasingly popular (Mattia & Giudice, 2000; Makino, 2003; Rochel & Martinez, 2003; Brette, 2006; Rudolph & Destexhe, 2006), but it applies to a limited class of neuron models, mainly the linear integrate-and-fire models. It is known that the leaky integrate-and-fire model has some limitations: it has an unrealistic behavior close to the threshold and reproduces only some characteristics of conductance-based neurons (Fourcaud-Trocm´e, Hansel, van Vreeswijk, & Brunel, 2003). More realistic models include nonlinear spike-generating currents that allow replacement of the strict voltage threshold by a smooth spike initiation zone. Here we simulate exactly the quadratic integrate-and-fire neuron (Ermentrout & Kopell, 1986). In the quadratic integrate-and-fire (QIF) model, the membrane potential follows C
dV dt
= q (V − Vth )2 − Ith + Is (t),
(1.1)
where V is the membrane voltage, C is the membrane capacitance, q is a parameter characterizing the frequency-current response curve, Ith is the threshold current, and Is is the synaptic current, and Vth is the voltage threshold, that is, the largest steady voltage at which the neuron can be maintained by a constant input. Without synaptic currents, Is = 0, the QIF neuron presents two distinct regimes. When Ith > 0, there are two fixed points. The stable one defines the resting state Vrest of the neuron: Vrest = Vth −
Ith . q
(1.2)
The unstable one is the threshold below which trajectories tend toward infinity in finite time that defines the spike time. When Ith < 0 the neuron fires regularly. This model represents the normal form of any type I neurons near the saddle-node bifurcation (Ermentrout, 1996; Ermentrout & Kopell, 1986) and is related to the so-called θ neuron (Ermentrout, 1996; Gutkin &
3228
A. Tonnelier, H. Belmabrouk, and D. Martinez
Ermentrout, 1998). Since the quadratic integrate-and-fire neuron is expected to reproduce the characteristics of any type I neuron close to bifurcation, it has been widely used as a realistic neuron model (Hansel & Mato, 2001; Latham, Richmond, Nelson, & Nirenberg, 2000; Brunel & Latham, 2003; Fourcaud-Trocm´e et al., 2003). The action potential is defined as a divergence of the voltage. In numerical simulations, one has to introduce a cutoff at a finite voltage Vpea k . After a spike, the membrane potential is instantaneously reset to Vr eset . The synaptic current Is is induced by presynaptic spikes. An incoming spike at time t f triggers a postsynaptic current, Is (t) = w δ(t − t f ),
(1.3)
where w is the synaptic weight of the synapse and δ is the dirac delta function. More realistic models use exponential currents, Is (t) = w exp(−(t − t f )/τs ),
(1.4)
where τs is the synaptic time constant. An event-driven simulation requires an analytical expression for the membrane potential or, at least, a closed-form expression for the firing times. In many cases, a given neuron in the network will not fire (its firing time is +∞). In such situations, the efficiency of the simulation is improved using a spike test—an algorithm that checks quickly whether a neuron will fire. Note that the efficiency of the spike test crucially depends on the overall activity of the network. In this note, we describe a method to simulate exactly the quadratic integrate-and-fire model, equation 1.1, with the synaptic currents, equations 1.3 and 1.4. The membrane potential is solved analytically, and a spike test is derived. 2 Exact Simulation We consider the dimensionless QIF model, dV dt
= V 2 − Ith ,
(2.1)
obtained from equation 1.1 by the change of variables V ← q (V − Vth )/C and Ith ← q Ith /C 2 (for convenience, we do not change the notations). We further consider an instantaneous synaptic current so that V → V + w when a spike is received and V → Vreset when a spike is emitted. We consider excitable neurons and take a positive threshold current. In an interval with no spike (i.e., no incoming spike and no spike emitted by the neuron itself),
Event-Driven Simulations of Nonlinear Integrate-and-Fire Neurons
3229
equation 2.1 can be solved analytically with the separation-of-variables technique. The membrane potential evolves according to Ith (t + c) , V(t) = − Ith tanh
(2.2)
where constant of integration given by the initial condition V(0) = √ c is a √ √ − Ith tanh(c Ith ). For V(0) < Ith , the neuron goes back to √ its resting value, and no spike is emitted. A spike is emitted for V(0) > Ith , that is, for an initial value greater fixed point of the neuron. In √ than the unstable √ this case, we calculate c Ith = −atanh( Ith /V(0)) + iπ/2.1 Using tanh(x + iπ/2) = coth(x), we have V(t) = − Ith coth( Ith t − atanh( Ith /V(0))).
(2.3)
The firing time is obtained when V(t) crosses Vpea k and is explicitly given using equation 2.3. In the limit Vpea k → ∞, the firing time is obtained, equaling to zero the argument of the coth function and is simply given by √ 1 Ith t f = √ atanh . V(0) Ith
(2.4)
An event-driven scheme for a network of QIF neurons with instantaneous coupling can be easily implemented using equation 2.4. The generalization of the event-based simulation to more realistic synaptic currents is not trivial. A network of QIF neurons with exponential current follows the equations dV dt
τs
d Is dt
= V 2 − Ith + Is ,
(2.5)
= −Is .
(2.6)
When a spike is received by a synapse, the current Is is instantaneously modified, Is → Is + w. Excitatory or inhibitory synapses are accounted for through the sign of w, provided that they have an identical time constant τs . System 2.5 to 2.6 is analytically solvable in intervals with no spike and V(t) is expressed using Bessel functions (see appendix A). Thus, we can calculate accurately the value of the membrane potential at any time. It is possible to speed up the simulation if neurons that will not spike are detected. We will derive a test that quickly checks whether a neuron
1
Depending on the initial condition V(0), the constant c can be a complex number.
3230
A. Tonnelier, H. Belmabrouk, and D. Martinez
will spike. Depending on the initial conditions V(0), Is (0) either a spike is emitted or the neuron goes back directly to its resting state. There is a curve V ∗ (Is ) in the phase plane that defines the threshold curve, that is, the neuron spikes if and only if its state is above this curve. The threshold curve √ is given by the stable variety of the unstable fixed point (Is , V) = (0, Ith ) (see Figure 1.A). When the spike test is positive, we need to compute the firing time. This computation can be done efficiently with standard methods of roots finding. These methods require an initial guess. We define V + and V − that evolve according to the following differential equations, d V+ = Vpea k |V + | − Ith + Is , dt
(2.7)
d V− = 2|V − | − 2 − Ith + Is , dt
(2.8)
where Is follows equation 2.6. Here, we have used piecewise linear bounds of the quadratic nonlinearity (see Figure 1B, left). is an arbitrary parameter, and we choose = 1. By construction we have ∀t ≥ 0, V − (t) ≤ V(t) ≤ V + (t) (see equations 2.5, 2.7, and 2.8 and Figure 1B, right). Since V + (t) and V − (t) are the membrane potentials of (piecewise) linear integrate-and-fire models, we can calculate accurately the times of the threshold crossing (Brette, 2006). These times give an upper and a lower bound for the firing times of the QIF neuron that allow an efficient root finding. However, when the network activity is high, the spike test is most of the time positive, and this method becomes time-consuming. Moreover, in the high-activity regime, V(0) is near the threshold curve, and it is possible to derive a more accurate bound of the firing time. Another approximation as an initial guess is obtained using the bounding QIF, d V+ 2 = V + − Ith + max(0, Is (0)), dt
(2.9)
which gives a lower bound for the firing time. Likewise, an upper bound can be obtained by using min(0, Is (0)) in equation 2.9. The firing time of the bounding QIF model, equation 2.9, can be calculated quickly and accurately using an equation similar to 2.4. The comparison with the lower bound given by the bounding LIF is shown in Figure 1C for different values of the f synaptic current. We calculate the relative error E = |t f − tb |/t f where t f f is the firing time of the QIF and tb is the approximation given by either the bounding LIF, equation 2.7, or the bounding QIF, equation 2.9, respectively. ˜ 0 (Is (0)) of the membrane potential such that There exists a critical value V ˜ 0 , the bounding QIF model gives a better approximation of the for V > V firing time. Note that for positive values of V0 , the bounding QIF model is
Event-Driven Simulations of Nonlinear Integrate-and-Fire Neurons
3231
always better. This case is achieved when the average activity is high, that is, neurons in the network are most of the time near the threshold curve. 3 Numerical Results To illustrate our method, we simulate a network of N identical QIF neurons receiving an external excitatory spike train modeled as a Poisson process with a constant rate of 10 kHz.2 The QIF neurons are coupled all-to-all through GABAergic inhibition. This scenario has been used to reproduce the oscillatory synchronization observed in early olfactory systems (Martinez, 2005; Ambard & Martinez, 2006). Numerical values of QIF neurons, Equation 1.1, are taken from Ambard and Martinez (2006): C = 0.2 nF, Vrest = −65 mV, Vth = −60.68 mV, q = 0.00643 mS.V−1 , Ith = 0.12 nA, Vpea k = 30 mV, and Vreset = −70 mV. The synaptic time constant is τs = 6 ms and the synaptic strength is w = 0.05/N for the inhibitory synapses and w = 5.10−5 for the excitatory Poissonian input. We wrote our event-driven simulator in C++ based on the event-driven library MVASpike (Rochel & Martinez, 2003), available online at http:// www.sop.inria.fr/odyssee/software/mvaspike. The Bessel functions needed for the exact computation of V(t) (see appendix A) were implemented using the GNU Scientific Library (http://www.gnu.org/ software/gsl). The threshold curve defining the spike test was accurately fitted by a polynomial of degree three. Each time the spike test is positive, the firing time is computed by finding the root of V(t) − Vpea k . This is accomplished by a few steps of a Newton-Raphson method starting from an initial guess obtained from the bounding QIF, equation 2.9. The precision achieved in the simulations is on the order of 10−7 ms. We first simulate a network of N = 100 neurons. Initial conditions of the membrane potentials V(0) are taken randomly between Vr eset and Vpea k so that the firing times of uncoupled neurons are uniformly distributed. This way, the network starts in a completely desynchronized state. When inhibition is blocked, individual neurons fire at 380 Hz on average (see Figure 1.D, left, for the spike raster of the first 100 ms of the simulation). The simulation time was about 20 min for 1 sec of biological time on a portable PC running Linux at 1.86 GHz. This is a reasonable time in regard to the large number of events encountered during the simulation (on the order of 106 ). Moreover, we found that the spike test is positive 100% of the time, so that any event from the Poissonian input requires spike timing computation of the N QIF neurons. When lateral inhibition is taken into account, the spike test is positive in 50% of the cases, and neurons are synchronized with precise spiking activity at about 10 Hz (see Figure 1.D, right). In such a case, all neurons fire almost at the same time. 2 This mimics the interaction with 1000 excitatory presynaptic neurons firing at 10 Hz on average.
3232
A. Tonnelier, H. Belmabrouk, and D. Martinez
Rather than searching for the next firing time within the entire population of neurons at each occurrence of an event, it might be sufficient to perform the search among a random subset of k neurons. This leads to considerable speed-up as k N, albeit at a cost of introducing errors in the simulation. A ¨ similar probabilistic speed-up has been used in kernel methods (Scholkopf & Smola, 2002). In appendix C, it is shown that the error probability is given by 0.95k for any N.3 Therefore, a subset of 59 neurons is already sufficient to get a low error probability. For k = 59, the error probability equals 0.05, which is small enough not to perturb the synchronization, as shown in Figure 1.E, left, for a network of N = 1000 neurons coupled all-toall (106 synapses). For comparison, Figure 1.E, right, shows the dynamics obtained by simulating the same network with k = 30 (error probability of 0.21). 4 Discussion Due to its simple dynamics, the behavior exhibited by the leaky integrateand-fire model is limited. The QIF model has seen increasing interest in recent years, primarily because it reproduces the properties of detailed Figure 1: Exact simulation of networks of quadratic integrate-and-fire neurons. (A) The (Is , V)-phase plane for the QIF neuron with exponential synaptic currents. The V-nullcline√is shown (bell-shape curve). The stable manifold of the saddle-node point (0, Ith ) defines the threshold curve above which the neuron fires. Trajectories √ starting under the threshold curve tend toward the stable resting state (0, − Ith ). (B) Left: Piecewise linear sector bounds on the nonlinearity. The nonlinearity is shown by a solid line. The dashed lines show piecewise linear sector bounds. Right: Bounding of the membrane potential of the quadratic integrate-and-fire neuron using piecewise linear integrate-and-fire models. Parameters are Vpea k = 3, Ith = 2. (C) The relative error on the spike timing of the QIF as a function of the initial value of the membrane potential V0 for different values of the initial synaptic current. Comparison is done between the approximation obtained the bounding LIF (bLIF) and the bounding QIF (bQIF). Left: Is (0) = 5. Middle: Is (0) = 7. Right: Is (0) = 10. Other parameters are those of B. (D) Event-driven simulations of a network of QIF neurons (N = 100 neurons). Left: Spike trains produced without inhibition. Right: Spike trains produced with lateral inhibition (104 synapses). (E) Event-driven simulations of a fully connected network of QIF neurons (N = 1000 neurons, 106 synapses). Left: Simulation with probabilistic speedup (k = 59). Right: Simulation with probabilistic speed-up (k = 30).
3
Error probability is defined as the probability that the firing time computed with the probabilistic speed-up is greater than the 5% first firing times of the entire population of neurons.
Event-Driven Simulations of Nonlinear Integrate-and-Fire Neurons a)
3233
3 2
Ith
V
1 0
Ith
0
2
4
6
8
Is
b)
10 8
v(t)
Membrane Potential
5
6
.
V 4 2 0 0
1
2
3
v+(t)
4
v (t)
3 2 1 0
0
0.5
1
bLIF bQIF
0.7
0.7 0.65
Error
Error
bLIF bQIF
0.75
0.65 0.6
0.8
0.6
0.55
0.55
0.5
0.5
2
2.5
1
0.8
0.8 0.75
Error
c)
1.5
t
v
bLIF bQIF
0.6 0.4 0.2
0.45
0
0.5
1
1.5
2
0.45
0
v0
90
80
80
70
70
30
30
20
20
0
0
0
10
20
30
Time (ms)
70
80
90 100
0
800
700
700 Neuron
900
800
Neuron
900
300
300
200
200
0
0
10
20
30
Time (ms)
70
80
90 100
30
Time (ms)
70
80
90 100
100
100 0
0
v0
10
10
e)
0
2
Neuron
90
Neuron
d)
1
v0
10
20
30
Time (ms)
70
80
90 100
0
10
20
3234
A. Tonnelier, H. Belmabrouk, and D. Martinez
conductance-based neurons near the threshold. At low firing rate, the response of any type I neuron is described by the QIF neuron. In particular, the QIF model reproduces the f-I curve of any type I neurons near the threshold. We have proposed a method to simulate the QIF neuron in an event-driven way. We have presented simulations of the quadratic integrate-and-fire model with exponential synaptic currents. In the numerical simulation of neural networks, it is difficult to predict if errors on firing times will create numerical artifacts or will remain irrelevant. Event-driven methods are more precise than traditional timestepping integration algorithms because the spike timings are calculated with an arbitrary desired precision. In general, implementation of an eventdriven method requires nonlinear root-finding algorithms, for instance, the Newton-Raphson method, which converges exponentially. Thus, calculations are exact in the sense that the limit of the computer accuracy is reached with few iterations of the method. The error done with Euler or Runge-Kutta methods is fundamentally different. It is inherent to these approximated methods, and the precision is fixed with a polynomial decreasing with respect to the time step. However, when the number of events is large, event-driven schemes are time-consuming. This issue is implementation dependent and, in particular, depends on how the event queue is performed. Another important limitation of exact simulations is that it cannot be applied to any model. Implementation of exact simulations is possible if the membrane equations are analytically solvable. Extensions of our eventdriven scheme to other types of synaptic connections or spike-generating currents have to be studied case-by-case:
r
A QIF neuron with exponential conductance-based synaptic currents follows, dV dt
τs
r
dg dt
= V 2 − Ith + g(V − E s ), = −g,
where g and E s are the synaptic conductance and the synaptic reversal potential, respectively. The membrane voltage is analytically solvable and can be written using Whittaker functions (see appendix B). Thus, an event-driven scheme, similar to the one proposed in this note, can be used to simulate exactly QIF neurons with synaptic conductances. The recently proposed exponential integrate-and-fire (EIF) model (Fourcaud-Trocm´e et al., 2003; Brette & Gerstner, 2005) is given by dV dt
= −(V + E) + e V ,
(4.1)
where V and t are normalized variables and E = (VT − El )/T with VT a threshold voltage, El the leak potential, and T the spike slope
Event-Driven Simulations of Nonlinear Integrate-and-Fire Neurons
3235
factor. Like the QIF model, the EIF model has a soft threshold. However, the spike-generating current is no longer quadratic but exponential. It has been shown that the f-I curve of the EIF model matches the f-I curve of detailed conductance-based models for a range of input currents larger than for the QIF model. The EIF model with instantaneous current, equation 1.3, has an implicit solution, and the firing time is given by tf =
Vpea k V(0)
du , −(u + E) + e u
for V(0) greater than the unstable fixed point of equation 4.1. As for the QIF neuron, Vpea k controls the shape of the spikes. It is possible to use precalculated tables with an arbitrary precision for the integral for an event-driven simulation of the EIF model. We do not know if the EIF model with exponential currents is analytically tractable or if an implicit solution can be found. Appendix A: The QIF with Exponential Synaptic Current The QIF neuron with an exponential synaptic current can be rewritten as a nonlinear and nonautonomous differential equation, dV dt
= V 2 − Ith + Is (0)e −t/τs .
(A.1)
The first step is to transform equation A.1 into a linear ordinary differential equation. We use the change of variables, V(t) = −
1 dy , y(t) dt
to obtain d2 y dt 2
= (−Ith + Is (0)e −t/τs )y(t).
(A.2)
We use the change of variables u(t) = 2τs Is (0)e −t/(2τs ) , which transforms the equation into u2
d2 y dy 2 +u + u − 4τs2 Ith y = 0. 2 du du
(A.3)
3236
A. Tonnelier, H. Belmabrouk, and D. Martinez
Let J α (x) and Yα (x) be the two independent solutions of the so-called Bessel’s equation, u2
dy d2 y + (u2 − α 2 )y = 0, +u du2 du
where α = −2τs
√
Ith . We have
y(u) = J α (u) + cYα (u), where c is a constant defined by the initial condition. Using α d J α (u) = −J α+1 (u) + J α (u), du u and a similar equation for Yα , we find that the membrane potential of the QIF neuron with exponential synaptic current, equations 2.5 and 2.6, is analytically solvable. The membrane potential is given by u(t) J α+1 + cYα+1 V(t) = − Ith − . 2τs J α + cYα where J and Y are Bessel functions evaluated at u(t) = 2τs Is (0)e −t/(2τs ) . The Bessel functions can be expressed as a series of gamma functions. Note that calculations with infinite series are not fundamentally different from calculating nonpolynomial functions (like the exponential or logarithmic functions). Numerically, the first terms of the series are necessary to reach good precision. It is also possible to use precalculated tables to speed up the calculations. Appendix B: The QIF with Exponential Synaptic Conductance The analytical integration of the QIF with exponential synaptic conductance proceeds along the same lines as the integration of the QIF with exponential currents. The analytical expression of the membrane potential uses Whittaker functions instead of Bessel functions. More precisely, let Wµ,ν (x) and Mµ,ν (x), the Whittaker functions, be the solutions of the differential equation 1 µ 1/4 − ν 2 d2 y y = 0. + − + + d x2 4 x x2
Event-Driven Simulations of Nonlinear Integrate-and-Fire Neurons
3237
The membrane potential of the QIF model with exponential synaptic conductance is given by V(t) = 1/(cWα,β + Mα,β ) ∗ (E s Mα,β + ( Ith − E s )Mα−1,β + . . . + c E s Wα,β + cτs E s2 − Ith )Wα−1,β , where W and M are Whittaker functions evaluated at u(t) = τs Is (0)e −t/τs , √ α = 1/2 + τs E s , β = τs Ith and c is a constant defined by the initial condition. Whittaker functions have power-series expansions that could be used to enhance the efficiency of the calculations. Appendix C: The Probabilistic Speed-Up Advancing the event-driven simulation to the next firing time requires computing the firing times of the N neurons. This can be very time-consuming, as N is large. In case of synchronization, the neurons fire at the same time and it might be sufficient to compute the firing times ti , i = 1, · · · , k, among a random subset of k neurons and advance the simulation to the next firing time min(ti ) obtained from this subset. This leads to a considerable speedup as k N, albeit at a cost of introducing errors in the simulation. We define the error probability pe as the probability that min(ti ) is greater than the 5% first firing times of the entire population of neurons. We then have pe = P(min(ti ) > t ∗ ), where t ∗ is the maximum of the 5% first firing times of the N neurons. Considering the ti s as identically and independently distributed firing times, we have pe = (P(ti > t ∗ ))k . Moreover, P(ti < t ∗ ) = 0.05 because we have considered the 5% first firing neurons in the definition of t ∗ , and thus pe = 0.95k . Acknowledgments This research was supported by the INRIA cooperative research initiative RDNR. References Ambard, M., & Martinez, D. (2006). Inhibitory control of spike timing precision. NeuroComputing, 70, 200–205. Berry, M. J., Warland, D. K., & Meister, M. (1997). The structure and precision of retinal spike trains. PNAS, 94, 5411–5416. Bi, G., & Poo, M. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472. Brette, R. (2006). Exact simulation of integrate-and-fire models with synaptic conductances. Neural Comp., 18, 2004–2027. Brette, R., & Gerstner, W. (2005). Adaptative exponential integrate-and-fire model as an effective description of neuronal activity. J. Neurophysiol., 94, 3637–3642.
3238
A. Tonnelier, H. Belmabrouk, and D. Martinez
Brunel, N., & Latham, P. (2003). Firing rate of the noisy quadratic integrate-and-fire neuron. Neural Comp., 15, 2281–2306. Ermentrout, G. (1996). Type I membranes, phase resetting curves, and synchrony. Neural Comp., 6, 979–1001. Ermentrout, G., & Kopell, N. (1986). Parabolic bursting in an excitable system coupled with a slow oscillation. SIAM J. Appl. Math., 46, 233–253. Fourcaud-Trocm´e, N., Hansel, D., van Vreeswijk, C., & Brunel, N. (2003). How spike generation mechanisms determine the neuronal response to fluctuating inputs. J. Neurosci., 23, 11628–11640. Gerstner, W., & Kistler, W. (2002). Spiking neuron models. Cambridge: Cambridge University Press. Gutkin, B., & Ermentrout, G. (1998). Dynamics of membrane excitability determine interspike interval variability: A link between spike generation mechanisms and cortical spike train statistics. Neural Comp., 10, 1047–1065. Hansel, D., & Mato, G. (2001). Existence and stability of persistent states in large neuronal networks. Phys. Rev. Lett., 10, 4175–4178. Hansel, D., Mato, G., Meunier, C., & Neltner, L. (1998). On the numerical simulations of integrate-and-fire networks. Neural Comp., 10, 467–483. Izhikevich, E. (2003). Simple model of spiking neurons. IEEE Transactions on Neural Networks, 14, 1569–1572. Latham, P., Richmond, B., Nelson, P., & Nirenberg, S. (2000). Intrinsic dynamics in neuronal networks, I. theory. J. Neurophysiol., 83, 808–827. Mainen, Z., & Sejnowski, T. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Makino, T. (2003). A discrete-event neural network simulator for general neuron models. Neural. Comput. and Applic., 11, 210–223. Martinez, D. (2005). Oscillatory synchronization requires precise and balanced feedback inhibition in a model of the insect antennal lobe. Neural Comp., 17, 2548–2570. Mattia, M., & Giudice, P. D. (2000). Efficient event-driven simulation of large networks of spiking neurons and dynamical synapses. Neural Comp., 12, 2305–2329. Rochel, O., & Martinez, D. (2003). An event-driven framawork for the simulation of networks of spiking neurons. Proc. 11th European Symposium on Artificial Neural Networks. Bruges, Belgium. Rudolph, M., & Destexhe, A. (2006). Analytical integrate-and-fire neuron models with conductance-based dynamics for event-driven simulation strategies. Neural Comp., 18, 2146–2210. ¨ Scholkopf, B., & Smola, A. (2002). Learning with kernels. Cambridge, MA: MIT Press. Shelley, M. J., & Tao, L. (2001). Efficient and accurate time-stepping schemes for integrate-and-fire neuronal networks. J. Comp. Neurosci., 11, 111–119. Tonnelier, A., & Gerstner, W. (2003). Piecewise linear differential equations and integrate-and-fire neurons: Insights from two-dimensional membrane models. Phys. Rev. E, 67. VanRullen, R., Guyonneau, R., & Thorpe, S. J. (2005). Spike times make sense. Trends in Neurosciences, 28, 1–4.
Received July 28, 2006; accepted December 6, 2006.
LETTER
Communicated by Frederic Theunissen
Temporal Coding of Time-Varying Stimuli Maoz Shamir [email protected] Hearing Research Center and Center for BioDynamics, Department of Biomedical Engineering, and Department of Mathematics, Boston University, Boston, MA 02215, U.S.A.
Kamal Sen [email protected]
H. Steven Colburn [email protected] Hearing Research Center and Center for BioDynamics, Department of Biomedical Engineering, Boston University, Boston, MA 02115, U.S.A.
Temporal structure is an inherent property of various sensory inputs and motor outputs of the brain. For example, auditory stimuli are defined by the sound waveform. Temporal structure is also an important feature of certain visual stimuli, for example, the image on the retina of a fly during flight. In many cases, this temporal structure of the stimulus is being represented by a time-dependent neuronal activity that is locked to certain features of the stimulus. Here, we study the information capacity of the temporal code. In particular we are interested in the following questions. First, how does the information content of the code depend on the observation time of the cell’s response, and what is the effect of temporal noise correlations on this information capacity? Second, what is the effect on the information content of reading the code with a finite temporal resolution for the neural response? We address these questions in the framework of a statistical model for the neuronal temporal response to a time-varying stimulus in a two-Alternative forced-choice paradigm. We show that information content of the temporal response scales linearly with the overall time of the response, even in the presence of temporal noise correlations. More precisely, we find that positive temporal noise correlations have a scaling effect that decreases the information content. Nevertheless, the information content of the response continues to scale linearly with the observation time. We further show that finite temporal resolution is sufficient for obtaining most of the information from the cell’s response. This finite timescale is related to the response properties of the cell.
Neural Computation 19, 3239–3261 (2007)
C 2007 Massachusetts Institute of Technology
3240
M. Shamir, K. Sen, and H. Colburn
1 Introduction The neural response to repetitive presentations of the same stimulus can be highly variable. This variability limits the amount of information that can be extracted from the neural response about the stimulus identity. In the attempt to understand the neural code, it was common to assume that information is coded by the mean spike count and to study the information capacity of the response (Paradiso, 1988; Seung & Sompolinsky, 1993; Zohary, Shadlen, & Newsome, 1994; Salinas & Abbott, 1994; Abbott & Dayan, 1999; Sompolinsky, Yoon, Kang, & Shamir, 2001; Wu, Nakahara, & Amari, 2001; Averbeck, Latham, & Pouget, 2006). It is generally accepted that noise correlations have a drastic detrimental effect on information capacity (Zohary et al., 1994; Sompolinsky et al., 2001; Averbeck et al., 2006). In recent years ¨ there has been an increasing interest in temporal coding (see, e.g., Gutig & Sompolinsky, 2006; Nelken, Chechik, Mrsic-Flogel, King, & Schnupp, 2005; Thorpe, Delorme, & Van Rullen, 2001; Johansson & Birznieks, 2004). A recent important work by Osborne, Bialek, and Lisberger (2004) has extended the study to the temporal domain. Osborne et al. studied the dependence of the information content on the observation time in the responses of middle temporal visual area (MT) cells to external stimuli, and investigated the effect of temporal noise correlations. It was suggested that temporal noise correlations cause the information capacity of the cell response to saturate, thus yielding a finite amount of information, even in the limit of long observation times. Here we investigate theoretically information coding by the temporal structure of the first-order statistics of neural responses (i.e., by the stimulus-dependent peristimulus time histogram). We address two main questions: First, how much information can be extracted from the neural response about the stimulus identity? Specifically, we would like to know how the information content depends on the observation time and what the effect is of temporal noise correlations. Second, how can this information be read by a biologically plausible mechanism? We address these questions in the framework of a statistical model for the neuronal dynamic response to the stimulus. This framework is general and can be applied to sensory as well as motor systems. For concreteness, we focus on field L, the auditory cortex homologue in the songbird that is characterized by a rich temporal structure of the response of neurons that are sensitive to different features of the song (Sen, Theunissen, & Doupe 2001). We use the simplest paradigm of two-interval two-alternative forced choice (2I2AFC). In the 2I2AFC paradigm, two birdsongs are presented in random order, and the system has to infer which song was first on the basis of the responses of auditory neurons. An upper bound on the amount of information that can be extracted from the cell response on the stimulus identity is given by the performance of the maximum likelihood (ML) readout (Thomas & Cover, 1991). Throughout this letter, we are interested in computing the ML performance. We shall term efficiency the negative of the logarithm of
Temporal Coding of Time-Varying Stimuli
3241
the error rate. Note that efficiency approaches infinity as the error rate approaches zero. The outline of this letter is as follows. In section 2, we model the cell response by an inhomogeneous Poisson process (IHPP; see, e.g., Tuckwell, 1988; van Kampen, 1997), which is the basic starting point for research of this type. The IHPP allows us to study the information content of the cell response, assuming there are no temporal noise correlations in the response. The effect of temporal noise correlations is addressed in section 3, where we use the gaussian process as a model for the neural response. The gaussian process provides a higher level of abstraction for the cell’s response. This level of abstraction is needed in order to capture the essential properties of a dynamic response with temporal noise correlations and yield a rigorous mathematical understanding. The problem of implementing a temporalcode readout in the central nervous system is the topic of section 4. Finally, in section 5, we summarize our results and discuss the relation of our findings to other studies of temporal coding and to known results from the theory of population coding. 2 The Inhomogeneous Poisson Process Given a birdsong si (t) (i ∈ {1, 2}) we model the response of an auditory cell N by a set of spike times {ti }i=1 that are distributed according to an IHPP with a mean rate ri (t). Following experimental findings in the auditory system (deCharms, Blake, & Merzenich, 1998; Theunissen, Sen, & Doupe, 2000), we approximate the IHPP rate by a threshold linear function of the birdsong, r (t|si ) =
nf k=1
uk ∗ si (k, t)
,
(2.1)
+
where [x]+ = x for x > 0 and 0 otherwise; the summation is over nk frequency bands; and si (k, t) represents the time-varying stimulus strength in frequency band k at time t. Here uk (t) is a causal filter that characterizes the dynamic response or the temporal receptive field of the cell to stimulus in the kth frequency band. The spectrotemporal receptive field (STRF) of the nf cell is defined by the entire set of filters u(t) ≡ {uk (t)}k=1 . The exact details of the STRF model and birdsong data that were used to generate the numerical results shown in Figures 1, 4, and 5, have been described in Narayan, Ergun, and Sen (2005). The STRF of the cell, u(t), determines the auditory feature that the cell is selective for and also governs the speed by which the cell can respond to changes in the stimulus. In our simulations, receptive fields of auditory cells are characterized by excitatory and inhibitory regions of approximately 10 msec, which determines the timescale by which the cell
3242
M. Shamir, K. Sen, and H. Colburn
can follow changes in the stimulus. For a given stimulus, si , the probability density of a specific spike train {t j } Nj=1 from time 0 to T is given by N 1 −Rs e i r (t j |si ), Pi {t j } Nj=1 = N!
(2.2)
j=1
where Rsi =
T
r (t|si )dt.
(2.3)
0
We shall refer to T hereafter as the observation time. We shall denote by (1) N1 {ti }i=1 the spike times of the cell during the presentation of the first stim(2) N2 those of the second. The ML readout estimates the ulus and by {ti }i=1 stimulus to be the one that results in the greatest probability for the cell (1) N1 (2) 2 response. Given the cell response in the two intervals, {ti }i=1 , {t j } Nj=1 , we define the log-likelihood ratio, L, to be the logarithm of the ratio of the probability of the response given that song s1 was first to the probability of the response given that song s2 was first (1) N1 (2) 2 L {ti }i=1 = log , {t j } Nj=1
(1) N1 (2) N2 P1 {ti }i=1 P2 {t j } j=1 (2) N2 (1) N1 . P1 {t j } j=1 P2 {ti }i=1
(2.4)
If the a priori probabilities of the two alternatives are equal, the ML decision is determined according to the sign of L: decide first song was
s1
L>0
s2
L<0
.
(2.5)
Otherwise, for L = 0, the ML readout decides the first song was s1 (or s2 ) randomly with a probability of 0.5. The ML performance is dominated asymptotically by the Chernoff distance (Thomas & Cover, 1991; Kang & Sompolinsky, 2001; see also appendix A). The Chernoff distance, DC , is a measure of the dissimilarity between the distribution of the neural response given one stimulus and that given the other. High DC means the two stimuli elicit very different responses statistically; hence, it is easy to discriminate between the two alternatives, yielding a low error rate. Using the saddle point approximation
Temporal Coding of Time-Varying Stimuli
3243
(Orszag & Bender, 1991), the ML probability of error is given by (see appendix A) e −DC PML (err or ) = , π (2) D 2
(2.6)
C
2 T (2) 1) where in the IHPP DC = −2 0 r (t|s1 )r (t|s2 ) log rr (t|s dt. For a (t|s2 ) 2I2AFC in an IHPP model from time 0 to time T, the Chernoff distance is given by (see appendix B) DC = 0
T
r (t|s1 ) r (t|s2 ) − r (t|s1 ) r (t|s2 )
2 dt.
(2.7)
Note that the Chernoff distance has an intuitive interpretation as a summation over all times of the squared signal-to-noise ratio difference of the responses to the two birdsongs, where the signal is given by the mean response, r (t|si )dt, and the noise by the standard deviation, r (t|si )dt. Due (2) to the additivity of the integral operation, both DC and DC scale linearly with the observation time, T. The corollary of the linear scaling of DC is that the ML error rate will decay to 0 exponentially with the observation time T. Figure 1 shows the efficiency of the ML readout (open circles) as a function of the observation time. The solid line shows the Chernoff distance, and the dashed line presents the saddle point approximation to the ML efficiency: (2) DC + 12 log( π2 |DC |). At very short observation times, T ∼ 5 ms, the cell does not yet respond to the song, and the ML performance is at chance level of 50%, which corresponds to an efficiency of − log 0.5 ≈ 0.69. As the observation time, T, is increased, the ML performance improves, and the error rate converges toward zero. From Figure 1, one can see the linear scaling of the ML efficiency and Chernoff distance with the observation time, as predicted by theory. The linear scaling of the Chernoff distance with the observation time can be viewed as the outcome of the lack of noise correlations in the IHPP. Recent studies (Osborne et al., 2004) have suggested that in the presence of temporal noise correlations, the information content of the cell’s response is limited, even in the limit of large observation times. The effect of temporal noise correlations on the information capacity of the neural dynamic response cannot be addressed in the framework of the IHPP that lacks such correlations. In the next section, we address this issue using the framework of a gaussian process that enables us to incorporate temporal correlations to the dynamic response of the cell.
3244
M. Shamir, K. Sen, and H. Colburn
Figure 1: The ML efficiency, the negative of the error rate logarithm, is plotted as a function of the observation time T in the IHPP (open circles). The solid line shows the Chernoff distance, and the saddle point approximation to the ML performance is shown by the dashed line. The ML error rate was estimated numerically by simulating the IHPP for 106 trials for each observation time, using a bin size of 2 ms. The mean responses of the cell, r1 (t) and r2 (t), were defined as in equation 2.1, where the cells’ response and spectrotemporal receptive field used in Figures 1, 4, and 5 were taken from Narayan et al. (2005).
3 The Gaussian Process Given a birdsong, si (t), that persists from time 0 to time T, we model the cell response by a gaussian random process, n(t). We denote by ri (t) the mean response of the cell—the trial average of n(t), given birdsong si (t). Thus, in the gaussian process, the neural response is represented by a continuous dynamic variable rather than by spike times. This abstraction is done in order to provide a rigorous mathematical understanding of the effect of the temporal correlation. Yet one can intuitively think of the neural response, n(t), as either an approximation to the cell’s spike counts in short time intervals or an underlying potential that governs the firing rate of the cell. An important quantity for information coding by the gaussian process is the difference in the mean response to song 1 and to song 2, r (t) ≡ r1 (t) − r2 (t), or, more convenient, the spectrum of r . For simplicity, we shall assume the spectrum of r is given by |˜r (ω)| = 2
2πρT 0
|ω| ∈ (ω1 , ω2 ) otherwise
,
(3.1)
where ρ is the spectral density, T is the duration of the birdsong stimulus, and we have used the following Fourier conventions: f˜ (ω) =
Temporal Coding of Time-Varying Stimuli
3245
∞ dte −iωt f (t) and f (t) = −∞ dω e iωt f˜ (ω). The upper cutoff, ω2 , reflects 2π the smallest timescale by which the cell can respond to changes in the external stimulus and characterizes the cell’s dynamics. The lower cutoff, ω1 , reflects the long time structure in the stimulus that the cell’s response follows, for example, syllables are of order of approximately 100 msec. The noise in this model is defined by the response covariance at different times. Specifically, we have used the following form for the covariance, C(t, t ), of the neural responses n(t) and n(t ), at times t and t , ∞
−∞
C(t, t ) = c(t − t ) = σ 2 δ(t − t ) + ce −|t−t |/τ ,
(3.2)
where τ and c are the correlation time and correlation strength, respectively. Note that c > − 2τ1 to ensure that C(t, t ) is positive definite. This is a combination of a white noise component and a low-pass noise component. Computing the ML error rate, in this case, one obtains (see appendix C; also Thomas & Cover, 1991) PML (err or ) =
1 1 1 − erf( DC ) ≈ √ e −DC , 2 2 π DC
(3.3)
where DC is the Chernoff distance and for the gaussian process in the 2I2AFC is given by (see appendix C; also Thomas & Cover, 1991) DC =
1 4
∞
−∞
r (t)C −1 (t, t )r (t )dtdt ,
(3.4)
C −1 (t, t ) is the inverse kernel of C(t, t ), that is, where −1 dt C (t, t )C(t , t ) = δ(t − t ). From equation 3.3, one obtains that the ML efficiency is asymptotically equal to the Chernoff distance,1 as in 1 Note that in order to avoid technical complications, we are using a slightly different normalization here from the one used in section 2. Namely, the underlying assumption of equation 3.4 is that the ML readout can use the entire neural response, n(t), from time −∞ to time +∞. However, since the stimulus lasted only from time 0 to time T, we expect the relevant observation time, that is, the time in which the average neural response will be different in the two alternatives, to be equal approximately to T. This is the reason for scaling the spectrum of the signal, |r (ω)|2 , with T in equation 3.1. The utility of this normalization is that it allows us to use the Fourier spectral representation of the temporal covariance matrix in order to invert it. Alternatively, one could assume, as in section 2, that the stimulus is ongoing and that the ML readout may only use the neural response from time 0 to time T. In that case, one has to worry about boundary terms when inverting the temporal covariance matrix. These boundary terms obscure the main results of the analysis and cause difficulties in the calculation. Nevertheless, the difference in the results when using the two normalizations is only a finite boundary effect that depends on τ and is not expected to scale with T for large T. Hence, to a leading order in T, the Chernoff distance will be the same for both normalizations.
3246
M. Shamir, K. Sen, and H. Colburn
the IHPP (see equation 2.6). The Chernoff distance, equation 3.4, is given by a squared signal-to-noise ratio, where the signal is the difference in the mean responses to the two songs, r = r1 − r2 , and the noise is the covariance of the temporal response. In the limit of zero temporal correlations, c = 0, the Chernoff distance of the gaussian process is reduced to DC0 =
1 4
∞
−∞
r1 (t) r2 (t) − σ σ
2 dt =
ρT(ω2 − ω1 ) , 2σ 2
(c = 0),
(3.5)
which is analogous, in its form, to the Chernoff distance in the IHPP, equation 2.7. In particular we find that DC0 ∝ T. In the presence of temporal correlations, c = 0, it is convenient to use the spectral representation of C in order to compute the Chernoff distance. Note that similar methods have been applied in the field of population coding for the computation of the population Fisher information (see, e.g., Sompolinsky et al., 2001, and Wu et al., 2001). Moreover, the expression for the Chernoff distance, equation 3.4, is similar to the expression for the Fisher information used in population coding. (For further discussion on the relation to population coding, see section 5.) Since the temporal correlations in this model, equation 3.2, depends only on relative times, C(t, t ) = c(t − t ), the Fourier modes are its eigenfunctions. Thus, we obtain DC =
1 4
∞
−∞
dω |˜r (ω)|2 . 2π c˜ (ω)
(3.6)
Intuitively one can view the result of equation 3.6 as a sum of the squared signal-to-noise ratio over all independent information channels, where the independent information channels are the Fourier modes, the squared signal is the spectrum of the difference in mean responses to the two songs, |˜r (ω)|2 , and the squared noise is the spectrum of the temporal noise correla1 tions, c˜ (ω). The noise spectrum is given by c˜ (ω) = σ 2 (1 + 2cτ 1+(ωτ ). Thus, )2 in the presence of positive temporal correlations, c > 0, the amount of noise in each Fourier mode is increased relative to the c = 0 case. This increase is proportional to the correlation strength, c; decays to zero algebraically with ω; and has a nonmonotonic behavior in τ . In the limit of very short correlation time, τ → 0, the temporal correlation converges to the c = 0 case. In the limit of long correlation time, τ → ∞, there are significant noise correlations but only in the slowly varying Fourier modes, ω = O(1/τ ). For a specific value of ω, at large correlation times, τ ω, the noise at the ω mode decays algebraically with τ , (˜c (ω) − σ 2 ) ∝ 1/τ . Using equation 3.1 for the spectrum of the signal, we obtain DC =
DC0
1−
2cτ arctan(ω2 τˆ ) − arctan(ω1 τˆ ) , 1 + 2cτ τˆ (ω2 − ω1 )
(3.7)
Temporal Coding of Time-Varying Stimuli
3247
Figure 2: The ratio of the Chernoff distance in the presence of temporal noise correlations and DC0 in the gaussian model is plotted as a function of the correlation time, τ , for different values of the correlation strength, from top to bottom c = −0.05, 0, 0.2, 0.4, 0.6, 0.8, 1, with ω1 = 1 and ω2 = 2 (solid lines). The bottom dashed line shows the ratio DC /DC0 for c = 1, ω1 = 0 and ω2 = 2.
√ where τˆ = τ/ 1 + 2cτ . From equation 3.7, we obtain that the Chernoff distance scales linearly with the observation time, DC ∝ T, even in the presence of correlations. This result is in contrast to recent suggestions that information capacity saturates in the presence of correlations (see section 5). Nevertheless, the Chernoff distance is a decreasing function of the correlation strength c, reflecting the fact that the noise in each Fourier channel, c˜ (ω), is an increasing function of c. The nonmonotonic dependence of the Chernoff distance on the correlation time is shown in Figure 2, for different values of the correlation strength, c. In the limit of a very short correlation time, τ → 0, the temporal correlations converge to the c = 0 case and DC → DC0 . In the limit of τ → ∞, the increase in the noise spectrum, c˜ (ω), is limited to |ω| ∈ [0, O(1/τ )); hence, as τ approaches infinity, a smaller portion of the signal is affected by the correlated noise. In the case of ω1 > 0 (solid lines), for τ 1/ω1 , the noise in the informative Fourier modes decays algebraically, and one obtains (1 − DC /DC0 ) ∝ τ −1 for large τ . In the case of ω1 = 0, where there is no lower cutoff in the spectrum of the signal (bottom dashed line in Figure 2), one obtains DC → DC0 in the limit of large correlation time τ → ∞, with a slower rate, (1 − DC /DC0 ) ∝ τ −1/2 . In the intermediate ranges of τ , the temporal noise correlations cause a decrease in the information content of the neural response relative to DC0 . Nevertheless, the Chernoff distance continues to scale linearly with the observation time, DC ∝ T. The Chernoff distance provides the efficiency of an ideal observer. Can the central nervous system implement such a readout, and to what
3248
M. Shamir, K. Sen, and H. Colburn
accuracy? In the next section, this question is addressed, where it is shown that the ML readout can be implemented by a simple linear readout mechanism and the effect of finite computational resources is studied in terms of limited temporal resolution of the cell response. 4 The Readout Problem In the previous sections we have used the performance of the ideal observer, the ML readout, as a measure of the information content of the cell’s temporal response. However, the ideal observer assumes knowledge of the temporal structure of the neural response with infinite temporal resolution. Biological readout mechanisms can have only finite temporal resolution. Furthermore, small jitter in the timing of the different syllables in the birdsong will cause the very fine structure of the cell response to become unreliable. Hence, an important question is how sensitive the ML efficiency is to finite temporal resolution of the response. We address this issue in the framework of a coarse-grained model of the IHPP of section 2. 4.1 The Coarse-Grained Poisson Model. We “coarse-grain” the neural response of the IHPP model (see section 2) by dividing the time interval [0, T) to nonoverlapping time bins of length and representing the neural response by the number of spikes in each bin. We shall denote by n(i) (t j ) the number of spikes fired during [t j−1 , t j ) (where t j = j) in interval i. The {n(i) (t j )} is a set of independent Poisson random variables with means ri (t j ) =
tj
r (t|si )dt.
(4.1)
t j−1
4.2 The Linear Readout. The ML estimator can be written as a linear readout for discrimination, where the decision is based on the sign of a linear filter of the neural response, decide first song was
s1
if
s2
if
w(t) n(1) (t) − n(2) (t) > 0 (1) , (2) t w(t) n (t) − n (t) < 0 t
(4.2)
where w(t) = log rr12 (t) and the sum is over t = , 2, . . . , T; recall that (t) n(i) (t) is the spike count in time interval of length and ri (t) is its mean. How can the central nervous system implement a temporal code readout? It is generally accepted that the central nervous system can implement a linear readout (see, e.g., Salinas & Abbott, 1994; Georgopoulos, Kalaska, Caminiti, & Massey, 1982; Georgopoulos, Schwartz, & Kettner, 1986). The unique feature of the above linear readout is that the synaptic weights, w(t), change with time. This nonbiological feature can be overcome by
Temporal Coding of Time-Varying Stimuli
3249
Figure 3: Delay line architecture of a linear readout for temporal codes. The stimulus at different values of delays is input to a layer of linear feature detectors that convolve their temporal receptive field, U, with the delayed stimulus. The second layer of readout responds to a weighted sum of the array of delayed feature detectors.
Figure 4: The ML efficiency, in terms of the negative of the error rate logarithm, is plotted as a function of the observation time, T, in the coarsely grained Poisson model for different values of the temporal resolution , as indicated. The dashed lines show the saddle point approximation to the ML performance. The ML error rate was estimated numerically by simulating the IHPP for 106 trials for each observation time.
using a delay line architecture, as depicted in Figure 3, that maps time into space. Note that the delayed responses can result from various mechanisms, ranging from physical delay of conduction through different time constants to persistence. The study of the specific readout mechanism that may be used in the central nervous system is beyond the scope of this letter. The effect of temporal resolution on the efficiency of the readout is shown in Figure 4, where the ML efficiency, in terms of the negative of the log error
3250
M. Shamir, K. Sen, and H. Colburn
Figure 5: The Chernoff distance of the coarse-grained response as a function of the temporal resolution, .
rate, is plotted as a function of the observation time, T, for different values of the temporal resolution, , from top to bottom: = 2, 10, 20 msec. As can be seen from the figure, the efficiency continues to scale linearly with the observation time, T. However, the slope of the efficiency curve is a decreasing function of and is determined by the Chernoff distance of the coarse-grained response. Figure 5 shows the effect of changing the temporal resolution, , on the Chernoff distance. In the limit of very coarse temporal representation, → ∞, DC saturates to the rate-model limit. As decreases, DC increases and saturates at the continuous limit, = 0, to its value in the IHPP. This limit results from the fact that once the firing rates are constant over the time bins, further partitioning of the time bins to a finer temporal resolution cannot increase the Chernoff distance. Hence, the timescale for saturation at the → 0 limit is dominated by the timescale in which the neural response can change. Assuming the external stimulus may contain very fast changes, this small timescale is dominated by the response properties of the cell, namely, the STRF, that are characterized by a small timescale of approximately 10 msec in our simulations. This timescale determines the temporal resolution needed by the readout in order to extract most of the information from the cells’ response. At the other end, in the limit of → ∞, the timescale of saturation in the rate model is dominated by the temporal structure of the stimulus. The identity of the birdsong is defined by syllables that are of length 50 to 200 msec and are spaced with gaps of length of about 50 msec. Once the temporal resolution of the readout is coarser than these timescales, the structure of the
Temporal Coding of Time-Varying Stimuli
3251
syllables cannot be inferred from the neural response, and the performance is reduced to that of a rate model. 5 Summary and Discussion 5.1 Temporal Coding in the IHPP Model. We have used the performance of the ideal observer, the ML, as a general bound on the amount of information that can be extracted from the dynamic response of a nerve cell. In the two-alternative forced-choice paradigm, this performance is asymptotically governed by the Chernoff distance, which measures the dissimilarity between the distributions of the neural response in the two alternatives. More precisely, the logarithm of the error rate is equal to the negative of the Chernoff distance, to a leading (linear) order in the observation time (see equation 2.6). This result can be extended to a more general paradigm of multiple alternative forced choice (see Kang & Sompolinsky, 2001). The Chernoff distance was studied in the framework of an IHPP model for the cell response, where it was shown that the Chernoff distance scales linearly with the observation time (see equation 2.7 and Figure 1). It was further shown that a linear readout can extract this information from the cells’ response, equation 4.2, and that a finite temporal resolution is sufficient in order to extract most of this information (see Figure 5). This finite temporal resolution is related to the timescale in which the cell can modify its response when following the external stimulus. 5.2 The Effect of Temporal Noise Correlations on the Information Capacity of the Dynamic Response. The effect of temporal noise correlations was studied in the framework of a gaussian model for the cells’ dynamic response. This higher level of abstraction is essential for obtaining analytical understanding of the effect of temporal noise correlations on the information content of the dynamic response. The gaussian model can be extended to a doubly stochastic process that incorporates spiking activity with temporal correlations. One possibility is to implement an IHPP with a rate that is given by the gaussian model. From the information processing inequality, the information content of the doubly stochastic process can only decrease relative to that of the gaussian model. Nevertheless, since the Poisson noise is local in time, we do not expect qualitatively different results. On the other hand, using the doubly stochastic process, one loses the ability to obtain complete analytical understanding. We find that the information content of the cell response, in terms of the Chernoff distance, continues to scale linearly with the observation time, even in the presence of temporal noise correlations, (see equations 3.5 to 3.7). In a recent work, Osborne et al. (2004) studied the mutual information (MI; see Thomas & Cover, 1991) of the stimulus and the entire spike count of the cell as a function of the observation time. The MI is expected to saturate to the stimulus entropy with a difference that is exponential with the negative
3252
M. Shamir, K. Sen, and H. Colburn
of the Chernoff distance (see Kang & Sompolinsky, 2001). Osborne et al. have shown positive temporal correlations in the cell’s dynamic response. Studying the effect of temporal correlations, Osborne et al. compared the MI computed from their data to the MI computed from the shuffled data that contain no correlations with a timescale greater than the bin size used for shuffling. In their work, they show an initial fast increase in the MI as a function of T, resulting from both a strong onset response and a faster response to stimuli that are closer to the preferred stimulus of the cell. After the strong initial increase of the MI with T, the MI continues to increase with a slower rate, until the stimulus ends. During this period, the increased rate of the MI in the shuffled data is greater than in the real data. Nevertheless, the MI of the real data continues to increase with the observation time as long as the stimulus lasts. Osborne et al. have interpreted these results to suggest that temporal noise correlations cause the saturation of information content of the dynamic response, similar to the effect of noise correlations in population coding as reported in Zohary et al. (1994). The analytical work in this letter shows that the observed saturation is incomplete and that the information continues to increase with the observation time, albeit with a decreased slope (see equation 3.7 with c > 0). Note, however, some differences between our analysis and the study of Osborne et al. In particular, Osborne et al. did not use an ideal observer, which takes into account the temporal structure of the neural response, and in this work, we have studied the response of a cell to complex natural stimuli rather than to a constant input. It has also been shown that negative temporal correlations, induced by refractoriness, increase the information capacity of the neural response (Chacron, Longtin, & Maler, 2001). These results are consistent with our analysis, equation 3.7, in the case of c < 0, in which the information content of a negatively correlated response is increased relative to the information content in the temporally uncorrelated case, DC0 (cf. Figure 2). 5.3 Comparison with the Theory of Population Coding. In this work, we have studied the distributed coding of information across time and the effect of temporal noise correlations. It is interesting to compare our results with known results from the theory of population coding. Population coding theory addresses questions relating to the distributed coding of information across space, that is, by the responses of many cells in a population. A central question that the theory addresses is how the information content of the population response depends on the population size. If the population response has no spatial correlations—the responses of different cells are statistically independent—then the information content of the population will scale linearly with the number of independent degrees of freedom in the system—with the number of cells in the population (Paradiso, 1988; Seung & Sompolinsky, 1993). However, in the presence of spatial correlations—correlations in the noisy responses of different cells—the
Temporal Coding of Time-Varying Stimuli
3253
information content of the population saturates to a finite value, even in the limit of very large populations (Zohary et al., 1994; Sompolinsky et al., 2001; Averbeck et al., 2006).2 Hence, the presence of spatial correlations limits the total number of independent degrees of freedom in the system to a finite value that is independent of the population size. How can we understand this discrepancy? In population coding, the biologically relevant parameter regime is such that the response fluctuation of every cell is significantly correlated with the response fluctuations of finite fraction of the entire population (see the discussion in Sompolinsky et al., 2001). Hence, when we add a neuron to the population of cells that is used by the readout mechanism, we add only redundant information, and the information content of the entire population response is unchanged. In contrast, temporal noise correlations are characterized by a finite correlation time τ . Intuitively, for finite correlation times, τ < ∞, one can think of the neural response at nonoverlapping time intervals of length > τ as effectively independent. Thus, increasing the observation time adds independent information. The effect of temporal noise correlations is to scale down the rate by which the information content of the dynamic response increases with the observation time (see Figure 2). A corollary is that the linear scaling of the Chernoff distance with the observation time, T, is not specific for the choice of temporal correlations used in this letter, namely, equation 3.2, and holds for other types of temporal correlations, such as power law correlations, as long as the spectrum of the noise correlations, c˜ (ω), is finite. This result can also be obtained by studying the Chernoff distance in the spectral representation, equation 3.6. Essentially, only the signal, |˜r (ω)|2 , is scaled by the observation time, T, which represents the stimulus duration. The temporal noise correlations, being an internal property of the cell dynamics, do not change with the duration of the stimulus T; hence, c˜ (ω) is independent of T and DC ∝ T. In the special case of τ = ∞, only a single mode in the noise spectrum, the uniform mode c˜ (0), has a power that scales with the observation time, T. The linear scaling of the Chernoff distance in this case is due to the rest of the spectrum, in which the noise is independent of T. Note, however, that one may consider situations in which the temporal correlations do depend on the external stimulus. In that case, the higher-order statistics of the cells’ response may also carry information about the stimulus identity. These effects have been studied in the context of population coding (Shamir & Sompolinsky, 2002, 2004; Wu, Amari, & Nakahara, 2004). However, studying the effects of stimulus-dependent temporal correlations on the information capacity of the cells’ dynamic response is beyond the scope of this letter and will be addressed elsewhere. 2 Note
that a different result has been obtained by Abbott and Dayan (1999). See the discussion in Sompolinsky et al. (2001) on the origin of this difference and the choice of biologically plausible parameter regime in population coding.
3254
M. Shamir, K. Sen, and H. Colburn
Appendix A: The Chernoff Distance and the ML Efficiency A.1 General Definition. In order for us to define the Chernoff distance, DC , we first define a quantity that we shall denote by Dα . Dα is a measure of dissimilarity between two probability distributions and is related to the Renyi distance (Thomas & Cover, 1991). Given two probability distributions, p1 (x) and p2 (x), of a random variable x, Dα is defined as
Dα ( p1 , p2 ) = − log
p11−α (x) p2α (x)
,
(A.1)
x
where the sum is over all values of x. In the case of a continuous random variable, the summation is replaced with integration over all values of x. One can show that as a function of α, Dα is convex, is zero for α = 0 and for α = 1, is nonnegative for α ∈ (0, 1), and zero if and only if p1 = p2 almost always, and obeys the following symmetry: Dα ( p1 , p2 ) = D1−α ( p2 , p1 ). The Chernoff distance is given by the value of Dα at its unique maximum: DC ≡ max{Dα }.
(A.2)
α
A.2 The Chernoff Distance of a 2I2AFC Task. In a 2I2AFC paradigm, the random variable, X, is the responses x1 and x2 in the two intervals, X = (x1 , x2 ). The probability distribution of the two alternatives is given by P1 (X) = p1 (x1 ) p2 (x2 ) and P2 (X) = p2 (x1 ) p1 (x2 ), where pi (x) is the probability distribution of the response, x, in a single interval under condition i ∈ {1, 2}. Substituting the above into the definition of Dα , equation A.1, we obtain
Dα (P1 , P2 ) = − log
x1 ,x2
p11−α (x1 ) p2α (x1 ) p21−α (x2 ) p1α (x2 )
= Dα ( p1 , p2 ) + D1−α ( p1 , p2 ).
(A.3)
Hence, from symmetry, we obtain that the maximal value of Dα (P1 , P2 ) is obtained for α = 1/2, and hence the Chernoff distance is DC (P1 , P2 ) = 2 Dα ( p1 , p2 )α= 1 . 2
(A.4)
A.3 The ML Efficiency in a 2AFC Task. Let x be a random variable with two possible probability distributions, p1 (x) and p2 (x) in conditions
Temporal Coding of Time-Varying Stimuli
3255
1 and 2, respectively. The log-likelihood ratio of the two probabilities for a specific value of x is L(x) = log
p1 (x) . p2 (x)
(A.5)
The probability distribution of L, given condition 1, is P(L) =
=
p1 (x)δ L − log
x ∞
−∞
dq −iq L e 2π
p1 (x) p2 (x)
=
∞
−∞
1+iq −iq p1 (x) p2 (x)
x
=
dq −iq L+iq log p1 (x)e 2π x ∞
−∞
p1 (x) p2 (x)
dq exp −iq L − D−iq , 2π (A.6)
where we have used the Fourier representation of the Dirac delta function, and D−iq is Dα with α = −iq . The ML readout estimates the condition according to the sign of L. In the case of positive L, the ML decides condition 1, and for negative values of L, it decides condition 2. The probability of error, given condition 1, is the probability that L is negative, PML (err or ) =
0 −∞
P(L)d L =
∞
−∞
dq −D−iq e 2π
0
−∞
e −iq L d L .
(A.7)
In order for the last integral on the right-hand side of equation A.7 to converge, we use the following trick, which is standard in physics. We add a small imaginary part, iε, to q , q → q + iε, and take the limit of ε → 0+. We can now integrate over L in equation A.7, yielding PML (err or ) =
∞ −∞
1 dq e −D−iq = lim ε→0+ 2π −iq + ε
i∞ −i∞
dα 1 −Dα e , 2πi α + ε (A.8)
where we have rotated the integration to the imaginary axis by a change of variables, α = −iq . Performing the integral over α in equation A.8 may not be a trivial task. However, in the case where Dα scales linearly with some large parameter, such as the observation time, T, we can use an asymptotic expansion in the large parameter to evaluate the ML error rate by applying the saddle point method (Orszag & Bender 1991), PML (err or ) =
e −DC
(2) α ∗ 2π DC
(A.9)
3256
M. Shamir, K. Sen, and H. Colburn
d2 = Dα , dα 2 α=α∗
(2) DC
(A.10)
where α ∗ is the value of α in which Dα is maximal. Note that in the saddle point approximation, we have used the freedom to deform the contour of integration in order to pass via the saddle point at α ∗ ∈ (0, 1). We were able to do this since in the limit of ε → 0+, the singularity, is always on the negative part of the real axis, while the saddle point is for α real and positive. In contrast, when computing the probability of a correct discrimination, the integral over L in equation A.7 should be performed from 0 to +∞. In that case, in order for the integral to converge, one needs to add an imaginary part to q , q → q + iε and take the limit ε → 0−. Hence, the contour of integration cannot be deformed in that case to pass via the saddle point. Using the fact that the residue is 1, in the limit of ε → 0, one can obtain the trivial result of PML (corr ect) = 1 − PML (err or ). Appendix B: The Chernoff Distance of an IHPP N In the IHPP, the random variable is the vector of spike times, x = ({ti }i=1 ), and the two distributions, p1 and p2 , are given by equation 2.2 with rates r (t|s1 ) and r (t|s1 ), respectively. Calculating Dα we obtain
∞ N ∞ 1 1−α α −(1−α)R1 −α R2 dti r (ti |s1 )r (ti |s2 )e Dα = − log N! 0 N=0 i=1
∞ N 1 ∞ 1−α α dtr (t|s1 )r (t|s2 ) = (1 − α)R1 + α R2 − log + N! 0 N=0 ∞ dtr 1−α (t|s1 )r α (t|s2 ). (B.1) = (1 − α)R1 + α R2 − 0
Using the result of equation A.4 for the 2I2AFC task, we obtain the Chernoff distance in this case,
T
DC =
2 1/2 r (t|s1 ) − r 1/2 (t|s2 ) dt,
(B.2)
0
which is equivalent to equation 2.7. The second derivative of Dα with respect to α, in the 2I2AFC task at the saddle point α ∗ = 1/2, is (2) DC
= −2 0
T
r (t|s1 ) 2 r (t|s1 )r (t|s2 ) log dt. r (t|s2 )
(B.3)
Temporal Coding of Time-Varying Stimuli
3257
Appendix C: The Chernoff Distance of the Gaussian Process C.1 The Chernoff Distance of a Multivariate Gaussian Random Variable. Let x be an N-dimensional gaussian random variable with two possible distributions: pi (x) =
|Ci | 1 T −1 exp − ) C (x − µ ) , i ∈ {1, 2}, (x − µ i i i (2π) N 2
(C.1)
where µi and Ci are the mean and covariance of x under the distribution pi (x); xT is the transpose of the vector x; and |C| denotes the determinant of the covariance matrix C. Applying the definition of Dα , equation A.1, to the above gaussian distributions, we obtain
dx 1−α α 2 |C | 2 exp(−g(x)) |C | (C.2) 1 2 (2π) N/2 1−α µ α µ µ T −1 µ T −1 g(x) = x− + x+ C1 x − C2 x + 2 2 2 2 2 2 1 − α −1 α −1 1 − α −1 α −1 = xT C1 + C2 x − xT C1 − C2 µ 2 2 2 2 1 − α −1 α −1 (C.3) C1 + C2 µ, +µT 8 8 Dα = − log
1 bT A−1 b 1 T T where µ ≡ µ1 − µ2 . Using the identity (2πdx) N/2 e − 2 x Ax+x b = e 2|A|1/2 , we obtain from integrating over x in equation C.2, after some algebra,
1 Dα = α(1 − α)µT [αC1 + (1 − α)C2 ]−1 µ + 2 1−α α 1 −1 + log |(1 − α)C−1 log |C1 | − log |C2 | 1 + αC2 | − 2 2 2
(C.4)
C.1.1 The Case of C1 = C2 . In the case of C1 = C2 ≡ C, one obtains Dα =
1 α(1 − α)µT C−1 µ, 2
(C.5)
which is maximal for α ∗ = 1/2, yielding DC =
1 T −1 µ C µ. 8
(C.6)
3258
M. Shamir, K. Sen, and H. Colburn
The second derivative of Dα with respect to α, at the saddle point, α ∗ = 1/2, (2) is DC = −8DC . Furthermore, in the case of C1 = C2 , the log-likelihood ratio, L, is a gaussian random variable with mean 4DC and variance 8DC . Hence, the probability of ML error in discrimination is given by PML (err or ) =
0
−∞
(L−4DC )2 dL 1 − 1 − erf( DC ) e 16DC = √ 2 16π DC
1 e −DC , ≈ √ 2 π DC
(C.7)
as in equation 3.3, which is also in line with the asymptotic result of equation A.9. In this case, one also obtains from equation C.5 that D1−α = Dα ; hence, (2) in the 2I2AFC task, Dα , DC , and DC are simply multiplied by a factor of 2. C.2 The Chernoff Distance of the Continuous Gaussian Process. Let n(t) be a family of gaussian random variables with two probability density distributions, as defined in section 3. Namely, in condition i (i ∈ {1, 2}), the mean of n(t) is ri (t), and the covariance of n(t) and n(t ), C(t, t ) = c(t − t ), is given by equation 3.2. The Chernoff distance between the distributions of n in the two conditions in the continuous limit is given by DC =
1 8
dtdt r (t)C −1 (t, t )r (t ),
(C.8)
where r ≡ r1 − r2 , as defined in section 3, and C −1 (t, t ) is the inverse kernel of C(t, t ) and is defined by the requirement
dt C −1 (t, t )C(t , t ) = δ(t − t ).
(C.9)
In order to invert C(t, t ) and compute the Chernoff distance, it is convenient to use the spectral representation of C(t, t ). Since C(t, t ) = c(t − t ) its eigenfunctions are the Fourier modes. Using the Fourier conventions as defined in section 3, we obtain the spectral representation of C, C(t, t ) =
−∞
c˜ (ω) ≡
∞
dω c˜ (ω)e iω(t−t ) 2π
∞
−∞
dte
−iωt
c(t) = σ
(C.10) 2
1 1 + 2cτ 1 + (ωτ )2
,
(C.11)
Temporal Coding of Time-Varying Stimuli
3259
where we have used the specific form of C, equation 3.2, to obtain the last equality on the right-hand side of equation C.11. The inverse of C is given by C(t, t )−1 =
∞
−∞
dω 1 iω(t−t ) , e 2π c˜ (ω)
(C.12)
where one can easily verify that equation C.9 holds with the inverse kernel, as defined in equation C.12. Using the spectral representation for the calculation of the Chernoff distance yields 1 DC = 8
∞
−∞
dω |˜r (ω)|2 . 2π c˜ (ω)
(C.13)
The spectrum of the signal, |˜r (ω)|2 , is given by equation 3.1. Substituting it into equation C.13, one obtains ρT DC = 8
∞
−∞
|˜r (ω)|
2
1 2cτ 1− 1 + 2cτ 1 + (ωτˆ )2
dω
2cτ arctan(ω2 τˆ ) − arctan(ω1 τˆ ) 1 , = DC0 1 − 2 1 + 2cτ τˆ (ω2 − ω1 )
(C.14)
where DC0 is defined by equation 3.5. In the case of a 2I2AFC task, the Chernoff distance is multiplied by a factor of 2, yielding the result of equation 3.7. Acknowledgments M.S. is grateful to Y. Loewenstein for a helpful discussion. M.S. was supported by a fellowship from the Burroughs Wellcome Fund. K.S. was supported by Grant 1 R01-DC-007610-01A1. H.S.C. was supported by NIH Grant No. 5 R01 DC00100. References Abbott, L. F., & Dayan, P. (1999). The effect of correlated variability on the accuracy of a population code. Neural Comput., 11(1), 91–101. Averbeck, B. B., Latham, P. E., & Pouget, A. (2006). Neural correlations, population coding and computation. Nature Reviews Neuroscience, 7, 358–366. Chacron, M. J., Longtin, A., & Maler, L. (2001). Negative interspike interval correlations increase the neuronal capacity for encoding time-dependent stimuli. J. Neurosci., 21(14), 5328–5343. deCharms, R. C., Blake, D. T., & Merzenich, M. M. (1998). Optimizing sound features for cortical neurons. Science, 280(5368), 1439–1443.
3260
M. Shamir, K. Sen, and H. Colburn
Georgopoulos, A. P., Kalaska, J. F., Caminiti, R., & Massey, J. T. (1982). On the relations between the direction of two-dimensional arm movements and cell discharge in primate motor cortex. J. Neurosci., 2(11), 1527–1537. Georgopoulos, A. P., Schwartz, A. B., & Kettner, R. E. (1986). Neuronal population coding of movement direction. Science, 233(4771), 1416–1419. ¨ Gutig, R., & Sompolinsky, H. (2006). The tempotron: A neuron that learns spike timing-based decisions. Nat. Neurosci., 9(3), 420–428. Johansson, R. S., & Birznieks, I. (2004). First spikes in ensembles of human tactile afferents code complex spatial fingertip events. Nature Neurosci., 7(2), 170– 177. Kang, K., & Sompolinsky, H. (2001). Mutual information of population codes and distance measures in probability space. Phys. Rev. Lett., 86(21), 4958–4961. Narayan, R., Ergun, A., & Sen, K. (2005). Delayed inhibition in cortical receptive fields and the discrimination of complex stimuli. J. Neurophysiol., 94(4), 2970–2975. Nelken, I., Chechik, G., Mrsic-Flogel, T. D., King, A. J., & Schnupp, J. W. (2005). Encoding stimulus information by spike numbers and mean response time in primary auditory cortex. J. Comput. Neurosci., 19(2), 199–221. Orszag, S. A., & Bender, C. M. (1991). Advanced mathematical methods for scientists and engineers. New York: Springer-Verlag. Osborne, L. C., Bialek, W., & Lisberger, S. G. (2004). Time course of information about motion direction in visual area MT of macaque monkeys. J. Neurosci., 24(13), 3210–3222. Paradiso, M. A. (1988). A theory for the use of visual orientation information which exploits the columnar structure of striate cortex. Biol. Cybern., 58(1), 35–49. Salinas, E., & Abbott, L. F. (1994). Vector reconstruction from firing rates. J. Comp. Neurosci., 1(1–2), 89–107. Sen, K., Theunissen, F. E., & Doupe, A. J. (2001). Feature analysis of natural sounds in the songbird auditory forebrain. J. Neurophysiol., 86(3), 1445–1458. Seung, H. S., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proc. Natl. Acad. Sci. USA, 90(22), 10749–10753. Shamir, M., & Sompolinsky, H. (2002). Correlation codes in neuronal networks. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Shamir, M., & Sompolinsky, H. (2004). Nonlinear population codes. Neural Comput., 16(6), 1105–1136. Sompolinsky, H., Yoon, H., Kang, K., & Shamir, M. (2001). Population coding in neuronal systems with correlated noise. Phys. Rev. E, 64 (5 Pt. 1), 051904. Theunissen, F. E., Sen, K., & Doupe, A. J. (2000). Spectral-temporal receptive fields of nonlinear auditory neurons obtained using natural sounds. J. Neurosci., 20(6), 2315–2331. Thomas, J. A., & Cover, T. M. (1991). Elements of information theory. New York: Wiley. Thorpe, S., Delorme, A., & Van Rullen, R. (2001). Spike-based strategies for rapid processing. Neural Networks, 14(6–7), 715–725. Tuckwell, C. H. (1988). Introduction to theoretical neurobiology, Vol. 2. Cambridge: Cambridge University Press. van Kampen, N. G. (1997). Stochastic processes in physics and chemistry. Amsterdam: Elsevier.
Temporal Coding of Time-Varying Stimuli
3261
Wu, S., Amari, S., & Nakahara, H. (2004). Information processing in a neuron ensemble with the multiplicative correlation structure. Neural Networks, 17(2), 205–214. Wu, S., Nakahara, H., & Amari, S. (2001). Population coding with correlation and an unfaithful model. Neural Comput., 13(4), 775–797. Zohary, E., Shadlen, M. N., & Newsome, W. T. (1994). Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 370(6485), 140– 143.
Received June 29, 2006; accepted October 29, 2006.
LETTER
Communicated by Alessandro Treves
Stochastic Dynamics of a Finite-Size Spiking Neural Network H´edi Soula [email protected]
Carson C. Chow [email protected] Laboratory of Biological Modeling, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD 20892, U.S.A.
We present a simple Markov model of spiking neural dynamics that can be analytically solved to characterize the stochastic dynamics of a finite-size spiking neural network. We give closed-form estimates for the equilibrium distribution, mean rate, variance, and autocorrelation function of the network activity. The model is applicable to any network where the probability of firing of a neuron in the network depends on only the number of neurons that fired in a previous temporal epoch. Networks with statistically homogeneous connectivity and membrane and synaptic time constants that are not excessively long could satisfy these conditions. Our model completely accounts for the size of the network and correlations in the firing activity. It also allows us to examine how the network dynamics can deviate from mean field theory. We show that the model and solutions are applicable to spiking neural networks in biophysically plausible parameter regimes. 1 Introduction Neurons in the cortex, while exhibiting signs of synchrony in certain states and tasks (Singer & Gray, 1995; Steinmetz et al. 2000; Fries, Reynolds, Rorie, & Desimone, 2001; Pesaran, Pezaris, Sahani, Mitra, & Andersen, 2002), mostly fire stochastically or asynchronously (Softky & Koch, 1993). Previous theoretical and computational work on the stochastic dynamics of neuronal networks has mostly focused on the behavior of networks in the infinite size limit with and without the presence of external noise. The bulk of these studies utilize a mean field theory approach, which presumes self-consistency between the input to a given neuron from the collective dynamics of the network with the output of that neuron and either ignores fluctuations or assumes that the fluctuations obey a prescribed statistical distribution (for a review, see Amit & Brunel, 1997a, 1997b; Van Vreeswijk & Sompolinsky, 1996, 1998; Gerstner, 2000; Cai, Tao, Shelley, & McLaughlin, 2004; Gerstner & Kistler, 2002). Within the mean field framework, the statistical distribution of fluctuations may be directly computed using a Neural Computation 19, 3262–3292 (2007)
C 2007 Massachusetts Institute of Technology
Stochastic Dynamics of a Finite-Size Spiking Neural Network
3263
Fokker-Planck approach (Abbott & Van Vreeswijk, 1993; Treves, 1993; Fusi & Mattia, 1999; Golomb & Hansel, 2000; Brunel, 2000; Nykamp & Tranchina, 2000; Hansel & Mato, 2003; Del Giudice & Mattia, 2003; Brunel & Hansel, 2006) or estimated from the response of one individual neuron submitted to noise (Plesser & Gerstner, 2000; Salinas & Sejnowski, 2002; Fourcaud & Brunel, 2002; Soula, Beslon, & Mazet, 2006). Although correlations in mean field or near–mean field networks can be nontrivial (Ginzburg & Sompolinsky, 1994; Van Vreeswijk & Sompolinsky, 1996, 1998; Meyer & Van Vreeswijk, 2002), in general, mean field theory is strictly valid only if the size of the network is large enough, the connections are sparse enough, or the external noise is large enough to decorrelate neurons or suppress fluctuations. Hence, mean field theory may not capture correlations in the firing activity of the network that could be important for the transmission and coding of information. It is well known that finite-size effects can contribute to fluctuations and correlations (Brunel, 2003). It could be possible that for some areas of the brain, small regions are statistically homogeneous in that the probability of firing of a given neuron is mostly affected by a common input and the neural activity within a given neighborhood. These neighborhoods may then be influenced by finite-size effects. However, the effect of finite size on neural circuit dynamics is not fully understood. Finite-size effects have been considered previously using expansions around mean field theory (Ginzburg & Sompolinsky, 1994; Meyer & Van Vreeswijk, 2002; Mattia & Del Giudice, 2002). It would be useful to develop analytical methods that could account for correlated fluctuations due to the finite size of the network away from the mean field limit. In general, finite-size effects are difficult to analyze. We circumvent some of the difficulties by considering a simple Markov model where the firing activity of a neuron at a given temporal epoch depends on only the activity of the network at a previous epoch. This simplification, which presumes statistical homogeneity in the network, allows us to produce analytical expressions for the equilibrium distribution, mean, variance, and autocorrelation function of the network activity (firing rate) for arbitrary network size. We find that mean field theory can be used to estimate the mean activity but not the variance and autocorrelation function. Our model can describe the stochastic dynamics of spiking neural networks in biophysically reasonable parameter regimes. We apply our formalism to three different spiking neuronal models. 2 The Model We consider a simple Markov model of spiking neurons. We assume that the network activity, which we define as the number of neurons firing at a given time, is characterized entirely by the network activity in a previous temporal epoch. In particular, the number of neurons that fire between t
3264
H. Soula and C. Chow
and t + t depends on only the number that fired between t − t and t. This amounts to assuming that the neurons are statistically homogeneous in that the probability that any neuron will fire is uniform for all neurons in the network and have a short memory of earlier network activity. Statistical homogeneity could arise, for example, if the network receives common input and the recurrent connectivity is effectively homogeneous. We note that our model does not necessarily require that the network architecture be homogeneous, only that the firing statistics of each neuron be homogeneous. As we will show later, the model in some circumstances can tolerate random connection heterogeneity. Short neuron memory can arise if the membrane and synaptic time constants are not excessively long. The crucial element for our formalism is the gain or response function of a neuron p(i, t), which gives the probability of firing of one neuron within t and t + t given that i neurons have fired in the previous epoch. The time dependence can reflect the time dependence of an external input or a network process such as adaptation or synaptic depression. Without loss of generality, we can rescale time so that t = 1. Assuming p(i, t) is the same for all neurons, then the probability that exactly j neurons in the network will fire between t and t + 1 given that i neurons fired between t − 1 and t is j
p(X(t + 1) = j|X(t) = i) = C N p(i, t) j (1 − p(i, t)) N− j ,
(2.1)
k where X(t) is the number of neurons firing between t − 1 and t and C N is the binomial coefficient. Equation 2.1 describes a Markov process on the finite set {0, . . . , N} where N is the maximum number of neurons allowed to fire on a time interval t. The process can be reexpressed in terms of a time-dependent probability density function (PDF) of the network activity P(t) and a Markov transition matrix (MTM) defined by k Mi j (t) ≡ p(X(t + 1) = j|X(t) = i) = C N p(i, t) j (1 − p(i, t)) N− j .
(2.2)
P(t) is a row vector on the space [0, 1] N that obeys P(t + 1) = P(t)M(t).
(2.3)
For time-invariant transition probabilities, p(i, t) = p(i), and for fixed N, we can write P(t) = P(0)Mt .
(2.4)
Our formalism is a simplified variation of previous statistical approaches that use renewal models for neuronal firing (Gerstner & van Hemmen, 1992; Gerstner, 1995; Meyer & Van Vreeswijk, 2002). In those approaches,
Stochastic Dynamics of a Finite-Size Spiking Neural Network
3265
the probability for a given neuron to fire depends on the refractory dynamics and inputs received over the time interval since the last time that particular neuron fired. Our model assumes that the neurons are completely indistinguishable, so the only relevant measure of the network dynamics is the number of neurons active within a temporal epoch. Thus, the probability that a neuron will fire depends on only the network activity in a previous temporal epoch. This simplifying assumption allows us to compute the PDF of the network activity, the first two moments, and the autocorrelation function analytically. 3 Time-Independent Markov Model We consider a network of size N with a time-independent response function, p(i, t) = p(i), so that equation 2.4 specifies the temporal evolution of the PDF of the network activity. If 0 < p(i) < 1 for all i ∈ [0, N] (i.e., the probability of firing is never zero or one), then the MTM is a positive matrix (i.e., it has only nonzero entries). The sum over a row of the MTM is unity by construction. Thus, the maximum row sum matrix norm of the MTM is unity, implying that the spectral radius is also unity. Hence, by the Frobenius-Perron theorem, the maximum eigenvalue is unity, and it is unique. This implies that for all initial states of the PDF, P(t) converges to a unique left eigenvector µ called the invariant measure, which satisfies the relation µ = µM.
(3.1)
The invariant measure is the attractor of the network dynamics and the equilibrium PDF. Given that the spectral radius of M is unity, convergence to the equilibrium will be exponential at a rate governed by the penultimate eigenvalue. We note that if the MTM is not positive, there may not be an invariant measure. For example, if the probability of just one possible transition is zero, then the PDF may never settle to an equilibrium. The PDF specifies all the information of the system and can be found by solving equation 3.1. From equation 2.4, we see that the invariant measure is given by the column average over any arbitrary PDF of limt→∞ Mt . Hence, the infinite power of the MTM must be a matrix with equal rows, each row being the invariant measure. Thus, a way to approximate the invariant measure to arbitrary precision is to take one row of a large power of the MTM. The higher the power, the more precise the approximation. Since the convergence to equilibrium is exponential, an accurate approximation can be obtained easily. 3.1 Mean and Variance. In general, of most interest are the first two moments of the PDF, so we derive expressions for the expectation value
3266
H. Soula and C. Chow
and variance of any function of the network activity. Let X ∈ [0, . . . , N] be a random variable representing the network activity and f be a real-valued function of X. The expectation value and variance of f at time t are thus f (X)t =
N
f (k)Pk (t)
(3.2)
k=0
and Vart ( f (X)) = f (X)2 t − f (X)2t ,
(3.3)
where we denote the kth element of vector P(t) by Pk (t). We note that in numerical simulations, we will implicitly assume ergodicity at equilibrium. That is, we will consider that the expectation value over all the possible outcomes to be equal to the expectation over time. Inserting equation 2.2 into 2.3 and using the definition of the mean of a binomial distribution, the mean activity Xt has the form Xt = N p(X)t−1 .
(3.4)
The mean firing rate of the network is Xt /t. We can also show that the variance is given by Vart (X) = N p(1 − p)t−1 + N2 Vart−1 ( p).
(3.5)
Details of the calculations for the mean and variance are in appendix A. Thus, the mean and variance of the network activity are expressible in terms of the mean and variance of the response function. At equilibrium, equations 3.4 and 3.5 are Xµ = N pµ and Varµ (X) = N p(1 − p)µ + N2 Varµ ( p), respectively. Given an MTM, we can calculate the invariant measure and from that obtain all the moments. If we suppose the response function to be linear, we can compute the mean and variance in closed form. Consider the linear response function p(X) = p0 +
(q − p0 ) X Nq
(3.6)
for X ∈ [0, N]. Here, p0 ∈ [0, 1] is the probability of firing for no inputs, and (q − p0 )/Nq is the slope of the response function. Inserting equation 3.6 into 3.4, gives Xµ = Np0 +
(q − p0 ) Xµ . q
(3.7)
Stochastic Dynamics of a Finite-Size Spiking Neural Network
3267
Solving for Xµ gives the mean activity Xµ = Nq .
(3.8)
Substituting equation 3.6 into 3.5 leads to the variance Varµ (X) = N
q (1 − q ) , 1 − λ2 + λ2 /N
(3.9)
where λ = (q − p0 )/q . The details of these calculations are given in appendix A. The expressions for the mean and variance give several insights into the model. From equation 3.6, we see that the mean activity 3.8 satisfies the condition p(Nq ) = q .
(3.10)
Hence, on a graph of the response function versus the input, the mean response probability is given by the intersection of the response function and a line of slope 1/N, which we denote the diagonal. Using equation 3.8, equation 3.10 can be reexpressed as Xµ = Np(Xµ ).
(3.11)
In equilibrium, the mean response of a neuron to X neurons firing is X/N. Hence, for a linear response function, the mean network activity obeys the self-consistency condition of mean field theory. This can be expected because of the assumption of statistical homogeneity. The variance 3.9 does not match a mean field theory that assumes uncorrelated statistically independent firing. In equilibrium, the mean probability of firing of a single neuron in the network is q . Thus, each neuron obeys a binomial distribution with variance q (1 − q ). Mean field theory would predict that the network variance would then be given by Nq (1 − q ). Equation 3.9 shows that the variance exceeds the statistically independent result by a factor that is size dependent, except for q = p0 , which corresponds to an unconnected network. Hence, the variance of the network activity cannot be discerned from the firing characteristics of a single neuron. The network could possess very large correlations, while each constituent, neuron would exhibit uncorrelated firing. When N λ2 , the variance scales as N, but for small N, there is a size√ dependent correction. The coefficient of variation approaches zero as 1/ N as N approaches infinity. When λ = 1, the slope of the response function is 1/N and coincides with the diagonal. This is a critical point where the equilibrium state becomes a line attractor (Seung, 1996). In the limit of
3268
H. Soula and C. Chow
N → ∞, the variance diverges at the critical point. At criticality, the variance of our model has the form N2 q (1 − q ), which diverges as the square of N. Away from the critical point, in the limit as N goes to infinity, the variance has the form (Nq (1 − q ))/(1 − λ2 ). Thus, the deviation from mean field theory is present for all network sizes and becomes more pronounced near criticality. The mean field solution of equation 3.10 is not a strict fixed point of the dynamics 2.4 per se. For example, if the response function is near criticality, possesses discontinuities, or crosses the diagonal multiple times, then the solution of equation 3.10 may not give the correct mean activity. Additionally, if the slope of the crossing has a magnitude greater than 1/N, then the crossing point would be locally unstable. Consider small perturbations of the mean around the fixed point given by equation 3.8: Xt = Nq + v(t).
(3.12)
The response function, equation 3.6, then takes the form p(X) = q +
λ v, N
(3.13)
where λ = (q − p0 )/q . Substituting equations 3.12 and 3.13 into 3.4 gives vt = λvt−1 .
(3.14)
Hence, v will approach zero (i.e., fixed point is stable) only if |λ| < 1. Finally, we note that in the case that there is only one solution to equation 3.10 (i.e., the response function crosses the diagonal only once), if the function is continuous and monotonic, then it can cross only with λ < 1 (slope less than 1/N) since 0 < p < 1. Thus, a continuous monotonic increasing response function that crosses the diagonal only once has a stable mean activity given by this crossing. This mean activity corresponds to the mode of the invariant measure. We can estimate the mean and variance of the activity for a smooth nonlinear response function that crosses the diagonal once by linearizing the response function around the crossing point (q = p(Nq )). Hence, near the intersection, p(X) = q + λ
(X − Nq ) , N
(3.15)
where λ = N ∂∂Xp (Nq ). Using equations 3.8 and 3.9 then gives Xµ = Nq
(3.16)
Stochastic Dynamics of a Finite-Size Spiking Neural Network
3269
2 ∞ 1 − x2 Figure 1: Example for response function p(n) = 2π d x with N = 100, θ −I −J n e σ θ = 1, I = 0.1, J = 1.5/N and σ = 1.0. (A) Np(n) (solid line) and the diagonal (dashed line). (B) The associated Markov transition matrix (in gray levels). (C) The evolution of the mean network activity (solid line) and the variance (dashed line). The steady state is quickly reached, and the activity corresponds to the crossing point between Np(n) and the diagonal. (D) The invariant measure (i.e., the PDF of the network activity).
Varµ (X) =
Nq (1 − q ) . 1 − λ2 + λ2 /N
(3.17)
Additionally, the crossing point is stable if |λ| < 1. These linearized estimates are likely to break down near criticality (i.e., λ near one). We show an example of a response function and resulting MTM in Figure 1. Starting with a trivial density function (all neurons are silent), we show the time evolution of the mean and the variance in Figure 1C. We show the invariant measure in Figure 1D. The equilibrium state is given by the intersection of the response function with the diagonal. We see that the mode of the invariant measure is aligned with the mean activity given by the crossing point.
3270
H. Soula and C. Chow
3.2 Autocovariance and Autocorrelation Functions. We can also compute the autocovariance function of the network activity: Cov(τ ) = (X(t) − Xµ )(X(t + τ ) − Xµ ) = X(t)X(t + τ ) − X2 µ .
(3.18)
Noting that X(t)X(t + τ ) =
j
jkp(X(t + τ ) = k|X(t) = j) p(X(t) = j),
(3.19)
k
where p(X(t + τ ) = k|X(t) = j) = Mτjk , we show in appendix B that at equilibrium, Cov(τ ) = λτ Varµ (X),
(3.20)
where λ is the slope factor (N times the slope) of the response function evaluated at the crossing point with the diagonal. The autocorrelation function AC(τ ) is simply λτ . The correlation time is given by 1/ ln λ. At the critical point (λ = 1), the correlation time becomes infinite. For a totally decorrelated network, λ = 0, giving a Kronecker delta function for the autocorrelation as expected. For an inhibitory network (λ < 0), the autocorrelation exhibits decaying oscillations with period equal to the time step of the Markov approximation (i.e., Cov(1) < 0). (Since we have assumed p > 0, an inhibitory network in this formalism is presumed to be driven by an external current, and the probability of firing decreases when other neurons in the local network fire.) We stress that these correlations are apparent only at the network level. Unless the rate is very high, a single neuron will have a very short correlation time. 3.3 Multiple Crossing Points. For dynamics where 0 < p < 1, if the response function is continuous and crosses the diagonal more than once, then the number of crossings must be odd. Consider the case with three crossings with p(0) < p(N). If the slopes of all the crossings are positive (i.e., a sigmoidal response function), then the slope factor λ of the middle crossing will be greater than one, and the other two crossings will have λ less than one, implying two stable crossing points and one unstable one in the middle. Figures 2A and 2B provide an example of such a response function and its associated MTM. The invariant measure is shown in Figure 2C. We see that it is bimodal with local peaks at the two crossing points with λ less than one. The invariant measure is well approximated by any row of a large power of the MTM (see Figure 2D). If the stable crossing points are separated enough (beyond a few standard deviations), then we can estimate the network mean and variance by combining the two local mean
Stochastic Dynamics of a Finite-Size Spiking Neural Network
3271
Figure 2: Example where the response function crosses the diagonal three times. (A) Np(n) (solid line) and diagonal (dashed line). (B) The associated MTM. 5 (C) The invariant measure. (D) MTM10 . All rows are equal to the invariant measure. Parameters are N = 100, I = 0.1, J = 1.8/N, and σ = 0.6.
and variances. Assume that the stable crossing points are q 1 and q 2 (q 1 < q 2 ) with slope factors λ1 and λ2 . Then the mean and the variance are: q1 + q2 (3.21) 2 2 q 1 (1 − q 1 ) N q 2 (1 − q 2 ) 2 q1 − q2 +N . (3.22) Varµ (X) = + 2 1 − λ21 + λ21 /N 1 − λ22 + λ22 /N 2 Xµ = N
4 Fast Leak Spiking Model We now consider a simple neuron model where the firing state obeys
N 1 J ik Vk (t − 1) + S(t) − θ , Vi (t) = H I (t) + N k=1
(4.1)
3272
H. Soula and C. Chow
where H(·) is the Heaviside step function, I (t) is an input current, S(t) = N(0, σ ) is uncorrelated gaussian noise, θ is a threshold, and J ik is the connection strength between neurons in the network. When the combined input of the neuron is below threshold, Vi = 0 (neuron is silent), and when the input exceeds threshold, Vi = 1 (neuron is active). This is a simple version of a spike response or renewal model (Gerstner & van Hemmen, 1992; Gerstner, 1995; Meyer & Van Vreeswijk, 2002). In order to apply our Markov formalism, we need to construct the response function. We assume first that the connection matrix has constant entries J . We will also consider disordered connectivity later. The response function is the probability that a neuron will fire given that n neurons fired previously. The response function at t for the neuron is then given by 1 p(n, t) = √ 2π
∞ θ −I (t)−nJ /N σ
e −x
2
/2
d x.
(4.2)
If we assume a time-independent input, then we can construct the MTM using equation 2.2. The equilibrium PDF is given by equation 3.1 and can be approximated by one row of a large power of the MTM. 4.1 Mean, Variance, and Autocorrelation. Imposing the selfconsistency condition, equation 3.10, on the mean activity Nq , where q = pµ , and p is given by equation 4.2 leads to 1 q= √ 2π
∞ θ −I −q J σ
e −x
2
/2
d x.
(4.3)
Taking the derivative of p(n, t) in equation 3.10 at n = Nq gives λ=
(θ −I −J q )2 J √ e − 2σ 2 . σ 2π
(4.4)
Using this in equation 3.17 gives the variance. Equation 4.4 shows the influence of noise and recurrent excitation on the slope factor and hence variance of the network activity. For λ < 1, the variance 3.17 increases with λ. An increase in excitation J increases λ for large J but decreases λ for small J . Similarly, a decrease in noise σ increases λ for large σ but decreases it for small σ . Thus, variability in the system can increase or decrease with synaptic excitation and external noise depending on the regime. The autocovariance function, from equation 3.20, is Cov(τ ) = λτ
Nq (1 − q ) . 1 − λ2 + λ2 /N
(4.5)
Stochastic Dynamics of a Finite-Size Spiking Neural Network
3273
4.2 Bifurcation Diagram. Bifurcations occur when the stability of the mean field solutions (i.e., crossing points) for the mean activity changes. We can construct the bifurcation diagram on the I − J plane by locating crossing points where λ = 1. We first consider rescaled parameters I = (θ − I )/σ and J = J /σ . Then equation 4.3 becomes 1 q= √ 2π
∞
I−q J
e −x
2
/2
d x,
(4.6)
and equation 4.4 becomes (I−J q )2 J λ = √ e− 2 . 2π
(4.7)
Solving equation 4.7 for q and inserting into equation 4.6 gives J I= √ 2π
∞
± ln(
J2 2πλ2
e )
−x 2 /2
dx ±
ln
J2 . 2πλ2
(4.8)
The bifurcation points are obtained by setting λ = 1 in equation 4.8 and evaluating the integral.
have two branches that sat√ The solutions isfy the conditions J ≥ 2π and I ≥ π/2. Figure 3A shows the twodimensional bifurcation √ diagram on the I − J plane. The intersection of
the two branches at J = 2π and I = π2 is a codimension two cusp bifurcation point (i.e., satisfies Np (x) = 1, Np (x) = 0 with genericity conditions; Kuznetsov, 1998). Between the two branches are three crossing points (two of which are stable), and outside there is one stable crossing point. Each traversal of a branch yields a saddle node bifurcation as seen in Figures 3B and 3C for the vertical and horizontal traversals (shown in the inset to Figure 3A), respectively. Traversing through the critical point along the diagonal line in the inset of Figure 3A shows a pitchfork bifurcation. We have assumed that J > 0 (i.e., excitatory network). By taking J → −J , we obtain the same bifurcation diagram for the inhibitory case, and the above conditions still hold using the absolute value of J . We compare our analytical estimates from the Markov model with numerical simulations of the fast leak model 4.1 in Figure 4. Figure 4A shows that the equilibrium PDF derived from the MTM matches the PDF generated from the simulation. Figure 4B shows a comparison of λτ versus the simulation autocorrelation function. Figure 4C shows the mean and variance of the activity versus the connection weight J . The theoretical results match the simulation results very well. On the same plot, we show the mean field theory estimate where the neurons are assumed to be completely uncorrelated. For low recurrent excitation and, hence, low activity, the correlations can
3274
H. Soula and C. Chow
Figure 3: (A) The two-dimensional bifurcation diagram (λ = 1) as a function of θ−I and σJ . Inset: The traversal lines of the various one-dimensional bifurcation σ diagrams shown in (B) vertical, (C) horizontal, and (D) oblique.
be neglected, and the system behaves like an uncorrelated neural network. However, as the excitation and firing activity increase, correlations become more significant, and the variance grows more quickly than the uncorrelated result. For very strong connections, the firing activity saturates and the variance drops. In that case, mean field theory is again an adequate approximation. Figure√4D shows the variance versus N for a network at
the critical point, J = 2π and I = π2 where λ = 1 at the crossing point. We see that the variance grows approximately as N3/2 . This is faster than the mean field result but slower than the prediction for a linear response function. The breakdown of the linear estimate is expected near criticality. Figure 5 shows the simulation autocorrelation for both the network and one neuron (chosen randomly) for N = 1000 and N = 10,000. The parameters were chosen so the network was at a critical point (λ = 1). We see that the correlation time of an individual neuron is near zero and much shorter than that of the network. With increasing N, we find that the correlation time increases for the network and decreases for the single neuron. Thus, a single neuron can be completely uncorrelated while the network has very long correlation times.
Stochastic Dynamics of a Finite-Size Spiking Neural Network
3275
Figure 4: (A) Numerical and theoretical PDF of the network activity at equilibrium for N = 100, θ = 1, I = 0.1, σ = 0.8, and J = 1.8. The theoretical PDF was obtained by taking one row of MT M100 . (B) Numerical and theoretical autocorrelation function for same parameters as A. Solid line is λt with the predicted λ from equation 4.4. (C) Dependence of the mean (solid line) and variance (dashed line) with J . The mean field variance is Nq (1 − q √ ) (dotted line). √ (D) Evolution of the variance when λ = 1 versus N for I = 1 − 2π, J = 2π , σ = 1, θ = 1. The solid line is N3/2 q (1 − q ).
4.3 Bistable Regime. In the regime where there are three crossing points, the invariant measure is bimodal. Hence, the network activity is effectively bistable. Bistable network activity has been proposed as a model of working memory (Compte, Brunel, Goldman-Rakic, & Wang, 2000; Laing & Chow, 2001; Gutkin, Laing, Colby, Chow, & Ermentrout, 2001; Compte, 2006). The two local equilibria are locally stable but due to the finite size fluctuations, there can be spontaneous transitions from one state to the other. Figure 6A shows the activity over time for a bistable network with N = 50. The associated PDF is shown in Figure 6B. No external input has been added to the network. For N = 1000, the switching rate is sufficiently low to be neglected. The resulting probability density function for this network is displayed in Figure 6C.
3276
H. Soula and C. Chow
Figure 5: Autocorrelation of the network and one neuron at a critical point (λ = 1). The solid line is for a network of 10,000 neurons and the dashed line is for a network of 1000 neurons. Crosses and dots are the autocorrelation of one neuron for N = 1000 and N = 10,000, respectively. The autocorrelation √ for one 2π , and neuron decays almost immediately. Parameters are σ = 1, θ = 1, J =
I = 1 − 2/π .
We can estimate the scaling behavior on N of the state switching rate, and hence lifetime of a memory, by supposing the finite-size induced fluctuations can be approximated by uncorrelated gaussian noise. We can then map the switching dynamics to barrier hopping of a stochastically forced particle in a double potential well V(x) with a stationary PDF given by the invariant measure µ. The switching rate is given by the formula (Risken, 1989)
(V(a )−V(c)) r = (2π)−1 V (c)|V (a )|e − σ 2 ,
(4.9)
where V = −σ 2 ln(µ), µ is the invariant measure, c is a stable crossing point, a is the unstable crossing point, and σ 2 is the forcing noise variance around c, which we take to be proportional to the variance of the network activity. For the general case, we cannot derive this rate analytically, but we can
Stochastic Dynamics of a Finite-Size Spiking Neural Network
3277
Figure 6: (A) The firing activity over time of a bistable network of 50 neurons. The switching between the two states (high and low activity) is spontaneous. (B) Theoretical PDF for N = 50. (C) Theoretical PDF for N = 1000. (D) Switching rate depends exponentially on N (solid curve). The dashed line is a linear fit on a semilog scale. Parameters are J = 1.8, σ = 0.6, I = 0.1, and θ = 1.
obtain an estimate by assuming that the invariant measure is a sum of two gaussian PDFs whose means are exactly the two stable crossing points: 2 2 1) 2) exp − (x−Nq exp − (x−Nq 2σ12 2σ22 µ(x) = + , √ √ 2 2πσ1 2 2πσ2 where σi2 =
Nq i (1−q i ) 1−λi2 +λi2 /N
r ∝ e −k N
(4.10)
for i = 1, 2. Using this in equation 4.9 then gives (4.11)
for a constant k. The switching rate is an exponentially decreasing function of N. We computed the switching rate for N ∈ [0; 150] for the parameters of the bistable network above. The results are displayed in Figure 6D. For the parameters in Figure 6D, a fit shows that k is of order 0.05. For these
3278
H. Soula and C. Chow
parameters, switching between states will not be important when N 1/k ∼ 20. 4.4 Disordered Connectivity. We now consider the dynamics for disordered (random) connection weights J i j drawn from a gaussian distribution (N( J¯ , σ J2 )). Soula et al. (2006) showed that the input to a given neuron for a disordered network can be approximated by stochastic input. We thus choose random input drawn from the distribution, N(k J¯ , kσ J2 ), for k neurons having fired in the previous epoch. The response function (for time independent input) then obeys 1 p(n) = √ 2π
∞
e −x
√
θ −I −n J¯ /N σ 2 +nσ J2 /N
2
/2
d x.
(4.12)
Applying the self-consistency condition 3.10 for the mean firing probability q gives 1 q= √ 2π
∞
√
θ −I −q J¯
e −x
2
/2
d x.
(4.13)
σ 2 +q σ J2
The variance is given by equation 3.17 with ¯ 2 ¯ − (θ −I2 − J q 2) J¯ θ − I − q J 1 2(σ +q σ J ) e λ= √ − σ J2 . 3/2 2π 2 σ 2 + q σ J2 σ 2 + q σ J2
(4.14)
Figure 7 shows the activity mean and variance as a function of the connection weight variance. There is a close match between the prediction and the results generated from a direct numerical simulation with 100 neurons. We note that a possible consequence of disorder in a neural network is a spin glass where many competing activity states could coexist, and so the resulting activity is strongly dependent on the initial condition (Hopfield, 1982). However, we believe that the stochastic forcing inherent in our network is large enough to overcome any spin glass effects. In the language of statistical mechanics, our system is at a high enough temperature to smooth the free energy function. 5 Integrate-and-Fire Model We now apply our formalism to a more biophysical integrate-and-fire model. We consider two versions. The first treats synaptic input as a current source, and the second considers synaptic input in the form of a conductance
Stochastic Dynamics of a Finite-Size Spiking Neural Network
3279
Figure 7: Mean (crosses) and variance (diamonds) from a numerical simulation of the fast leak model for N = 100 neurons as a function of the variance of the random disordered connection weights. Theoretical mean (solid line) and variance (dashed line) match well with numerical simulation values. Parameters are J = 0.5/N, I = 0.1, σ = 1.0, and θ = 1.0.
change. We apply our Markov model with a discrete time step chosen to be the larger of the membrane or synaptic time constants. 5.1 Current-Based Synapse. We first consider an integrate-and-fire model where the synaptic input is applied as a current. The membrane potential obeys τ
N dv J = I −V+ α(t − t f ) + Z(t), dt N f
(5.1)
t
where α(t) = e − τs /τs if t > 0 and zero otherwise, I is a constant input, Z is an uncorrelated zero-mean white noise with variance σ , J is the synaptic strength coefficient, N is the network size, and t f are the firing times of all neurons in the network. These firing times are computed whenever the membrane potential crosses a threshold Vθ , whereupon the potential is reset to VRs . After firing, neurons are prevented from firing during an absolute t
3280
H. Soula and C. Chow
refractory period r . The parameter values are τ = 1 ms, τs = 1 ms,Vr = −65 mV, Vθ = −60 mV, VRs = −80 mV, and r = 1 ms. The network dynamics of this model were studied previously in detail with a Fokker-Planck formalism (Brunel, 2000; Brunel & Hansel, 2006). These analyses showed that if the input current is an uncorrelated white noise N(µ I , σ I2 ) and τs < 1, then the response function of the neuron in equilibrium is given by √ 1 =r+ π p(X)
Vθ −µ I −J X/N σ I2 Vr −µ I −J X/N σ I2
2
due u (1 + erf (u)).
(5.2)
The mean activity is again given by Nq where p(X)µ = q , and q is a solution of equation 3.10: q = r +
√
π
Vθ −µ I −J q σ I2 Vr −µ I −J q σ I2
−1 due u (1 + er f (u)) 2
.
(5.3)
Using equation 5.3, we can derive the slope factor λ, 2 Vθ −µ I −J m J Vθ − µ I − Jm 2 2 σI λ=m 2 e 1 + er f σI σ I2
−e
Vr −µ I −J m σ I2
2
1 + er f
Vr − µ I − Jm σ I2
,
(5.4)
which can be used to estimate the variance using equation 3.17. We also evaluate the response function numerically. We assume that the response function can be approximated by the firing probability of a neuron whose membrane potential obeys τ
dV JX =I+ − V(t) + Z(t), dt N
(5.5)
where X ∈ {0, . . . , N}. This presumes that the effect of discrete synaptic events can be approximated by a constant input equivalent to the mean input plus noise. We estimated the firing probability in a temporal epoch from the average firing rate of the neuron obeying equation 5.5 for each X. We used the firing probabilities directly to form the MTM. We then computed the equilibrium PDF, the mean activity (crossing point), the slope
Stochastic Dynamics of a Finite-Size Spiking Neural Network
3281
Figure 8: Current synapse model. (A) Mean and variance from the simulations versus J from the numerical simulation compare well with the predictions (solid and dashed lines). The mean field variance is also plotted (dotted line). Parameters are N = 100, I = 4, and σ = 2.0. (B) The numerical PDF(histogram) and prediction (solid line) for N = 100, J = 5.0, I = 4, and σ = 2.0. (C) The autocorrelation for a network of N = 1000. Parameters are J = 5, I = 4, σ = 2.0, and λ = 0.40. (D) The variance (circles) versus N. The solid line is from equation 3.17 with λ computed using the autocorrelation decay for N = 1000 (λ = 0.4) displayed in C. Parameters are J = 5, I = 4, and σ = 2.0.
factor, and the variance. We choose the bin size of the Markov model to be the membrane time constant. All differential equations were computed using the Euler method with a time step of 0.01 ms. We compared the numerically simulated mean and variance of the network activity at equilibrium for 100 neurons with our theoretical predictions for differing synaptic strength J . The results are displayed in Figure 8A. A close match is observed for both quantities. The mean and variance increase with J . Figure 8A also shows the mean field (uncorrelated) variance. We see that the network variance always exceeds the mean field value, especially for midrange values of J . The PDFs from the numerical simulation of 100 neurons and using the Markov model are shown in Figure 8B. The simulated PDF was generated by taking a histogram of the network activity over
3282
H. Soula and C. Chow
105 ms. For the Markov model, the PDF is the eigenvector with eigenvalue one and approximated by taking a single row of the one-hundredth power of the MTM. The model predicts an exponentially decaying autocorrelation function with time constant 1/ ln(λ). It is displayed in Figure 8C for a network of N = 1000. We used the same parameters as in Figure 8A with a connection weight where the dynamics deviates from mean field theory (J = 5). In this parameter range, the recurrent excitation is strong, and the single-neuron firing rate is very high—approximately 350 Hz (the refractory time of 1 ms imposes a maximum frequency is 1000 Hz). In this regime, the refractory time acts like inhibition, so the probability of firing actually decreases if the activity in the previous epoch increases. Thus, the autocorrelation function exhibits anticorrelations as seen in the Figure 8C. We can estimate λ by fitting the autocorrelation to λτ . Using the approximation λ = Cov(1)/Var gives |λ| = 0.40. The theoretical value of |λ| using equation 5.4 gives 0.37 for the same parameters (using the simulated mean number of firing neurons). Figure 8D shows a plot of the variance versus N for the numerical simulation. The simulated variance is well matched by the estimated variance from equation 3.17 with q measured from the simulation (estimated by the mean firing rate of a single neuron) and λ = 0.4. 5.2 Conductance-Based Synapse. We now consider the integrate-andfire neuron model with conductance-based synaptic connections. The membrane potential V obeys τ
dV = I (t) − (V − Vr ) − s(t) (V − Vs ) + Z(t), dt
(5.6)
where τ is the membrane time constant, I (t) is an input current, Z(t) is zero-mean white noise with variance σ , Vr is the rest potential, and Vs is the reversal potential of an excitatory synapse. The synaptic gating variable s(t) obeys τs
ds J δ(t − t f ) − s(t), = dt N f
(5.7)
t
where τs is the synaptic time constant, J is the maximal synaptic conductance, N is the number of neurons in the network, and t f are the firing times of all neurons in the network. The threshold is Vθ , and the reset potential is VR . As with the previous model, a refractory period r was introduced. We used τ = 1 ms, τs = 1 ms, Vr = −65 mV, Vs = 0 mV, Vθ = −60 mV, VR = −80 mV, τs = 1 ms, and r = 1 ms.
Stochastic Dynamics of a Finite-Size Spiking Neural Network
3283
Figure 9: Conductance synapse model. (A) Mean and variance from the simulations versus J from the numerical simulation compare well with the predictions (solid and dashed lines). Parameters are N = 100, I = 4, and σ = 2.0. (B) The numerical PDF(histogram) and prediction (solid line) for N = 100, I = 4, σ = 2.0, and J = 0.05. (C) The autocorrelation for a network of N = 1500. Parameters are J = 0.05, I = 4, σ = 2.0, and λ = 0.31. (D) The variance (circles) versus N. The solid line is from equation 3.17 with λ computed using the autocorrelation decay for N = 1500 (λ = 0.31) displayed in C. Parameters are J = 0.05, I = 4, and σ = 2.0.
We computed the response function by measuring numerically the firing probability of a neuron that obeys τ
JX dV = I − (V − Vr ) − (V − Vr vs ) + Z(t) dt N
(5.8)
for all X ∈ {0, . . . , N} using the same method as in the current-based model. From the response function, we obtained the MTM and the invariant measure. Figure 9A compares the mean and variance of a numerical simulation with the Markov prediction for varying J . We see that there is a good match. For J = 0.05, we compare the numerically computed PDF with the invariant measure. As shown in Figure 9B, the prediction is extremely accurate.
3284
H. Soula and C. Chow
We estimated the slope factor as in the previous section using the autocorrelation function for N = 1500 and found λ = 0.31 and computed the estimated variance for various N. The comparison is shown on Figure 9D. There is again a close match. 6 Discussion Our model relies on the assumption that the neuronal dynamics can be represented by a Markov process. We partition time into discrete epochs, and the probability that a neuron will fire in one epoch depends on only the activity of the previous epoch. For this approximation to be valid, at least two conditions must be met. The first is that the epoch time needs to be long enough so that the influence of activity in the distant past is negligible. The second is that the epoch time needs to be short enough so that a neuron fires at most once within an epoch. The influence time of a neuron is given approximately by the larger of the membrane time constant (τm ) and synaptic time constant (τs ). Presumably the neuron does not have a memory of events much beyond this timescale. This gives a lower bound on the epoch time t. Hence, t > max[τm , τs ]. The second condition is equivalent to f t < 1, where f is the maximum firing rate of a neuron in the network. Thus, the Markov model is applicable for max[τm , τs ] < f −1 .
(6.1)
This condition is not too difficult to satisfy. For example, a cortical neuron receiving AMPA inputs with a membrane time constant less than 20 ms can satisfy the Markov condition if its maximum firing rate is below 50 Hz. Since typical cortical neurons have a membrane time constant between 1 and 20 ms and many recordings in the cortex find rates below 50 Hz (Nicholls, Martin, & Wallace, 1992), our formalism could be applicable to a broad class of cortical circuits. The equilibrium state of our Markov model exists and is exactly solvable if the response function is never zero or one. In other words, a neuron in the network always has a nonzero probability of firing but never has full certainty it will fire. Hence, the neuronal dynamics are always fully stochastic. Thus, the equilibrium state of a fully stochastic network could be another definition of the asynchronous state. However, although the network is completely stochastic, the activity is not uncorrelated. These correlations are manifested in the entire network activity but not within the firing statistics of the individual neurons, which obey a simple random binomial or Poisson distribution. The autocorrelation function of the individual neuron also decays much more quickly than that of the network. Many previous approaches to studying the asynchronous state assumed that neuronal firing was statistically independent so a mean field
Stochastic Dynamics of a Finite-Size Spiking Neural Network
3285
description was valid. With our simple model, we can compute the equilibrium PDF directly and explicitly determine the parameter regimes where mean field theory breaks down. We also show that the order parameter that determines nearness to criticality is the slope of the response function around the equilibrium mean activity. As expected, we find that mean field theory is less applicable for small or strongly recurrent neural circuits. In our model, the mean activity of our network can be obtained using mean field theory except at the critical point. We compute the variance of the network activity directly and see precisely how the network dynamics deviate from mean field theory as we approach the critical point. We note that while our closed-form estimates for the mean and variance may break down very near criticality, our model does not. The invariant measure of the MTM still gives the equilibrium PDF and in principle can be computed to arbitrary precision. Hence, our model can serve as a simple example of when mean field theory is applicable. We note that even in the limit of N going to infinity, the variance of the network deviates from a network of independent neurons by a factor of (1 − λ)−1 . The deviations from mean field theory persist even in an infinite-sized network. Our model requires statistical homogeneity. While purely homogeneous networks are unrealistic for biological networks, statistical homogeneity may be more plausible. For some classes of inputs, the probability of firing for a given set of neurons could be uniform over some limited temporal duration. We found that our analytical estimates can be modified to account for disordered randomness in the connections. A model that fully incorporated all heterogeneous effects would require taking into account different populations. However, with each additional population, the dimension of the MTM increases by a power of N. Thus, for a network of two populations, say inhibitory and excitatory neurons, the resulting problem will involve an MTM with N2 × N2 elements. Future work could examine the behavior at the critical point more carefully. Our variance estimate predicted that the variance at criticality would diverge as the system size squared, but simulations found an exponent of 3/2. We also showed that the correlation time of the network activity diverged at the critical point. We expect correlations to obey a power law at criticality. Perhaps a renormalization group approach may be adapted to study the scaling behavior near criticality. Recent experiments have found that cortical slices exhibit critical behavior (Beggs & Plenz, 2003, 2004). Our Markov model may be an ideal system to explore critical behavior. Finally, we note that even at the critical point, where the network activity is highly correlated, the single-neuron dynamics can exhibit uncorrelated Poisson statistics. This shows that collective network dynamics may not be deducible from single-neuron dynamics. Using our model as a basis, it may be possible to probe characteristics about local network size, connectivity, and nearness to criticality by combining recordings of single neurons with measurements of the local field potential.
3286
H. Soula and C. Chow
Appendix A: Mean and Variance Derivations The mean activity is given by
Xt =
N
k Pk (t)
k=0
=
N N k k CN (1 − pi ) N−k pik Pi (t − 1). k=0
i=0
Rearranging yields
Xt =
N
Pi (t − 1)
i=0
N
k kC N (1 − pi ) N−k pik .
k=0
N k Since k=0 kC N (1 − pi ) N−k pik = Npi is the mean of a binomial distribution, we obtain
Xt =
N
Pi (t − 1)Npi
i=0
= N pt−1 which is equation 3.4. The variance is given by
Vart (X) =
N
k 2 Pk (t) − N2 p2t−1
k=0
=
N k=0
=
N i=0
=
N
k2
N
k CN (1 − pi ) N−k pik Pi (t − 1) − N2 p2t−1
i=0
Pi (t − 1)
N
k k2C N (1 − pi ) N−k pik − N2 p2t−1
k=0
Pi (t − 1)Npi (1 − pi ) +
i=0
N
Pi (t − 1)N2 pi2 − N2 p2t−1
i=0
= N p(1 − p)t−1 + N Vart−1 ( p), 2
Stochastic Dynamics of a Finite-Size Spiking Neural Network
3287
where we have used the variance of a binomial distribution Np(1 − p). For the linear case, writing p(X) = a + b X we have: Varµ (X) = N p(1 − p)µ + N2 p 2 µ − N2 p2µ = Na + b X − a 2 − 2a b X − b 2 X2 µ + N2 a 2 + 2a b X + b 2 X2 µ − N2 a + b X2µ = Na + b NX − Na 2 − 2Na bX − Nb 2 X2 + N2 a 2 + 2N2 a bX + N2 b 2 X2 − N2 (a 2 + 2a bX + b 2 X2 ) = Na − Na 2 + N2 a 2 − N2 a 2 + X(b N − 2Na b + 2N2 a b − 2N2 a b) + X2 (−Nb 2 + N2 b 2 ) − b 2 N2 X2 = Na − Na 2 + b N(1 − 2a )X + (N2 b 2 − Nb 2 )Var(X) − Nb 2 X2 =
Na − Na 2 + b N(1 − 2a )X − Nb 2 X2 . 1 − N2 b 2 + Nb 2
Setting a = p0 and b = Varµ (X) = q 2 N
(q − p0 ) Nm
gives
p0 (1 − p0 ) + (q − p0 )(1 − p0 − q ) . q 2 − (q − p0 )2 + (q − p0 )2 /N
If we set λ = (q − p0 )/q , we get equation 3.17. Appendix B: Autocovariance Function We prove the form of the autocovariance function Cov(τ ) = λτ Varµ (X) for the linear response function using induction. We first show that Cov(1) = λVarµ (X) and then Cov(τ + 1) = λCov(τ ). The autocovariance function when τ = 1 is given by Cov(1) = Xt Xt+1 − X2µ =
N N
ki P(Xt+1 = i|Xt = k)P(Xt = k) − X2µ
k=0 i=0
=N
N
kpk P(Xt = k) − X2µ
k=0
= N p Xµ − X2µ .
3288
H. Soula and C. Chow
For a linear response function p, we obtain Cov(1) = N p Xµ − X2µ q − p0 2 = N p0 X + − N 2 m2 X Nq µ q − p0 2 X = N 2 p0 q + − N2 q 2 q µ = N 2 q ( p0 − q ) + =
q − p0 Varµ (X) + N2 q 2 q
q − p0 Varµ (X). q
Hence for τ = 1, the autocovariance function is equal to the slope factor λ = (q − p0 )/q times the variance. Assume for τ that Cov(τ ) = λτ Varµ (X); then Cov(τ + 1) = Xt Xt+τ +1 − X2µ =
N N
k j P(Xt = k)P(Xt+τ +1 = j|Xt = k) − X2µ
k=0 j=0
=
N N
k j P(Xt = k)
N
P(Xt+τ +1 = j|Xt+τ = r )P
r =0
k=0 j=0
× (Xt+τ = r |Xt = k) − X2µ =
N N N
j
k j P(Xt = k)C N (1 − p(r )) N− j p(r ) j P
k=0 j=0 r =0
× (Xt+τ = r |Xt = k) − X2µ =
N N
Nkp(r )P(Xt = k)P(Xt+τ = r |Xt = k) − X2µ ,
k=0 r =0
using the mean of the binomial distribution. We can now insert the linear response function to obtain
Cov(τ + 1) =
N N k=0 r =0
Nk
q − p0 p0 + r Nq
P(Xt = k)P(Xt+τ = r |Xt = k) − X2µ
Stochastic Dynamics of a Finite-Size Spiking Neural Network
=
N N
3289
Nkp0 P(Xt = k)P(Xt+τ = r |Xt = k)
k=0 r =0
+
q − p0 Cov(τ )X2µ − X2µ q
because N N
kr P(Xt = k)P(Xt+τ = r |Xt = k) = Cov(t + τ ) + X2µ .
k=0 r =0
Then since N N
Nkp0 P(Xt = k)P(Xt+τ = r |Xt = k) =
k=0 r =0
N
Nkp0 P(Xt = k),
k=0
because N
P(Xt+τ = r |Xt = k) = 1,
r =0
for all k and N
Nkp0 P(Xt = k) = Np0 Xµ = N2 p0 q ,
k=0
we finally obtain Cov(τ + 1) = N2 p0 q + = N 2 p0 q +
q − p0 X2µ − X2µ + λCov(τ ) q q − p0 2 2 N q − N2 q 2 + λCov(τ ) q
= N2 ( p0 q + (q − p0 )q − q 2 ) + λCov(τ ) = λCov(τ ). proving equation 3.20 by induction. Acknowledgments This work was supported by the Intramural Research Program of NIH, NIDDK. We thank Michael Buice for his helpful comments on the manuscript.
3290
H. Soula and C. Chow
References Abbott, L., & Van Vreeswijk, C. (1993). Asynchronous states in a network of pulsecoupled oscillators. Phys. Rev. E, 48(2), 1483–1490. Amit, D. J., & Brunel, N. (1997a). Dynamics of recurrent spiking neurons before and following learning. Network: Comput. Neural. Syst., 8, 373–404. Amit, D. J., & Brunel, N. (1997b). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cereb. Cortex, 7(3), 237–252. Beggs, J. M., & Plenz, D. (2003). Neuronal avalanches in neocortical circuits. J. Neurosci., 23(35), 11167–11177. Beggs, J. M., & Plenz, D. (2004). Neuronal avalanches are diverse and precise activity patterns that are stable for many hours in cortical slice cultures. J. Neurosci., 24(22), 5216–5229. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8(3), 183–208. Brunel, N. (2003). Dynamics and plasticity of stimulus-selective persistent activity in cortical network models. Cereb. Cortex, 13(11), 1151–1161. Brunel, N., & Hansel, D. (2006). How noise affects the synchronization properties of recurrent networks of inhibitory neurons. Neural Comput., 18(5), 1066–1110. Cai, D., Tao, L., Shelley, M., & McLaughlin, D. W. (2004). An effective kinetic representation of fluctuation-driven neuronal networks with application to simple and complex cells in visual cortex. Proc. Natl. Acad. Sci. USA, 101(20), 7757– 7762. Compte, A. (2006). Computational and in vitro studies of persistent activity: Edging towards cellular and synaptic mechanisms of working memory. Neuroscience, 139(1), 135–151. Compte, A., Brunel, N., Goldman-Rakic, P. S., & Wang, X. J. (2000). Synaptic mechanisms and network dynamics underlying spatial working memory in a cortical network model. Cereb. Cortex, 10(9), 910–923. Del Giudice, P., & Mattia, M. (2003). Stochastic dynamics of spiking neurons. In E. Korutcheva & K. Cureno (Eds.), Advances in condensed matter and statistical physics (pp. 125–153). Hauppauge, NY: Nova Science. Fourcaud, N., & Brunel, N. (2002). Dynamics of the firing probability of noisy integrate-and-fire neurons. Neural Comput., 14(9), 2057–2110. Fries, P., Reynolds, J. H., Rorie, A. E., & Desimone, R. (2001). Modulation of oscillatory neuronal synchronization by selective visual attention. Science, 291(5508), 1560– 1563. Fusi, S., & Mattia, M. (1999). Collective behavior of networks with linear (VLSI) integrate-and-fire neurons. Neural Comput., 11, 633–652. Gerstner, W. (1995). Time structure of the activity in neural network models. Physical Review E: Statistical Physics, Plasmas, Fluids, and Related Interdisciplinary Topics, 51(1), 738–758. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states and locking. Neural Comput., 12, 43–89. Gerstner, W., & van Hemmen, J. (1992). Associative memory in a network of “spiking” neurons. Network, 3, 139–164.
Stochastic Dynamics of a Finite-Size Spiking Neural Network
3291
Gerstner, W., & Kistler, W. (2002). Spiking neuron models—single neurons, populations, plasticity. Cambridge: Cambrige University Press. Ginzburg, I., & Sompolinsky, H. (1994). Theory of correlations in stochastic neural networks. Phys. Rev. E.: Stat. Nonlin. Soft Matter Phys., 50(4), 3171–3191. Golomb, D., & Hansel, D. (2000). The number of synaptic inputs and the synchrony of large, sparse neuronal networks. Neural Computation, 12(5), 1095–1139. Gutkin, B. S., Laing, C. R., Colby, C. L., Chow, C. C., & Ermentrout, G. B. (2001). Turning on and off with excitation: The role of spike-timing asynchrony and synchrony in sustained neural activity. J. Comput. Neurosci., 11(2), 121–134. Hansel, D., & Mato, G. (2003). Asynchronous states and the emergence of synchrony in large networks of interacting excitatory and inhibitory neurons. Neural Comput., 15(1), 1–56. Hopfield, J. J. (1982). Neural networks and physical systems with emergent computational properties. Proc. Natl. Acad. Sci. USA, 79(2), 2554–2558. Kuznetsov, Y. (1998). Elements of bifurcation theory (2nd ed.). Berlin: Springer. Laing, C. R., & Chow, C. C. (2001). Stationary bumps in networks of spiking neurons. Neural Comput., 13(7), 1473–1494. Mattia, M., & Del Giudice, P. (2002). Population dynamics of interacting spiking neurons. Physical Review E, 66(5), 051917. Meyer, C., & Van Vreeswijk, C. (2002). Temporal correlations in stochastic networks of spiking neurons. Neural Computation, 14(2), 369–404. Nicholls, J., Martin, A., & Wallace, B. (1992). From neuron to brain (3rd ed.). Sunderland, MA: Sinauer Associates. Nykamp, D., & Tranchina, D. (2000). A population density approach that facilitates large-scale modeling of neural networks: Analysis and an application to orientation tuning. Journal of Computational Neuroscience, 8, 19–50. Pesaran, B., Pezaris, J. S., Sahani, M., Mitra, P. P., & Andersen, R. A. (2002). Temporal structure in neuronal activity during working memory in macaque parietal cortex. Nat. Neurosci., 5(8), 805–811. Plesser, H. E., & Gerstner, W. (2000). Noise in integrate-and-fire models: From stochastic input to escape rates. Neural Computation, 12, 367–384. Risken, H. (1989). The Fokker-Planck equation: Methods of solution and application (2nd ed.). Berlin: Springer. Salinas, E., & Sejnowski, T. J. (2002). Integrate-and-fire neurons driven by correlated stochastic input. Neural Comput., 14(9), 2111–2155. Seung, H. S. (1996). How the brain keeps the eyes still. Proc. Natl. Acad. Sci. USA, 93(23), 13339–13344. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Annu. Rev. Neurosci., 18, 555–586. Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPS. J. Neurosci., 13(1), 334– 350. Soula, H., Beslon, G., & Mazet, O. (2006). Spontaneous dynamics of asymmetric random recurrent spiking neural networks. Neural Computation, 18(1), 60–79. Steinmetz, P. N., Roy, A., Fitzgerald, P. J., Hsiao, S. S., Johnson, K. O., & Niebur, E. (2000). Attention modulates synchronized neuronal firing in primate somatosensory cortex. Nature, 404(6774), 187–190.
3292
H. Soula and C. Chow
Treves, A. (1993). Mean-field analysis of neuronal spike dynamics. Network, 4, 259– 284. Van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. Van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Comput., 10(6), 1321–1371.
Received August 9, 2006; accepted September 27, 2006.
LETTER
Communicated by Bruno Olshausen
What Is the Optimal Architecture for Visual Information Routing? Philipp Wolfrum [email protected] Frankfurt Institute for Advanced Studies, D-60438 Frankfurt am Main, Germany
Christoph von der Malsburg [email protected] Frankfurt Institute for Advanced Studies, D-60438 Frankfurt am Main, Germany, and Computer Science Department, University of Southern California, Los Angeles, CA 90089, U.S.A.
Analyzing the design of networks for visual information routing is an underconstrained problem due to insufficient anatomical and physiological data. We propose here optimality criteria for the design of routing networks. For a very general architecture, we derive the number of routing layers and the fanout that minimize the required neural circuitry. The optimal fanout l is independent of network size, while the number k of layers scales logarithmically (with a prefactor below 1), with the number n of visual resolution units to be routed independently. The results are found to agree with data of the primate visual system. 1 Introduction An impressive capability of biological vision systems is invariant object recognition. The same object seen at a different position, distance, or under rotation leads to entirely different retinal images, which have to be perceived as the same object. Only one of these variances, translation, can be compensated by movements of the eye. The approximate log-polar transform that takes place in the mapping from retina to cortex (Schwartz, 1977) replaces some kinds of transformations (scale, rotation) by others (translation on the cortex) and therefore does not fully explain invariant recognition either. Invariant recognition, and how the visual system performs it, remains a topic far from being understood. Invariance does not mean insensitivity to the spatial arrangement of visual information. Although object recognition in our brain is invariant with respect to the transformations noted, it is sensitive to small differences in the retinal activity pattern arising from, say, seeing both of your twin sisters shortly after each other. We believe that the only way a brain can solve these two competing problems realistically is to have a general Neural Computation 19, 3293–3309 (2007)
C 2007 Massachusetts Institute of Technology
3294
P. Wolfrum and C. von der Malsburg
object-independent mechanism that compensates variance transformations without distorting the image to convey a normalized version of it to higher brain areas for recognition. Such a mechanism requires a routing network providing physical connections among all locations in the visual input region (V1) to all points in the target area (like IT). Additionally, neural machinery is required to control these connections. The need for dynamic information routing was appreciated early on in vision research (Pitts & McCulloch, 1947), and several architectures have been proposed, like shifter circuits (Olshausen, Anderson, & van Essen, 1993) and the SCAN model (Postma, van den Herik, & Hudson, 1997). What has been missing so far is a discussion of efficiency in terms of required neural resources. Different routing architectures require different numbers of intermediate feature-representing nodes (we will refer to them simply as nodes) and node-to-node connections (links from now on; if we mean both links and nodes, we will use the term units). Although we will not discuss in this letter how connections in a routing circuit are controlled (for ¨ this, refer to Olshausen et al. 1993, and to Lucke, 2005), we assume that the maintenance of a link and its control by a neural control unit has the same cost as feature nodes (for a deviation from this assumption, see section 2.1). It is likely that cortical architectures have evolved that minimize this cost for the organism. In this letter, we therefore derive and analyze the routing network structure that minimizes the sum of all required units, both nodes and links. In our analysis, we focus on two situations. In section 2 we discuss routing between two cortical regions of identical size. This corresponds to perception of an already coarsely segmented object. In section 3 we consider the architecture that must be present in real biological vision systems: routing from a large input domain to a much smaller output domain, the first corresponding to the whole visual field and the second to a small, higherlevel target area engaged in object recognition. After analysis of these two cases, we interpret our results quantitatively for the case of the primate brain (see section 4) and discuss some implications and predictions arising from them (see section 5). 2 Routing Between Two Regions of the Same Size Let us define a routing architecture with as few assumptions as possible:
r r r
Input and output stages both consist of n image points. Each image point is represented by one feature unit (the extension of this to more than one feature per image point will be discussed below). The routing between input and output is established via k − 1 intermediate layers of n feature units each. Nodes of adjacent feature layers can be connected. For every such connection, there exists one dynamic neural unit that controls information
Optimal Architecture for Visual Information Routing
3295
Figure 1: Architectures for routing networks. (a) The one-dimensional case, 1 with n = 27 and k = 3; thus, l = n k = 3. All feature nodes are shown (dots) but only selected links (lines), the others being shifted versions (with circular boundary conditions) of the shown links. All connections from one input node to the whole output stage are shown as solid lines, and the connectivity between one output node and the whole input as dashed lines. (b) The two-dimensional case, with n = 64, k = 3, and l = 4. Only the downward connections from a single output node are shown.
flow in both directions. These units resemble the control units of Olshausen et al. (1993). We assume here that one link (including its control unit) imposes the same “maintenance cost” as one feature node. If these costs are not identical, this can be accounted for with the parameter α introduced in section 2.1. Under these assumptions, what is the minimal architecture providing for each input node one separate pathway to every output node? For k = 1, the situation is clear: without any intermediate layer, every input node must be connected to all n output nodes. With intermediate layers, however, we can make use of a combinatorial code to achieve full connectivity, similar to “butterfly” computations used in the fast Fourier transform (Cooley & Connor, 1965): We assume that each input unit is connected only to l nodes of the adjacent intermediate layer (see the solid lines in Figure 1a). Each of these l nodes in turn has connections to l nodes of the following layer, and so on, until the output stage is reached. This method yields for every input node l k pathways to the output stage, which are unique and lead all to different output nodes if we make sure that no two separate pathways merge again on the way to the output stage. An anatomically plausible way to meet this functional requirement is to let the spacing between target points increase exactly by the factor l from one link layer to the next, as shown in Figure 1a. The two-dimensional case is analogous, except that
3296
P. Wolfrum and C. von der Malsburg
here, the groups of nodes projecting to the same target are two-dimensional patches of l units with adequate spacing in between (see Figure 1b). The connectivity described here agrees with the general anatomical finding that the spread of neuronal connections increases along the visual hierarchy. Perkel, Bullier, and Kennedy (1986), for example, found a higher divergence of projections between V1 and V4 than between V1 and V2 or V3, respectively. In Tanigawa, Wang, and Fujita (2005), a four times larger spread of horizontal axons in inferotemporal cortex than in V1 was reported. Note, however, that the specific connectivity of the routing network is irrelevant for the results derived in the following. The only requirement is that the pathways of every input node be unique and lead to different output nodes. In order to reach the whole output stage with these pathways, their number must equal the number of output nodes: n = lk . From this we get the necessary neuronal fanout at each stage as 1
l = nk .
(2.1)
Let us now calculate how many nodes are needed to realize the routing architecture. Having k − 1 intermediate layers means that a total of (k + 1)n feature nodes is required. All of these nodes, except those of the output k+1 layer, have l links to the next stage, resulting in a total of knl = kn k links. So the total number of units as a function of k and n is N(k, n) = (k + 1)n + kn
k+1 k
.
(2.2)
As we can see in Figure 2, this number changes drastically with the number of intermediate layers being used. A direct all-to-all connectivity without any intermediate layers (k = 1) is most expensive because the number of required links scales quadratically with n in this case. For a very large number of intermediate layers, the decrease in the required fanout l is outweighed by the linear increase in nodes caused by additional layers. As we can see, there is a unique value kopt for which the number of required units attains a minimum. To determine kopt , we calculate the derivative of N with respect to k and set it to zero: 1 k+1 ∂N k+1 1 1 ln n ! k k k k = n + n − k ln n 2 n = n 1 + n − n = 0. ∂k k k
Optimal Architecture for Visual Information Routing
3297
Figure 2: The number of required units for a routing architecture between two layers depends strongly on the number k − 1 of intermediate layers being used. The values shown here are for input and output stages of n = 1000 image points each.
This is satisfied for ln n 1 = n− k + 1. k
(2.3)
With the ansatz kopt = c ln n,
(2.4)
equation 2.3 becomes independent of n: 1 1 = e − c + 1. c
(2.5)
Solving this numerically, we obtain kopt ≈ 0.7822 ln n.
(2.6)
The fact that kopt scales logarithmically with n is not surprising by itself. Such a scaling behavior lies at the heart of many techniques that have to permute or operate on a group of nodes simultaneously, like permutation networks or the fast Fourier transform (Cooley & Connor, 1965). Even in random graphs, the network diameter (corresponding somewhat to our
3298
P. Wolfrum and C. von der Malsburg
number of layers k) scales logarithmically with the number of nodes. This general logarithmic scaling behavior is independent of the specific fanout (or degree) at each node. A different fanout changes only the basis of the logarithm, which is equivalent to changing the prefactor in the logarithmic relation. Here, however, minimizing the number of components of the network leads to a specific logarithmic scaling, or phrased differently, the prefactor c in kopt is unique. This goes hand in hand with the existence of a unique optimal fanout: 1
1
lopt = n kopt = e c .
(2.7)
We discuss this finding further in section 5. 2.1 More Than One Feature per Image Point. So far, we have neglected the routing of visual information when there are several feature cells at one image point. Instead of a dense pixel array, visual information in V1 is represented by a pattern of “hypercolumns” of lower density. However, each hypercolumn contains cells responsive to many different local properties of the input, such as wavelet-like features (Gabors) at different orientations and spatial frequencies, different colors, or specificity for one eye or the other. It may not be necessary to route these features independently of each other to higher areas, so one might assume that only one active link is needed to route many feature units in one image location. Nevertheless, certain feature types do require individual treatment. For example, for full orientation invariance, units of one orientation specificity of the input would need connections to all orientation specificities of the output domain. The truth probably lies somewhere in between these two extremes, as suggested in Zhu and von der Malsburg (2004). Image points are not routed individually, but in small assemblies through collective links called “maplets.” For every group of nodes, there exist several such maplets, responsible for routing at different scales and orientations without requiring individual links for all features in all positions. Since the focus of this letter is not on a specific routing architecture but on finding the optimal number of layers for a very general architecture, we will merge the above arguments into a single factor α ≥ 1, representing the number of feature nodes that are controlled by a single link. If necessary, the parameter α can also be used to account for unequal expense assumed for feature units versus link units. Instead of n independent feature nodes, we now have nα = αn groups of nodes, each containing α nodes. With this, the number of units in the routing circuit, equation 2.2, changes to k+1
N(k, n) = α(k + 1)nα + knα k .
(2.2 )
Optimal Architecture for Visual Information Routing
3299
Figure 3: The prefactors c and c˜ define kopt through equations 2.4’ and 3.3 in the cases of routing to an output of the same size or much smaller size, respectively.
Setting the derivative of equation 2.2’ to zero leads to ln nα −1 = αnα k + 1 = 0. k
(2.3 )
The new ansatz, kopt = c ln nα ,
(2.4 )
yields 1 1 = αe − c + 1, c
(2.5 )
which we can solve numerically for explicit values of α. In Figure 3 we see that c—and, with it, kopt —changes by only a factor of 2 over a reasonably large range of α. Having determined the number of layers kopt that minimizes the required neural circuitry for given n and α, we can calculate the size Nopt of this minimal circuitry. Inserting equation 2.4’ into 2.2’ yields 1 Nopt (n) = n + α + e c (cnα ln nα ).
(2.8)
3300
P. Wolfrum and C. von der Malsburg
This means that for large n, the number of units of the optimal routing architecture between two layers of nα image points scales with Nopt (nα ) ∝ nα ln nα ,
(2.9)
as expected from classical network theory. This result also holds for routing of only a single feature (α = 1) per point. 3 Routing Circuit with Different Sizes of Input and Output Layer Let us now discuss routing from the whole visual field to a comparatively small cortical output region. We assume that an attentional mechanism singles out, in the input domain, a region that is to be mapped to the output region. We do not claim here that invariant recognition is perfect over the whole visual field (there are studies showing that this is not the case: Dill & Fahle, 1998; Cox, Meier, Oertelt, & DiCarlo, 2005), but object recognition is possible to some degree even at high retinal eccentricities, although impaired by the poor resolution at these parts of the retina. In any case, the basic problem remains the same as before: dynamic connections must exist between all parts of the visual input region and a target area. Computationally, the situation is similar to the one discussed in the previous section and leads to a generalization of the results derived there. We now want to connect an input stage of n units with an output stage that is smaller by the factor m and contains only mn units. As before, the routing is established via k − 1 intermediate layers, and groups of α nodes can be routed collectively. With the same argument as in section 2, we see that now each group has to make l=
n k1 α
m
(3.1)
connections to the next higher layer in order to connect every group of input nodes with every output group. Figure 4 shows parts of the architecture required for routing from a 125-node input to an 8-node output stage. Note that due to the different input and output sizes, downward fanout now has to be higher than the upward fanout l. Differently from before, the size of intermediate layers is not well defined now. We will assume that the number of nodes changes linearly from the input to the output layer. This is supported by measurements of the average sizes of primary visual areas in humans (Dougherty et al., 2003). Note, however, that the same article reports variance of V1 sizes of more than 100% between different individuals. In general, there seem to be few undisputed data on this question in the literature. Given this uncertainty in
Optimal Architecture for Visual Information Routing
3301
Figure 4: Routing network for an input of n = 125 and an output of mn = 8 nodes via k = 3 link layers. Consequently the upward fanout is l = 2. The solid lines show the links connecting an input node with the full output stage. Downward 1 connectivity is ldown = n 3 = 5 > l (shown exemplarily for a node on the second level of the architecture by dotted lines).
the anatomical data, the simplest possible assumption is probably best for this kind of general discussion. With a linear decrease in size, the number of feature units in layer κ (κ = 0 for the input and κ = k for the output layer) is fκ = n −
n κ n− . k m
The number F of the feature encoding units of all layers is then
F =
k κ=0
nm+1 n k(k + 1) = (k + 1). f κ = (k + 1)n − n − m 2k 2 m
Adding the links emanating upward from all but the top-most layer, we get the total number of units as n 1 F− l α m 1 nα k nα m + 1 2 = α(k + 1) + k + 1 − . 2 m m+1 m
N(n, k) = F +
Setting the derivation with respect to k to zero leads to −α
n − k1 α
m
ln nα = 1 + 2m k
2 −k−1 . m+1
(3.2)
3302
P. Wolfrum and C. von der Malsburg
With the ansatz kopt = c˜ ln
nα , m
(3.3)
this turns into −αe − c˜ = 1 + 1
1 c˜ 2 ln nmα
2 1 −1 − . m+1 c˜
2 becomes negligible, so that c˜ For large input-output ratio m, the term m+1 depends on only the number of independently routed output nodes nmα and not on m itself:
−αe − c˜ ≈ 1 − 1
c˜ 2
1 1 nα − . ln m c˜
(3.4)
Numerical analysis of equation 3.4 shows, however, that c˜ changes by less than 10% when nmα is varied over three orders of magnitude. So we can say that, like c in section 2, c˜ depends on only the parameter α. Figure 3 shows that c˜ takes on similar but slightly higher values than c. Calculating the size of the derived routing circuit by plugging equation 3.3 into equation 3.2 yields n nα 1 α c˜ ln Nopt = α + e c˜ +1 2 m
(3.5)
for large m. Although the relation is a bit different from the one derived for equal input and output domains, equation 2.8, the scaling with n˜ α ln n˜ α n (here n˜ α = αm ) remains the same. 4 Physiological Interpretation The main goal of this letter is to raise the question of optimal information routing in terms of required neural resources. In sections 2 and 3, we found that the optimal number of link layers in a routing circuit is given by kopt = c ln nα and kopt = c˜ ln
nα m
for routing to an output stage of identical size and of much smaller size, respectively. So in both cases, kopt is proportional to the natural logarithm
Optimal Architecture for Visual Information Routing
3303
of the number of independently routed output nodes. The well-defined prefactors c and c˜ are very similar (cf. Figure 3), depend on only α, and do not vary too much over large ranges of α. How do those results match the facts in the human brain? A good starting point is the optic nerve, which is known to contain ∼106 fibers for humans and other primates (Potts et al., 1972). Since the optic nerve is the bandwidth bottleneck of the visual system, it is safe to assume that it contains no redundant information. The number of neurons in V1, however, is by far higher than the number of optic nerve fibers, mainly for two reasons. First, the cortex employs a population coding strategy in order to reduce noise and increase transmission speed. This means that several neurons together (perhaps the approximately 100 of a cortical minicolumn) represent one of our abstract feature units. Second, visual information is represented in an overcomplete code in V1 (Olshausen & Field, 1997), increasing the number of feature units over the number of optic nerve fibers. Nevertheless, the information represented in V1 cannot be higher than that transported by the optic nerve, so that overcomplete groups of units can be routed collectively. We will therefore assume that the number of feature encoding units is of the same order as the number of fibers in the optic nerve, keeping in mind that an overcomplete basis in V1 may be accounted for by a correspondingly higher value of α. From the primary visual area V1, visual information is routed retinotopically along the ventral pathway to a target region in inferotemporal cortex (IT). Psychophysical evidence (van Essen, Olshausen, Anderson, & Gallant, 1991) suggests that about 1000 feature nodes are sufficient to represent the contents of the two-dimensional “window of attention,” and therefore the size of this target region, at any given time. One may assume that there exist multiple such target regions in parallel in IT, which are used for different object recognition tasks. But that question is outside the scope of this letter. How would our routing architecture look for these numbers? For this, we still miss an estimate of the parameter α. Research in our lab has shown that representing an image with 40 Gabor wavelets in each image point preserves all necessary image information of gray-scale images (Wundrich, ¨ von der Malsburg, & Wurtz, 2004) and allows good object identification (Lades et al., 1993). To additionally include color and temporal information (direction of motion), this number would have to be roughly twice as high. This is in line with findings concerning the number of orientation pinwheels in the primate brain. Obermayer and Blasdel (1997) report around 104 pinwheels for V1 of the macaque. Assuming a similar number for the human brain, we face the situation of an input region of the ventral stream containing 106 feature units clustered in some 104 pinwheels. If we assume that every pinwheel—as a first-order approximation of the functional “hypercolumn”—contains the full set of visual features for a certain input location on one retina, it follows that the number of these distinct features is of the order 100. Inputs from the two eyes are treated independently here,
3304
P. Wolfrum and C. von der Malsburg
Figure 5: Routing from an input stage of 106 to an output stage of 1000 nodes. (Top) Optimal number of link layers as a function of α. (Bottom) Total number of required units using the optimal number of layers from above.
so that successful stereoscopic fusion can be achieved for arbitrary depths by activating the right routing links. While we have two estimates that agree on the number of feature units per resolution point, coming from computer vision and physiology, the number α of features that can be routed together is difficult to estimate. It depends on the kinds of invariance operations that are realized in the routing circuit, as discussed in section 2.1. We assume α to lie in the approximate range of 2 to 5. For these values, the optimal number of layers for routing (k + 1) from a 106 node input to a 1000 node output ranges from 4.3 to 5.8. Figure 5 shows these values, as well as the number of units required for the full circuit when using the optimal number of layers. The ventral pathway comprises the areas V1, V2, V4, and IT. IT in turn consists of posterior, central, and anterior parts. In our setting, it may make sense to take into account this additional subdivision, since the receptive field sizes of these three parts are very different (Tanaka, Fujita, Kobatake, Cheng, & Ito, 1993; see also Figure 4 in Oram & Perret, 1994), suggesting that they form different stages of the routing hierarchy. Visual information
Optimal Architecture for Visual Information Routing
3305
is relayed from the lateral geniculate nucleus in a rather clear sequential order, V1→V2→V4→PIT→CIT→AIT, finally being combined with other signal streams in the superior temporal polysensory area (STP). Note that there exist at least equally strong feed back connections between the layers, indicating the importance of recurrent processes in vision. The number of four to six distinct cortical stages (depending on whether we regard IT as one or three stages) lies clearly in the range derived for our optimal circuit above. It is therefore possible that the ventral pathway performs computationally optimal information routing. At this point, however, this is only a hypothesis due to the great uncertainties in the available data. More explicit interpretations would be possible if α could be narrowed down further (also by quantifying the “overcompleteness” of V1) and if the stages involved in the routing were known for certain. Also, more psychophysical work on the information content in the window of attention would be desirable. 5 Discussion We have seen in sections 2 and 3 that under some very general assumptions, there exists a clear optimality condition on the number of layers required to build a routing architecture with minimal neural resources. This number depends on the size of the target region as well as the number of independently routed feature types. Within the given uncertainties, the derived numbers agree well with physiological data. Constraining the design of a routing architecture by an optimality condition, as we did, has the advantage of imposing an additional requirement to an otherwise underconstrained problem. While the shifter circuit of Olshausen et al. (1993) addresses several anatomical and physiological facts, there is no experimental or theoretical justification for some of the parameter values chosen, among them the exact doubling of link spacing from layer to layer. In the absence of experimental results dictating these values, we think it best to follow some global optimality condition like those we have proposed. We are well aware that our very general assumptions can be refined in several ways, possibly changing the derived quantitative results:
r
r
We avoided on purpose a detailed discussion of the kind of featureto-feature connectivity that may be in place to achieve scale and orientation invariance. This will be addressed in future work and will help to narrow the parameter α. Routing architectures with many numbers of layers are a disadvantage to the organism in terms of longer reaction times and more complicated routing dynamics. This additional cost has not been considered here; its influence would bias the biological routing architecture in favor of fewer stages than derived here.
3306
P. Wolfrum and C. von der Malsburg
Although the main contribution of this communication may lie in introducing optimality criteria to the design of routing circuits, it also leads to experimental predictions. One such prediction arises from the fact that the notion of a static receptive field becomes meaningless if one embraces an active routing process. During attention focusing and recognition, this process would choose a certain routing path and deactivate all alternative pathways. For a unit at the output stage in the hierarchy (IT), this would change the functional receptive field from a very broad region to a narrow and specific location. A unit at a medium stage of the hierarchy might even be bypassed by the currently established routing pathway. There is ample evidence for the behavioral plasticity of receptive fields (Moran & Desimone, 1985; Connor, Gallant, Preddie, & van Essen, 1993), and recent findings (Murray, Boyaci, & Kersten, 2006) show that even the size of representation in V1 can change with an object’s perceived size (suggesting a scale-invariant routing process that already starts in the mapping from lateral geniculate nucleus to V1). However, these findings are often interpreted as the result of a diffuse “attention modulation” mechanism, without taking the possibility of an explicit routing process seriously. In the light of the rather specific geometric changes of receptive fields implied by the presence of such a process, it should be possible to design attention experiments that can clearly prove or refute the routing hypothesis. While the above predictions are general implications of an active routing process and have been discussed similarly before (Olshausen et al., 1993), the quantitative results obtained here make some more specific predictions. An interesting feature of the minimal architecture, mentioned in section 2, is that the number of links emanating from one node (see equation 2.1 or 3.1) is independent of network size: 1
lopt = nαopt = exp k
ln nα c˜ ln nα
= exp
1 . c˜
(5.1)
lopt is surprisingly low (between 3 and 9 for the range of α shown in Figure 3). This number should not be confused with the full number of connections that a cortical neuron makes, which is known to be several thousand. First, here we count only the connections necessary for information routing, not those involved in other kinds of processing or communication. Second, the functional units discussed here are abstract image points, which in the cortex are probably made up of around 100 spiking neurons (like a cortical minicolumn). Single neurons in such a group would have to devote the majority of their connections to homeostatic within-group connections ¨ (Lucke & von der Malsburg, 2004), which do not appear on our level of abstraction. Nevertheless, the small fanout necessary for optimal routing is an interesting feature and shows that by including the number of control units in our optimization, we have implicitly also minimized the required connectivity of the routing architecture.
Optimal Architecture for Visual Information Routing
3307
The optimal number of layers (see equation 2.4’ or 3.3), on the other hand, scales logarithmically with network size: kopt = c˜ ln nα . This means that if more visual information has to be routed, the number of routing stages increases, while the local properties (number of connections that each node has to make) remain the same. Consequently, for species processing different amounts of visual information, the ventral streams should contain different numbers of routing stages. While the optic nerve of primates contains on the order of 106 fibers (Potts et al., 1972), the number is 105 for the rat (Fukuda, Sugimoto, & Shirokawa, 1982), 2 · 105 for the cat (Hughes & W¨assle, 1976), and 2.4 · 106 for the adult chicken (Rager & Rager, 1978). If we assume, as we did before, that the number of optic nerve fibers is a measure for the number of input units of the ventral stream and if the number of output (IT) units changes by the same factor, then a rat would optimally have 2.3 layers fewer, a cat 1.6 layers fewer, and a chicken 0.9 routing layers more than a primate. The differences might be smaller, however, if the size of the output stage does not change as strongly as the number of optic nerve fibers, since kopt depends on the number of output units. Although anatomical comparisons across species will be difficult, it may be interesting to investigate different brains with regard to this question. Acknowledgments We thank Charles Anderson for inspiring discussions, Cornelius Weber for useful comments on the manuscript, and two anonymous reviewers whose criticism has helped us to clarify a number of points. This work was supported by the European Union through project FP6-2005-015803 (“Daisy”) and by the Hertie Foundation. References Connor, C. E., Gallant, J. L., Preddie, D. C., & van Essen, D. C. (1993). Responses in area v4 depend on the spatial relationship between stimulus and attention. Journal of Neurophysiology, 75, 1306–1308. Cooley, J. W., & Connor, J. W. C. E. (1965). An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 19, 297–301. Cox, D., Meier, P., Oertelt, N., & DiCarlo, J. J. (2005). “Breaking” position-invariant object recognition. Nature Neuroscience, 8(9), 1145–1147. Dill, M., & Fahle, M. (1998). Limited translation invariance of human visual pattern recognition. Perception and Psychophysics, 60(1), 65–81. Dougherty, R. F., Koch, V. M., Brewer, A. A., Fischer, B., Modersitzki, J., & Wandell, B. A. (2003). Visual field representations and locations of visual areas V1/2/3 in human visual cortex. Journal of Vision, 3, 586–598.
3308
P. Wolfrum and C. von der Malsburg
Fukuda, Y., Sugimoto, T., & Shirokawa, T. (1982). Strain differences in quantitative analysis of rat optic nerve. Experimental Neurology, 75, 525–532. Hughes, A., & W¨assle, H. (1976). The cat optic nerve: Fibre total count and diameter spectrum. Journal of Comparative Neurology, 169, 171–184. ¨ ¨ Lades, M., Vorbruggen, J., Buhmann, J., Lange, J., von der Malsburg, C., Wurtz, R., & Konen, W. (1993). Distortion invariant object recognition in the dynamic link architecture. IEEE Transactions on Computers, 42, 300–311. ¨ Lucke, J. (2005). Dynamics of cortical columns—sensitive decision making. In Proc. ICANN, LNCS 3696 (pp. 25–30). New York: Springer. ¨ Lucke, J., & von der Malsburg, C. (2004). Rapid processing and unsupervised learning in a model of the cortical macrocolumn. Neural Computation, 16, 501–533. Moran, J., & Desimone, R. (1985). Selective attention gates visual processing in the extrastriate cortex. Science, 229, 782–784. Murray, S. O., Boyaci, H., & Kersten, D. (2006). The representation of perceived angular size in human primary visual cortex. Nature Neuroscience, 9, 429–434. Obermayer, K., & Blasdel, G. G. (1997). Singularities in primate orientation maps. Neural Computation, 9, 555–575. Olshausen, B. A., Anderson, C. H., & van Essen, D. C. (1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. Journal of Neuroscience, 13(11), 4700–4719. Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37, 3311–3325. Oram, M. W., & Perret, D. I. (1994). Modeling visual recognition from neurobiological constraints. Neural Networks, 7, 945–972. Perkel, D. J., Bullier, J., & Kennedy, H. (1986). Topography of the afferent connectivity of area 17 in the macaque monkey: A double-labelling study. Journal of Comparative Neurology, 253, 374–402. Pitts, W., & McCulloch, W. S. (1947). How we know universals: The perception of auditory and visual forms. Bulletin of Mathematical Biophysics, 9, 127–147. Postma, E. O., van den Herik, H. J., & Hudson, P. T. W. (1997). Scan: A scalable neural model of covert attention. Neural Networks 10(6), 993–1015. Potts, A. M., Hodges, D., Shelman, C. B., Fritz, K. J., Levy, N. S., & Mangnall, Y. (1972). Morphology of the primate optic nerve. I. Method and total fiber count. Investigative Ophthalmology and Visual Science, 11, 980–988. Rager, G., & Rager, U. (1978). Systems-matching by degeneration I. A quantitative electron microscopic study of the generation and degeneration of retinal ganglion cells in the chicken. Experimental Brain Research, 33, 65–78. Schwartz, E. L. (1977). Spatial mapping in primate sensory projection: Analytic structure and relevance to perception. Biological Cybernetics, 25, 181–194. Tanaka, K., Fujita, I., Kobatake, E., Cheng, K., & Ito, M. (1993). Serial processing of visual object-features in the posterior and anterior parts of the inferotemporal cortex. In T. Ono, L. R. Squire, M. E. Raichle, D. Perrett, & M. Fukuda (Eds.), Brain mechanisms of perception and memory: From neuron to behaviour (pp. 34–46). New York: Oxford University Press. Tanigawa, H., Wang, Q., & Fujita, I. (2005). Organization of horizontal axons in the inferior temporal cortex and primary visual cortex of the macaque monkey. Cerebral Cortex, 15, 1887–1899.
Optimal Architecture for Visual Information Routing
3309
van Essen, D. C., Olshausen, B., Anderson, C. H., & Gallant, J. (1991). Pattern recognition, attention, and information bottlenecks in the primate visual system. In B. Mathur & C. Koch (Eds.), Proceedings of the SPIE Conference on Visual Information Processing: From Neurons to Chips (Vol. 1473, pp. 17–28.). Bellingham, WA: SPIE. ¨ Wundrich, I. J., von der Malsburg, C., & Wurtz, R. P. (2004). Image representation by complex cell responses. Neural Computation, 16(12), 2563–2575. Zhu, J., & von der Malsburg, C. (2004). Maplets for correspondence-based object recognition. Neural Networks, 17(8–9), 1311–1326.
Received May 29 2006; accepted December 6, 2006.
LETTER
Communicated by Mike DeWeese
Enhanced Sound Perception by Widespread-Onset Neuronal Responses in Auditory Cortex Osamu Hoshino [email protected] Department of Intelligent Systems Engineering, Ibaraki University, Hitachi, Ibaraki, 316-8511, Japan
Accumulating evidence suggests that auditory cortical neurons exhibit widespread-onset responses and restricted sustained responses to sound stimuli. When a sound stimulus is presented to a subject, the auditory cortex first responds with transient discharges across a relatively large population of neurons, showing widespread-onset responses. As time passes, the activation becomes restricted to a small population of neurons that are preferentially driven by the stimulus, showing restricted sustained responses. The sustained responses are considered to have a role in expressing information about the stimulus, but it remains to be seen what roles the widespread-onset responses have in auditory information processing. We carried out numerical simulations of a neural network model for a lateral belt area of auditory cortex. In the network, dynamic cell assemblies expressed information about auditory sounds. Lateral excitatory and inhibitory connections were made between cell assemblies, respectively, by direct and indirect projections via interneurons. Widespread-onset neuronal responses to sound stimuli (bandpassed noises) took place over the network if lateral excitation preceded lateral inhibition, making a time widow for the onset responses. The widespread-onset responses contributed to the accelerating reaction time of neurons to sensory stimulation. Lateral interaction among dynamic cell assemblies was essential for maintaining ongoing membrane potentials near thresholds for action potential generation, thereby accelerating reaction time to subsequent sensory input as well. We suggest that the widespread-onset neuronal responses and the ongoing subthreshold cortical state, for which the coordination of lateral synaptic interaction among dissimilar cell assemblies is essential, may work together in order for the auditory cortex to quickly detect the sudden occurrence of sounds from the external environment. 1 Introduction Accumulating evidence suggests that auditory cortical neurons exhibit widespread onset responses and restricted sustained responses to sound stimuli (Recanzone, 2000; Mickey & Middlebrooks, 2003; DeWeese, Wehr, Neural Computation 19, 3310–3334 (2007)
C 2007 Massachusetts Institute of Technology
Sound Perception by Widespread-Onset Responses
3311
& Zador, 2003). A recent study (Wang, Lu, Snider, & Liang, 2005) gave an interesting report that some neurons of primary auditory (AI) and lateral belt (LB) areas responded selectively to sound stimuli (e.g., pure tones, bandpassed noises, frequency- and amplitude-modulated tones) in a sustained manner, and many other neurons responded to these stimuli in an onset manner. When a sound stimulus was presented to a subject, the auditory cortex first responded with transient discharges across a relatively large population of neurons, showing widespread-onset responses. As time passed, the activation became restricted to a small population of neurons that were preferentially driven by the stimulus, showing restricted sustained responses. The researchers suggested that the sustained discharges express information about the stimulus, for which excitatory synaptic connections within cell assemblies or feedback input from higher cortical areas (for both) might be responsible. Wang and colleagues (2005) speculated that onset discharges evoked in the auditory cortex could have different roles in representing information about external acoustic environments. Widespread-onset responses with short latency may allow the auditory cortex to quickly detect the occurrence of sounds from the external environment. The timing of onset discharges may provide the auditory cortex with precise timing information to mark the onset of a sound or transitions in an ongoing acoustic stream. The onset discharges may be used to encode auditory information as well, because onset discharges are as great as sustained discharges. The purpose of the study presented here is to elucidate what roles, if any, the widespreadonset neuronal responses to sound stimuli play in auditory information processing and how the widespread-onset responses are organized in the cortex. For that purpose, we carry out numerical simulations of a neural network model for the LB. Information about bandpassed noises is expressed in the LB network dynamics as population activation of neurons (or dynamic cell assemblies). Within cell assemblies, principal cells are connected via excitatory connections. We assume here both lateral excitation and inhibition among dissimilar dynamic cell assemblies, each of which has sensitivity to a different bandpassed noise. The lateral excitation and inhibition are made, respectively, by direct and indirect connections between principal cells. Inhibitory interneurons mediate the indirect connections. Although this circuitry for the auditory cortex is hypothetical or its evidence has not clearly been identified yet, similar circuitry has been found in other early sensory cortical areas. For example, in the primary visual cortex, neighboring cell assemblies (orientation columns) that have different orientation preferences are connected through both excitatory and inhibitory synapses (Eysel, 1999). The lateral inhibition between dissimilar orientation columns enhanced the tuning property of neurons to specific orientations. Das and Gilbert (1999) suggested that lateral excitatory connections between dissimilar orientation columns contribute to processing complex visual features such as corners and Tjunctions even in early cortical areas.
3312
O. Hoshino
In the rat barrel cortex, neurons are topographically organized in a precise columnar structure, and neighboring barrel columns that have dissimilar tactile preferences are connected through both excitatory and inhibitory synapses (Ajima & Tanaka, 2006). The researchers suggested that the lateral excitatory connections between dissimilar barrel columns might be important for the earliest intercolumnar integration, because excitatory inputs from whiskers are conveyed separately to the pyramidal cells in the specific barrel columns. We focus our study on whether the widespread-onset neuronal responses have any active role in processing externally resented auditory stimuli and how these responses are organized. We also investigate how combinatorial lateral excitation and inhibition between dissimilar dynamic cell assemblies regulates an ongoing (spontaneous) network state that precedes sensory stimulation. This was motivated by the fact that an ongoing network state, working as a ready state for sensory input, has great impact on subsequent neuronal information processing in the cortex. For example, self-generated ongoing activity had a distinct spatiotemporal patterning, which led to the temporal coordination of input-triggered responses and their binding into coherent dynamic cell assemblies (Engel, Fries, & Singer, 2001). Preceding ongoing activity was clearly reflected in the activity of responding cell assemblies to sensory stimulation and played an essential role in cortical function, based on which Arieli and colleagues (Arieli, Shoham, Hildesheim, & Grinvald, 1995; Arieli, Sterkin, Grinvald, & Aertsen, 1996) suggested that ongoing neuronal activity cannot be neglected when we investigate the details of cognitive neuronal information processing in the brain. 2 Neural Network Model A schematic drawing of the neural network model is shown in Figure 1A. The model consists of a primary auditory cortical network (AI) and a lateral belt network (LB). The AI and LB networks have tuning preferences to specific frequency (characteristic frequency) tones and bandpassed noises, respectively. When a bandpassed noise BNn (n = 1, 2, 3 or 4) that has a 1 octave bandwidth centered at a frequency CFn (Rauschecker, Tian, & Hauser, 1995) is presented to the AI network, a set of five frequency columns (ellipses in the figure) is simultaneously activated and sends action potentials to the cell assembly of the LB network corresponding to the stimulus (BNn) through selective feedforward projections (solid lines in the figure). The center frequencies of the four bandpassed noises are 1.0 (CF1), 2.5 (CF2), 4.0 (CF3), and 15.0 (CF4) kHz. The AI network consists of projection neurons. For simplicity, there is no connection between projection neurons, and the AI network works simply as an input layer. As shown in Figure 1B, the LB network consists of cell units, each composed of a pyramidal cell (P), a small basket cell (S), and a large basket cell (L). In each unit, the pyramidal cell and the small basket cell are reciprocally
Sound Perception by Widespread-Onset Responses
3313
Figure 1: Architecture of the model. (A) The model consists of AI and LB networks, where ellipses and circles denote cell assemblies and cell units, respectively. In the LB network, some portions of cell assemblies (filled circles) are connected via lateral excitatory synapses with each other (dotted lines). Each cell assembly of the LB network consists of 10 cell units, and that of the AI network consists of 10 projection neurons. (B) In the LB network, each cell unit (gray region) is composed of a pyramidal cell (P), a small basket cell (S), and a large basket cell (L). Open and filled triangles denote excitatory and inhibitory synapses, respectively. Each dotted line denotes a lateral excitatory connection to another cell assembly.
3314
O. Hoshino
connected by an excitatory and an inhibitory synapse. The pyramidal cell connects to the large basket cell by excitatory synapse. The large basket cell connects to the pyramidal cells belonging to the other cell assemblies by lateral inhibitory synapses. Groups of pyramidal cells form the so-called cell assemblies. The pyramidal cells within cell assemblies are connected with each other by excitatory synapses. Portions of the cell assemblies (filled circles in Figure 1A) are connected with each other by lateral excitatory synapses (dotted lines in Figure 1A), which form a fraction of lateral excitatory connection areas. This neuronal circuitry is based in part on anatomical and neurophysiological findings in the cortex, which will be discussed in detail in section 4. The dynamic evolution of membrane potentials of projection neurons (PN) of the AI network and pyramidal cells (P) of the LB network is defined, respectively, by duiP N (t) N + IiP N (t), = − uiP N (t) − urPest dt du P (t) τmP i = − uiP (t) − urPest + Isyn + Iinp , dt
τmP N
(2.1) (2.2)
where Isyn and Iinp denote the total postsynaptic current within the LB network and the input current from the AI network, respectively. Isyn is defined by Isyn = I AMP A + IG AB AL + IG AB AS ,
(2.3)
which are supplied by pyramidal cells (P), large basket cells (L) and small basket cells (S), respectively, and described by NL B P I AMP A = gˆ AMP A u AMP A − uiP (t) wiP,P j r j (t),
(2.4)
j=1 NL B L IG AB AL = gˆ G AB A uG AB A − uiP (t) wiP,L j r j (t),
(2.5)
j=1
IG AB AS = gˆ G AB A uG AB A − uiP (t) wiP,SriS (t).
(2.6)
Iinp is described by NAI L iLj B,AI r jP N (t). Iinp = g˜ AMP A u AMP A − uiP (t) j=1
(2.7)
Sound Perception by Widespread-Onset Responses
3315
The dynamic evolution of membrane potentials of L and S is defined, respectively, by τmL
duiL (t) = − uiL (t) − urLest + gˆ AMP A u AMP A − uiL (t) wiL ,P riP (t), (2.8) dt
τmS
duiS (t) = − uiS (t) − urSest + gˆ AMP A u AMP A − uiS (t) wiS,P riP (t). dt
(2.9)
uiY (t) is the membrane potential of the ith Y (Y = PN, P, L, S) cell at time t, is resting potential, and τmY is a membrane time constant. gˆ Z and u Z (Z = AMPA or GABA) are, respectively, the maximal conductance and reversal potential for postsynaptic current mediated by Z-type receptor. g˜ AMP A is the maximal conductance for synaptic connections from the AI to LB network. IiP N (t) is an external input to the ith PN with intensity (positive constant). NL B and NAI are the numbers of cell units of the LB and AI networks, respectively. wiX,Y is a synaptic coupling constant from the jth Y cell to the j ith X cell. wiX,Y is a synaptic coupling constant from cell Y to X of unit i. L iLj B,AI is a synaptic coupling constant for the feedforward connections from projection neurons (PNs), to pyramidal cells (Ps) and defined by urYest
L iLj B,AI =
4.8
0.0
between BNn-sensitive PNs and Ps: n = 1–4 otherwise.
(2.10)
Note that the BNn-sensitive PNs are five sets of projection neurons that have characteristic-frequencies centered at CFn (see Figure 1A). r Yj (t) (Y = PN, P, L, S) denotes the fraction of z-receptors (Z = AMP A, G AB AA) in the open state induced by presynaptic cell j and is described by Destexhe, Mainen, and Sejnowski (1998), dr Yj (t) dt
= αY [T] 1 − r Yj (t) − βY r Yj (t),
(2.11)
where αY and βY are positive constants, and [T] (T = Glut or GABA) is glutamate or GABA concentration in the synaptic cleft. [T] = Tmax for 1 ms when presynaptic cell j fires and [T] = 0 otherwise. The probability of firing of jth cell is defined by Pr ob[ j; f iring] =
1 1+e
−ηY (uYj −θY )
,
(2.12)
where ηY and θY are, respectively, the steepness and the threshold of the sigmoid function. After firing, the membrane potential is reset to the
3316
O. Hoshino
resting potential. Other possible postsynaptic currents are those mediated by NMDA and G AB AB receptors, which we neglect here for simplicity. Unless otherwise stated elsewhere, NAI = 200 and NL B = 40, where each neuronal column (cell assembly) consists of 10 projection neurons and 10 cell units, respectively. τmP N = τmP = 20 ms, τmL = 50 ms, τmS = 10 ms, N urPest = urPest = −65 mV and urLest = urSest = −70 mV (Koch, 1999; McCormick, Connors, Lighthall, & Prince, 1985; Kawaguchi & Shindou, 1998). gˆ AMP A = g˜ AMP A = 0.7 nS, gˆ G AB A = 0.5 nS, u AMP A = 0 mV, and uG AB A = −80 mV. α P N = α P = 1.1 × 106 M−1 sec−1 , β P N = β P = 190 sec−1 , α L = α S = 7.2 × 104 M−1 sec−1 , β L = β S = 6.6 sec−1 and Tmax = 1 mM (Destexhe et al., 1998). η P N = η P = 10.2, η L = η S = 21.0, θ P N = θ P = 0.1, and θ L = θ S = 0.3. wiP,P = 1.0 within cell assemblies, and wiP,P = 2.0 between cell assemblies. j j P,L P,S L ,P wi j = 2.0, wi = 40.0, wi = 20.0, and wiS,P = 20.0. When the AI network is stimulated with a given bandpassed noise, the projection neurons corresponding to the stimulus receive an input as IiP N (t) = 1.0 where = 0.2, and the other projection neurons that are irrelevant to the stimulus receive no input as IiP N (t) = 0.0. 3 Results 3.1 Ongoing and Cognitive Behaviors. In this section, we show the basic neural behavior in an ongoing (spontaneous) state established in the LB network dynamics and the cognitive behavior of the network when stimulated with bandpassed noises. Figure 2A (top) shows the raster plots for the four cell assemblies that have tuning preferences to specific bandpassed noises (BN1–4). Without any sensory stimulation, the pyramidal cells of the LB network show ongoing (spontaneous) activity, emitting action potentials in a random manner. The pyramidal cells corresponding to the input (BN2) show sustained discharges throughout the stimulation period (see the horizontal bar). Note that the other (or stimulus-irrelevant) pyramidal cells (BN1, BN3, BN4) exhibit widespread-onset discharges; they are transient and fade quickly. If the lateral excitatory connections between cell assemblies are not considered in the LB network, the onset responses disappear, as shown in Figure 2B. These results indicate that the lateral excitatory connections between cell assemblies might be crucial for the organization of widespread-onset neuronal responses within intracortical circuitry. To investigate whether and how the widespread-onset responses influence the cognitive performance of the network, we changed the threshold values of the onset-responsive pyramidal cells (θ P ; see equation 2.12) during the stimulation period and measured reaction time. The reaction time is response latency to stimulus onset. Figure 2C presents the reaction time of one of the sustained-responsive pyramidal cells (the circles in the figure) and the number of spikes (onset responsiveness; OR) of one of the
Sound Perception by Widespread-Onset Responses
3317
Figure 2: Ongoing and cognitive behaviors of the lateral belt (LB) network. (A) Top: Raster plots for the cell assemblies that have tuning preferences to specific bandpassed noises, BN1–4. Bottom: The number of total spikes of the stimulusrelevant (BN2) pyramidal cells (dotted line) and that of the stimulus-irrelevant (BN1, BN3, or BN4) pyramidal cells (solid line) counted in a time bin (10 ms). The horizontal bars indicates the stimulus presentation period. (B) Raster plots for the cell assemblies without lateral excitatory connections, where widespreadonset neuronal responses do not take place. (C) Reaction time (circles) of a stimulus-relevant pyramidal cell and onset responsiveness (OR) (triangles) of a stimulus-irrelevant pyramidal cell as a function of the threshold value for the stimulus-irrelevant (onset-responsive) pyramidal cells (θ P ; see equation 2.12). For details, see the text.
3318
O. Hoshino
onset-responsive pyramidal cells (the triangles in the figure) as a function of the threshold value. The results indicate that a decrease in the threshold results in accelerating reaction time and enhancing onset responses. This implies that the widespread-onset responses might be essential for accelerating reaction time to sensory stimulation, which could be encouraged by further examination, as will be shown in sections 3.2 and 3.3. Note that the reaction time is directly affected by the threshold value for the sustained-responsive pyramidal cells (BN2), which made it difficult to elucidate how the onset responsiveness affects reaction time. To avoid this problem, we changed the threshold value only for the onset-responsive pyramidal cells (BN1, BN3, BN4), but not that for the sustained-responsive pyramidal cells (BN2). The threshold value also alters the ongoing behavior of pyramidal cells, which is supposed to influence reaction time as well. To avoid this problem, we changed the threshold value only during the stimulation period, not during the ongoing period. By these parameter settings, we were able to look into the exact influences of widespread-onset responses on reaction time. The acceleration of reaction time shown in Figure 2C might reflect the enhancement of cognitive performance or the enhancement of sound perception. In many psychological experiments, reaction time is frequently employed to assess cognitive performance of humans. For example, when the attention of subjects with parietal lesions was disengaged from a visual target, the subjects had a showed slowed reaction time to the target (Rafal & Robertson, 1995). In the task, the subject was asked to respond to the appearance of a visual target by pressing a key. Reaction time was measured as the time period between the stimulus onset and the key-pressing onset. In another experiment, Salthouse (1996) demonstrated the evidence of agerelated declines in cognitive task performance (e.g., letter comparison and pattern comparison), in which reaction time as task speed was slowed from the 20s toward old age. In general, reaction time is a combination of neural speed and motor speed (Maylor & Perfect, 2000). In this respect, reaction time here was measured in terms of neural speed, not motor speed or their combination.
3.2 Lateral Synaptic Interaction Among Dissimilar Cell Assemblies. In this section, we show how lateral synaptic interaction among dissimilar dynamic cell assemblies could contribute to the organization of widespreadonset responses and the acceleration of reaction time to sensory stimulation. We also show how the lateral synaptic interaction could regulate the ongoing state of the LB network and influence subsequent cognitive information processing. Figure 3A presents the dependence of onset responsiveness of a pyramidal cell (OR) on lateral excitatory circuitry. A stimuluswas presented to
Sound Perception by Widespread-Onset Responses
3319
a pyramidal cell, and its firing rate was measured and plotted as functions of the strength of lateral excitatory connections (LEC) and the fraction of lateral excitatory connection areas between cell assemblies (LECA) (see Figure 1A, filled circles). The fraction of LECA = 0% denotes that there is no lateral excitatory connection between cell assemblies, and LECA = 100% denotes that all the pyramidal cells have lateral excitatory connections with each other. The onset responsiveness tends to be enhanced as the level of lateral excitation increases, reaching a maximum at about 40% of LECA. Then we investigated how the widespread-onset responses influence reaction time. A mean reaction time (RT) was calculated for a pyramidal cell over repeated (100) trials, in which the same stimulus was presented arbitrary in time between 3 and 8 sec. Figure 3B presents the reaction time in a reversed fashion (or 1/RT) as functions of the strength of lateral excitatory connections (LEC) and the fraction of lateral excitatory connection areas (LECA). The reaction time tends to be accelerated (1/RT increased) as the level of lateral excitation increases. This tendency is quite similar to that for the onset responsiveness (OR) (see Figure 3A), which indicates that the widespread-onset responses may have a close relation to the acceleration of reaction time. This will be confirmed quantitatively, as shown in Figure 3E. Since the ongoing network state, as a ready state for sensory input, is supposed to have a great impact on subsequent neuronal information processing in the cortex (see section 1), we speculated that the reaction time could be influenced not only by the widespread-onset responses but also by the ongoing network behavior. Figure 3C presents the relationship between the reaction time of a pyramidal cell and its ongoing membrane potential just before stimulus onset. As expected, the greater the membrane depolarization, the shorter the reaction time. Figure 3D presents the timeaveraged (10 sec) ongoing membrane potential (AOMP) of a pyramidal cell as functions of the strength of LEC and the fraction of LECAs. The overall membrane depolarization implies that the ongoing network state is near threshold for action potential generation, whereby the pyramidal cells can respond rapidly to sensory stimulation. This tendency might be reflected in reaction time (see Figure 3B) and may give an important notion that the acceleration of reaction time can be facilitated by an ongoing subthreshold network state. Higher values of correlation coefficients (r ; see Figure 3E) indicate that the onset responsiveness (circles in Figure 3E; r = 0.912) and the ongoing membrane potential (triangles in Figure 3E; r = 0.927) have a close relationship with reaction time. Dynamic interaction among individual cell assemblies via lateral excitatory synapses might be crucial for generating the widespread-onset neuronal responses and establishing the ongoing subthreshold network state, by which the reaction time could be accelerated. Note that excessive lateral excitation results in the deterioration of
3320
O. Hoshino
Sound Perception by Widespread-Onset Responses
3321
selective responsiveness (SR), as shown in Figure 3F. SR is defined as the ratio of firing rate evoked by a relevant stimulus to that by an irrelevant stimulus, where initial onset-responding periods were discarded. Excessive lateral excitation let pyramidal cells behave as if they were the members of the same large dynamic cell assembly consisting of individual cell assemblies (BN1–4). In such a situation, sustained discharges activate other (e.g., onset-responsive) pyramidal cells even after their onset-responding periods, thereby deteriorating the selective responsiveness. To understand in more detail the relevance of the ongoing subthreshold network state to the acceleration of reaction time, we calculated the probability density of reaction times. As shown in Figure 4A, the probability density of reaction times is greatly influenced by the extent of LECAs. Without lateral excitatory connections (top; LECA = 0%), the distribution of reaction times peaks at about 30 ms. As the fraction of LECA increases, the peak is shifted to a shorter reaction time and reaches a minimal (about 25 ms)(middle; LECA = 40%), and then reverts to a longer reaction time (about 35 ms)(bottom; LECA = 80%). These results, together with the cumulative probabilities of reaction times (see Figure 4B), indicate that the overall reaction time could be maximally accelerated at an intermediate extent of lateral excitation (40%; the triangles in Figure 4B). Note that the depolarization of ongoing membrane potentials (AOMP = −53.7 mV; Figure 4A, middle), could occur, provided that the extent of lateral excitation is intermediate. To elucidate its underlying neuronal mechanisms, we analyzed the neuronal and synaptic behaviors of the LB network. Figure 5 presents the ongoing membrane potential of a pyramidal cell (top); the spikes of small basket cells (S), large basket cells (L), and pyramidal cells (P) that belong to the same cell assembly (middle); and total postsynaptic current into the pyramidal cell (bottom), where the fraction of LECAs is 0% (see Figure 5A), 40% (see Figure 5B), or 80% (see Figure 5C).
Figure 3: Dependence of neuronal behaviors on the strength of lateral excitatory connections (LEC) and the fraction of lateral excitatory connection areas between cell assemblies (LECA). (A) Dependence of onset responsiveness (OR) of a pyramidal cell. (B) Dependence of reaction times (RT) (expressed as 1/RT) of a pyramidal cell. (C) Reaction time versus ongoing membrane potential of the pyramidal cell shown in B just before the stimulus onset. Circles, triangles, and squares denote, respectively, that the fractions of LECA are 0%, 40%, and 80%. The strength of LEC is 2.0. (D) Time-averaged (10 sec period) ongoing membrane potential (AOMP) of the pyramidal cell shown in B. (E) Correlation coefficient (r ) between the onset responsiveness and the reaction times (circles; r = 0.912) and that between the ongoing membrane potentials and the reaction times (triangles; r = 0.927). (F) Selective responsiveness (SR) of the pyramidal cell shown in B. For details, see the text.
3322
O. Hoshino
Figure 4: Distributions of reaction times. (A) Probability densities of reaction times, where the fraction of lateral excitatory connection areas (LECA) is 0% (top), 40% (middle), or 80% (bottom). Time-averaged ongoing membrane potential (AOMP) of the pyramidal cell is given. (B) Cumulative probabilities of the reaction times shown in A. Circles, triangles, and squares denote that the fractions of LECA are 0%, 40%, and 80%, respectively.
If the fraction of LECA is large (see Figure 5C, top), the pyramidal cell tends to be frequently hyperpolarized. The membrane hyperpolarization arises from an increase in outward postsynaptic current, or Isyn < 0 (see Figure 5C, bottom) supplied by the active small basket cells (S) and large basket cells (L) (see Figure 5C, middle). Note that the large basket cells show weaker responses to brief bursts of pyramidal cells due to their longer time constant (τmL = 50 ms), whereas the small basket cells show stronger responses because of their shorter time constant (τmS = 10 ms). As shown in Figure 5B (bottom), the overall positive postsynaptic current into the
Sound Perception by Widespread-Onset Responses
3323
Figure 5: Ongoing membrane potentials, action potentials, and postsynaptic currents. The fractions of lateral excitatory connection areas (LECA) are 0% (A), 40% (B), or 80% (C). Membrane potential of a pyramidal cell (top), raster plots for small basket cells (S), large basket cells (L), and pyramidal cells (P) (middle), and total postsynaptic current (Isyn ) into the pyramidal cell (bottom). These cells belong to the same cell assembly.
pyramidal cell (Isyn > 0) allows it near threshold, resulting in an ongoing subthreshold cortical state (see Figure 4A, middle). 3.3 Time Windows for Widespread-Onset Responses. In this section, we show how the membrane time constants of pyramidal cells, small basket cells, and large basket cells influence the widespread-onset neuronal responses, ongoing subthreshold network state, and reaction time. As shown in Figure 6A (top), the onset response of a pyramidal cell disappears if the membrane time constant of large basket cells becomes short (τmL = 10 ms). A longer time constant (τmL = 100 ms) effectively contributes
3324
O. Hoshino
Figure 6: Membrane and postsynaptic behaviors at around an onset response. Membrane time constant of large basket cells (τmL ) is 10 ms (A) or 100 ms (B). From top to bottom, mem. pot., I AMP A, IG AB AL , IG AB AS , and Isyn denote, respectively, the membrane potential of a pyramidal cell, excitatory postsynaptic current, lateral inhibitory postsynaptic current, feedback inhibitory postsynaptic current, and total postsynaptic current into the pyramidal cell. Note that input current from the AI network is excluded. Note that ∼10 ms and ∼70 ms denote the time windows for the onset responses.
to the organization of an onset response (see Figure 6B, top). To elucidate how the onset response could be organized, we measured excitatory postsynaptic current (I AMP A), lateral inhibitory postsynaptic current (IG AB AL ), feedback inhibitory postsynaptic current (IG AB AS ), and total postsynaptic current (Isyn ) into the pyramidal cell (see equations 2.3–2.6). If the onset of I AMP A precedes that of IG AB AL (see Figure 6B; approximately 70 ms), the time window for the onset response could be formed. If the time window is too narrow (see Figure 6A; approximately 10 ms) due to the shorter time
Sound Perception by Widespread-Onset Responses
3325
Figure 7: Time windows for onset responses. (A) Dependence of time window (TW; circles) and onset responsiveness (OR; triangles) on the membrane time constant of large basket cells (τmL ). Onset responsiveness is the number of spikes emitted from a pyramidal cell during an onset-responding period. (B) Dependence of time window and onset responsiveness on the membrane time constant of pyramidal cells (τmP ). Note that the time constant (τmP ) should be short; otherwise, the pyramidal cell cannot respond in an onset manner (triangles) even though the time window for the onset response is reserved (circles).
constant (τmL = 10 ms), the onset response does not take place. These results indicate that the large basket cells may play an essential role in making such a time window within which onset neuronal responses are allowed. Figure 7A presents the dependence of the time window (TW) on the membrane time constant of large basket cells (τmL ). The time window is expanded (see the circles in Figure 7A), and the onset responsiveness (OR) of a pyramidal cell is increased (see the triangles) as the time constant (τmL ) increases. As shown in Figure 7B, wider time windows are preserved independent of the time constant of pyramidal cells (τmP ) (the circles in the figure). However, the time constant does affect the onset responsiveness: the number of onset spikes emitted from the pyramidal cell is progressively decreased as the time constant (τmP ) increases (see the triangles in Figure 7B). This result indicates that the fast-responsive nature of pyramidal cells might be essential for them to respond within the time window.
3326
O. Hoshino
Figure 8: Dependence of onset responsiveness, reaction time, ongoing membrane potential, and selective responsiveness on the membrane time constants of pyramidal cells (τmP ) and large basket cells (τmL ). (A) Onset responsiveness, (B) reaction times (RT) expressed as 1/RT, (C) time-averaged (10 sec period) ongoing membrane potential (AOMP), and (D) selective responsiveness (SR) of a pyramidal cell.
These results imply that a combination of slow-responsive large basket cells and fast-responsive pyramidal cells might be essential for the organization of widespread-onset responses. Figure 8A furthers this notion. That is, the longer the time constants of large basket cells and the shorter the time constant of pyramidal cells, the greater the onset responsiveness is. The reaction time (see Figure 8B) and the ongoing membrane potential (see Figure 8C) exhibit a similar tendency. Note that a further increase in the time constant of large basket cells (τmL ) is advantageous for the pyramidal cells to be more depolarized, but it should be limited. Otherwise the selective responsiveness of pyramidal cells is deteriorated due to the depressed lateral inhibition between dynamic cell assemblies (see Figure 8D).
Sound Perception by Widespread-Onset Responses
3327
In this study, we assumed two distinct types of inhibitory interneurons: those of the small basket cells and large basket cells that have shorter (τmS = ∼ 10 ms) and longer (τmL = ∼ 50 ms) membrane time constants, respectively. As shown in Figure 9A, the shorter time constant of small basket cells (τmS = ∼ 10 ms) allows pyramidal cells to be depolarized during an ongoing period. In contrast, if the time constant takes values that are too short (less than 3 ms) or long (above 50 ms), the pyramidal cells tend to be hyperpolarized, for which two distinct inhibitory input currents might be responsible. That is, as shown in Figure 9B (left) the hyperpolarization caused by the time constant that is too short arises largely from the enhanced feedback inhibitory postsynaptic current into the pyramidal cells (IG AB AS ). Because of the membrane time constant that is too short, the small basket cells are activated almost simultaneously with the bursts of pyramidal cells even though they are brief (top). The other inhibitory input current (lateral inhibitory postsynaptic current; IG AB AL ) is not likely supplied to the pyramidal cells, because the large basket cells responsible for IG AB AL are barely activated by such brief bursts of pyramidal cells. In contrast, as shown in Figure 9B (right), the hyperpolarization caused by the longer time constant (τmS = 50 ms) arises largely from the enhanced lateral inhibitory postsynaptic current (IG AB AL ). The depressed activity of small basket cells due to their longer membrane time constant, clearly reflected in less feedback inhibitory current (IG AB AS ), allows the pyramidal cells to generate longer bursts (top). The extended durations of bursts are long enough for activating the large basket cells, which supply sufficient lateral inhibitory postsynaptic current to the pyramidal cells (IG AB AL ) and thus lead to the membrane hyperpolarization.
4 Discussion In this study, we showed that widespread-onset neuronal responses to sound stimuli could be made as transient population activation of multiple dynamic cell assemblies in the network, for which lateral synaptic interaction among cell assemblies was crucial. The widespread-onset responses took place if the lateral excitation preceded the lateral inhibition, making a time widow for the onset responses (see Figure 6), for which the slower responsive nature of large basket cells was essential. We also showed evidence that the widespread-onset neuronal responses could contribute to the accelerating reaction time of pyramidal cells to sensory stimulation, provided that the level of lateral excitation among dissimilar cell assemblies is moderate. This finding supports the speculation provoked by Wang and colleagues (2005) that widespread-onset neuronal responses might allow the auditory cortex to quickly detect the occurrence of sounds, albeit less discriminatively. Note that we confirmed such less discriminative neuronal responses as well (see Figure 3F).
3328
O. Hoshino
Figure 9: Dependence of ongoing membrane and postsynaptic behaviors on the membrane time constant of small basket cells (τmS ). (A) Time-averaged (10 sec) ongoing membrane potential of a pyramidal cell (AOMP) as a function of τmS . (B) Membrane potential and postsynaptic currents for τmS = 3 ms (left) and 50 ms (right). From top to bottom, mem. pot., I AMP A, IG AB AL , IG AB AS , and Isyn denote, respectively, the membrane potential of a pyramidal cell, excitatory postsynaptic current, lateral inhibitory postsynaptic current, feedback inhibitory postsynaptic current, and total postsynaptic current into the pyramidal cell.
Sound Perception by Widespread-Onset Responses
3329
Wang and colleagues (2005) speculated there is another role of onset discharges in auditory information processing: the timing of onset discharges could provide the auditory cortex with precise timing information to mark the onset of a sound or transitions in an ongoing acoustic stream. We consider that onset discharges could also be used to encode individual auditory sounds, because onset discharges are as great as sustained discharges. A combination of widespread-onset discharges and restricted sustained discharges may have a role in expressing sound information in the auditory cortex. The brain may use such a spatiotemporal representation to encode auditory information. We hope to know the details in the near future. We showed that ongoing neuronal activity is important for accelerating reaction time. That is, if the ongoing network state was kept at a subthreshold level or the membrane potentials of pyramidal cells oscillated near thresholds for action potential generation, the network could respond quickly to sensory stimulation. We suggest that the widespread-onset neuronal responses and the ongoing subthreshold network state may work together to accelerate reaction time. The networks corresponded to the primary auditory cortex (AI) and the lateral belt of auditory cortex (LB). Bandpassed noises with defined center frequencies evoked greater responses than pure tones (Rauschecker et al., 1995; Rauschecker, 1998; Kaas, Hackett, & Tramo, 1999). The best center frequencies of LB neurons varied along a rostrocaudal axis, and the best bandwidths of LB neurons varied along a mediolateral axis. The present LB neurons (see Figure 1A) prefer a certain bandpassed noise (BNn; n = 1, 2, 3 or 4) with a defined center frequency (CFn). The best center frequencies of LB pyramidal cells varied along an axis (frequency axis), which appears simple but might accord in part with these experimental observations. To make LB neurons selective to specific bandpassed noises, we assumed convergent excitatory inputs from the AI to LB network (see Figure 1A). Recent studies have revealed hierarchical processing between the LB and the core areas such as the AI and the rostral area (Kaas et al., 1999; Wessinger et al., 2001). The core receives direct input from the thalamus and in turn provides input to the LB. Kaas and colleagues (1999) suggested that convergent inputs from the core neurons sensitive to adjacent frequencies onto each LB neuron might be essential for integrating basic auditory features (or tones) into more complex preferences such as bandpassed noises. The LB’s preference for specific bandpassed noises appears simple but might accord in part with these experimental observations and the suggestions derived from them. In this study, we assumed two distinct types of interneurons: small basket cells and large basket cells. In the cortex, about 50% of all inhibitory interneurons are basket cells (Markram et al., 2004). Basket cells are known to synapse selectively on pyramidal cells (Somogyi, Kisvarday, Martin, &
3330
O. Hoshino
Whitteridge, 1983) that are believed to play major roles in neural information processing in the brain. Basket cells have diversity especially in their axonal arborizations and are classified roughly into two subclasses: large basket cells and small basket cells (Wang, Gupta, Toledo-Rodriguez, Wu, & Markram, 2002), which may correspond to wide arbor basket cells and local arbor basket cells, respectively (Krimer & Goldman-Rakic, 2001). Large basket cells and small basket cells have, respectively, aspiny wide (up to about 1000 µm) and narrow (up to about 300 µm) axonal arbors that tend to synapse on the somata of target cells. Nest basket cells (Wang et al., 2002) or medium arbor basket cells (Krimer & Goldman-Rakic, 2001) could be placed in a third subclass, where their axonal arborizations are in a medium range of about 500 µm. Such a difference in axonal arbor length has great impacts on cortical neuronal information processing (Krimer & Goldman-Rakic, 2001). Researchers (Sillito, 1984; Eysel, 1992) have suggested that lateral inhibition between pyramidal cells, which is mediated by interneurons, plays an important role in neuronal information processing in the brain, such as the tuning of excitatory neurons to sound frequency (Yost, 1994) in the primary auditory cortex, to orientation or direction (Eysel, Shevelev, Lazareva, & Sharaev, 1998) and cross-orientation inhibition (Kisvarday, Kim, Eysel, & Bonhoeffer, 1994) in the primary visual cortex. Goldman-Rakic and colleagues (Wilson, O’Scalaidhe, & Goldman-Rakic, 1994; Rao, Williams, & Goldman-Rakic, 1999, 2000) demonstrated that even in higher cortical areas (e.g., prefrontal cortex), lateral inhibitory mechanisms operated in shaping memory fields of neurons when monkeys were engaging in working memory tasks. Large basket cells may mediate such a lateral inhibitory effect through their wide axonal arbors, which we assumed in the study presented here. It is also well known that excitatory and inhibitory neurons frequently form reciprocal synaptic connections (Martin, 2002). Zilberter (2000) reported that 75% (n = 80) of pairs of a pyramidal cell and a fast-spiking interneuron form reciprocal synaptic connections. The fast-spiking interneurons might be small basket cells or chandelier cells (Krimer & GoldmanRakic, 2001). Such reciprocal connections are likely to mediate the feedback inhibition of pyramidal cells if the activity of the pyramidal cells becomes greater. When an excitatory input current was applied to the soma of small basket cells, the small basket cells generated spikes with a higher firing rate, while large basket cells fired with a lower rate (Krimer & GoldmanRakic, 2001). The shorter and longer time constants of the small basket cells and large basket cells that we assumed in this study were employed for simply and functionally making such fast-spiking small basket cells and slow-spiking large basket cells, respectively. In this model, connecting pyramidal cells to the large basket cells belonging to the other cell assemblies formed lateral inhibitory pathways. Such pathways could also be constructed in another plausible manner by
Sound Perception by Widespread-Onset Responses
3331
connecting pyramidal cells directly to the small basket cells belonging to the other cell assemblies. However, this circuitry resulted in the disappearance of widespread responses and in an ongoing network state strongly hyperpolarized (not shown). The lateral inhibition mediated by small basket cells instead of large basket cells corresponds in part to changing the time constant of large basket cells to the short value (τmL = 10 ms) that is consistent with that of small basket cells, or τmS = 10 ms. As is evident from the result shown in Figure 7A (see τmL = 10 ms), the onset responses disappear because of the narrowing of time windows for onset responses. Therefore, a combinatorial inhibition of pyramidal cells by large basket cells and small basket cells might be crucial for generating widespread-onset neuronal responses. We showed that transient-onset neuronal responses could take place within the auditory cortex. However, such responses have been observed as well in lower auditory areas such as medial geniculate body (MGB) (Kvasnak, Popelar, & Syka, 2000; He, 2001, 2002; Cetas et al., 2002). Therefore, the bottom-up onset signals from the MGB could influence the dynamic (onset or sustained) neuronal behavior in the auditory cortex. In such a situation, the onset-responsive cortical neurons can be activated in both intracortical (lateral excitatory) and intercortical (bottom-up) manners. In the light of this study, if the onset responses of BN1, 3, 4 (see Figure 2) are externally induced by such bottom-up onset signals, followed by the internal (lateral excitatory) activation, the latency of the sustained responses of BN2 could be further reduced. Because the bottom-up signals might activate the onset-responsive neurons (BN1, 3, 4), which then depolarizes the sustained-responsive neurons (BN2) through lateral excitation, the sustained-responsive neurons can be rapidly activated (or fire) by the stimulus BN2. This hypothesis will be tested by a new model in which the bottom-up onset signals from the MGB are provided to the auditory cortex. In addition to the onset responses, transient-offset neuronal responses to an auditory stimulus have also been observed in the auditory cortex (Chimoto, Kitama, Qin, Sakayori, & Sato, 2002; Mickey & Middlebrooks, 2003; Qin, Kitama, Chimoto, Sakayori, & Sato, 2003) and in MGB (He, 2001, 2002; Cetas et al., 2002), which may have an impact on the neuronal information processing of the current or a subsequent auditory input. The next step of our study is to investigate how the transient onset and offset signals from the MGB influence the dynamic neuronal behavior in the auditory cortex and thus its cognitive performance. Acknowledgments I am grateful to Hiromi Ohta for her encouragement throughout the study. I am also grateful to the anonymous reviewers for giving me valuable comments and suggestions on the earlier draft.
3332
O. Hoshino
References Ajima, A., & Tanaka, S. (2006). Spatial patterns of excitation and inhibition evoked by lateral connectivity in layer 2/3 of rat barrel cortex. Cereb. Cortex, 16, 1202–1211. Arieli, A., Shoham, D., Hildesheim, R., & Grinvald, A. (1995). Coherent spatiotemporal patterns of ongoing activity revealed by real-time optical imaging coupled with single-unit recording in the cat visual cortex. J. Neurophysiol., 73, 2072–2093. Arieli, A., Sterkin, A., Grinvald, A., & Aertsen, A. (1996). Dynamics of ongoing activity: Explanation of the large variability in evoked cortical responses. Science, 273, 1868–1871. Cetas, J. S., Price, R. O., Velenovsky, D. S., Crowe, J. J., Sinex, D. G., & McMullen, N. T. (2002). Cell types and response properties of neurons in the ventral division of the medial geniculate body of the rabbit. J. Comp. Neurol. 445, 78–96. Chimoto, S., Kitama, T., Qin, L., Sakayori, S., & Sato, Y. (2002). Tonal response patterns of primary auditory cortex neurons in alert cats. Brain Res., 934, 34–42. Das, A., & Gilbert, C. D. (1999). Topography of contextual modulations mediated by short-range interactions in primary visual cortex. Nature, 399, 655– 661. Destexhe, A., Mainen, Z. F., & Sejnowski, T. J. (1998). Kinetic models of synaptic transmission. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (pp. 1– 25). Cambridge, MA: MIT Press. DeWeese, M. R., Wehr, M., & Zador, A. M. (2003). Binary spiking in auditory cortex. J. Neurosci., 27, 7940–7949. Engel, A. K., Fries, P., & Singer, W. (2001). Dynamic predictions: Oscillations and synchronization in top-down processing. Nature Rev. Neurosci., 2, 704–717. Eysel, U. T. (1992). Lateral inhibitory interactions in areas 17 and 18 of the cat visual cortex. Prog. Brain Res., 90, 407–422. Eysel, U. T. (1999). Turning a corner in vision research. Nature, 399, 641–644. Eysel, U. T., Shevelev, I. A., Lazareva, N. A., & Sharaev, G. A. (1998). Orientation tuning and receptive field structure in cat striate neurons during local blockade of intracortical inhibition. Neuroscience, 84, 25–36. He, J. (2001). On and off pathways segregated at the auditory thalamus of the guinea pig. J. Neurosci., 21, 8672–8679. He, J. (2002). OFF responses in the auditory thalamus of the guinea pig. J. Neurophysiol. 88, 2377–2386. Kaas, J. H., Hackett, T. A., & Tramo, M. J. (1999). Auditory processing in primate cerebral cortex. Curr. Opin. Neurobiol., 9, 164–170. Kawaguchi, Y., & Shindou, T. (1998). Noradrenergic excitation and inhibition of GABAergic cell types in rat frontal cortex. J. Neurosci., 18, 6963–6976. Kisvarday, Z. F., Kim, D. S., Eysel, U. T., & Bonhoeffer, T. (1994). Relationship between lateral inhibitory connections and the topography of the orientation map in cat visual cortex. Eur. J. Neurosci., 6, 1619–1632. Koch, C. (1999). Biophysics of computation. New York: Oxford University Press. Krimer, L. S., & Goldman-Rakic, P. S. (2001). Prefrontal microcircuits: Membrane properties and excitatory input of local, medium, and wide arbor interneurons.
Sound Perception by Widespread-Onset Responses
3333
J. Neurosci., 21, 3788–3796. Kvasnak, E., Popelar, J., & Syka, J. (2000). Discharge properties of neurons in subdivisions of the medial geniculate body of the guinea pig. Physiol. Res., 49, 369–378. Markram, H., Toledo-Rodriguez, M., Wang, Y., Gupta, A., Silberberg, G., & Wu, C. (2004). Interneurons of the neocortical inhibitory system. Nature Rev. Neurosci., 5, 793–807. Martin, K. A. C. (2002). Microcircuits in visual cortex. Curr. Opin. Neurobiol., 12, 418–425. Maylor, E. A., & Perfect, T. J. (2000). Rejecting the dull hypothesis: The relation between method and theory in cognitive aging research. In E. A. Maylor & T. J. Perfect (Eds.), Models of cognitive aging (pp. 1–18). Cambridge, MA: MIT Press. McCormick, D. A., Connors, B. W., Lighthall, J. W., & Prince, D. A. (1985). Comparative electrophysiology of pyramidal and sparsely spiny stellate neurons of the neocortex. J. Neurophysiol., 54, 782–806. Mickey, B. J., & Middlebrooks, J. C. (2003). Representation of auditory space by cortical neurons in awake cats. J. Neurosci., 24, 8649–8663. Qin, L., Kitama, T., Chimoto, S., Sakayori, S., & Sato, Y. (2003). Time course of tonal frequency-response-area of primary auditory cortex neurons in alert cats. Neurosci. Res., 46, 145–152. Rafal, R., & Robertson, L. (1995). The neurobiology of visual attention. In M. S. Gazzaniga (Ed.), The cognitive neuroscience (pp. 625–648). Cambridge, MA: MIT Press. Rao, S. G., Williams, G. V., & Goldman-Rakic, P. S. (1999). Isodirectional tuning of adjacent interneurons and pyramidal cells during working memory: Evidence for microcolumnar organization in PFC. J. Neurophysiol., 81, 1903–1916. Rao, S. G., Williams, G. V., & Goldman-Rakic, P. S. (2000). Destruction and creation of spatial tuning by disinhibition: GABA(A) blockade of prefrontal cortical neurons engaged by working memory. J. Neurosci., 20, 485–94. Rauschecker, J. P. (1998). Parallel processing in the auditory cortex of primates. Audiol. Neurootol., 3, 86–103. Rauschecker, J. P., Tian, B., & Hauser, M. (1995). Processing of complex sounds in the macaque nonprimary auditory cortex. Science, 268, 111–114. Recanzone, G. H. (2000). Response profiles of auditory cortical neurons to tones and noise in behaving macaque monkeys. Hear. Res., 150, 104–118. Salthouse, T. A. (1996). The processing-speed theory of adult age differences in cognition. Psychol. Rev., 103, 403–428. Sillito, A. M. (1984). Functional considerations of the operation of GABAergic inhibitory processes in the visual cortex. In A. Peter & E. G. Jones (Eds.), Cerebral cortex (pp. 91–117). New York: Plenum Press. Somogyi, P., Kisvarday, Z. F., Martin, K. A. C., & Whitteridge, D. (1983). Synaptic connections of morphologically identified and physiologically characterized large basket cells in the striate cortex of cat. Neuroscience, 10, 261–294. Wang, X., Lu, T., Snider, R. K., & Liang, L. (2005). Sustained firing in auditory cortex evoked by preferred stimuli. Nature, 435, 341–346. Wang, Y., Gupta, A., Toledo-Rodoriguez, M., Wu, C. Z., & Markram, H. (2002). Anatomical, physiological, molecular and circuit properties of nest basket cells in the developing somatosensory cortex. Cerebral Cortex, 12, 395–410.
3334
O. Hoshino
Wessinger, C. M., VanMeter, J., Tian, B., Van Lare, J., Pekar, J., & Rauschecker, J. P. (2001). Hierarchical organization of the human auditory cortex revealed by functional magnetic resonance imaging. J. Cogn. Neurosci., 13, 1–7. Wilson, F. A., O’Scalaidhe, S. P., & Goldman-Rakic, P. S. (1994). Functional synergism between putative gamma-aminobutyrate-containing neurons and pyramidal neurons in prefrontal cortex. Proc. Natl. Acad. Sci. USA, 91, 4009–4013. Yost, W. A. (1994). Fundamentals of hearing. San Diego, CA: Academic Press. Zilberter, Y. (2000). Dendritic release of glutamate suppresses synaptic inhibition of pyramidal neurons in rat neocortex. J. Physiology, 538(3), 489–496.
Received May 17, 2006; accepted September 26, 2006.
LETTER
Communicated by Robert A. Jacobs
Bayesian Inference Explains Perception of Unity and Ventriloquism Aftereffect: Identification of Common Sources of Audiovisual Stimuli Yoshiyuki Sato [email protected] Department of Complexity Science and Engineering, Graduate School of Frontier Sciences, University of Tokyo, Tokyo 277-8562, and Aihara Complexity Modeling Project, ERATO, Japan Science and Technology Agency, Tokyo 151-0064, Japan
Taro Toyoizumi [email protected] RIKEN Brain Science Institute, Saitama 351-0198, Japan, and Department of Complexity Science and Engineering, Graduate School of Frontier Sciences, University of Tokyo, Tokyo 277-8562, Japan
Kazuyuki Aihara [email protected] Institute of Industrial Science, University of Tokyo, Tokyo 153-8505, Japan; Aihara Complexity Modeling Project, ERATO, Japan Science and Technology Agency, Tokyo 151-0064, Japan, and Department of Complexity Science and Engineering, Graduate School of Frontier Sciences, University of Tokyo, Tokyo 277-8562, Japan
We study a computational model of audiovisual integration by setting a Bayesian observer that localizes visual and auditory stimuli without presuming the binding of audiovisual information. The observer adopts the maximum a posteriori approach to estimate the physically delivered position or timing of presented stimuli, simultaneously judging whether they are from the same source or not. Several experimental results on the perception of spatial unity and the ventriloquism effect can be explained comprehensively if the subjects in the experiments are regarded as Bayesian observers who try to accurately locate the stimulus. Moreover, by adaptively changing the inner representation of the Bayesian observer in terms of experience, we show that our model reproduces perceived spatial frame shifts due to the audiovisual adaptation known as the ventriloquism aftereffect. 1 Introduction Audiovisual integration plays an important role in our perception of external events. When light and sound are emitted from an external event, we Neural Computation 19, 3335–3355 (2007)
C 2007 Massachusetts Institute of Technology
3336
Y. Sato, T. Toyoizumi, and K. Aihara
take both of them into account to obtain as much information as possible, for example, the location and the timing of the event. The well-known ventriloquism effect is a good example of audiovisual integration: an auditory stimulus is perceived to be biased toward the corresponding visual stimulus when presented in temporal coincidence (Howard & Templeton, 1996). Even simple audiovisual stimuli such as a small spot of light and a beep are known to cause a ventriloquism effect (Alais & Burr, 2004; Bertelson & Radeau, 1981; Hairston et al., 2003; Slutsky & Recanzone, 2001; Wallace et al., 2004). In general, when conflicting multisensory stimuli are presented through two or more modalities, the modality having the best acuity influences the other modalities (the modality appropriateness hypothesis) (Welch & Warren, 1980). According to this hypothesis, a visual stimulus influences the localization of an auditory stimulus because vision has a greater resolution than hearing. In addition, it has been shown that a tactile location is also modified by a visual stimulus (Pavani, Spence, & Driver, 2000). In the case of temporal information, audition is more accurate than vision; the perceived arrival time of a visual event can be attracted to the sound arrival time (temporal ventriloquism) (Bertelson & Aschersleben, 2003; Morein-Zamir, Soto-Faraco, & Kingstone, 2003). The ventriloquism effect can be understood as a near-optimal integration of conflicting auditory and visual stimuli. Alais and Burr (2004) computed a maximum a posteriori (MAP) estimation (or an equivalent maximum likelihood estimation in their model) of the location of the assumed common source of audiovisual stimuli. They found that their MAP estimator predicted the experimental ventriloquism effect very well. However, binding audiovisual information is a good strategy only when the light and the sound have a common source. Binding visual and auditory stimuli even when they do not have a common source may impair our accurate perception of external events. Therefore, the judgment of the common sources of audiovisual stimuli plays a crucial role in audiovisual integration. This idea has been termed the unity assumption (Welch & Warren, 1980) or pairing (Epstein, 1975). The judgment on whether the presented audiovisual stimuli share the same source depends on their spatial and temporal discrepancies (Lewald & Guski, 2003). This result is reasonable because the visual and auditory stimuli that arrive close in space and time are likely to originate from a common event. Indeed, the strength of the ventriloquism effect decreases as the spatial and temporal disparity increases (Wallace et al., 2004). Moreover, the ventriloquism effect can be increased by cognitively compelling a common cause of perception of the auditory and visual events (Radeau, 1994). Putting these results together, the ventriloquism effect is strongly related to the perception of whether audiovisual signals are from the same source. Several studies have modeled audiovisual interactions using the Bayesian inference (Battaglia, Jacobs, & Aslin, 2003; Ernst & Banks, 2002;
Bayesian Identification of Audiovisual Sources
3337
Witten & Knudsen, 2005). In these models, however, it is assumed that auditory and visual stimuli are integrated from the beginning. In this letter, we set a Bayesian observer that uses Bayesian inference to localize audiovisual stimuli without presuming the binding of the stimuli. Hence, the observer estimates the position or the onset time of the stimuli, judging whether the auditory and visual signals are from the same source. In particular, we try to reproduce the following experimental results related to the ventriloquism effect and the perception of unity. There is a strong correlation between the ventriloquism effect and the spatial unity perception, that is, the perception of whether audiovisual stimuli are from the same place (Wallace et al., 2004). Another finding is that the perception of spatial unity depends not only on the spatial discrepancy but also on the temporal discrepancy between visual and auditory stimuli (Slutsky & Recanzone, 2001). Conversely, the perception of simultaneity is modulated by the spatial discrepancy (Bertelson & Aschersleben, 2003; Lewald & Guski, 2003; Zampini, Guest, & Shore, 2005). Another result that we wish to reproduce in this letter is the effect of adaptation on the perception of audiovisual stimuli. Audiovisual localization is known to exhibit an adaptation phenomenon. After a prolonged exposure to simultaneous audiovisual stimuli with spatial disparity, the disparity that strongly produces the spatial unity perception shifts toward the presented disparity (Recanzone, 1998). The aftereffect can be observed not only in the audiovisual localization but also in the unimodal auditory localization. This effect is known as the ventriloquism aftereffect (Canon, 1970; Lewald, 2002; Radeau & Bertelson, 1974; Recanzone, 1998). The localization of an auditory stimulus shifts in the direction of a relative visual stimulus position during the adapting period. Visual localization is also modified by adapting stimuli; however, the magnitude of this effect is smaller than that of the auditory localization (Radeau & Bertelson, 1974). This asymmetry between vision and audition is also believed to be due to the difference in spatial resolutions. We model this effect of adaptation by changing the model conditional distributions during each presentation of a pair of auditory and visual stimuli. Combining the above factors, we show that our model reproduces mainly the following three experimental results:
r r r
A strong relationship between the ventriloquism effect and spatial unity perception A temporal influence on spatial perception and spatial influence on temporal perception Spatial audiovisual adaptation
In section 2, we introduce adaptive Bayesian inference and apply it to the estimation of audiovisual stimuli. In section 3, on the basis of numerical simulations, we show how our model reproduces experimental results on the ventriloquism effect, the perception of spatial unity, and the
3338
Y. Sato, T. Toyoizumi, and K. Aihara
ventriloquism aftereffect. Section 4 summarizes the results and discusses the theoretical and psychological plausibility of our model. 2 Modeling Audiovisual Interaction by Bayesian Inference Bayesian models are shown to be one of the most powerful methods for understanding human psychophysical experiments on multisensory inte¨ gration (Deneve & Pouget, 2004; Knill & Pouget, 2004; Kording & Wolpert, 2004). This method provides the optimal means to combine different kinds of information from different sensory pathways with different resolutions. Biologically plausible implementations of the method and their relations to the activity of neural circuits have also been investigated (Deneve, Latham, & Pouget, 2001; Rao, 2004). In the following section, we introduce a Bayesian model that estimates the location or the arrival time of audiovisual stimuli. 2.1 Bayesian Estimation Paradigm. For simplicity, we begin with a spatial task that does not have any temporal factors. Later, we also take temporal factors into consideration. The locations encoded in the early sensory systems are not exactly the same as the physically delivered ones; this is due to the sensory noise in the information processing pathways. Given the noisy sensory input,we assume that the Bayesian observer estimates a ¨ true location by using Bayesian inference (Kording & Wolpert, 2004; Witten & Knudsen, 2005). Further, we include the inference of whether audiovisual stimuli originate from the same event. Given the sensory noises in auditory and visual information pathways, the perceived locations of the light and sound, xV and x A, respectively, are shifted from their original positions, xVp (light) and x Ap (sound). Variable ξ is a binary variable that indicates whether visual and auditory stimuli originate from the same source (ξ = 1) or not (ξ = 0). The Bayesian observer uses the maximum a posteriori (MAP) rule for inference, that is, the observer determines its estimate such that the posterior probability is maximized, given the perceived locations. The posterior probability is calculated by integrating P(xVp , x Ap , ξ |xV , x A) with respect to unestimated parameters; for example, when only the location of an auditory stimulus is to be estimated, its MAP estimator xˆ Ap is the one that maximizes the posterior probability P(x Ap |xV , x A), where the “hat” represents an estimator. From Bayes’ formula, we obtain P(xVp , x Ap , ξ |xV , x A) =
P(xV , x A|xVp , x Ap , ξ )P(xVp , x Ap , ξ ) , P(xV , x A)
(2.1)
where P(xV , x A|xVp , x Ap , ξ ) is the conditional probability that the perceived positions are xV and x A given the physical parameters xVp , x Ap , and ξ . We assume that the sensory noises in visual and auditory pathways are independent and are also independent of ξ . Hence, the first term on the
Bayesian Identification of Audiovisual Sources
3339
right-hand side in equation 2.1 is given by P(xV , x A|xVp , x Ap , ξ ) = P(xV , x A|xVp , x Ap ) = P(xV |xVp )P(x A|x Ap ).
(2.2)
The visual and auditory noises P(xV |xVp ) and P(x A|x Ap ) are initially set according to gaussian distributions with mean zero, that is, (xV − xVp )2 1 P(xV |xVp ) = √ exp − , 2 2σsV 2πσsV
(2.3)
(x A − x Ap )2 exp − , 2σs2A 2πσs A
(2.4)
P(x A|x Ap ) = √
1
where σsV and σs A are the standard deviations of visual and auditory noises and also correspond to visual and auditory spatial resolutions. Table 1 shows all the parameters used in this letter and their values. In order to reproduce experimental results on the ventriloquism aftereffect, the noise distributions of equations 2.3 and 2.4 are subject to an adaptive update at each stimulus presentation. A detailed explanation of our model of adaptation is provided later. We define P(xVp , x Ap , ξ ) in equation 2.1 as a joint prior distribution of xVp , x Ap , and ξ , and this can be written as P(xVp , x Ap , ξ ) = P(xVp , x Ap |ξ )P(ξ ).
(2.5)
The locations of audiovisual stimuli should be close if they originate from the same source; otherwise, they are not correlated. Further, we assume that the stimuli sources are distributed uniformly. We model P(xVp , x Ap |ξ ) as a gaussian distribution, which is a function of the difference between the visual and auditory locations for ξ = 1, and as a constant distribution when ξ = 0, that is, P(xVp , x Ap |ξ ) =
√
1 2πσsp L s
1 L 2s
(x −x )2 exp − Vp2σ 2 Ap
(ξ = 1),
sp
(ξ = 0),
(2.6)
where L s is the length of the spatial integral range for normalization and σsp is the standard deviation of the distance between audiovisual stimuli when they share a common source. We take L s σsp so as to eliminate the boundary effect.
3340
Y. Sato, T. Toyoizumi, and K. Aihara
Table 1: List of Parameters and Their Values. Parameter
Value1
Value2
Value3
Description
Spatial or temporal resolution σsV
2.5◦
0.8◦
0.8◦
σs A
8◦
2.5◦
2.5◦
σsp
1◦
1◦
1◦
σtV
—
45 ms
—
σt A
—
10 ms
—
σtp
—
10 ms
—
Spatial resolution of vision Spatial resolution of audition Standard deviation (SD) of the distribution of distance between audiovisual stimuli originating from the same source Temporal resolution of vision Temporal resolution of audition SD of the distribution of temporal separation between audiovisual stimuli originating from the same source
—
—
0.003
αA
—
—
0.003
Others Ws
8◦
4◦
4◦
Wt
—
65 ms
—
Ls Lt P(ξ = 1)
180◦ — 0.005
180◦ 1000 ms 0.2
180◦
Spatial adaptation αV
a
Coefficient used in visual adaptation Coefficient used in auditory adaptation Spatial window width for the judgment of spatial unity Temporal window width for the judgment of simultaneity Spatial integral region Temporal integral region The probability that an audiovisual stimulus pair originate from the same source
Note: The value1, value2, and value3 columns show the values used in simulation 1, simulation 2, and simulation 3, respectively, in section 3. a See appendix B for value 3.
Putting all this together, equation 2.1 can be written as
P(xVp , x Ap , ξ |xV , x A) =
P(xV |xVp )P(x A|x Ap )P(xVp , x Ap |ξ )P(ξ ) , P(xV , x A)
(2.7)
Bayesian Identification of Audiovisual Sources
3341
which is the product of the visual and auditory sensory noise distributions, the prior distributions of visual and auditory stimuli positions, and the prior distribution of ξ , divided by a normalization factor. Next, let us take temporal factors into consideration. The detailed calculation including temporal factors is provided in appendix A. The assumptions and calculation procedures are almost the same as the spatial factors. Assuming independent spatial and temporal noises and priors, we find that P(xVp , x Ap , tVp , tAp , ξ |xV , x A, tV , tA) = P(xV |xVp )P(x A|x Ap )P(tV |tVp )P(tA|tAp )P(xVp , x Ap |ξ )P(tVp , tAp |ξ )P(ξ ) . P(xV , x A, tV , tA) (2.8) Probability P(xVp , x Ap , tVp , tAp , ξ |xV , x A, tV , tA) can be decomposed into a product of noise terms and physical variation terms, each of which is related to either visual or auditory stimuli and to either space or time. 2.2 Modeling of Adaptation. We model the spatial adaptation of audiovisual perception by updating the distributions P(xV |xVp ) and P(x A|x Ap ) of equations 2.3 and 2.4, respectively, at each stimulus presentation. We set two free parameters, µV and µ A, to be adapted; this modifies the mean values of P(xV |xVp ) and P(x A|x Ap ), respectively. Equations 2.3 and 2.4 are rewritten as (xV − xVp − µV )2 1 , exp − P(xV |xVp ) = √ 2 2σsV 2πσsV
(2.9)
(x A − x Ap − µ A)2 . exp − 2σs2A 2πσs A
(2.10)
P(x A|x Ap ) = √
1
Each time the observer receives the adapting audiovisual stimuli, it estimates the corresponding parameters and updates the noise distributions based on its observations and estimations. The observer determines MAP estimators xˆ Vp and xˆ Ap from xV and x A and updates µV and µ A as µV → (1 − αV )µV + αV (xV − xˆ Vp ),
(2.11)
µ A → (1 − α A)µ A + α A(x A − xˆ Vp ),
(2.12)
where α A and αV represent the magnitude of adaptation for each adapting stimulus presentation. It has been shown that the ventriloquism aftereffect is related to a coordinate transformation between different coordinates used by different
3342
Y. Sato, T. Toyoizumi, and K. Aihara
modalities (Lewald, 2002). We modeled this interpretation and shifted the mean values of P(xV |xVp ) and P(x A|x Ap ) toward the biases at previous estimations. Such a shift leads to a shift in the visual and auditory reference frame. 3 Results We consider the three experimental results listed below and perform computer simulations for each of them: 1. The relationship between the unity judgment and the ventriloquism effect 2. The spatial and temporal effect on audiovisual perception 3. The ventriloquism aftereffect In the experiments we reproduce in this letter, the subjects were not instructed to report whether the audiovisual stimuli originated from the same source. Therefore, the posterior probability to be maximized is integrated with respect to ξ in the following sections. 3.1 MAP Estimation and the Ventriloquism Effect. In this section, we explain how our model causes the ventriloquism effect. Due to sensory noises in auditory and visual sensory pathways, the perceived relative location of the sound with respect to the light, x A − xV , is distributed around the presented value x Ap − xVp . Figure 1A shows the probability distribution, P(x A − xV |x Ap − xVp ) =
x A−xV =x A−xV
P(x A|x Ap )P(xV |xVp )d x Ad xV , (3.1)
for x Ap − xVp = 0◦ and x Ap − xVp = 20◦ . In Figure 1B, we show the posterior distributions of x Ap − xVp given the perceived relative location x A − xV , that is, P(x Ap − xVp |x A − xV ) =
ξ
x Ap −xVp =x Ap −xVp
P(x Ap , xVp , ξ |x A, xV )d x Ap d xVp
(3.2)
for x A − xV = 8◦ (circles) and x A − xV = 18◦ (crosses). In Figure 1B, we observe that the posterior distributions have two peaks that are located near 0◦ and x A − xV , and the ratio of the magnitude of the peaks depends on x A − xV . When x A − xV = 8◦ , the peak near 0◦ is higher; thus, the difference
Bayesian Identification of Audiovisual Sources
3343
Figure 1: (A) The probability distribution of x A − xV when the presented stimuli have x Ap − xVp = 0◦ and 20◦ . (B) The posterior probability distribution of x Ap − xVp when the observed value is x A − xV = 8◦ (circles) and 18◦ (crosses). The lines with circles and crosses in A represent the cases in which x A − xV is perceived to be 8◦ (circles) or 18◦ (crosses); each of them corresponds to the line with either circles or crosses in B.
of the estimations of x Ap and xVp is near 0◦ , which implies that the ventriloquism effect has taken place. It should be noted that the peak around 0◦ mainly originates from P(x Ap , xVp , ξ = 1|x A, xV ) (the probability for an assumed common source) and the peak around x A − xV originates from P(x Ap , xVp , ξ = 0|x A, xV ) (the probability for assumed uncommon sources). 3.2 Simulation 1: The Relationship Between the Spatial Unity Judgment and the Ventriloquism Effect. The purpose of simulation 1 is to confirm the results of the experiment by Wallace et al. (2004), who showed a strong relationship between the spatial unity perception (the perception that the light and sound come from the same location) and the ventriloquism effect. For simplicity, this section considers only the spatial factors. Wallace et al. (2004) showed that when the spatial unity perception was reported, the localization of the auditory stimulus was almost biased to the location of the visual stimulus. However, when this perception was not reported, they observed a negative bias, that is, the localization of the auditory stimulus was rather biased opposite to the location of the visual stimulus. They also found that the standard deviation of auditory localization was relatively small when the spatial unity was reported compared to when it was not reported. We performed a computer simulation to compare the consequences of our model with the experimental results. In simulation 1, the task was to localize the auditory stimuli and judge the spatial unity. Gaussian noises with mean 0◦ and standard deviations σsV and σs A were added to the presented stimuli x Ap and xVp , which resulted in xV and x A,
3344
Y. Sato, T. Toyoizumi, and K. Aihara
Figure 2: (A) The relation between the strength of the ventriloquism effect and the judgment of the spatial unity. The vertical axis represents the auditory localization bias. (B) Standard deviation of the auditory localization. The vertical axis represents the standard deviation of the auditory localization. In both the figures, the horizontal axis represents the spatial disparity between the light and the sound (xVp − x Ap ), and the two lines correspond to unity (|xˆ Vp − xˆ Ap | < Ws ) and nonunity (|xˆ Vp − xˆ Ap | > Ws ) cases.
respectively. The observer determined the estimations xˆ Vp and xˆ Ap by maximizing ξ P(xVp , x Ap , ξ |xV , x A) and then judged the spatial unity as yes if |xˆ Vp − xˆ Ap | < Ws . It should be noted that the spatial unity perception (|xˆ Vp − xˆ Ap | < Ws ) and the perception of the common cause (MAP estimator of ξ equals 1) are different. If audiovisual stimuli are presented at the same position but with great temporal disparity, the stimuli are unlikely to have originated from the same event; however, subjects mostly report spatial unity because the stimuli are presented at the same place. The strength of the auditory localization bias was defined as bias =
xˆAp − xAp , xVp − xAp
(3.3)
where bias = 1 implies that the auditory localization was completely attracted to the visual stimulus, and a minus bias implies that it was repulsed from the visual location. The trial was repeated 2000 times for each audiovisual stimulus pair. Other details of the simulation are provided in appendix B. After the calculation, the results were classified into two groups depending on the spatial unity judgment. Figure 2A shows that when the spatial unity was judged, the auditory localization was strongly attracted by the visual signal. On the contrary, when it was not judged, the bias was rather negative, and the auditory localization was biased opposite to the visual stimulus (Wallace et al., 2004). Further, this negative bias was greater when the stimuli were presented in
Bayesian Identification of Audiovisual Sources
3345
Figure 3: (A) Upper-right panel: relation of xˆ Ap to x A. Each circle is located at (x A, xˆ Ap ), where x A is generated according to the probability distribution of x A when xVp = xV = 0◦ , x Ap = 15◦ and xˆ Ap is the corresponding estimation. The shaded area represents the unity case. The horizontal axis represents x A, and the vertical axis represents xˆ Ap . The bottom and left panels represent the probability distributions of x A and xˆ Ap , respectively, when x Ap = 15◦ . (B) The probability distribution of xˆ Ap when x Ap = 15◦ . The brighter-shaded area represents the unity case. The darker-shaded area represents cases where the bias is negative. The horizontal axis represents xˆ Ap , and the vertical axis represents the probability density of xˆ Ap .
close proximity. Figure 2B shows the standard deviation of the auditory localization for each test stimulus. From Figure 2B, we observe that the standard deviation of the auditory localization was greater in the nonunity case, where the localization deviation increased as the spatial disparity decreased (Wallace et al., 2004). These results agree with the experimental results by Wallace et al. Figure 3 explains the reason for the negative bias in the nonunity case in our model. In this figure, we assume xVp = xV = 0◦ for simplicity. The upper-right panel of Figure 3A shows the relation of xˆ Ap to x A. Each circle is located at (x A, xˆ Ap ), where x A is generated according to equation 2.4 when x Ap = 15◦ and xˆ Ap is the corresponding MAP estimator. The spatial unity is judged as yes in the shaded area (|xˆ Ap | < Ws ). The bottom and left panels of Figure 3A show the probability distributions of x A and xˆ Ap , respectively, when x Ap = 15◦ . Figure 3B again shows the probability distribution of xˆ Ap when x Ap = 15◦ . The spatial unity is judged as yes in the brighter-shaded area (|xˆ Ap | < Ws ). In the darker-shaded area (xˆ Ap > x Ap ), the estimated location is greater than the presented location, which means the bias is negative. Due to the relatively large value of Ws , the sharp distribution near 0◦ is always included in the brighter-shaded area. Therefore, it can be observed that the mean bias becomes negative in the nonunity case (outside the brighter-shaded area).
3346
Y. Sato, T. Toyoizumi, and K. Aihara
3.3 Simulation 2: Spatial and Temporal Effect on Audiovisual Perception. In simulation 2, we investigated the temporal dependency of the spatial unity judgment and the spatial dependency of the simultaneous judgment. Both spatial and temporal factors were considered in this simulation. The temporal discrepancy between visual and auditory stimuli influenced the perception of the spatial unity (Slutsky & Recanzone, 2001). It was found that the proportion of the spatial unity judgment decreased as the temporal discrepancy increased. It was also shown that the spatial discrepancy modulates the perception of simultaneity or time resolution of the audiovisual stimuli (Bertelson & Aschersleben, 2003; Lewald & Guski, 2003; Zampini et al., 2005). One possible explanation for the temporal dependency of the spatial unity is as follows. As described above, when the temporal disparity is large, the ventriloquism effect is small. Therefore, audiovisual stimuli are perceived to be more spatially distant in this case than when the temporal disparity is small; thus, the stimuli are likely to be perceived as originating from different locations. Let us consider the task of reporting the spatial unity in our model. When the temporal discrepancy is large, the audiovisual stimuli are not likely to originate from the same source, and the ventriloquism effect is absent. This leads to a decrease in the spatial unity perception. In mathematical terms, by omitting normalization factors in the denominators in equations 2.7 and 2.8, we observe that P(tV |tVp )P(tA|tAp )P(tVp , tAp |ξ )P(ξ ) in equation 2.8 corresponds to P(ξ ) in equation 2.7. Therefore, the decision of the spatial unity depends on the temporal disparity via ξ . We performed two computer simulations that investigated different tasks for audiovisual stimuli with various spatial and temporal disparities: Simulation 2.1 for the spatial unity judgment task and simulation 2.2 for the simultaneity judgment task. In simulation 2.1, since the observer did not judge the temporal properties, it determined estimators xˆ Vp and xˆ Ap by maximizing the posterior probability, P(xVp , x Ap |xV , x A, tV , tA) =
P(xVp , x Ap , tVp , tAp ,
ξ
ξ |xV , x A, tV , tA)dtVp dtAp .
(3.4)
Then the observer judged the spatial unity as yes if |xˆ Vp − xˆ Ap | < Ws . Similarly, in simulation 2.2, the observer determined estimators tˆVp and tˆAp by maximizing P(tVp , tAp |xV , x A, tV , tA) =
P(xVp , x Ap , tVp , tAp ,
ξ
ξ |xV , x A, tV , tA)d xVp d x Ap ,
(3.5)
Bayesian Identification of Audiovisual Sources
3347
Figure 4: (A) Temporal dependency of the spatial unity judgment. The vertical axis represents the proportion of the spatial unity judgment. The horizontal axis denotes the temporal disparity, tVp − tAp , between the audiovisual stimuli. Each line represents the data calculated for the different spatial disparity, xVp − x Ap . (B) Spatial dependency of the simultaneity judgment. The vertical axis represents the proportion of the simultaneity judgment. The horizontal axis denotes the spatial disparity between audiovisual stimuli. The different lines were calculated for different temporal disparities.
and judged the temporal simultaneity as yes if |tˆVp − tˆAp | < Wt . Other details of the simulations are provided in appendix B. The simulation results for simulations 2.1 and 2.2 are shown in Figures 4A and 4B, respectively. In Figure 4A, we observe that the proportion of the spatial unity judgment decreases as the temporal disparity increases. This temporal dependency of the spatial unity judgment is consistent with the experimental results (Slutsky & Recanzone, 2001). Further, Figure 4B shows that the proportion of the simultaneity judgment decreases as the spatial disparity increases. This result is also consistent with the experimental results (Zampini et al., 2005). 3.4 Simulation 3: The Ventriloquism Aftereffect. In simulation 3, we investigated the audiovisual spatial adaptation known as the ventriloquism aftereffect. In this simulation, only spatial factors were considered. It is known that after a prolonged exposure to simultaneous audiovisual stimuli with spatial disparity, the disparity that strongly produces a spatial unity perception shifts toward the presented disparity (Recanzone, 1998). The aftereffect can also be observed in the unimodal auditory localization. First, we presented the adapting stimuli and performed an adaptation process. The light and the sound were presented at the same relative positions; the light was 8◦ to the right of the sound. The adapted parameters µV and µ A were initially set to zero and were updated after each presentation of the adapting stimuli. The adapting stimuli were presented 2500 times. Test audiovisual stimuli were presented after the completion of the
3348
Y. Sato, T. Toyoizumi, and K. Aihara
Figure 5: Results of spatial adaptation for audiovisual test stimuli after the 8◦ adaptation condition. The vertical axis represents the proportion of the spatial unity judgment. The horizontal axis denotes the location of the presented auditory stimulus. The visual stimulus was fixed at 0◦ .
adaptation. The observer estimated xˆ Vp and xˆ Ap and judged the spatial unity. Other simulation details are provided in appendix B. Figure 5 shows the result for audiovisual test stimuli after the presentation of the adapting stimuli with xVp − x Ap = 8◦ . This figure shows the proportion of judgment of the spatial unity for various auditory test locations when the visual stimulus was fixed at 0◦ . Figure 5 indicates that after the adaptation, spatial unity was more likely to be judged when the light was located relatively to the right of the sound; this location was consistent with the relative position of the light during the adaptation period. This result agrees with the experimental results (Recanzone, 1998). We also investigated the adaptation effect on the unimodal localization, that is, the localization when only the test auditory or visual stimulus was presented. We assumed that the unimodal location was determined by maximizing the posterior probability of the physical position. For the auditory localization, the observer received x A and maximized P(x Ap |x A). From Bayes’ formula, this is given by P(x Ap |x A) =
P(x A|x Ap )P(x Ap ) . P(x A)
(3.6)
If we assume an isotropically distributed stimulus source and a uniform prior distribution P(x Ap ), the estimator xˆ Ap is the one that maximizes P(x A|x Ap ). Other simulation details are provided in appendix B. Figure 6 shows the results for the visual and auditory unimodal test stimuli. The simulation details are provided in appendix B. The auditory unimodal localization was shifted in the relative direction of the light in the adaptation period. However, the visual unimodal localization was not greatly affected by the adaptation. This relatively large shift in the
Bayesian Identification of Audiovisual Sources
3349
Figure 6: Unimodal (A) visual and (B) auditory localization after adaptation. The solid line represents the estimated locations for the various presented locations after adaptation. The broken line shows the estimation before adaptation.
auditory localization and the small shift in the visual localization agree with the experimental results of the ventriloquism aftereffect (Lewald, 2002; Recanzone, 1998). The relatively great shift in the auditory unimodal localization is due to our assumption that the adaptation was executed based on estimated parameters. The estimation of the auditory location differed from the perceived location during the adaptation period because of the ventriloquism effect. However, the estimation of the visual location was not distant from the perceived location. This is why in our model, the auditory localization was strongly affected by the adaptation, but the visual localization was not greatly affected by it. 4 Summary and Discussion In this letter, we modeled audiovisual integration using adaptive Bayesian inference. We defined a Bayesian observer that uses Bayesian inference to judge the locations or the timing of stimuli in addition to whether audiovisual stimuli are from the same source. The idea that the source of the audiovisual stimuli plays a crucial role in audiovisual integration has been termed the unity assumption (Welch & Warren, 1980) or pairing (Epstein, 1975). While former models on audiovisual interactions with Bayesian inference (Battaglia et al., 2003; Witten & Knudsen, 2005) assumed that audiovisual stimuli are bound from the beginning, our model not only considers positions and the timing of audiovisual stimuli but also considers whether audiovisual stimuli should be bound. It also explains audiovisual adaptation by adaptively changing the probability distributions of sensory noises. The observer estimates the physically delivered position or timing of stimuli from noisy information. Although we applied the MAP rule in this
3350
Y. Sato, T. Toyoizumi, and K. Aihara
letter, it is also possible to apply the Bayes estimator that corresponds to the expected value of the posterior probability. We could derive qualitatively similar results using this rule as well. We showed that our model reproduces several experimental results on the binding of multisensory signals. Our model showed a strong relation between the spatial unity perception and the ventriloquism effect. We also showed that the audiovisual temporal discrepancy affects the perception of the spatial unity and that the spatial distance also affects the perception of simultaneity. Further, we assumed that adaptive changes in noise distributions modify the audiovisual perception. We adapted the mean value of noise distributions based on the observed and estimated parameters, considering that the ventriloquism aftereffect may correspond to the adaptation of coordinate transformation. We showed that our model of adaptation reproduces the experimental results on spatial audiovisual adaptation known as the ventriloquism aftereffect. Theoretically, there are two ways to include the adaptation in our model. One is to change the noise distributions P(x A|x Ap ) and P(xV |xVp ), as shown in our model. These distributions characterize the noise property of the sensory system. Thus, the adaptation of these distributions is meant to keep track of the changes in one’s inner sensory processing pathways with the aid of other sensory modalities. The other possibility is to change the prior distribution P(xVp , x Ap , ξ ) in order to adapt to the changes in the stimulus properties of the external world (Sharpee et al., 2006; Smirnakis, Berry, Warland, Bialek, & Meister, 1997). The ventriloquism aftereffect is usually considered to be a recalibration process that compensates for the change in the relationship between physical input and inner representation, which is caused by development or injury to sensory organs (De Gelder & Bertelson, 2003). Further, it is difficult to imagine the situation where the physical direction of the light and the sound from the same event have a constant disparity during minutes or tens of minutes (this is the typical timescale needed to induce the ventriloquism aftereffect). However, the comparison of the audiovisual aftereffect caused by these two different adaptation mechanisms, that is, the adaptation to internal or external changes, is important; both of these adaptations might be plausible in reality. Although our model can reproduce some aspects of audiovisual spatial adaptation, it cannot explain all the properties of the experimental results. For example, if we present the adapting stimuli with more spatial or temporal disparities than those in our simulation, the adaptation effect may disappear; however, some experimental results have suggested that the adaptation effect can be observed even when the disparities are relatively large (Fujisaki, Shimojo, Kashino, & Nishida, 2004; Lewald, 2002). Therefore, the adaptation mechanism can be partially, but not completely, described by our model. There may exist some other mechanisms for adaptation that have not been included in our adaptation model.
Bayesian Identification of Audiovisual Sources
3351
Our model has many parameters, as shown in Table 1. It is difficult to determine these parameter values because we cannot reproduce all the experiments with the same set of parameter values. However, we can predict the parameter dependency of experimental results from our model. For example, parameter L t is the temporal integral region width, and this corresponds to the temporal range in which the stimuli are presented. The value of L t influences the ratio of P(tVp , tAp |ξ = 1) and P(tVp , tAp |ξ = 0), as shown in appendix A. We can show that the proportion of temporal simultaneity judgment increases as L t increases. In fact, it is experimentally known that when a wide range of temporal disparities was presented, the proportion of simultaneity judgment increased (Zampini et al., 2005), which is in agreement with our results. Although we modeled spatial and temporal audiovisual integration, the mathematical formulation of our model is not limited to this integration. It is possible that the same formulation can be applied to the integration of other modalities such as visuotactile or sensorimotor integration. Our future work is to model such various multisensory integrations. Appendix A: Bayesian Estimation with Spatial and Temporal Factors In appendix A, we introduce Bayesian inference, taking temporal factors into consideration. In addition to the notations described in section 2.1, let tVp and tAp represent times when visual and auditory stimuli, respectively, are presented to the observer, and tV and tA represent the noisy information that the observer can use to infer the actual physical times. Taking spatial and temporal factors into consideration, the observer determines the physical parameters based on conditional probability P(xVp , x Ap , tVp , tAp , ξ |xV , x A, tV , tA). From Bayes’ formula, we obtain P(xVp , x Ap , tVp , tAp , ξ |xV , x A, tV , tA) = P(xV , x A, tV , tA|xVp , x Ap , tVp , tAp , ξ )P(xVp , x Ap , tVp , tAp , ξ ) . P(xV , x A, tV , tA)
(4.1)
We assume that the first term in the numerator on the right-hand side does not depend on ξ , as seen in section 2.1, and that there is no correlation between the noises added during spatial information processing and temporal information processing. Further, we assume that visual and auditory noises are independent in both spatial and temporal dimensions. It then follows that P(xV , x A, tV , tA|xVp , x Ap , tVp , tAp , ξ ) = P(xV , x A, tV , tA|xVp , x Ap , tVp , tAp ) = P(xV , x A|xVp , x Ap )P(tV , tA|tVp , tAp ) = P(xV |xVp )P(x A|x Ap )P(tV |tVp )P(tA|tAp ).
(4.2)
3352
Y. Sato, T. Toyoizumi, and K. Aihara
With regard to the prior distribution term P(xVp , x Ap , tVp , tAp , ξ ), we assume that the spatial and temporal physical variations of visual and auditory stimuli are independent. The term can then be written as P(xVp , x Ap , tVp , tAp , ξ ) = P(xVp , x Ap , tVp , tAp |ξ )P(ξ ) = P(xVp , x Ap |ξ )P(tVp , tAp |ξ )P(ξ ).
(4.3)
Taking all the assumptions together, equation 4.1 is decomposed into P(xVp , x Ap , tVp , tAp , ξ |xV , x A, tV , tA) = P(xV |xVp )P(x A|x Ap )P(tV |tVp )P(tA|tAp )P(xVp , x Ap |ξ )P(tVp , tAp |ξ )P(ξ ) . P(xV , x A, tV , tA) (4.4) The distributions regarding the temporal factors are set the same as the spatial factors as follows: (tV − tVp )2 , exp − 2 2σtV 2πσtV (tA − tAp )2 1 , P(tA|tAp ) = √ exp − 2σt2A 2πσt A
P(tV |tVp ) = √
1
(tVp − tAp )2 1 exp − √ 2σtp2 2πσtp L t P(tVp , tAp |ξ ) = 1 2 Lt
(4.5) (4.6)
(ξ = 1), (ξ = 0),
(4.7)
where σtV and σt A are the standard deviations of visual and auditory noises, respectively, in temporal information processing. These parameters correspond to visual and auditory temporal resolutions, respectively. L t is the length of the temporal integral range for normalization, and σtp is the standard deviation of the temporal disparity distribution of audiovisual stimuli when they share the same source. Appendix B: Simulation Details B.1 Simulation 1. The parameter values are shown in Table 1 (value1). The spatial axis was divided into 0.5◦ width bins, and the results were calculated numerically. Physical locations of auditory stimuli were fixed at x Ap = 0◦ , and a visual stimulus was delivered to one of {−25◦ , −20◦ , ..., +25◦ }. Gaussian noises with mean 0◦ and standard deviations σsV and σs A were
Bayesian Identification of Audiovisual Sources
3353
added to presented stimuli xVp and x Ap , respectively, which resulted in xV and x ˆ Vp and xˆ Ap by maxA. The observer determined estimations x imizing ξ P(xVp , x Ap , ξ |xV , x A) and then judged the spatial unity as yes if |xˆ Vp − xˆ Ap | < Ws . Trials for each stimulus position were repeated 2000 times. The number of iterations was roughly set to follow the experiment by Wallace et al. (2004). Since the standard deviations σsV and σs A were not provided in the paper, we adopted the values that were measured in another experiment (Hairston et al., 2003), which was conducted by the same group. B.2 Simulation 2. The parameter values used in simulation 2.1 are shown in Table 1 (value2). The spatial axis was divided into 0.3◦ width bins, and the temporal axis was divided into 5 ms bins. The physical position of the auditory stimulus was fixed at x Ap = 0◦ . The zero point of time was set to the physical arrival time of the auditory stimulus: tAp = 0 ms. Location xVp was one of {0◦ , 4◦ , 8◦ , 12◦ , 16◦ , 20◦ }, and tVp was one of {0 ms, 50 ms, 100 ms, 150 ms, 200 ms, 250 ms}. Trials were repeated 1000 times for each pair of visual and auditory stimuli. Each time the stimuli were presented, noises with zero mean and standard deviation σsV , σs A, σtV , and σt A were added to xVp , x Ap , tVp , and tAp , respectively, which resulted in xV , x A, tV , and tA. The task was to judge spatial unity. The observer determined estimators xˆ Vp and xˆ Ap by maximizing ξ P(xVp , x Ap , tVp , tAp , ξ |xV , x A, tV , tA)dtVp dtAp . The observer judged the spatial unity as yes if |xˆ Vp − xˆ Ap | < Ws . In simulation 2.2, all the parameter values and stimulus presenting procedures were the same as in simulation 2.1. The task was to judge simultaneity. In simulation 2.2, the observer determined estimators tˆVp and tˆAp by maxi mizing ξ P(xVp , x Ap , tVp , tAp , ξ |xV , x A, tV , tA)d xVp d x Ap and judged the temporal simultaneity as yes if |tˆVp − tˆAp | < Wt . In the experiments that we wished to reproduce in simulations 2 and 3, most of the parameters such as σs A and σsV were not given explicitly, and we could not always adopt the parameter values from other similar experiments. In fact, in the experiments by Slutsky and Recanzone (2001) and by Recanzone (1998), the spatial resolution is much better than that by Wallace et al. (2004), which we described in simulation 1. Therefore, we chose reasonable parameter values such that the simulation results could fit the experimental data. B.3 Simulation 3. The parameter values are shown in Table 1 (value3). In simulation 3, P(ξ = 1) was set to 0.6 during the adaptation period and to 0 during the test period to reproduce the experimental design by Recanzone (1998). The spatial axis was divided into 0.1◦ width bins. During the adaptation period, the position x Ap of the auditory stimulus was chosen randomly from {−28◦ , −24◦ , ..., +20◦ }, and the visual stimulus was 8◦ to the right of the selected auditory stimulus position. The adapting stimuli were presented 2500 times. The details of the stimulus presenting procedures
3354
Y. Sato, T. Toyoizumi, and K. Aihara
and judgment procedures were the same as those in simulation 1 in both the adaptation and audiovisual test periods. During the audiovisual test period, the visual stimulus location was fixed at 0◦ , and the auditory stimulus location was one of {−28◦ , −26◦ , ..., +28◦ }; each stimulus was presented 150 times. During the unimodal test period, the presented test auditory or visual stimulus location was one of {−28◦ , −26◦ , ..., +28◦ }, and the estimated location was determined by maximizing equation 3.6. The test stimuli were presented 150 times for each stimulus location in both visual and auditory tests. Acknowledgments This research is partially supported by the Japan Society for the Promotion of Science, a Grant-in-Aid for JSPS fellows (18-06772), and a Grant-in-Aid for Scientific Research on Priority Areas—system study on higher-order brain functions—from the Ministry of Education, Culture, Sports, Science and Technology of Japan (17022012).
References Alais, D., & Burr, D. (2004). The ventriloquist effect results from near-optimal bimodal integration. Current Biology, 14(3), 257–262. Battaglia, P. W., Jacobs, R. A., & Aslin, R. N. (2003). Bayesian integration of visual and auditory signals for spatial localization. Journal of the Optical Society of America A—Optics Image Science and Vision, 20(7), 1391–1397. Bertelson, P., & Aschersleben, G. (2003). Temporal ventriloquism: Crossmodal interaction on the time dimension—1. Evidence from auditory-visual temporal order judgment. International Journal of Psychophysiology, 50(1–2), 147–155. Bertelson, P., & Radeau, M. (1981). Cross-modal bias and perceptual fusion with auditory-visual spatial discordance. Perception and Psychophysics, 29(6), 578–584. Canon, L. K. (1970). Intermodality inconsistency of input and directed attention as determinants of nature of adaptation. Journal of Experimental Psychology, 84(1), 141–147. De Gelder, B., & Bertelson, P. (2003). Multisensory integration, perception and ecological validity. Trends in Cognitive Sciences, 7(10), 460–467. Deneve, S., Latham, P. E., & Pouget, A. (2001). Efficient computation and cue integration with noisy population codes. Nature Neuroscience, 4(8), 826–831. Deneve, S., & Pouget, A. (2004). Bayesian multisensory integration and cross-modal spatial links. Journal of Physiology–Paris, 98(1–3), 249–258. Epstein, W. (1975). Recalibration by pairing: A process of perceptual learning. Perception, 4, 59–72. Ernst, M. O., & Banks, M. S. (2002). Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415, 429–433. Fujisaki, W., Shimojo, S., Kashino, M., & Nishida, S. (2004). Recalibration of audiovisual simultaneity. Nature Neuroscience, 7(7), 773–778.
Bayesian Identification of Audiovisual Sources
3355
Hairston, W. D., Wallace, M. T., Vaughan, J. W., Stein, B. E., Norris, J. L., & Schirillo, J. A. (2003). Visual localization ability influences cross-modal bias. Journal of Cognitive Neuroscience, 15(1), 20–29. Howard, I. P., & Templeton, W. B. (1996). Human spatial orientation. New York: Wiley. Knill, D. C., & Pouget, A. (2004). The Bayesian brain: The role of uncertainty in neural coding and computation. Trends in Neurosciences, 27(12), 712–719. ¨ Kording, K. P., & Wolpert, D. M. (2004). Bayesian integration in sensorimotor learning. Nature, 427, 244–247. Lewald, J. (2002). Rapid adaptation to auditory-visual spatial disparity. Learning and Memory, 9(5), 268–278. Lewald, J., & Guski, R. (2003). Cross-modal perceptual integration of spatially and temporally disparate auditory and visual stimuli. Cognitive Brain Research, 16(3), 468–478. Morein-Zamir, S., Soto-Faraco, S., & Kingstone, A. (2003). Auditory capture of vision: Examining temporal ventriloquism. Cognitive Brain Research, 17(1), 154–163. Pavani, F., Spence, C., & Driver, J. (2000). Visual capture of touch: Out-of-the-body experiences with rubber gloves. Psychological Science, 11(5), 353–359. Radeau, M. (1994). Auditory-visual spatial interaction and modularity. Cahiers de Psychologie Cognitive–Current Psychology of Cognition, 13(1), 3–51. Radeau, M., & Bertelson, P. (1974). Aftereffects of ventriloquism. Quarterly Journal of Experimental Psychology, 26, 63–71. Rao, R. P. N. (2004). Bayesian computation in recurrent neural circuits. Neural Computation, 16(1), 1–38. Recanzone, G. H. (1998). Rapidly induced auditory plasticity: The ventriloquism aftereffect. Proceedings of the National Academy of Sciences of the United States of America, 95(3), 869–875. Sharpee, T. O., Sugihara, H., Kurgansky, A. V., Rebrik, S. P., Stryker, M. P., & Miller, K. D. (2006). Adaptive filtering enhances information transmission in visual cortex. Nature, 439, 936–942. Slutsky, D. A., & Recanzone, G. H. (2001). Temporal and spatial dependency of the ventriloquism effect. Neuroreport, 12(1), 7–10. Smirnakis, S. M., Berry, M. J., Warland, D. K., Bialek, W., & Meister, M. (1997). Adaptation of retinal processing to image contrast and spatial scale. Nature, 386, 69–73. Wallace, M. T., Roberson, G. E., Hairston, W. D., Stein, B. E., Vaughan, J. W., & Schirillo, J. A. (2004). Unifying multisensory signals across time and space. Experimental Brain Research, 158(2), 252–258. Welch, R. B., & Warren, D. H. (1980). Immediate perceptual response to intersensory discrepancy. Psychological Bulletin, 88(3), 638–667. Witten, I. B., & Knudsen, E. I. (2005). Why seeing is believing: Merging auditory and visual worlds. Neuron, 48(3), 489–496. Zampini, M., Guest, S., & Shore, D. I. (2005). Audio-visual simultaneity judgments. Perception and Psychophysics, 67(3), 531–544.
Received June 20, 2006; accepted November 15, 2006.
LETTER
Communicated by Joydeep Ghosh
Training Pi-Sigma Network by Online Gradient Algorithm with Penalty for Small Weight Update Yan Xiong [email protected] Department of Applied Mathematics, Dalian University of Technology, Dalian 116024, People’s Republic of China, and Faculty of Science, University of Science and Technology Liaoning, Anshan, 114051, People’s Republic of China
Wei Wu [email protected]
Xidai Kang kxd [email protected]
Chao Zhang zhangchao [email protected] Department of Applied Mathematics, Dalian University of Technology, Dalian 116024, People’s Republic of China
A pi-sigma network is a class of feedforward neural networks with product units in the output layer. An online gradient algorithm is the simplest and most often used training method for feedforward neural networks. But there arises a problem when the online gradient algorithm is used for pi-sigma networks in that the update increment of the weights may become very small, especially early in training, resulting in a very slow convergence. To overcome this difficulty, we introduce an adaptive penalty term into the error function, so as to increase the magnitude of the update increment of the weights when it is too small. This strategy brings about faster convergence as shown by the numerical experiments carried out in this letter. 1 Introduction Recently higher-order neural networks have attracted much attention from researchers due to their nonlinear mapping ability with fewer units. Several kinds of neural networks have been developed to use higher-order combinations of input information, including higher-order neural networks (Giles & Maxwell, 1987), sigma-pi neural networks, and product unit neural networks (Durbin & Rumelhart, 1989). The higher-order neural networks considered in this letter are pi-sigma networks (PSN), which were introduced by Shin and Ghosh (1991) and explored by Hussaina and Liatsisb (2002), Li (2002, 2003), Sinha, Kumar, and Kalra (2000), and Neural Computation 19, 3356–3368 (2007)
C 2007 Massachusetts Institute of Technology
Online Gradient Algorithm for Pi-Sigma Network
3357
Voutriaridis, Boutalis, and Mertzios (2003). PSNs use product cells as the output units to indirectly incorporate the capabilities of higher-order networks while using fewer weights and processing units and have been used effectively in pattern classification and approximation (Shin & Ghosh, 1991; Xin, 2002; Jiang, Xu, & Piao, 2005). An online gradient algorithm with a randomized modification is suggested for training PSN (Shin & Ghosh, 1992). However, we notice that the update increment of the weights may become very small when the magnitude of weights is small, as usually happens early in training, resulting in a very slow convergence. To overcome this difficulty, we introduce an adaptive penalty term into the error function, so as to increase the magnitude of the update increment of the weights when it is too small. This strategy brings about faster convergence, as shown by the numerical experiments carried out in this letter. The rest of this letter is organized as follows. Section 2 provides a brief introduction to PSN and the online gradient algorithm and specifies the small increment problem. In section 3, we present an online algorithm with a penalty for training PSN. Section 4 provides a few simulation results that illustrate the effectiveness of our approach. 2 PSN and Randomized Online Gradient Algorithm 2.1 Structure of PSN. Assume that the number of neurons for the input, summation, and product layers are P, N, and 1, respectively. The structure is shown in Figure 1. We denote by wn = (wn1 , wn2 , . . . , wnP )T (1 ≤ n ≤ N) the weight vector connecting the input and the summation node n, and write w = (w1T , w2T , . . . , w TN ) ∈ R NP . ξ = (ξ1 , . . . , ξ P )T ∈ R P stands for the input vector. Here we have added a special input unit ξ P = −1 corresponding to the biases wnP . The weights on the connections between the product node and the summation nodes are fixed to 1. Let g : R → R be a given activation function for the output layer. For any given input ξ and weight w, the output of the network is y=g
N
(wi · ξ ) .
(2.1)
i=1
2.2 Randomized Online Gradient Algorithm for PSN. The online gradient algorithm is a simple and efficient learning method for feedforward neural networks. Usually PSN and the networks with pi-sigma building blocks are also trained by it but with a randomized modification (or an asynchronous modification, which is not considered in this letter).
3358
Y. Xiong, W. Wu, X. Kang, and C. Zhang
Figure 1: A pi-sigma network.
Let {ξ j , O j } Jj=1 ⊂ R P × R be the training examples. The usual mean square error function for the network is N 2 J J 1 j 1 j 2 j j E(w) = (O − y ) = (wi · ξ ) . O −g 2 2 j=1
j=1
(2.2)
i=1
The gradient of the error function with respect to wn (n = 1, 2, . . . , N) is
Dn E(w) = −
J
N j (wi · ξ ) g O −g j
j=1
i=1
N N × (wi · ξ j ) (wi · ξ j ) ξ j . i=1
(2.3)
i=1 i=n
The updating rule of the online gradient algorithm with a randomized modification for the weight vector of PSN is as follows (Shin & Ghosh, 1991). Start from an arbitrary initial guess w 0 . At the mth step (m = 0, 1, . . .) of the training iteration of the weights, we randomly choose an input example ξ j and a summing unit n and update the corresponding connecting weights while the weights of the other summing units remain unchanged, so the online gradient algorithm iteratively updates the weights in the fashion that wnm+1 = wnm + j wnm
(2.4a)
wqm+1 = wqm ,
(2.4b)
q = n
Online Gradient Algorithm for Pi-Sigma Network
3359
where j wnm
=η O − g j
N
wim
·ξ
j
g
i=1
×
N
N wim · ξ j wim · ξ j ξ j
i=1
(2.5)
i=1 i=n
and η > 0 is the learning rate. In this letter, we use C for a general positive constant, which may be different in different places but is independent of p, n, j, m, and η. 2.3 Influence of Small Weights on the Training. Condition A on the activation function will be needed in our analysis: |g(t)| and |g (t)| are uniformly bounded for t ∈ R. It is satisfied by, for example, sigmoid functions, which are the most popular activation functions. Now let us illustrate the influence of the small weights on the training process. The following proposition estimates the magnitudes of the increment of the weight vector j wnm and the error function E(w m ), respectively. Proposition 1. Let condition A be satisfied and the sequence {w m }∞ m=1 be generated from the learning algorithm, 2.4. Then there exists a positive constant C independent of η, n, p, j, and m, such that
j w m ≤ Cηδ N−1 n m
(2.6)
2(N−1 ) |E(w m )| ≤ Cηδm ,
(2.7)
and
2(N−1 )
where δm = max wnm , and δmN−1 and δm 1≤n≤N
denote the (N − 1 )th and
2(N − 1)th powers of δm , respectively. Proof. Estimate 2.6 can be obtained directly from the definition of δm , equation 2.5, and the Cauchy-Schwartz inequality.
3360
Y. Xiong, W. Wu, X. Kang, and C. Zhang
To verify equation 2.7, we first expand the Taylor theorem: g
N
N
m+1 i=1 (wi
N i=1
wnm · ξ j by
N m j wim+1 · ξ j = g wnm + j wnm · ξ j wi · ξ
i=1
i=1 i=n
=g
· ξ j ) at
N
wim · ξ
j
N m wim · ξ j j wnm · ξ j , + g tn, j
i=1
(2.8)
i=1 i=n
N N m+1 m where tn, · ξ j ) and i=1 (wim · ξ j ). Acj is a real number between i=1 (wi cording to equation 2.6 and the Cauchy-Schwartz inequality, we have N N m+1 j m j 2(N−1) wi wi · ξ ·ξ max{g (x), ξ j N } −g g ≤ Cηδm x∈R 1≤ j≤J i=1
i=1
2(N−1) . ≤ Cηδm
(2.9)
This inequality, together with equation 2.2, implies N 2 J 1 m+1 m m+1 m j j O −g wi ) − E(w ) ≤ ·ξ |E(w )| = |E(w 2 j=1 i=1 N 2 J N 1 m+1 j j m j = g − O −g ·ξ wi · ξ wi 2 i=1 j=1 i=1 N N N j m+1 m j j j m j −g ·ξ wi · ξ wi +O −g wi · ξ O − g i=1
≤ CJ
2(N−1) ηδm
i=1
≤
i=1
2(N−1) Cηδm .
It is well known that the initial guess w 0 ’s, and hence δ0 , are required to be small in magnitude in order to avoid premature saturation of the weights. Then, we notice from equations 2.6 and 2.7 that j wnm and E(w m+1 ) are even smaller when δm is small, resulting in a very slow convergence early in training. The situation will get worse for big N, as illustrated in Figure 2a. By proposition 1, a possible remedy might be to use a larger learning rate η to accelerate the convergence. This strategy is actually used in Shin and Ghosh (1992, 1995). But a constant large learning rate may lead to oscillation of the error function when the weights are not small, as illustrated in Figure 2(b).Thus, an adaptive variable learning rate seems desirable but is
Online Gradient Algorithm for Pi-Sigma Network
3361
Figure 2: The behaviors of PSN without penalty for a three-dimensional parity problem (cf. section 4.2). The number of neurons for the input and product layers are P = 4 and 1, respectively. The initial weights are selected randomly within the interval [−0.15, 0.15]. (a) The results of PSN with different numbers of summing units. The learning rate is 0.8 for (1) and 0.3 for (2) and (3). (1) N = 2; (2) N = 3; (3) N = 4. (b) The results of PSN with different learning rates. The number of summing units is N = 3: (1) η = 0.5; (2) η = 1.0; (3) η = 1.5.
not easy to define directly. The above argument motivates us to introduce in the next section a penalty for small weights into the error function, so as to resolve the problem in an adaptive manner. 3 Online Gradient Algorithm with a Penalty Recall that the original task of the network training is to find w ∗ such that N (wim · ξ j ) to obtain a E(w ∗ ) = min E(w). Now we add a constraint on w
i=1
constrained optimization problem, min E(w) w
N j s.t. (wi · ξ ) > γ ,
j = 1, 2, . . . , J ,
(3.1)
i=1
where γ > 0 is an appropriate constant. This constrained optimization problem can be converted into a nonconstrained problem by using a Lagrangian multiplier λ:
min E(w) ≡ min w
w
E(w) +
J λ
2
j=1
p j (w) ,
(3.2)
3362
Y. Xiong, W. Wu, X. Kang, and C. Zhang
where λ > 0 is the penalty coefficient, and p j (w) ( j = 1, 2, . . . , J ) is the penalty function defined as N 2 j γ − (wi · ξ ) , p j (w) = i=1 0,
−γ <
N
(wi · ξ j ) < γ . elsewhere
i=1
Finally, we solve the new nonlinear optimization problem, equation 3.2, by the online gradient algorithm. So the gradient of the error function with respect to wn (n = 1, 2, . . . , N) is replaced by E(w) = − Dn
J
N N j j (wi · ξ ) g (wi · ξ ) O −g j
j=1
i=1
i=1
λ j j Dn p j (w), (wi · ξ ) ξ + 2 i=1 N
J
(3.3)
j=1
i=n
where N N N j −2 γ − (w · ξ ) (wi · ξ j )ξ j , 0 ≤ (wi · ξ j ) < γ i i=1 i=1 i=1 N i=n N N . Dn p j (w) = (wi · ξ j )ξ j , −γ < (wi · ξ j ) < 0 2 γ + (wi · ξ j ) i=1 i=1 i=1 i=n 0, elsewhere In place of equations 2.4a and 2.5, we have, respectively, j wnm wnm+1 = wnm +
(3.4)
and j wnm
=η
O −g j
N
wim
·ξ
j
g
i=1
N N m j j λ wim · ξ j wi · ξ ξ − Dn p j (w m ) 2 i=1 i=1
(3.5)
i=n
From equations 3.1 and 3.2, we see that γ decides when the penalty starts to activate, and λ determines how strong the activation is. Because we aim to penalize only the small weight update, the value of γ should not
Online Gradient Algorithm for Pi-Sigma Network
3363
be too big. On the other hand, λ needs to be large as usually required in optimization problems. A practical choice of γ and λ can be made in terms of experience or simulation experiment. Our experiments below indicate that the values of γ and λ can be chosen in a comparatively wide range. 4 Simulation Results In this section, we test the performance of our penalty approach by two simulation examples. The training iteration process stops when one of the following criteria is met: Criterion 1: The iteration step is less than 5000, and j wnm < 0.00001 for three successive iterations. Criterion 2: j wnm < 0.00001 is not valid for three successive iterations, but the iteration step has reached 5000. 4.1 Example 1: Function Approximation Problems. First, we trained the PSN to approximate the function f (x) = cos(π x/2), x ∈ [−1, 1]. The aim of this test is to compare the behaviors of the online gradient algorithms with or without penalty. The number of neurons for the input, summation, and product layers are P = 2, N = 3 and 1, respectively. The output activation function is g(x) = 1+e1 −x . The parameters in this example take the values η = 0.15, γ = 0.3, and λ = 40, and the initial weights are selected randomly within the interval [−0.15, 0.15]. Figure 3a shows the change of the error function E(w) in the case without penalty, where the curve decreases slowly and keeps flat until the iteration step 4012. The value of the product unit is accordingly small, as shown in Figure 3b. The result of the online gradient algorithm with a penalty is shown in Figures 3c and 3d. The ˜ new error function E(w) defined in equation 3.2 decreases fairly fast and meets Criterion 1 at iteration step 468. Next, we use the 2D Gabor function to show the function approximation capability of PSN with penalty. The 2D Gabor function has the following form (Shin & Ghosh, 1992): h(x, y) =
1 ·e 2π(0.5)2
2 +y2 2(0.5)2
−x
· cos(2π(x + y)).
Thirty-six input points are selected from an evenly spaced 6 × 6 grid on −0.5 ≤ x ≤ 0.5 and −0.5 ≤ y ≤ 0.5. Eighteen input points are randomly selected from the 36 points as the training set. The remaining 18 points are used for testing the generalization. The number of neurons for the input, summation, and product layers are P = 3, N = 6 and 1, respectively. The parameters in this example take the values η = 0.2, γ = 0.06, and λ = 15, and the initial weights are selected randomly within the interval [−2, 2].
3364
Y. Xiong, W. Wu, X. Kang, and C. Zhang
Figure 3: Performances of online gradient algorithm (a, b) without penalty and (c, d) with penalty.
Figure 4: The approximation results for Gabor function: (a) Gabor function; (b) actual network outputs for PSN approximation with penalty; (c) actual network outputs for PSN approximation without penalty.
The output activation function is g(x) = tanh (x). The errors of PSN with penalty for training and testing inputs achieve 0.098 and 0.89 after 2351 epochs, while those of PSN without penalty are 0.1 and 0.88 after 4673 epochs. Figure 4 shows the Gabor function and the actual network outputs.
Online Gradient Algorithm for Pi-Sigma Network
3365
Figure 5: Performances of the online gradient algorithm (1) with penalty and (2) without penalty.
Again we see from the experiment that our penalty algorithm can expedite the training of PSN. 4.2 Example 2: K-Dimensional Parity Problems. The K-dimensional parity problem is a difficult classification problem. There are 2 K -input patterns in the K -dimensional vector space. The values of the components of the input patterns are either +1 or −1. The output is 1 when the input pattern contains an odd number of +1 s and is −1 otherwise. The famous XOR problem is just the two-dimensional parity problem. In these experiments, the output activation function is g(x) = 1+e1 −x . First, we investigate the behaviors of PSN with or without penalty for the three-dimensional parity problem. The number of neurons for the input, summation, and product layers are P = 4, N = 3 and 1, respectively. The initial weights are selected randomly within the interval [−0.15, 0.15]. The parameters in this example are η = 0.2, γ = 0.05, and λ = 10. The result is shown in Figure 5. Next, we investigate the behaviors of the two approaches, using either a large learning rate or a penalty respectively, for the four-dimension parity problem. The number of neurons for the input, summation and product layers are P = 5, N = 4 and 1, respectively. The parameters are w ∈ [−0.2, 0.2], γ = 0.05, and λ = 10. Ten tests are carried out with different parameters. It is clear from Figure 6 that the penalty approach has better convergence and stability as η ∈ [0.3, 0.5] than the large learning rate approach as η ∈ [1.0, 1.4]. Finally, we choose different values of λ and γ to test their influence on the performance of the online gradient algorithm with a penalty for the fourdimension parity problem. Ten tests are carried out with different values of
3366
Y. Xiong, W. Wu, X. Kang, and C. Zhang
Figure 6: Performances of the online gradient algorithm (a, b) without penalty and (c, d) with penalty.
Table 1: Number of Successful Tests in 10 Tests.
γ γ γ γ γ γ γ γ γ
= 0.01 = 0.02 = 0.03 = 0.04 = 0.05 = 0.06 = 0.07 = 0.08 = 0.09
λ=4
λ=8
λ = 12
λ = 16
λ = 20
λ = 24
λ = 28
λ = 32
λ = 36
6 4 1 6 4 5 6 9 7
6 6 9 6 7 3 5 5 4
1 4 5 5 6 5 5 5 1
6 6 6 5 5 5 5 3 3
4 5 6 7 6 4 4 5 1
8 6 5 4 8 2 5 1 3
4 5 5 3 3 3 3 0 1
6 4 3 6 3 2 5 1 2
2 4 4 5 2 2 3 1 1
λ and γ . The number of the successful tests—namely, the tests that stop by criterion 1—in the 10 tests are given in Table 1. The average stopping steps of the 10 tests are showed in Table 2. We observe that λ in [8, 24] and γ in [0.01, 0.05] give better convergence and stability.
Online Gradient Algorithm for Pi-Sigma Network
3367
Table 2: Average Iteration Steps of 10 Tests.
γ γ γ γ γ γ γ γ γ
= 0.01 = 0.02 = 0.03 = 0.04 = 0.05 = 0.06 = 0.07 = 0.08 = 0.09
λ=4
λ=8
λ = 12
λ = 16
λ = 20
λ = 24
λ = 28
λ = 32
λ = 36
3124 3976 4859 3700 3807 3625 3596 2886 3253
3447 3439 2834 3663 3123 4030 3393 3235 3559
4808 3961 3506 3660 3133 3367 3379 3788 4558
3893 3350 3038 3869 3253 3475 3252 4100 3952
3871 2562 3131 3725 2516 3456 3677 3429 4697
3051 3864 3175 2732 3168 4245 3743 4566 4554
4459 3405 4066 4111 3959 3903 3986 5000 4693
3539 3598 4242 3036 3909 4173 3116 4563 4491
4428 3734 3615 3173 4449 4248 3812 4575 4534
Acknowledgments This work was partly supported by the National Natural Science Foundation of China (10471017). References Durbin, R., & Rumelhart, D. (1989). Product units: A computationally powerful and biologically plausible extension to backpropagation networks. Neural Computation, 1, 133–142. Giles, C. L., & Maxwell, T. (1987). Learning invariance and generalization in higherorder neural networks. Applied Optics, 26(23), 4972–4978. Hussaina, A. J., & Liatsisb, P. (2002). Recurrent pi-sigma networks for DPCM image coding. Neurocomputing, 55, 363–382. Jiang, L. J., Xu, F., & Piao, S. R. (2005). Application of pi-sigma neural network to real-time classification of seafloor sediments. Applied Acoustics, 24, 346–350. Li, C. K. (2002). Memory-based sigma-pi-sigma neural network. International Conference on Systems, Man and Cybernetics, 4, 1–6. Li, C. K. (2003). A sigma-pi-sigma neural network (SPSNN). Neural Processing Letters, 17, 1–19. Shin, Y., & Ghosh, J. (1991). The pi-sigma network: An efficient higher-order neural network for pattern classification and function approximation. International Joint Conference on Neural Networks, 1, 13–18. Shin, Y., & Ghosh, J. (1992). Approximation of multivariate functions using ridge polynomial networks. In Proceedings of International Joint Conference on Neural Networks (Vol. 2, pp. 380–385). Piscataway, NJ: IEEE Press. Shin, Y., & Ghosh, J. (1995). Ridge polynomial networks. IEEE Transactions on Neural Networks, 6(3), 610–622. Sinha, M., Kumar, K., & Kalra, P. K. (2000). Some new neural network architectures with improved learning schemes. Soft Computing, 4(4), 214–223. Voutriaridis, C., Boutalis, Y. S., & Mertzios, B. G. (2003). Ridge polynomial networks in pattern recognition. In 4th EURASIP Conference Focused on Video I Image
3368
Y. Xiong, W. Wu, X. Kang, and C. Zhang
Processing and Multimedia Communications (Vol. 2, pp. 519–524). Piscataway, NJ: IEEE Press. Xin, L. (2002). Interpolation by ridge polynomials and its application in neural networks. Journal of Computational and Applied Mathematics, 144, 197–209.
Received July 21, 2006; accepted January 24, 2007.
LETTER
Communicated by Hava Siegelmann
Multiorder Neurons for Evolutionary Higher-Order Clustering and Growth Kiruthika Ramanathan kiruthika [email protected] Department of Electrical and Computer Engineering, National University of Singapore, Singapore 117576
Sheng-Uei Guan [email protected] School of Engineering and Design, Brunel University, Uxbridge, Middlesex UB8 3PH, U.K.
This letter proposes to use multiorder neurons for clustering irregularly shaped data arrangements. Multiorder neurons are an evolutionary extension of the use of higher-order neurons in clustering. Higher-order neurons parametrically model complex neuron shapes by replacing the classic synaptic weight by higher-order tensors. The multiorder neuron goes one step further and eliminates two problems associated with higher-order neurons. First, it uses evolutionary algorithms to select the best neuron order for a given problem. Second, it obtains more information about the underlying data distribution by identifying the correct order for a given cluster of patterns. Empirically we observed that when the correlation of clusters found with ground truth information is used in measuring clustering accuracy, the proposed evolutionary multiorder neurons method can be shown to outperform other related clustering methods. The simulation results from the Iris, Wine, and Glass data sets show significant improvement when compared to the results obtained using self-organizing maps and higher-order neurons. The letter also proposes an intuitive model by which multiorder neurons can be grown, thereby determining the number of clusters in data. 1 Introduction Self-organizing maps (SOMs) (Kohonen, 1997) represent a type of neural network proposed for clustering. They work by assigning a synaptic weight (denoted by the column vector w ( j) ) to each neuron. The winning neuron j(x) is the neuron that has the highest correlation with the input x, that is, it is the neuron for which w ( j)T x is the largest: j(x) = arg j max w ( j)T x Neural Computation 19, 3369–3391 (2007)
(1.1) C 2007 Massachusetts Institute of Technology
3370
K. Ramanathan and S. Guan
where the operator · represents the Euclidean norm of the vector. The idea behind equation 1.1 is to select the neuron that exhibits maximum correlation with the input. Often used instead of equation 1.1 is the minimum Euclidean distance matching criterion (Kohonen, 1997): j(x) = arg j min w ( j) −x,
(1.2)
However, the use of the highest correlation or the minimum distance matching criterion implies that (1) the features of the input domain are spherical, that is, deviations are equal in all dimensions, and (2) the distance between features must be larger than the distance between points in a feature. These two implications of the data can be summarized in these equations: 1 ≤ m,n ≤ d, m = n, λm
I
≈ λn
I
∀x, y ∈ I, x T y if ( j(x) = j(y)) > x T y if ( j(x) = j(y)).
(1.3) (1.4)
In these equations, the I operator represents the covariance of the data matrix I, λm is the mth eigenvalue, d is the dimension of the input domain, and x and y are arbitrary data vectors. It could be difficult for an arbitrary set of data to fulfill these conditions, especially in cases where such distributions of data are difficult to visualize and detect due to the high dimensionality of the problem. Even when the distribution of the data is detected through visualization, when the conditions described in equations 1.3 and 1.4 are not satisfied, there is no guarantee that the use of an SOM will solve the problem. Consider, for example, the arrangements of data clusters in two dimensions as shown in Figure 1. Table 1 summarizes the properties of these data sets in terms of their ability to satisfy equations 1.3 and 1.4. Due to the nature of the data, there is no guarantee that the SOM will be able to cluster correctly the sets of data other than data A. Several methods have been proposed to solve these problems. Secondorder metrics are a popular choice, as the use of an inverse covariance matrix in the clustering metric captures linear directionality properties of a cluster (Lipson, Hod, & Siegelmann, 1998). The concept of second-order curves was expanded in several cases to include second-order shells (Krishnapuram, Frigui, & Nasraoui, 1995). Kohonen discussed the use of weighted Euclidean distance that captures different variances in the components of input signals. The use of the Mahalanobis distance (Mao & Jain, 1996) was also considered. On the other hand, nonparametric techniques such as agglomeration (Blatt, Wiseman, & Domany, 1996) attempt to find arbitrary shapes in the clusters. Improvements have been observed empirically in the application of these second-order and arbitrarily shaped clusters. However,
Evolutionary Multiorder Neurons
3371
Figure 1: Artificially generated two-dimensional two-class clusters illustrating the weakness of SOMs. Table 1: Summary of the Properties of the Data in Figure 1.
Data Name Data A Data B Data C Data D
Spherical Clusters: Satisfies Equation 1.3?
Clusters are Sufficiently far apart: Satisfies Equation 1.4?
Yes Yes No No
Yes No Yes No
their performance is dependent on several factors, including their ability to satisfy equation 1.4, the distribution of data, and the shape of the underlying data. Lipson and Siegelmann (2000) proposed the generalized higher-order neuron (HON) structure, a parametric algorithm that extended secondorder surfaces to arbitrary order hypersurfaces, thereby giving rise to a continuum of possible cluster shapes between the classic spherical-ellipsoidal unit and the fully nonparametric agglomerative algorithm. To summarize, HON relaxes the shape restriction of classic neurons by replacing the weight vector or covariance matrix of a classic neuron with a general higher-order tensor, which can capture multilinear correlations among the signals associated with the neurons. They can also capture shapes with holes or detached areas.
3372
K. Ramanathan and S. Guan
It must be noted that the higher-order neuron proposed by Lipson and Siegelmann (2000) is different from that used by Balestrino and Verona (1994). Although Balestino and Verona also used the term higher-order neurons to denote their work, their higher-order neurons were not used in competitive learning and clustering. Lipson and Siegelmann show that the higher-order neurons proposed are tensors and that their order pertains to their internal tensorial order rather than the degree of their inputs. The proposition of Lipson and Siegelmann’s higher-order neurons paves the way for a natural extension, and it is this idea of higher-order neurons that we use in this letter. A difference also has to be established between higher-order neurons as described by Lipson and Siegelmann (2000) and used in this letter and other high-dimensional clustering methods. Ben-Hur, Horn, Siegelmann, & Vapnik (2001), for instance, proposed a method of clustering using support vectors. The method maps the data into high dimensions and finds hyperspheres in the high dimensions. The hypersphere data are then mapped back to determine the final clustering arrangement. In contrast to the above-described algorithm, HONs are parametric methods, where the clusters are represented by their eigentensors, which are derived from their covariance matrices. There is no higher-dimensional mapping. The use of higher-order neurons opens the possibility of parametrically extending neuron shapes to a continuum of shapes both intuitively and using simulation results. Before the development of higher-order neurons, arbitrary cluster shapes could be identified only by nonparametric methods such as agglomerative clustering. The introduction of higher-order neurons now enables the representation of clusters using the corresponding eigentensors. The clusters can be trained with Hebbian learning. Also, unsupervised learning with higher-order neurons can be extended to supervised (Lipson & Siegelmann, 2000) and semisupervised learning. Despite the advantages of higher-order neurons, preliminary experiments showed a significant difference in their performance across problems and across clusters in the same problem. We observed that the choice of neuron order plays an important part in the clustering accuracy of the neurons. In higher-order neurons, this choice of neuron order is manual and hence could be suboptimal. In this letter, we aim to solve the problem of neuron order by using evolutionary algorithms to extend the idea of higher-order neurons. The proposal of evolutionary higher-order neurons (eHONs or multiorder neurons) aims to serve two purposes. To illustrate the problem scope, consider the simple data distribution shown in Figure 2. Due to the nonspherical nature of clusters 2 and 3, the use of SOMs may not be suitable in correctly clustering the patterns in Figure 2. It is difficult to visualize whether higher-order neurons can be used to solve this problem. From Figure 2, we can see that cluster 2 is spherical in nature. This cluster alone can be clustered properly by an
Evolutionary Multiorder Neurons
3373
Figure 2: Hypothetical clusters to illustrate the performance scope of eHONs.
SOM. Cluster 1 is elliptical in nature, and cluster 3 has a shape like a banana. If the order of these clusters can be identified, then the clustering can be performed more accurately. In higher-order neurons, all the clusters in the problem are given only one order. Multiple orders, such as required for the clusters in Figure 2, are not provided. As a result, the higher-order neuron may not be able to correctly cluster a situation provided in cluster 1. The two purposes of eHONs are: 1. To identify the correct order for a given set of data and cluster them according to the order 2. To identify the possibilities of multiple ordered clusters, assign a suitable order for each cluster, and perform clustering The problem of order choice that occurs in HONs is thus circumvented by the use of eHONs. Evolutionary algorithms take into account that different clusters in the same problem may not be of the same order and use homogeneous coordinates to combine different orders of neurons together. Because evolutionary higher-order neurons are used to create a multiorder solution to a problem, the terms evolutionary higher-order neurons (eHONs) and multiorder neurons will be used interchangeably in this letter. The rest of the letter is organized as follows. Section 2 explains the related theory of higher-order neurons and evolutionary algorithms. Section 3 describes the evolutionary learning of the higher-order neurons. Section 4 describes the results of higher-order neurons and evolutionary higherorder neurons on real-world data. Section 5 proposes a method for growing eHONs, enabling automatic detection of clusters. The letter concludes with some remarks and future extensions. 2 Related Theory In the rest of the letter, we adopt the same notation as that of Lipson and Siegelmann (2000):
3374
K. Ramanathan and S. Guan
x, w: Column vectors W, D: Matrices x ( j) , D( j) : The jth vector/matrix corresponding to the jth neuron xi , Di j : Element of a vector/matrix xH , DH : Vector/matrix in homogeneous coordinates m: Order of the high-order neuron d: Dimensionality of the input N: Number of neurons in a cluster We define the following terms as used in the letter: Higher order neurons (HON): The method proposed by Lipson and Siegelmann (2000), which represents a cluster by a neuronal order and tensorial structure. Section 2.1 describes the higher-order neuronal structure and their training algorithm. Self-organizing maps (SOM): A competitive neuronal structure for unsupervised learning introduced by Kohonen (1997). Evolutionary self-organizing maps (eSOM): A model of self-organizing maps proposed by Painho and Bacao (2000). Evolutionary higher-order neurons (eHONs), or Multiorder neurons: Proposed in this letter as an evolutionary batch mode version of higherorder neurons. They use evolutionary methods to train HONs, evolving not only the self-organizing map structure but also the order of each neuron. 2.1 Higher-Order Neurons. The HON structure (Lipson, Hod, & Siegelmann, 1998) generalizes the spherical and ellipsoidal scheme proposed by the self-organizing maps and second-order structures. The use of a higherorder neuron structure gives rise to a continuum of cluster shapes between the classic spherical-ellipsoidal clustering systems and the fully parametric approach. Clusters of data in the real world usually have arbitrary shapes. The “shape” of a cluster is categorized using the order of the neuron representing it. The constraint of the single-order neuron (used by SOMs) comes from the fact that the weight w ( j) of the neuron is of the same dimensionality as the input x, a single point in the domain. However, the neuron is modeling a cluster of points with attributes such as directionality and curvature. To circumvent this restriction, the higher-order neuronal structure augments the neuron by allowing it to map geometric and topological information, that is, a second-order neuron will allow for the orientation and size components of the cluster, thus introducing a new metric to the system. Each higher-order neuron therefore aims to represent not only the mean value of the data points but also the principal directions, curvatures, and other features of the cluster in these directions. Lipson and Siegelmann (2000)
Evolutionary Multiorder Neurons
3375
intuitively expressed a higher-order metric as defining a multidimensional shape with an adaptive topology, as opposed to defining a sphere. They also showed that higher-order neurons exhibit stability and good training performance with Hebbian learning. 2.1.1 Tensorial Multiplication and Neuron Shapes. A second-order neuron is typically expressed by Lipson and Siegelmann (2000) as a matrix D( j) ∈ Rd xd that expresses the direction and size properties of a neuron for each dimension of the input, where d is the number of dimensions. The Euclidean distance metric in equation 1.2 can therefore be rewritten as j(x) = arg j min D( j) (w ( j) − x)
(2.1)
where D( j) can simply be expressed by a product of its eigenvalue and eigenvector matrices as
D( j)
1 √σ 1 0 . . . 0 1 0 √ : σ2 = : . 0 1 0 ... 0 √ σd
v1T T v 2 . : vT d
(2.2)
In contrast to equation 1.2, the second-order neuron in equation 2.1 is represented by two variables, (D( j) , w ( j) ), where D( j) represents the rotation and scaling and w ( j) represents translation. As D( j) can be obtained by the eigenstructure of the correlation matrix of the data associated with neuron j, equation 2.2 is rewritten as 1
D( j) = λ( j)− 2 V ( j) .
(2.3)
Here, λ( j) refers to the eigenvalues of D( j) in diagonal form, and V ( j) produces its unit eigenvectors in matrix form. Since the eigenvalues of a matrix A can be written as AV = λV, A= V T λV, the expression D( j) (w ( j) − x) in equation 2.1 can be written as D( j) (w ( j) − x) = (w ( j) − x)T D( j)T D( j) (w ( j) − x).
(2.4)
3376
K. Ramanathan and S. Guan
From equation 2.2, D( j)T D( j) can be written as
D( j)T D( j)
1 σ1 0 T T T = v1 v2 . . . vd : 0
v1T 1 v2T : σ2 = R( j)−1 . . 0 : 1 vdT ... 0 σd 0 ... 0
(2.5)
( j) ( j)T x x ∈ Rd xd . R( j) here is the inverse of the correlation matrix, R( j) = The minimum Euclidean distance rule can therefore be written as j(x) = arg j min[(x − w ( j) )T R( j)−1 (x − w ( j) )], j = 1, 2 . . . , N.
(2.6)
This use of the correlation matrix for clustering was used by Gustafson and Kessel (1979). Although the use of the correlation matrix somewhat simplifies the clustering process, it still requires the optimization of two terms, R( j) and w ( j) . Since separate optimization cannot be easily derived, Lipson and Siegelmann (2000) proposed the combination of the two parameters into one expanded covariance matrix in homogeneous coordinates as ( j)
RH =
( j) ( j)T
xH xH ,
(2.7)
( j)
where xH ∈ R(d+1)x1 . This expanded matrix was proposed by Faux and Pratt (1981), and it contains coordinate permutations of xH = x1 , involving 1 as one of the multiplicands. The Euclidean distance matching criterion now can be written as j(x) = arg j min[XT R( j)−1 X] j(x) = arg j min R( j)−1 x, j = 1, 2 . . . , N.
(2.8)
2.1.2 Higher Orders. Equation 2.8 can be extended beyond the secondorder neuron structure. The use of the ellipsoidal neuron (or the secondorder neuron) can capture only linear properties, not curved directions. Higher-order geometrics were introduced by Lipson and Siegelmann (2000) for the purpose of capturing more complex spatial configurations. The higher-order extended covariance matrix requires the creation of a higherorder correlation matrix. This can be obtained by multiplying the vector xH by itself, in successive outer products, more than once. The process ( j) yields a covariance tensor rather than a covariance matrix, such that RH ∈
Evolutionary Multiorder Neurons
3377
R(d+1)x(d+1) . A covariance tensor of order m, for instance, can be obtained by 2(m − 1) outer products of xH by itself. Consequently, the inverse of the tensor can be used to compute the higher-order correlation of x with the tensor, using a tensor multiplication (⊕): m−1 j(x) = arg j min R( j)−1 ⊕ xH ,
j = 1, 2, . . . , N.
(2.9)
The metric in equation 2.8 will not necessarily be spherical; it may be disjoint or even take several other forms. 2.1.3 The HON Training Algorithm. The higher-order unsupervised learning algorithm proposed by Lipson and Siegelmann (2000) is performed as follows: 1. Select the number of clusters (or number of neurons) K and the order of the neurons (m) for a given problem. 2. The neurons are initialized with ZH =
n i=1
2(m−1)
xi
. ZH is the covari-
ance tensor of the data, initialized to a midpoint value. In the case of a second-order problem, the covariance tensor is simply the corn relation matrix, ZH = xiT xi . For higher-order tensors, this value is i=1
m−1 as a vector with all the mth degree calculated by writing down xH permutations of {x1 , x2 , . . . , xd , 1} and finding ZH as the matrix summing the outer product of all these vectors. The value of the inverse of the tensor is found and normalized using its determinant f to obtain Z−1 H /f .
3. The winning neuron for a given pattern is computed using (m−1) j = arg j min Z−1 . H /f ⊕ x
(2.10) 2(m−1)
4. The winning neuron is now updated using ZH,new = ZH,old + ηxi , / f are where η is the learning rate. The new values of ZH and Z−1 H stored. 5. Steps 3 and 4 are repeated. Ideally, while the first-order neuron finds spherical shapes and the second-order neuron finds ellipsoidal shapes (with two principal directions), the third-order neuron, which makes use of the covariance tensor (having a cubical shape), finds four principal directions and copes with banana-shaped clusters. Figure 3 shows the neuron information of first-, second-, and third-order neurons, respectively.
3378
K. Ramanathan and S. Guan
Figure 3: The internal representation of a self-organizing neuron is a point for a (a) first-order neuron, (b) two eigenvectors (and values) for the second-order neuron, and (c) four eigenvectors (and values) for the third-order neuron. More eigenvectors (and values) are encoded in the higher-order tensors of neurons with a higher order.
2.2 Evolutionary Self-Organizing Maps. Evolutionary algorithms have been used to find global solutions in many applications, including neural network applications for supervised learning (Yao, 1993). Inspired by this, Painho and Bacao (2000) apply genetic algorithms to clustering problems with good effect. The genetic algorithm applied is simple and retains the form of SOMs, but with evolutionary representation of the weights. Genetic algorithms (Goldberg, 1989) make use of the Darwinian theory of the survival of the fittest. The parameters to be determined are encoded as an array of elements, also called a chromosome. The chromosomes are evaluated and allowed to reproduce according to their fitness values, so as to allow the development of new chromosomes with better fitness values. More simply, since the objective is to maximize the value w (k)T x for each pattern x, a population of real coded chromosomes encodes w (k) for each cluster k. Each chromosome therefore consists of K ∗ d elements, where K is the number of clusters and d is the dimension of the input data. The chromosomes are evaluated in batch mode so as to maximize K k=1 x∈Ck
w (k)T x.
(2.11)
Evolutionary Multiorder Neurons
3379
Crossover and mutation are performed, and a new generation of chromosomes is produced. The process is continued until the system stagnates or a maximum number of epochs is reached. 3 Evolutionary Higher-Order Neurons We propose evolutionary higher-order neurons (eHONs) as an extension of the higher-order neuron structure and the evolutionary self-organizing map. The idea of evolutionary higher-order neurons is motivated by the following reasoning. While Lipson and Siegelmann’s (2000) higher-order neurons are shown to exhibit improved performance, some of our simulations (presented in section 4) showed that the order of the neuron plays an important part in the meaning of the clusters formed. A higher-order neuron does not necessarily perform better than a lower-order neuron. In the HON algorithm described in section 2, the order of the neuron is prespecified by the user (Lipson & Siegelmann, 2000). Moreover, for a given data set, only a single order of neuron is used to represent all of the clusters in their work. We feel that this is a limitation to the algorithm. Data are usually distributed irregularly, with some classes having spherical, elliptical, banana-shaped, or even higher-order forms. We propose in this letter a messy evolutionary algorithm (Goldberg, Deb, & Korb, 1991) based multiorder HONs. The algorithm is outlined below. 3.1 Batch Version of HONs. Genetic algorithms (GA) are used in neural network applications with various scopes. Many of these GA-based neural networks (GANNs) are developed using batch mode training (Yao, 1993). A corresponding batch mode training version of HONs has also been developed to facilitate the process of developing eHONs. The batch algorithm developed is similar to the online algorithm proposed by Lipson and Siegelmann (2000) in section 3.1. However, instead (m−1) of choosing a winning neuron by using j = arg j min Z−1 , we H /f ⊕ x implemented a batch minimization criterion such as the one used in equation 2.11 of the evolutionary SOMs in section 3.2. The algorithm focuses on minimizing equation 3.1: K
−1 ZkH f k ⊕ x (m−1) .
(3.1)
k=1 x∈Ck
Preliminary simulation results to compare the effect of the batch mode algorithm with the online version developed by Lipson and Siegelmann (2000) showed that the batch mode implementation has results similar to those of the online version.
3380
K. Ramanathan and S. Guan
3.2 Chromosome Initialization. The initialization of chromosomes is done as outlined by the following steps: 1. Each data point is randomly assigned to one of the K clusters. 2. The covariance tensor of order m, ZkH , for the cluster k ∈ K is initialized as x 2(m−1) . (3.2) ZkH = x∈Ck
3.3 Chromosome Structure. Each chromosome is coded as an array of structures, each consisting of two components: the order of a chromosome and the value of the tensor, as given below: struct chromosome { neuron NEURON[K]; }; struct neuron { int order; int tensor[][]; }; A tensor, regardless of the neuron order, is flattened out into a Kronecker matrix (Graham, 1981) in a form similar to that used by Lipson and Siegelmann (2000). 3.4 Kronecker Notation. The Kronecker tensor product flattens out the elements of X ⊕ Y into a large matrix formed by taking all the possible products between the elements of X and those of Y. If X is a 2 × 2 matrix, then, X⊕Y =
Y.X1,1 Y.X2,1
Y.X1,2 Y.X2,2
(3.3)
Each term in equation 3.3 is a matrix of the size of Y. The Kronecker notation makes the internal structure of a higher-order tensor easy to perceive, and the eigenvalues of the tensor can be used in finding the equivalent of the minimum distance rule in equation 3.1. 3.5 Global Search: Muation. Global search for eHONs is simulated by large-range mutation. There are two criteria for large-range mutation: 1. A random element in a tensor is mutated with a probability p1 . 2. The order of the tensor is changed with a probability p2 . The tensor is now reinitialized using equation 3.2.
Evolutionary Multiorder Neurons
3381
3.6 Fitness Function. The fitness function is the optimization of the expression in equation 3.1. The expression is minimized so as to maximize the cluster tightness. 3.7 Algorithm Summary. The eHON algorithm can be summarized as: 1. Initialize a population of chromosomes denoting a set of multiorder tensors 2. While termination criterion is not met K a. Evaluate each chromosome x using f (x) = (ZkH )−1 / (m−1) x∈C k=1 k fk ⊕ x . b. Select the fittest chromosome. c. Perform mutation and reproduction. 4 Experimental Results Evaluation of the system is performed according to consistency with the ground truth information, which is measured by the number of misassigned patterns, that is, the number of patterns not clustered together with their true class. Mutation is carried out as follows. Mutation of a random element in the tensor is performed with an arbitrary probability of 0.2. The order of a tensor is changed with a probability of 0.1. Arbitrary values of mutation were chosen because we were interested in the scope of the eHON. Other values of the mutation and crossover probability did not show a significant change in the accuracy of the eHON. Each chromosome could take an order between 1 and 3. The relatively low order of neurons was preferred for this preliminary work because higher-order neurons require more computational effort than those of lower orders. We therefore felt that orders between 1 and 3 were sufficient to illustrate the scope of eHON. 4.1 Data and Algorithm Descriptions. This section compares the results of four algorithms: 1. eHONs: Higher-order self-organizing neurons with evolutionary capabilities that identify the best possible order for a chromosome based on a population generated with various multiorder chromosomes. 2. HON: The higher-order neuron structure proposed by Lipson and Siegelmann (2000). Simulations were run with order 2 and order 3. Order 2 detects elliptical, oval, and, to a certain extent, banana shapes; order 3 detects other higher-order shapes to a certain extent.
3382
K. Ramanathan and S. Guan
3. SOM: The self-organizing map or HON of order 1 (Kohonen, 1997). This algorithm detects spherical clusters and, to some extent, ovalshaped clusters. 4. eSOM: The self-organizing map or HON of order 1 trained using evolutionary algorithms similar to that proposed by Painho and Bacao (2000). This algorithm also detects spherical clusters and, to some extent, oval-shaped clusters. Algorithms 1 to 4 will be applied to the following benchmark data sets from the UCI machine learning repository (http://www.ics.uci.edu/ ∼mlearn/MLRepository.html): 1. Iris data set (4 dimensions, 3 clusters, 150 patterns) 2. Wine data set (13 dimensions, 3 clusters, 178 patterns) 3. Glass data set (9 dimensions, 6 classes, 214 patterns) Figure 4 shows the two-dimensional principal component projections of these data sets. 4.2 Effect of eHONs: Studies on the Iris Data Set. Figure 5 compares the clustering of the Iris data (with three clusters) using the SOMs, evolutionary SOMs, HONs, and eHONs. It is observed that over an average of 10 runs, the number of misassigned patterns of the data is as low as 3.5 using the eHON system as opposed to 4 using the HON system. Similarly, the eHON is also shown as being capable of identifying a suitable order for a given cluster. From the visualization of the Iris data in Figure 4 and the performance of the self-organizing map, we can see that a single-order neuron is sufficient to correlate one of the classes with the ground truth information. However, the other two classes are closer to each other and therefore need higher-order neurons. The eHON system correctly identified order 2 as suitable for these clusters and order 1 to be sufficient for the other cluster. 4.3 Correlation with Ground Truth Information in Real-World Data. In this section we present the correlation with ground truth information (the available class labels), for the Iris, Wine, and Glass data sets. We begin the experimental analysis by observing the clustering accuracy across different classes of each data set. Tables 2 and 3 show the correlation tables of the clusters found and the ground truth information (class labels) for the Iris, Wine, and Glass data sets when SOMs and higher-order neurons are used. The number of misclustered patterns is marked in bold in the tables. The tables can be read as follows. For instance, in Table 2, for the Iris data clustered with neuron order 3, 50 patterns from class 1 are clustered together, with 0 patterns clustered with patterns of either class 2 or 3. However, 2 patterns from class 2 are clustered with patterns of class 3 and 5 patterns
Evolutionary Multiorder Neurons
3383
Figure 4: Distribution of data in the (a) Iris (three clusters), (b) Wine (three clusters), and (c) Glass (six clusters) data sets.
from class 3 are clustered with patterns of class 2 (as indicated in bold). The total number of misclustered patterns is noted below each table. The following observations are made from Tables 2 to 4.:
r
The best solution is obtained for the Iris and Wine data sets with neuron order 2 and for the Glass data set with neuron order 1.
3384
K. Ramanathan and S. Guan
Figure 5: Clusters of the Iris data using (a) SOMs, (b) eSOMs, (c) HONs, and (d) eHONs.
Evolutionary Multiorder Neurons
3385
Table 2: Correlation Between Ground Truth Information and Cluster Information for the Iris Data Using Second- and Third-Order Neurons. Neuron Order = 2
Neuron Order = 3
Class 1
Class 2
Class 3
Class 1
Class 2
Class 3
50 0 0
0 48 2
0 2 48
50 0 0
0 48 2
0 5 45
Class 1 Class 2 Class 3
Total misclustered: 4
Total misclustered: 7
Table 3: Correlation Between Ground Truth Information and Cluster Information for the Wine Data Using SOMs (First-Order neurons), Second-Order and Third-Neurons. Order = 1
Order = 2
Order = 3
Class 1 Class 2 Class 3 Class 1 Class 2 Class 3 Class 1 Class 2 Class 3 Class 1 Class 2 Class 3
27 31 1
0 62 9
Total misclustered: 54
r
13 0 35
36 0 23
8 34 29
0 0 48
Total misclustered: 60
35 10 14
9 27 35
2 8 38
Total misclustered: 78
Different classes are better clustered with different orders of neurons. For instance, in the Wine data set, order 1 is most suitable for class 2, and order 2 is most suitable for classes 1 and 3.
The objective of the eHONs is therefore to solve for the best order for each class using evolutionary algorithms. Tables 5 to 7 compare the total number of misclustered patterns obtained when using SOMs, HONs, eSOMS, and eHONs. The eSOMs and eHON algorithms were run 10 times, and the mean results are presented. 4.4 Discussions. The results show the importance of the use of multiorder neurons in clustering, as opposed to higher-order neurons. The use of the evolutionary algorithm picks out the best possible cluster order for each class, thereby enabling better correlation of clusters with the class labels, making the clusters more meaningful. This property is observed in all three data sets. The introduction of cluster order serves a dual purpose. As shown in the tables, there is an obvious improvement in clustering accuracy when multiorder neurons are used. Although the improvement is beneficial, there is a cost encountered in computation. Each higher-order neuron requires
29 9 22 0 1 9
4 28 24 0 17 3
Class 2
Total misclustered: 119
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6
Class 1
0 6 2 0 0 9
Class 3
0 0 0 10 3 0
Class 4
Order = 1
7 0 0 2 0 0
Class 5 0 0 0 1 2 26
Class 6 47 20 0 3 5 1
Class 2 11 0 2 0 1 3
Class 3
Total misclustered: 176
50 0 0 3 2 15
Class 1 8 1 0 2 0 2
Class 4
Order = 2
4 5 0 0 0 0
Class 5 4 14 0 5 1 5
Class 6 34 11 10 10 3 8
Class 2
11 3 1 1 1 0
Class 3
3 4 1 2 2 1
Class 4
Total misclustered = 157
36 5 13 2 10 4
Class 1
Order = 3
5 0 0 1 1 2
Class 5
2 3 5 6 6 7
Class 6
Table 4: Correlation Between Ground Truth Information and Cluster Information for the Glass Data Using SOMs (First-Order Neurons), Second-Order, and Third-Order Neurons.
3386 K. Ramanathan and S. Guan
Evolutionary Multiorder Neurons
3387
Table 5: Results of the Iris Data (Best Order: 1, 2, 2). Algorithm
Number of Misassigned Patterns
HON (higher-order neuron), order = 2 HON (higher-order neuron), order = 3 SOM (self-organizing map) HON order = 1 eSOMs (evolutionary evolved SOMs) eHONs (to find the optimal order for a given class)
4 7 23 16 3.5
Table 6: Results of the Wine Data (Best Order: 2, 1, 2). Algorithm
Number of Misassigned Patterns
HON (higher-order neuron), order = 2 HON (higher-order neuron), order = 3 SOM (self-organizing map) HON order = 1 eSOMs (evolutionary evolved SOMs) eHONs (to find the optimal order for a given class)
60 78 54 57 49
Table 7: Results of the Glass Data (Best Order: 2, 1, 1, 1, 3, 2). Algorithm HON (higher-order neuron), order = 2 HON (higher-order neuron), order = 3 SOM (self-organizing map) HON order = 1 eSOMs (evolutionary evolved SOMs) eHONs (to find the optimal order for a given class)
Number of Misassigned Patterns 176 157 119 115 110
the computation of the inverse of the higher-order tensor, which is intensive in nature. This is a clear setback in the use of multiorder neurons. Several other clustering algorithms, such as AHC (Blatt et al., 1996) and cluster ensembles (Fred & Jain, 2002, 2005), can also be used instead to improve clustering accuracy. On the other hand, using the eHON, we also obtain the optimal order of a particular cluster of patterns, thereby also giving information on the shape of the cluster. We feel that this is a unique contribution of the eHON algorithm that will allow it to be integrated with ensemble clustering approaches. We expect that cluster shape information can help to mold the clustering ensembles to the required shape and improve clustering accuracy. 4.5 Limitations. As the eHON model improves on the model of the higher-order neurons, their computational cost is equivalent to that of the HONs. The computational complexity of the covariance tensor ZH and
3388
K. Ramanathan and S. Guan
the inverse covariance tensor Z−1 H is high. Also, as the eHON makes use of several chromosomes, the covariance tensor of each new chromosome has to be evaluated. The current implementation of the eHON algorithm takes into account the computational complexity and implements only neurons with orders 1, 2, and 3. However, with these orders alone, it is not possible to capture complex cluster shapes and overlapping clusters. Using the eHON algorithm to identify more complex cluster structures will require higher computational complexity. 5 Growing Multiorder Neurons: A Simplified Model Cluster shape information can thus be incorporated into the ensemble algorithms and thereby improve clustering accuracy. In all the problems discussed in section 4, the number of clusters to be formed is a predetermined value that is found earlier and given to the system. However, determining the number of clusters in a set of data patterns has been widely researched (Blackmore & Miikkulainen, 1993; Dittenbach, Merkl, & Rauber, 2000; Fritze, 1995). The algorithms that these authors describe propose various ways to increase dynamically the number of clusters, using certain threshold conditions and stopping criteria. Dittenbach et al. (2000) go one step further and create a hierarchical structure for clustering. Their setup not only increases the number of clusters but also creates a hierarchy, thereby identifying the relationship between clusters. All the algorithms described above are implemented using selforganizing maps (Kohonen, 1997) or other first-order clustering methods. However, they can be converted into a batch version and adapted to the use of higher-order neurons, thus creating a growing model for the number of clusters. The method developed will then be an incremental approach to increasing the number of clusters needed until the resulting number of clusters meets the error criterion. For illustration, we describe in this section a simple extension to eHONs that can be used to determine the number of clusters. We begin the eHON algorithm with two clusters and fix the number of training iterations to be λ . We now determine the “spread” of each cluster and the total spread of eHON map. Effectively, this can be obtained by evaluating j(x) = the( j)−1 m−1 R ⊕ xH for each of the two clusters. The total spread of the x∈ j
eHON can therefore be evaluated as the sum of the spread of each of the clusters. If the total spread after λ iterations is less than a given threshold τ , the cluster with the largest spread is reduced to two clusters. The eHON algorithm is repeated, now with the patterns in the largest cluster, to find two new clusters to represent the large cluster. The pseudocode of the growing eHON can therefore be summarized as:
Evolutionary Multiorder Neurons
3389
Figure 6: The growing eHON model to incrementally determine the number of clusters.
1. Set the number of clusters as two. 2. Implement eHON with all the available patterns. 3. IF the cluster spread is less than τ : a. Complete training. 4. Else: a. Repeat steps 1 to 3 with the patterns of the cluster with the largest spread. The process can be illustrated using the Iris data, as shown in Figure 6. Figure 6a shows two clusters of the Iris data of order 1 and order 2, respectively. In Figure 6b, the larger cluster is further reduced into two smaller clusters, each of order 2. The final clustering obtained has the same accuracy as that reported in Table 5. 6 Conclusions In this letter, we have introduced the concept of multiorder neurons or evolutionary higher-order neurons (eHONs). While higher-order neurons are useful
3390
K. Ramanathan and S. Guan
in clustering, their performance is heavily dependent on the prespecified neuron order. It is desirable to determine the best order not just for the data set but for individual clusters in it. The global search capability of evolutionary algorithms makes them an ideal choice of search algorithms for a problem of this nature, therefore leading to the development of evolutionary higher-order neurons (eHONs), also known as multiorder neurons. These eHONs work by identifying the best order a cluster can take, thereby identifying the complexity of each cluster. The clusters obtained are therefore better correlated with the class labels, making them more meaningful. eHONs were designed by implementing an evolutionary algorithm to evolve the covariance tensors of neurons of multiple orders. The neurons are denoted as structures of tensors, each structure having as its components the order of the neuron and the Kronecker form of the tensor. The evolutionary algorithm was performed using mutation, where the order of the neuron and the value of tensor elements were mutated with a given probability. The use of the multiorder neurons improves the performance of the best higher-order neuron structure and is significantly better than the use of SOMs and single-order neurons. The improvement in accuracy is as high as 13% when compared to the use of SOMs (Iris data). From the results obtained, we can observe that the development of eHONs shows promise in the field of clustering. We have also proposed a minor extension to the eHON model to develop the system such that the optimal number of clusters can be incrementally found. While the method proposed is mostly heuristic, the results using the Iris data set show promise for further research using more sophisticated algorithms. While eHON is capable of finding an optimal cluster order, its use is still confined to disjointed clusters of arbitrary shapes. Tensors of lower orders such as described in this letter will not be capable of dealing with overlapping clusters. However, as Lipson and Siegelmann (2000) have mentioned, tensors of very high orders are capable of dealing with overlapping clusters. Future work will therefore include adapting the eHON structure to deal with overlapping clusters and investigate the effect of eHONs as the system deals with higher-dimensional complicated data. References Balestrino, A., & Verona, B. F. (1994). New adaptive polynomial neural network. Mathematics and Computers in Simulation, 37, 189–194. Ben-Hur, A., Horn, D., Siegelmann, H. T., & Vapnik, V. (2001). Support vector clustering. Journal of Machine Learning Research, 2, 125–137. Blackmore, J., & Miikkulainen, R. (1993). Incremental grid growing: Encoding high dimensional structure into a two-dimensional feature map. In
Evolutionary Multiorder Neurons
3391
IEEE International Conference on Neural Networks (ICNN’1993). Piscataway, NJ: IEEE. Blatt, M., Wiseman, S., & Domany, E. (1996). Supermagnetic clustering of data. Physical Review Letters, 76(18), 3251–3254. Dittenbach, M., Merkl, D., & Rauber, A. (2000). The growing hierarchical selforganizing map. In International Joint Conference on Neural Networks, IJCNN’ 2000 (Vol. 6, pp. 15–19). Piscataway, NJ: IEEE Press. Faux, I. D., & Pratt, M. J. (1981). Computational geometry for design and manufacture. New York: Wiley. Fred, A., & Jain, A. K. (2002). Data clustering using evidence accumulation. In Proceedings of the 16th International Conference on Pattern Recognition (pp. 276–280). Los Alamitos, CA: IEEE Computer Society Press. Fred, A., & Jain, A. K., (2005). Combining multiple clustering using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6), 835–850. Fritze, B. (1995). Growing grid: A self organizing network with constant neighborhood range and adaptation strength. Neural Processing Letters, 2(5), 9–13. Goldberg, D. E. (1989). Genetic algorithms in search, optimization and machine learning. Reading, MA: Addison-Wesley. Goldberg, D. E., Deb, K., & Korb, B. (1991). Don’t worry, be messy. In R. Belew & L. Booker (Eds.), Proceedings of the Fourth International Conference in Genetic Algorithms and Their Applications (pp. 24–30). San Mateo, CA: Morgan Kaufmann. Graham, A. (1981). Kronecker products and matrix calculus: With applications. New York, Wiley. Gustafson, E., & Kessel, W. C. (1979). Fuzzy clustering with fuzzy covariance matrix. In Proc. IEEE CDC (pp. 761–766). Piscataway, NJ: IEEE. Kohonen, T. (1997). Self-organizing maps. Berlin: Springer-Verlag. Krishnapuram, R., Frigui, H., & Nasraoui, O. (1995). Fuzzy and possibilistic shell clustering algorithms and their application to boundary detection and surface approximation–Parts I and II. IEEE Transactions on Fuzzy Systems, 3(1), 29–60. Lipson, H., Hod, Y., & Siegelmann, H. T. (1998). Higher-order clustering metrics for competitive learning neural networks. In Proceedings of the Israel-Korea Bi National Conference on New Themes in Computer Aided Geometric Modeling. Tel-Aviv, Israel. Lipson, H., & Siegelmann, W. T. (2000). Clustering irregular shapes using higherorder neurons. Neural Computation, 12, 2331–2353. Mao, J., & Jain, A. (1996). A self organizing network for hyper ellipsoidal clustering (HEC). IEEE Transactions on Neural Networks, 7, 16–39. Painho, M., & Bacao, R. (2000). Using genetic algorithms in clustering problems. Geocomputation 2000. Available online at http://www.geocomputation.org/ 2000/GC015/Gc015.htm. Yao, X. (1993). A review of evolutionary artificial neural networks. International Journal of Intelligent Systems, 8(4), 539–567.
Received July 13, 2006; accepted December 16, 2006.
LETTER
Communicated by Tianping Chen
Multiple Almost Periodic Solutions in Nonautonomous Delayed Neural Networks Kuang-Hui Lin [email protected]
Chih-Wen Shih [email protected] Department of Applied Mathematics, National Chiao Tung University, Hsinchu, Taiwan, R.O.C.
A general methodology that involves geometric configuration of the network structure for studying multistability and multiperiodicity is developed. We consider a general class of nonautonomous neural networks with delays and various activation functions. A geometrical formulation that leads to a decomposition of the phase space into invariant regions is employed. We further derive criteria under which the n-neuron network admits 2n exponentially stable sets. In addition, we establish the existence of 2n exponentially stable almost periodic solutions for the system, when the connection strengths, time lags, and external bias are almost periodic functions of time, through applying the contraction mapping principle. Finally, three numerical simulations are presented to illustrate our theory. 1 Introduction Rhythms are ubiquitous in nature. Experimental and theoretical studies suggest that a mammalian brain may be exploiting dynamic attractors for its storage of associative memories (Kopell, 2000; Skarda & Freeman, 1987; Izhikevich, 1999; Yau, Freeman, Burke, & Yang, 1991). Therefore, investigation of dynamic attractors as limit cycles and strange attractors is essential in neural networks. Many important studies of neural networks, however, focus mainly on static attractors, that is, equilibrium type. On the other hand, cyclic behaviors or rhythmic activities may not be represented exactly by periodic functions from a phenomenological point of view. It is therefore interesting to investigate almost periodic functions, the naturally generalized notion of periodic functions. In this letter, we consider the following neural network model, d xi (t) = −µi (t)xi (t) + αi j (t)g j (x j (t)) + βi j (t)g j (x j (t − τi j (t))) + Ii (t), dt n
n
j=1
j=1
(1.1) Neural Computation 19, 3392–3420 (2007)
C 2007 Massachusetts Institute of Technology
Multiple Almost Periodic Solutions
3393
where i = 1, 2, . . . , n; the integer n corresponds to the number of units in the network; xi (t) corresponds to the state of the ith unit at time t; µi (t) represents the rate with which the ith unit resets its potential to the resting state in isolation when disconnected from the networks and external input; αi j (t) and βi j (t) denote the connection strength of the jth unit on the ith unit at time t and time t − τi j (t), respectively; g j (x j (t)) denotes the output of the jth unit at time t; τi j (t) denotes the transmission delay of the ith unit along the axon of the jth unit at time t and is a nonnegative function; and Ii (t) corresponds to the external bias on the ith unit at time t. Many theoretical studies of neural networks are predominantly concerned with autonomous systems; namely, all the parameters in system 1.1 are constants. Such an autonomous network system is often referred to in the literature as a delayed cellular neural network (Chua & Yang, 1988; Roska & Chua, 1992). The autonomous cellular neural networks with delay and without delay have been successfully applied to many fields, such as optimization, associative memory, signal and image processing, and pattern recognition. The theory on the existence and stability of equilibrium and periodic solution for such autonomous systems has been extensively investigated (Cao, 2003; Chen, Lu, & Chen, 2005; Juang & Lin, 2000; Mohamad & Gopalsamy, 2003; Peng, Qiao, & Xu, 2002; Shih, 2000; Tu & Liao, 2005; van den Driessche & Zou, 1998; Zhou, Liu, & Chen, 2004). The autonomous neural network systems are modeled on the assumption that the environment is stationary, that is, its characteristics do not change with time. However, frequently the environment of interest is not stationary, and the parameters of the information-bearing signal generated by the environment vary with time. In general, nonautonomous phenomena often occur in many realistic systems. Particularly when we consider a long-term dynamical behavior of a system, the parameters of the system usually change with time. The investigations of nonautonomous delayed neural networks (see equation 1.1) are important in understanding the dynamical characteristics of neural behavior in time varying environments. In this letter, we develop a geometrical approach that leads to a decomposition of the phase space into several positively invariant and stable sets. We then study the existence of multiple, almost periodic solutions for the nonautonomous delayed neural network 1.1, under the assumption that all the parameters and delays in system 1.1 are almost periodic. Most of the studies of autonomous systems and nonautonomous systems center around the existence of a unique equilibrium, periodic, multiple almost periodic solutions and their global convergence. A unique periodic solution and its globally exponentially robust stability were investigated by Cao and Chen (2004) for system 1.1 with periodic external bias Ii (t), using the Lyapunov method. The globally exponential stability of a unique almost periodic solution has been analyzed by Cao, Chen, and Huang (2005), Chen and Cao (2003), Fan and Ye (2005), Huang and Cao (2003), Huang, Cao, and Ho (2006), Liu & Huang (2005), Mohamad and Gopalsamy (2000), Zhang,
3394
K. Lin and C. Shih
Wei, and Xu (2005), Mohamad (2003), and Zhao (2004). The techniques they employed include exponential dichotomy, fixed-point theorem, contraction mapping principle, Lyapunov methods and Halanay inequality. A neural network in a more general form, which includes system 1.1 and distributed delays, has been analyzed in Lu and Chen (2004, 2005), who obtained the globally exponential stability of a single periodic solution and a single almost periodic solution. Multistability in neural networks is essential in numerous applications such as content-addressable memory storage and pattern recognition. In addition, the application of multistability is important in decision making, digital selection, and analogy amplification (Hahnloser, 1998). Recently the theory of the existence of multiple stable equilibria for autonomous networks has been established in Cheng, Lin, and Shih (2006); multiple stable periodic orbits for autonomous networks with periodic inputs have been exploited in Cheng, Lin, and Shih (2007), and Zeng and Wang (2006). This presentation moves this direction of studies to wider and more general considerations. We establish the existence of 2n exponentially stable almost periodic solutions for general n-dimensional nonautonomous networks, 1.1, with various activation functions. Our derivation also indicates the basins of attraction for these almost periodic solutions. The presentation is organized as follows. In section 2, we justify that all the solutions for system 1.1 are bounded, under the assumption that all the parameters in this system are bounded continuous functions of t. 2n subsets of the phase space are shown to be positively invariant under the flows generated by system 1.1. Within these 2n subsets, we further construct 2n exponentially stable subsets (to be defined later) for system 1.1. In section 3, we derive the existence of 2n almost periodic solutions for system 1.1 with almost periodic parameters and time lags through the contraction mapping principle. Finally, in section 4, we arrange three numerical simulations on the dynamics of system 1.1 to illustrate the theory. 2 Boundedness, Invariant Sets, and Exponentially Stable Regions We consider that µi (t), αi j (t), βi j (t), Ii (t), τi j (t) in system 1.1 are bounded continuous functions defined for t ∈ R, the set of all real numbers. We denote that 0 < µi ≤ µi (t) ≤ µ ¯ i, I i ≤ Ii (t) ≤ I¯i ,
α i j ≤ αi j (t) ≤ α¯ i j ,
β i j ≤ βi j (t) ≤ β¯ i j ,
(2.1)
0 ≤ τ i j ≤ τi j (t) ≤ τ¯i j
for all i, j = 1, 2, . . . , n and t ∈ R with inf µi (t) = µi ,
t∈R
inf αi j (t) = α i j ,
t∈R
inf βi j (t) = β i j ,
t∈R
inf Ii (t) = I i ,
t∈R
Multiple Almost Periodic Solutions
inf τi j (t) = τ i j ,
sup µi (t) = µ ¯ i,
t∈R
t∈R
sup Ii (t) = I¯i ,
3395
sup αi j (t) = α¯ i j , t∈R
sup βi j (t) = β¯ i j , t∈R
sup τi j (t) = τ¯i j .
t∈R
t∈R
Moreover, we set αi∗j = max{|α i j |, |α¯ i j |},
βi∗j = max{|β i j |, |β¯ i j |},
Ii∗ = max{|I i |, | I¯i |}.
(2.2)
The activation functions g j usually admit sigmoidal configuration or are nondecreasing with saturation. In this presentation, we mainly adopt g j (ξ ) = g(ξ ) := tanh(ξ ), for all j = 1, 2, . . . , n.
(2.3)
Its graph is depicted in Figure 1a. Our theory can be applied to more general activation functions: gi ∈ C , 2
ui < gi (ξ ) < vi , gi (ξ ) > 0, (ξ − σi )gi (ξ ) < 0, for all ξ ∈ R, limξ
→+∞
gi (ξ ) = vi , limξ
→−∞
gi (ξ ) = ui ,
where ui , vi , and σi are constants with ui < vi , and the nondecreasing continuous functions with saturation if − ∞ < ξ < pi , ui if pi ≤ ξ ≤ q i , gi (ξ ) = increasing vi if q i < ξ < ∞, where ui , vi , q i , pi are constants with ui < vi , pi < q i , i = 1, 2, . . . , n. The latter contains the standard piecewise linear activation function for cellular neural networks: (|ξ − 1| + |ξ + 1|)/2. Notably, the analysis in Zeng and Wang (2006) is performed within the framework of this piecewise linear activation function. The configurations for these functions are depicted in Figures 2a to 2c, respectively. We set τ := max1≤i, j≤n {τ¯i j }. System 1.1 is one of nonautonomous delayed differential equations. The evolution of the system is described by the solution x(t; t0 ; ϕ) = x(t; t0 ) = (x1 (t; t0 ), x2 (t; t0 ), . . . , xn (t; t0 )), t ≥ t0 − τ , starting from the initial condition: xi (t0 + θ ; t0 ) = ϕi (θ ) for θ ∈ [−τ, 0], where ϕ = (ϕ1 , ϕ2 , . . . , ϕn ) ∈ C([−τ, 0], Rn ). We note that the fundamental theory for ˙ = F (t, xt ) employs the conventional expression the delayed equation x(t) xt (θ ) = x(t + θ ; t0 ), θ ∈ [−τ, 0] with phase space C([−τ, 0], Rn ). One usually regards the image of solution xt (θ ), t ≥ t0 as a trajectory {x(t; t0 ) : t ≥ t0 − τ } lying in Rn . We first illustrate that all solutions of system 1.1 are bounded and then establish criteria under which system 1.1 admits 2n positively invariant
3396
K. Lin and C. Shih
Figure 1: (a) Graph of activation function g(ξ ) = tanh(ξ ). (b) Configurations for fˆi and fˇi .
subsets of phase space C([−τ, 0], Rn ). Each of these invariant sets induces an exponentially stable region in Rn in which the trajectories {x(t; t0 ) : t ≥ t0 − τ } lie. Note that the norm ϕ = max1≤i≤n {sups∈[−τ,0] |ϕi (s)|} on C([−τ, 0], Rn ) is adopted here. Proposition 1. All solutions of system 1.1 with bounded activation functions are bounded. Moreover, for system 1.1 with activation function 2.3, |xi (t0 ; t0 )| ≤ Mi implies |xi (t; t0 )| ≤ Mi , for t > t0 , |xi (t0 ; t0 )| > Mi implies lim sup |xi (t; t0 )| ≤ Mi , t→∞
for all i = 1, 2, . . . , n, where Mi = [ nj=1 (αi∗j + βi∗j ) + Ii∗ ]/µi .
Multiple Almost Periodic Solutions
3397
(a)
(b) (c)
Figure 2: Configurations of typical (a) smooth sigmoidal activation functions and (b) saturated activation functions. (c) Graph of the standard piecewise linear activation function.
Proof. We prove only the situation of activation function 2.3. Using equations 2.1 to 2.3 in system 1.1, we derive d+ |xi (t; t0 )| ≤ −µi |xi (t; t0 )| + (αi∗j + βi∗j ) + Ii∗ , t > t0 , i = 1, 2, . . . , n, dt n
j=1
where d + /dt denotes the right-hand derivative. It follows that |xi (t; t0 )| ≤ Mi + e −µi (t−t0 ) (|xi (t0 ; t0 )| − Mi ), t > t0 , i = 1, 2, . . . , n, where Mi is defined above. Hence, |xi (t0 ; t0 )| ≤ Mi implies |xi (t; t0 )| ≤ Mi , t > t0 , i = 1, 2, . . . , n. On the other hand, if |xi (t0 ; t0 )| − Mi = Ci > 0, then |xi (t; t0 )| ≤ Mi + e −µi (t−t0 ) Ci for t > t0 and e −µi (t−t0 ) Ci → 0 as t → ∞.
3398
K. Lin and C. Shih
Let us denote by F = (F1 , F2 , . . . , Fn ) the right-hand side of equation 1.1 with activation function 2.3, Fi (t, xt ) : = −µi (t)xi (t) +
n
αi j (t)g j (x j (t))
j=1
+
n
βi j (t)g j (x j (t − τi j (t))) + Ii (t),
j=1
with g j (ξ ) = tanh(ξ ), for every j. Next, we define, for i = 1, 2, . . . , n,
−µi ξ + (α¯ ii + β¯ ii )gi (ξ ) + ki+ , ξ ≥ 0 −µ ¯ i ξ + (α ii + β ii )gi (ξ ) + ki+ , ξ < 0, ¯ i ξ + (α ii + β ii )gi (ξ ) + ki− , ξ ≥ 0 ˇf i (ξ ) = −µ −µi ξ + (α¯ ii + β¯ ii )gi (ξ ) + ki− , ξ < 0,
fˆi (ξ ) =
(2.4)
where ki+ :=
n
(αi∗j + βi∗j ) + I¯i ,
j=1, j =i
ki− := −
n
(αi∗j + βi∗j ) + I i ,
j=1, j =i
with αi∗j and βi∗j defined in equation 2.2. It follows that fˇi (xi ) ≤ Fi (t, xt ) ≤ fˆi (xi ),
(2.5)
for all xt ∈ C([−τ, 0], Rn ), xi ∈ R, t ≥ t0 and i = 1, 2, . . . , n, since |g j | ≤ 1 for all j. Our approach is motivated by the geometric configurations of fˇi and fˆi depicted in Figure 1b. To establish such a configuration, several conditions are required. The first parameter condition is (H1 ) : 0 = inf gi (ξ ) < ξ ∈R
µi
µ ¯i , < max gi (ξ ) = 1, i = 1, 2, . . . , n. ξ ∈R α¯ ii + β¯ ii α ii + β ii
Lemma 1. Assume that condition H1 holds. Then (i) there exist two points p¯ i and q¯ i with p¯ i < 0 < q¯ i such that fˆi ( p¯ i ) = 0 and fˆi (q¯ i ) = 0, i = 1, 2, . . . , n; and (ii) there exist two points pi and q i with pi < 0 < q i such that fˇi ( pi ) = 0 and fˇi (q ) = 0, i = 1, 2, . . . , n. i
Multiple Almost Periodic Solutions
3399
Proof. We prove only case i. For each i, when ξ ≥ 0, since fˆi (ξ ) = −µi + (α¯ ii + β¯ ii )gi (ξ ), we have fˆi (ξ ) = 0 if and only if gi (ξ ) =
µi
α¯ ii + β¯ ii
.
Note that the graph of function gi (ξ ) is concave down. In addition, gi at tains its maximal value at ξ = 0 and limξ →±∞ gi (ξ ) = 0. Hence, since gi is continuous, if 0 = inf gi (ξ ) < gi (ξ ) = ξ ∈R
µi
α¯ ii + β¯ ii
< max gi (ξ ) = 1, i = 1, 2, . . . , n, ξ ∈R
there exists a point q¯ i with q¯ i > 0 such that gi (q¯ i ) =
µi
α¯ ii + β¯ ii
.
Similarly, when ξ < 0, there exists a point p¯ i with p¯ i < 0, such that gi ( p¯ i ) =
µ ¯i . α ii + β ii
Note that condition H1 implies α ii + β ii > 0 for all i = 1, 2, . . . , n, since each µi is already assumed a positive constant. The second parameter condition is (H2 ) : fˆi ( p¯ i ) < 0, fˇi (q i ) > 0, i = 1, 2, . . . , n. The configuration that motivates H2 is depicted in Figure 1. Such a configuration is due to the characteristics of the activation functions gi . Under assumptions H1 and H2 , there exist points aˆ i , bˆ i , cˆ i with aˆ i < bˆ i < cˆ i such that fˆi (aˆ i ) = ˆf i (bˆ i ) = ˆf i (ˆc i ) = 0 as well as points aˇ i , bˇ i , cˇ i with aˇ i < bˇ i < cˇ i such that ˇf i (aˇ i ) = ˇf i (bˇ i ) = ˇf i (ˇc i ) = 0. We consider the following 2n subsets of C([−τ, 0], Rn ). Let e = (e 1 , . . . , e n ) with e i = l or r, where, l, r mean, respectively, “left,” and “right.” Set e = {ϕ = (ϕ1 , ϕ2 , . . . , ϕn ) | ϕi ∈ il if e i = l, ϕi ∈ ir if e i = r}, where il := {ϕi ∈ C([−τ, 0], R) | ϕi (θ ) < bˆ i , for all θ ∈ [−τ, 0]}, ir := {ϕi ∈ C([−τ, 0], R) | ϕi (θ ) > bˇ i , for all θ ∈ [−τ, 0]}.
(2.6)
3400
K. Lin and C. Shih
Theorem 1. Assume that H1 and H2 hold and βii (t) > 0 , for all t ∈ R, i = 1, 2, . . . , n. Then each e is positively invariant under the solution flow generated by system 1.1. Proof. Under assumptions H1 , H2 , and by lemma 1, we obtain the configuration as in Figure 1b and 2n subsets e of C([−τ, 0], Rn ) defined in equation 2.6. Consider one of the 2n sets e , that is, fix an e = (e 1 , e 2 , . . . , e n ), e i = l, r, i = 1, 2, . . . , n. Let φ = (φ1 , φ2 , . . . , φn ) ∈ e . Then there exists a sufficiently small constant ε0 > 0 such that φi (θ ) ≥ bˇ i + ε0 for all θ ∈ [−τ, 0], if e i = r and φi (θ ) ≤ bˆ i − ε0 for all θ ∈ [−τ, 0], if e i = l. We assert that the solution x(t) = x(t; t0 ; φ) remains in e for all t ≥ t0 . If this is not true, there exists a component of x(t) that is the first one or among the first ones escaping from il or ir . More precisely, there exist some i ∈ {1, 2, . . . , n} and t1 > t0 such that either xi (t1 ) = bˇ i + ε0 , ddtxi (t1 ) ≤ 0, and xi (t) > bˇ i + ε0 for t0 − τ ≤ t < t1 or xi (t1 ) = bˆ i − ε0 , ddtxi (t1 ) ≥ 0 and xi (t) < bˆ i − ε0 for t0 − τ ≤ t < t1 . For the first case with xi (t1 ) = bˇ i + ε0 and ddtxi (t1 ) ≤ 0, it follows from equation 1.1 that d xi (t1 ) = −µi (t1 )(bˇ i + ε0 ) + αii (t1 )gi (bˇi + ε0 ) + βii (t1 )gi (xi (t1 − τii (t1 ))) dt n n αi j (t1 )g j (x j (t1 )) + βi j (t1 )g j (x j (t1 − τi j (t1 ))) + j=1, j =i
j=1, j =i
+Ii (t1 ) ≤ 0.
(2.7)
On the other hand, H2 and the previous description of bˇ i yield fˇi (bˇ i + ε0 ) > 0:, −µ ¯ i (bˇ i + ε0 ) + (α ii + β ii )gi (bˇ i + ε0 ) − nj=1, j =i (αi∗j + βi∗j ) + I i > 0, if bˇ i + ε0 ≥ 0 −µ (bˇ i + ε0 ) + (α¯ ii + β¯ ii )gi (bˇ i + ε0 ) − nj=1, j =i (αi∗j + βi∗j ) + I i > 0, i ˇ if b i + ε0 < 0.
(2.8)
Notice that gi (xi (t1 − τii (t1 ))) > gi (bˇ i + ε0 ), by the monotonicity of function gi . In addition, t1 is the first time for xi to decrease across the value bˇ i + ε0 . By βii (t1 ) > 0 and |gi (·)| ≤ 1, we obtain from equation 2.8 that −µi (t1 )(bˇ i + ε0 ) + αii (t1 )gi ( bˇ i + ε0 ) + βii (t1 )gi (xi (t1 − τii (t1 ))) +
n j=1, j =i
αi j (t1 )g j (x j (t1 )) +
n j=1, j =i
βi j (t1 )g j (x j (t1 − τi j (t1 ))) + Ii (t1 )
Multiple Almost Periodic Solutions
3401
−µ ¯ i (bˇ i + ε0 ) + (α ii + β ii )gi (bˇ i + ε0 ) − nj=1, j =i (αi∗j + βi∗j ) + I i > 0, if bˇ i + ε0 ≥ 0 ≥ −µ (bˇ i + ε0 ) + (α¯ ii + β¯ ii )gi (bˇ i + ε0 ) − nj=1, j =i (αi∗j + βi∗j ) i + I i > 0, if bˇ i + ε0 < 0, which contradicts equation 2.7. Hence, xi (t) ≥ bˇ i + ε0 for all t > t0 . Similar arguments justify that xi (t) ≤ bˆ i − ε0 , for all t > t0 for the situation that xi (t1 ) = bˆ i − ε0 and ddtxi (t1 ) ≥ 0. Therefore, e is positively invariant under the flow generated by equation 1.1. Remark 1. The assertion of theorem 1 actually holds for the subset e in equation 2.6 with bˇ i replaced by any number between bˇ i and cˇ i , and bˆ i replaced by any number between bˆ i and aˆ i , as observed from the proof. Such an observation provides further understanding of the dynamics of the system. Definition 1. A positively invariant set ⊂ C([−τ, 0], Rn ) is called an exponentially stable set for system 1.1 if there exist constants γ > 0 and L > 0 such that xt (·; t0 ; ϕ) − xt (·; t0 ; ψ) ≤ Le −γ (t−t0 ) ϕ − ψ for all t ≥ t0 , for solutions xt (·; t0 ; ϕ) and xt (·; t0 ; ψ) of system 1.1, starting from any two initial conditions ϕ, ψ ∈ . In this situation, := {ψ(θ ), θ ∈ [−τ, 0], for all ψ ∈ } ⊂ Rn , the images of all elements in , is called an exponentially stable region for system 1.1. Let η j ∈ R such that η j > max{g j (ξ ) | ξ = cˇ j , aˆ j }, j = 1, 2, . . . , n. We consider the following criterion, which yields exponentially stable sets for system 1.1:
(H3 ) : inf µi (t) − t∈R
n j=1
η j [|αi j (t)| + |βi j (t)|] > 0, i = 1, 2, . . . , n.
We define d j and d¯ j as d j := min{ξ |g j (ξ ) = η j }, d¯ j := max{ξ |g j (ξ ) = η j }.
(2.9)
Then d j > aˆ j , d¯ j < cˇ j . We consider the following 2n subsets of C([−τ, 0], Rn ). Let e = (e 1 , e 2 , . . . , e n ) with e i = l or r, and set ˜ e = {ϕ = (ϕ1 , ϕ2 , . . . , ϕn ) | ϕi ∈ ˜ il if e i = l, ϕi ∈ ˜ ir if e i = r},
(2.10)
3402
K. Lin and C. Shih
where ˜ il := {ϕi ∈ C([−τ, 0], R) | ϕi (θ ) ≤ d i , for all θ ∈ [−τ, 0]}, ˜ ir := {ϕi ∈ C([−τ, 0], R) | ϕi (θ ) ≥ d¯ i , for all θ ∈ [−τ, 0]}. ˜ ir ⊆ ir . ˜ il ⊆ il and Notably, ˜ e of In the following, we shall derive that each of these 2n subsets n C([−τ, 0], R ) is an exponentially stable set for system 1.1. Theorem 2. Assume that conditions H1 , H2 , H3 hold and βii (t) > 0 , for all ˜ e is exponentially stable for t ∈ R, i = 1, 2, . . . , n. Then each of the 2n sets system 1.1. ˜ e , defined in equation Proof. We consider any one of these 2n subsets e ˜ follows from remark 1. Let 2.10. The positive invariance property of x(t) = x(t; t0 ; φ) and y(t) = x(t; t0 ; ψ) be two solutions of system 1.1 with ˜ e . For each i = 1, 2, . . . , n, we define respective initial conditions φ, ψ ∈ the single-variable functions G i (·) by
G i (ζ ) = inf µi (t) − ζ − t∈R
n
η j |αi j (t)| −
j=1
n j=1
η j |βi j (t)|e ζ τ¯i j
.
Then G i is continuous and G i (0) > 0 from H3 . Moreover, there exists a constant λ > 0 such that G i (λ) > 0, for all i = 1, 2, . . . , n, due to continuity of G i . From equations 2.1 to 2.3 and by using η j > max{g j (ξ ) | ξ = cˇ j , aˆ j }, we obtain d+ η j |αi j (t)||x j (t) − y j (t)| |xi (t) − yi (t)| ≤ −µi (t)|xi (t) − yi (t)| + dt n
j=1
+
n
η j |βi j (t)||x j (t − τi j (t)) − y j (t − τi j (t))|.
(2.11)
j=1
We assume that
K := max
1≤i≤n
sup s∈[t0 −τ,t0 ]
|xi (s) − yi (s)| > 0.
(2.12)
Now consider functions zi (·) defined by zi (t) = e λ(t−t0 ) |xi (t) − yi (t)| for t ≥ t0 − τ.
(2.13)
Multiple Almost Periodic Solutions
3403
Then zi (t) ≤ K for t ∈ [t0 − τ, t0 ]. Using equations 2.11 and 2.13 we obtain d+ η j |αi j (t)|z j (t) zi (t) ≤ −[µi (t) − λ]zi (t) + dt n
j=1
+
n
η j |βi j (t)|e
λτ¯i j
sup z j (s)
(2.14)
s∈[t−τ,t]
j=1
for i = 1, 2, . . . , n, t ≥ t0 . We assert that zi (t) ≤ K for all t ≥ t0 , i = 1, 2, . . . , n.
(2.15)
Suppose on the contrary, there exists an i ∈ {1, 2, . . . , n} (say, i = k) and a t1 ≥ t0 at the earliest time such that zi (t) ≤ K , t ∈ [−τ, t1 ], i = 1, 2, . . . , n, i = k, zk (t) ≤ K , t ∈ [−τ, t1 ), zk (t1 ) = K , with
d zk (t1 ) ≥ 0. dt
(2.16)
From equations 2.14 and 2.16, due to G i (λ) > 0, we obtain 0≤
d+ zk (t1 ) dt
≤ −[µk (t1 ) − λ]zk (t1 ) +
+
n
n
η j |βk j (t1 )|e
λτ¯k j
≤ − µk (t1 ) − λ −
sup s∈[t1 −τ,t1 ]
j=1
η j |αk j (t1 )|z j (t1 )
j=1
n
z j (s)
η j |αk j (t1 )| −
j=1
n
η j |βk j (t1 )|e
λτ¯k j
K < 0,
j=1
which is a contradiction. Hence assertion 2.15 holds. It then follows from equations 2.12, 2.13, and 2.15 that
|xi (t) − yi (t)| ≤ e
−λ(t−t0 )
for all i = 1, 2, . . . , n, t ≥ t0 .
max
1≤i≤n
sup s∈[t0 −τ,t0 ]
|xi (s) − yi (s)| ,
3404
K. Lin and C. Shih
3 Existence and Stability of Almost Periodic Solutions In this section, we investigate the existence and stability of almost periodic solutions for system 1.1. We consider that µi (t), αi j (t), βi j (t), τi j (t), and Ii (t), i, j = 1, 2, . . . , n, are almost periodic functions defined for all t ∈ R. First, let us recall some basic definitions and properties of almost periodic functions. Definition 2. (Bohr, 1947; Yoshizawa, 1975). Let h : R → R be continuous. h is said to to be (Bohr) almost periodic on R if, for any > 0, the set T(h, ) = { p : |h(t + p) − h(t)| < , for all t ∈ R} is relatively dense, that is, for any > 0, it is possible to find a real number l = l() > 0, and for any interval with length l(), there exists a number p = p() in the interval such that |h(t + p) − h(t)| < , for all t ∈ R. The number p is called the -translation number or -almost period of h. We denote by AP the set of all such functions. Notably, (AP, · ) with the supremum norm is a Banach space. We collect some properties of almost periodic functions, which will be used in the following discussions. Proposition 2. (Corduneanu, 1968; Fink, 1974). i. ii. iii. iv.
Every periodic function is almost periodic. Each h in AP is bounded and uniformly continuous. AP is an algebra (closed under addition, product, and scalar multiplication). If h ∈ AP and H is uniformly continuous on the range of h, then H ◦ h ∈ AP. v. If h ∈ AP, and inft∈R |h(t)| > 0 , then 1 / h ∈ AP. vi. AP is closed under uniform limits on R. vii. Suppose h 1 , h 2 , . . . , h m ∈ AP. Then for every > 0, there exist a common l() and a common -translation number p() for these functions. Proposition 2(vii) shows that definition 2 can be extended to vectorvalued functions, as the situation for our solutions x(t; t0 ) to system 1.1. We denote by AP(Rn ) the space of all almost periodic functions from R to Rn with norm h = max1≤i≤n supt∈R |h i (t)|. ˜ Proposition 3. Assume that H, h ∈ AP, and define H(t) = H(t + h(t)). Then ˜ ∈ AP. H Proof. H is uniformly continuous since H ∈ AP. Thus, for any ε > 0, there exists δ = δ(ε) > 0 (we choose δ < ε/2) such that |H(ξ ) − H(η)| < ε/2, provided |ξ − η| < δ.
(3.1)
Multiple Almost Periodic Solutions
3405
On the other hand, for the δ = δ(ε) > 0 in equation 3.1, there exists a real number l = l(δ) > 0, and for any interval of length l(δ), there exists a number p = p(δ) in the interval such that |H(t + p) − H(t)| < δ, and |h(t + p) − h(t)| < δ, for all t ∈ R, due to H, h ∈ AP and proposition 2(vii). Therefore, for any ε > 0, there exists l = l(ε) = l(δ(ε)), and p(ε) = p(δ(ε)) in any interval of length l such that |H(t + h(t + p)) − H(t + h(t))| < ε/2, |H(t + h(t + p) + p) − H(t + h(t + p))| < δ < ε/2. Subsequently, |H(t + p + h(t + p)) − H(t + h(t))| ≤ |H(t + h(t + p) + p) − H(t + h(t + p))| + |H(t + h(t + p)) −H(t + h(t))| < ε. If we set µi (t), αi j (t), βi j (t), τi j (t) and Ii (t), i, j = 1, 2, . . . , n, to be almost periodic functions, then theorems 1 and 2 can also be established under the same conditions, since every almost periodic function is bounded. In the following discussion, we investigate the existence of almost periodic solutions of system 1.1 under those conditions in theorem 2 and αii (t) > 0, for all t. Theorem 3. Let µi (t), αi j (t), βi j (t), τi j (t), and Ii (t), i, j = 1, 2, . . . , n, be almost periodic functions. Assume that conditions H1 , H2 , H3 , and αii (t) > 0 , βii (t) > 0, t ∈ R, i = 1, 2, . . . , n, hold. Then there exist 2n exponentially stable almost periodic solutions for system 1.1. Proof. We consider the following 2n subsets of C(R, Rn ). For each e = (e 1 , . . . , e n ), e i = l or r, we set ¯ e = φ = (φ1 , φ2 , . . . , φn ) | φi ∈ ¯ il if wi = l, φi ∈ ¯ ir if wi = r , (3.2) where ¯ il := {φi ∈ C(R, R) | φi (θ ) ≤ d i , for all θ ∈ R}, ¯ ir := {φi ∈ C(R, R) | φi (θ ) ≥ d¯ i , for all θ ∈ R},
3406
K. Lin and C. Shih e
¯ . Let us define a and d i , d¯ i are defined in equation 2.9. Consider a fixed mapping P, P(φ) = (P1 (φ), P2 (φ), . . . , Pn (φ)), by
∞
Pi (φ)(t) =
n
0
+
αi j (t − r )g j (φ j (t − r ))
j=1
n
βi j (t − r )g j (φ j (t − r − τi j (t − r )))
j=1
r + Ii (t − r ) × e − 0 µi (t−s)ds dr, t ∈ R.
(3.3)
¯ e AP(Rn ) → ¯ e AP(Rn ). We divide It will be shown below that P : the following justifications into four steps. ¯ e . We derive from equation 3.3 that Step 1. Let φ ∈
∞
Pi (φ)(t) =
[αii (t − r )gi (φi (t − r )) + βii (t − r )gi (φi (t − r − τii (t − r )))
0
+
n
αi j (t − r )g j (φ j (t − r ))
j=1, j =i
+
n
βi j (t − r )g j (φ j (t − r − τi j (t − r )))
j=1, j =i r
+Ii (t − r )]e − 0 µi (t−s)ds dr ∞ [αii (t − r )gi (d¯ i ) + βii (t − r )gi (d¯ i ) ≥ 0
+
n
αi j (t − r )g j (φ j (t − r ))
j=1, j =i
+
n
βi j (t − r )g j (φ j (t − r − τi j (t − r )))
j=1, j =i r
+Ii (t − r )]e − 0 µi (t−s)ds dr ∞ n r [(α ii + β ii )gi (d¯ i ) − (αi∗j + βi∗j ) + I i ]e − 0 µ¯ i ds dr, ≥ 0
j=1, j =i
(3.4)
Multiple Almost Periodic Solutions
3407
since αii (t), βii (t) ≥ 0, t ∈ R, |gi (·)| ≤ 1, d¯ i > 0, and gi (φi (t − r )), gi (φi (t − r − τii (t − r ))) ≥ gi (d¯ i ) > 0, due to the monotonicity of functions gi and H1 , H2 , and H3 . On the other hand, since fˇi (d¯ i ) > 0 and d¯ i > 0, we have ¯ i d¯ i + (α ii + β ii )gi (d¯ i ) − fˇi (d¯ i ) = −µ that is, (α ii + β ii )gi (d¯ i ) − from equation 3.4 that ¯ i d¯ i Pi (φ)(t) > µ
∞
n
∗ j=1, j =i (αi j
e−
r 0
µ ¯ i ds
n
(αi∗j + βi∗j ) + I i > 0,
j=1, j =i
+ βi∗j ) + I i > µ ¯ i d¯ i . With this, it follows
dr = d¯ i , for all t ∈ R.
0
Similar arguments can be employed to show that Pi (φ)(t) < d i , for all t ∈ R ¯e → ¯ e. and i = 1, 2, . . . , n. Therefore, P: Step 2. We justify that P : AP(Rn ) → AP(Rn ). It suffices to show that φ ∈ AP(Rn ) implies Pi (φ) ∈ AP(R1 ) for all i = 1, 2, . . . , n. Let > 0 be arbitrary. Choose such that µi2 µi , i = 1, 2, . . . , n . , µ, = min ∗ ∗ 5 nk=1 (αik + βik∗ ) + Ii∗ 5 i 5 nk=1 (αik + βik∗ )
By propositions 2 and 3, there exists a common -almost period p = p( ) for µi (·), αi j (·), βi j (·), Ii (·), φ j (·), and Hi j (·) with Hi j (t) := φ j (t − τi j (t)) such that for each i = 1, 2, . . . , n, |µi (t + p − s) − µi (t − s)| ≤ ≤ n
µi n , ∗ 5 k=1 (αik + βik∗ ) + Ii∗
|αi j (t + p − r ) − αi j (t − r )| ≤ ≤
µ, 5 i
|βi j (t + p − r ) − βi j (t − r )| ≤ ≤
µ, 5 i
j=1 n
2
j=1
µ, 5 i µ |φi (t + p − r ) − φi (t − r )| ≤ ≤ n ∗ ∗ , 5 k=1 (αk + βk )
|Ii (t + p − r ) − Ii (t − r )| ≤ ≤
|φi (t + p − r − τi (t + p − r )) − φi (t − r − τi (t − r ))| µ ≤ ≤ n ∗ ∗ , 5 k=1 (αk + βk )
(3.5)
3408
K. Lin and C. Shih
for all = 1, 2, . . . , n, and r, s, t ∈ R. We compute that
|Pi (φ)(t + p) − Pi (φ)(t)| ∞ n r |αi j (t + p − r ) − αi j (t − r )||g j (φ j (t + p − r ))|e − 0 µi (t+ p−s)ds dr ≤ 0
j=1 ∞
+ 0
e−
r 0
n
|βi j (t + p − r ) − βi j (t − r )||g j (φ j (t + p − r − τi j (t + p − r )))|
j=1
µi (t+ p−s)ds
+
∞
0
+
n
dr
|αi j (t − r )||g j (φ j (t + p − r )) − g j (φ j (t − r ))|e −
r 0
µi (t+ p−s)ds
dr
j=1 ∞
0
n
|βi j (t − r )||g j (φ j (t + p − r − τi j (t + p − r )))
j=1 r
− g j (φ j (t − r − τi j (t − r )))|e − 0 µi (t+ p−s)ds dr ∞ n r r + |αi j (t − r )||g j (φ j (t − r ))||e − 0 µi (t+ p−s)ds − e − 0 µi (t−s)ds |dr 0
+
j=1 ∞
0
n
|βi j (t − r )||g j (φ j (t − r − τi j (t − r )))|
j=1 r
r
×|e − 0 µi (t+ p−s)ds − e − 0 µi (t−s)ds |dr ∞ n r + |Ii (t + p − r ) − Ii (t − r )|e − 0 µi (t+ p−s)ds dr 0
+ 0
j=1 ∞
n
|Ii (t − r )||e −
r 0
µi (t+ p−s)ds
− e−
r 0
µi (t−s)ds
|dr
(3.6)
j=1
for φ ∈ AP(Rn ). By the mean value theorem, we obtain the following:
|g j (φ j (t + p − r )) − g j (φ j (t − r ))| = | tanh(φ j (t + p − r )) − tanh(φ j (t − r ))| = |sech2 (θo )[φ j (t + p − r ) − φ j (t − r )]| µi ≤ |φ j (t + p − r ) − φ j (t − r )| ≤ n , ∗ 5 k=1 (αik + βik∗ )
(3.7)
Multiple Almost Periodic Solutions
3409
for all i = 1, 2, . . . , n, where θo lies between φ j (t + p − r ) and φ j (t − r ), and |g j (φ j (t + p − r − τi j (t + p − r ))) − g j (φ j (t − r − τi j (t − r )))| = | tanh(φ j (t + p − r − τi j (t + p − r ))) − tanh(φ j (t − r − τi j (t − r )))| = |sech2 (θτ )[φ j (t + p − r − τi j (t + p − r )) − φ j (t − r − τi j (t − r ))]| ≤ |φ j (t + p − r − τi j (t + p − r )) − φ j (t − r − τi j (t − r ))| µi ≤ n , ∗ 5 k=1 (αik + βik∗ )
(3.8)
for all i = 1, 2, . . . , n, where θτ lies between φ j (t + p − r − τi j (t + p − r )) and φ j (t − r − τi j (t − r )). Similarly, r
r
|e − 0 µi (t+ p−s)ds − e − 0 µi (t−s)ds | r = e θe − µi (t + p − s)ds + 0
≤ e−
r 0
µi ds
r
0
r
µi (t − s)ds
|µi (t + p − s) − µi (t − s)|ds
0
≤ r e −µi r
µi , ∗ 5 nk=1 (αik + βik∗ ) + Ii∗ 2
(3.9)
r for all i = 1, 2, . . . , n, where θe lies between − 0 µi (t + p − s)ds and r − 0 µi (t − s)ds. By substituting equations 3.5 and 3.7 to 3.9 into 3.6, we have |Pi (φ)(t + p) − Pi (φ)(t)| ∞ ∞ r r e − 0 µi ds dr + µi e − 0 µi ds dr ≤ µi 5 5 0 0 n ∞ r µi + αi∗j n e − 0 µi ds dr ∗ ∗ 5 k=1 (αik + βik ) 0 j=1
+
n j=1
+
n j=1
+
n j=1
∞ µi r βi∗j n e − 0 µi ds dr ∗ ∗ 5 k=1 (αik + βik ) 0 µi αi∗j n ∗ 5 k=1 (αik + βik∗ ) + Ii∗
µi βi∗j n ∗ 5 k=1 (αik + βik∗ ) + Ii∗
2
2
∞
r e −µi r dr
0
0
∞
r e −µi r dr
3410
K. Lin and C. Shih
∞ ∞ µi2 r + µi e − 0 µi ds dr + Ii∗ n r e −µi r dr ∗ 5 5 k=1 (αik + βik∗ ) + Ii∗ 0 0 ≤ + + + + = (3.10) 5 5 5 5 5 for φ ∈ AP(Rn ). Therefore, we obtain that the common -almost period p that satisfies equation 3.5 is actually an -almost period of Pi (φ) for all i = 1, 2, . . . , n. Step 3. We shall prove that P is a contraction mapping. Notably, condition H3 yields
µi (t) >
n
η j [|αi j (t)| + |βi j (t)|] for all t ∈ R, i = 1, 2, . . . , n.
(3.11)
j=1
¯e For arbitrary ϕ, ψ ∈
AP(Rn ), we derive
P(ϕ) − P(ψ) = max sup |Pi (ϕ)(t) − Pi (ψ)(t)| 1≤i≤n t∈R
= max sup | 1≤i≤n t∈R
+
n
∞
e
−
r 0
µi (t−s)ds
0
n
αi j (t − r )[g j (ϕ j (t − r )) − g j (ψ j (t − r ))]
j=1
βi j (t − r )[g j (ϕ j (t − r − τi j (t − r )))
j=1
− g j (ψ j (t − r − τi j (t − r )))] dr | ≤ max sup
×
n
+
1≤i≤n t∈R
∞
e−
r 0
µi (t−s)ds
0
η j |αi j (t − r )||ϕ j (t − r ) − ψ j (t − r )|
j=1
n j=1
η j |βi j (t − r )||ϕ j (t − r − τi j (t − r )) − ψ j (t − r − τi j (t − r ))| dr
≤ max sup 1≤i≤n t∈R
= κ ϕ − ψ ,
0
∞
n j=1
η j [|αi j (t − r )| + |βi j (t − r )|] e −
r 0
µi (t−s)ds
dr ϕ − ψ
Multiple Almost Periodic Solutions
where κ := max sup 1≤i≤n t∈R
∞
0
< max sup 1≤i≤n t∈R
n
3411
η j [|αi j (t − r )| + |βi j (t − r )|] e −
r 0
µi (t−s)ds
dr
j=1 ∞
µi (t − r )e −
r 0
µi (t−s)ds
dr = 1,
0
due to equation 3.11, and that µi (t) are positive almost periodic functions of t ∈ R. Accordingly, P is a contraction mapping. Thus, P has a unique fixed ¯ e AP(Rn ), that is, P(¯x) = x¯ , by the contraction mapping point x¯ = x¯ (·) ∈ principle. Step 4. Finally, we show that x¯ = x¯ (t) = (x¯ 1 (t), x¯ 2 (t), . . . , x¯ n (t)) defined for t ∈ R is a solution of equation 1.1. Indeed, d d x¯ i (t) = Pi (¯x)(t) dt dt ∞ n d αi j (t − r )g j (x¯ j (t − r )) = dt 0 j=1
+
n j=1
d = dt
t
βi j (t − r )g j (x¯ j (t − r − τi j (t − r ))) + Ii (t − r ) e − n
−∞ j=1
αi j (r )g j (x¯ j (r )) +
= −µi (t)x¯ i (t) +
n
r 0
µi (t−s)ds
dr t
βi j (r )g j (x¯ j (r − τi j (r ))) + Ii (r ) e−
r
µi (s)ds
dr
j=1 n j=1
αi j (t)g j (x¯ j (t)) +
n
βi j (t)g j (x¯ j (t − τi j (t))) + Ii (t).
j=1
Hence system 1.1 admits 2n almost periodic solutions. In addition, each of the 2n regions {(x1 , x2 , . . . , xn ) ∈ Rn : xi ≥ d¯ i , or xi ≤ d i } contains one of these almost periodic solutions. Moreover, according to theorem 2, these 2n almost periodic solutions are exponentially stable. This completes the proof. Theorems 2 and 3 yield corollary 1, and the proof of theorem 3 can be modified to justify corollary 2: ¯ e defined in equation 3.2 contains the basin of attraction for Corollary 1. Each ¯ e. the almost periodic solution lying in Corollary 2. Let µi (t), αi j (t), βi j (t), τi j (t) and Ii (t), i, j = 1, 2, . . . , n, be T-periodic functions defined for all t ∈ R. Assume that conditions H1 , H2 , H3 ,
3412
K. Lin and C. Shih
and αii (t) > 0, βii (t) > 0, t ∈ R, i = 1, 2, . . . , n, hold. Then there exist exactly 2n exponentially stable T-periodic solutions for system 1.1. 4 Numerical Illustrations In this section, we present three examples to illustrate our theory. 4.1 Example 1. Consider the two-dimensional system 1.1 with activation functions g1 (ξ ) = g2 (ξ ) = tanh(ξ ): d x1 (t) = −µ1 (t)x1 (t) + α11 (t)g1 (x1 (t)) + α12 (t)g2 (x2 (t)) dt + β11 (t)g1 (x1 (t − τ11 (t))) + β12 (t)g2 (x2 (t − τ12 (t))) + I1 (t) d x2 (t) = −µ2 (t)x2 (t) + α21 (t)g1 (x1 (t)) + α22 (t)g2 (x2 (t)) dt + β21 (t)g1 (x1 (t − τ21 (t))) + β22 (t)g2 (x2 (t − τ22 (t))) + I2 (t). The parameters are set as follows: α11 (t) = 4 + 0.3 sin(t) β11 (t) = 3 + 0.2 sin(t) α21 (t) = 0.7 + 0.2 cos(t) β21 (t) = 0.8 + 0.2 cos(t) µ1 (t) = 1 + |0.1 sin(t)| + |0.2 sin(πt)|
α12 (t) = 0.6 + 0.1 cos(πt) β12 (t) = 0.4 + 0.1 cos(πt) α22 (t) = 5 + 0.5 sin(πt) β22 (t) = 7 + 0.5 sin(πt) µ2 (t) = 3 + |0.2 cos(t)| + |0.1 cos(πt)| √ I2 (t) = sin(t) + sin( 2t) √ I1 (t) = cos(t) + cos(πt) τ11 (t) = τ21 (t) = 10 + sin(t) + sin(πt) τ12 (t) = τ22 (t) = 5 + cos( 2t)
Using equation 2.1, we obtain, for all t ∈ R, 3.7 = α 11 ≤ α11 (t) ≤ α¯ 11 = 4.3 2.8 = β 11 ≤ β11 (t) ≤ β¯ 11 = 3.2 0.5 = α 21 ≤ α21 (t) ≤ α¯ 21 = 0.9 0.6 = β 21 ≤ β21 (t) ≤ β¯ 21 = 1 ¯ 1 = 1.3 1 = µ1 ≤ µ1 (t) ≤ µ −2 = I 1 ≤ I1 (t) ≤ I¯1 = 2
0.5 = α 12 ≤ α12 (t) ≤ α¯ 12 = 0.7 0.3 = β 12 ≤ β12 (t) ≤ β¯ 12 = 0.5 4.5 = α 22 ≤ α22 (t) ≤ α¯ 22 = 5.5 6.5 = β 22 ≤ β22 (t) ≤ β¯ 22 = 7.5 3 = µ2 ≤ µ2 (t) ≤ µ ¯ 2 = 3.3 −2 = I 2 ≤ I2 (t) ≤ I¯2 = 2
By equation 2.4, a direct computation gives fˆ1 (x1 ) =
−x1 + 7.5g1 (x1 ) + 3.2, −1.3x1 + 6.5g1 (x1 ) + 3.2,
fˇ1 (x1 ) =
−1.3x1 + 6.5g1 (x1 ) − 3.2, −x1 + 7.5g1 (x1 ) − 3.2,
x1 ≥ 0 x1 < 0 x1 ≥ 0 x1 < 0
Multiple Almost Periodic Solutions
3413
Table 1: Local Extreme Points and Zeros of ˆf 1 , fˇ1 , ˆf 2 , fˇ2 . aˆ 1 aˇ 1 aˆ 2 aˇ 2
= −2.466999 = −10.7 = −2.331622 = −5.364478
p¯ 1 p1 p¯ 2 p2
= −1.443655 = −1.665432 = −1.209935 = −1.362874
bˆ 1 bˇ 1 bˆ 2 bˇ 2
= −0.768372 = 0.768372 = −0.440329 = 0.440329
q¯ 1 = 1.665432 q 1 = 1.443655 q¯ 2 = 1.362874 q 2 = 1.209935
cˆ 1 cˇ 1 cˆ 2 cˇ 2
= 10.7 = 2.466999 = 5.364478 = 2.331622
and
−3x2 + 13g2 (x2 ) + 3.1, −3.3x2 + 11g2 (x2 ) + 3.1,
x2 ≥ 0 x2 < 0
−3.3x2 + 11g2 (x2 ) − 3.1, −3x2 + 13g2 (x2 ) − 3.1,
x2 ≥ 0 x2 < 0.
fˆ2 (x2 ) = fˇ2 (x2 ) =
Herein, the parameters satisfy our conditions in theorem 3: Condition H1 0< 0<
µ1
α¯ 11 + β¯ 11 µ2
α¯ 22 + β¯ 22
=
1 1.3 µ ¯1 = , <1 7.5 α 11 + β 11 6.5
=
3 3.3 µ ¯2 = , <1 13 α 22 + β 22 11
Condition H2 ˆf 1 ( p¯ 1 ) = −0.737051 < 0,
fˇ1 (q 1 ) = 0.737051 > 0
ˆf 2 ( p¯ 2 ) = −2.110474 < 0,
fˇ2 (q 2 ) = 2.110474 > 0
Condition H3 inft∈R µ1 (t) − η1 (|α11 (t)| + |β11 (t)|) − η2 (|α12 (t)| + |β12 (t)|) = 0.106 > 0 inft∈R µ2 (t) − η1 (|α21 (t)| + |β21 (t)|) − η2 (|α22 (t)| + |β22 (t)|) = 1.25 > 0 where η1 = 0.1 > g (aˆ 1 ) = g (ˇc 1 ) = 0.028381 and η2 = 0.12 > g (aˆ 2 ) = g (ˇc 2 ) = 0.037041. Local extreme points and zeros of ˆf 1 , fˇ1 , ˆf 2 , fˇ2 are listed in Table 1. The components x1 (t), x2 (t) of evolutions from several constant initial values and time-dependent initial values are depicted in Figures 3 and 4, respectively. The two-dimensional phase diagram is arranged in Figure 5.
3414
K. Lin and C. Shih
10
x1
5
0
0
10
20
30
40
50
60
time t
Figure 3: Evolution of state variable x1 (t) in example 1.
4.2 Example 2. Consider the two-dimensional system 1.1 with activation functions g1 (ξ ) = g2 (ξ ) = 12 (|ξ − 1| + |ξ + 1|) and the following parameters:
√ t 2t α11 (t) = 2 + 0.2 cos α12 (t) = 1 + 0.1 sin 2 2
√
√ t 2t 2t β11 (t) = 3 + 0.2 cos + 0.1 sin β12 (t) = 1 + 0.2 sin 2 2 2
√ t 2t α21 (t) = −1 + 0.2 sin α22 (t) = 4 + 0.5 cos 2 2 t t β21 (t) = 2 + 0.3 sin β22 (t) = 5 + 0.3 sin 2
√2 2t + 0.3 cos 2
√ t 2t 0.1 sin t 0.2 sin µ = 2 + + µ1 (t) = 1 + 0.1 cos (t) 2 2 2 2
√ 2t + 0.3 sin 2
Multiple Almost Periodic Solutions
3415
10
8
6
4
x2
2
0
0
10
20
30
40
50
60
time t
Figure 4: Evolution of state variable x2 (t) in example 1.
√ t 2t I1 (t) = 0.1 sin + 0.2 cos 2 2 τ11 (t) = τ21 (t) = 6 + cos (t) + sin
√
2t
t I2 (t) = 1 + 0.1 cos 2 √ + 0.2 sin 22t τ12 (t) = τ22 (t) = 15 + sin(t) + cos(πt)
These parameters satisfy the conditions analogous to the ones of theorem 3, as adapted to the activation functions considered here. The evolutions of components x1 (t), x2 (t) are depicted in Figures 6 and 7, respectively. 4.3 Example 3. Consider the three-dimensional system 1.1 with activation functions g j (ξ ) = g(ξ ) := 1+e1−ξ/ε , ε = 0.5 > 0, j = 1, 2, 3. Let αi j = 0, i, j = 1, 2, 3, that is, the three-dimensional nonautonomous delayed Hopfield neural networks, d x1 (t) = −µ1 (t)x1 (t) + β11 (t)g1 (x1 (t − τ11 (t))) + β12 (t)g2 (x2 (t − τ12 (t))) dt + β13 (t)g3 (x3 (t − τ13 (t))) + I1 (t)
3416
K. Lin and C. Shih
10
8
6
4
x
2
2
0
0 x1
5
10
Figure 5: Phase portrait for example 1.
d x2 (t) = −µ2 (t)x2 (t) + β21 (t)g1 (x1 (t − τ21 (t))) + β22 (t)g2 (x2 (t − τ22 (t))) dt + β23 (t)g3 (x3 (t − τ23 (t))) + I2 (t) d x3 (t) = −µ3 (t)x3 (t) + β31 (t)g1 (x1 (t − τ31 (t))) + β32 (t)g2 (x2 (t − τ32 (t))) dt + β33 (t)g3 (x3 (t − τ33 (t))) + I3 (t). The parameters are set as follows: µ1 (t) = 1 + 0.1 sin(t) β11 (t) = 19 + sin(t) β21 (t) = 1 + sin(t) β31 (t) = sin(t)
µ2 (t) = 1 + 0.1 cos(t) β12 (t) = 1 + cos(t) β22 (t) = 22 + cos(t) + sin(t) β32 (t) = 1 + cos(t)
I1 (t) = −9 + sin(t) τk1 (t) = 10 + sin(t)
I2 (t) = −10 + cos(t) τk2 (t) = 20 + cos(t)
µ2 (t) = 3 + 0.2 sin(t) β13 (t) = sin(t) β23 (t) = 1 + sin(t) β33 (t) = 32 + 2 sin(t) + cos(t) I3 (t) = −15 + sin(t) τk3 (t) = 30 + sin(t)
Multiple Almost Periodic Solutions
3417
10
x
1
5
0
0
10
20
30
40
50
60
40
50
60
time t
Figure 6: Evolution of state variable x1 (t) in example 2. 10
8
6
4
x2
2
0
0
10
20
30 time t
Figure 7: Evolution of state variable x2 (t) in example 2.
3418
K. Lin and C. Shih
20 15 10
x
3
5 0
20 15 10
10 5
5 0
x2
0
x
1
Figure 8: The dynamics in example 3.
where k = 1, 2, 3. These parameters satisfy the conditions analogous to the ones of corollary 2, as adapted to the activation functions considered here. Hence, there are eight stable periodic solutions. The dynamics of the system are illustrated in Figure 8. 5 Conclusion We have established a new methodology for investigating multistability and multiperiodicity for general nonautonomous neural networks with delays. The presentation is mainly formulated for the system with activation function 2.3. All the conditions of the theorems can be modified to adapt to other typical activation functions. Our approach, which makes use of the geometric configuration of network structure, is effective and illuminating. The treatment is expected to contribute toward exploring further fruitful dynamics in neural networks. Acknowledgments We are grateful to the reviewers for their comments and suggestions, which led to an improvement of this presentation. This work is partially supported by the National Science Council, the National Center of Theoretical Sciences, and the MOEATU program, of R.O.C. on Taiwan.
Multiple Almost Periodic Solutions
3419
References Bohr, H. (1947). Almost periodic function. New York: Chelsea. Cao, J. (2003). New results concerning exponential stability and periodic solutions of delayed cellular neural networks. Phys. Lett. A, 307, 136–147. Cao, J., & Chen, T. (2004). Globally exponentially robust stability and periodicity of delayed neural networks. Chaos Soli. Frac., 22, 957–963. Cao, J., Chen, A., & Huang, X. (2005). Almost periodic attractor of delayed neural networks with variable coefficients. Phys. Lett. A, 340, 104–120. Chen, A., & Cao, J. (2003). Existence and attractivity of almost periodic solutions for cellular neural networks with distributed delays and variable coefficients. Appl. Math. Comp., 134, 125–140. Chen, T., Lu, W., & Chen, G. (2005). Dynamical behaviors of a large class of general delayed neural networks. Neural Computation, 17, 949–968. Cheng, C. Y., Lin, K. H., & Shih, C. W. (2006). Multistability in recurrent neural networks. SIAM J. Appl. Math., 66, 1301–1320. Cheng, C. Y., Lin, K. H., & Shih, C. W. (2007). Multistability and convergence in delayed neural networks. Physica D, 225, 61–74. Chua, L. O., & Yang, L. (1988). Cellular neural networks: Theory. IEEE Trans. Circuits Syst., 35, 1257–1272. Corduneanu, C. (1968). Almost periodic functions. New York: Chelsea. Fan, M., & Ye, D. (2005). Convergence dynamics and pseudo almost periodicity of a class of nonautonomous RFDEs with applications. J. Math. Anal. Appl., 309, 598–625. Fink, A. M. (1974). Almost periodic differential equations. Berlin: Springer-Verlag. Hahnloser, R. L .T. (1998). On the piecewise analysis of networks of linear threshold neurons. Neural Networks, 11, 691–697. Huang, X., & Cao, J. (2003). Almost periodic solution of shunting inhibitory cellular neural networks with time-varying delay. Phys. Lett. A, 314, 222–231. Huang, X., Cao, J., & Ho, W. C. (2006). Existence and attractivity of almost periodic solution for recurrent neural networks with unbounded delays and variable coefficients. Nonlinear Dynamics, 45, 337–351. Izhikevich, E. M. (1999). Weakly connected quasi-periodic oscillators, FM interactions, and multiplexing in the brain. SIAM J. Appl. Math., 59, 2193–2223. Juang, J., & Lin, S. S. (2000). Cellular neural networks: Mosaic pattern and spatial chaos. SIAM J. Appl. Math., 60, 891–915. Kopell, N. (2000). We got rhythm: Dynamical systems of the nervous system. Notices Amer. Math. Soc., 47, 6–16. Liu, B., & Huang, L. (2005). Existence and exponential stability of almost periodic solutions for cellular neural networks with time-varying delays. Phys. Lett. A, 341, 135–144. Lu, W., & Chen, T. (2004). On periodic dynamical systems. China Anna. Math. Seri. B, 25(4), 455–462. Lu, W., & Chen, T. (2005). Global exponential stability of almost periodic solution for a large class of delayed dynamical systems. Science in China, Seri. A Math., 48, 1015–1026.
3420
K. Lin and C. Shih
Mohamad, S. (2003). Convergence dynamics of delayed Hopfield-type neural networks under almost periodic stimuli. Acta. Appl. Math., 76, 117–135. Mohamad, S., & Gopalsamy, K. (2000). Extreme stability and almost periodicity in continuous and discrete neural models with finite delays. Anziam J., 44, 261–282. Mohamad, S., & Gopalsamy, K. (2003). Exponential stability of continuous-time and discrete-time cellular neural networks with delays. Appl. Math. Comp., 135, 17–38. Peng, J., Qiao, H., & Xu, Z. B. (2002). A new approach to stability of neural networks with time-varying delays. Neural Networks, 15, 95–103. Roska, T., & Chua, L. O. (1992). Cellular neural networks with nonlinear and delaytype template. Int. J. Circuit Theory Appl., 20, 469–481. Shih, C. W. (2000). Influence of boundary conditions on pattern formation and spatial chaos in lattice systems. SIAM J. Appl. Math., 61, 335–368. Skarda, C. A., & Freeman, W. J. (1987). How brains make chaos in order to make sense of the world. Brain Behav. Sci., 10, 161–195. Tu, F., & Liao, X. (2005). Estimation of exponential convergence rate and exponential stability for neural networks with time-varying delay. Chaos Soli. Frac., 26, 1499– 1505. van den Driessche, P., & Zou, X. (1998). Global attractivity in delayed Hopfield neural network models. SIAM J. Appl. Math., 58, 1878–1890. Yau, Y., Freeman, W. J., Burke, B., & Yang, Q. (1991). Pattern recognition by distributed neural networks: An industrial application. Neural Networks, 4, 103–121. Yoshizawa, T. (1975). Stability theory and the existence of periodic solutions and almost periodic solutions. Berlin: Springer-Verlag. Zeng, Z., & Wang, J. (2006). Multiperiodicity and exponential attractivity evoked by periodic external inputs in delayed cellular neural networks. Neural Computation, 18, 848–870. Zhang, Q., Wei, X., & Xu, J. (2005). On global exponential stability of nonautonomous delayed neural networks. Chaos Soli. Frac., 26, 965–970. Zhao, H. (2004). Existence and global attractivity of almost periodic solution for cellular neural network with distributed delays. Appl. Math. Comp., 154, 683–695. Zhou, J., Liu, Z., & Chen, G. (2004). Dynamics of periodic delayed neural networks. Neural Networks, 17, 87–101.
Received September 3, 2006; accepted October 29, 2006.
Index
Volume 19 By Author Abarbanel, Henry D.I.—See Shlens, Jonathon Abrams, Alicia B., Hillis, James M., and Brainard, David H. The Relation Between Color Discrimination and Color Constancy: When Is Optimal Adaptation Task Dependent? (Letter)
19(10): 2610–2637
Aertsen, Ad—See Morrison, Abigail ¨ Aguera y Arcas, Blaise—See Hong, Sungho Aihara, Kazuyuki—See Hamaguchi, Kosuke Aihara, Kazuyuki—See Masuda, Naoki Aihara, Kazuyuki—See Morita, Kenji Aihara, Kazuyuki—See Sato, Yoshiyuki Aihara, Kazuyuki—See Toyoizumi, Taro Alonso-Betanzos, Amparo—See Castillo, Enrique Amari, Shun-ichi Integration of Stochastic Models by Minimizing α-Divergence (Letter)
19(10): 2780–2796
Anagnostopoulos, Georgios C.—See Zhong, Mingyu Andreou, Andreas G.—See Goldberg, David H. Aoki, Takaaki and Aoyagi, Toshio Self-Organizing Maps with Asymmetric Neighborhood Function (Letter)
19(9): 2515–2535
3422
Aoki, Takaaki and Aoyagi, Toshio Synchrony-Induced Switching Behavior of Spike Pattern Attractors Created by Spike-Timing-Dependent Plasticity (Letter)
Index
19(10): 2720–2738
Aoyagi, Toshio—See Aoki, Takaaki Appleby, Peter A. and Elliott, Terry Multispike Interactions in a Stochastic Model of Spike-Timing-Dependent Plasticity (Letter)
19(5): 1362–1399
Baragona, Roberto and Battaglia, Francesco Outliers Detection in Multivariate Time Series by Independent Component Analysis (Letter)
19(7): 1962–1984
Baras, Dorit and Meir, Ron Reinforcement Learning, Spike-Time-Dependent Plasticity, and the BCM Rule (Letter)
19(8): 2245–2279
Bartolozzi, Chiara and Indiveri, Giacomo Synaptic Dynamics in Analog VLSI (Article)
19(10): 2581–2603
Battaglia, Francesco—See Baragona, Roberto Beck, Jeffrey M. and Pouget, Alexandre Exact Inferences in a Neural Implementation of a Hidden Markov Model (Letter) Behabadi, Bardia F. and Mel, Bartlett W. J4 at Sweet 16: A New Wrinkle? (Note) Belmabrouk, Hana—See Tonnelier, Arnaud Ben-Jacob, Eshel—See Volman, Vladislav Bersini, Hugues—See Molter, Colin
19(5): 1344–1361
19(11): 2865–2870
Index
Blaschke, Tobias, Zito, Tiziano, and Wiskott, Laurenz Independent Slow Feature Analysis and Nonlinear Blind Source Separation (Letter) Bo, Liefeng, Wang, Ling, and Jiao, Licheng Recursive Finite Newton Algorithm for Support Vector Regression in the Primal (Letter)
3423
19(4): 994–1021
19(4): 1082–1096
Boahen, Kwabena—See Hynna, Kai M. Bogacz, Rafal and Gurney, Kevin The Basal Ganglia and Cortex Implement Optimal Decision Making Between Alternative Actions (Letter)
19(2): 442–477
Bohte, Sander M. and Mozer, Michael C. Reducing the Variability of Neural Responses: A Computational Theory of Spike-Timing-Dependent Plasticity (Letter)
19(2): 371–403
Brader, Joseph M., Senn, Walter, and Fusi, Stefano Learning Real-World Stimuli in a Neural Network with Spike-Driven Synaptic Dynamics
19(11): 2881–2912
Brainard, David H.—See Abrams, Alicia B. Brette, Romain Exact Simulation of Integrate-and-Fire Models with Exponential Currents (Note)
19(10): 2604–2609
Buesing, Lars—See Muller, Eilif Burwick, Thomas Oscillatory Neural Networks with Self-Organized Segmentation of Overlapping Patterns (Letter) Carbonell, Felix—See Sotero, Roberto C.
19(8): 2093–2123
3424
Cassanello, Carlos R. and Ferrera, Vincent P. Visual Remapping by Vector Subtraction: Analysis of Multiplicative Gain Field Models (Letter)
Index
19(9): 2353–2386
Castillo, Carmen—See Castillo, Enrique ˜ Noelia, Castillo, Enrique, S´anchez-Marono, Alonso-Betanzos, Amparo, and Castillo, Carmen Functional Network Topology Learning and Sensitivity Analysis Based on ANOVA Decomposition (Letter)
19(1): 231–257
Cauwenberghs, Gert—See Vogelstein, R. Jacob Cebulla, Christof Asymptotic Behavior and Synchronizability Characteristics of a Class of Recurrent Neural Networks (Letter)
19(9): 2492–2514
Chapelle, Olivier Training a Support Vector Machine in the Primal (Letter)
19(5): 1155–1178
Chichilnisky, E.J.—See Shlens, Jonathon Cho, Sungzoon—See Shin, Hyunjung Chow, Carson C.—See Soula, H´edi Chu, Wei and Keerthi, S. Sathiya Support Vector Ordinal Regression (Letter)
19(3): 792–815
Coggeshall, Dave—See Zhong, Mingyu Cohen, Shay, Dror, Gideon, and Ruppin, Eytan Feature Selection via Coalitional Game Theory (Letter) Colburn, H. Steven—See Shamir, Maoz Cortes, J.M.—See Torres, J.J. Cui, Baotong—See Lou, Xuyang
19(7): 1939–1961
Index
3425
Culurciello, Eugenio—See Vogelstein, R. Jacob Dayan, Peter—See Huys, Quentin J.M. Dayhoff, Judith E. Computational Properties of Networks of Synchronous Groups of Spiking Neurons (Letter)
19(9): 2433–2467
Dean, Paul—See Porrill, John Depireux, Didier A.—See Simon, Jonathan Z. Diesmann, Markus—See Morrison, Abigail Dominguez, D., Koroutchev, K., Serrano, E., and Rodr´ıguez, F.B. Information and Topology in Attractor Neural Networks (Letter)
19(4): 956–973
Donoghue, John P.—See Truccolo, Wilson Doya, Kenji—See Morimoto, Jun Dror, Gideon—See Cohen, Shay Dyrholm, Mads, Makeig, Scott, and Hansen, Lars Kai Model Selection for Convolutive ICA with an Application to Spatiotemporal Analysis of EEG (Letter)
19(4): 934–955
Edelman, Gerald M.—See Seth, Anil K. Eguchi, Shinto—See Kanamori, Takafumi Elliott, Terry—See Appleby, Peter A. Ernst, Udo, Rotermund, David, and Pawelzik, Klaus Efficient Computation Based on Stochastic Spikes (Letter) Etienne-Cummings, Ralph—See Vogelstein, R. Jacob
19(5): 1313–1343
3426
Index
Fairhall, Adrienne L.—See Hong, Sungho Ferrera, Vincent P.—See Cassanello, Carlos R. Florian, Rˇazvan V. Reinforcement Learning Through Modulation of Spike-Timing-Dependent Synaptic Plasticity (Letter)
19(6): 1468–1502
Fritz, Jonathan B.—See Simon, Jonathan Z. Fuhs, Mark C. and Touretzky, David S. Context Learning in the Rodent Hippocampus (Article)
19(12): 3173–3215
Fujita, Hajime and Ishii, Shin Model-Based Reinforcement Learning for Partially Observable Games with Sampling-Based State Estimation (Letter)
19(11): 3051–3087
Fukushima, Masao—See Zhong, Ping Fusi, Stefano—See Brader, Joseph M. ¨ Gagliolo, Matteo—See Schmidhuber, Jurgen Georgiopoulos, Michael—See Zhong, Mingyu Gerstner, Wulfram—See Toyoizumi, Taro Ghaneie, Ehsan—See Zhong, Mingyu Gielen, Stan C.A.M.—See Marinazzo, Daniele Goh, Su Lee and Mandic, Danilo P. An Augmented Extended Kalman Filter Algorithm for Complex-Valued Recurrent Neural Networks (Letter) Goldberg, David H. and Andreou, Andreas G. Distortion of Neural Signals by Spike Coding (Letter)
19(4): 1039–1055
19(10): 2797–2839
Index
3427
¨ Gomez, Faustino—See Schmidhuber, Jurgen ´ Gomez, Vicenc¸—See Kaltenbrunner, Andreas Griffiths, Thomas L.—See Iwata, Tomoharu ¨ Gruning, Andr´e Elman Backpropagation as Reinforcement for Simple Recurrent Networks (Letter)
19(11): 3108–3131
Guan, Sheng-Uei—See Ramanathan, Kiruthika ¨ ¸ lu, ¨ Burak Guc Deviation from Weber’s Law in the Non-Pacinian I Tactile Channel: A Psychophysical and Simulation Study of Intensity Discrimination (Letter) Gunter, Lacey and Zhu, Ji Efficient Computation and Model Selection for the Support Vector Regression (Letter)
19(10): 2638–2664
19(6): 1633–1655
Gurney, Kevin—See Bogacz, Rafal Gurney, Kevin—See Humphries, Mark D. Gutkin, Boris—See Jeong, Ho Young Hamaguchi, Kosuke, Okada, Masato, and Aihara, Kazuyuki Variable Timescales of Repeated Spike Patterns in Synfire Chain with Mexican-Hat Connectivity (Letter)
19(9): 2468–2491
Hansen, Lars Kai—See Dyrholm, Mads ¨ Hasler, Stephan, Wersing, Heiko, and Korner, Edgar Combining Reconstruction and Discrimination with Class-Specific Sparse Coding (Letter) Hillis, James M.—See Abrams, Alicia B.
19(7): 1897–1918
3428
¨ Hong, Sungho, Aguera y Arcas, Blaise, and Fairhall, Adrienne L. Single Neuron Computation: From Dynamical System to Feature Detector (Article) Hoshino, Osamu Spatiotemporal Conversion of Auditory Information for Cochleotopic Mapping (Letter) Hoshino, Osamu Enhanced Sound Perception by Widespread-Onset Neuronal Responses in Auditory Cortex (Letter)
Index
19(12): 3133–3172
19(2): 351–370
19(12): 3310–3334
Huang, De-Shuang—See Zheng, Chun-Hou ´ Huerta, Ramon—See Stiesberg, Gregory R. Humphries, Mark D. and Gurney, Kevin Solution Methods for a New Class of Simple Model Neurons (Note)
19(12): 3216–3225
Huys, Quentin J.M., Zemel, Richard S., Natarajan, Rama, and Dayan, Peter Fast Population Coding (Letter)
19(2): 404–441
Hynna, Kai M. and Boahen, Kwabena Thermodynamically Equivalent Silicon Models of Voltage-Dependent Ion Channels (Letter)
19(2): 327–350
Indiveri, Giacomo—See Bartolozzi, Chiara Irwin, George—See Zheng, Chun-Hou Ishii, Shin—See Fujita, Hajime Iturria-Medina, Yasser—See Sotero, Roberto C.
Index
3429
Iwata, Tomoharu, Saito, Kazumi, Ueda, Naonori, Stromsten, Sean, Griffiths, Thomas L., and Tenenbaum, Joshua B. Parametric Embedding for Class Visualization (Letter)
19(9): 2536–2556
Jaramillo, Santiago and Pearlmutter, Barak A. Optimal Coding Predicts Attentional Modulation of Activity in Neural Systems (Letter)
19(5): 1295–1312
Jeong, Ho Young and Gutkin, Boris Synchrony of Neuronal Oscillations Controlled by GABAergic Reversal Potentials (Letter)
19(3): 706–729
Jiao, Licheng—See Bo, Liefeng Jimenez, Juan C.—See Sotero, Roberto C. Johansson, Christopher and Lansner, Anders Imposing Biological Constraints onto an Abstract Neocortical Attractor Network Model (Letter)
19(7): 1871–1896
Josi´c, Kreˇsimir—See Rubin, Jonathan Jupp, P.E.—See Stone, J.V. ´ ´ Kaltenbrunner, Andreas, Gomez, Vicenc¸, and Lopez, Vicente Phase Transition and Hysteresis in an Ensemble of Stochastic Spiking Neurons (Letter)
19(11): 3011–3050
Kamel, Mohamed S.—See Xia, Youshen Kanamori, Takafumi, Takenouchi, Takashi, Eguchi, Shinto, and Murata, Noboru Robust Loss Functions for Boosting (Letter) Kang, Daesung—See Park, Jooyoung Kang, Xidai—See Xiong, Yan
19(8): 2183–2244
3430
Index
Kappen, H.J.—See Torres, J.J. Kappen, Hilbert J.—See Marinazzo, Daniele Keerthi, S. Sathiya—See Chu, Wei Keerthi, S. Sathiya—See Sundararajan, S. Kennel, Matthew B.—See Shlens, Jonathon Kenyon, Garrett T.—See Miller, Jeremy A. Kim, Hyun-Chul and Lee, Jaewook Clustering Based on Gaussian Processes (Letter)
19(11): 3088–3107
Kim, Jongho—See Park, Jooyoung Klein, David J.—See Simon, Jonathan Z. Klyubin, Alexander S., Polani, Daniel, and Nehaniv, Chrystopher L. Representations of Space and Time in the Maximization of Information Flow in the Perception-Action Loop (Letter)
19(9): 2387–2432
Koch, Inge and Naito, Kanta Dimension Selection for Feature Selection and Dimension Reduction with Principal and Independent Component Analysis (Letter)
19(2): 513–545
Kompass, Raul A Generalized Divergence Measure for Nonnegative Matrix Factorization (Letter)
19(3): 780–791
¨ Korner, Edgar—See Hasler, Stephan Koroutchev, K.—See Dominguez, D. Kreutz-Delgado, Kenneth—See Murray, Joseph F. Kwok, James T.—See Park, Jooyoung
Index
3431
Lansner, Anders—See Johansson, Christopher Lee, Daniel D.—See Sha, Fei Lee, Jaewook—See Kim, Hyun-Chul Leen, Todd K.—See Lu, Zhengdong Lehn-Schiøler, Tue—See Olsson, Rasmus Kongsgaard Levi, Rafael—See Nowotny, Thomas Levine, Herbert—See Volman, Vladislav Li, Kang—See Zheng, Chun-Hou Lian, Heng On the Consistency of Bayesian Function Approximation Using Step Functions (Note)
19(11): 2871–2880
Lin, Chih-Jen Projected Gradient Methods for Nonnegative Matrix Factorization (Letter)
19(10): 2756–2779
Lin, Kuang-Hui and Shih, Chih-Wen Multiple Almost Periodic Solutions in Nonautonomous Delayed Neural Networks (Letter)
19(12): 3392–3420
Lin, Yuanqing—See Sha, Fei Liu, Feng, Quek, Chai, and Ng, Geok See A Novel Generic Hebbian Ordering-Based Fuzzy Rule Base Reduction Approach to Mamdani Neuro-Fuzzy System (Letter)
19(6): 1656–1680
´ Lopez, Vicente—See Kaltenbrunner, Andreas Lou, Xuyang and Cui, Baotong Boundedness and Stability for Integrodifferential Equations Modeling Neural Field with Time Delay (Letter)
19(2): 570–581
3432
Index
Lu, Zhengdong and Leen, Todd K. Penalized Probabilistic Clustering (Letter)
19(6): 1528–1567
Ly, Cheng and Tranchina, Daniel Critical Analysis of Dimension Reduction by a Moment Closure Method in a Population Density Approach to Neural Network Modeling (Article)
19(8): 2032–2092
Ma, Jianfu and Wu, Jianhong Multistability in Spiking Neuron Models of Delayed Recurrent Inhibitory Loops (Letter)
19(8): 2124–2148
Makeig, Scott—See Dyrholm, Mads Mallik, Udayan—See Vogelstein, R. Jacob Mandic, Danilo P.—See Goh, Su Lee Marinazzo, Daniele, Kappen, Hilbert J., and Gielen, Stan C.A.M. Input-Driven Oscillations in Networks with Excitatory and Inhibitory Neurons with Dynamic Synapses (Letter)
19(7): 1739–1765
Marro, J.—See Torres, J.J. Martinez, Dominique—See Tonnelier, Arnaud Masuda, Naoki, Okada, Masato, and Aihara, Kazuyuki Filtering of Spatial Bias and Noise Inputs by Spatially Structured Neural Networks (Letter) Matsuda, Yoshitatsu and Yamaguchi, Kazunori Linear Multilayer ICA Generating Hierarchical Edge Detectors (Letter) Meier, Karlheinz—See Muller, Eilif Meir, Ron—See Baras, Dorit
19(7): 1854–1870
19(1): 218–230
Index
3433
Mel, Bartlett W.—See Behabadi, Bardia F. Miao, Xu and Rao, Rajesh P.N. Learning the Lie Groups of Visual Invariance (Letter)
19(10): 2665–2693
Miller, David J. and Pal, Siddharth Transductive Methods for the Distributed Ensemble Classification Problem (Letter)
19(3): 856–884
Miller, Jeremy A. and Kenyon, Garrett T. Extracting Number-Selective Responses from Coherent Oscillations in a Computer Model (Letter)
19(7): 1766–1797
Mollaghasemi, Mansooreh—See Zhong, Mingyu Molter, Colin, Salihoglu, Utku, and Bersini, Hugues The Road to Chaos by Time-Asymmetric Hebbian Learning in Recurrent Neural Networks (Letter) Montemurro, M. A., Senatore, R., and Panzeri, S. Tight Data-Robust Bounds to Mutual Information Combining Shuffling and Model Selection Techniques (Letter)
19(1): 80–110
19(11): 2913–2957
Moreno-Bote, Rub´en—See Renart, Alfonso Morimoto, Jun and Doya, Kenji Reinforcement Learning State Estimator (Letter) Morita, Kenji, Okada, Masato, and Aihara, Kazuyuki Selectivity and Stability via Dendritic Nonlinearity (Letter) Morrison, Abigail, Straube, Sirko, Plesser, Hans Ekkehard, and Diesmann, Markus Exact Subthreshold Integration with Continuous Spike Times in Discrete-Time Neural Network Simulations (Letter)
19(3): 730–756
19(7): 1798–1853
19(1): 47–79
3434
Morrison, Abigail, Aertsen, Ad, and Diesmann, Markus Spike-Timing-Dependent Plasticity in Balanced Random Networks (Letter)
Index
19(6): 1437–1467
Mozer, Michael C.—See Bohte, Sander M. Muller, Eilif, Buesing, Lars, Schemmel, Johannes, and Meier, Karlheinz Spike-Frequency Adapting Neural Ensembles: Beyond Mean Adaptation and Renewal Theories (Letter)
19(11): 2958–3010
Murata, Noboru—See Kanamori, Takafumi Murray, Joseph F. and Kreutz-Delgado, Kenneth Visual Recognition and Inference Using Dynamic Overcomplete Sparse Learning (Letter)
19(9): 2301–2352
Naganuma, Hidenori—See Oohori, Takahumi Naito, Kanta—See Koch, Inge Nakajima, Shinichi and Watanabe, Sumio Variational Bayes Solution of Linear Neural Networks and Its Generalization Performance (Letter)
19(4): 1112–1153
Natarajan, Rama—See Huys, Quentin, J.M. Nehaniv, Chrystopher L.—See Klyubin, Alexander S. Ng, Geok See—See Liu, Feng Niebur, Ernst Generation of Synthetic Spike Trains with Defined Pairwise Correlations (Letter)
19(7): 1720–1738
Index
¨ Nowotny, Thomas, Szucs, Attila, Levi, Rafael, and Selverston, Allen I. Models Wagging the Dog: Are Circuits Constructed with Disparate Parameters? (View)
3435
19(8): 1985–2003
Okada, Masato—See Hamaguchi, Kosuke Okada, Masato—See Masuda, Naoki Okada, Masato—See Morita, Kenji Olsson, Rasmus Kongsgaard, Petersen, Kaare Brandt, and Lehn-Schiøler, Tue State-Space Models: From the EM Algorithm to a Gradient Approach (Letter)
19(4): 1097–1111
Oohori, Takahumi, Naganuma, Hidenori, and Watanabe, Kazuhisa A New Backpropagation Learning Algorithm for Layered Neural Networks with Nondifferentiable Units (Letter)
19(5): 1422–1435
Or, Jimmy Robustness of Connectionist Swimming Controllers Against Random Variation in Neural Connections (Letter)
19(6): 1568–1588
Ozturk, Mustafa C., Xu, Dongming, and Pr´ıncipe, Jos´e C. Analysis and Design of Echo State Networks (Letter)
19(1): 111–138
Pal, Siddharth—See Miller, David J. Panzeri, S.—See Montemurro, M.A. Parga, N´estor—See Renart, Alfonso Park, Jooyoung, Kang, Daesung, Kim, Jongho, Kwok, James T., and Tsang, Ivor W. SVDD-Based Pattern Denoising (Letter) Pawelzik, Klaus—See Ernst, Udo
19(7): 1919–1938
3436
Index
Pearlmutter, Barak A.—See Jaramillo, Santiago Petersen, Kaare Brandt—See Olsson, Rasmus Kongsgaard Pfister, Jean-Pascal—See Toyoizumi, Taro Pinto, Reynaldo D.—See Stiesberg, Gregory R. Plesser, Hans Ekkehard—See Morrison, Abigail Polani, Daniel—See Klyubin, Alexander S. Pope, Thomas—See Zhong, Mingyu ¨ otter, ¨ Porr, Bernd and Worg Florentin Learning with “Relevance”: Using a Third Factor to Stabilize Hebbian Learning (Letter) Porrill, John and Dean, Paul Recurrent Cerebellar Loops Simplfy Adaptive Control of Redundant and Nonlinear Motor Systems (Letter)
19(10): 2694–2719
19(1): 170–193
Pouget, Alexandre—See Beck, Jeffrey M. Pr´ıncipe, Jos´e C.—See Ozturk, Mustafa C. Quek, Chai—See Liu, Feng Rafael, Levi—See Nowotny, Thomas Ramanathan, Kiruthika and Guan, Sheng-Uei Multiorder Neurons for Evolutionary Higher-Order Clustering and Growth (Letter)
19(12): 3369–3391
Rao, Rajesh P.N.—See Miao, Xu Renart, Alfonso, Moreno-Bote, Rub´en, Wang, XiaoJing, and Parga, N´estor Mean-Driven and Fluctuation-Driven Persistent Activity in Recurrent Networks (Letter)
19(1): 1–46
Index
3437
Reyes, Marcelo Bussotti—See Stiesberg, Gregory R. Richie, Samuel—See Zhong, Mingyu Rivera, Mark—See Zhong, Mingyu Rodr´ıguez, F.B.—See, Dominguez, D. Rolls, Edmund T. and Stringer, Simon M. Invariant Global Motion Recognition in the Dorsal Visual System: A Unifying Theory (Letter)
19(1): 139–169
Rotermund, David—See Ernst, Udo Rowat, Peter Interspike Interval Statistics in the Stochastic Hodgkin-Huxley Model: Coexistence of Gamma Frequency Bursts and Highly Irregular Firing (Letter)
19(5): 1215–1250
Rubin, Jonathan and Josi´c, Kreˇsimir The Firing of an Excitable Neuron in the Presence of Stochastic Trains of Strong Synaptic Inputs (Letter)
19(5): 1251–1294
Ruppin, Eytan—See Cohen, Shay Sahani, Maneesh—See Turner, Richard Saito, Kazumi—See Iwata, Tomoharu Salihoglu, Utku—See Molter, Colin ˜ Noelia—See Castillo, Enrique S´anchez-Marono, Sato, Yoshiyuki, Toyoizumi, Taro, and Aihara, Kazuyuki Bayesian Inference Explains Perception of Unity and Ventriloquism Aftereffect: Identification of Common Sources of Audiovisual Stimuli (Letter) Saul, Lawrence K.—See Sha, Fei
19(12): 3335–3355
3438
Index
Schemmel, Johannes—See Muller, Eilif ¨ Schmidhuber, Jurgen, Wierstra, Daan, Gagliolo, Matteo, Gomez, Faustino Training Recurrent Networks by Evolino (Letter)
19(3): 757–779
Selverston, Allen I.—See Nowotny, Thomas Sen, Kamal—See Shamir, Maoz Senatore, R.—See Montemurro, M.A. Senn, Walter—See Brader, Joseph M. Serrano, E.—See Dominguez, D. Seth, Anil K. and Edelman, Gerald M. Distinguishing Causal Interactions in Neural Populations (Letter)
19(4): 910–933
Sha, Fei, Lin, Yuanqing, Saul, Lawrence K., and Lee, Daniel D. Multiplicative Updates for Nonnegative Quadratic Programming (Article)
19(8): 2004–2031
Shamir, Maoz, Sen, Kamal and Colburn, H. Steven Temporal Coding of Time-Varying Stimuli (Letter)
19(12): 3239–3261
Shamma, Shihab A.—See Simon, Jonathan Z. Shevade, Shirish—See Sundararajan, S. Shih, Chih-Wen—See Lin, Kuang-Hui Shimazaki, Hideaki and Shinomoto, Shigeru A Method for Selecting the Bin Size of a Time Histogram (Letter) Shin, Hyunjung and Cho, Sungzoon Neighborhood Property-Based Pattern Selection for Support Vector Machines (Letter)
19(6): 1503–1527
19(3): 816–855
Index
3439
Shinomoto, Shigeru—See Shimazaki, Hideaki Shlens, Jonathon, Kennel, Matthew B., Abarbanel, Henry D.I., and Chichilnisky, E.J. Estimating Information Rates with Confidence Intervals in Neural Spike Trains (Article)
19(7): 1683–1719
Simon, Jonathan Z., Depireux, Didier A., Klein, David J., Fritz, Jonathan B., and Shamma, Shihab A. Temporal Symmetry in Primary Auditory Cortex: Implications for Cortical Connectivity (Article)
19(3): 583–638
Snippe, H.P. and van Hateren, J.H. Dynamics of Nonlinear Feedback Control (Letter) Sotero, Roberto C., Trujillo-Barreto, Nelson J., Iturria-Medina, Yasser, Carbonell, Felix, and Jimenez, Juan C. Realistically Coupled Neural Mass Models Can Generate EEG Rhythms (Letter) Soula, H´edi and Chow, Carson C. Stochastic Dynamics of a Finite-Size Spiking Neural Network (Letter)
19(5): 1179–1214
19(2): 478–512
19(12): 3262–3292
Stiesberg, Gregory R., Reyes, Marcelo Bussotti, Varona, Pablo, Pinto, Reynaldo D., and Huerta, ´ Ramon Connection Topology Selection in Central Pattern Generators by Maximizing the Gain of Information (Letter)
19(4): 974–993
Stone, J.V. and Jupp, P.E. Free-Lunch Learning: Modeling Spontaneous Recovery of Memory (Letter)
19(1): 194–217
Straube, Sirko—See Morrison, Abigail Stringer, Simon M.—See Rolls, Edmund T. Stromsten, Sean—See Iwata, Tomoharu
3440
Index
Sun, Zhan-Li—See Zheng, Chun-Hou Sundararajan, S., Shevade, Shirish, and Keerthi, S. Sathiya Fast Generalized Cross-Validation Algorithm for Sparse Model Learning (Letter)
19(1): 283–301
¨ Szucs, Attila—See Nowotny, Thomas Takenouchi, Takashi—See Kanamori, Takafumi Tenenbaum, Joshua B.—See Iwata, Tomoharu ˇ Peter Tino, Equilibria of Iterative Softmax and Critical Temperatures for Intermittent Search in Self-Organizing Neural Networks (Letter)
19(4): 1056–1081
Tonnelier, Arnaud, Belmabrouk, Hana, and Martinez, Dominique Event-Driven Simulations of Nonlinear Integrate-and-Fire Neurons (Note)
19(12): 3226–3238
Torres, J.J., Cortes, J.M., Marro, J., and Kappen, H.J. Competition Between Synaptic Depression and Facilitation in Attractor Neural Networks (Letter)
19(10): 2739–2755
Touretzky, David S.—See Fuhs, Mark C. Toyoizumi, Taro, Pfister, Jean-Pascal, Aihara, Kazuyuki, and Gerstner, Wulfram Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity: Synaptic Memory and Weight Distribution (Letter) Toyoizumi, Taro—See Sato, Yoshiyuki Tranchina, Daniel—See Ly, Cheng
19(3): 639–671
Index
Triesch, Jochen Synergies Between Intrinsic and Synaptic Plasticity Mechanisms (Article)
3441
19(4): 885–909
Troje, Nikolaus F.—See Zhang, Zonghua Truccolo, Wilson and Donoghue, John P. Nonparametric Modeling of Neural Point Processes via Stochastic Gradient Boosting Regression (Letter)
19(3): 672–705
Trujillo-Barreto, Nelson J.—See Sotero, Roberto C. Tsang, Ivor W.—See Park, Jooyoung Turner, Richard and Sahani, Maneesh A Maximum-Likelihood Interpretation for Slow Feature Analysis (Letter)
19(4): 1022–1038
Ueda, Naonori—See Iwata, Tomoharu van Hateren, J.H—See Snippe, H.P. Varona, Pablo—See Stiesberg, Gregory R. Vogelstein, R. Jacob, Mallik, Udayan, Culurciello, Eugenio, Cauwenberghs, Gert, and Etienne-Cummings, Ralph A Multichip Neuromorphic System for Spike-Based Visual Information Processing (Article) Volman, Vladislav, Ben-Jacob, Eshel, and Levine, Herbert The Astrocyte as a Gatekeeper of Synaptic Information Transfer (Article) von der Malsburg, Christoph—See Wolfrum, Philipp Wang, Jun—See Zeng, Zhigang Wang, Ling—See Bo, Liefeng Wang, Xiao-Jing—See Renart, Alfonso
19(9): 2281–2300
19(2): 303–326
3442
Index
Watanabe, Kazuhisa—See Oohori, Takahumi Watanabe, Sumio—See Nakajima, Shinichi Wersing, Heiko—See Hasler, Stephan ¨ Wierstra, Daan—See Schmidhuber, Jurgen Wiskott, Laurenz—See Blaschke, Tobias Wolfrum, Philipp and von der Malsburg, Christoph What Is the Optimal Architecture for Visual Information Routing? (Letter)
19(12): 3293–3309
¨ otter, ¨ Worg Florentin—See Porr, Bernd Wu, Jianhong—See Ma, Jianfu Wu, Wei—See Xiong, Yan Xia, Youshen and Kamel, Mohamed S. A Measurement Fusion Method for Nonlinear System Identification Using a Cooperative Learning Algorithm (Letter) Xiong, Yan, Wu, Wei, Kang, Xidai, and Zhang, Chao Training Pi-Sigma Network by Online Gradient Algorithm with Penalty for Small Weight Update (Letter)
19(6): 1589–1632
19(12): 3356–3368
Xu, Dongming—See Ozturk, Mustafa C. Xu, Lei One-Bit-Matching Theorem for ICA, Convex-Concave Programming on Polyhedral Set, and Distribution Approximation for Combinatorics (Letter) Yamaguchi, Kazunori—See Matsuda, Yoshitatsu Yuanqing, Lin—See Sha, Fei
19(2): 546–569
Index
3443
Zemel, Richard S.—See Huys, Quentin J.M. Zeng, Zhigang and Wang, Jun Analysis and Design of Associative Memories Based on Recurrent Neural Networks with Linear Saturation Activation Functions and Time-Varying Delays (Letter)
19(8): 2149–2182
Zhang, Chao—See Xiong, Yan Zhang, Zonghua and Troje, Nikolaus F. 3D Periodic Human Motion Reconstruction from 2D Motion Sequences (Letter)
19(5) 1400–1421
Zheng, Chun-Hou, Huang, De-Shuang, Li, Kang, Irwin, George, and Sun, Zhan-Li MISEP Method for Postnonlinear Blind Source Separation (Letter)
19(9): 2557–2578
Zhong, Mingyu, Coggeshall, Dave, Ghaneie, Ehsan, Pope, Thomas, Rivera, Mark, Georgiopoulos, Michael, Anagnostopoulos, Georgios C., Mollaghasemi, Mansooreh, and Richie, Samuel Gap-Based Estimation: Choosing the Smoothing Parameters for Probabilistic and General Regression Neural Networks (Letter) Zhong, Ping and Fukushima, Masao Second-Order Cone Programming Formulations for Robust Multiclass Classification (Letter) Zhu, Ji—See Gunter, Lacey Zito, Tiziano—See Blaschke, Tobias
19(10): 2840–2864
19(1): 258–282